* [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5
@ 2001-10-01 22:16 Ingo Molnar
2001-10-01 22:26 ` Tim Hockin
` (4 more replies)
0 siblings, 5 replies; 14+ messages in thread
From: Ingo Molnar @ 2001-10-01 22:16 UTC (permalink / raw)
To: linux-kernel
Cc: Linus Torvalds, Alan Cox, Alexey Kuznetsov, Andrea Arcangeli,
Simon Kirby
[-- Attachment #1: Type: TEXT/PLAIN, Size: 6202 bytes --]
to sum things up, we have three main problem areas that are connected to
hardirq and softirq processing:
- a little utility written by Simon Kirby proved that no matter how much
softirq throttling, it's easy to lock up a pretty powerful Linux
box via a high rate of network interrupts, from relatively low-powered
clients as well. 2.4.6, 2.4.7, 2.4.10 all lock up. Alexey said it as
well that it's still easy to lock up low-powered Linux routers via more
or less normal traffic.
- prior 2.4.7 we used to 'leak' softirq handling => we ended up missing
softirqs in a number of circumstances. Stock 2.4.10 still has a number
of places that do this too.
- a number of people have reported gigabit performance problems (some
people reported a 10-20% drop in performance under load) since
ksoftirqd was added - which was added to fix some of the 2.4.6-
softirq-handling latency problems.
we also have another problem that often pops up when the BIOS goes bad or
a device driver does some mistake:
- Linux often 'locks up' if it gets into a 'interrupt storm' - when
interrupt sources that send a very high rate of interrupts. This can be
seen as boot-time hangs and module-insert-time hangs as well.
the attached patch, while a bit radical, is i believe a robust solution to
all four problems. It gives gigabit performance back, avoids the lockups
and attempts to reach as short softirq-processing latency as possible.
the new mechanizm:
- the irq handling code has been extended to support 'soft mitigation',
ie. to mitigate the rate of hardware interrupts, without support from
the actual hardware. There is a reasonable default, but the value can
also be decreased/increased on a per-irq basis via /proc/irq/NR/max_rate.
the method is the following. We count the number of interrupts serviced,
and if within a jiffy there are more than max_rate interrupts, the code
disables the IRQ source and marks it as IRQ_MITIGATED. On the next timer
interrupt the irq_rate_check() function is called, which makes sure that
'blocked' irqs are restarted & handled properly. The interrupt is disabled
in the interrupt controller, which has the nice side-effect of fixing and
blocking interrupt storms. (The support code for 'soft mitigation' is
designed to be very lightweight, it's a decrement and a test in the IRQ
handling hot path.)
(note that in case of shared interrupts, another 'innocent' device might
stay disabled for some short amount of time as well - but this is not an
issue because this mitigation does not make that device inoperable, it
just delays its interrupt by up to 10 msecs. Plus, modern systems have
properly distributed interrupts.)
- softirq code got simplified significantly. The concept is to 'handle all
pending softirqs' - just as the hardware IRQ code 'handles all hardware
interrupts that were passed to it'. Since most of the time there is a
direct relationship between softirq work and hardirq work, the
mitigation of hardirqs mitigates softirq load as well.
- ksoftirqd is gone, there is never any softirq pending while
softirq-unaware code is executing.
- the tasklet code needed some cleanup along the way, and it also won some
restart-on-enable and restart-on-unlock properties that it lacked
before. (but which is desired.)
due to these changes, the linecount in softirq.c got smaller by 25%.
[i dropped the unwakeup change - but that one could be useful in the VM,
to eg. unwakeup bdflush or kswapd.]
- drivers can optionally use the set_irq_rate(irq, new_rate) call to
change the current IRQ rate. Drivers are the ones who know best what
kind of loads to expect from the hardware, so they might want to
influence this value. Also, drivers that implement IRQ mitigation
themselves in hardware, can effectively disable the soft-mitigation code
by using a very high rate value.
what is the concept behind all this? Simplicity, and concept. We were
clearly heading in the wrong direction: putting more complexity into the
core softirq code to handle some really extreme and unusual cases. Also,
softirqs were slowly morphing into something process-ish - but in Linux we
already have a concept of processes, so we'd have two dualling concepts.
(We still have tasklets, which are not really processes - they are
single-threaded paths of execution.)
with this patch, softirqs can again be what they should be: lightweight
'interrupt code' that processes hard-IRQ events but still does this with
interrupts enabled, to allow for low hard-IRQ latencies. Anything that is
conceptually heavyweight IMO does not belong into softirqs, it should be
moved into process contexts. That will take care of CPU-time usage
accounting and CPU-time-limiting and priority issues as well.
(the patch also imports the latency and softirq-restart fixes from my
previous softirq patches.)
i've tested the patch on both UP, SMP, XT-PIC and APIC systems, it
correctly limits network interrupt rates (and other device interrupt
rates) to the given limit. I've done stress-testing as well. The patch is
against 2.4.11-pre1, but it applies just fine to the -ac tree as well.
with a high irq-rate limit set, ping flooding has this effect on the
test-system:
[root@mars /root]# vmstat 1
procs memory swap io
r b w swpd free buff cache si so bi bo in
0 0 0 0 877024 1140 11364 0 0 12 0 30960
0 0 0 0 877024 1140 11364 0 0 0 0 30950
0 0 0 0 877024 1140 11364 0 0 0 0 30520
ie. 30k interrupts/sec. With the max_rate set to 1000 interrupts/sec:
[root@mars /root]# echo 1000 > /proc/irq/21/max_rate
[root@mars /root]# vmstat 1
procs memory swap io
r b w swpd free buff cache si so bi bo in
0 0 0 0 877004 1144 11372 0 0 0 0 1112
0 0 0 0 877004 1144 11372 0 0 0 0 1111
0 0 0 0 877004 1144 11372 0 0 0 0 1111
so it works just fine here. Interactive tasks are still snappy over the
same interface.
Comments, reports, suggestions and testing feedback is more than welcome,
Ingo
[-- Attachment #2: Type: TEXT/PLAIN, Size: 26601 bytes --]
--- linux/kernel/ksyms.c.orig Mon Oct 1 21:52:32 2001
+++ linux/kernel/ksyms.c Mon Oct 1 21:52:43 2001
@@ -538,8 +538,6 @@
EXPORT_SYMBOL(tasklet_kill);
EXPORT_SYMBOL(__run_task_queue);
EXPORT_SYMBOL(do_softirq);
-EXPORT_SYMBOL(raise_softirq);
-EXPORT_SYMBOL(cpu_raise_softirq);
EXPORT_SYMBOL(__tasklet_schedule);
EXPORT_SYMBOL(__tasklet_hi_schedule);
--- linux/kernel/softirq.c.orig Mon Oct 1 21:52:32 2001
+++ linux/kernel/softirq.c Mon Oct 1 21:53:52 2001
@@ -44,26 +44,11 @@
static struct softirq_action softirq_vec[32] __cacheline_aligned;
-/*
- * we cannot loop indefinitely here to avoid userspace starvation,
- * but we also don't want to introduce a worst case 1/HZ latency
- * to the pending events, so lets the scheduler to balance
- * the softirq load for us.
- */
-static inline void wakeup_softirqd(unsigned cpu)
-{
- struct task_struct * tsk = ksoftirqd_task(cpu);
-
- if (tsk && tsk->state != TASK_RUNNING)
- wake_up_process(tsk);
-}
-
asmlinkage void do_softirq()
{
int cpu = smp_processor_id();
__u32 pending;
long flags;
- __u32 mask;
if (in_interrupt())
return;
@@ -75,7 +60,6 @@
if (pending) {
struct softirq_action *h;
- mask = ~pending;
local_bh_disable();
restart:
/* Reset the pending bitmask before enabling irqs */
@@ -95,152 +79,130 @@
local_irq_disable();
pending = softirq_pending(cpu);
- if (pending & mask) {
- mask &= ~pending;
+ if (pending)
goto restart;
- }
__local_bh_enable();
-
- if (pending)
- wakeup_softirqd(cpu);
}
local_irq_restore(flags);
}
-/*
- * This function must run with irq disabled!
- */
-inline void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
-{
- __cpu_raise_softirq(cpu, nr);
-
- /*
- * If we're in an interrupt or bh, we're done
- * (this also catches bh-disabled code). We will
- * actually run the softirq once we return from
- * the irq or bh.
- *
- * Otherwise we wake up ksoftirqd to make sure we
- * schedule the softirq soon.
- */
- if (!(local_irq_count(cpu) | local_bh_count(cpu)))
- wakeup_softirqd(cpu);
-}
-
-void raise_softirq(unsigned int nr)
-{
- long flags;
-
- local_irq_save(flags);
- cpu_raise_softirq(smp_processor_id(), nr);
- local_irq_restore(flags);
-}
-
void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)
{
softirq_vec[nr].data = data;
softirq_vec[nr].action = action;
}
-
/* Tasklets */
struct tasklet_head tasklet_vec[NR_CPUS] __cacheline_aligned;
struct tasklet_head tasklet_hi_vec[NR_CPUS] __cacheline_aligned;
-void __tasklet_schedule(struct tasklet_struct *t)
+static inline void __tasklet_enable(struct tasklet_struct *t,
+ struct tasklet_head *vec, int softirq)
{
int cpu = smp_processor_id();
- unsigned long flags;
- local_irq_save(flags);
- t->next = tasklet_vec[cpu].list;
- tasklet_vec[cpu].list = t;
- cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
- local_irq_restore(flags);
+ smp_mb__before_atomic_dec();
+ if (!atomic_dec_and_test(&t->count))
+ return;
+
+ local_irq_disable();
+ /*
+ * Being able to clear the SCHED bit from 1 to 0 means
+ * we got the right to handle this tasklet.
+ * Setting it from 0 to 1 means we can queue it.
+ */
+ if (test_and_clear_bit(TASKLET_STATE_SCHED, &t->state) && !t->next) {
+ if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) {
+
+ t->next = (vec + cpu)->list;
+ (vec + cpu)->list = t;
+ __cpu_raise_softirq(cpu, softirq);
+ }
+ }
+ local_irq_enable();
+ rerun_softirqs(cpu);
}
-void __tasklet_hi_schedule(struct tasklet_struct *t)
+void tasklet_enable(struct tasklet_struct *t)
+{
+ __tasklet_enable(t, tasklet_vec, TASKLET_SOFTIRQ);
+}
+
+void tasklet_hi_enable(struct tasklet_struct *t)
+{
+ __tasklet_enable(t, tasklet_hi_vec, HI_SOFTIRQ);
+}
+
+static inline void __tasklet_sched(struct tasklet_struct *t,
+ struct tasklet_head *vec, int softirq)
{
int cpu = smp_processor_id();
unsigned long flags;
local_irq_save(flags);
- t->next = tasklet_hi_vec[cpu].list;
- tasklet_hi_vec[cpu].list = t;
- cpu_raise_softirq(cpu, HI_SOFTIRQ);
+ t->next = (vec + cpu)->list;
+ (vec + cpu)->list = t;
+ __cpu_raise_softirq(cpu, softirq);
local_irq_restore(flags);
+ rerun_softirqs(cpu);
}
-static void tasklet_action(struct softirq_action *a)
+void __tasklet_schedule(struct tasklet_struct *t)
{
- int cpu = smp_processor_id();
- struct tasklet_struct *list;
-
- local_irq_disable();
- list = tasklet_vec[cpu].list;
- tasklet_vec[cpu].list = NULL;
- local_irq_enable();
-
- while (list) {
- struct tasklet_struct *t = list;
-
- list = list->next;
-
- if (tasklet_trylock(t)) {
- if (!atomic_read(&t->count)) {
- if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
- BUG();
- t->func(t->data);
- tasklet_unlock(t);
- continue;
- }
- tasklet_unlock(t);
- }
+ __tasklet_sched(t, tasklet_vec, TASKLET_SOFTIRQ);
+}
- local_irq_disable();
- t->next = tasklet_vec[cpu].list;
- tasklet_vec[cpu].list = t;
- __cpu_raise_softirq(cpu, TASKLET_SOFTIRQ);
- local_irq_enable();
- }
+void __tasklet_hi_schedule(struct tasklet_struct *t)
+{
+ __tasklet_sched(t, tasklet_hi_vec, HI_SOFTIRQ);
}
-static void tasklet_hi_action(struct softirq_action *a)
+static inline void __tasklet_action(struct softirq_action *a,
+ struct tasklet_head *vec)
{
int cpu = smp_processor_id();
struct tasklet_struct *list;
local_irq_disable();
- list = tasklet_hi_vec[cpu].list;
- tasklet_hi_vec[cpu].list = NULL;
+ list = (vec + cpu)->list;
+ (vec + cpu)->list = NULL;
local_irq_enable();
while (list) {
struct tasklet_struct *t = list;
list = list->next;
+ t->next = NULL;
- if (tasklet_trylock(t)) {
- if (!atomic_read(&t->count)) {
- if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
- BUG();
- t->func(t->data);
- tasklet_unlock(t);
- continue;
- }
+repeat:
+ if (!tasklet_trylock(t))
+ continue;
+ if (atomic_read(&t->count)) {
tasklet_unlock(t);
+ continue;
}
-
- local_irq_disable();
- t->next = tasklet_hi_vec[cpu].list;
- tasklet_hi_vec[cpu].list = t;
- __cpu_raise_softirq(cpu, HI_SOFTIRQ);
- local_irq_enable();
+ if (test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) {
+ t->func(t->data);
+ tasklet_unlock(t);
+ if (test_bit(TASKLET_STATE_SCHED, &t->state))
+ goto repeat;
+ continue;
+ }
+ tasklet_unlock(t);
}
}
+static void tasklet_action(struct softirq_action *a)
+{
+ __tasklet_action(a, tasklet_vec);
+}
+
+static void tasklet_hi_action(struct softirq_action *a)
+{
+ __tasklet_action(a, tasklet_hi_vec);
+}
void tasklet_init(struct tasklet_struct *t,
void (*func)(unsigned long), unsigned long data)
@@ -268,8 +230,6 @@
clear_bit(TASKLET_STATE_SCHED, &t->state);
}
-
-
/* Old style BHs */
static void (*bh_base[32])(void);
@@ -325,7 +285,7 @@
{
int i;
- for (i=0; i<32; i++)
+ for (i = 0; i < 32; i++)
tasklet_init(bh_task_vec+i, bh_action, i);
open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
@@ -358,61 +318,3 @@
f(data);
}
}
-
-static int ksoftirqd(void * __bind_cpu)
-{
- int bind_cpu = *(int *) __bind_cpu;
- int cpu = cpu_logical_map(bind_cpu);
-
- daemonize();
- current->nice = 19;
- sigfillset(¤t->blocked);
-
- /* Migrate to the right CPU */
- current->cpus_allowed = 1UL << cpu;
- while (smp_processor_id() != cpu)
- schedule();
-
- sprintf(current->comm, "ksoftirqd_CPU%d", bind_cpu);
-
- __set_current_state(TASK_INTERRUPTIBLE);
- mb();
-
- ksoftirqd_task(cpu) = current;
-
- for (;;) {
- if (!softirq_pending(cpu))
- schedule();
-
- __set_current_state(TASK_RUNNING);
-
- while (softirq_pending(cpu)) {
- do_softirq();
- if (current->need_resched)
- schedule();
- }
-
- __set_current_state(TASK_INTERRUPTIBLE);
- }
-}
-
-static __init int spawn_ksoftirqd(void)
-{
- int cpu;
-
- for (cpu = 0; cpu < smp_num_cpus; cpu++) {
- if (kernel_thread(ksoftirqd, (void *) &cpu,
- CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
- printk("spawn_ksoftirqd() failed for cpu %d\n", cpu);
- else {
- while (!ksoftirqd_task(cpu_logical_map(cpu))) {
- current->policy |= SCHED_YIELD;
- schedule();
- }
- }
- }
-
- return 0;
-}
-
-__initcall(spawn_ksoftirqd);
--- linux/kernel/timer.c.orig Tue Aug 21 14:26:19 2001
+++ linux/kernel/timer.c Mon Oct 1 21:52:43 2001
@@ -674,6 +674,7 @@
void do_timer(struct pt_regs *regs)
{
(*(unsigned long *)&jiffies)++;
+ irq_rate_check();
#ifndef CONFIG_SMP
/* SMP process accounting uses the local APIC timer */
--- linux/include/linux/netdevice.h.orig Mon Oct 1 21:52:28 2001
+++ linux/include/linux/netdevice.h Mon Oct 1 23:07:44 2001
@@ -486,8 +486,9 @@
local_irq_save(flags);
dev->next_sched = softnet_data[cpu].output_queue;
softnet_data[cpu].output_queue = dev;
- cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+ __cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
local_irq_restore(flags);
+ rerun_softirqs(cpu);
}
}
@@ -535,8 +536,9 @@
local_irq_save(flags);
skb->next = softnet_data[cpu].completion_queue;
softnet_data[cpu].completion_queue = skb;
- cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
+ __cpu_raise_softirq(cpu, NET_TX_SOFTIRQ);
local_irq_restore(flags);
+ rerun_softirqs(cpu);
}
}
--- linux/include/linux/interrupt.h.orig Mon Oct 1 21:52:32 2001
+++ linux/include/linux/interrupt.h Mon Oct 1 23:07:33 2001
@@ -74,9 +74,15 @@
asmlinkage void do_softirq(void);
extern void open_softirq(int nr, void (*action)(struct softirq_action*), void *data);
extern void softirq_init(void);
-#define __cpu_raise_softirq(cpu, nr) do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
-extern void FASTCALL(cpu_raise_softirq(unsigned int cpu, unsigned int nr));
-extern void FASTCALL(raise_softirq(unsigned int nr));
+extern void show_stack(unsigned long* esp);
+#define __cpu_raise_softirq(cpu, nr) \
+ do { softirq_pending(cpu) |= 1UL << (nr); } while (0)
+
+#define rerun_softirqs(cpu) \
+do { \
+ if (!(local_irq_count(cpu) | local_bh_count(cpu))) \
+ do_softirq(); \
+} while (0);
@@ -182,18 +188,8 @@
smp_mb();
}
-static inline void tasklet_enable(struct tasklet_struct *t)
-{
- smp_mb__before_atomic_dec();
- atomic_dec(&t->count);
-}
-
-static inline void tasklet_hi_enable(struct tasklet_struct *t)
-{
- smp_mb__before_atomic_dec();
- atomic_dec(&t->count);
-}
-
+extern void tasklet_enable(struct tasklet_struct *t);
+extern void tasklet_hi_enable(struct tasklet_struct *t);
extern void tasklet_kill(struct tasklet_struct *t);
extern void tasklet_init(struct tasklet_struct *t,
void (*func)(unsigned long), unsigned long data);
@@ -263,5 +259,6 @@
extern unsigned long probe_irq_on(void); /* returns 0 on failure */
extern int probe_irq_off(unsigned long); /* returns 0 or negative on failure */
extern unsigned int probe_irq_mask(unsigned long); /* returns mask of ISA interrupts */
+extern void irq_rate_check(void);
#endif
--- linux/include/linux/irq.h.orig Mon Oct 1 21:52:32 2001
+++ linux/include/linux/irq.h Mon Oct 1 23:07:19 2001
@@ -31,6 +31,7 @@
#define IRQ_LEVEL 64 /* IRQ level triggered */
#define IRQ_MASKED 128 /* IRQ masked - shouldn't be seen again */
#define IRQ_PER_CPU 256 /* IRQ is per CPU */
+#define IRQ_MITIGATED 512 /* IRQ got rate-limited */
/*
* Interrupt controller descriptor. This is all we need
@@ -62,6 +63,7 @@
struct irqaction *action; /* IRQ action list */
unsigned int depth; /* nested irq disables */
spinlock_t lock;
+ unsigned int count;
} ____cacheline_aligned irq_desc_t;
extern irq_desc_t irq_desc [NR_IRQS];
--- linux/include/asm-i386/irq.h.orig Mon Oct 1 23:06:53 2001
+++ linux/include/asm-i386/irq.h Mon Oct 1 23:07:06 2001
@@ -33,6 +33,7 @@
extern void disable_irq(unsigned int);
extern void disable_irq_nosync(unsigned int);
extern void enable_irq(unsigned int);
+extern void set_irq_rate(unsigned int irq, unsigned int rate);
#ifdef CONFIG_X86_LOCAL_APIC
#define ARCH_HAS_NMI_WATCHDOG /* See include/linux/nmi.h */
--- linux/include/asm-mips/softirq.h.orig Mon Oct 1 21:52:32 2001
+++ linux/include/asm-mips/softirq.h Mon Oct 1 21:52:43 2001
@@ -40,6 +40,4 @@
#define in_softirq() (local_bh_count(smp_processor_id()) != 0)
-#define __cpu_raise_softirq(cpu, nr) set_bit(nr, &softirq_pending(cpu))
-
#endif /* _ASM_SOFTIRQ_H */
--- linux/include/asm-mips64/softirq.h.orig Mon Oct 1 21:52:32 2001
+++ linux/include/asm-mips64/softirq.h Mon Oct 1 21:52:43 2001
@@ -39,19 +39,4 @@
#define in_softirq() (local_bh_count(smp_processor_id()) != 0)
-extern inline void __cpu_raise_softirq(int cpu, int nr)
-{
- unsigned int *m = (unsigned int *) &softirq_pending(cpu);
- unsigned int temp;
-
- __asm__ __volatile__(
- "1:\tll\t%0, %1\t\t\t# __cpu_raise_softirq\n\t"
- "or\t%0, %2\n\t"
- "sc\t%0, %1\n\t"
- "beqz\t%0, 1b"
- : "=&r" (temp), "=m" (*m)
- : "ir" (1UL << nr), "m" (*m)
- : "memory");
-}
-
#endif /* _ASM_SOFTIRQ_H */
--- linux/net/core/dev.c.orig Mon Oct 1 21:52:32 2001
+++ linux/net/core/dev.c Mon Oct 1 21:52:43 2001
@@ -1218,8 +1218,9 @@
dev_hold(skb->dev);
__skb_queue_tail(&queue->input_pkt_queue,skb);
/* Runs from irqs or BH's, no need to wake BH */
- cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+ __cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
local_irq_restore(flags);
+ rerun_softirqs(this_cpu);
#ifndef OFFLINE_SAMPLE
get_sample_stats(this_cpu);
#endif
@@ -1529,8 +1530,9 @@
local_irq_disable();
netdev_rx_stat[this_cpu].time_squeeze++;
/* This already runs in BH context, no need to wake up BH's */
- cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
+ __cpu_raise_softirq(this_cpu, NET_RX_SOFTIRQ);
local_irq_enable();
+ rerun_softirqs(this_cpu);
NET_PROFILE_LEAVE(softnet_process);
return;
--- linux/arch/i386/kernel/irq.c.orig Mon Oct 1 21:52:28 2001
+++ linux/arch/i386/kernel/irq.c Mon Oct 1 23:06:26 2001
@@ -18,6 +18,7 @@
*/
#include <linux/config.h>
+#include <linux/compiler.h>
#include <linux/ptrace.h>
#include <linux/errno.h>
#include <linux/signal.h>
@@ -68,7 +69,24 @@
irq_desc_t irq_desc[NR_IRQS] __cacheline_aligned =
{ [0 ... NR_IRQS-1] = { 0, &no_irq_type, NULL, 0, SPIN_LOCK_UNLOCKED}};
-static void register_irq_proc (unsigned int irq);
+#define DEFAULT_IRQ_RATE 20000
+
+/*
+ * Maximum number of interrupts allowed, per second.
+ * Individual values can be set via echoing the new
+ * decimal value into /proc/irq/IRQ/max_rate.
+ */
+static unsigned int irq_rate [NR_IRQS] =
+ { [0 ... NR_IRQS-1] = DEFAULT_IRQ_RATE };
+
+/*
+ * Print warnings only once. We reset it to 1 if rate
+ * limit has been changed.
+ */
+static unsigned int rate_warning [NR_IRQS] =
+ { [0 ... NR_IRQS-1] = 1 };
+
+static void register_irq_proc(unsigned int irq);
/*
* Special irq handlers.
@@ -230,35 +248,8 @@
show_stack(NULL);
printk("\n");
}
-
-#define MAXCOUNT 100000000
-/*
- * I had a lockup scenario where a tight loop doing
- * spin_unlock()/spin_lock() on CPU#1 was racing with
- * spin_lock() on CPU#0. CPU#0 should have noticed spin_unlock(), but
- * apparently the spin_unlock() information did not make it
- * through to CPU#0 ... nasty, is this by design, do we have to limit
- * 'memory update oscillation frequency' artificially like here?
- *
- * Such 'high frequency update' races can be avoided by careful design, but
- * some of our major constructs like spinlocks use similar techniques,
- * it would be nice to clarify this issue. Set this define to 0 if you
- * want to check whether your system freezes. I suspect the delay done
- * by SYNC_OTHER_CORES() is in correlation with 'snooping latency', but
- * i thought that such things are guaranteed by design, since we use
- * the 'LOCK' prefix.
- */
-#define SUSPECTED_CPU_OR_CHIPSET_BUG_WORKAROUND 0
-
-#if SUSPECTED_CPU_OR_CHIPSET_BUG_WORKAROUND
-# define SYNC_OTHER_CORES(x) udelay(x+1)
-#else
-/*
- * We have to allow irqs to arrive between __sti and __cli
- */
-# define SYNC_OTHER_CORES(x) __asm__ __volatile__ ("nop")
-#endif
+#define MAXCOUNT 100000000
static inline void wait_on_irq(int cpu)
{
@@ -276,7 +267,7 @@
break;
/* Duh, we have to loop. Release the lock to avoid deadlocks */
- clear_bit(0,&global_irq_lock);
+ clear_bit(0, &global_irq_lock);
for (;;) {
if (!--count) {
@@ -284,7 +275,8 @@
count = ~0;
}
__sti();
- SYNC_OTHER_CORES(cpu);
+ /* Allow irqs to arrive */
+ __asm__ __volatile__ ("nop");
__cli();
if (irqs_running())
continue;
@@ -467,6 +459,13 @@
* controller lock.
*/
+inline void __disable_irq(irq_desc_t *desc, unsigned int irq)
+{
+ if (!desc->depth++) {
+ desc->status |= IRQ_DISABLED;
+ desc->handler->disable(irq);
+ }
+}
/**
* disable_irq_nosync - disable an irq without waiting
* @irq: Interrupt to disable
@@ -485,10 +484,7 @@
unsigned long flags;
spin_lock_irqsave(&desc->lock, flags);
- if (!desc->depth++) {
- desc->status |= IRQ_DISABLED;
- desc->handler->disable(irq);
- }
+ __disable_irq(desc, irq);
spin_unlock_irqrestore(&desc->lock, flags);
}
@@ -516,23 +512,8 @@
}
}
-/**
- * enable_irq - enable handling of an irq
- * @irq: Interrupt to enable
- *
- * Undoes the effect of one call to disable_irq(). If this
- * matches the last disable, processing of interrupts on this
- * IRQ line is re-enabled.
- *
- * This function may be called from IRQ context.
- */
-
-void enable_irq(unsigned int irq)
+static inline void __enable_irq(irq_desc_t *desc, unsigned int irq)
{
- irq_desc_t *desc = irq_desc + irq;
- unsigned long flags;
-
- spin_lock_irqsave(&desc->lock, flags);
switch (desc->depth) {
case 1: {
unsigned int status = desc->status & ~IRQ_DISABLED;
@@ -551,9 +532,69 @@
printk("enable_irq(%u) unbalanced from %p\n", irq,
__builtin_return_address(0));
}
+}
+
+/**
+ * enable_irq - enable handling of an irq
+ * @irq: Interrupt to enable
+ *
+ * Undoes the effect of one call to disable_irq(). If this
+ * matches the last disable, processing of interrupts on this
+ * IRQ line is re-enabled.
+ *
+ * This function may be called from IRQ context.
+ */
+
+void enable_irq(unsigned int irq)
+{
+ irq_desc_t *desc = irq_desc + irq;
+ unsigned long flags;
+
+ spin_lock_irqsave(&desc->lock, flags);
+ __enable_irq(desc, irq);
spin_unlock_irqrestore(&desc->lock, flags);
}
+void set_irq_rate(unsigned int irq, unsigned int rate)
+{
+ if (rate < 2*HZ)
+ rate = 2*HZ;
+ if (irq_rate[irq] != rate)
+ rate_warning[irq] = 1;
+ irq_rate[irq] = rate;
+}
+
+static inline void __handle_mitigated(irq_desc_t *desc, unsigned int irq)
+{
+ desc->status &= ~IRQ_MITIGATED;
+ __enable_irq(desc, irq);
+}
+
+/*
+ * This function, provided by every architecture, resets
+ * the irq-limit counters in every jiffy. Overhead is
+ * fairly small, since it gets the spinlock only if the IRQ
+ * got mitigated.
+ */
+
+void irq_rate_check(void)
+{
+ unsigned long flags;
+ irq_desc_t *desc;
+ int i;
+
+ for (i = 0; i < NR_IRQS; i++) {
+ desc = irq_desc + i;
+ if (desc->count <= 1) {
+ spin_lock_irqsave(&desc->lock, flags);
+ if (desc->status & IRQ_MITIGATED)
+ __handle_mitigated(desc, i);
+ spin_unlock_irqrestore(&desc->lock, flags);
+ }
+ desc->count = irq_rate[i] / HZ;
+ }
+}
+
/*
* do_IRQ handles all normal device IRQ's (the special
* SMP cross-CPU interrupts have their own specific
@@ -585,6 +626,13 @@
WAITING is used by probe to mark irqs that are being tested
*/
status = desc->status & ~(IRQ_REPLAY | IRQ_WAITING);
+ /*
+ * One decrement and one branch (test for zero) into
+ * an unlikely-predicted branch. It cannot be cheaper
+ * than this.
+ */
+ if (unlikely(!--desc->count))
+ goto mitigate;
status |= IRQ_PENDING; /* we _want_ to handle it */
/*
@@ -639,6 +687,27 @@
if (softirq_pending(cpu))
do_softirq();
return 1;
+
+mitigate:
+ /*
+ * We take a slightly longer path here to not put
+ * overhead into the IRQ hotpath:
+ */
+ desc->count = 1;
+ if (status & IRQ_MITIGATED)
+ goto out;
+ /*
+ * Disable interrupt source. It will be re-enabled
+ * by the next timer interrupt - and possibly be
+ * restarted if needed.
+ */
+ desc->status |= IRQ_MITIGATED | IRQ_PENDING;
+ __disable_irq(desc, irq);
+ if (rate_warning[irq]) {
+ printk(KERN_WARNING "Rate limit of %d irqs/sec exceeded for IRQ%d! Throttling irq source.\n", irq_rate[irq], irq);
+ rate_warning[irq] = 0;
+ }
+ goto out;
}
/**
@@ -809,7 +878,7 @@
* something may have generated an irq long ago and we want to
* flush such a longstanding irq before considering it as spurious.
*/
- for (i = NR_IRQS-1; i > 0; i--) {
+ for (i = NR_IRQS-1; i > 0; i--) {
desc = irq_desc + i;
spin_lock_irq(&desc->lock);
@@ -1030,9 +1099,49 @@
static struct proc_dir_entry * root_irq_dir;
static struct proc_dir_entry * irq_dir [NR_IRQS];
+#define DEC_DIGITS 9
+
+/*
+ * Parses from 0 to 999999999. More than enough for IRQ purposes.
+ */
+static unsigned int parse_dec_value(const char *buffer,
+ unsigned long count, unsigned long *ret)
+{
+ unsigned char decnum [DEC_DIGITS];
+ unsigned long value;
+ int i;
+
+ if (!count)
+ return -EINVAL;
+ if (count > DEC_DIGITS)
+ count = DEC_DIGITS;
+ if (copy_from_user(decnum, buffer, count))
+ return -EFAULT;
+
+ /*
+ * Parse the first 9 characters as a decimal string,
+ * any non-decimal char is end-of-string.
+ */
+ value = 0;
+
+ for (i = 0; i < count; i++) {
+ unsigned int c = decnum[i];
+
+ switch (c) {
+ case '0' ... '9': c -= '0'; break;
+ default:
+ goto out;
+ }
+ value = value * 10 + c;
+ }
+out:
+ *ret = value;
+ return 0;
+}
+
#define HEX_DIGITS 8
-static unsigned int parse_hex_value (const char *buffer,
+static unsigned int parse_hex_value(const char *buffer,
unsigned long count, unsigned long *ret)
{
unsigned char hexnum [HEX_DIGITS];
@@ -1071,18 +1180,17 @@
#if CONFIG_SMP
-static struct proc_dir_entry * smp_affinity_entry [NR_IRQS];
-
static unsigned long irq_affinity [NR_IRQS] = { [0 ... NR_IRQS-1] = ~0UL };
-static int irq_affinity_read_proc (char *page, char **start, off_t off,
+
+static int irq_affinity_read_proc(char *page, char **start, off_t off,
int count, int *eof, void *data)
{
- if (count < HEX_DIGITS+1)
+ if (count <= HEX_DIGITS)
return -EINVAL;
return sprintf (page, "%08lx\n", irq_affinity[(long)data]);
}
-static int irq_affinity_write_proc (struct file *file, const char *buffer,
+static int irq_affinity_write_proc(struct file *file, const char *buffer,
unsigned long count, void *data)
{
int irq = (long) data, full_count = count, err;
@@ -1109,16 +1217,16 @@
#endif
-static int prof_cpu_mask_read_proc (char *page, char **start, off_t off,
+static int prof_cpu_mask_read_proc(char *page, char **start, off_t off,
int count, int *eof, void *data)
{
unsigned long *mask = (unsigned long *) data;
- if (count < HEX_DIGITS+1)
+ if (count <= HEX_DIGITS)
return -EINVAL;
return sprintf (page, "%08lx\n", *mask);
}
-static int prof_cpu_mask_write_proc (struct file *file, const char *buffer,
+static int prof_cpu_mask_write_proc(struct file *file, const char *buffer,
unsigned long count, void *data)
{
unsigned long *mask = (unsigned long *) data, full_count = count, err;
@@ -1132,10 +1240,45 @@
return full_count;
}
+static int irq_rate_read_proc(char *page, char **start, off_t off,
+ int count, int *eof, void *data)
+{
+ int irq = (int) data;
+ if (count <= DEC_DIGITS)
+ return -EINVAL;
+ return sprintf (page, "%d\n", irq_rate[irq]);
+}
+
+static int irq_rate_write_proc(struct file *file, const char *buffer,
+ unsigned long count, void *data)
+{
+ int irq = (int) data;
+ unsigned long full_count = count, err;
+ unsigned long new_value;
+
+ /* do not allow the timer interrupt to be rate-limited ... :-| */
+ if (!irq)
+ return -EINVAL;
+ err = parse_dec_value(buffer, count, &new_value);
+ if (err)
+ return err;
+
+ /*
+ * Do not allow a frequency to be lower than 1 interrupt
+ * per jiffy.
+ */
+ if (!new_value)
+ return -EINVAL;
+
+ set_irq_rate(irq, new_value);
+ return full_count;
+}
+
#define MAX_NAMELEN 10
-static void register_irq_proc (unsigned int irq)
+static void register_irq_proc(unsigned int irq)
{
+ struct proc_dir_entry *entry;
char name [MAX_NAMELEN];
if (!root_irq_dir || (irq_desc[irq].handler == &no_irq_type) ||
@@ -1148,28 +1291,32 @@
/* create /proc/irq/1234 */
irq_dir[irq] = proc_mkdir(name, root_irq_dir);
-#if CONFIG_SMP
- {
- struct proc_dir_entry *entry;
+ /* create /proc/irq/1234/max_rate */
+ entry = create_proc_entry("max_rate", 0600, irq_dir[irq]);
- /* create /proc/irq/1234/smp_affinity */
- entry = create_proc_entry("smp_affinity", 0600, irq_dir[irq]);
+ if (entry) {
+ entry->nlink = 1;
+ entry->data = (void *)irq;
+ entry->read_proc = irq_rate_read_proc;
+ entry->write_proc = irq_rate_write_proc;
+ }
- if (entry) {
- entry->nlink = 1;
- entry->data = (void *)(long)irq;
- entry->read_proc = irq_affinity_read_proc;
- entry->write_proc = irq_affinity_write_proc;
- }
+#if CONFIG_SMP
+ /* create /proc/irq/1234/smp_affinity */
+ entry = create_proc_entry("smp_affinity", 0600, irq_dir[irq]);
- smp_affinity_entry[irq] = entry;
+ if (entry) {
+ entry->nlink = 1;
+ entry->data = (void *)(long)irq;
+ entry->read_proc = irq_affinity_read_proc;
+ entry->write_proc = irq_affinity_write_proc;
}
#endif
}
unsigned long prof_cpu_mask = -1;
-void init_irq_proc (void)
+void init_irq_proc(void)
{
struct proc_dir_entry *entry;
int i;
@@ -1181,7 +1328,7 @@
entry = create_proc_entry("prof_cpu_mask", 0600, root_irq_dir);
if (!entry)
- return;
+ return;
entry->nlink = 1;
entry->data = (void *)&prof_cpu_mask;
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5
2001-10-01 22:16 [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5 Ingo Molnar
@ 2001-10-01 22:26 ` Tim Hockin
2001-10-01 22:50 ` Ingo Molnar
2001-10-01 22:36 ` Andreas Dilger
` (3 subsequent siblings)
4 siblings, 1 reply; 14+ messages in thread
From: Tim Hockin @ 2001-10-01 22:26 UTC (permalink / raw)
To: mingo
Cc: linux-kernel, Linus Torvalds, Alan Cox, Alexey Kuznetsov,
Andrea Arcangeli, Simon Kirby
> - a little utility written by Simon Kirby proved that no matter how much
> softirq throttling, it's easy to lock up a pretty powerful Linux
> box via a high rate of network interrupts, from relatively low-powered
> clients as well. 2.4.6, 2.4.7, 2.4.10 all lock up. Alexey said it as
> well that it's still easy to lock up low-powered Linux routers via more
> or less normal traffic.
We proved this a year+ ago. We've got some code brewing to do fair sharing
of IRQs for heavy load situations. I don't have all the details, but
eventually...
> i've tested the patch on both UP, SMP, XT-PIC and APIC systems, it
> correctly limits network interrupt rates (and other device interrupt
> rates) to the given limit. I've done stress-testing as well. The patch is
> against 2.4.11-pre1, but it applies just fine to the -ac tree as well.
Our solution/needs are slightly different - we want to service as many
interrupts as possible and do as much network traffic as possible, and
interactive-tasks be damned.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5
2001-10-01 22:26 ` Tim Hockin
@ 2001-10-01 22:50 ` Ingo Molnar
0 siblings, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2001-10-01 22:50 UTC (permalink / raw)
To: Tim Hockin
Cc: linux-kernel, Linus Torvalds, Alan Cox, Alexey Kuznetsov,
Andrea Arcangeli, Simon Kirby
On Mon, 1 Oct 2001, Tim Hockin wrote:
> Our solution/needs are slightly different - we want to service as many
> interrupts as possible and do as much network traffic as possible, and
> interactive-tasks be damned.
i the patch in fact enables this too: you can more agressively get irqs
and softirqs executed by increasing max_rate just above the 'critical'
rate you can measure. (and the blocked-interrupts period of time will be
enough to let the softirq work to be finished.) So in fact you might even
end up having higher performance by blocking interrupts in a certain
portion of a timer tick - backlogged work will be processed. Via max_rate
you can partition the percentage of CPU time dedicated to softirq and
process work. (which in your case would be softirq-only work - which
should not be underestimated either.)
Ingo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5
2001-10-01 22:16 [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5 Ingo Molnar
2001-10-01 22:26 ` Tim Hockin
@ 2001-10-01 22:36 ` Andreas Dilger
2001-10-01 22:50 ` Ben Greear
` (2 subsequent siblings)
4 siblings, 0 replies; 14+ messages in thread
From: Andreas Dilger @ 2001-10-01 22:36 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Alan Cox, Alexey Kuznetsov,
Andrea Arcangeli, Simon Kirby
On Oct 02, 2001 00:16 +0200, Ingo Molnar wrote:
> - the irq handling code has been extended to support 'soft mitigation',
> ie. to mitigate the rate of hardware interrupts, without support from
> the actual hardware. There is a reasonable default, but the value can
> also be decreased/increased on a per-irq basis via /proc/irq/NR/max_rate.
>
> the method is the following. We count the number of interrupts serviced,
> and if within a jiffy there are more than max_rate interrupts, the code
> disables the IRQ source and marks it as IRQ_MITIGATED. On the next timer
> interrupt the irq_rate_check() function is called, which makes sure that
> 'blocked' irqs are restarted & handled properly.
How far is it to go from a mitigated IRQ (because of too high an interrupt
rate) to a polled interface (e.g. for network cards)? This was discussed
a number of times to improve overall performance on bust network systems.
Concievably, a network card could tune max_rate to a value where it is
more efficient (CPU wise) to poll the interface instead of using IRQs.
However, waiting for the next regular timer interrupt may be too long
(resulting in lost packets) as buffers overflowed. Would it also be
possible for a driver to register a "maximum delay" between servicing
interrupts (within reason, on a non-RT system) so that it can say "I
have X kB of buffers, and the maximum line rate is Y kB/s, so I need
to be serviced within X/Y s when polling without losing data".
Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5
2001-10-01 22:16 [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5 Ingo Molnar
2001-10-01 22:26 ` Tim Hockin
2001-10-01 22:36 ` Andreas Dilger
@ 2001-10-01 22:50 ` Ben Greear
2001-10-02 14:30 ` Alan Cox
2001-10-01 23:03 ` Linus Torvalds
2001-10-02 6:50 ` [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5 Marcus Sundberg
4 siblings, 1 reply; 14+ messages in thread
From: Ben Greear @ 2001-10-01 22:50 UTC (permalink / raw)
To: mingo; +Cc: linux-kernel
Ingo Molnar wrote:
> (note that in case of shared interrupts, another 'innocent' device might
> stay disabled for some short amount of time as well - but this is not an
> issue because this mitigation does not make that device inoperable, it
> just delays its interrupt by up to 10 msecs. Plus, modern systems have
> properly distributed interrupts.)
I'm all for anything that speeds up (and makes more reliable) high network
speeds, but I often run with 8+ ethernet devices, so IRQs have to be shared,
and a 10ms lockdown on an interface could lose lots of packets. Although
it's not a perfect solution, maybe you could (in the kernel) multiple the
max by the number of things using that IRQ? For example, if you have four
ethernet drivers on one IRQ, then let that IRQ fire 4 times faster than
normal before putting it in lockdown...
Do you have any idea how many packets-per-second you can get out of a
system (obviously, your system of choice) using your updated code?
(I'm running about 7k packets-per-second tx, and 7k rx, on 3 EEPRO ports
simultaneously on a 1Ghz PIII and 2.4.9-pre10... This is from user-space,
so much of the CPU is spent hauling my packets to and from the device..)
Ben
--
Ben Greear <greearb@candelatech.com> <Ben_Greear@excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5
2001-10-01 22:50 ` Ben Greear
@ 2001-10-02 14:30 ` Alan Cox
2001-10-02 20:51 ` Ingo Molnar
0 siblings, 1 reply; 14+ messages in thread
From: Alan Cox @ 2001-10-02 14:30 UTC (permalink / raw)
To: Ben Greear; +Cc: mingo, linux-kernel
> I'm all for anything that speeds up (and makes more reliable) high network
> speeds, but I often run with 8+ ethernet devices, so IRQs have to be shared,
> and a 10ms lockdown on an interface could lose lots of packets. Although
> it's not a perfect solution, maybe you could (in the kernel) multiple the
> max by the number of things using that IRQ? For example, if you have four
> ethernet drivers on one IRQ, then let that IRQ fire 4 times faster than
> normal before putting it in lockdown...
What you really care about is limiting the total amount of CPU time used for
interrupt processing so that usermode progress is made. The network layer
shows this up paticularly badly because (and its kind of hard to avoid this)
it frees resources on the hardware before userspace has processed them.
Silencing a specific target cannot be done by IRQ masking, you have to
ask the controller to shut up. It may be the default "shut up" handler
is disable_irq but that is non optimal.
Having driver callbacks as part of the irq handler also massively improves
the effect of the event, because faced with an IRQ storm a card can
- decide if it is the guily party
If so
- consider switching to polled mode
- change its ring buffer size to reduce IRQ load and up latency
as a tradeoff
- anything else magical the hardware has (like retuning irq
mitigation registers)
Alan
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5
2001-10-02 14:30 ` Alan Cox
@ 2001-10-02 20:51 ` Ingo Molnar
0 siblings, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2001-10-02 20:51 UTC (permalink / raw)
To: Alan Cox; +Cc: Ben Greear, linux-kernel
On Tue, 2 Oct 2001, Alan Cox wrote:
> What you really care about is limiting the total amount of CPU time
> used for interrupt processing so that usermode progress is made.
> [...]
exactly. The estimator in -D9 tries to achieve precisely this, both
hardirqs and softirqs are measured.
> Silencing a specific target cannot be done by IRQ masking, you have to
> ask the controller to shut up. It may be the default "shut up" handler
> is disable_irq but that is non optimal.
this could be done later on, but i think this is out of question for 2.4,
as it needs extensive changes in irq handler and network driver API.
Ingo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5
2001-10-01 22:16 [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5 Ingo Molnar
` (2 preceding siblings ...)
2001-10-01 22:50 ` Ben Greear
@ 2001-10-01 23:03 ` Linus Torvalds
2001-10-02 20:34 ` [patch] auto-limiting IRQ load, IRQ-polling, irq-rewrite-2.4.11-D9 Ingo Molnar
2001-10-02 6:50 ` [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5 Marcus Sundberg
4 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2001-10-01 23:03 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Alan Cox, Alexey Kuznetsov, Andrea Arcangeli, Simon Kirby
On Tue, 2 Oct 2001, Ingo Molnar wrote:
>
> - the irq handling code has been extended to support 'soft mitigation',
> ie. to mitigate the rate of hardware interrupts, without support from
> the actual hardware. There is a reasonable default, but the value can
> also be decreased/increased on a per-irq basis via /proc/irq/NR/max_rate.
Adn how do you select max_rate sanely? It depends on how heavy each
interrupt is, the speed of the CPU etc etc. A rate that works for a
network card with a certain packet size may be completely ineffective on
the same machine with the same network card but a different packet size.
When you select the wrong number, you slow the system down for no good
reason (too low a number) or your mitigation has zero effect because the
system can't do that many interrupts per tick anyway (too high a number).
Saying "hey, that's the users problem", is _not_ a solution. It needs to
have some automatic cut-off that finds the right sustainable rate
automatically, instead of hardcoding random default values and asking the
user to know the unknowable.
Automatically doing the right thing may be hard, but it should be
solvable. In particular, something like the following _may_ be a workable
approach, rather than having a hardcoded limit:
- have a notion of "made progress". Certain events count as progress, and
will reset the interrupt count.
Examples of "progress":
- idle task loop
- a context switch
- depend on the fact that on a PC, the timer interrupt has the highest
priority, and make the timer interrupt do something like
if (!made_progress) {
disable_next_irq = 1;
} else
made_progress = 0;
- have all other interrupts do something like
if (disable_next_irq)
goto mitigate;
which just says that we mitigate an irq _only_ if we didn't make any
progress at all. Rather than mitigating on some random count that can
never be perfect.
(Tweak to suit your own definition of "made progress" - maybe you'd like
to require more than just a context switch).
Linus
^ permalink raw reply [flat|nested] 14+ messages in thread
* [patch] auto-limiting IRQ load, IRQ-polling, irq-rewrite-2.4.11-D9
2001-10-01 23:03 ` Linus Torvalds
@ 2001-10-02 20:34 ` Ingo Molnar
2001-10-03 14:51 ` [patch] auto-limiting IRQ load take #2, irq-rewrite-2.4.11-F4 Ingo Molnar
0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2001-10-02 20:34 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, Alan Cox, Arjan van de Ven, Alexey Kuznetsov,
Andrea Arcangeli, Simon Kirby
[-- Attachment #1: Type: TEXT/PLAIN, Size: 6046 bytes --]
On Mon, 1 Oct 2001, Linus Torvalds wrote:
> And how do you select max_rate sanely? [...]
> Saying "hey, that's the users problem", is _not_ a solution. It needs
> to have some automatic cut-off that finds the right sustainable rate
> automatically, instead of hardcoding random default values and asking
> the user to know the unknowable.
good point. I did not ignore this problem, i was just unable to find any
solution that felt robust, so i convinced myself that max_rate is the best
idea :-)
but a fresh start today, and a good idea from Arjan resulted in a pretty
good 'irq load' estimator that implements the above cut-off dynamically:
the method is to detect in do_IRQ() whether we have interrupted an
interrupt context of any sort or not. The number of 'total' and 'irq'
interruptions are counted, and the 'irq load' is "irq / total". The irq
code goes into per-irq and per-cpu 'overload mode' if the 'irq load' is
higher than ~97%.
There is one case of 'understimation': hardirqs that do not enable
interrupts (SA_INTERRUPT handlers) will not be interrupted. Fortunately,
99.9% of the network drivers (and other, high-rate irq drivers) enable
interrupts in their hardirq handlers. The only significant SA_INTERRUPT
user is the SCSI layer. Which is not an 'external' device.
but in the loads we care about, the irqs are irq-enabled and trigger
softirqs - both are measured precisely via the in_interrupt() method.
this load calculation method has a few 'false triggers': eg.
syscall-level code that uses disable_local_bh() will be triggered - but
such code is not very common and overestimating irq-load slightly is
always better than underestimating.
Other estimators, like context switches and rdtsc have more serious
problems i believe. Context switches are imo not a 'perfect' indicator of
'progress', in a number of important situations. Eg. when a userspace
process is not scheduling at all. There is no other indicator of progress
in this case but the fact that it's userspace that we interrupt.
RDTSC, while 'perfect' indicator of actual system, irq and user load, is
not generic and adds quite some overhead to the lowlevel code. The only
'generic' and accurate time measerment method, do_gettimeofday(), has way
too much overhead to be used in lowlevel irq code.
another advantage of the in_interrupt() method is finegrained metrics:
the counters are per-irq and per-cpu, so low-frequency interrupts like the
keyboard interrupt or mouse interrupt are much less likely to be mitigated
needlessly. Separate mitigation does not mean the global effects will not
be measured correctly: if eg. 16 devices all produce a 10k irqs/sec load
(which, in isolation, is not enough to trigger an overload), they together
will starve non-irq contexts and will cause an overload in all 16 active
irq handlers.
unfortunately there is also another, new problem, which got reported and
which i can reproduce as well: due to the millisecs long disabling of
ethernet interrupts, the receiver can overflow easily, and produces
overruns and lost packets. The result of this phenomenon is an effectively
frozen network: no incoming or outgoing TCP connection ever makes any
reasonable progress. So by auto-mitigation alone we only exchanged a 'box
lockup' against a 'network connection' lockup - a different kind of DoS
but still a DoS.
the new patch (attached) provides a solution for this problem too, by
introducing a hardirq-polling kernel thread: 'kpolld'. (kpolld is
significantly different from ksoftirqd: it gets only triggered in truly
hopeless situations, and it handles hardirq load in such cases. I've never
seen it run under any 'good' loads i care about. Plus, polling can have
significant performance advantages in dedicated networking environments.)
while this inevitably caused the introduction of an device-polling
framework, it's hard to separate the two things - auto-mitigation alone is
not useful without going into poll mode, unfortunately. Only the
networking code uses the polling framework currently.
Another option would be to use the interrupt handlers themselves to do the
polling - but this puts certain assumptions into existing IRQ handlers,
which we cannot do for 2.4 i believe. Plus, the ->poll_controller() driver
extension is also used by the netconsole, so we could get free testing for
it :-) Another reason is that i think subsystems should have close control
over the way they do polling. There are also a few examples of
'synchronous polling' points i added to the networking code: the device
will only be polled once, and only if we are in IRQ overload mode.
about performance: while it certainly can be tuned further, the estimator
works pretty well under the loads i tested. 97% proved to be a reasonable
limit, which i was unable to reach via 'normal' loads - it took dedicated
tools like updspam to trigger the overload. In overload mode, performance
is still pretty good, and TCP connections over the network are snappy. But
it would be very nice if those who reported packet drops and bad network
performance when using yesterday's patch could re-test the same load
situation with this patch applied to 2.4.11-pre2.
note: i still kept max_rate, but it's now scaled along cpu_khz - a 300 MHz
box will have a default value of 30k irqs/sec, a 1 GHZ box will have a
100k irqs/sec limit. This limit still has the advantage to potentially
catch runaway devices that cause irq storms. Another reason to keep
max_rate was to enable router & appliance vendors to set it to a low value
to force the system into 'polling mode'. For dedicated boxes this makes
perfect sense.
note2: the patch includes the eepro100 driver patches from the -ac tree as
well (Arjan's receiver(?)-hangup fix) - without those fixes i could easily
hang my eepro100 cards after a few minutes of extreme load.
the patch can also be downloaded from:
http://redhat.com/~mingo/irq-rewrite/
i've stress-tested the patch on 2.4.11-pre1, on UP-PIC, UP-APIC, and
SMP-APIC systems.
Comments, testing feedback welcome,
Ingo
[-- Attachment #2: Type: APPLICATION/x-bzip2, Size: 13213 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* [patch] auto-limiting IRQ load take #2, irq-rewrite-2.4.11-F4
2001-10-02 20:34 ` [patch] auto-limiting IRQ load, IRQ-polling, irq-rewrite-2.4.11-D9 Ingo Molnar
@ 2001-10-03 14:51 ` Ingo Molnar
2001-10-03 15:16 ` jamal
0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2001-10-03 14:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, Alan Cox, Arjan van de Ven, Alexey Kuznetsov, netdev
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1178 bytes --]
the attached patch contains a cleaned up version of IRQ auto-mitigation.
- i've removed the max_rate limit and have streamlined the impact of the
load-estimator on do_IRQ() to this piece of code:
desc->total_contexts++;
if (unlikely(in_interrupt()))
goto mitigate_irqload;
i dont think we can get much cheaper than this. (We could perhaps avoid
the total_contexts counter by saving a 'snapshot' of the existing
kstat.irqs array of counters in every timer tick and comparing the
snapshot to the current kstat.irqs values. That looked pretty unclean
though.)
- the per-cpu irq counting in -D9 was incorrect as it collapsed all irq
handlers into a single counter.
- i've removed the net-polling hacks - they are unrelated to this problem.
the patch is against 2.4.11-pre2. (the eepro100.c fixes from the -ac tree
are already included in -pre2, i only included them in this patch to make
patching & testing against 2.4.10 easier.).
(i'd like to stress the point again that the goal of this approach is
*not* to be nice. This is an airbag mechanizm, it can and will hurt
performance. But my box does not lock up anymore.)
Ingo
[-- Attachment #2: Type: APPLICATION/x-bzip2, Size: 10395 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch] auto-limiting IRQ load take #2, irq-rewrite-2.4.11-F4
2001-10-03 14:51 ` [patch] auto-limiting IRQ load take #2, irq-rewrite-2.4.11-F4 Ingo Molnar
@ 2001-10-03 15:16 ` jamal
2001-10-03 16:51 ` Rik van Riel
0 siblings, 1 reply; 14+ messages in thread
From: jamal @ 2001-10-03 15:16 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, linux-kernel, Alan Cox, Arjan van de Ven,
Alexey Kuznetsov, netdev
Your approach is still wrong. Please do not accept this patch.
cheers,
jamal
On Wed, 3 Oct 2001, Ingo Molnar wrote:
>
> the attached patch contains a cleaned up version of IRQ auto-mitigation.
>
> - i've removed the max_rate limit and have streamlined the impact of the
> load-estimator on do_IRQ() to this piece of code:
>
> desc->total_contexts++;
> if (unlikely(in_interrupt()))
> goto mitigate_irqload;
>
> i dont think we can get much cheaper than this. (We could perhaps avoid
> the total_contexts counter by saving a 'snapshot' of the existing
> kstat.irqs array of counters in every timer tick and comparing the
> snapshot to the current kstat.irqs values. That looked pretty unclean
> though.)
>
> - the per-cpu irq counting in -D9 was incorrect as it collapsed all irq
> handlers into a single counter.
>
> - i've removed the net-polling hacks - they are unrelated to this problem.
>
> the patch is against 2.4.11-pre2. (the eepro100.c fixes from the -ac tree
> are already included in -pre2, i only included them in this patch to make
> patching & testing against 2.4.10 easier.).
>
> (i'd like to stress the point again that the goal of this approach is
> *not* to be nice. This is an airbag mechanizm, it can and will hurt
> performance. But my box does not lock up anymore.)
>
> Ingo
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [patch] auto-limiting IRQ load take #2, irq-rewrite-2.4.11-F4
2001-10-03 15:16 ` jamal
@ 2001-10-03 16:51 ` Rik van Riel
0 siblings, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2001-10-03 16:51 UTC (permalink / raw)
To: jamal
Cc: Ingo Molnar, Linus Torvalds, linux-kernel, Alan Cox,
Arjan van de Ven, Alexey Kuznetsov, netdev
On Wed, 3 Oct 2001, jamal wrote:
> Your approach is still wrong. Please do not accept this patch.
I rather like the fact that Ingo's approach will keep the
system alive regardless of what driver is used.
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/ (volunteers needed)
http://www.surriel.com/ http://distro.conectiva.com/
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5
2001-10-01 22:16 [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5 Ingo Molnar
` (3 preceding siblings ...)
2001-10-01 23:03 ` Linus Torvalds
@ 2001-10-02 6:50 ` Marcus Sundberg
2001-10-03 8:47 ` Ingo Molnar
4 siblings, 1 reply; 14+ messages in thread
From: Marcus Sundberg @ 2001-10-02 6:50 UTC (permalink / raw)
To: linux-kernel
mingo@elte.hu (Ingo Molnar) writes:
> (note that in case of shared interrupts, another 'innocent' device might
> stay disabled for some short amount of time as well - but this is not an
> issue because this mitigation does not make that device inoperable, it
> just delays its interrupt by up to 10 msecs. Plus, modern systems have
> properly distributed interrupts.)
Guess my P3-based laptop doesn't count as modern then:
0: 7602983 XT-PIC timer
1: 10575 XT-PIC keyboard
2: 0 XT-PIC cascade
8: 1 XT-PIC rtc
11: 1626004 XT-PIC Toshiba America Info Systems ToPIC95 PCI to Cardbus Bridge with ZV Support, Toshiba America Info Systems ToPIC95 PCI to Cardbus Bridge with ZV Support (#2), usb-uhci, eth0, BreezeCom Card, Intel 440MX, irda0
12: 1342 XT-PIC PS/2 Mouse
14: 23605 XT-PIC ide0
I can't even imagine why they did it like this...
//Marcus
--
---------------------------------+---------------------------------
Marcus Sundberg | Phone: +46 707 452062
Embedded Systems Consultant | Email: marcus@cendio.se
Cendio Systems AB | http://www.cendio.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5
2001-10-02 6:50 ` [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5 Marcus Sundberg
@ 2001-10-03 8:47 ` Ingo Molnar
0 siblings, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2001-10-03 8:47 UTC (permalink / raw)
To: Marcus Sundberg; +Cc: linux-kernel
On 2 Oct 2001, Marcus Sundberg wrote:
> Guess my P3-based laptop doesn't count as modern then:
>
> 0: 7602983 XT-PIC timer
> 1: 10575 XT-PIC keyboard
> 2: 0 XT-PIC cascade
> 8: 1 XT-PIC rtc
> 11: 1626004 XT-PIC Toshiba America Info Systems ToPIC95 PCI to
> Cardbus Bridge with ZV Support, Toshiba America Info Systems ToPIC95
> PCI to Cardbus Bridge with ZV Support (#2), usb-uhci, eth0, BreezeCom
> Card, Intel 440MX, irda0
ugh!
> I can't even imagine why they did it like this...
well, you arent going to be using it as a webserver i guess? :) But the
costs on desktops are minimal. It's the high-irq-rate server environments
that want separate irq sources.
Ingo
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2001-10-03 16:51 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-10-01 22:16 [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5 Ingo Molnar
2001-10-01 22:26 ` Tim Hockin
2001-10-01 22:50 ` Ingo Molnar
2001-10-01 22:36 ` Andreas Dilger
2001-10-01 22:50 ` Ben Greear
2001-10-02 14:30 ` Alan Cox
2001-10-02 20:51 ` Ingo Molnar
2001-10-01 23:03 ` Linus Torvalds
2001-10-02 20:34 ` [patch] auto-limiting IRQ load, IRQ-polling, irq-rewrite-2.4.11-D9 Ingo Molnar
2001-10-03 14:51 ` [patch] auto-limiting IRQ load take #2, irq-rewrite-2.4.11-F4 Ingo Molnar
2001-10-03 15:16 ` jamal
2001-10-03 16:51 ` Rik van Riel
2001-10-02 6:50 ` [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5 Marcus Sundberg
2001-10-03 8:47 ` Ingo Molnar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).