linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
@ 2006-12-13 13:53 Arjan van de Ven
  2006-12-13 19:41 ` Eric W. Biederman
  0 siblings, 1 reply; 15+ messages in thread
From: Arjan van de Ven @ 2006-12-13 13:53 UTC (permalink / raw)
  To: ebiederm, mingo; +Cc: linux-kernel

[due to a broken libata in current -git I've not been able to test this patch enough]


This patch adds an "allowed_affinity" mask to each interrupt, in addition to the 
existing actual affinity mask. In addition this new mask is exported to userspace
in a similar way as the actual affinity is exported. (this is so that irqbalance
can find out about the restriction and take it into account)

The purpose for having this mask is to allow for the situation where interrupts
just can't or shouldn't go to all cpus; one example is the "per cpu" IRQ thing 
that powerpc and others have. Another soon-to-come example is MSIX devices that 
can generate a different MSI interrupt for each cpu; in that case the MSI needs to
be strictly constrained to it's designated cpu.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

---
 include/linux/irq.h |    2 ++
 kernel/irq/chip.c   |    1 +
 kernel/irq/handle.c |    3 ++-
 kernel/irq/manage.c |    5 ++++-
 kernel/irq/proc.c   |   27 +++++++++++++++++++++++++++
 5 files changed, 36 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/irq.h
===================================================================
--- linux-2.6.orig/include/linux/irq.h
+++ linux-2.6/include/linux/irq.h
@@ -137,6 +137,7 @@ struct irq_chip {
  * @irqs_unhandled:	stats field for spurious unhandled interrupts
  * @lock:		locking for SMP
  * @affinity:		IRQ affinity on SMP
+ * @allowed_affinity	The allowed affinity for this IRQ
  * @cpu:		cpu index useful for balancing
  * @pending_mask:	pending rebalanced interrupts
  * @dir:		/proc/irq/ procfs entry
@@ -160,6 +161,7 @@ struct irq_desc {
 	spinlock_t		lock;
 #ifdef CONFIG_SMP
 	cpumask_t		affinity;
+	cpumask_t		allowed_affinity;
 	unsigned int		cpu;
 #endif
 #if defined(CONFIG_GENERIC_PENDING_IRQ) || defined(CONFIG_IRQBALANCE)
Index: linux-2.6/kernel/irq/chip.c
===================================================================
--- linux-2.6.orig/kernel/irq/chip.c
+++ linux-2.6/kernel/irq/chip.c
@@ -46,6 +46,7 @@ void dynamic_irq_init(unsigned int irq)
 	desc->irqs_unhandled = 0;
 #ifdef CONFIG_SMP
 	desc->affinity = CPU_MASK_ALL;
+	desc->allowed_affinity = CPU_MASK_ALL;
 #endif
 	spin_unlock_irqrestore(&desc->lock, flags);
 }
Index: linux-2.6/kernel/irq/handle.c
===================================================================
--- linux-2.6.orig/kernel/irq/handle.c
+++ linux-2.6/kernel/irq/handle.c
@@ -56,7 +56,8 @@ struct irq_desc irq_desc[NR_IRQS] __cach
 		.depth = 1,
 		.lock = __SPIN_LOCK_UNLOCKED(irq_desc->lock),
 #ifdef CONFIG_SMP
-		.affinity = CPU_MASK_ALL
+		.affinity = CPU_MASK_ALL,
+		.allowed_affinity = CPU_MASK_ALL
 #endif
 	}
 };
Index: linux-2.6/kernel/irq/manage.c
===================================================================
--- linux-2.6.orig/kernel/irq/manage.c
+++ linux-2.6/kernel/irq/manage.c
@@ -278,8 +278,13 @@ int setup_irq(unsigned int irq, struct i
 
 	*p = new;
 #if defined(CONFIG_IRQ_PER_CPU)
-	if (new->flags & IRQF_PERCPU)
+	if (new->flags & IRQF_PERCPU) {
 		desc->status |= IRQ_PER_CPU;
+		/* don't allow affinity to be set for per cpu interrupts */
+#ifdef CONFIG_SMP
+		desc->allowed_affinity = CPU_MASK_NONE;
+#endif
+	}
 #endif
 	if (!shared) {
 		irq_chip_set_defaults(desc->chip);
Index: linux-2.6/kernel/irq/proc.c
===================================================================
--- linux-2.6.orig/kernel/irq/proc.c
+++ linux-2.6/kernel/irq/proc.c
@@ -47,6 +47,20 @@ static int irq_affinity_read_proc(char *
 	return len;
 }
 
+
+static int irq_affinity_read_allowed_proc(char *page, char **start, off_t off,
+				  int count, int *eof, void *data)
+{
+	int len = cpumask_scnprintf(page, count, irq_desc[(long)data].allowed_affinity);
+
+	if (count - len < 2)
+		return -EINVAL;
+	len += sprintf(page + len, "\n");
+	return len;
+}
+
+
+
 int no_irq_affinity;
 static int irq_affinity_write_proc(struct file *file, const char __user *buffer,
 				   unsigned long count, void *data)
@@ -62,6 +76,9 @@ static int irq_affinity_write_proc(struc
 	if (err)
 		return err;
 
+	/* mask off the allowed_affinity mask to only leave legal cpus */
+	cpus_and(new_value, new_value, irq_desc[irq].allowed_affinity);
+
 	/*
 	 * Do not allow disabling IRQs completely - it's a too easy
 	 * way to make the system unusable accidentally :-) At least
@@ -141,6 +158,16 @@ void register_irq_proc(unsigned int irq)
 			entry->read_proc = irq_affinity_read_proc;
 			entry->write_proc = irq_affinity_write_proc;
 		}
+
+		/* create /proc/irq/<irq>/smp_affinity */
+		entry = create_proc_entry("allowed_affinity", 0400, irq_desc[irq].dir);
+
+		if (entry) {
+			entry->nlink = 1;
+			entry->data = (void *)(long)irq;
+			entry->read_proc = irq_affinity_read_allowed_proc;
+		}
+
 	}
 #endif
 }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-13 13:53 [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs Arjan van de Ven
@ 2006-12-13 19:41 ` Eric W. Biederman
  2006-12-13 19:43   ` Ingo Molnar
  2006-12-13 20:02   ` Arjan van de Ven
  0 siblings, 2 replies; 15+ messages in thread
From: Eric W. Biederman @ 2006-12-13 19:41 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: mingo, linux-kernel

Arjan van de Ven <arjan@linux.intel.com> writes:

> [due to a broken libata in current -git I've not been able to test this patch
> enough]
>
>
> This patch adds an "allowed_affinity" mask to each interrupt, in addition to the
> existing actual affinity mask. In addition this new mask is exported to
> userspace
> in a similar way as the actual affinity is exported. (this is so that irqbalance
> can find out about the restriction and take it into account)
>
> The purpose for having this mask is to allow for the situation where interrupts
> just can't or shouldn't go to all cpus; one example is the "per cpu" IRQ thing 
> that powerpc and others have. Another soon-to-come example is MSIX devices that
> can generate a different MSI interrupt for each cpu; in that case the MSI needs
> to
> be strictly constrained to it's designated cpu.

I don't like this.  If we really have constraints that limit which cpus
we can handle a irqs on we should be taking advantage of them to make
the data structure smaller, not making it bigger. 

This feels like a forced fit.  To fit the situations you describe
into allowed_affinity you have to throw away information.  Such
as the fact it is a NUMA node or a MSI-X interrupt.

In addition the cases I can think of allowed_affinity is the wrong
name.  suggested_affinity sounds like what you are trying to implement
and when it is merely a suggestion and not a hard limit it doesn't
make sense to export like this.

The restriction with MSIX is that we want the irq that describes
the flow of data and the user space process that consumes that flow of
data scheduled on the same cpu.  Yes this is mostly one per cpu but it
is not strictly one per cpu so allowed cpus doesn't make sense in that
case.

The more I think about describing what we are schedule, we are
scheduling a more or less periodic realtime process don't we already
have schedulers in the kernel to schedule this kind of thing?  Don't
we want to take into consideration the cache penalty of migration
before we decided to move the irq to another cpu?

Anyway if we are going to start exporting this kind of information
please let's export the whole context and export it in a way that
applications besides irqbalance can take advantage of.  Otherwise
we are just committing to support a wide kernel/user space interface
for the purposes of a single application and that seems ridiculous.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-13 19:41 ` Eric W. Biederman
@ 2006-12-13 19:43   ` Ingo Molnar
  2006-12-13 20:06     ` Eric W. Biederman
  2006-12-13 20:02   ` Arjan van de Ven
  1 sibling, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2006-12-13 19:43 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Arjan van de Ven, linux-kernel


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> In addition the cases I can think of allowed_affinity is the wrong 
> name.  suggested_affinity sounds like what you are trying to implement 
> and when it is merely a suggestion and not a hard limit it doesn't 
> make sense to export like this.

well, there are interrupts that must be tied to a single CPU and must 
never be moved away. For example per-CPU clock-events-source interrupts 
are such. So allowed_affinity very much exists.

also there might be hardware that can only route a given IRQ to a subset 
of CPUs. While setting set_affinity allows the irqbalance-daemon to 
'probe' this mask, it's a far from optimal API.

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-13 19:41 ` Eric W. Biederman
  2006-12-13 19:43   ` Ingo Molnar
@ 2006-12-13 20:02   ` Arjan van de Ven
  2006-12-13 20:16     ` Eric W. Biederman
  1 sibling, 1 reply; 15+ messages in thread
From: Arjan van de Ven @ 2006-12-13 20:02 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: mingo, linux-kernel

> .
> 
> In addition the cases I can think of allowed_affinity is the wrong
> name.  suggested_affinity sounds like what you are trying to implement
> and when it is merely a suggestion and not a hard limit it doesn't
> make sense to export like this.

it really IS a hard limit. 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-13 19:43   ` Ingo Molnar
@ 2006-12-13 20:06     ` Eric W. Biederman
  2006-12-13 20:19       ` Ingo Molnar
  2006-12-13 20:23       ` Arjan van de Ven
  0 siblings, 2 replies; 15+ messages in thread
From: Eric W. Biederman @ 2006-12-13 20:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Arjan van de Ven, linux-kernel

Ingo Molnar <mingo@elte.hu> writes:

> * Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> In addition the cases I can think of allowed_affinity is the wrong 
>> name.  suggested_affinity sounds like what you are trying to implement 
>> and when it is merely a suggestion and not a hard limit it doesn't 
>> make sense to export like this.
>
> well, there are interrupts that must be tied to a single CPU and must 
> never be moved away. For example per-CPU clock-events-source interrupts 
> are such. So allowed_affinity very much exists.

Although in that case since it is a single cpu there is a much
more elegant implementation.  We don't need a full cpumask_t to
describe it.

> also there might be hardware that can only route a given IRQ to a subset 
> of CPUs. While setting set_affinity allows the irqbalance-daemon to 
> 'probe' this mask, it's a far from optimal API.

I agree, I am just arguing that adding another awkward interface to
the current situation does not really make the situation better, and
it increases our support burden.

For a bunch of this it is arguable that the way to go is simply to
parse the irq type in /proc/interrupts.  All of the really weird cases
will have a distinct type there.  This certainly captures the MSI-X
case.  There is still a question of how to handle the NUMA case but...

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-13 20:02   ` Arjan van de Ven
@ 2006-12-13 20:16     ` Eric W. Biederman
  0 siblings, 0 replies; 15+ messages in thread
From: Eric W. Biederman @ 2006-12-13 20:16 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: mingo, linux-kernel

Arjan van de Ven <arjan@linux.intel.com> writes:

>> .
>> 
>> In addition the cases I can think of allowed_affinity is the wrong
>> name.  suggested_affinity sounds like what you are trying to implement
>> and when it is merely a suggestion and not a hard limit it doesn't
>> make sense to export like this.
>
> it really IS a hard limit. 

Ok.  Which generally makes it uninteresting.  The only cases that I know
with a hard limit are completely unrouteable.

In addition upon reflection you don't handle PER_CPU irqs properly.  As
I recall ia64 uses a different per cpu irq source to target each
individual processor.  But because they are the same the share the
irq source.

I don't think allowed_affinity can even describe the case above.

Eric



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-13 20:06     ` Eric W. Biederman
@ 2006-12-13 20:19       ` Ingo Molnar
  2006-12-13 20:23       ` Arjan van de Ven
  1 sibling, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2006-12-13 20:19 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Arjan van de Ven, linux-kernel


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> > also there might be hardware that can only route a given IRQ to a 
> > subset of CPUs. While setting set_affinity allows the 
> > irqbalance-daemon to 'probe' this mask, it's a far from optimal API.
> 
> I agree, I am just arguing that adding another awkward interface to 
> the current situation does not really make the situation better, and 
> it increases our support burden.

well, please suggest a better interface then.

> For a bunch of this it is arguable that the way to go is simply to 
> parse the irq type in /proc/interrupts.  All of the really weird cases 
> will have a distinct type there.  This certainly captures the MSI-X 
> case.  There is still a question of how to handle the NUMA case but...

... so parsing /proc/interrupts should be that interface? That is a 
historically very volatile interface. It's mostly human-parsed, and we 
frequently twiddle it - genirq changed it too. In v2.6.19 we had fasteio 
instead of fasteoi there.

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-13 20:06     ` Eric W. Biederman
  2006-12-13 20:19       ` Ingo Molnar
@ 2006-12-13 20:23       ` Arjan van de Ven
  2006-12-14  4:14         ` Eric W. Biederman
  1 sibling, 1 reply; 15+ messages in thread
From: Arjan van de Ven @ 2006-12-13 20:23 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ingo Molnar, linux-kernel

Eric W. Biederman wrote:

> There is still a question of how to handle the NUMA case but...
>

the numa case is already handled; the needed info for that is exposed 
already enough... at least for irqbalance

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-13 20:23       ` Arjan van de Ven
@ 2006-12-14  4:14         ` Eric W. Biederman
  2006-12-14  6:24           ` Yinghai Lu
  2006-12-14  7:55           ` Arjan van de Ven
  0 siblings, 2 replies; 15+ messages in thread
From: Eric W. Biederman @ 2006-12-14  4:14 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Ingo Molnar, linux-kernel

Arjan van de Ven <arjan@linux.intel.com> writes:

> Eric W. Biederman wrote:
>
>> There is still a question of how to handle the NUMA case but...
>>
>
> the numa case is already handled; the needed info for that is exposed already
> enough... at least for irqbalance

What is the problem you are trying to solve?

If it is just interrupts irqbalanced can't help with we can do that
with a single bit.

My basic problem with understanding what this patch is trying to
solve is that I've seen some theoretical cases raised but I don't see
the real world problem.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-14  4:14         ` Eric W. Biederman
@ 2006-12-14  6:24           ` Yinghai Lu
  2006-12-14  7:55           ` Arjan van de Ven
  1 sibling, 0 replies; 15+ messages in thread
From: Yinghai Lu @ 2006-12-14  6:24 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Arjan van de Ven, Ingo Molnar, linux-kernel

On 12/13/06, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Arjan van de Ven <arjan@linux.intel.com> writes:
> > the numa case is already handled; the needed info for that is exposed already
> > enough... at least for irqbalance
>
> What is the problem you are trying to solve?
>
> If it is just interrupts irqbalanced can't help with we can do that
> with a single bit.
>
> My basic problem with understanding what this patch is trying to
> solve is that I've seen some theoretical cases raised but I don't see
> the real world problem.

Good question.

irqbalance could use set_affinity from boot cpu to final cpu.

YH

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-14  4:14         ` Eric W. Biederman
  2006-12-14  6:24           ` Yinghai Lu
@ 2006-12-14  7:55           ` Arjan van de Ven
  2006-12-14  8:22             ` Eric W. Biederman
  1 sibling, 1 reply; 15+ messages in thread
From: Arjan van de Ven @ 2006-12-14  7:55 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ingo Molnar, linux-kernel

Eric W. Biederman wrote:
> What is the problem you are trying to solve?

2 problems
1) irq's that irqbalance should not touch at all
2) irqs that can only go to a subset of processors.

1) is very real today
2) is partially real on some of the bigger numa stuff already.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-14  7:55           ` Arjan van de Ven
@ 2006-12-14  8:22             ` Eric W. Biederman
  2006-12-14  8:23               ` Arjan van de Ven
  0 siblings, 1 reply; 15+ messages in thread
From: Eric W. Biederman @ 2006-12-14  8:22 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Ingo Molnar, linux-kernel

Arjan van de Ven <arjan@linux.intel.com> writes:

> Eric W. Biederman wrote:
>> What is the problem you are trying to solve?
>
> 2 problems
> 1) irq's that irqbalance should not touch at all

This is easy we just need a single bit.  Not 128+ bytes on the huge
machines.

> 2) irqs that can only go to a subset of processors.
>
> 1) is very real today
> 2) is partially real on some of the bigger numa stuff already.

You have said you the NUMA cases is handled in another way already?
In which case irqs that can only got to a subset of processors
shouldn't be a problem.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-14  8:22             ` Eric W. Biederman
@ 2006-12-14  8:23               ` Arjan van de Ven
  2006-12-14  8:34                 ` Eric W. Biederman
  0 siblings, 1 reply; 15+ messages in thread
From: Arjan van de Ven @ 2006-12-14  8:23 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ingo Molnar, linux-kernel

Eric W. Biederman wrote:
>> 1) is very real today
>> 2) is partially real on some of the bigger numa stuff already.
> 
> You have said you the NUMA cases is handled in another way already?

the numa case of "I prefer that cpu" is handled. Not the "I cannot 
work on those".

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-14  8:23               ` Arjan van de Ven
@ 2006-12-14  8:34                 ` Eric W. Biederman
  2006-12-14  8:40                   ` Arjan van de Ven
  0 siblings, 1 reply; 15+ messages in thread
From: Eric W. Biederman @ 2006-12-14  8:34 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Ingo Molnar, linux-kernel

Arjan van de Ven <arjan@linux.intel.com> writes:

> Eric W. Biederman wrote:
>>> 1) is very real today
>>> 2) is partially real on some of the bigger numa stuff already.
>>
>> You have said you the NUMA cases is handled in another way already?
>
> the numa case of "I prefer that cpu" is handled. Not the "I cannot work on
> those".

How is the NUMA case of I prefer that cpu handled?

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs
  2006-12-14  8:34                 ` Eric W. Biederman
@ 2006-12-14  8:40                   ` Arjan van de Ven
  0 siblings, 0 replies; 15+ messages in thread
From: Arjan van de Ven @ 2006-12-14  8:40 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Ingo Molnar, linux-kernel


> > the numa case of "I prefer that cpu" is handled. Not the "I cannot work on
> > those".
> 
> How is the NUMA case of I prefer that cpu handled?

it's exported via /sys/bus/pci/devices/<device>/local_cpus
(and the irq is in the /irq directory next to local_cpus)


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2006-12-14  8:40 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-13 13:53 [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs Arjan van de Ven
2006-12-13 19:41 ` Eric W. Biederman
2006-12-13 19:43   ` Ingo Molnar
2006-12-13 20:06     ` Eric W. Biederman
2006-12-13 20:19       ` Ingo Molnar
2006-12-13 20:23       ` Arjan van de Ven
2006-12-14  4:14         ` Eric W. Biederman
2006-12-14  6:24           ` Yinghai Lu
2006-12-14  7:55           ` Arjan van de Ven
2006-12-14  8:22             ` Eric W. Biederman
2006-12-14  8:23               ` Arjan van de Ven
2006-12-14  8:34                 ` Eric W. Biederman
2006-12-14  8:40                   ` Arjan van de Ven
2006-12-13 20:02   ` Arjan van de Ven
2006-12-13 20:16     ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).