linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86_64: resize NR_IRQS for large machines
@ 2008-03-24 14:31 Alan Mayer
  2008-03-25 16:19 ` Ingo Molnar
  2008-03-25 23:10 ` Pavel Machek
  0 siblings, 2 replies; 8+ messages in thread
From: Alan Mayer @ 2008-03-24 14:31 UTC (permalink / raw)
  To: torvalds, mingo
  Cc: linux-kernel list, Robin Holt, Jack Steiner, Russ Anderson

Subject: [PATCH] x86_64: resize NR_IRQS for large machines

From: Alan Mayer <ajm@sgi.com>

On machines with very large numbers of cpus, tables that are dimensioned
by NR_IRQS get very large, especially the irq_desc table.  They are also
very sparsely used.  When the cpu count is > MAX_IO_APICS, use MAX_IO_APICS
to set NR_IRQS, otherwise use NR_CPUS.

Signed-off-by: Alan Mayer <ajm@sgi.com>

Reviewed-by: Christoph Lameter <clameter@sgi.com>

---

Index: v2.6.25-rc6/include/asm-x86/irq_64.h
===================================================================
--- v2.6.25-rc6.orig/include/asm-x86/irq_64.h	2008-03-19 16:52:52.000000000 -0500
+++ v2.6.25-rc6/include/asm-x86/irq_64.h	2008-03-20 16:46:51.000000000 -0500
@@ -10,6 +10,10 @@
  *	<tomsoft@informatik.tu-chemnitz.de>
  */
 
+#if !defined(MAX_IO_APICS)
+#include <asm/apicdef.h>
+#endif
+
 #define TIMER_IRQ 0
 
 /*
@@ -31,7 +35,11 @@
 
 #define FIRST_SYSTEM_VECTOR	0xef   /* duplicated in hw_irq.h */
 
-#define NR_IRQS (NR_VECTORS + (32 *NR_CPUS))
+#if NR_CPUS < MAX_IO_APICS
+#define NR_IRQS (NR_VECTORS + (32 * NR_CPUS))
+#else
+#define NR_IRQS (NR_VECTORS + (32 * MAX_IO_APICS))
+#endif
 #define NR_IRQ_VECTORS NR_IRQS
 
 static __inline__ int irq_canonicalize(int irq)
Index: v2.6.25-rc6/include/linux/kernel_stat.h
===================================================================
--- v2.6.25-rc6.orig/include/linux/kernel_stat.h	2008-03-19 16:53:00.000000000 -0500
+++ v2.6.25-rc6/include/linux/kernel_stat.h	2008-03-20 11:12:27.000000000 -0500
@@ -1,11 +1,11 @@
 #ifndef _LINUX_KERNEL_STAT_H
 #define _LINUX_KERNEL_STAT_H
 
-#include <asm/irq.h>
 #include <linux/smp.h>
 #include <linux/threads.h>
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
+#include <asm/irq.h>
 #include <asm/cputime.h>
 
 /*


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86_64: resize NR_IRQS for large machines
  2008-03-24 14:31 [PATCH] x86_64: resize NR_IRQS for large machines Alan Mayer
@ 2008-03-25 16:19 ` Ingo Molnar
  2008-03-25 16:24   ` Alan Mayer
  2008-03-25 23:10 ` Pavel Machek
  1 sibling, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2008-03-25 16:19 UTC (permalink / raw)
  To: Alan Mayer
  Cc: torvalds, linux-kernel list, Robin Holt, Jack Steiner, Russ Anderson


* Alan Mayer <ajm@sgi.com> wrote:

> Subject: [PATCH] x86_64: resize NR_IRQS for large machines
> 
> From: Alan Mayer <ajm@sgi.com>
> 
> On machines with very large numbers of cpus, tables that are 
> dimensioned by NR_IRQS get very large, especially the irq_desc table.  
> They are also very sparsely used.  When the cpu count is > 
> MAX_IO_APICS, use MAX_IO_APICS to set NR_IRQS, otherwise use NR_CPUS.
> 
> Signed-off-by: Alan Mayer <ajm@sgi.com>
> Reviewed-by: Christoph Lameter <clameter@sgi.com>

thanks, applied. I suspect this can wait until .26, as there appears to 
be a number of other patches that you need to get these boxes running on 
vanilla, right?

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86_64: resize NR_IRQS for large machines
  2008-03-25 16:19 ` Ingo Molnar
@ 2008-03-25 16:24   ` Alan Mayer
  0 siblings, 0 replies; 8+ messages in thread
From: Alan Mayer @ 2008-03-25 16:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Mayer, torvalds, linux-kernel list, Robin Holt,
	Jack Steiner, Russ Anderson

On Tue, 25 Mar 2008, Ingo Molnar wrote:

> 
> * Alan Mayer <ajm@sgi.com> wrote:
> 
> > Subject: [PATCH] x86_64: resize NR_IRQS for large machines
> > 
> > From: Alan Mayer <ajm@sgi.com>
> > 
> > On machines with very large numbers of cpus, tables that are 
> > dimensioned by NR_IRQS get very large, especially the irq_desc table.  
> > They are also very sparsely used.  When the cpu count is > 
> > MAX_IO_APICS, use MAX_IO_APICS to set NR_IRQS, otherwise use NR_CPUS.
> > 
> > Signed-off-by: Alan Mayer <ajm@sgi.com>
> > Reviewed-by: Christoph Lameter <clameter@sgi.com>
> 
> thanks, applied. I suspect this can wait until .26, as there appears to 
> be a number of other patches that you need to get these boxes running on 
> vanilla, right?
> 
> 	Ingo

That's right.  Thanks.

		--ajm

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Spirits are using me,
A larger voice is calling me.
--
Alan J. Mayer
SGI
ajm@sgi.com
WORK: 651-683-3131
HOME: 651-407-0134
--


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86_64: resize NR_IRQS for large machines
  2008-03-24 14:31 [PATCH] x86_64: resize NR_IRQS for large machines Alan Mayer
  2008-03-25 16:19 ` Ingo Molnar
@ 2008-03-25 23:10 ` Pavel Machek
  2008-03-26 16:29   ` Linus Torvalds
  2008-03-26 18:49   ` Alan Mayer
  1 sibling, 2 replies; 8+ messages in thread
From: Pavel Machek @ 2008-03-25 23:10 UTC (permalink / raw)
  To: Alan Mayer
  Cc: torvalds, mingo, linux-kernel list, Robin Holt, Jack Steiner,
	Russ Anderson

Hi!

> Subject: [PATCH] x86_64: resize NR_IRQS for large machines
> 
> From: Alan Mayer <ajm@sgi.com>
> 
> On machines with very large numbers of cpus, tables that are dimensioned
> by NR_IRQS get very large, especially the irq_desc table.  They are also
> very sparsely used.  When the cpu count is > MAX_IO_APICS, use MAX_IO_APICS
> to set NR_IRQS, otherwise use NR_CPUS.
> 
> Signed-off-by: Alan Mayer <ajm@sgi.com>
> 
> Reviewed-by: Christoph Lameter <clameter@sgi.com>

> ===================================================================
> --- v2.6.25-rc6.orig/include/asm-x86/irq_64.h	2008-03-19 16:52:52.000000000 -0500
> +++ v2.6.25-rc6/include/asm-x86/irq_64.h	2008-03-20 16:46:51.000000000 -0500
> @@ -10,6 +10,10 @@
>   *	<tomsoft@informatik.tu-chemnitz.de>
>   */
>  
> +#if !defined(MAX_IO_APICS)
> +#include <asm/apicdef.h>
> +#endif
> +

This is very ugly. Why not include it unconditionally -- with guard in
apicdef.h?
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86_64: resize NR_IRQS for large machines
  2008-03-25 23:10 ` Pavel Machek
@ 2008-03-26 16:29   ` Linus Torvalds
  2008-03-26 18:55     ` Alan Mayer
  2008-03-26 18:49   ` Alan Mayer
  1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2008-03-26 16:29 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Mayer, mingo, linux-kernel list, Robin Holt, Jack Steiner,
	Russ Anderson



On Wed, 26 Mar 2008, Pavel Machek wrote:
> 
> This is very ugly. Why not include it unconditionally -- with guard in
> apicdef.h?

I do agree that it's ugly, but I think the ugliness is more serious than 
that.

What I think we should do is to make NR_IRQS no longer be a compile-time 
constant, but instead just do something like

	unsigned int NR_IRQS __read_mostly;

and then just set it early in the boot sequence depending on the real CPU 
numbers etc.

I realize that this will require some changes to a few arrays that are 
statically allocated and depend on NR_IRQ's (notably "irq_desc"), but 
don't you guys think that this would be a cleaner thing?

[ I suspect that irq_desc[] itself could quite reasonably be a rather much 
  smaller __read_mostly hash-table of dynamically allocated entries - the 
  thing would be only modified at boot, so it should cache beautifully 
  even across hundreds of CPU's ]

Whatever. I'm not opposed to this whole static thing, but I do wonder if 
it's worth doing that way. There *may* be performance reasons for doing it 
the way we're doing it, but quite frankly, I think the #define is mostly 
purely historical, from when it was just a fixed number (originally 16!) 
and it made sense to think of it as a small static array.

			Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86_64: resize NR_IRQS for large machines
  2008-03-25 23:10 ` Pavel Machek
  2008-03-26 16:29   ` Linus Torvalds
@ 2008-03-26 18:49   ` Alan Mayer
  1 sibling, 0 replies; 8+ messages in thread
From: Alan Mayer @ 2008-03-26 18:49 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Mayer, torvalds, mingo, linux-kernel list, Robin Holt,
	Jack Steiner, Russ Anderson


On Wed, 26 Mar 2008, Pavel Machek wrote:

> Hi!
> 
> > Subject: [PATCH] x86_64: resize NR_IRQS for large machines
> > 
> > From: Alan Mayer <ajm@sgi.com>
> > 
> > On machines with very large numbers of cpus, tables that are dimensioned
> > by NR_IRQS get very large, especially the irq_desc table.  They are also
> > very sparsely used.  When the cpu count is > MAX_IO_APICS, use MAX_IO_APICS
> > to set NR_IRQS, otherwise use NR_CPUS.
> > 
> > Signed-off-by: Alan Mayer <ajm@sgi.com>
> > 
> > Reviewed-by: Christoph Lameter <clameter@sgi.com>
> 
> > ===================================================================
> > --- v2.6.25-rc6.orig/include/asm-x86/irq_64.h	2008-03-19 16:52:52.000000000 -0500
> > +++ v2.6.25-rc6/include/asm-x86/irq_64.h	2008-03-20 16:46:51.000000000 -0500
> > @@ -10,6 +10,10 @@
> >   *	<tomsoft@informatik.tu-chemnitz.de>
> >   */
> >  
> > +#if !defined(MAX_IO_APICS)
> > +#include <asm/apicdef.h>
> > +#endif
> > +
> 
> This is very ugly. Why not include it unconditionally -- with guard in
> apicdef.h?
> 							Pavel

Okay, I can change that.

		--ajm

> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> 

--
Alan J. Mayer
SGI
ajm@sgi.com
WORK: 651-683-3131
HOME: 651-407-0134
--


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86_64: resize NR_IRQS for large machines
  2008-03-26 16:29   ` Linus Torvalds
@ 2008-03-26 18:55     ` Alan Mayer
  2008-04-12 20:12       ` Eric W. Biederman
  0 siblings, 1 reply; 8+ messages in thread
From: Alan Mayer @ 2008-03-26 18:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Alan Mayer, mingo, linux-kernel list, Robin Holt,
	Jack Steiner, Russ Anderson

On Wed, 26 Mar 2008, Linus Torvalds wrote:

> 
> 
> On Wed, 26 Mar 2008, Pavel Machek wrote:
> > 
> > This is very ugly. Why not include it unconditionally -- with guard in
> > apicdef.h?
> 
> I do agree that it's ugly, but I think the ugliness is more serious than 
> that.
> 
> What I think we should do is to make NR_IRQS no longer be a compile-time 
> constant, but instead just do something like
> 
> 	unsigned int NR_IRQS __read_mostly;
> 
> and then just set it early in the boot sequence depending on the real CPU 
> numbers etc.
> 
> I realize that this will require some changes to a few arrays that are 
> statically allocated and depend on NR_IRQ's (notably "irq_desc"), but 
> don't you guys think that this would be a cleaner thing?
> 
> [ I suspect that irq_desc[] itself could quite reasonably be a rather much 
>   smaller __read_mostly hash-table of dynamically allocated entries - the 
>   thing would be only modified at boot, so it should cache beautifully 
>   even across hundreds of CPU's ]
> 
> Whatever. I'm not opposed to this whole static thing, but I do wonder if 
> it's worth doing that way. There *may* be performance reasons for doing it 
> the way we're doing it, but quite frankly, I think the #define is mostly 
> purely historical, from when it was just a fixed number (originally 16!) 
> and it made sense to think of it as a small static array.
> 
> 			Linus
> 

Well, I was looking at it from that point of view.  But, when I found myself
looking at code, particularly in drivers, that indexed into the irq_desc array
and started modifying the descriptor in place and then calling setup_irq(),
I realized that what was needed was a redesign of the whole mess from first
principals.  I still think that's what needs to be done, but by some one with
more experience and credibility than me.  Maybe in a year I'd be willing
to attempt it, but not today.

		--ajm


--
Alan J. Mayer
SGI
ajm@sgi.com
WORK: 651-683-3131
HOME: 651-407-0134
--


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] x86_64: resize NR_IRQS for large machines
  2008-03-26 18:55     ` Alan Mayer
@ 2008-04-12 20:12       ` Eric W. Biederman
  0 siblings, 0 replies; 8+ messages in thread
From: Eric W. Biederman @ 2008-04-12 20:12 UTC (permalink / raw)
  To: Alan Mayer
  Cc: Linus Torvalds, Pavel Machek, mingo, linux-kernel list,
	Robin Holt, Jack Steiner, Russ Anderson

On Wed, 2008-03-26 at 13:55 -0500, Alan Mayer wrote:
> On Wed, 26 Mar 2008, Linus Torvalds wrote:
> 
> > 
> > 
> > On Wed, 26 Mar 2008, Pavel Machek wrote:
> > > 
> > > This is very ugly. Why not include it unconditionally -- with guard in
> > > apicdef.h?
> > 
> > I do agree that it's ugly, but I think the ugliness is more serious than 
> > that.
> > 
> > What I think we should do is to make NR_IRQS no longer be a compile-time 
> > constant, but instead just do something like
> > 
> > 	unsigned int NR_IRQS __read_mostly;
> > 
> > and then just set it early in the boot sequence depending on the real CPU 
> > numbers etc.
> > 
> > I realize that this will require some changes to a few arrays that are 
> > statically allocated and depend on NR_IRQ's (notably "irq_desc"), but 
> > don't you guys think that this would be a cleaner thing?
> > 
> > [ I suspect that irq_desc[] itself could quite reasonably be a rather much 
> >   smaller __read_mostly hash-table of dynamically allocated entries - the 
> >   thing would be only modified at boot, so it should cache beautifully 
> >   even across hundreds of CPU's ]
> > 
> > Whatever. I'm not opposed to this whole static thing, but I do wonder if 
> > it's worth doing that way. There *may* be performance reasons for doing it 
> > the way we're doing it, but quite frankly, I think the #define is mostly 
> > purely historical, from when it was just a fixed number (originally 16!) 
> > and it made sense to think of it as a small static array.
> > 
> > 			Linus
> > 
> 
> Well, I was looking at it from that point of view.  But, when I found myself
> looking at code, particularly in drivers, that indexed into the irq_desc array
> and started modifying the descriptor in place and then calling setup_irq(),
> I realized that what was needed was a redesign of the whole mess from first
> principals.  I still think that's what needs to be done, but by some one with
> more experience and credibility than me.  Maybe in a year I'd be willing
> to attempt it, but not today.

Well I will agree with Linus and go one farther and say that NR_IRQS
needs to die.  I started on that once and x86 is just about ready to
accomodate it.

There is a size issue on small machines.  And there is very bad NUMA
affinity on large machines.  So the current structure really is not
optimal for anyone.  All of this gets especially bad for distro kernels
that try and support everything.

Also MAX_IOAPICS is very much the wrong factor to be using on large
machines to size the irq array.  New machines are moving towards MSI and
cards can have an unreasonable number of MSI irqs.  In practice the top
end I have seen is 20-30 per card but it is still a lot.    So I think
you may get a nasty surprise when you plug in a bunch of high
performance cards with multiple queues into a big box.  The 32*NR_CPUS
as a rule of thumb comes from IBM boxes that are a little better
balanced when it comes to compute vs. I/O capablility.  

For actual irq reception we have our per cpu array of vectors that point
to the irq_desc so even if the global list of irqs was a linked list we
should not have performance problems.

I need to do some sorting out of sysfs first but I will certainly see if
I can look at this again.  And I will very much be willing to work with
someone else who wants to work on this and has more time then I do at
the moment.

The basic idea is moving the generic irq apis to a point where we can
refer to irqs in the generic code with a struct irq * instead of by
number.  We really only the need the number for talking about irqs to
user space.

Eric



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-04-12 20:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-24 14:31 [PATCH] x86_64: resize NR_IRQS for large machines Alan Mayer
2008-03-25 16:19 ` Ingo Molnar
2008-03-25 16:24   ` Alan Mayer
2008-03-25 23:10 ` Pavel Machek
2008-03-26 16:29   ` Linus Torvalds
2008-03-26 18:55     ` Alan Mayer
2008-04-12 20:12       ` Eric W. Biederman
2008-03-26 18:49   ` Alan Mayer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).