All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Yong Zhang <yong.zhang0@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"arjan@linux.jf.intel.com" <arjan@linux.jf.intel.com>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	Jesse Barnes <jbarnes@virtuousgeek.org>
Subject: Re: [PATCH] irq: Add node_affinity CPU masks for smarter irqbalance hints
Date: Tue, 24 Nov 2009 09:55:19 -0800	[thread overview]
Message-ID: <1259085319.2631.46.camel@ppwaskie-mobl2> (raw)
In-Reply-To: <alpine.LFD.2.00.0911241027140.24119@localhost.localdomain>

On Tue, 2009-11-24 at 03:07 -0700, Thomas Gleixner wrote:
> On Tue, 24 Nov 2009, Peter P Waskiewicz Jr wrote:
> > On Tue, 2009-11-24 at 01:38 -0700, Peter Zijlstra wrote:
> > > On Mon, 2009-11-23 at 15:32 -0800, Waskiewicz Jr, Peter P wrote:
> > > 
> > > > Unfortunately, a driver can't.  The irq_set_affinity() function isn't 
> > > > exported.  I proposed a patch on netdev to export it, and then to tie down 
> > > > an interrupt using IRQF_NOBALANCING, so irqbalance won't touch it.  That 
> > > > was rejected, since the driver is enforcing policy of the interrupt 
> > > > balancing, not irqbalance.
> > > 
> > > Why would a patch touching the irq subsystem go to netdev?
> > 
> > The only change to the IRQ subsystem was:
> > 
> > EXPORT_SYMBOL(irq_set_affinity);
> 
> Which is still touching the generic irq subsystem and needs the ack of
> the relevant maintainer. If there is a need to expose such an
> interface to drivers then the maintainer wants to know exactly why and
> needs to be part of the discussion of alternative solutions. Otherwise
> you waste time on implementing stuff like the current patch which is
> definitely not going anywhere near the irq subsystem.
> 

Understood, and duly noted.

> > > If all you want is to expose policy to userspace then you don't need any
> > > of this, simply expose the NICs home node through a sysfs device thingy
> > > (I was under the impression its already there somewhere, but I can't
> > > ever find anything in /sys).
> > > 
> > > No need what so ever to poke at the IRQ subsystem.
> > 
> > The point is we need something common that the kernel side (whether a
> > driver or /proc can modify) that irqbalance can use.
> 
> /sys/class/net/ethX/device/numa_node 
> 
> perhaps ?

What I'm trying to do though is one to many NUMA node assignments.  See
below for a better overview of what the issue is we're trying to solve.

>  
> > > > Also, if you use the /proc interface to change smp_affinity on an 
> > > > interrupt without any of these changes, irqbalance will override it on its 
> > > > next poll interval.  This also is not desirable.
> > > 
> > > This all sounds backwards.. we've got a perfectly functional interface
> > > for affinity -- which people object to being used for some reason. So
> > > you add another interface on top, and that is ok?
> > > 
> > 
> > But it's not functional.  If I set the affinity in smp_affinity, then
> > irqbalance will override it 10 seconds later.
> 
> And to work around the brain wreckage of irqbalanced you want to
> fiddle in the irq code instead of teaching irqbalanced to handle node
> affinities ?
> 
> The only thing which is worth to investigate is whether the irq core
> code should honour the dev->numa_node setting and restrict the
> possible irq affinity settings to that node. If a device is tied to a
> node it makes a certain amount of sense to do that.
> 
> But such a change would not need a new interface in the irq core and
> definitely not a new cpumask_t member in the irq_desc structure to
> store a node affinity which can be expressed with a simple
> integer.
> 
> But this needs more thoughts and I want to know more about the
> background and the reasoning for such a change.
> 

I'll use the ixgbe driver as my example, since that is where my
immediate problems are.  This is our 10GbE device, and supports 128 Rx
queues, 128 Tx queues, and has a maximum of 64 MSI-X vectors.  In a
typical case, let's say an 8-core machine (Nehalem-EP with
hyperthreading off) brings one port online.  We'll allocate 8 Rx and 8
Tx queues.  When these allocations occur, we want to allocate the memory
for our descriptor rings and buffer structs and DMA areas onto the
various NUMA nodes.  This will promote spreading of the load not just
across CPUs, but also the memory controllers.

If we were to just run like that and have irqbalance move our vectors to
a single node, then we'd have half of our network resources creating
cross-node traffic, which is undesirable, since the OS may have to take
locks node to node to get the memory it's looking for.

The bottom line is we need some mechanism that allows a driver/user to
deterministically assign the underlying interrupt resources to the
correct NUMA node for each interrupt.  And in the example above, we may
have more than one NUMA node we need to balance into.

Please let me know if I've explained this well enough.  I appreciate the
time.

Cheers,
-PJ Waskiewicz


  reply	other threads:[~2009-11-24 17:55 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-23  6:46 [PATCH] irq: Add node_affinity CPU masks for smarter irqbalance hints Peter P Waskiewicz Jr
2009-11-23  7:32 ` Yong Zhang
2009-11-23  7:32   ` Yong Zhang
2009-11-23  9:36   ` Peter P Waskiewicz Jr
2009-11-23 10:21     ` ixgbe question Eric Dumazet
2009-11-23 10:30       ` Badalian Vyacheslav
2009-11-23 10:34       ` Waskiewicz Jr, Peter P
2009-11-23 10:37         ` Eric Dumazet
2009-11-23 14:05           ` Eric Dumazet
2009-11-23 21:26           ` David Miller
2009-11-23 14:10       ` Jesper Dangaard Brouer
2009-11-23 14:38         ` Eric Dumazet
2009-11-23 18:30           ` robert
2009-11-23 16:59             ` Eric Dumazet
2009-11-23 20:54               ` robert
2009-11-23 21:28                 ` David Miller
2009-11-23 22:14                   ` Robert Olsson
2009-11-23 23:28               ` Waskiewicz Jr, Peter P
2009-11-23 23:44                 ` David Miller
2009-11-24  7:46                 ` Eric Dumazet
2009-11-24  8:46                   ` Badalian Vyacheslav
2009-11-24  9:07                   ` Peter P Waskiewicz Jr
2009-11-24  9:55                     ` Eric Dumazet
2009-11-24 10:06                       ` Peter P Waskiewicz Jr
2009-11-24 11:37                         ` [PATCH net-next-2.6] ixgbe: Fix TX stats accounting Eric Dumazet
2009-11-24 13:23                           ` Eric Dumazet
2009-11-25  7:38                             ` Jeff Kirsher
2009-11-25  9:31                               ` Eric Dumazet
2009-11-25  9:38                                 ` Jeff Kirsher
2009-11-24 13:14                         ` ixgbe question John Fastabend
2009-11-29  8:18                           ` David Miller
2009-11-30 13:02                             ` Eric Dumazet
2009-11-30 20:20                               ` John Fastabend
2009-11-26 14:10                       ` Badalian Vyacheslav
2009-11-23 17:05     ` [PATCH] irq: Add node_affinity CPU masks for smarter irqbalance hints Peter Zijlstra
2009-11-23 23:32       ` Waskiewicz Jr, Peter P
2009-11-24  8:38         ` Peter Zijlstra
2009-11-24  8:59           ` Peter P Waskiewicz Jr
2009-11-24  9:08             ` Peter Zijlstra
2009-11-24  9:15               ` Peter P Waskiewicz Jr
2009-11-24 14:43               ` Arjan van de Ven
2009-11-24  9:15             ` Peter Zijlstra
2009-11-24 10:07             ` Thomas Gleixner
2009-11-24 17:55               ` Peter P Waskiewicz Jr [this message]
2009-11-25 11:18               ` Peter Zijlstra
2009-11-24  6:07       ` Arjan van de Ven
2009-11-24  8:39         ` Peter Zijlstra
2009-11-24 14:42           ` Arjan van de Ven
2009-11-24 17:39           ` David Miller
2009-11-24 17:56             ` Peter P Waskiewicz Jr
2009-11-24 18:26               ` Eric Dumazet
2009-11-24 18:33                 ` Peter P Waskiewicz Jr
2009-11-24 19:01                   ` Eric Dumazet
2009-11-24 19:53                     ` Peter P Waskiewicz Jr
2009-11-24 18:54                 ` David Miller
2009-11-24 18:58                   ` Eric Dumazet
2009-11-24 20:35                     ` Andi Kleen
2009-11-24 20:46                       ` Eric Dumazet
2009-11-25 10:30                         ` Eric Dumazet
2009-11-25 10:37                           ` Andi Kleen
2009-11-25 11:35                             ` Eric Dumazet
2009-11-25 11:50                               ` Andi Kleen
2009-11-26 11:43                                 ` Eric Dumazet
2009-11-24  5:17     ` Yong Zhang
2009-11-24  5:17       ` Yong Zhang
2009-11-24  8:39       ` Peter P Waskiewicz Jr
2009-11-23  7:12 Peter P Waskiewicz Jr

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1259085319.2631.46.camel@ppwaskie-mobl2 \
    --to=peter.p.waskiewicz.jr@intel.com \
    --cc=arjan@linux.jf.intel.com \
    --cc=davem@davemloft.net \
    --cc=jbarnes@virtuousgeek.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=yong.zhang0@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.