LinuxPPC-Dev Archive on lore.kernel.org
 help / color / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Yunsheng Lin <linyunsheng@huawei.com>
Cc: dalias@libc.org, linux-sh@vger.kernel.org,
	catalin.marinas@arm.com, dave.hansen@linux.intel.com,
	heiko.carstens@de.ibm.com, jiaxun.yang@flygoat.com,
	Michal Hocko <mhocko@kernel.org>,
	mwb@linux.vnet.ibm.com, paulus@samba.org, hpa@zytor.com,
	sparclinux@vger.kernel.org, chenhc@lemote.com, will@kernel.org,
	cai@lca.pw, linux-s390@vger.kernel.org,
	ysato@users.sourceforge.jp, x86@kernel.org, rppt@linux.ibm.com,
	borntraeger@de.ibm.com, dledford@redhat.com, mingo@redhat.com,
	jeffrey.t.kirsher@intel.com, jhogan@kernel.org,
	mattst88@gmail.com, linux-mips@vger.kernel.org,
	len.brown@intel.com, gor@linux.ibm.com,
	anshuman.khandual@arm.com, gregkh@linuxfoundation.org,
	bp@alien8.de, luto@kernel.org, tglx@linutronix.de,
	naveen.n.rao@linux.vnet.ibm.com,
	linux-arm-kernel@lists.infradead.org, rth@twiddle.net,
	axboe@kernel.dk, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org, ralf@linux-mips.org,
	tbogendoerfer@suse.de, paul.burton@mips.com,
	linux-alpha@vger.kernel.org, rafael@kernel.org,
	ink@jurassic.park.msu.ru, akpm@linux-foundation.org,
	robin.murphy@arm.com, davem@davemloft.net
Subject: Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
Date: Tue, 24 Sep 2019 13:28:11 +0200
Message-ID: <20190924112811.GK2332@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <c816abbe-155b-504b-cef1-6413f7cdd20c@huawei.com>

On Tue, Sep 24, 2019 at 07:07:36PM +0800, Yunsheng Lin wrote:
> On 2019/9/24 17:25, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
> >> On 2019/9/24 4:34, Peter Zijlstra wrote:
> > 
> >>> I'm saying the ACPI standard is wrong. Explain to me how it is
> >>> physically possible to have a device without NUMA affinity in a NUMA
> >>> system?
> >>>
> >>>  1) The fundamental interconnect is not uniform.
> >>>  2) The device needs to actually be somewhere.
> >>>
> >>
> >> From what I can see, NUMA_NO_NODE may make sense in the below case:
> >>
> >> 1) Theoretically, there would be a device that can access all the memory
> >> uniformly and can be accessed by all cpus uniformly even in a NUMA system.
> >> Suppose we have two nodes, and the device just sit in the middle of the
> >> interconnect between the two nodes.
> >>
> >> Even we define a third node solely for the device, we may need to look at
> >> the node distance to decide the device can be accessed uniformly.
> >>
> >> Or we can decide that the device can be accessed uniformly by setting
> >> it's node to NUMA_NO_NODE.
> > 
> > This is indeed a theoretical case; it doesn't scale. The moment you're
> > adding multiple sockets or even board interconnects this all goes out
> > the window.
> > 
> > And in this case, forcing the device to either node is fine.
> 
> Not really.
> For packet sending and receiving, the buffer memory may be allocated
> dynamically. Node of tx buffer memory is mainly based on the cpu
> that is sending sending, node of rx buffer memory is mainly based on
> the cpu the interrupt handler of the device is running on, and the
> device' interrupt affinity is mainly based on node id of the device.
> 
> We can bind the processes that are using the device to both nodes
> in order to utilize memory on both nodes for packet sending.
> 
> But for packet receiving, the node1 may not be used becuase the node
> id of device is forced to node 0, which is the default way to bind
> the interrupt to the cpu of the same node.
> 
> If node_to_cpumask_map() returns all usable cpus when the device's node
> id is NUMA_NO_NODE, then interrupt can be binded to the cpus on both nodes.

s/binded/bound/

Sure; the data can be allocated wherever, but the control structures are
not dynamically allocated every time. They are persistent, and they will
be local to some node.

Anyway, are you saying this stupid corner case is actually relevant?
Because how does it scale out? What if you have 8 sockets, with each
socket having 2 nodes and 1 such magic device. Then returning all CPUs
is just plain wrong.

> >> 2) For many virtual deivces, such as tun or loopback netdevice, they
> >> are also accessed uniformly by all cpus.
> > 
> > Not true; the virtual device will sit in memory local to some node.
> > 
> > And as with physical devices, you probably want at least one (virtual)
> > queue per node.
> 
> There may be similar handling as above for virtual device too.

And it'd be similarly broken.

  reply index

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-17 12:48 Yunsheng Lin
2019-09-21 22:38 ` Paul Burton
2019-09-23  2:31   ` Yunsheng Lin
2019-09-23 15:15 ` Peter Zijlstra
2019-09-23 15:28   ` Michal Hocko
2019-09-23 15:48     ` Peter Zijlstra
2019-09-23 16:52       ` Michal Hocko
2019-09-23 20:34         ` Peter Zijlstra
2019-09-24  1:29           ` Yunsheng Lin
2019-09-24  9:25             ` Peter Zijlstra
2019-09-24 11:07               ` Yunsheng Lin
2019-09-24 11:28                 ` Peter Zijlstra [this message]
2019-09-24 11:44                   ` Yunsheng Lin
2019-09-24 11:58                     ` Peter Zijlstra
2019-09-24 12:09                       ` Yunsheng Lin
2019-09-24  7:47           ` Michal Hocko
2019-09-24  9:17             ` Peter Zijlstra
2019-09-24 10:56               ` Michal Hocko
2019-09-24 11:23                 ` Peter Zijlstra
2019-09-24 11:54                   ` Michal Hocko
2019-09-24 12:09                     ` Peter Zijlstra
2019-09-24 12:25                       ` Michal Hocko
2019-09-24 12:43                         ` Peter Zijlstra
2019-09-24 12:59                           ` Peter Zijlstra
2019-09-24 13:19                             ` Michal Hocko
2019-09-25  9:14                               ` Yunsheng Lin
2019-09-25 10:41                                 ` Peter Zijlstra
2019-10-08  8:38                                   ` Yunsheng Lin
2019-10-09 12:25                                     ` Robin Murphy
2019-10-10  6:07                                       ` Yunsheng Lin
2019-10-10  7:32                                         ` Michal Hocko
2019-10-11  3:27                                           ` Yunsheng Lin
2019-10-11 11:15                                             ` Peter Zijlstra
2019-10-12  6:17                                               ` Yunsheng Lin
2019-10-12  7:40                                                 ` Greg KH
2019-10-12  9:47                                                   ` Yunsheng Lin
2019-10-12 10:40                                                     ` Greg KH
2019-10-12 10:47                                                       ` Greg KH
2019-10-14  8:00                                                         ` Yunsheng Lin
2019-10-14  9:25                                                           ` Greg KH
2019-10-14  9:49                                                             ` Peter Zijlstra
2019-10-14 10:04                                                               ` Greg KH
2019-10-15 10:40                                                             ` Yunsheng Lin
2019-10-15 16:58                                                               ` Greg KH
2019-10-16 12:07                                                                 ` Yunsheng Lin
2019-10-28  9:20                                                   ` Yunsheng Lin
2019-10-29  8:53                                                     ` Michal Hocko
2019-10-30  1:58                                                       ` Yunsheng Lin
2019-10-10  8:56                                       ` Peter Zijlstra
2019-09-25 10:40                               ` Peter Zijlstra
2019-09-25 13:25                                 ` Michal Hocko
2019-09-25 16:31                                   ` Peter Zijlstra
2019-09-25 21:45                                     ` Peter Zijlstra
2019-09-26  9:05                                       ` Peter Zijlstra
2019-09-26 12:10                                         ` Peter Zijlstra
2019-09-26 11:45                                     ` Geert Uytterhoeven
2019-09-26 12:24                                       ` Peter Zijlstra

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190924112811.GK2332@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=axboe@kernel.dk \
    --cc=borntraeger@de.ibm.com \
    --cc=bp@alien8.de \
    --cc=cai@lca.pw \
    --cc=catalin.marinas@arm.com \
    --cc=chenhc@lemote.com \
    --cc=dalias@libc.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=davem@davemloft.net \
    --cc=dledford@redhat.com \
    --cc=gor@linux.ibm.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=heiko.carstens@de.ibm.com \
    --cc=hpa@zytor.com \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jeffrey.t.kirsher@intel.com \
    --cc=jhogan@kernel.org \
    --cc=jiaxun.yang@flygoat.com \
    --cc=len.brown@intel.com \
    --cc=linux-alpha@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux-sh@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=linyunsheng@huawei.com \
    --cc=luto@kernel.org \
    --cc=mattst88@gmail.com \
    --cc=mhocko@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mwb@linux.vnet.ibm.com \
    --cc=naveen.n.rao@linux.vnet.ibm.com \
    --cc=paul.burton@mips.com \
    --cc=paulus@samba.org \
    --cc=rafael@kernel.org \
    --cc=ralf@linux-mips.org \
    --cc=robin.murphy@arm.com \
    --cc=rppt@linux.ibm.com \
    --cc=rth@twiddle.net \
    --cc=sparclinux@vger.kernel.org \
    --cc=tbogendoerfer@suse.de \
    --cc=tglx@linutronix.de \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=ysato@users.sourceforge.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org
	public-inbox-index linuxppc-dev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git