LinuxPPC-Dev Archive on lore.kernel.org
 help / color / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Robin Murphy <robin.murphy@arm.com>
Cc: dalias@libc.org, linux-sh@vger.kernel.org,
	catalin.marinas@arm.com, dave.hansen@linux.intel.com,
	heiko.carstens@de.ibm.com, jiaxun.yang@flygoat.com,
	Michal Hocko <mhocko@kernel.org>,
	mwb@linux.vnet.ibm.com, paulus@samba.org, hpa@zytor.com,
	sparclinux@vger.kernel.org, chenhc@lemote.com, will@kernel.org,
	cai@lca.pw, linux-s390@vger.kernel.org,
	ysato@users.sourceforge.jp, x86@kernel.org,
	Yunsheng Lin <linyunsheng@huawei.com>,
	rppt@linux.ibm.com, borntraeger@de.ibm.com, dledford@redhat.com,
	mingo@redhat.com, jeffrey.t.kirsher@intel.com, jhogan@kernel.org,
	mattst88@gmail.com, linux-mips@vger.kernel.org,
	len.brown@intel.com, gor@linux.ibm.com,
	anshuman.khandual@arm.com, bp@alien8.de, luto@kernel.org,
	tglx@linutronix.de, naveen.n.rao@linux.vnet.ibm.com,
	linux-arm-kernel@lists.infradead.org, rth@twiddle.net,
	axboe@kernel.dk, gregkh@linuxfoundation.org,
	linux-kernel@vger.kernel.org, ralf@linux-mips.org,
	tbogendoerfer@suse.de, paul.burton@mips.com,
	linux-alpha@vger.kernel.org, rafael@kernel.org,
	ink@jurassic.park.msu.ru, akpm@linux-foundation.org,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
Subject: Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
Date: Thu, 10 Oct 2019 10:56:16 +0200
Message-ID: <20191010085616.GQ2311@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <bec80499-86d9-bf1f-df23-9044a8099992@arm.com>

On Wed, Oct 09, 2019 at 01:25:14PM +0100, Robin Murphy wrote:
> On 2019-10-08 9:38 am, Yunsheng Lin wrote:
> > On 2019/9/25 18:41, Peter Zijlstra wrote:
> > > On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
> > > >  From the discussion above, It seems making the node_to_cpumask_map()
> > > > NUMA_NO_NODE aware is the most feasible way to move forwad.
> > > 
> > > That's still wrong.
> > 
> > Hi, Peter
> > 
> > It seems this has trapped in the dead circle.
> > 
> >  From my understanding, NUMA_NO_NODE which means not node numa preference
> > is the state to describe the node of virtual device or the physical device
> > that has equal distance to all cpu.

So I _really_ don't believe in the equidistant physical device. Physics
just doesn't allow that. Or rather, you can, but then it'll be so slow
it doesn't matter.

The only possible option is equidistant to a _small_ number of nodes,
and if that is a reality, then we should look at that. So far however
it's purely been a hypothetical device.

> > We can be stricter if the device does have a nearer node, but we can not
> > deny that a device does not have a node numa preference or node affinity,
> > which also means the control or data buffer can be allocated at the node where
> > the process is running.
> > 
> > As you has proposed, making it -2 and have dev_to_node() warn if the device does
> > have a nearer node and not set by the fw is a way to be stricter.

Because it is 100% guaranteed (we have proof) that BIOS is shit and
doesn't set node affinity for devices that really should have it.

So we're trading a hypothetical shared device vs not reporting actual
BIOS bugs. That's no contest.

Worse, we have virtual devices that have clear node affinity without it
set.

So we're growing shit, allowing bugs, and what do we get in return? Warm
fuzzies is not it.

> > Any better suggestion to move this forward?
> 
> FWIW (since this is in my inbox), it sounds like the fundamental issue is
> that NUMA_NO_NODE is conflated for at least two different purposes, so
> trying to sort that out would be a good first step. AFAICS we have genuine
> "don't care" cases like alloc_pages_node(), where if the producer says it
> doesn't matter then the consumer is free to make its own judgement on what
> to do, and fundamentally different "we expect this thing to have an affinity
> but it doesn't, so we can't say what's appropriate" cases which could really
> do with some separate indicator like "NUMA_INVALID_NODE".

It can possible be a 3 state:

 - UNKNON; overridden by parent/bus/etc..
   ERROR when still UNKNOWN on register.

 - INVALID; ERROR on devm usage.
   for virtual devices / pure sysfs nodes

 - NO_NODE; may only be set on virtual devices (we can check against PCI
   bus etc..) when there really is no better option.

But I only want to see the NO_NODE crap at the end, after all other
possible avenues have been done.

> The tricky part is then bestowed on the producers to decide whether they can
> downgrade "invalid" to "don't care". You can technically build 'a device'
> whose internal logic is distributed between nodes and thus appears to have
> equal affinity - interrupt controllers, for example, may have per-CPU or
> per-node interfaces that end up looking like that - so although it's
> unlikely it's not outright nonsensical.

I'm thinking we should/do create per cpu/node devices for such
distributed stuff. For instance, we create per-cpu clockevent devices
(where appropriate).

> Similarly a 'device' that's actually emulated behind a firmware call
> interface may well effectively have no real affinity.

Emulated devices are typically slow as heck and should be avoided if at
all possible. I don't see NUMA affinity being important for them.

  parent reply index

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-17 12:48 Yunsheng Lin
2019-09-21 22:38 ` Paul Burton
2019-09-23  2:31   ` Yunsheng Lin
2019-09-23 15:15 ` Peter Zijlstra
2019-09-23 15:28   ` Michal Hocko
2019-09-23 15:48     ` Peter Zijlstra
2019-09-23 16:52       ` Michal Hocko
2019-09-23 20:34         ` Peter Zijlstra
2019-09-24  1:29           ` Yunsheng Lin
2019-09-24  9:25             ` Peter Zijlstra
2019-09-24 11:07               ` Yunsheng Lin
2019-09-24 11:28                 ` Peter Zijlstra
2019-09-24 11:44                   ` Yunsheng Lin
2019-09-24 11:58                     ` Peter Zijlstra
2019-09-24 12:09                       ` Yunsheng Lin
2019-09-24  7:47           ` Michal Hocko
2019-09-24  9:17             ` Peter Zijlstra
2019-09-24 10:56               ` Michal Hocko
2019-09-24 11:23                 ` Peter Zijlstra
2019-09-24 11:54                   ` Michal Hocko
2019-09-24 12:09                     ` Peter Zijlstra
2019-09-24 12:25                       ` Michal Hocko
2019-09-24 12:43                         ` Peter Zijlstra
2019-09-24 12:59                           ` Peter Zijlstra
2019-09-24 13:19                             ` Michal Hocko
2019-09-25  9:14                               ` Yunsheng Lin
2019-09-25 10:41                                 ` Peter Zijlstra
2019-10-08  8:38                                   ` Yunsheng Lin
2019-10-09 12:25                                     ` Robin Murphy
2019-10-10  6:07                                       ` Yunsheng Lin
2019-10-10  7:32                                         ` Michal Hocko
2019-10-11  3:27                                           ` Yunsheng Lin
2019-10-11 11:15                                             ` Peter Zijlstra
2019-10-12  6:17                                               ` Yunsheng Lin
2019-10-12  7:40                                                 ` Greg KH
2019-10-12  9:47                                                   ` Yunsheng Lin
2019-10-12 10:40                                                     ` Greg KH
2019-10-12 10:47                                                       ` Greg KH
2019-10-14  8:00                                                         ` Yunsheng Lin
2019-10-14  9:25                                                           ` Greg KH
2019-10-14  9:49                                                             ` Peter Zijlstra
2019-10-14 10:04                                                               ` Greg KH
2019-10-15 10:40                                                             ` Yunsheng Lin
2019-10-15 16:58                                                               ` Greg KH
2019-10-16 12:07                                                                 ` Yunsheng Lin
2019-10-28  9:20                                                   ` Yunsheng Lin
2019-10-29  8:53                                                     ` Michal Hocko
2019-10-30  1:58                                                       ` Yunsheng Lin
2019-10-10  8:56                                       ` Peter Zijlstra [this message]
2019-09-25 10:40                               ` Peter Zijlstra
2019-09-25 13:25                                 ` Michal Hocko
2019-09-25 16:31                                   ` Peter Zijlstra
2019-09-25 21:45                                     ` Peter Zijlstra
2019-09-26  9:05                                       ` Peter Zijlstra
2019-09-26 12:10                                         ` Peter Zijlstra
2019-09-26 11:45                                     ` Geert Uytterhoeven
2019-09-26 12:24                                       ` Peter Zijlstra

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191010085616.GQ2311@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=axboe@kernel.dk \
    --cc=borntraeger@de.ibm.com \
    --cc=bp@alien8.de \
    --cc=cai@lca.pw \
    --cc=catalin.marinas@arm.com \
    --cc=chenhc@lemote.com \
    --cc=dalias@libc.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=davem@davemloft.net \
    --cc=dledford@redhat.com \
    --cc=gor@linux.ibm.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=heiko.carstens@de.ibm.com \
    --cc=hpa@zytor.com \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jeffrey.t.kirsher@intel.com \
    --cc=jhogan@kernel.org \
    --cc=jiaxun.yang@flygoat.com \
    --cc=len.brown@intel.com \
    --cc=linux-alpha@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux-sh@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=linyunsheng@huawei.com \
    --cc=luto@kernel.org \
    --cc=mattst88@gmail.com \
    --cc=mhocko@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mwb@linux.vnet.ibm.com \
    --cc=naveen.n.rao@linux.vnet.ibm.com \
    --cc=paul.burton@mips.com \
    --cc=paulus@samba.org \
    --cc=rafael@kernel.org \
    --cc=ralf@linux-mips.org \
    --cc=robin.murphy@arm.com \
    --cc=rppt@linux.ibm.com \
    --cc=rth@twiddle.net \
    --cc=sparclinux@vger.kernel.org \
    --cc=tbogendoerfer@suse.de \
    --cc=tglx@linutronix.de \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=ysato@users.sourceforge.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org
	public-inbox-index linuxppc-dev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git