Re: [PATCH v3 2/7] RISC-V: Add a syscall for HW probing

From: Evan Green <evan@rivosinc.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: "Palmer Dabbelt" <palmer@rivosinc.com>,
	"Heiko Stübner" <heiko@sntech.de>,
	"Conor Dooley" <conor@kernel.org>,
	slewis@rivosinc.com, "Vineet Gupta" <vineetg@rivosinc.com>,
	"Albert Ou" <aou@eecs.berkeley.edu>,
	"Andrew Bresticker" <abrestic@rivosinc.com>,
	"Andrew Jones" <ajones@ventanamicro.com>,
	"Anup Patel" <apatel@ventanamicro.com>,
	"Atish Patra" <atishp@rivosinc.com>,
	"Bagas Sanjaya" <bagasdotme@gmail.com>,
	"Celeste Liu" <coelacanthus@outlook.com>,
	"Conor.Dooley" <conor.dooley@microchip.com>,
	guoren <guoren@kernel.org>, "Jonathan Corbet" <corbet@lwn.net>,
	"Niklas Cassel" <niklas.cassel@wdc.com>,
	"Palmer Dabbelt" <palmer@dabbelt.com>,
	"Paul Walmsley" <paul.walmsley@sifive.com>,
	"Randy Dunlap" <rdunlap@infradead.org>,
	"Ruizhe Pan" <c141028@gmail.com>,
	"Sunil V L" <sunilvl@ventanamicro.com>,
	"Tobias Klauser" <tklauser@distanz.ch>,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-riscv@lists.infradead.org
Subject: Re: [PATCH v3 2/7] RISC-V: Add a syscall for HW probing
Date: Fri, 31 Mar 2023 10:51:55 -0700	[thread overview]
Message-ID: <CALs-HsuiyocyK+RW_5O3JniOE7ROHHDJ4K=+6NZRzGad7wJpMQ@mail.gmail.com> (raw)
In-Reply-To: <d91ffb1e-261b-4b2f-a78f-f2846600a3e7@app.fastmail.com>

Hi Arnd,

On Fri, Mar 31, 2023 at 6:21 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, Mar 30, 2023, at 20:30, Evan Green wrote:
> > On Thu, Feb 23, 2023 at 2:06 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >> > +    long sys_riscv_hwprobe(struct riscv_hwprobe *pairs, size_t
> >> > pair_count,
> >> > +                           size_t cpu_count, cpu_set_t *cpus,
> >> > +                           unsigned long flags);
> >>
> >> The cpu set argument worries me more: there should never be a
> >> need to optimize for broken hardware that has an asymmetric set
> >> of features. Just let the kernel figure out the minimum set
> >> of features that works across all CPUs and report that like we
> >> do with HWCAP. If there is a SoC that is so broken that it has
> >> important features on a subset of cores that some user might
> >> actually want to rely on, then have them go through the slow
> >> sysfs interface for probing the CPUs indidually, but don't make
> >> the broken case easier at the expense of normal users that
> >> run on working hardware.
> >
> > I'm not so sure. While I agree with you for major classes of features
> > (eg one CPU has floating point support but another does not), I expect
> > these bits to contain more subtle details as well, which might vary
> > across asymmetric implementations without breaking ABI compatibility
> > per-se. Maybe some vendor has implemented exotic video decoding
> > acceleration instructions that only work on the big core. Or maybe the
> > big cores support v3.1 of some extension (where certain things run
> > faster), but the little cores only have v3.0, where it's a little
> > slower. Certain apps would likely want to know these things so they
> > can allocate their work optimally across cores.
>
> Do you have a specific feature in mind where hardware would be
> intentionally designed this way? I still can't come up with a
> scenario where this would actually work in practice, as having
> asymmetric features is incompatible with so many other things
> we normally do.

Isn't this already a reality, with the ARM folks building systems
where only some cores can run arm32 code, and the rest are aarch64
only?
https://lwn.net/Articles/838339/

The way I see it, it's going to be close to impossible to _not_ design
hardware this way. As long as die area and energy efficiency are a
constraint, big.LITTLE and other asymmetric architectures seem
inevitable. Given this interface is sort of a software cpuid
instruction, pretending there won't be differences between cores feels
like an incomplete API.

>
> - In a virtual machine, the VCPU tents to get scheduled arbitrarily
>   to physical CPUs, so setting affinity in a guest won't actually
>   guarantee that the feature is still there.

Sure, but I wouldn't expect the guest to be able to observe physical
properties from a transient host CPU (I'd consider that a bug). I'd
expect either the VCPU is pinned to a physical CPU with no chance of
migration, in which case you might reasonably plumb down some physical
properties to the guest, or the guest sees a watered down generic CPU
containing the least common denominator of features.

>
> - Using a CPU feature from library code is practically impossible
>   if it requires special CPU affinity, as the application may
>   already be started on specific CPUs for another reason, and
>   having a library call sched_setaffinity will conflict with those.
>
> - Even in the simplest case of having a standalone application
>   without any shared libraries try to pick a sensible CPU to
>   run on is hard to do in a generic way, as it would need to
>   weigh availabilty of features on certain cores against the
>   number of cores with or without the feature and their current
>   and expected system load.

In ChromeOS at least, we struggled a lot with chrome getting scheduled
on the big vs little cores on some ARM platforms, as we'd get things
like dropped video frames if we stayed stuck on the wrong cores. I
think we managed to solve that with uclamp, but again that
acknowledges that there are differences between cores, which I think
justifies the cpuset parameter to a cpu feature interface like this.

>
> As long as there isn't a specific requirement, I think it's better
> to not actually encourage hardware vendors to implement designs
> like that, or at least not designing an interface to make getting
> this information a few microseconds faster that what already
> exists.

As this is ABI, there is a bit of "crystal ball gazing" I'm doing,
though I hope I've backed it up with enough to show it's not a
completely wild parameter.

-Evan