All of lore.kernel.org
 help / color / mirror / Atom feed
* vector status when vlen doesn't match
@ 2024-02-28 16:02 Conor Dooley
  2024-02-28 21:56 ` Palmer Dabbelt
  0 siblings, 1 reply; 6+ messages in thread
From: Conor Dooley @ 2024-02-28 16:02 UTC (permalink / raw)
  To: linux-riscv, andy.chiu


[-- Attachment #1.1: Type: text/plain, Size: 2289 bytes --]

Yo Andy,

I was wondering if you could clarify something that came up today on
IRC.

When we enable vector we check vlenb and if it doesn't match the kernel
is suppose to not support vector. We do this in a few places, the first
is in riscv_fill_hwcap():

	if (elf_hwcap & COMPAT_HWCAP_ISA_V) {
		riscv_v_setup_vsize();
		/*
		 * ISA string in device tree might have 'v' flag, but
		 * CONFIG_RISCV_ISA_V is disabled in kernel.
		 * Clear V flag in elf_hwcap if CONFIG_RISCV_ISA_V is disabled.
		 */
		if (!IS_ENABLED(CONFIG_RISCV_ISA_V))
			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
	}

Why does this riscv_v_setup_vsize() failing have no impact? Because this
is called only once on the boot CPU, right? I feel like it deserves a
comment as to why.

But the main thing I was wondering about was the other time it is called,
in smp_callin():
	if (has_vector()) {
		if (riscv_v_setup_vsize())
			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
	}

If this fails, we clear the HWCAP bit, but I am not immediately grasping
how this affects the rest of the kernel.
has_vector() just checks if the extension is supported by all cores on
the cpu using either an alternative or the extension support bitmap.
There are lots of sites in the kernel that use has_vector() as the
gating check (as far as I can tell), so if vlen does not match between
CPUs should we not be returning an error for has_vector()?

The alternative that has_vector() relies on is patched before the non-boot
CPUs are enabled, so we cannot modify the result once the non-boot CPUs
are in their callin functions and detect a vlen mismatch while at at the
same time using it in smp_calling(), so should this code be changed to
something along the lines of:

	if (riscv_has_extension_unlikely(v)) {
		if (riscv_v_setup_vsize()) {
			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
			riscv_v_vlen_mismatch = true;
		}
	}
and then has_vector() becomes something like
static __always_inline bool has_vector(void)
{
	return riscv_has_extension_unlikely(RISCV_ISA_EXT_v) && unlikely(riscv_v_vlen_mismatch);
}
Probably there's value to be gained in static branches etc here, but I
was just trying to explain what I was getting at. Maybe I am missing
something though? I do remember talking about this back when the vector
patches were still in review.

Thanks,
Conor.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

[-- Attachment #2: Type: text/plain, Size: 161 bytes --]

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: vector status when vlen doesn't match
  2024-02-28 16:02 vector status when vlen doesn't match Conor Dooley
@ 2024-02-28 21:56 ` Palmer Dabbelt
  2024-02-28 23:32   ` Conor Dooley
  0 siblings, 1 reply; 6+ messages in thread
From: Palmer Dabbelt @ 2024-02-28 21:56 UTC (permalink / raw)
  To: Conor Dooley; +Cc: linux-riscv, andy.chiu

On Wed, 28 Feb 2024 08:02:47 PST (-0800), Conor Dooley wrote:
> Yo Andy,
>
> I was wondering if you could clarify something that came up today on
> IRC.
>
> When we enable vector we check vlenb and if it doesn't match the kernel
> is suppose to not support vector. We do this in a few places, the first
> is in riscv_fill_hwcap():
>
> 	if (elf_hwcap & COMPAT_HWCAP_ISA_V) {
> 		riscv_v_setup_vsize();
> 		/*
> 		 * ISA string in device tree might have 'v' flag, but
> 		 * CONFIG_RISCV_ISA_V is disabled in kernel.
> 		 * Clear V flag in elf_hwcap if CONFIG_RISCV_ISA_V is disabled.
> 		 */
> 		if (!IS_ENABLED(CONFIG_RISCV_ISA_V))
> 			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> 	}
>
> Why does this riscv_v_setup_vsize() failing have no impact? Because this
> is called only once on the boot CPU, right? I feel like it deserves a
> comment as to why.

Unless I'm missing something, that first call can't fail: there's no 
previous riscv_v_size set, so there's nothing to trigger the failure.

Might warrant a comment, though...

> But the main thing I was wondering about was the other time it is called,
> in smp_callin():
> 	if (has_vector()) {
> 		if (riscv_v_setup_vsize())
> 			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> 	}
>
> If this fails, we clear the HWCAP bit, but I am not immediately grasping
> how this affects the rest of the kernel.

I think it's actually a bug: we're not clearing the bit in riscv_isa, so 
this only ends up disabling the V bit in AT_HWCAP and thus doesn't do 
anything in the kernel.

> has_vector() just checks if the extension is supported by all cores on
> the cpu using either an alternative or the extension support bitmap.
> There are lots of sites in the kernel that use has_vector() as the
> gating check (as far as I can tell), so if vlen does not match between
> CPUs should we not be returning an error for has_vector()?
>
> The alternative that has_vector() relies on is patched before the non-boot
> CPUs are enabled, so we cannot modify the result once the non-boot CPUs
> are in their callin functions and detect a vlen mismatch while at at the
> same time using it in smp_calling(), so should this code be changed to
> something along the lines of:
>
> 	if (riscv_has_extension_unlikely(v)) {
> 		if (riscv_v_setup_vsize()) {
> 			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> 			riscv_v_vlen_mismatch = true;
> 		}
> 	}
> and then has_vector() becomes something like
> static __always_inline bool has_vector(void)
> {
> 	return riscv_has_extension_unlikely(RISCV_ISA_EXT_v) && unlikely(riscv_v_vlen_mismatch);
> }
>
> Probably there's value to be gained in static branches etc here, but I
> was just trying to explain what I was getting at. Maybe I am missing
> something though? I do remember talking about this back when the vector
> patches were still in review.

I actually don't think that alone fixes it: even if we make has_vector() 
respect the changes from the new CPU (whether we re-apply the 
alternatives or make a more dynamic check), we'd need to make sure the 
code that has already looked at has_vector() doesn't get scheduled on 
one of the shorter-VLENB CPUs.

So maybe the right answer here is to flip things around, and just refuse 
to enable CPUs that come in with a VLENB different than the boot CPU?  
That way we could avoid being broken by systems that behave this way, if 
someone ends up building one then they can figure out how to make it 
work better.

> Thanks,
> Conor.
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: vector status when vlen doesn't match
  2024-02-28 21:56 ` Palmer Dabbelt
@ 2024-02-28 23:32   ` Conor Dooley
  2024-02-29 13:32     ` Andy Chiu
  2024-02-29 16:59     ` Palmer Dabbelt
  0 siblings, 2 replies; 6+ messages in thread
From: Conor Dooley @ 2024-02-28 23:32 UTC (permalink / raw)
  To: Palmer Dabbelt; +Cc: linux-riscv, andy.chiu


[-- Attachment #1.1: Type: text/plain, Size: 5108 bytes --]

On Wed, Feb 28, 2024 at 01:56:51PM -0800, Palmer Dabbelt wrote:
> On Wed, 28 Feb 2024 08:02:47 PST (-0800), Conor Dooley wrote:
> > Yo Andy,
> > 
> > I was wondering if you could clarify something that came up today on
> > IRC.
> > 
> > When we enable vector we check vlenb and if it doesn't match the kernel
> > is suppose to not support vector. We do this in a few places, the first
> > is in riscv_fill_hwcap():
> > 
> > 	if (elf_hwcap & COMPAT_HWCAP_ISA_V) {
> > 		riscv_v_setup_vsize();
> > 		/*
> > 		 * ISA string in device tree might have 'v' flag, but
> > 		 * CONFIG_RISCV_ISA_V is disabled in kernel.
> > 		 * Clear V flag in elf_hwcap if CONFIG_RISCV_ISA_V is disabled.
> > 		 */
> > 		if (!IS_ENABLED(CONFIG_RISCV_ISA_V))
> > 			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > 	}
> > 
> > Why does this riscv_v_setup_vsize() failing have no impact? Because this
> > is called only once on the boot CPU, right? I feel like it deserves a
> > comment as to why.
> 
> Unless I'm missing something, that first call can't fail: there's no
> previous riscv_v_size set, so there's nothing to trigger the failure.
> 
> Might warrant a comment, though...

Yeah. I did write (or maybe moreso remodel) that function so I know it
only gets called on the boot CPU, I was just pointing out what is not
immediately obvious about that particular callsite.

> > But the main thing I was wondering about was the other time it is called,
> > in smp_callin():
> > 	if (has_vector()) {
> > 		if (riscv_v_setup_vsize())
> > 			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > 	}
> > 
> > If this fails, we clear the HWCAP bit, but I am not immediately grasping
> > how this affects the rest of the kernel.
> 
> I think it's actually a bug: we're not clearing the bit in riscv_isa, so
> this only ends up disabling the V bit in AT_HWCAP and thus doesn't do
> anything in the kernel.

Yeah, that's what I was getting at - I could not see how disabling it in
hwcap would make a difference either in the kernel or in hwprobe. I was
not sure if I was missing something in terms of the
userspace-using-vector side of things, but I was pretty sure that
hwprobe and in-kernel vector had this bug.

> > has_vector() just checks if the extension is supported by all cores on
> > the cpu using either an alternative or the extension support bitmap.
> > There are lots of sites in the kernel that use has_vector() as the
> > gating check (as far as I can tell), so if vlen does not match between
> > CPUs should we not be returning an error for has_vector()?
> > 
> > The alternative that has_vector() relies on is patched before the non-boot
> > CPUs are enabled, so we cannot modify the result once the non-boot CPUs
> > are in their callin functions and detect a vlen mismatch while at at the
> > same time using it in smp_calling(), so should this code be changed to
> > something along the lines of:
> > 
> > 	if (riscv_has_extension_unlikely(v)) {
> > 		if (riscv_v_setup_vsize()) {
> > 			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > 			riscv_v_vlen_mismatch = true;
> > 		}
> > 	}
> > and then has_vector() becomes something like
> > static __always_inline bool has_vector(void)
> > {
> > 	return riscv_has_extension_unlikely(RISCV_ISA_EXT_v) && unlikely(riscv_v_vlen_mismatch);
> > }
> > 
> > Probably there's value to be gained in static branches etc here, but I
> > was just trying to explain what I was getting at. Maybe I am missing
> > something though? I do remember talking about this back when the vector
> > patches were still in review.
> 
> I actually don't think that alone fixes it: even if we make has_vector()
> respect the changes from the new CPU (whether we re-apply the alternatives
> or make a more dynamic check), we'd need to make sure the code that has
> already looked at has_vector() doesn't get scheduled on one of the
> shorter-VLENB CPUs.

Is it possible to have that happen given this is in smp_callin()?
I guess the concern is some sort of hotplug situations, given that a CPU
could be hotplugged into a fully running system with a vlen mismatch.

> So maybe the right answer here is to flip things around, and just refuse to
> enable CPUs that come in with a VLENB different than the boot CPU?  That way
> we could avoid being broken by systems that behave this way,

smp_callin() doesn't return an error to report a failure, but it does
have a completion. I suppose we would just return early from
smp_callin() so that the completion never gets completed, in turn
causing __cpu_up() to fail?

> if someone ends
> up building one then they can figure out how to make it work better.

Apparently the next gen sophgo product does. Not sure if people will wnt
to run across all the cores though, apparently they different by a lot
more than vlen.
A DT property would allow us to get the vlen info before the CPUs have
themselves been brought up. Apparently there are Zvl extensions for
this, but I'd rather just have a property like we have for the cbo
sizes if it comes to DT properties.

Cheers,
Conor.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

[-- Attachment #2: Type: text/plain, Size: 161 bytes --]

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: vector status when vlen doesn't match
  2024-02-28 23:32   ` Conor Dooley
@ 2024-02-29 13:32     ` Andy Chiu
  2024-02-29 17:00       ` Conor Dooley
  2024-02-29 16:59     ` Palmer Dabbelt
  1 sibling, 1 reply; 6+ messages in thread
From: Andy Chiu @ 2024-02-29 13:32 UTC (permalink / raw)
  To: Conor Dooley; +Cc: Palmer Dabbelt, linux-riscv

On Thu, Feb 29, 2024 at 7:32 AM Conor Dooley <conor@kernel.org> wrote:
>
> On Wed, Feb 28, 2024 at 01:56:51PM -0800, Palmer Dabbelt wrote:
> > On Wed, 28 Feb 2024 08:02:47 PST (-0800), Conor Dooley wrote:
> > > Yo Andy,
> > >
> > > I was wondering if you could clarify something that came up today on
> > > IRC.
> > >
> > > When we enable vector we check vlenb and if it doesn't match the kernel
> > > is suppose to not support vector. We do this in a few places, the first
> > > is in riscv_fill_hwcap():
> > >
> > >     if (elf_hwcap & COMPAT_HWCAP_ISA_V) {
> > >             riscv_v_setup_vsize();
> > >             /*
> > >              * ISA string in device tree might have 'v' flag, but
> > >              * CONFIG_RISCV_ISA_V is disabled in kernel.
> > >              * Clear V flag in elf_hwcap if CONFIG_RISCV_ISA_V is disabled.
> > >              */
> > >             if (!IS_ENABLED(CONFIG_RISCV_ISA_V))
> > >                     elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > >     }
> > >
> > > Why does this riscv_v_setup_vsize() failing have no impact? Because this
> > > is called only once on the boot CPU, right? I feel like it deserves a
> > > comment as to why.
> >
> > Unless I'm missing something, that first call can't fail: there's no
> > previous riscv_v_size set, so there's nothing to trigger the failure.
> >
> > Might warrant a comment, though...
>
> Yeah. I did write (or maybe moreso remodel) that function so I know it
> only gets called on the boot CPU, I was just pointing out what is not
> immediately obvious about that particular callsite.

Yes, this call can't fall. Ok, I'll update the comment.

>
> > > But the main thing I was wondering about was the other time it is called,
> > > in smp_callin():
> > >     if (has_vector()) {
> > >             if (riscv_v_setup_vsize())
> > >                     elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > >     }
> > >
> > > If this fails, we clear the HWCAP bit, but I am not immediately grasping
> > > how this affects the rest of the kernel.
> >
> > I think it's actually a bug: we're not clearing the bit in riscv_isa, so
> > this only ends up disabling the V bit in AT_HWCAP and thus doesn't do
> > anything in the kernel.
>
> Yeah, that's what I was getting at - I could not see how disabling it in
> hwcap would make a difference either in the kernel or in hwprobe. I was
> not sure if I was missing something in terms of the
> userspace-using-vector side of things, but I was pretty sure that
> hwprobe and in-kernel vector had this bug.

hwprobe still reports the hardware has Vector. Shouldn't this be ok
because hwprobe reflects what is supported by the hardware? I think
we'd need another bit in hwprobe to reflect the value in hwcap though.

Userspace context switch code is gated by VS == 0. I am not sure if we
need something more obvious.

The in-kernel vector part was not well-considered. It sounds ok for
the non-preemptible vector. Userspace won't be able to execute any
user vector on a non-symmetric vector platform. So all in-kernel
vector code doesn't have to save/restore userspace vector registers.
The task executing non-preemptible in-kernel vector code has no chance
to be rescheduled on other cores during vector execution. So it should
be fine as long as multiple parts of kernel_vector_{begin,end}() do
not assume using the same vlen. Most vector code doesn't assume
running on a particular vlen (vlen agnostic), so ideally it should be
safe.

Parts of the kernel blindly assume riscv_v_size != 0. Those should be
changed. However, race conditions may happen if userspace is already
up and running at this stage (cpu hotplug?). We must not reschedule a
userspace process which is actively executing vector to a core that
runs in a different vlen.

It is not possible to execute the preemptible vector in a
non-symmetric vector platform. This should be gated by failing the
allocation of kernel_vstate.datap.

>
> > > has_vector() just checks if the extension is supported by all cores on
> > > the cpu using either an alternative or the extension support bitmap.
> > > There are lots of sites in the kernel that use has_vector() as the
> > > gating check (as far as I can tell), so if vlen does not match between
> > > CPUs should we not be returning an error for has_vector()?

No, has_vector() still returns true because it is backed by
alternative and determined during boot.

> > >
> > > The alternative that has_vector() relies on is patched before the non-boot
> > > CPUs are enabled, so we cannot modify the result once the non-boot CPUs
> > > are in their callin functions and detect a vlen mismatch while at at the
> > > same time using it in smp_calling(), so should this code be changed to
> > > something along the lines of:
> > >
> > >     if (riscv_has_extension_unlikely(v)) {
> > >             if (riscv_v_setup_vsize()) {
> > >                     elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > >                     riscv_v_vlen_mismatch = true;
> > >             }
> > >     }
> > > and then has_vector() becomes something like
> > > static __always_inline bool has_vector(void)
> > > {
> > >     return riscv_has_extension_unlikely(RISCV_ISA_EXT_v) && unlikely(riscv_v_vlen_mismatch);
> > > }
> > >
> > > Probably there's value to be gained in static branches etc here, but I
> > > was just trying to explain what I was getting at. Maybe I am missing
> > > something though? I do remember talking about this back when the vector
> > > patches were still in review.
> >
> > I actually don't think that alone fixes it: even if we make has_vector()
> > respect the changes from the new CPU (whether we re-apply the alternatives
> > or make a more dynamic check), we'd need to make sure the code that has
> > already looked at has_vector() doesn't get scheduled on one of the
> > shorter-VLENB CPUs.

I agree that flipping off the alternative for has_vector() could be
dangerous at this point.

>
> Is it possible to have that happen given this is in smp_callin()?
> I guess the concern is some sort of hotplug situations, given that a CPU
> could be hotplugged into a fully running system with a vlen mismatch.
>
> > So maybe the right answer here is to flip things around, and just refuse to
> > enable CPUs that come in with a VLENB different than the boot CPU?  That way
> > we could avoid being broken by systems that behave this way,
>
> smp_callin() doesn't return an error to report a failure, but it does
> have a completion. I suppose we would just return early from
> smp_callin() so that the completion never gets completed, in turn
> causing __cpu_up() to fail?

I agree disabling the entire vector or fail bringing up the cpu is the
safe way though.

>
> > if someone ends
> > up building one then they can figure out how to make it work better.
>
> Apparently the next gen sophgo product does. Not sure if people will wnt
> to run across all the cores though, apparently they different by a lot
> more than vlen.

The vector patchset assumes running on an SMP system. Supporting
non-symmetric vlen or non-symmetric isa may require some efforts.

> A DT property would allow us to get the vlen info before the CPUs have
> themselves been brought up. Apparently there are Zvl extensions for
> this, but I'd rather just have a property like we have for the cbo
> sizes if it comes to DT properties.

Yes, perhaps having a DT property is the first step to go. Because we
can prevent race set/get-ing riscv_v_vsize in such case.

>
> Cheers,
> Conor.

Thanks,
Andy

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: vector status when vlen doesn't match
  2024-02-28 23:32   ` Conor Dooley
  2024-02-29 13:32     ` Andy Chiu
@ 2024-02-29 16:59     ` Palmer Dabbelt
  1 sibling, 0 replies; 6+ messages in thread
From: Palmer Dabbelt @ 2024-02-29 16:59 UTC (permalink / raw)
  To: Conor Dooley; +Cc: linux-riscv, andy.chiu

On Wed, 28 Feb 2024 15:32:47 PST (-0800), Conor Dooley wrote:
> On Wed, Feb 28, 2024 at 01:56:51PM -0800, Palmer Dabbelt wrote:
>> On Wed, 28 Feb 2024 08:02:47 PST (-0800), Conor Dooley wrote:
>> > Yo Andy,
>> > 
>> > I was wondering if you could clarify something that came up today on
>> > IRC.
>> > 
>> > When we enable vector we check vlenb and if it doesn't match the kernel
>> > is suppose to not support vector. We do this in a few places, the first
>> > is in riscv_fill_hwcap():
>> > 
>> > 	if (elf_hwcap & COMPAT_HWCAP_ISA_V) {
>> > 		riscv_v_setup_vsize();
>> > 		/*
>> > 		 * ISA string in device tree might have 'v' flag, but
>> > 		 * CONFIG_RISCV_ISA_V is disabled in kernel.
>> > 		 * Clear V flag in elf_hwcap if CONFIG_RISCV_ISA_V is disabled.
>> > 		 */
>> > 		if (!IS_ENABLED(CONFIG_RISCV_ISA_V))
>> > 			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
>> > 	}
>> > 
>> > Why does this riscv_v_setup_vsize() failing have no impact? Because this
>> > is called only once on the boot CPU, right? I feel like it deserves a
>> > comment as to why.
>> 
>> Unless I'm missing something, that first call can't fail: there's no
>> previous riscv_v_size set, so there's nothing to trigger the failure.
>> 
>> Might warrant a comment, though...
>
> Yeah. I did write (or maybe moreso remodel) that function so I know it
> only gets called on the boot CPU, I was just pointing out what is not
> immediately obvious about that particular callsite.

OK, so sounds like a comment then ;)

>> > But the main thing I was wondering about was the other time it is called,
>> > in smp_callin():
>> > 	if (has_vector()) {
>> > 		if (riscv_v_setup_vsize())
>> > 			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
>> > 	}
>> > 
>> > If this fails, we clear the HWCAP bit, but I am not immediately grasping
>> > how this affects the rest of the kernel.
>> 
>> I think it's actually a bug: we're not clearing the bit in riscv_isa, so
>> this only ends up disabling the V bit in AT_HWCAP and thus doesn't do
>> anything in the kernel.
>
> Yeah, that's what I was getting at - I could not see how disabling it in
> hwcap would make a difference either in the kernel or in hwprobe. I was
> not sure if I was missing something in terms of the
> userspace-using-vector side of things, but I was pretty sure that
> hwprobe and in-kernel vector had this bug.

IIRC we used to have this stuff hooked into elf_hwcap back when it was 
the only user-visible interface.  The V support dates back to around 
then, so we probably just missed the in-flight conflict.

>> > has_vector() just checks if the extension is supported by all cores on
>> > the cpu using either an alternative or the extension support bitmap.
>> > There are lots of sites in the kernel that use has_vector() as the
>> > gating check (as far as I can tell), so if vlen does not match between
>> > CPUs should we not be returning an error for has_vector()?
>> > 
>> > The alternative that has_vector() relies on is patched before the non-boot
>> > CPUs are enabled, so we cannot modify the result once the non-boot CPUs
>> > are in their callin functions and detect a vlen mismatch while at at the
>> > same time using it in smp_calling(), so should this code be changed to
>> > something along the lines of:
>> > 
>> > 	if (riscv_has_extension_unlikely(v)) {
>> > 		if (riscv_v_setup_vsize()) {
>> > 			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
>> > 			riscv_v_vlen_mismatch = true;
>> > 		}
>> > 	}
>> > and then has_vector() becomes something like
>> > static __always_inline bool has_vector(void)
>> > {
>> > 	return riscv_has_extension_unlikely(RISCV_ISA_EXT_v) && unlikely(riscv_v_vlen_mismatch);
>> > }
>> > 
>> > Probably there's value to be gained in static branches etc here, but I
>> > was just trying to explain what I was getting at. Maybe I am missing
>> > something though? I do remember talking about this back when the vector
>> > patches were still in review.
>> 
>> I actually don't think that alone fixes it: even if we make has_vector()
>> respect the changes from the new CPU (whether we re-apply the alternatives
>> or make a more dynamic check), we'd need to make sure the code that has
>> already looked at has_vector() doesn't get scheduled on one of the
>> shorter-VLENB CPUs.
>
> Is it possible to have that happen given this is in smp_callin()?
> I guess the concern is some sort of hotplug situations, given that a CPU
> could be hotplugged into a fully running system with a vlen mismatch.

Ya, something along those lines.  We essentially need to defer this sort 
of alternative processing until after all CPUs have booted and run their 
V detection, but IIUC that could be arbitrarily late if we're talking 
the hotplug world.

>> So maybe the right answer here is to flip things around, and just refuse to
>> enable CPUs that come in with a VLENB different than the boot CPU?  That way
>> we could avoid being broken by systems that behave this way,
>
> smp_callin() doesn't return an error to report a failure, but it does
> have a completion. I suppose we would just return early from
> smp_callin() so that the completion never gets completed, in turn
> causing __cpu_up() to fail?

Ya, I think something along those lines would do it.  I haven't really 
thought it through so I bet something hairy is lurking around in there.

>> if someone ends
>> up building one then they can figure out how to make it work better.
>
> Apparently the next gen sophgo product does. Not sure if people will wnt
> to run across all the cores though, apparently they different by a lot
> more than vlen.
> A DT property would allow us to get the vlen info before the CPUs have
> themselves been brought up. Apparently there are Zvl extensions for
> this, but I'd rather just have a property like we have for the cbo
> sizes if it comes to DT properties.

That might be the right answer here.  In general the "just run some code 
on the CPU to figure out what it implements" style of detection comes 
with a lot of these long-tail issues, DT just fixes that because we can 
look at every core from early in boot.

> Cheers,
> Conor.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: vector status when vlen doesn't match
  2024-02-29 13:32     ` Andy Chiu
@ 2024-02-29 17:00       ` Conor Dooley
  0 siblings, 0 replies; 6+ messages in thread
From: Conor Dooley @ 2024-02-29 17:00 UTC (permalink / raw)
  To: Andy Chiu; +Cc: Palmer Dabbelt, linux-riscv


[-- Attachment #1.1: Type: text/plain, Size: 9256 bytes --]

On Thu, Feb 29, 2024 at 09:32:56PM +0800, Andy Chiu wrote:
> On Thu, Feb 29, 2024 at 7:32 AM Conor Dooley <conor@kernel.org> wrote:
> >
> > On Wed, Feb 28, 2024 at 01:56:51PM -0800, Palmer Dabbelt wrote:
> > > On Wed, 28 Feb 2024 08:02:47 PST (-0800), Conor Dooley wrote:
> > > > Yo Andy,
> > > >
> > > > I was wondering if you could clarify something that came up today on
> > > > IRC.
> > > >
> > > > When we enable vector we check vlenb and if it doesn't match the kernel
> > > > is suppose to not support vector. We do this in a few places, the first
> > > > is in riscv_fill_hwcap():
> > > >
> > > >     if (elf_hwcap & COMPAT_HWCAP_ISA_V) {
> > > >             riscv_v_setup_vsize();
> > > >             /*
> > > >              * ISA string in device tree might have 'v' flag, but
> > > >              * CONFIG_RISCV_ISA_V is disabled in kernel.
> > > >              * Clear V flag in elf_hwcap if CONFIG_RISCV_ISA_V is disabled.
> > > >              */
> > > >             if (!IS_ENABLED(CONFIG_RISCV_ISA_V))
> > > >                     elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > > >     }
> > > >
> > > > Why does this riscv_v_setup_vsize() failing have no impact? Because this
> > > > is called only once on the boot CPU, right? I feel like it deserves a
> > > > comment as to why.
> > >
> > > Unless I'm missing something, that first call can't fail: there's no
> > > previous riscv_v_size set, so there's nothing to trigger the failure.
> > >
> > > Might warrant a comment, though...
> >
> > Yeah. I did write (or maybe moreso remodel) that function so I know it
> > only gets called on the boot CPU, I was just pointing out what is not
> > immediately obvious about that particular callsite.
> 
> Yes, this call can't fall. Ok, I'll update the comment.

Thanks.

> > > > But the main thing I was wondering about was the other time it is called,
> > > > in smp_callin():
> > > >     if (has_vector()) {
> > > >             if (riscv_v_setup_vsize())
> > > >                     elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > > >     }
> > > >
> > > > If this fails, we clear the HWCAP bit, but I am not immediately grasping
> > > > how this affects the rest of the kernel.
> > >
> > > I think it's actually a bug: we're not clearing the bit in riscv_isa, so
> > > this only ends up disabling the V bit in AT_HWCAP and thus doesn't do
> > > anything in the kernel.
> >
> > Yeah, that's what I was getting at - I could not see how disabling it in
> > hwcap would make a difference either in the kernel or in hwprobe. I was
> > not sure if I was missing something in terms of the
> > userspace-using-vector side of things, but I was pretty sure that
> > hwprobe and in-kernel vector had this bug.
> 
> hwprobe still reports the hardware has Vector. Shouldn't this be ok
> because hwprobe reflects what is supported by the hardware? I think
> we'd need another bit in hwprobe to reflect the value in hwcap though.

No. The hwprobe test was never whether or not the hardware has vector,
given has_vector() does not actually represent that. Rather,
has_vector() should only return true when the running kernel on hardware
that supports vector is capable of supporting vector.

> Userspace context switch code is gated by VS == 0. I am not sure if we
> need something more obvious.

Could you point me out where exactly this is happening? Also, what is
"VS" in this context?

> The in-kernel vector part was not well-considered. It sounds ok for
> the non-preemptible vector. Userspace won't be able to execute any
> user vector on a non-symmetric vector platform.

How does the kernel prevent this?

> So all in-kernel
> vector code doesn't have to save/restore userspace vector registers.
> The task executing non-preemptible in-kernel vector code has no chance
> to be rescheduled on other cores during vector execution. So it should
> be fine as long as multiple parts of kernel_vector_{begin,end}() do
> not assume using the same vlen. Most vector code doesn't assume
> running on a particular vlen (vlen agnostic), so ideally it should be
> safe.
> 
> Parts of the kernel blindly assume riscv_v_size != 0. Those should be
> changed. However, race conditions may happen if userspace is already
> up and running at this stage (cpu hotplug?). We must not reschedule a
> userspace process which is actively executing vector to a core that
> runs in a different vlen.

> It is not possible to execute the preemptible vector in a
> non-symmetric vector platform. This should be gated by failing the
> allocation of kernel_vstate.datap.

It sounds to me like the kernel does not currently do this?
However, I don't think we should even get as far as a failed allocation,
given we _know_ far before any in-kernel vector users appear whether or
not we have identical vlen across all CPUs.

> > > > has_vector() just checks if the extension is supported by all cores on
> > > > the cpu using either an alternative or the extension support bitmap.
> > > > There are lots of sites in the kernel that use has_vector() as the
> > > > gating check (as far as I can tell), so if vlen does not match between
> > > > CPUs should we not be returning an error for has_vector()?
> 
> No, has_vector() still returns true because it is backed by
> alternative and determined during boot.

In case you misunderstood, I know how has_vector() works - I was
suggesting a change to its behaviour.

> > > > The alternative that has_vector() relies on is patched before the non-boot
> > > > CPUs are enabled, so we cannot modify the result once the non-boot CPUs
> > > > are in their callin functions and detect a vlen mismatch while at at the
> > > > same time using it in smp_calling(), so should this code be changed to
> > > > something along the lines of:
> > > >
> > > >     if (riscv_has_extension_unlikely(v)) {
> > > >             if (riscv_v_setup_vsize()) {
> > > >                     elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > > >                     riscv_v_vlen_mismatch = true;
> > > >             }
> > > >     }
> > > > and then has_vector() becomes something like
> > > > static __always_inline bool has_vector(void)
> > > > {
> > > >     return riscv_has_extension_unlikely(RISCV_ISA_EXT_v) && unlikely(riscv_v_vlen_mismatch);
> > > > }
> > > >
> > > > Probably there's value to be gained in static branches etc here, but I
> > > > was just trying to explain what I was getting at. Maybe I am missing
> > > > something though? I do remember talking about this back when the vector
> > > > patches were still in review.
> > >
> > > I actually don't think that alone fixes it: even if we make has_vector()
> > > respect the changes from the new CPU (whether we re-apply the alternatives
> > > or make a more dynamic check), we'd need to make sure the code that has
> > > already looked at has_vector() doesn't get scheduled on one of the
> > > shorter-VLENB CPUs.
> 
> I agree that flipping off the alternative for has_vector() could be
> dangerous at this point.
> 
> >
> > Is it possible to have that happen given this is in smp_callin()?
> > I guess the concern is some sort of hotplug situations, given that a CPU
> > could be hotplugged into a fully running system with a vlen mismatch.
> >
> > > So maybe the right answer here is to flip things around, and just refuse to
> > > enable CPUs that come in with a VLENB different than the boot CPU?  That way
> > > we could avoid being broken by systems that behave this way,
> >
> > smp_callin() doesn't return an error to report a failure, but it does
> > have a completion. I suppose we would just return early from
> > smp_callin() so that the completion never gets completed, in turn
> > causing __cpu_up() to fail?
> 
> I agree disabling the entire vector or fail bringing up the cpu is the
> safe way though.
> 
> >
> > > if someone ends
> > > up building one then they can figure out how to make it work better.
> >
> > Apparently the next gen sophgo product does. Not sure if people will wnt
> > to run across all the cores though, apparently they different by a lot
> > more than vlen.
> 
> The vector patchset assumes running on an SMP system. Supporting
> non-symmetric vlen or non-symmetric isa may require some efforts.

Supporting non symmetric extensions is actually not that hard, depending
on what the variance is - we already support enabling the intersection
of what all CPUs support. Supporting something like that for vlen
though, I agree.

> > A DT property would allow us to get the vlen info before the CPUs have
> > themselves been brought up. Apparently there are Zvl extensions for
> > this, but I'd rather just have a property like we have for the cbo
> > sizes if it comes to DT properties.
> 
> Yes, perhaps having a DT property is the first step to go. Because we
> can prevent race set/get-ing riscv_v_vsize in such case.

I think it is the second step. The first step I think is making
smp_callin() fail if vlen doesn't match, since that can be applied as a
quick fix to avoid issues on systems that have different vlens and would
also prevent an issue on a non-DT system.

Cheers,
Conor.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

[-- Attachment #2: Type: text/plain, Size: 161 bytes --]

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-02-29 17:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-28 16:02 vector status when vlen doesn't match Conor Dooley
2024-02-28 21:56 ` Palmer Dabbelt
2024-02-28 23:32   ` Conor Dooley
2024-02-29 13:32     ` Andy Chiu
2024-02-29 17:00       ` Conor Dooley
2024-02-29 16:59     ` Palmer Dabbelt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.