All of lore.kernel.org
 help / color / mirror / Atom feed
* arm64 equivalents of PR_SET_TSC/ARCH_SET_CPUID
@ 2022-05-21 20:07 ` Kyle Huey
  0 siblings, 0 replies; 8+ messages in thread
From: Kyle Huey @ 2022-05-21 20:07 UTC (permalink / raw)
  To: open list
  Cc: moderated list:ARM PORT,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	yyc1992, Keno Fischer, Robert O'Callahan, Thomas Gleixner,
	Borislav Petkov, Marc Zyngier, Suzuki K Poulose, Will Deacon

There is ongoing work by Yichao Yu to make rr, a userspace record and
replay debugger[0], production quality on arm64[1]. One of the bigger
remaining issues is the kernel's emulation of accesses to certain
system registers[2] that reflect timing and CPU capabilities and are
either non-deterministic or can vary from processor to processor. We
would like to add the ability to tell the kernel to decline to emulate
these instructions for a given task and pass that responsibility onto
the supervising rr ptracer. There are analogous processor features and
disabling mechanisms on x86. The RDTSC instruction is controlled by
prctl(PR_SET_TSC) and the CPUID instruction is controlled (when the
hardware allows) by arch_prctl(ARCH_SET_CPUID).

The questions I'd like to raise are:

1. Is it appropriate to reuse PR_SET_TSC for roughly equivalent
functionality on AArch64? (even if the AArch64 feature is not actually
named Time Stamp Counter).
2. Likewise for ARCH_SET_CPUID
3. Since arch_prctl is x86-only, does it make more sense to add
arch_prctl to arm64 or to duplicate ARCH_SET_CPUID into the prctl
world? (e.g. a PR_SET_CPUID that works on both x86/arm64)

- Kyle

[0] https://rr-project.org/
[1] https://github.com/rr-debugger/rr/issues/3234
[2] e.g. CNTVCT_EL0 and MIDR_EL1, among others

^ permalink raw reply	[flat|nested] 8+ messages in thread

* arm64 equivalents of PR_SET_TSC/ARCH_SET_CPUID
@ 2022-05-21 20:07 ` Kyle Huey
  0 siblings, 0 replies; 8+ messages in thread
From: Kyle Huey @ 2022-05-21 20:07 UTC (permalink / raw)
  To: open list
  Cc: moderated list:ARM PORT,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	yyc1992, Keno Fischer, Robert O'Callahan, Thomas Gleixner,
	Borislav Petkov, Marc Zyngier, Suzuki K Poulose, Will Deacon

There is ongoing work by Yichao Yu to make rr, a userspace record and
replay debugger[0], production quality on arm64[1]. One of the bigger
remaining issues is the kernel's emulation of accesses to certain
system registers[2] that reflect timing and CPU capabilities and are
either non-deterministic or can vary from processor to processor. We
would like to add the ability to tell the kernel to decline to emulate
these instructions for a given task and pass that responsibility onto
the supervising rr ptracer. There are analogous processor features and
disabling mechanisms on x86. The RDTSC instruction is controlled by
prctl(PR_SET_TSC) and the CPUID instruction is controlled (when the
hardware allows) by arch_prctl(ARCH_SET_CPUID).

The questions I'd like to raise are:

1. Is it appropriate to reuse PR_SET_TSC for roughly equivalent
functionality on AArch64? (even if the AArch64 feature is not actually
named Time Stamp Counter).
2. Likewise for ARCH_SET_CPUID
3. Since arch_prctl is x86-only, does it make more sense to add
arch_prctl to arm64 or to duplicate ARCH_SET_CPUID into the prctl
world? (e.g. a PR_SET_CPUID that works on both x86/arm64)

- Kyle

[0] https://rr-project.org/
[1] https://github.com/rr-debugger/rr/issues/3234
[2] e.g. CNTVCT_EL0 and MIDR_EL1, among others

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: arm64 equivalents of PR_SET_TSC/ARCH_SET_CPUID
  2022-05-21 20:07 ` Kyle Huey
@ 2022-05-22 15:35   ` Marc Zyngier
  -1 siblings, 0 replies; 8+ messages in thread
From: Marc Zyngier @ 2022-05-22 15:35 UTC (permalink / raw)
  To: Kyle Huey
  Cc: open list, moderated list:ARM PORT,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	yyc1992, Keno Fischer, Robert O'Callahan, Thomas Gleixner,
	Borislav Petkov, Suzuki K Poulose, Will Deacon

On Sat, 21 May 2022 21:07:14 +0100,
Kyle Huey <me@kylehuey.com> wrote:
> 
> There is ongoing work by Yichao Yu to make rr, a userspace record and
> replay debugger[0], production quality on arm64[1]. One of the bigger
> remaining issues is the kernel's emulation of accesses to certain
> system registers[2] that reflect timing and CPU capabilities and are
> either non-deterministic or can vary from processor to processor.

Just to make things clear: the kernel usually doesn't provide any
emulation for registers such as CNTVCT_EL0. On sane HW, userspace is
free to access it directly without any mediation (we only use the trap
for the sake of dealing with HW bugs).

> We
> would like to add the ability to tell the kernel to decline to emulate
> these instructions for a given task and pass that responsibility onto
> the supervising rr ptracer. There are analogous processor features and
> disabling mechanisms on x86. The RDTSC instruction is controlled by
> prctl(PR_SET_TSC) and the CPUID instruction is controlled (when the
> hardware allows) by arch_prctl(ARCH_SET_CPUID).
> 
> The questions I'd like to raise are:
> 
> 1. Is it appropriate to reuse PR_SET_TSC for roughly equivalent
> functionality on AArch64? (even if the AArch64 feature is not actually
> named Time Stamp Counter).

My gut feeling is that you really don't want to hijack an existing
API, because this is fundamentally different. The Linux arm64 ABI
mandates that the counter (and the frequency register associated with
it) are accessible, and you can't make them disappear.

From what I understand, you are relying on the TSC being disabled in
the tracee and intercepting the signal that gets delivered when it
accesses the counter. Is that correct?

Assuming I'm right, I think it'd make a lot more sense if there was a
first class ptrace option, if only because this would mandate the
kernel to start trapping things that are not trapped today.

It also begs the question of the fate of CNTFRQ_EL0, since you want to
be able to replay traces from one system to another (and the counter
is meaningless without the frequency).

Finally, what of the VDSO, which is by far the most common user of the
counter? I can totally imagine the VDSO getting stuck if emulation is
used and the sequence counter moves synchronously with the traps
(which is why we disable the VDSO when trapping CNTVCT_EL0).

> 2. Likewise for ARCH_SET_CPUID

We don't just emulate a single register, but a whole class of them. If
you are to present a different view for any of those, you'll need to
handle the lot (I really can't see why one would be more important
than the others).

So SET_CPUID really is the wrong tool. I'd rather there was (again) an
API that described exactly that.

> 3. Since arch_prctl is x86-only, does it make more sense to add
> arch_prctl to arm64 or to duplicate ARCH_SET_CPUID into the prctl
> world? (e.g. a PR_SET_CPUID that works on both x86/arm64)

I don't think any applies here. Different architectures have different
ABI requirements, and you can't really merge the two. Because the next
thing you know, you'll ask for the same thing for PMU registers, and
try to map them onto something else.

Overall, this would be better served by a framework for userspace
delegation of sysreg access by a ptrace'd process. Let's try to look
at it in those terms rather than casting arm64 into a seemingly
unrelated API.

Thanks,

	M.

> 
> - Kyle
> 
> [0] https://rr-project.org/
> [1] https://github.com/rr-debugger/rr/issues/3234
> [2] e.g. CNTVCT_EL0 and MIDR_EL1, among others
> 

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: arm64 equivalents of PR_SET_TSC/ARCH_SET_CPUID
@ 2022-05-22 15:35   ` Marc Zyngier
  0 siblings, 0 replies; 8+ messages in thread
From: Marc Zyngier @ 2022-05-22 15:35 UTC (permalink / raw)
  To: Kyle Huey
  Cc: open list, moderated list:ARM PORT,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	yyc1992, Keno Fischer, Robert O'Callahan, Thomas Gleixner,
	Borislav Petkov, Suzuki K Poulose, Will Deacon

On Sat, 21 May 2022 21:07:14 +0100,
Kyle Huey <me@kylehuey.com> wrote:
> 
> There is ongoing work by Yichao Yu to make rr, a userspace record and
> replay debugger[0], production quality on arm64[1]. One of the bigger
> remaining issues is the kernel's emulation of accesses to certain
> system registers[2] that reflect timing and CPU capabilities and are
> either non-deterministic or can vary from processor to processor.

Just to make things clear: the kernel usually doesn't provide any
emulation for registers such as CNTVCT_EL0. On sane HW, userspace is
free to access it directly without any mediation (we only use the trap
for the sake of dealing with HW bugs).

> We
> would like to add the ability to tell the kernel to decline to emulate
> these instructions for a given task and pass that responsibility onto
> the supervising rr ptracer. There are analogous processor features and
> disabling mechanisms on x86. The RDTSC instruction is controlled by
> prctl(PR_SET_TSC) and the CPUID instruction is controlled (when the
> hardware allows) by arch_prctl(ARCH_SET_CPUID).
> 
> The questions I'd like to raise are:
> 
> 1. Is it appropriate to reuse PR_SET_TSC for roughly equivalent
> functionality on AArch64? (even if the AArch64 feature is not actually
> named Time Stamp Counter).

My gut feeling is that you really don't want to hijack an existing
API, because this is fundamentally different. The Linux arm64 ABI
mandates that the counter (and the frequency register associated with
it) are accessible, and you can't make them disappear.

From what I understand, you are relying on the TSC being disabled in
the tracee and intercepting the signal that gets delivered when it
accesses the counter. Is that correct?

Assuming I'm right, I think it'd make a lot more sense if there was a
first class ptrace option, if only because this would mandate the
kernel to start trapping things that are not trapped today.

It also begs the question of the fate of CNTFRQ_EL0, since you want to
be able to replay traces from one system to another (and the counter
is meaningless without the frequency).

Finally, what of the VDSO, which is by far the most common user of the
counter? I can totally imagine the VDSO getting stuck if emulation is
used and the sequence counter moves synchronously with the traps
(which is why we disable the VDSO when trapping CNTVCT_EL0).

> 2. Likewise for ARCH_SET_CPUID

We don't just emulate a single register, but a whole class of them. If
you are to present a different view for any of those, you'll need to
handle the lot (I really can't see why one would be more important
than the others).

So SET_CPUID really is the wrong tool. I'd rather there was (again) an
API that described exactly that.

> 3. Since arch_prctl is x86-only, does it make more sense to add
> arch_prctl to arm64 or to duplicate ARCH_SET_CPUID into the prctl
> world? (e.g. a PR_SET_CPUID that works on both x86/arm64)

I don't think any applies here. Different architectures have different
ABI requirements, and you can't really merge the two. Because the next
thing you know, you'll ask for the same thing for PMU registers, and
try to map them onto something else.

Overall, this would be better served by a framework for userspace
delegation of sysreg access by a ptrace'd process. Let's try to look
at it in those terms rather than casting arm64 into a seemingly
unrelated API.

Thanks,

	M.

> 
> - Kyle
> 
> [0] https://rr-project.org/
> [1] https://github.com/rr-debugger/rr/issues/3234
> [2] e.g. CNTVCT_EL0 and MIDR_EL1, among others
> 

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: arm64 equivalents of PR_SET_TSC/ARCH_SET_CPUID
  2022-05-22 15:35   ` Marc Zyngier
@ 2022-05-22 18:22     ` Keno Fischer
  -1 siblings, 0 replies; 8+ messages in thread
From: Keno Fischer @ 2022-05-22 18:22 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Kyle Huey, open list, moderated list:ARM PORT,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	yyc1992, Robert O'Callahan, Thomas Gleixner, Borislav Petkov,
	Suzuki K Poulose, Will Deacon

On Sun, May 22, 2022 at 11:35 AM Marc Zyngier <maz@kernel.org> wrote:
> From what I understand, you are relying on the TSC being disabled in
> the tracee and intercepting the signal that gets delivered when it
> accesses the counter. Is that correct?

Yes, this is correct. The way that these kernel APIs work is that they turn
any use of `rdtsc` (respectively `cpuid`) into SIGSEGV signals that the
ptracer intercepts and emulates. It's not particularly pretty, but it
works reasonably well in practice.

> Assuming I'm right, I think it'd make a lot more sense if there was a
> first class ptrace option, if only because this would mandate the
> kernel to start trapping things that are not trapped today.

I'm a bit nervous about "first class ptrace option" if only because ptrace
is already a complicated mess and having spent a significant amount
of time hunting down architecture-specific ptrace quirks, I'd be quite
hesitant to introduce another one without a very strong justification.
If the proposed mechanism is essentially signal-equivalent
(i.e. it causes a ptrace stop and lets the ptracer emulate the instruction),
then I'd strongly advocate for making it an actual, proper signal which
has well-understood quirks (as the PR_SET_TSC/ARCH_SET_CPUID
do on x86).

The other consideration here is that disabling these sorts of counters
may have non-ptrace applications e.g. sandboxes may want to disable
these sorts of capabilities to harden against timing attacks, which
may suggest that ptrace isn't the right place for it.

If we're considering something more fancy, that's a different story of
course. Naturally causing a ptrace trap on these instructions has
significant overhead, but because they're usually fast, existing userspace
is not particularly judicious in their use (the same issue happens on x86
of course). One could imagine a light-weight kernel-level record/replay
capability where all accesses to these registers are traced and dumped
into a buffer (with the corresponding capability to feed the values from
a buffer). That kind of capability feels like a more natural fit for the perf
subsystem, which already has capabilities to shuffle trace buffers around.

> It also begs the question of the fate of CNTFRQ_EL0, since you want to
> be able to replay traces from one system to another (and the counter
> is meaningless without the frequency).

Yes, it'd have to be interceptable also.

> Finally, what of the VDSO, which is by far the most common user of the
> counter? I can totally imagine the VDSO getting stuck if emulation is
> used and the sequence counter moves synchronously with the traps
> (which is why we disable the VDSO when trapping CNTVCT_EL0).

Could you elaborate on this concern? rr does disable the vdso currently,
so it wouldn't be a problem from that perspective, but I don't understand
what you mean by the VDSO getting "stuck".

> > 2. Likewise for ARCH_SET_CPUID
>
> We don't just emulate a single register, but a whole class of them. If
> you are to present a different view for any of those, you'll need to
> handle the lot (I really can't see why one would be more important
> than the others).
>
> So SET_CPUID really is the wrong tool. I'd rather there was (again) an
> API that described exactly that.

I'm assuming these register values are all fixed as long as the process
doesn't get migrated between CPU cores? In that case, it seems quite
doable to introduce another ptrace regset that just has the register
values for everything that could potentially be emulated (and is extensible
for future additions). We'd need to think through the exact semantics
in the ordinary course if one of the emulated registers does change,
but it seems like a solvable issue.

Keno

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: arm64 equivalents of PR_SET_TSC/ARCH_SET_CPUID
@ 2022-05-22 18:22     ` Keno Fischer
  0 siblings, 0 replies; 8+ messages in thread
From: Keno Fischer @ 2022-05-22 18:22 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Kyle Huey, open list, moderated list:ARM PORT,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	yyc1992, Robert O'Callahan, Thomas Gleixner, Borislav Petkov,
	Suzuki K Poulose, Will Deacon

On Sun, May 22, 2022 at 11:35 AM Marc Zyngier <maz@kernel.org> wrote:
> From what I understand, you are relying on the TSC being disabled in
> the tracee and intercepting the signal that gets delivered when it
> accesses the counter. Is that correct?

Yes, this is correct. The way that these kernel APIs work is that they turn
any use of `rdtsc` (respectively `cpuid`) into SIGSEGV signals that the
ptracer intercepts and emulates. It's not particularly pretty, but it
works reasonably well in practice.

> Assuming I'm right, I think it'd make a lot more sense if there was a
> first class ptrace option, if only because this would mandate the
> kernel to start trapping things that are not trapped today.

I'm a bit nervous about "first class ptrace option" if only because ptrace
is already a complicated mess and having spent a significant amount
of time hunting down architecture-specific ptrace quirks, I'd be quite
hesitant to introduce another one without a very strong justification.
If the proposed mechanism is essentially signal-equivalent
(i.e. it causes a ptrace stop and lets the ptracer emulate the instruction),
then I'd strongly advocate for making it an actual, proper signal which
has well-understood quirks (as the PR_SET_TSC/ARCH_SET_CPUID
do on x86).

The other consideration here is that disabling these sorts of counters
may have non-ptrace applications e.g. sandboxes may want to disable
these sorts of capabilities to harden against timing attacks, which
may suggest that ptrace isn't the right place for it.

If we're considering something more fancy, that's a different story of
course. Naturally causing a ptrace trap on these instructions has
significant overhead, but because they're usually fast, existing userspace
is not particularly judicious in their use (the same issue happens on x86
of course). One could imagine a light-weight kernel-level record/replay
capability where all accesses to these registers are traced and dumped
into a buffer (with the corresponding capability to feed the values from
a buffer). That kind of capability feels like a more natural fit for the perf
subsystem, which already has capabilities to shuffle trace buffers around.

> It also begs the question of the fate of CNTFRQ_EL0, since you want to
> be able to replay traces from one system to another (and the counter
> is meaningless without the frequency).

Yes, it'd have to be interceptable also.

> Finally, what of the VDSO, which is by far the most common user of the
> counter? I can totally imagine the VDSO getting stuck if emulation is
> used and the sequence counter moves synchronously with the traps
> (which is why we disable the VDSO when trapping CNTVCT_EL0).

Could you elaborate on this concern? rr does disable the vdso currently,
so it wouldn't be a problem from that perspective, but I don't understand
what you mean by the VDSO getting "stuck".

> > 2. Likewise for ARCH_SET_CPUID
>
> We don't just emulate a single register, but a whole class of them. If
> you are to present a different view for any of those, you'll need to
> handle the lot (I really can't see why one would be more important
> than the others).
>
> So SET_CPUID really is the wrong tool. I'd rather there was (again) an
> API that described exactly that.

I'm assuming these register values are all fixed as long as the process
doesn't get migrated between CPU cores? In that case, it seems quite
doable to introduce another ptrace regset that just has the register
values for everything that could potentially be emulated (and is extensible
for future additions). We'd need to think through the exact semantics
in the ordinary course if one of the emulated registers does change,
but it seems like a solvable issue.

Keno

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: arm64 equivalents of PR_SET_TSC/ARCH_SET_CPUID
  2022-05-22 15:35   ` Marc Zyngier
@ 2022-05-23 19:27     ` Kyle Huey
  -1 siblings, 0 replies; 8+ messages in thread
From: Kyle Huey @ 2022-05-23 19:27 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: open list, moderated list:ARM PORT,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	yyc1992, Keno Fischer, Robert O'Callahan, Thomas Gleixner,
	Borislav Petkov, Suzuki K Poulose, Will Deacon

On Sun, May 22, 2022 at 8:35 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Sat, 21 May 2022 21:07:14 +0100,
> Kyle Huey <me@kylehuey.com> wrote:
> >
> > There is ongoing work by Yichao Yu to make rr, a userspace record and
> > replay debugger[0], production quality on arm64[1]. One of the bigger
> > remaining issues is the kernel's emulation of accesses to certain
> > system registers[2] that reflect timing and CPU capabilities and are
> > either non-deterministic or can vary from processor to processor.
>
> Just to make things clear: the kernel usually doesn't provide any
> emulation for registers such as CNTVCT_EL0. On sane HW, userspace is
> free to access it directly without any mediation (we only use the trap
> for the sake of dealing with HW bugs).

Right, in this case we would also want to set e.g.
CNTKCTL_EL1.EL0VCTEN = 0 whenever a task that has used PR_SET_TSC is
running. Then the hardware would deliver the fault, and the kernel
would decline to emulate the instruction, ultimately leading to the rr
supervisor seeing the signal, looking at the offending instruction,
and deciding how to handle it itself.

This is how RDTSC is handled on x86.  prctl(PR_SET_TSC,
PR_TSC_SIGSEGV) sets TIF_NOTSC on the task, and then at context
switches that leads to setting and clearing CR4.TSD as appropriate.

My understanding is that some of these (e.g. MIDR_EL1) actually are
always emulated by the kernel.

> > We
> > would like to add the ability to tell the kernel to decline to emulate
> > these instructions for a given task and pass that responsibility onto
> > the supervising rr ptracer. There are analogous processor features and
> > disabling mechanisms on x86. The RDTSC instruction is controlled by
> > prctl(PR_SET_TSC) and the CPUID instruction is controlled (when the
> > hardware allows) by arch_prctl(ARCH_SET_CPUID).
> >
> > The questions I'd like to raise are:
> >
> > 1. Is it appropriate to reuse PR_SET_TSC for roughly equivalent
> > functionality on AArch64? (even if the AArch64 feature is not actually
> > named Time Stamp Counter).
>
> My gut feeling is that you really don't want to hijack an existing
> API, because this is fundamentally different. The Linux arm64 ABI
> mandates that the counter (and the frequency register associated with
> it) are accessible, and you can't make them disappear.

We don't want to make them disappear per se. rr would emulate accesses
to these features. On x86 rr itself runs an rdtsc during recording
after the hardware traps for the tracee and notes the value it
returns. rr then sticks that value into the supervised process and
resumes execution. During replay we inject the same rdtsc value we
noted during recording to produce deterministic behavior.

> From what I understand, you are relying on the TSC being disabled in
> the tracee and intercepting the signal that gets delivered when it
> accesses the counter. Is that correct?

Yes.

> Assuming I'm right, I think it'd make a lot more sense if there was a
> first class ptrace option, if only because this would mandate the
> kernel to start trapping things that are not trapped today.
>
> It also begs the question of the fate of CNTFRQ_EL0, since you want to
> be able to replay traces from one system to another (and the counter
> is meaningless without the frequency).

Yes, we'd want that too.

> Finally, what of the VDSO, which is by far the most common user of the
> counter? I can totally imagine the VDSO getting stuck if emulation is
> used and the sequence counter moves synchronously with the traps
> (which is why we disable the VDSO when trapping CNTVCT_EL0).

prctl(PR_SET_TSC, PR_TSC_SIGSEGV) will break RDTSC-using VDSO syscalls
like gettimeofday entirely on x86 so this isn't a new issue. rr
disables the VDSO anyways for other reasons, so this doesn't actually
apply to us. Like Keno I am curious what problem you expect if the
counter is actually emulated though.

> > 2. Likewise for ARCH_SET_CPUID
>
> We don't just emulate a single register, but a whole class of them. If
> you are to present a different view for any of those, you'll need to
> handle the lot (I really can't see why one would be more important
> than the others).
>
> So SET_CPUID really is the wrong tool. I'd rather there was (again) an
> API that described exactly that.

I don't think I communicated accurately what this does. PR_SET_TSC and
ARCH_SET_CPUID don't actually set the values returned from RDTSC and
CPUID. They instead control whether the instruction is enabled or uses
hardware capabilities to trap when executed. The rr supervisor then
figures out what to do.  x86's CPUID is also quite complicated and has
many different sets of information that can be accessed based on the
value of %eax,%edx when the CPUID instruction is executed.

> > 3. Since arch_prctl is x86-only, does it make more sense to add
> > arch_prctl to arm64 or to duplicate ARCH_SET_CPUID into the prctl
> > world? (e.g. a PR_SET_CPUID that works on both x86/arm64)
>
> I don't think any applies here. Different architectures have different
> ABI requirements, and you can't really merge the two. Because the next
> thing you know, you'll ask for the same thing for PMU registers, and
> try to map them onto something else.
>
> Overall, this would be better served by a framework for userspace
> delegation of sysreg access by a ptrace'd process. Let's try to look
> at it in those terms rather than casting arm64 into a seemingly
> unrelated API.

Explicit mechanisms to provide control over the hardware level trap
features provided by AArch64 (e.g. SCTLR_EL1.UCT and
CNTKCTL_EL1.EL0VCTEN) and the kernel's emulation of other sysregs
(e.g. MIDR_EL1) might make more sense. I share Keno's wariness of
running it through ptrace specifically (as opposed to a prctl and
signal like on x86) though.

- Kyle

> Thanks,
>
>         M.
>
> >
> > - Kyle
> >
> > [0] https://rr-project.org/
> > [1] https://github.com/rr-debugger/rr/issues/3234
> > [2] e.g. CNTVCT_EL0 and MIDR_EL1, among others
> >
>
> --
> Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: arm64 equivalents of PR_SET_TSC/ARCH_SET_CPUID
@ 2022-05-23 19:27     ` Kyle Huey
  0 siblings, 0 replies; 8+ messages in thread
From: Kyle Huey @ 2022-05-23 19:27 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: open list, moderated list:ARM PORT,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	yyc1992, Keno Fischer, Robert O'Callahan, Thomas Gleixner,
	Borislav Petkov, Suzuki K Poulose, Will Deacon

On Sun, May 22, 2022 at 8:35 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Sat, 21 May 2022 21:07:14 +0100,
> Kyle Huey <me@kylehuey.com> wrote:
> >
> > There is ongoing work by Yichao Yu to make rr, a userspace record and
> > replay debugger[0], production quality on arm64[1]. One of the bigger
> > remaining issues is the kernel's emulation of accesses to certain
> > system registers[2] that reflect timing and CPU capabilities and are
> > either non-deterministic or can vary from processor to processor.
>
> Just to make things clear: the kernel usually doesn't provide any
> emulation for registers such as CNTVCT_EL0. On sane HW, userspace is
> free to access it directly without any mediation (we only use the trap
> for the sake of dealing with HW bugs).

Right, in this case we would also want to set e.g.
CNTKCTL_EL1.EL0VCTEN = 0 whenever a task that has used PR_SET_TSC is
running. Then the hardware would deliver the fault, and the kernel
would decline to emulate the instruction, ultimately leading to the rr
supervisor seeing the signal, looking at the offending instruction,
and deciding how to handle it itself.

This is how RDTSC is handled on x86.  prctl(PR_SET_TSC,
PR_TSC_SIGSEGV) sets TIF_NOTSC on the task, and then at context
switches that leads to setting and clearing CR4.TSD as appropriate.

My understanding is that some of these (e.g. MIDR_EL1) actually are
always emulated by the kernel.

> > We
> > would like to add the ability to tell the kernel to decline to emulate
> > these instructions for a given task and pass that responsibility onto
> > the supervising rr ptracer. There are analogous processor features and
> > disabling mechanisms on x86. The RDTSC instruction is controlled by
> > prctl(PR_SET_TSC) and the CPUID instruction is controlled (when the
> > hardware allows) by arch_prctl(ARCH_SET_CPUID).
> >
> > The questions I'd like to raise are:
> >
> > 1. Is it appropriate to reuse PR_SET_TSC for roughly equivalent
> > functionality on AArch64? (even if the AArch64 feature is not actually
> > named Time Stamp Counter).
>
> My gut feeling is that you really don't want to hijack an existing
> API, because this is fundamentally different. The Linux arm64 ABI
> mandates that the counter (and the frequency register associated with
> it) are accessible, and you can't make them disappear.

We don't want to make them disappear per se. rr would emulate accesses
to these features. On x86 rr itself runs an rdtsc during recording
after the hardware traps for the tracee and notes the value it
returns. rr then sticks that value into the supervised process and
resumes execution. During replay we inject the same rdtsc value we
noted during recording to produce deterministic behavior.

> From what I understand, you are relying on the TSC being disabled in
> the tracee and intercepting the signal that gets delivered when it
> accesses the counter. Is that correct?

Yes.

> Assuming I'm right, I think it'd make a lot more sense if there was a
> first class ptrace option, if only because this would mandate the
> kernel to start trapping things that are not trapped today.
>
> It also begs the question of the fate of CNTFRQ_EL0, since you want to
> be able to replay traces from one system to another (and the counter
> is meaningless without the frequency).

Yes, we'd want that too.

> Finally, what of the VDSO, which is by far the most common user of the
> counter? I can totally imagine the VDSO getting stuck if emulation is
> used and the sequence counter moves synchronously with the traps
> (which is why we disable the VDSO when trapping CNTVCT_EL0).

prctl(PR_SET_TSC, PR_TSC_SIGSEGV) will break RDTSC-using VDSO syscalls
like gettimeofday entirely on x86 so this isn't a new issue. rr
disables the VDSO anyways for other reasons, so this doesn't actually
apply to us. Like Keno I am curious what problem you expect if the
counter is actually emulated though.

> > 2. Likewise for ARCH_SET_CPUID
>
> We don't just emulate a single register, but a whole class of them. If
> you are to present a different view for any of those, you'll need to
> handle the lot (I really can't see why one would be more important
> than the others).
>
> So SET_CPUID really is the wrong tool. I'd rather there was (again) an
> API that described exactly that.

I don't think I communicated accurately what this does. PR_SET_TSC and
ARCH_SET_CPUID don't actually set the values returned from RDTSC and
CPUID. They instead control whether the instruction is enabled or uses
hardware capabilities to trap when executed. The rr supervisor then
figures out what to do.  x86's CPUID is also quite complicated and has
many different sets of information that can be accessed based on the
value of %eax,%edx when the CPUID instruction is executed.

> > 3. Since arch_prctl is x86-only, does it make more sense to add
> > arch_prctl to arm64 or to duplicate ARCH_SET_CPUID into the prctl
> > world? (e.g. a PR_SET_CPUID that works on both x86/arm64)
>
> I don't think any applies here. Different architectures have different
> ABI requirements, and you can't really merge the two. Because the next
> thing you know, you'll ask for the same thing for PMU registers, and
> try to map them onto something else.
>
> Overall, this would be better served by a framework for userspace
> delegation of sysreg access by a ptrace'd process. Let's try to look
> at it in those terms rather than casting arm64 into a seemingly
> unrelated API.

Explicit mechanisms to provide control over the hardware level trap
features provided by AArch64 (e.g. SCTLR_EL1.UCT and
CNTKCTL_EL1.EL0VCTEN) and the kernel's emulation of other sysregs
(e.g. MIDR_EL1) might make more sense. I share Keno's wariness of
running it through ptrace specifically (as opposed to a prctl and
signal like on x86) though.

- Kyle

> Thanks,
>
>         M.
>
> >
> > - Kyle
> >
> > [0] https://rr-project.org/
> > [1] https://github.com/rr-debugger/rr/issues/3234
> > [2] e.g. CNTVCT_EL0 and MIDR_EL1, among others
> >
>
> --
> Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-05-23 19:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-21 20:07 arm64 equivalents of PR_SET_TSC/ARCH_SET_CPUID Kyle Huey
2022-05-21 20:07 ` Kyle Huey
2022-05-22 15:35 ` Marc Zyngier
2022-05-22 15:35   ` Marc Zyngier
2022-05-22 18:22   ` Keno Fischer
2022-05-22 18:22     ` Keno Fischer
2022-05-23 19:27   ` Kyle Huey
2022-05-23 19:27     ` Kyle Huey

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.