Linux-man Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] ptrace.2: Describe PTRACE_SET/GETREGSET on NT_X86_XSTATE
@ 2020-05-18  3:00 Keno Fischer
  2020-05-19 20:44 ` Michael Kerrisk (man-pages)
  2020-05-20  1:19 ` Andi Kleen
  0 siblings, 2 replies; 9+ messages in thread
From: Keno Fischer @ 2020-05-18  3:00 UTC (permalink / raw)
  To: mtk.manpages
  Cc: linux-man, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, x86, H. Peter Anvin,
	Borislav Petkov, Dave Hansen, Andi Kleen

Correctly using the result of this operation is quite hard,
because the layout is not fixed and depends on the kernel
configuration. Furthermore, because of the initial state
optimization, parts of the layout may be missing. If ptrace
users are not careful, it is easy to get unexpected results.
This documents everything I know about how to use NT_X86_XSTATE
"correctly". This should probably have been documented earlier,
since every single ptrace application I looked at gets this wrong
in one way or another, but hopefully having documentation will at
least help future users cover the relevant corner cases.

Signed-off-by: Keno Fischer <keno@juliacomputing.com>

---
I'm hoping this will help. I recently had occasion to read up on
how this actually works (finding I, too, had used it incorrectly),
for patch in https://lkml.org/lkml/2020/4/6/1161. I'm cc'ing the
folks who took part in that review here, since I think they would
be interested in making sure the status quo is adequately documented.
Please let me know if I got anything wrong, or if anything is confusing.

 man2/ptrace.2 | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/man2/ptrace.2 b/man2/ptrace.2
index 575062971..57958119b 100644
--- a/man2/ptrace.2
+++ b/man2/ptrace.2
@@ -2322,6 +2322,63 @@ result, to the real parent (to the real parent only when the
 whole multithreaded process exits).
 If the tracer and the real parent are the same process,
 the report is sent only once.
+.SS The layout and operation of the NT_X86_XSTATE regset
+On x86(_64), the values of extended registers can be obtained as an xstate buffer,
+accessed through the NT_X86_XSTATE option to
+.B PTRACE_GETREGSET.
+The layout of this area is relatively complex (and described below). It is
+in general not safe to assume that a buffer obtained using
+.B PTRACE_GETREGSET
+may be set back to any task using
+.B PTRACE_SETREGSET
+while resulting in a task that has equivalent register state (see below for
+details). It is also not safe to assume that the buffer is a valid xsave area
+that may be restored using the
+.I xrstor
+instruction, nor is it safe to assume that any extended state component is
+located at a particular fixed offset. Instead the following algorithm should be used to
+compute the offset of any given state component within the xsave buffer:
+.IP 1. 3
+Obtain the kernel xsave component bitmask from the software-reserved area of the
+xstate buffer. The software-reserved area beings at offset 464 into the xsave
+buffer and the first 64 bits of this area contain the kernel xsave component bitmask
+.IP 2.
+Compute the offset of each state component by adding the sizes of all prior state
+components that are enabled in the kernel xsave component bitmask, aligning to 64 byte boundaries along the way. This
+format matches that of a compacted xsave area with XCOMP_BV set to the
+kernel component bitmask. Further details on the layout of the compacted xsave
+area may be found in the Intel architecture manual, but note that the xsave
+buffer returned from ptrace will have its XCOMP_BV set to 0.
+.IP 3.
+For the given state component of interest, check the corresponding bit
+in the xsave header's XSTATE_BV bitfield. If this bit is zero, the corresponding
+component is in initial state and was not written to the buffer (i.e. the kernel
+does not touch the memory corresponding to this state component at all,
+the start offset next active state component will not be affected unless
+the bit is also missing from the kernel component bitmask obtained in step 1).
+The initial state for any state component is defined in the Intel architecture manual (for
+most state components it is the zero state).
+.PP
+
+In particular, the third of these considerations results in a buffer that does
+not round-trip through
+.B PTRACE_SETREGSET.
+If a given state component is missing from the XSTATE_BV bitfield, it will
+be ignored by
+.B PTRACE_SETREGSET
+even if the corresponding register in the target task is currently not in
+initial state.
+
+Thus, to obtain an xsave area that may be set back to the tracee, all unused
+state components must first be re-set to the correct initial state for the
+corresponding state component, and the XSTATE_BV bitfield must subsequently
+be adjusted to match the kernel xstate component bitmask (obtained as
+described above).
+
+The value of the kernel's state component bitmask is determined on boot and
+need not be equivalent to the maximal set of state components supported by the
+CPU (as enumerated through CPUID).
+
 .SH RETURN VALUE
 On success, the
 .B PTRACE_PEEK*
-- 
2.25.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ptrace.2: Describe PTRACE_SET/GETREGSET on NT_X86_XSTATE
  2020-05-18  3:00 [PATCH] ptrace.2: Describe PTRACE_SET/GETREGSET on NT_X86_XSTATE Keno Fischer
@ 2020-05-19 20:44 ` Michael Kerrisk (man-pages)
  2020-05-19 21:29   ` Denys Vlasenko
  2020-05-20  1:19 ` Andi Kleen
  1 sibling, 1 reply; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-05-19 20:44 UTC (permalink / raw)
  To: Keno Fischer
  Cc: linux-man, Andy Lutomirski, Dave Hansen, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, x86, H. Peter Anvin,
	Borislav Petkov, Dave Hansen, Andi Kleen, Denys Vlasenko,
	Michael Kerrisk

[CC += Denys, since he's had a lot of input to ptrace(2) in the past,
and perhaps might also have a comment to this patch]

On Mon, 18 May 2020 at 05:00, Keno Fischer <keno@juliacomputing.com> wrote:
>
> Correctly using the result of this operation is quite hard,
> because the layout is not fixed and depends on the kernel
> configuration. Furthermore, because of the initial state
> optimization, parts of the layout may be missing. If ptrace
> users are not careful, it is easy to get unexpected results.
> This documents everything I know about how to use NT_X86_XSTATE
> "correctly". This should probably have been documented earlier,
> since every single ptrace application I looked at gets this wrong
> in one way or another, but hopefully having documentation will at
> least help future users cover the relevant corner cases.
>
> Signed-off-by: Keno Fischer <keno@juliacomputing.com>
>
> ---
> I'm hoping this will help. I recently had occasion to read up on
> how this actually works (finding I, too, had used it incorrectly),
> for patch in https://lkml.org/lkml/2020/4/6/1161. I'm cc'ing the
> folks who took part in that review here, since I think they would
> be interested in making sure the status quo is adequately documented.
> Please let me know if I got anything wrong, or if anything is confusing.
>
>  man2/ptrace.2 | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 57 insertions(+)
>
> diff --git a/man2/ptrace.2 b/man2/ptrace.2
> index 575062971..57958119b 100644
> --- a/man2/ptrace.2
> +++ b/man2/ptrace.2
> @@ -2322,6 +2322,63 @@ result, to the real parent (to the real parent only when the
>  whole multithreaded process exits).
>  If the tracer and the real parent are the same process,
>  the report is sent only once.
> +.SS The layout and operation of the NT_X86_XSTATE regset
> +On x86(_64), the values of extended registers can be obtained as an xstate buffer,
> +accessed through the NT_X86_XSTATE option to
> +.B PTRACE_GETREGSET.
> +The layout of this area is relatively complex (and described below). It is
> +in general not safe to assume that a buffer obtained using
> +.B PTRACE_GETREGSET
> +may be set back to any task using
> +.B PTRACE_SETREGSET
> +while resulting in a task that has equivalent register state (see below for
> +details). It is also not safe to assume that the buffer is a valid xsave area
> +that may be restored using the
> +.I xrstor
> +instruction, nor is it safe to assume that any extended state component is
> +located at a particular fixed offset. Instead the following algorithm should be used to
> +compute the offset of any given state component within the xsave buffer:
> +.IP 1. 3
> +Obtain the kernel xsave component bitmask from the software-reserved area of the
> +xstate buffer. The software-reserved area beings at offset 464 into the xsave
> +buffer and the first 64 bits of this area contain the kernel xsave component bitmask
> +.IP 2.
> +Compute the offset of each state component by adding the sizes of all prior state
> +components that are enabled in the kernel xsave component bitmask, aligning to 64 byte boundaries along the way. This
> +format matches that of a compacted xsave area with XCOMP_BV set to the
> +kernel component bitmask. Further details on the layout of the compacted xsave
> +area may be found in the Intel architecture manual, but note that the xsave
> +buffer returned from ptrace will have its XCOMP_BV set to 0.
> +.IP 3.
> +For the given state component of interest, check the corresponding bit
> +in the xsave header's XSTATE_BV bitfield. If this bit is zero, the corresponding
> +component is in initial state and was not written to the buffer (i.e. the kernel
> +does not touch the memory corresponding to this state component at all,
> +the start offset next active state component will not be affected unless
> +the bit is also missing from the kernel component bitmask obtained in step 1).
> +The initial state for any state component is defined in the Intel architecture manual (for
> +most state components it is the zero state).
> +.PP
> +
> +In particular, the third of these considerations results in a buffer that does
> +not round-trip through
> +.B PTRACE_SETREGSET.
> +If a given state component is missing from the XSTATE_BV bitfield, it will
> +be ignored by
> +.B PTRACE_SETREGSET
> +even if the corresponding register in the target task is currently not in
> +initial state.
> +
> +Thus, to obtain an xsave area that may be set back to the tracee, all unused
> +state components must first be re-set to the correct initial state for the
> +corresponding state component, and the XSTATE_BV bitfield must subsequently
> +be adjusted to match the kernel xstate component bitmask (obtained as
> +described above).
> +
> +The value of the kernel's state component bitmask is determined on boot and
> +need not be equivalent to the maximal set of state components supported by the
> +CPU (as enumerated through CPUID).
> +
>  .SH RETURN VALUE
>  On success, the
>  .B PTRACE_PEEK*
> --
> 2.25.1
>


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ptrace.2: Describe PTRACE_SET/GETREGSET on NT_X86_XSTATE
  2020-05-19 20:44 ` Michael Kerrisk (man-pages)
@ 2020-05-19 21:29   ` Denys Vlasenko
  2020-05-19 22:46     ` Keno Fischer
  0 siblings, 1 reply; 9+ messages in thread
From: Denys Vlasenko @ 2020-05-19 21:29 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Keno Fischer, linux-man, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, X86 ML,
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen

On Tue, May 19, 2020 at 10:44 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> [CC += Denys, since he's had a lot of input to ptrace(2) in the past,
> and perhaps might also have a comment to this patch]
>
> On Mon, 18 May 2020 at 05:00, Keno Fischer <keno@juliacomputing.com> wrote:
> >
> > Correctly using the result of this operation is quite hard,
> > because the layout is not fixed and depends on the kernel
> > configuration. Furthermore, because of the initial state
> > optimization, parts of the layout may be missing. If ptrace
> > users are not careful, it is easy to get unexpected results.
> > This documents everything I know about how to use NT_X86_XSTATE
> > "correctly". This should probably have been documented earlier,
> > since every single ptrace application I looked at gets this wrong
> > in one way or another, but hopefully having documentation will at
> > least help future users cover the relevant corner cases.
> >
> > Signed-off-by: Keno Fischer <keno@juliacomputing.com>
> >
> > ---
> > I'm hoping this will help. I recently had occasion to read up on
> > how this actually works (finding I, too, had used it incorrectly),
> > for patch in https://lkml.org/lkml/2020/4/6/1161. I'm cc'ing the
> > folks who took part in that review here, since I think they would
> > be interested in making sure the status quo is adequately documented.
> > Please let me know if I got anything wrong, or if anything is confusing.
> >
> >  man2/ptrace.2 | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 57 insertions(+)
> >
> > diff --git a/man2/ptrace.2 b/man2/ptrace.2
> > index 575062971..57958119b 100644
> > --- a/man2/ptrace.2
> > +++ b/man2/ptrace.2
> > @@ -2322,6 +2322,63 @@ result, to the real parent (to the real parent only when the
> >  whole multithreaded process exits).
> >  If the tracer and the real parent are the same process,
> >  the report is sent only once.
> > +.SS The layout and operation of the NT_X86_XSTATE regset
> > +On x86(_64), the values of extended registers can be obtained as an xstate buffer,
> > +accessed through the NT_X86_XSTATE option to
> > +.B PTRACE_GETREGSET.
> > +The layout of this area is relatively complex (and described below). It is
> > +in general not safe to assume that a buffer obtained using
> > +.B PTRACE_GETREGSET
> > +may be set back to any task using
> > +.B PTRACE_SETREGSET
> > +while resulting in a task that has equivalent register state (see below for
> > +details). It is also not safe to assume that the buffer is a valid xsave area
> > +that may be restored using the
> > +.I xrstor
> > +instruction, nor is it safe to assume that any extended state component is
> > +located at a particular fixed offset. Instead the following algorithm should be used to
> > +compute the offset of any given state component within the xsave buffer:
> > +.IP 1. 3
> > +Obtain the kernel xsave component bitmask from the software-reserved area of the
> > +xstate buffer. The software-reserved area beings at offset 464 into the xsave
> > +buffer and the first 64 bits of this area contain the kernel xsave component bitmask
> > +.IP 2.
> > +Compute the offset of each state component by adding the sizes of all prior state
> > +components that are enabled in the kernel xsave component bitmask, aligning to 64 byte boundaries along the way. This
> > +format matches that of a compacted xsave area with XCOMP_BV set to the
> > +kernel component bitmask. Further details on the layout of the compacted xsave
> > +area may be found in the Intel architecture manual, but note that the xsave
> > +buffer returned from ptrace will have its XCOMP_BV set to 0.

I propose s/will have its XCOMP_BV set to 0/may have its XCOMP_BV set to 0/
(future-proofing for the case that kernel may change this behavior).

> > +.IP 3.
> > +For the given state component of interest, check the corresponding bit
> > +in the xsave header's XSTATE_BV bitfield.

"...(the 64-bit field at 512 byte offset in the area)".

> If this bit is zero, the corresponding
> > +component is in initial state and was not written to the buffer (i.e. the kernel
> > +does not touch the memory corresponding to this state component at all,

We probably can not guarantee these "untouched" parts
retain their contents before syscall, it's possible future kernels may zero-fill
it - I propose to not over-promise here. How about "the contents is undefined,
it may remain untouched, or be filled with dummy data"?

> > +the start offset next active state component will not be affected unless
> > +the bit is also missing from the kernel component bitmask obtained in step 1).
> > +The initial state for any state component is defined in the Intel architecture manual (for
> > +most state components it is the zero state).
> > +.PP
> > +
> > +In particular, the third of these considerations results in a buffer that does
> > +not round-trip through
> > +.B PTRACE_SETREGSET.
> > +If a given state component is missing from the XSTATE_BV bitfield, it will
> > +be ignored by
> > +.B PTRACE_SETREGSET
> > +even if the corresponding register in the target task is currently not in
> > +initial state.
> > +
> > +Thus, to obtain an xsave area that may be set back to the tracee, all unused
> > +state components must first be re-set to the correct initial state for the
> > +corresponding state component, and the XSTATE_BV bitfield must subsequently
> > +be adjusted to match the kernel xstate component bitmask (obtained as
> > +described above).

The above paragraph needs a better wording. Are you saying the following? -

"If a state component is not saved (its XSTATE_BV bit is zero) but you
want to modify corresponding registers in the tracee, you need to set
this bit to 1 and initialize the component to the desired state."

> > +
> > +The value of the kernel's state component bitmask is determined on boot and
> > +need not be equivalent to the maximal set of state components supported by the
> > +CPU (as enumerated through CPUID).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ptrace.2: Describe PTRACE_SET/GETREGSET on NT_X86_XSTATE
  2020-05-19 21:29   ` Denys Vlasenko
@ 2020-05-19 22:46     ` Keno Fischer
  2020-05-20 10:03       ` Denys Vlasenko
  0 siblings, 1 reply; 9+ messages in thread
From: Keno Fischer @ 2020-05-19 22:46 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Michael Kerrisk, linux-man, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, X86 ML,
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen

> > > Further details on the layout of the compacted xsave
> > > +area may be found in the Intel architecture manual, but note that the xsave
> > > +buffer returned from ptrace will have its XCOMP_BV set to 0.
>
> I propose s/will have its XCOMP_BV set to 0/may have its XCOMP_BV set to 0/
> (future-proofing for the case that kernel may change this behavior).

Sounds good, though I would hope that this behavior change comes with some
flag to enable it :).

> > > +.IP 3.
> > > +For the given state component of interest, check the corresponding bit
> > > +in the xsave header's XSTATE_BV bitfield.
>
> "...(the 64-bit field at 512 byte offset in the area)".

applied. Will send out with v2 once I have all the comments in v1.

> > If this bit is zero, the corresponding
> > > +component is in initial state and was not written to the buffer (i.e. the kernel
> > > +does not touch the memory corresponding to this state component at all,
>
> We probably can not guarantee these "untouched" parts
> retain their contents before syscall, it's possible future kernels may zero-fill
> it - I propose to not over-promise here. How about "the contents is undefined,
> it may remain untouched, or be filled with dummy data"?

Sounds reasonable. Applied.

> > > +
> > > +Thus, to obtain an xsave area that may be set back to the tracee, all unused
> > > +state components must first be re-set to the correct initial state for the
> > > +corresponding state component, and the XSTATE_BV bitfield must subsequently
> > > +be adjusted to match the kernel xstate component bitmask (obtained as
> > > +described above).
>
> The above paragraph needs a better wording. Are you saying the following? -
>
> "If a state component is not saved (its XSTATE_BV bit is zero) but you
> want to modify corresponding registers in the tracee, you need to set
> this bit to 1 and initialize the component to the desired state."

Kind of, what I want to get across is a warning that the following pattern:

struct iov = { ... };
ptrace(PTRACE_GETREGSET, pid1, NT_X86_XSTATE, &iov);
ptrace(PTRACE_SETREGSET, pid2, NT_X86_XSTATE, &iov);

will not necessarily result in pid1 and pid2 having identical register states.
If a state component was in its initial state in pid1, the XSTATE_BV
bit will be cleared, resulting in the registers in pid2 not being modified.
Perhaps the easiest thing is just to include this example as well as
some pseudo-code like the following:

struct user_xstateregs *buffer = ...;
struct iov = { iov_base=(uint8_t*)buffer, iov_len=... };
ptrace(PTRACE_GETREGSET, pid1, NT_X86_XSTATE, &iov);
uint64_t kernel_xmask = buffer->i387.xstate_fx_sw[0];
uint64_t active_mask = buffer->header.xfeatures;
for (int i = 0; i < XFEATURES_MAX; ++i) {
    uint64_t bit = (uint64_t)1 << i;
    if ((kernel_mask & big) && !(active_mask & bit)) {
        reset_state_component(buffer, i);
    }
}
ptrace(PTRACE_SETREGSET, pid2, NT_X86_XSTATE, &iov);

to show how to make sure that pid2 ends up with the same register
state as pid1.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ptrace.2: Describe PTRACE_SET/GETREGSET on NT_X86_XSTATE
  2020-05-18  3:00 [PATCH] ptrace.2: Describe PTRACE_SET/GETREGSET on NT_X86_XSTATE Keno Fischer
  2020-05-19 20:44 ` Michael Kerrisk (man-pages)
@ 2020-05-20  1:19 ` Andi Kleen
  2020-05-20  3:30   ` Keno Fischer
  1 sibling, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2020-05-20  1:19 UTC (permalink / raw)
  To: Keno Fischer
  Cc: mtk.manpages, linux-man, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, x86,
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen

> diff --git a/man2/ptrace.2 b/man2/ptrace.2
> index 575062971..57958119b 100644
> --- a/man2/ptrace.2
> +++ b/man2/ptrace.2
> @@ -2322,6 +2322,63 @@ result, to the real parent (to the real parent only when the
>  whole multithreaded process exits).
>  If the tracer and the real parent are the same process,
>  the report is sent only once.
> +.SS The layout and operation of the NT_X86_XSTATE regset

Should rather have a complete table of NT_* entries first. The others
can be dummies for now.

> +On x86(_64), the values of extended registers can be obtained as an xstate buffer,
> +accessed through the NT_X86_XSTATE option to
> +.B PTRACE_GETREGSET.
> +The layout of this area is relatively complex (and described below). It is
> +in general not safe to assume that a buffer obtained using
> +.B PTRACE_GETREGSET
> +may be set back to any task using
> +.B PTRACE_SETREGSET
> +while resulting in a task that has equivalent register state (see below for
> +details). It is also not safe to assume that the buffer is a valid xsave area
> +that may be restored using the
> +.I xrstor
> +instruction, nor is it safe to assume that any extended state component is
> +located at a particular fixed offset. Instead the following algorithm should be used to
> +compute the offset of any given state component within the xsave buffer:
> +.IP 1. 3
> +Obtain the kernel xsave component bitmask from the software-reserved area of the
> +xstate buffer. The software-reserved area beings at offset 464 into the xsave

It would be better to put some struct defining this into the kernel uapi
and then reference that instead of magic numbers.

> +buffer and the first 64 bits of this area contain the kernel xsave component bitmask
> +.IP 2.
> +Compute the offset of each state component by adding the sizes of all prior state
> +components that are enabled in the kernel xsave component bitmask, aligning to 64 byte boundaries along the way. This
> +format matches that of a compacted xsave area with XCOMP_BV set to the

The sizes of these areas should probably also be in the uapi include

> +kernel component bitmask. Further details on the layout of the compacted xsave
> +area may be found in the Intel architecture manual, but note that the xsave
> +buffer returned from ptrace will have its XCOMP_BV set to 0.

The note seems disconnected. You'll have to explain it here.

> +Thus, to obtain an xsave area that may be set back to the tracee, all unused
> +state components must first be re-set to the correct initial state for the
> +corresponding state component, and the XSTATE_BV bitfield must subsequently
> +be adjusted to match the kernel xstate component bitmask (obtained as
> +described above).

I wonder if we shouldn't just fix the kernel to do this properly on its
own.  Presumably it won't break any existing user space.

It seems more a bug than something that should be a documented ABI.

> +
> +The value of the kernel's state component bitmask is determined on boot and
> +need not be equivalent to the maximal set of state components supported by the
> +CPU (as enumerated through CPUID).

Okay so how should someone get it? Looks like that's a hole in the
kernel API that we need to fix somehow.

> +
>  .SH RETURN VALUE
>  On success, the
>  .B PTRACE_PEEK*
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ptrace.2: Describe PTRACE_SET/GETREGSET on NT_X86_XSTATE
  2020-05-20  1:19 ` Andi Kleen
@ 2020-05-20  3:30   ` Keno Fischer
  2020-05-20  5:08     ` Michael Kerrisk (man-pages)
  2020-05-20 13:56     ` Andi Kleen
  0 siblings, 2 replies; 9+ messages in thread
From: Keno Fischer @ 2020-05-20  3:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Michael Kerrisk-manpages, linux-man, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen

> > +.SS The layout and operation of the NT_X86_XSTATE regset
>
> Should rather have a complete table of NT_* entries first. The others
> can be dummies for now.

Oh boy, I'm not sure my man-page-formatting-fu is up to the task of
creating a nice looking table :).
Michael, can you help?

> > +Obtain the kernel xsave component bitmask from the software-reserved area of the
> > +xstate buffer. The software-reserved area beings at offset 464 into the xsave
>
> It would be better to put some struct defining this into the kernel uapi
> and then reference that instead of magic numbers.

We have user_xstateregs in the kernel, but that's not in the uapi.
I suppose we should move it, given that it is exposed here.

> > +buffer and the first 64 bits of this area contain the kernel xsave component bitmask
> > +.IP 2.
> > +Compute the offset of each state component by adding the sizes of all prior state
> > +components that are enabled in the kernel xsave component bitmask, aligning to 64 byte boundaries along the way. This
> > +format matches that of a compacted xsave area with XCOMP_BV set to the
>
> The sizes of these areas should probably also be in the uapi include

Yes, that seems like a good idea.
What's the policy on helper functions in uapi includes?
Can we have helper functions that given a buffer and the kernel xstate mask,
does this computation for you?

> > +kernel component bitmask. Further details on the layout of the compacted xsave
> > +area may be found in the Intel architecture manual, but note that the xsave
> > +buffer returned from ptrace will have its XCOMP_BV set to 0.
>
> The note seems disconnected. You'll have to explain it here.

Ok, I'll elaborate. The point I wanted to make is that even though the buffer
looks compressed, XCOMP_BV is 0, so it's not a valid compressed buffer
that can be xrstor'ed.

> > +Thus, to obtain an xsave area that may be set back to the tracee, all unused
> > +state components must first be re-set to the correct initial state for the
> > +corresponding state component, and the XSTATE_BV bitfield must subsequently
> > +be adjusted to match the kernel xstate component bitmask (obtained as
> > +described above).
>
> I wonder if we shouldn't just fix the kernel to do this properly on its
> own.  Presumably it won't break any existing user space.
>
> It seems more a bug than something that should be a documented ABI.

I'd be happy to see this interface improved, since I do think it wasn't quite
intended to work this way when originally conceived (i.e. originally, before
the init optimization and before we had flags that turn off various xstate
components resulting in a compressed buffer).

As I said in the other email thread, I think it would be reasonable to change
the offsets to always be non-compressed, which would at least make this
a normal xsave buffer. No ptracer that I looked at knows that this buffer
can be compressed, so changing the kernel behavior here would actually
make it closer to what existing userspace expects ;).

I'm not sure what to do about the getregset/setregset mismatch. On the one
hand it's pretty bad, but on the other hand, I'm not really sure what to do
about it, short of introducing a different NT_X86_* constant that behaves
differently.

> > +
> > +The value of the kernel's state component bitmask is determined on boot and
> > +need not be equivalent to the maximal set of state components supported by the
> > +CPU (as enumerated through CPUID).
>
> Okay so how should someone get it? Looks like that's a hole in the
> kernel API that we need to fix somehow.

The cpuid enumerated value does still represent a maximum so that can be used
to size the buffer and the actual value can then be read from the software saved
area as described here. Is that what you were asking? Not sure I understood
correctly.



Keno

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ptrace.2: Describe PTRACE_SET/GETREGSET on NT_X86_XSTATE
  2020-05-20  3:30   ` Keno Fischer
@ 2020-05-20  5:08     ` Michael Kerrisk (man-pages)
  2020-05-20 13:56     ` Andi Kleen
  1 sibling, 0 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-05-20  5:08 UTC (permalink / raw)
  To: Keno Fischer
  Cc: Andi Kleen, linux-man, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen

On Wed, 20 May 2020 at 05:30, Keno Fischer <keno@juliacomputing.com> wrote:
>
> > > +.SS The layout and operation of the NT_X86_XSTATE regset
> >
> > Should rather have a complete table of NT_* entries first. The others
> > can be dummies for now.
>
> Oh boy, I'm not sure my man-page-formatting-fu is up to the task of
> creating a nice looking table :).
> Michael, can you help?

The general form is simple enough. Assuming, say, a three-column
table, write something like this:

.TS
l l l
---
l l l.
HEAD1<tab>HEAD2<tab>HEAD3
val<tab>val<tab>val<tab>
val<tab>val<tab>val<tab>
...
.TE

I've simplified things somewhat, but if you write your table like
that, I'll fix up any needed formatting afterwards.

For a more complicated example, see the source of syscalls(2).

Thanks,

Michael

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ptrace.2: Describe PTRACE_SET/GETREGSET on NT_X86_XSTATE
  2020-05-19 22:46     ` Keno Fischer
@ 2020-05-20 10:03       ` Denys Vlasenko
  0 siblings, 0 replies; 9+ messages in thread
From: Denys Vlasenko @ 2020-05-20 10:03 UTC (permalink / raw)
  To: Keno Fischer
  Cc: Michael Kerrisk, linux-man, Andy Lutomirski, Dave Hansen,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, X86 ML,
	H. Peter Anvin, Borislav Petkov, Dave Hansen, Andi Kleen

On Wed, May 20, 2020 at 12:47 AM Keno Fischer <keno@juliacomputing.com> wrote:
> > > > +Thus, to obtain an xsave area that may be set back to the tracee, all unused
> > > > +state components must first be re-set to the correct initial state for the
> > > > +corresponding state component, and the XSTATE_BV bitfield must subsequently
> > > > +be adjusted to match the kernel xstate component bitmask (obtained as
> > > > +described above).
> >
> > The above paragraph needs a better wording. Are you saying the following? -
> >
> > "If a state component is not saved (its XSTATE_BV bit is zero) but you
> > want to modify corresponding registers in the tracee, you need to set
> > this bit to 1 and initialize the component to the desired state."
>
> Kind of, what I want to get across is a warning that the following pattern:
>
> struct iov = { ... };
> ptrace(PTRACE_GETREGSET, pid1, NT_X86_XSTATE, &iov);
> ptrace(PTRACE_SETREGSET, pid2, NT_X86_XSTATE, &iov);
>
> will not necessarily result in pid1 and pid2 having identical register states.
> If a state component was in its initial state in pid1, the XSTATE_BV
> bit will be cleared, resulting in the registers in pid2 not being modified.

Wanting to set the registers to initial state is a subset of the case where
you want to set them to some state (initial or not), so my proposed
explanation covers it too.

But your example with two separate pids makes it clearer when you would
need to be aware of it even if the state you are setting is the initial one:
you need explicitly set it, can't assume just "copying" will do.

I propose:

If a state component is not saved (its XSTATE_BV bit is zero) but you
want to modify corresponding registers in the tracee, you need to set
this bit to 1 and initialize the component to the desired state.

In particular, it means that on PTRACE_SETREGSET, not saved component
does not cause corresponding registers to be re-initialized: a naive
"copy registers from pid1 to pid2":

ptrace(PTRACE_GETREGSET, pid1, NT_X86_XSTATE, &iov);
ptrace(PTRACE_SETREGSET, pid2, NT_X86_XSTATE, &iov);

will not copy register sets which are in initial state in pid1,
they will remain unmodified in pid2.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ptrace.2: Describe PTRACE_SET/GETREGSET on NT_X86_XSTATE
  2020-05-20  3:30   ` Keno Fischer
  2020-05-20  5:08     ` Michael Kerrisk (man-pages)
@ 2020-05-20 13:56     ` Andi Kleen
  1 sibling, 0 replies; 9+ messages in thread
From: Andi Kleen @ 2020-05-20 13:56 UTC (permalink / raw)
  To: Keno Fischer
  Cc: Andi Kleen, Michael Kerrisk-manpages, linux-man, Andy Lutomirski,
	Dave Hansen, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Borislav Petkov, Dave Hansen

On Tue, May 19, 2020 at 11:30:02PM -0400, Keno Fischer wrote:
> > > +.SS The layout and operation of the NT_X86_XSTATE regset
> >
> > Should rather have a complete table of NT_* entries first. The others
> > can be dummies for now.
> 
> Oh boy, I'm not sure my man-page-formatting-fu is up to the task of
> creating a nice looking table :).
> Michael, can you help?


'\" t in the first line

then 

.TS
tab(:);
c s 
l l.
header:header
entry:entry
...
.TE



> 
> > > +Obtain the kernel xsave component bitmask from the software-reserved area of the
> > > +xstate buffer. The software-reserved area beings at offset 464 into the xsave
> >
> > It would be better to put some struct defining this into the kernel uapi
> > and then reference that instead of magic numbers.
> 
> We have user_xstateregs in the kernel, but that's not in the uapi.
> I suppose we should move it, given that it is exposed here.

Yes.

> 
> > > +buffer and the first 64 bits of this area contain the kernel xsave component bitmask
> > > +.IP 2.
> > > +Compute the offset of each state component by adding the sizes of all prior state
> > > +components that are enabled in the kernel xsave component bitmask, aligning to 64 byte boundaries along the way. This
> > > +format matches that of a compacted xsave area with XCOMP_BV set to the
> >
> > The sizes of these areas should probably also be in the uapi include
> 
> Yes, that seems like a good idea.
> What's the policy on helper functions in uapi includes?
> Can we have helper functions that given a buffer and the kernel xstate mask,
> does this computation for you?

Yes that sounds like a good idea. Just needs to be an inline.

I'm not aware of any policy against that.

> > > +Thus, to obtain an xsave area that may be set back to the tracee, all unused
> > > +state components must first be re-set to the correct initial state for the
> > > +corresponding state component, and the XSTATE_BV bitfield must subsequently
> > > +be adjusted to match the kernel xstate component bitmask (obtained as
> > > +described above).
> >
> > I wonder if we shouldn't just fix the kernel to do this properly on its
> > own.  Presumably it won't break any existing user space.
> >
> > It seems more a bug than something that should be a documented ABI.
> 
> I'd be happy to see this interface improved, since I do think it wasn't quite
> intended to work this way when originally conceived (i.e. originally, before
> the init optimization and before we had flags that turn off various xstate
> components resulting in a compressed buffer).

Yes we should fix the kernel to make sure that anything copied out
can be straight copied in again (plus probably some self tests for this)

> 
> As I said in the other email thread, I think it would be reasonable to change
> the offsets to always be non-compressed, which would at least make this
> a normal xsave buffer. No ptracer that I looked at knows that this buffer
> can be compressed, so changing the kernel behavior here would actually
> make it closer to what existing userspace expects ;).

Hmm, maybe. It seems a bit risky to mess with an ABI.
I guess could add a new NT_* for this and deprecate the old one.

> 
> I'm not sure what to do about the getregset/setregset mismatch. On the one
> hand it's pretty bad, but on the other hand, I'm not really sure what to do
> about it, short of introducing a different NT_X86_* constant that behaves
> differently.


The kernel just needs to fill in the right fields on GET?

Or perhaps it can be only fixed in a new NT_*

> 
> > > +
> > > +The value of the kernel's state component bitmask is determined on boot and
> > > +need not be equivalent to the maximal set of state components supported by the
> > > +CPU (as enumerated through CPUID).
> >
> > Okay so how should someone get it? Looks like that's a hole in the
> > kernel API that we need to fix somehow.
> 
> The cpuid enumerated value does still represent a maximum so that can be used
> to size the buffer and the actual value can then be read from the software saved
> area as described here. Is that what you were asking? Not sure I understood
> correctly.

Okay so it's already exported. Never mind.

-Andi

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-18  3:00 [PATCH] ptrace.2: Describe PTRACE_SET/GETREGSET on NT_X86_XSTATE Keno Fischer
2020-05-19 20:44 ` Michael Kerrisk (man-pages)
2020-05-19 21:29   ` Denys Vlasenko
2020-05-19 22:46     ` Keno Fischer
2020-05-20 10:03       ` Denys Vlasenko
2020-05-20  1:19 ` Andi Kleen
2020-05-20  3:30   ` Keno Fischer
2020-05-20  5:08     ` Michael Kerrisk (man-pages)
2020-05-20 13:56     ` Andi Kleen

Linux-man Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-man/0 linux-man/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-man linux-man/ https://lore.kernel.org/linux-man \
		linux-man@vger.kernel.org
	public-inbox-index linux-man

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-man


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git