linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] rseq: minor optimizations
@ 2021-04-13 16:22 Eric Dumazet
  2021-04-13 16:22 ` [PATCH v2 1/3] rseq: optimize rseq_update_cpu_id() Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Eric Dumazet @ 2021-04-13 16:22 UTC (permalink / raw)
  To: Ingo Molnar, Mathieu Desnoyers, Peter Zijlstra
  Cc: Paul E . McKenney, Boqun Feng, Arjun Roy, linux-kernel,
	Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

rseq is a heavy user of copy to/from user data in fast paths.
This series tries to reduce the cost.

v2: Addressed Peter and Mathieu feedbacks, thanks !

Eric Dumazet (3):
  rseq: optimize rseq_update_cpu_id()
  rseq: remove redundant access_ok()
  rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

 kernel/rseq.c | 61 +++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 45 insertions(+), 16 deletions(-)

-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/3] rseq: optimize rseq_update_cpu_id()
  2021-04-13 16:22 [PATCH v2 0/3] rseq: minor optimizations Eric Dumazet
@ 2021-04-13 16:22 ` Eric Dumazet
  2021-04-13 16:22 ` [PATCH v2 2/3] rseq: remove redundant access_ok() Eric Dumazet
  2021-04-13 16:22 ` [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs() Eric Dumazet
  2 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2021-04-13 16:22 UTC (permalink / raw)
  To: Ingo Molnar, Mathieu Desnoyers, Peter Zijlstra
  Cc: Paul E . McKenney, Boqun Feng, Arjun Roy, linux-kernel,
	Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Two put_user() in rseq_update_cpu_id() are replaced
by a pair of unsafe_put_user() with appropriate surroundings.

This removes one stac/clac pair on x86 in fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/rseq.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index a4f86a9d6937cdfa2f13d1dcc9be863c1943d06f..f020f18f512a3f6241c3c9b104ce50e4d2c6188c 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -84,13 +84,20 @@
 static int rseq_update_cpu_id(struct task_struct *t)
 {
 	u32 cpu_id = raw_smp_processor_id();
+	struct rseq __user *rseq = t->rseq;
 
-	if (put_user(cpu_id, &t->rseq->cpu_id_start))
-		return -EFAULT;
-	if (put_user(cpu_id, &t->rseq->cpu_id))
-		return -EFAULT;
+	if (!user_write_access_begin(rseq, sizeof(*rseq)))
+		goto efault;
+	unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end);
+	unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
+	user_write_access_end();
 	trace_rseq_update(t);
 	return 0;
+
+efault_end:
+	user_write_access_end();
+efault:
+	return -EFAULT;
 }
 
 static int rseq_reset_rseq_cpu_id(struct task_struct *t)
-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/3] rseq: remove redundant access_ok()
  2021-04-13 16:22 [PATCH v2 0/3] rseq: minor optimizations Eric Dumazet
  2021-04-13 16:22 ` [PATCH v2 1/3] rseq: optimize rseq_update_cpu_id() Eric Dumazet
@ 2021-04-13 16:22 ` Eric Dumazet
  2021-04-13 16:22 ` [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs() Eric Dumazet
  2 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2021-04-13 16:22 UTC (permalink / raw)
  To: Ingo Molnar, Mathieu Desnoyers, Peter Zijlstra
  Cc: Paul E . McKenney, Boqun Feng, Arjun Roy, linux-kernel,
	Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

After commit 8f2817701492 ("rseq: Use get_user/put_user rather
than __get_user/__put_user") we no longer need
an access_ok() call from __rseq_handle_notify_resume()

Mathieu pointed out the same cleanup can be done
in rseq_syscall().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/rseq.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index f020f18f512a3f6241c3c9b104ce50e4d2c6188c..cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 
 	if (unlikely(t->flags & PF_EXITING))
 		return;
-	if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq))))
-		goto error;
 	ret = rseq_ip_fixup(regs);
 	if (unlikely(ret < 0))
 		goto error;
@@ -301,8 +299,7 @@ void rseq_syscall(struct pt_regs *regs)
 
 	if (!t->rseq)
 		return;
-	if (!access_ok(t->rseq, sizeof(*t->rseq)) ||
-	    rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
+	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
 		force_sig(SIGSEGV);
 }
 
-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 16:22 [PATCH v2 0/3] rseq: minor optimizations Eric Dumazet
  2021-04-13 16:22 ` [PATCH v2 1/3] rseq: optimize rseq_update_cpu_id() Eric Dumazet
  2021-04-13 16:22 ` [PATCH v2 2/3] rseq: remove redundant access_ok() Eric Dumazet
@ 2021-04-13 16:22 ` Eric Dumazet
  2021-04-13 16:54   ` Mathieu Desnoyers
  2 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2021-04-13 16:22 UTC (permalink / raw)
  To: Ingo Molnar, Mathieu Desnoyers, Peter Zijlstra
  Cc: Paul E . McKenney, Boqun Feng, Arjun Roy, linux-kernel,
	Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
update includes") added regressions for our servers.

Using copy_from_user() and clear_user() for 64bit values
is suboptimal.

We can use faster put_user() and get_user().

32bit arches can be changed to use the ptr32 field,
since the padding field must always be zero.

v2: added ideas from Peter and Mathieu about making this
    generic, since my initial patch was only dealing with
    64bit arches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
---
 kernel/rseq.c | 41 +++++++++++++++++++++++++++++++++--------
 1 file changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6..f2eee3f7f5d330688c81cb2e57d47ca6b843873e 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -119,23 +119,46 @@ static int rseq_reset_rseq_cpu_id(struct task_struct *t)
 	return 0;
 }
 
+#ifdef CONFIG_64BIT
+static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
+			   const struct rseq __user *rseq)
+{
+	u64 ptr;
+
+	if (get_user(ptr, &rseq->rseq_cs.ptr64))
+		return -EFAULT;
+	*uptrp = (struct rseq_cs __user *)ptr;
+	return 0;
+}
+#else
+static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
+			   const struct rseq __user *rseq)
+{
+	u32 ptr;
+
+	if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
+		return -EFAULT;
+	*uptrp = (struct rseq_cs __user *)ptr;
+	return 0;
+}
+#endif
+
 static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
 {
 	struct rseq_cs __user *urseq_cs;
-	u64 ptr;
 	u32 __user *usig;
 	u32 sig;
 	int ret;
 
-	if (copy_from_user(&ptr, &t->rseq->rseq_cs.ptr64, sizeof(ptr)))
+	if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
 		return -EFAULT;
-	if (!ptr) {
+	if (!urseq_cs) {
 		memset(rseq_cs, 0, sizeof(*rseq_cs));
 		return 0;
 	}
-	if (ptr >= TASK_SIZE)
+	if ((unsigned long)urseq_cs >= TASK_SIZE)
 		return -EINVAL;
-	urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
+
 	if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
 		return -EFAULT;
 
@@ -211,9 +234,11 @@ static int clear_rseq_cs(struct task_struct *t)
 	 *
 	 * Set rseq_cs to NULL.
 	 */
-	if (clear_user(&t->rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
-		return -EFAULT;
-	return 0;
+#ifdef CONFIG_64BIT
+	return put_user(0UL, &t->rseq->rseq_cs.ptr64);
+#else
+	return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
+#endif
 }
 
 /*
-- 
2.31.1.295.g9ea45b61b8-goog


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 16:22 ` [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs() Eric Dumazet
@ 2021-04-13 16:54   ` Mathieu Desnoyers
  2021-04-13 16:57     ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2021-04-13 16:54 UTC (permalink / raw)
  To: Eric Dumazet, David Laight
  Cc: Ingo Molnar, Peter Zijlstra, paulmck, Boqun Feng, Arjun Roy,
	linux-kernel, Eric Dumazet

----- On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.dumazet@gmail.com wrote:

> From: Eric Dumazet <edumazet@google.com>
> 
> Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> update includes") added regressions for our servers.
> 
> Using copy_from_user() and clear_user() for 64bit values
> is suboptimal.
> 
> We can use faster put_user() and get_user().
> 
> 32bit arches can be changed to use the ptr32 field,
> since the padding field must always be zero.
> 
> v2: added ideas from Peter and Mathieu about making this
>    generic, since my initial patch was only dealing with
>    64bit arches.

Ah, now I remember the reason why reading and clearing the entire 64-bit
is important: it's because we don't want to allow user-space processes to
use this change in behavior to figure out whether they are running on a
32-bit or in a 32-bit compat mode on a 64-bit kernel.

So although I'm fine with making 64-bit kernels faster, we'll want to keep
updating the entire 64-bit ptr field on 32-bit kernels as well.

Thanks,

Mathieu

> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Arjun Roy <arjunroy@google.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> ---
> kernel/rseq.c | 41 +++++++++++++++++++++++++++++++++--------
> 1 file changed, 33 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index
> cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6..f2eee3f7f5d330688c81cb2e57d47ca6b843873e
> 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -119,23 +119,46 @@ static int rseq_reset_rseq_cpu_id(struct task_struct *t)
> 	return 0;
> }
> 
> +#ifdef CONFIG_64BIT
> +static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
> +			   const struct rseq __user *rseq)
> +{
> +	u64 ptr;
> +
> +	if (get_user(ptr, &rseq->rseq_cs.ptr64))
> +		return -EFAULT;
> +	*uptrp = (struct rseq_cs __user *)ptr;
> +	return 0;
> +}
> +#else
> +static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
> +			   const struct rseq __user *rseq)
> +{
> +	u32 ptr;
> +
> +	if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
> +		return -EFAULT;
> +	*uptrp = (struct rseq_cs __user *)ptr;
> +	return 0;
> +}
> +#endif
> +
> static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
> {
> 	struct rseq_cs __user *urseq_cs;
> -	u64 ptr;
> 	u32 __user *usig;
> 	u32 sig;
> 	int ret;
> 
> -	if (copy_from_user(&ptr, &t->rseq->rseq_cs.ptr64, sizeof(ptr)))
> +	if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
> 		return -EFAULT;
> -	if (!ptr) {
> +	if (!urseq_cs) {
> 		memset(rseq_cs, 0, sizeof(*rseq_cs));
> 		return 0;
> 	}
> -	if (ptr >= TASK_SIZE)
> +	if ((unsigned long)urseq_cs >= TASK_SIZE)
> 		return -EINVAL;
> -	urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
> +
> 	if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
> 		return -EFAULT;
> 
> @@ -211,9 +234,11 @@ static int clear_rseq_cs(struct task_struct *t)
> 	 *
> 	 * Set rseq_cs to NULL.
> 	 */
> -	if (clear_user(&t->rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
> -		return -EFAULT;
> -	return 0;
> +#ifdef CONFIG_64BIT
> +	return put_user(0UL, &t->rseq->rseq_cs.ptr64);
> +#else
> +	return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
> +#endif
> }
> 
> /*
> --
> 2.31.1.295.g9ea45b61b8-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 16:54   ` Mathieu Desnoyers
@ 2021-04-13 16:57     ` Eric Dumazet
  2021-04-13 17:01       ` Eric Dumazet
  2021-04-13 17:06       ` Mathieu Desnoyers
  0 siblings, 2 replies; 26+ messages in thread
From: Eric Dumazet @ 2021-04-13 16:57 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Eric Dumazet, David Laight, Ingo Molnar, Peter Zijlstra, paulmck,
	Boqun Feng, Arjun Roy, linux-kernel

On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.dumazet@gmail.com wrote:
>
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> > update includes") added regressions for our servers.
> >
> > Using copy_from_user() and clear_user() for 64bit values
> > is suboptimal.
> >
> > We can use faster put_user() and get_user().
> >
> > 32bit arches can be changed to use the ptr32 field,
> > since the padding field must always be zero.
> >
> > v2: added ideas from Peter and Mathieu about making this
> >    generic, since my initial patch was only dealing with
> >    64bit arches.
>
> Ah, now I remember the reason why reading and clearing the entire 64-bit
> is important: it's because we don't want to allow user-space processes to
> use this change in behavior to figure out whether they are running on a
> 32-bit or in a 32-bit compat mode on a 64-bit kernel.
>
> So although I'm fine with making 64-bit kernels faster, we'll want to keep
> updating the entire 64-bit ptr field on 32-bit kernels as well.
>
> Thanks,
>

So... back to V1 then ?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 16:57     ` Eric Dumazet
@ 2021-04-13 17:01       ` Eric Dumazet
  2021-04-13 17:07         ` Eric Dumazet
  2021-04-13 17:06       ` Mathieu Desnoyers
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2021-04-13 17:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Eric Dumazet, David Laight, Ingo Molnar, Peter Zijlstra, paulmck,
	Boqun Feng, Arjun Roy, linux-kernel

On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> > ----- On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.dumazet@gmail.com wrote:
> >
> > > From: Eric Dumazet <edumazet@google.com>
> > >
> > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> > > update includes") added regressions for our servers.
> > >
> > > Using copy_from_user() and clear_user() for 64bit values
> > > is suboptimal.
> > >
> > > We can use faster put_user() and get_user().
> > >
> > > 32bit arches can be changed to use the ptr32 field,
> > > since the padding field must always be zero.
> > >
> > > v2: added ideas from Peter and Mathieu about making this
> > >    generic, since my initial patch was only dealing with
> > >    64bit arches.
> >
> > Ah, now I remember the reason why reading and clearing the entire 64-bit
> > is important: it's because we don't want to allow user-space processes to
> > use this change in behavior to figure out whether they are running on a
> > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
> >
> > So although I'm fine with making 64-bit kernels faster, we'll want to keep
> > updating the entire 64-bit ptr field on 32-bit kernels as well.
> >
> > Thanks,
> >
>
> So... back to V1 then ?

Or add more stuff as in :

#ifdef CONFIG_64BIT
+       return put_user(0UL, &t->rseq->rseq_cs.ptr64);
+#else
+       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32) |
put_user(0UL, &t->rseq->rseq_cs.ptr.padding);
+#endif

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 16:57     ` Eric Dumazet
  2021-04-13 17:01       ` Eric Dumazet
@ 2021-04-13 17:06       ` Mathieu Desnoyers
  1 sibling, 0 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2021-04-13 17:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David Laight, Ingo Molnar, Peter Zijlstra, paulmck,
	Boqun Feng, Arjun Roy, linux-kernel

----- On Apr 13, 2021, at 12:57 PM, Eric Dumazet edumazet@google.com wrote:

> On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> ----- On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.dumazet@gmail.com wrote:
>>
>> > From: Eric Dumazet <edumazet@google.com>
>> >
>> > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
>> > update includes") added regressions for our servers.
>> >
>> > Using copy_from_user() and clear_user() for 64bit values
>> > is suboptimal.
>> >
>> > We can use faster put_user() and get_user().
>> >
>> > 32bit arches can be changed to use the ptr32 field,
>> > since the padding field must always be zero.
>> >
>> > v2: added ideas from Peter and Mathieu about making this
>> >    generic, since my initial patch was only dealing with
>> >    64bit arches.
>>
>> Ah, now I remember the reason why reading and clearing the entire 64-bit
>> is important: it's because we don't want to allow user-space processes to
>> use this change in behavior to figure out whether they are running on a
>> 32-bit or in a 32-bit compat mode on a 64-bit kernel.
>>
>> So although I'm fine with making 64-bit kernels faster, we'll want to keep
>> updating the entire 64-bit ptr field on 32-bit kernels as well.
>>
>> Thanks,
>>
> 
> So... back to V1 then ?

In terms of behavior, yes. And it's probably the "easy" fix, but I hate that
it adds lots of preprocessor ifdefs into the rseq code.

But this would require auditing get_user()/put_user() for each architecture
supported by rseq to ensure they support 8-byte load/store. And it would become
an added burden on architecture maintainers wishing to add rseq support for their
architecture.

One alternative would be to implement rseq_get_user_u64 and rseq_put_user_u64
wrappers as static functions within rseq.c to hide the preprocessor ifdeffery
from the higher-level code. I try very hard to avoid mixing preprocessor ifdefs
with C code logic whenever I can.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 17:01       ` Eric Dumazet
@ 2021-04-13 17:07         ` Eric Dumazet
  2021-04-13 17:20           ` Mathieu Desnoyers
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2021-04-13 17:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Eric Dumazet, David Laight, Ingo Molnar, Peter Zijlstra, paulmck,
	Boqun Feng, Arjun Roy, linux-kernel

On Tue, Apr 13, 2021 at 7:01 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> > >
> > > ----- On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.dumazet@gmail.com wrote:
> > >
> > > > From: Eric Dumazet <edumazet@google.com>
> > > >
> > > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> > > > update includes") added regressions for our servers.
> > > >
> > > > Using copy_from_user() and clear_user() for 64bit values
> > > > is suboptimal.
> > > >
> > > > We can use faster put_user() and get_user().
> > > >
> > > > 32bit arches can be changed to use the ptr32 field,
> > > > since the padding field must always be zero.
> > > >
> > > > v2: added ideas from Peter and Mathieu about making this
> > > >    generic, since my initial patch was only dealing with
> > > >    64bit arches.
> > >
> > > Ah, now I remember the reason why reading and clearing the entire 64-bit
> > > is important: it's because we don't want to allow user-space processes to
> > > use this change in behavior to figure out whether they are running on a
> > > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
> > >
> > > So although I'm fine with making 64-bit kernels faster, we'll want to keep
> > > updating the entire 64-bit ptr field on 32-bit kernels as well.
> > >
> > > Thanks,
> > >
> >
> > So... back to V1 then ?
>
> Or add more stuff as in :

diff against v2, WDYT ?

diff --git a/kernel/rseq.c b/kernel/rseq.c
index f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
 {
        u32 ptr;

+       if (get_user(ptr, &rseq->rseq_cs.ptr.padding))
+               return -EFAULT;
+       if (ptr)
+               return -EINVAL;
        if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
                return -EFAULT;
        *uptrp = (struct rseq_cs __user *)ptr;
@@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
struct rseq_cs *rseq_cs)
        u32 sig;
        int ret;

-       if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
-               return -EFAULT;
+       ret = rseq_get_cs_ptr(&urseq_cs, t->rseq);
+       if (ret)
+               return ret;
        if (!urseq_cs) {
                memset(rseq_cs, 0, sizeof(*rseq_cs));
                return 0;
@@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
 #ifdef CONFIG_64BIT
        return put_user(0UL, &t->rseq->rseq_cs.ptr64);
 #else
-       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
+       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32) |
+              put_user(0UL, &t->rseq->rseq_cs.ptr.padding);
 #endif
 }

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 17:07         ` Eric Dumazet
@ 2021-04-13 17:20           ` Mathieu Desnoyers
  2021-04-13 17:33             ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2021-04-13 17:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David Laight, Ingo Molnar, Peter Zijlstra, paulmck,
	Boqun Feng, Arjun Roy, linux-kernel

----- On Apr 13, 2021, at 1:07 PM, Eric Dumazet edumazet@google.com wrote:

> On Tue, Apr 13, 2021 at 7:01 PM Eric Dumazet <edumazet@google.com> wrote:
>>
>> On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet <edumazet@google.com> wrote:
>> >
>> > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
>> > <mathieu.desnoyers@efficios.com> wrote:
>> > >
>> > > ----- On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.dumazet@gmail.com wrote:
>> > >
>> > > > From: Eric Dumazet <edumazet@google.com>
>> > > >
>> > > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
>> > > > update includes") added regressions for our servers.
>> > > >
>> > > > Using copy_from_user() and clear_user() for 64bit values
>> > > > is suboptimal.
>> > > >
>> > > > We can use faster put_user() and get_user().
>> > > >
>> > > > 32bit arches can be changed to use the ptr32 field,
>> > > > since the padding field must always be zero.
>> > > >
>> > > > v2: added ideas from Peter and Mathieu about making this
>> > > >    generic, since my initial patch was only dealing with
>> > > >    64bit arches.
>> > >
>> > > Ah, now I remember the reason why reading and clearing the entire 64-bit
>> > > is important: it's because we don't want to allow user-space processes to
>> > > use this change in behavior to figure out whether they are running on a
>> > > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
>> > >
>> > > So although I'm fine with making 64-bit kernels faster, we'll want to keep
>> > > updating the entire 64-bit ptr field on 32-bit kernels as well.
>> > >
>> > > Thanks,
>> > >
>> >
>> > So... back to V1 then ?
>>
>> Or add more stuff as in :
> 
> diff against v2, WDYT ?

I like this approach slightly better, because it moves the preprocessor ifdefs into
rseq_get_rseq_cs and clear_rseq_cs, while keeping the same behavior for a 32-bit
process running on native 32-bit kernel and as compat task on a 64-bit kernel.

That being said, I don't expect anyone to care much about performance of 32-bit
kernels, so we could use copy_from_user() on 32-bit kernels to remove special-cases
in 32-bit specific code. This would eliminate the 32-bit specific "padding" read, and
let the TASK_SIZE comparison handle the check for both 32-bit and 64-bit kernels.

As for clear_user(), I wonder whether we could simply keep using it, but change the
clear_user() macro to figure out that it can use a faster 8-byte put_user ? I find it
odd that performance optimizations which would be relevant elsewhere creep into the
rseq code.

Thanks,

Mathieu

> 
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index
> f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
> 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
> {
>        u32 ptr;
> 
> +       if (get_user(ptr, &rseq->rseq_cs.ptr.padding))
> +               return -EFAULT;
> +       if (ptr)
> +               return -EINVAL;
>        if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
>                return -EFAULT;
>        *uptrp = (struct rseq_cs __user *)ptr;
> @@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
> struct rseq_cs *rseq_cs)
>        u32 sig;
>        int ret;
> 
> -       if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
> -               return -EFAULT;
> +       ret = rseq_get_cs_ptr(&urseq_cs, t->rseq);
> +       if (ret)
> +               return ret;
>        if (!urseq_cs) {
>                memset(rseq_cs, 0, sizeof(*rseq_cs));
>                return 0;
> @@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
> #ifdef CONFIG_64BIT
>        return put_user(0UL, &t->rseq->rseq_cs.ptr64);
> #else
> -       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
> +       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32) |
> +              put_user(0UL, &t->rseq->rseq_cs.ptr.padding);
> #endif
>  }

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 17:20           ` Mathieu Desnoyers
@ 2021-04-13 17:33             ` Eric Dumazet
  2021-04-13 18:00               ` Mathieu Desnoyers
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2021-04-13 17:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Eric Dumazet, David Laight, Ingo Molnar, Peter Zijlstra, paulmck,
	Boqun Feng, Arjun Roy, linux-kernel

On Tue, Apr 13, 2021 at 7:20 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Apr 13, 2021, at 1:07 PM, Eric Dumazet edumazet@google.com wrote:
>
> > On Tue, Apr 13, 2021 at 7:01 PM Eric Dumazet <edumazet@google.com> wrote:
> >>
> >> On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet <edumazet@google.com> wrote:
> >> >
> >> > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
> >> > <mathieu.desnoyers@efficios.com> wrote:
> >> > >
> >> > > ----- On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.dumazet@gmail.com wrote:
> >> > >
> >> > > > From: Eric Dumazet <edumazet@google.com>
> >> > > >
> >> > > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> >> > > > update includes") added regressions for our servers.
> >> > > >
> >> > > > Using copy_from_user() and clear_user() for 64bit values
> >> > > > is suboptimal.
> >> > > >
> >> > > > We can use faster put_user() and get_user().
> >> > > >
> >> > > > 32bit arches can be changed to use the ptr32 field,
> >> > > > since the padding field must always be zero.
> >> > > >
> >> > > > v2: added ideas from Peter and Mathieu about making this
> >> > > >    generic, since my initial patch was only dealing with
> >> > > >    64bit arches.
> >> > >
> >> > > Ah, now I remember the reason why reading and clearing the entire 64-bit
> >> > > is important: it's because we don't want to allow user-space processes to
> >> > > use this change in behavior to figure out whether they are running on a
> >> > > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
> >> > >
> >> > > So although I'm fine with making 64-bit kernels faster, we'll want to keep
> >> > > updating the entire 64-bit ptr field on 32-bit kernels as well.
> >> > >
> >> > > Thanks,
> >> > >
> >> >
> >> > So... back to V1 then ?
> >>
> >> Or add more stuff as in :
> >
> > diff against v2, WDYT ?
>
> I like this approach slightly better, because it moves the preprocessor ifdefs into
> rseq_get_rseq_cs and clear_rseq_cs, while keeping the same behavior for a 32-bit
> process running on native 32-bit kernel and as compat task on a 64-bit kernel.
>
> That being said, I don't expect anyone to care much about performance of 32-bit
> kernels, so we could use copy_from_user() on 32-bit kernels to remove special-cases
> in 32-bit specific code. This would eliminate the 32-bit specific "padding" read, and
> let the TASK_SIZE comparison handle the check for both 32-bit and 64-bit kernels.
>
> As for clear_user(), I wonder whether we could simply keep using it, but change the
> clear_user() macro to figure out that it can use a faster 8-byte put_user ? I find it
> odd that performance optimizations which would be relevant elsewhere creep into the
> rseq code.


clear_user() is a maze of arch-dependent macros/functions/assembly

I guess the same could be said from  copy_in_user(), but apparently we removed
special-casing, like in commit a41e0d754240fe8ca9c4f2070bf67e3b0228aa22

Definitely it seems odd having to carefully choose between multiple methods.


>
> Thanks,
>
> Mathieu
>
> >
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index
> > f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
> > 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
> > {
> >        u32 ptr;
> >
> > +       if (get_user(ptr, &rseq->rseq_cs.ptr.padding))
> > +               return -EFAULT;
> > +       if (ptr)
> > +               return -EINVAL;
> >        if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
> >                return -EFAULT;
> >        *uptrp = (struct rseq_cs __user *)ptr;
> > @@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
> > struct rseq_cs *rseq_cs)
> >        u32 sig;
> >        int ret;
> >
> > -       if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
> > -               return -EFAULT;
> > +       ret = rseq_get_cs_ptr(&urseq_cs, t->rseq);
> > +       if (ret)
> > +               return ret;
> >        if (!urseq_cs) {
> >                memset(rseq_cs, 0, sizeof(*rseq_cs));
> >                return 0;
> > @@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
> > #ifdef CONFIG_64BIT
> >        return put_user(0UL, &t->rseq->rseq_cs.ptr64);
> > #else
> > -       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
> > +       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32) |
> > +              put_user(0UL, &t->rseq->rseq_cs.ptr.padding);
> > #endif
> >  }
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 17:33             ` Eric Dumazet
@ 2021-04-13 18:00               ` Mathieu Desnoyers
  2021-04-13 18:22                 ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2021-04-13 18:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David Laight, Ingo Molnar, Peter Zijlstra, paulmck,
	Boqun Feng, Arjun Roy, linux-kernel

----- On Apr 13, 2021, at 1:33 PM, Eric Dumazet edumazet@google.com wrote:

> On Tue, Apr 13, 2021 at 7:20 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> ----- On Apr 13, 2021, at 1:07 PM, Eric Dumazet edumazet@google.com wrote:
>>
>> > On Tue, Apr 13, 2021 at 7:01 PM Eric Dumazet <edumazet@google.com> wrote:
>> >>
>> >> On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet <edumazet@google.com> wrote:
>> >> >
>> >> > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
>> >> > <mathieu.desnoyers@efficios.com> wrote:
>> >> > >
>> >> > > ----- On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.dumazet@gmail.com wrote:
>> >> > >
>> >> > > > From: Eric Dumazet <edumazet@google.com>
>> >> > > >
>> >> > > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
>> >> > > > update includes") added regressions for our servers.
>> >> > > >
>> >> > > > Using copy_from_user() and clear_user() for 64bit values
>> >> > > > is suboptimal.
>> >> > > >
>> >> > > > We can use faster put_user() and get_user().
>> >> > > >
>> >> > > > 32bit arches can be changed to use the ptr32 field,
>> >> > > > since the padding field must always be zero.
>> >> > > >
>> >> > > > v2: added ideas from Peter and Mathieu about making this
>> >> > > >    generic, since my initial patch was only dealing with
>> >> > > >    64bit arches.
>> >> > >
>> >> > > Ah, now I remember the reason why reading and clearing the entire 64-bit
>> >> > > is important: it's because we don't want to allow user-space processes to
>> >> > > use this change in behavior to figure out whether they are running on a
>> >> > > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
>> >> > >
>> >> > > So although I'm fine with making 64-bit kernels faster, we'll want to keep
>> >> > > updating the entire 64-bit ptr field on 32-bit kernels as well.
>> >> > >
>> >> > > Thanks,
>> >> > >
>> >> >
>> >> > So... back to V1 then ?
>> >>
>> >> Or add more stuff as in :
>> >
>> > diff against v2, WDYT ?
>>
>> I like this approach slightly better, because it moves the preprocessor ifdefs
>> into
>> rseq_get_rseq_cs and clear_rseq_cs, while keeping the same behavior for a 32-bit
>> process running on native 32-bit kernel and as compat task on a 64-bit kernel.
>>
>> That being said, I don't expect anyone to care much about performance of 32-bit
>> kernels, so we could use copy_from_user() on 32-bit kernels to remove
>> special-cases
>> in 32-bit specific code. This would eliminate the 32-bit specific "padding"
>> read, and
>> let the TASK_SIZE comparison handle the check for both 32-bit and 64-bit
>> kernels.
>>
>> As for clear_user(), I wonder whether we could simply keep using it, but change
>> the
>> clear_user() macro to figure out that it can use a faster 8-byte put_user ? I
>> find it
>> odd that performance optimizations which would be relevant elsewhere creep into
>> the
>> rseq code.
> 
> 
> clear_user() is a maze of arch-dependent macros/functions/assembly
> 
> I guess the same could be said from  copy_in_user(), but apparently we removed
> special-casing, like in commit a41e0d754240fe8ca9c4f2070bf67e3b0228aa22
> 
> Definitely it seems odd having to carefully choose between multiple methods.

As long as the ifdefs are localized within clearly identified wrappers in the
rseq code I don't mind doing the special-casing there.

The point which remains is that I don't think we want to optimize for speed
on 32-bit architectures when it adds special-casing and complexity to the 32-bit
build. I suspect there is less and less testing performed on 32-bit architectures
nowadays, and it's good that as much code as possible is shared between 32-bit and
64-bit builds to share the test coverage.

Thanks,

Mathieu

> 
> 
>>
>> Thanks,
>>
>> Mathieu
>>
>> >
>> > diff --git a/kernel/rseq.c b/kernel/rseq.c
>> > index
>> > f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
>> > 100644
>> > --- a/kernel/rseq.c
>> > +++ b/kernel/rseq.c
>> > @@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
>> > {
>> >        u32 ptr;
>> >
>> > +       if (get_user(ptr, &rseq->rseq_cs.ptr.padding))
>> > +               return -EFAULT;
>> > +       if (ptr)
>> > +               return -EINVAL;
>> >        if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
>> >                return -EFAULT;
>> >        *uptrp = (struct rseq_cs __user *)ptr;
>> > @@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
>> > struct rseq_cs *rseq_cs)
>> >        u32 sig;
>> >        int ret;
>> >
>> > -       if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
>> > -               return -EFAULT;
>> > +       ret = rseq_get_cs_ptr(&urseq_cs, t->rseq);
>> > +       if (ret)
>> > +               return ret;
>> >        if (!urseq_cs) {
>> >                memset(rseq_cs, 0, sizeof(*rseq_cs));
>> >                return 0;
>> > @@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
>> > #ifdef CONFIG_64BIT
>> >        return put_user(0UL, &t->rseq->rseq_cs.ptr64);
>> > #else
>> > -       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
>> > +       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32) |
>> > +              put_user(0UL, &t->rseq->rseq_cs.ptr.padding);
>> > #endif
>> >  }
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 18:00               ` Mathieu Desnoyers
@ 2021-04-13 18:22                 ` Eric Dumazet
  2021-04-13 18:35                   ` Arjun Roy
  2021-04-13 19:13                   ` Mathieu Desnoyers
  0 siblings, 2 replies; 26+ messages in thread
From: Eric Dumazet @ 2021-04-13 18:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Eric Dumazet, David Laight, Ingo Molnar, Peter Zijlstra, paulmck,
	Boqun Feng, Arjun Roy, linux-kernel

On Tue, Apr 13, 2021 at 8:00 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>

> As long as the ifdefs are localized within clearly identified wrappers in the
> rseq code I don't mind doing the special-casing there.
>
> The point which remains is that I don't think we want to optimize for speed
> on 32-bit architectures when it adds special-casing and complexity to the 32-bit
> build. I suspect there is less and less testing performed on 32-bit architectures
> nowadays, and it's good that as much code as possible is shared between 32-bit and
> 64-bit builds to share the test coverage.
>

Quite frankly V1 was fine, I can't really make it looking better.

> Thanks,
>
> Mathieu
>
> >
> >
> >>
> >> Thanks,
> >>
> >> Mathieu
> >>
> >> >
> >> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> >> > index
> >> > f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
> >> > 100644
> >> > --- a/kernel/rseq.c
> >> > +++ b/kernel/rseq.c
> >> > @@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
> >> > {
> >> >        u32 ptr;
> >> >
> >> > +       if (get_user(ptr, &rseq->rseq_cs.ptr.padding))
> >> > +               return -EFAULT;
> >> > +       if (ptr)
> >> > +               return -EINVAL;
> >> >        if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
> >> >                return -EFAULT;
> >> >        *uptrp = (struct rseq_cs __user *)ptr;
> >> > @@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
> >> > struct rseq_cs *rseq_cs)
> >> >        u32 sig;
> >> >        int ret;
> >> >
> >> > -       if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
> >> > -               return -EFAULT;
> >> > +       ret = rseq_get_cs_ptr(&urseq_cs, t->rseq);
> >> > +       if (ret)
> >> > +               return ret;
> >> >        if (!urseq_cs) {
> >> >                memset(rseq_cs, 0, sizeof(*rseq_cs));
> >> >                return 0;
> >> > @@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
> >> > #ifdef CONFIG_64BIT
> >> >        return put_user(0UL, &t->rseq->rseq_cs.ptr64);
> >> > #else
> >> > -       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
> >> > +       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32) |
> >> > +              put_user(0UL, &t->rseq->rseq_cs.ptr.padding);
> >> > #endif
> >> >  }
> >>
> >> --
> >> Mathieu Desnoyers
> >> EfficiOS Inc.
> > > http://www.efficios.com
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 18:22                 ` Eric Dumazet
@ 2021-04-13 18:35                   ` Arjun Roy
  2021-04-13 21:19                     ` David Laight
  2021-04-13 19:13                   ` Mathieu Desnoyers
  1 sibling, 1 reply; 26+ messages in thread
From: Arjun Roy @ 2021-04-13 18:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mathieu Desnoyers, Eric Dumazet, David Laight, Ingo Molnar,
	Peter Zijlstra, paulmck, Boqun Feng, linux-kernel

On Tue, Apr 13, 2021 at 11:22 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Apr 13, 2021 at 8:00 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
>
> > As long as the ifdefs are localized within clearly identified wrappers in the
> > rseq code I don't mind doing the special-casing there.
> >
> > The point which remains is that I don't think we want to optimize for speed
> > on 32-bit architectures when it adds special-casing and complexity to the 32-bit
> > build. I suspect there is less and less testing performed on 32-bit architectures
> > nowadays, and it's good that as much code as possible is shared between 32-bit and
> > 64-bit builds to share the test coverage.
> >
>
> Quite frankly V1 was fine, I can't really make it looking better.
>
> > Thanks,
> >
> > Mathieu
> >
> > >
> > >
> > >>
> > >> Thanks,
> > >>
> > >> Mathieu
> > >>

If we're special-casing 64-bit architectures anyways - unrolling the
32B copy_from_user() for struct rseq_cs appears to be roughly 5-10%
savings on x86-64 when I measured it (well, in a microbenchmark, not
in rseq_get_rseq_cs() directly). Perhaps that could be an additional
avenue for improvement here.

-Arjun

> > >> >
> > >> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > >> > index
> > >> > f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
> > >> > 100644
> > >> > --- a/kernel/rseq.c
> > >> > +++ b/kernel/rseq.c
> > >> > @@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
> > >> > {
> > >> >        u32 ptr;
> > >> >
> > >> > +       if (get_user(ptr, &rseq->rseq_cs.ptr.padding))
> > >> > +               return -EFAULT;
> > >> > +       if (ptr)
> > >> > +               return -EINVAL;
> > >> >        if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
> > >> >                return -EFAULT;
> > >> >        *uptrp = (struct rseq_cs __user *)ptr;
> > >> > @@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
> > >> > struct rseq_cs *rseq_cs)
> > >> >        u32 sig;
> > >> >        int ret;
> > >> >
> > >> > -       if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
> > >> > -               return -EFAULT;
> > >> > +       ret = rseq_get_cs_ptr(&urseq_cs, t->rseq);
> > >> > +       if (ret)
> > >> > +               return ret;
> > >> >        if (!urseq_cs) {
> > >> >                memset(rseq_cs, 0, sizeof(*rseq_cs));
> > >> >                return 0;
> > >> > @@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
> > >> > #ifdef CONFIG_64BIT
> > >> >        return put_user(0UL, &t->rseq->rseq_cs.ptr64);
> > >> > #else
> > >> > -       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
> > >> > +       return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32) |
> > >> > +              put_user(0UL, &t->rseq->rseq_cs.ptr.padding);
> > >> > #endif
> > >> >  }
> > >>
> > >> --
> > >> Mathieu Desnoyers
> > >> EfficiOS Inc.
> > > > http://www.efficios.com
> >
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > http://www.efficios.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 18:22                 ` Eric Dumazet
  2021-04-13 18:35                   ` Arjun Roy
@ 2021-04-13 19:13                   ` Mathieu Desnoyers
  1 sibling, 0 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2021-04-13 19:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David Laight, Ingo Molnar, Peter Zijlstra, paulmck,
	Boqun Feng, Arjun Roy, linux-kernel



----- On Apr 13, 2021, at 2:22 PM, Eric Dumazet edumazet@google.com wrote:

> On Tue, Apr 13, 2021 at 8:00 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
> 
>> As long as the ifdefs are localized within clearly identified wrappers in the
>> rseq code I don't mind doing the special-casing there.
>>
>> The point which remains is that I don't think we want to optimize for speed
>> on 32-bit architectures when it adds special-casing and complexity to the 32-bit
>> build. I suspect there is less and less testing performed on 32-bit
>> architectures
>> nowadays, and it's good that as much code as possible is shared between 32-bit
>> and
>> 64-bit builds to share the test coverage.
>>
> 
> Quite frankly V1 was fine, I can't really make it looking better.

Yes, I'm OK with V1 of that patch.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 18:35                   ` Arjun Roy
@ 2021-04-13 21:19                     ` David Laight
  2021-04-13 22:03                       ` Arjun Roy
  0 siblings, 1 reply; 26+ messages in thread
From: David Laight @ 2021-04-13 21:19 UTC (permalink / raw)
  To: 'Arjun Roy', Eric Dumazet
  Cc: Mathieu Desnoyers, Eric Dumazet, Ingo Molnar, Peter Zijlstra,
	paulmck, Boqun Feng, linux-kernel

> If we're special-casing 64-bit architectures anyways - unrolling the
> 32B copy_from_user() for struct rseq_cs appears to be roughly 5-10%
> savings on x86-64 when I measured it (well, in a microbenchmark, not
> in rseq_get_rseq_cs() directly). Perhaps that could be an additional
> avenue for improvement here.

The killer is usually 'user copy hardening'.
It significantly slows down sendmsg() and recvmsg().
I've got measurable performance improvements by
using __copy_from_user() when the buffer since has
already been checked - but isn't a compile-time constant.

There is also scope for using _get_user() when reading
iovec[] (instead of copy_from_user()) and doing all the
bound checks (etc) in the loop.
That gives a measurable improvement for writev("/dev/null").
I must sort those patches out again.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 21:19                     ` David Laight
@ 2021-04-13 22:03                       ` Arjun Roy
  2021-04-14  7:55                         ` David Laight
  0 siblings, 1 reply; 26+ messages in thread
From: Arjun Roy @ 2021-04-13 22:03 UTC (permalink / raw)
  To: David Laight
  Cc: Eric Dumazet, Mathieu Desnoyers, Eric Dumazet, Ingo Molnar,
	Peter Zijlstra, paulmck, Boqun Feng, linux-kernel

On Tue, Apr 13, 2021 at 2:19 PM David Laight <David.Laight@aculab.com> wrote:
>
> > If we're special-casing 64-bit architectures anyways - unrolling the
> > 32B copy_from_user() for struct rseq_cs appears to be roughly 5-10%
> > savings on x86-64 when I measured it (well, in a microbenchmark, not
> > in rseq_get_rseq_cs() directly). Perhaps that could be an additional
> > avenue for improvement here.
>
> The killer is usually 'user copy hardening'.
> It significantly slows down sendmsg() and recvmsg().
> I've got measurable performance improvements by
> using __copy_from_user() when the buffer since has
> already been checked - but isn't a compile-time constant.
>
> There is also scope for using _get_user() when reading
> iovec[] (instead of copy_from_user()) and doing all the
> bound checks (etc) in the loop.
> That gives a measurable improvement for writev("/dev/null").
> I must sort those patches out again.
>
>         David
>

In this case I mean replacing copy_from_user(rseq_cs, urseq_cs,
sizeof(*rseq_cs)) with  4 (x8B=32 total) unsafe_get_user() (wrapped in
user_read_access_begin/end()) which I think would just bypass user
copy hardening (as far as I can tell).

-Arjun

> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-13 22:03                       ` Arjun Roy
@ 2021-04-14  7:55                         ` David Laight
  2021-04-14 16:00                           ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: David Laight @ 2021-04-14  7:55 UTC (permalink / raw)
  To: 'Arjun Roy'
  Cc: Eric Dumazet, Mathieu Desnoyers, Eric Dumazet, Ingo Molnar,
	Peter Zijlstra, paulmck, Boqun Feng, linux-kernel

From: Arjun Roy
> Sent: 13 April 2021 23:04
> 
> On Tue, Apr 13, 2021 at 2:19 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > > If we're special-casing 64-bit architectures anyways - unrolling the
> > > 32B copy_from_user() for struct rseq_cs appears to be roughly 5-10%
> > > savings on x86-64 when I measured it (well, in a microbenchmark, not
> > > in rseq_get_rseq_cs() directly). Perhaps that could be an additional
> > > avenue for improvement here.
> >
> > The killer is usually 'user copy hardening'.
> > It significantly slows down sendmsg() and recvmsg().
> > I've got measurable performance improvements by
> > using __copy_from_user() when the buffer since has
> > already been checked - but isn't a compile-time constant.
> >
> > There is also scope for using _get_user() when reading
> > iovec[] (instead of copy_from_user()) and doing all the
> > bound checks (etc) in the loop.
> > That gives a measurable improvement for writev("/dev/null").
> > I must sort those patches out again.
> >
> >         David
> >
> 
> In this case I mean replacing copy_from_user(rseq_cs, urseq_cs,
> sizeof(*rseq_cs)) with  4 (x8B=32 total) unsafe_get_user() (wrapped in
> user_read_access_begin/end()) which I think would just bypass user
> copy hardening (as far as I can tell).

Yes that is one advantage over any of the get_user() calls.
You also lose all the 'how shall we optimise this' checks
in copy_from_user().

Repeated unsafe_get_user() calls are crying out for an optimisation.
You get something like:
	failed = 0;
	copy();
	if (failed) goto error;
	copy();
	if (failed) goto error;
Where 'failed' is set by the fault handler.

This could be optimised to:
	failed = 0;
	copy();
	copy();
	if (failed) goto error;
Even if it faults on every invalid address it probably
doesn't matter - no one cares about that path.

I've not really looked at how it could be achieved though.

It might be that the 'asm goto with outputs' variant
manages to avoid the compare and jump.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-14  7:55                         ` David Laight
@ 2021-04-14 16:00                           ` Eric Dumazet
  2021-04-14 16:08                             ` David Laight
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2021-04-14 16:00 UTC (permalink / raw)
  To: David Laight
  Cc: Arjun Roy, Mathieu Desnoyers, Eric Dumazet, Ingo Molnar,
	Peter Zijlstra, paulmck, Boqun Feng, linux-kernel

On Wed, Apr 14, 2021 at 9:55 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Arjun Roy
> > Sent: 13 April 2021 23:04
> >
> > On Tue, Apr 13, 2021 at 2:19 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > > If we're special-casing 64-bit architectures anyways - unrolling the
> > > > 32B copy_from_user() for struct rseq_cs appears to be roughly 5-10%
> > > > savings on x86-64 when I measured it (well, in a microbenchmark, not
> > > > in rseq_get_rseq_cs() directly). Perhaps that could be an additional
> > > > avenue for improvement here.
> > >
> > > The killer is usually 'user copy hardening'.
> > > It significantly slows down sendmsg() and recvmsg().
> > > I've got measurable performance improvements by
> > > using __copy_from_user() when the buffer since has
> > > already been checked - but isn't a compile-time constant.
> > >
> > > There is also scope for using _get_user() when reading
> > > iovec[] (instead of copy_from_user()) and doing all the
> > > bound checks (etc) in the loop.
> > > That gives a measurable improvement for writev("/dev/null").
> > > I must sort those patches out again.
> > >
> > >         David
> > >
> >
> > In this case I mean replacing copy_from_user(rseq_cs, urseq_cs,
> > sizeof(*rseq_cs)) with  4 (x8B=32 total) unsafe_get_user() (wrapped in
> > user_read_access_begin/end()) which I think would just bypass user
> > copy hardening (as far as I can tell).
>
> Yes that is one advantage over any of the get_user() calls.
> You also lose all the 'how shall we optimise this' checks
> in copy_from_user().
>
> Repeated unsafe_get_user() calls are crying out for an optimisation.
> You get something like:
>         failed = 0;
>         copy();
>         if (failed) goto error;
>         copy();
>         if (failed) goto error;
> Where 'failed' is set by the fault handler.
>
> This could be optimised to:
>         failed = 0;
>         copy();
>         copy();
>         if (failed) goto error;
> Even if it faults on every invalid address it probably
> doesn't matter - no one cares about that path.


On which arch are you looking at ?

On x86_64 at least, code generation is just perfect.
Not even a conditional jmp, it is all handled by exceptions (if any)

stac
copy();
copy();
clac


<out_of_line>
efault_end: do error recovery.



>
> I've not really looked at how it could be achieved though.
>
> It might be that the 'asm goto with outputs' variant
> manages to avoid the compare and jump.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-14 16:00                           ` Eric Dumazet
@ 2021-04-14 16:08                             ` David Laight
  2021-04-14 16:10                               ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: David Laight @ 2021-04-14 16:08 UTC (permalink / raw)
  To: 'Eric Dumazet'
  Cc: Arjun Roy, Mathieu Desnoyers, Eric Dumazet, Ingo Molnar,
	Peter Zijlstra, paulmck, Boqun Feng, linux-kernel

From: Eric Dumazet
> Sent: 14 April 2021 17:00
...
> > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > You get something like:
> >         failed = 0;
> >         copy();
> >         if (failed) goto error;
> >         copy();
> >         if (failed) goto error;
> > Where 'failed' is set by the fault handler.
> >
> > This could be optimised to:
> >         failed = 0;
> >         copy();
> >         copy();
> >         if (failed) goto error;
> > Even if it faults on every invalid address it probably
> > doesn't matter - no one cares about that path.
> 
> 
> On which arch are you looking at ?
> 
> On x86_64 at least, code generation is just perfect.
> Not even a conditional jmp, it is all handled by exceptions (if any)
> 
> stac
> copy();
> copy();
> clac
> 
> 
> <out_of_line>
> efault_end: do error recovery.

It will be x86_64.
I'm definitely seeing repeated tests of (IIRC) %rdx.

It may well be because the compiler isn't very new.
Will be an Ubuntu build of 9.3.0.
Does that support 'asm goto with outputs' - which
may be the difference.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-14 16:08                             ` David Laight
@ 2021-04-14 16:10                               ` Eric Dumazet
  2021-04-14 17:15                                 ` Arjun Roy
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2021-04-14 16:10 UTC (permalink / raw)
  To: David Laight
  Cc: Arjun Roy, Mathieu Desnoyers, Eric Dumazet, Ingo Molnar,
	Peter Zijlstra, paulmck, Boqun Feng, linux-kernel

On Wed, Apr 14, 2021 at 6:08 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 14 April 2021 17:00
> ...
> > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > You get something like:
> > >         failed = 0;
> > >         copy();
> > >         if (failed) goto error;
> > >         copy();
> > >         if (failed) goto error;
> > > Where 'failed' is set by the fault handler.
> > >
> > > This could be optimised to:
> > >         failed = 0;
> > >         copy();
> > >         copy();
> > >         if (failed) goto error;
> > > Even if it faults on every invalid address it probably
> > > doesn't matter - no one cares about that path.
> >
> >
> > On which arch are you looking at ?
> >
> > On x86_64 at least, code generation is just perfect.
> > Not even a conditional jmp, it is all handled by exceptions (if any)
> >
> > stac
> > copy();
> > copy();
> > clac
> >
> >
> > <out_of_line>
> > efault_end: do error recovery.
>
> It will be x86_64.
> I'm definitely seeing repeated tests of (IIRC) %rdx.
>
> It may well be because the compiler isn't very new.
> Will be an Ubuntu build of 9.3.0.
> Does that support 'asm goto with outputs' - which
> may be the difference.
>

Yep, probably. I am using some recent clang version.

>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-14 16:10                               ` Eric Dumazet
@ 2021-04-14 17:15                                 ` Arjun Roy
  2021-04-14 17:35                                   ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Arjun Roy @ 2021-04-14 17:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Laight, Mathieu Desnoyers, Eric Dumazet, Ingo Molnar,
	Peter Zijlstra, paulmck, Boqun Feng, linux-kernel

On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Apr 14, 2021 at 6:08 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Eric Dumazet
> > > Sent: 14 April 2021 17:00
> > ...
> > > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > > You get something like:
> > > >         failed = 0;
> > > >         copy();
> > > >         if (failed) goto error;
> > > >         copy();
> > > >         if (failed) goto error;
> > > > Where 'failed' is set by the fault handler.
> > > >
> > > > This could be optimised to:
> > > >         failed = 0;
> > > >         copy();
> > > >         copy();
> > > >         if (failed) goto error;
> > > > Even if it faults on every invalid address it probably
> > > > doesn't matter - no one cares about that path.
> > >
> > >
> > > On which arch are you looking at ?
> > >
> > > On x86_64 at least, code generation is just perfect.
> > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > >
> > > stac
> > > copy();
> > > copy();
> > > clac
> > >
> > >
> > > <out_of_line>
> > > efault_end: do error recovery.
> >
> > It will be x86_64.
> > I'm definitely seeing repeated tests of (IIRC) %rdx.
> >
> > It may well be because the compiler isn't very new.
> > Will be an Ubuntu build of 9.3.0.
> > Does that support 'asm goto with outputs' - which
> > may be the difference.
> >
>
> Yep, probably. I am using some recent clang version.
>

On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
down to stac + lfence + 8 x mov + clac as straight line code. And
results in roughly a 5%-10% speedup over copy_from_user().

-Arjun


> >         David
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-14 17:15                                 ` Arjun Roy
@ 2021-04-14 17:35                                   ` Eric Dumazet
  2021-04-14 20:15                                     ` Arjun Roy
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2021-04-14 17:35 UTC (permalink / raw)
  To: Arjun Roy
  Cc: David Laight, Mathieu Desnoyers, Eric Dumazet, Ingo Molnar,
	Peter Zijlstra, paulmck, Boqun Feng, linux-kernel

On Wed, Apr 14, 2021 at 7:15 PM Arjun Roy <arjunroy@google.com> wrote:
>
> On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Apr 14, 2021 at 6:08 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Eric Dumazet
> > > > Sent: 14 April 2021 17:00
> > > ...
> > > > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > > > You get something like:
> > > > >         failed = 0;
> > > > >         copy();
> > > > >         if (failed) goto error;
> > > > >         copy();
> > > > >         if (failed) goto error;
> > > > > Where 'failed' is set by the fault handler.
> > > > >
> > > > > This could be optimised to:
> > > > >         failed = 0;
> > > > >         copy();
> > > > >         copy();
> > > > >         if (failed) goto error;
> > > > > Even if it faults on every invalid address it probably
> > > > > doesn't matter - no one cares about that path.
> > > >
> > > >
> > > > On which arch are you looking at ?
> > > >
> > > > On x86_64 at least, code generation is just perfect.
> > > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > > >
> > > > stac
> > > > copy();
> > > > copy();
> > > > clac
> > > >
> > > >
> > > > <out_of_line>
> > > > efault_end: do error recovery.
> > >
> > > It will be x86_64.
> > > I'm definitely seeing repeated tests of (IIRC) %rdx.
> > >
> > > It may well be because the compiler isn't very new.
> > > Will be an Ubuntu build of 9.3.0.
> > > Does that support 'asm goto with outputs' - which
> > > may be the difference.
> > >
> >
> > Yep, probably. I am using some recent clang version.
> >
>
> On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
> down to stac + lfence + 8 x mov + clac as straight line code. And
> results in roughly a 5%-10% speedup over copy_from_user().
>

But rseq_get_rseq_cs() would still need three different copies,
with 3 stac+lfence+clac sequences.

Maybe we need to enclose all __rseq_handle_notify_resume() operations
in a single section.







> -Arjun
>
>
> > >         David
> > >
> > > -
> > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-14 17:35                                   ` Eric Dumazet
@ 2021-04-14 20:15                                     ` Arjun Roy
  2021-04-14 20:25                                       ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Arjun Roy @ 2021-04-14 20:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Laight, Mathieu Desnoyers, Eric Dumazet, Ingo Molnar,
	Peter Zijlstra, paulmck, Boqun Feng, linux-kernel

On Wed, Apr 14, 2021 at 10:35 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Apr 14, 2021 at 7:15 PM Arjun Roy <arjunroy@google.com> wrote:
> >
> > On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, Apr 14, 2021 at 6:08 PM David Laight <David.Laight@aculab.com> wrote:
> > > >
> > > > From: Eric Dumazet
> > > > > Sent: 14 April 2021 17:00
> > > > ...
> > > > > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > > > > You get something like:
> > > > > >         failed = 0;
> > > > > >         copy();
> > > > > >         if (failed) goto error;
> > > > > >         copy();
> > > > > >         if (failed) goto error;
> > > > > > Where 'failed' is set by the fault handler.
> > > > > >
> > > > > > This could be optimised to:
> > > > > >         failed = 0;
> > > > > >         copy();
> > > > > >         copy();
> > > > > >         if (failed) goto error;
> > > > > > Even if it faults on every invalid address it probably
> > > > > > doesn't matter - no one cares about that path.
> > > > >
> > > > >
> > > > > On which arch are you looking at ?
> > > > >
> > > > > On x86_64 at least, code generation is just perfect.
> > > > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > > > >
> > > > > stac
> > > > > copy();
> > > > > copy();
> > > > > clac
> > > > >
> > > > >
> > > > > <out_of_line>
> > > > > efault_end: do error recovery.
> > > >
> > > > It will be x86_64.
> > > > I'm definitely seeing repeated tests of (IIRC) %rdx.
> > > >
> > > > It may well be because the compiler isn't very new.
> > > > Will be an Ubuntu build of 9.3.0.
> > > > Does that support 'asm goto with outputs' - which
> > > > may be the difference.
> > > >
> > >
> > > Yep, probably. I am using some recent clang version.
> > >
> >
> > On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
> > down to stac + lfence + 8 x mov + clac as straight line code. And
> > results in roughly a 5%-10% speedup over copy_from_user().
> >
>
> But rseq_get_rseq_cs() would still need three different copies,
> with 3 stac+lfence+clac sequences.
>
> Maybe we need to enclose all __rseq_handle_notify_resume() operations
> in a single section.
>
>

To provide a bit of further exposition on this point, if you do 4x
unsafe_get_user() recall I mentioned a 5-10% improvement. On the other
hand, 4x normal get_user() I saw something like a 100% (ie. doubling
of sys time measured) regression.

I assume that's the fault of multiple stac+clac.

-Arjun

>
>
>
>
>
> > -Arjun
> >
> >
> > > >         David
> > > >
> > > > -
> > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > > Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-14 20:15                                     ` Arjun Roy
@ 2021-04-14 20:25                                       ` Eric Dumazet
  2021-04-14 20:35                                         ` Arjun Roy
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2021-04-14 20:25 UTC (permalink / raw)
  To: Arjun Roy
  Cc: David Laight, Mathieu Desnoyers, Eric Dumazet, Ingo Molnar,
	Peter Zijlstra, paulmck, Boqun Feng, linux-kernel

On Wed, Apr 14, 2021 at 10:15 PM Arjun Roy <arjunroy@google.com> wrote:
>
> On Wed, Apr 14, 2021 at 10:35 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Apr 14, 2021 at 7:15 PM Arjun Roy <arjunroy@google.com> wrote:
> > >
> > > On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Wed, Apr 14, 2021 at 6:08 PM David Laight <David.Laight@aculab.com> wrote:
> > > > >
> > > > > From: Eric Dumazet
> > > > > > Sent: 14 April 2021 17:00
> > > > > ...
> > > > > > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > > > > > You get something like:
> > > > > > >         failed = 0;
> > > > > > >         copy();
> > > > > > >         if (failed) goto error;
> > > > > > >         copy();
> > > > > > >         if (failed) goto error;
> > > > > > > Where 'failed' is set by the fault handler.
> > > > > > >
> > > > > > > This could be optimised to:
> > > > > > >         failed = 0;
> > > > > > >         copy();
> > > > > > >         copy();
> > > > > > >         if (failed) goto error;
> > > > > > > Even if it faults on every invalid address it probably
> > > > > > > doesn't matter - no one cares about that path.
> > > > > >
> > > > > >
> > > > > > On which arch are you looking at ?
> > > > > >
> > > > > > On x86_64 at least, code generation is just perfect.
> > > > > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > > > > >
> > > > > > stac
> > > > > > copy();
> > > > > > copy();
> > > > > > clac
> > > > > >
> > > > > >
> > > > > > <out_of_line>
> > > > > > efault_end: do error recovery.
> > > > >
> > > > > It will be x86_64.
> > > > > I'm definitely seeing repeated tests of (IIRC) %rdx.
> > > > >
> > > > > It may well be because the compiler isn't very new.
> > > > > Will be an Ubuntu build of 9.3.0.
> > > > > Does that support 'asm goto with outputs' - which
> > > > > may be the difference.
> > > > >
> > > >
> > > > Yep, probably. I am using some recent clang version.
> > > >
> > >
> > > On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
> > > down to stac + lfence + 8 x mov + clac as straight line code. And
> > > results in roughly a 5%-10% speedup over copy_from_user().
> > >
> >
> > But rseq_get_rseq_cs() would still need three different copies,
> > with 3 stac+lfence+clac sequences.
> >
> > Maybe we need to enclose all __rseq_handle_notify_resume() operations
> > in a single section.
> >
> >
>
> To provide a bit of further exposition on this point, if you do 4x
> unsafe_get_user() recall I mentioned a 5-10% improvement. On the other
> hand, 4x normal get_user() I saw something like a 100% (ie. doubling
> of sys time measured) regression.
>
> I assume that's the fault of multiple stac+clac.


I was suggesting only using unsafe_get_user() and unsafe_put_user(),
and one surrounding stac/clac

Basically what we had (partially) in our old Google kernels, before
commit 8f2817701492 ("rseq: Use get_user/put_user rather than
__get_user/__put_user")
but with all the needed modern stuff.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()
  2021-04-14 20:25                                       ` Eric Dumazet
@ 2021-04-14 20:35                                         ` Arjun Roy
  0 siblings, 0 replies; 26+ messages in thread
From: Arjun Roy @ 2021-04-14 20:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Laight, Mathieu Desnoyers, Eric Dumazet, Ingo Molnar,
	Peter Zijlstra, paulmck, Boqun Feng, linux-kernel

On Wed, Apr 14, 2021 at 1:25 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Apr 14, 2021 at 10:15 PM Arjun Roy <arjunroy@google.com> wrote:
> >
> > On Wed, Apr 14, 2021 at 10:35 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, Apr 14, 2021 at 7:15 PM Arjun Roy <arjunroy@google.com> wrote:
> > > >
> > > > On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Wed, Apr 14, 2021 at 6:08 PM David Laight <David.Laight@aculab.com> wrote:
> > > > > >
> > > > > > From: Eric Dumazet
> > > > > > > Sent: 14 April 2021 17:00
> > > > > > ...
> > > > > > > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > > > > > > You get something like:
> > > > > > > >         failed = 0;
> > > > > > > >         copy();
> > > > > > > >         if (failed) goto error;
> > > > > > > >         copy();
> > > > > > > >         if (failed) goto error;
> > > > > > > > Where 'failed' is set by the fault handler.
> > > > > > > >
> > > > > > > > This could be optimised to:
> > > > > > > >         failed = 0;
> > > > > > > >         copy();
> > > > > > > >         copy();
> > > > > > > >         if (failed) goto error;
> > > > > > > > Even if it faults on every invalid address it probably
> > > > > > > > doesn't matter - no one cares about that path.
> > > > > > >
> > > > > > >
> > > > > > > On which arch are you looking at ?
> > > > > > >
> > > > > > > On x86_64 at least, code generation is just perfect.
> > > > > > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > > > > > >
> > > > > > > stac
> > > > > > > copy();
> > > > > > > copy();
> > > > > > > clac
> > > > > > >
> > > > > > >
> > > > > > > <out_of_line>
> > > > > > > efault_end: do error recovery.
> > > > > >
> > > > > > It will be x86_64.
> > > > > > I'm definitely seeing repeated tests of (IIRC) %rdx.
> > > > > >
> > > > > > It may well be because the compiler isn't very new.
> > > > > > Will be an Ubuntu build of 9.3.0.
> > > > > > Does that support 'asm goto with outputs' - which
> > > > > > may be the difference.
> > > > > >
> > > > >
> > > > > Yep, probably. I am using some recent clang version.
> > > > >
> > > >
> > > > On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
> > > > down to stac + lfence + 8 x mov + clac as straight line code. And
> > > > results in roughly a 5%-10% speedup over copy_from_user().
> > > >
> > >
> > > But rseq_get_rseq_cs() would still need three different copies,
> > > with 3 stac+lfence+clac sequences.
> > >
> > > Maybe we need to enclose all __rseq_handle_notify_resume() operations
> > > in a single section.
> > >
> > >
> >
> > To provide a bit of further exposition on this point, if you do 4x
> > unsafe_get_user() recall I mentioned a 5-10% improvement. On the other
> > hand, 4x normal get_user() I saw something like a 100% (ie. doubling
> > of sys time measured) regression.
> >
> > I assume that's the fault of multiple stac+clac.
>
>
> I was suggesting only using unsafe_get_user() and unsafe_put_user(),
> and one surrounding stac/clac
>
> Basically what we had (partially) in our old Google kernels, before
> commit 8f2817701492 ("rseq: Use get_user/put_user rather than
> __get_user/__put_user")
> but with all the needed modern stuff.

Yep - in agreement with that approach.

-Arjun

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-04-14 20:35 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-13 16:22 [PATCH v2 0/3] rseq: minor optimizations Eric Dumazet
2021-04-13 16:22 ` [PATCH v2 1/3] rseq: optimize rseq_update_cpu_id() Eric Dumazet
2021-04-13 16:22 ` [PATCH v2 2/3] rseq: remove redundant access_ok() Eric Dumazet
2021-04-13 16:22 ` [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs() Eric Dumazet
2021-04-13 16:54   ` Mathieu Desnoyers
2021-04-13 16:57     ` Eric Dumazet
2021-04-13 17:01       ` Eric Dumazet
2021-04-13 17:07         ` Eric Dumazet
2021-04-13 17:20           ` Mathieu Desnoyers
2021-04-13 17:33             ` Eric Dumazet
2021-04-13 18:00               ` Mathieu Desnoyers
2021-04-13 18:22                 ` Eric Dumazet
2021-04-13 18:35                   ` Arjun Roy
2021-04-13 21:19                     ` David Laight
2021-04-13 22:03                       ` Arjun Roy
2021-04-14  7:55                         ` David Laight
2021-04-14 16:00                           ` Eric Dumazet
2021-04-14 16:08                             ` David Laight
2021-04-14 16:10                               ` Eric Dumazet
2021-04-14 17:15                                 ` Arjun Roy
2021-04-14 17:35                                   ` Eric Dumazet
2021-04-14 20:15                                     ` Arjun Roy
2021-04-14 20:25                                       ` Eric Dumazet
2021-04-14 20:35                                         ` Arjun Roy
2021-04-13 19:13                   ` Mathieu Desnoyers
2021-04-13 17:06       ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).