* [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix @ 2020-07-06 20:49 Mathieu Desnoyers 2020-07-06 20:49 ` [RFC PATCH for 5.8 1/4] sched: Fix unreliable rseq cpu_id for new tasks Mathieu Desnoyers ` (5 more replies) 0 siblings, 6 replies; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-06 20:49 UTC (permalink / raw) To: Thomas Gleixner Cc: linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner, linux-api, Florian Weimer, Mathieu Desnoyers Hi, Recent integration of rseq into glibc unearthed an issue with inaccurate cpu_id field for newly created tasks. This series includes a fix for the underlying issue (meant to be backported to stable), as well as new rseq flags to let user-space know that the kernel implements this fix, so glibc and other rseq users can use this flag to know whether they can safely use rseq without risk of corrupting their per-cpu data. This new flag could either be added only to the master branch (no stable backport) or backported to stable, depending on what seems the most appropriate. This is an RFC aiming for quick inclusion into the Linux kernel, unless we prefer reverting the entire rseq glibc integration and try again in 6 months. Their upcoming release is on August 3rd, so we need to take a decision on this matter quickly. Thanks, Mathieu Mathieu Desnoyers (4): sched: Fix unreliable rseq cpu_id for new tasks rseq: Introduce RSEQ_FLAG_REGISTER rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID rseq: selftests: Expect reliable cpu_id field include/uapi/linux/rseq.h | 15 +++++- kernel/rseq.c | 81 ++++++++++++++++------------- kernel/sched/core.c | 2 + tools/testing/selftests/rseq/rseq.c | 10 +++- 4 files changed, 71 insertions(+), 37 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 25+ messages in thread
* [RFC PATCH for 5.8 1/4] sched: Fix unreliable rseq cpu_id for new tasks 2020-07-06 20:49 [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix Mathieu Desnoyers @ 2020-07-06 20:49 ` Mathieu Desnoyers 2020-07-07 7:30 ` Florian Weimer 2020-07-06 20:49 ` [RFC PATCH for 5.8 2/4] rseq: Introduce RSEQ_FLAG_REGISTER Mathieu Desnoyers ` (4 subsequent siblings) 5 siblings, 1 reply; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-06 20:49 UTC (permalink / raw) To: Thomas Gleixner Cc: linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner, linux-api, Florian Weimer, Mathieu Desnoyers, Dmitry Vyukov, Neel Natu, stable While integrating rseq into glibc and replacing glibc's sched_getcpu implementation with rseq, glibc's tests discovered an issue with incorrect __rseq_abi.cpu_id field value right after the first time a newly created process issues sched_setaffinity. For the records, it triggers after building glibc and running tests, and then issuing: for x in {1..2000} ; do posix/tst-affinity-static & done and shows up as: error: Unexpected CPU 2, expected 0 error: Unexpected CPU 2, expected 0 error: Unexpected CPU 2, expected 0 error: Unexpected CPU 2, expected 0 error: Unexpected CPU 138, expected 0 error: Unexpected CPU 138, expected 0 error: Unexpected CPU 138, expected 0 error: Unexpected CPU 138, expected 0 This is caused by the scheduler invoking __set_task_cpu() directly from sched_fork() and wake_up_new_task(), thus bypassing rseq_migrate() which is done by set_task_cpu(). Add the missing rseq_migrate() to both functions. The only other direct use of __set_task_cpu() is done by init_idle(), which does not involve a user-space task. Based on my testing with the glibc test-case, just adding rseq_migrate() to wake_up_new_task() is sufficient to fix the observed issue. Also add it to sched_fork() to keep things consistent. The reason why this never triggered so far with the rseq/basic_test selftest is unclear. The current use of sched_getcpu(3) does not typically require it to be always accurate. However, use of the __rseq_abi.cpu_id field within rseq critical sections requires it to be accurate. If it is not accurate, it can cause corruption in the per-cpu data targeted by rseq critical sections in user-space. Link: https://sourceware.org/pipermail/libc-alpha/2020-July/115816.html Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Paul Turner <pjt@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Neel Natu <neelnatu@google.com> Cc: linux-api@vger.kernel.org Cc: stable@vger.kernel.org # v4.18+ --- kernel/sched/core.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ca5db40392d4..86a855bd4d90 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2962,6 +2962,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) * Silence PROVE_RCU. */ raw_spin_lock_irqsave(&p->pi_lock, flags); + rseq_migrate(p); /* * We're setting the CPU for the first time, we don't migrate, * so use __set_task_cpu(). @@ -3026,6 +3027,7 @@ void wake_up_new_task(struct task_struct *p) * as we're not fully set-up yet. */ p->recent_used_cpu = task_cpu(p); + rseq_migrate(p); __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif rq = __task_rq_lock(p, &rf); -- 2.17.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 1/4] sched: Fix unreliable rseq cpu_id for new tasks 2020-07-06 20:49 ` [RFC PATCH for 5.8 1/4] sched: Fix unreliable rseq cpu_id for new tasks Mathieu Desnoyers @ 2020-07-07 7:30 ` Florian Weimer 2020-07-07 10:51 ` Mathieu Desnoyers 0 siblings, 1 reply; 25+ messages in thread From: Florian Weimer @ 2020-07-07 7:30 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu, stable * Mathieu Desnoyers: > While integrating rseq into glibc and replacing glibc's sched_getcpu > implementation with rseq, glibc's tests discovered an issue with > incorrect __rseq_abi.cpu_id field value right after the first time > a newly created process issues sched_setaffinity. > > For the records, it triggers after building glibc and running tests, and > then issuing: > > for x in {1..2000} ; do posix/tst-affinity-static & done > > and shows up as: > > error: Unexpected CPU 2, expected 0 > error: Unexpected CPU 2, expected 0 > error: Unexpected CPU 2, expected 0 > error: Unexpected CPU 2, expected 0 > error: Unexpected CPU 138, expected 0 > error: Unexpected CPU 138, expected 0 > error: Unexpected CPU 138, expected 0 > error: Unexpected CPU 138, expected 0 As far as I can tell, the glibc reproducer no longer shows the issue with this patch applied. Tested-By: Florian Weimer <fweimer@redhat.com> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 1/4] sched: Fix unreliable rseq cpu_id for new tasks 2020-07-07 7:30 ` Florian Weimer @ 2020-07-07 10:51 ` Mathieu Desnoyers 0 siblings, 0 replies; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-07 10:51 UTC (permalink / raw) To: Florian Weimer Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu, stable ----- On Jul 7, 2020, at 3:30 AM, Florian Weimer fw@deneb.enyo.de wrote: > * Mathieu Desnoyers: > >> While integrating rseq into glibc and replacing glibc's sched_getcpu >> implementation with rseq, glibc's tests discovered an issue with >> incorrect __rseq_abi.cpu_id field value right after the first time >> a newly created process issues sched_setaffinity. >> >> For the records, it triggers after building glibc and running tests, and >> then issuing: >> >> for x in {1..2000} ; do posix/tst-affinity-static & done >> >> and shows up as: >> >> error: Unexpected CPU 2, expected 0 >> error: Unexpected CPU 2, expected 0 >> error: Unexpected CPU 2, expected 0 >> error: Unexpected CPU 2, expected 0 >> error: Unexpected CPU 138, expected 0 >> error: Unexpected CPU 138, expected 0 >> error: Unexpected CPU 138, expected 0 >> error: Unexpected CPU 138, expected 0 > > As far as I can tell, the glibc reproducer no longer shows the issue > with this patch applied. > > Tested-By: Florian Weimer <fweimer@redhat.com> Thanks a lot Florian for your thorough review and testing ! Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* [RFC PATCH for 5.8 2/4] rseq: Introduce RSEQ_FLAG_REGISTER 2020-07-06 20:49 [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix Mathieu Desnoyers 2020-07-06 20:49 ` [RFC PATCH for 5.8 1/4] sched: Fix unreliable rseq cpu_id for new tasks Mathieu Desnoyers @ 2020-07-06 20:49 ` Mathieu Desnoyers 2020-07-06 20:49 ` [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID Mathieu Desnoyers ` (3 subsequent siblings) 5 siblings, 0 replies; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-06 20:49 UTC (permalink / raw) To: Thomas Gleixner Cc: linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner, linux-api, Florian Weimer, Mathieu Desnoyers, Dmitry Vyukov, Neel Natu Introduce RSEQ_FLAG_REGISTER with the same behavior as the flag value "0". The main advantage of introducing this flag as a non-zero (1 << 1) value is that it can be combined with other flags to register and check for features with a single system call. Considering that this system call needs to be performed for each new thread in glibc, minimize the amount of overhead required. This is needed for introducing a new RSEQ_FLAG_RELIABLE_CPU_ID flag in a later change. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Paul Turner <pjt@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Neel Natu <neelnatu@google.com> Cc: linux-api@vger.kernel.org --- include/uapi/linux/rseq.h | 10 ++++- kernel/rseq.c | 77 +++++++++++++++++++++------------------ 2 files changed, 51 insertions(+), 36 deletions(-) diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index 9a402fdb60e9..3b5fba25461a 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -18,8 +18,16 @@ enum rseq_cpu_id_state { RSEQ_CPU_ID_REGISTRATION_FAILED = -2, }; +/* + * RSEQ_FLAG_UNREGISTER: Unregister rseq ABI for caller thread. + * RSEQ_FLAG_REGISTER: Register rseq ABI for caller thread. + * + * Flag value 0 has the same behavior as RSEQ_FLAG_REGISTER, but cannot be + * combined with other flags. This behavior is kept for backward compatibility. + */ enum rseq_flags { - RSEQ_FLAG_UNREGISTER = (1 << 0), + RSEQ_FLAG_UNREGISTER = (1 << 0), + RSEQ_FLAG_REGISTER = (1 << 1), }; enum rseq_cs_flags_bit { diff --git a/kernel/rseq.c b/kernel/rseq.c index a4f86a9d6937..47ce221cd6f9 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -309,9 +309,16 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, { int ret; - if (flags & RSEQ_FLAG_UNREGISTER) { - if (flags & ~RSEQ_FLAG_UNREGISTER) - return -EINVAL; + /* + * Flag value 0 has the same behavior as RSEQ_FLAG_REGISTER, but cannot + * be combined with other flags. This behavior is kept for backward + * compatibility. + */ + if (!flags) + flags = RSEQ_FLAG_REGISTER; + + switch (flags) { + case RSEQ_FLAG_UNREGISTER: /* Unregister rseq for current thread. */ if (current->rseq != rseq || !current->rseq) return -EINVAL; @@ -324,43 +331,43 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, return ret; current->rseq = NULL; current->rseq_sig = 0; - return 0; - } - - if (unlikely(flags)) - return -EINVAL; + break; + case RSEQ_FLAG_REGISTER: + if (current->rseq) { + /* + * If rseq is already registered, check whether + * the provided address differs from the prior + * one. + */ + if (current->rseq != rseq || rseq_len != sizeof(*rseq)) + return -EINVAL; + if (current->rseq_sig != sig) + return -EPERM; + /* Already registered. */ + return -EBUSY; + } - if (current->rseq) { /* - * If rseq is already registered, check whether - * the provided address differs from the prior - * one. + * If there was no rseq previously registered, + * ensure the provided rseq is properly aligned and valid. */ - if (current->rseq != rseq || rseq_len != sizeof(*rseq)) + if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)) || + rseq_len != sizeof(*rseq)) return -EINVAL; - if (current->rseq_sig != sig) - return -EPERM; - /* Already registered. */ - return -EBUSY; - } - - /* - * If there was no rseq previously registered, - * ensure the provided rseq is properly aligned and valid. - */ - if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)) || - rseq_len != sizeof(*rseq)) + if (!access_ok(rseq, rseq_len)) + return -EFAULT; + current->rseq = rseq; + current->rseq_sig = sig; + /* + * If rseq was previously inactive, and has just been + * registered, ensure the cpu_id_start and cpu_id fields + * are updated before returning to user-space. + */ + rseq_set_notify_resume(current); + break; + default: return -EINVAL; - if (!access_ok(rseq, rseq_len)) - return -EFAULT; - current->rseq = rseq; - current->rseq_sig = sig; - /* - * If rseq was previously inactive, and has just been - * registered, ensure the cpu_id_start and cpu_id fields - * are updated before returning to user-space. - */ - rseq_set_notify_resume(current); + } return 0; } -- 2.17.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-06 20:49 [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix Mathieu Desnoyers 2020-07-06 20:49 ` [RFC PATCH for 5.8 1/4] sched: Fix unreliable rseq cpu_id for new tasks Mathieu Desnoyers 2020-07-06 20:49 ` [RFC PATCH for 5.8 2/4] rseq: Introduce RSEQ_FLAG_REGISTER Mathieu Desnoyers @ 2020-07-06 20:49 ` Mathieu Desnoyers 2020-07-07 7:29 ` Florian Weimer 2020-07-06 20:49 ` [RFC PATCH for 5.8 4/4] rseq: selftests: Expect reliable cpu_id field Mathieu Desnoyers ` (2 subsequent siblings) 5 siblings, 1 reply; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-06 20:49 UTC (permalink / raw) To: Thomas Gleixner Cc: linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner, linux-api, Florian Weimer, Mathieu Desnoyers, Dmitry Vyukov, Neel Natu commit 93b585c08d16 ("Fix: sched: unreliable rseq cpu_id for new tasks") addresses an issue with cpu_id field of newly created processes. Expose a flag which can be used by user-space to query whether the kernel implements this fix. Considering that this issue can cause corruption of user-space per-cpu data updated with rseq, it is recommended that user-space detects availability of this fix by using the RSEQ_FLAG_RELIABLE_CPU_ID flag either combined with registration or on its own before using rseq. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Paul Turner <pjt@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Neel Natu <neelnatu@google.com> Cc: linux-api@vger.kernel.org --- include/uapi/linux/rseq.h | 5 +++++ kernel/rseq.c | 4 ++++ 2 files changed, 9 insertions(+) diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index 3b5fba25461a..a548b77c9520 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -21,13 +21,18 @@ enum rseq_cpu_id_state { /* * RSEQ_FLAG_UNREGISTER: Unregister rseq ABI for caller thread. * RSEQ_FLAG_REGISTER: Register rseq ABI for caller thread. + * RSEQ_FLAG_RELIABLE_CPU_ID: rseq provides a reliable cpu_id field. * * Flag value 0 has the same behavior as RSEQ_FLAG_REGISTER, but cannot be * combined with other flags. This behavior is kept for backward compatibility. + * + * The flag RSEQ_FLAG_REGISTER can be combined with the RSEQ_FLAG_RELIABLE_CPU_ID + * flag. */ enum rseq_flags { RSEQ_FLAG_UNREGISTER = (1 << 0), RSEQ_FLAG_REGISTER = (1 << 1), + RSEQ_FLAG_RELIABLE_CPU_ID = (1 << 2), }; enum rseq_cs_flags_bit { diff --git a/kernel/rseq.c b/kernel/rseq.c index 47ce221cd6f9..32cc2e0d961f 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -333,6 +333,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, current->rseq_sig = 0; break; case RSEQ_FLAG_REGISTER: + fallthrough; + case RSEQ_FLAG_REGISTER | RSEQ_FLAG_RELIABLE_CPU_ID: if (current->rseq) { /* * If rseq is already registered, check whether @@ -365,6 +367,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, */ rseq_set_notify_resume(current); break; + case RSEQ_FLAG_RELIABLE_CPU_ID: + break; default: return -EINVAL; } -- 2.17.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-06 20:49 ` [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID Mathieu Desnoyers @ 2020-07-07 7:29 ` Florian Weimer 2020-07-07 10:48 ` Mathieu Desnoyers 0 siblings, 1 reply; 25+ messages in thread From: Florian Weimer @ 2020-07-07 7:29 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu * Mathieu Desnoyers: > commit 93b585c08d16 ("Fix: sched: unreliable rseq cpu_id for new tasks") > addresses an issue with cpu_id field of newly created processes. Expose > a flag which can be used by user-space to query whether the kernel > implements this fix. > > Considering that this issue can cause corruption of user-space per-cpu > data updated with rseq, it is recommended that user-space detects > availability of this fix by using the RSEQ_FLAG_RELIABLE_CPU_ID flag > either combined with registration or on its own before using rseq. Presumably, the intent is that glibc uses RSEQ_FLAG_RELIABLE_CPU_ID to register the rseq area. That will surely prevent glibc itself from activating rseq on broken kernels. But if another rseq library performs registration and has not been updated to use RSEQ_FLAG_RELIABLE_CPU_ID, we still end up with an active rseq area (and incorrect CPU IDs from sched_getcpu in glibc). So further glibc changes will be needed. I suppose we could block third-party rseq registration with a registration of a hidden rseq area (not __rseq_abi). But then the question is if any of the third-party rseq users are expecting the EINVAL error code from their failed registration. The rseq registration state machine is quite tricky already, and the need to use RSEQ_FLAG_RELIABLE_CPU_ID would make it even more complicated. Even if we implemented all the changes, it's all going to be essentially dead, untestable code in a few months, when the broken kernels are out of circulation. It does not appear to be good investment to me. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-07 7:29 ` Florian Weimer @ 2020-07-07 10:48 ` Mathieu Desnoyers 2020-07-07 11:32 ` Florian Weimer 0 siblings, 1 reply; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-07 10:48 UTC (permalink / raw) To: Florian Weimer Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu ----- On Jul 7, 2020, at 3:29 AM, Florian Weimer fw@deneb.enyo.de wrote: > * Mathieu Desnoyers: > >> commit 93b585c08d16 ("Fix: sched: unreliable rseq cpu_id for new tasks") >> addresses an issue with cpu_id field of newly created processes. Expose >> a flag which can be used by user-space to query whether the kernel >> implements this fix. >> >> Considering that this issue can cause corruption of user-space per-cpu >> data updated with rseq, it is recommended that user-space detects >> availability of this fix by using the RSEQ_FLAG_RELIABLE_CPU_ID flag >> either combined with registration or on its own before using rseq. > > Presumably, the intent is that glibc uses RSEQ_FLAG_RELIABLE_CPU_ID to > register the rseq area. That will surely prevent glibc itself from > activating rseq on broken kernels. But if another rseq library > performs registration and has not been updated to use > RSEQ_FLAG_RELIABLE_CPU_ID, we still end up with an active rseq area > (and incorrect CPU IDs from sched_getcpu in glibc). So further glibc > changes will be needed. I suppose we could block third-party rseq > registration with a registration of a hidden rseq area (not > __rseq_abi). But then the question is if any of the third-party rseq > users are expecting the EINVAL error code from their failed > registration. > > The rseq registration state machine is quite tricky already, and the > need to use RSEQ_FLAG_RELIABLE_CPU_ID would make it even more > complicated. Even if we implemented all the changes, it's all going > to be essentially dead, untestable code in a few months, when the > broken kernels are out of circulation. It does not appear to be good > investment to me. Those are very good points. One possibility we have would be to let glibc do the rseq registration without the RSEQ_FLAG_RELIABLE_CPU_ID flag. On kernels with the bug present, the cpu_id field is still good enough for typical uses of sched_getcpu() which does not appear to have a very strict correctness requirement on returning the right cpu number. Then libraries and applications which require a reliable cpu_id field could check this on their own by calling rseq with the RSEQ_FLAG_RELIABLE_CPU_ID flag. This would not make the state more complex in __rseq_abi, and let each rseq user decide about its own fate: whether it uses rseq or keeps using an rseq-free fallback. I am still tempted to allow combining RSEQ_FLAG_REGISTER | RSEQ_FLAG_RELIABLE_CPU_ID for applications which would not be using glibc, and want to check this flag on thread registration. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-07 10:48 ` Mathieu Desnoyers @ 2020-07-07 11:32 ` Florian Weimer 2020-07-07 12:06 ` Mathieu Desnoyers 0 siblings, 1 reply; 25+ messages in thread From: Florian Weimer @ 2020-07-07 11:32 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu * Mathieu Desnoyers: > Those are very good points. One possibility we have would be to let > glibc do the rseq registration without the RSEQ_FLAG_RELIABLE_CPU_ID > flag. On kernels with the bug present, the cpu_id field is still good > enough for typical uses of sched_getcpu() which does not appear to > have a very strict correctness requirement on returning the right > cpu number. > > Then libraries and applications which require a reliable cpu_id > field could check this on their own by calling rseq with the > RSEQ_FLAG_RELIABLE_CPU_ID flag. This would not make the state more > complex in __rseq_abi, and let each rseq user decide about its own > fate: whether it uses rseq or keeps using an rseq-free fallback. > > I am still tempted to allow combining RSEQ_FLAG_REGISTER | > RSEQ_FLAG_RELIABLE_CPU_ID for applications which would not be using > glibc, and want to check this flag on thread registration. Well, you could add a bug fix level field to the __rseq_abi variable. Then applications could check if the kernel has the appropriate level of non-buggyness. But the same thing could be useful for many other kernel interfaces, and I haven't seen such a fix level value for them. What makes rseq so special? It won't help with the present bug, but maybe we should add an rseq API sub-call that blocks future rseq registration for the thread. Then we can add a glibc tunable that flips off rseq reliably if people do not want to use it for some reason (and switch to the non-rseq fallback code instead). But that's going to help with future bugs only. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-07 11:32 ` Florian Weimer @ 2020-07-07 12:06 ` Mathieu Desnoyers 2020-07-07 18:53 ` Carlos O'Donell 0 siblings, 1 reply; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-07 12:06 UTC (permalink / raw) To: Florian Weimer Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu ----- On Jul 7, 2020, at 7:32 AM, Florian Weimer fw@deneb.enyo.de wrote: > * Mathieu Desnoyers: > >> Those are very good points. One possibility we have would be to let >> glibc do the rseq registration without the RSEQ_FLAG_RELIABLE_CPU_ID >> flag. On kernels with the bug present, the cpu_id field is still good >> enough for typical uses of sched_getcpu() which does not appear to >> have a very strict correctness requirement on returning the right >> cpu number. >> >> Then libraries and applications which require a reliable cpu_id >> field could check this on their own by calling rseq with the >> RSEQ_FLAG_RELIABLE_CPU_ID flag. This would not make the state more >> complex in __rseq_abi, and let each rseq user decide about its own >> fate: whether it uses rseq or keeps using an rseq-free fallback. >> >> I am still tempted to allow combining RSEQ_FLAG_REGISTER | >> RSEQ_FLAG_RELIABLE_CPU_ID for applications which would not be using >> glibc, and want to check this flag on thread registration. > > Well, you could add a bug fix level field to the __rseq_abi variable. Even though I initially planned to make the struct rseq_abi extensible, the __rseq_abi variable ends up being fix-sized, so we need to be very careful in choosing what we place in the remaining 12 bytes of padding. I suspect we'd want to keep 8 bytes to express a pointer to an "extended" structure. I wonder if a bug fix level "version" is the right approach. We could instead have a bitmask of fixes, which the application could independently check. For instance, some applications may care about cpu_id field reliability, and others not. > Then applications could check if the kernel has the appropriate level > of non-buggyness. But the same thing could be useful for many other > kernel interfaces, and I haven't seen such a fix level value for them. > What makes rseq so special? I guess my only answer is because I care as a user of the system call, and what is a system call without users ? I have real applications which work today with end users deploying them on various kernels, old and new, and I want to take advantage of this system call to speed them up. However, if I have to choose between speed and correctness (in other words, not crashing a critical application), I will choose correctness. So if I cannot detect that I can safely use the system call, it becomes pretty much useless even for my own use-cases. > It won't help with the present bug, but maybe we should add an rseq > API sub-call that blocks future rseq registration for the thread. > Then we can add a glibc tunable that flips off rseq reliably if people > do not want to use it for some reason (and switch to the non-rseq > fallback code instead). But that's going to help with future bugs > only. I don't think it's needed. All I really need is to have _some_ way to let lttng-ust or liburcu query whether the cpu_id field is reliable. This state does not have to be made quickly accessible to other libraries, nor does it have to be shared between libraries. It would allow each user library or application to make its own mind on whether it can use rseq or not. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-07 12:06 ` Mathieu Desnoyers @ 2020-07-07 18:53 ` Carlos O'Donell 2020-07-07 18:59 ` Mathieu Desnoyers 2020-07-07 19:55 ` Florian Weimer 0 siblings, 2 replies; 25+ messages in thread From: Carlos O'Donell @ 2020-07-07 18:53 UTC (permalink / raw) To: Mathieu Desnoyers, Florian Weimer Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu On 7/7/20 8:06 AM, Mathieu Desnoyers wrote: > ----- On Jul 7, 2020, at 7:32 AM, Florian Weimer fw@deneb.enyo.de wrote: > >> * Mathieu Desnoyers: >> >>> Those are very good points. One possibility we have would be to let >>> glibc do the rseq registration without the RSEQ_FLAG_RELIABLE_CPU_ID >>> flag. On kernels with the bug present, the cpu_id field is still good >>> enough for typical uses of sched_getcpu() which does not appear to >>> have a very strict correctness requirement on returning the right >>> cpu number. >>> >>> Then libraries and applications which require a reliable cpu_id >>> field could check this on their own by calling rseq with the >>> RSEQ_FLAG_RELIABLE_CPU_ID flag. This would not make the state more >>> complex in __rseq_abi, and let each rseq user decide about its own >>> fate: whether it uses rseq or keeps using an rseq-free fallback. >>> >>> I am still tempted to allow combining RSEQ_FLAG_REGISTER | >>> RSEQ_FLAG_RELIABLE_CPU_ID for applications which would not be using >>> glibc, and want to check this flag on thread registration. >> >> Well, you could add a bug fix level field to the __rseq_abi variable. > > Even though I initially planned to make the struct rseq_abi extensible, > the __rseq_abi variable ends up being fix-sized, so we need to be very > careful in choosing what we place in the remaining 12 bytes of padding. > I suspect we'd want to keep 8 bytes to express a pointer to an > "extended" structure. > > I wonder if a bug fix level "version" is the right approach. We could > instead have a bitmask of fixes, which the application could independently > check. For instance, some applications may care about cpu_id field > reliability, and others not. I agree with Florian. Developers are not interested in a bitmask of fixes. They want a version they can check and that at a given version everything works as expected. In reality today this is the kernel version. So rseq is broken from a developer perspective until they can get a new kernel or an agreement from their downstream vendor that revision Z of the kernel they are using has the fix you've just committed. Essentially this problem solves itself at level higher than the interfaces we're talking about today. Encoding this as a bitmask of fixes is an overengineered solution for a problem that the downstream communities already know how to solve. I would strongly suggest a "version" or nothing. >> Then applications could check if the kernel has the appropriate level >> of non-buggyness. But the same thing could be useful for many other >> kernel interfaces, and I haven't seen such a fix level value for them. >> What makes rseq so special? > > I guess my only answer is because I care as a user of the system call, and > what is a system call without users ? I have real applications which work > today with end users deploying them on various kernels, old and new, and I > want to take advantage of this system call to speed them up. However, if I > have to choose between speed and correctness (in other words, not crashing > a critical application), I will choose correctness. So if I cannot detect > that I can safely use the system call, it becomes pretty much useless even > for my own use-cases. Yes. In the case of RHEL we have *tons* of users in the same predicament. They just wait until the RHEL kernel team releases a fixed kernel version and check for that version and deploy with that version. Likewise every other user of a kernel. They solve it by asking their kernel provider, internal or external, to verify the fix is applied to the deployment kernel. If they are an ISV they have to test and deploy similar strategies for multiple kernel version. By trying to automate this you are encoding into the API some level of package management concepts e.g. patch level, revision level, etc. This is difficult to do without a more generalized API. Why do it just for rseq? Why do it with the few bits you have? It's not a great fit IMO. Just let the kernel version be the arbiter of correctness. >> It won't help with the present bug, but maybe we should add an rseq >> API sub-call that blocks future rseq registration for the thread. >> Then we can add a glibc tunable that flips off rseq reliably if people >> do not want to use it for some reason (and switch to the non-rseq >> fallback code instead). But that's going to help with future bugs >> only. > > I don't think it's needed. All I really need is to have _some_ way to > let lttng-ust or liburcu query whether the cpu_id field is reliable. This > state does not have to be made quickly accessible to other libraries, > nor does it have to be shared between libraries. It would allow each > user library or application to make its own mind on whether it can use > rseq or not. That check is "kernel version > x.y.z" :-) or "My kernel vendor told me it's fixed." -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-07 18:53 ` Carlos O'Donell @ 2020-07-07 18:59 ` Mathieu Desnoyers 2020-07-08 8:31 ` Florian Weimer 2020-07-07 19:55 ` Florian Weimer 1 sibling, 1 reply; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-07 18:59 UTC (permalink / raw) To: Carlos O'Donell Cc: Florian Weimer, Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu ----- On Jul 7, 2020, at 2:53 PM, Carlos O'Donell carlos@redhat.com wrote: > On 7/7/20 8:06 AM, Mathieu Desnoyers wrote: >> ----- On Jul 7, 2020, at 7:32 AM, Florian Weimer fw@deneb.enyo.de wrote: >> >>> * Mathieu Desnoyers: >>> >>>> Those are very good points. One possibility we have would be to let >>>> glibc do the rseq registration without the RSEQ_FLAG_RELIABLE_CPU_ID >>>> flag. On kernels with the bug present, the cpu_id field is still good >>>> enough for typical uses of sched_getcpu() which does not appear to >>>> have a very strict correctness requirement on returning the right >>>> cpu number. >>>> >>>> Then libraries and applications which require a reliable cpu_id >>>> field could check this on their own by calling rseq with the >>>> RSEQ_FLAG_RELIABLE_CPU_ID flag. This would not make the state more >>>> complex in __rseq_abi, and let each rseq user decide about its own >>>> fate: whether it uses rseq or keeps using an rseq-free fallback. >>>> >>>> I am still tempted to allow combining RSEQ_FLAG_REGISTER | >>>> RSEQ_FLAG_RELIABLE_CPU_ID for applications which would not be using >>>> glibc, and want to check this flag on thread registration. >>> >>> Well, you could add a bug fix level field to the __rseq_abi variable. >> >> Even though I initially planned to make the struct rseq_abi extensible, >> the __rseq_abi variable ends up being fix-sized, so we need to be very >> careful in choosing what we place in the remaining 12 bytes of padding. >> I suspect we'd want to keep 8 bytes to express a pointer to an >> "extended" structure. >> >> I wonder if a bug fix level "version" is the right approach. We could >> instead have a bitmask of fixes, which the application could independently >> check. For instance, some applications may care about cpu_id field >> reliability, and others not. > > I agree with Florian. > > Developers are not interested in a bitmask of fixes. > > They want a version they can check and that at a given version everything > works as expected. > > In reality today this is the kernel version. > > So rseq is broken from a developer perspective until they can get a new > kernel or an agreement from their downstream vendor that revision Z of > the kernel they are using has the fix you've just committed. > > Essentially this problem solves itself at level higher than the interfaces > we're talking about today. > > Encoding this as a bitmask of fixes is an overengineered solution for a > problem that the downstream communities already know how to solve. > > I would strongly suggest a "version" or nothing. That's a good point. > >>> Then applications could check if the kernel has the appropriate level >>> of non-buggyness. But the same thing could be useful for many other >>> kernel interfaces, and I haven't seen such a fix level value for them. >>> What makes rseq so special? >> >> I guess my only answer is because I care as a user of the system call, and >> what is a system call without users ? I have real applications which work >> today with end users deploying them on various kernels, old and new, and I >> want to take advantage of this system call to speed them up. However, if I >> have to choose between speed and correctness (in other words, not crashing >> a critical application), I will choose correctness. So if I cannot detect >> that I can safely use the system call, it becomes pretty much useless even >> for my own use-cases. > > Yes. > > In the case of RHEL we have *tons* of users in the same predicament. > > They just wait until the RHEL kernel team releases a fixed kernel version > and check for that version and deploy with that version. > > Likewise every other user of a kernel. They solve it by asking their > kernel provider, internal or external, to verify the fix is applied to > the deployment kernel. > > If they are an ISV they have to test and deploy similar strategies for > multiple kernel version. > > By trying to automate this you are encoding into the API some level of > package management concepts e.g. patch level, revision level, etc. > > This is difficult to do without a more generalized API. Why do it just > for rseq? Why do it with the few bits you have? > > It's not a great fit IMO. Just let the kernel version be the arbiter of > correctness. > >>> It won't help with the present bug, but maybe we should add an rseq >>> API sub-call that blocks future rseq registration for the thread. >>> Then we can add a glibc tunable that flips off rseq reliably if people >>> do not want to use it for some reason (and switch to the non-rseq >>> fallback code instead). But that's going to help with future bugs >>> only. >> >> I don't think it's needed. All I really need is to have _some_ way to >> let lttng-ust or liburcu query whether the cpu_id field is reliable. This >> state does not have to be made quickly accessible to other libraries, >> nor does it have to be shared between libraries. It would allow each >> user library or application to make its own mind on whether it can use >> rseq or not. > > That check is "kernel version > x.y.z" :-) > > or > > "My kernel vendor told me it's fixed." Allright, thanks for the insight! I'll drop these patches and focus only on the bugfix. Thanks, Mathieu > > -- > Cheers, > Carlos. -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-07 18:59 ` Mathieu Desnoyers @ 2020-07-08 8:31 ` Florian Weimer 0 siblings, 0 replies; 25+ messages in thread From: Florian Weimer @ 2020-07-08 8:31 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Carlos O'Donell, Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu * Mathieu Desnoyers: > Allright, thanks for the insight! I'll drop these patches and focus only > on the bugfix. Thanks, much appreciated! ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-07 18:53 ` Carlos O'Donell 2020-07-07 18:59 ` Mathieu Desnoyers @ 2020-07-07 19:55 ` Florian Weimer 2020-07-08 15:33 ` Mathieu Desnoyers 1 sibling, 1 reply; 25+ messages in thread From: Florian Weimer @ 2020-07-07 19:55 UTC (permalink / raw) To: Carlos O'Donell Cc: Mathieu Desnoyers, Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu * Carlos O'Donell: > It's not a great fit IMO. Just let the kernel version be the arbiter of > correctness. For manual review, sure. But checking it programmatically does not yield good results due to backports. Even those who use the stable kernel series sometimes pick up critical fixes beforehand, so it's not reliable possible for a program to say, “I do not want to run on this kernel because it has a bad version”. We had a recent episode of this with the Go runtime, which tried to do exactly this. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-07 19:55 ` Florian Weimer @ 2020-07-08 15:33 ` Mathieu Desnoyers 2020-07-08 16:22 ` Christian Brauner 0 siblings, 1 reply; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-08 15:33 UTC (permalink / raw) To: Florian Weimer, Linus Torvalds Cc: carlos, Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu [ Context for Linus: I am dropping this RFC patch, but am curious to hear your point of view on exposing to user-space which system call behavior fixes are present in the kernel, either through feature flags or system-call versioning. The intent is to allow user-space to make better decisions on whether it should use a system call or rely on fallback behavior. ] ----- On Jul 7, 2020, at 3:55 PM, Florian Weimer fw@deneb.enyo.de wrote: > * Carlos O'Donell: > >> It's not a great fit IMO. Just let the kernel version be the arbiter of >> correctness. > > For manual review, sure. But checking it programmatically does not > yield good results due to backports. Even those who use the stable > kernel series sometimes pick up critical fixes beforehand, so it's not > reliable possible for a program to say, “I do not want to run on this > kernel because it has a bad version”. We had a recent episode of this > with the Go runtime, which tried to do exactly this. FWIW, the kernel fix backport issue would also be a concern if we exposed a numeric "fix level version" with specific system calls: what should we do if a distribution chooses to include one fix in the sequence, but not others ? Identifying fixes are "feature flags" allow cherry-picking specific fixes in a backport, but versions would not allow that. That being said, maybe it's not such a bad thing to _require_ the entire series of fixes to be picked in backports, which would be a fortunate side-effect of the per-syscall-fix-version approach. But I'm under the impression that such a scheme ends up versioning a system call, which I suspect will be a no-go from Linus' perspective. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-08 15:33 ` Mathieu Desnoyers @ 2020-07-08 16:22 ` Christian Brauner 2020-07-08 16:36 ` Florian Weimer 2020-07-08 17:34 ` Mathieu Desnoyers 0 siblings, 2 replies; 25+ messages in thread From: Christian Brauner @ 2020-07-08 16:22 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Florian Weimer, Linus Torvalds, carlos, Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu On Wed, Jul 08, 2020 at 11:33:51AM -0400, Mathieu Desnoyers wrote: > [ Context for Linus: I am dropping this RFC patch, but am curious to > hear your point of view on exposing to user-space which system call > behavior fixes are present in the kernel, either through feature > flags or system-call versioning. The intent is to allow user-space > to make better decisions on whether it should use a system call or > rely on fallback behavior. ] > > ----- On Jul 7, 2020, at 3:55 PM, Florian Weimer fw@deneb.enyo.de wrote: > > > * Carlos O'Donell: > > > >> It's not a great fit IMO. Just let the kernel version be the arbiter of > >> correctness. > > > > For manual review, sure. But checking it programmatically does not > > yield good results due to backports. Even those who use the stable > > kernel series sometimes pick up critical fixes beforehand, so it's not > > reliable possible for a program to say, “I do not want to run on this > > kernel because it has a bad version”. We had a recent episode of this > > with the Go runtime, which tried to do exactly this. > > FWIW, the kernel fix backport issue would also be a concern if we exposed > a numeric "fix level version" with specific system calls: what should > we do if a distribution chooses to include one fix in the sequence, > but not others ? Identifying fixes are "feature flags" allow > cherry-picking specific fixes in a backport, but versions would not > allow that. > > That being said, maybe it's not such a bad thing to _require_ the > entire series of fixes to be picked in backports, which would be a > fortunate side-effect of the per-syscall-fix-version approach. > > But I'm under the impression that such a scheme ends up versioning > a system call, which I suspect will be a no-go from Linus' perspective. I've been following this a little bit. The kernel version itself doesn't really mean anything and the kernel version is imho not at all interesting to userspace applications. Especially for cross-distro programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux, openSUSE and god knows who what other distro what their fixed kernel version is. That's not feasible at all and not how must programs do it. Sure, a lot of programs name a minimal kernel version they require but realistically we can't keep bumping it all the time. So the best strategy for userspace imho has been to introduce a re-versioned flag or enum that indicates the fixed behavior. So I would suggest to just introduce RSEQ_FLAG_REGISTER_2 = (1 << 2), that's how these things are usually done (Netlink etc.). So not introducing a fix bit or whatever but simply reversion your flag/enum. We already deal with this today. (Also, as a side-note. I see that you're passing struct rseq *rseq with a length argument but you are not versioning by size. Is that intentional? That basically somewhat locks you to the current struct rseq layout and means users might run into problems when you extend struct rseq in the future as they can't pass the new struct down to older kernels. The way we deal with this is now - rseq might preceed this - is copy_struct_from_user() (for example in sched_{get,set}attr(), openat2(), bpf(), clone3(), etc.). Maybe you want to switch to that to keep rseq extensible? Users can detect the new rseq version by just passing a larger struct down to the kernel with the extra bytes set to 0 and if rseq doesn't complain they know they're dealing with an rseq that knows larger struct sizes. Might be worth it if you have any reason to belive that struct rseq might need to grow.) Christian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-08 16:22 ` Christian Brauner @ 2020-07-08 16:36 ` Florian Weimer 2020-07-08 17:34 ` Mathieu Desnoyers 1 sibling, 0 replies; 25+ messages in thread From: Florian Weimer @ 2020-07-08 16:36 UTC (permalink / raw) To: Christian Brauner Cc: Mathieu Desnoyers, Linus Torvalds, carlos, Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu * Christian Brauner: > I've been following this a little bit. The kernel version itself doesn't > really mean anything and the kernel version is imho not at all > interesting to userspace applications. Especially for cross-distro > programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux, > openSUSE and god knows who what other distro what their fixed kernel > version is. And Red Hat Enterprise Linux only has a dozen or two kernel branches under active maintenance, each with their own succession of version numbers. It's just not feasible. Even figuring out the branch based on the kernel version can be tricky! > (Also, as a side-note. I see that you're passing struct rseq *rseq with > a length argument but you are not versioning by size. Is that > intentional? That basically somewhat locks you to the current struct > rseq layout and means users might run into problems when you extend > struct rseq in the future as they can't pass the new struct down to > older kernels. The way we deal with this is now - rseq might preceed > this - is copy_struct_from_user() The kernel retains the pointer after the system call returns. Basically, ownership of the memory area is transferred to the kernel. It's like set_robust_list in this regard. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-08 16:22 ` Christian Brauner 2020-07-08 16:36 ` Florian Weimer @ 2020-07-08 17:34 ` Mathieu Desnoyers 2020-07-09 12:49 ` Christian Brauner 1 sibling, 1 reply; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-08 17:34 UTC (permalink / raw) To: Christian Brauner Cc: Florian Weimer, Linus Torvalds, carlos, Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu ----- On Jul 8, 2020, at 12:22 PM, Christian Brauner christian.brauner@ubuntu.com wrote: [...] > I've been following this a little bit. The kernel version itself doesn't > really mean anything and the kernel version is imho not at all > interesting to userspace applications. Especially for cross-distro > programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux, > openSUSE and god knows who what other distro what their fixed kernel > version is. That's not feasible at all and not how must programs do it. > Sure, a lot of programs name a minimal kernel version they require but > realistically we can't keep bumping it all the time. So the best > strategy for userspace imho has been to introduce a re-versioned flag or > enum that indicates the fixed behavior. > > So I would suggest to just introduce > RSEQ_FLAG_REGISTER_2 = (1 << 2), > that's how these things are usually done (Netlink etc.). So not > introducing a fix bit or whatever but simply reversion your flag/enum. > We already deal with this today. Because rseq is effectively a per-thread resource shared across application and libraries, it is not practical to merge the notion of version with the registration. Typically __rseq_abi is registered by libc, and can be used by the application and by many libraries. Early adopter libraries and applications (e.g. librseq, tcmalloc) can also choose to handle registration if it's not already done by libc. For instance, it is acceptable for glibc to register rseq for all threads, even in the presence of the cpu_id field inaccuracy, for use by the sched_getcpu(3) implementation. However, users of rseq which need to implement critical sections performing per-cpu data updates may want to know whether the cpu_id field is reliable to ensure they do not crash the process due to per-cpu data corruption. This led me to consider exposing a feature-specific flag which can be queried by specific users to know whether rseq has specific set of correct behaviors implemented. > (Also, as a side-note. I see that you're passing struct rseq *rseq with > a length argument but you are not versioning by size. Is that > intentional? That basically somewhat locks you to the current struct > rseq layout and means users might run into problems when you extend > struct rseq in the future as they can't pass the new struct down to > older kernels. The way we deal with this is now - rseq might preceed > this - is copy_struct_from_user() (for example in sched_{get,set}attr(), > openat2(), bpf(), clone3(), etc.). Maybe you want to switch to that to > keep rseq extensible? Users can detect the new rseq version by just > passing a larger struct down to the kernel with the extra bytes set to 0 > and if rseq doesn't complain they know they're dealing with an rseq that > knows larger struct sizes. Might be worth it if you have any reason to > belive that struct rseq might need to grow.) In the initial iterations of the rseq patch set, we initially had the rseq_len argument hoping we would eventually be able to extend struct rseq. However, it was finally decided against making it extensible, so the rseq ABI merged into the Linux kernel with a fixed-size. One of the key reasons for this is explained in commit 83b0b15bcb0f ("rseq: Remove superfluous rseq_len from task_struct") The rseq system call, when invoked with flags of "0" or "RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to be equal to sizeof(struct rseq), which is fixed-size and fixed-layout, specified in uapi linux/rseq.h. Expecting a fixed size for rseq_len is a design choice that ensures multiple libraries and application defining __rseq_abi in the same process agree on its exact size. The issue here is caused by the fact that the __rseq_abi variable is shared across application/libraries for a given thread. So it's not enough to agree between kernel and user-space on the extensibility scheme, but we'd also have to find a way for all users within a process to agree on the layout. The "rseq_len" parameter could eventually become useful when combined with additional flags, but currently its size is fixed. There are indeed clear use-cases for extending struct rseq (or add a new similar TLS structure), for instance exposing the current numa node id, or to implement a fast signal-disabling scheme. But the fact that struct rseq is shared across application/libraries makes it tricky to "just extend" struct rseq. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-08 17:34 ` Mathieu Desnoyers @ 2020-07-09 12:49 ` Christian Brauner 2020-07-09 15:15 ` Mathieu Desnoyers 0 siblings, 1 reply; 25+ messages in thread From: Christian Brauner @ 2020-07-09 12:49 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Florian Weimer, Linus Torvalds, carlos, Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu On Wed, Jul 08, 2020 at 01:34:48PM -0400, Mathieu Desnoyers wrote: > ----- On Jul 8, 2020, at 12:22 PM, Christian Brauner christian.brauner@ubuntu.com wrote: > [...] > > I've been following this a little bit. The kernel version itself doesn't > > really mean anything and the kernel version is imho not at all > > interesting to userspace applications. Especially for cross-distro > > programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux, > > openSUSE and god knows who what other distro what their fixed kernel > > version is. That's not feasible at all and not how must programs do it. > > Sure, a lot of programs name a minimal kernel version they require but > > realistically we can't keep bumping it all the time. So the best > > strategy for userspace imho has been to introduce a re-versioned flag or > > enum that indicates the fixed behavior. > > > > So I would suggest to just introduce > > RSEQ_FLAG_REGISTER_2 = (1 << 2), > > that's how these things are usually done (Netlink etc.). So not > > introducing a fix bit or whatever but simply reversion your flag/enum. > > We already deal with this today. > > Because rseq is effectively a per-thread resource shared across application > and libraries, it is not practical to merge the notion of version with the > registration. Typically __rseq_abi is registered by libc, and can be used > by the application and by many libraries. Early adopter libraries and > applications (e.g. librseq, tcmalloc) can also choose to handle registration > if it's not already done by libc. I'm probably missing the elephant in the room but I was briefly looking at github.com/compudj/librseq and it seems to me that the registration you're talking about is: extern __thread struct rseq __rseq_abi; extern int __rseq_handled; and it's done in int rseq_register_current_thread(void) afaict and currently registration is done with flags set to 0. What is the problem with either adding a - I don't know - RSEG_FLAG_REGISTER/RSEQ_RELIABLE_CPU_FIELD flag that is also recorded in __rseq_abi.flags. If the kernel doesn't support the flag it will fail registration with EINVAL. So the registering program can detect it. If a caller needs to know whether another thread uses the new flag it can query __rseq_abi.flags. Some form of coordination must be possible in userspace otherwise you'll have trouble with any new feature you add. I general, I don't see how this is different from adding a new feature to rseq. It should be the same principle. I also don't understand the "not practical to merge the notion of version with the registration". I'm not sure what that means to be honest. :) But just thinking about adding a new feature to rseq. Then you're in the same spot, I think. When you register a bumped rseq - because you added a new flag or whatever - you register a new version one way or the other since a new feature - imho - is always a version bump. In fact, you could think of your "reliable cpu" as a new feature not a bug. ;) Also, you seem to directly version struct rseq_cs already through the "version" member. So even if you are against the new flag I wouldn't know what would stop you from directly versioning struct rseq itself. And it's not that we don't version syscalls. We're doing it in multiple ways to be honest, syscalls with a flag argument that reject unknown flags are bumped in their version every time you add a new flag that they accept. We don't spell this out but this is effectively what it is. Think of it as a minor version bump. Extensible syscalls are versioned by size and when their struct grows are bumped in their (minor) version. In fact extensible syscalls with flags argument embedded in the struct can be version bumped in two ways: growing a new flag argument or growing a new struct member. > > For instance, it is acceptable for glibc to register rseq for all threads, > even in the presence of the cpu_id field inaccuracy, for use by the > sched_getcpu(3) implementation. However, users of rseq which need to > implement critical sections performing per-cpu data updates may want > to know whether the cpu_id field is reliable to ensure they do not crash > the process due to per-cpu data corruption. > > This led me to consider exposing a feature-specific flag which can be > queried by specific users to know whether rseq has specific set of correct > behaviors implemented. > > > (Also, as a side-note. I see that you're passing struct rseq *rseq with > > a length argument but you are not versioning by size. Is that > > intentional? That basically somewhat locks you to the current struct > > rseq layout and means users might run into problems when you extend > > struct rseq in the future as they can't pass the new struct down to > > older kernels. The way we deal with this is now - rseq might preceed > > this - is copy_struct_from_user() (for example in sched_{get,set}attr(), > > openat2(), bpf(), clone3(), etc.). Maybe you want to switch to that to > > keep rseq extensible? Users can detect the new rseq version by just > > passing a larger struct down to the kernel with the extra bytes set to 0 > > and if rseq doesn't complain they know they're dealing with an rseq that > > knows larger struct sizes. Might be worth it if you have any reason to > > belive that struct rseq might need to grow.) > > In the initial iterations of the rseq patch set, we initially had the rseq_len > argument hoping we would eventually be able to extend struct rseq. However, > it was finally decided against making it extensible, so the rseq ABI merged > into the Linux kernel with a fixed-size. > > One of the key reasons for this is explained in > commit 83b0b15bcb0f ("rseq: Remove superfluous rseq_len from task_struct") > > The rseq system call, when invoked with flags of "0" or > "RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to > be equal to sizeof(struct rseq), which is fixed-size and fixed-layout, > specified in uapi linux/rseq.h. > > Expecting a fixed size for rseq_len is a design choice that ensures > multiple libraries and application defining __rseq_abi in the same > process agree on its exact size. > > The issue here is caused by the fact that the __rseq_abi variable is > shared across application/libraries for a given thread. So it's not > enough to agree between kernel and user-space on the extensibility > scheme, but we'd also have to find a way for all users within a process > to agree on the layout. But you're in the same boat if you add any new feature, no? In your model, wouldn't you need all users to agree on the feature set as well? Not just the struct rseq size. If that's the case then rseq would be unextendable (for now). But specifically about the size-versioning part. Well, one way to solve this - imho - would be to add a output size parameter to struct rseq and introduce a little more vetting than we have right now. So the kernel is what ultimately registers struct rseq iiuc. If there were a size output parameter the kernel could set the size of the struct it knows about before registering it. So a caller passing down a larger struct with e.g. a new field set to a non-zero value would get an error from the kernel and the size of the supported struct. The caller can then adjust and simply zero out the unsupported field and retry. Other callers in userspace can query the size and find out what size of struct is registered. If it's larger than what they know about they can infer they are on a newer kernel with new features but they can simply ignore the unknown fields. If it's smaller than they know what fields to ignore. I hope I'm not derailing this discussion. I'm just trying to show that there's some hope. In short, imho, I think adding a flag indicating the new "reliable cpu" feature is probably the best way to go. Think of it not as the kernel indicating the absence of a bug but as the presence of a feature. :) Thanks! Christian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-09 12:49 ` Christian Brauner @ 2020-07-09 15:15 ` Mathieu Desnoyers 2020-07-11 15:54 ` Christian Brauner 0 siblings, 1 reply; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-09 15:15 UTC (permalink / raw) To: Christian Brauner Cc: Florian Weimer, Linus Torvalds, carlos, Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu ----- On Jul 9, 2020, at 8:49 AM, Christian Brauner christian.brauner@ubuntu.com wrote: > On Wed, Jul 08, 2020 at 01:34:48PM -0400, Mathieu Desnoyers wrote: >> ----- On Jul 8, 2020, at 12:22 PM, Christian Brauner >> christian.brauner@ubuntu.com wrote: >> [...] >> > I've been following this a little bit. The kernel version itself doesn't >> > really mean anything and the kernel version is imho not at all >> > interesting to userspace applications. Especially for cross-distro >> > programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux, >> > openSUSE and god knows who what other distro what their fixed kernel >> > version is. That's not feasible at all and not how must programs do it. >> > Sure, a lot of programs name a minimal kernel version they require but >> > realistically we can't keep bumping it all the time. So the best >> > strategy for userspace imho has been to introduce a re-versioned flag or >> > enum that indicates the fixed behavior. >> > >> > So I would suggest to just introduce >> > RSEQ_FLAG_REGISTER_2 = (1 << 2), >> > that's how these things are usually done (Netlink etc.). So not >> > introducing a fix bit or whatever but simply reversion your flag/enum. >> > We already deal with this today. >> >> Because rseq is effectively a per-thread resource shared across application >> and libraries, it is not practical to merge the notion of version with the >> registration. Typically __rseq_abi is registered by libc, and can be used >> by the application and by many libraries. Early adopter libraries and >> applications (e.g. librseq, tcmalloc) can also choose to handle registration >> if it's not already done by libc. > > I'm probably missing the elephant in the room but I was briefly looking > at github.com/compudj/librseq and it seems to me that the registration > you're talking about is: > > extern __thread struct rseq __rseq_abi; > extern int __rseq_handled; Note that __rseq_handled has now vanished, adapting to glibc's ABI. I just updated librseq's header accordingly. > > and it's done in int rseq_register_current_thread(void) afaict and > currently registration is done with flags set to 0. Correct, however that registration will become a no-op when linked against a glibc 2.32+, because the glibc will already have handled the registration at thread creation. > > What is the problem with either adding a - I don't know - > RSEG_FLAG_REGISTER/RSEQ_RELIABLE_CPU_FIELD flag that is also recorded in > __rseq_abi.flags. If the kernel doesn't support the flag it will fail > registration with EINVAL. So the registering program can detect it. If a > caller needs to know whether another thread uses the new flag it can > query __rseq_abi.flags. Some form of coordination must be possible in > userspace otherwise you'll have trouble with any new feature you add. I > general, I don't see how this is different from adding a new feature to > rseq. It should be the same principle. The problem with "extending" struct rseq is that it becomes complex because it is shared between libraries and application. Let's suppose the library doing the rseq registration does the scheme you describe: queries the kernel for features, and stores them in the __rseq_abi.flags. We end up with the following upgrade transition headhaches for an application using __rseq_abi: Kernel | glibc | librseq | __rseq_abi registration owner ---------------------------------------------------------------------- 4.18 | 2.31 | no | application (reliable cpu_id = false) 4.18 | 2.31 | yes | librseq (reliable cpu_id = false) 5.8 | 2.31 | yes | librseq (reliable cpu_id = true) 5.8 | 2.32 | yes | glibc (reliable cpu_id = false) 5.8 | 2.33+ | yes | glibc (reliable cpu_id = true) This kind of transition regressing feature-wise when upgrading a glibc can be confusing for users. One possibility would be to have the kernel store the "reliable cpu_id" flag directly into a new __rseq_abi.kernel_flags (because __rseq_abi.flags is documented as only read by the kernel). This would remove the registration owner from the upgrade scenarios. But what would we gain by exposing this flag within struct rseq ? The only real reason for doing so over using an explicit system call is typically speed, and querying the kernel for a feature does not need to be done often, so this is why I originally favored exposing this information through a new system call flag without changing the content of struct rseq_cs. One additional thing to keep in mind: the application can itself choose to define the __rseq_abi TLS, which AFAIU (please let me know if I am wrong) would take precedence over glibc's copy. So extending the size of struct rseq seems rather tricky because the application may provide a smaller __rseq_abi, even if both the kernel and glibc agree on a larger size. > > I also don't understand the "not practical to merge the notion of > version with the registration". I'm not sure what that means to be > honest. :) The notion of "version" here would be to replace the "RELIABLE_CPU_FIELD" flag I proposed with a steadily-increasing "fix" version instead. For both approaches, we could either pass them as parameters with rseq registration, and make rseq registration success conditional on the kernel implementing those feature/fix-version, or validate the flag/version separately from registration. If this is done on registration, it means glibc will eventually have to handle this. This prevents user libraries with specific needs to query whether their features are available. Doing the feature/version validation separately from registration allows each user library to make its own queries and take advantage of new kernel features before glibc is upgraded to be made aware of them. > But just thinking about adding a new feature to rseq. Then you're in the > same spot, I think. When you register a bumped rseq - because you added > a new flag or whatever - you register a new version one way or the other > since a new feature - imho - is always a version bump. In fact, you > could think of your "reliable cpu" as a new feature not a bug. ;) Indeed. > Also, you seem to directly version struct rseq_cs already through the > "version" member. So even if you are against the new flag I wouldn't > know what would stop you from directly versioning struct rseq itself. struct rseq needs to be shared between application and libraries, with issues about what to do if size changes when we have an application defining a small struct rseq (taking precedence over glibc's), and glibc agreeing with the kernel on a larger structure. So there is little hope in changing that layout. The case of struct rseq_cs is simpler: it is only used as interface between a specific library/application user and the kernel, which allows us to version the structure and create new layouts as needed. > > And it's not that we don't version syscalls. We're doing it in multiple > ways to be honest, syscalls with a flag argument that reject unknown > flags are bumped in their version every time you add a new flag that > they accept. We don't spell this out but this is effectively what it is. > Think of it as a minor version bump. Extensible syscalls are versioned > by size and when their struct grows are bumped in their (minor) version. > In fact extensible syscalls with flags argument embedded in the struct > can be version bumped in two ways: growing a new flag argument or > growing a new struct member. The issue with struct rseq extensibility is not so much ABI between kernel and user-space, but rather ABI between userspace libraries/application users. > >> >> For instance, it is acceptable for glibc to register rseq for all threads, >> even in the presence of the cpu_id field inaccuracy, for use by the >> sched_getcpu(3) implementation. However, users of rseq which need to >> implement critical sections performing per-cpu data updates may want >> to know whether the cpu_id field is reliable to ensure they do not crash >> the process due to per-cpu data corruption. >> >> This led me to consider exposing a feature-specific flag which can be >> queried by specific users to know whether rseq has specific set of correct >> behaviors implemented. >> >> > (Also, as a side-note. I see that you're passing struct rseq *rseq with >> > a length argument but you are not versioning by size. Is that >> > intentional? That basically somewhat locks you to the current struct >> > rseq layout and means users might run into problems when you extend >> > struct rseq in the future as they can't pass the new struct down to >> > older kernels. The way we deal with this is now - rseq might preceed >> > this - is copy_struct_from_user() (for example in sched_{get,set}attr(), >> > openat2(), bpf(), clone3(), etc.). Maybe you want to switch to that to >> > keep rseq extensible? Users can detect the new rseq version by just >> > passing a larger struct down to the kernel with the extra bytes set to 0 >> > and if rseq doesn't complain they know they're dealing with an rseq that >> > knows larger struct sizes. Might be worth it if you have any reason to >> > belive that struct rseq might need to grow.) >> >> In the initial iterations of the rseq patch set, we initially had the rseq_len >> argument hoping we would eventually be able to extend struct rseq. However, >> it was finally decided against making it extensible, so the rseq ABI merged >> into the Linux kernel with a fixed-size. >> >> One of the key reasons for this is explained in >> commit 83b0b15bcb0f ("rseq: Remove superfluous rseq_len from task_struct") >> >> The rseq system call, when invoked with flags of "0" or >> "RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to >> be equal to sizeof(struct rseq), which is fixed-size and fixed-layout, >> specified in uapi linux/rseq.h. >> >> Expecting a fixed size for rseq_len is a design choice that ensures >> multiple libraries and application defining __rseq_abi in the same >> process agree on its exact size. >> >> The issue here is caused by the fact that the __rseq_abi variable is >> shared across application/libraries for a given thread. So it's not >> enough to agree between kernel and user-space on the extensibility >> scheme, but we'd also have to find a way for all users within a process >> to agree on the layout. > > But you're in the same boat if you add any new feature, no? In your > model, wouldn't you need all users to agree on the feature set as well? > Not just the struct rseq size. If that's the case then rseq would be > unextendable (for now). As a consequence of this, my current approach to add a "node_id" field to rseq (in a prototype patch) is far from ideal: it defines another TLS symbol, e.g. __rseq_abi2, with an extended layout, and registers it with new rseq flags. I would really like to be able to extend struct rseq, but because of ABI compatibility required between user-space libraries/applications, it seems rather tricky to do so. > But specifically about the size-versioning part. Well, one way to solve > this - imho - would be to add a output size parameter to struct rseq and > introduce a little more vetting than we have right now. > So the kernel is what ultimately registers struct rseq iiuc. If there > were a size output parameter the kernel could set the size of the struct > it knows about before registering it. > So a caller passing down a larger struct with e.g. a new field set to a > non-zero value would get an error from the kernel and the size of the > supported struct. The caller can then adjust and simply zero out the > unsupported field and retry. Other callers in userspace can query the > size and find out what size of struct is registered. If it's larger than > what they know about they can infer they are on a newer kernel with new > features but they can simply ignore the unknown fields. If it's smaller > than they know what fields to ignore. How would that work in the case of an application defining its own copy of struct rseq __rseq_abi TLS with sizeof(struct rseq) == 32, and then upgrading its glibc to a new glibc which implements a larger structure, which agrees with the kernel on that larger layout ? > > I hope I'm not derailing this discussion. I'm just trying to show that > there's some hope. > In short, imho, I think adding a flag indicating the new "reliable cpu" > feature is probably the best way to go. Think of it not as the kernel > indicating the absence of a bug but as the presence of a feature. :) Indeed, the "reliable cpu_id" feature is simpler, because it does not require any layout change nor size extension whatsoever. But I think it's good that we discuss those extensibility challenges right away, because there are a lot of moving pieces involved on the user-space side. Thanks, Mathieu > > Thanks! > Christian -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-09 15:15 ` Mathieu Desnoyers @ 2020-07-11 15:54 ` Christian Brauner 2020-07-13 18:40 ` Mathieu Desnoyers 0 siblings, 1 reply; 25+ messages in thread From: Christian Brauner @ 2020-07-11 15:54 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Florian Weimer, Linus Torvalds, carlos, Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu On Thu, Jul 09, 2020 at 11:15:57AM -0400, Mathieu Desnoyers wrote: > ----- On Jul 9, 2020, at 8:49 AM, Christian Brauner christian.brauner@ubuntu.com wrote: > > > On Wed, Jul 08, 2020 at 01:34:48PM -0400, Mathieu Desnoyers wrote: > >> ----- On Jul 8, 2020, at 12:22 PM, Christian Brauner > >> christian.brauner@ubuntu.com wrote: > >> [...] > >> > I've been following this a little bit. The kernel version itself doesn't > >> > really mean anything and the kernel version is imho not at all > >> > interesting to userspace applications. Especially for cross-distro > >> > programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux, > >> > openSUSE and god knows who what other distro what their fixed kernel > >> > version is. That's not feasible at all and not how must programs do it. > >> > Sure, a lot of programs name a minimal kernel version they require but > >> > realistically we can't keep bumping it all the time. So the best > >> > strategy for userspace imho has been to introduce a re-versioned flag or > >> > enum that indicates the fixed behavior. > >> > > >> > So I would suggest to just introduce > >> > RSEQ_FLAG_REGISTER_2 = (1 << 2), > >> > that's how these things are usually done (Netlink etc.). So not > >> > introducing a fix bit or whatever but simply reversion your flag/enum. > >> > We already deal with this today. > >> > >> Because rseq is effectively a per-thread resource shared across application > >> and libraries, it is not practical to merge the notion of version with the > >> registration. Typically __rseq_abi is registered by libc, and can be used > >> by the application and by many libraries. Early adopter libraries and > >> applications (e.g. librseq, tcmalloc) can also choose to handle registration > >> if it's not already done by libc. > > > > I'm probably missing the elephant in the room but I was briefly looking > > at github.com/compudj/librseq and it seems to me that the registration > > you're talking about is: > > > > extern __thread struct rseq __rseq_abi; > > extern int __rseq_handled; > > Note that __rseq_handled has now vanished, adapting to glibc's ABI. I just > updated librseq's header accordingly. > > > > > and it's done in int rseq_register_current_thread(void) afaict and > > currently registration is done with flags set to 0. > > Correct, however that registration will become a no-op when linked against a > glibc 2.32+, because the glibc will already have handled the registration > at thread creation. The registration is a thread-group property I'll assume, right, i.e. all threads will have rseq TLS or no thread will have it? Some things I seem to be able to assume (correct me if I'm wrong): - rseq registration is expected to be done at thread creation time - rseq registration _should_ only be done once, i.e. if a caller detects that rseq is already registered for a thread, then they could technically re-register rseq - risking a feature mismatch if doing so - but they shouldn't re-register but simply assume that someone else is in control of rseq. If they violate that assumption than you're hosed anyway. So it seems as long as callers leave rseq registration alone whenever they detect that it is already registered then one can assume that rseq is under consistent control of a single library/program. If that's the case it should safe to assume that the library will use the same rseq registration for all threads bounded by the available kernel features or by the set of features it is aware of. > > > > > What is the problem with either adding a - I don't know - > > RSEG_FLAG_REGISTER/RSEQ_RELIABLE_CPU_FIELD flag that is also recorded in > > __rseq_abi.flags. If the kernel doesn't support the flag it will fail > > registration with EINVAL. So the registering program can detect it. If a > > caller needs to know whether another thread uses the new flag it can > > query __rseq_abi.flags. Some form of coordination must be possible in > > userspace otherwise you'll have trouble with any new feature you add. I > > general, I don't see how this is different from adding a new feature to > > rseq. It should be the same principle. > > The problem with "extending" struct rseq is that it becomes complex > because it is shared between libraries and application. Let's suppose > the library doing the rseq registration does the scheme you describe: > queries the kernel for features, and stores them in the __rseq_abi.flags. > We end up with the following upgrade transition headhaches for an > application using __rseq_abi: > > Kernel | glibc | librseq | __rseq_abi registration owner > ---------------------------------------------------------------------- > 4.18 | 2.31 | no | application (reliable cpu_id = false) > 4.18 | 2.31 | yes | librseq (reliable cpu_id = false) > 5.8 | 2.31 | yes | librseq (reliable cpu_id = true) > 5.8 | 2.32 | yes | glibc (reliable cpu_id = false) > 5.8 | 2.33+ | yes | glibc (reliable cpu_id = true) > > This kind of transition regressing feature-wise when upgrading a glibc > can be confusing for users. Sure, but that's seems like a problem that is not specific to rseq. That's a problem that any interface has which introduces new features or fixes bugs. > > One possibility would be to have the kernel store the "reliable cpu_id" > flag directly into a new __rseq_abi.kernel_flags (because __rseq_abi.flags > is documented as only read by the kernel). This would remove the registration Ah, there we go. That's a missing piece of information I didn't have. Thank you. > owner from the upgrade scenarios. But what would we gain by exposing this > flag within struct rseq ? The only real reason for doing so over using an I proposed that specific scheme because I was under the impression that you are in need of a mechanism for a caller (at thread creation I assume) to check what feature set is supported by rseq _without_ issung a system call. If you were to record the requested flags in struct rseq or in some other way, then another library which tries to register rseq for a thread but detects it has already been registered can look at e.g. whether the reliable cpu feature is around and then adjust it's behavior accordingly. Another reason why this seems worthwhile is because of how rseq works in general. Since it registers a piece of userspace memory which userspace can trivially access enforcing that userspace issue a syscall to get at the feature list seems odd when you can just record it in the struct. But that's a matter of style, I guess. Also, I'm thinking about the case of adding one or two new features that introduce mutually exclusive behavior for rseq, i.e. if you register rseq with FEAT1 and someone else registers it with FEAT2 and FEAT1 and FEAT2 would lead to incompatible behavior for an aspect or all of rseq. Even if you had a way to query the kernel for FEAT1 and FEAT2 in the rseq syscall it still be a problem since a caller wouldn't know at rseq registration time whether the library registered rseq with FEAT1 or FEAT2. If you record the behavior somewhere - kernel_flags or whatever - then the caller could check at registration time "ah, rseq is registered with this behavior" I need to adjust my behavior. > explicit system call is typically speed, and querying the kernel for a > feature does not need to be done often, so this is why I originally favored > exposing this information through a new system call flag without changing > the content of struct rseq_cs. Sure, there are multiple ways of doing this. What you're proposing is one. Though that is not as simple as it seems, I fear. :) If you add a new flag to the system call that effectively only exists to query the supported features then you need to decide whether checking for a feature means the system call just returns and all it does is to give you back the features it supports, i.e. it doesn't do any real work or if it actually does real work. That becomes a little more tricky actually. If the system call does the feature checking and registration in one shot you need to decide whether it should fail registration if a feature is not supported or whether it should just move one. If the latter then you likely want a way for the kernel to report back on the features it supports. That's not a big deal if it's only a single feature but it becomes more tricky if you introduce multiple features because then you don't know what feature the kernel doesn't support and you likely want to know what feature made the system call fail, i.e. what feature is missing. :) Thinking about this a little bit I would think what makes the most sense for such a scheme is the following: - introduce a flag CHECK_FEATURES flag wich by default makes the system call a noop effectively and just reports back (somewhere in struct rseq) the features it has and e.g. masking out any features it doesn't know about. - introduce another flag (again, dummy names here) IGNORE_UNKNOWN_FEATURES which makes the system call succeed making use of all the features requested that it supports, masking out any features it doesn't know about. This way a library/caller can decide what behavior it wants to have. That's advanced behavior though and maybe that "just introduce a flag and make it fail if not supported by the kernel" is enough for your use-case now and the other stuff can be added when more features happen. > > One additional thing to keep in mind: the application can itself choose > to define the __rseq_abi TLS, which AFAIU (please let me know if I am > wrong) would take precedence over glibc's copy. So extending the You mean either that an application could simply choose to ignore that e.g. glibc has already registered rseq and e.g. unregister it and register it's own or that it registers it's own rseq before glibc comes into the play? I mean, if I interpreted what you're trying to say correctly, I think one needs to work under the assumption that if someone else has already registered rseq than it becomes the single source of truth. I think that basically needs to be the contract. Same way you expect a user of pthreads to not suddenly go and call clone3() with CLONE_THREAD | CLONE_VM | [...] and assume that this won't mess with glibc's internal state. > size of struct rseq seems rather tricky because the application may > provide a smaller __rseq_abi, even if both the kernel and glibc agree > on a larger size. > > > > > I also don't understand the "not practical to merge the notion of > > version with the registration". I'm not sure what that means to be > > honest. :) > > The notion of "version" here would be to replace the "RELIABLE_CPU_FIELD" > flag I proposed with a steadily-increasing "fix" version instead. Sure. > > For both approaches, we could either pass them as parameters with rseq > registration, and make rseq registration success conditional on the > kernel implementing those feature/fix-version, or validate the flag/version > separately from registration. > > If this is done on registration, it means glibc will eventually have to > handle this. This prevents user libraries with specific needs to query > whether their features are available. Doing the feature/version validation > separately from registration allows each user library to make its own > queries and take advantage of new kernel features before glibc is > upgraded to be made aware of them. Why isn't there a "dual scheme"? I.e. you record the features somewhere in struct rseq or some other place so userspace can query an already registered thread for the features it was registered with and have a way to query the kernel for the supported features through the system call (See what I suggested above with the feature checking flags.). > > > But just thinking about adding a new feature to rseq. Then you're in the > > same spot, I think. When you register a bumped rseq - because you added > > a new flag or whatever - you register a new version one way or the other > > since a new feature - imho - is always a version bump. In fact, you > > could think of your "reliable cpu" as a new feature not a bug. ;) > > Indeed. > > > Also, you seem to directly version struct rseq_cs already through the > > "version" member. So even if you are against the new flag I wouldn't > > know what would stop you from directly versioning struct rseq itself. > > struct rseq needs to be shared between application and libraries, with > issues about what to do if size changes when we have an application > defining a small struct rseq (taking precedence over glibc's), and glibc > agreeing with the kernel on a larger structure. So there is little hope > in changing that layout. I really think this is not an rseq specific problem. This seems to be a generic problem any interface has when it e.g. makes use of an extended struct and e.g. decides to make its own syscalls without going through the glibc wrappers (If there are any...). That's the reality right now and will likely always be that way short of us blocking non-libc syscalls similar to some bsds at which point we need a 1:1 kernel + libc development. :) That's not going to happen. The problem ranges from struct statx to struct clone_args and struct open_how and so on. But one question. Musn't the assumption be that all threads in a thread-group if they have registered rseq then the same application/library has done that registration? And if that's the case then the application/library doing the registration is what defines the layout for that thread-group and becomes the one source of truth. Basically, if an application uses it's own rseq then glibc must be out of the picture. If that's not part of the contract then it feels to me that rseq cannot be extended (for now). > > The case of struct rseq_cs is simpler: it is only used as interface between > a specific library/application user and the kernel, which allows us to > version the structure and create new layouts as needed. > > > > > And it's not that we don't version syscalls. We're doing it in multiple > > ways to be honest, syscalls with a flag argument that reject unknown > > flags are bumped in their version every time you add a new flag that > > they accept. We don't spell this out but this is effectively what it is. > > Think of it as a minor version bump. Extensible syscalls are versioned > > by size and when their struct grows are bumped in their (minor) version. > > In fact extensible syscalls with flags argument embedded in the struct > > can be version bumped in two ways: growing a new flag argument or > > growing a new struct member. > > The issue with struct rseq extensibility is not so much ABI between kernel > and user-space, but rather ABI between userspace libraries/application users. > > > > >> > >> For instance, it is acceptable for glibc to register rseq for all threads, > >> even in the presence of the cpu_id field inaccuracy, for use by the > >> sched_getcpu(3) implementation. However, users of rseq which need to > >> implement critical sections performing per-cpu data updates may want > >> to know whether the cpu_id field is reliable to ensure they do not crash > >> the process due to per-cpu data corruption. > >> > >> This led me to consider exposing a feature-specific flag which can be > >> queried by specific users to know whether rseq has specific set of correct > >> behaviors implemented. > >> > >> > (Also, as a side-note. I see that you're passing struct rseq *rseq with > >> > a length argument but you are not versioning by size. Is that > >> > intentional? That basically somewhat locks you to the current struct > >> > rseq layout and means users might run into problems when you extend > >> > struct rseq in the future as they can't pass the new struct down to > >> > older kernels. The way we deal with this is now - rseq might preceed > >> > this - is copy_struct_from_user() (for example in sched_{get,set}attr(), > >> > openat2(), bpf(), clone3(), etc.). Maybe you want to switch to that to > >> > keep rseq extensible? Users can detect the new rseq version by just > >> > passing a larger struct down to the kernel with the extra bytes set to 0 > >> > and if rseq doesn't complain they know they're dealing with an rseq that > >> > knows larger struct sizes. Might be worth it if you have any reason to > >> > belive that struct rseq might need to grow.) > >> > >> In the initial iterations of the rseq patch set, we initially had the rseq_len > >> argument hoping we would eventually be able to extend struct rseq. However, > >> it was finally decided against making it extensible, so the rseq ABI merged > >> into the Linux kernel with a fixed-size. > >> > >> One of the key reasons for this is explained in > >> commit 83b0b15bcb0f ("rseq: Remove superfluous rseq_len from task_struct") > >> > >> The rseq system call, when invoked with flags of "0" or > >> "RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to > >> be equal to sizeof(struct rseq), which is fixed-size and fixed-layout, > >> specified in uapi linux/rseq.h. > >> > >> Expecting a fixed size for rseq_len is a design choice that ensures > >> multiple libraries and application defining __rseq_abi in the same > >> process agree on its exact size. > >> > >> The issue here is caused by the fact that the __rseq_abi variable is > >> shared across application/libraries for a given thread. So it's not > >> enough to agree between kernel and user-space on the extensibility > >> scheme, but we'd also have to find a way for all users within a process > >> to agree on the layout. > > > > But you're in the same boat if you add any new feature, no? In your > > model, wouldn't you need all users to agree on the feature set as well? > > Not just the struct rseq size. If that's the case then rseq would be > > unextendable (for now). > > As a consequence of this, my current approach to add a "node_id" field to rseq > (in a prototype patch) is far from ideal: it defines another TLS symbol, e.g. > __rseq_abi2, with an extended layout, and registers it with new rseq flags. > > I would really like to be able to extend struct rseq, but because of ABI > compatibility required between user-space libraries/applications, it seems > rather tricky to do so. > > > But specifically about the size-versioning part. Well, one way to solve > > this - imho - would be to add a output size parameter to struct rseq and > > introduce a little more vetting than we have right now. > > So the kernel is what ultimately registers struct rseq iiuc. If there > > were a size output parameter the kernel could set the size of the struct > > it knows about before registering it. > > So a caller passing down a larger struct with e.g. a new field set to a > > non-zero value would get an error from the kernel and the size of the > > supported struct. The caller can then adjust and simply zero out the > > unsupported field and retry. Other callers in userspace can query the > > size and find out what size of struct is registered. If it's larger than > > what they know about they can infer they are on a newer kernel with new > > features but they can simply ignore the unknown fields. If it's smaller > > than they know what fields to ignore. > > How would that work in the case of an application defining its own copy of > struct rseq __rseq_abi TLS with sizeof(struct rseq) == 32, and then upgrading > its glibc to a new glibc which implements a larger structure, which agrees with > the kernel on that larger layout ? Wouldn't the non-updated application just access fields and methods of the smaller struct? The way struct extensions work is that we only extend them after the last member and always correctly 64 bit aligned. And as long as you only extend the struct at the end wouldn't that be ok? An application with a non-updated/smaller struct rseq would just access fields that the larger struct supports, no? Thanks! Christian ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID 2020-07-11 15:54 ` Christian Brauner @ 2020-07-13 18:40 ` Mathieu Desnoyers 0 siblings, 0 replies; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-13 18:40 UTC (permalink / raw) To: Christian Brauner Cc: Florian Weimer, Linus Torvalds, carlos, Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck, Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, Dmitry Vyukov, Neel Natu ----- On Jul 11, 2020, at 11:54 AM, Christian Brauner christian.brauner@ubuntu.com wrote: > > The registration is a thread-group property I'll assume, right, i.e. all > threads will have rseq TLS or no thread will have it? No, rseq registration is a per-thread property, but it would arguably be a bit weird for a thread-group to observe different registration states for different threads. > Some things I seem to be able to assume (correct me if I'm wrong): > - rseq registration is expected to be done at thread creation time True. > - rseq registration _should_ only be done once, i.e. if a caller detects > that rseq is already registered for a thread, then they could > technically re-register rseq - risking a feature mismatch if doing so > - but they shouldn't re-register but simply assume that someone else > is in control of rseq. If they violate that assumption than you're > hosed anyway. Right. > So it seems as long as callers leave rseq registration alone whenever > they detect that it is already registered then one can assume that rseq > is under consistent control of a single library/program. If that's the > case it should safe to assume that the library will use the same rseq > registration for all threads bounded by the available kernel features or > by the set of features it is aware of. The rseq registration is per-thread. But yes, as soon as one user registers rseq, other users for that thread are expected to piggy-back on that registration. > I proposed that specific scheme because I was under the impression that > you are in need of a mechanism for a caller (at thread creation I > assume) to check what feature set is supported by rseq _without_ > issung a system call. If you were to record the requested flags in > struct rseq or in some other way, then another library which tries to > register rseq for a thread but detects it has already been registered > can look at e.g. whether the reliable cpu feature is around and then > adjust it's behavior accordingly. > Another reason why this seems worthwhile is because of how rseq works in > general. Since it registers a piece of userspace memory which userspace > can trivially access enforcing that userspace issue a syscall to get at > the feature list seems odd when you can just record it in the struct. > But that's a matter of style, I guess. Good points. > > Also, I'm thinking about the case of adding one or two new features that > introduce mutually exclusive behavior for rseq, i.e. if you register > rseq with FEAT1 and someone else registers it with FEAT2 and FEAT1 and > FEAT2 would lead to incompatible behavior for an aspect or all of rseq. > Even if you had a way to query the kernel for FEAT1 and FEAT2 in the > rseq syscall it still be a problem since a caller wouldn't know at rseq > registration time whether the library registered rseq with FEAT1 or > FEAT2. If you record the behavior somewhere - kernel_flags or whatever - > then the caller could check at registration time "ah, rseq is registered > with this behavior" I need to adjust my behavior. I think what we want here is to be able to extend the feature set, but not "pick and choose" different incompatible features. [...] >> >> One additional thing to keep in mind: the application can itself choose >> to define the __rseq_abi TLS, which AFAIU (please let me know if I am >> wrong) would take precedence over glibc's copy. So extending the > > You mean either that an application could simply choose to ignore that e.g. > glibc has already registered rseq and e.g. unregister it and register > it's own or that it registers it's own rseq before glibc comes into the > play? No quite. I mean that an application binary or a preloaded library is allowed to interpose with glibc and expose a __rseq_abi symbol with a size smaller than glibc's __rseq_abi. The issue is that glibc (or other library responsible for rseq registration) is unaware of that symbol's size. This means that extending __rseq_abi cannot be done by means of additional flags or parameters passed directly from the registration owner to the rseq system call. > I mean, if I interpreted what you're trying to say correctly, I think > one needs to work under the assumption that if someone else has already > registered rseq than it becomes the single source of truth. I think that > basically needs to be the contract. Same way you expect a user of > pthreads to not suddenly go and call clone3() with CLONE_THREAD | > CLONE_VM | [...] and assume that this won't mess with glibc's internal > state. Right. The issue is not about which library owns the registration, but rather making sure everyone agree on the size of __rseq_abi, including interposition scenarios. [...] >> >> For both approaches, we could either pass them as parameters with rseq >> registration, and make rseq registration success conditional on the >> kernel implementing those feature/fix-version, or validate the flag/version >> separately from registration. >> >> If this is done on registration, it means glibc will eventually have to >> handle this. This prevents user libraries with specific needs to query >> whether their features are available. Doing the feature/version validation >> separately from registration allows each user library to make its own >> queries and take advantage of new kernel features before glibc is >> upgraded to be made aware of them. > > Why isn't there a "dual scheme"? I.e. you record the features somewhere > in struct rseq or some other place so userspace can query an already > registered thread for the features it was registered with and have a way > to query the kernel for the supported features through the system call > (See what I suggested above with the feature checking flags.). This discussion got my head into gears over the weekend, and I think I came up with something that could elegantly solve all the "rseq extensibility" problem. More below. [...] > I really think this is not an rseq specific problem. This seems to be a > generic problem any interface has when it e.g. makes use of an extended > struct and e.g. decides to make its own syscalls without going through > the glibc wrappers (If there are any...). That's the reality right now > and will likely always be that way short of us blocking non-libc > syscalls similar to some bsds at which point we need a 1:1 kernel + libc > development. :) That's not going to happen. The problem ranges from > struct statx to struct clone_args and struct open_how and so on. AFAIU the only "special" thing about rseq is that its __rseq_abi is a TLS symbol shared between application/libraries, and interposition is allowed. > > But one question. Musn't the assumption be that all threads in a > thread-group if they have registered rseq then the same > application/library has done that registration? No, __rseq_abi is a TLS, and the registration is per-thread. > And if that's the case > then the application/library doing the registration is what defines the > layout for that thread-group and becomes the one source of truth. > Basically, if an application uses it's own rseq then glibc must be out > of the picture. If that's not part of the contract then it feels to me > that rseq cannot be extended (for now). Indeed, the new scheme I have in mind for rseq extensibility would allow new features to be used between new application/library and kernel even with an older glibc which would contain the rseq registration code, but be unaware of those new features. [...] > > Wouldn't the non-updated application just access fields and methods of > the smaller struct? The way struct extensions work is that we only > extend them after the last member and always correctly 64 bit aligned. > And as long as you only extend the struct at the end wouldn't that be > ok? An application with a non-updated/smaller struct rseq would just > access fields that the larger struct supports, no? The issue is symbol interposition, as discussed above. So here is the idea which emerged through the weekend as I was thinking about your email: * Current technical constraints - struct rseq __rseq_abi can be interposed by preloaded libraries and application, without knowledge from the registration "owner" (typically glibc). * Objectives: - Allow extending the size of struct rseq to add additional features, such as node_id field, signal-disabling fields, and so on. - Allow extending this size without requiring an upgrade of the library performing rseq registration. Simply upgrading the rseq "user" application or library and the kernel should suffice. * Proposed ABI - Reserve a bit in the field (struct rseq *)->flags of the TLS __rseq_abi, named e.g.: RSEQ_CS_FLAG_SIZE = (1U << 3). - A definition wishing to extend struct rseq would be required to initialize __rseq_abi with this bit set in the flags field. - When RSEQ_CS_FLAG_SIZE is set, struct rseq is guaranteed to have two new fields after the flags field: a __u32 user_size and a __u32 kernel_size field. - The user_size field is meant to be initialized to sizeof(struct rseq) by the __rseq_abi definition. In an interposition scenario, a kernel supporting this size feature will know about the size of the symbol by checking both the RSEQ_CS_FLAG_SIZE flag and the user_size field. - On registration, if the __rseq_abi.flags RSEQ_CS_FLAG_SIZE flag is set (and this flag is supported by the kernel), the kernel updates the kernel_size field to min(sizeof(struct rseq), __rseq_abi.user_size), effectively the subset of size supported by both the user-space __rseq_abi definition and by the kernel. - Users wishing to use additional fields beyond __rseq_abi.flags would need to check that __rseq_abi->flags & RSEQ_CS_FLAG_SIZE is true, and that __rseq_abi.kernel_size >= offsetof(struct rseq, feature_field) + sizeof(__rseq_abi.feature_field) This would ensure the fields are only used if the symbol is large enough to hold them *and* the kernel supports them. With this kind of scheme, we could then easily extend struct rseq to cover additional use-cases such as: - adding a new "node_id" field to speed up getcpu(3), user-space locking on NUMA, and possibly memory allocators, - adding fields allowing to quickly disable/enable signals, - adding a "__u64 features" field, which would contain for instance RSEQ_FEATURE_RELIABLE_CPU_ID. I'm not sure why I did not think of this earlier, but it all seems to fit nicely. Any comments ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* [RFC PATCH for 5.8 4/4] rseq: selftests: Expect reliable cpu_id field 2020-07-06 20:49 [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix Mathieu Desnoyers ` (2 preceding siblings ...) 2020-07-06 20:49 ` [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID Mathieu Desnoyers @ 2020-07-06 20:49 ` Mathieu Desnoyers 2020-07-07 6:26 ` [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix Florian Weimer 2020-07-07 14:54 ` Florian Weimer 5 siblings, 0 replies; 25+ messages in thread From: Mathieu Desnoyers @ 2020-07-06 20:49 UTC (permalink / raw) To: Thomas Gleixner Cc: linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner, linux-api, Florian Weimer, Mathieu Desnoyers, Shuah Khan, Dmitry Vyukov, Neel Natu, linux-kselftest The rseq selftests should discover whether the kernel implements the RSEQ_FLAG_RELIABLE_CPU_ID flag, which indicates that the __rseq_abi.cpu_id field is reliable. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Paul Turner <pjt@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Neel Natu <neelnatu@google.com> Cc: linux-api@vger.kernel.org Cc: linux-kselftest@vger.kernel.org --- tools/testing/selftests/rseq/rseq.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c index 7159eb777fd3..55f1edb0649c 100644 --- a/tools/testing/selftests/rseq/rseq.c +++ b/tools/testing/selftests/rseq/rseq.c @@ -73,6 +73,11 @@ static int sys_rseq(volatile struct rseq *rseq_abi, uint32_t rseq_len, return syscall(__NR_rseq, rseq_abi, rseq_len, flags, sig); } +static bool rseq_reliable_cpu_id(void) +{ + return sys_rseq(NULL, 0, RSEQ_FLAG_RELIABLE_CPU_ID, 0) == 0; +} + int rseq_register_current_thread(void) { int rc, ret = 0; @@ -87,7 +92,8 @@ int rseq_register_current_thread(void) } if (__rseq_refcount++) goto end; - rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG); + rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), + RSEQ_FLAG_REGISTER | RSEQ_FLAG_RELIABLE_CPU_ID, RSEQ_SIG); if (!rc) { assert(rseq_current_cpu_raw() >= 0); goto end; @@ -96,6 +102,8 @@ int rseq_register_current_thread(void) __rseq_abi.cpu_id = RSEQ_CPU_ID_REGISTRATION_FAILED; ret = -1; __rseq_refcount--; + if (errno == EINVAL && !rseq_reliable_cpu_id()) + fprintf(stderr, "Error: rseq does not provide a reliable cpu_id field.\n"); end: signal_restore(oldset); return ret; -- 2.17.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix 2020-07-06 20:49 [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix Mathieu Desnoyers ` (3 preceding siblings ...) 2020-07-06 20:49 ` [RFC PATCH for 5.8 4/4] rseq: selftests: Expect reliable cpu_id field Mathieu Desnoyers @ 2020-07-07 6:26 ` Florian Weimer 2020-07-07 14:54 ` Florian Weimer 5 siblings, 0 replies; 25+ messages in thread From: Florian Weimer @ 2020-07-07 6:26 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner, linux-api * Mathieu Desnoyers: > This is an RFC aiming for quick inclusion into the Linux kernel, unless > we prefer reverting the entire rseq glibc integration and try again in 6 > months. Their upcoming release is on August 3rd, so we need to take a > decision on this matter quickly. Just to clarify here, I don't think it's necessary to change glibc so that it only enables the rseq functionality if this particular bug is not present. From our perspective, it's just another kernel bug. If you truly feel we must not enable this feature in its present kernel state, then we need a working patch on all sides by the end of the week because the hard ABI freeze for glibc 2.32 is coming up, and at without any other patches, the only safe choice to prevent glibc from using slightly broken rseq support would be to back out the rseq patches. But again, I don't think such drastic action is necessary. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix 2020-07-06 20:49 [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix Mathieu Desnoyers ` (4 preceding siblings ...) 2020-07-07 6:26 ` [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix Florian Weimer @ 2020-07-07 14:54 ` Florian Weimer 5 siblings, 0 replies; 25+ messages in thread From: Florian Weimer @ 2020-07-07 14:54 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner, linux-api I would like to point out that the subject is misleading: This is not an ABI change. It fixes the contents of the __rseq_abi TLS variable (as glibc calls it), but that's it. (Sorry, I should have mentioned this earlier.) ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2020-07-13 18:40 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-06 20:49 [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix Mathieu Desnoyers 2020-07-06 20:49 ` [RFC PATCH for 5.8 1/4] sched: Fix unreliable rseq cpu_id for new tasks Mathieu Desnoyers 2020-07-07 7:30 ` Florian Weimer 2020-07-07 10:51 ` Mathieu Desnoyers 2020-07-06 20:49 ` [RFC PATCH for 5.8 2/4] rseq: Introduce RSEQ_FLAG_REGISTER Mathieu Desnoyers 2020-07-06 20:49 ` [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID Mathieu Desnoyers 2020-07-07 7:29 ` Florian Weimer 2020-07-07 10:48 ` Mathieu Desnoyers 2020-07-07 11:32 ` Florian Weimer 2020-07-07 12:06 ` Mathieu Desnoyers 2020-07-07 18:53 ` Carlos O'Donell 2020-07-07 18:59 ` Mathieu Desnoyers 2020-07-08 8:31 ` Florian Weimer 2020-07-07 19:55 ` Florian Weimer 2020-07-08 15:33 ` Mathieu Desnoyers 2020-07-08 16:22 ` Christian Brauner 2020-07-08 16:36 ` Florian Weimer 2020-07-08 17:34 ` Mathieu Desnoyers 2020-07-09 12:49 ` Christian Brauner 2020-07-09 15:15 ` Mathieu Desnoyers 2020-07-11 15:54 ` Christian Brauner 2020-07-13 18:40 ` Mathieu Desnoyers 2020-07-06 20:49 ` [RFC PATCH for 5.8 4/4] rseq: selftests: Expect reliable cpu_id field Mathieu Desnoyers 2020-07-07 6:26 ` [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix Florian Weimer 2020-07-07 14:54 ` Florian Weimer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).