Hi Geert, your patch "futex: switch to USER_DS for futex test" in linux-next now causes s390 to die on boot. As reference your patch: futex: switch to USER_DS for futex test Since e4f2dfbb5e92b ("m68k: implement futex.h to support userspace robust futexes and PI mutexes"), the kernel crashes during boot up on MC68030: Data read fault at 0x00000000 in Super Data (pc=0x3afec) BAD KERNEL BUSERR [...] [<000020ac>] do_one_initcall+0xa4/0x144 [<00001000>] kernel_pg_dir+0x0/0x1000 [<00002008>] do_one_initcall+0x0/0x144 [<002b3a56>] kernel_init_freeable+0xca/0x152 [<002b8ca4>] futex_init+0x0/0x54 [<001e316a>] kernel_init+0x0/0xc8 [<001e3172>] kernel_init+0x8/0xc8 [<001e316a>] kernel_init+0x0/0xc8 [<000025d4>] ret_from_kernel_thread+0xc/0x14 This happens because the futex test in futex_init() lacks a switch to the USER_DS address space, while cmpxchg_futex_value_locked() and futex_atomic_cmpxchg_inatomic() operate on userspace pointers (albeit NULL for this particular test). Fix this by switching to USER_DS before running the test, and restoring the old address space afterwards. diff --git a/kernel/futex.c b/kernel/futex.c index f6ff0191ecf7..66d23727c6ab 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -63,6 +63,7 @@ #include <linux/sched/rt.h> #include <linux/hugetlb.h> #include <linux/freezer.h> +#include <linux/uaccess.h> #include <asm/futex.h> @@ -2733,6 +2734,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, static int __init futex_init(void) { + mm_segment_t fs; u32 curval; int i; @@ -2746,8 +2748,11 @@ static int __init futex_init(void) * implementation, the non-functional ones will return * -ENOSYS. */ + fs = get_fs(); + set_fs(USER_DS); if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT) futex_cmpxchg_enabled = 1; + set_fs(fs); for (i = 0; i < ARRAY_SIZE(futex_queues); i++) { plist_head_init(&futex_queues[i].chain); Now, this seems to be wrong. It was intended to cause a fault while in kernel space. When accessing user space current->mm must not be NULL, but it is, since this is early code and we have no user space context to operate with. Hence at least s390's page tables aren't setup yet to handle this correctly. Actually our code dies when walking page tables and trying to acquire current's mm->page_table_lock, which points to nowhere. I'm wondering why m68k's exception handling for put/get_user doesn't fixup the instruction pointers and these functions simply return -EFAULT? Also m68k's futex_atomic_cmpxchg_inatomic() implementation seems to miss a page_fault_disable()/page_fault_enable() pair. Since this is already the second or third time this specific futex code causes problems on s390, it would be nice if we could get rid of it. E.g. with the following patch: >From d3b5585ebc302a7b94edff7267f9ec7f63b57141 Mon Sep 17 00:00:00 2001 From: Heiko Carstens <heiko.carstens@de.ibm.com> Date: Fri, 3 Jan 2014 14:16:19 +0100 Subject: [PATCH] futex: allow architectures to skip futex_atomic_cmpxchg_inatomic() test If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there is no runtume check necessary if it is working allow to skip the test within futex_init(). This allows to get rid of some code which would always give the same result. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> --- arch/s390/Kconfig | 1 + include/linux/futex.h | 4 ++++ init/Kconfig | 6 ++++++ kernel/futex.c | 14 ++++++++++++-- 4 files changed, 23 insertions(+), 2 deletions(-) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 1e1a03d2d19f..e00a76224537 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -117,6 +117,7 @@ config S390 select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER select HAVE_FUNCTION_TRACE_MCOUNT_TEST + select HAVE_FUTEX_CMPXCHG if FUTEX select HAVE_KERNEL_BZIP2 select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZ4 diff --git a/include/linux/futex.h b/include/linux/futex.h index b0d95cac826e..6435f46d6e13 100644 --- a/include/linux/futex.h +++ b/include/linux/futex.h @@ -55,7 +55,11 @@ union futex_key { #ifdef CONFIG_FUTEX extern void exit_robust_list(struct task_struct *curr); extern void exit_pi_state_list(struct task_struct *curr); +#ifdef CONFIG_HAVE_FUTEX_CMPXCHG +#define futex_cmpxchg_enabled 1 +#else extern int futex_cmpxchg_enabled; +#endif #else static inline void exit_robust_list(struct task_struct *curr) { diff --git a/init/Kconfig b/init/Kconfig index 8d402e33b7fc..5699a638e1ca 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1398,6 +1398,12 @@ config FUTEX support for "fast userspace mutexes". The resulting kernel may not run glibc-based applications correctly. +config HAVE_FUTEX_CMPXCHG + bool + help + Architectures should select this if futex_atomic_cmpxchg_inatomic() + is implemented and always working. + config EPOLL bool "Enable eventpoll support" if EXPERT default y diff --git a/kernel/futex.c b/kernel/futex.c index 66d23727c6ab..7c7dfb233515 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -69,7 +69,9 @@ #include "locking/rtmutex_common.h" +#ifndef CONFIG_HAVE_FUTEX_CMPXCHG int __read_mostly futex_cmpxchg_enabled; +#endif #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8) @@ -2732,11 +2734,11 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return do_futex(uaddr, op, val, tp, uaddr2, val2, val3); } -static int __init futex_init(void) +static void __init futex_detect_cmpxchg(void) { +#ifndef CONFIG_HAVE_FUTEX_CMPXCHG mm_segment_t fs; u32 curval; - int i; /* * This will fail and we want it. Some arch implementations do @@ -2753,6 +2755,14 @@ static int __init futex_init(void) if (cmpxchg_futex_value_locked(&curval, NULL, 0, 0) == -EFAULT) futex_cmpxchg_enabled = 1; set_fs(fs); +#endif /* CONFIG_HAVE_FUTEX_CMPXCHG */ +} + +static int __init futex_init(void) +{ + int i; + + futex_detect_cmpxchg(); for (i = 0; i < ARRAY_SIZE(futex_queues); i++) { plist_head_init(&futex_queues[i].chain); -- 1.8.4.5
Hi Heiko, On Fri, Jan 3, 2014 at 3:19 PM, Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > your patch "futex: switch to USER_DS for futex test" in linux-next now causes > s390 to die on boot Oops, sorry for that. So Andrew was right to let it cook a bit longer in -next, for 3.14. > Now, this seems to be wrong. It was intended to cause a fault while in kernel > space. When accessing user space current->mm must not be NULL, but it is, since > this is early code and we have no user space context to operate with. > > Hence at least s390's page tables aren't setup yet to handle this correctly. > Actually our code dies when walking page tables and trying to acquire current's > mm->page_table_lock, which points to nowhere. m68k do_page_fault() has: /* * If we're in an interrupt or have no user * context, we must not take the fault.. */ if (in_atomic() || !mm) goto no_context; so we abort if there's no mm (but in this specific case it was already aborted due to the in_atomic() test). Actually s390 do_exception() has similar checks: /* * Verify that the fault happened in user space, that * we are not in an interrupt and that there is a * user context. */ fault = VM_FAULT_BADCONTEXT; if (unlikely(!user_space_fault(trans_exc_code) || in_atomic() || !mm)) goto out; But as it fails for you, it crashes before you get to that point? > I'm wondering why m68k's exception handling for put/get_user doesn't fixup > the instruction pointers and these functions simply return -EFAULT? The fixup only happens for pointers in userspace context. In kernel context, we die. > Also m68k's futex_atomic_cmpxchg_inatomic() implementation seems to miss a > page_fault_disable()/page_fault_enable() pair. I'm not a futex hero, but FWIW, cmpxchg_futex_value_locked() calls page_fault_disable() before calling futex_atomic_cmpxchg_inatomic(). The inatomic suffix indicates we're already in an atomic context, right? > Since this is already the second or third time this specific futex code causes > problems on s390, it would be nice if we could get rid of it. E.g. with the > following patch: Yeah, recently Linus was optimizing the code so the compiler would eliminate the test for him... Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
On Fri, Jan 03, 2014 at 03:42:30PM +0100, Geert Uytterhoeven wrote: > Hi Heiko, [...] > m68k do_page_fault() has: > > /* > * If we're in an interrupt or have no user > * context, we must not take the fault.. > */ > if (in_atomic() || !mm) > goto no_context; > > so we abort if there's no mm (but in this specific case it was already > aborted due > to the in_atomic() test). > > Actually s390 do_exception() has similar checks: > > /* > * Verify that the fault happened in user space, that > * we are not in an interrupt and that there is a > * user context. > */ > fault = VM_FAULT_BADCONTEXT; > if (unlikely(!user_space_fault(trans_exc_code) || in_atomic() || !mm)) > goto out; > > But as it fails for you, it crashes before you get to that point? Yes, for futex cmpxchg operations we walk the page tables 'by hand' and do not let the hardware do that. This is usually not a problem since current->mm is (must be) valid. > > I'm wondering why m68k's exception handling for put/get_user doesn't fixup > > the instruction pointers and these functions simply return -EFAULT? > > The fixup only happens for pointers in userspace context. > In kernel context, we die. But... you *should* fix that also up in kernel space. That's the only reason why the futex check works at all on any architecture so far. There is also other code that relies on this: e.g. copy_mount_options() my be called with KERNEL_DS. If DEBUG_PAGEALLOC is turned on, it would crash badly in kernel space if it crosses page boundaries and touches an invalid page, even though it should survive... > > Also m68k's futex_atomic_cmpxchg_inatomic() implementation seems to miss a > > page_fault_disable()/page_fault_enable() pair. > > I'm not a futex hero, but FWIW, cmpxchg_futex_value_locked() calls > page_fault_disable() before calling futex_atomic_cmpxchg_inatomic(). > The inatomic suffix indicates we're already in an atomic context, right? Yes, you're right, I was wrong :)
Heiko Carstens <heiko.carstens@de.ibm.com> writes: > There is also other code that relies on this: e.g. copy_mount_options() my be > called with KERNEL_DS. With KERNEL_DS you can *only* access kernel memory, which is unpagable. If you want to access user memory, you _must_ use USER_DS. > If DEBUG_PAGEALLOC is turned on, it would crash badly in kernel space > if it crosses page boundaries and touches an invalid page, even though > it should survive... Accessing an invalid page in kernel space is _always_ a bug. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different."
On Fri, Jan 03, 2014 at 04:41:10PM +0100, Andreas Schwab wrote: > Heiko Carstens <heiko.carstens@de.ibm.com> writes: > > > There is also other code that relies on this: e.g. copy_mount_options() my be > > called with KERNEL_DS. > > With KERNEL_DS you can *only* access kernel memory, which is unpagable. > If you want to access user memory, you _must_ use USER_DS. I didn't say anything else. copy_mount_options() will be called with KERNEL_DS from e.g. do_mount_root(). > > If DEBUG_PAGEALLOC is turned on, it would crash badly in kernel space > > if it crosses page boundaries and touches an invalid page, even though > > it should survive... > > Accessing an invalid page in kernel space is _always_ a bug. Even though the current futex check relies on working exception handling for this case. If the patch I posted gets merged as well, it really doesn't matter for me. Martin and I discussed this today and we will change the s390 code so that it will also survive very early USER_DS accesses (without valid current->mm) since we also discovered a couple of other oddities in our code. But theses changes would be too complex for -stable, imho.
On Fri, Jan 03, 2014 at 05:09:24PM +0100, Heiko Carstens wrote:
> On Fri, Jan 03, 2014 at 04:41:10PM +0100, Andreas Schwab wrote:
> > Heiko Carstens <heiko.carstens@de.ibm.com> writes:
> >
> > > There is also other code that relies on this: e.g. copy_mount_options() my be
> > > called with KERNEL_DS.
> >
> > With KERNEL_DS you can *only* access kernel memory, which is unpagable.
> > If you want to access user memory, you _must_ use USER_DS.
>
> I didn't say anything else. copy_mount_options() will be called with KERNEL_DS
> from e.g. do_mount_root().
Just to be more precise: when sys_mount() gets called from kernel space (with
KERNEL_DS) it will call copy_mount_options() which in turn will call
exact_copy_from_user() which will usually copy a whole page, unless a fault
happens.
E.g. devtmpfsd() (drivers/base/devtmpfs.c) calls sys_mount() with mount options
being on the kernel stack. Now if these mount options are at the beginning of
the kernel stack, copy_mount_options() _will_ cross page boundaries and leave
the kernel stack. In case of DEBUG_PAGEALLOC it's not very unlikely that the
next page has an invalid mapping and a fault happens.
I'm not saying that the devtmpfsd() code is correct, however this code would
crash on m68k as well if it doesn't fixup it's instructions pointer for
exceptions in kernel space.