linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* clock_gettime64 vdso bug on 32-bit arm, rpi-4
@ 2020-05-19 19:54 Arnd Bergmann
  2020-05-19 20:24 ` Adhemerval Zanella
  0 siblings, 1 reply; 8+ messages in thread
From: Arnd Bergmann @ 2020-05-19 19:54 UTC (permalink / raw)
  To: Vincenzo Frascino, Russell King - ARM Linux
  Cc: Will Deacon, Rich Felker, Jack Schmidt, Linux ARM,
	Linux Kernel Mailing List, Szabolcs Nagy, Thomas Gleixner,
	Stephen Boyd, Adhemerval Zanella

Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
month: https://github.com/richfelker/musl-cross-make/issues/96 and
https://github.com/raspberrypi/linux/issues/3579

As Will Deacon pointed out, this was never reported on the mailing list,
so I'll try to summarize what we know, so this can hopefully be resolved soon.

- This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
   kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
   clock_gettime64(CLOCK_REALTIME)

- The kernel tree is at https://github.com/raspberrypi/linux/, but I could
  see no relevant changes compared to a mainline kernel.

- From the report, I see that the returned time value is larger than the
  expected time, by 3.4 to 14.5 million seconds in four samples, my
  guess is that a random number gets added in at some point.

- From other sources, I found that the Raspberry Pi clocksource runs
  at 54 MHz, with a mask value of 0xffffffffffffff. From these numbers
  I would expect that reading a completely random hardware register
  value would result in an offset up to 1.33 billion seconds, which is
  around factor 100 more than the error we see, though similar.

- The test case calls the musl clock_gettime() function, which falls back to
  the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
  clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does
  not show the bug.

- The behavior was not reproduced on the same user space in qemu,
  though I cannot tell whether the exact same kernel binary was used.

- glibc-2.31 calls the same clock_gettime64() vdso function on arm to
  implement clock_gettime(), but earlier versions did not. I have not
  seen any reports of this bug, which could be explained by users
  generally being on older versions.

- As far as I can tell, there are no reports of this bug from other users,
  and so far nobody could reproduce it.

- The current musl git tree has been patched to not call clock_gettime64
   on ARM because of this problem, so it cannot be used for reproducing it.

If anyone has other information that may help figure out what is going
on, please share.

        Arnd

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
  2020-05-19 19:54 clock_gettime64 vdso bug on 32-bit arm, rpi-4 Arnd Bergmann
@ 2020-05-19 20:24 ` Adhemerval Zanella
  2020-05-19 20:31   ` Arnd Bergmann
  2020-05-19 20:41   ` Rich Felker
  0 siblings, 2 replies; 8+ messages in thread
From: Adhemerval Zanella @ 2020-05-19 20:24 UTC (permalink / raw)
  To: Arnd Bergmann, Vincenzo Frascino, Russell King - ARM Linux
  Cc: Will Deacon, Rich Felker, Jack Schmidt, Linux ARM,
	Linux Kernel Mailing List, Szabolcs Nagy, Thomas Gleixner,
	Stephen Boyd



On 19/05/2020 16:54, Arnd Bergmann wrote:
> Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
> month: https://github.com/richfelker/musl-cross-make/issues/96 and
> https://github.com/raspberrypi/linux/issues/3579
> 
> As Will Deacon pointed out, this was never reported on the mailing list,
> so I'll try to summarize what we know, so this can hopefully be resolved soon.
> 
> - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
>    kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
>    clock_gettime64(CLOCK_REALTIME)

Does it happen with other clocks as well?

> 
> - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
>   see no relevant changes compared to a mainline kernel.

Is this bug reproducible with mainline kernel or mainline kernel can't be
booted on bcm2711?

> 
> - From the report, I see that the returned time value is larger than the
>   expected time, by 3.4 to 14.5 million seconds in four samples, my
>   guess is that a random number gets added in at some point.

What kind code are you using to reproduce it? It is threaded or issue
clock_gettime from signal handlers?

> 
> - From other sources, I found that the Raspberry Pi clocksource runs
>   at 54 MHz, with a mask value of 0xffffffffffffff. From these numbers
>   I would expect that reading a completely random hardware register
>   value would result in an offset up to 1.33 billion seconds, which is
>   around factor 100 more than the error we see, though similar.
> 
> - The test case calls the musl clock_gettime() function, which falls back to
>   the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
>   clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does
>   not show the bug.
> 
> - The behavior was not reproduced on the same user space in qemu,
>   though I cannot tell whether the exact same kernel binary was used.
> 
> - glibc-2.31 calls the same clock_gettime64() vdso function on arm to
>   implement clock_gettime(), but earlier versions did not. I have not
>   seen any reports of this bug, which could be explained by users
>   generally being on older versions.
> 
> - As far as I can tell, there are no reports of this bug from other users,
>   and so far nobody could reproduce it.
> 
> - The current musl git tree has been patched to not call clock_gettime64
>    on ARM because of this problem, so it cannot be used for reproducing it.

So should glibc follow musl and remove arm clock_gettime6y4 vDSO support
or this bug is localized to an specific kernel version running on an
specific hardware?

> 
> If anyone has other information that may help figure out what is going
> on, please share.
> 
>         Arnd
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
  2020-05-19 20:24 ` Adhemerval Zanella
@ 2020-05-19 20:31   ` Arnd Bergmann
  2020-05-20 15:41     ` Szabolcs Nagy
  2020-05-19 20:41   ` Rich Felker
  1 sibling, 1 reply; 8+ messages in thread
From: Arnd Bergmann @ 2020-05-19 20:31 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: Vincenzo Frascino, Russell King - ARM Linux, Will Deacon,
	Rich Felker, Jack Schmidt, Linux ARM, Linux Kernel Mailing List,
	Szabolcs Nagy, Thomas Gleixner, Stephen Boyd

On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
> On 19/05/2020 16:54, Arnd Bergmann wrote:
> > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
> > month: https://github.com/richfelker/musl-cross-make/issues/96 and
> > https://github.com/raspberrypi/linux/issues/3579
> >
> > As Will Deacon pointed out, this was never reported on the mailing list,
> > so I'll try to summarize what we know, so this can hopefully be resolved soon.
> >
> > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
> >    kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
> >    clock_gettime64(CLOCK_REALTIME)
>
> Does it happen with other clocks as well?

Unclear.

> > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
> >   see no relevant changes compared to a mainline kernel.
>
> Is this bug reproducible with mainline kernel or mainline kernel can't be
> booted on bcm2711?

Mainline linux-5.6 should boot on that machine but might not have
all the other features, so I think users tend to use the raspberry pi
kernel sources for now.

> > - From the report, I see that the returned time value is larger than the
> >   expected time, by 3.4 to 14.5 million seconds in four samples, my
> >   guess is that a random number gets added in at some point.
>
> What kind code are you using to reproduce it? It is threaded or issue
> clock_gettime from signal handlers?

The reproducer is very simple without threads or signals,
see the start of https://github.com/richfelker/musl-cross-make/issues/96

It does rely on calling into the musl wrapper, not the direct vdso
call.

> > - From other sources, I found that the Raspberry Pi clocksource runs
> >   at 54 MHz, with a mask value of 0xffffffffffffff. From these numbers
> >   I would expect that reading a completely random hardware register
> >   value would result in an offset up to 1.33 billion seconds, which is
> >   around factor 100 more than the error we see, though similar.
> >
> > - The test case calls the musl clock_gettime() function, which falls back to
> >   the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
> >   clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does
> >   not show the bug.
> >
> > - The behavior was not reproduced on the same user space in qemu,
> >   though I cannot tell whether the exact same kernel binary was used.
> >
> > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to
> >   implement clock_gettime(), but earlier versions did not. I have not
> >   seen any reports of this bug, which could be explained by users
> >   generally being on older versions.
> >
> > - As far as I can tell, there are no reports of this bug from other users,
> >   and so far nobody could reproduce it.
> >
> > - The current musl git tree has been patched to not call clock_gettime64
> >    on ARM because of this problem, so it cannot be used for reproducing it.
>
> So should glibc follow musl and remove arm clock_gettime6y4 vDSO support
> or this bug is localized to an specific kernel version running on an
> specific hardware?

I hope we can figure out what is actually going on soon, there is probably
no need to change glibc before we have.

          Arnd

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
  2020-05-19 20:24 ` Adhemerval Zanella
  2020-05-19 20:31   ` Arnd Bergmann
@ 2020-05-19 20:41   ` Rich Felker
  1 sibling, 0 replies; 8+ messages in thread
From: Rich Felker @ 2020-05-19 20:41 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: Arnd Bergmann, Vincenzo Frascino, Russell King - ARM Linux,
	Will Deacon, Jack Schmidt, Linux ARM, Linux Kernel Mailing List,
	Szabolcs Nagy, Thomas Gleixner, Stephen Boyd

On Tue, May 19, 2020 at 05:24:18PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 19/05/2020 16:54, Arnd Bergmann wrote:
> > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
> > month: https://github.com/richfelker/musl-cross-make/issues/96 and
> > https://github.com/raspberrypi/linux/issues/3579
> > 
> > As Will Deacon pointed out, this was never reported on the mailing list,
> > so I'll try to summarize what we know, so this can hopefully be resolved soon.
> > 
> > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
> >    kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
> >    clock_gettime64(CLOCK_REALTIME)
> 
> Does it happen with other clocks as well?
> 
> > 
> > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
> >   see no relevant changes compared to a mainline kernel.
> 
> Is this bug reproducible with mainline kernel or mainline kernel can't be
> booted on bcm2711?
> 
> > 
> > - From the report, I see that the returned time value is larger than the
> >   expected time, by 3.4 to 14.5 million seconds in four samples, my
> >   guess is that a random number gets added in at some point.
> 
> What kind code are you using to reproduce it? It is threaded or issue
> clock_gettime from signal handlers?

Original report thread is here:

https://github.com/richfelker/musl-cross-make/issues/96

The reporter originally misunderstood the issue and wrongly attributed
it to difference between gettimeofday and clock_gettime but it was
just big jumps between successive vdso clock_gettime64 calls.

No transformation was being done on the output of the vdso function;
as long as it succeeds musl just returns directly with the value it
stored in the timespec. No threads or anything fancy were involved.

Current musl will no longer call it but you should be able to
dlopen("linux-gate.so.1", RTLD_NOW|RTLD_LOCAL) then use dlsym to get
its address and call it (not tested; I've never used it this way).

> > - The current musl git tree has been patched to not call clock_gettime64
> >    on ARM because of this problem, so it cannot be used for reproducing it.
> 
> So should glibc follow musl and remove arm clock_gettime6y4 vDSO support
> or this bug is localized to an specific kernel version running on an
> specific hardware?

For musl it was important to disable it asap pending a fix, because
users are expected to generate static binaries, and these could make
it into the wild without anyone realizing they're broken until much
later when run on an affected kernel (especially since pre-5.6 kernels
would hide the issue entirely due to lacking vdso). Ideally a fix will
be something we can detect (e.g. new symbol version) so as not to risk
calling the broken one, but whether that's necessary may depend on
what's affected.

I'm not sure if glibc should do the same; it's not often used in
static linking, and replacing libc (shared lib, or re-static-linking
which LGPL requires you to facilitate to distribute static binaries)
could solve the issue on affected systems.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
  2020-05-19 20:31   ` Arnd Bergmann
@ 2020-05-20 15:41     ` Szabolcs Nagy
  2020-05-20 16:08       ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Szabolcs Nagy @ 2020-05-20 15:41 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Adhemerval Zanella, Vincenzo Frascino, Russell King - ARM Linux,
	Will Deacon, Rich Felker, Jack Schmidt, Linux ARM,
	Linux Kernel Mailing List, Thomas Gleixner, Stephen Boyd, nd

The 05/19/2020 22:31, Arnd Bergmann wrote:
> On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
> > On 19/05/2020 16:54, Arnd Bergmann wrote:
> > > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
> > > month: https://github.com/richfelker/musl-cross-make/issues/96 and
> > > https://github.com/raspberrypi/linux/issues/3579
> > >
> > > As Will Deacon pointed out, this was never reported on the mailing list,
> > > so I'll try to summarize what we know, so this can hopefully be resolved soon.
> > >
> > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
> > >    kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
> > >    clock_gettime64(CLOCK_REALTIME)
> >
> > Does it happen with other clocks as well?
> 
> Unclear.
> 
> > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
> > >   see no relevant changes compared to a mainline kernel.
> >
> > Is this bug reproducible with mainline kernel or mainline kernel can't be
> > booted on bcm2711?
> 
> Mainline linux-5.6 should boot on that machine but might not have
> all the other features, so I think users tend to use the raspberry pi
> kernel sources for now.
> 
> > > - From the report, I see that the returned time value is larger than the
> > >   expected time, by 3.4 to 14.5 million seconds in four samples, my
> > >   guess is that a random number gets added in at some point.
> >
> > What kind code are you using to reproduce it? It is threaded or issue
> > clock_gettime from signal handlers?
> 
> The reproducer is very simple without threads or signals,
> see the start of https://github.com/richfelker/musl-cross-make/issues/96
> 
> It does rely on calling into the musl wrapper, not the direct vdso
> call.
> 
> > > - From other sources, I found that the Raspberry Pi clocksource runs
> > >   at 54 MHz, with a mask value of 0xffffffffffffff. From these numbers
> > >   I would expect that reading a completely random hardware register
> > >   value would result in an offset up to 1.33 billion seconds, which is
> > >   around factor 100 more than the error we see, though similar.
> > >
> > > - The test case calls the musl clock_gettime() function, which falls back to
> > >   the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
> > >   clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does
> > >   not show the bug.
> > >
> > > - The behavior was not reproduced on the same user space in qemu,
> > >   though I cannot tell whether the exact same kernel binary was used.
> > >
> > > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to
> > >   implement clock_gettime(), but earlier versions did not. I have not
> > >   seen any reports of this bug, which could be explained by users
> > >   generally being on older versions.
> > >
> > > - As far as I can tell, there are no reports of this bug from other users,
> > >   and so far nobody could reproduce it.

note: i could not reproduce it in qemu-system with these configs:

qemu-system-aarch64 + arm64 kernel + compat vdso
qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel
qemu-system-arm + cpu max + 32bit arm kernel

so i think it's something specific to that user's setup
(maybe rpi hw bug or gcc miscompiled the vdso or something
with that particular linux, i built my own linux 5.6 because
i did not know the exact kernel version where the bug was seen)

i don't have access to rpi (or other cortex-a53 where i
can install my own kernel) so this is as far as i got.

> > >
> > > - The current musl git tree has been patched to not call clock_gettime64
> > >    on ARM because of this problem, so it cannot be used for reproducing it.
> >
> > So should glibc follow musl and remove arm clock_gettime6y4 vDSO support
> > or this bug is localized to an specific kernel version running on an
> > specific hardware?
> 
> I hope we can figure out what is actually going on soon, there is probably
> no need to change glibc before we have.
> 
>           Arnd

-- 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
  2020-05-20 15:41     ` Szabolcs Nagy
@ 2020-05-20 16:08       ` Rich Felker
  2020-05-20 17:09         ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2020-05-20 16:08 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Arnd Bergmann, Adhemerval Zanella, Vincenzo Frascino,
	Russell King - ARM Linux, Will Deacon, Jack Schmidt, Linux ARM,
	Linux Kernel Mailing List, Thomas Gleixner, Stephen Boyd, nd

On Wed, May 20, 2020 at 04:41:29PM +0100, Szabolcs Nagy wrote:
> The 05/19/2020 22:31, Arnd Bergmann wrote:
> > On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella
> > <adhemerval.zanella@linaro.org> wrote:
> > > On 19/05/2020 16:54, Arnd Bergmann wrote:
> > > > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
> > > > month: https://github.com/richfelker/musl-cross-make/issues/96 and
> > > > https://github.com/raspberrypi/linux/issues/3579
> > > >
> > > > As Will Deacon pointed out, this was never reported on the mailing list,
> > > > so I'll try to summarize what we know, so this can hopefully be resolved soon.
> > > >
> > > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
> > > >    kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
> > > >    clock_gettime64(CLOCK_REALTIME)
> > >
> > > Does it happen with other clocks as well?
> > 
> > Unclear.
> > 
> > > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
> > > >   see no relevant changes compared to a mainline kernel.
> > >
> > > Is this bug reproducible with mainline kernel or mainline kernel can't be
> > > booted on bcm2711?
> > 
> > Mainline linux-5.6 should boot on that machine but might not have
> > all the other features, so I think users tend to use the raspberry pi
> > kernel sources for now.
> > 
> > > > - From the report, I see that the returned time value is larger than the
> > > >   expected time, by 3.4 to 14.5 million seconds in four samples, my
> > > >   guess is that a random number gets added in at some point.
> > >
> > > What kind code are you using to reproduce it? It is threaded or issue
> > > clock_gettime from signal handlers?
> > 
> > The reproducer is very simple without threads or signals,
> > see the start of https://github.com/richfelker/musl-cross-make/issues/96
> > 
> > It does rely on calling into the musl wrapper, not the direct vdso
> > call.
> > 
> > > > - From other sources, I found that the Raspberry Pi clocksource runs
> > > >   at 54 MHz, with a mask value of 0xffffffffffffff. From these numbers
> > > >   I would expect that reading a completely random hardware register
> > > >   value would result in an offset up to 1.33 billion seconds, which is
> > > >   around factor 100 more than the error we see, though similar.
> > > >
> > > > - The test case calls the musl clock_gettime() function, which falls back to
> > > >   the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
> > > >   clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does
> > > >   not show the bug.
> > > >
> > > > - The behavior was not reproduced on the same user space in qemu,
> > > >   though I cannot tell whether the exact same kernel binary was used.
> > > >
> > > > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to
> > > >   implement clock_gettime(), but earlier versions did not. I have not
> > > >   seen any reports of this bug, which could be explained by users
> > > >   generally being on older versions.
> > > >
> > > > - As far as I can tell, there are no reports of this bug from other users,
> > > >   and so far nobody could reproduce it.
> 
> note: i could not reproduce it in qemu-system with these configs:
> 
> qemu-system-aarch64 + arm64 kernel + compat vdso
> qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel
> qemu-system-arm + cpu max + 32bit arm kernel
> 
> so i think it's something specific to that user's setup
> (maybe rpi hw bug or gcc miscompiled the vdso or something
> with that particular linux, i built my own linux 5.6 because
> i did not know the exact kernel version where the bug was seen)
> 
> i don't have access to rpi (or other cortex-a53 where i
> can install my own kernel) so this is as far as i got.

If we have a binary of the kernel that's known to be failing on the
hardware, it would be useful to dump its vdso and examine the
disassembly to see if it was miscompiled.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
  2020-05-20 16:08       ` Rich Felker
@ 2020-05-20 17:09         ` Rich Felker
  2020-05-20 20:52           ` Arnd Bergmann
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2020-05-20 17:09 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Arnd Bergmann, Adhemerval Zanella, Vincenzo Frascino,
	Russell King - ARM Linux, Will Deacon, Jack Schmidt, Linux ARM,
	Linux Kernel Mailing List, Thomas Gleixner, Stephen Boyd, nd

On Wed, May 20, 2020 at 12:08:10PM -0400, Rich Felker wrote:
> On Wed, May 20, 2020 at 04:41:29PM +0100, Szabolcs Nagy wrote:
> > The 05/19/2020 22:31, Arnd Bergmann wrote:
> > > On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella
> > > <adhemerval.zanella@linaro.org> wrote:
> > > > On 19/05/2020 16:54, Arnd Bergmann wrote:
> > > > > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
> > > > > month: https://github.com/richfelker/musl-cross-make/issues/96 and
> > > > > https://github.com/raspberrypi/linux/issues/3579
> > > > >
> > > > > As Will Deacon pointed out, this was never reported on the mailing list,
> > > > > so I'll try to summarize what we know, so this can hopefully be resolved soon.
> > > > >
> > > > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
> > > > >    kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
> > > > >    clock_gettime64(CLOCK_REALTIME)
> > > >
> > > > Does it happen with other clocks as well?
> > > 
> > > Unclear.
> > > 
> > > > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
> > > > >   see no relevant changes compared to a mainline kernel.
> > > >
> > > > Is this bug reproducible with mainline kernel or mainline kernel can't be
> > > > booted on bcm2711?
> > > 
> > > Mainline linux-5.6 should boot on that machine but might not have
> > > all the other features, so I think users tend to use the raspberry pi
> > > kernel sources for now.
> > > 
> > > > > - From the report, I see that the returned time value is larger than the
> > > > >   expected time, by 3.4 to 14.5 million seconds in four samples, my
> > > > >   guess is that a random number gets added in at some point.
> > > >
> > > > What kind code are you using to reproduce it? It is threaded or issue
> > > > clock_gettime from signal handlers?
> > > 
> > > The reproducer is very simple without threads or signals,
> > > see the start of https://github.com/richfelker/musl-cross-make/issues/96
> > > 
> > > It does rely on calling into the musl wrapper, not the direct vdso
> > > call.
> > > 
> > > > > - From other sources, I found that the Raspberry Pi clocksource runs
> > > > >   at 54 MHz, with a mask value of 0xffffffffffffff. From these numbers
> > > > >   I would expect that reading a completely random hardware register
> > > > >   value would result in an offset up to 1.33 billion seconds, which is
> > > > >   around factor 100 more than the error we see, though similar.
> > > > >
> > > > > - The test case calls the musl clock_gettime() function, which falls back to
> > > > >   the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
> > > > >   clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does
> > > > >   not show the bug.
> > > > >
> > > > > - The behavior was not reproduced on the same user space in qemu,
> > > > >   though I cannot tell whether the exact same kernel binary was used.
> > > > >
> > > > > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to
> > > > >   implement clock_gettime(), but earlier versions did not. I have not
> > > > >   seen any reports of this bug, which could be explained by users
> > > > >   generally being on older versions.
> > > > >
> > > > > - As far as I can tell, there are no reports of this bug from other users,
> > > > >   and so far nobody could reproduce it.
> > 
> > note: i could not reproduce it in qemu-system with these configs:
> > 
> > qemu-system-aarch64 + arm64 kernel + compat vdso
> > qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel
> > qemu-system-arm + cpu max + 32bit arm kernel
> > 
> > so i think it's something specific to that user's setup
> > (maybe rpi hw bug or gcc miscompiled the vdso or something
> > with that particular linux, i built my own linux 5.6 because
> > i did not know the exact kernel version where the bug was seen)
> > 
> > i don't have access to rpi (or other cortex-a53 where i
> > can install my own kernel) so this is as far as i got.
> 
> If we have a binary of the kernel that's known to be failing on the
> hardware, it would be useful to dump its vdso and examine the
> disassembly to see if it was miscompiled.

OK, OP posted it and I think we've solved this. See
https://github.com/richfelker/musl-cross-make/issues/96#issuecomment-631604410

And my analysis:

<@dalias> see what i just found on the tracker
<@dalias> patch_vdso/vdso_nullpatch_one in arch/arm/kernel/vdso.c patches out the time32 functions in this case
<@dalias> but not the time64 one
<@dalias> this looks like a real kernel bug that's not hw-specific except breaking on all hardware where the patching-out is needed
<@dalias> we could possibly work around it by refusing to use the time64 vdso unless the time32 one is also present
<@dalias> yep
<@dalias> so i think we've solved this. the kernel thought it wasnt using vdso anymore because it patched it out
<@dalias> but it forgot to patch out the time64 one
<@dalias> so it stopped updating the data needed for vdso to work


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
  2020-05-20 17:09         ` Rich Felker
@ 2020-05-20 20:52           ` Arnd Bergmann
  0 siblings, 0 replies; 8+ messages in thread
From: Arnd Bergmann @ 2020-05-20 20:52 UTC (permalink / raw)
  To: Rich Felker
  Cc: Szabolcs Nagy, Adhemerval Zanella, Vincenzo Frascino,
	Russell King - ARM Linux, Will Deacon, Jack Schmidt, Linux ARM,
	Linux Kernel Mailing List, Thomas Gleixner, Stephen Boyd, nd,
	Florian Fainelli, Mark Rutland, Marc Zyngier

On Wed, May 20, 2020 at 7:09 PM Rich Felker <dalias@libc.org> wrote:
>
> On Wed, May 20, 2020 at 12:08:10PM -0400, Rich Felker wrote:
> > On Wed, May 20, 2020 at 04:41:29PM +0100, Szabolcs Nagy wrote:
> > > The 05/19/2020 22:31, Arnd Bergmann wrote:
> > > > On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella
> > > > <adhemerval.zanella@linaro.org> wrote:
> > > > > On 19/05/2020 16:54, Arnd Bergmann wrote:
> > > note: i could not reproduce it in qemu-system with these configs:
> > >
> > > qemu-system-aarch64 + arm64 kernel + compat vdso
> > > qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel
> > > qemu-system-arm + cpu max + 32bit arm kernel
> > >
> > > so i think it's something specific to that user's setup
> > > (maybe rpi hw bug or gcc miscompiled the vdso or something
> > > with that particular linux, i built my own linux 5.6 because
> > > i did not know the exact kernel version where the bug was seen)
> > >
> > > i don't have access to rpi (or other cortex-a53 where i
> > > can install my own kernel) so this is as far as i got.
> >
> > If we have a binary of the kernel that's known to be failing on the
> > hardware, it would be useful to dump its vdso and examine the
> > disassembly to see if it was miscompiled.
>
> OK, OP posted it and I think we've solved this. See
> https://github.com/richfelker/musl-cross-make/issues/96#issuecomment-631604410

Thanks a lot everyone for figuring this out.

> And my analysis:
>
> <@dalias> see what i just found on the tracker
> <@dalias> patch_vdso/vdso_nullpatch_one in arch/arm/kernel/vdso.c patches out the time32 functions in this case
> <@dalias> but not the time64 one
> <@dalias> this looks like a real kernel bug that's not hw-specific except breaking on all hardware where the patching-out is needed
> <@dalias> we could possibly work around it by refusing to use the time64 vdso unless the time32 one is also present
> <@dalias> yep
> <@dalias> so i think we've solved this. the kernel thought it wasnt using vdso anymore because it patched it out
> <@dalias> but it forgot to patch out the time64 one
> <@dalias> so it stopped updating the data needed for vdso to work

As you mentioned in the issue tracker, the patching was meant as
an optimization and missing it for clock_gettime64 was a mistake but
should by itself not have caused incorrect data to be returned.

I would assume that there is another bug that leads to clock_gettime64
not entering the syscall fallback path as it should but instead returning
bogus data.

Here are some more things I found:

- From reading the linux-5.6 code that was tested, I see that a condition
  that leads to patching out the clock_gettime() vdso should also lead to
  clock_gettime64() falling back to the the syscall after
  __arch_get_hw_counter() returns an error, but for some reason that
  does not happen. Presumably the presence of the patching meant that
  this code path was never much exercised.
  A missing 45939ce292b4 ("ARM: 8957/1: VDSO: Match ARMv8 timer in
  cntvct_functional()") would explain the problem, if it happened on
  linux-5.6-rc7 or earlier. The fix was merged in the final v5.6 though.

- The patching may actually be counterproductive because it means that
   clock_gettime(CLOCK_*COARSE, ...) has to go through the system call
   when it could just return the time of the last timer tick regardless of the
   clocksource.

- We may get bitten by errata handling on 32-bit kernels running on 64-bit
  hardware that has errata workaround in arch/arm64 for compat mode
  but not in native arm kernels. ARM64_ERRATUM_1418040,
  ARM64_ERRATUM_858921 or SUN50I_ERRATUM_UNKNOWN1
  are examples of workaround that are not used on 32-bit kernels running
  on 64-bit hardware.

         Arnd

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-05-20 20:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-19 19:54 clock_gettime64 vdso bug on 32-bit arm, rpi-4 Arnd Bergmann
2020-05-19 20:24 ` Adhemerval Zanella
2020-05-19 20:31   ` Arnd Bergmann
2020-05-20 15:41     ` Szabolcs Nagy
2020-05-20 16:08       ` Rich Felker
2020-05-20 17:09         ` Rich Felker
2020-05-20 20:52           ` Arnd Bergmann
2020-05-19 20:41   ` Rich Felker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).