All of lore.kernel.org
 help / color / mirror / Atom feed
* Random, rare, but reproducible segmentation faults
@ 2020-07-08 20:28 Aurelien Jarno
  2020-07-09 10:49 ` Alex Ghiti
  0 siblings, 1 reply; 5+ messages in thread
From: Aurelien Jarno @ 2020-07-08 20:28 UTC (permalink / raw)
  To: linux-riscv; +Cc: Alexandre Ghiti

Hi all,

For some time I have seen random but relatively rare segmentation faults
when building Debian packages, either in a QEMU virtual machine or an
Hifive Unleashed board. I have recently been able to reproduce the issue
by building Qt [1]. It usually fails after roughly one hour of build,
not always on the same C++ file, but always on the same set of files.
Trying to run make again the build is usually successful. When it
happens one get the following error from GCC:

| g++: internal compiler error: Segmentation fault signal terminated program cc1plus
| Please submit a full bug report,
| with preprocessed source if appropriate.

And the following kernel error:

| [267054.967857] cc1plus[1171618]: unhandled signal 11 code 0x1 at 0x000000156888a430 in cc1plus[10000+e0e000]
| [267054.976759] CPU: 3 PID: 1171618 Comm: cc1plus Not tainted 5.7.7+ #1
| [267054.983101] epc: 0000000000a70e3e ra : 0000000000a71dbe sp : 0000003ffff3c870
| [267054.990293]  gp : 0000000000e33158 tp : 000000156a0f0720 t0 : 0000001569feb0d0
| [267054.997593]  t1 : 0000000000182a2c t2 : 0000000000e30620 s0 : 000000000003b7c0
| [267055.004880]  s1 : 000000000003b7c0 a0 : 000000156888a420 a1 : 000000000003b7c0
| [267055.012176]  a2 : 0000000000000002 a3 : 000000000003b7c0 a4 : 000000000003b7c0
| [267055.019473]  a5 : 0000000000000001 a6 : 61746e656d676553 a7 : 73737350581f0402
| [267055.026763]  s2 : 0000003ffff3c8c8 s3 : 000000156888a420 s4 : 000000007fffffff
| [267055.034052]  s5 : 0000000070000000 s6 : 0000000000000000 s7 : 0000003ffff3d638
| [267055.041345]  s8 : 0000000000e9ab18 s9 : 0000000000000000 s10: 0000000000e9a9d0
| [267055.048636]  s11: 0000000000000000 t3 : 000000156888a420 t4 : 0000000000000001
| [267055.055930]  t5 : 0000000000000001 t6 : 0000000000000000
| [267055.061311] status: 8000000200006020 badaddr: 000000156888a430 cause: 000000000000000d

I have been able to bisect the issue and found it has been introduced by
this commit:

| commit 54c95a11cc1b5e1d578209e027adf5775395dafd
| Author: Alexandre Ghiti <alex@ghiti.fr>
| Date:   Mon Sep 23 15:39:21 2019 -0700
|
|     riscv: make mmap allocation top-down by default
|
|     In order to avoid wasting user address space by using bottom-up mmap
|     allocation scheme, prefer top-down scheme when possible.

Reverting this commit, even on 5.7.7, fixes the issue. I debugged things
a bit more, and found that the problem doesn't come from the top-down
allocation (i.e. setting vm.legacy_va_layout to 1 doesn't change
anything). However I have found it comes from the randomization, I mean
that the following patch is enough to fix (or workaround?) the issue:

| --- a/mm/util.c
| +++ b/mm/util.c
| @@ -397,8 +397,8 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
|  {
|         unsigned long random_factor = 0UL;
|  
| -       if (current->flags & PF_RANDOMIZE)
| -               random_factor = arch_mmap_rnd();
| +/*     if (current->flags & PF_RANDOMIZE)
| +               random_factor = arch_mmap_rnd();*/
|  
|         if (mmap_is_legacy(rlim_stack)) {
|                 mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;

I have also tried to play with vm.mmap_rnd_bits, but it seems that it
doesn't have any real effect. I however noticed that setting this value
to 24 (the maximum) might move the heap next to the stack in some cases,
although it seems unrelated to the original issue:

| $ cat /proc/self/maps
| 340fde4000-340fe06000 rw-p 00000000 00:00 0 
| 340fe06000-340ff08000 r-xp 00000000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
| 340ff08000-340ff09000 ---p 00102000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
| 340ff09000-340ff0c000 r--p 00102000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
| 340ff0c000-340ff0f000 rw-p 00105000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
| 340ff0f000-340ff14000 rw-p 00000000 00:00 0 
| 340ff1d000-340ff1f000 r-xp 00000000 00:00 0                              [vdso]
| 340ff1f000-340ff36000 r-xp 00000000 fe:01 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
| 340ff36000-340ff37000 r--p 00016000 fe:01 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
| 340ff37000-340ff38000 rw-p 00017000 fe:01 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
| 340ff38000-340ff39000 rw-p 00000000 00:00 0 
| 3924505000-392450b000 r-xp 00000000 fe:01 1048640                        /bin/cat
| 392450b000-392450c000 r--p 00005000 fe:01 1048640                        /bin/cat
| 392450c000-392450d000 rw-p 00006000 fe:01 1048640                        /bin/cat
| 3955e23000-3955e44000 rw-p 00000000 00:00 0                              [heap]
| 3fffa2b000-3fffa4c000 rw-p 00000000 00:00 0                              [stack]

To come back to the original issue, I don't know how to debug it
further. I have already tried to get a core dump and analyze it with
GDB, but the code triggering the failure is not part of the binary. Any
hint or help would be welcomed.

Thanks,
Aurelien


[1] Technically the qtbase-opensource-src Debian package:
    https://packages.debian.org/source/sid/qtbase-opensource-src

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Random, rare, but reproducible segmentation faults
  2020-07-08 20:28 Random, rare, but reproducible segmentation faults Aurelien Jarno
@ 2020-07-09 10:49 ` Alex Ghiti
  2020-07-10  5:15   ` Alex Ghiti
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Ghiti @ 2020-07-09 10:49 UTC (permalink / raw)
  To: Aurelien Jarno, linux-riscv

Hi Aurélien,

Le 7/8/20 à 4:28 PM, Aurelien Jarno a écrit :
> Hi all,
> 
> For some time I have seen random but relatively rare segmentation faults
> when building Debian packages, either in a QEMU virtual machine or an
> Hifive Unleashed board. I have recently been able to reproduce the issue
> by building Qt [1]. It usually fails after roughly one hour of build,
> not always on the same C++ file, but always on the same set of files.
> Trying to run make again the build is usually successful. When it
> happens one get the following error from GCC:
> 
> | g++: internal compiler error: Segmentation fault signal terminated program cc1plus
> | Please submit a full bug report,
> | with preprocessed source if appropriate.
> 
> And the following kernel error:
> 
> | [267054.967857] cc1plus[1171618]: unhandled signal 11 code 0x1 at 0x000000156888a430 in cc1plus[10000+e0e000]
> | [267054.976759] CPU: 3 PID: 1171618 Comm: cc1plus Not tainted 5.7.7+ #1
> | [267054.983101] epc: 0000000000a70e3e ra : 0000000000a71dbe sp : 0000003ffff3c870
> | [267054.990293]  gp : 0000000000e33158 tp : 000000156a0f0720 t0 : 0000001569feb0d0
> | [267054.997593]  t1 : 0000000000182a2c t2 : 0000000000e30620 s0 : 000000000003b7c0
> | [267055.004880]  s1 : 000000000003b7c0 a0 : 000000156888a420 a1 : 000000000003b7c0
> | [267055.012176]  a2 : 0000000000000002 a3 : 000000000003b7c0 a4 : 000000000003b7c0
> | [267055.019473]  a5 : 0000000000000001 a6 : 61746e656d676553 a7 : 73737350581f0402
> | [267055.026763]  s2 : 0000003ffff3c8c8 s3 : 000000156888a420 s4 : 000000007fffffff
> | [267055.034052]  s5 : 0000000070000000 s6 : 0000000000000000 s7 : 0000003ffff3d638
> | [267055.041345]  s8 : 0000000000e9ab18 s9 : 0000000000000000 s10: 0000000000e9a9d0
> | [267055.048636]  s11: 0000000000000000 t3 : 000000156888a420 t4 : 0000000000000001
> | [267055.055930]  t5 : 0000000000000001 t6 : 0000000000000000
> | [267055.061311] status: 8000000200006020 badaddr: 000000156888a430 cause: 000000000000000d
> 
> I have been able to bisect the issue and found it has been introduced by
> this commit:
> 
> | commit 54c95a11cc1b5e1d578209e027adf5775395dafd
> | Author: Alexandre Ghiti <alex@ghiti.fr>
> | Date:   Mon Sep 23 15:39:21 2019 -0700
> |
> |     riscv: make mmap allocation top-down by default
> |
> |     In order to avoid wasting user address space by using bottom-up mmap
> |     allocation scheme, prefer top-down scheme when possible.
> 
> Reverting this commit, even on 5.7.7, fixes the issue. I debugged things
> a bit more, and found that the problem doesn't come from the top-down
> allocation (i.e. setting vm.legacy_va_layout to 1 doesn't change
> anything). However I have found it comes from the randomization, I mean
> that the following patch is enough to fix (or workaround?) the issue:
> 
> | --- a/mm/util.c
> | +++ b/mm/util.c
> | @@ -397,8 +397,8 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
> |  {
> |         unsigned long random_factor = 0UL;
> |
> | -       if (current->flags & PF_RANDOMIZE)
> | -               random_factor = arch_mmap_rnd();
> | +/*     if (current->flags & PF_RANDOMIZE)
> | +               random_factor = arch_mmap_rnd();*/
> |
> |         if (mmap_is_legacy(rlim_stack)) {
> |                 mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
> 
> I have also tried to play with vm.mmap_rnd_bits, but it seems that it
> doesn't have any real effect. 

That's too bad because this is the only riscv specific thing in this 
feature. And it is used by arm, arm64 and mips so we should be looking 
at something riscv specific here, but right now I don't know what.

I will take a look at that.

Thanks,

Alex

I however noticed that setting this value
> to 24 (the maximum) might move the heap next to the stack in some cases,
> although it seems unrelated to the original issue:
> 
> | $ cat /proc/self/maps
> | 340fde4000-340fe06000 rw-p 00000000 00:00 0
> | 340fe06000-340ff08000 r-xp 00000000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
> | 340ff08000-340ff09000 ---p 00102000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
> | 340ff09000-340ff0c000 r--p 00102000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
> | 340ff0c000-340ff0f000 rw-p 00105000 fe:01 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
> | 340ff0f000-340ff14000 rw-p 00000000 00:00 0
> | 340ff1d000-340ff1f000 r-xp 00000000 00:00 0                              [vdso]
> | 340ff1f000-340ff36000 r-xp 00000000 fe:01 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
> | 340ff36000-340ff37000 r--p 00016000 fe:01 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
> | 340ff37000-340ff38000 rw-p 00017000 fe:01 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
> | 340ff38000-340ff39000 rw-p 00000000 00:00 0
> | 3924505000-392450b000 r-xp 00000000 fe:01 1048640                        /bin/cat
> | 392450b000-392450c000 r--p 00005000 fe:01 1048640                        /bin/cat
> | 392450c000-392450d000 rw-p 00006000 fe:01 1048640                        /bin/cat
> | 3955e23000-3955e44000 rw-p 00000000 00:00 0                              [heap]
> | 3fffa2b000-3fffa4c000 rw-p 00000000 00:00 0                              [stack]
> 
> To come back to the original issue, I don't know how to debug it
> further. I have already tried to get a core dump and analyze it with
> GDB, but the code triggering the failure is not part of the binary. Any
> hint or help would be welcomed.
> 
> Thanks,
> Aurelien
> 
> 
> [1] Technically the qtbase-opensource-src Debian package:
>      https://packages.debian.org/source/sid/qtbase-opensource-src
> 

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Random, rare, but reproducible segmentation faults
  2020-07-09 10:49 ` Alex Ghiti
@ 2020-07-10  5:15   ` Alex Ghiti
  2020-07-10 19:12     ` Aurelien Jarno
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Ghiti @ 2020-07-10  5:15 UTC (permalink / raw)
  To: Aurelien Jarno, linux-riscv

Hi Aurélien,

Le 7/9/20 à 6:49 AM, Alex Ghiti a écrit :
> Hi Aurélien,
> 
> Le 7/8/20 à 4:28 PM, Aurelien Jarno a écrit :
>> Hi all,
>>
>> For some time I have seen random but relatively rare segmentation faults
>> when building Debian packages, either in a QEMU virtual machine or an
>> Hifive Unleashed board. I have recently been able to reproduce the issue
>> by building Qt [1]. It usually fails after roughly one hour of build,
>> not always on the same C++ file, but always on the same set of files.
>> Trying to run make again the build is usually successful. When it
>> happens one get the following error from GCC:
>>
>> | g++: internal compiler error: Segmentation fault signal terminated 
>> program cc1plus
>> | Please submit a full bug report,
>> | with preprocessed source if appropriate.
>>
>> And the following kernel error:
>>
>> | [267054.967857] cc1plus[1171618]: unhandled signal 11 code 0x1 at 
>> 0x000000156888a430 in cc1plus[10000+e0e000]
>> | [267054.976759] CPU: 3 PID: 1171618 Comm: cc1plus Not tainted 5.7.7+ #1
>> | [267054.983101] epc: 0000000000a70e3e ra : 0000000000a71dbe sp : 
>> 0000003ffff3c870
>> | [267054.990293]  gp : 0000000000e33158 tp : 000000156a0f0720 t0 : 
>> 0000001569feb0d0
>> | [267054.997593]  t1 : 0000000000182a2c t2 : 0000000000e30620 s0 : 
>> 000000000003b7c0
>> | [267055.004880]  s1 : 000000000003b7c0 a0 : 000000156888a420 a1 : 
>> 000000000003b7c0
>> | [267055.012176]  a2 : 0000000000000002 a3 : 000000000003b7c0 a4 : 
>> 000000000003b7c0
>> | [267055.019473]  a5 : 0000000000000001 a6 : 61746e656d676553 a7 : 
>> 73737350581f0402
>> | [267055.026763]  s2 : 0000003ffff3c8c8 s3 : 000000156888a420 s4 : 
>> 000000007fffffff
>> | [267055.034052]  s5 : 0000000070000000 s6 : 0000000000000000 s7 : 
>> 0000003ffff3d638
>> | [267055.041345]  s8 : 0000000000e9ab18 s9 : 0000000000000000 s10: 
>> 0000000000e9a9d0
>> | [267055.048636]  s11: 0000000000000000 t3 : 000000156888a420 t4 : 
>> 0000000000000001
>> | [267055.055930]  t5 : 0000000000000001 t6 : 0000000000000000
>> | [267055.061311] status: 8000000200006020 badaddr: 000000156888a430 
>> cause: 000000000000000d
>>
>> I have been able to bisect the issue and found it has been introduced by
>> this commit:
>>
>> | commit 54c95a11cc1b5e1d578209e027adf5775395dafd
>> | Author: Alexandre Ghiti <alex@ghiti.fr>
>> | Date:   Mon Sep 23 15:39:21 2019 -0700
>> |
>> |     riscv: make mmap allocation top-down by default
>> |
>> |     In order to avoid wasting user address space by using bottom-up 
>> mmap
>> |     allocation scheme, prefer top-down scheme when possible.
>>
>> Reverting this commit, even on 5.7.7, fixes the issue. I debugged things
>> a bit more, and found that the problem doesn't come from the top-down
>> allocation (i.e. setting vm.legacy_va_layout to 1 doesn't change
>> anything). However I have found it comes from the randomization, I mean
>> that the following patch is enough to fix (or workaround?) the issue:
>>
>> | --- a/mm/util.c
>> | +++ b/mm/util.c
>> | @@ -397,8 +397,8 @@ void arch_pick_mmap_layout(struct mm_struct *mm, 
>> struct rlimit *rlim_stack)
>> |  {
>> |         unsigned long random_factor = 0UL;
>> |
>> | -       if (current->flags & PF_RANDOMIZE)
>> | -               random_factor = arch_mmap_rnd();
>> | +/*     if (current->flags & PF_RANDOMIZE)
>> | +               random_factor = arch_mmap_rnd();*/
>> |
>> |         if (mmap_is_legacy(rlim_stack)) {
>> |                 mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
>>
>> I have also tried to play with vm.mmap_rnd_bits, but it seems that it
>> doesn't have any real effect. 
> 
> That's too bad because this is the only riscv specific thing in this 
> feature. And it is used by arm, arm64 and mips so we should be looking 
> at something riscv specific here, but right now I don't know what.
> 
> I will take a look at that.
> 
> Thanks,
> 
> Alex
> 
> I however noticed that setting this value
>> to 24 (the maximum) might move the heap next to the stack in some cases,
>> although it seems unrelated to the original issue:
>>
>> | $ cat /proc/self/maps
>> | 340fde4000-340fe06000 rw-p 00000000 00:00 0
>> | 340fe06000-340ff08000 r-xp 00000000 fe:01 
>> 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
>> | 340ff08000-340ff09000 ---p 00102000 fe:01 
>> 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
>> | 340ff09000-340ff0c000 r--p 00102000 fe:01 
>> 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
>> | 340ff0c000-340ff0f000 rw-p 00105000 fe:01 
>> 1835219                        /lib/riscv64-linux-gnu/libc-2.30.so
>> | 340ff0f000-340ff14000 rw-p 00000000 00:00 0
>> | 340ff1d000-340ff1f000 r-xp 00000000 00:00 
>> 0                              [vdso]
>> | 340ff1f000-340ff36000 r-xp 00000000 fe:01 
>> 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
>> | 340ff36000-340ff37000 r--p 00016000 fe:01 
>> 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
>> | 340ff37000-340ff38000 rw-p 00017000 fe:01 
>> 1835039                        /lib/riscv64-linux-gnu/ld-2.30.so
>> | 340ff38000-340ff39000 rw-p 00000000 00:00 0
>> | 3924505000-392450b000 r-xp 00000000 fe:01 
>> 1048640                        /bin/cat
>> | 392450b000-392450c000 r--p 00005000 fe:01 
>> 1048640                        /bin/cat
>> | 392450c000-392450d000 rw-p 00006000 fe:01 
>> 1048640                        /bin/cat
>> | 3955e23000-3955e44000 rw-p 00000000 00:00 
>> 0                              [heap]
>> | 3fffa2b000-3fffa4c000 rw-p 00000000 00:00 
>> 0                              [stack]
>>
>> To come back to the original issue, I don't know how to debug it
>> further. I have already tried to get a core dump and analyze it with
>> GDB, but the code triggering the failure is not part of the binary. Any
>> hint or help would be welcomed.
>>
>> Thanks,
>> Aurelien
>>
>>
>> [1] Technically the qtbase-opensource-src Debian package:
>>      https://packages.debian.org/source/sid/qtbase-opensource-src
>>
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

I have a debian kernel downloaded from here 
https://people.debian.org/~gio/dqib/ that runs using the following qemu 
command:

qemu-system-riscv64 -machine virt -cpu rv64 -m 1G -device 
virtio-blk-device,drive=hd -drive file=image.qcow2,if=none,id=hd -device 
virtio-net-device,netdev=net -netdev user,id=net,hostfwd=tcp::2222-:22 
-bios ~/wip/lpc/buildroot/build_rv64/images/fw_jump.elf -kernel kernel 
-initrd initrd -object rng-random,filename=/dev/urandom,id=rng -device 
virtio-rng-device,rng=rng -nographic -append "root=/dev/vda1 console=ttyS0"

First is this kernel version ok to reproduce the bug ? Or should I 
download another image ? I'd like to avoid having to rebuild the kernel 
myself if possible.

Now I would like to reproduce the bug: can you give me instructions on 
how to compile the qt package ?

Is the page fault address always in the same area ? It might be 
interesting to find some pattern in those addresses, maybe you could 
also print the random offset to try to link both ? Also print the entire 
virtual memory mapping at the time of the fault (I don't know how to do 
that) to check what the address is close to ? The 0xd cause implies that 
the virtual address does not exist at all, which is weird, my guess is 
that the randomization "reveals" the bug but that the bug is still there 
once the randomization is disabled.

I'm sorry I don't have much more to propose here :(

Alex


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Random, rare, but reproducible segmentation faults
  2020-07-10  5:15   ` Alex Ghiti
@ 2020-07-10 19:12     ` Aurelien Jarno
  2020-07-10 20:37       ` Alex Ghiti
  0 siblings, 1 reply; 5+ messages in thread
From: Aurelien Jarno @ 2020-07-10 19:12 UTC (permalink / raw)
  To: Alex Ghiti; +Cc: linux-riscv

Hi Alex,

On 2020-07-10 01:15, Alex Ghiti wrote:
> I have a debian kernel downloaded from here
> https://people.debian.org/~gio/dqib/ that runs using the following qemu
> command:
> 
> qemu-system-riscv64 -machine virt -cpu rv64 -m 1G -device
> virtio-blk-device,drive=hd -drive file=image.qcow2,if=none,id=hd -device
> virtio-net-device,netdev=net -netdev user,id=net,hostfwd=tcp::2222-:22 -bios
> ~/wip/lpc/buildroot/build_rv64/images/fw_jump.elf -kernel kernel -initrd
> initrd -object rng-random,filename=/dev/urandom,id=rng -device
> virtio-rng-device,rng=rng -nographic -append "root=/dev/vda1 console=ttyS0"
> 
> First is this kernel version ok to reproduce the bug ? Or should I download
> another image ? I'd like to avoid having to rebuild the kernel myself if
> possible.

Yes, that should do it, it's running kernel 5.7.6 so enough to reproduce
the issue. You just need to increase the memory a bit more (4 to 8GB)
and add more CPU with for example -smp 4.

> Now I would like to reproduce the bug: can you give me instructions on how
> to compile the qt package ?

The following sequence should allow you to build it:
- sudo apt-get update
- sudo apt-get install build-essential
- sudo apt-get build-dep qtbase-opensource-src
- apt-get source qtbase-opensource-src
- cd qtbase-opensource-src-5.14.2+dfsg/
- dpkg-buildpackage -B

Alternatively I can prepare you an image with everything ready.

> Is the page fault address always in the same area ? It might be interesting
> to find some pattern in those addresses, maybe you could also print the
> random offset to try to link both ?

It seems really random to me, with 3 outliers:
0x0000003fe7ef3140
0x0000003fcd16cff0
0x0000003fb9e96170
0x0000003fd3f4a120
0x448173f67cdbc8b0
0x0000003fdfe093f0
0x0000003fdfe093f0
0x0000003fe1d4aa70
0x0000003fc2cfef90
0x0000003fc0f5d050
0x0000003fe1d879d0
0x0000003fe9d3e990
0xf0ef4585be2ae01f
0x00000034484f71b0
0x0000003fde30e960
0x000000156888a430
0x0000003eb8560936
0x0000003fb121a490
0x0000003fb9abddd0
0x0000003fe41fc5d0

> Also print the entire virtual memory
> mapping at the time of the fault (I don't know how to do that) to check what
> the address is close to ?

Yes, I'll try to find a way to do that.

> The 0xd cause implies that the virtual address
> does not exist at all, which is weird, my guess is that the randomization
> "reveals" the bug but that the bug is still there once the randomization is
> disabled.

I have also that feeling. It could even be a userland issue, with the
userland not able to cope with some memory mapping.

Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Random, rare, but reproducible segmentation faults
  2020-07-10 19:12     ` Aurelien Jarno
@ 2020-07-10 20:37       ` Alex Ghiti
  0 siblings, 0 replies; 5+ messages in thread
From: Alex Ghiti @ 2020-07-10 20:37 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: linux-riscv

Hi Aurélien,

Le 7/10/20 à 3:12 PM, Aurelien Jarno a écrit :
> Hi Alex,
> 
> On 2020-07-10 01:15, Alex Ghiti wrote:
>> I have a debian kernel downloaded from here
>> https://people.debian.org/~gio/dqib/ that runs using the following qemu
>> command:
>>
>> qemu-system-riscv64 -machine virt -cpu rv64 -m 1G -device
>> virtio-blk-device,drive=hd -drive file=image.qcow2,if=none,id=hd -device
>> virtio-net-device,netdev=net -netdev user,id=net,hostfwd=tcp::2222-:22 -bios
>> ~/wip/lpc/buildroot/build_rv64/images/fw_jump.elf -kernel kernel -initrd
>> initrd -object rng-random,filename=/dev/urandom,id=rng -device
>> virtio-rng-device,rng=rng -nographic -append "root=/dev/vda1 console=ttyS0"
>>
>> First is this kernel version ok to reproduce the bug ? Or should I download
>> another image ? I'd like to avoid having to rebuild the kernel myself if
>> possible.
> 
> Yes, that should do it, it's running kernel 5.7.6 so enough to reproduce
> the issue. You just need to increase the memory a bit more (4 to 8GB)
> and add more CPU with for example -smp 4.

Ok thanks.

> 
>> Now I would like to reproduce the bug: can you give me instructions on how
>> to compile the qt package ?
> 
> The following sequence should allow you to build it:
> - sudo apt-get update
> - sudo apt-get install build-essential
> - sudo apt-get build-dep qtbase-opensource-src
> - apt-get source qtbase-opensource-src
> - cd qtbase-opensource-src-5.14.2+dfsg/
> - dpkg-buildpackage -B
> 
> Alternatively I can prepare you an image with everything ready.

I hope my laptop will survive that :)

> 
>> Is the page fault address always in the same area ? It might be interesting
>> to find some pattern in those addresses, maybe you could also print the
>> random offset to try to link both ?
> 
> It seems really random to me, with 3 outliers:
> 0x0000003fe7ef3140
> 0x0000003fcd16cff0
> 0x0000003fb9e96170
> 0x0000003fd3f4a120
> 0x448173f67cdbc8b0
> 0x0000003fdfe093f0
> 0x0000003fdfe093f0
> 0x0000003fe1d4aa70
> 0x0000003fc2cfef90
> 0x0000003fc0f5d050
> 0x0000003fe1d879d0
> 0x0000003fe9d3e990
> 0xf0ef4585be2ae01f
> 0x00000034484f71b0
> 0x0000003fde30e960
> 0x000000156888a430
> 0x0000003eb8560936
> 0x0000003fb121a490
> 0x0000003fb9abddd0
> 0x0000003fe41fc5d0

Indeed, that's random. And the outliers are weird, at first sight I 
would say the userspace program is responsible for that, but that 
deserves more thinking.

> 
>> Also print the entire virtual memory
>> mapping at the time of the fault (I don't know how to do that) to check what
>> the address is close to ?
> 
> Yes, I'll try to find a way to do that.
> 

In case you want to try:

diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c 

index ae7b7fe24658..32efe9e750d6 100644 

--- a/arch/riscv/mm/fault.c 

+++ b/arch/riscv/mm/fault.c 

@@ -13,6 +13,7 @@ 

  #include <linux/perf_event.h> 

  #include <linux/signal.h> 

  #include <linux/uaccess.h> 

+#include <linux/dcache.h> 

 

  #include <asm/pgalloc.h> 

  #include <asm/ptrace.h> 

@@ -166,6 +167,18 @@ asmlinkage void do_page_fault(struct pt_regs *regs) 

         mmap_read_unlock(mm); 

         /* User mode accesses just cause a SIGSEGV */ 

         if (user_mode(regs)) { 

+        char filename[512], *f; 

+ 

+        for (vma = mm->mmap; vma; vma = vma->vm_next) { 

+            f = filename; 

+            if (vma->vm_file) 

+                f = dentry_path_raw(vma->vm_file->f_path.dentry, 
filename, 512);
+            else 

+                strcpy(filename, "none"); 

+            pr_err("%px -> %px %s\n", 

+                    (void *)vma->vm_start, (void *)vma->vm_end, f); 

+        } 

+ 

                 do_trap(regs, SIGSEGV, code, addr); 

                 return; 

         }

which will result in something like that:
  # segfault
[   44.297718] 0000000000010000 -> 0000000000011000 /usr/bin/segfault
[   44.298067] 0000000000011000 -> 0000000000012000 /usr/bin/segfault
[   44.298346] 0000000000012000 -> 0000000000013000 /usr/bin/segfault
[   44.298623] 0000003fc89e8000 -> 0000003fc8b26000 /lib/libc-2.29.so
[   44.298897] 0000003fc8b26000 -> 0000003fc8b27000 /lib/libc-2.29.so
[   44.299171] 0000003fc8b27000 -> 0000003fc8b2b000 /lib/libc-2.29.so
[   44.299444] 0000003fc8b2b000 -> 0000003fc8b2d000 /lib/libc-2.29.so
[   44.299770] 0000003fc8b2d000 -> 0000003fc8b33000 none
[   44.300000] 0000003fc8b33000 -> 0000003fc8b34000 none
[   44.300225] 0000003fc8b34000 -> 0000003fc8b35000 none
[   44.300454] 0000003fc8b35000 -> 0000003fc8b52000 
/lib/ld-linux-riscv64-lp64.so.1
[   44.300974] 0000003fc8b52000 -> 0000003fc8b53000 
/lib/ld-linux-riscv64-lp64.so.1
[   44.301323] 0000003fc8b53000 -> 0000003fc8b54000 
/lib/ld-linux-riscv64-lp64.so.1
[   44.301708] 0000003fc8b54000 -> 0000003fc8b55000 none
[   44.301986] 0000003fffbe1000 -> 0000003fffc02000 none
[   44.302684] segfault[123]: unhandled signal 11 code 0x1 at 
0x00000000007b3238 in segfault[10000+1000]
[   44.304009] CPU: 2 PID: 123 Comm: segfault Tainted: G      D 
  5.8.0-rc4 #23
[   44.304448] epc: 0000000000010450 ra : 0000003fc8a08250 sp : 
0000003fffc01c30

>> The 0xd cause implies that the virtual address
>> does not exist at all, which is weird, my guess is that the randomization
>> "reveals" the bug but that the bug is still there once the randomization is
>> disabled.
> 
> I have also that feeling. It could even be a userland issue, with the
> userland not able to cope with some memory mapping.
> 
> Aurelien
> 

Alex

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-07-10 20:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-08 20:28 Random, rare, but reproducible segmentation faults Aurelien Jarno
2020-07-09 10:49 ` Alex Ghiti
2020-07-10  5:15   ` Alex Ghiti
2020-07-10 19:12     ` Aurelien Jarno
2020-07-10 20:37       ` Alex Ghiti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.