* Error on loading some network Kernel modules @ 2020-01-28 19:22 Carlos Eduardo de Paula 2020-01-30 3:00 ` Paul Walmsley 0 siblings, 1 reply; 30+ messages in thread From: Carlos Eduardo de Paula @ 2020-01-28 19:22 UTC (permalink / raw) To: linux-riscv I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most network features I need baked into the kernel instead of modules. I tried building a kernel with these network features as modules but when loading some of them I got this error: root@unleashed:~# uname -a Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020 riscv64 GNU/Linux root@unleashed:~# modprobe br_netfilter [ 139.290533] br_netfilter: target ffffffe0000422d8 can not be addressed by the 32-bit offset from PC = 000000003dfd6deb root@unleashed:~# modprobe openvswitch [ 101.830567] openvswitch: target ffffffe0000422d8 can not be addressed by the 32-bit offset from PC = 000000007e236a39 modprobe: ERROR: could not insert 'openvswitch': Invalid argument The config I used to build the modules is on: https://gist.github.com/carlosedp/9f7a6ecde1dc9cebdda62b60d823b15a The config I'm using that has all features worked is on: https://github.com/carlosedp/riscv-bringup/raw/master/unleashed/unleashed_config -- ________________________________________ Carlos Eduardo de Paula me@carlosedp.com http://carlosedp.com http://twitter.com/carlosedp Linkedin ________________________________________ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-01-28 19:22 Error on loading some network Kernel modules Carlos Eduardo de Paula @ 2020-01-30 3:00 ` Paul Walmsley 2020-01-30 16:20 ` David Abdurachmanov 0 siblings, 1 reply; 30+ messages in thread From: Paul Walmsley @ 2020-01-30 3:00 UTC (permalink / raw) To: Carlos Eduardo de Paula Cc: vincent.chen, linux-riscv, Romain Dolbeau, Aurelien Jarno On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote: > I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most > network features I need baked into the kernel instead of modules. > > I tried building a kernel with these network features as modules but > when loading some of them I got this error: > > root@unleashed:~# uname -a > Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020 > riscv64 GNU/Linux > > root@unleashed:~# modprobe br_netfilter > [ 139.290533] br_netfilter: target ffffffe0000422d8 can not be > addressed by the 32-bit offset from PC = 000000003dfd6deb This is a known issue: https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/ https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r Vincent is looking into it and I expect there will be a fix soon. - Paul ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-01-30 3:00 ` Paul Walmsley @ 2020-01-30 16:20 ` David Abdurachmanov 2020-01-31 14:12 ` Carlos Eduardo de Paula 2020-01-31 20:11 ` Aurelien Jarno 0 siblings, 2 replies; 30+ messages in thread From: David Abdurachmanov @ 2020-01-30 16:20 UTC (permalink / raw) To: Paul Walmsley Cc: Romain Dolbeau, Vincent Chen, Carlos Eduardo de Paula, linux-riscv, Aurelien Jarno On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote: > > On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote: > > > I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most > > network features I need baked into the kernel instead of modules. > > > > I tried building a kernel with these network features as modules but > > when loading some of them I got this error: > > > > root@unleashed:~# uname -a > > Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020 > > riscv64 GNU/Linux > > > > root@unleashed:~# modprobe br_netfilter > > [ 139.290533] br_netfilter: target ffffffe0000422d8 can not be > > addressed by the 32-bit offset from PC = 000000003dfd6deb > > This is a known issue: > > https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/ > > https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r > > Vincent is looking into it and I expect there will be a fix soon. > Is this patch solving the problem? https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589 It was incl. into https://github.com/bjoto/linux/commits/rv64-bpf-jit-bcc (not posted upstream). david ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-01-30 16:20 ` David Abdurachmanov @ 2020-01-31 14:12 ` Carlos Eduardo de Paula 2020-01-31 20:11 ` Aurelien Jarno 1 sibling, 0 replies; 30+ messages in thread From: Carlos Eduardo de Paula @ 2020-01-31 14:12 UTC (permalink / raw) To: David Abdurachmanov Cc: Romain Dolbeau, Vincent Chen, Paul Walmsley, linux-riscv, Aurelien Jarno I just built v5.3-rc4 (tag I currently use) with the patch pointed by David (https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589) and report that it works fine. I've been able to load network modules (nf_conntrack, openvswitch and others) with no problem. Also tested with Docker that requires netfilter and others, all working: > lsmod Module Size Used by xt_nat 4371 4 veth 21849 0 vxlan 71685 0 xt_policy 4248 0 iptable_mangle 2375 0 xt_mark 1510 0 xt_u32 2272 0 xt_conntrack 4517 4 xt_MASQUERADE 1893 4 nf_conntrack_netlink 48190 0 nfnetlink 8648 2 nf_conntrack_netlink xfrm_user 39665 1 xfrm_algo 6723 1 xfrm_user xt_addrtype 4589 2 iptable_filter 2484 1 iptable_nat 2669 2 openvswitch 194024 0 nsh 3693 1 openvswitch nf_conncount 11932 1 openvswitch nf_nat 41006 4 xt_nat,openvswitch,iptable_nat,xt_MASQUERADE nf_conntrack 146281 7 xt_conntrack,nf_nat,xt_nat,openvswitch,nf_conntrack_netlink,nf_conncount,xt_MASQUERADE nf_defrag_ipv6 10727 2 nf_conntrack,openvswitch nf_defrag_ipv4 2538 1 nf_conntrack nbd 39492 2 overlay 137654 0 br_netfilter 22333 0 bridge 221254 1 br_netfilter stp 2801 1 bridge llc 6044 2 bridge,stp ip_tables 17472 3 iptable_filter,iptable_nat,iptable_mangle Carlos On Thu, Jan 30, 2020 at 1:21 PM David Abdurachmanov <david.abdurachmanov@gmail.com> wrote: > > On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote: > > > > On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote: > > > > > I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most > > > network features I need baked into the kernel instead of modules. > > > > > > I tried building a kernel with these network features as modules but > > > when loading some of them I got this error: > > > > > > root@unleashed:~# uname -a > > > Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020 > > > riscv64 GNU/Linux > > > > > > root@unleashed:~# modprobe br_netfilter > > > [ 139.290533] br_netfilter: target ffffffe0000422d8 can not be > > > addressed by the 32-bit offset from PC = 000000003dfd6deb > > > > This is a known issue: > > > > https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/ > > > > https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r > > > > Vincent is looking into it and I expect there will be a fix soon. > > > > Is this patch solving the problem? > > https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589 > > It was incl. into > https://github.com/bjoto/linux/commits/rv64-bpf-jit-bcc (not posted > upstream). > > david -- ________________________________________ Carlos Eduardo de Paula me@carlosedp.com http://carlosedp.com http://twitter.com/carlosedp Linkedin ________________________________________ ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-01-30 16:20 ` David Abdurachmanov 2020-01-31 14:12 ` Carlos Eduardo de Paula @ 2020-01-31 20:11 ` Aurelien Jarno 2020-02-01 7:52 ` Anup Patel 1 sibling, 1 reply; 30+ messages in thread From: Aurelien Jarno @ 2020-01-31 20:11 UTC (permalink / raw) To: David Abdurachmanov Cc: Romain Dolbeau, Vincent Chen, Paul Walmsley, Carlos Eduardo de Paula, linux-riscv On 2020-01-30 17:20, David Abdurachmanov wrote: > On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote: > > > > On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote: > > > > > I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most > > > network features I need baked into the kernel instead of modules. > > > > > > I tried building a kernel with these network features as modules but > > > when loading some of them I got this error: > > > > > > root@unleashed:~# uname -a > > > Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020 > > > riscv64 GNU/Linux > > > > > > root@unleashed:~# modprobe br_netfilter > > > [ 139.290533] br_netfilter: target ffffffe0000422d8 can not be > > > addressed by the 32-bit offset from PC = 000000003dfd6deb > > > > This is a known issue: > > > > https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/ > > > > https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r > > > > Vincent is looking into it and I expect there will be a fix soon. > > > > Is this patch solving the problem? > > https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589 I have just tried that patch on top of a 5.4.13 kernel, and I confirm it fixes the issue. Thanks! Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-01-31 20:11 ` Aurelien Jarno @ 2020-02-01 7:52 ` Anup Patel 2020-02-01 13:59 ` Alex Ghiti 0 siblings, 1 reply; 30+ messages in thread From: Anup Patel @ 2020-02-01 7:52 UTC (permalink / raw) To: Aurelien Jarno Cc: Carlos Eduardo de Paula, David Abdurachmanov, Paul Walmsley, Vincent Chen, Romain Dolbeau, linux-riscv On Sat, Feb 1, 2020 at 1:41 AM Aurelien Jarno <aurelien@aurel32.net> wrote: > > On 2020-01-30 17:20, David Abdurachmanov wrote: > > On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote: > > > > > > On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote: > > > > > > > I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most > > > > network features I need baked into the kernel instead of modules. > > > > > > > > I tried building a kernel with these network features as modules but > > > > when loading some of them I got this error: > > > > > > > > root@unleashed:~# uname -a > > > > Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020 > > > > riscv64 GNU/Linux > > > > > > > > root@unleashed:~# modprobe br_netfilter > > > > [ 139.290533] br_netfilter: target ffffffe0000422d8 can not be > > > > addressed by the 32-bit offset from PC = 000000003dfd6deb > > > > > > This is a known issue: > > > > > > https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/ > > > > > > https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r > > > > > > Vincent is looking into it and I expect there will be a fix soon. > > > > > > > Is this patch solving the problem? > > > > https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589 > > I have just tried that patch on top of a 5.4.13 kernel, and I confirm it > fixes the issue. Thanks! I guess this patch was lost over time. First of all we need to rebase this patch on latest kernel. The MODULES_xyz defines are totally redundant and create confusion because with these defines it seems we are creating separate virtual memory region for modules which is not true. In fact, the patch only tries to ensure that modules are allocated from the end-of VMALLOC region. I suggest to drop all MODULES_xyz defines and implement module_alloc() as follows: void *module_alloc(unsigned long size) { return __vmalloc_node_range(size, 1, VMALLOC_END - SZ_128M, VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, __builtin_return_address(0)); } Maybe we can even have a kconfig options for MODULES_SIZE which can be used in-place of SZ_128M in above code. Should I send a patch on latest kernel ?? Regards, Anup ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-01 7:52 ` Anup Patel @ 2020-02-01 13:59 ` Alex Ghiti 2020-02-02 14:10 ` Anup Patel 2020-02-02 16:27 ` Romain Dolbeau 0 siblings, 2 replies; 30+ messages in thread From: Alex Ghiti @ 2020-02-01 13:59 UTC (permalink / raw) To: Anup Patel, Aurelien Jarno Cc: Carlos Eduardo de Paula, David Abdurachmanov, Paul Walmsley, Vincent Chen, Romain Dolbeau, linux-riscv On 2/1/20 2:52 AM, Anup Patel wrote: > On Sat, Feb 1, 2020 at 1:41 AM Aurelien Jarno <aurelien@aurel32.net> wrote: >> On 2020-01-30 17:20, David Abdurachmanov wrote: >>> On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote: >>>> On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote: >>>> >>>>> I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most >>>>> network features I need baked into the kernel instead of modules. >>>>> >>>>> I tried building a kernel with these network features as modules but >>>>> when loading some of them I got this error: >>>>> >>>>> root@unleashed:~# uname -a >>>>> Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020 >>>>> riscv64 GNU/Linux >>>>> >>>>> root@unleashed:~# modprobe br_netfilter >>>>> [ 139.290533] br_netfilter: target ffffffe0000422d8 can not be >>>>> addressed by the 32-bit offset from PC = 000000003dfd6deb >>>> This is a known issue: >>>> >>>> https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/ >>>> >>>> https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r >>>> >>>> Vincent is looking into it and I expect there will be a fix soon. >>>> >>> Is this patch solving the problem? >>> >>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589 >> I have just tried that patch on top of a 5.4.13 kernel, and I confirm it >> fixes the issue. Thanks! > I guess this patch was lost over time. > > First of all we need to rebase this patch on latest kernel. > > The MODULES_xyz defines are totally redundant and create > confusion because with these defines it seems we are creating > separate virtual memory region for modules which is not true. > In fact, the patch only tries to ensure that modules are allocated > from the end-of VMALLOC region. > > I suggest to drop all MODULES_xyz defines and implement > module_alloc() as follows: > > void *module_alloc(unsigned long size) > { > return __vmalloc_node_range(size, 1, VMALLOC_END - SZ_128M, > VMALLOC_END, GFP_KERNEL, > PAGE_KERNEL_EXEC, 0, > NUMA_NO_NODE, > __builtin_return_address(0)); > } Why restrict to 128M whereas we have 2GB offset available to the end of the kernel ? I think the cleanest solution is to use the following range [&_end - 2 * SZ_1G; VMALLOC_END]. Alex > Maybe we can even have a kconfig options for MODULES_SIZE > which can be used in-place of SZ_128M in above code. > > Should I send a patch on latest kernel ?? > > Regards, > Anup > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-01 13:59 ` Alex Ghiti @ 2020-02-02 14:10 ` Anup Patel 2020-02-02 15:21 ` Troy Benjegerdes 2020-02-02 16:27 ` Romain Dolbeau 1 sibling, 1 reply; 30+ messages in thread From: Anup Patel @ 2020-02-02 14:10 UTC (permalink / raw) To: Alex Ghiti Cc: Carlos Eduardo de Paula, David Abdurachmanov, Paul Walmsley, Vincent Chen, Romain Dolbeau, linux-riscv, Aurelien Jarno On Sat, Feb 1, 2020 at 7:30 PM Alex Ghiti <alex@ghiti.fr> wrote: > > On 2/1/20 2:52 AM, Anup Patel wrote: > > On Sat, Feb 1, 2020 at 1:41 AM Aurelien Jarno <aurelien@aurel32.net> wrote: > >> On 2020-01-30 17:20, David Abdurachmanov wrote: > >>> On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote: > >>>> On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote: > >>>> > >>>>> I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most > >>>>> network features I need baked into the kernel instead of modules. > >>>>> > >>>>> I tried building a kernel with these network features as modules but > >>>>> when loading some of them I got this error: > >>>>> > >>>>> root@unleashed:~# uname -a > >>>>> Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020 > >>>>> riscv64 GNU/Linux > >>>>> > >>>>> root@unleashed:~# modprobe br_netfilter > >>>>> [ 139.290533] br_netfilter: target ffffffe0000422d8 can not be > >>>>> addressed by the 32-bit offset from PC = 000000003dfd6deb > >>>> This is a known issue: > >>>> > >>>> https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/ > >>>> > >>>> https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r > >>>> > >>>> Vincent is looking into it and I expect there will be a fix soon. > >>>> > >>> Is this patch solving the problem? > >>> > >>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589 > >> I have just tried that patch on top of a 5.4.13 kernel, and I confirm it > >> fixes the issue. Thanks! > > I guess this patch was lost over time. > > > > First of all we need to rebase this patch on latest kernel. > > > > The MODULES_xyz defines are totally redundant and create > > confusion because with these defines it seems we are creating > > separate virtual memory region for modules which is not true. > > In fact, the patch only tries to ensure that modules are allocated > > from the end-of VMALLOC region. > > > > I suggest to drop all MODULES_xyz defines and implement > > module_alloc() as follows: > > > > void *module_alloc(unsigned long size) > > { > > return __vmalloc_node_range(size, 1, VMALLOC_END - SZ_128M, > > VMALLOC_END, GFP_KERNEL, > > PAGE_KERNEL_EXEC, 0, > > NUMA_NO_NODE, > > __builtin_return_address(0)); > > } > > > Why restrict to 128M whereas we have 2GB offset available to the end of > the kernel ? > > I think the cleanest solution is to use the following range [&_end - 2 * > SZ_1G; VMALLOC_END]. I agree we should not restrict to just SZ_128M. You can go ahead and send a patch this (if you want). Regards, Anup ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-02 14:10 ` Anup Patel @ 2020-02-02 15:21 ` Troy Benjegerdes 0 siblings, 0 replies; 30+ messages in thread From: Troy Benjegerdes @ 2020-02-02 15:21 UTC (permalink / raw) To: Anup Patel, Paul Walmsley, David Abdurachmanov Cc: Carlos Eduardo de Paula, Alex Ghiti, Vincent Chen, Romain Dolbeau, linux-riscv, Aurelien Jarno Do we have a good place to put a community-supported ‘riscv-stable’ tree somewhere that we can put stuff like this we need for basic functionality so we don’t lose it again? I also feel like there’s a need to backport some patches like this to a -stable/lts kernel series, and 4.19 seems a little old to keep maintaining.. > On Feb 2, 2020, at 8:10 AM, Anup Patel <anup@brainfault.org> wrote: > > On Sat, Feb 1, 2020 at 7:30 PM Alex Ghiti <alex@ghiti.fr> wrote: >> >> On 2/1/20 2:52 AM, Anup Patel wrote: >>> On Sat, Feb 1, 2020 at 1:41 AM Aurelien Jarno <aurelien@aurel32.net> wrote: >>>> On 2020-01-30 17:20, David Abdurachmanov wrote: >>>>> On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote: >>>>>> On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote: >>>>>> >>>>>>> I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most >>>>>>> network features I need baked into the kernel instead of modules. >>>>>>> >>>>>>> I tried building a kernel with these network features as modules but >>>>>>> when loading some of them I got this error: >>>>>>> >>>>>>> root@unleashed:~# uname -a >>>>>>> Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020 >>>>>>> riscv64 GNU/Linux >>>>>>> >>>>>>> root@unleashed:~# modprobe br_netfilter >>>>>>> [ 139.290533] br_netfilter: target ffffffe0000422d8 can not be >>>>>>> addressed by the 32-bit offset from PC = 000000003dfd6deb >>>>>> This is a known issue: >>>>>> >>>>>> https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/ >>>>>> >>>>>> https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r >>>>>> >>>>>> Vincent is looking into it and I expect there will be a fix soon. >>>>>> >>>>> Is this patch solving the problem? >>>>> >>>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589 >>>> I have just tried that patch on top of a 5.4.13 kernel, and I confirm it >>>> fixes the issue. Thanks! >>> I guess this patch was lost over time. >>> >>> First of all we need to rebase this patch on latest kernel. >>> >>> The MODULES_xyz defines are totally redundant and create >>> confusion because with these defines it seems we are creating >>> separate virtual memory region for modules which is not true. >>> In fact, the patch only tries to ensure that modules are allocated >>> from the end-of VMALLOC region. >>> >>> I suggest to drop all MODULES_xyz defines and implement >>> module_alloc() as follows: >>> >>> void *module_alloc(unsigned long size) >>> { >>> return __vmalloc_node_range(size, 1, VMALLOC_END - SZ_128M, >>> VMALLOC_END, GFP_KERNEL, >>> PAGE_KERNEL_EXEC, 0, >>> NUMA_NO_NODE, >>> __builtin_return_address(0)); >>> } >> >> >> Why restrict to 128M whereas we have 2GB offset available to the end of >> the kernel ? >> >> I think the cleanest solution is to use the following range [&_end - 2 * >> SZ_1G; VMALLOC_END]. > > I agree we should not restrict to just SZ_128M. > > You can go ahead and send a patch this (if you want). > > Regards, > Anup ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-01 13:59 ` Alex Ghiti 2020-02-02 14:10 ` Anup Patel @ 2020-02-02 16:27 ` Romain Dolbeau 2020-02-03 10:04 ` Vincent Chen 2020-02-03 20:57 ` Alex Ghiti 1 sibling, 2 replies; 30+ messages in thread From: Romain Dolbeau @ 2020-02-02 16:27 UTC (permalink / raw) To: Alex Ghiti, Anup Patel, Aurelien Jarno Cc: Vincent Chen, Carlos Eduardo de Paula, Paul Walmsley, linux-riscv, David Abdurachmanov On 2020-02-01 14:59, Alex Ghiti wrote: > Why restrict to 128M whereas we have 2GB offset available to the end of > the kernel ? Isn't that 2 GiB offset to whatever the module requires in the kernel, rather than to the end of the kernel space? Is there some guarantee that symbols accessible by modules are at the end of the kernel? If so, wouldn't the maximum offset for this patch still be (2 GiB - <total size of accessible symbols>)? Cordially, -- Romain Dolbeau ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-02 16:27 ` Romain Dolbeau @ 2020-02-03 10:04 ` Vincent Chen 2020-02-04 3:55 ` Zong Li 2020-02-03 20:57 ` Alex Ghiti 1 sibling, 1 reply; 30+ messages in thread From: Vincent Chen @ 2020-02-03 10:04 UTC (permalink / raw) To: Romain Dolbeau Cc: Carlos Eduardo de Paula, Alex Ghiti, David Abdurachmanov, Anup Patel, Paul Walmsley, linux-riscv, Aurelien Jarno On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu> wrote: > > On 2020-02-01 14:59, Alex Ghiti wrote: > > Why restrict to 128M whereas we have 2GB offset available to the end of > > the kernel ? > > Isn't that 2 GiB offset to whatever the module requires in the kernel, > rather than to the end of the kernel space? > > Is there some guarantee that symbols accessible by modules are at the > end of the kernel? If so, wouldn't the maximum offset for this patch > still be (2 GiB - <total size of accessible symbols>)? > > Cordially, > > -- > Romain Dolbeau It took me some time to find the root cause of this problem, please allow me to share some observations before the discussion. The root cause of this issue is that the percpu data is declared as a static variable. The "static" attribute will make the compiler think that this symbol is close to the .text section at runtime. Hence, the compiler uses "auipc" to access this percpu data instead of using GOT. In this case, the access range is limited to + -2G. However, in practice, these percpu data are placed at a pre-allocated region created by the memblock_allocate() function. In other words, the distance between the pre-allocated region (>PAGE_OFFSET ) and the .text section of the kernel module (in VMALLOC region) is much larger than 2G. I agree that the original patch, https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589, can solve most cases. However, I do not think the patch still works if the kernel module region is determined by _end or <total size of accessible symbols>. The reason is that the pre-allocated region for module percpu data comes from the memblock function at runtime. Hence, we cannot know the actual address of this region at compile-time, and this issue probably may occur again in this case. By the way, I think maybe we can refer to the implementation of MIPS. 1. For general cases, we can use this patch to solve this issue. 2. For a large kernel image (>2G) or enabling the KASLR feature, we may need a new code mode to deal with this issue. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-03 10:04 ` Vincent Chen @ 2020-02-04 3:55 ` Zong Li 2020-02-04 6:50 ` Alex Ghiti 0 siblings, 1 reply; 30+ messages in thread From: Zong Li @ 2020-02-04 3:55 UTC (permalink / raw) To: Vincent Chen Cc: Paul Walmsley, Alex Ghiti, David Abdurachmanov, Anup Patel, Carlos Eduardo de Paula, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道: > > On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau > <romain.dolbeau@european-processor-initiative.eu> wrote: > > > > On 2020-02-01 14:59, Alex Ghiti wrote: > > > Why restrict to 128M whereas we have 2GB offset available to the end of > > > the kernel ? > > > > Isn't that 2 GiB offset to whatever the module requires in the kernel, > > rather than to the end of the kernel space? > > > > Is there some guarantee that symbols accessible by modules are at the > > end of the kernel? If so, wouldn't the maximum offset for this patch > > still be (2 GiB - <total size of accessible symbols>)? > > > > Cordially, > > > > -- > > Romain Dolbeau > > It took me some time to find the root cause of this problem, please > allow me to share some observations before the discussion. > The root cause of this issue is that the percpu data is declared as a > static variable. The "static" attribute will make the compiler think > that this symbol is close to the .text section at runtime. Hence, the > compiler uses "auipc" to access this percpu data instead of using GOT. > In this case, the access range is limited to + -2G. However, in > practice, these percpu data are placed at a pre-allocated region > created by the memblock_allocate() function. In other words, the > distance between the pre-allocated region (>PAGE_OFFSET ) and the > .text section of the kernel module (in VMALLOC region) is much larger > than 2G. > I agree that the original patch, > https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589, > can solve most cases. However, I do not think the patch still works if > the kernel module region is determined by _end or <total size of > accessible symbols>. The reason is that the pre-allocated region for > module percpu data comes from the memblock function at runtime. Hence, > we cannot know the actual address of this region at compile-time, and > this issue probably may occur again in this case. > > By the way, I think maybe we can refer to the implementation of MIPS. > 1. For general cases, we can use this patch to solve this issue. > 2. For a large kernel image (>2G) or enabling the KASLR feature, we > may need a new code mode to deal with this issue. > The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is enabled. Unless we limit the randomized range in small area, the module region start address will be bigger than VMALLOC_END. So if we divide a region into module use, we also have to provide module randomization at the same time when KASLR is enabled. It good to me if there is a new code model to use GOT for static variable, everything will be easy. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-04 3:55 ` Zong Li @ 2020-02-04 6:50 ` Alex Ghiti 2020-02-04 7:19 ` Zong Li 0 siblings, 1 reply; 30+ messages in thread From: Alex Ghiti @ 2020-02-04 6:50 UTC (permalink / raw) To: Zong Li, Vincent Chen Cc: Paul Walmsley, David Abdurachmanov, Anup Patel, Carlos Eduardo de Paula, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno Hi Zong, On 2/3/20 10:55 PM, Zong Li wrote: > Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道: >> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau >> <romain.dolbeau@european-processor-initiative.eu> wrote: >>> On 2020-02-01 14:59, Alex Ghiti wrote: >>>> Why restrict to 128M whereas we have 2GB offset available to the end of >>>> the kernel ? >>> Isn't that 2 GiB offset to whatever the module requires in the kernel, >>> rather than to the end of the kernel space? >>> >>> Is there some guarantee that symbols accessible by modules are at the >>> end of the kernel? If so, wouldn't the maximum offset for this patch >>> still be (2 GiB - <total size of accessible symbols>)? >>> >>> Cordially, >>> >>> -- >>> Romain Dolbeau >> It took me some time to find the root cause of this problem, please >> allow me to share some observations before the discussion. >> The root cause of this issue is that the percpu data is declared as a >> static variable. The "static" attribute will make the compiler think >> that this symbol is close to the .text section at runtime. Hence, the >> compiler uses "auipc" to access this percpu data instead of using GOT. >> In this case, the access range is limited to + -2G. However, in >> practice, these percpu data are placed at a pre-allocated region >> created by the memblock_allocate() function. In other words, the >> distance between the pre-allocated region (>PAGE_OFFSET ) and the >> .text section of the kernel module (in VMALLOC region) is much larger >> than 2G. >> I agree that the original patch, >> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589, >> can solve most cases. However, I do not think the patch still works if >> the kernel module region is determined by _end or <total size of >> accessible symbols>. The reason is that the pre-allocated region for >> module percpu data comes from the memblock function at runtime. Hence, >> we cannot know the actual address of this region at compile-time, and >> this issue probably may occur again in this case. >> >> By the way, I think maybe we can refer to the implementation of MIPS. >> 1. For general cases, we can use this patch to solve this issue. >> 2. For a large kernel image (>2G) or enabling the KASLR feature, we >> may need a new code mode to deal with this issue. >> > The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is > enabled. Unless we limit the randomized range in small area, the > module region start address will be bigger than VMALLOC_END. Actually, the relocatable patch I proposed already moves "everything" up at the same time: the kernel itself but also all the "zones" below (vmalloc, vmemmap, fixup...etc) since all those zones are defined from PAGE_OFFSET that is now dynamic. So the modules ill remain at the same offset to the kernel, unless explicitly randomized in the vmalloc zone. > So if we > divide a region into module use, we also have to provide module > randomization at the same time when KASLR is enabled. It good to me if > there is a new code model to use GOT for static variable, everything > will be easy. GOT usage seems indeed to be easier, but it seems a bit overkill to me: arm64 already uses the range I proposed for modules (arch/arm64/kernel/kaslr.c, module_alloc_base definition). Anyway, even if GOT is chosen, it won't be anytime soon right ? So we could, for the moment, constraint the modules to be close to the kernel with any range you like and that will still work with our implementation of KASLR. It seems urgent to fix those modules loading issues for 5.6. Alex ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-04 6:50 ` Alex Ghiti @ 2020-02-04 7:19 ` Zong Li 2020-02-04 9:32 ` Alexandre Ghiti 0 siblings, 1 reply; 30+ messages in thread From: Zong Li @ 2020-02-04 7:19 UTC (permalink / raw) To: Alex Ghiti Cc: Paul Walmsley, David Abdurachmanov, Anup Patel, Carlos Eduardo de Paula, Vincent Chen, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道: > > Hi Zong, > > On 2/3/20 10:55 PM, Zong Li wrote: > > Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道: > >> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau > >> <romain.dolbeau@european-processor-initiative.eu> wrote: > >>> On 2020-02-01 14:59, Alex Ghiti wrote: > >>>> Why restrict to 128M whereas we have 2GB offset available to the end of > >>>> the kernel ? > >>> Isn't that 2 GiB offset to whatever the module requires in the kernel, > >>> rather than to the end of the kernel space? > >>> > >>> Is there some guarantee that symbols accessible by modules are at the > >>> end of the kernel? If so, wouldn't the maximum offset for this patch > >>> still be (2 GiB - <total size of accessible symbols>)? > >>> > >>> Cordially, > >>> > >>> -- > >>> Romain Dolbeau > >> It took me some time to find the root cause of this problem, please > >> allow me to share some observations before the discussion. > >> The root cause of this issue is that the percpu data is declared as a > >> static variable. The "static" attribute will make the compiler think > >> that this symbol is close to the .text section at runtime. Hence, the > >> compiler uses "auipc" to access this percpu data instead of using GOT. > >> In this case, the access range is limited to + -2G. However, in > >> practice, these percpu data are placed at a pre-allocated region > >> created by the memblock_allocate() function. In other words, the > >> distance between the pre-allocated region (>PAGE_OFFSET ) and the > >> .text section of the kernel module (in VMALLOC region) is much larger > >> than 2G. > >> I agree that the original patch, > >> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589, > >> can solve most cases. However, I do not think the patch still works if > >> the kernel module region is determined by _end or <total size of > >> accessible symbols>. The reason is that the pre-allocated region for > >> module percpu data comes from the memblock function at runtime. Hence, > >> we cannot know the actual address of this region at compile-time, and > >> this issue probably may occur again in this case. > >> > >> By the way, I think maybe we can refer to the implementation of MIPS. > >> 1. For general cases, we can use this patch to solve this issue. > >> 2. For a large kernel image (>2G) or enabling the KASLR feature, we > >> may need a new code mode to deal with this issue. > >> > > The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is > > enabled. Unless we limit the randomized range in small area, the > > module region start address will be bigger than VMALLOC_END. > > > Actually, the relocatable patch I proposed already moves "everything" up > at the same > time: the kernel itself but also all the "zones" below (vmalloc, > vmemmap, fixup...etc) > since all those zones are defined from PAGE_OFFSET that is now dynamic. > So the modules > ill remain at the same offset to the kernel, unless explicitly > randomized in the vmalloc zone. > OK, it makes sense. The module region moves along with kernel seems to stay away from the concern I mentioned. So now, the problem is that the pre-allocated region of percpu data is located after _end symbol as Vincent mentioned, the 2G distance seems to be too far for module region start address. (i.e. &_end - 2G). > > > So if we > > divide a region into module use, we also have to provide module > > randomization at the same time when KASLR is enabled. It good to me if > > there is a new code model to use GOT for static variable, everything > > will be easy. > > > GOT usage seems indeed to be easier, but it seems a bit overkill to me: > arm64 already > uses the range I proposed for modules (arch/arm64/kernel/kaslr.c, > module_alloc_base > definition). > > Anyway, even if GOT is chosen, it won't be anytime soon right ? So we > could, for the moment, > constraint the modules to be close to the kernel with any range you like > and that will still > work with our implementation of KASLR. Yes, I'm with you on that, we need a resolution in kernel first. > > It seems urgent to fix those modules loading issues for 5.6. > > Alex > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-04 7:19 ` Zong Li @ 2020-02-04 9:32 ` Alexandre Ghiti 2020-02-04 10:46 ` Vincent Chen 0 siblings, 1 reply; 30+ messages in thread From: Alexandre Ghiti @ 2020-02-04 9:32 UTC (permalink / raw) To: Zong Li Cc: Paul Walmsley, David Abdurachmanov, Anup Patel, Carlos Eduardo de Paula, Vincent Chen, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno On 2/4/20 8:19 AM, Zong Li wrote: > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道: >> Hi Zong, >> >> On 2/3/20 10:55 PM, Zong Li wrote: >>> Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道: >>>> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau >>>> <romain.dolbeau@european-processor-initiative.eu> wrote: >>>>> On 2020-02-01 14:59, Alex Ghiti wrote: >>>>>> Why restrict to 128M whereas we have 2GB offset available to the end of >>>>>> the kernel ? >>>>> Isn't that 2 GiB offset to whatever the module requires in the kernel, >>>>> rather than to the end of the kernel space? >>>>> >>>>> Is there some guarantee that symbols accessible by modules are at the >>>>> end of the kernel? If so, wouldn't the maximum offset for this patch >>>>> still be (2 GiB - <total size of accessible symbols>)? >>>>> >>>>> Cordially, >>>>> >>>>> -- >>>>> Romain Dolbeau >>>> It took me some time to find the root cause of this problem, please >>>> allow me to share some observations before the discussion. >>>> The root cause of this issue is that the percpu data is declared as a >>>> static variable. The "static" attribute will make the compiler think >>>> that this symbol is close to the .text section at runtime. Hence, the >>>> compiler uses "auipc" to access this percpu data instead of using GOT. >>>> In this case, the access range is limited to + -2G. However, in >>>> practice, these percpu data are placed at a pre-allocated region >>>> created by the memblock_allocate() function. In other words, the >>>> distance between the pre-allocated region (>PAGE_OFFSET ) and the >>>> .text section of the kernel module (in VMALLOC region) is much larger >>>> than 2G. >>>> I agree that the original patch, >>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589, >>>> can solve most cases. However, I do not think the patch still works if >>>> the kernel module region is determined by _end or <total size of >>>> accessible symbols>. The reason is that the pre-allocated region for >>>> module percpu data comes from the memblock function at runtime. Hence, >>>> we cannot know the actual address of this region at compile-time, and >>>> this issue probably may occur again in this case. >>>> >>>> By the way, I think maybe we can refer to the implementation of MIPS. >>>> 1. For general cases, we can use this patch to solve this issue. >>>> 2. For a large kernel image (>2G) or enabling the KASLR feature, we >>>> may need a new code mode to deal with this issue. >>>> >>> The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is >>> enabled. Unless we limit the randomized range in small area, the >>> module region start address will be bigger than VMALLOC_END. >> >> Actually, the relocatable patch I proposed already moves "everything" up >> at the same >> time: the kernel itself but also all the "zones" below (vmalloc, >> vmemmap, fixup...etc) >> since all those zones are defined from PAGE_OFFSET that is now dynamic. >> So the modules >> ill remain at the same offset to the kernel, unless explicitly >> randomized in the vmalloc zone. >> > OK, it makes sense. The module region moves along with kernel seems to > stay away from the concern I mentioned. > > So now, the problem is that the pre-allocated region of percpu data is > located after _end symbol as Vincent mentioned, the 2G distance seems > to be too far for module region start address. (i.e. &_end - 2G). Actually, I don't understand this issue: we are limited to addressing symbols within +/- 2GB from PC. So as long as the percpu symbol exists in the kernel, it is below _end and then we don't care that its content is initialized dynamically, as long as we can 'compute' its address from PC, right ? Can you point out where I can take a look at this pre-allocated region for percpu data please ? Thanks, Alex > >>> So if we >>> divide a region into module use, we also have to provide module >>> randomization at the same time when KASLR is enabled. It good to me if >>> there is a new code model to use GOT for static variable, everything >>> will be easy. >> >> GOT usage seems indeed to be easier, but it seems a bit overkill to me: >> arm64 already >> uses the range I proposed for modules (arch/arm64/kernel/kaslr.c, >> module_alloc_base >> definition). >> >> Anyway, even if GOT is chosen, it won't be anytime soon right ? So we >> could, for the moment, >> constraint the modules to be close to the kernel with any range you like >> and that will still >> work with our implementation of KASLR. > Yes, I'm with you on that, we need a resolution in kernel first. > >> It seems urgent to fix those modules loading issues for 5.6. >> >> Alex >> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-04 9:32 ` Alexandre Ghiti @ 2020-02-04 10:46 ` Vincent Chen 2020-02-04 11:30 ` Anup Patel 0 siblings, 1 reply; 30+ messages in thread From: Vincent Chen @ 2020-02-04 10:46 UTC (permalink / raw) To: Alexandre Ghiti Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote: > > > On 2/4/20 8:19 AM, Zong Li wrote: > > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道: > >> Hi Zong, > >> > >> On 2/3/20 10:55 PM, Zong Li wrote: > >>> Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道: > >>>> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau > >>>> <romain.dolbeau@european-processor-initiative.eu> wrote: > >>>>> On 2020-02-01 14:59, Alex Ghiti wrote: > >>>>>> Why restrict to 128M whereas we have 2GB offset available to the end of > >>>>>> the kernel ? > >>>>> Isn't that 2 GiB offset to whatever the module requires in the kernel, > >>>>> rather than to the end of the kernel space? > >>>>> > >>>>> Is there some guarantee that symbols accessible by modules are at the > >>>>> end of the kernel? If so, wouldn't the maximum offset for this patch > >>>>> still be (2 GiB - <total size of accessible symbols>)? > >>>>> > >>>>> Cordially, > >>>>> > >>>>> -- > >>>>> Romain Dolbeau > >>>> It took me some time to find the root cause of this problem, please > >>>> allow me to share some observations before the discussion. > >>>> The root cause of this issue is that the percpu data is declared as a > >>>> static variable. The "static" attribute will make the compiler think > >>>> that this symbol is close to the .text section at runtime. Hence, the > >>>> compiler uses "auipc" to access this percpu data instead of using GOT. > >>>> In this case, the access range is limited to + -2G. However, in > >>>> practice, these percpu data are placed at a pre-allocated region > >>>> created by the memblock_allocate() function. In other words, the > >>>> distance between the pre-allocated region (>PAGE_OFFSET ) and the > >>>> .text section of the kernel module (in VMALLOC region) is much larger > >>>> than 2G. > >>>> I agree that the original patch, > >>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589, > >>>> can solve most cases. However, I do not think the patch still works if > >>>> the kernel module region is determined by _end or <total size of > >>>> accessible symbols>. The reason is that the pre-allocated region for > >>>> module percpu data comes from the memblock function at runtime. Hence, > >>>> we cannot know the actual address of this region at compile-time, and > >>>> this issue probably may occur again in this case. > >>>> > >>>> By the way, I think maybe we can refer to the implementation of MIPS. > >>>> 1. For general cases, we can use this patch to solve this issue. > >>>> 2. For a large kernel image (>2G) or enabling the KASLR feature, we > >>>> may need a new code mode to deal with this issue. > >>>> > >>> The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is > >>> enabled. Unless we limit the randomized range in small area, the > >>> module region start address will be bigger than VMALLOC_END. > >> > >> Actually, the relocatable patch I proposed already moves "everything" up > >> at the same > >> time: the kernel itself but also all the "zones" below (vmalloc, > >> vmemmap, fixup...etc) > >> since all those zones are defined from PAGE_OFFSET that is now dynamic. > >> So the modules > >> ill remain at the same offset to the kernel, unless explicitly > >> randomized in the vmalloc zone. > >> > > OK, it makes sense. The module region moves along with kernel seems to > > stay away from the concern I mentioned. > > > > So now, the problem is that the pre-allocated region of percpu data is > > located after _end symbol as Vincent mentioned, the 2G distance seems > > to be too far for module region start address. (i.e. &_end - 2G). > > > Actually, I don't understand this issue: we are limited to addressing > symbols within +/- 2GB > from PC. So as long as the percpu symbol exists in the kernel, it is > below _end and then we don't > care that its content is initialized dynamically, as long as we can > 'compute' its address from PC, > right ? In this case, the static percpu symbols of this issue are declared in the kernel module, not in the kernel image. When kernel loads the kernel module, in general, it continuously places all the module sections in the VMALLOC memory area. However, the ".data..percpu" section is an exception. The kernel places the ".data..percpu" section in a pre-allocated memory region. This region is used to place the percpu data instances for each CPU and is created by the memblock(). Hence, the instance of these per-cpu symbols is above the _end. In this case, if we make the module region locate at [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its address >_end) and module text section will probably exceed the 2G limitation. > > Can you point out where I can take a look at this pre-allocated region > for percpu data please ? > In mm/percpu.c, you can find how the kernel allocates the percpu data region during the initialization phase. In kernel/module.c's simplify_symbols function, you can find kernel treats .data.percpu section as an exception when loading module. If I have any misunderstandings, please let me know. Thanks Best regards, Vincent > Thanks, > > Alex > > > > > >>> So if we > >>> divide a region into module use, we also have to provide module > >>> randomization at the same time when KASLR is enabled. It good to me if > >>> there is a new code model to use GOT for static variable, everything > >>> will be easy. > >> > >> GOT usage seems indeed to be easier, but it seems a bit overkill to me: > >> arm64 already > >> uses the range I proposed for modules (arch/arm64/kernel/kaslr.c, > >> module_alloc_base > >> definition). > >> > >> Anyway, even if GOT is chosen, it won't be anytime soon right ? So we > >> could, for the moment, > >> constraint the modules to be close to the kernel with any range you like > >> and that will still > >> work with our implementation of KASLR. > > Yes, I'm with you on that, we need a resolution in kernel first. > > > >> It seems urgent to fix those modules loading issues for 5.6. > >> > >> Alex > >> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-04 10:46 ` Vincent Chen @ 2020-02-04 11:30 ` Anup Patel 2020-02-04 14:03 ` Vincent Chen 2020-02-04 17:48 ` Romain Dolbeau 0 siblings, 2 replies; 30+ messages in thread From: Anup Patel @ 2020-02-04 11:30 UTC (permalink / raw) To: Vincent Chen Cc: Carlos Eduardo de Paula, Alexandre Ghiti, David Abdurachmanov, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com> wrote: > > On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote: > > > > > > On 2/4/20 8:19 AM, Zong Li wrote: > > > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道: > > >> Hi Zong, > > >> > > >> On 2/3/20 10:55 PM, Zong Li wrote: > > >>> Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道: > > >>>> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau > > >>>> <romain.dolbeau@european-processor-initiative.eu> wrote: > > >>>>> On 2020-02-01 14:59, Alex Ghiti wrote: > > >>>>>> Why restrict to 128M whereas we have 2GB offset available to the end of > > >>>>>> the kernel ? > > >>>>> Isn't that 2 GiB offset to whatever the module requires in the kernel, > > >>>>> rather than to the end of the kernel space? > > >>>>> > > >>>>> Is there some guarantee that symbols accessible by modules are at the > > >>>>> end of the kernel? If so, wouldn't the maximum offset for this patch > > >>>>> still be (2 GiB - <total size of accessible symbols>)? > > >>>>> > > >>>>> Cordially, > > >>>>> > > >>>>> -- > > >>>>> Romain Dolbeau > > >>>> It took me some time to find the root cause of this problem, please > > >>>> allow me to share some observations before the discussion. > > >>>> The root cause of this issue is that the percpu data is declared as a > > >>>> static variable. The "static" attribute will make the compiler think > > >>>> that this symbol is close to the .text section at runtime. Hence, the > > >>>> compiler uses "auipc" to access this percpu data instead of using GOT. > > >>>> In this case, the access range is limited to + -2G. However, in > > >>>> practice, these percpu data are placed at a pre-allocated region > > >>>> created by the memblock_allocate() function. In other words, the > > >>>> distance between the pre-allocated region (>PAGE_OFFSET ) and the > > >>>> .text section of the kernel module (in VMALLOC region) is much larger > > >>>> than 2G. > > >>>> I agree that the original patch, > > >>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589, > > >>>> can solve most cases. However, I do not think the patch still works if > > >>>> the kernel module region is determined by _end or <total size of > > >>>> accessible symbols>. The reason is that the pre-allocated region for > > >>>> module percpu data comes from the memblock function at runtime. Hence, > > >>>> we cannot know the actual address of this region at compile-time, and > > >>>> this issue probably may occur again in this case. > > >>>> > > >>>> By the way, I think maybe we can refer to the implementation of MIPS. > > >>>> 1. For general cases, we can use this patch to solve this issue. > > >>>> 2. For a large kernel image (>2G) or enabling the KASLR feature, we > > >>>> may need a new code mode to deal with this issue. > > >>>> > > >>> The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is > > >>> enabled. Unless we limit the randomized range in small area, the > > >>> module region start address will be bigger than VMALLOC_END. > > >> > > >> Actually, the relocatable patch I proposed already moves "everything" up > > >> at the same > > >> time: the kernel itself but also all the "zones" below (vmalloc, > > >> vmemmap, fixup...etc) > > >> since all those zones are defined from PAGE_OFFSET that is now dynamic. > > >> So the modules > > >> ill remain at the same offset to the kernel, unless explicitly > > >> randomized in the vmalloc zone. > > >> > > > OK, it makes sense. The module region moves along with kernel seems to > > > stay away from the concern I mentioned. > > > > > > So now, the problem is that the pre-allocated region of percpu data is > > > located after _end symbol as Vincent mentioned, the 2G distance seems > > > to be too far for module region start address. (i.e. &_end - 2G). > > > > > > Actually, I don't understand this issue: we are limited to addressing > > symbols within +/- 2GB > > from PC. So as long as the percpu symbol exists in the kernel, it is > > below _end and then we don't > > care that its content is initialized dynamically, as long as we can > > 'compute' its address from PC, > > right ? > > In this case, the static percpu symbols of this issue are declared in > the kernel module, not in the kernel image. > When kernel loads the kernel module, in general, it continuously > places all the module sections in the VMALLOC memory area. However, > the ".data..percpu" section is an exception. The kernel places the > ".data..percpu" section in a pre-allocated memory region. This region > is used to place the percpu data instances for each CPU and is created > by the memblock(). Hence, the instance of these per-cpu symbols is > above the _end. In this case, if we make the module region locate at > [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its > address >_end) and module text section will probably exceed the 2G > limitation. The static percpu symbols are particularly problem for loadable modules on RISC-V but dynamic percpu variable are perfectly fine. I had faced this issue with KVM RISC-V module and I converted static percpu symbol into dynamic percpu variable. In fact, in Linux kernel dynamic percpu variables are preferred over static percpu symbols so wherever we see issue with static perpcu symbol in any module we should just send patch to convert it into dynamic percpu. In general, any memory access via pointers (just like dynamic percpu variables) are fine as long as the pointer itself is within 2GB of relative addressing. Overall, Alex's suggestion will address most cases of loadable modules except modules having static percpu symbols and for such modules just convert these percpu symbols into dynamic percpu variables. > > > > > Can you point out where I can take a look at this pre-allocated region > > for percpu data please ? > > > In mm/percpu.c, you can find how the kernel allocates the percpu data > region during the initialization phase. > In kernel/module.c's simplify_symbols function, you can find kernel > treats .data.percpu section as an exception when loading module. > > If I have any misunderstandings, please let me know. Thanks Please see above comment. We should prefer dynamic percpu variable in loadable modules over static percpu symbols. I am sure there will be very few kernel loadable modules having static percpu symbols. Regards, Anup ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-04 11:30 ` Anup Patel @ 2020-02-04 14:03 ` Vincent Chen 2020-02-04 19:10 ` Alex Ghiti [not found] ` <a55f265e-71b2-5ebb-b079-6345007a442e@ghiti.fr> 2020-02-04 17:48 ` Romain Dolbeau 1 sibling, 2 replies; 30+ messages in thread From: Vincent Chen @ 2020-02-04 14:03 UTC (permalink / raw) To: Anup Patel Cc: Carlos Eduardo de Paula, Alexandre Ghiti, David Abdurachmanov, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno On Tue, Feb 4, 2020 at 7:31 PM Anup Patel <anup@brainfault.org> wrote: > > On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com> wrote: > > > > On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote: > > > > > > > > > On 2/4/20 8:19 AM, Zong Li wrote: > > > > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道: > > > >> Hi Zong, > > > >> > > > >> On 2/3/20 10:55 PM, Zong Li wrote: > > > >>> Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道: > > > >>>> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau > > > >>>> <romain.dolbeau@european-processor-initiative.eu> wrote: > > > >>>>> On 2020-02-01 14:59, Alex Ghiti wrote: > > > >>>>>> Why restrict to 128M whereas we have 2GB offset available to the end of > > > >>>>>> the kernel ? > > > >>>>> Isn't that 2 GiB offset to whatever the module requires in the kernel, > > > >>>>> rather than to the end of the kernel space? > > > >>>>> > > > >>>>> Is there some guarantee that symbols accessible by modules are at the > > > >>>>> end of the kernel? If so, wouldn't the maximum offset for this patch > > > >>>>> still be (2 GiB - <total size of accessible symbols>)? > > > >>>>> > > > >>>>> Cordially, > > > >>>>> > > > >>>>> -- > > > >>>>> Romain Dolbeau > > > >>>> It took me some time to find the root cause of this problem, please > > > >>>> allow me to share some observations before the discussion. > > > >>>> The root cause of this issue is that the percpu data is declared as a > > > >>>> static variable. The "static" attribute will make the compiler think > > > >>>> that this symbol is close to the .text section at runtime. Hence, the > > > >>>> compiler uses "auipc" to access this percpu data instead of using GOT. > > > >>>> In this case, the access range is limited to + -2G. However, in > > > >>>> practice, these percpu data are placed at a pre-allocated region > > > >>>> created by the memblock_allocate() function. In other words, the > > > >>>> distance between the pre-allocated region (>PAGE_OFFSET ) and the > > > >>>> .text section of the kernel module (in VMALLOC region) is much larger > > > >>>> than 2G. > > > >>>> I agree that the original patch, > > > >>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589, > > > >>>> can solve most cases. However, I do not think the patch still works if > > > >>>> the kernel module region is determined by _end or <total size of > > > >>>> accessible symbols>. The reason is that the pre-allocated region for > > > >>>> module percpu data comes from the memblock function at runtime. Hence, > > > >>>> we cannot know the actual address of this region at compile-time, and > > > >>>> this issue probably may occur again in this case. > > > >>>> > > > >>>> By the way, I think maybe we can refer to the implementation of MIPS. > > > >>>> 1. For general cases, we can use this patch to solve this issue. > > > >>>> 2. For a large kernel image (>2G) or enabling the KASLR feature, we > > > >>>> may need a new code mode to deal with this issue. > > > >>>> > > > >>> The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is > > > >>> enabled. Unless we limit the randomized range in small area, the > > > >>> module region start address will be bigger than VMALLOC_END. > > > >> > > > >> Actually, the relocatable patch I proposed already moves "everything" up > > > >> at the same > > > >> time: the kernel itself but also all the "zones" below (vmalloc, > > > >> vmemmap, fixup...etc) > > > >> since all those zones are defined from PAGE_OFFSET that is now dynamic. > > > >> So the modules > > > >> ill remain at the same offset to the kernel, unless explicitly > > > >> randomized in the vmalloc zone. > > > >> > > > > OK, it makes sense. The module region moves along with kernel seems to > > > > stay away from the concern I mentioned. > > > > > > > > So now, the problem is that the pre-allocated region of percpu data is > > > > located after _end symbol as Vincent mentioned, the 2G distance seems > > > > to be too far for module region start address. (i.e. &_end - 2G). > > > > > > > > > Actually, I don't understand this issue: we are limited to addressing > > > symbols within +/- 2GB > > > from PC. So as long as the percpu symbol exists in the kernel, it is > > > below _end and then we don't > > > care that its content is initialized dynamically, as long as we can > > > 'compute' its address from PC, > > > right ? > > > > In this case, the static percpu symbols of this issue are declared in > > the kernel module, not in the kernel image. > > When kernel loads the kernel module, in general, it continuously > > places all the module sections in the VMALLOC memory area. However, > > the ".data..percpu" section is an exception. The kernel places the > > ".data..percpu" section in a pre-allocated memory region. This region > > is used to place the percpu data instances for each CPU and is created > > by the memblock(). Hence, the instance of these per-cpu symbols is > > above the _end. In this case, if we make the module region locate at > > [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its > > address >_end) and module text section will probably exceed the 2G > > limitation. > > The static percpu symbols are particularly problem for loadable modules > on RISC-V but dynamic percpu variable are perfectly fine. I had faced this > issue with KVM RISC-V module and I converted static percpu symbol into > dynamic percpu variable. In fact, in Linux kernel dynamic percpu variables > are preferred over static percpu symbols so wherever we see issue with > static perpcu symbol in any module we should just send patch to convert > it into dynamic percpu. > > In general, any memory access via pointers (just like dynamic percpu > variables) are fine as long as the pointer itself is within 2GB of relative > addressing. > As far as I know, the kernel module uses PIC as the code mode instead of medany. Therefore, I don't think most pointers in kernel modules have a 2GB limit. That is why the modules without static percpu variables do not encounter the 32-bit offset issue. > Overall, Alex's suggestion will address most cases of loadable modules > except modules having static percpu symbols and for such modules just > convert these percpu symbols into dynamic percpu variables. > > > > > > > > > Can you point out where I can take a look at this pre-allocated region > > > for percpu data please ? > > > > > In mm/percpu.c, you can find how the kernel allocates the percpu data > > region during the initialization phase. > > In kernel/module.c's simplify_symbols function, you can find kernel > > treats .data.percpu section as an exception when loading module. > > > > If I have any misunderstandings, please let me know. Thanks > > Please see above comment. We should prefer dynamic percpu variable > in loadable modules over static percpu symbols. I am sure there will > be very few kernel loadable modules having static percpu symbols. > Thanks for your comments. I agree that Alex's suggestion can address most cases of loadable modules. The reason why I pointed out the static percpu variable case is that the percpu variable is declared as a static symbol in the reduced case provided by Romain in https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/. Also, I found two static percpu variables declared in openvswitch module. After removing the static attribute, the kernel can insert them successfully. Hence, I think the static percpu variable causes the issue. (According to the result of 'grep', very few kernel loadable modules use static percpu data, as you mentioned. However, I'm not sure this might be a reason to prohibit RISC-V users from using static percpu variables in kernel modules. Therefore, I have no comment on this.) If we can ignore the existence of static percpu variable in the kernels module, I agree with Alex's suggestions. The reason is not that we can use it to solve bugs, but that we can change the code model from PIC to medany by creating a specific kernel module region to improve the access performance. (Maybe we can keep the current implementation for some corner cases such as the size of the kernel image is close to 2GB) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-04 14:03 ` Vincent Chen @ 2020-02-04 19:10 ` Alex Ghiti [not found] ` <a55f265e-71b2-5ebb-b079-6345007a442e@ghiti.fr> 1 sibling, 0 replies; 30+ messages in thread From: Alex Ghiti @ 2020-02-04 19:10 UTC (permalink / raw) To: Vincent Chen, Anup Patel Cc: Carlos Eduardo de Paula, David Abdurachmanov, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno On 2/4/20 9:03 AM, Vincent Chen wrote: > On Tue, Feb 4, 2020 at 7:31 PM Anup Patel <anup@brainfault.org> wrote: >> >> On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com> wrote: >>> >>> On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote: >>>> >>>> >>>> On 2/4/20 8:19 AM, Zong Li wrote: >>>>> Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道: >>>>>> Hi Zong, >>>>>> >>>>>> On 2/3/20 10:55 PM, Zong Li wrote: >>>>>>> Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道: >>>>>>>> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau >>>>>>>> <romain.dolbeau@european-processor-initiative.eu> wrote: >>>>>>>>> On 2020-02-01 14:59, Alex Ghiti wrote: >>>>>>>>>> Why restrict to 128M whereas we have 2GB offset available to the end of >>>>>>>>>> the kernel ? >>>>>>>>> Isn't that 2 GiB offset to whatever the module requires in the kernel, >>>>>>>>> rather than to the end of the kernel space? >>>>>>>>> >>>>>>>>> Is there some guarantee that symbols accessible by modules are at the >>>>>>>>> end of the kernel? If so, wouldn't the maximum offset for this patch >>>>>>>>> still be (2 GiB - <total size of accessible symbols>)? >>>>>>>>> >>>>>>>>> Cordially, >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Romain Dolbeau >>>>>>>> It took me some time to find the root cause of this problem, please >>>>>>>> allow me to share some observations before the discussion. >>>>>>>> The root cause of this issue is that the percpu data is declared as a >>>>>>>> static variable. The "static" attribute will make the compiler think >>>>>>>> that this symbol is close to the .text section at runtime. Hence, the >>>>>>>> compiler uses "auipc" to access this percpu data instead of using GOT. >>>>>>>> In this case, the access range is limited to + -2G. However, in >>>>>>>> practice, these percpu data are placed at a pre-allocated region >>>>>>>> created by the memblock_allocate() function. In other words, the >>>>>>>> distance between the pre-allocated region (>PAGE_OFFSET ) and the >>>>>>>> .text section of the kernel module (in VMALLOC region) is much larger >>>>>>>> than 2G. >>>>>>>> I agree that the original patch, >>>>>>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589, >>>>>>>> can solve most cases. However, I do not think the patch still works if >>>>>>>> the kernel module region is determined by _end or <total size of >>>>>>>> accessible symbols>. The reason is that the pre-allocated region for >>>>>>>> module percpu data comes from the memblock function at runtime. Hence, >>>>>>>> we cannot know the actual address of this region at compile-time, and >>>>>>>> this issue probably may occur again in this case. >>>>>>>> >>>>>>>> By the way, I think maybe we can refer to the implementation of MIPS. >>>>>>>> 1. For general cases, we can use this patch to solve this issue. >>>>>>>> 2. For a large kernel image (>2G) or enabling the KASLR feature, we >>>>>>>> may need a new code mode to deal with this issue. >>>>>>>> >>>>>>> The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is >>>>>>> enabled. Unless we limit the randomized range in small area, the >>>>>>> module region start address will be bigger than VMALLOC_END. >>>>>> >>>>>> Actually, the relocatable patch I proposed already moves "everything" up >>>>>> at the same >>>>>> time: the kernel itself but also all the "zones" below (vmalloc, >>>>>> vmemmap, fixup...etc) >>>>>> since all those zones are defined from PAGE_OFFSET that is now dynamic. >>>>>> So the modules >>>>>> ill remain at the same offset to the kernel, unless explicitly >>>>>> randomized in the vmalloc zone. >>>>>> >>>>> OK, it makes sense. The module region moves along with kernel seems to >>>>> stay away from the concern I mentioned. >>>>> >>>>> So now, the problem is that the pre-allocated region of percpu data is >>>>> located after _end symbol as Vincent mentioned, the 2G distance seems >>>>> to be too far for module region start address. (i.e. &_end - 2G). >>>> >>>> >>>> Actually, I don't understand this issue: we are limited to addressing >>>> symbols within +/- 2GB >>>> from PC. So as long as the percpu symbol exists in the kernel, it is >>>> below _end and then we don't >>>> care that its content is initialized dynamically, as long as we can >>>> 'compute' its address from PC, >>>> right ? >>> >>> In this case, the static percpu symbols of this issue are declared in >>> the kernel module, not in the kernel image. >>> When kernel loads the kernel module, in general, it continuously >>> places all the module sections in the VMALLOC memory area. However, >>> the ".data..percpu" section is an exception. The kernel places the >>> ".data..percpu" section in a pre-allocated memory region. This region >>> is used to place the percpu data instances for each CPU and is created >>> by the memblock(). Hence, the instance of these per-cpu symbols is >>> above the _end. In this case, if we make the module region locate at >>> [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its >>> address >_end) and module text section will probably exceed the 2G >>> limitation. >> >> The static percpu symbols are particularly problem for loadable modules >> on RISC-V but dynamic percpu variable are perfectly fine. I had faced this >> issue with KVM RISC-V module and I converted static percpu symbol into >> dynamic percpu variable. In fact, in Linux kernel dynamic percpu variables >> are preferred over static percpu symbols so wherever we see issue with >> static perpcu symbol in any module we should just send patch to convert >> it into dynamic percpu. >> >> In general, any memory access via pointers (just like dynamic percpu >> variables) are fine as long as the pointer itself is within 2GB of relative >> addressing. >> > > As far as I know, the kernel module uses PIC as the code mode instead > of medany. Therefore, I don't think most pointers in kernel modules > have a 2GB limit. That is why the modules without static percpu > variables do not encounter the 32-bit offset issue. > >> Overall, Alex's suggestion will address most cases of loadable modules >> except modules having static percpu symbols and for such modules just >> convert these percpu symbols into dynamic percpu variables. >> >>> >>>> >>>> Can you point out where I can take a look at this pre-allocated region >>>> for percpu data please ? >>>> >>> In mm/percpu.c, you can find how the kernel allocates the percpu data >>> region during the initialization phase. >>> In kernel/module.c's simplify_symbols function, you can find kernel >>> treats .data.percpu section as an exception when loading module. >>> Thanks for your explanations, I was miles away from understanding the real problem, good finding Vincent. >>> If I have any misunderstandings, please let me know. Thanks >> >> Please see above comment. We should prefer dynamic percpu variable >> in loadable modules over static percpu symbols. I am sure there will >> be very few kernel loadable modules having static percpu symbols. >> > Thanks for your comments. > I agree that Alex's suggestion can address most cases of loadable > modules. The reason why I pointed out the static percpu variable case > is that the percpu variable is declared as a static symbol in the > reduced case provided by Romain in > https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/. > Also, I found two static percpu variables declared in openvswitch > module. After removing the static attribute, the kernel can insert > them successfully. Hence, I think the static percpu variable causes > the issue. > > (According to the result of 'grep', very few kernel loadable modules > use static percpu data, as you mentioned. However, I'm not sure this > might be a reason to prohibit RISC-V users from using static percpu > variables in kernel modules. Therefore, I have no comment on this.) > If we can ignore the existence of static percpu variable in the > kernels module, I agree with Alex's suggestions. The reason is not > that we can use it to solve bugs, but that we can change the code > model from PIC to medany by creating a specific kernel module region > to improve the access performance. If I understand well, you mean that having this module zone at the end of vmalloc zone does not allow to fix the bug regarding static percpu variables but allows to have better performance than PIC which uses an additional redirection using the GOT, right ? > (Maybe we can keep the current implementation for some corner cases > such as the size of the kernel image is close to 2GB) > Either we follow Anup's advice and we don't care about those static percpu variables and if ever they fall behind the 2GB limit, we can not load the module, or we find a way to make sure that the module range comprises the area for module percpu variables which is allocated in setup_per_cpu_areas. I'm taking a look at this solution and how other architectures with same code model handle that. Thanks again, Alex ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <a55f265e-71b2-5ebb-b079-6345007a442e@ghiti.fr>]
* Re: Error on loading some network Kernel modules [not found] ` <a55f265e-71b2-5ebb-b079-6345007a442e@ghiti.fr> @ 2020-02-05 3:22 ` Vincent Chen 2020-02-05 4:24 ` Anup Patel 0 siblings, 1 reply; 30+ messages in thread From: Vincent Chen @ 2020-02-05 3:22 UTC (permalink / raw) To: Alex Ghiti Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno On Wed, Feb 5, 2020 at 3:07 AM Alex Ghiti <alex@ghiti.fr> wrote: > > > On 2/4/20 9:03 AM, Vincent Chen wrote: > > On Tue, Feb 4, 2020 at 7:31 PM Anup Patel <anup@brainfault.org> wrote: > > On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com> wrote: > > On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote: > > On 2/4/20 8:19 AM, Zong Li wrote: > > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道: > > Hi Zong, > > On 2/3/20 10:55 PM, Zong Li wrote: > > Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道: > > On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau > <romain.dolbeau@european-processor-initiative.eu> wrote: > > On 2020-02-01 14:59, Alex Ghiti wrote: > > Why restrict to 128M whereas we have 2GB offset available to the end of > the kernel ? > > Isn't that 2 GiB offset to whatever the module requires in the kernel, > rather than to the end of the kernel space? > > Is there some guarantee that symbols accessible by modules are at the > end of the kernel? If so, wouldn't the maximum offset for this patch > still be (2 GiB - <total size of accessible symbols>)? > > Cordially, > > -- > Romain Dolbeau > > It took me some time to find the root cause of this problem, please > allow me to share some observations before the discussion. > The root cause of this issue is that the percpu data is declared as a > static variable. The "static" attribute will make the compiler think > that this symbol is close to the .text section at runtime. Hence, the > compiler uses "auipc" to access this percpu data instead of using GOT. > In this case, the access range is limited to + -2G. However, in > practice, these percpu data are placed at a pre-allocated region > created by the memblock_allocate() function. In other words, the > distance between the pre-allocated region (>PAGE_OFFSET ) and the > .text section of the kernel module (in VMALLOC region) is much larger > than 2G. > I agree that the original patch, > https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589, > can solve most cases. However, I do not think the patch still works if > the kernel module region is determined by _end or <total size of > accessible symbols>. The reason is that the pre-allocated region for > module percpu data comes from the memblock function at runtime. Hence, > we cannot know the actual address of this region at compile-time, and > this issue probably may occur again in this case. > > By the way, I think maybe we can refer to the implementation of MIPS. > 1. For general cases, we can use this patch to solve this issue. > 2. For a large kernel image (>2G) or enabling the KASLR feature, we > may need a new code mode to deal with this issue. > > The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is > enabled. Unless we limit the randomized range in small area, the > module region start address will be bigger than VMALLOC_END. > > Actually, the relocatable patch I proposed already moves "everything" up > at the same > time: the kernel itself but also all the "zones" below (vmalloc, > vmemmap, fixup...etc) > since all those zones are defined from PAGE_OFFSET that is now dynamic. > So the modules > ill remain at the same offset to the kernel, unless explicitly > randomized in the vmalloc zone. > > OK, it makes sense. The module region moves along with kernel seems to > stay away from the concern I mentioned. > > So now, the problem is that the pre-allocated region of percpu data is > located after _end symbol as Vincent mentioned, the 2G distance seems > to be too far for module region start address. (i.e. &_end - 2G). > > Actually, I don't understand this issue: we are limited to addressing > symbols within +/- 2GB > from PC. So as long as the percpu symbol exists in the kernel, it is > below _end and then we don't > care that its content is initialized dynamically, as long as we can > 'compute' its address from PC, > right ? > > In this case, the static percpu symbols of this issue are declared in > the kernel module, not in the kernel image. > When kernel loads the kernel module, in general, it continuously > places all the module sections in the VMALLOC memory area. However, > the ".data..percpu" section is an exception. The kernel places the > ".data..percpu" section in a pre-allocated memory region. This region > is used to place the percpu data instances for each CPU and is created > by the memblock(). Hence, the instance of these per-cpu symbols is > above the _end. In this case, if we make the module region locate at > [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its > address >_end) and module text section will probably exceed the 2G > limitation. > > The static percpu symbols are particularly problem for loadable modules > on RISC-V but dynamic percpu variable are perfectly fine. I had faced this > issue with KVM RISC-V module and I converted static percpu symbol into > dynamic percpu variable. In fact, in Linux kernel dynamic percpu variables > are preferred over static percpu symbols so wherever we see issue with > static perpcu symbol in any module we should just send patch to convert > it into dynamic percpu. > > In general, any memory access via pointers (just like dynamic percpu > variables) are fine as long as the pointer itself is within 2GB of relative > addressing. > > As far as I know, the kernel module uses PIC as the code mode instead > of medany. Therefore, I don't think most pointers in kernel modules > have a 2GB limit. That is why the modules without static percpu > variables do not encounter the 32-bit offset issue. > > Overall, Alex's suggestion will address most cases of loadable modules > except modules having static percpu symbols and for such modules just > convert these percpu symbols into dynamic percpu variables. > > Can you point out where I can take a look at this pre-allocated region > for percpu data please ? > > In mm/percpu.c, you can find how the kernel allocates the percpu data > region during the initialization phase. > In kernel/module.c's simplify_symbols function, you can find kernel > treats .data.percpu section as an exception when loading module. > > > Thanks for your explanations, I was miles away from understanding the real problem, > good finding Vincent. > > > If I have any misunderstandings, please let me know. Thanks > > Please see above comment. We should prefer dynamic percpu variable > in loadable modules over static percpu symbols. I am sure there will > be very few kernel loadable modules having static percpu symbols. > > Thanks for your comments. > I agree that Alex's suggestion can address most cases of loadable > modules. The reason why I pointed out the static percpu variable case > is that the percpu variable is declared as a static symbol in the > reduced case provided by Romain in > https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/. > Also, I found two static percpu variables declared in openvswitch > module. After removing the static attribute, the kernel can insert > them successfully. Hence, I think the static percpu variable causes > the issue. > > (According to the result of 'grep', very few kernel loadable modules > use static percpu data, as you mentioned. However, I'm not sure this > might be a reason to prohibit RISC-V users from using static percpu > variables in kernel modules. Therefore, I have no comment on this.) > If we can ignore the existence of static percpu variable in the > kernels module, I agree with Alex's suggestions. The reason is not > that we can use it to solve bugs, but that we can change the code > model from PIC to medany by creating a specific kernel module region > to improve the access performance. > > > If I understand well, you mean that having this module zone at the end of vmalloc zone > does not allow to fix the bug regarding static percpu variables but allows to have better > performance than PIC which uses an additional redirection using the GOT, right ? > Yes. Currently, I think PIC model can resolve most of the pointer reference except for the static percpu variables. However, as you mentioned, It will lose some performance due to the additional redirection using the GOT. > > (Maybe we can keep the current implementation for some corner cases > such as the size of the kernel image is close to 2GB) > > > Either we follow Anup's advice and we don't care about those static percpu variables and > if ever they fall behind the 2GB limit, we can not load the module, or we find a way to make > sure that the module range comprises the area for module percpu variables which is allocated > in setup_per_cpu_areas. I'm taking a look at this solution and how other architectures with > same code model handle that. > As far as I know, ARM64 and MIPS64 with CONFIG_KBUILD_64BIT_SYM32 also use PC-relative way to access the static perpcu variables. I quickly describe their implementation below, please correct me if I have any misunderstandings, thanks. 1. ARM64 When CONFIG_RANDOMIZE_BASE is disabled, they place the module region at "(u64)_etext - MODULES_VSIZE(128MB)", and the range limitation of PC-relative ( instruction adrp) is +-4GB. When CONFIG_RANDOMIZE_BASE is enabled, the maximum access range between module region symbol and kernel symbol is limited to 2GB. There is still 2GB room to access the per-cpu data. Therefore, I think it is also safe to access the per-cpu variables. 2. MIPS64 with CONFIG_KBUILD_64BIT_SYM32 The VMALLOC region is placed after the kernel image. Therefore, the start address of the kernel image is at the 0xfffffff80100000, and the kernel module region is at 0xffffffffc0000000 which is the beginning of the VMALLOC. They do not encounter this issue because the range limitation of PC-relative is +-2GB. Currently, I have three solutions to fix the percpu data problem, and I list them below. Any feedback is welcome. 1. Create a specific module region locating at [PAGE_OFFSET-SIZE, PAGE_OFFSET]. I think this way can solve most cases because the general size of kernel image is much smaller than 2G, and we still can change the code model from PIC to medany. However, it will still encounter the same problem if the kernel image is near to 2GB. 2. Request a new code model for kernel Maybe we request compiler guys to generate a new PIC mode for kernel, which uses GOT to assess every static per-cpu data instead of PC-relative. However, I am not sure whether this idea is accepted by the compiler guys or not. If accepted, It may need a long time to realize it. 3. Jump to a resolved function. Zong and I came out with this rough idea to address the problem, but we do not go to realize it. The following is the current code pattern for accessing the percpu data. 64: 00000917 auipc s2,0x0 68: 00093903 ld s2,0(s2) # 64 <.L0 > We plan to change the " ld s2,0(s2) " to "jalr xxx(s2)" to redirect the code to the corresponding resolved function. Each static percpu data owns a resolved function. These resolved functions will go to get the variable content, and then it stores the variable to the target register by decoding the auipc instruction. We think this method may solve this problem, but it will drop the performance due to the redirection. I preferred the solution combining method 1 and method 2. 1. In general, users can choose method 1 to improve the performance of data access. 2. For safety, users can use the new code model to access the static percpu data without any limitation. We can create a new kernel config that looks a bit like MIPS CONFIG_KBUILD_64BIT_SYM32 for users to chose. ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: Error on loading some network Kernel modules 2020-02-05 3:22 ` Vincent Chen @ 2020-02-05 4:24 ` Anup Patel 2020-02-05 10:37 ` Vincent Chen 0 siblings, 1 reply; 30+ messages in thread From: Anup Patel @ 2020-02-05 4:24 UTC (permalink / raw) To: Vincent Chen, Alex Ghiti Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno > -----Original Message----- > From: linux-riscv <linux-riscv-bounces@lists.infradead.org> On Behalf Of > Vincent Chen > Sent: Wednesday, February 5, 2020 8:52 AM > To: Alex Ghiti <alex@ghiti.fr> > Cc: Carlos Eduardo de Paula <me@carlosedp.com>; David Abdurachmanov > <david.abdurachmanov@gmail.com>; Anup Patel <anup@brainfault.org>; > Paul Walmsley <paul@pwsan.com>; Zong Li <zongbox@gmail.com>; Romain > Dolbeau <romain.dolbeau@european-processor-initiative.eu>; Zong Li > <zong.li@sifive.com>; linux-riscv <linux-riscv@lists.infradead.org>; Aurelien > Jarno <aurelien@aurel32.net> > Subject: Re: Error on loading some network Kernel modules > > On Wed, Feb 5, 2020 at 3:07 AM Alex Ghiti <alex@ghiti.fr> wrote: > > > > > > On 2/4/20 9:03 AM, Vincent Chen wrote: > > > > On Tue, Feb 4, 2020 at 7:31 PM Anup Patel <anup@brainfault.org> wrote: > > > > On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com> > wrote: > > > > On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote: > > > > On 2/4/20 8:19 AM, Zong Li wrote: > > > > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道: > > > > Hi Zong, > > > > On 2/3/20 10:55 PM, Zong Li wrote: > > > > Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午 > 6:04寫道: > > > > On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau > > <romain.dolbeau@european-processor-initiative.eu> wrote: > > > > On 2020-02-01 14:59, Alex Ghiti wrote: > > > > Why restrict to 128M whereas we have 2GB offset available to the end > > of the kernel ? > > > > Isn't that 2 GiB offset to whatever the module requires in the kernel, > > rather than to the end of the kernel space? > > > > Is there some guarantee that symbols accessible by modules are at the > > end of the kernel? If so, wouldn't the maximum offset for this patch > > still be (2 GiB - <total size of accessible symbols>)? > > > > Cordially, > > > > -- > > Romain Dolbeau > > > > It took me some time to find the root cause of this problem, please > > allow me to share some observations before the discussion. > > The root cause of this issue is that the percpu data is declared as a > > static variable. The "static" attribute will make the compiler think > > that this symbol is close to the .text section at runtime. Hence, the > > compiler uses "auipc" to access this percpu data instead of using GOT. > > In this case, the access range is limited to + -2G. However, in > > practice, these percpu data are placed at a pre-allocated region > > created by the memblock_allocate() function. In other words, the > > distance between the pre-allocated region (>PAGE_OFFSET ) and the > > .text section of the kernel module (in VMALLOC region) is much larger > > than 2G. > > I agree that the original patch, > > > https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b921 > 7 > > c96e1589, can solve most cases. However, I do not think the patch > > still works if the kernel module region is determined by _end or > > <total size of accessible symbols>. The reason is that the > > pre-allocated region for module percpu data comes from the memblock > > function at runtime. Hence, we cannot know the actual address of this > > region at compile-time, and this issue probably may occur again in > > this case. > > > > By the way, I think maybe we can refer to the implementation of MIPS. > > 1. For general cases, we can use this patch to solve this issue. > > 2. For a large kernel image (>2G) or enabling the KASLR feature, we > > may need a new code mode to deal with this issue. > > > > The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is > > enabled. Unless we limit the randomized range in small area, the > > module region start address will be bigger than VMALLOC_END. > > > > Actually, the relocatable patch I proposed already moves "everything" > > up at the same > > time: the kernel itself but also all the "zones" below (vmalloc, > > vmemmap, fixup...etc) since all those zones are defined from > > PAGE_OFFSET that is now dynamic. > > So the modules > > ill remain at the same offset to the kernel, unless explicitly > > randomized in the vmalloc zone. > > > > OK, it makes sense. The module region moves along with kernel seems to > > stay away from the concern I mentioned. > > > > So now, the problem is that the pre-allocated region of percpu data is > > located after _end symbol as Vincent mentioned, the 2G distance seems > > to be too far for module region start address. (i.e. &_end - 2G). > > > > Actually, I don't understand this issue: we are limited to addressing > > symbols within +/- 2GB from PC. So as long as the percpu symbol exists > > in the kernel, it is below _end and then we don't care that its > > content is initialized dynamically, as long as we can 'compute' its > > address from PC, right ? > > > > In this case, the static percpu symbols of this issue are declared in > > the kernel module, not in the kernel image. > > When kernel loads the kernel module, in general, it continuously > > places all the module sections in the VMALLOC memory area. However, > > the ".data..percpu" section is an exception. The kernel places the > > ".data..percpu" section in a pre-allocated memory region. This region > > is used to place the percpu data instances for each CPU and is created > > by the memblock(). Hence, the instance of these per-cpu symbols is > > above the _end. In this case, if we make the module region locate at > > [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its > > address >_end) and module text section will probably exceed the 2G > > limitation. > > > > The static percpu symbols are particularly problem for loadable > > modules on RISC-V but dynamic percpu variable are perfectly fine. I > > had faced this issue with KVM RISC-V module and I converted static > > percpu symbol into dynamic percpu variable. In fact, in Linux kernel > > dynamic percpu variables are preferred over static percpu symbols so > > wherever we see issue with static perpcu symbol in any module we > > should just send patch to convert it into dynamic percpu. > > > > In general, any memory access via pointers (just like dynamic percpu > > variables) are fine as long as the pointer itself is within 2GB of > > relative addressing. > > > > As far as I know, the kernel module uses PIC as the code mode instead > > of medany. Therefore, I don't think most pointers in kernel modules > > have a 2GB limit. That is why the modules without static percpu > > variables do not encounter the 32-bit offset issue. > > > > Overall, Alex's suggestion will address most cases of loadable modules > > except modules having static percpu symbols and for such modules just > > convert these percpu symbols into dynamic percpu variables. > > > > Can you point out where I can take a look at this pre-allocated region > > for percpu data please ? > > > > In mm/percpu.c, you can find how the kernel allocates the percpu data > > region during the initialization phase. > > In kernel/module.c's simplify_symbols function, you can find kernel > > treats .data.percpu section as an exception when loading module. > > > > > > Thanks for your explanations, I was miles away from understanding the > > real problem, good finding Vincent. > > > > > > If I have any misunderstandings, please let me know. Thanks > > > > Please see above comment. We should prefer dynamic percpu variable in > > loadable modules over static percpu symbols. I am sure there will be > > very few kernel loadable modules having static percpu symbols. > > > > Thanks for your comments. > > I agree that Alex's suggestion can address most cases of loadable > > modules. The reason why I pointed out the static percpu variable case > > is that the percpu variable is declared as a static symbol in the > > reduced case provided by Romain in > > https://lore.kernel.org/linux-riscv/1572281840733.3517@european- > processor-initiative.eu/. > > Also, I found two static percpu variables declared in openvswitch > > module. After removing the static attribute, the kernel can insert > > them successfully. Hence, I think the static percpu variable causes > > the issue. > > > > (According to the result of 'grep', very few kernel loadable modules > > use static percpu data, as you mentioned. However, I'm not sure this > > might be a reason to prohibit RISC-V users from using static percpu > > variables in kernel modules. Therefore, I have no comment on this.) If > > we can ignore the existence of static percpu variable in the kernels > > module, I agree with Alex's suggestions. The reason is not that we can > > use it to solve bugs, but that we can change the code model from PIC > > to medany by creating a specific kernel module region to improve the > > access performance. > > > > > > If I understand well, you mean that having this module zone at the end > > of vmalloc zone does not allow to fix the bug regarding static percpu > > variables but allows to have better performance than PIC which uses an > additional redirection using the GOT, right ? > > > > Yes. Currently, I think PIC model can resolve most of the pointer reference > except for the static percpu variables. However, as you mentioned, It will > lose some performance due to the additional redirection using the GOT. > > > > > (Maybe we can keep the current implementation for some corner cases > > such as the size of the kernel image is close to 2GB) > > > > > > Either we follow Anup's advice and we don't care about those static > > percpu variables and if ever they fall behind the 2GB limit, we can > > not load the module, or we find a way to make sure that the module > > range comprises the area for module percpu variables which is > > allocated in setup_per_cpu_areas. I'm taking a look at this solution and > how other architectures with same code model handle that. > > > As far as I know, ARM64 and MIPS64 with CONFIG_KBUILD_64BIT_SYM32 > also use PC-relative way to access the static perpcu variables. I quickly > describe their implementation below, please correct me if I have any > misunderstandings, thanks. > 1. ARM64 > When CONFIG_RANDOMIZE_BASE is disabled, they place the module > region at "(u64)_etext - MODULES_VSIZE(128MB)", and the range limitation > of PC-relative ( instruction adrp) is +-4GB. When CONFIG_RANDOMIZE_BASE > is enabled, the maximum access range between module region symbol and > kernel symbol is limited to 2GB. There is still 2GB room to access the per-cpu > data. Therefore, I think it is also safe to access the per-cpu variables. > 2. MIPS64 with CONFIG_KBUILD_64BIT_SYM32 > The VMALLOC region is placed after the kernel image. Therefore, the start > address of the kernel image is at the 0xfffffff80100000, and the kernel > module region is at 0xffffffffc0000000 which is the beginning of the > VMALLOC. They do not encounter this issue because the range limitation of > PC-relative is +-2GB. > > Currently, I have three solutions to fix the percpu data problem, and I list > them below. Any feedback is welcome. > > 1. Create a specific module region locating at [PAGE_OFFSET-SIZE, > PAGE_OFFSET]. > I think this way can solve most cases because the general size of kernel > image is much smaller than 2G, and we still can change the code model from > PIC to medany. However, it will still encounter the same problem if the > kernel image is near to 2GB. > > 2. Request a new code model for kernel > Maybe we request compiler guys to generate a new PIC mode for kernel, > which uses GOT to assess every static per-cpu data instead of PC-relative. > However, I am not sure whether this idea is accepted by the compiler guys or > not. If accepted, It may need a long time to realize it. > > 3. Jump to a resolved function. > Zong and I came out with this rough idea to address the problem, but we do > not go to realize it. > > The following is the current code pattern for accessing the percpu data. > 64: 00000917 auipc s2,0x0 > 68: 00093903 ld s2,0(s2) # 64 <.L0 > > > We plan to change the " ld s2,0(s2) " to "jalr xxx(s2)" to redirect the code to > the corresponding resolved function. Each static percpu data owns a resolved > function. These resolved functions will go to get the variable content, and > then it stores the variable to the target register by decoding the auipc > instruction. We think this method may solve this problem, but it will drop the > performance due to the redirection. > > I preferred the solution combining method 1 and method 2. > 1. In general, users can choose method 1 to improve the performance of > data access. > 2. For safety, users can use the new code model to access the static percpu > data without any limitation. > We can create a new kernel config that looks a bit like MIPS > CONFIG_KBUILD_64BIT_SYM32 for users to chose. I think we should not waste time in our quest to find perfect solution. We should first have a module_alloc() in-place (based on various suggestions in this email thread) so that most people don't see issues. We can fix module loading for module with static per-cpu symbols separately. For point1, we don't need a dedicated region as long as we are allocating modules from VMALLOC area. Let's avoid another virtual memory region if possible. For point2, loading symbol address for per-cpu static symbols is always PC-relative for most RISC architectures and static per-cpu symbols in loadable modules is a known issue (Refer, https://lwn.net/Articles/1509/). I don't agree that this should be using compilers because per-cpu data management is Linux specific (or OS-specific) so we should rather explore ways to handle this in Linux itself. The dynamic per-cpu alloc APIs (i.e. alloc_percpu() and friends) were added to primarily benefit loadable modules and NUMA systems. The static per-cpu symbols (i.e. DEFINE_PER_CPU() and friends) should only be used in built-in modules or code linked to kernel. It is generally very easy to upgrade code using static per-cpu symbols to use dynamic per-cpu symbols. Regards, Anup ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-05 4:24 ` Anup Patel @ 2020-02-05 10:37 ` Vincent Chen 2020-02-07 14:39 ` Vincent Chen 0 siblings, 1 reply; 30+ messages in thread From: Vincent Chen @ 2020-02-05 10:37 UTC (permalink / raw) To: Anup Patel Cc: Carlos Eduardo de Paula, Alex Ghiti, David Abdurachmanov, Anup Patel, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Paul Walmsley, Aurelien Jarno On Wed, Feb 5, 2020 at 12:24 PM Anup Patel <Anup.Patel@wdc.com> wrote: > > > > > -----Original Message----- > > From: linux-riscv <linux-riscv-bounces@lists.infradead.org> On Behalf Of > > Vincent Chen > > Sent: Wednesday, February 5, 2020 8:52 AM > > To: Alex Ghiti <alex@ghiti.fr> > > Cc: Carlos Eduardo de Paula <me@carlosedp.com>; David Abdurachmanov > > <david.abdurachmanov@gmail.com>; Anup Patel <anup@brainfault.org>; > > Paul Walmsley <paul@pwsan.com>; Zong Li <zongbox@gmail.com>; Romain > > Dolbeau <romain.dolbeau@european-processor-initiative.eu>; Zong Li > > <zong.li@sifive.com>; linux-riscv <linux-riscv@lists.infradead.org>; Aurelien > > Jarno <aurelien@aurel32.net> > > Subject: Re: Error on loading some network Kernel modules > > > > On Wed, Feb 5, 2020 at 3:07 AM Alex Ghiti <alex@ghiti.fr> wrote: > > > > > > > > > On 2/4/20 9:03 AM, Vincent Chen wrote: > > > > > > On Tue, Feb 4, 2020 at 7:31 PM Anup Patel <anup@brainfault.org> wrote: > > > > > > On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com> > > wrote: > > > > > > On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote: > > > > > > On 2/4/20 8:19 AM, Zong Li wrote: > > > > > > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道: > > > > > > Hi Zong, > > > > > > On 2/3/20 10:55 PM, Zong Li wrote: > > > > > > Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午 > > 6:04寫道: > > > > > > On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau > > > <romain.dolbeau@european-processor-initiative.eu> wrote: > > > > > > On 2020-02-01 14:59, Alex Ghiti wrote: > > > > > > Why restrict to 128M whereas we have 2GB offset available to the end > > > of the kernel ? > > > > > > Isn't that 2 GiB offset to whatever the module requires in the kernel, > > > rather than to the end of the kernel space? > > > > > > Is there some guarantee that symbols accessible by modules are at the > > > end of the kernel? If so, wouldn't the maximum offset for this patch > > > still be (2 GiB - <total size of accessible symbols>)? > > > > > > Cordially, > > > > > > -- > > > Romain Dolbeau > > > > > > It took me some time to find the root cause of this problem, please > > > allow me to share some observations before the discussion. > > > The root cause of this issue is that the percpu data is declared as a > > > static variable. The "static" attribute will make the compiler think > > > that this symbol is close to the .text section at runtime. Hence, the > > > compiler uses "auipc" to access this percpu data instead of using GOT. > > > In this case, the access range is limited to + -2G. However, in > > > practice, these percpu data are placed at a pre-allocated region > > > created by the memblock_allocate() function. In other words, the > > > distance between the pre-allocated region (>PAGE_OFFSET ) and the > > > .text section of the kernel module (in VMALLOC region) is much larger > > > than 2G. > > > I agree that the original patch, > > > > > https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b921 > > 7 > > > c96e1589, can solve most cases. However, I do not think the patch > > > still works if the kernel module region is determined by _end or > > > <total size of accessible symbols>. The reason is that the > > > pre-allocated region for module percpu data comes from the memblock > > > function at runtime. Hence, we cannot know the actual address of this > > > region at compile-time, and this issue probably may occur again in > > > this case. > > > > > > By the way, I think maybe we can refer to the implementation of MIPS. > > > 1. For general cases, we can use this patch to solve this issue. > > > 2. For a large kernel image (>2G) or enabling the KASLR feature, we > > > may need a new code mode to deal with this issue. > > > > > > The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is > > > enabled. Unless we limit the randomized range in small area, the > > > module region start address will be bigger than VMALLOC_END. > > > > > > Actually, the relocatable patch I proposed already moves "everything" > > > up at the same > > > time: the kernel itself but also all the "zones" below (vmalloc, > > > vmemmap, fixup...etc) since all those zones are defined from > > > PAGE_OFFSET that is now dynamic. > > > So the modules > > > ill remain at the same offset to the kernel, unless explicitly > > > randomized in the vmalloc zone. > > > > > > OK, it makes sense. The module region moves along with kernel seems to > > > stay away from the concern I mentioned. > > > > > > So now, the problem is that the pre-allocated region of percpu data is > > > located after _end symbol as Vincent mentioned, the 2G distance seems > > > to be too far for module region start address. (i.e. &_end - 2G). > > > > > > Actually, I don't understand this issue: we are limited to addressing > > > symbols within +/- 2GB from PC. So as long as the percpu symbol exists > > > in the kernel, it is below _end and then we don't care that its > > > content is initialized dynamically, as long as we can 'compute' its > > > address from PC, right ? > > > > > > In this case, the static percpu symbols of this issue are declared in > > > the kernel module, not in the kernel image. > > > When kernel loads the kernel module, in general, it continuously > > > places all the module sections in the VMALLOC memory area. However, > > > the ".data..percpu" section is an exception. The kernel places the > > > ".data..percpu" section in a pre-allocated memory region. This region > > > is used to place the percpu data instances for each CPU and is created > > > by the memblock(). Hence, the instance of these per-cpu symbols is > > > above the _end. In this case, if we make the module region locate at > > > [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its > > > address >_end) and module text section will probably exceed the 2G > > > limitation. > > > > > > The static percpu symbols are particularly problem for loadable > > > modules on RISC-V but dynamic percpu variable are perfectly fine. I > > > had faced this issue with KVM RISC-V module and I converted static > > > percpu symbol into dynamic percpu variable. In fact, in Linux kernel > > > dynamic percpu variables are preferred over static percpu symbols so > > > wherever we see issue with static perpcu symbol in any module we > > > should just send patch to convert it into dynamic percpu. > > > > > > In general, any memory access via pointers (just like dynamic percpu > > > variables) are fine as long as the pointer itself is within 2GB of > > > relative addressing. > > > > > > As far as I know, the kernel module uses PIC as the code mode instead > > > of medany. Therefore, I don't think most pointers in kernel modules > > > have a 2GB limit. That is why the modules without static percpu > > > variables do not encounter the 32-bit offset issue. > > > > > > Overall, Alex's suggestion will address most cases of loadable modules > > > except modules having static percpu symbols and for such modules just > > > convert these percpu symbols into dynamic percpu variables. > > > > > > Can you point out where I can take a look at this pre-allocated region > > > for percpu data please ? > > > > > > In mm/percpu.c, you can find how the kernel allocates the percpu data > > > region during the initialization phase. > > > In kernel/module.c's simplify_symbols function, you can find kernel > > > treats .data.percpu section as an exception when loading module. > > > > > > > > > Thanks for your explanations, I was miles away from understanding the > > > real problem, good finding Vincent. > > > > > > > > > If I have any misunderstandings, please let me know. Thanks > > > > > > Please see above comment. We should prefer dynamic percpu variable in > > > loadable modules over static percpu symbols. I am sure there will be > > > very few kernel loadable modules having static percpu symbols. > > > > > > Thanks for your comments. > > > I agree that Alex's suggestion can address most cases of loadable > > > modules. The reason why I pointed out the static percpu variable case > > > is that the percpu variable is declared as a static symbol in the > > > reduced case provided by Romain in > > > https://lore.kernel.org/linux-riscv/1572281840733.3517@european- > > processor-initiative.eu/. > > > Also, I found two static percpu variables declared in openvswitch > > > module. After removing the static attribute, the kernel can insert > > > them successfully. Hence, I think the static percpu variable causes > > > the issue. > > > > > > (According to the result of 'grep', very few kernel loadable modules > > > use static percpu data, as you mentioned. However, I'm not sure this > > > might be a reason to prohibit RISC-V users from using static percpu > > > variables in kernel modules. Therefore, I have no comment on this.) If > > > we can ignore the existence of static percpu variable in the kernels > > > module, I agree with Alex's suggestions. The reason is not that we can > > > use it to solve bugs, but that we can change the code model from PIC > > > to medany by creating a specific kernel module region to improve the > > > access performance. > > > > > > > > > If I understand well, you mean that having this module zone at the end > > > of vmalloc zone does not allow to fix the bug regarding static percpu > > > variables but allows to have better performance than PIC which uses an > > additional redirection using the GOT, right ? > > > > > > > Yes. Currently, I think PIC model can resolve most of the pointer reference > > except for the static percpu variables. However, as you mentioned, It will > > lose some performance due to the additional redirection using the GOT. > > > > > > > > (Maybe we can keep the current implementation for some corner cases > > > such as the size of the kernel image is close to 2GB) > > > > > > > > > Either we follow Anup's advice and we don't care about those static > > > percpu variables and if ever they fall behind the 2GB limit, we can > > > not load the module, or we find a way to make sure that the module > > > range comprises the area for module percpu variables which is > > > allocated in setup_per_cpu_areas. I'm taking a look at this solution and > > how other architectures with same code model handle that. > > > > > As far as I know, ARM64 and MIPS64 with CONFIG_KBUILD_64BIT_SYM32 > > also use PC-relative way to access the static perpcu variables. I quickly > > describe their implementation below, please correct me if I have any > > misunderstandings, thanks. > > 1. ARM64 > > When CONFIG_RANDOMIZE_BASE is disabled, they place the module > > region at "(u64)_etext - MODULES_VSIZE(128MB)", and the range limitation > > of PC-relative ( instruction adrp) is +-4GB. When CONFIG_RANDOMIZE_BASE > > is enabled, the maximum access range between module region symbol and > > kernel symbol is limited to 2GB. There is still 2GB room to access the per-cpu > > data. Therefore, I think it is also safe to access the per-cpu variables. > > 2. MIPS64 with CONFIG_KBUILD_64BIT_SYM32 > > The VMALLOC region is placed after the kernel image. Therefore, the start > > address of the kernel image is at the 0xfffffff80100000, and the kernel > > module region is at 0xffffffffc0000000 which is the beginning of the > > VMALLOC. They do not encounter this issue because the range limitation of > > PC-relative is +-2GB. > > > > Currently, I have three solutions to fix the percpu data problem, and I list > > them below. Any feedback is welcome. > > > > 1. Create a specific module region locating at [PAGE_OFFSET-SIZE, > > PAGE_OFFSET]. > > I think this way can solve most cases because the general size of kernel > > image is much smaller than 2G, and we still can change the code model from > > PIC to medany. However, it will still encounter the same problem if the > > kernel image is near to 2GB. > > > > 2. Request a new code model for kernel > > Maybe we request compiler guys to generate a new PIC mode for kernel, > > which uses GOT to assess every static per-cpu data instead of PC-relative. > > However, I am not sure whether this idea is accepted by the compiler guys or > > not. If accepted, It may need a long time to realize it. > > > > 3. Jump to a resolved function. > > Zong and I came out with this rough idea to address the problem, but we do > > not go to realize it. > > > > The following is the current code pattern for accessing the percpu data. > > 64: 00000917 auipc s2,0x0 > > 68: 00093903 ld s2,0(s2) # 64 <.L0 > > > > > We plan to change the " ld s2,0(s2) " to "jalr xxx(s2)" to redirect the code to > > the corresponding resolved function. Each static percpu data owns a resolved > > function. These resolved functions will go to get the variable content, and > > then it stores the variable to the target register by decoding the auipc > > instruction. We think this method may solve this problem, but it will drop the > > performance due to the redirection. > > > > I preferred the solution combining method 1 and method 2. > > 1. In general, users can choose method 1 to improve the performance of > > data access. > > 2. For safety, users can use the new code model to access the static percpu > > data without any limitation. > > We can create a new kernel config that looks a bit like MIPS > > CONFIG_KBUILD_64BIT_SYM32 for users to chose. > > I think we should not waste time in our quest to find perfect solution. We > should first have a module_alloc() in-place (based on various suggestions in > this email thread) so that most people don't see issues. We can fix module > loading for module with static per-cpu symbols separately. I mean "static" is the attribute in C language not the type of the percpu symbol. Sorry, the description provided is not accurate and may cause you misunderstanding, and I also misunderstand the meaning of static percpu symbols that you mentioned early. > > For point1, we don't need a dedicated region as long as we are allocating > modules from VMALLOC area. Let's avoid another virtual memory region if > possible. > Your comments inspire me if we can know the end of the percpu data region, we can dynamically calculate the start address in each module allocation by "<end of the percpu data region> - 2GB" if needed. I am finding a way to derive the <end of the percpu data region>. > For point2, loading symbol address for per-cpu static symbols is always > PC-relative for most RISC architectures and static per-cpu symbols in > loadable modules is a known issue (Refer, https://lwn.net/Articles/1509/). > I don't agree that this should be using compilers because per-cpu data > management is Linux specific (or OS-specific) so we should rather explore > ways to handle this in Linux itself. The dynamic per-cpu alloc APIs (i.e. > alloc_percpu() and friends) were added to primarily benefit loadable > modules and NUMA systems. The static per-cpu symbols (i.e. > DEFINE_PER_CPU() and friends) should only be used in built-in modules > or code linked to kernel. It is generally very easy to upgrade code using > static per-cpu symbols to use dynamic per-cpu symbols. > I found that we want to deal with the same problem based on a different viewpoint, and I can understand what you mean now. I found that other RISC-V users also want the compiler to use GOT to access the static symbol instead of PC-relative. https://github.com/riscv/riscv-gcc/issues/158 https://groups.google.com/a/groups.riscv.org/forum/#!msg/sw-dev/Xa3FN6gQhPI/59xUWUpBCQAJ If this feature is supported, I think this issue can be easily resolved :) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-05 10:37 ` Vincent Chen @ 2020-02-07 14:39 ` Vincent Chen 2020-02-07 14:51 ` Vincent Chen 0 siblings, 1 reply; 30+ messages in thread From: Vincent Chen @ 2020-02-07 14:39 UTC (permalink / raw) To: Anup Patel Cc: Carlos Eduardo de Paula, Alex Ghiti, David Abdurachmanov, Anup Patel, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Paul Walmsley, Aurelien Jarno > > For point1, we don't need a dedicated region as long as we are allocating > > modules from VMALLOC area. Let's avoid another virtual memory region if > > possible. > > > > Your comments inspire me if we can know the end of the percpu data > region, we can dynamically calculate the start address in each module > allocation by "<end of the percpu data region> - 2GB" if needed. I am > finding a way to derive the <end of the percpu data region>. > In the implementation of the idea, I found that I did have some misunderstandings about the mechanism of accessing static percpu symbol. Currently, I think this issue can be solved by modifying the module_allocate() to the following. #define MODULE_START max(PFN_ALIGN(&_end - 2 * SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)) void *module_alloc(unsigned long size) { return __vmalloc_node_range(size, 1, MODULE_START, VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, __builtin_return_address(0)); } After applying a similar modification above, the kernel can successfully insert the test module provided by Romain. Any feedback is welcome. If someone is curious about the reason, I roughly described it at the end of this mail. I am sorry if someone was misled by early discussion. As mentioned early, kernel pre-allocates a memory region for each CPU to place the instances of the percpu data. According to the declaration type, each memory region can be divided into three chunks. These three chunks from low memory to high memory are static chunk, reserved chunk, and dynamic chunk. a. The static chunk is used for the static percpu symbols declared in the kernel b. The reserved chunk is used for the static percpu symbols declared in the kernel module c. The dynamic chunk is used for all dynamic percpu symbols. The order of the percpu symbols in each memory region is the same and is recorded in a symbol list maintained by the kernel. Therefore, the address of a specific CPU's percpu symbol is "The address of this percpu symbol in the symbol list" + "The offset from symbols list to the specific CPU region". a. The symbol list is created based on the ".data..percpu" section. In other words, the start address of this symbol list is the start address of ".data..percpu". When a dynamic percpu symbol is created or a kernel module import a static percpu symbol, the kernel will follow the above chunk rule to add this percpu symbol to its suitable region. b. When each memory area was created, "The offset from symbols list to the specific CPU region" has been recorded. Back to the issue, according to the rules above, the kernel will relocate the ".data..percpu" section of the loaded module to the end of the ".data..percpu" section of the kernel image. The end of the ".data..percpu" section of the kernel image almost equals _text, and the current size of the reserved chunk is 8KB. Therefore, the appropriate range to place the loaded module is [max(PFN_ALIGN(&_end - 2 *SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)), VMALLOC_END]. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-07 14:39 ` Vincent Chen @ 2020-02-07 14:51 ` Vincent Chen 2020-02-10 6:37 ` Alex Ghiti 0 siblings, 1 reply; 30+ messages in thread From: Vincent Chen @ 2020-02-07 14:51 UTC (permalink / raw) To: Anup Patel Cc: Carlos Eduardo de Paula, Alex Ghiti, David Abdurachmanov, Anup Patel, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Paul Walmsley, Aurelien Jarno On Fri, Feb 7, 2020 at 10:39 PM Vincent Chen <vincent.chen@sifive.com> wrote: > > > > For point1, we don't need a dedicated region as long as we are allocating > > > modules from VMALLOC area. Let's avoid another virtual memory region if > > > possible. > > > > > > > Your comments inspire me if we can know the end of the percpu data > > region, we can dynamically calculate the start address in each module > > allocation by "<end of the percpu data region> - 2GB" if needed. I am > > finding a way to derive the <end of the percpu data region>. > > > > In the implementation of the idea, I found that I did have some > misunderstandings about the mechanism of accessing static percpu > symbol. Currently, I think this issue can be solved by modifying the > module_allocate() to the following. > #define MODULE_START max(PFN_ALIGN(&_end - 2 * > SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)) > void *module_alloc(unsigned long size) > { > return __vmalloc_node_range(size, 1, MODULE_START, > VMALLOC_END, GFP_KERNEL, > PAGE_KERNEL_EXEC, 0, > NUMA_NO_NODE, > __builtin_return_address(0)); > } > After applying a similar modification above, the kernel can > successfully insert the test module provided by Romain. Any feedback > is welcome. If someone is curious about the reason, I roughly > described it at the end of this mail. I am sorry if someone was misled > by early discussion. > > > As mentioned early, kernel pre-allocates a memory region for each CPU > to place the instances of the percpu data. According to the > declaration type, each memory region can be divided into three chunks. > These three chunks from low memory to high memory are static chunk, > reserved chunk, and dynamic chunk. > a. The static chunk is used for the static percpu symbols declared in the kernel > b. The reserved chunk is used for the static percpu symbols declared > in the kernel module > c. The dynamic chunk is used for all dynamic percpu symbols. > > The order of the percpu symbols in each memory region is the same and > is recorded in a symbol list maintained by the kernel. Therefore, the > address of a specific CPU's percpu symbol is "The address of this > percpu symbol in the symbol list" + "The offset from symbols list to > the specific CPU region". > a. The symbol list is created based on the ".data..percpu" section. In > other words, the start address of this symbol list is the start > address of ".data..percpu". When a dynamic percpu symbol is created or > a kernel module import a static percpu symbol, the kernel will follow > the above chunk rule to add this percpu symbol to its suitable region. > b. When each memory area was created, "The offset from symbols list to > the specific CPU region" has been recorded. > > Back to the issue, according to the rules above, the kernel will > relocate the ".data..percpu" section of the loaded module to the end > of the ".data..percpu" section of the kernel image. The end of the > ".data..percpu" section of the kernel image almost equals _text, and > the current size of the reserved chunk is 8KB. Therefore, the > appropriate range to place the loaded module is [max(PFN_ALIGN(&_end - > 2 *SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)), VMALLOC_END]. I did not notice that the size of the kernel .text section is impossible below 2 PAGES. Therefore, the modification of module_alloc function can be reduced to the following. void *module_alloc(unsigned long size) { return __vmalloc_node_range(size, 1, PFN_ALIGN(&_end - 2 *SZ_1G), VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, __builtin_return_address(0)); } It is the same as Alex's suggestion. Is Alex willing to send this patch to resolve this bug? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-07 14:51 ` Vincent Chen @ 2020-02-10 6:37 ` Alex Ghiti 2020-02-10 9:53 ` Vincent Chen 0 siblings, 1 reply; 30+ messages in thread From: Alex Ghiti @ 2020-02-10 6:37 UTC (permalink / raw) To: Vincent Chen, Anup Patel Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno Hi Vincent, On 2/7/20 9:51 AM, Vincent Chen wrote: > On Fri, Feb 7, 2020 at 10:39 PM Vincent Chen <vincent.chen@sifive.com> wrote: >> >>>> For point1, we don't need a dedicated region as long as we are allocating >>>> modules from VMALLOC area. Let's avoid another virtual memory region if >>>> possible. >>>> >>> >>> Your comments inspire me if we can know the end of the percpu data >>> region, we can dynamically calculate the start address in each module >>> allocation by "<end of the percpu data region> - 2GB" if needed. I am >>> finding a way to derive the <end of the percpu data region>. >>> >> >> In the implementation of the idea, I found that I did have some >> misunderstandings about the mechanism of accessing static percpu >> symbol. Currently, I think this issue can be solved by modifying the >> module_allocate() to the following. >> #define MODULE_START max(PFN_ALIGN(&_end - 2 * >> SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)) >> void *module_alloc(unsigned long size) >> { >> return __vmalloc_node_range(size, 1, MODULE_START, >> VMALLOC_END, GFP_KERNEL, >> PAGE_KERNEL_EXEC, 0, >> NUMA_NO_NODE, >> __builtin_return_address(0)); >> } >> After applying a similar modification above, the kernel can >> successfully insert the test module provided by Romain. Any feedback >> is welcome. If someone is curious about the reason, I roughly >> described it at the end of this mail. I am sorry if someone was misled >> by early discussion. >> >> >> As mentioned early, kernel pre-allocates a memory region for each CPU >> to place the instances of the percpu data. According to the >> declaration type, each memory region can be divided into three chunks. >> These three chunks from low memory to high memory are static chunk, >> reserved chunk, and dynamic chunk. >> a. The static chunk is used for the static percpu symbols declared in the kernel >> b. The reserved chunk is used for the static percpu symbols declared >> in the kernel module >> c. The dynamic chunk is used for all dynamic percpu symbols. >> >> The order of the percpu symbols in each memory region is the same and >> is recorded in a symbol list maintained by the kernel. Therefore, the >> address of a specific CPU's percpu symbol is "The address of this >> percpu symbol in the symbol list" + "The offset from symbols list to >> the specific CPU region". >> a. The symbol list is created based on the ".data..percpu" section. In >> other words, the start address of this symbol list is the start >> address of ".data..percpu". When a dynamic percpu symbol is created or >> a kernel module import a static percpu symbol, the kernel will follow >> the above chunk rule to add this percpu symbol to its suitable region. >> b. When each memory area was created, "The offset from symbols list to >> the specific CPU region" has been recorded. >> >> Back to the issue, according to the rules above, the kernel will >> relocate the ".data..percpu" section of the loaded module to the end >> of the ".data..percpu" section of the kernel image. The end of the >> ".data..percpu" section of the kernel image almost equals _text, and >> the current size of the reserved chunk is 8KB. Therefore, the >> appropriate range to place the loaded module is [max(PFN_ALIGN(&_end - >> 2 *SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)), VMALLOC_END]. > > I did not notice that the size of the kernel .text section is > impossible below 2 PAGES. Therefore, the modification of module_alloc > function can be reduced to the following. > void *module_alloc(unsigned long size) > { > return __vmalloc_node_range(size, 1, PFN_ALIGN(&_end - 2 *SZ_1G), > VMALLOC_END, GFP_KERNEL, > PAGE_KERNEL_EXEC, 0, > NUMA_NO_NODE, > __builtin_return_address(0)); > } > It is the same as Alex's suggestion. Is Alex willing to send this > patch to resolve this bug? > You did all the work, please send a patch explaining what you learnt and feel free to add a Suggested-By. Anyway, I'll add a Reviewed-by when you propose something. Thank you Vincent for all your research, Alex ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-10 6:37 ` Alex Ghiti @ 2020-02-10 9:53 ` Vincent Chen 2020-02-19 6:46 ` Alex Ghiti 0 siblings, 1 reply; 30+ messages in thread From: Vincent Chen @ 2020-02-10 9:53 UTC (permalink / raw) To: Alex Ghiti Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel, Anup Patel, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno On Mon, Feb 10, 2020 at 2:37 PM Alex Ghiti <alex@ghiti.fr> wrote: > > Hi Vincent, > > On 2/7/20 9:51 AM, Vincent Chen wrote: > > On Fri, Feb 7, 2020 at 10:39 PM Vincent Chen <vincent.chen@sifive.com> wrote: > >> > >>>> For point1, we don't need a dedicated region as long as we are allocating > >>>> modules from VMALLOC area. Let's avoid another virtual memory region if > >>>> possible. > >>>> > >>> > >>> Your comments inspire me if we can know the end of the percpu data > >>> region, we can dynamically calculate the start address in each module > >>> allocation by "<end of the percpu data region> - 2GB" if needed. I am > >>> finding a way to derive the <end of the percpu data region>. > >>> > >> > >> In the implementation of the idea, I found that I did have some > >> misunderstandings about the mechanism of accessing static percpu > >> symbol. Currently, I think this issue can be solved by modifying the > >> module_allocate() to the following. > >> #define MODULE_START max(PFN_ALIGN(&_end - 2 * > >> SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)) > >> void *module_alloc(unsigned long size) > >> { > >> return __vmalloc_node_range(size, 1, MODULE_START, > >> VMALLOC_END, GFP_KERNEL, > >> PAGE_KERNEL_EXEC, 0, > >> NUMA_NO_NODE, > >> __builtin_return_address(0)); > >> } > >> After applying a similar modification above, the kernel can > >> successfully insert the test module provided by Romain. Any feedback > >> is welcome. If someone is curious about the reason, I roughly > >> described it at the end of this mail. I am sorry if someone was misled > >> by early discussion. > >> > >> > >> As mentioned early, kernel pre-allocates a memory region for each CPU > >> to place the instances of the percpu data. According to the > >> declaration type, each memory region can be divided into three chunks. > >> These three chunks from low memory to high memory are static chunk, > >> reserved chunk, and dynamic chunk. > >> a. The static chunk is used for the static percpu symbols declared in the kernel > >> b. The reserved chunk is used for the static percpu symbols declared > >> in the kernel module > >> c. The dynamic chunk is used for all dynamic percpu symbols. > >> > >> The order of the percpu symbols in each memory region is the same and > >> is recorded in a symbol list maintained by the kernel. Therefore, the > >> address of a specific CPU's percpu symbol is "The address of this > >> percpu symbol in the symbol list" + "The offset from symbols list to > >> the specific CPU region". > >> a. The symbol list is created based on the ".data..percpu" section. In > >> other words, the start address of this symbol list is the start > >> address of ".data..percpu". When a dynamic percpu symbol is created or > >> a kernel module import a static percpu symbol, the kernel will follow > >> the above chunk rule to add this percpu symbol to its suitable region. > >> b. When each memory area was created, "The offset from symbols list to > >> the specific CPU region" has been recorded. > >> > >> Back to the issue, according to the rules above, the kernel will > >> relocate the ".data..percpu" section of the loaded module to the end > >> of the ".data..percpu" section of the kernel image. The end of the > >> ".data..percpu" section of the kernel image almost equals _text, and > >> the current size of the reserved chunk is 8KB. Therefore, the > >> appropriate range to place the loaded module is [max(PFN_ALIGN(&_end - > >> 2 *SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)), VMALLOC_END]. > > > > I did not notice that the size of the kernel .text section is > > impossible below 2 PAGES. Therefore, the modification of module_alloc > > function can be reduced to the following. > > void *module_alloc(unsigned long size) > > { > > return __vmalloc_node_range(size, 1, PFN_ALIGN(&_end - 2 *SZ_1G), > > VMALLOC_END, GFP_KERNEL, > > PAGE_KERNEL_EXEC, 0, > > NUMA_NO_NODE, > > __builtin_return_address(0)); > > } > > It is the same as Alex's suggestion. Is Alex willing to send this > > patch to resolve this bug? > > > > You did all the work, please send a patch explaining what you learnt and > feel free to add a Suggested-By. Anyway, I'll add a Reviewed-by when you > propose something. > > Thank you Vincent for all your research, > > Alex OK, I understood. Thank you and Anup for giving me some suggestions from different viewpoints. I will add Suggested-By for you all if possible :) Vincent ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-10 9:53 ` Vincent Chen @ 2020-02-19 6:46 ` Alex Ghiti 2020-02-19 7:30 ` Vincent Chen 0 siblings, 1 reply; 30+ messages in thread From: Alex Ghiti @ 2020-02-19 6:46 UTC (permalink / raw) To: Vincent Chen Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel, Anup Patel, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno Hi Vincent, On 2/10/20 4:53 AM, Vincent Chen wrote: > On Mon, Feb 10, 2020 at 2:37 PM Alex Ghiti <alex@ghiti.fr> wrote: >> >> Hi Vincent, >> >> On 2/7/20 9:51 AM, Vincent Chen wrote: >>> On Fri, Feb 7, 2020 at 10:39 PM Vincent Chen <vincent.chen@sifive.com> wrote: >>>> >>>>>> For point1, we don't need a dedicated region as long as we are allocating >>>>>> modules from VMALLOC area. Let's avoid another virtual memory region if >>>>>> possible. >>>>>> >>>>> >>>>> Your comments inspire me if we can know the end of the percpu data >>>>> region, we can dynamically calculate the start address in each module >>>>> allocation by "<end of the percpu data region> - 2GB" if needed. I am >>>>> finding a way to derive the <end of the percpu data region>. >>>>> >>>> >>>> In the implementation of the idea, I found that I did have some >>>> misunderstandings about the mechanism of accessing static percpu >>>> symbol. Currently, I think this issue can be solved by modifying the >>>> module_allocate() to the following. >>>> #define MODULE_START max(PFN_ALIGN(&_end - 2 * >>>> SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)) >>>> void *module_alloc(unsigned long size) >>>> { >>>> return __vmalloc_node_range(size, 1, MODULE_START, >>>> VMALLOC_END, GFP_KERNEL, >>>> PAGE_KERNEL_EXEC, 0, >>>> NUMA_NO_NODE, >>>> __builtin_return_address(0)); >>>> } >>>> After applying a similar modification above, the kernel can >>>> successfully insert the test module provided by Romain. Any feedback >>>> is welcome. If someone is curious about the reason, I roughly >>>> described it at the end of this mail. I am sorry if someone was misled >>>> by early discussion. >>>> >>>> >>>> As mentioned early, kernel pre-allocates a memory region for each CPU >>>> to place the instances of the percpu data. According to the >>>> declaration type, each memory region can be divided into three chunks. >>>> These three chunks from low memory to high memory are static chunk, >>>> reserved chunk, and dynamic chunk. >>>> a. The static chunk is used for the static percpu symbols declared in the kernel >>>> b. The reserved chunk is used for the static percpu symbols declared >>>> in the kernel module >>>> c. The dynamic chunk is used for all dynamic percpu symbols. >>>> >>>> The order of the percpu symbols in each memory region is the same and >>>> is recorded in a symbol list maintained by the kernel. Therefore, the >>>> address of a specific CPU's percpu symbol is "The address of this >>>> percpu symbol in the symbol list" + "The offset from symbols list to >>>> the specific CPU region". >>>> a. The symbol list is created based on the ".data..percpu" section. In >>>> other words, the start address of this symbol list is the start >>>> address of ".data..percpu". When a dynamic percpu symbol is created or >>>> a kernel module import a static percpu symbol, the kernel will follow >>>> the above chunk rule to add this percpu symbol to its suitable region. >>>> b. When each memory area was created, "The offset from symbols list to >>>> the specific CPU region" has been recorded. >>>> >>>> Back to the issue, according to the rules above, the kernel will >>>> relocate the ".data..percpu" section of the loaded module to the end >>>> of the ".data..percpu" section of the kernel image. The end of the >>>> ".data..percpu" section of the kernel image almost equals _text, and >>>> the current size of the reserved chunk is 8KB. Therefore, the >>>> appropriate range to place the loaded module is [max(PFN_ALIGN(&_end - >>>> 2 *SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)), VMALLOC_END]. >>> >>> I did not notice that the size of the kernel .text section is >>> impossible below 2 PAGES. Therefore, the modification of module_alloc >>> function can be reduced to the following. >>> void *module_alloc(unsigned long size) >>> { >>> return __vmalloc_node_range(size, 1, PFN_ALIGN(&_end - 2 *SZ_1G), >>> VMALLOC_END, GFP_KERNEL, >>> PAGE_KERNEL_EXEC, 0, >>> NUMA_NO_NODE, >>> __builtin_return_address(0)); >>> } >>> It is the same as Alex's suggestion. Is Alex willing to send this >>> patch to resolve this bug? >>> >> >> You did all the work, please send a patch explaining what you learnt and >> feel free to add a Suggested-By. Anyway, I'll add a Reviewed-by when you >> propose something. >> >> Thank you Vincent for all your research, >> >> Alex > > OK, I understood. Thank you and Anup for giving me some suggestions > from different viewpoints. I will add Suggested-By for you all if > possible :) > > Vincent > Are you going to propose something soon regarding this issue ? I strongly believe we need to fix modules loading for 5.6. Thanks, Alex ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-19 6:46 ` Alex Ghiti @ 2020-02-19 7:30 ` Vincent Chen 0 siblings, 0 replies; 30+ messages in thread From: Vincent Chen @ 2020-02-19 7:30 UTC (permalink / raw) To: Alex Ghiti Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel, Anup Patel, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno On Wed, Feb 19, 2020 at 2:46 PM Alex Ghiti <alex@ghiti.fr> wrote: > > Hi Vincent, > > On 2/10/20 4:53 AM, Vincent Chen wrote: > > On Mon, Feb 10, 2020 at 2:37 PM Alex Ghiti <alex@ghiti.fr> wrote: > >> > >> Hi Vincent, > >> > >> On 2/7/20 9:51 AM, Vincent Chen wrote: > >>> On Fri, Feb 7, 2020 at 10:39 PM Vincent Chen <vincent.chen@sifive.com> wrote: > >>>> > >>>>>> For point1, we don't need a dedicated region as long as we are allocating > >>>>>> modules from VMALLOC area. Let's avoid another virtual memory region if > >>>>>> possible. > >>>>>> > >>>>> > >>>>> Your comments inspire me if we can know the end of the percpu data > >>>>> region, we can dynamically calculate the start address in each module > >>>>> allocation by "<end of the percpu data region> - 2GB" if needed. I am > >>>>> finding a way to derive the <end of the percpu data region>. > >>>>> > >>>> > >>>> In the implementation of the idea, I found that I did have some > >>>> misunderstandings about the mechanism of accessing static percpu > >>>> symbol. Currently, I think this issue can be solved by modifying the > >>>> module_allocate() to the following. > >>>> #define MODULE_START max(PFN_ALIGN(&_end - 2 * > >>>> SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)) > >>>> void *module_alloc(unsigned long size) > >>>> { > >>>> return __vmalloc_node_range(size, 1, MODULE_START, > >>>> VMALLOC_END, GFP_KERNEL, > >>>> PAGE_KERNEL_EXEC, 0, > >>>> NUMA_NO_NODE, > >>>> __builtin_return_address(0)); > >>>> } > >>>> After applying a similar modification above, the kernel can > >>>> successfully insert the test module provided by Romain. Any feedback > >>>> is welcome. If someone is curious about the reason, I roughly > >>>> described it at the end of this mail. I am sorry if someone was misled > >>>> by early discussion. > >>>> > >>>> > >>>> As mentioned early, kernel pre-allocates a memory region for each CPU > >>>> to place the instances of the percpu data. According to the > >>>> declaration type, each memory region can be divided into three chunks. > >>>> These three chunks from low memory to high memory are static chunk, > >>>> reserved chunk, and dynamic chunk. > >>>> a. The static chunk is used for the static percpu symbols declared in the kernel > >>>> b. The reserved chunk is used for the static percpu symbols declared > >>>> in the kernel module > >>>> c. The dynamic chunk is used for all dynamic percpu symbols. > >>>> > >>>> The order of the percpu symbols in each memory region is the same and > >>>> is recorded in a symbol list maintained by the kernel. Therefore, the > >>>> address of a specific CPU's percpu symbol is "The address of this > >>>> percpu symbol in the symbol list" + "The offset from symbols list to > >>>> the specific CPU region". > >>>> a. The symbol list is created based on the ".data..percpu" section. In > >>>> other words, the start address of this symbol list is the start > >>>> address of ".data..percpu". When a dynamic percpu symbol is created or > >>>> a kernel module import a static percpu symbol, the kernel will follow > >>>> the above chunk rule to add this percpu symbol to its suitable region. > >>>> b. When each memory area was created, "The offset from symbols list to > >>>> the specific CPU region" has been recorded. > >>>> > >>>> Back to the issue, according to the rules above, the kernel will > >>>> relocate the ".data..percpu" section of the loaded module to the end > >>>> of the ".data..percpu" section of the kernel image. The end of the > >>>> ".data..percpu" section of the kernel image almost equals _text, and > >>>> the current size of the reserved chunk is 8KB. Therefore, the > >>>> appropriate range to place the loaded module is [max(PFN_ALIGN(&_end - > >>>> 2 *SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)), VMALLOC_END]. > >>> > >>> I did not notice that the size of the kernel .text section is > >>> impossible below 2 PAGES. Therefore, the modification of module_alloc > >>> function can be reduced to the following. > >>> void *module_alloc(unsigned long size) > >>> { > >>> return __vmalloc_node_range(size, 1, PFN_ALIGN(&_end - 2 *SZ_1G), > >>> VMALLOC_END, GFP_KERNEL, > >>> PAGE_KERNEL_EXEC, 0, > >>> NUMA_NO_NODE, > >>> __builtin_return_address(0)); > >>> } > >>> It is the same as Alex's suggestion. Is Alex willing to send this > >>> patch to resolve this bug? > >>> > >> > >> You did all the work, please send a patch explaining what you learnt and > >> feel free to add a Suggested-By. Anyway, I'll add a Reviewed-by when you > >> propose something. > >> > >> Thank you Vincent for all your research, > >> > >> Alex > > > > OK, I understood. Thank you and Anup for giving me some suggestions > > from different viewpoints. I will add Suggested-By for you all if > > possible :) > > > > Vincent > > > > Are you going to propose something soon regarding this issue ? I > strongly believe we need to fix modules loading for 5.6. > > Thanks, > > Alex Thanks for your reminder. I just posted the patch to the mailing list. Vincent ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-04 11:30 ` Anup Patel 2020-02-04 14:03 ` Vincent Chen @ 2020-02-04 17:48 ` Romain Dolbeau 1 sibling, 0 replies; 30+ messages in thread From: Romain Dolbeau @ 2020-02-04 17:48 UTC (permalink / raw) To: Anup Patel, Vincent Chen Cc: Carlos Eduardo de Paula, Alexandre Ghiti, David Abdurachmanov, Paul Walmsley, Zong Li, Zong Li, linux-riscv, Aurelien Jarno On 2020-02-04 12:30, Anup Patel wrote: > for such modules just > convert these percpu symbols into dynamic percpu variables. Indeed you are right for my case; converting from static to dynamic percpu variable solves the problem in my out-of-tree case of interest, thanks! Cordially, -- Romain Dolbeau ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Error on loading some network Kernel modules 2020-02-02 16:27 ` Romain Dolbeau 2020-02-03 10:04 ` Vincent Chen @ 2020-02-03 20:57 ` Alex Ghiti 1 sibling, 0 replies; 30+ messages in thread From: Alex Ghiti @ 2020-02-03 20:57 UTC (permalink / raw) To: Romain Dolbeau, Anup Patel, Aurelien Jarno Cc: Vincent Chen, Carlos Eduardo de Paula, Paul Walmsley, linux-riscv, David Abdurachmanov Hi Romain, On 2/2/20 11:27 AM, Romain Dolbeau wrote: > On 2020-02-01 14:59, Alex Ghiti wrote: >> Why restrict to 128M whereas we have 2GB offset available to the end >> of the kernel ? > > Isn't that 2 GiB offset to whatever the module requires in the kernel, > rather than to the end of the kernel space? Yes it is, _end is defined as the last symbol in the kernel: so the modules need to lie somewhere they can, at worse, access this last symbol so having the modules in vmalloc zone from [ _end - 2GB; VMALLOC_END ] allows that right ? Or am I mistaken ? Alex > > Is there some guarantee that symbols accessible by modules are at the > end of the kernel? If so, wouldn't the maximum offset for this patch > still be (2 GiB - <total size of accessible symbols>)? > > Cordially, > ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2020-02-19 7:30 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-01-28 19:22 Error on loading some network Kernel modules Carlos Eduardo de Paula 2020-01-30 3:00 ` Paul Walmsley 2020-01-30 16:20 ` David Abdurachmanov 2020-01-31 14:12 ` Carlos Eduardo de Paula 2020-01-31 20:11 ` Aurelien Jarno 2020-02-01 7:52 ` Anup Patel 2020-02-01 13:59 ` Alex Ghiti 2020-02-02 14:10 ` Anup Patel 2020-02-02 15:21 ` Troy Benjegerdes 2020-02-02 16:27 ` Romain Dolbeau 2020-02-03 10:04 ` Vincent Chen 2020-02-04 3:55 ` Zong Li 2020-02-04 6:50 ` Alex Ghiti 2020-02-04 7:19 ` Zong Li 2020-02-04 9:32 ` Alexandre Ghiti 2020-02-04 10:46 ` Vincent Chen 2020-02-04 11:30 ` Anup Patel 2020-02-04 14:03 ` Vincent Chen 2020-02-04 19:10 ` Alex Ghiti [not found] ` <a55f265e-71b2-5ebb-b079-6345007a442e@ghiti.fr> 2020-02-05 3:22 ` Vincent Chen 2020-02-05 4:24 ` Anup Patel 2020-02-05 10:37 ` Vincent Chen 2020-02-07 14:39 ` Vincent Chen 2020-02-07 14:51 ` Vincent Chen 2020-02-10 6:37 ` Alex Ghiti 2020-02-10 9:53 ` Vincent Chen 2020-02-19 6:46 ` Alex Ghiti 2020-02-19 7:30 ` Vincent Chen 2020-02-04 17:48 ` Romain Dolbeau 2020-02-03 20:57 ` Alex Ghiti
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).