Linux-RISC-V Archive on lore.kernel.org
 help / color / Atom feed
* Error on loading some network Kernel modules
@ 2020-01-28 19:22 Carlos Eduardo de Paula
  2020-01-30  3:00 ` Paul Walmsley
  0 siblings, 1 reply; 30+ messages in thread
From: Carlos Eduardo de Paula @ 2020-01-28 19:22 UTC (permalink / raw)
  To: linux-riscv

I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most
network features I need baked into the kernel instead of modules.

I tried building a kernel with these network features as modules but
when loading some of them I got this error:

root@unleashed:~# uname -a
Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020
riscv64 GNU/Linux

root@unleashed:~# modprobe br_netfilter
[  139.290533] br_netfilter: target ffffffe0000422d8 can not be
addressed by the 32-bit offset from PC = 000000003dfd6deb

root@unleashed:~# modprobe openvswitch
[  101.830567] openvswitch: target ffffffe0000422d8 can not be
addressed by the 32-bit offset from PC = 000000007e236a39
modprobe: ERROR: could not insert 'openvswitch': Invalid argument

The config I used to build the modules is on:
https://gist.github.com/carlosedp/9f7a6ecde1dc9cebdda62b60d823b15a

The config I'm using that has all features worked is on:
https://github.com/carlosedp/riscv-bringup/raw/master/unleashed/unleashed_config
-- 
________________________________________
Carlos Eduardo de Paula
me@carlosedp.com
http://carlosedp.com
http://twitter.com/carlosedp
Linkedin
________________________________________


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-01-28 19:22 Error on loading some network Kernel modules Carlos Eduardo de Paula
@ 2020-01-30  3:00 ` Paul Walmsley
  2020-01-30 16:20   ` David Abdurachmanov
  0 siblings, 1 reply; 30+ messages in thread
From: Paul Walmsley @ 2020-01-30  3:00 UTC (permalink / raw)
  To: Carlos Eduardo de Paula
  Cc: vincent.chen, linux-riscv, Romain Dolbeau, Aurelien Jarno

On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote:

> I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most
> network features I need baked into the kernel instead of modules.
> 
> I tried building a kernel with these network features as modules but
> when loading some of them I got this error:
> 
> root@unleashed:~# uname -a
> Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020
> riscv64 GNU/Linux
> 
> root@unleashed:~# modprobe br_netfilter
> [  139.290533] br_netfilter: target ffffffe0000422d8 can not be
> addressed by the 32-bit offset from PC = 000000003dfd6deb

This is a known issue:

https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/

https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r

Vincent is looking into it and I expect there will be a fix soon.


- Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-01-30  3:00 ` Paul Walmsley
@ 2020-01-30 16:20   ` David Abdurachmanov
  2020-01-31 14:12     ` Carlos Eduardo de Paula
  2020-01-31 20:11     ` Aurelien Jarno
  0 siblings, 2 replies; 30+ messages in thread
From: David Abdurachmanov @ 2020-01-30 16:20 UTC (permalink / raw)
  To: Paul Walmsley
  Cc: Romain Dolbeau, Vincent Chen, Carlos Eduardo de Paula,
	linux-riscv, Aurelien Jarno

On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote:
>
> On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote:
>
> > I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most
> > network features I need baked into the kernel instead of modules.
> >
> > I tried building a kernel with these network features as modules but
> > when loading some of them I got this error:
> >
> > root@unleashed:~# uname -a
> > Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020
> > riscv64 GNU/Linux
> >
> > root@unleashed:~# modprobe br_netfilter
> > [  139.290533] br_netfilter: target ffffffe0000422d8 can not be
> > addressed by the 32-bit offset from PC = 000000003dfd6deb
>
> This is a known issue:
>
> https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/
>
> https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r
>
> Vincent is looking into it and I expect there will be a fix soon.
>

Is this patch solving the problem?

https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589

It was incl. into
https://github.com/bjoto/linux/commits/rv64-bpf-jit-bcc (not posted
upstream).

david


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-01-30 16:20   ` David Abdurachmanov
@ 2020-01-31 14:12     ` Carlos Eduardo de Paula
  2020-01-31 20:11     ` Aurelien Jarno
  1 sibling, 0 replies; 30+ messages in thread
From: Carlos Eduardo de Paula @ 2020-01-31 14:12 UTC (permalink / raw)
  To: David Abdurachmanov
  Cc: Romain Dolbeau, Vincent Chen, Paul Walmsley, linux-riscv, Aurelien Jarno

I just built v5.3-rc4 (tag I currently use) with the patch pointed by
David (https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589)
and report that it works fine.

I've been able to load network modules (nf_conntrack, openvswitch and
others) with no problem. Also tested with Docker that requires
netfilter and others, all working:

> lsmod
Module                  Size  Used by
xt_nat                  4371  4
veth                   21849  0
vxlan                  71685  0
xt_policy               4248  0
iptable_mangle          2375  0
xt_mark                 1510  0
xt_u32                  2272  0
xt_conntrack            4517  4
xt_MASQUERADE           1893  4
nf_conntrack_netlink    48190  0
nfnetlink               8648  2 nf_conntrack_netlink
xfrm_user              39665  1
xfrm_algo               6723  1 xfrm_user
xt_addrtype             4589  2
iptable_filter          2484  1
iptable_nat             2669  2
openvswitch           194024  0
nsh                     3693  1 openvswitch
nf_conncount           11932  1 openvswitch
nf_nat                 41006  4 xt_nat,openvswitch,iptable_nat,xt_MASQUERADE
nf_conntrack          146281  7
xt_conntrack,nf_nat,xt_nat,openvswitch,nf_conntrack_netlink,nf_conncount,xt_MASQUERADE
nf_defrag_ipv6         10727  2 nf_conntrack,openvswitch
nf_defrag_ipv4          2538  1 nf_conntrack
nbd                    39492  2
overlay               137654  0
br_netfilter           22333  0
bridge                221254  1 br_netfilter
stp                     2801  1 bridge
llc                     6044  2 bridge,stp
ip_tables              17472  3 iptable_filter,iptable_nat,iptable_mangle

Carlos

On Thu, Jan 30, 2020 at 1:21 PM David Abdurachmanov
<david.abdurachmanov@gmail.com> wrote:
>
> On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote:
> >
> > On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote:
> >
> > > I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most
> > > network features I need baked into the kernel instead of modules.
> > >
> > > I tried building a kernel with these network features as modules but
> > > when loading some of them I got this error:
> > >
> > > root@unleashed:~# uname -a
> > > Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020
> > > riscv64 GNU/Linux
> > >
> > > root@unleashed:~# modprobe br_netfilter
> > > [  139.290533] br_netfilter: target ffffffe0000422d8 can not be
> > > addressed by the 32-bit offset from PC = 000000003dfd6deb
> >
> > This is a known issue:
> >
> > https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/
> >
> > https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r
> >
> > Vincent is looking into it and I expect there will be a fix soon.
> >
>
> Is this patch solving the problem?
>
> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589
>
> It was incl. into
> https://github.com/bjoto/linux/commits/rv64-bpf-jit-bcc (not posted
> upstream).
>
> david



-- 
________________________________________
Carlos Eduardo de Paula
me@carlosedp.com
http://carlosedp.com
http://twitter.com/carlosedp
Linkedin
________________________________________


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-01-30 16:20   ` David Abdurachmanov
  2020-01-31 14:12     ` Carlos Eduardo de Paula
@ 2020-01-31 20:11     ` Aurelien Jarno
  2020-02-01  7:52       ` Anup Patel
  1 sibling, 1 reply; 30+ messages in thread
From: Aurelien Jarno @ 2020-01-31 20:11 UTC (permalink / raw)
  To: David Abdurachmanov
  Cc: Romain Dolbeau, Vincent Chen, Paul Walmsley,
	Carlos Eduardo de Paula, linux-riscv

On 2020-01-30 17:20, David Abdurachmanov wrote:
> On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote:
> >
> > On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote:
> >
> > > I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most
> > > network features I need baked into the kernel instead of modules.
> > >
> > > I tried building a kernel with these network features as modules but
> > > when loading some of them I got this error:
> > >
> > > root@unleashed:~# uname -a
> > > Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020
> > > riscv64 GNU/Linux
> > >
> > > root@unleashed:~# modprobe br_netfilter
> > > [  139.290533] br_netfilter: target ffffffe0000422d8 can not be
> > > addressed by the 32-bit offset from PC = 000000003dfd6deb
> >
> > This is a known issue:
> >
> > https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/
> >
> > https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r
> >
> > Vincent is looking into it and I expect there will be a fix soon.
> >
> 
> Is this patch solving the problem?
> 
> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589

I have just tried that patch on top of a 5.4.13 kernel, and I confirm it
fixes the issue. Thanks!

Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-01-31 20:11     ` Aurelien Jarno
@ 2020-02-01  7:52       ` Anup Patel
  2020-02-01 13:59         ` Alex Ghiti
  0 siblings, 1 reply; 30+ messages in thread
From: Anup Patel @ 2020-02-01  7:52 UTC (permalink / raw)
  To: Aurelien Jarno
  Cc: Carlos Eduardo de Paula, David Abdurachmanov, Paul Walmsley,
	Vincent Chen, Romain Dolbeau, linux-riscv

On Sat, Feb 1, 2020 at 1:41 AM Aurelien Jarno <aurelien@aurel32.net> wrote:
>
> On 2020-01-30 17:20, David Abdurachmanov wrote:
> > On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote:
> > >
> > > On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote:
> > >
> > > > I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most
> > > > network features I need baked into the kernel instead of modules.
> > > >
> > > > I tried building a kernel with these network features as modules but
> > > > when loading some of them I got this error:
> > > >
> > > > root@unleashed:~# uname -a
> > > > Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020
> > > > riscv64 GNU/Linux
> > > >
> > > > root@unleashed:~# modprobe br_netfilter
> > > > [  139.290533] br_netfilter: target ffffffe0000422d8 can not be
> > > > addressed by the 32-bit offset from PC = 000000003dfd6deb
> > >
> > > This is a known issue:
> > >
> > > https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/
> > >
> > > https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r
> > >
> > > Vincent is looking into it and I expect there will be a fix soon.
> > >
> >
> > Is this patch solving the problem?
> >
> > https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589
>
> I have just tried that patch on top of a 5.4.13 kernel, and I confirm it
> fixes the issue. Thanks!

I guess this patch was lost over time.

First of all we need to rebase this patch on latest kernel.

The MODULES_xyz defines are totally redundant and create
confusion because with these defines it seems we are creating
separate virtual memory region for modules which is not true.
In fact, the patch only tries to ensure that modules are allocated
from the end-of VMALLOC region.

I suggest to drop all MODULES_xyz defines and implement
module_alloc() as follows:

void *module_alloc(unsigned long size)
{
return __vmalloc_node_range(size, 1, VMALLOC_END - SZ_128M,
VMALLOC_END, GFP_KERNEL,
PAGE_KERNEL_EXEC, 0,
NUMA_NO_NODE,
__builtin_return_address(0));
}

Maybe we can even have a kconfig options for MODULES_SIZE
which can be used in-place of SZ_128M in above code.

Should I send a patch on latest kernel ??

Regards,
Anup


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-01  7:52       ` Anup Patel
@ 2020-02-01 13:59         ` Alex Ghiti
  2020-02-02 14:10           ` Anup Patel
  2020-02-02 16:27           ` Romain Dolbeau
  0 siblings, 2 replies; 30+ messages in thread
From: Alex Ghiti @ 2020-02-01 13:59 UTC (permalink / raw)
  To: Anup Patel, Aurelien Jarno
  Cc: Carlos Eduardo de Paula, David Abdurachmanov, Paul Walmsley,
	Vincent Chen, Romain Dolbeau, linux-riscv

On 2/1/20 2:52 AM, Anup Patel wrote:
> On Sat, Feb 1, 2020 at 1:41 AM Aurelien Jarno <aurelien@aurel32.net> wrote:
>> On 2020-01-30 17:20, David Abdurachmanov wrote:
>>> On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote:
>>>> On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote:
>>>>
>>>>> I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most
>>>>> network features I need baked into the kernel instead of modules.
>>>>>
>>>>> I tried building a kernel with these network features as modules but
>>>>> when loading some of them I got this error:
>>>>>
>>>>> root@unleashed:~# uname -a
>>>>> Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020
>>>>> riscv64 GNU/Linux
>>>>>
>>>>> root@unleashed:~# modprobe br_netfilter
>>>>> [  139.290533] br_netfilter: target ffffffe0000422d8 can not be
>>>>> addressed by the 32-bit offset from PC = 000000003dfd6deb
>>>> This is a known issue:
>>>>
>>>> https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/
>>>>
>>>> https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r
>>>>
>>>> Vincent is looking into it and I expect there will be a fix soon.
>>>>
>>> Is this patch solving the problem?
>>>
>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589
>> I have just tried that patch on top of a 5.4.13 kernel, and I confirm it
>> fixes the issue. Thanks!
> I guess this patch was lost over time.
>
> First of all we need to rebase this patch on latest kernel.
>
> The MODULES_xyz defines are totally redundant and create
> confusion because with these defines it seems we are creating
> separate virtual memory region for modules which is not true.
> In fact, the patch only tries to ensure that modules are allocated
> from the end-of VMALLOC region.
>
> I suggest to drop all MODULES_xyz defines and implement
> module_alloc() as follows:
>
> void *module_alloc(unsigned long size)
> {
> return __vmalloc_node_range(size, 1, VMALLOC_END - SZ_128M,
> VMALLOC_END, GFP_KERNEL,
> PAGE_KERNEL_EXEC, 0,
> NUMA_NO_NODE,
> __builtin_return_address(0));
> }


Why restrict to 128M whereas we have 2GB offset available to the end of 
the kernel ?

I think the cleanest solution is to use the following range [&_end - 2 * 
SZ_1G; VMALLOC_END].

Alex


> Maybe we can even have a kconfig options for MODULES_SIZE
> which can be used in-place of SZ_128M in above code.
>
> Should I send a patch on latest kernel ??
>
> Regards,
> Anup
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-01 13:59         ` Alex Ghiti
@ 2020-02-02 14:10           ` Anup Patel
  2020-02-02 15:21             ` Troy Benjegerdes
  2020-02-02 16:27           ` Romain Dolbeau
  1 sibling, 1 reply; 30+ messages in thread
From: Anup Patel @ 2020-02-02 14:10 UTC (permalink / raw)
  To: Alex Ghiti
  Cc: Carlos Eduardo de Paula, David Abdurachmanov, Paul Walmsley,
	Vincent Chen, Romain Dolbeau, linux-riscv, Aurelien Jarno

On Sat, Feb 1, 2020 at 7:30 PM Alex Ghiti <alex@ghiti.fr> wrote:
>
> On 2/1/20 2:52 AM, Anup Patel wrote:
> > On Sat, Feb 1, 2020 at 1:41 AM Aurelien Jarno <aurelien@aurel32.net> wrote:
> >> On 2020-01-30 17:20, David Abdurachmanov wrote:
> >>> On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote:
> >>>> On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote:
> >>>>
> >>>>> I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most
> >>>>> network features I need baked into the kernel instead of modules.
> >>>>>
> >>>>> I tried building a kernel with these network features as modules but
> >>>>> when loading some of them I got this error:
> >>>>>
> >>>>> root@unleashed:~# uname -a
> >>>>> Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020
> >>>>> riscv64 GNU/Linux
> >>>>>
> >>>>> root@unleashed:~# modprobe br_netfilter
> >>>>> [  139.290533] br_netfilter: target ffffffe0000422d8 can not be
> >>>>> addressed by the 32-bit offset from PC = 000000003dfd6deb
> >>>> This is a known issue:
> >>>>
> >>>> https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/
> >>>>
> >>>> https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r
> >>>>
> >>>> Vincent is looking into it and I expect there will be a fix soon.
> >>>>
> >>> Is this patch solving the problem?
> >>>
> >>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589
> >> I have just tried that patch on top of a 5.4.13 kernel, and I confirm it
> >> fixes the issue. Thanks!
> > I guess this patch was lost over time.
> >
> > First of all we need to rebase this patch on latest kernel.
> >
> > The MODULES_xyz defines are totally redundant and create
> > confusion because with these defines it seems we are creating
> > separate virtual memory region for modules which is not true.
> > In fact, the patch only tries to ensure that modules are allocated
> > from the end-of VMALLOC region.
> >
> > I suggest to drop all MODULES_xyz defines and implement
> > module_alloc() as follows:
> >
> > void *module_alloc(unsigned long size)
> > {
> > return __vmalloc_node_range(size, 1, VMALLOC_END - SZ_128M,
> > VMALLOC_END, GFP_KERNEL,
> > PAGE_KERNEL_EXEC, 0,
> > NUMA_NO_NODE,
> > __builtin_return_address(0));
> > }
>
>
> Why restrict to 128M whereas we have 2GB offset available to the end of
> the kernel ?
>
> I think the cleanest solution is to use the following range [&_end - 2 *
> SZ_1G; VMALLOC_END].

I agree we should not restrict to just SZ_128M.

You can go ahead and send a patch this (if you want).

Regards,
Anup


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-02 14:10           ` Anup Patel
@ 2020-02-02 15:21             ` Troy Benjegerdes
  0 siblings, 0 replies; 30+ messages in thread
From: Troy Benjegerdes @ 2020-02-02 15:21 UTC (permalink / raw)
  To: Anup Patel, Paul Walmsley, David Abdurachmanov
  Cc: Carlos Eduardo de Paula, Alex Ghiti, Vincent Chen,
	Romain Dolbeau, linux-riscv, Aurelien Jarno

Do we have a good place to put a community-supported ‘riscv-stable’ tree somewhere that we can put stuff like this we need for basic functionality so we don’t lose it again?

I also feel like there’s a need to backport some patches like this to a -stable/lts kernel series, and 4.19 seems a little old to keep maintaining..

> On Feb 2, 2020, at 8:10 AM, Anup Patel <anup@brainfault.org> wrote:
> 
> On Sat, Feb 1, 2020 at 7:30 PM Alex Ghiti <alex@ghiti.fr> wrote:
>> 
>> On 2/1/20 2:52 AM, Anup Patel wrote:
>>> On Sat, Feb 1, 2020 at 1:41 AM Aurelien Jarno <aurelien@aurel32.net> wrote:
>>>> On 2020-01-30 17:20, David Abdurachmanov wrote:
>>>>> On Thu, Jan 30, 2020 at 4:00 AM Paul Walmsley <paul@pwsan.com> wrote:
>>>>>> On Tue, 28 Jan 2020, Carlos Eduardo de Paula wrote:
>>>>>> 
>>>>>>> I currently run stock Kernel 5.5.0-rc7 on my Unleashed board with most
>>>>>>> network features I need baked into the kernel instead of modules.
>>>>>>> 
>>>>>>> I tried building a kernel with these network features as modules but
>>>>>>> when loading some of them I got this error:
>>>>>>> 
>>>>>>> root@unleashed:~# uname -a
>>>>>>> Linux unleashed 5.5.0-rc7-dirty #4 SMP Fri Jan 24 18:16:43 -02 2020
>>>>>>> riscv64 GNU/Linux
>>>>>>> 
>>>>>>> root@unleashed:~# modprobe br_netfilter
>>>>>>> [  139.290533] br_netfilter: target ffffffe0000422d8 can not be
>>>>>>> addressed by the 32-bit offset from PC = 000000003dfd6deb
>>>>>> This is a known issue:
>>>>>> 
>>>>>> https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/
>>>>>> 
>>>>>> https://lore.kernel.org/linux-riscv/20191029105055.GA20736@aurel32.net/#r
>>>>>> 
>>>>>> Vincent is looking into it and I expect there will be a fix soon.
>>>>>> 
>>>>> Is this patch solving the problem?
>>>>> 
>>>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589
>>>> I have just tried that patch on top of a 5.4.13 kernel, and I confirm it
>>>> fixes the issue. Thanks!
>>> I guess this patch was lost over time.
>>> 
>>> First of all we need to rebase this patch on latest kernel.
>>> 
>>> The MODULES_xyz defines are totally redundant and create
>>> confusion because with these defines it seems we are creating
>>> separate virtual memory region for modules which is not true.
>>> In fact, the patch only tries to ensure that modules are allocated
>>> from the end-of VMALLOC region.
>>> 
>>> I suggest to drop all MODULES_xyz defines and implement
>>> module_alloc() as follows:
>>> 
>>> void *module_alloc(unsigned long size)
>>> {
>>> return __vmalloc_node_range(size, 1, VMALLOC_END - SZ_128M,
>>> VMALLOC_END, GFP_KERNEL,
>>> PAGE_KERNEL_EXEC, 0,
>>> NUMA_NO_NODE,
>>> __builtin_return_address(0));
>>> }
>> 
>> 
>> Why restrict to 128M whereas we have 2GB offset available to the end of
>> the kernel ?
>> 
>> I think the cleanest solution is to use the following range [&_end - 2 *
>> SZ_1G; VMALLOC_END].
> 
> I agree we should not restrict to just SZ_128M.
> 
> You can go ahead and send a patch this (if you want).
> 
> Regards,
> Anup



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-01 13:59         ` Alex Ghiti
  2020-02-02 14:10           ` Anup Patel
@ 2020-02-02 16:27           ` Romain Dolbeau
  2020-02-03 10:04             ` Vincent Chen
  2020-02-03 20:57             ` Alex Ghiti
  1 sibling, 2 replies; 30+ messages in thread
From: Romain Dolbeau @ 2020-02-02 16:27 UTC (permalink / raw)
  To: Alex Ghiti, Anup Patel, Aurelien Jarno
  Cc: Vincent Chen, Carlos Eduardo de Paula, Paul Walmsley,
	linux-riscv, David Abdurachmanov

On 2020-02-01 14:59, Alex Ghiti wrote:
> Why restrict to 128M whereas we have 2GB offset available to the end of 
> the kernel ?

Isn't that 2 GiB offset to whatever the module requires in the kernel, 
rather than to the end of the kernel space?

Is there some guarantee that symbols accessible by modules are at the 
end of the kernel? If so, wouldn't the maximum offset for this patch 
still be (2 GiB - <total size of accessible symbols>)?

Cordially,

-- 
Romain Dolbeau


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-02 16:27           ` Romain Dolbeau
@ 2020-02-03 10:04             ` Vincent Chen
  2020-02-04  3:55               ` Zong Li
  2020-02-03 20:57             ` Alex Ghiti
  1 sibling, 1 reply; 30+ messages in thread
From: Vincent Chen @ 2020-02-03 10:04 UTC (permalink / raw)
  To: Romain Dolbeau
  Cc: Carlos Eduardo de Paula, Alex Ghiti, David Abdurachmanov,
	Anup Patel, Paul Walmsley, linux-riscv, Aurelien Jarno

On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
<romain.dolbeau@european-processor-initiative.eu> wrote:
>
> On 2020-02-01 14:59, Alex Ghiti wrote:
> > Why restrict to 128M whereas we have 2GB offset available to the end of
> > the kernel ?
>
> Isn't that 2 GiB offset to whatever the module requires in the kernel,
> rather than to the end of the kernel space?
>
> Is there some guarantee that symbols accessible by modules are at the
> end of the kernel? If so, wouldn't the maximum offset for this patch
> still be (2 GiB - <total size of accessible symbols>)?
>
> Cordially,
>
> --
> Romain Dolbeau

It took me some time to find the root cause of this problem, please
allow me to share some observations before the discussion.
The root cause of this issue is that the percpu data is declared as a
static variable. The "static" attribute will make the compiler think
that this symbol is close to the .text section at runtime. Hence, the
compiler uses "auipc" to access this percpu data instead of using GOT.
In this case,  the access range is limited to + -2G. However, in
practice, these percpu data are placed at a pre-allocated region
created by the memblock_allocate() function. In other words, the
distance between the pre-allocated region (>PAGE_OFFSET ) and the
.text section of the kernel module (in VMALLOC region) is much larger
than 2G.
I agree that the original patch,
https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589,
can solve most cases. However, I do not think the patch still works if
the kernel module region is determined by _end or <total size of
accessible symbols>. The reason is that the pre-allocated region for
module percpu data comes from the memblock function at runtime. Hence,
we cannot know the actual address of this region at compile-time, and
this issue probably may occur again in this case.

By the way, I think maybe we can refer to the implementation of MIPS.
1. For general cases, we can use this patch to solve this issue.
2. For a large kernel image (>2G) or enabling the KASLR feature, we
may need a new code mode to deal with this issue.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-02 16:27           ` Romain Dolbeau
  2020-02-03 10:04             ` Vincent Chen
@ 2020-02-03 20:57             ` Alex Ghiti
  1 sibling, 0 replies; 30+ messages in thread
From: Alex Ghiti @ 2020-02-03 20:57 UTC (permalink / raw)
  To: Romain Dolbeau, Anup Patel, Aurelien Jarno
  Cc: Vincent Chen, Carlos Eduardo de Paula, Paul Walmsley,
	linux-riscv, David Abdurachmanov

Hi Romain,

On 2/2/20 11:27 AM, Romain Dolbeau wrote:
> On 2020-02-01 14:59, Alex Ghiti wrote:
>> Why restrict to 128M whereas we have 2GB offset available to the end 
>> of the kernel ?
>
> Isn't that 2 GiB offset to whatever the module requires in the kernel, 
> rather than to the end of the kernel space?


Yes it is, _end is defined as the last symbol in the kernel: so the 
modules need to lie somewhere they can,
at worse, access this last symbol so having the modules in vmalloc zone 
from [ _end - 2GB; VMALLOC_END ]
allows that right ? Or am I mistaken ?

Alex


>
> Is there some guarantee that symbols accessible by modules are at the 
> end of the kernel? If so, wouldn't the maximum offset for this patch 
> still be (2 GiB - <total size of accessible symbols>)?
>
> Cordially,
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-03 10:04             ` Vincent Chen
@ 2020-02-04  3:55               ` Zong Li
  2020-02-04  6:50                 ` Alex Ghiti
  0 siblings, 1 reply; 30+ messages in thread
From: Zong Li @ 2020-02-04  3:55 UTC (permalink / raw)
  To: Vincent Chen
  Cc: Paul Walmsley, Alex Ghiti, David Abdurachmanov, Anup Patel,
	Carlos Eduardo de Paula, Romain Dolbeau, Zong Li, linux-riscv,
	Aurelien Jarno

Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道:
>
> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
> <romain.dolbeau@european-processor-initiative.eu> wrote:
> >
> > On 2020-02-01 14:59, Alex Ghiti wrote:
> > > Why restrict to 128M whereas we have 2GB offset available to the end of
> > > the kernel ?
> >
> > Isn't that 2 GiB offset to whatever the module requires in the kernel,
> > rather than to the end of the kernel space?
> >
> > Is there some guarantee that symbols accessible by modules are at the
> > end of the kernel? If so, wouldn't the maximum offset for this patch
> > still be (2 GiB - <total size of accessible symbols>)?
> >
> > Cordially,
> >
> > --
> > Romain Dolbeau
>
> It took me some time to find the root cause of this problem, please
> allow me to share some observations before the discussion.
> The root cause of this issue is that the percpu data is declared as a
> static variable. The "static" attribute will make the compiler think
> that this symbol is close to the .text section at runtime. Hence, the
> compiler uses "auipc" to access this percpu data instead of using GOT.
> In this case,  the access range is limited to + -2G. However, in
> practice, these percpu data are placed at a pre-allocated region
> created by the memblock_allocate() function. In other words, the
> distance between the pre-allocated region (>PAGE_OFFSET ) and the
> .text section of the kernel module (in VMALLOC region) is much larger
> than 2G.
> I agree that the original patch,
> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589,
> can solve most cases. However, I do not think the patch still works if
> the kernel module region is determined by _end or <total size of
> accessible symbols>. The reason is that the pre-allocated region for
> module percpu data comes from the memblock function at runtime. Hence,
> we cannot know the actual address of this region at compile-time, and
> this issue probably may occur again in this case.
>
> By the way, I think maybe we can refer to the implementation of MIPS.
> 1. For general cases, we can use this patch to solve this issue.
> 2. For a large kernel image (>2G) or enabling the KASLR feature, we
> may need a new code mode to deal with this issue.
>

The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is
enabled. Unless we limit the randomized range in small area, the
module region start address will be bigger than VMALLOC_END. So if we
divide a region into module use, we also have to provide module
randomization at the same time when KASLR is enabled. It good to me if
there is a new code model to use GOT for static variable, everything
will be easy.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-04  3:55               ` Zong Li
@ 2020-02-04  6:50                 ` Alex Ghiti
  2020-02-04  7:19                   ` Zong Li
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Ghiti @ 2020-02-04  6:50 UTC (permalink / raw)
  To: Zong Li, Vincent Chen
  Cc: Paul Walmsley, David Abdurachmanov, Anup Patel,
	Carlos Eduardo de Paula, Romain Dolbeau, Zong Li, linux-riscv,
	Aurelien Jarno

Hi Zong,

On 2/3/20 10:55 PM, Zong Li wrote:
> Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道:
>> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
>> <romain.dolbeau@european-processor-initiative.eu> wrote:
>>> On 2020-02-01 14:59, Alex Ghiti wrote:
>>>> Why restrict to 128M whereas we have 2GB offset available to the end of
>>>> the kernel ?
>>> Isn't that 2 GiB offset to whatever the module requires in the kernel,
>>> rather than to the end of the kernel space?
>>>
>>> Is there some guarantee that symbols accessible by modules are at the
>>> end of the kernel? If so, wouldn't the maximum offset for this patch
>>> still be (2 GiB - <total size of accessible symbols>)?
>>>
>>> Cordially,
>>>
>>> --
>>> Romain Dolbeau
>> It took me some time to find the root cause of this problem, please
>> allow me to share some observations before the discussion.
>> The root cause of this issue is that the percpu data is declared as a
>> static variable. The "static" attribute will make the compiler think
>> that this symbol is close to the .text section at runtime. Hence, the
>> compiler uses "auipc" to access this percpu data instead of using GOT.
>> In this case,  the access range is limited to + -2G. However, in
>> practice, these percpu data are placed at a pre-allocated region
>> created by the memblock_allocate() function. In other words, the
>> distance between the pre-allocated region (>PAGE_OFFSET ) and the
>> .text section of the kernel module (in VMALLOC region) is much larger
>> than 2G.
>> I agree that the original patch,
>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589,
>> can solve most cases. However, I do not think the patch still works if
>> the kernel module region is determined by _end or <total size of
>> accessible symbols>. The reason is that the pre-allocated region for
>> module percpu data comes from the memblock function at runtime. Hence,
>> we cannot know the actual address of this region at compile-time, and
>> this issue probably may occur again in this case.
>>
>> By the way, I think maybe we can refer to the implementation of MIPS.
>> 1. For general cases, we can use this patch to solve this issue.
>> 2. For a large kernel image (>2G) or enabling the KASLR feature, we
>> may need a new code mode to deal with this issue.
>>
> The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is
> enabled. Unless we limit the randomized range in small area, the
> module region start address will be bigger than VMALLOC_END.


Actually, the relocatable patch I proposed already moves "everything" up 
at the same
time: the kernel itself but also all the "zones" below (vmalloc, 
vmemmap, fixup...etc)
since all those zones are defined from PAGE_OFFSET that is now dynamic. 
So the modules
ill remain at the same offset to the kernel, unless explicitly 
randomized in the vmalloc zone.


> So if we
> divide a region into module use, we also have to provide module
> randomization at the same time when KASLR is enabled. It good to me if
> there is a new code model to use GOT for static variable, everything
> will be easy.


GOT usage seems indeed to be easier, but it seems a bit overkill to me: 
arm64 already
uses the range I proposed for modules (arch/arm64/kernel/kaslr.c, 
module_alloc_base
definition).

Anyway, even if GOT is chosen, it won't be anytime soon right ? So we 
could, for the moment,
constraint the modules to be close to the kernel with any range you like 
and that will still
work with our implementation of KASLR.

It seems urgent to fix those modules loading issues for 5.6.

Alex



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-04  6:50                 ` Alex Ghiti
@ 2020-02-04  7:19                   ` Zong Li
  2020-02-04  9:32                     ` Alexandre Ghiti
  0 siblings, 1 reply; 30+ messages in thread
From: Zong Li @ 2020-02-04  7:19 UTC (permalink / raw)
  To: Alex Ghiti
  Cc: Paul Walmsley, David Abdurachmanov, Anup Patel,
	Carlos Eduardo de Paula, Vincent Chen, Romain Dolbeau, Zong Li,
	linux-riscv, Aurelien Jarno

Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道:
>
> Hi Zong,
>
> On 2/3/20 10:55 PM, Zong Li wrote:
> > Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道:
> >> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
> >> <romain.dolbeau@european-processor-initiative.eu> wrote:
> >>> On 2020-02-01 14:59, Alex Ghiti wrote:
> >>>> Why restrict to 128M whereas we have 2GB offset available to the end of
> >>>> the kernel ?
> >>> Isn't that 2 GiB offset to whatever the module requires in the kernel,
> >>> rather than to the end of the kernel space?
> >>>
> >>> Is there some guarantee that symbols accessible by modules are at the
> >>> end of the kernel? If so, wouldn't the maximum offset for this patch
> >>> still be (2 GiB - <total size of accessible symbols>)?
> >>>
> >>> Cordially,
> >>>
> >>> --
> >>> Romain Dolbeau
> >> It took me some time to find the root cause of this problem, please
> >> allow me to share some observations before the discussion.
> >> The root cause of this issue is that the percpu data is declared as a
> >> static variable. The "static" attribute will make the compiler think
> >> that this symbol is close to the .text section at runtime. Hence, the
> >> compiler uses "auipc" to access this percpu data instead of using GOT.
> >> In this case,  the access range is limited to + -2G. However, in
> >> practice, these percpu data are placed at a pre-allocated region
> >> created by the memblock_allocate() function. In other words, the
> >> distance between the pre-allocated region (>PAGE_OFFSET ) and the
> >> .text section of the kernel module (in VMALLOC region) is much larger
> >> than 2G.
> >> I agree that the original patch,
> >> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589,
> >> can solve most cases. However, I do not think the patch still works if
> >> the kernel module region is determined by _end or <total size of
> >> accessible symbols>. The reason is that the pre-allocated region for
> >> module percpu data comes from the memblock function at runtime. Hence,
> >> we cannot know the actual address of this region at compile-time, and
> >> this issue probably may occur again in this case.
> >>
> >> By the way, I think maybe we can refer to the implementation of MIPS.
> >> 1. For general cases, we can use this patch to solve this issue.
> >> 2. For a large kernel image (>2G) or enabling the KASLR feature, we
> >> may need a new code mode to deal with this issue.
> >>
> > The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is
> > enabled. Unless we limit the randomized range in small area, the
> > module region start address will be bigger than VMALLOC_END.
>
>
> Actually, the relocatable patch I proposed already moves "everything" up
> at the same
> time: the kernel itself but also all the "zones" below (vmalloc,
> vmemmap, fixup...etc)
> since all those zones are defined from PAGE_OFFSET that is now dynamic.
> So the modules
> ill remain at the same offset to the kernel, unless explicitly
> randomized in the vmalloc zone.
>

OK, it makes sense. The module region moves along with kernel seems to
stay away from the concern I mentioned.

So now, the problem is that the pre-allocated region of percpu data is
located after _end symbol as Vincent mentioned, the 2G distance seems
to be too far for module region start address. (i.e. &_end - 2G).

>
> > So if we
> > divide a region into module use, we also have to provide module
> > randomization at the same time when KASLR is enabled. It good to me if
> > there is a new code model to use GOT for static variable, everything
> > will be easy.
>
>
> GOT usage seems indeed to be easier, but it seems a bit overkill to me:
> arm64 already
> uses the range I proposed for modules (arch/arm64/kernel/kaslr.c,
> module_alloc_base
> definition).
>
> Anyway, even if GOT is chosen, it won't be anytime soon right ? So we
> could, for the moment,
> constraint the modules to be close to the kernel with any range you like
> and that will still
> work with our implementation of KASLR.

Yes, I'm with you on that, we need a resolution in kernel first.

>
> It seems urgent to fix those modules loading issues for 5.6.
>
> Alex
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-04  7:19                   ` Zong Li
@ 2020-02-04  9:32                     ` Alexandre Ghiti
  2020-02-04 10:46                       ` Vincent Chen
  0 siblings, 1 reply; 30+ messages in thread
From: Alexandre Ghiti @ 2020-02-04  9:32 UTC (permalink / raw)
  To: Zong Li
  Cc: Paul Walmsley, David Abdurachmanov, Anup Patel,
	Carlos Eduardo de Paula, Vincent Chen, Romain Dolbeau, Zong Li,
	linux-riscv, Aurelien Jarno


On 2/4/20 8:19 AM, Zong Li wrote:
> Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道:
>> Hi Zong,
>>
>> On 2/3/20 10:55 PM, Zong Li wrote:
>>> Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道:
>>>> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
>>>> <romain.dolbeau@european-processor-initiative.eu> wrote:
>>>>> On 2020-02-01 14:59, Alex Ghiti wrote:
>>>>>> Why restrict to 128M whereas we have 2GB offset available to the end of
>>>>>> the kernel ?
>>>>> Isn't that 2 GiB offset to whatever the module requires in the kernel,
>>>>> rather than to the end of the kernel space?
>>>>>
>>>>> Is there some guarantee that symbols accessible by modules are at the
>>>>> end of the kernel? If so, wouldn't the maximum offset for this patch
>>>>> still be (2 GiB - <total size of accessible symbols>)?
>>>>>
>>>>> Cordially,
>>>>>
>>>>> --
>>>>> Romain Dolbeau
>>>> It took me some time to find the root cause of this problem, please
>>>> allow me to share some observations before the discussion.
>>>> The root cause of this issue is that the percpu data is declared as a
>>>> static variable. The "static" attribute will make the compiler think
>>>> that this symbol is close to the .text section at runtime. Hence, the
>>>> compiler uses "auipc" to access this percpu data instead of using GOT.
>>>> In this case,  the access range is limited to + -2G. However, in
>>>> practice, these percpu data are placed at a pre-allocated region
>>>> created by the memblock_allocate() function. In other words, the
>>>> distance between the pre-allocated region (>PAGE_OFFSET ) and the
>>>> .text section of the kernel module (in VMALLOC region) is much larger
>>>> than 2G.
>>>> I agree that the original patch,
>>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589,
>>>> can solve most cases. However, I do not think the patch still works if
>>>> the kernel module region is determined by _end or <total size of
>>>> accessible symbols>. The reason is that the pre-allocated region for
>>>> module percpu data comes from the memblock function at runtime. Hence,
>>>> we cannot know the actual address of this region at compile-time, and
>>>> this issue probably may occur again in this case.
>>>>
>>>> By the way, I think maybe we can refer to the implementation of MIPS.
>>>> 1. For general cases, we can use this patch to solve this issue.
>>>> 2. For a large kernel image (>2G) or enabling the KASLR feature, we
>>>> may need a new code mode to deal with this issue.
>>>>
>>> The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is
>>> enabled. Unless we limit the randomized range in small area, the
>>> module region start address will be bigger than VMALLOC_END.
>>
>> Actually, the relocatable patch I proposed already moves "everything" up
>> at the same
>> time: the kernel itself but also all the "zones" below (vmalloc,
>> vmemmap, fixup...etc)
>> since all those zones are defined from PAGE_OFFSET that is now dynamic.
>> So the modules
>> ill remain at the same offset to the kernel, unless explicitly
>> randomized in the vmalloc zone.
>>
> OK, it makes sense. The module region moves along with kernel seems to
> stay away from the concern I mentioned.
>
> So now, the problem is that the pre-allocated region of percpu data is
> located after _end symbol as Vincent mentioned, the 2G distance seems
> to be too far for module region start address. (i.e. &_end - 2G).


Actually, I don't understand this issue: we are limited to addressing 
symbols within +/- 2GB
from PC. So as long as the percpu symbol exists in the kernel, it is 
below _end and then we don't
care that its content is initialized dynamically, as long as we can 
'compute' its address from PC,
right ?

Can you point out where I can take a look at this pre-allocated region 
for percpu data please ?

Thanks,

Alex


>
>>> So if we
>>> divide a region into module use, we also have to provide module
>>> randomization at the same time when KASLR is enabled. It good to me if
>>> there is a new code model to use GOT for static variable, everything
>>> will be easy.
>>
>> GOT usage seems indeed to be easier, but it seems a bit overkill to me:
>> arm64 already
>> uses the range I proposed for modules (arch/arm64/kernel/kaslr.c,
>> module_alloc_base
>> definition).
>>
>> Anyway, even if GOT is chosen, it won't be anytime soon right ? So we
>> could, for the moment,
>> constraint the modules to be close to the kernel with any range you like
>> and that will still
>> work with our implementation of KASLR.
> Yes, I'm with you on that, we need a resolution in kernel first.
>
>> It seems urgent to fix those modules loading issues for 5.6.
>>
>> Alex
>>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-04  9:32                     ` Alexandre Ghiti
@ 2020-02-04 10:46                       ` Vincent Chen
  2020-02-04 11:30                         ` Anup Patel
  0 siblings, 1 reply; 30+ messages in thread
From: Vincent Chen @ 2020-02-04 10:46 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel,
	Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv,
	Aurelien Jarno

On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
>
> On 2/4/20 8:19 AM, Zong Li wrote:
> > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道:
> >> Hi Zong,
> >>
> >> On 2/3/20 10:55 PM, Zong Li wrote:
> >>> Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道:
> >>>> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
> >>>> <romain.dolbeau@european-processor-initiative.eu> wrote:
> >>>>> On 2020-02-01 14:59, Alex Ghiti wrote:
> >>>>>> Why restrict to 128M whereas we have 2GB offset available to the end of
> >>>>>> the kernel ?
> >>>>> Isn't that 2 GiB offset to whatever the module requires in the kernel,
> >>>>> rather than to the end of the kernel space?
> >>>>>
> >>>>> Is there some guarantee that symbols accessible by modules are at the
> >>>>> end of the kernel? If so, wouldn't the maximum offset for this patch
> >>>>> still be (2 GiB - <total size of accessible symbols>)?
> >>>>>
> >>>>> Cordially,
> >>>>>
> >>>>> --
> >>>>> Romain Dolbeau
> >>>> It took me some time to find the root cause of this problem, please
> >>>> allow me to share some observations before the discussion.
> >>>> The root cause of this issue is that the percpu data is declared as a
> >>>> static variable. The "static" attribute will make the compiler think
> >>>> that this symbol is close to the .text section at runtime. Hence, the
> >>>> compiler uses "auipc" to access this percpu data instead of using GOT.
> >>>> In this case,  the access range is limited to + -2G. However, in
> >>>> practice, these percpu data are placed at a pre-allocated region
> >>>> created by the memblock_allocate() function. In other words, the
> >>>> distance between the pre-allocated region (>PAGE_OFFSET ) and the
> >>>> .text section of the kernel module (in VMALLOC region) is much larger
> >>>> than 2G.
> >>>> I agree that the original patch,
> >>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589,
> >>>> can solve most cases. However, I do not think the patch still works if
> >>>> the kernel module region is determined by _end or <total size of
> >>>> accessible symbols>. The reason is that the pre-allocated region for
> >>>> module percpu data comes from the memblock function at runtime. Hence,
> >>>> we cannot know the actual address of this region at compile-time, and
> >>>> this issue probably may occur again in this case.
> >>>>
> >>>> By the way, I think maybe we can refer to the implementation of MIPS.
> >>>> 1. For general cases, we can use this patch to solve this issue.
> >>>> 2. For a large kernel image (>2G) or enabling the KASLR feature, we
> >>>> may need a new code mode to deal with this issue.
> >>>>
> >>> The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is
> >>> enabled. Unless we limit the randomized range in small area, the
> >>> module region start address will be bigger than VMALLOC_END.
> >>
> >> Actually, the relocatable patch I proposed already moves "everything" up
> >> at the same
> >> time: the kernel itself but also all the "zones" below (vmalloc,
> >> vmemmap, fixup...etc)
> >> since all those zones are defined from PAGE_OFFSET that is now dynamic.
> >> So the modules
> >> ill remain at the same offset to the kernel, unless explicitly
> >> randomized in the vmalloc zone.
> >>
> > OK, it makes sense. The module region moves along with kernel seems to
> > stay away from the concern I mentioned.
> >
> > So now, the problem is that the pre-allocated region of percpu data is
> > located after _end symbol as Vincent mentioned, the 2G distance seems
> > to be too far for module region start address. (i.e. &_end - 2G).
>
>
> Actually, I don't understand this issue: we are limited to addressing
> symbols within +/- 2GB
> from PC. So as long as the percpu symbol exists in the kernel, it is
> below _end and then we don't
> care that its content is initialized dynamically, as long as we can
> 'compute' its address from PC,
> right ?

In this case, the static percpu symbols of this issue are declared in
the kernel module, not in the kernel image.
When kernel loads the kernel module, in general, it continuously
places all the module sections in the VMALLOC memory area. However,
the ".data..percpu" section is an exception. The kernel places the
".data..percpu" section in a pre-allocated memory region. This region
is used to place the percpu data instances for each CPU and is created
by the memblock(). Hence, the instance of these per-cpu symbols is
above the _end.  In this case, if we make the module region locate at
[_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its
address >_end) and module text section will probably exceed the 2G
limitation.

>
> Can you point out where I can take a look at this pre-allocated region
> for percpu data please ?
>
In mm/percpu.c, you can find how the kernel allocates the percpu data
region during the initialization phase.
In kernel/module.c's simplify_symbols function, you can find kernel
treats .data.percpu section as an exception when loading module.

If I have any misunderstandings, please let me know. Thanks

Best regards,
Vincent

> Thanks,
>
> Alex
>
>
> >
> >>> So if we
> >>> divide a region into module use, we also have to provide module
> >>> randomization at the same time when KASLR is enabled. It good to me if
> >>> there is a new code model to use GOT for static variable, everything
> >>> will be easy.
> >>
> >> GOT usage seems indeed to be easier, but it seems a bit overkill to me:
> >> arm64 already
> >> uses the range I proposed for modules (arch/arm64/kernel/kaslr.c,
> >> module_alloc_base
> >> definition).
> >>
> >> Anyway, even if GOT is chosen, it won't be anytime soon right ? So we
> >> could, for the moment,
> >> constraint the modules to be close to the kernel with any range you like
> >> and that will still
> >> work with our implementation of KASLR.
> > Yes, I'm with you on that, we need a resolution in kernel first.
> >
> >> It seems urgent to fix those modules loading issues for 5.6.
> >>
> >> Alex
> >>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-04 10:46                       ` Vincent Chen
@ 2020-02-04 11:30                         ` Anup Patel
  2020-02-04 14:03                           ` Vincent Chen
  2020-02-04 17:48                           ` Romain Dolbeau
  0 siblings, 2 replies; 30+ messages in thread
From: Anup Patel @ 2020-02-04 11:30 UTC (permalink / raw)
  To: Vincent Chen
  Cc: Carlos Eduardo de Paula, Alexandre Ghiti, David Abdurachmanov,
	Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv,
	Aurelien Jarno

On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com> wrote:
>
> On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
> >
> >
> > On 2/4/20 8:19 AM, Zong Li wrote:
> > > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道:
> > >> Hi Zong,
> > >>
> > >> On 2/3/20 10:55 PM, Zong Li wrote:
> > >>> Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道:
> > >>>> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
> > >>>> <romain.dolbeau@european-processor-initiative.eu> wrote:
> > >>>>> On 2020-02-01 14:59, Alex Ghiti wrote:
> > >>>>>> Why restrict to 128M whereas we have 2GB offset available to the end of
> > >>>>>> the kernel ?
> > >>>>> Isn't that 2 GiB offset to whatever the module requires in the kernel,
> > >>>>> rather than to the end of the kernel space?
> > >>>>>
> > >>>>> Is there some guarantee that symbols accessible by modules are at the
> > >>>>> end of the kernel? If so, wouldn't the maximum offset for this patch
> > >>>>> still be (2 GiB - <total size of accessible symbols>)?
> > >>>>>
> > >>>>> Cordially,
> > >>>>>
> > >>>>> --
> > >>>>> Romain Dolbeau
> > >>>> It took me some time to find the root cause of this problem, please
> > >>>> allow me to share some observations before the discussion.
> > >>>> The root cause of this issue is that the percpu data is declared as a
> > >>>> static variable. The "static" attribute will make the compiler think
> > >>>> that this symbol is close to the .text section at runtime. Hence, the
> > >>>> compiler uses "auipc" to access this percpu data instead of using GOT.
> > >>>> In this case,  the access range is limited to + -2G. However, in
> > >>>> practice, these percpu data are placed at a pre-allocated region
> > >>>> created by the memblock_allocate() function. In other words, the
> > >>>> distance between the pre-allocated region (>PAGE_OFFSET ) and the
> > >>>> .text section of the kernel module (in VMALLOC region) is much larger
> > >>>> than 2G.
> > >>>> I agree that the original patch,
> > >>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589,
> > >>>> can solve most cases. However, I do not think the patch still works if
> > >>>> the kernel module region is determined by _end or <total size of
> > >>>> accessible symbols>. The reason is that the pre-allocated region for
> > >>>> module percpu data comes from the memblock function at runtime. Hence,
> > >>>> we cannot know the actual address of this region at compile-time, and
> > >>>> this issue probably may occur again in this case.
> > >>>>
> > >>>> By the way, I think maybe we can refer to the implementation of MIPS.
> > >>>> 1. For general cases, we can use this patch to solve this issue.
> > >>>> 2. For a large kernel image (>2G) or enabling the KASLR feature, we
> > >>>> may need a new code mode to deal with this issue.
> > >>>>
> > >>> The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is
> > >>> enabled. Unless we limit the randomized range in small area, the
> > >>> module region start address will be bigger than VMALLOC_END.
> > >>
> > >> Actually, the relocatable patch I proposed already moves "everything" up
> > >> at the same
> > >> time: the kernel itself but also all the "zones" below (vmalloc,
> > >> vmemmap, fixup...etc)
> > >> since all those zones are defined from PAGE_OFFSET that is now dynamic.
> > >> So the modules
> > >> ill remain at the same offset to the kernel, unless explicitly
> > >> randomized in the vmalloc zone.
> > >>
> > > OK, it makes sense. The module region moves along with kernel seems to
> > > stay away from the concern I mentioned.
> > >
> > > So now, the problem is that the pre-allocated region of percpu data is
> > > located after _end symbol as Vincent mentioned, the 2G distance seems
> > > to be too far for module region start address. (i.e. &_end - 2G).
> >
> >
> > Actually, I don't understand this issue: we are limited to addressing
> > symbols within +/- 2GB
> > from PC. So as long as the percpu symbol exists in the kernel, it is
> > below _end and then we don't
> > care that its content is initialized dynamically, as long as we can
> > 'compute' its address from PC,
> > right ?
>
> In this case, the static percpu symbols of this issue are declared in
> the kernel module, not in the kernel image.
> When kernel loads the kernel module, in general, it continuously
> places all the module sections in the VMALLOC memory area. However,
> the ".data..percpu" section is an exception. The kernel places the
> ".data..percpu" section in a pre-allocated memory region. This region
> is used to place the percpu data instances for each CPU and is created
> by the memblock(). Hence, the instance of these per-cpu symbols is
> above the _end.  In this case, if we make the module region locate at
> [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its
> address >_end) and module text section will probably exceed the 2G
> limitation.

The static percpu symbols are particularly problem for loadable modules
on RISC-V but dynamic percpu variable are perfectly fine. I had faced this
issue with KVM RISC-V module and I converted static percpu symbol into
dynamic percpu variable. In fact, in Linux kernel dynamic percpu variables
are preferred over static percpu symbols so wherever we see issue with
static perpcu symbol in any module we should just send patch to convert
it into dynamic percpu.

In general, any memory access via pointers (just like dynamic percpu
variables) are fine as long as the pointer itself is within 2GB of relative
addressing.

Overall, Alex's suggestion will address most cases of loadable modules
except modules having static percpu symbols and for such modules just
convert these percpu symbols into dynamic percpu variables.

>
> >
> > Can you point out where I can take a look at this pre-allocated region
> > for percpu data please ?
> >
> In mm/percpu.c, you can find how the kernel allocates the percpu data
> region during the initialization phase.
> In kernel/module.c's simplify_symbols function, you can find kernel
> treats .data.percpu section as an exception when loading module.
>
> If I have any misunderstandings, please let me know. Thanks

Please see above comment. We should prefer dynamic percpu variable
in loadable modules over static percpu symbols. I am sure there will
be very few kernel loadable modules having static percpu symbols.

Regards,
Anup


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-04 11:30                         ` Anup Patel
@ 2020-02-04 14:03                           ` Vincent Chen
  2020-02-04 19:10                             ` Alex Ghiti
       [not found]                             ` <a55f265e-71b2-5ebb-b079-6345007a442e@ghiti.fr>
  2020-02-04 17:48                           ` Romain Dolbeau
  1 sibling, 2 replies; 30+ messages in thread
From: Vincent Chen @ 2020-02-04 14:03 UTC (permalink / raw)
  To: Anup Patel
  Cc: Carlos Eduardo de Paula, Alexandre Ghiti, David Abdurachmanov,
	Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv,
	Aurelien Jarno

On Tue, Feb 4, 2020 at 7:31 PM Anup Patel <anup@brainfault.org> wrote:
>
> On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com> wrote:
> >
> > On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
> > >
> > >
> > > On 2/4/20 8:19 AM, Zong Li wrote:
> > > > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道:
> > > >> Hi Zong,
> > > >>
> > > >> On 2/3/20 10:55 PM, Zong Li wrote:
> > > >>> Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道:
> > > >>>> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
> > > >>>> <romain.dolbeau@european-processor-initiative.eu> wrote:
> > > >>>>> On 2020-02-01 14:59, Alex Ghiti wrote:
> > > >>>>>> Why restrict to 128M whereas we have 2GB offset available to the end of
> > > >>>>>> the kernel ?
> > > >>>>> Isn't that 2 GiB offset to whatever the module requires in the kernel,
> > > >>>>> rather than to the end of the kernel space?
> > > >>>>>
> > > >>>>> Is there some guarantee that symbols accessible by modules are at the
> > > >>>>> end of the kernel? If so, wouldn't the maximum offset for this patch
> > > >>>>> still be (2 GiB - <total size of accessible symbols>)?
> > > >>>>>
> > > >>>>> Cordially,
> > > >>>>>
> > > >>>>> --
> > > >>>>> Romain Dolbeau
> > > >>>> It took me some time to find the root cause of this problem, please
> > > >>>> allow me to share some observations before the discussion.
> > > >>>> The root cause of this issue is that the percpu data is declared as a
> > > >>>> static variable. The "static" attribute will make the compiler think
> > > >>>> that this symbol is close to the .text section at runtime. Hence, the
> > > >>>> compiler uses "auipc" to access this percpu data instead of using GOT.
> > > >>>> In this case,  the access range is limited to + -2G. However, in
> > > >>>> practice, these percpu data are placed at a pre-allocated region
> > > >>>> created by the memblock_allocate() function. In other words, the
> > > >>>> distance between the pre-allocated region (>PAGE_OFFSET ) and the
> > > >>>> .text section of the kernel module (in VMALLOC region) is much larger
> > > >>>> than 2G.
> > > >>>> I agree that the original patch,
> > > >>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589,
> > > >>>> can solve most cases. However, I do not think the patch still works if
> > > >>>> the kernel module region is determined by _end or <total size of
> > > >>>> accessible symbols>. The reason is that the pre-allocated region for
> > > >>>> module percpu data comes from the memblock function at runtime. Hence,
> > > >>>> we cannot know the actual address of this region at compile-time, and
> > > >>>> this issue probably may occur again in this case.
> > > >>>>
> > > >>>> By the way, I think maybe we can refer to the implementation of MIPS.
> > > >>>> 1. For general cases, we can use this patch to solve this issue.
> > > >>>> 2. For a large kernel image (>2G) or enabling the KASLR feature, we
> > > >>>> may need a new code mode to deal with this issue.
> > > >>>>
> > > >>> The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is
> > > >>> enabled. Unless we limit the randomized range in small area, the
> > > >>> module region start address will be bigger than VMALLOC_END.
> > > >>
> > > >> Actually, the relocatable patch I proposed already moves "everything" up
> > > >> at the same
> > > >> time: the kernel itself but also all the "zones" below (vmalloc,
> > > >> vmemmap, fixup...etc)
> > > >> since all those zones are defined from PAGE_OFFSET that is now dynamic.
> > > >> So the modules
> > > >> ill remain at the same offset to the kernel, unless explicitly
> > > >> randomized in the vmalloc zone.
> > > >>
> > > > OK, it makes sense. The module region moves along with kernel seems to
> > > > stay away from the concern I mentioned.
> > > >
> > > > So now, the problem is that the pre-allocated region of percpu data is
> > > > located after _end symbol as Vincent mentioned, the 2G distance seems
> > > > to be too far for module region start address. (i.e. &_end - 2G).
> > >
> > >
> > > Actually, I don't understand this issue: we are limited to addressing
> > > symbols within +/- 2GB
> > > from PC. So as long as the percpu symbol exists in the kernel, it is
> > > below _end and then we don't
> > > care that its content is initialized dynamically, as long as we can
> > > 'compute' its address from PC,
> > > right ?
> >
> > In this case, the static percpu symbols of this issue are declared in
> > the kernel module, not in the kernel image.
> > When kernel loads the kernel module, in general, it continuously
> > places all the module sections in the VMALLOC memory area. However,
> > the ".data..percpu" section is an exception. The kernel places the
> > ".data..percpu" section in a pre-allocated memory region. This region
> > is used to place the percpu data instances for each CPU and is created
> > by the memblock(). Hence, the instance of these per-cpu symbols is
> > above the _end.  In this case, if we make the module region locate at
> > [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its
> > address >_end) and module text section will probably exceed the 2G
> > limitation.
>
> The static percpu symbols are particularly problem for loadable modules
> on RISC-V but dynamic percpu variable are perfectly fine. I had faced this
> issue with KVM RISC-V module and I converted static percpu symbol into
> dynamic percpu variable. In fact, in Linux kernel dynamic percpu variables
> are preferred over static percpu symbols so wherever we see issue with
> static perpcu symbol in any module we should just send patch to convert
> it into dynamic percpu.
>
> In general, any memory access via pointers (just like dynamic percpu
> variables) are fine as long as the pointer itself is within 2GB of relative
> addressing.
>

As far as I know, the kernel module uses PIC as the code mode instead
of medany. Therefore, I don't think most pointers in kernel modules
have a 2GB limit. That is why the modules without static percpu
variables do not encounter the 32-bit offset issue.

> Overall, Alex's suggestion will address most cases of loadable modules
> except modules having static percpu symbols and for such modules just
> convert these percpu symbols into dynamic percpu variables.
>
> >
> > >
> > > Can you point out where I can take a look at this pre-allocated region
> > > for percpu data please ?
> > >
> > In mm/percpu.c, you can find how the kernel allocates the percpu data
> > region during the initialization phase.
> > In kernel/module.c's simplify_symbols function, you can find kernel
> > treats .data.percpu section as an exception when loading module.
> >
> > If I have any misunderstandings, please let me know. Thanks
>
> Please see above comment. We should prefer dynamic percpu variable
> in loadable modules over static percpu symbols. I am sure there will
> be very few kernel loadable modules having static percpu symbols.
>
Thanks for your comments.
I agree that Alex's suggestion can address most cases of loadable
modules. The reason why I pointed out the static percpu variable case
is that the percpu variable is declared as a static symbol in the
reduced case provided by Romain in
https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/.
Also, I found two static percpu variables declared in openvswitch
module. After removing the static attribute, the kernel can insert
them successfully. Hence, I think the static percpu variable causes
the issue.

(According to the result of 'grep', very few kernel loadable modules
use static percpu data, as you mentioned. However, I'm not sure this
might be a reason to prohibit RISC-V users from using static percpu
variables in kernel modules. Therefore, I have no comment on this.)
If we can ignore the existence of static percpu variable in the
kernels module, I agree with Alex's suggestions. The reason is not
that we can use it to solve bugs, but that we can change the code
model from PIC to medany by creating a specific kernel module region
to improve the access performance.
(Maybe we can keep the current implementation for some corner cases
such as the size of the kernel image is close to 2GB)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-04 11:30                         ` Anup Patel
  2020-02-04 14:03                           ` Vincent Chen
@ 2020-02-04 17:48                           ` Romain Dolbeau
  1 sibling, 0 replies; 30+ messages in thread
From: Romain Dolbeau @ 2020-02-04 17:48 UTC (permalink / raw)
  To: Anup Patel, Vincent Chen
  Cc: Carlos Eduardo de Paula, Alexandre Ghiti, David Abdurachmanov,
	Paul Walmsley, Zong Li, Zong Li, linux-riscv, Aurelien Jarno

On 2020-02-04 12:30, Anup Patel wrote:
> for such modules just
> convert these percpu symbols into dynamic percpu variables.

Indeed you are right for my case; converting from static to dynamic 
percpu variable solves the problem in my out-of-tree case of interest, 
thanks!

Cordially,

-- 
Romain Dolbeau


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-04 14:03                           ` Vincent Chen
@ 2020-02-04 19:10                             ` Alex Ghiti
       [not found]                             ` <a55f265e-71b2-5ebb-b079-6345007a442e@ghiti.fr>
  1 sibling, 0 replies; 30+ messages in thread
From: Alex Ghiti @ 2020-02-04 19:10 UTC (permalink / raw)
  To: Vincent Chen, Anup Patel
  Cc: Carlos Eduardo de Paula, David Abdurachmanov, Paul Walmsley,
	Zong Li, Romain Dolbeau, Zong Li, linux-riscv, Aurelien Jarno



On 2/4/20 9:03 AM, Vincent Chen wrote:
> On Tue, Feb 4, 2020 at 7:31 PM Anup Patel <anup@brainfault.org> wrote:
>>
>> On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com> wrote:
>>>
>>> On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>>
>>>>
>>>> On 2/4/20 8:19 AM, Zong Li wrote:
>>>>> Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道:
>>>>>> Hi Zong,
>>>>>>
>>>>>> On 2/3/20 10:55 PM, Zong Li wrote:
>>>>>>> Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道:
>>>>>>>> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
>>>>>>>> <romain.dolbeau@european-processor-initiative.eu> wrote:
>>>>>>>>> On 2020-02-01 14:59, Alex Ghiti wrote:
>>>>>>>>>> Why restrict to 128M whereas we have 2GB offset available to the end of
>>>>>>>>>> the kernel ?
>>>>>>>>> Isn't that 2 GiB offset to whatever the module requires in the kernel,
>>>>>>>>> rather than to the end of the kernel space?
>>>>>>>>>
>>>>>>>>> Is there some guarantee that symbols accessible by modules are at the
>>>>>>>>> end of the kernel? If so, wouldn't the maximum offset for this patch
>>>>>>>>> still be (2 GiB - <total size of accessible symbols>)?
>>>>>>>>>
>>>>>>>>> Cordially,
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Romain Dolbeau
>>>>>>>> It took me some time to find the root cause of this problem, please
>>>>>>>> allow me to share some observations before the discussion.
>>>>>>>> The root cause of this issue is that the percpu data is declared as a
>>>>>>>> static variable. The "static" attribute will make the compiler think
>>>>>>>> that this symbol is close to the .text section at runtime. Hence, the
>>>>>>>> compiler uses "auipc" to access this percpu data instead of using GOT.
>>>>>>>> In this case,  the access range is limited to + -2G. However, in
>>>>>>>> practice, these percpu data are placed at a pre-allocated region
>>>>>>>> created by the memblock_allocate() function. In other words, the
>>>>>>>> distance between the pre-allocated region (>PAGE_OFFSET ) and the
>>>>>>>> .text section of the kernel module (in VMALLOC region) is much larger
>>>>>>>> than 2G.
>>>>>>>> I agree that the original patch,
>>>>>>>> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589,
>>>>>>>> can solve most cases. However, I do not think the patch still works if
>>>>>>>> the kernel module region is determined by _end or <total size of
>>>>>>>> accessible symbols>. The reason is that the pre-allocated region for
>>>>>>>> module percpu data comes from the memblock function at runtime. Hence,
>>>>>>>> we cannot know the actual address of this region at compile-time, and
>>>>>>>> this issue probably may occur again in this case.
>>>>>>>>
>>>>>>>> By the way, I think maybe we can refer to the implementation of MIPS.
>>>>>>>> 1. For general cases, we can use this patch to solve this issue.
>>>>>>>> 2. For a large kernel image (>2G) or enabling the KASLR feature, we
>>>>>>>> may need a new code mode to deal with this issue.
>>>>>>>>
>>>>>>> The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is
>>>>>>> enabled. Unless we limit the randomized range in small area, the
>>>>>>> module region start address will be bigger than VMALLOC_END.
>>>>>>
>>>>>> Actually, the relocatable patch I proposed already moves "everything" up
>>>>>> at the same
>>>>>> time: the kernel itself but also all the "zones" below (vmalloc,
>>>>>> vmemmap, fixup...etc)
>>>>>> since all those zones are defined from PAGE_OFFSET that is now dynamic.
>>>>>> So the modules
>>>>>> ill remain at the same offset to the kernel, unless explicitly
>>>>>> randomized in the vmalloc zone.
>>>>>>
>>>>> OK, it makes sense. The module region moves along with kernel seems to
>>>>> stay away from the concern I mentioned.
>>>>>
>>>>> So now, the problem is that the pre-allocated region of percpu data is
>>>>> located after _end symbol as Vincent mentioned, the 2G distance seems
>>>>> to be too far for module region start address. (i.e. &_end - 2G).
>>>>
>>>>
>>>> Actually, I don't understand this issue: we are limited to addressing
>>>> symbols within +/- 2GB
>>>> from PC. So as long as the percpu symbol exists in the kernel, it is
>>>> below _end and then we don't
>>>> care that its content is initialized dynamically, as long as we can
>>>> 'compute' its address from PC,
>>>> right ?
>>>
>>> In this case, the static percpu symbols of this issue are declared in
>>> the kernel module, not in the kernel image.
>>> When kernel loads the kernel module, in general, it continuously
>>> places all the module sections in the VMALLOC memory area. However,
>>> the ".data..percpu" section is an exception. The kernel places the
>>> ".data..percpu" section in a pre-allocated memory region. This region
>>> is used to place the percpu data instances for each CPU and is created
>>> by the memblock(). Hence, the instance of these per-cpu symbols is
>>> above the _end.  In this case, if we make the module region locate at
>>> [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its
>>> address >_end) and module text section will probably exceed the 2G
>>> limitation.
>>
>> The static percpu symbols are particularly problem for loadable modules
>> on RISC-V but dynamic percpu variable are perfectly fine. I had faced this
>> issue with KVM RISC-V module and I converted static percpu symbol into
>> dynamic percpu variable. In fact, in Linux kernel dynamic percpu variables
>> are preferred over static percpu symbols so wherever we see issue with
>> static perpcu symbol in any module we should just send patch to convert
>> it into dynamic percpu.
>>
>> In general, any memory access via pointers (just like dynamic percpu
>> variables) are fine as long as the pointer itself is within 2GB of relative
>> addressing.
>>
> 
> As far as I know, the kernel module uses PIC as the code mode instead
> of medany. Therefore, I don't think most pointers in kernel modules
> have a 2GB limit. That is why the modules without static percpu
> variables do not encounter the 32-bit offset issue.
> 
>> Overall, Alex's suggestion will address most cases of loadable modules
>> except modules having static percpu symbols and for such modules just
>> convert these percpu symbols into dynamic percpu variables.
>>
>>>
>>>>
>>>> Can you point out where I can take a look at this pre-allocated region
>>>> for percpu data please ?
>>>>
>>> In mm/percpu.c, you can find how the kernel allocates the percpu data
>>> region during the initialization phase.
>>> In kernel/module.c's simplify_symbols function, you can find kernel
>>> treats .data.percpu section as an exception when loading module.
>>>

Thanks for your explanations, I was miles away from understanding the 
real problem,
good finding Vincent.

>>> If I have any misunderstandings, please let me know. Thanks
>>
>> Please see above comment. We should prefer dynamic percpu variable
>> in loadable modules over static percpu symbols. I am sure there will
>> be very few kernel loadable modules having static percpu symbols.
>>
> Thanks for your comments.
> I agree that Alex's suggestion can address most cases of loadable
> modules. The reason why I pointed out the static percpu variable case
> is that the percpu variable is declared as a static symbol in the
> reduced case provided by Romain in
> https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/.
> Also, I found two static percpu variables declared in openvswitch
> module. After removing the static attribute, the kernel can insert
> them successfully. Hence, I think the static percpu variable causes
> the issue.
> 
> (According to the result of 'grep', very few kernel loadable modules
> use static percpu data, as you mentioned. However, I'm not sure this
> might be a reason to prohibit RISC-V users from using static percpu
> variables in kernel modules. Therefore, I have no comment on this.)
> If we can ignore the existence of static percpu variable in the
> kernels module, I agree with Alex's suggestions. The reason is not
> that we can use it to solve bugs, but that we can change the code
> model from PIC to medany by creating a specific kernel module region
> to improve the access performance.


If I understand well, you mean that having this module zone at the end 
of vmalloc zone
does not allow to fix the bug regarding static percpu variables but 
allows to have better
performance than PIC which uses an additional redirection using the GOT, 
right ?

> (Maybe we can keep the current implementation for some corner cases
> such as the size of the kernel image is close to 2GB)
> 

Either we follow Anup's advice and we don't care about those static 
percpu variables and
if ever they fall behind the 2GB limit, we can not load the module, or 
we find a way to make
sure that the module range comprises the area for module percpu 
variables which is allocated
in setup_per_cpu_areas. I'm taking a look at this solution and how other 
architectures with
same code model handle that.

Thanks again,
Alex
  	


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
       [not found]                             ` <a55f265e-71b2-5ebb-b079-6345007a442e@ghiti.fr>
@ 2020-02-05  3:22                               ` Vincent Chen
  2020-02-05  4:24                                 ` Anup Patel
  0 siblings, 1 reply; 30+ messages in thread
From: Vincent Chen @ 2020-02-05  3:22 UTC (permalink / raw)
  To: Alex Ghiti
  Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel,
	Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv,
	Aurelien Jarno

On Wed, Feb 5, 2020 at 3:07 AM Alex Ghiti <alex@ghiti.fr> wrote:
>
>
> On 2/4/20 9:03 AM, Vincent Chen wrote:
>
> On Tue, Feb 4, 2020 at 7:31 PM Anup Patel <anup@brainfault.org> wrote:
>
> On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com> wrote:
>
> On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> On 2/4/20 8:19 AM, Zong Li wrote:
>
> Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道:
>
> Hi Zong,
>
> On 2/3/20 10:55 PM, Zong Li wrote:
>
> Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午6:04寫道:
>
> On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
> <romain.dolbeau@european-processor-initiative.eu> wrote:
>
> On 2020-02-01 14:59, Alex Ghiti wrote:
>
> Why restrict to 128M whereas we have 2GB offset available to the end of
> the kernel ?
>
> Isn't that 2 GiB offset to whatever the module requires in the kernel,
> rather than to the end of the kernel space?
>
> Is there some guarantee that symbols accessible by modules are at the
> end of the kernel? If so, wouldn't the maximum offset for this patch
> still be (2 GiB - <total size of accessible symbols>)?
>
> Cordially,
>
> --
> Romain Dolbeau
>
> It took me some time to find the root cause of this problem, please
> allow me to share some observations before the discussion.
> The root cause of this issue is that the percpu data is declared as a
> static variable. The "static" attribute will make the compiler think
> that this symbol is close to the .text section at runtime. Hence, the
> compiler uses "auipc" to access this percpu data instead of using GOT.
> In this case,  the access range is limited to + -2G. However, in
> practice, these percpu data are placed at a pre-allocated region
> created by the memblock_allocate() function. In other words, the
> distance between the pre-allocated region (>PAGE_OFFSET ) and the
> .text section of the kernel module (in VMALLOC region) is much larger
> than 2G.
> I agree that the original patch,
> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b9217c96e1589,
> can solve most cases. However, I do not think the patch still works if
> the kernel module region is determined by _end or <total size of
> accessible symbols>. The reason is that the pre-allocated region for
> module percpu data comes from the memblock function at runtime. Hence,
> we cannot know the actual address of this region at compile-time, and
> this issue probably may occur again in this case.
>
> By the way, I think maybe we can refer to the implementation of MIPS.
> 1. For general cases, we can use this patch to solve this issue.
> 2. For a large kernel image (>2G) or enabling the KASLR feature, we
> may need a new code mode to deal with this issue.
>
> The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is
> enabled. Unless we limit the randomized range in small area, the
> module region start address will be bigger than VMALLOC_END.
>
> Actually, the relocatable patch I proposed already moves "everything" up
> at the same
> time: the kernel itself but also all the "zones" below (vmalloc,
> vmemmap, fixup...etc)
> since all those zones are defined from PAGE_OFFSET that is now dynamic.
> So the modules
> ill remain at the same offset to the kernel, unless explicitly
> randomized in the vmalloc zone.
>
> OK, it makes sense. The module region moves along with kernel seems to
> stay away from the concern I mentioned.
>
> So now, the problem is that the pre-allocated region of percpu data is
> located after _end symbol as Vincent mentioned, the 2G distance seems
> to be too far for module region start address. (i.e. &_end - 2G).
>
> Actually, I don't understand this issue: we are limited to addressing
> symbols within +/- 2GB
> from PC. So as long as the percpu symbol exists in the kernel, it is
> below _end and then we don't
> care that its content is initialized dynamically, as long as we can
> 'compute' its address from PC,
> right ?
>
> In this case, the static percpu symbols of this issue are declared in
> the kernel module, not in the kernel image.
> When kernel loads the kernel module, in general, it continuously
> places all the module sections in the VMALLOC memory area. However,
> the ".data..percpu" section is an exception. The kernel places the
> ".data..percpu" section in a pre-allocated memory region. This region
> is used to place the percpu data instances for each CPU and is created
> by the memblock(). Hence, the instance of these per-cpu symbols is
> above the _end.  In this case, if we make the module region locate at
> [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its
> address >_end) and module text section will probably exceed the 2G
> limitation.
>
> The static percpu symbols are particularly problem for loadable modules
> on RISC-V but dynamic percpu variable are perfectly fine. I had faced this
> issue with KVM RISC-V module and I converted static percpu symbol into
> dynamic percpu variable. In fact, in Linux kernel dynamic percpu variables
> are preferred over static percpu symbols so wherever we see issue with
> static perpcu symbol in any module we should just send patch to convert
> it into dynamic percpu.
>
> In general, any memory access via pointers (just like dynamic percpu
> variables) are fine as long as the pointer itself is within 2GB of relative
> addressing.
>
> As far as I know, the kernel module uses PIC as the code mode instead
> of medany. Therefore, I don't think most pointers in kernel modules
> have a 2GB limit. That is why the modules without static percpu
> variables do not encounter the 32-bit offset issue.
>
> Overall, Alex's suggestion will address most cases of loadable modules
> except modules having static percpu symbols and for such modules just
> convert these percpu symbols into dynamic percpu variables.
>
> Can you point out where I can take a look at this pre-allocated region
> for percpu data please ?
>
> In mm/percpu.c, you can find how the kernel allocates the percpu data
> region during the initialization phase.
> In kernel/module.c's simplify_symbols function, you can find kernel
> treats .data.percpu section as an exception when loading module.
>
>
> Thanks for your explanations, I was miles away from understanding the real problem,
> good finding Vincent.
>
>
> If I have any misunderstandings, please let me know. Thanks
>
> Please see above comment. We should prefer dynamic percpu variable
> in loadable modules over static percpu symbols. I am sure there will
> be very few kernel loadable modules having static percpu symbols.
>
> Thanks for your comments.
> I agree that Alex's suggestion can address most cases of loadable
> modules. The reason why I pointed out the static percpu variable case
> is that the percpu variable is declared as a static symbol in the
> reduced case provided by Romain in
> https://lore.kernel.org/linux-riscv/1572281840733.3517@european-processor-initiative.eu/.
> Also, I found two static percpu variables declared in openvswitch
> module. After removing the static attribute, the kernel can insert
> them successfully. Hence, I think the static percpu variable causes
> the issue.
>
> (According to the result of 'grep', very few kernel loadable modules
> use static percpu data, as you mentioned. However, I'm not sure this
> might be a reason to prohibit RISC-V users from using static percpu
> variables in kernel modules. Therefore, I have no comment on this.)
> If we can ignore the existence of static percpu variable in the
> kernels module, I agree with Alex's suggestions. The reason is not
> that we can use it to solve bugs, but that we can change the code
> model from PIC to medany by creating a specific kernel module region
> to improve the access performance.
>
>
> If I understand well, you mean that having this module zone at the end of vmalloc zone
> does not allow to fix the bug regarding static percpu variables but allows to have better
> performance than PIC which uses an additional redirection using the GOT, right ?
>

Yes. Currently, I think PIC model can resolve most of the pointer
reference except for the static percpu variables. However, as you
mentioned, It will lose some performance due to the additional
redirection using the GOT.

>
> (Maybe we can keep the current implementation for some corner cases
> such as the size of the kernel image is close to 2GB)
>
>
> Either we follow Anup's advice and we don't care about those static percpu variables and
> if ever they fall behind the 2GB limit, we can not load the module, or we find a way to make
> sure that the module range comprises the area for module percpu variables which is allocated
> in setup_per_cpu_areas. I'm taking a look at this solution and how other architectures with
> same code model handle that.
>
As far as I know, ARM64 and MIPS64 with CONFIG_KBUILD_64BIT_SYM32 also
use PC-relative way to access the static perpcu variables. I quickly
describe their implementation below, please correct me if I have any
misunderstandings, thanks.
1. ARM64
    When CONFIG_RANDOMIZE_BASE is disabled, they place the module
region at "(u64)_etext - MODULES_VSIZE(128MB)", and the range
limitation of PC-relative ( instruction adrp) is +-4GB. When
CONFIG_RANDOMIZE_BASE is enabled, the maximum access range between
module region symbol and kernel symbol is limited to 2GB. There is
still 2GB room to access the per-cpu data. Therefore, I think it is
also safe to access the per-cpu variables.
2. MIPS64 with CONFIG_KBUILD_64BIT_SYM32
     The VMALLOC region is placed after the kernel image. Therefore,
the start address of the kernel image is at the 0xfffffff80100000, and
the kernel module region is at 0xffffffffc0000000 which is the
beginning of the VMALLOC. They do not encounter this issue because the
range limitation of PC-relative is +-2GB.

Currently, I have three solutions to fix the percpu data problem, and
I list them below. Any feedback is welcome.

1. Create a specific module region locating at [PAGE_OFFSET-SIZE, PAGE_OFFSET].
    I think this way can solve most cases because the general size of
kernel image is much smaller than 2G, and we still can change the code
model from PIC to medany. However, it will still encounter the same
problem if the kernel image is near to 2GB.

2. Request a new code model for kernel
    Maybe we request compiler guys to generate a new PIC mode for
kernel, which uses GOT to assess every static per-cpu data instead of
PC-relative. However, I am not sure whether this idea is accepted by
the compiler guys or not. If accepted, It may need a long time to
realize it.

3. Jump to a resolved function.
Zong and I came out with this rough idea to address the problem, but
we do not go to realize it.

The following is the current code pattern for accessing the percpu data.
64: 00000917 auipc s2,0x0
68: 00093903 ld s2,0(s2) # 64 <.L0 >

We plan to change the " ld s2,0(s2) " to "jalr xxx(s2)" to redirect
the code to the corresponding resolved function. Each static percpu
data owns a resolved function. These resolved functions will go to get
the variable content, and then it stores the variable to the target
register by decoding the auipc instruction. We think this method may
solve this problem, but it will drop the performance due to the
redirection.

I preferred the solution combining method 1 and method 2.
1. In general, users can choose method 1 to improve the performance of
data access.
2. For safety, users can use the new code model to access the static
percpu data without any limitation.
We can create a new kernel config that looks a bit like MIPS
CONFIG_KBUILD_64BIT_SYM32 for users to chose.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Error on loading some network Kernel modules
  2020-02-05  3:22                               ` Vincent Chen
@ 2020-02-05  4:24                                 ` Anup Patel
  2020-02-05 10:37                                   ` Vincent Chen
  0 siblings, 1 reply; 30+ messages in thread
From: Anup Patel @ 2020-02-05  4:24 UTC (permalink / raw)
  To: Vincent Chen, Alex Ghiti
  Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel,
	Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv,
	Aurelien Jarno



> -----Original Message-----
> From: linux-riscv <linux-riscv-bounces@lists.infradead.org> On Behalf Of
> Vincent Chen
> Sent: Wednesday, February 5, 2020 8:52 AM
> To: Alex Ghiti <alex@ghiti.fr>
> Cc: Carlos Eduardo de Paula <me@carlosedp.com>; David Abdurachmanov
> <david.abdurachmanov@gmail.com>; Anup Patel <anup@brainfault.org>;
> Paul Walmsley <paul@pwsan.com>; Zong Li <zongbox@gmail.com>; Romain
> Dolbeau <romain.dolbeau@european-processor-initiative.eu>; Zong Li
> <zong.li@sifive.com>; linux-riscv <linux-riscv@lists.infradead.org>; Aurelien
> Jarno <aurelien@aurel32.net>
> Subject: Re: Error on loading some network Kernel modules
> 
> On Wed, Feb 5, 2020 at 3:07 AM Alex Ghiti <alex@ghiti.fr> wrote:
> >
> >
> > On 2/4/20 9:03 AM, Vincent Chen wrote:
> >
> > On Tue, Feb 4, 2020 at 7:31 PM Anup Patel <anup@brainfault.org> wrote:
> >
> > On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com>
> wrote:
> >
> > On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
> >
> > On 2/4/20 8:19 AM, Zong Li wrote:
> >
> > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道:
> >
> > Hi Zong,
> >
> > On 2/3/20 10:55 PM, Zong Li wrote:
> >
> > Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午
> 6:04寫道:
> >
> > On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
> > <romain.dolbeau@european-processor-initiative.eu> wrote:
> >
> > On 2020-02-01 14:59, Alex Ghiti wrote:
> >
> > Why restrict to 128M whereas we have 2GB offset available to the end
> > of the kernel ?
> >
> > Isn't that 2 GiB offset to whatever the module requires in the kernel,
> > rather than to the end of the kernel space?
> >
> > Is there some guarantee that symbols accessible by modules are at the
> > end of the kernel? If so, wouldn't the maximum offset for this patch
> > still be (2 GiB - <total size of accessible symbols>)?
> >
> > Cordially,
> >
> > --
> > Romain Dolbeau
> >
> > It took me some time to find the root cause of this problem, please
> > allow me to share some observations before the discussion.
> > The root cause of this issue is that the percpu data is declared as a
> > static variable. The "static" attribute will make the compiler think
> > that this symbol is close to the .text section at runtime. Hence, the
> > compiler uses "auipc" to access this percpu data instead of using GOT.
> > In this case,  the access range is limited to + -2G. However, in
> > practice, these percpu data are placed at a pre-allocated region
> > created by the memblock_allocate() function. In other words, the
> > distance between the pre-allocated region (>PAGE_OFFSET ) and the
> > .text section of the kernel module (in VMALLOC region) is much larger
> > than 2G.
> > I agree that the original patch,
> >
> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b921
> 7
> > c96e1589, can solve most cases. However, I do not think the patch
> > still works if the kernel module region is determined by _end or
> > <total size of accessible symbols>. The reason is that the
> > pre-allocated region for module percpu data comes from the memblock
> > function at runtime. Hence, we cannot know the actual address of this
> > region at compile-time, and this issue probably may occur again in
> > this case.
> >
> > By the way, I think maybe we can refer to the implementation of MIPS.
> > 1. For general cases, we can use this patch to solve this issue.
> > 2. For a large kernel image (>2G) or enabling the KASLR feature, we
> > may need a new code mode to deal with this issue.
> >
> > The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is
> > enabled. Unless we limit the randomized range in small area, the
> > module region start address will be bigger than VMALLOC_END.
> >
> > Actually, the relocatable patch I proposed already moves "everything"
> > up at the same
> > time: the kernel itself but also all the "zones" below (vmalloc,
> > vmemmap, fixup...etc) since all those zones are defined from
> > PAGE_OFFSET that is now dynamic.
> > So the modules
> > ill remain at the same offset to the kernel, unless explicitly
> > randomized in the vmalloc zone.
> >
> > OK, it makes sense. The module region moves along with kernel seems to
> > stay away from the concern I mentioned.
> >
> > So now, the problem is that the pre-allocated region of percpu data is
> > located after _end symbol as Vincent mentioned, the 2G distance seems
> > to be too far for module region start address. (i.e. &_end - 2G).
> >
> > Actually, I don't understand this issue: we are limited to addressing
> > symbols within +/- 2GB from PC. So as long as the percpu symbol exists
> > in the kernel, it is below _end and then we don't care that its
> > content is initialized dynamically, as long as we can 'compute' its
> > address from PC, right ?
> >
> > In this case, the static percpu symbols of this issue are declared in
> > the kernel module, not in the kernel image.
> > When kernel loads the kernel module, in general, it continuously
> > places all the module sections in the VMALLOC memory area. However,
> > the ".data..percpu" section is an exception. The kernel places the
> > ".data..percpu" section in a pre-allocated memory region. This region
> > is used to place the percpu data instances for each CPU and is created
> > by the memblock(). Hence, the instance of these per-cpu symbols is
> > above the _end.  In this case, if we make the module region locate at
> > [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its
> > address >_end) and module text section will probably exceed the 2G
> > limitation.
> >
> > The static percpu symbols are particularly problem for loadable
> > modules on RISC-V but dynamic percpu variable are perfectly fine. I
> > had faced this issue with KVM RISC-V module and I converted static
> > percpu symbol into dynamic percpu variable. In fact, in Linux kernel
> > dynamic percpu variables are preferred over static percpu symbols so
> > wherever we see issue with static perpcu symbol in any module we
> > should just send patch to convert it into dynamic percpu.
> >
> > In general, any memory access via pointers (just like dynamic percpu
> > variables) are fine as long as the pointer itself is within 2GB of
> > relative addressing.
> >
> > As far as I know, the kernel module uses PIC as the code mode instead
> > of medany. Therefore, I don't think most pointers in kernel modules
> > have a 2GB limit. That is why the modules without static percpu
> > variables do not encounter the 32-bit offset issue.
> >
> > Overall, Alex's suggestion will address most cases of loadable modules
> > except modules having static percpu symbols and for such modules just
> > convert these percpu symbols into dynamic percpu variables.
> >
> > Can you point out where I can take a look at this pre-allocated region
> > for percpu data please ?
> >
> > In mm/percpu.c, you can find how the kernel allocates the percpu data
> > region during the initialization phase.
> > In kernel/module.c's simplify_symbols function, you can find kernel
> > treats .data.percpu section as an exception when loading module.
> >
> >
> > Thanks for your explanations, I was miles away from understanding the
> > real problem, good finding Vincent.
> >
> >
> > If I have any misunderstandings, please let me know. Thanks
> >
> > Please see above comment. We should prefer dynamic percpu variable in
> > loadable modules over static percpu symbols. I am sure there will be
> > very few kernel loadable modules having static percpu symbols.
> >
> > Thanks for your comments.
> > I agree that Alex's suggestion can address most cases of loadable
> > modules. The reason why I pointed out the static percpu variable case
> > is that the percpu variable is declared as a static symbol in the
> > reduced case provided by Romain in
> > https://lore.kernel.org/linux-riscv/1572281840733.3517@european-
> processor-initiative.eu/.
> > Also, I found two static percpu variables declared in openvswitch
> > module. After removing the static attribute, the kernel can insert
> > them successfully. Hence, I think the static percpu variable causes
> > the issue.
> >
> > (According to the result of 'grep', very few kernel loadable modules
> > use static percpu data, as you mentioned. However, I'm not sure this
> > might be a reason to prohibit RISC-V users from using static percpu
> > variables in kernel modules. Therefore, I have no comment on this.) If
> > we can ignore the existence of static percpu variable in the kernels
> > module, I agree with Alex's suggestions. The reason is not that we can
> > use it to solve bugs, but that we can change the code model from PIC
> > to medany by creating a specific kernel module region to improve the
> > access performance.
> >
> >
> > If I understand well, you mean that having this module zone at the end
> > of vmalloc zone does not allow to fix the bug regarding static percpu
> > variables but allows to have better performance than PIC which uses an
> additional redirection using the GOT, right ?
> >
> 
> Yes. Currently, I think PIC model can resolve most of the pointer reference
> except for the static percpu variables. However, as you mentioned, It will
> lose some performance due to the additional redirection using the GOT.
> 
> >
> > (Maybe we can keep the current implementation for some corner cases
> > such as the size of the kernel image is close to 2GB)
> >
> >
> > Either we follow Anup's advice and we don't care about those static
> > percpu variables and if ever they fall behind the 2GB limit, we can
> > not load the module, or we find a way to make sure that the module
> > range comprises the area for module percpu variables which is
> > allocated in setup_per_cpu_areas. I'm taking a look at this solution and
> how other architectures with same code model handle that.
> >
> As far as I know, ARM64 and MIPS64 with CONFIG_KBUILD_64BIT_SYM32
> also use PC-relative way to access the static perpcu variables. I quickly
> describe their implementation below, please correct me if I have any
> misunderstandings, thanks.
> 1. ARM64
>     When CONFIG_RANDOMIZE_BASE is disabled, they place the module
> region at "(u64)_etext - MODULES_VSIZE(128MB)", and the range limitation
> of PC-relative ( instruction adrp) is +-4GB. When CONFIG_RANDOMIZE_BASE
> is enabled, the maximum access range between module region symbol and
> kernel symbol is limited to 2GB. There is still 2GB room to access the per-cpu
> data. Therefore, I think it is also safe to access the per-cpu variables.
> 2. MIPS64 with CONFIG_KBUILD_64BIT_SYM32
>      The VMALLOC region is placed after the kernel image. Therefore, the start
> address of the kernel image is at the 0xfffffff80100000, and the kernel
> module region is at 0xffffffffc0000000 which is the beginning of the
> VMALLOC. They do not encounter this issue because the range limitation of
> PC-relative is +-2GB.
> 
> Currently, I have three solutions to fix the percpu data problem, and I list
> them below. Any feedback is welcome.
> 
> 1. Create a specific module region locating at [PAGE_OFFSET-SIZE,
> PAGE_OFFSET].
>     I think this way can solve most cases because the general size of kernel
> image is much smaller than 2G, and we still can change the code model from
> PIC to medany. However, it will still encounter the same problem if the
> kernel image is near to 2GB.
> 
> 2. Request a new code model for kernel
>     Maybe we request compiler guys to generate a new PIC mode for kernel,
> which uses GOT to assess every static per-cpu data instead of PC-relative.
> However, I am not sure whether this idea is accepted by the compiler guys or
> not. If accepted, It may need a long time to realize it.
> 
> 3. Jump to a resolved function.
> Zong and I came out with this rough idea to address the problem, but we do
> not go to realize it.
> 
> The following is the current code pattern for accessing the percpu data.
> 64: 00000917 auipc s2,0x0
> 68: 00093903 ld s2,0(s2) # 64 <.L0 >
> 
> We plan to change the " ld s2,0(s2) " to "jalr xxx(s2)" to redirect the code to
> the corresponding resolved function. Each static percpu data owns a resolved
> function. These resolved functions will go to get the variable content, and
> then it stores the variable to the target register by decoding the auipc
> instruction. We think this method may solve this problem, but it will drop the
> performance due to the redirection.
> 
> I preferred the solution combining method 1 and method 2.
> 1. In general, users can choose method 1 to improve the performance of
> data access.
> 2. For safety, users can use the new code model to access the static percpu
> data without any limitation.
> We can create a new kernel config that looks a bit like MIPS
> CONFIG_KBUILD_64BIT_SYM32 for users to chose.

I think we should not waste time in our quest to find perfect solution. We
should first have a module_alloc() in-place (based on various suggestions in
this email thread) so that most people don't see issues. We can fix module
loading for module with static per-cpu symbols separately.

For point1, we don't need a dedicated region as long as we are allocating
modules from VMALLOC area. Let's avoid another virtual memory region if
possible.

For point2, loading symbol address for per-cpu static symbols is always
PC-relative for most RISC architectures and static per-cpu symbols in
loadable modules is a known issue (Refer, https://lwn.net/Articles/1509/).
I don't agree that this should be using compilers because per-cpu data
management is Linux specific (or OS-specific) so we should rather explore
ways to handle this in Linux itself. The dynamic per-cpu alloc APIs (i.e.
alloc_percpu() and friends) were added to primarily benefit loadable
modules and NUMA systems. The static per-cpu symbols (i.e.
DEFINE_PER_CPU() and friends) should only be used in built-in modules
or code linked to kernel. It is generally very easy to upgrade code using
static per-cpu symbols to use dynamic per-cpu symbols.

Regards,
Anup



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-05  4:24                                 ` Anup Patel
@ 2020-02-05 10:37                                   ` Vincent Chen
  2020-02-07 14:39                                     ` Vincent Chen
  0 siblings, 1 reply; 30+ messages in thread
From: Vincent Chen @ 2020-02-05 10:37 UTC (permalink / raw)
  To: Anup Patel
  Cc: Carlos Eduardo de Paula, Alex Ghiti, David Abdurachmanov,
	Anup Patel, Zong Li, Romain Dolbeau, Zong Li, linux-riscv,
	Paul Walmsley, Aurelien Jarno

On Wed, Feb 5, 2020 at 12:24 PM Anup Patel <Anup.Patel@wdc.com> wrote:
>
>
>
> > -----Original Message-----
> > From: linux-riscv <linux-riscv-bounces@lists.infradead.org> On Behalf Of
> > Vincent Chen
> > Sent: Wednesday, February 5, 2020 8:52 AM
> > To: Alex Ghiti <alex@ghiti.fr>
> > Cc: Carlos Eduardo de Paula <me@carlosedp.com>; David Abdurachmanov
> > <david.abdurachmanov@gmail.com>; Anup Patel <anup@brainfault.org>;
> > Paul Walmsley <paul@pwsan.com>; Zong Li <zongbox@gmail.com>; Romain
> > Dolbeau <romain.dolbeau@european-processor-initiative.eu>; Zong Li
> > <zong.li@sifive.com>; linux-riscv <linux-riscv@lists.infradead.org>; Aurelien
> > Jarno <aurelien@aurel32.net>
> > Subject: Re: Error on loading some network Kernel modules
> >
> > On Wed, Feb 5, 2020 at 3:07 AM Alex Ghiti <alex@ghiti.fr> wrote:
> > >
> > >
> > > On 2/4/20 9:03 AM, Vincent Chen wrote:
> > >
> > > On Tue, Feb 4, 2020 at 7:31 PM Anup Patel <anup@brainfault.org> wrote:
> > >
> > > On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com>
> > wrote:
> > >
> > > On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
> > >
> > > On 2/4/20 8:19 AM, Zong Li wrote:
> > >
> > > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道:
> > >
> > > Hi Zong,
> > >
> > > On 2/3/20 10:55 PM, Zong Li wrote:
> > >
> > > Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午
> > 6:04寫道:
> > >
> > > On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
> > > <romain.dolbeau@european-processor-initiative.eu> wrote:
> > >
> > > On 2020-02-01 14:59, Alex Ghiti wrote:
> > >
> > > Why restrict to 128M whereas we have 2GB offset available to the end
> > > of the kernel ?
> > >
> > > Isn't that 2 GiB offset to whatever the module requires in the kernel,
> > > rather than to the end of the kernel space?
> > >
> > > Is there some guarantee that symbols accessible by modules are at the
> > > end of the kernel? If so, wouldn't the maximum offset for this patch
> > > still be (2 GiB - <total size of accessible symbols>)?
> > >
> > > Cordially,
> > >
> > > --
> > > Romain Dolbeau
> > >
> > > It took me some time to find the root cause of this problem, please
> > > allow me to share some observations before the discussion.
> > > The root cause of this issue is that the percpu data is declared as a
> > > static variable. The "static" attribute will make the compiler think
> > > that this symbol is close to the .text section at runtime. Hence, the
> > > compiler uses "auipc" to access this percpu data instead of using GOT.
> > > In this case,  the access range is limited to + -2G. However, in
> > > practice, these percpu data are placed at a pre-allocated region
> > > created by the memblock_allocate() function. In other words, the
> > > distance between the pre-allocated region (>PAGE_OFFSET ) and the
> > > .text section of the kernel module (in VMALLOC region) is much larger
> > > than 2G.
> > > I agree that the original patch,
> > >
> > https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b921
> > 7
> > > c96e1589, can solve most cases. However, I do not think the patch
> > > still works if the kernel module region is determined by _end or
> > > <total size of accessible symbols>. The reason is that the
> > > pre-allocated region for module percpu data comes from the memblock
> > > function at runtime. Hence, we cannot know the actual address of this
> > > region at compile-time, and this issue probably may occur again in
> > > this case.
> > >
> > > By the way, I think maybe we can refer to the implementation of MIPS.
> > > 1. For general cases, we can use this patch to solve this issue.
> > > 2. For a large kernel image (>2G) or enabling the KASLR feature, we
> > > may need a new code mode to deal with this issue.
> > >
> > > The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is
> > > enabled. Unless we limit the randomized range in small area, the
> > > module region start address will be bigger than VMALLOC_END.
> > >
> > > Actually, the relocatable patch I proposed already moves "everything"
> > > up at the same
> > > time: the kernel itself but also all the "zones" below (vmalloc,
> > > vmemmap, fixup...etc) since all those zones are defined from
> > > PAGE_OFFSET that is now dynamic.
> > > So the modules
> > > ill remain at the same offset to the kernel, unless explicitly
> > > randomized in the vmalloc zone.
> > >
> > > OK, it makes sense. The module region moves along with kernel seems to
> > > stay away from the concern I mentioned.
> > >
> > > So now, the problem is that the pre-allocated region of percpu data is
> > > located after _end symbol as Vincent mentioned, the 2G distance seems
> > > to be too far for module region start address. (i.e. &_end - 2G).
> > >
> > > Actually, I don't understand this issue: we are limited to addressing
> > > symbols within +/- 2GB from PC. So as long as the percpu symbol exists
> > > in the kernel, it is below _end and then we don't care that its
> > > content is initialized dynamically, as long as we can 'compute' its
> > > address from PC, right ?
> > >
> > > In this case, the static percpu symbols of this issue are declared in
> > > the kernel module, not in the kernel image.
> > > When kernel loads the kernel module, in general, it continuously
> > > places all the module sections in the VMALLOC memory area. However,
> > > the ".data..percpu" section is an exception. The kernel places the
> > > ".data..percpu" section in a pre-allocated memory region. This region
> > > is used to place the percpu data instances for each CPU and is created
> > > by the memblock(). Hence, the instance of these per-cpu symbols is
> > > above the _end.  In this case, if we make the module region locate at
> > > [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its
> > > address >_end) and module text section will probably exceed the 2G
> > > limitation.
> > >
> > > The static percpu symbols are particularly problem for loadable
> > > modules on RISC-V but dynamic percpu variable are perfectly fine. I
> > > had faced this issue with KVM RISC-V module and I converted static
> > > percpu symbol into dynamic percpu variable. In fact, in Linux kernel
> > > dynamic percpu variables are preferred over static percpu symbols so
> > > wherever we see issue with static perpcu symbol in any module we
> > > should just send patch to convert it into dynamic percpu.
> > >
> > > In general, any memory access via pointers (just like dynamic percpu
> > > variables) are fine as long as the pointer itself is within 2GB of
> > > relative addressing.
> > >
> > > As far as I know, the kernel module uses PIC as the code mode instead
> > > of medany. Therefore, I don't think most pointers in kernel modules
> > > have a 2GB limit. That is why the modules without static percpu
> > > variables do not encounter the 32-bit offset issue.
> > >
> > > Overall, Alex's suggestion will address most cases of loadable modules
> > > except modules having static percpu symbols and for such modules just
> > > convert these percpu symbols into dynamic percpu variables.
> > >
> > > Can you point out where I can take a look at this pre-allocated region
> > > for percpu data please ?
> > >
> > > In mm/percpu.c, you can find how the kernel allocates the percpu data
> > > region during the initialization phase.
> > > In kernel/module.c's simplify_symbols function, you can find kernel
> > > treats .data.percpu section as an exception when loading module.
> > >
> > >
> > > Thanks for your explanations, I was miles away from understanding the
> > > real problem, good finding Vincent.
> > >
> > >
> > > If I have any misunderstandings, please let me know. Thanks
> > >
> > > Please see above comment. We should prefer dynamic percpu variable in
> > > loadable modules over static percpu symbols. I am sure there will be
> > > very few kernel loadable modules having static percpu symbols.
> > >
> > > Thanks for your comments.
> > > I agree that Alex's suggestion can address most cases of loadable
> > > modules. The reason why I pointed out the static percpu variable case
> > > is that the percpu variable is declared as a static symbol in the
> > > reduced case provided by Romain in
> > > https://lore.kernel.org/linux-riscv/1572281840733.3517@european-
> > processor-initiative.eu/.
> > > Also, I found two static percpu variables declared in openvswitch
> > > module. After removing the static attribute, the kernel can insert
> > > them successfully. Hence, I think the static percpu variable causes
> > > the issue.
> > >
> > > (According to the result of 'grep', very few kernel loadable modules
> > > use static percpu data, as you mentioned. However, I'm not sure this
> > > might be a reason to prohibit RISC-V users from using static percpu
> > > variables in kernel modules. Therefore, I have no comment on this.) If
> > > we can ignore the existence of static percpu variable in the kernels
> > > module, I agree with Alex's suggestions. The reason is not that we can
> > > use it to solve bugs, but that we can change the code model from PIC
> > > to medany by creating a specific kernel module region to improve the
> > > access performance.
> > >
> > >
> > > If I understand well, you mean that having this module zone at the end
> > > of vmalloc zone does not allow to fix the bug regarding static percpu
> > > variables but allows to have better performance than PIC which uses an
> > additional redirection using the GOT, right ?
> > >
> >
> > Yes. Currently, I think PIC model can resolve most of the pointer reference
> > except for the static percpu variables. However, as you mentioned, It will
> > lose some performance due to the additional redirection using the GOT.
> >
> > >
> > > (Maybe we can keep the current implementation for some corner cases
> > > such as the size of the kernel image is close to 2GB)
> > >
> > >
> > > Either we follow Anup's advice and we don't care about those static
> > > percpu variables and if ever they fall behind the 2GB limit, we can
> > > not load the module, or we find a way to make sure that the module
> > > range comprises the area for module percpu variables which is
> > > allocated in setup_per_cpu_areas. I'm taking a look at this solution and
> > how other architectures with same code model handle that.
> > >
> > As far as I know, ARM64 and MIPS64 with CONFIG_KBUILD_64BIT_SYM32
> > also use PC-relative way to access the static perpcu variables. I quickly
> > describe their implementation below, please correct me if I have any
> > misunderstandings, thanks.
> > 1. ARM64
> >     When CONFIG_RANDOMIZE_BASE is disabled, they place the module
> > region at "(u64)_etext - MODULES_VSIZE(128MB)", and the range limitation
> > of PC-relative ( instruction adrp) is +-4GB. When CONFIG_RANDOMIZE_BASE
> > is enabled, the maximum access range between module region symbol and
> > kernel symbol is limited to 2GB. There is still 2GB room to access the per-cpu
> > data. Therefore, I think it is also safe to access the per-cpu variables.
> > 2. MIPS64 with CONFIG_KBUILD_64BIT_SYM32
> >      The VMALLOC region is placed after the kernel image. Therefore, the start
> > address of the kernel image is at the 0xfffffff80100000, and the kernel
> > module region is at 0xffffffffc0000000 which is the beginning of the
> > VMALLOC. They do not encounter this issue because the range limitation of
> > PC-relative is +-2GB.
> >
> > Currently, I have three solutions to fix the percpu data problem, and I list
> > them below. Any feedback is welcome.
> >
> > 1. Create a specific module region locating at [PAGE_OFFSET-SIZE,
> > PAGE_OFFSET].
> >     I think this way can solve most cases because the general size of kernel
> > image is much smaller than 2G, and we still can change the code model from
> > PIC to medany. However, it will still encounter the same problem if the
> > kernel image is near to 2GB.
> >
> > 2. Request a new code model for kernel
> >     Maybe we request compiler guys to generate a new PIC mode for kernel,
> > which uses GOT to assess every static per-cpu data instead of PC-relative.
> > However, I am not sure whether this idea is accepted by the compiler guys or
> > not. If accepted, It may need a long time to realize it.
> >
> > 3. Jump to a resolved function.
> > Zong and I came out with this rough idea to address the problem, but we do
> > not go to realize it.
> >
> > The following is the current code pattern for accessing the percpu data.
> > 64: 00000917 auipc s2,0x0
> > 68: 00093903 ld s2,0(s2) # 64 <.L0 >
> >
> > We plan to change the " ld s2,0(s2) " to "jalr xxx(s2)" to redirect the code to
> > the corresponding resolved function. Each static percpu data owns a resolved
> > function. These resolved functions will go to get the variable content, and
> > then it stores the variable to the target register by decoding the auipc
> > instruction. We think this method may solve this problem, but it will drop the
> > performance due to the redirection.
> >
> > I preferred the solution combining method 1 and method 2.
> > 1. In general, users can choose method 1 to improve the performance of
> > data access.
> > 2. For safety, users can use the new code model to access the static percpu
> > data without any limitation.
> > We can create a new kernel config that looks a bit like MIPS
> > CONFIG_KBUILD_64BIT_SYM32 for users to chose.
>
> I think we should not waste time in our quest to find perfect solution. We
> should first have a module_alloc() in-place (based on various suggestions in
> this email thread) so that most people don't see issues. We can fix module
> loading for module with static per-cpu symbols separately.

I mean "static" is the attribute in C language not the type of the
percpu symbol.
Sorry, the description provided is not accurate and may cause you
misunderstanding, and I also misunderstand the meaning of static
percpu symbols that you mentioned early.

>
> For point1, we don't need a dedicated region as long as we are allocating
> modules from VMALLOC area. Let's avoid another virtual memory region if
> possible.
>

Your comments inspire me if we can know the end of the percpu data
region, we can dynamically calculate the start address in each module
allocation by "<end of the percpu data region> - 2GB" if needed. I am
finding a way to derive the <end of the percpu data region>.



> For point2, loading symbol address for per-cpu static symbols is always
> PC-relative for most RISC architectures and static per-cpu symbols in
> loadable modules is a known issue (Refer, https://lwn.net/Articles/1509/).
> I don't agree that this should be using compilers because per-cpu data
> management is Linux specific (or OS-specific) so we should rather explore
> ways to handle this in Linux itself. The dynamic per-cpu alloc APIs (i.e.
> alloc_percpu() and friends) were added to primarily benefit loadable
> modules and NUMA systems. The static per-cpu symbols (i.e.
> DEFINE_PER_CPU() and friends) should only be used in built-in modules
> or code linked to kernel. It is generally very easy to upgrade code using
> static per-cpu symbols to use dynamic per-cpu symbols.
>
I found that we want to deal with the same problem based on a
different viewpoint, and I can understand what you mean now. I found
that other RISC-V users also want the compiler to use GOT to access
the static symbol instead of PC-relative.
https://github.com/riscv/riscv-gcc/issues/158
https://groups.google.com/a/groups.riscv.org/forum/#!msg/sw-dev/Xa3FN6gQhPI/59xUWUpBCQAJ
If this feature is supported, I think this issue can be easily resolved :)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-05 10:37                                   ` Vincent Chen
@ 2020-02-07 14:39                                     ` Vincent Chen
  2020-02-07 14:51                                       ` Vincent Chen
  0 siblings, 1 reply; 30+ messages in thread
From: Vincent Chen @ 2020-02-07 14:39 UTC (permalink / raw)
  To: Anup Patel
  Cc: Carlos Eduardo de Paula, Alex Ghiti, David Abdurachmanov,
	Anup Patel, Zong Li, Romain Dolbeau, Zong Li, linux-riscv,
	Paul Walmsley, Aurelien Jarno

> > For point1, we don't need a dedicated region as long as we are allocating
> > modules from VMALLOC area. Let's avoid another virtual memory region if
> > possible.
> >
>
> Your comments inspire me if we can know the end of the percpu data
> region, we can dynamically calculate the start address in each module
> allocation by "<end of the percpu data region> - 2GB" if needed. I am
> finding a way to derive the <end of the percpu data region>.
>

In the implementation of the idea, I found that I did have some
misunderstandings about the mechanism of accessing static percpu
symbol. Currently, I think this issue can be solved by modifying the
module_allocate() to the following.
#define MODULE_START max(PFN_ALIGN(&_end - 2 *
SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G))
void *module_alloc(unsigned long size)
{
return __vmalloc_node_range(size, 1, MODULE_START,
VMALLOC_END, GFP_KERNEL,
PAGE_KERNEL_EXEC, 0,
NUMA_NO_NODE,
__builtin_return_address(0));
}
After applying a similar modification above, the kernel can
successfully insert the test module provided by Romain. Any feedback
is welcome. If someone is curious about the reason, I roughly
described it at the end of this mail. I am sorry if someone was misled
by early discussion.


As mentioned early, kernel pre-allocates a memory region for each CPU
to place the instances of the percpu data. According to the
declaration type, each memory region can be divided into three chunks.
These three chunks from low memory to high memory are static chunk,
reserved chunk, and dynamic chunk.
a. The static chunk is used for the static percpu symbols declared in the kernel
b. The reserved chunk is used for the static percpu symbols declared
in the kernel module
c. The dynamic chunk is used for all dynamic percpu symbols.

The order of the percpu symbols in each memory region is the same and
is recorded in a symbol list maintained by the kernel. Therefore, the
address of a specific CPU's percpu symbol is "The address of this
percpu symbol in the symbol list" + "The offset from symbols list to
the specific CPU region".
a. The symbol list is created based on the ".data..percpu" section. In
other words, the start address of this symbol list is the start
address of ".data..percpu". When a dynamic percpu symbol is created or
a kernel module import a static percpu symbol, the kernel will follow
the above chunk rule to add this percpu symbol to its suitable region.
b. When each memory area was created, "The offset from symbols list to
the specific CPU region" has been recorded.

Back to the issue, according to the rules above, the kernel will
relocate the ".data..percpu" section of the loaded module to the end
of the ".data..percpu" section of the kernel image. The end of the
".data..percpu" section of the kernel image almost equals _text, and
the current size of the reserved chunk is 8KB. Therefore, the
appropriate range to place the loaded module is [max(PFN_ALIGN(&_end -
2 *SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)),  VMALLOC_END].


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-07 14:39                                     ` Vincent Chen
@ 2020-02-07 14:51                                       ` Vincent Chen
  2020-02-10  6:37                                         ` Alex Ghiti
  0 siblings, 1 reply; 30+ messages in thread
From: Vincent Chen @ 2020-02-07 14:51 UTC (permalink / raw)
  To: Anup Patel
  Cc: Carlos Eduardo de Paula, Alex Ghiti, David Abdurachmanov,
	Anup Patel, Zong Li, Romain Dolbeau, Zong Li, linux-riscv,
	Paul Walmsley, Aurelien Jarno

On Fri, Feb 7, 2020 at 10:39 PM Vincent Chen <vincent.chen@sifive.com> wrote:
>
> > > For point1, we don't need a dedicated region as long as we are allocating
> > > modules from VMALLOC area. Let's avoid another virtual memory region if
> > > possible.
> > >
> >
> > Your comments inspire me if we can know the end of the percpu data
> > region, we can dynamically calculate the start address in each module
> > allocation by "<end of the percpu data region> - 2GB" if needed. I am
> > finding a way to derive the <end of the percpu data region>.
> >
>
> In the implementation of the idea, I found that I did have some
> misunderstandings about the mechanism of accessing static percpu
> symbol. Currently, I think this issue can be solved by modifying the
> module_allocate() to the following.
> #define MODULE_START max(PFN_ALIGN(&_end - 2 *
> SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G))
> void *module_alloc(unsigned long size)
> {
> return __vmalloc_node_range(size, 1, MODULE_START,
> VMALLOC_END, GFP_KERNEL,
> PAGE_KERNEL_EXEC, 0,
> NUMA_NO_NODE,
> __builtin_return_address(0));
> }
> After applying a similar modification above, the kernel can
> successfully insert the test module provided by Romain. Any feedback
> is welcome. If someone is curious about the reason, I roughly
> described it at the end of this mail. I am sorry if someone was misled
> by early discussion.
>
>
> As mentioned early, kernel pre-allocates a memory region for each CPU
> to place the instances of the percpu data. According to the
> declaration type, each memory region can be divided into three chunks.
> These three chunks from low memory to high memory are static chunk,
> reserved chunk, and dynamic chunk.
> a. The static chunk is used for the static percpu symbols declared in the kernel
> b. The reserved chunk is used for the static percpu symbols declared
> in the kernel module
> c. The dynamic chunk is used for all dynamic percpu symbols.
>
> The order of the percpu symbols in each memory region is the same and
> is recorded in a symbol list maintained by the kernel. Therefore, the
> address of a specific CPU's percpu symbol is "The address of this
> percpu symbol in the symbol list" + "The offset from symbols list to
> the specific CPU region".
> a. The symbol list is created based on the ".data..percpu" section. In
> other words, the start address of this symbol list is the start
> address of ".data..percpu". When a dynamic percpu symbol is created or
> a kernel module import a static percpu symbol, the kernel will follow
> the above chunk rule to add this percpu symbol to its suitable region.
> b. When each memory area was created, "The offset from symbols list to
> the specific CPU region" has been recorded.
>
> Back to the issue, according to the rules above, the kernel will
> relocate the ".data..percpu" section of the loaded module to the end
> of the ".data..percpu" section of the kernel image. The end of the
> ".data..percpu" section of the kernel image almost equals _text, and
> the current size of the reserved chunk is 8KB. Therefore, the
> appropriate range to place the loaded module is [max(PFN_ALIGN(&_end -
> 2 *SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)),  VMALLOC_END].

I did not notice that the size of the kernel .text section is
impossible below 2 PAGES. Therefore, the modification of module_alloc
function can be reduced to the following.
void *module_alloc(unsigned long size)
 {
 return __vmalloc_node_range(size, 1, PFN_ALIGN(&_end - 2 *SZ_1G),
 VMALLOC_END, GFP_KERNEL,
 PAGE_KERNEL_EXEC, 0,
 NUMA_NO_NODE,
 __builtin_return_address(0));
 }
It is the same as Alex's suggestion. Is Alex willing to send this
patch to resolve this bug?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-07 14:51                                       ` Vincent Chen
@ 2020-02-10  6:37                                         ` Alex Ghiti
  2020-02-10  9:53                                           ` Vincent Chen
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Ghiti @ 2020-02-10  6:37 UTC (permalink / raw)
  To: Vincent Chen, Anup Patel
  Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel,
	Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li, linux-riscv,
	Aurelien Jarno

Hi Vincent,

On 2/7/20 9:51 AM, Vincent Chen wrote:
> On Fri, Feb 7, 2020 at 10:39 PM Vincent Chen <vincent.chen@sifive.com> wrote:
>>
>>>> For point1, we don't need a dedicated region as long as we are allocating
>>>> modules from VMALLOC area. Let's avoid another virtual memory region if
>>>> possible.
>>>>
>>>
>>> Your comments inspire me if we can know the end of the percpu data
>>> region, we can dynamically calculate the start address in each module
>>> allocation by "<end of the percpu data region> - 2GB" if needed. I am
>>> finding a way to derive the <end of the percpu data region>.
>>>
>>
>> In the implementation of the idea, I found that I did have some
>> misunderstandings about the mechanism of accessing static percpu
>> symbol. Currently, I think this issue can be solved by modifying the
>> module_allocate() to the following.
>> #define MODULE_START max(PFN_ALIGN(&_end - 2 *
>> SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G))
>> void *module_alloc(unsigned long size)
>> {
>> return __vmalloc_node_range(size, 1, MODULE_START,
>> VMALLOC_END, GFP_KERNEL,
>> PAGE_KERNEL_EXEC, 0,
>> NUMA_NO_NODE,
>> __builtin_return_address(0));
>> }
>> After applying a similar modification above, the kernel can
>> successfully insert the test module provided by Romain. Any feedback
>> is welcome. If someone is curious about the reason, I roughly
>> described it at the end of this mail. I am sorry if someone was misled
>> by early discussion.
>>
>>
>> As mentioned early, kernel pre-allocates a memory region for each CPU
>> to place the instances of the percpu data. According to the
>> declaration type, each memory region can be divided into three chunks.
>> These three chunks from low memory to high memory are static chunk,
>> reserved chunk, and dynamic chunk.
>> a. The static chunk is used for the static percpu symbols declared in the kernel
>> b. The reserved chunk is used for the static percpu symbols declared
>> in the kernel module
>> c. The dynamic chunk is used for all dynamic percpu symbols.
>>
>> The order of the percpu symbols in each memory region is the same and
>> is recorded in a symbol list maintained by the kernel. Therefore, the
>> address of a specific CPU's percpu symbol is "The address of this
>> percpu symbol in the symbol list" + "The offset from symbols list to
>> the specific CPU region".
>> a. The symbol list is created based on the ".data..percpu" section. In
>> other words, the start address of this symbol list is the start
>> address of ".data..percpu". When a dynamic percpu symbol is created or
>> a kernel module import a static percpu symbol, the kernel will follow
>> the above chunk rule to add this percpu symbol to its suitable region.
>> b. When each memory area was created, "The offset from symbols list to
>> the specific CPU region" has been recorded.
>>
>> Back to the issue, according to the rules above, the kernel will
>> relocate the ".data..percpu" section of the loaded module to the end
>> of the ".data..percpu" section of the kernel image. The end of the
>> ".data..percpu" section of the kernel image almost equals _text, and
>> the current size of the reserved chunk is 8KB. Therefore, the
>> appropriate range to place the loaded module is [max(PFN_ALIGN(&_end -
>> 2 *SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)),  VMALLOC_END].
> 
> I did not notice that the size of the kernel .text section is
> impossible below 2 PAGES. Therefore, the modification of module_alloc
> function can be reduced to the following.
> void *module_alloc(unsigned long size)
>   {
>   return __vmalloc_node_range(size, 1, PFN_ALIGN(&_end - 2 *SZ_1G),
>   VMALLOC_END, GFP_KERNEL,
>   PAGE_KERNEL_EXEC, 0,
>   NUMA_NO_NODE,
>   __builtin_return_address(0));
>   }
> It is the same as Alex's suggestion. Is Alex willing to send this
> patch to resolve this bug?
> 

You did all the work, please send a patch explaining what you learnt and 
feel free to add a Suggested-By. Anyway, I'll add a Reviewed-by when you 
propose something.

Thank you Vincent for all your research,

Alex


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-10  6:37                                         ` Alex Ghiti
@ 2020-02-10  9:53                                           ` Vincent Chen
  2020-02-19  6:46                                             ` Alex Ghiti
  0 siblings, 1 reply; 30+ messages in thread
From: Vincent Chen @ 2020-02-10  9:53 UTC (permalink / raw)
  To: Alex Ghiti
  Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel,
	Anup Patel, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li,
	linux-riscv, Aurelien Jarno

On Mon, Feb 10, 2020 at 2:37 PM Alex Ghiti <alex@ghiti.fr> wrote:
>
> Hi Vincent,
>
> On 2/7/20 9:51 AM, Vincent Chen wrote:
> > On Fri, Feb 7, 2020 at 10:39 PM Vincent Chen <vincent.chen@sifive.com> wrote:
> >>
> >>>> For point1, we don't need a dedicated region as long as we are allocating
> >>>> modules from VMALLOC area. Let's avoid another virtual memory region if
> >>>> possible.
> >>>>
> >>>
> >>> Your comments inspire me if we can know the end of the percpu data
> >>> region, we can dynamically calculate the start address in each module
> >>> allocation by "<end of the percpu data region> - 2GB" if needed. I am
> >>> finding a way to derive the <end of the percpu data region>.
> >>>
> >>
> >> In the implementation of the idea, I found that I did have some
> >> misunderstandings about the mechanism of accessing static percpu
> >> symbol. Currently, I think this issue can be solved by modifying the
> >> module_allocate() to the following.
> >> #define MODULE_START max(PFN_ALIGN(&_end - 2 *
> >> SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G))
> >> void *module_alloc(unsigned long size)
> >> {
> >> return __vmalloc_node_range(size, 1, MODULE_START,
> >> VMALLOC_END, GFP_KERNEL,
> >> PAGE_KERNEL_EXEC, 0,
> >> NUMA_NO_NODE,
> >> __builtin_return_address(0));
> >> }
> >> After applying a similar modification above, the kernel can
> >> successfully insert the test module provided by Romain. Any feedback
> >> is welcome. If someone is curious about the reason, I roughly
> >> described it at the end of this mail. I am sorry if someone was misled
> >> by early discussion.
> >>
> >>
> >> As mentioned early, kernel pre-allocates a memory region for each CPU
> >> to place the instances of the percpu data. According to the
> >> declaration type, each memory region can be divided into three chunks.
> >> These three chunks from low memory to high memory are static chunk,
> >> reserved chunk, and dynamic chunk.
> >> a. The static chunk is used for the static percpu symbols declared in the kernel
> >> b. The reserved chunk is used for the static percpu symbols declared
> >> in the kernel module
> >> c. The dynamic chunk is used for all dynamic percpu symbols.
> >>
> >> The order of the percpu symbols in each memory region is the same and
> >> is recorded in a symbol list maintained by the kernel. Therefore, the
> >> address of a specific CPU's percpu symbol is "The address of this
> >> percpu symbol in the symbol list" + "The offset from symbols list to
> >> the specific CPU region".
> >> a. The symbol list is created based on the ".data..percpu" section. In
> >> other words, the start address of this symbol list is the start
> >> address of ".data..percpu". When a dynamic percpu symbol is created or
> >> a kernel module import a static percpu symbol, the kernel will follow
> >> the above chunk rule to add this percpu symbol to its suitable region.
> >> b. When each memory area was created, "The offset from symbols list to
> >> the specific CPU region" has been recorded.
> >>
> >> Back to the issue, according to the rules above, the kernel will
> >> relocate the ".data..percpu" section of the loaded module to the end
> >> of the ".data..percpu" section of the kernel image. The end of the
> >> ".data..percpu" section of the kernel image almost equals _text, and
> >> the current size of the reserved chunk is 8KB. Therefore, the
> >> appropriate range to place the loaded module is [max(PFN_ALIGN(&_end -
> >> 2 *SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)),  VMALLOC_END].
> >
> > I did not notice that the size of the kernel .text section is
> > impossible below 2 PAGES. Therefore, the modification of module_alloc
> > function can be reduced to the following.
> > void *module_alloc(unsigned long size)
> >   {
> >   return __vmalloc_node_range(size, 1, PFN_ALIGN(&_end - 2 *SZ_1G),
> >   VMALLOC_END, GFP_KERNEL,
> >   PAGE_KERNEL_EXEC, 0,
> >   NUMA_NO_NODE,
> >   __builtin_return_address(0));
> >   }
> > It is the same as Alex's suggestion. Is Alex willing to send this
> > patch to resolve this bug?
> >
>
> You did all the work, please send a patch explaining what you learnt and
> feel free to add a Suggested-By. Anyway, I'll add a Reviewed-by when you
> propose something.
>
> Thank you Vincent for all your research,
>
> Alex

OK, I understood. Thank you and Anup for giving me some suggestions
from different viewpoints. I will add Suggested-By for you all if
possible :)

Vincent


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-10  9:53                                           ` Vincent Chen
@ 2020-02-19  6:46                                             ` Alex Ghiti
  2020-02-19  7:30                                               ` Vincent Chen
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Ghiti @ 2020-02-19  6:46 UTC (permalink / raw)
  To: Vincent Chen
  Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel,
	Anup Patel, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li,
	linux-riscv, Aurelien Jarno

Hi Vincent,

On 2/10/20 4:53 AM, Vincent Chen wrote:
> On Mon, Feb 10, 2020 at 2:37 PM Alex Ghiti <alex@ghiti.fr> wrote:
>>
>> Hi Vincent,
>>
>> On 2/7/20 9:51 AM, Vincent Chen wrote:
>>> On Fri, Feb 7, 2020 at 10:39 PM Vincent Chen <vincent.chen@sifive.com> wrote:
>>>>
>>>>>> For point1, we don't need a dedicated region as long as we are allocating
>>>>>> modules from VMALLOC area. Let's avoid another virtual memory region if
>>>>>> possible.
>>>>>>
>>>>>
>>>>> Your comments inspire me if we can know the end of the percpu data
>>>>> region, we can dynamically calculate the start address in each module
>>>>> allocation by "<end of the percpu data region> - 2GB" if needed. I am
>>>>> finding a way to derive the <end of the percpu data region>.
>>>>>
>>>>
>>>> In the implementation of the idea, I found that I did have some
>>>> misunderstandings about the mechanism of accessing static percpu
>>>> symbol. Currently, I think this issue can be solved by modifying the
>>>> module_allocate() to the following.
>>>> #define MODULE_START max(PFN_ALIGN(&_end - 2 *
>>>> SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G))
>>>> void *module_alloc(unsigned long size)
>>>> {
>>>> return __vmalloc_node_range(size, 1, MODULE_START,
>>>> VMALLOC_END, GFP_KERNEL,
>>>> PAGE_KERNEL_EXEC, 0,
>>>> NUMA_NO_NODE,
>>>> __builtin_return_address(0));
>>>> }
>>>> After applying a similar modification above, the kernel can
>>>> successfully insert the test module provided by Romain. Any feedback
>>>> is welcome. If someone is curious about the reason, I roughly
>>>> described it at the end of this mail. I am sorry if someone was misled
>>>> by early discussion.
>>>>
>>>>
>>>> As mentioned early, kernel pre-allocates a memory region for each CPU
>>>> to place the instances of the percpu data. According to the
>>>> declaration type, each memory region can be divided into three chunks.
>>>> These three chunks from low memory to high memory are static chunk,
>>>> reserved chunk, and dynamic chunk.
>>>> a. The static chunk is used for the static percpu symbols declared in the kernel
>>>> b. The reserved chunk is used for the static percpu symbols declared
>>>> in the kernel module
>>>> c. The dynamic chunk is used for all dynamic percpu symbols.
>>>>
>>>> The order of the percpu symbols in each memory region is the same and
>>>> is recorded in a symbol list maintained by the kernel. Therefore, the
>>>> address of a specific CPU's percpu symbol is "The address of this
>>>> percpu symbol in the symbol list" + "The offset from symbols list to
>>>> the specific CPU region".
>>>> a. The symbol list is created based on the ".data..percpu" section. In
>>>> other words, the start address of this symbol list is the start
>>>> address of ".data..percpu". When a dynamic percpu symbol is created or
>>>> a kernel module import a static percpu symbol, the kernel will follow
>>>> the above chunk rule to add this percpu symbol to its suitable region.
>>>> b. When each memory area was created, "The offset from symbols list to
>>>> the specific CPU region" has been recorded.
>>>>
>>>> Back to the issue, according to the rules above, the kernel will
>>>> relocate the ".data..percpu" section of the loaded module to the end
>>>> of the ".data..percpu" section of the kernel image. The end of the
>>>> ".data..percpu" section of the kernel image almost equals _text, and
>>>> the current size of the reserved chunk is 8KB. Therefore, the
>>>> appropriate range to place the loaded module is [max(PFN_ALIGN(&_end -
>>>> 2 *SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)),  VMALLOC_END].
>>>
>>> I did not notice that the size of the kernel .text section is
>>> impossible below 2 PAGES. Therefore, the modification of module_alloc
>>> function can be reduced to the following.
>>> void *module_alloc(unsigned long size)
>>>    {
>>>    return __vmalloc_node_range(size, 1, PFN_ALIGN(&_end - 2 *SZ_1G),
>>>    VMALLOC_END, GFP_KERNEL,
>>>    PAGE_KERNEL_EXEC, 0,
>>>    NUMA_NO_NODE,
>>>    __builtin_return_address(0));
>>>    }
>>> It is the same as Alex's suggestion. Is Alex willing to send this
>>> patch to resolve this bug?
>>>
>>
>> You did all the work, please send a patch explaining what you learnt and
>> feel free to add a Suggested-By. Anyway, I'll add a Reviewed-by when you
>> propose something.
>>
>> Thank you Vincent for all your research,
>>
>> Alex
> 
> OK, I understood. Thank you and Anup for giving me some suggestions
> from different viewpoints. I will add Suggested-By for you all if
> possible :)
> 
> Vincent
> 

Are you going to propose something soon regarding this issue ? I 
strongly believe we need to fix modules loading for 5.6.

Thanks,

Alex


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Error on loading some network Kernel modules
  2020-02-19  6:46                                             ` Alex Ghiti
@ 2020-02-19  7:30                                               ` Vincent Chen
  0 siblings, 0 replies; 30+ messages in thread
From: Vincent Chen @ 2020-02-19  7:30 UTC (permalink / raw)
  To: Alex Ghiti
  Cc: Carlos Eduardo de Paula, David Abdurachmanov, Anup Patel,
	Anup Patel, Paul Walmsley, Zong Li, Romain Dolbeau, Zong Li,
	linux-riscv, Aurelien Jarno

On Wed, Feb 19, 2020 at 2:46 PM Alex Ghiti <alex@ghiti.fr> wrote:
>
> Hi Vincent,
>
> On 2/10/20 4:53 AM, Vincent Chen wrote:
> > On Mon, Feb 10, 2020 at 2:37 PM Alex Ghiti <alex@ghiti.fr> wrote:
> >>
> >> Hi Vincent,
> >>
> >> On 2/7/20 9:51 AM, Vincent Chen wrote:
> >>> On Fri, Feb 7, 2020 at 10:39 PM Vincent Chen <vincent.chen@sifive.com> wrote:
> >>>>
> >>>>>> For point1, we don't need a dedicated region as long as we are allocating
> >>>>>> modules from VMALLOC area. Let's avoid another virtual memory region if
> >>>>>> possible.
> >>>>>>
> >>>>>
> >>>>> Your comments inspire me if we can know the end of the percpu data
> >>>>> region, we can dynamically calculate the start address in each module
> >>>>> allocation by "<end of the percpu data region> - 2GB" if needed. I am
> >>>>> finding a way to derive the <end of the percpu data region>.
> >>>>>
> >>>>
> >>>> In the implementation of the idea, I found that I did have some
> >>>> misunderstandings about the mechanism of accessing static percpu
> >>>> symbol. Currently, I think this issue can be solved by modifying the
> >>>> module_allocate() to the following.
> >>>> #define MODULE_START max(PFN_ALIGN(&_end - 2 *
> >>>> SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G))
> >>>> void *module_alloc(unsigned long size)
> >>>> {
> >>>> return __vmalloc_node_range(size, 1, MODULE_START,
> >>>> VMALLOC_END, GFP_KERNEL,
> >>>> PAGE_KERNEL_EXEC, 0,
> >>>> NUMA_NO_NODE,
> >>>> __builtin_return_address(0));
> >>>> }
> >>>> After applying a similar modification above, the kernel can
> >>>> successfully insert the test module provided by Romain. Any feedback
> >>>> is welcome. If someone is curious about the reason, I roughly
> >>>> described it at the end of this mail. I am sorry if someone was misled
> >>>> by early discussion.
> >>>>
> >>>>
> >>>> As mentioned early, kernel pre-allocates a memory region for each CPU
> >>>> to place the instances of the percpu data. According to the
> >>>> declaration type, each memory region can be divided into three chunks.
> >>>> These three chunks from low memory to high memory are static chunk,
> >>>> reserved chunk, and dynamic chunk.
> >>>> a. The static chunk is used for the static percpu symbols declared in the kernel
> >>>> b. The reserved chunk is used for the static percpu symbols declared
> >>>> in the kernel module
> >>>> c. The dynamic chunk is used for all dynamic percpu symbols.
> >>>>
> >>>> The order of the percpu symbols in each memory region is the same and
> >>>> is recorded in a symbol list maintained by the kernel. Therefore, the
> >>>> address of a specific CPU's percpu symbol is "The address of this
> >>>> percpu symbol in the symbol list" + "The offset from symbols list to
> >>>> the specific CPU region".
> >>>> a. The symbol list is created based on the ".data..percpu" section. In
> >>>> other words, the start address of this symbol list is the start
> >>>> address of ".data..percpu". When a dynamic percpu symbol is created or
> >>>> a kernel module import a static percpu symbol, the kernel will follow
> >>>> the above chunk rule to add this percpu symbol to its suitable region.
> >>>> b. When each memory area was created, "The offset from symbols list to
> >>>> the specific CPU region" has been recorded.
> >>>>
> >>>> Back to the issue, according to the rules above, the kernel will
> >>>> relocate the ".data..percpu" section of the loaded module to the end
> >>>> of the ".data..percpu" section of the kernel image. The end of the
> >>>> ".data..percpu" section of the kernel image almost equals _text, and
> >>>> the current size of the reserved chunk is 8KB. Therefore, the
> >>>> appropriate range to place the loaded module is [max(PFN_ALIGN(&_end -
> >>>> 2 *SZ_1G), PFN_ALIGN(&_text + 8 * SZ_1K - 2 * SZ_1G)),  VMALLOC_END].
> >>>
> >>> I did not notice that the size of the kernel .text section is
> >>> impossible below 2 PAGES. Therefore, the modification of module_alloc
> >>> function can be reduced to the following.
> >>> void *module_alloc(unsigned long size)
> >>>    {
> >>>    return __vmalloc_node_range(size, 1, PFN_ALIGN(&_end - 2 *SZ_1G),
> >>>    VMALLOC_END, GFP_KERNEL,
> >>>    PAGE_KERNEL_EXEC, 0,
> >>>    NUMA_NO_NODE,
> >>>    __builtin_return_address(0));
> >>>    }
> >>> It is the same as Alex's suggestion. Is Alex willing to send this
> >>> patch to resolve this bug?
> >>>
> >>
> >> You did all the work, please send a patch explaining what you learnt and
> >> feel free to add a Suggested-By. Anyway, I'll add a Reviewed-by when you
> >> propose something.
> >>
> >> Thank you Vincent for all your research,
> >>
> >> Alex
> >
> > OK, I understood. Thank you and Anup for giving me some suggestions
> > from different viewpoints. I will add Suggested-By for you all if
> > possible :)
> >
> > Vincent
> >
>
> Are you going to propose something soon regarding this issue ? I
> strongly believe we need to fix modules loading for 5.6.
>
> Thanks,
>
> Alex

Thanks for your reminder. I just posted the patch to the mailing list.

Vincent


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, back to index

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-28 19:22 Error on loading some network Kernel modules Carlos Eduardo de Paula
2020-01-30  3:00 ` Paul Walmsley
2020-01-30 16:20   ` David Abdurachmanov
2020-01-31 14:12     ` Carlos Eduardo de Paula
2020-01-31 20:11     ` Aurelien Jarno
2020-02-01  7:52       ` Anup Patel
2020-02-01 13:59         ` Alex Ghiti
2020-02-02 14:10           ` Anup Patel
2020-02-02 15:21             ` Troy Benjegerdes
2020-02-02 16:27           ` Romain Dolbeau
2020-02-03 10:04             ` Vincent Chen
2020-02-04  3:55               ` Zong Li
2020-02-04  6:50                 ` Alex Ghiti
2020-02-04  7:19                   ` Zong Li
2020-02-04  9:32                     ` Alexandre Ghiti
2020-02-04 10:46                       ` Vincent Chen
2020-02-04 11:30                         ` Anup Patel
2020-02-04 14:03                           ` Vincent Chen
2020-02-04 19:10                             ` Alex Ghiti
     [not found]                             ` <a55f265e-71b2-5ebb-b079-6345007a442e@ghiti.fr>
2020-02-05  3:22                               ` Vincent Chen
2020-02-05  4:24                                 ` Anup Patel
2020-02-05 10:37                                   ` Vincent Chen
2020-02-07 14:39                                     ` Vincent Chen
2020-02-07 14:51                                       ` Vincent Chen
2020-02-10  6:37                                         ` Alex Ghiti
2020-02-10  9:53                                           ` Vincent Chen
2020-02-19  6:46                                             ` Alex Ghiti
2020-02-19  7:30                                               ` Vincent Chen
2020-02-04 17:48                           ` Romain Dolbeau
2020-02-03 20:57             ` Alex Ghiti

Linux-RISC-V Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-riscv/0 linux-riscv/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-riscv linux-riscv/ https://lore.kernel.org/linux-riscv \
		linux-riscv@lists.infradead.org
	public-inbox-index linux-riscv

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-riscv


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git