RE: Error on loading some network Kernel modules

From: Anup Patel <Anup.Patel@wdc.com>
To: Vincent Chen <vincent.chen@sifive.com>, Alex Ghiti <alex@ghiti.fr>
Cc: Carlos Eduardo de Paula <me@carlosedp.com>,
	David Abdurachmanov <david.abdurachmanov@gmail.com>,
	Anup Patel <anup@brainfault.org>, Paul Walmsley <paul@pwsan.com>,
	Zong Li <zongbox@gmail.com>,
	Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>,
	Zong Li <zong.li@sifive.com>,
	linux-riscv <linux-riscv@lists.infradead.org>,
	Aurelien Jarno <aurelien@aurel32.net>
Subject: RE: Error on loading some network Kernel modules
Date: Wed, 5 Feb 2020 04:24:23 +0000	[thread overview]
Message-ID: <MN2PR04MB6061ED6ABC668B59A7A544CF8D020@MN2PR04MB6061.namprd04.prod.outlook.com> (raw)
In-Reply-To: <CABvJ_xjVLJEebCac_sb6-Yd_iHU1x8Daqw-iFqcGn11YKktm8Q@mail.gmail.com>

> -----Original Message-----
> From: linux-riscv <linux-riscv-bounces@lists.infradead.org> On Behalf Of
> Vincent Chen
> Sent: Wednesday, February 5, 2020 8:52 AM
> To: Alex Ghiti <alex@ghiti.fr>
> Cc: Carlos Eduardo de Paula <me@carlosedp.com>; David Abdurachmanov
> <david.abdurachmanov@gmail.com>; Anup Patel <anup@brainfault.org>;
> Paul Walmsley <paul@pwsan.com>; Zong Li <zongbox@gmail.com>; Romain
> Dolbeau <romain.dolbeau@european-processor-initiative.eu>; Zong Li
> <zong.li@sifive.com>; linux-riscv <linux-riscv@lists.infradead.org>; Aurelien
> Jarno <aurelien@aurel32.net>
> Subject: Re: Error on loading some network Kernel modules
> 
> On Wed, Feb 5, 2020 at 3:07 AM Alex Ghiti <alex@ghiti.fr> wrote:
> >
> >
> > On 2/4/20 9:03 AM, Vincent Chen wrote:
> >
> > On Tue, Feb 4, 2020 at 7:31 PM Anup Patel <anup@brainfault.org> wrote:
> >
> > On Tue, Feb 4, 2020 at 4:16 PM Vincent Chen <vincent.chen@sifive.com>
> wrote:
> >
> > On Tue, Feb 4, 2020 at 5:32 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
> >
> > On 2/4/20 8:19 AM, Zong Li wrote:
> >
> > Alex Ghiti <alex@ghiti.fr> 於 2020年2月4日 週二 下午2:50寫道：
> >
> > Hi Zong,
> >
> > On 2/3/20 10:55 PM, Zong Li wrote:
> >
> > Vincent Chen <vincent.chen@sifive.com> 於 2020年2月3日 週一 下午
> 6:04寫道：
> >
> > On Mon, Feb 3, 2020 at 12:27 AM Romain Dolbeau
> > <romain.dolbeau@european-processor-initiative.eu> wrote:
> >
> > On 2020-02-01 14:59, Alex Ghiti wrote:
> >
> > Why restrict to 128M whereas we have 2GB offset available to the end
> > of the kernel ?
> >
> > Isn't that 2 GiB offset to whatever the module requires in the kernel,
> > rather than to the end of the kernel space?
> >
> > Is there some guarantee that symbols accessible by modules are at the
> > end of the kernel? If so, wouldn't the maximum offset for this patch
> > still be (2 GiB - <total size of accessible symbols>)?
> >
> > Cordially,
> >
> > --
> > Romain Dolbeau
> >
> > It took me some time to find the root cause of this problem, please
> > allow me to share some observations before the discussion.
> > The root cause of this issue is that the percpu data is declared as a
> > static variable. The "static" attribute will make the compiler think
> > that this symbol is close to the .text section at runtime. Hence, the
> > compiler uses "auipc" to access this percpu data instead of using GOT.
> > In this case,  the access range is limited to + -2G. However, in
> > practice, these percpu data are placed at a pre-allocated region
> > created by the memblock_allocate() function. In other words, the
> > distance between the pre-allocated region (>PAGE_OFFSET ) and the
> > .text section of the kernel module (in VMALLOC region) is much larger
> > than 2G.
> > I agree that the original patch,
> >
> https://github.com/bjoto/linux/commit/8a56d1c8e8e91c1bc3893946d52b921
> 7
> > c96e1589, can solve most cases. However, I do not think the patch
> > still works if the kernel module region is determined by _end or
> > <total size of accessible symbols>. The reason is that the
> > pre-allocated region for module percpu data comes from the memblock
> > function at runtime. Hence, we cannot know the actual address of this
> > region at compile-time, and this issue probably may occur again in
> > this case.
> >
> > By the way, I think maybe we can refer to the implementation of MIPS.
> > 1. For general cases, we can use this patch to solve this issue.
> > 2. For a large kernel image (>2G) or enabling the KASLR feature, we
> > may need a new code mode to deal with this issue.
> >
> > The range [&_end - 2G, VMALLOC_END] won't be suitable when KASLR is
> > enabled. Unless we limit the randomized range in small area, the
> > module region start address will be bigger than VMALLOC_END.
> >
> > Actually, the relocatable patch I proposed already moves "everything"
> > up at the same
> > time: the kernel itself but also all the "zones" below (vmalloc,
> > vmemmap, fixup...etc) since all those zones are defined from
> > PAGE_OFFSET that is now dynamic.
> > So the modules
> > ill remain at the same offset to the kernel, unless explicitly
> > randomized in the vmalloc zone.
> >
> > OK, it makes sense. The module region moves along with kernel seems to
> > stay away from the concern I mentioned.
> >
> > So now, the problem is that the pre-allocated region of percpu data is
> > located after _end symbol as Vincent mentioned, the 2G distance seems
> > to be too far for module region start address. (i.e. &_end - 2G).
> >
> > Actually, I don't understand this issue: we are limited to addressing
> > symbols within +/- 2GB from PC. So as long as the percpu symbol exists
> > in the kernel, it is below _end and then we don't care that its
> > content is initialized dynamically, as long as we can 'compute' its
> > address from PC, right ?
> >
> > In this case, the static percpu symbols of this issue are declared in
> > the kernel module, not in the kernel image.
> > When kernel loads the kernel module, in general, it continuously
> > places all the module sections in the VMALLOC memory area. However,
> > the ".data..percpu" section is an exception. The kernel places the
> > ".data..percpu" section in a pre-allocated memory region. This region
> > is used to place the percpu data instances for each CPU and is created
> > by the memblock(). Hence, the instance of these per-cpu symbols is
> > above the _end.  In this case, if we make the module region locate at
> > [_end-2G, PAGE_OFFSET], the distance between these percpu symbols (its
> > address >_end) and module text section will probably exceed the 2G
> > limitation.
> >
> > The static percpu symbols are particularly problem for loadable
> > modules on RISC-V but dynamic percpu variable are perfectly fine. I
> > had faced this issue with KVM RISC-V module and I converted static
> > percpu symbol into dynamic percpu variable. In fact, in Linux kernel
> > dynamic percpu variables are preferred over static percpu symbols so
> > wherever we see issue with static perpcu symbol in any module we
> > should just send patch to convert it into dynamic percpu.
> >
> > In general, any memory access via pointers (just like dynamic percpu
> > variables) are fine as long as the pointer itself is within 2GB of
> > relative addressing.
> >
> > As far as I know, the kernel module uses PIC as the code mode instead
> > of medany. Therefore, I don't think most pointers in kernel modules
> > have a 2GB limit. That is why the modules without static percpu
> > variables do not encounter the 32-bit offset issue.
> >
> > Overall, Alex's suggestion will address most cases of loadable modules
> > except modules having static percpu symbols and for such modules just
> > convert these percpu symbols into dynamic percpu variables.
> >
> > Can you point out where I can take a look at this pre-allocated region
> > for percpu data please ?
> >
> > In mm/percpu.c, you can find how the kernel allocates the percpu data
> > region during the initialization phase.
> > In kernel/module.c's simplify_symbols function, you can find kernel
> > treats .data.percpu section as an exception when loading module.
> >
> >
> > Thanks for your explanations, I was miles away from understanding the
> > real problem, good finding Vincent.
> >
> >
> > If I have any misunderstandings, please let me know. Thanks
> >
> > Please see above comment. We should prefer dynamic percpu variable in
> > loadable modules over static percpu symbols. I am sure there will be
> > very few kernel loadable modules having static percpu symbols.
> >
> > Thanks for your comments.
> > I agree that Alex's suggestion can address most cases of loadable
> > modules. The reason why I pointed out the static percpu variable case
> > is that the percpu variable is declared as a static symbol in the
> > reduced case provided by Romain in
> > https://lore.kernel.org/linux-riscv/1572281840733.3517@european-
> processor-initiative.eu/.
> > Also, I found two static percpu variables declared in openvswitch
> > module. After removing the static attribute, the kernel can insert
> > them successfully. Hence, I think the static percpu variable causes
> > the issue.
> >
> > (According to the result of 'grep', very few kernel loadable modules
> > use static percpu data, as you mentioned. However, I'm not sure this
> > might be a reason to prohibit RISC-V users from using static percpu
> > variables in kernel modules. Therefore, I have no comment on this.) If
> > we can ignore the existence of static percpu variable in the kernels
> > module, I agree with Alex's suggestions. The reason is not that we can
> > use it to solve bugs, but that we can change the code model from PIC
> > to medany by creating a specific kernel module region to improve the
> > access performance.
> >
> >
> > If I understand well, you mean that having this module zone at the end
> > of vmalloc zone does not allow to fix the bug regarding static percpu
> > variables but allows to have better performance than PIC which uses an
> additional redirection using the GOT, right ?
> >
> 
> Yes. Currently, I think PIC model can resolve most of the pointer reference
> except for the static percpu variables. However, as you mentioned, It will
> lose some performance due to the additional redirection using the GOT.
> 
> >
> > (Maybe we can keep the current implementation for some corner cases
> > such as the size of the kernel image is close to 2GB)
> >
> >
> > Either we follow Anup's advice and we don't care about those static
> > percpu variables and if ever they fall behind the 2GB limit, we can
> > not load the module, or we find a way to make sure that the module
> > range comprises the area for module percpu variables which is
> > allocated in setup_per_cpu_areas. I'm taking a look at this solution and
> how other architectures with same code model handle that.
> >
> As far as I know, ARM64 and MIPS64 with CONFIG_KBUILD_64BIT_SYM32
> also use PC-relative way to access the static perpcu variables. I quickly
> describe their implementation below, please correct me if I have any
> misunderstandings, thanks.
> 1. ARM64
>     When CONFIG_RANDOMIZE_BASE is disabled, they place the module
> region at "(u64)_etext - MODULES_VSIZE(128MB)", and the range limitation
> of PC-relative ( instruction adrp) is +-4GB. When CONFIG_RANDOMIZE_BASE
> is enabled, the maximum access range between module region symbol and
> kernel symbol is limited to 2GB. There is still 2GB room to access the per-cpu
> data. Therefore, I think it is also safe to access the per-cpu variables.
> 2. MIPS64 with CONFIG_KBUILD_64BIT_SYM32
>      The VMALLOC region is placed after the kernel image. Therefore, the start
> address of the kernel image is at the 0xfffffff80100000, and the kernel
> module region is at 0xffffffffc0000000 which is the beginning of the
> VMALLOC. They do not encounter this issue because the range limitation of
> PC-relative is +-2GB.
> 
> Currently, I have three solutions to fix the percpu data problem, and I list
> them below. Any feedback is welcome.
> 
> 1. Create a specific module region locating at [PAGE_OFFSET-SIZE,
> PAGE_OFFSET].
>     I think this way can solve most cases because the general size of kernel
> image is much smaller than 2G, and we still can change the code model from
> PIC to medany. However, it will still encounter the same problem if the
> kernel image is near to 2GB.
> 
> 2. Request a new code model for kernel
>     Maybe we request compiler guys to generate a new PIC mode for kernel,
> which uses GOT to assess every static per-cpu data instead of PC-relative.
> However, I am not sure whether this idea is accepted by the compiler guys or
> not. If accepted, It may need a long time to realize it.
> 
> 3. Jump to a resolved function.
> Zong and I came out with this rough idea to address the problem, but we do
> not go to realize it.
> 
> The following is the current code pattern for accessing the percpu data.
> 64: 00000917 auipc s2,0x0
> 68: 00093903 ld s2,0(s2) # 64 <.L0 >
> 
> We plan to change the " ld s2,0(s2) " to "jalr xxx(s2)" to redirect the code to
> the corresponding resolved function. Each static percpu data owns a resolved
> function. These resolved functions will go to get the variable content, and
> then it stores the variable to the target register by decoding the auipc
> instruction. We think this method may solve this problem, but it will drop the
> performance due to the redirection.
> 
> I preferred the solution combining method 1 and method 2.
> 1. In general, users can choose method 1 to improve the performance of
> data access.
> 2. For safety, users can use the new code model to access the static percpu
> data without any limitation.
> We can create a new kernel config that looks a bit like MIPS
> CONFIG_KBUILD_64BIT_SYM32 for users to chose.

I think we should not waste time in our quest to find perfect solution. We
should first have a module_alloc() in-place (based on various suggestions in
this email thread) so that most people don't see issues. We can fix module
loading for module with static per-cpu symbols separately.

For point1, we don't need a dedicated region as long as we are allocating
modules from VMALLOC area. Let's avoid another virtual memory region if
possible.

For point2, loading symbol address for per-cpu static symbols is always
PC-relative for most RISC architectures and static per-cpu symbols in
loadable modules is a known issue (Refer, https://lwn.net/Articles/1509/).
I don't agree that this should be using compilers because per-cpu data
management is Linux specific (or OS-specific) so we should rather explore
ways to handle this in Linux itself. The dynamic per-cpu alloc APIs (i.e.
alloc_percpu() and friends) were added to primarily benefit loadable
modules and NUMA systems. The static per-cpu symbols (i.e.
DEFINE_PER_CPU() and friends) should only be used in built-in modules
or code linked to kernel. It is generally very easy to upgrade code using
static per-cpu symbols to use dynamic per-cpu symbols.

Regards,
Anup