All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-28 15:35 ` Anthony Coulter
  0 siblings, 0 replies; 26+ messages in thread
From: Anthony Coulter @ 2019-03-28 15:35 UTC (permalink / raw)
  To: anup, rppt
  Cc: Anup.Patel, aou, Atish.Patra, hch, linux-kernel, linux-riscv, me,
	palmer, paul.walmsley

If your goal is to be able to boot from any 4k-aligned address, then
disallowing boots between PAGE_OFFSET and PAGE_OFSSET + vmlinux_size
seems counterproductive, since there are a lot of 4k-aligned addresses
in that range that are now disallowed. (And worse, the specific range
of disallowed addresses now depends on how large the kernel is, which
makes things awkward. What happens if someone downloads a kernel update
that increases the size of vmlinux to a point where their boot loader
configuration is no longer valid? That would be crazy.)

Note that in order to boot from any 4k-aligned address, you will need
set up trampoline_pg_dir to map a single 4k page. The rule is that the
trampoline page tables can map a single page of whatever size you're
working with, and that page needs to be mapped to the same virtual
address that it will have in the final swapper_pg_dir table. Since
swapper_pg_dir cannot use hugepages, trampoline_pg_dir cannot use a
hugepage either.

But that also means you can't do very much work between enabling the
trampoline page tables and switching over to swapper_pg_dir, because
during that period of time only 4k of memory is mapped. You can't call
any functions that live outside those four kilobytes, nor can you
modify any page tables (because the single page you have must cover the
code in _start, so it can't point to any memory that includes page
tables). So you need to set up both the trampoline and swapper page
tables before enabling either of them. The only complexity you can
postpone by splitting up setup_vm is the initialization of the fixmap
tables.

That said: I think that booting from 4k-aligned addresses is probably
still a pretty simple change, though I *also* have doubts about whether
it is worthwhile.

Why is it simple? Because all you have to do is add one extra level to
each of the trampoline and swapper page tables, and both of these
tables have simple structures. The code proposed in the latest draft
is complicated because the function calls have so many layers of
indirection and not enough attention is paid to using the contiguity
of the page tables to reduce work. But that's accidental complexity;
a more careful implementation would be a lot shorter.

Why is it irrelevant? Because a memory-constrained kernel will want to
drop its .init segment after booting, but the memory that this frees up
will all be at the beginning of the kernel image (and not at the end).

Let's be concrete and talk about the HiFive Unleashed board, on which
RAM starts at address 0x80000000. But the problem is that the Berkeley
Boot Loader gets loaded to 0x80000000, so it has to load the Linux
kernel to the next hugepage, at 0x80200000. Now, if you're short on RAM
you will want your kernel to drop its .init segment, which occupies the
first megabyte (?) of kernel space. (I don't know how large the .init
segment is, but I *do* know from the linker script that it's at the
beginning of memory. Let's call it a megabyte.) So Linux releases its
first megabyte of memory to applications, and now the kernel itself
starts somewhere around 0x8030000.

How is the kernel going to make use of the freed-up space between
0x80200000 and 0x8030000? That's a vm-system problem: somewhere in the
virtual memory code there will be data structures and algorithms that
are smart enough to make use of both the space *before* the kernel
image (i.e. before 0x80300000) and the space *after* the kernel (i.e.
all the space from 0x80200000 to 0x8020000 + vmlinux_size). Surely this
code already exists, because some architectures *do* drop their .init
sections after boot.

But, now, if the virtual memory system is already smart enough to make
use of physical memory that is located before the kernel image, then
there's no harm in booting at 0x80200000 because the virtual memory
system can figure out how to use the gap between the end of the boot
loader and the start of the kernel image. This is true whether the
kernel chooses to drop its .init segment or not, because the point is
that the Linux kernel virtual memory data management system is already
designed to make use of free space from before the kernel image.

So the best way to reclaim wasted space before 0x80200000 is probably
going to be to make your boot loader tell the kernel (via the device
tree) how much space is available between boot_loader_end and
vmlinux_start, and to make sure that this space gets used by the
virtual memory framework.

I'm sorry my email is so long, but I've found that long emails lead to
less confusion than short ones.

Regards,
Anthony Coulter

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-28 15:35 ` Anthony Coulter
  0 siblings, 0 replies; 26+ messages in thread
From: Anthony Coulter @ 2019-03-28 15:35 UTC (permalink / raw)
  To: anup, rppt
  Cc: me, aou, Anup.Patel, linux-kernel, hch, Atish.Patra, palmer,
	paul.walmsley, linux-riscv

If your goal is to be able to boot from any 4k-aligned address, then
disallowing boots between PAGE_OFFSET and PAGE_OFSSET + vmlinux_size
seems counterproductive, since there are a lot of 4k-aligned addresses
in that range that are now disallowed. (And worse, the specific range
of disallowed addresses now depends on how large the kernel is, which
makes things awkward. What happens if someone downloads a kernel update
that increases the size of vmlinux to a point where their boot loader
configuration is no longer valid? That would be crazy.)

Note that in order to boot from any 4k-aligned address, you will need
set up trampoline_pg_dir to map a single 4k page. The rule is that the
trampoline page tables can map a single page of whatever size you're
working with, and that page needs to be mapped to the same virtual
address that it will have in the final swapper_pg_dir table. Since
swapper_pg_dir cannot use hugepages, trampoline_pg_dir cannot use a
hugepage either.

But that also means you can't do very much work between enabling the
trampoline page tables and switching over to swapper_pg_dir, because
during that period of time only 4k of memory is mapped. You can't call
any functions that live outside those four kilobytes, nor can you
modify any page tables (because the single page you have must cover the
code in _start, so it can't point to any memory that includes page
tables). So you need to set up both the trampoline and swapper page
tables before enabling either of them. The only complexity you can
postpone by splitting up setup_vm is the initialization of the fixmap
tables.

That said: I think that booting from 4k-aligned addresses is probably
still a pretty simple change, though I *also* have doubts about whether
it is worthwhile.

Why is it simple? Because all you have to do is add one extra level to
each of the trampoline and swapper page tables, and both of these
tables have simple structures. The code proposed in the latest draft
is complicated because the function calls have so many layers of
indirection and not enough attention is paid to using the contiguity
of the page tables to reduce work. But that's accidental complexity;
a more careful implementation would be a lot shorter.

Why is it irrelevant? Because a memory-constrained kernel will want to
drop its .init segment after booting, but the memory that this frees up
will all be at the beginning of the kernel image (and not at the end).

Let's be concrete and talk about the HiFive Unleashed board, on which
RAM starts at address 0x80000000. But the problem is that the Berkeley
Boot Loader gets loaded to 0x80000000, so it has to load the Linux
kernel to the next hugepage, at 0x80200000. Now, if you're short on RAM
you will want your kernel to drop its .init segment, which occupies the
first megabyte (?) of kernel space. (I don't know how large the .init
segment is, but I *do* know from the linker script that it's at the
beginning of memory. Let's call it a megabyte.) So Linux releases its
first megabyte of memory to applications, and now the kernel itself
starts somewhere around 0x8030000.

How is the kernel going to make use of the freed-up space between
0x80200000 and 0x8030000? That's a vm-system problem: somewhere in the
virtual memory code there will be data structures and algorithms that
are smart enough to make use of both the space *before* the kernel
image (i.e. before 0x80300000) and the space *after* the kernel (i.e.
all the space from 0x80200000 to 0x8020000 + vmlinux_size). Surely this
code already exists, because some architectures *do* drop their .init
sections after boot.

But, now, if the virtual memory system is already smart enough to make
use of physical memory that is located before the kernel image, then
there's no harm in booting at 0x80200000 because the virtual memory
system can figure out how to use the gap between the end of the boot
loader and the start of the kernel image. This is true whether the
kernel chooses to drop its .init segment or not, because the point is
that the Linux kernel virtual memory data management system is already
designed to make use of free space from before the kernel image.

So the best way to reclaim wasted space before 0x80200000 is probably
going to be to make your boot loader tell the kernel (via the device
tree) how much space is available between boot_loader_end and
vmlinux_start, and to make sure that this space gets used by the
virtual memory framework.

I'm sorry my email is so long, but I've found that long emails lead to
less confusion than short ones.

Regards,
Anthony Coulter

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-28  9:52                 ` Anup Patel
@ 2019-03-28 10:24                   ` Anup Patel
  -1 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-28 10:24 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Christoph Hellwig, Palmer Dabbelt, Anup Patel, linux-kernel,
	Atish Patra, Albert Ou, Paul Walmsley, linux-riscv, me

On Thu, Mar 28, 2019 at 3:22 PM Anup Patel <anup@brainfault.org> wrote:
>
> On Thu, Mar 28, 2019 at 1:25 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Wed, Mar 27, 2019 at 12:54:41AM -0700, Christoph Hellwig wrote:
> > > On Mon, Mar 25, 2019 at 09:46:59PM +0530, Anup Patel wrote:
> > > > > Why do you even care about kernel mappings for non-existant ram.
> > > >
> > > > We care because there will always be some buggy kernel driver/code going
> > > > out-of-bound and accessing non-existent RAM. If we by default map all
> > > > possible kernel virtual address then behaviour of buggy accesses will be
> > > > unpredictable.
> > > >
> > > > Further, I think we should also make .text and .rodata sections of kernel
> > > > as read-only. This will protect kernel code and rodata.
> > >
> > > All of that is useful at the final_setup_vm() time - but none of it
> > > matters during early setup_vm where life is complicated.
> > >
> > > Mike suggested on the previous iteration that you only do smaller
> > > mappings when setting up the final mapping to avoid the ops churn,
> > > and I fully agree with him.
> > >
> > > So I would suggest we avoid complicated the fiddly early boot changes
> > > that just add complxity, and you instead redirect your efforts to
> > > say implemented proper ro and non-executable sections using 4k mappings
> > > in the final VM setup only.  That should actuall lead to less code
> > > and complexity, and provide more benefits.
> >
> > It might be worth keeping trampoline_pg_dir if we are to split setup_vm().
> > Then setup_vm() will only initialize the trampoline_pg_dir and
> > final_setup_vm() will setup the swapper_pg_dir and switch to it.
> > Otherwise final_setup_vm() would need to update live mappings which might
> > be fragile.
> >
>
> We finally know the purpose trampoline_pg_dir page table.
>
> The trampoline_pg_dir is suppose to contain only one 2M/4M mapping
> to handle case where PAGE_OFFSET < load_address.
>
> For 64bit systems, the PAGE_OFFSET is very high value typically
> 0xFFFFxxxxxxxxxxxx compared to RAM start 0x80000000. It is very
> unlikely that we will have enormous RAM ending somewhere
> 0xFFFFxxxxxxxxxxxx.
>
> For 32bit systems, it is quite possible that bootloader loads kernel at
> load_address > 0x80000000. Let say PAGE_OFFSET = 0xC0000000
> and load_address = 0xC0100000. Now the instruction which enables
> MMU will be 0xC0100xxx and after enabling it will try to fetch next
> instruction 0xC0100xxx + 4 but 0xC0100000 maps to 0xC0200000
> as load_address > PAGE_OFFSET hence we will see wrong instruction
> after enabling MMU.
>
> (Note: Above explanation was provided by Anthony)
>
> I guess we will have to keep both trampoline_pg_dir and swapper_pg_dir
> init in setup_vm() because trampoline_pg_dir will only contain just
> one 2M/4M mapping.
>
> Regards,
> Anup

I think we should go ahead with your suggestion but with additional
constraint as follows:

1. The setup_vm() will map only vmlinux_start to vmlinux_end and
FDT for early mapping in trampoline_pg_dir
2. The setup_vm() will hit BUG_ON() if load_address is between
vmlinux_start and vmlinux_end.
3. The setup_vm_final() will create swapper_pg_dir from scratch
to avoid updating live mappings

The point2 above essentially means that on 32bit/64bit systems
the bootloader cannot load kernel in physical address range
PAGE_OFFSET to PAGE_OFFSET + (vmlinux_size) even if
it is a valid RAM physical address range.

Suggestions??

Regards,
Anup

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-28 10:24                   ` Anup Patel
  0 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-28 10:24 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: me, Palmer Dabbelt, Anup Patel, linux-kernel, Christoph Hellwig,
	Atish Patra, Albert Ou, Paul Walmsley, linux-riscv

On Thu, Mar 28, 2019 at 3:22 PM Anup Patel <anup@brainfault.org> wrote:
>
> On Thu, Mar 28, 2019 at 1:25 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > On Wed, Mar 27, 2019 at 12:54:41AM -0700, Christoph Hellwig wrote:
> > > On Mon, Mar 25, 2019 at 09:46:59PM +0530, Anup Patel wrote:
> > > > > Why do you even care about kernel mappings for non-existant ram.
> > > >
> > > > We care because there will always be some buggy kernel driver/code going
> > > > out-of-bound and accessing non-existent RAM. If we by default map all
> > > > possible kernel virtual address then behaviour of buggy accesses will be
> > > > unpredictable.
> > > >
> > > > Further, I think we should also make .text and .rodata sections of kernel
> > > > as read-only. This will protect kernel code and rodata.
> > >
> > > All of that is useful at the final_setup_vm() time - but none of it
> > > matters during early setup_vm where life is complicated.
> > >
> > > Mike suggested on the previous iteration that you only do smaller
> > > mappings when setting up the final mapping to avoid the ops churn,
> > > and I fully agree with him.
> > >
> > > So I would suggest we avoid complicated the fiddly early boot changes
> > > that just add complxity, and you instead redirect your efforts to
> > > say implemented proper ro and non-executable sections using 4k mappings
> > > in the final VM setup only.  That should actuall lead to less code
> > > and complexity, and provide more benefits.
> >
> > It might be worth keeping trampoline_pg_dir if we are to split setup_vm().
> > Then setup_vm() will only initialize the trampoline_pg_dir and
> > final_setup_vm() will setup the swapper_pg_dir and switch to it.
> > Otherwise final_setup_vm() would need to update live mappings which might
> > be fragile.
> >
>
> We finally know the purpose trampoline_pg_dir page table.
>
> The trampoline_pg_dir is suppose to contain only one 2M/4M mapping
> to handle case where PAGE_OFFSET < load_address.
>
> For 64bit systems, the PAGE_OFFSET is very high value typically
> 0xFFFFxxxxxxxxxxxx compared to RAM start 0x80000000. It is very
> unlikely that we will have enormous RAM ending somewhere
> 0xFFFFxxxxxxxxxxxx.
>
> For 32bit systems, it is quite possible that bootloader loads kernel at
> load_address > 0x80000000. Let say PAGE_OFFSET = 0xC0000000
> and load_address = 0xC0100000. Now the instruction which enables
> MMU will be 0xC0100xxx and after enabling it will try to fetch next
> instruction 0xC0100xxx + 4 but 0xC0100000 maps to 0xC0200000
> as load_address > PAGE_OFFSET hence we will see wrong instruction
> after enabling MMU.
>
> (Note: Above explanation was provided by Anthony)
>
> I guess we will have to keep both trampoline_pg_dir and swapper_pg_dir
> init in setup_vm() because trampoline_pg_dir will only contain just
> one 2M/4M mapping.
>
> Regards,
> Anup

I think we should go ahead with your suggestion but with additional
constraint as follows:

1. The setup_vm() will map only vmlinux_start to vmlinux_end and
FDT for early mapping in trampoline_pg_dir
2. The setup_vm() will hit BUG_ON() if load_address is between
vmlinux_start and vmlinux_end.
3. The setup_vm_final() will create swapper_pg_dir from scratch
to avoid updating live mappings

The point2 above essentially means that on 32bit/64bit systems
the bootloader cannot load kernel in physical address range
PAGE_OFFSET to PAGE_OFFSET + (vmlinux_size) even if
it is a valid RAM physical address range.

Suggestions??

Regards,
Anup

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-28  7:55               ` Mike Rapoport
@ 2019-03-28  9:52                 ` Anup Patel
  -1 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-28  9:52 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Christoph Hellwig, Palmer Dabbelt, Anup Patel, linux-kernel,
	Atish Patra, Albert Ou, Paul Walmsley, linux-riscv, me

On Thu, Mar 28, 2019 at 1:25 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Mar 27, 2019 at 12:54:41AM -0700, Christoph Hellwig wrote:
> > On Mon, Mar 25, 2019 at 09:46:59PM +0530, Anup Patel wrote:
> > > > Why do you even care about kernel mappings for non-existant ram.
> > >
> > > We care because there will always be some buggy kernel driver/code going
> > > out-of-bound and accessing non-existent RAM. If we by default map all
> > > possible kernel virtual address then behaviour of buggy accesses will be
> > > unpredictable.
> > >
> > > Further, I think we should also make .text and .rodata sections of kernel
> > > as read-only. This will protect kernel code and rodata.
> >
> > All of that is useful at the final_setup_vm() time - but none of it
> > matters during early setup_vm where life is complicated.
> >
> > Mike suggested on the previous iteration that you only do smaller
> > mappings when setting up the final mapping to avoid the ops churn,
> > and I fully agree with him.
> >
> > So I would suggest we avoid complicated the fiddly early boot changes
> > that just add complxity, and you instead redirect your efforts to
> > say implemented proper ro and non-executable sections using 4k mappings
> > in the final VM setup only.  That should actuall lead to less code
> > and complexity, and provide more benefits.
>
> It might be worth keeping trampoline_pg_dir if we are to split setup_vm().
> Then setup_vm() will only initialize the trampoline_pg_dir and
> final_setup_vm() will setup the swapper_pg_dir and switch to it.
> Otherwise final_setup_vm() would need to update live mappings which might
> be fragile.
>

We finally know the purpose trampoline_pg_dir page table.

The trampoline_pg_dir is suppose to contain only one 2M/4M mapping
to handle case where PAGE_OFFSET < load_address.

For 64bit systems, the PAGE_OFFSET is very high value typically
0xFFFFxxxxxxxxxxxx compared to RAM start 0x80000000. It is very
unlikely that we will have enormous RAM ending somewhere
0xFFFFxxxxxxxxxxxx.

For 32bit systems, it is quite possible that bootloader loads kernel at
load_address > 0x80000000. Let say PAGE_OFFSET = 0xC0000000
and load_address = 0xC0100000. Now the instruction which enables
MMU will be 0xC0100xxx and after enabling it will try to fetch next
instruction 0xC0100xxx + 4 but 0xC0100000 maps to 0xC0200000
as load_address > PAGE_OFFSET hence we will see wrong instruction
after enabling MMU.

(Note: Above explanation was provided by Anthony)

I guess we will have to keep both trampoline_pg_dir and swapper_pg_dir
init in setup_vm() because trampoline_pg_dir will only contain just
one 2M/4M mapping.

Regards,
Anup

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-28  9:52                 ` Anup Patel
  0 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-28  9:52 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: me, Palmer Dabbelt, Anup Patel, linux-kernel, Christoph Hellwig,
	Atish Patra, Albert Ou, Paul Walmsley, linux-riscv

On Thu, Mar 28, 2019 at 1:25 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> On Wed, Mar 27, 2019 at 12:54:41AM -0700, Christoph Hellwig wrote:
> > On Mon, Mar 25, 2019 at 09:46:59PM +0530, Anup Patel wrote:
> > > > Why do you even care about kernel mappings for non-existant ram.
> > >
> > > We care because there will always be some buggy kernel driver/code going
> > > out-of-bound and accessing non-existent RAM. If we by default map all
> > > possible kernel virtual address then behaviour of buggy accesses will be
> > > unpredictable.
> > >
> > > Further, I think we should also make .text and .rodata sections of kernel
> > > as read-only. This will protect kernel code and rodata.
> >
> > All of that is useful at the final_setup_vm() time - but none of it
> > matters during early setup_vm where life is complicated.
> >
> > Mike suggested on the previous iteration that you only do smaller
> > mappings when setting up the final mapping to avoid the ops churn,
> > and I fully agree with him.
> >
> > So I would suggest we avoid complicated the fiddly early boot changes
> > that just add complxity, and you instead redirect your efforts to
> > say implemented proper ro and non-executable sections using 4k mappings
> > in the final VM setup only.  That should actuall lead to less code
> > and complexity, and provide more benefits.
>
> It might be worth keeping trampoline_pg_dir if we are to split setup_vm().
> Then setup_vm() will only initialize the trampoline_pg_dir and
> final_setup_vm() will setup the swapper_pg_dir and switch to it.
> Otherwise final_setup_vm() would need to update live mappings which might
> be fragile.
>

We finally know the purpose trampoline_pg_dir page table.

The trampoline_pg_dir is suppose to contain only one 2M/4M mapping
to handle case where PAGE_OFFSET < load_address.

For 64bit systems, the PAGE_OFFSET is very high value typically
0xFFFFxxxxxxxxxxxx compared to RAM start 0x80000000. It is very
unlikely that we will have enormous RAM ending somewhere
0xFFFFxxxxxxxxxxxx.

For 32bit systems, it is quite possible that bootloader loads kernel at
load_address > 0x80000000. Let say PAGE_OFFSET = 0xC0000000
and load_address = 0xC0100000. Now the instruction which enables
MMU will be 0xC0100xxx and after enabling it will try to fetch next
instruction 0xC0100xxx + 4 but 0xC0100000 maps to 0xC0200000
as load_address > PAGE_OFFSET hence we will see wrong instruction
after enabling MMU.

(Note: Above explanation was provided by Anthony)

I guess we will have to keep both trampoline_pg_dir and swapper_pg_dir
init in setup_vm() because trampoline_pg_dir will only contain just
one 2M/4M mapping.

Regards,
Anup

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-27  7:54             ` Christoph Hellwig
@ 2019-03-28  7:55               ` Mike Rapoport
  -1 siblings, 0 replies; 26+ messages in thread
From: Mike Rapoport @ 2019-03-28  7:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anup Patel, Palmer Dabbelt, Anup Patel, linux-kernel,
	Atish Patra, Albert Ou, Paul Walmsley, linux-riscv

On Wed, Mar 27, 2019 at 12:54:41AM -0700, Christoph Hellwig wrote:
> On Mon, Mar 25, 2019 at 09:46:59PM +0530, Anup Patel wrote:
> > > Why do you even care about kernel mappings for non-existant ram.
> > 
> > We care because there will always be some buggy kernel driver/code going
> > out-of-bound and accessing non-existent RAM. If we by default map all
> > possible kernel virtual address then behaviour of buggy accesses will be
> > unpredictable.
> > 
> > Further, I think we should also make .text and .rodata sections of kernel
> > as read-only. This will protect kernel code and rodata.
> 
> All of that is useful at the final_setup_vm() time - but none of it
> matters during early setup_vm where life is complicated.
> 
> Mike suggested on the previous iteration that you only do smaller
> mappings when setting up the final mapping to avoid the ops churn,
> and I fully agree with him. 
> 
> So I would suggest we avoid complicated the fiddly early boot changes
> that just add complxity, and you instead redirect your efforts to
> say implemented proper ro and non-executable sections using 4k mappings
> in the final VM setup only.  That should actuall lead to less code
> and complexity, and provide more benefits.

It might be worth keeping trampoline_pg_dir if we are to split setup_vm().
Then setup_vm() will only initialize the trampoline_pg_dir and
final_setup_vm() will setup the swapper_pg_dir and switch to it.
Otherwise final_setup_vm() would need to update live mappings which might
be fragile.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-28  7:55               ` Mike Rapoport
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Rapoport @ 2019-03-28  7:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Palmer Dabbelt, Anup Patel, Anup Patel, linux-kernel,
	Atish Patra, Albert Ou, Paul Walmsley, linux-riscv

On Wed, Mar 27, 2019 at 12:54:41AM -0700, Christoph Hellwig wrote:
> On Mon, Mar 25, 2019 at 09:46:59PM +0530, Anup Patel wrote:
> > > Why do you even care about kernel mappings for non-existant ram.
> > 
> > We care because there will always be some buggy kernel driver/code going
> > out-of-bound and accessing non-existent RAM. If we by default map all
> > possible kernel virtual address then behaviour of buggy accesses will be
> > unpredictable.
> > 
> > Further, I think we should also make .text and .rodata sections of kernel
> > as read-only. This will protect kernel code and rodata.
> 
> All of that is useful at the final_setup_vm() time - but none of it
> matters during early setup_vm where life is complicated.
> 
> Mike suggested on the previous iteration that you only do smaller
> mappings when setting up the final mapping to avoid the ops churn,
> and I fully agree with him. 
> 
> So I would suggest we avoid complicated the fiddly early boot changes
> that just add complxity, and you instead redirect your efforts to
> say implemented proper ro and non-executable sections using 4k mappings
> in the final VM setup only.  That should actuall lead to less code
> and complexity, and provide more benefits.

It might be worth keeping trampoline_pg_dir if we are to split setup_vm().
Then setup_vm() will only initialize the trampoline_pg_dir and
final_setup_vm() will setup the swapper_pg_dir and switch to it.
Otherwise final_setup_vm() would need to update live mappings which might
be fragile.

-- 
Sincerely yours,
Mike.


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-25 16:16           ` Anup Patel
@ 2019-03-27  7:54             ` Christoph Hellwig
  -1 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2019-03-27  7:54 UTC (permalink / raw)
  To: Anup Patel
  Cc: Christoph Hellwig, Palmer Dabbelt, Anup Patel, linux-kernel,
	Mike Rapoport, Atish Patra, Albert Ou, Paul Walmsley,
	linux-riscv

On Mon, Mar 25, 2019 at 09:46:59PM +0530, Anup Patel wrote:
> > Why do you even care about kernel mappings for non-existant ram.
> 
> We care because there will always be some buggy kernel driver/code going
> out-of-bound and accessing non-existent RAM. If we by default map all
> possible kernel virtual address then behaviour of buggy accesses will be
> unpredictable.
> 
> Further, I think we should also make .text and .rodata sections of kernel
> as read-only. This will protect kernel code and rodata.

All of that is useful at the final_setup_vm() time - but none of it
matters during early setup_vm where life is complicated.

Mike suggested on the previous iteration that you only do smaller
mappings when setting up the final mapping to avoid the ops churn,
and I fully agree with him. 

So I would suggest we avoid complicated the fiddly early boot changes
that just add complxity, and you instead redirect your efforts to
say implemented proper ro and non-executable sections using 4k mappings
in the final VM setup only.  That should actuall lead to less code
and complexity, and provide more benefits.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-27  7:54             ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2019-03-27  7:54 UTC (permalink / raw)
  To: Anup Patel
  Cc: Palmer Dabbelt, Anup Patel, linux-kernel, Mike Rapoport,
	Christoph Hellwig, Atish Patra, Albert Ou, Paul Walmsley,
	linux-riscv

On Mon, Mar 25, 2019 at 09:46:59PM +0530, Anup Patel wrote:
> > Why do you even care about kernel mappings for non-existant ram.
> 
> We care because there will always be some buggy kernel driver/code going
> out-of-bound and accessing non-existent RAM. If we by default map all
> possible kernel virtual address then behaviour of buggy accesses will be
> unpredictable.
> 
> Further, I think we should also make .text and .rodata sections of kernel
> as read-only. This will protect kernel code and rodata.

All of that is useful at the final_setup_vm() time - but none of it
matters during early setup_vm where life is complicated.

Mike suggested on the previous iteration that you only do smaller
mappings when setting up the final mapping to avoid the ops churn,
and I fully agree with him. 

So I would suggest we avoid complicated the fiddly early boot changes
that just add complxity, and you instead redirect your efforts to
say implemented proper ro and non-executable sections using 4k mappings
in the final VM setup only.  That should actuall lead to less code
and complexity, and provide more benefits.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-25 17:35           ` Gary Guo
@ 2019-03-26  9:43             ` Anup Patel
  0 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-26  9:43 UTC (permalink / raw)
  To: Gary Guo; +Cc: linux-riscv

On Mon, Mar 25, 2019 at 11:06 PM Gary Guo <gary@garyguo.net> wrote:
>
>
>
> On 25/03/2019 16:16, Anup Patel wrote:
>  > 1G mappings will give better performance compared 2M mappings.
>  >
>  > This will be very useful for performance hungry system with lot of RAM.
>  >
>  > This patch selects 1G or 2M mapping at runtime based on load address
>  > alignment.
>  >
>  > Regards,
>  > Anup
>  >
>
> Not always the case. In general, if set-associative or direct mapped
> TLBs are used, then either
> 1) separate TLBs are used for different granularities. Usually very few
> entries will be there for 1G mappings. E.g. Intel CPU's D-TLB has a
> separate array of 1G page TLB which is only 4 entries.
> 2) a smaller-granularity PTE is "faked" by page walker and inserted into
> the TLB. E.g. Intel CPU's I-TLB can't hold 1G entry at all.
>
> In either cases the performance won't differ by much. In fact if 1) is
> used there are cases where performance can be negatively impacted by
> using larger pages.

The above is just one way of designing/organizing TLBs in HW. It is also
possible to have a unified-TLB (like ARM/ARM64 world SOCs) supporting
multiple page sizes in each TLB entry.

Depending on the TLB micro-architecture the performance gains will vary.

Irrespective to TLB design, there will always be some gains because
after all we are saving page-table walks by creating bigger mappings.

In case of virtualization with nested page tables, the savings can be
even bigger because it's a two dimensional page table walk. For example,
let's assume M = "number of levels in guest page table" and N = "number
of levels in hypervisor page table". Now, number descriptor access would
be "MxN + M + N" so if we are able to reduce M or N or both then it will
save lot of memory accesses.

Regards,
Anup

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-25 16:16           ` Anup Patel
  (?)
@ 2019-03-25 17:35           ` Gary Guo
  2019-03-26  9:43             ` Anup Patel
  -1 siblings, 1 reply; 26+ messages in thread
From: Gary Guo @ 2019-03-25 17:35 UTC (permalink / raw)
  To: linux-riscv



On 25/03/2019 16:16, Anup Patel wrote:
 > 1G mappings will give better performance compared 2M mappings.
 >
 > This will be very useful for performance hungry system with lot of RAM.
 >
 > This patch selects 1G or 2M mapping at runtime based on load address
 > alignment.
 >
 > Regards,
 > Anup
 >

Not always the case. In general, if set-associative or direct mapped 
TLBs are used, then either
1) separate TLBs are used for different granularities. Usually very few 
entries will be there for 1G mappings. E.g. Intel CPU's D-TLB has a 
separate array of 1G page TLB which is only 4 entries.
2) a smaller-granularity PTE is "faked" by page walker and inserted into 
the TLB. E.g. Intel CPU's I-TLB can't hold 1G entry at all.

In either cases the performance won't differ by much. In fact if 1) is 
used there are cases where performance can be negatively impacted by 
using larger pages.
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-25 14:59         ` Christoph Hellwig
@ 2019-03-25 16:16           ` Anup Patel
  -1 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-25 16:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anup Patel, Palmer Dabbelt, Albert Ou, linux-kernel,
	Mike Rapoport, Atish Patra, Paul Walmsley, linux-riscv

On Mon, Mar 25, 2019 at 8:29 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Mar 25, 2019 at 06:18:45PM +0530, Anup Patel wrote:
> > We trying to addresses following issues in current code:
> > 1. The current setup_vm() maps all possible kernel virtual addresses (128GB
> > on 64bit system and 1GB on 32bit system). The amount RAM present on
> > real systems might be much less so we should not have kernel mappings for
> > non-existent RAM. Of course, we don't know amount of RAM available in
> > setup_vm() so we have to split page table setup in two parts and do minimal
> > required mapping in setup_vm().
>
> Why do you even care about kernel mappings for non-existant ram.

We care because there will always be some buggy kernel driver/code going
out-of-bound and accessing non-existent RAM. If we by default map all
possible kernel virtual address then behaviour of buggy accesses will be
unpredictable.

Further, I think we should also make .text and .rodata sections of kernel
as read-only. This will protect kernel code and rodata.

Above things add more debugablity to kernel and also help us catch bugs
faster.

>
> > 2. NOMMU kernel requires a swapper_pg_dir with identity mapping (VA == PA)
>
> Bullshit.  nommu per defintion does not have page tables, and thus

Yes, I know but we did see a kernel crash in-absence of kernel page table
with identity mappings.

>
> > 3. For both NOMMU with tiny memory and MMU with tiny memory, the current
> > setup_vm() is not allowing us to place kernel on non-2M (or non-4M) aligned
> > addressed there by causing memory below kernel to be wasted.
>
> As mentioned a few times - nommu per defintion does not have page
> tables, and in my uptodate nommu port none of this code is even compiled
> in.  If someone still compiles that code in some codebase that is just
> a bug that needs to be fixed.
>
> For MMU with tiny memory is is a theoretical case, but I still haven't
> seen a good rationale why we'd care for that case.

Lot of RISC-V systems being built today target resource constraint
use-cases (such as IoT or embedded) where more RAM is more
cost and more power.

It is very useful to have Linux RISC-V run on just few MBs (if possible).

>
> > 4. For MMU based kernel, the current setup_vm() is hard-wired for fixed 2M
> > mapping size. It will require more changes if we want to do 1G mappings.
>
> And why do we care?  Even if we come up with a good reason for 1G
> mappings we'd better find a way to select it at run time and not require
> a gazillion of options to select how to map the kernel.

1G mappings will give better performance compared 2M mappings.

This will be very useful for performance hungry system with lot of RAM.

This patch selects 1G or 2M mapping at runtime based on load address
alignment.

Regards,
Anup

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-25 16:16           ` Anup Patel
  0 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-25 16:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Palmer Dabbelt, Anup Patel, linux-kernel, Mike Rapoport,
	Atish Patra, Albert Ou, Paul Walmsley, linux-riscv

On Mon, Mar 25, 2019 at 8:29 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Mar 25, 2019 at 06:18:45PM +0530, Anup Patel wrote:
> > We trying to addresses following issues in current code:
> > 1. The current setup_vm() maps all possible kernel virtual addresses (128GB
> > on 64bit system and 1GB on 32bit system). The amount RAM present on
> > real systems might be much less so we should not have kernel mappings for
> > non-existent RAM. Of course, we don't know amount of RAM available in
> > setup_vm() so we have to split page table setup in two parts and do minimal
> > required mapping in setup_vm().
>
> Why do you even care about kernel mappings for non-existant ram.

We care because there will always be some buggy kernel driver/code going
out-of-bound and accessing non-existent RAM. If we by default map all
possible kernel virtual address then behaviour of buggy accesses will be
unpredictable.

Further, I think we should also make .text and .rodata sections of kernel
as read-only. This will protect kernel code and rodata.

Above things add more debugablity to kernel and also help us catch bugs
faster.

>
> > 2. NOMMU kernel requires a swapper_pg_dir with identity mapping (VA == PA)
>
> Bullshit.  nommu per defintion does not have page tables, and thus

Yes, I know but we did see a kernel crash in-absence of kernel page table
with identity mappings.

>
> > 3. For both NOMMU with tiny memory and MMU with tiny memory, the current
> > setup_vm() is not allowing us to place kernel on non-2M (or non-4M) aligned
> > addressed there by causing memory below kernel to be wasted.
>
> As mentioned a few times - nommu per defintion does not have page
> tables, and in my uptodate nommu port none of this code is even compiled
> in.  If someone still compiles that code in some codebase that is just
> a bug that needs to be fixed.
>
> For MMU with tiny memory is is a theoretical case, but I still haven't
> seen a good rationale why we'd care for that case.

Lot of RISC-V systems being built today target resource constraint
use-cases (such as IoT or embedded) where more RAM is more
cost and more power.

It is very useful to have Linux RISC-V run on just few MBs (if possible).

>
> > 4. For MMU based kernel, the current setup_vm() is hard-wired for fixed 2M
> > mapping size. It will require more changes if we want to do 1G mappings.
>
> And why do we care?  Even if we come up with a good reason for 1G
> mappings we'd better find a way to select it at run time and not require
> a gazillion of options to select how to map the kernel.

1G mappings will give better performance compared 2M mappings.

This will be very useful for performance hungry system with lot of RAM.

This patch selects 1G or 2M mapping at runtime based on load address
alignment.

Regards,
Anup

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-25 12:48       ` Anup Patel
@ 2019-03-25 14:59         ` Christoph Hellwig
  -1 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2019-03-25 14:59 UTC (permalink / raw)
  To: Anup Patel
  Cc: Christoph Hellwig, Anup Patel, Palmer Dabbelt, Albert Ou,
	linux-kernel, Mike Rapoport, Atish Patra, Paul Walmsley,
	linux-riscv

On Mon, Mar 25, 2019 at 06:18:45PM +0530, Anup Patel wrote:
> We trying to addresses following issues in current code:
> 1. The current setup_vm() maps all possible kernel virtual addresses (128GB
> on 64bit system and 1GB on 32bit system). The amount RAM present on
> real systems might be much less so we should not have kernel mappings for
> non-existent RAM. Of course, we don't know amount of RAM available in
> setup_vm() so we have to split page table setup in two parts and do minimal
> required mapping in setup_vm().

Why do you even care about kernel mappings for non-existant ram.

> 2. NOMMU kernel requires a swapper_pg_dir with identity mapping (VA == PA)

Bullshit.  nommu per defintion does not have page tables, and thus

> 3. For both NOMMU with tiny memory and MMU with tiny memory, the current
> setup_vm() is not allowing us to place kernel on non-2M (or non-4M) aligned
> addressed there by causing memory below kernel to be wasted.

As mentioned a few times - nommu per defintion does not have page
tables, and in my uptodate nommu port none of this code is even compiled
in.  If someone still compiles that code in some codebase that is just
a bug that needs to be fixed.

For MMU with tiny memory is is a theoretical case, but I still haven't
seen a good rationale why we'd care for that case.

> 4. For MMU based kernel, the current setup_vm() is hard-wired for fixed 2M
> mapping size. It will require more changes if we want to do 1G mappings.

And why do we care?  Even if we come up with a good reason for 1G
mappings we'd better find a way to select it at run time and not require
a gazillion of options to select how to map the kernel.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-25 14:59         ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2019-03-25 14:59 UTC (permalink / raw)
  To: Anup Patel
  Cc: Palmer Dabbelt, Anup Patel, linux-kernel, Mike Rapoport,
	Christoph Hellwig, Atish Patra, Albert Ou, Paul Walmsley,
	linux-riscv

On Mon, Mar 25, 2019 at 06:18:45PM +0530, Anup Patel wrote:
> We trying to addresses following issues in current code:
> 1. The current setup_vm() maps all possible kernel virtual addresses (128GB
> on 64bit system and 1GB on 32bit system). The amount RAM present on
> real systems might be much less so we should not have kernel mappings for
> non-existent RAM. Of course, we don't know amount of RAM available in
> setup_vm() so we have to split page table setup in two parts and do minimal
> required mapping in setup_vm().

Why do you even care about kernel mappings for non-existant ram.

> 2. NOMMU kernel requires a swapper_pg_dir with identity mapping (VA == PA)

Bullshit.  nommu per defintion does not have page tables, and thus

> 3. For both NOMMU with tiny memory and MMU with tiny memory, the current
> setup_vm() is not allowing us to place kernel on non-2M (or non-4M) aligned
> addressed there by causing memory below kernel to be wasted.

As mentioned a few times - nommu per defintion does not have page
tables, and in my uptodate nommu port none of this code is even compiled
in.  If someone still compiles that code in some codebase that is just
a bug that needs to be fixed.

For MMU with tiny memory is is a theoretical case, but I still haven't
seen a good rationale why we'd care for that case.

> 4. For MMU based kernel, the current setup_vm() is hard-wired for fixed 2M
> mapping size. It will require more changes if we want to do 1G mappings.

And why do we care?  Even if we come up with a good reason for 1G
mappings we'd better find a way to select it at run time and not require
a gazillion of options to select how to map the kernel.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-25 12:42       ` Anup Patel
@ 2019-03-25 14:55         ` Christoph Hellwig
  -1 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2019-03-25 14:55 UTC (permalink / raw)
  To: Anup Patel
  Cc: Christoph Hellwig, Anup Patel, Palmer Dabbelt, Albert Ou,
	linux-kernel, Mike Rapoport, Atish Patra, Paul Walmsley,
	linux-riscv

On Mon, Mar 25, 2019 at 06:12:28PM +0530, Anup Patel wrote:
> It uses new code for 2M mappings as well.

That seems to be an odd view - yes some code is shared, but we still
have two different ways using ifdefs.  

> The code is lot better, readable and flexible.

better is in the eye of the beholder.  It very much is not easier
readable with the spurious indirect calls at least.  And it is a lot
bigger.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-25 14:55         ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2019-03-25 14:55 UTC (permalink / raw)
  To: Anup Patel
  Cc: Palmer Dabbelt, Anup Patel, linux-kernel, Mike Rapoport,
	Christoph Hellwig, Atish Patra, Albert Ou, Paul Walmsley,
	linux-riscv

On Mon, Mar 25, 2019 at 06:12:28PM +0530, Anup Patel wrote:
> It uses new code for 2M mappings as well.

That seems to be an odd view - yes some code is shared, but we still
have two different ways using ifdefs.  

> The code is lot better, readable and flexible.

better is in the eye of the beholder.  It very much is not easier
readable with the spurious indirect calls at least.  And it is a lot
bigger.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-25 11:39     ` Christoph Hellwig
@ 2019-03-25 12:48       ` Anup Patel
  -1 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-25 12:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anup Patel, Palmer Dabbelt, Albert Ou, linux-kernel,
	Mike Rapoport, Atish Patra, Paul Walmsley, linux-riscv

On Mon, Mar 25, 2019 at 5:09 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> I'm still not sold on this at all.  It is a lot more code, a lot harder
> to read code and all for a very narrow corner case that isn't even
> going to be enabled in default configs.

In case you missed my previous response about why its not just
about a very narrow corner case ......


We trying to addresses following issues in current code:
1. The current setup_vm() maps all possible kernel virtual addresses (128GB
on 64bit system and 1GB on 32bit system). The amount RAM present on
real systems might be much less so we should not have kernel mappings for
non-existent RAM. Of course, we don't know amount of RAM available in
setup_vm() so we have to split page table setup in two parts and do minimal
required mapping in setup_vm().
2. NOMMU kernel requires a swapper_pg_dir with identity mapping (VA == PA)
and without it we get boot-time crash so we cannot skip it for NOMMU case. For
NOMMU, the PAGE_OFFSET will typically be 0x80020000 (or 0x80xxxxxx). This
means swapper_pmd array (which uses -PAGE_OFFSET) will be over-sized
causing compile errors.
3. For both NOMMU with tiny memory and MMU with tiny memory, the current
setup_vm() is not allowing us to place kernel on non-2M (or non-4M) aligned
addressed there by causing memory below kernel to be wasted.
4. For MMU based kernel, the current setup_vm() is hard-wired for fixed 2M
mapping size. It will require more changes if we want to do 1G mappings.

The above issues motivated us to re-write setup_vm().

We are trying to make initial page table setup more flexible and robust so that:
1. We don't have any unwanted mappings pointing to non-existent RAM
2. We can have any value of PAGE_OFFSET for NOMMU case without the
page table arrays becoming oversized
3. We can create mappings of best possible size to get good performance
4. We can boot from any 4K/2M/1G (or just 4K) aligned load address

Also, the end result of all this is a much more readable page table setup code
shared between setup_vm() an setup_vm_final() where the differences are
abstracted via mapping ops.


Regards,
Anup

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-25 12:48       ` Anup Patel
  0 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-25 12:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Palmer Dabbelt, Anup Patel, linux-kernel, Mike Rapoport,
	Atish Patra, Albert Ou, Paul Walmsley, linux-riscv

On Mon, Mar 25, 2019 at 5:09 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> I'm still not sold on this at all.  It is a lot more code, a lot harder
> to read code and all for a very narrow corner case that isn't even
> going to be enabled in default configs.

In case you missed my previous response about why its not just
about a very narrow corner case ......


We trying to addresses following issues in current code:
1. The current setup_vm() maps all possible kernel virtual addresses (128GB
on 64bit system and 1GB on 32bit system). The amount RAM present on
real systems might be much less so we should not have kernel mappings for
non-existent RAM. Of course, we don't know amount of RAM available in
setup_vm() so we have to split page table setup in two parts and do minimal
required mapping in setup_vm().
2. NOMMU kernel requires a swapper_pg_dir with identity mapping (VA == PA)
and without it we get boot-time crash so we cannot skip it for NOMMU case. For
NOMMU, the PAGE_OFFSET will typically be 0x80020000 (or 0x80xxxxxx). This
means swapper_pmd array (which uses -PAGE_OFFSET) will be over-sized
causing compile errors.
3. For both NOMMU with tiny memory and MMU with tiny memory, the current
setup_vm() is not allowing us to place kernel on non-2M (or non-4M) aligned
addressed there by causing memory below kernel to be wasted.
4. For MMU based kernel, the current setup_vm() is hard-wired for fixed 2M
mapping size. It will require more changes if we want to do 1G mappings.

The above issues motivated us to re-write setup_vm().

We are trying to make initial page table setup more flexible and robust so that:
1. We don't have any unwanted mappings pointing to non-existent RAM
2. We can have any value of PAGE_OFFSET for NOMMU case without the
page table arrays becoming oversized
3. We can create mappings of best possible size to get good performance
4. We can boot from any 4K/2M/1G (or just 4K) aligned load address

Also, the end result of all this is a much more readable page table setup code
shared between setup_vm() an setup_vm_final() where the differences are
abstracted via mapping ops.


Regards,
Anup

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-25 11:39     ` Christoph Hellwig
@ 2019-03-25 12:42       ` Anup Patel
  -1 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-25 12:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anup Patel, Palmer Dabbelt, Albert Ou, linux-kernel,
	Mike Rapoport, Atish Patra, Paul Walmsley, linux-riscv

On Mon, Mar 25, 2019 at 5:09 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> I'm still not sold on this at all.  It is a lot more code, a lot harder
> to read code and all for a very narrow corner case that isn't even
> going to be enabled in default configs.

The old page table setup code does not exist anymore.

It uses new code for 2M mappings as well.

In fact, it will also try to create 1G mappings if possible.

The code is lot better, readable and flexible.

Regards,
Anup

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-25 12:42       ` Anup Patel
  0 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-25 12:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Palmer Dabbelt, Anup Patel, linux-kernel, Mike Rapoport,
	Atish Patra, Albert Ou, Paul Walmsley, linux-riscv

On Mon, Mar 25, 2019 at 5:09 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> I'm still not sold on this at all.  It is a lot more code, a lot harder
> to read code and all for a very narrow corner case that isn't even
> going to be enabled in default configs.

The old page table setup code does not exist anymore.

It uses new code for 2M mappings as well.

In fact, it will also try to create 1G mappings if possible.

The code is lot better, readable and flexible.

Regards,
Anup

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-25  9:23   ` Anup Patel
@ 2019-03-25 11:39     ` Christoph Hellwig
  -1 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2019-03-25 11:39 UTC (permalink / raw)
  To: Anup Patel
  Cc: Palmer Dabbelt, Albert Ou, linux-kernel, Mike Rapoport,
	Christoph Hellwig, Atish Patra, Paul Walmsley, linux-riscv

I'm still not sold on this at all.  It is a lot more code, a lot harder
to read code and all for a very narrow corner case that isn't even
going to be enabled in default configs.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-25 11:39     ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2019-03-25 11:39 UTC (permalink / raw)
  To: Anup Patel
  Cc: Albert Ou, Palmer Dabbelt, linux-kernel, Mike Rapoport,
	Christoph Hellwig, Atish Patra, Paul Walmsley, linux-riscv

I'm still not sold on this at all.  It is a lot more code, a lot harder
to read code and all for a very narrow corner case that isn't even
going to be enabled in default configs.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
  2019-03-25  9:22 [PATCH v3 0/4] Boot RISC-V " Anup Patel
@ 2019-03-25  9:23   ` Anup Patel
  0 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-25  9:23 UTC (permalink / raw)
  To: Palmer Dabbelt, Albert Ou
  Cc: Atish Patra, Christoph Hellwig, Paul Walmsley, Mike Rapoport,
	linux-riscv, linux-kernel, Anup Patel

Currently, we have to boot RISCV64 kernel from a 2MB aligned physical
address and RISCV32 kernel from a 4MB aligned physical address. This
constraint is because initial pagetable setup (i.e. setup_vm()) maps
entire RAM using hugepages (i.e. 2MB for 3-level pagetable and 4MB for
2-level pagetable).

Further, the above booting contraint also results in memory wastage
because if we boot kernel from some <xyz> address (which is not same as
RAM start address) then RISCV kernel will map PAGE_OFFSET virtual address
lineraly to <xyz> physical address and memory between RAM start and <xyz>
will be reserved/unusable.

For example, RISCV64 kernel booted from 0x80200000 will waste 2MB of RAM
and RISCV32 kernel booted from 0x80400000 will waste 4MB of RAM.

This patch re-writes the initial pagetable setup code to allow booting
RISV32 and RISCV64 kernel from any 4KB (i.e. PAGE_SIZE) aligned address.

To achieve this:
1. We add kconfig option BOOT_PAGE_ALIGNED. When it is enabled we use
   4KB mappings in initial page table setup otherwise we use 2MB/4MB
   mappings.
2. We map kernel and dtb (few MBs) in setup_vm() (called from head.S)
3. Once we reach paging_init() (called from setup_arch()) after
   memblock setup, we map all available memory banks.

With this patch in-place, the booting constraint for RISCV32 and RISCV64
kernel is much more relaxed when CONFIG_BOOT_PAGE_ALIGNED=y and we can
now boot kernel very close to RAM start thereby minimizng memory wastage.

Signed-off-by: Anup Patel <anup.patel@wdc.com>
---
 arch/riscv/Kconfig                  |  12 +
 arch/riscv/include/asm/fixmap.h     |   5 +
 arch/riscv/include/asm/pgtable-64.h |   5 +
 arch/riscv/include/asm/pgtable.h    |   5 +
 arch/riscv/kernel/head.S            |   1 +
 arch/riscv/kernel/setup.c           |   4 +-
 arch/riscv/mm/init.c                | 351 ++++++++++++++++++++++++----
 7 files changed, 335 insertions(+), 48 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index eb56c82d8aa1..d7812b1f7c7e 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -172,6 +172,18 @@ config SMP
 
 	  If you don't know what to do here, say N.
 
+config BOOT_PAGE_ALIGNED
+	bool "Allow booting from page aligned address"
+	default n
+	help
+	  This enables support for booting the kernel from any page aligned
+	  address (i.e. 4KB aligned). This option is particularly useful on
+	  systems with a very small RAM (few MBs) as we can boot the kernel
+	  closer to the RAM start address, thereby reducing the amount of
+	  unusable RAM below the kernel.
+
+	  If you don't know what to do here, say N.
+
 config NR_CPUS
 	int "Maximum number of CPUs (2-32)"
 	range 2 32
diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
index c207f6634b91..9c66033c3a54 100644
--- a/arch/riscv/include/asm/fixmap.h
+++ b/arch/riscv/include/asm/fixmap.h
@@ -21,6 +21,11 @@
  */
 enum fixed_addresses {
 	FIX_HOLE,
+#define FIX_FDT_SIZE	SZ_1M
+	FIX_FDT_END,
+	FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
+	FIX_PTE,
+	FIX_PMD,
 	FIX_EARLYCON_MEM_BASE,
 	__end_of_fixed_addresses
 };
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 7aa0ea9bd8bb..56ecc3dc939d 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -78,6 +78,11 @@ static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)
 	return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
 }
 
+static inline unsigned long _pmd_pfn(pmd_t pmd)
+{
+	return pmd_val(pmd) >> _PAGE_PFN_SHIFT;
+}
+
 #define pmd_ERROR(e) \
 	pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
 
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 1141364d990e..c4968b47c37d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -127,6 +127,11 @@ static inline pgd_t pfn_pgd(unsigned long pfn, pgprot_t prot)
 	return __pgd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
 }
 
+static inline unsigned long _pgd_pfn(pgd_t pgd)
+{
+	return pgd_val(pgd) >> _PAGE_PFN_SHIFT;
+}
+
 #define pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
 
 /* Locate an entry in the page global directory */
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 3449671ec867..61e253ae38b4 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -62,6 +62,7 @@ clear_bss_done:
 
 	/* Initialize page tables and relocate to virtual addresses */
 	la sp, init_thread_union + THREAD_SIZE
+	mv a0, s1
 	call setup_vm
 	call relocate
 
diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
index 540a331d1376..79670458527d 100644
--- a/arch/riscv/kernel/setup.c
+++ b/arch/riscv/kernel/setup.c
@@ -30,6 +30,7 @@
 #include <linux/sched/task.h>
 #include <linux/swiotlb.h>
 
+#include <asm/fixmap.h>
 #include <asm/setup.h>
 #include <asm/sections.h>
 #include <asm/pgtable.h>
@@ -54,7 +55,8 @@ unsigned long boot_cpu_hartid;
 
 void __init parse_dtb(unsigned int hartid, void *dtb)
 {
-	if (early_init_dt_scan(__va(dtb)))
+	dtb = (void *)fix_to_virt(FIX_FDT) + ((uintptr_t)dtb & ~PAGE_MASK);
+	if (early_init_dt_scan(dtb))
 		return;
 
 	pr_err("No DTB passed to the kernel\n");
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index f9add4381c73..56970dab3727 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1,14 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  * Copyright (C) 2012 Regents of the University of California
- *
- *   This program is free software; you can redistribute it and/or
- *   modify it under the terms of the GNU General Public License
- *   as published by the Free Software Foundation, version 2.
- *
- *   This program is distributed in the hope that it will be useful,
- *   but WITHOUT ANY WARRANTY; without even the implied warranty of
- *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *   GNU General Public License for more details.
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
  */
 
 #include <linux/init.h>
@@ -49,13 +42,6 @@ void setup_zero_page(void)
 	memset((void *)empty_zero_page, 0, PAGE_SIZE);
 }
 
-void __init paging_init(void)
-{
-	setup_zero_page();
-	local_flush_tlb_all();
-	zone_sizes_init();
-}
-
 void __init mem_init(void)
 {
 #ifdef CONFIG_FLATMEM
@@ -152,16 +138,28 @@ EXPORT_SYMBOL(va_pa_offset);
 unsigned long pfn_base;
 EXPORT_SYMBOL(pfn_base);
 
+#define MAX_EARLY_MAPPING_SIZE	SZ_128M
+
 pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
 
 #ifndef __PAGETABLE_PMD_FOLDED
-#define NUM_SWAPPER_PMDS ((uintptr_t)-PAGE_OFFSET >> PGDIR_SHIFT)
-pmd_t swapper_pmd[PTRS_PER_PMD*((-PAGE_OFFSET)/PGDIR_SIZE)] __page_aligned_bss;
+#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
+#define NUM_SWAPPER_PMDS	1UL
+#else
+#define NUM_SWAPPER_PMDS	(MAX_EARLY_MAPPING_SIZE/PGDIR_SIZE)
+#endif
+pmd_t swapper_pmd[PTRS_PER_PMD*NUM_SWAPPER_PMDS] __page_aligned_bss;
 pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
+#define NUM_SWAPPER_PTES	(MAX_EARLY_MAPPING_SIZE/PMD_SIZE)
+#else
+#define NUM_SWAPPER_PTES	(MAX_EARLY_MAPPING_SIZE/PGDIR_SIZE)
 #endif
 
+pte_t swapper_pte[PTRS_PER_PTE*NUM_SWAPPER_PTES] __page_aligned_bss;
 pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;
 
+static uintptr_t map_size;
+
 void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
 {
 	unsigned long addr = __fix_to_virt(idx);
@@ -179,6 +177,201 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
 	}
 }
 
+struct mapping_ops {
+	pte_t *(*get_pte_virt)(phys_addr_t pa);
+	phys_addr_t (*alloc_pte)(uintptr_t va);
+	pmd_t *(*get_pmd_virt)(phys_addr_t pa);
+	phys_addr_t (*alloc_pmd)(uintptr_t va);
+};
+
+static phys_addr_t __init final_alloc_pgtable(void)
+{
+	return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
+}
+
+static pte_t *__init early_get_pte_virt(phys_addr_t pa)
+{
+	return (pte_t *)((uintptr_t)pa);
+}
+
+static pte_t *__init final_get_pte_virt(phys_addr_t pa)
+{
+	clear_fixmap(FIX_PTE);
+
+	return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
+}
+
+static phys_addr_t __init early_alloc_pte(uintptr_t va)
+{
+	pte_t *base = swapper_pte;
+	uintptr_t pte_num = ((va - PAGE_OFFSET) >> PMD_SHIFT);
+
+	BUG_ON(pte_num >= NUM_SWAPPER_PTES);
+
+	return (uintptr_t)&base[pte_num * PTRS_PER_PTE];
+}
+
+static phys_addr_t __init final_alloc_pte(uintptr_t va)
+{
+	return final_alloc_pgtable();
+}
+
+static void __init create_pte_mapping(pte_t *ptep,
+				      uintptr_t va, phys_addr_t pa,
+				      phys_addr_t sz, pgprot_t prot)
+{
+	uintptr_t pte_index = pte_index(va);
+
+	BUG_ON(sz != PAGE_SIZE);
+
+	if (pte_none(ptep[pte_index]))
+		ptep[pte_index] = pfn_pte(PFN_DOWN(pa), prot);
+}
+
+#ifndef __PAGETABLE_PMD_FOLDED
+static pmd_t *__init early_get_pmd_virt(phys_addr_t pa)
+{
+	return (pmd_t *)((uintptr_t)pa);
+}
+
+static pmd_t *__init final_get_pmd_virt(phys_addr_t pa)
+{
+	clear_fixmap(FIX_PMD);
+
+	return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
+}
+
+static phys_addr_t __init early_alloc_pmd(uintptr_t va)
+{
+	pmd_t *base = swapper_pmd;
+	uintptr_t pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
+
+	BUG_ON(pmd_num >= NUM_SWAPPER_PMDS);
+
+	return (uintptr_t)&base[pmd_num * PTRS_PER_PMD];
+}
+
+static phys_addr_t __init final_alloc_pmd(uintptr_t va)
+{
+	return final_alloc_pgtable();
+}
+
+static void __init create_pmd_mapping(pmd_t *pmdp,
+				      uintptr_t va, phys_addr_t pa,
+				      phys_addr_t sz, pgprot_t prot,
+				      struct mapping_ops *ops)
+{
+	pte_t *ptep;
+	phys_addr_t pte_phys;
+	uintptr_t pmd_index = pmd_index(va);
+
+	if (sz == PMD_SIZE) {
+		if (pmd_none(pmdp[pmd_index]))
+			pmdp[pmd_index] = pfn_pmd(PFN_DOWN(pa), prot);
+		return;
+	}
+
+	if (pmd_none(pmdp[pmd_index])) {
+		pte_phys = ops->alloc_pte(va);
+		pmdp[pmd_index] = pfn_pmd(PFN_DOWN(pte_phys),
+					  __pgprot(_PAGE_TABLE));
+		ptep = ops->get_pte_virt(pte_phys);
+		memset(ptep, 0, PAGE_SIZE);
+	} else {
+		pte_phys = PFN_PHYS(_pmd_pfn(pmdp[pmd_index]));
+		ptep = ops->get_pte_virt(pte_phys);
+	}
+
+	create_pte_mapping(ptep, va, pa, sz, prot);
+}
+
+
+static void __init create_pgd_mapping(pgd_t *pgdp,
+				      uintptr_t va, phys_addr_t pa,
+				      phys_addr_t sz, pgprot_t prot,
+				      struct mapping_ops *ops)
+{
+	pmd_t *pmdp;
+	phys_addr_t pmd_phys;
+	uintptr_t pgd_index = pgd_index(va);
+
+	if (sz == PGDIR_SIZE) {
+		if (pgd_val(pgdp[pgd_index]) == 0)
+			pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
+		return;
+	}
+
+	if (pgd_val(pgdp[pgd_index]) == 0) {
+		pmd_phys = ops->alloc_pmd(va);
+		pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pmd_phys),
+					  __pgprot(_PAGE_TABLE));
+		pmdp = ops->get_pmd_virt(pmd_phys);
+		memset(pmdp, 0, PAGE_SIZE);
+	} else {
+		pmd_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_index]));
+		pmdp = ops->get_pmd_virt(pmd_phys);
+	}
+
+	create_pmd_mapping(pmdp, va, pa, sz, prot, ops);
+}
+#else
+static void __init create_pgd_mapping(pgd_t *pgdp,
+				      uintptr_t va, phys_addr_t pa,
+				      phys_addr_t sz, pgprot_t prot,
+				      struct mapping_ops *ops)
+{
+	pte_t *ptep;
+	phys_addr_t pte_phys;
+	uintptr_t pgd_index = pgd_index(va);
+
+	if (sz == PGDIR_SIZE) {
+		if (pgd_val(pgdp[pgd_index]) == 0)
+			pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
+		return;
+	}
+
+	if (pgd_val(pgdp[pgd_index]) == 0) {
+		pte_phys = ops->alloc_pte(va);
+		pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pte_phys),
+					  __pgprot(_PAGE_TABLE));
+		ptep = ops->get_pte_virt(pte_phys);
+		memset(ptep, 0, PAGE_SIZE);
+	} else {
+		pte_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_index]));
+		ptep = ops->get_pte_virt(pte_phys);
+	}
+
+	create_pte_mapping(ptep, va, pa, sz, prot);
+}
+#endif
+
+static uintptr_t __init best_map_size(uintptr_t load_pa, phys_addr_t size)
+{
+#ifdef CONFIG_BOOT_PAGE_ALIGNED
+	uintptr_t map_sz = PAGE_SIZE;
+#else
+#ifndef __PAGETABLE_PMD_FOLDED
+	uintptr_t map_sz = PMD_SIZE;
+#else
+	uintptr_t map_sz = PGDIR_SIZE;
+#endif
+#endif
+
+#ifndef __PAGETABLE_PMD_FOLDED
+	if (!(load_pa & (PMD_SIZE - 1)) &&
+	    (size >= PMD_SIZE) &&
+	    (map_sz < PMD_SIZE))
+		map_sz = PMD_SIZE;
+#endif
+
+	if (!(load_pa & (PGDIR_SIZE - 1)) &&
+	    (size >= PGDIR_SIZE) &&
+	    (map_sz < PGDIR_SIZE))
+		map_sz = PGDIR_SIZE;
+
+	return map_sz;
+}
+
 /*
  * The setup_vm() is called from head.S with MMU-off.
  *
@@ -192,46 +385,110 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
  * Currently, the above requirements are honoured by using custom CFLAGS
  * for init.o in mm/Makefile.
  */
-asmlinkage void __init setup_vm(void)
+asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 {
-	uintptr_t i;
-	uintptr_t pa = (uintptr_t) &_start;
+	uintptr_t va, end_va;
+	uintptr_t load_pa = (uintptr_t)(&_start);
+	uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
+	pgprot_t tableprot = __pgprot(_PAGE_TABLE);
 	pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);
+	struct mapping_ops ops;
 
-	va_pa_offset = PAGE_OFFSET - pa;
-	pfn_base = PFN_DOWN(pa);
+	va_pa_offset = PAGE_OFFSET - load_pa;
+	pfn_base = PFN_DOWN(load_pa);
+	map_size = best_map_size(load_pa, PGDIR_SIZE);
 
 	/* Sanity check alignment and size */
 	BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0);
-	BUG_ON((pa % (PAGE_SIZE * PTRS_PER_PTE)) != 0);
+	BUG_ON((load_pa % map_size) != 0);
+	BUG_ON(load_sz > MAX_EARLY_MAPPING_SIZE);
+
+	/* Setup swapper ops */
+	ops.get_pte_virt = early_get_pte_virt;
+	ops.alloc_pte = early_alloc_pte;
+	ops.get_pmd_virt = NULL;
+	ops.alloc_pmd = NULL;
 
 #ifndef __PAGETABLE_PMD_FOLDED
-	for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
-		size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
+	/* Update mapping ops for PMD */
+	ops.get_pmd_virt = early_get_pmd_virt;
+	ops.alloc_pmd = early_alloc_pmd;
+
+	/* Setup swapper PGD and PMD for fixmap */
+	create_pgd_mapping(swapper_pg_dir, FIXADDR_START,
+			   (uintptr_t)fixmap_pmd, PGDIR_SIZE, tableprot, &ops);
+	create_pmd_mapping(fixmap_pmd, FIXADDR_START,
+			   (uintptr_t)fixmap_pte, PMD_SIZE, tableprot, &ops);
+#else
+	/* Setup swapper PGD for fixmap */
+	create_pgd_mapping(swapper_pg_dir, FIXADDR_START,
+			   (uintptr_t)fixmap_pte, PGDIR_SIZE, tableprot, &ops);
+#endif
 
-		swapper_pg_dir[o] =
-			pfn_pgd(PFN_DOWN((uintptr_t)swapper_pmd) + i,
-				__pgprot(_PAGE_TABLE));
-	}
-	for (i = 0; i < ARRAY_SIZE(swapper_pmd); i++)
-		swapper_pmd[i] = pfn_pmd(PFN_DOWN(pa + i * PMD_SIZE), prot);
-
-	swapper_pg_dir[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
-		pfn_pgd(PFN_DOWN((uintptr_t)fixmap_pmd),
-				__pgprot(_PAGE_TABLE));
-	fixmap_pmd[(FIXADDR_START >> PMD_SHIFT) % PTRS_PER_PMD] =
-		pfn_pmd(PFN_DOWN((uintptr_t)fixmap_pte),
-				__pgprot(_PAGE_TABLE));
+	/*
+	 * Setup swapper PGD covering entire kernel which will allows
+	 * us to reach paging_init(). We map all memory banks later in
+	 * setup_vm_final() below.
+	 */
+	end_va = PAGE_OFFSET + load_sz;
+	for (va = PAGE_OFFSET; va < end_va; va += map_size)
+		create_pgd_mapping(swapper_pg_dir, va,
+				   load_pa + (va - PAGE_OFFSET),
+				   map_size, prot, &ops);
+
+	/* Create fixed mapping for early FDT parsing */
+	end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
+	for (va = __fix_to_virt(FIX_FDT); va < end_va; va += PAGE_SIZE)
+		create_pte_mapping(fixmap_pte, va,
+				   dtb_pa + (va - __fix_to_virt(FIX_FDT)),
+				   PAGE_SIZE, prot);
+}
+
+static void __init setup_vm_final(void)
+{
+	phys_addr_t pa, start, end;
+	struct memblock_region *reg;
+	struct mapping_ops ops;
+	pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);
+
+	/* Setup mapping ops */
+	ops.get_pte_virt = final_get_pte_virt;
+	ops.alloc_pte = final_alloc_pte;
+#ifndef __PAGETABLE_PMD_FOLDED
+	ops.get_pmd_virt = final_get_pmd_virt;
+	ops.alloc_pmd = final_alloc_pmd;
 #else
-	for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
-		size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
+	ops.get_pmd_virt = NULL;
+	ops.alloc_pmd = NULL;
+#endif
 
-		swapper_pg_dir[o] =
-			pfn_pgd(PFN_DOWN(pa + i * PGDIR_SIZE), prot);
+	/* Map all memory banks */
+	for_each_memblock(memory, reg) {
+		start = reg->base;
+		end = start + reg->size;
+
+		if (start >= end)
+			break;
+		if (memblock_is_nomap(reg))
+			continue;
+		if (start <= __pa(PAGE_OFFSET) &&
+		    __pa(PAGE_OFFSET) < end)
+			start = __pa(PAGE_OFFSET);
+
+		for (pa = start; pa < end; pa += map_size)
+			create_pgd_mapping(swapper_pg_dir,
+					   (uintptr_t)__va(pa), pa,
+					   map_size, prot, &ops);
 	}
 
-	swapper_pg_dir[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
-		pfn_pgd(PFN_DOWN((uintptr_t)fixmap_pte),
-				__pgprot(_PAGE_TABLE));
-#endif
+	clear_fixmap(FIX_PTE);
+	clear_fixmap(FIX_PMD);
+}
+
+void __init paging_init(void)
+{
+	setup_vm_final();
+	setup_zero_page();
+	local_flush_tlb_all();
+	zone_sizes_init();
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address
@ 2019-03-25  9:23   ` Anup Patel
  0 siblings, 0 replies; 26+ messages in thread
From: Anup Patel @ 2019-03-25  9:23 UTC (permalink / raw)
  To: Palmer Dabbelt, Albert Ou
  Cc: Anup Patel, linux-kernel, Mike Rapoport, Christoph Hellwig,
	Atish Patra, Paul Walmsley, linux-riscv

Currently, we have to boot RISCV64 kernel from a 2MB aligned physical
address and RISCV32 kernel from a 4MB aligned physical address. This
constraint is because initial pagetable setup (i.e. setup_vm()) maps
entire RAM using hugepages (i.e. 2MB for 3-level pagetable and 4MB for
2-level pagetable).

Further, the above booting contraint also results in memory wastage
because if we boot kernel from some <xyz> address (which is not same as
RAM start address) then RISCV kernel will map PAGE_OFFSET virtual address
lineraly to <xyz> physical address and memory between RAM start and <xyz>
will be reserved/unusable.

For example, RISCV64 kernel booted from 0x80200000 will waste 2MB of RAM
and RISCV32 kernel booted from 0x80400000 will waste 4MB of RAM.

This patch re-writes the initial pagetable setup code to allow booting
RISV32 and RISCV64 kernel from any 4KB (i.e. PAGE_SIZE) aligned address.

To achieve this:
1. We add kconfig option BOOT_PAGE_ALIGNED. When it is enabled we use
   4KB mappings in initial page table setup otherwise we use 2MB/4MB
   mappings.
2. We map kernel and dtb (few MBs) in setup_vm() (called from head.S)
3. Once we reach paging_init() (called from setup_arch()) after
   memblock setup, we map all available memory banks.

With this patch in-place, the booting constraint for RISCV32 and RISCV64
kernel is much more relaxed when CONFIG_BOOT_PAGE_ALIGNED=y and we can
now boot kernel very close to RAM start thereby minimizng memory wastage.

Signed-off-by: Anup Patel <anup.patel@wdc.com>
---
 arch/riscv/Kconfig                  |  12 +
 arch/riscv/include/asm/fixmap.h     |   5 +
 arch/riscv/include/asm/pgtable-64.h |   5 +
 arch/riscv/include/asm/pgtable.h    |   5 +
 arch/riscv/kernel/head.S            |   1 +
 arch/riscv/kernel/setup.c           |   4 +-
 arch/riscv/mm/init.c                | 351 ++++++++++++++++++++++++----
 7 files changed, 335 insertions(+), 48 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index eb56c82d8aa1..d7812b1f7c7e 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -172,6 +172,18 @@ config SMP
 
 	  If you don't know what to do here, say N.
 
+config BOOT_PAGE_ALIGNED
+	bool "Allow booting from page aligned address"
+	default n
+	help
+	  This enables support for booting the kernel from any page aligned
+	  address (i.e. 4KB aligned). This option is particularly useful on
+	  systems with a very small RAM (few MBs) as we can boot the kernel
+	  closer to the RAM start address, thereby reducing the amount of
+	  unusable RAM below the kernel.
+
+	  If you don't know what to do here, say N.
+
 config NR_CPUS
 	int "Maximum number of CPUs (2-32)"
 	range 2 32
diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
index c207f6634b91..9c66033c3a54 100644
--- a/arch/riscv/include/asm/fixmap.h
+++ b/arch/riscv/include/asm/fixmap.h
@@ -21,6 +21,11 @@
  */
 enum fixed_addresses {
 	FIX_HOLE,
+#define FIX_FDT_SIZE	SZ_1M
+	FIX_FDT_END,
+	FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
+	FIX_PTE,
+	FIX_PMD,
 	FIX_EARLYCON_MEM_BASE,
 	__end_of_fixed_addresses
 };
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 7aa0ea9bd8bb..56ecc3dc939d 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -78,6 +78,11 @@ static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)
 	return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
 }
 
+static inline unsigned long _pmd_pfn(pmd_t pmd)
+{
+	return pmd_val(pmd) >> _PAGE_PFN_SHIFT;
+}
+
 #define pmd_ERROR(e) \
 	pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
 
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 1141364d990e..c4968b47c37d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -127,6 +127,11 @@ static inline pgd_t pfn_pgd(unsigned long pfn, pgprot_t prot)
 	return __pgd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
 }
 
+static inline unsigned long _pgd_pfn(pgd_t pgd)
+{
+	return pgd_val(pgd) >> _PAGE_PFN_SHIFT;
+}
+
 #define pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
 
 /* Locate an entry in the page global directory */
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 3449671ec867..61e253ae38b4 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -62,6 +62,7 @@ clear_bss_done:
 
 	/* Initialize page tables and relocate to virtual addresses */
 	la sp, init_thread_union + THREAD_SIZE
+	mv a0, s1
 	call setup_vm
 	call relocate
 
diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
index 540a331d1376..79670458527d 100644
--- a/arch/riscv/kernel/setup.c
+++ b/arch/riscv/kernel/setup.c
@@ -30,6 +30,7 @@
 #include <linux/sched/task.h>
 #include <linux/swiotlb.h>
 
+#include <asm/fixmap.h>
 #include <asm/setup.h>
 #include <asm/sections.h>
 #include <asm/pgtable.h>
@@ -54,7 +55,8 @@ unsigned long boot_cpu_hartid;
 
 void __init parse_dtb(unsigned int hartid, void *dtb)
 {
-	if (early_init_dt_scan(__va(dtb)))
+	dtb = (void *)fix_to_virt(FIX_FDT) + ((uintptr_t)dtb & ~PAGE_MASK);
+	if (early_init_dt_scan(dtb))
 		return;
 
 	pr_err("No DTB passed to the kernel\n");
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index f9add4381c73..56970dab3727 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1,14 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  * Copyright (C) 2012 Regents of the University of California
- *
- *   This program is free software; you can redistribute it and/or
- *   modify it under the terms of the GNU General Public License
- *   as published by the Free Software Foundation, version 2.
- *
- *   This program is distributed in the hope that it will be useful,
- *   but WITHOUT ANY WARRANTY; without even the implied warranty of
- *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- *   GNU General Public License for more details.
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
  */
 
 #include <linux/init.h>
@@ -49,13 +42,6 @@ void setup_zero_page(void)
 	memset((void *)empty_zero_page, 0, PAGE_SIZE);
 }
 
-void __init paging_init(void)
-{
-	setup_zero_page();
-	local_flush_tlb_all();
-	zone_sizes_init();
-}
-
 void __init mem_init(void)
 {
 #ifdef CONFIG_FLATMEM
@@ -152,16 +138,28 @@ EXPORT_SYMBOL(va_pa_offset);
 unsigned long pfn_base;
 EXPORT_SYMBOL(pfn_base);
 
+#define MAX_EARLY_MAPPING_SIZE	SZ_128M
+
 pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
 
 #ifndef __PAGETABLE_PMD_FOLDED
-#define NUM_SWAPPER_PMDS ((uintptr_t)-PAGE_OFFSET >> PGDIR_SHIFT)
-pmd_t swapper_pmd[PTRS_PER_PMD*((-PAGE_OFFSET)/PGDIR_SIZE)] __page_aligned_bss;
+#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
+#define NUM_SWAPPER_PMDS	1UL
+#else
+#define NUM_SWAPPER_PMDS	(MAX_EARLY_MAPPING_SIZE/PGDIR_SIZE)
+#endif
+pmd_t swapper_pmd[PTRS_PER_PMD*NUM_SWAPPER_PMDS] __page_aligned_bss;
 pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
+#define NUM_SWAPPER_PTES	(MAX_EARLY_MAPPING_SIZE/PMD_SIZE)
+#else
+#define NUM_SWAPPER_PTES	(MAX_EARLY_MAPPING_SIZE/PGDIR_SIZE)
 #endif
 
+pte_t swapper_pte[PTRS_PER_PTE*NUM_SWAPPER_PTES] __page_aligned_bss;
 pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;
 
+static uintptr_t map_size;
+
 void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
 {
 	unsigned long addr = __fix_to_virt(idx);
@@ -179,6 +177,201 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
 	}
 }
 
+struct mapping_ops {
+	pte_t *(*get_pte_virt)(phys_addr_t pa);
+	phys_addr_t (*alloc_pte)(uintptr_t va);
+	pmd_t *(*get_pmd_virt)(phys_addr_t pa);
+	phys_addr_t (*alloc_pmd)(uintptr_t va);
+};
+
+static phys_addr_t __init final_alloc_pgtable(void)
+{
+	return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
+}
+
+static pte_t *__init early_get_pte_virt(phys_addr_t pa)
+{
+	return (pte_t *)((uintptr_t)pa);
+}
+
+static pte_t *__init final_get_pte_virt(phys_addr_t pa)
+{
+	clear_fixmap(FIX_PTE);
+
+	return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
+}
+
+static phys_addr_t __init early_alloc_pte(uintptr_t va)
+{
+	pte_t *base = swapper_pte;
+	uintptr_t pte_num = ((va - PAGE_OFFSET) >> PMD_SHIFT);
+
+	BUG_ON(pte_num >= NUM_SWAPPER_PTES);
+
+	return (uintptr_t)&base[pte_num * PTRS_PER_PTE];
+}
+
+static phys_addr_t __init final_alloc_pte(uintptr_t va)
+{
+	return final_alloc_pgtable();
+}
+
+static void __init create_pte_mapping(pte_t *ptep,
+				      uintptr_t va, phys_addr_t pa,
+				      phys_addr_t sz, pgprot_t prot)
+{
+	uintptr_t pte_index = pte_index(va);
+
+	BUG_ON(sz != PAGE_SIZE);
+
+	if (pte_none(ptep[pte_index]))
+		ptep[pte_index] = pfn_pte(PFN_DOWN(pa), prot);
+}
+
+#ifndef __PAGETABLE_PMD_FOLDED
+static pmd_t *__init early_get_pmd_virt(phys_addr_t pa)
+{
+	return (pmd_t *)((uintptr_t)pa);
+}
+
+static pmd_t *__init final_get_pmd_virt(phys_addr_t pa)
+{
+	clear_fixmap(FIX_PMD);
+
+	return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
+}
+
+static phys_addr_t __init early_alloc_pmd(uintptr_t va)
+{
+	pmd_t *base = swapper_pmd;
+	uintptr_t pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
+
+	BUG_ON(pmd_num >= NUM_SWAPPER_PMDS);
+
+	return (uintptr_t)&base[pmd_num * PTRS_PER_PMD];
+}
+
+static phys_addr_t __init final_alloc_pmd(uintptr_t va)
+{
+	return final_alloc_pgtable();
+}
+
+static void __init create_pmd_mapping(pmd_t *pmdp,
+				      uintptr_t va, phys_addr_t pa,
+				      phys_addr_t sz, pgprot_t prot,
+				      struct mapping_ops *ops)
+{
+	pte_t *ptep;
+	phys_addr_t pte_phys;
+	uintptr_t pmd_index = pmd_index(va);
+
+	if (sz == PMD_SIZE) {
+		if (pmd_none(pmdp[pmd_index]))
+			pmdp[pmd_index] = pfn_pmd(PFN_DOWN(pa), prot);
+		return;
+	}
+
+	if (pmd_none(pmdp[pmd_index])) {
+		pte_phys = ops->alloc_pte(va);
+		pmdp[pmd_index] = pfn_pmd(PFN_DOWN(pte_phys),
+					  __pgprot(_PAGE_TABLE));
+		ptep = ops->get_pte_virt(pte_phys);
+		memset(ptep, 0, PAGE_SIZE);
+	} else {
+		pte_phys = PFN_PHYS(_pmd_pfn(pmdp[pmd_index]));
+		ptep = ops->get_pte_virt(pte_phys);
+	}
+
+	create_pte_mapping(ptep, va, pa, sz, prot);
+}
+
+
+static void __init create_pgd_mapping(pgd_t *pgdp,
+				      uintptr_t va, phys_addr_t pa,
+				      phys_addr_t sz, pgprot_t prot,
+				      struct mapping_ops *ops)
+{
+	pmd_t *pmdp;
+	phys_addr_t pmd_phys;
+	uintptr_t pgd_index = pgd_index(va);
+
+	if (sz == PGDIR_SIZE) {
+		if (pgd_val(pgdp[pgd_index]) == 0)
+			pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
+		return;
+	}
+
+	if (pgd_val(pgdp[pgd_index]) == 0) {
+		pmd_phys = ops->alloc_pmd(va);
+		pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pmd_phys),
+					  __pgprot(_PAGE_TABLE));
+		pmdp = ops->get_pmd_virt(pmd_phys);
+		memset(pmdp, 0, PAGE_SIZE);
+	} else {
+		pmd_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_index]));
+		pmdp = ops->get_pmd_virt(pmd_phys);
+	}
+
+	create_pmd_mapping(pmdp, va, pa, sz, prot, ops);
+}
+#else
+static void __init create_pgd_mapping(pgd_t *pgdp,
+				      uintptr_t va, phys_addr_t pa,
+				      phys_addr_t sz, pgprot_t prot,
+				      struct mapping_ops *ops)
+{
+	pte_t *ptep;
+	phys_addr_t pte_phys;
+	uintptr_t pgd_index = pgd_index(va);
+
+	if (sz == PGDIR_SIZE) {
+		if (pgd_val(pgdp[pgd_index]) == 0)
+			pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
+		return;
+	}
+
+	if (pgd_val(pgdp[pgd_index]) == 0) {
+		pte_phys = ops->alloc_pte(va);
+		pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pte_phys),
+					  __pgprot(_PAGE_TABLE));
+		ptep = ops->get_pte_virt(pte_phys);
+		memset(ptep, 0, PAGE_SIZE);
+	} else {
+		pte_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_index]));
+		ptep = ops->get_pte_virt(pte_phys);
+	}
+
+	create_pte_mapping(ptep, va, pa, sz, prot);
+}
+#endif
+
+static uintptr_t __init best_map_size(uintptr_t load_pa, phys_addr_t size)
+{
+#ifdef CONFIG_BOOT_PAGE_ALIGNED
+	uintptr_t map_sz = PAGE_SIZE;
+#else
+#ifndef __PAGETABLE_PMD_FOLDED
+	uintptr_t map_sz = PMD_SIZE;
+#else
+	uintptr_t map_sz = PGDIR_SIZE;
+#endif
+#endif
+
+#ifndef __PAGETABLE_PMD_FOLDED
+	if (!(load_pa & (PMD_SIZE - 1)) &&
+	    (size >= PMD_SIZE) &&
+	    (map_sz < PMD_SIZE))
+		map_sz = PMD_SIZE;
+#endif
+
+	if (!(load_pa & (PGDIR_SIZE - 1)) &&
+	    (size >= PGDIR_SIZE) &&
+	    (map_sz < PGDIR_SIZE))
+		map_sz = PGDIR_SIZE;
+
+	return map_sz;
+}
+
 /*
  * The setup_vm() is called from head.S with MMU-off.
  *
@@ -192,46 +385,110 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
  * Currently, the above requirements are honoured by using custom CFLAGS
  * for init.o in mm/Makefile.
  */
-asmlinkage void __init setup_vm(void)
+asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 {
-	uintptr_t i;
-	uintptr_t pa = (uintptr_t) &_start;
+	uintptr_t va, end_va;
+	uintptr_t load_pa = (uintptr_t)(&_start);
+	uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
+	pgprot_t tableprot = __pgprot(_PAGE_TABLE);
 	pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);
+	struct mapping_ops ops;
 
-	va_pa_offset = PAGE_OFFSET - pa;
-	pfn_base = PFN_DOWN(pa);
+	va_pa_offset = PAGE_OFFSET - load_pa;
+	pfn_base = PFN_DOWN(load_pa);
+	map_size = best_map_size(load_pa, PGDIR_SIZE);
 
 	/* Sanity check alignment and size */
 	BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0);
-	BUG_ON((pa % (PAGE_SIZE * PTRS_PER_PTE)) != 0);
+	BUG_ON((load_pa % map_size) != 0);
+	BUG_ON(load_sz > MAX_EARLY_MAPPING_SIZE);
+
+	/* Setup swapper ops */
+	ops.get_pte_virt = early_get_pte_virt;
+	ops.alloc_pte = early_alloc_pte;
+	ops.get_pmd_virt = NULL;
+	ops.alloc_pmd = NULL;
 
 #ifndef __PAGETABLE_PMD_FOLDED
-	for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
-		size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
+	/* Update mapping ops for PMD */
+	ops.get_pmd_virt = early_get_pmd_virt;
+	ops.alloc_pmd = early_alloc_pmd;
+
+	/* Setup swapper PGD and PMD for fixmap */
+	create_pgd_mapping(swapper_pg_dir, FIXADDR_START,
+			   (uintptr_t)fixmap_pmd, PGDIR_SIZE, tableprot, &ops);
+	create_pmd_mapping(fixmap_pmd, FIXADDR_START,
+			   (uintptr_t)fixmap_pte, PMD_SIZE, tableprot, &ops);
+#else
+	/* Setup swapper PGD for fixmap */
+	create_pgd_mapping(swapper_pg_dir, FIXADDR_START,
+			   (uintptr_t)fixmap_pte, PGDIR_SIZE, tableprot, &ops);
+#endif
 
-		swapper_pg_dir[o] =
-			pfn_pgd(PFN_DOWN((uintptr_t)swapper_pmd) + i,
-				__pgprot(_PAGE_TABLE));
-	}
-	for (i = 0; i < ARRAY_SIZE(swapper_pmd); i++)
-		swapper_pmd[i] = pfn_pmd(PFN_DOWN(pa + i * PMD_SIZE), prot);
-
-	swapper_pg_dir[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
-		pfn_pgd(PFN_DOWN((uintptr_t)fixmap_pmd),
-				__pgprot(_PAGE_TABLE));
-	fixmap_pmd[(FIXADDR_START >> PMD_SHIFT) % PTRS_PER_PMD] =
-		pfn_pmd(PFN_DOWN((uintptr_t)fixmap_pte),
-				__pgprot(_PAGE_TABLE));
+	/*
+	 * Setup swapper PGD covering entire kernel which will allows
+	 * us to reach paging_init(). We map all memory banks later in
+	 * setup_vm_final() below.
+	 */
+	end_va = PAGE_OFFSET + load_sz;
+	for (va = PAGE_OFFSET; va < end_va; va += map_size)
+		create_pgd_mapping(swapper_pg_dir, va,
+				   load_pa + (va - PAGE_OFFSET),
+				   map_size, prot, &ops);
+
+	/* Create fixed mapping for early FDT parsing */
+	end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
+	for (va = __fix_to_virt(FIX_FDT); va < end_va; va += PAGE_SIZE)
+		create_pte_mapping(fixmap_pte, va,
+				   dtb_pa + (va - __fix_to_virt(FIX_FDT)),
+				   PAGE_SIZE, prot);
+}
+
+static void __init setup_vm_final(void)
+{
+	phys_addr_t pa, start, end;
+	struct memblock_region *reg;
+	struct mapping_ops ops;
+	pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);
+
+	/* Setup mapping ops */
+	ops.get_pte_virt = final_get_pte_virt;
+	ops.alloc_pte = final_alloc_pte;
+#ifndef __PAGETABLE_PMD_FOLDED
+	ops.get_pmd_virt = final_get_pmd_virt;
+	ops.alloc_pmd = final_alloc_pmd;
 #else
-	for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
-		size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
+	ops.get_pmd_virt = NULL;
+	ops.alloc_pmd = NULL;
+#endif
 
-		swapper_pg_dir[o] =
-			pfn_pgd(PFN_DOWN(pa + i * PGDIR_SIZE), prot);
+	/* Map all memory banks */
+	for_each_memblock(memory, reg) {
+		start = reg->base;
+		end = start + reg->size;
+
+		if (start >= end)
+			break;
+		if (memblock_is_nomap(reg))
+			continue;
+		if (start <= __pa(PAGE_OFFSET) &&
+		    __pa(PAGE_OFFSET) < end)
+			start = __pa(PAGE_OFFSET);
+
+		for (pa = start; pa < end; pa += map_size)
+			create_pgd_mapping(swapper_pg_dir,
+					   (uintptr_t)__va(pa), pa,
+					   map_size, prot, &ops);
 	}
 
-	swapper_pg_dir[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
-		pfn_pgd(PFN_DOWN((uintptr_t)fixmap_pte),
-				__pgprot(_PAGE_TABLE));
-#endif
+	clear_fixmap(FIX_PTE);
+	clear_fixmap(FIX_PMD);
+}
+
+void __init paging_init(void)
+{
+	setup_vm_final();
+	setup_zero_page();
+	local_flush_tlb_all();
+	zone_sizes_init();
 }
-- 
2.17.1


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-03-29  5:38 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-28 15:35 [PATCH v3 4/4] RISC-V: Allow booting kernel from any 4KB aligned address Anthony Coulter
2019-03-28 15:35 ` Anthony Coulter
  -- strict thread matches above, loose matches on Subject: below --
2019-03-25  9:22 [PATCH v3 0/4] Boot RISC-V " Anup Patel
2019-03-25  9:23 ` [PATCH v3 4/4] RISC-V: Allow booting " Anup Patel
2019-03-25  9:23   ` Anup Patel
2019-03-25 11:39   ` Christoph Hellwig
2019-03-25 11:39     ` Christoph Hellwig
2019-03-25 12:42     ` Anup Patel
2019-03-25 12:42       ` Anup Patel
2019-03-25 14:55       ` Christoph Hellwig
2019-03-25 14:55         ` Christoph Hellwig
2019-03-25 12:48     ` Anup Patel
2019-03-25 12:48       ` Anup Patel
2019-03-25 14:59       ` Christoph Hellwig
2019-03-25 14:59         ` Christoph Hellwig
2019-03-25 16:16         ` Anup Patel
2019-03-25 16:16           ` Anup Patel
2019-03-25 17:35           ` Gary Guo
2019-03-26  9:43             ` Anup Patel
2019-03-27  7:54           ` Christoph Hellwig
2019-03-27  7:54             ` Christoph Hellwig
2019-03-28  7:55             ` Mike Rapoport
2019-03-28  7:55               ` Mike Rapoport
2019-03-28  9:52               ` Anup Patel
2019-03-28  9:52                 ` Anup Patel
2019-03-28 10:24                 ` Anup Patel
2019-03-28 10:24                   ` Anup Patel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.