All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] binfmt_elf: fully allocate bss pages
@ 2023-09-14 15:59 Thomas Weißschuh
  2023-09-14 19:49 ` Eric W. Biederman
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Thomas Weißschuh @ 2023-09-14 15:59 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Eric Biederman, Kees Cook
  Cc: Mark Brown, Willy Tarreau, linux-fsdevel, linux-mm, linux-kernel,
	Sebastian Ott, stable, Thomas Weißschuh

When allocating the pages for bss the start address needs to be rounded
down instead of up.
Otherwise the start of the bss segment may be unmapped.

The was reported to happen on Aarch64:

Memory allocated by set_brk():
Before: start=0x420000 end=0x420000
After:  start=0x41f000 end=0x420000

The triggering binary looks like this:

    Elf file type is EXEC (Executable file)
    Entry point 0x400144
    There are 4 program headers, starting at offset 64

    Program Headers:
      Type           Offset             VirtAddr           PhysAddr
                     FileSiz            MemSiz              Flags  Align
      LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                     0x0000000000000178 0x0000000000000178  R E    0x10000
      LOAD           0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
                     0x0000000000000000 0x0000000000000008  RW     0x10000
      NOTE           0x0000000000000120 0x0000000000400120 0x0000000000400120
                     0x0000000000000024 0x0000000000000024  R      0x4
      GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                     0x0000000000000000 0x0000000000000000  RW     0x10

     Section to Segment mapping:
      Segment Sections...
       00     .note.gnu.build-id .text .eh_frame
       01     .bss
       02     .note.gnu.build-id
       03

Reported-by: Sebastian Ott <sebott@redhat.com>
Closes: https://lore.kernel.org/lkml/5d49767a-fbdc-fbe7-5fb2-d99ece3168cb@redhat.com/
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
---

I'm not really familiar with the ELF loading process, so putting this
out as RFC.

A example binary compiled with aarch64-linux-gnu-gcc 13.2.0 is available
at https://test.t-8ch.de/binfmt-bss-repro.bin
---
 fs/binfmt_elf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7b3d2d491407..4008a57d388b 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -112,7 +112,7 @@ static struct linux_binfmt elf_format = {
 
 static int set_brk(unsigned long start, unsigned long end, int prot)
 {
-	start = ELF_PAGEALIGN(start);
+	start = ELF_PAGESTART(start);
 	end = ELF_PAGEALIGN(end);
 	if (end > start) {
 		/*

---
base-commit: aed8aee11130a954356200afa3f1b8753e8a9482
change-id: 20230914-bss-alloc-f523fa61718c

Best regards,
-- 
Thomas Weißschuh <linux@weissschuh.net>


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] binfmt_elf: fully allocate bss pages
  2023-09-14 15:59 [PATCH RFC] binfmt_elf: fully allocate bss pages Thomas Weißschuh
@ 2023-09-14 19:49 ` Eric W. Biederman
  2023-09-14 22:18   ` Thomas Weißschuh
  2023-09-15 19:35 ` Sebastian Ott
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2023-09-14 19:49 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Alexander Viro, Christian Brauner, Kees Cook, Mark Brown,
	Willy Tarreau, linux-fsdevel, linux-mm, linux-kernel,
	Sebastian Ott, stable

Thomas Weißschuh <linux@weissschuh.net> writes:

> When allocating the pages for bss the start address needs to be rounded
> down instead of up.
> Otherwise the start of the bss segment may be unmapped.
>
> The was reported to happen on Aarch64:

Those program headers you quote look corrupt.

The address 0x41ffe8 is not 0x10000 aligned.

I don't think anything in the elf specification allows that.

The most common way to have bss is for a elf segment to have a larger
memsize than filesize.  In which case rounding up is the correct way to
handle things.

We definitely need to verify the appended bss case works, before
taking this patch, or we will get random application failures
because parts of the data segment are being zeroed, or the binaries
won't load because the bss won't be able to map over the initialized data.


The note segment living at a conflicting virtual address also looks
suspicious.   It is probably harmless, as note segments are not
loaded.


Are you by any chance using an experimental linker?


In general every segment in an elf executable needs to be aligned to the
SYSVABI's architecture page size.  I think that is 64k on ARM.  Which it
looks like the linker tried to implement by setting the alignment to
0x10000, and then ignored by putting a byte offset beginning to the
page.

At a minimum someone needs to sort through what the elf specification
says needs to happen is a weird case like this where the start address
of a load segment does not match the alignment of the segment.

To see how common this is I looked at a binary known to be working, and
my /usr/bin/ls binary has one segment that has one of these unaligned
starts as well.

So it must be defined to work somewhere but I need to see the definition
to even have a good opinion on the nonsense of saying an unaligned value
should be aligned.


All I know is that we need to limit our support to what memory mapping
pieces from the elf executable can support.  Which at a minimum requires:
	virt_addr % ELF_MIN_ALIGN == file_offset % ELF_MIN_ALIGN



Eric










> Memory allocated by set_brk():
> Before: start=0x420000 end=0x420000
> After:  start=0x41f000 end=0x420000
>
> The triggering binary looks like this:
>
>     Elf file type is EXEC (Executable file)
>     Entry point 0x400144
>     There are 4 program headers, starting at offset 64
>
>     Program Headers:
>       Type           Offset             VirtAddr           PhysAddr
>                      FileSiz            MemSiz              Flags  Align
>       LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
>                      0x0000000000000178 0x0000000000000178  R E    0x10000
>       LOAD           0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
>                      0x0000000000000000 0x0000000000000008  RW     0x10000
>       NOTE           0x0000000000000120 0x0000000000400120 0x0000000000400120
>                      0x0000000000000024 0x0000000000000024  R      0x4
>       GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
>                      0x0000000000000000 0x0000000000000000  RW     0x10
>
>      Section to Segment mapping:
>       Segment Sections...
>        00     .note.gnu.build-id .text .eh_frame
>        01     .bss
>        02     .note.gnu.build-id
>        03
>
> Reported-by: Sebastian Ott <sebott@redhat.com>
> Closes: https://lore.kernel.org/lkml/5d49767a-fbdc-fbe7-5fb2-d99ece3168cb@redhat.com/
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: stable@vger.kernel.org
> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
> ---
>
> I'm not really familiar with the ELF loading process, so putting this
> out as RFC.
>
> A example binary compiled with aarch64-linux-gnu-gcc 13.2.0 is available
> at https://test.t-8ch.de/binfmt-bss-repro.bin
> ---
>  fs/binfmt_elf.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 7b3d2d491407..4008a57d388b 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -112,7 +112,7 @@ static struct linux_binfmt elf_format = {
>  
>  static int set_brk(unsigned long start, unsigned long end, int prot)
>  {
> -	start = ELF_PAGEALIGN(start);
> +	start = ELF_PAGESTART(start);
>  	end = ELF_PAGEALIGN(end);
>  	if (end > start) {
>  		/*
>
> ---
> base-commit: aed8aee11130a954356200afa3f1b8753e8a9482
> change-id: 20230914-bss-alloc-f523fa61718c
>
> Best regards,

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] binfmt_elf: fully allocate bss pages
  2023-09-14 19:49 ` Eric W. Biederman
@ 2023-09-14 22:18   ` Thomas Weißschuh
  0 siblings, 0 replies; 20+ messages in thread
From: Thomas Weißschuh @ 2023-09-14 22:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexander Viro, Christian Brauner, Kees Cook, Mark Brown,
	Willy Tarreau, linux-fsdevel, linux-mm, linux-kernel,
	Sebastian Ott, stable

On 2023-09-14 14:49:44-0500, Eric W. Biederman wrote:
> Thomas Weißschuh <linux@weissschuh.net> writes:
> 
> > When allocating the pages for bss the start address needs to be rounded
> > down instead of up.
> > Otherwise the start of the bss segment may be unmapped.
> >
> > The was reported to happen on Aarch64:
> 
> Those program headers you quote look corrupt.

To reproduce:

$ cat test.c
char foo[1];

void __attribute__((weak, noreturn, optimize("Os", "omit-frame-pointer"))) _start(void)
{
	__asm__ volatile (
		"mov x0, 123\n"
		"mov x8, 93\n"       /* NR_exit == 93 */
		"svc #0\n"
	);
	__builtin_unreachable();
}

$ aarch64-linux-gnu-gcc -fno-stack-protector -o nolibc-test -nostdlib -static test.c

Note:
it works in qemu-user, newer versions need the workaround from
https://gitlab.com/qemu-project/qemu/-/issues/1854
The issue in qemu-user seems to be related to the question at hand.

> The address 0x41ffe8 is not 0x10000 aligned.
> 
> I don't think anything in the elf specification allows that.
> 
> The most common way to have bss is for a elf segment to have a larger
> memsize than filesize.  In which case rounding up is the correct way to
> handle things.
> 
> We definitely need to verify the appended bss case works, before
> taking this patch, or we will get random application failures
> because parts of the data segment are being zeroed, or the binaries
> won't load because the bss won't be able to map over the initialized data.

My hope in posting this patch was also for the bots to uncover any
obvious breakage. So far there were no reports.

> The note segment living at a conflicting virtual address also looks
> suspicious.   It is probably harmless, as note segments are not
> loaded.
> 
> 
> Are you by any chance using an experimental linker?

I'm using GNU ld 2.41 as supplied by my distro.
(ArchLinux, aarch64-linux-gnu-binutils 2.41-2)

> In general every segment in an elf executable needs to be aligned to the
> SYSVABI's architecture page size.  I think that is 64k on ARM.  Which it
> looks like the linker tried to implement by setting the alignment to
> 0x10000, and then ignored by putting a byte offset beginning to the
> page.

Looking at Figure A-5 of [0] this seems not to be the case.
It shows p_vaddr=0x8048100 and p_align=0x1000.
(On x86_64 with PAGE_SIZE=0x1000)

> At a minimum someone needs to sort through what the elf specification
> says needs to happen is a weird case like this where the start address
> of a load segment does not match the alignment of the segment.

I'll take a look.

> To see how common this is I looked at a binary known to be working, and
> my /usr/bin/ls binary has one segment that has one of these unaligned
> starts as well.

Same for my /usr/bin/busybox, also the .data and .bss segment.

> So it must be defined to work somewhere but I need to see the definition
> to even have a good opinion on the nonsense of saying an unaligned value
> should be aligned.

Figure 2-1 from [0]:

p_align:

Loadable process segments must have congruent values for p_vaddr and
p_offset, modulo the page size.This member gives the value to which the
segments are aligned in memory and in the file. Values 0 and 1 mean that no
alignment is required. Otherwise, p_align should be a positive, integral power of
2, and p_addr should equal p_offset, modulo p_align.

0x41ffe8 (p_vaddr)  % 0x1000 = 0xfe8
0x00ffe8 (p_offset) % 0x1000 = 0xfe8

0x41ffe8 (p_addr)   % 0x10000 = 0xffe8
0x00ffe8 (p_offset) % 0x10000 = 0xffe8

So this seems to be satisfied.

> All I know is that we need to limit our support to what memory mapping
> pieces from the elf executable can support.  Which at a minimum requires:
> 	virt_addr % ELF_MIN_ALIGN == file_offset % ELF_MIN_ALIGN

Aarch64 can also handle 4k pages so this invariant should be satisfied.
4k pages seems to be the default for the kernel, too.

[0] https://refspecs.linuxfoundation.org/elf/elf.pdf

> > Memory allocated by set_brk():
> > Before: start=0x420000 end=0x420000
> > After:  start=0x41f000 end=0x420000
> >
> > The triggering binary looks like this:
> >
> >     Elf file type is EXEC (Executable file)
> >     Entry point 0x400144
> >     There are 4 program headers, starting at offset 64
> >
> >     Program Headers:
> >       Type           Offset             VirtAddr           PhysAddr
> >                      FileSiz            MemSiz              Flags  Align
> >       LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
> >                      0x0000000000000178 0x0000000000000178  R E    0x10000
> >       LOAD           0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
> >                      0x0000000000000000 0x0000000000000008  RW     0x10000
> >       NOTE           0x0000000000000120 0x0000000000400120 0x0000000000400120
> >                      0x0000000000000024 0x0000000000000024  R      0x4
> >       GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
> >                      0x0000000000000000 0x0000000000000000  RW     0x10
> >
> >      Section to Segment mapping:
> >       Segment Sections...
> >        00     .note.gnu.build-id .text .eh_frame
> >        01     .bss
> >        02     .note.gnu.build-id
> >        03
> >
> > Reported-by: Sebastian Ott <sebott@redhat.com>
> > Closes: https://lore.kernel.org/lkml/5d49767a-fbdc-fbe7-5fb2-d99ece3168cb@redhat.com/
> > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
> > ---
> >
> > I'm not really familiar with the ELF loading process, so putting this
> > out as RFC.
> >
> > A example binary compiled with aarch64-linux-gnu-gcc 13.2.0 is available
> > at https://test.t-8ch.de/binfmt-bss-repro.bin
> > ---
> >  fs/binfmt_elf.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > index 7b3d2d491407..4008a57d388b 100644
> > --- a/fs/binfmt_elf.c
> > +++ b/fs/binfmt_elf.c
> > @@ -112,7 +112,7 @@ static struct linux_binfmt elf_format = {
> >  
> >  static int set_brk(unsigned long start, unsigned long end, int prot)
> >  {
> > -	start = ELF_PAGEALIGN(start);
> > +	start = ELF_PAGESTART(start);
> >  	end = ELF_PAGEALIGN(end);
> >  	if (end > start) {
> >  		/*
> >
> > ---
> > base-commit: aed8aee11130a954356200afa3f1b8753e8a9482
> > change-id: 20230914-bss-alloc-f523fa61718c
> >
> > Best regards,

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] binfmt_elf: fully allocate bss pages
  2023-09-14 15:59 [PATCH RFC] binfmt_elf: fully allocate bss pages Thomas Weißschuh
  2023-09-14 19:49 ` Eric W. Biederman
@ 2023-09-15 19:35 ` Sebastian Ott
  2023-09-15 22:15 ` Pedro Falcato
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Sebastian Ott @ 2023-09-15 19:35 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Alexander Viro, Christian Brauner, Eric Biederman, Kees Cook,
	Mark Brown, Willy Tarreau, linux-fsdevel, linux-mm, linux-kernel,
	stable

[-- Attachment #1: Type: text/plain, Size: 796 bytes --]

Hej Thomas,

On Thu, 14 Sep 2023, Thomas Weißschuh wrote:
> fs/binfmt_elf.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 7b3d2d491407..4008a57d388b 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -112,7 +112,7 @@ static struct linux_binfmt elf_format = {
>
> static int set_brk(unsigned long start, unsigned long end, int prot)
> {
> -	start = ELF_PAGEALIGN(start);
> +	start = ELF_PAGESTART(start);
> 	end = ELF_PAGEALIGN(end);
> 	if (end > start) {
> 		/*
>

My arm box failed to boot with that patch applied on top of 6.6-rc1 .
There was nothing suspicious on the serial console it just hung somewhere
in userspace initialization. Sadly there was also nothing in the system
logs. 6.6-rc1 worked fine.

Sebastian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] binfmt_elf: fully allocate bss pages
  2023-09-14 15:59 [PATCH RFC] binfmt_elf: fully allocate bss pages Thomas Weißschuh
  2023-09-14 19:49 ` Eric W. Biederman
  2023-09-15 19:35 ` Sebastian Ott
@ 2023-09-15 22:15 ` Pedro Falcato
  2023-09-15 22:41   ` Thomas Weißschuh
  2023-09-18 14:11 ` kernel test robot
  2023-09-21 10:36 ` Sebastian Ott
  4 siblings, 1 reply; 20+ messages in thread
From: Pedro Falcato @ 2023-09-15 22:15 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Alexander Viro, Christian Brauner, Eric Biederman, Kees Cook,
	Mark Brown, Willy Tarreau, linux-fsdevel, linux-mm, linux-kernel,
	Sebastian Ott, stable

On Fri, Sep 15, 2023 at 4:54 AM Thomas Weißschuh <linux@weissschuh.net> wrote:
>
> When allocating the pages for bss the start address needs to be rounded
> down instead of up.
> Otherwise the start of the bss segment may be unmapped.
>
> The was reported to happen on Aarch64:
>
> Memory allocated by set_brk():
> Before: start=0x420000 end=0x420000
> After:  start=0x41f000 end=0x420000
>
> The triggering binary looks like this:
>
>     Elf file type is EXEC (Executable file)
>     Entry point 0x400144
>     There are 4 program headers, starting at offset 64
>
>     Program Headers:
>       Type           Offset             VirtAddr           PhysAddr
>                      FileSiz            MemSiz              Flags  Align
>       LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
>                      0x0000000000000178 0x0000000000000178  R E    0x10000
>       LOAD           0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
>                      0x0000000000000000 0x0000000000000008  RW     0x10000
>       NOTE           0x0000000000000120 0x0000000000400120 0x0000000000400120
>                      0x0000000000000024 0x0000000000000024  R      0x4
>       GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
>                      0x0000000000000000 0x0000000000000000  RW     0x10
>
>      Section to Segment mapping:
>       Segment Sections...
>        00     .note.gnu.build-id .text .eh_frame
>        01     .bss
>        02     .note.gnu.build-id
>        03
>
> Reported-by: Sebastian Ott <sebott@redhat.com>
> Closes: https://lore.kernel.org/lkml/5d49767a-fbdc-fbe7-5fb2-d99ece3168cb@redhat.com/
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: stable@vger.kernel.org
> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
> ---
>
> I'm not really familiar with the ELF loading process, so putting this
> out as RFC.
>
> A example binary compiled with aarch64-linux-gnu-gcc 13.2.0 is available
> at https://test.t-8ch.de/binfmt-bss-repro.bin
> ---
>  fs/binfmt_elf.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 7b3d2d491407..4008a57d388b 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -112,7 +112,7 @@ static struct linux_binfmt elf_format = {
>
>  static int set_brk(unsigned long start, unsigned long end, int prot)
>  {
> -       start = ELF_PAGEALIGN(start);
> +       start = ELF_PAGESTART(start);
>         end = ELF_PAGEALIGN(end);
>         if (end > start) {
>                 /*

I don't see how this change can be correct. set_brk takes the start of
.bss as the start, so doing ELF_PAGESTART(start) will give you what
may very well be another ELF segment. In the common case, you'd map an
anonymous page on top of someone's .data, which will misload the ELF.

The current logic looks OK to me (gosh this code would ideally take a
good refactoring...). I still can't quite tell how padzero() (in the
original report) is -EFAULTing though.

-- 
Pedro

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] binfmt_elf: fully allocate bss pages
  2023-09-15 22:15 ` Pedro Falcato
@ 2023-09-15 22:41   ` Thomas Weißschuh
  0 siblings, 0 replies; 20+ messages in thread
From: Thomas Weißschuh @ 2023-09-15 22:41 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Alexander Viro, Christian Brauner, Eric Biederman, Kees Cook,
	Mark Brown, Willy Tarreau, linux-fsdevel, linux-mm, linux-kernel,
	Sebastian Ott, stable

On 2023-09-15 23:15:05+0100, Pedro Falcato wrote:
> On Fri, Sep 15, 2023 at 4:54 AM Thomas Weißschuh <linux@weissschuh.net> wrote:
> >
> > When allocating the pages for bss the start address needs to be rounded
> > down instead of up.
> > Otherwise the start of the bss segment may be unmapped.
> >
> > The was reported to happen on Aarch64:
> >
> > Memory allocated by set_brk():
> > Before: start=0x420000 end=0x420000
> > After:  start=0x41f000 end=0x420000
> >
> > The triggering binary looks like this:
> >
> >     Elf file type is EXEC (Executable file)
> >     Entry point 0x400144
> >     There are 4 program headers, starting at offset 64
> >
> >     Program Headers:
> >       Type           Offset             VirtAddr           PhysAddr
> >                      FileSiz            MemSiz              Flags  Align
> >       LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
> >                      0x0000000000000178 0x0000000000000178  R E    0x10000
> >       LOAD           0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
> >                      0x0000000000000000 0x0000000000000008  RW     0x10000
> >       NOTE           0x0000000000000120 0x0000000000400120 0x0000000000400120
> >                      0x0000000000000024 0x0000000000000024  R      0x4
> >       GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
> >                      0x0000000000000000 0x0000000000000000  RW     0x10
> >
> >      Section to Segment mapping:
> >       Segment Sections...
> >        00     .note.gnu.build-id .text .eh_frame
> >        01     .bss
> >        02     .note.gnu.build-id
> >        03
> >
> > Reported-by: Sebastian Ott <sebott@redhat.com>
> > Closes: https://lore.kernel.org/lkml/5d49767a-fbdc-fbe7-5fb2-d99ece3168cb@redhat.com/
> > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
> > ---
> >
> > I'm not really familiar with the ELF loading process, so putting this
> > out as RFC.
> >
> > A example binary compiled with aarch64-linux-gnu-gcc 13.2.0 is available
> > at https://test.t-8ch.de/binfmt-bss-repro.bin
> > ---
> >  fs/binfmt_elf.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > index 7b3d2d491407..4008a57d388b 100644
> > --- a/fs/binfmt_elf.c
> > +++ b/fs/binfmt_elf.c
> > @@ -112,7 +112,7 @@ static struct linux_binfmt elf_format = {
> >
> >  static int set_brk(unsigned long start, unsigned long end, int prot)
> >  {
> > -       start = ELF_PAGEALIGN(start);
> > +       start = ELF_PAGESTART(start);
> >         end = ELF_PAGEALIGN(end);
> >         if (end > start) {
> >                 /*
> 
> I don't see how this change can be correct. set_brk takes the start of
> .bss as the start, so doing ELF_PAGESTART(start) will give you what
> may very well be another ELF segment. In the common case, you'd map an
> anonymous page on top of someone's .data, which will misload the ELF.

That does make sense, and indeed it breaks more complex binaries.

> The current logic looks OK to me (gosh this code would ideally take a
> good refactoring...). I still can't quite tell how padzero() (in the
> original report) is -EFAULTing though.

As a test I replaced the asm clear_user() in padzero() with the generic
memset()-based implementation from include/asm-generic/uaccess.h.
It does provide better diagnostics, see below.

Who should have mapped this partial .bss page if there is no .data?
Maybe the logic needs to be a bit more complex and check if this page
has been already mapped for .data and in that case don't map it again.



[    5.620235] Run /init as init process
[    5.662763] CUSTOM DEBUG ELF_PAGEALIGN(start)=0x420000 ELF_PAGEALIGN(end)=0x420000 ELF_PAGESTART(0x41f000)
[    5.667176] Unable to handle kernel paging request at virtual address 000000000041ffe8
[    5.668062] Mem abort info:
[    5.668429]   ESR = 0x0000000096000045
[    5.669400]   EC = 0x25: DABT (current EL), IL = 32 bits
[    5.670119]   SET = 0, FnV = 0
[    5.670608]   EA = 0, S1PTW = 0
[    5.671172]   FSC = 0x05: level 1 translation fault
[    5.672024] Data abort info:
[    5.673273]   ISV = 0, ISS = 0x00000045, ISS2 = 0x00000000
[    5.674169]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[    5.674991]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    5.676871] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000043f20000
[    5.677776] [000000000041ffe8] pgd=0800000043c62003, p4d=0800000043c62003, pud=0800000043c62003, pmd=0000000000000000
[    5.681522] Internal error: Oops: 0000000096000045 [#1] PREEMPT SMP
[    5.682604] Modules linked in:
[    5.683576] CPU: 0 PID: 1 Comm: init Not tainted 6.6.0-rc1+ #241 00a261b9689606c4fc0c90eb29739c5b0eec7b82
[    5.684706] Hardware name: linux,dummy-virt (DT)
[    5.685462] pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    5.686094] pc : __memset+0x50/0x188
[    5.686572] lr : padzero+0x84/0xa0
[    5.686956] sp : ffffffc08003bc70
[    5.687307] x29: ffffffc08003bc70 x28: 0000000000000000 x27: ffffff80026afa00
[    5.688091] x26: 000000000041fff0 x25: 000000000041ffe8 x24: 0000000000400144
[    5.688698] x23: 0000000000000000 x22: 0000000000000000 x21: 0000000000000000
[    5.689275] x20: 0000000000000fe8 x19: 000000000041ffe8 x18: ffffffffffffffff
[    5.689928] x17: ffffffdf46cd2984 x16: ffffffdf46cd2880 x15: 0720072007200720
[    5.690597] x14: 0720072007200720 x13: 0720072007200720 x12: 0000000000000000
[    5.691192] x11: 00000000ffffefff x10: 0000000000000000 x9 : ffffffdf46e1ba28
[    5.691906] x8 : 000000000041ffe8 x7 : 0000000000000000 x6 : 0000000000057fa8
[    5.692496] x5 : 0000000000000fff x4 : 0000000000000008 x3 : 0000000000000000
[    5.693168] x2 : 0000000000000018 x1 : 0000000000000000 x0 : 000000000041ffe8
[    5.693985] Call trace:
[    5.694318]  __memset+0x50/0x188
[    5.694708]  load_elf_binary+0x630/0x15d0
[    5.695132]  bprm_execve+0x2bc/0x7c0
[    5.695505]  kernel_execve+0x144/0x1c8
[    5.695882]  run_init_process+0xf8/0x110
[    5.696264]  kernel_init+0x8c/0x200
[    5.696624]  ret_from_fork+0x10/0x20
[    5.697216] Code: d65f03c0 cb0803e4 f2400c84 54000080 (a9001d07)
[    5.698936] ---[ end trace 0000000000000000 ]---
[    5.701625] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    5.702502] SMP: stopping secondary CPUs
[    5.703608] Kernel Offset: 0x1ec6a00000 from 0xffffffc080000000
[    5.704119] PHYS_OFFSET: 0x40000000
[    5.704491] CPU features: 0x0000000d,00020000,0000420b
[    5.705276] Memory Limit: none

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] binfmt_elf: fully allocate bss pages
  2023-09-14 15:59 [PATCH RFC] binfmt_elf: fully allocate bss pages Thomas Weißschuh
                   ` (2 preceding siblings ...)
  2023-09-15 22:15 ` Pedro Falcato
@ 2023-09-18 14:11 ` kernel test robot
  2023-09-21 10:36 ` Sebastian Ott
  4 siblings, 0 replies; 20+ messages in thread
From: kernel test robot @ 2023-09-18 14:11 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: oe-lkp, lkp, Sebastian Ott, linux-mm, Alexander Viro,
	Christian Brauner, Eric Biederman, Kees Cook, Mark Brown,
	Willy Tarreau, linux-fsdevel, linux-kernel, stable,
	Thomas Weißschuh, oliver.sang



Hello,

kernel test robot noticed "segfault_at_ip_sp_error" on:

commit: 13bd7a228b281e5cef2f51a236cafaa3400592a5 ("[PATCH RFC] binfmt_elf: fully allocate bss pages")
url: https://github.com/intel-lab-lkp/linux/commits/Thomas-Wei-schuh/binfmt_elf-fully-allocate-bss-pages/20230915-000102
patch link: https://lore.kernel.org/all/20230914-bss-alloc-v1-1-78de67d2c6dd@weissschuh.net/
patch subject: [PATCH RFC] binfmt_elf: fully allocate bss pages

in testcase: boot

compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202309181644.1932ad53-oliver.sang@intel.com


[   11.004901][    T1] ### dt-test ### EXPECT_NOT / : WARNING: <<all>>
[   11.005947][    T1] ### dt-test ### EXPECT_NOT / : ------------[ cut here ]------------
[   11.006784][    T1] ### dt-test ### pass of_unittest_lifecycle():3252
[   11.008735][    T1] ### dt-test ### pass of_unittest_lifecycle():3253
[   11.009666][    T1] ### dt-test ### pass of_unittest_check_tree_linkage():271
[   11.010598][    T1] ### dt-test ### pass of_unittest_check_tree_linkage():272
[   11.011531][    T1] ### dt-test ### FAIL of_unittest_overlay_high_level():3542 overlay_base_root not initialized
[   11.012852][    T1] ### dt-test ### end of unittest - 303 passed, 1 failed
[   11.022721][   T39] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[   11.042019][    T1] Sending DHCP requests ., OK
[   12.032757][    T1] IP-Config: Got DHCP answer from 10.0.2.2, my address is 10.0.2.15
[   12.033736][    T1] IP-Config: Complete:
[   12.034229][    T1]      device=eth0, hwaddr=52:54:00:12:34:56, ipaddr=10.0.2.15, mask=255.255.255.0, gw=10.0.2.2
[   12.035554][    T1]      host=vm-meta-36, domain=, nis-domain=(none)
[   12.036331][    T1]      bootserver=10.0.2.2, rootserver=10.0.2.2, rootpath=
[   12.036337][    T1]      nameserver0=10.0.2.3
[   12.038817][    T1] clk: Disabling unused clocks
[   12.041570][    T1] Freeing unused kernel image (initmem) memory: 1036K
[   12.059292][    T1] Write protecting kernel text and read-only data: 10632k
[   12.075444][    T1] Run /init as init process
[   12.075883][    T1]   with arguments:
[   12.076211][    T1]     /init
[   12.076481][    T1]   with environment:
[   12.076818][    T1]     HOME=/
[   12.077095][    T1]     TERM=linux
[   12.077397][    T1]     RESULT_ROOT=/result/boot/1/vm-snb/debian-11.1-i386-20220923.cgz/i386-randconfig-016-20230915/gcc-12/13bd7a228b281e5cef2f51a236cafaa3400592a5/5
[   12.078684][    T1]     BOOT_IMAGE=/pkg/linux/i386-randconfig-016-20230915/gcc-12/13bd7a228b281e5cef2f51a236cafaa3400592a5/vmlinuz-6.6.0-rc1-00073-g13bd7a228b28
[   12.079910][    T1]     branch=linux-review/Thomas-Wei-schuh/binfmt_elf-fully-allocate-bss-pages/20230915-000102
[   12.080775][    T1]     job=/lkp/jobs/scheduled/vm-meta-36/boot-1-debian-11.1-i386-20220923.cgz-i386-randconfig-016-20230915-13bd7a228b28-20230917-97632-11h3y6y-5.yaml
[   12.082051][    T1]     user=lkp
[   12.082345][    T1]     ARCH=i386
[   12.082639][    T1]     kconfig=i386-randconfig-016-20230915
[   12.083177][    T1]     commit=13bd7a228b281e5cef2f51a236cafaa3400592a5
[   12.083743][    T1]     max_uptime=600
[   12.084074][    T1]     LKP_SERVER=internal-lkp-server
[   12.084522][    T1]     selinux=0
[   12.084820][    T1]     softlockup_panic=1
[   12.085181][    T1]     prompt_ramdisk=0
[   12.085551][    T1]     vga=normal
[   12.117728][    T1] [1]: RTC configured in localtime, applying delta of 0 minutes to system time.

Welcome to Debian GNU/Linux 11 (bullseye)!

[   12.189049][   T58] process 58 ((sd-executor)) attempted a POSIX timer syscall while CONFIG_POSIX_TIMERS is not set
[   12.234253][   T63] systemd-getty-g[63]: segfault at 484771 ip 00480047 sp bffb6e4c error 7 in true[480000+1000] likely on CPU 0 (core 0, socket 0)
[ 12.242969][ T63] Code: 00 00 00 b8 82 00 00 00 00 00 00 34 00 20 00 0b 00 28 00 1e 00 1d 00 06 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00 60 01 00 <00> 60 01 00 00 04 00 00 00 04 00 00 00 03 00 00 00 94 01 00 00 94
All code
========
   0:	00 00                	add    %al,(%rax)
   2:	00 b8 82 00 00 00    	add    %bh,0x82(%rax)
   8:	00 00                	add    %al,(%rax)
   a:	00 34 00             	add    %dh,(%rax,%rax,1)
   d:	20 00                	and    %al,(%rax)
   f:	0b 00                	or     (%rax),%eax
  11:	28 00                	sub    %al,(%rax)
  13:	1e                   	(bad)  
  14:	00 1d 00 06 00 00    	add    %bl,0x600(%rip)        # 0x61a
  1a:	00 34 00             	add    %dh,(%rax,%rax,1)
  1d:	00 00                	add    %al,(%rax)
  1f:	34 00                	xor    $0x0,%al
  21:	00 00                	add    %al,(%rax)
  23:	34 00                	xor    $0x0,%al
  25:	00 00                	add    %al,(%rax)
  27:	60                   	(bad)  
  28:	01 00                	add    %eax,(%rax)
  2a:*	00 60 01             	add    %ah,0x1(%rax)		<-- trapping instruction
  2d:	00 00                	add    %al,(%rax)
  2f:	04 00                	add    $0x0,%al
  31:	00 00                	add    %al,(%rax)
  33:	04 00                	add    $0x0,%al
  35:	00 00                	add    %al,(%rax)
  37:	03 00                	add    (%rax),%eax
  39:	00 00                	add    %al,(%rax)
  3b:	94                   	xchg   %eax,%esp
  3c:	01 00                	add    %eax,(%rax)
  3e:	00                   	.byte 0x0
  3f:	94                   	xchg   %eax,%esp

Code starting with the faulting instruction
===========================================
   0:	00 60 01             	add    %ah,0x1(%rax)
   3:	00 00                	add    %al,(%rax)
   5:	04 00                	add    $0x0,%al
   7:	00 00                	add    %al,(%rax)
   9:	04 00                	add    $0x0,%al
   b:	00 00                	add    %al,(%rax)
   d:	03 00                	add    (%rax),%eax
   f:	00 00                	add    %al,(%rax)
  11:	94                   	xchg   %eax,%esp
  12:	01 00                	add    %eax,(%rax)
  14:	00                   	.byte 0x0
  15:	94                   	xchg   %eax,%esp
[   12.256651][   T62] systemd-fstab-g[62]: segfault at 0 ip 004a0004 sp bf81264b error 6 in systemd-fstab-generator[4a0000+2000] likely on CPU 0 (core 0, socket 0)
[ 12.257967][ T62] Code: Unable to access opcode bytes at 0x49ffda.

Code starting with the faulting instruction
===========================================
[   12.266578][   T60] systemd-cryptse[60]: segfault at 0 ip 00453004 sp bfeefa7b error 6 in systemd-cryptsetup-generator[453000+1000] likely on CPU 1 (core 1, socket 0)
[ 12.271885][ T60] Code: Unable to access opcode bytes at 0x452fda.

Code starting with the faulting instruction
===========================================
[   12.276875][   T61] systemd-debug-g[61]: segfault at fffff000 ip 00464004 sp bfd3675b error 7 in systemd-debug-generator[464000+1000] likely on CPU 1 (core 1, socket 0)
[ 12.278229][ T61] Code: Unable to access opcode bytes at 0x463fda.

Code starting with the faulting instruction
===========================================


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230918/202309181644.1932ad53-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] binfmt_elf: fully allocate bss pages
  2023-09-14 15:59 [PATCH RFC] binfmt_elf: fully allocate bss pages Thomas Weißschuh
                   ` (3 preceding siblings ...)
  2023-09-18 14:11 ` kernel test robot
@ 2023-09-21 10:36 ` Sebastian Ott
  2023-09-25  0:50   ` Eric W. Biederman
  4 siblings, 1 reply; 20+ messages in thread
From: Sebastian Ott @ 2023-09-21 10:36 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Alexander Viro, Christian Brauner, Eric Biederman, Kees Cook,
	Mark Brown, Willy Tarreau, linux-fsdevel, linux-mm, linux-kernel,
	stable

Hej,

since we figured that the proposed patch is not going to work I've spent a
couple more hours looking at this (some static binaries on arm64 segfault
during load [0]). The segfault happens because of a failed clear_user()
call in load_elf_binary(). The address we try to write zeros to is mapped with
correct permissions.

After some experiments I've noticed that writing to anonymous mappings work
fine and all the error cases happend on file backed VMAs. Debugging showed that
in elf_map() we call vm_mmap() with a file offset of 15 pages - for a binary
that's less than 1KiB in size.

Looking at the ELF headers again that 15 pages offset originates from the offset
of the 2nd segment - so, I guess the loader did as instructed and that binary is
just too nasty?

Program Headers:
   Type           Offset             VirtAddr           PhysAddr
                  FileSiz            MemSiz              Flags  Align
   LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                  0x0000000000000178 0x0000000000000178  R E    0x10000
   LOAD           0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
                  0x0000000000000000 0x0000000000000008  RW     0x10000
   NOTE           0x0000000000000120 0x0000000000400120 0x0000000000400120
                  0x0000000000000024 0x0000000000000024  R      0x4
   GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                  0x0000000000000000 0x0000000000000000  RW     0x10

As an additional test I've added a bunch of zeros at the end of that binary
so that the offset is within that file and it did load just fine.

On the other hand there is this section header:
   [ 4] .bss              NOBITS           000000000041ffe8  0000ffe8
        0000000000000008  0000000000000000  WA       0     0     1

"sh_offset
This member's value gives the byte offset from the beginning of the file to
the first byte in the section. One section type, SHT_NOBITS described
below, occupies no space in the file, and its sh_offset member locates
the conceptual placement in the file.
"

So, still not sure what to do here..

Sebastian

[0] https://lore.kernel.org/lkml/5d49767a-fbdc-fbe7-5fb2-d99ece3168cb@redhat.com/


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] binfmt_elf: fully allocate bss pages
  2023-09-21 10:36 ` Sebastian Ott
@ 2023-09-25  0:50   ` Eric W. Biederman
  2023-09-25  9:20     ` Sebastian Ott
  0 siblings, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2023-09-25  0:50 UTC (permalink / raw)
  To: Sebastian Ott
  Cc: Thomas Weißschuh, Alexander Viro, Christian Brauner,
	Kees Cook, Mark Brown, Willy Tarreau, linux-fsdevel, linux-mm,
	linux-kernel, stable

Sebastian Ott <sebott@redhat.com> writes:

> Hej,
>
> since we figured that the proposed patch is not going to work I've spent a
> couple more hours looking at this (some static binaries on arm64 segfault
> during load [0]). The segfault happens because of a failed clear_user()
> call in load_elf_binary(). The address we try to write zeros to is mapped with
> correct permissions.
>
> After some experiments I've noticed that writing to anonymous mappings work
> fine and all the error cases happend on file backed VMAs. Debugging showed that
> in elf_map() we call vm_mmap() with a file offset of 15 pages - for a binary
> that's less than 1KiB in size.
>
> Looking at the ELF headers again that 15 pages offset originates from the offset
> of the 2nd segment - so, I guess the loader did as instructed and that binary is
> just too nasty?
>
> Program Headers:
>   Type           Offset             VirtAddr           PhysAddr
>                  FileSiz            MemSiz              Flags  Align
>   LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
>                  0x0000000000000178 0x0000000000000178  R E    0x10000
>   LOAD           0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
>                  0x0000000000000000 0x0000000000000008  RW     0x10000
>   NOTE           0x0000000000000120 0x0000000000400120 0x0000000000400120
>                  0x0000000000000024 0x0000000000000024  R      0x4
>   GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
>                  0x0000000000000000 0x0000000000000000  RW     0x10
>
> As an additional test I've added a bunch of zeros at the end of that binary
> so that the offset is within that file and it did load just fine.
>
> On the other hand there is this section header:
>   [ 4] .bss              NOBITS           000000000041ffe8  0000ffe8
>        0000000000000008  0000000000000000  WA       0     0     1
>
> "sh_offset
> This member's value gives the byte offset from the beginning of the file to
> the first byte in the section. One section type, SHT_NOBITS described
> below, occupies no space in the file, and its sh_offset member locates
> the conceptual placement in the file.
> "
>
> So, still not sure what to do here..
>
> Sebastian
>
> [0] https://lore.kernel.org/lkml/5d49767a-fbdc-fbe7-5fb2-d99ece3168cb@redhat.com/

I think that .bss section that is being generated is atrocious.

At the same time I looked at what the linux elf loader is trying to do,
and the elf loader's handling of program segments with memsz > filesz
has serious remnants a.out of programs allocating memory with the brk
syscall.

Lots of the structure looks like it started with the assumption that
there would only be a single program header with memsz > filesz the way
and that was the .bss.   The way things were in the a.out days and
handling of other cases has been debugged in later.

So I have modified elf_map to always return successfully when there is
a zero filesz in the program header for an elf segment.

Then I have factored out a function clear_tail that ensures the zero
padding for an entire elf segment is present.

Please test this and see if it causes your test case to work.

Please also dig into gcc or whichever code generates that horrendous
.bss section and see if that can be fixed so the code can work on older
kernels.  A section that only contains .bss has no business not being
properly aligned.  Even if the data in that section doesn't start at the
beginning of the page, there is no reason to feed nasty data to other
programs.  It just increases the odds of complications for no good
reason.  At a minimum that is going to be needed to run that code on
older kernels.

Eric


diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7b3d2d491407..f6608df75df6 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -110,43 +110,66 @@ static struct linux_binfmt elf_format = {
 
 #define BAD_ADDR(x) (unlikely((unsigned long)(x) >= TASK_SIZE))
 
-static int set_brk(unsigned long start, unsigned long end, int prot)
-{
-	start = ELF_PAGEALIGN(start);
-	end = ELF_PAGEALIGN(end);
-	if (end > start) {
-		/*
-		 * Map the last of the bss segment.
-		 * If the header is requesting these pages to be
-		 * executable, honour that (ppc32 needs this).
-		 */
-		int error = vm_brk_flags(start, end - start,
-				prot & PROT_EXEC ? VM_EXEC : 0);
-		if (error)
-			return error;
-	}
-	current->mm->start_brk = current->mm->brk = end;
-	return 0;
-}
-
 /* We need to explicitly zero any fractional pages
    after the data section (i.e. bss).  This would
    contain the junk from the file that should not
    be in memory
  */
-static int padzero(unsigned long elf_bss)
+static int padzero(unsigned long elf_bss, unsigned long elf_brk)
 {
 	unsigned long nbyte;
 
 	nbyte = ELF_PAGEOFFSET(elf_bss);
 	if (nbyte) {
 		nbyte = ELF_MIN_ALIGN - nbyte;
+		if (nbyte > elf_brk - elf_bss)
+			nbyte = elf_brk - elf_bss;
 		if (clear_user((void __user *) elf_bss, nbyte))
 			return -EFAULT;
 	}
 	return 0;
 }
 
+static int clear_tail(struct elf_phdr *phdr, unsigned long load_bias, int prot)
+{
+	unsigned long start, end;
+
+	/* Is there a tail to clear? */
+	if (phdr->p_filesz >= phdr->p_memsz)
+		return 0;
+
+	/* Where does the tail start? */
+	if (phdr->p_filesz)
+		start = load_bias + phdr->p_vaddr + phdr->p_filesz;
+	else
+		start = ELF_PAGESTART(load_bias + phdr->p_vaddr);
+
+	/* Where does the tail end? */
+	end = load_bias + phdr->p_vaddr + phdr->p_memsz;
+
+	/*
+	 * This bss-zeroing can fail if the ELF
+	 * file specifies odd protections. So
+	 * we don't check the return value
+	 */
+	padzero(start, end);
+
+	start = ELF_PAGEALIGN(start);
+	end = ELF_PAGEALIGN(end);
+	if (end > start) {
+		/*
+		 * Map the last of the bss segment.
+		 * If the header is requesting these pages to be
+		 * executable, honour that (ppc32 needs this).
+		 */
+		int error = vm_brk_flags(start, end - start,
+				prot & PROT_EXEC ? VM_EXEC : 0);
+		if (error)
+			return error;
+	}
+	return 0;
+}
+
 /* Let's use some macros to make this stack manipulation a little clearer */
 #ifdef CONFIG_STACK_GROWSUP
 #define STACK_ADD(sp, items) ((elf_addr_t __user *)(sp) + (items))
@@ -379,7 +402,7 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
 
 	/* mmap() will return -EINVAL if given a zero size, but a
 	 * segment with zero filesize is perfectly valid */
-	if (!size)
+	if (!eppnt->p_filesz)
 		return addr;
 
 	/*
@@ -596,8 +619,6 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 	struct elf_phdr *eppnt;
 	unsigned long load_addr = 0;
 	int load_addr_set = 0;
-	unsigned long last_bss = 0, elf_bss = 0;
-	int bss_prot = 0;
 	unsigned long error = ~0UL;
 	unsigned long total_size;
 	int i;
@@ -661,50 +682,13 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 				goto out;
 			}
 
-			/*
-			 * Find the end of the file mapping for this phdr, and
-			 * keep track of the largest address we see for this.
-			 */
-			k = load_addr + eppnt->p_vaddr + eppnt->p_filesz;
-			if (k > elf_bss)
-				elf_bss = k;
-
-			/*
-			 * Do the same thing for the memory mapping - between
-			 * elf_bss and last_bss is the bss section.
-			 */
-			k = load_addr + eppnt->p_vaddr + eppnt->p_memsz;
-			if (k > last_bss) {
-				last_bss = k;
-				bss_prot = elf_prot;
-			}
+			/* Map anonymous pages and clear the tail if needed */
+			error = clear_tail(eppnt, load_addr, elf_prot);
+			if (error)
+				goto out;
 		}
 	}
 
-	/*
-	 * Now fill out the bss section: first pad the last page from
-	 * the file up to the page boundary, and zero it from elf_bss
-	 * up to the end of the page.
-	 */
-	if (padzero(elf_bss)) {
-		error = -EFAULT;
-		goto out;
-	}
-	/*
-	 * Next, align both the file and mem bss up to the page size,
-	 * since this is where elf_bss was just zeroed up to, and where
-	 * last_bss will end after the vm_brk_flags() below.
-	 */
-	elf_bss = ELF_PAGEALIGN(elf_bss);
-	last_bss = ELF_PAGEALIGN(last_bss);
-	/* Finally, if there is still more bss to allocate, do it. */
-	if (last_bss > elf_bss) {
-		error = vm_brk_flags(elf_bss, last_bss - elf_bss,
-				bss_prot & PROT_EXEC ? VM_EXEC : 0);
-		if (error)
-			goto out;
-	}
-
 	error = load_addr;
 out:
 	return error;
@@ -828,8 +812,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	unsigned long error;
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
 	struct elf_phdr *elf_property_phdata = NULL;
-	unsigned long elf_bss, elf_brk;
-	int bss_prot = 0;
+	unsigned long elf_brk;
 	int retval, i;
 	unsigned long elf_entry;
 	unsigned long e_entry;
@@ -1020,7 +1003,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	if (retval < 0)
 		goto out_free_dentry;
 
-	elf_bss = 0;
 	elf_brk = 0;
 
 	start_code = ~0UL;
@@ -1040,32 +1022,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		if (elf_ppnt->p_type != PT_LOAD)
 			continue;
 
-		if (unlikely (elf_brk > elf_bss)) {
-			unsigned long nbyte;
-
-			/* There was a PT_LOAD segment with p_memsz > p_filesz
-			   before this one. Map anonymous pages, if needed,
-			   and clear the area.  */
-			retval = set_brk(elf_bss + load_bias,
-					 elf_brk + load_bias,
-					 bss_prot);
-			if (retval)
-				goto out_free_dentry;
-			nbyte = ELF_PAGEOFFSET(elf_bss);
-			if (nbyte) {
-				nbyte = ELF_MIN_ALIGN - nbyte;
-				if (nbyte > elf_brk - elf_bss)
-					nbyte = elf_brk - elf_bss;
-				if (clear_user((void __user *)elf_bss +
-							load_bias, nbyte)) {
-					/*
-					 * This bss-zeroing can fail if the ELF
-					 * file specifies odd protections. So
-					 * we don't check the return value
-					 */
-				}
-			}
-		}
 
 		elf_prot = make_prot(elf_ppnt->p_flags, &arch_state,
 				     !!interpreter, false);
@@ -1208,42 +1164,31 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			goto out_free_dentry;
 		}
 
-		k = elf_ppnt->p_vaddr + elf_ppnt->p_filesz;
+		/* Map anonymous pages and clear the tail if needed */
+		retval = clear_tail(elf_ppnt, load_bias, elf_prot);
+		if (retval)
+			goto out_free_dentry;
 
-		if (k > elf_bss)
-			elf_bss = k;
+		k = elf_ppnt->p_vaddr + elf_ppnt->p_filesz;
 		if ((elf_ppnt->p_flags & PF_X) && end_code < k)
 			end_code = k;
 		if (end_data < k)
 			end_data = k;
 		k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;
 		if (k > elf_brk) {
-			bss_prot = elf_prot;
 			elf_brk = k;
 		}
 	}
 
 	e_entry = elf_ex->e_entry + load_bias;
 	phdr_addr += load_bias;
-	elf_bss += load_bias;
 	elf_brk += load_bias;
 	start_code += load_bias;
 	end_code += load_bias;
 	start_data += load_bias;
 	end_data += load_bias;
 
-	/* Calling set_brk effectively mmaps the pages that we need
-	 * for the bss and break sections.  We must do this before
-	 * mapping in the interpreter, to make sure it doesn't wind
-	 * up getting placed where the bss needs to go.
-	 */
-	retval = set_brk(elf_bss, elf_brk, bss_prot);
-	if (retval)
-		goto out_free_dentry;
-	if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bss))) {
-		retval = -EFAULT; /* Nobody gets to see this, but.. */
-		goto out_free_dentry;
-	}
+	current->mm->start_brk = current->mm->brk = ELF_PAGEALIGN(elf_brk);
 
 	if (interpreter) {
 		elf_entry = load_elf_interp(interp_elf_ex,
@@ -1369,7 +1314,6 @@ static int load_elf_library(struct file *file)
 {
 	struct elf_phdr *elf_phdata;
 	struct elf_phdr *eppnt;
-	unsigned long elf_bss, bss, len;
 	int retval, error, i, j;
 	struct elfhdr elf_ex;
 
@@ -1425,19 +1369,9 @@ static int load_elf_library(struct file *file)
 	if (error != ELF_PAGESTART(eppnt->p_vaddr))
 		goto out_free_ph;
 
-	elf_bss = eppnt->p_vaddr + eppnt->p_filesz;
-	if (padzero(elf_bss)) {
-		error = -EFAULT;
+	error = clear_tail(eppnt, 0, 0);
+	if (error)
 		goto out_free_ph;
-	}
-
-	len = ELF_PAGEALIGN(eppnt->p_filesz + eppnt->p_vaddr);
-	bss = ELF_PAGEALIGN(eppnt->p_memsz + eppnt->p_vaddr);
-	if (bss > len) {
-		error = vm_brk(len, bss - len);
-		if (error)
-			goto out_free_ph;
-	}
 	error = 0;
 
 out_free_ph:


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] binfmt_elf: fully allocate bss pages
  2023-09-25  0:50   ` Eric W. Biederman
@ 2023-09-25  9:20     ` Sebastian Ott
  2023-09-25  9:50       ` Eric W. Biederman
  2023-09-25 12:59       ` [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts Eric W. Biederman
  0 siblings, 2 replies; 20+ messages in thread
From: Sebastian Ott @ 2023-09-25  9:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Thomas Weißschuh, Alexander Viro, Christian Brauner,
	Kees Cook, Mark Brown, Willy Tarreau, linux-fsdevel, linux-mm,
	linux-kernel, stable

On Sun, 24 Sep 2023, Eric W. Biederman wrote:
> Sebastian Ott <sebott@redhat.com> writes:
>
>> Hej,
>>
>> since we figured that the proposed patch is not going to work I've spent a
>> couple more hours looking at this (some static binaries on arm64 segfault
>> during load [0]). The segfault happens because of a failed clear_user()
>> call in load_elf_binary(). The address we try to write zeros to is mapped with
>> correct permissions.
>>
>> After some experiments I've noticed that writing to anonymous mappings work
>> fine and all the error cases happend on file backed VMAs. Debugging showed that
>> in elf_map() we call vm_mmap() with a file offset of 15 pages - for a binary
>> that's less than 1KiB in size.
>>
>> Looking at the ELF headers again that 15 pages offset originates from the offset
>> of the 2nd segment - so, I guess the loader did as instructed and that binary is
>> just too nasty?
>>
>> Program Headers:
>>   Type           Offset             VirtAddr           PhysAddr
>>                  FileSiz            MemSiz              Flags  Align
>>   LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
>>                  0x0000000000000178 0x0000000000000178  R E    0x10000
>>   LOAD           0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
>>                  0x0000000000000000 0x0000000000000008  RW     0x10000
>>   NOTE           0x0000000000000120 0x0000000000400120 0x0000000000400120
>>                  0x0000000000000024 0x0000000000000024  R      0x4
>>   GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
>>                  0x0000000000000000 0x0000000000000000  RW     0x10
>>
>> As an additional test I've added a bunch of zeros at the end of that binary
>> so that the offset is within that file and it did load just fine.
>>
>> On the other hand there is this section header:
>>   [ 4] .bss              NOBITS           000000000041ffe8  0000ffe8
>>        0000000000000008  0000000000000000  WA       0     0     1
>>
>> "sh_offset
>> This member's value gives the byte offset from the beginning of the file to
>> the first byte in the section. One section type, SHT_NOBITS described
>> below, occupies no space in the file, and its sh_offset member locates
>> the conceptual placement in the file.
>> "
>>
>> So, still not sure what to do here..
>>
>> Sebastian
>>
>> [0] https://lore.kernel.org/lkml/5d49767a-fbdc-fbe7-5fb2-d99ece3168cb@redhat.com/
>
> I think that .bss section that is being generated is atrocious.
>
> At the same time I looked at what the linux elf loader is trying to do,
> and the elf loader's handling of program segments with memsz > filesz
> has serious remnants a.out of programs allocating memory with the brk
> syscall.
>
> Lots of the structure looks like it started with the assumption that
> there would only be a single program header with memsz > filesz the way
> and that was the .bss.   The way things were in the a.out days and
> handling of other cases has been debugged in later.
>
> So I have modified elf_map to always return successfully when there is
> a zero filesz in the program header for an elf segment.
>
> Then I have factored out a function clear_tail that ensures the zero
> padding for an entire elf segment is present.
>
> Please test this and see if it causes your test case to work.

Sadly, that causes issues for other programs:

[   44.164596] Run /init as init process
[   44.168763] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[   44.176409] CPU: 32 PID: 1 Comm: init Not tainted 6.6.0-rc2+ #89
[   44.182404] Hardware name: GIGABYTE R181-T92-00/MT91-FS4-00, BIOS F34 08/13/2020
[   44.189786] Call trace:
[   44.192220]  dump_backtrace+0xa4/0x130
[   44.195961]  show_stack+0x20/0x38
[   44.199264]  dump_stack_lvl+0x48/0x60
[   44.202917]  dump_stack+0x18/0x28
[   44.206219]  panic+0x2e0/0x350
[   44.209264]  do_exit+0x370/0x390
[   44.212481]  do_group_exit+0x3c/0xa0
[   44.216044]  get_signal+0x800/0x808
[   44.219521]  do_signal+0xfc/0x200
[   44.222824]  do_notify_resume+0xc8/0x418
[   44.226734]  el0_da+0x114/0x120
[   44.229866]  el0t_64_sync_handler+0xb8/0x130
[   44.234124]  el0t_64_sync+0x194/0x198
[   44.237776] SMP: stopping secondary CPUs
[   44.241740] Kernel Offset: disabled
[   44.245215] CPU features: 0x03000000,14028142,10004203
[   44.250342] Memory Limit: none
[   44.253383] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] binfmt_elf: fully allocate bss pages
  2023-09-25  9:20     ` Sebastian Ott
@ 2023-09-25  9:50       ` Eric W. Biederman
  2023-09-25 12:59       ` [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts Eric W. Biederman
  1 sibling, 0 replies; 20+ messages in thread
From: Eric W. Biederman @ 2023-09-25  9:50 UTC (permalink / raw)
  To: Sebastian Ott
  Cc: Thomas Weißschuh, Alexander Viro, Christian Brauner,
	Kees Cook, Mark Brown, Willy Tarreau, linux-fsdevel, linux-mm,
	linux-kernel, stable

Sebastian Ott <sebott@redhat.com> writes:

> On Sun, 24 Sep 2023, Eric W. Biederman wrote:
>> Sebastian Ott <sebott@redhat.com> writes:
>>
>>> Hej,
>>>
>>> since we figured that the proposed patch is not going to work I've spent a
>>> couple more hours looking at this (some static binaries on arm64 segfault
>>> during load [0]). The segfault happens because of a failed clear_user()
>>> call in load_elf_binary(). The address we try to write zeros to is mapped with
>>> correct permissions.
>>>
>>> After some experiments I've noticed that writing to anonymous mappings work
>>> fine and all the error cases happend on file backed VMAs. Debugging showed that
>>> in elf_map() we call vm_mmap() with a file offset of 15 pages - for a binary
>>> that's less than 1KiB in size.
>>>
>>> Looking at the ELF headers again that 15 pages offset originates from the offset
>>> of the 2nd segment - so, I guess the loader did as instructed and that binary is
>>> just too nasty?
>>>
>>> Program Headers:
>>>   Type           Offset             VirtAddr           PhysAddr
>>>                  FileSiz            MemSiz              Flags  Align
>>>   LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
>>>                  0x0000000000000178 0x0000000000000178  R E    0x10000
>>>   LOAD           0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
>>>                  0x0000000000000000 0x0000000000000008  RW     0x10000
>>>   NOTE           0x0000000000000120 0x0000000000400120 0x0000000000400120
>>>                  0x0000000000000024 0x0000000000000024  R      0x4
>>>   GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
>>>                  0x0000000000000000 0x0000000000000000  RW     0x10
>>>
>>> As an additional test I've added a bunch of zeros at the end of that binary
>>> so that the offset is within that file and it did load just fine.
>>>
>>> On the other hand there is this section header:
>>>   [ 4] .bss              NOBITS           000000000041ffe8  0000ffe8
>>>        0000000000000008  0000000000000000  WA       0     0     1
>>>
>>> "sh_offset
>>> This member's value gives the byte offset from the beginning of the file to
>>> the first byte in the section. One section type, SHT_NOBITS described
>>> below, occupies no space in the file, and its sh_offset member locates
>>> the conceptual placement in the file.
>>> "
>>>
>>> So, still not sure what to do here..
>>>
>>> Sebastian
>>>
>>> [0] https://lore.kernel.org/lkml/5d49767a-fbdc-fbe7-5fb2-d99ece3168cb@redhat.com/
>>
>> I think that .bss section that is being generated is atrocious.
>>
>> At the same time I looked at what the linux elf loader is trying to do,
>> and the elf loader's handling of program segments with memsz > filesz
>> has serious remnants a.out of programs allocating memory with the brk
>> syscall.
>>
>> Lots of the structure looks like it started with the assumption that
>> there would only be a single program header with memsz > filesz the way
>> and that was the .bss.   The way things were in the a.out days and
>> handling of other cases has been debugged in later.
>>
>> So I have modified elf_map to always return successfully when there is
>> a zero filesz in the program header for an elf segment.
>>
>> Then I have factored out a function clear_tail that ensures the zero
>> padding for an entire elf segment is present.
>>
>> Please test this and see if it causes your test case to work.
>
> Sadly, that causes issues for other programs:

Bah.  Too much cleanup at once.

I will respin.

> [   44.164596] Run /init as init process
> [   44.168763] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [   44.176409] CPU: 32 PID: 1 Comm: init Not tainted 6.6.0-rc2+ #89
> [   44.182404] Hardware name: GIGABYTE R181-T92-00/MT91-FS4-00, BIOS F34 08/13/2020
> [   44.189786] Call trace:
> [   44.192220]  dump_backtrace+0xa4/0x130
> [   44.195961]  show_stack+0x20/0x38
> [   44.199264]  dump_stack_lvl+0x48/0x60
> [   44.202917]  dump_stack+0x18/0x28
> [   44.206219]  panic+0x2e0/0x350
> [   44.209264]  do_exit+0x370/0x390
> [   44.212481]  do_group_exit+0x3c/0xa0
> [   44.216044]  get_signal+0x800/0x808
> [   44.219521]  do_signal+0xfc/0x200
> [   44.222824]  do_notify_resume+0xc8/0x418
> [   44.226734]  el0_da+0x114/0x120
> [   44.229866]  el0t_64_sync_handler+0xb8/0x130
> [   44.234124]  el0t_64_sync+0x194/0x198
> [   44.237776] SMP: stopping secondary CPUs
> [   44.241740] Kernel Offset: disabled
> [   44.245215] CPU features: 0x03000000,14028142,10004203
> [   44.250342] Memory Limit: none
> [   44.253383] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts
  2023-09-25  9:20     ` Sebastian Ott
  2023-09-25  9:50       ` Eric W. Biederman
@ 2023-09-25 12:59       ` Eric W. Biederman
  2023-09-25 13:00         ` kernel test robot
                           ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Eric W. Biederman @ 2023-09-25 12:59 UTC (permalink / raw)
  To: Sebastian Ott
  Cc: Thomas Weißschuh, Alexander Viro, Christian Brauner,
	Kees Cook, Mark Brown, Willy Tarreau, linux-fsdevel, linux-mm,
	linux-kernel, stable


Implement a helper elf_load that wraps elf_map and performs all
of the necessary work to ensure that when "memsz > filesz"
the bytes described by "memsz > filesz" are zeroed.

Link: https://lkml.kernel.org/r/20230914-bss-alloc-v1-1-78de67d2c6dd@weissschuh.net
Reported-by: Sebastian Ott <sebott@redhat.com>
Reported-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/binfmt_elf.c | 111 +++++++++++++++++++++---------------------------
 1 file changed, 48 insertions(+), 63 deletions(-)

Can you please test this one?

With this patch I can boot a machine, and I like the structure much
better.  Overall this seems a more reviewable and safer patch although
it is almost as aggressive in the cleanups.

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7b3d2d491407..8bea9d974361 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -110,25 +110,6 @@ static struct linux_binfmt elf_format = {
 
 #define BAD_ADDR(x) (unlikely((unsigned long)(x) >= TASK_SIZE))
 
-static int set_brk(unsigned long start, unsigned long end, int prot)
-{
-	start = ELF_PAGEALIGN(start);
-	end = ELF_PAGEALIGN(end);
-	if (end > start) {
-		/*
-		 * Map the last of the bss segment.
-		 * If the header is requesting these pages to be
-		 * executable, honour that (ppc32 needs this).
-		 */
-		int error = vm_brk_flags(start, end - start,
-				prot & PROT_EXEC ? VM_EXEC : 0);
-		if (error)
-			return error;
-	}
-	current->mm->start_brk = current->mm->brk = end;
-	return 0;
-}
-
 /* We need to explicitly zero any fractional pages
    after the data section (i.e. bss).  This would
    contain the junk from the file that should not
@@ -406,6 +387,51 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
 	return(map_addr);
 }
 
+static unsigned long elf_load(struct file *filep, unsigned long addr,
+		const struct elf_phdr *eppnt, int prot, int type,
+		unsigned long total_size)
+{
+	unsigned long zero_start, zero_end;
+	unsigned long map_addr;
+
+	if (eppnt->p_filesz) {
+		map_addr = elf_map(filep, addr, eppnt, prot, type, total_size);
+		if (BAD_ADDR(map_addr))
+			return map_addr;
+		if (eppnt->p_memsz > eppnt->p_filesz) {
+			zero_start = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
+				eppnt->p_filesz;
+			zero_end = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
+				eppnt->p_memsz;
+
+			/* Zero the end of the last mapped page */
+			padzero(zero_start);
+		}
+	} else {
+		zero_start = ELF_PAGESTART(addr);
+		zero_end = zero_start + ELF_PAGEOFFSET(eppnt->p_vaddr) +
+			eppnt->p_memsz;
+	}
+	if (eppnt->p_memsz > eppnt->p_filesz) {
+		/*
+		 * Map the last of the segment.
+		 * If the header is requesting these pages to be
+		 * executable, honour that (ppc32 needs this).
+		 */
+		int error;
+
+		zero_start = ELF_PAGEALIGN(zero_start);
+		zero_end = ELF_PAGEALIGN(zero_end);
+
+		error = vm_brk_flags(zero_start, zero_end - zero_start,
+				     prot & PROT_EXEC ? VM_EXEC : 0);
+		if (error)
+			map_addr = error;
+	}
+	return map_addr;
+}
+
+
 static unsigned long total_mapping_size(const struct elf_phdr *phdr, int nr)
 {
 	elf_addr_t min_addr = -1;
@@ -829,7 +855,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
 	struct elf_phdr *elf_property_phdata = NULL;
 	unsigned long elf_bss, elf_brk;
-	int bss_prot = 0;
 	int retval, i;
 	unsigned long elf_entry;
 	unsigned long e_entry;
@@ -1040,33 +1065,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		if (elf_ppnt->p_type != PT_LOAD)
 			continue;
 
-		if (unlikely (elf_brk > elf_bss)) {
-			unsigned long nbyte;
-
-			/* There was a PT_LOAD segment with p_memsz > p_filesz
-			   before this one. Map anonymous pages, if needed,
-			   and clear the area.  */
-			retval = set_brk(elf_bss + load_bias,
-					 elf_brk + load_bias,
-					 bss_prot);
-			if (retval)
-				goto out_free_dentry;
-			nbyte = ELF_PAGEOFFSET(elf_bss);
-			if (nbyte) {
-				nbyte = ELF_MIN_ALIGN - nbyte;
-				if (nbyte > elf_brk - elf_bss)
-					nbyte = elf_brk - elf_bss;
-				if (clear_user((void __user *)elf_bss +
-							load_bias, nbyte)) {
-					/*
-					 * This bss-zeroing can fail if the ELF
-					 * file specifies odd protections. So
-					 * we don't check the return value
-					 */
-				}
-			}
-		}
-
 		elf_prot = make_prot(elf_ppnt->p_flags, &arch_state,
 				     !!interpreter, false);
 
@@ -1162,7 +1160,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			}
 		}
 
-		error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
+		error = elf_load(bprm->file, load_bias + vaddr, elf_ppnt,
 				elf_prot, elf_flags, total_size);
 		if (BAD_ADDR(error)) {
 			retval = IS_ERR_VALUE(error) ?
@@ -1217,10 +1215,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		if (end_data < k)
 			end_data = k;
 		k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;
-		if (k > elf_brk) {
-			bss_prot = elf_prot;
+		if (k > elf_brk)
 			elf_brk = k;
-		}
 	}
 
 	e_entry = elf_ex->e_entry + load_bias;
@@ -1232,18 +1228,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	start_data += load_bias;
 	end_data += load_bias;
 
-	/* Calling set_brk effectively mmaps the pages that we need
-	 * for the bss and break sections.  We must do this before
-	 * mapping in the interpreter, to make sure it doesn't wind
-	 * up getting placed where the bss needs to go.
-	 */
-	retval = set_brk(elf_bss, elf_brk, bss_prot);
-	if (retval)
-		goto out_free_dentry;
-	if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bss))) {
-		retval = -EFAULT; /* Nobody gets to see this, but.. */
-		goto out_free_dentry;
-	}
+	current->mm->start_brk = current->mm->brk = ELF_PAGEALIGN(elf_brk);
 
 	if (interpreter) {
 		elf_entry = load_elf_interp(interp_elf_ex,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts
  2023-09-25 12:59       ` [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts Eric W. Biederman
@ 2023-09-25 13:00         ` kernel test robot
  2023-09-25 13:07           ` Eric W. Biederman
  2023-09-25 15:27         ` Sebastian Ott
  2023-09-26 13:49         ` Dan Carpenter
  2 siblings, 1 reply; 20+ messages in thread
From: kernel test robot @ 2023-09-25 13:00 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: stable, oe-kbuild-all

Hi,

Thanks for your patch.

FYI: kernel test robot notices the stable kernel rule is not satisfied.

The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html/#option-1

Rule: add the tag "Cc: stable@vger.kernel.org" in the sign-off area to have the patch automatically included in the stable tree.
Subject: [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts
Link: https://lore.kernel.org/stable/87jzsemmsd.fsf_-_%40email.froward.int.ebiederm.org

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts
  2023-09-25 13:00         ` kernel test robot
@ 2023-09-25 13:07           ` Eric W. Biederman
  0 siblings, 0 replies; 20+ messages in thread
From: Eric W. Biederman @ 2023-09-25 13:07 UTC (permalink / raw)
  To: kernel test robot; +Cc: stable, oe-kbuild-all

kernel test robot <lkp@intel.com> writes:

> Hi,
>
> Thanks for your patch.
>
> FYI: kernel test robot notices the stable kernel rule is not satisfied.
>
> The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html/#option-1
>
> Rule: add the tag "Cc: stable@vger.kernel.org" in the sign-off area to have the patch automatically included in the stable tree.
> Subject: [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts
> Link: https://lore.kernel.org/stable/87jzsemmsd.fsf_-_%40email.froward.int.ebiederm.org

My apologies kernel test robot I had realized stable was cc'd on this
conversation.

This patch as is, is most definitely not stable fodder.  Maybe after
being tested.  It is definitely not a regression fix, and I am not
certain even after being tested it could be considered a fix rather than
just a new feature. AKA "Support very weird compiler generated
exectuables".

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts
  2023-09-25 12:59       ` [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts Eric W. Biederman
  2023-09-25 13:00         ` kernel test robot
@ 2023-09-25 15:27         ` Sebastian Ott
  2023-09-25 17:06           ` Kees Cook
  2023-09-26 13:49         ` Dan Carpenter
  2 siblings, 1 reply; 20+ messages in thread
From: Sebastian Ott @ 2023-09-25 15:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Thomas Weißschuh, Alexander Viro, Christian Brauner,
	Kees Cook, Mark Brown, Willy Tarreau, linux-fsdevel, linux-mm,
	linux-kernel, stable

[-- Attachment #1: Type: text/plain, Size: 881 bytes --]

On Mon, 25 Sep 2023, Eric W. Biederman wrote:
>
> Implement a helper elf_load that wraps elf_map and performs all
> of the necessary work to ensure that when "memsz > filesz"
> the bytes described by "memsz > filesz" are zeroed.
>
> Link: https://lkml.kernel.org/r/20230914-bss-alloc-v1-1-78de67d2c6dd@weissschuh.net
> Reported-by: Sebastian Ott <sebott@redhat.com>
> Reported-by: Thomas Weißschuh <linux@weissschuh.net>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> fs/binfmt_elf.c | 111 +++++++++++++++++++++---------------------------
> 1 file changed, 48 insertions(+), 63 deletions(-)
>
> Can you please test this one?
>

That one did the trick! The arm box booted successful, ran the binaries
that were used for the repo of this issue, and ran the nolibc compiled
binaries from kselftests that initially triggered the loader issues.

Thanks,
Sebastian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts
  2023-09-25 15:27         ` Sebastian Ott
@ 2023-09-25 17:06           ` Kees Cook
  2023-09-26  3:27             ` Eric W. Biederman
  0 siblings, 1 reply; 20+ messages in thread
From: Kees Cook @ 2023-09-25 17:06 UTC (permalink / raw)
  To: Sebastian Ott, Eric W. Biederman, Pedro Falcato
  Cc: Thomas Weißschuh, Alexander Viro, Christian Brauner,
	Mark Brown, Willy Tarreau, sam, Rich Felker, linux-fsdevel,
	linux-mm, linux-kernel, stable

On Mon, Sep 25, 2023 at 05:27:12PM +0200, Sebastian Ott wrote:
> On Mon, 25 Sep 2023, Eric W. Biederman wrote:
> > 
> > Implement a helper elf_load that wraps elf_map and performs all
> > of the necessary work to ensure that when "memsz > filesz"
> > the bytes described by "memsz > filesz" are zeroed.
> > 
> > Link: https://lkml.kernel.org/r/20230914-bss-alloc-v1-1-78de67d2c6dd@weissschuh.net
> > Reported-by: Sebastian Ott <sebott@redhat.com>
> > Reported-by: Thomas Weißschuh <linux@weissschuh.net>
> > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > ---
> > fs/binfmt_elf.c | 111 +++++++++++++++++++++---------------------------
> > 1 file changed, 48 insertions(+), 63 deletions(-)
> > 
> > Can you please test this one?

Eric thanks for doing this refactoring! This does look similar to the
earlier attempt:
https://lore.kernel.org/lkml/20221106021657.1145519-1-pedro.falcato@gmail.com/
and it's a bit easier to review.

> That one did the trick! The arm box booted successful, ran the binaries
> that were used for the repo of this issue, and ran the nolibc compiled
> binaries from kselftests that initially triggered the loader issues.

Thanks for testing! I need to dig out the other "weird" binaries (like
the mentioned ppc32 case) and see if I can get those tested too.

Pedro, are you able to test ppc64le musl libc with this patch too?

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts
  2023-09-25 17:06           ` Kees Cook
@ 2023-09-26  3:27             ` Eric W. Biederman
  2023-09-27  2:34               ` Kees Cook
  0 siblings, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2023-09-26  3:27 UTC (permalink / raw)
  To: Kees Cook
  Cc: Sebastian Ott, Pedro Falcato, Thomas Weißschuh,
	Alexander Viro, Christian Brauner, Mark Brown, Willy Tarreau,
	sam, Rich Felker, linux-fsdevel, linux-mm, linux-kernel, stable

Kees Cook <keescook@chromium.org> writes:

> On Mon, Sep 25, 2023 at 05:27:12PM +0200, Sebastian Ott wrote:
>> On Mon, 25 Sep 2023, Eric W. Biederman wrote:
>> > 
>> > Implement a helper elf_load that wraps elf_map and performs all
>> > of the necessary work to ensure that when "memsz > filesz"
>> > the bytes described by "memsz > filesz" are zeroed.
>> > 
>> > Link: https://lkml.kernel.org/r/20230914-bss-alloc-v1-1-78de67d2c6dd@weissschuh.net
>> > Reported-by: Sebastian Ott <sebott@redhat.com>
>> > Reported-by: Thomas Weißschuh <linux@weissschuh.net>
>> > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> > ---
>> > fs/binfmt_elf.c | 111 +++++++++++++++++++++---------------------------
>> > 1 file changed, 48 insertions(+), 63 deletions(-)
>> > 
>> > Can you please test this one?
>
> Eric thanks for doing this refactoring! This does look similar to the
> earlier attempt:
> https://lore.kernel.org/lkml/20221106021657.1145519-1-pedro.falcato@gmail.com/
> and it's a bit easier to review.

I need to context switch away for a while so Kees if you will
I will let you handle the rest of this one.


A couple of thoughts running through my head for anyone whose ambitious
might include cleaning up binfmt_elf.c

The elf_bss variable in load_elf_binary can be removed.

Work for a follow on patch is using my new elf_load in load_elf_interp
and possibly in load_elf_library.  (More code size reduction).

An outstanding issue is if the first segment has filesz 0, and has a
randomized locations.  But that is the same as today.

There is a whole question does it make sense for the elf loader
to have it's own helper vm_brk_flags in mm/mmap.c or would it
make more sense for binfmt_elf to do what binfmt_elf_fdpic does and
have everything to go through vm_mmap.

I think replacing vm_brk_flags with vm_mmap would allow fixing the
theoretical issue of filesz 0 and randomizing locations.



In this change I replaced an open coded padzero that did not clear
all of the way to the end of the page, with padzero that does.

I also stopped checking the return of padzero as there is at least
one known case where testing for failure is the wrong thing to do.
It looks like binfmt_elf_fdpic may have the proper set of tests
for when error handling can be safely completed.

I found a couple of commits in the old history
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git,
that look very interesting in understanding this code.

commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail")
commit c6e2227e4a3e ("[SPARC64]: Missing user access return value checks in fs/binfmt_elf.c and fs/compat.c")
commit 5bf3be033f50 ("v2.4.10.1 -> v2.4.10.2")

Looking at commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail"):
>  commit 39b56d902bf35241e7cba6cc30b828ed937175ad
>  Author: Pavel Machek <pavel@ucw.cz>
>  Date:   Wed Feb 9 22:40:30 2005 -0800
> 
>     [PATCH] binfmt_elf: clearing bss may fail
>     
>     So we discover that Borland's Kylix application builder emits weird elf
>     files which describe a non-writeable bss segment.
>     
>     So remove the clear_user() check at the place where we zero out the bss.  I
>     don't _think_ there are any security implications here (plus we've never
>     checked that clear_user() return value, so whoops if it is a problem).
>     
>     Signed-off-by: Pavel Machek <pavel@suse.cz>
>     Signed-off-by: Andrew Morton <akpm@osdl.org>
>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>

It seems pretty clear that binfmt_elf_fdpic with skipping clear_user
for non-writable segments and otherwise calling clear_user (aka padzero)
and checking it's return code is the right thing to do.

I just skipped the error checking as that avoids breaking things.

It looks like Borland's Kylix died in 2005 so it might be safe to
just consider read-only segments with memsz > filesz an error.


Looking at commit 5bf3be033f50 ("v2.4.10.1 -> v2.4.10.2") the
binfmt_elf.c bits confirm my guess that the weird structure is because
before that point binfmt_elf.c assumed there would be only a single
segment with memsz > filesz.  Which is why the code was structured so
weirdly.

Looking a little farther it looks like the binfmt_elf.c was introduced
in Linux v1.0, with essentially the same structure in load_elf_binary as
it has now.  Prior to that Linux hard coded support for a.out binaries
in execve.  So if someone wants to add a Fixes tag it should be
"Fixes: v1.0"

Which finally explains to me why the code is so odd.  For the most part
the code has only received maintenance for the last 30 years or so.
Strictly 29 years, but 30 has a better ring to it.

Anyway those are my rambling thoughts that might help someone.
For now I will be happy if we can get my elf_load helper tested
to everyone's satisfaction and merged.

Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts
  2023-09-25 12:59       ` [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts Eric W. Biederman
  2023-09-25 13:00         ` kernel test robot
  2023-09-25 15:27         ` Sebastian Ott
@ 2023-09-26 13:49         ` Dan Carpenter
  2023-09-26 14:42           ` [PATCH v2] " Eric W. Biederman
  2 siblings, 1 reply; 20+ messages in thread
From: Dan Carpenter @ 2023-09-26 13:49 UTC (permalink / raw)
  To: oe-kbuild, Eric W. Biederman, Sebastian Ott
  Cc: lkp, oe-kbuild-all, Thomas Weißschuh, Alexander Viro,
	Christian Brauner, Kees Cook, Mark Brown, Willy Tarreau,
	linux-fsdevel, linux-mm, linux-kernel, stable

Hi Eric,

kernel test robot noticed the following build warnings:

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Eric-W-Biederman/binfmt_elf-Support-segments-with-0-filesz-and-misaligned-starts/20230925-210022
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/execve
patch link:    https://lore.kernel.org/r/87jzsemmsd.fsf_-_%40email.froward.int.ebiederm.org
patch subject: [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts
config: i386-randconfig-141-20230926 (https://download.01.org/0day-ci/archive/20230926/202309261925.QvgPAYL7-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce: (https://download.01.org/0day-ci/archive/20230926/202309261925.QvgPAYL7-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202309261925.QvgPAYL7-lkp@intel.com/

smatch warnings:
fs/binfmt_elf.c:431 elf_load() error: uninitialized symbol 'map_addr'.

vim +/map_addr +431 fs/binfmt_elf.c

a6409120b31666 Eric W. Biederman 2023-09-25  390  static unsigned long elf_load(struct file *filep, unsigned long addr,
a6409120b31666 Eric W. Biederman 2023-09-25  391  		const struct elf_phdr *eppnt, int prot, int type,
a6409120b31666 Eric W. Biederman 2023-09-25  392  		unsigned long total_size)
a6409120b31666 Eric W. Biederman 2023-09-25  393  {
a6409120b31666 Eric W. Biederman 2023-09-25  394  	unsigned long zero_start, zero_end;
a6409120b31666 Eric W. Biederman 2023-09-25  395  	unsigned long map_addr;
a6409120b31666 Eric W. Biederman 2023-09-25  396  
a6409120b31666 Eric W. Biederman 2023-09-25  397  	if (eppnt->p_filesz) {
a6409120b31666 Eric W. Biederman 2023-09-25  398  		map_addr = elf_map(filep, addr, eppnt, prot, type, total_size);
a6409120b31666 Eric W. Biederman 2023-09-25  399  		if (BAD_ADDR(map_addr))
a6409120b31666 Eric W. Biederman 2023-09-25  400  			return map_addr;
a6409120b31666 Eric W. Biederman 2023-09-25  401  		if (eppnt->p_memsz > eppnt->p_filesz) {
a6409120b31666 Eric W. Biederman 2023-09-25  402  			zero_start = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
a6409120b31666 Eric W. Biederman 2023-09-25  403  				eppnt->p_filesz;
a6409120b31666 Eric W. Biederman 2023-09-25  404  			zero_end = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
a6409120b31666 Eric W. Biederman 2023-09-25  405  				eppnt->p_memsz;
a6409120b31666 Eric W. Biederman 2023-09-25  406  
a6409120b31666 Eric W. Biederman 2023-09-25  407  			/* Zero the end of the last mapped page */
a6409120b31666 Eric W. Biederman 2023-09-25  408  			padzero(zero_start);
a6409120b31666 Eric W. Biederman 2023-09-25  409  		}
a6409120b31666 Eric W. Biederman 2023-09-25  410  	} else {
a6409120b31666 Eric W. Biederman 2023-09-25  411  		zero_start = ELF_PAGESTART(addr);
a6409120b31666 Eric W. Biederman 2023-09-25  412  		zero_end = zero_start + ELF_PAGEOFFSET(eppnt->p_vaddr) +
a6409120b31666 Eric W. Biederman 2023-09-25  413  			eppnt->p_memsz;

For this else path, map_addr is only set if there is an error.

a6409120b31666 Eric W. Biederman 2023-09-25  414  	}
a6409120b31666 Eric W. Biederman 2023-09-25  415  	if (eppnt->p_memsz > eppnt->p_filesz) {
a6409120b31666 Eric W. Biederman 2023-09-25  416  		/*
a6409120b31666 Eric W. Biederman 2023-09-25  417  		 * Map the last of the segment.
a6409120b31666 Eric W. Biederman 2023-09-25  418  		 * If the header is requesting these pages to be
a6409120b31666 Eric W. Biederman 2023-09-25  419  		 * executable, honour that (ppc32 needs this).
a6409120b31666 Eric W. Biederman 2023-09-25  420  		 */
a6409120b31666 Eric W. Biederman 2023-09-25  421  		int error;
a6409120b31666 Eric W. Biederman 2023-09-25  422  
a6409120b31666 Eric W. Biederman 2023-09-25  423  		zero_start = ELF_PAGEALIGN(zero_start);
a6409120b31666 Eric W. Biederman 2023-09-25  424  		zero_end = ELF_PAGEALIGN(zero_end);
a6409120b31666 Eric W. Biederman 2023-09-25  425  
a6409120b31666 Eric W. Biederman 2023-09-25  426  		error = vm_brk_flags(zero_start, zero_end - zero_start,
a6409120b31666 Eric W. Biederman 2023-09-25  427  				     prot & PROT_EXEC ? VM_EXEC : 0);
a6409120b31666 Eric W. Biederman 2023-09-25  428  		if (error)
a6409120b31666 Eric W. Biederman 2023-09-25  429  			map_addr = error;
a6409120b31666 Eric W. Biederman 2023-09-25  430  	}
a6409120b31666 Eric W. Biederman 2023-09-25 @431  	return map_addr;
a6409120b31666 Eric W. Biederman 2023-09-25  432  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2] binfmt_elf: Support segments with 0 filesz and misaligned starts
  2023-09-26 13:49         ` Dan Carpenter
@ 2023-09-26 14:42           ` Eric W. Biederman
  0 siblings, 0 replies; 20+ messages in thread
From: Eric W. Biederman @ 2023-09-26 14:42 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: oe-kbuild, Sebastian Ott, lkp, oe-kbuild-all,
	Thomas Weißschuh, Alexander Viro, Christian Brauner,
	Kees Cook, Mark Brown, Willy Tarreau, linux-fsdevel, linux-mm,
	linux-kernel


Implement a helper elf_load that wraps elf_map and performs all
of the necessary work to ensure that when "memsz > filesz"
the bytes described by "memsz > filesz" are zeroed.

Link: https://lkml.kernel.org/r/20230914-bss-alloc-v1-1-78de67d2c6dd@weissschuh.net
Reported-by: Sebastian Ott <sebott@redhat.com>
Reported-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/binfmt_elf.c | 111 +++++++++++++++++++++---------------------------
 1 file changed, 48 insertions(+), 63 deletions(-)

Dan Carpenter <dan.carpenter@linaro.org> writes:
> smatch warnings:
> fs/binfmt_elf.c:431 elf_load() error: uninitialized symbol 'map_addr'.

That was embarrassing.  I have added an initialization of map_error in
the p_filesz == 0 case.

I am a bit surprised this worked without map_error being set in testing,
but it looks like any "value < TASK_SIZE" would work.  So I guess odds
are about 50/50 that any random stack value would not cause
load_elf_binary to fail.

Thanks for pointing that out.

Eric



diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7b3d2d491407..2a615f476e44 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -110,25 +110,6 @@ static struct linux_binfmt elf_format = {
 
 #define BAD_ADDR(x) (unlikely((unsigned long)(x) >= TASK_SIZE))
 
-static int set_brk(unsigned long start, unsigned long end, int prot)
-{
-	start = ELF_PAGEALIGN(start);
-	end = ELF_PAGEALIGN(end);
-	if (end > start) {
-		/*
-		 * Map the last of the bss segment.
-		 * If the header is requesting these pages to be
-		 * executable, honour that (ppc32 needs this).
-		 */
-		int error = vm_brk_flags(start, end - start,
-				prot & PROT_EXEC ? VM_EXEC : 0);
-		if (error)
-			return error;
-	}
-	current->mm->start_brk = current->mm->brk = end;
-	return 0;
-}
-
 /* We need to explicitly zero any fractional pages
    after the data section (i.e. bss).  This would
    contain the junk from the file that should not
@@ -406,6 +387,51 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
 	return(map_addr);
 }
 
+static unsigned long elf_load(struct file *filep, unsigned long addr,
+		const struct elf_phdr *eppnt, int prot, int type,
+		unsigned long total_size)
+{
+	unsigned long zero_start, zero_end;
+	unsigned long map_addr;
+
+	if (eppnt->p_filesz) {
+		map_addr = elf_map(filep, addr, eppnt, prot, type, total_size);
+		if (BAD_ADDR(map_addr))
+			return map_addr;
+		if (eppnt->p_memsz > eppnt->p_filesz) {
+			zero_start = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
+				eppnt->p_filesz;
+			zero_end = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
+				eppnt->p_memsz;
+
+			/* Zero the end of the last mapped page */
+			padzero(zero_start);
+		}
+	} else {
+		map_addr = zero_start = ELF_PAGESTART(addr);
+		zero_end = zero_start + ELF_PAGEOFFSET(eppnt->p_vaddr) +
+			eppnt->p_memsz;
+	}
+	if (eppnt->p_memsz > eppnt->p_filesz) {
+		/*
+		 * Map the last of the segment.
+		 * If the header is requesting these pages to be
+		 * executable, honour that (ppc32 needs this).
+		 */
+		int error;
+
+		zero_start = ELF_PAGEALIGN(zero_start);
+		zero_end = ELF_PAGEALIGN(zero_end);
+
+		error = vm_brk_flags(zero_start, zero_end - zero_start,
+				     prot & PROT_EXEC ? VM_EXEC : 0);
+		if (error)
+			map_addr = error;
+	}
+	return map_addr;
+}
+
+
 static unsigned long total_mapping_size(const struct elf_phdr *phdr, int nr)
 {
 	elf_addr_t min_addr = -1;
@@ -829,7 +855,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
 	struct elf_phdr *elf_property_phdata = NULL;
 	unsigned long elf_bss, elf_brk;
-	int bss_prot = 0;
 	int retval, i;
 	unsigned long elf_entry;
 	unsigned long e_entry;
@@ -1040,33 +1065,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		if (elf_ppnt->p_type != PT_LOAD)
 			continue;
 
-		if (unlikely (elf_brk > elf_bss)) {
-			unsigned long nbyte;
-
-			/* There was a PT_LOAD segment with p_memsz > p_filesz
-			   before this one. Map anonymous pages, if needed,
-			   and clear the area.  */
-			retval = set_brk(elf_bss + load_bias,
-					 elf_brk + load_bias,
-					 bss_prot);
-			if (retval)
-				goto out_free_dentry;
-			nbyte = ELF_PAGEOFFSET(elf_bss);
-			if (nbyte) {
-				nbyte = ELF_MIN_ALIGN - nbyte;
-				if (nbyte > elf_brk - elf_bss)
-					nbyte = elf_brk - elf_bss;
-				if (clear_user((void __user *)elf_bss +
-							load_bias, nbyte)) {
-					/*
-					 * This bss-zeroing can fail if the ELF
-					 * file specifies odd protections. So
-					 * we don't check the return value
-					 */
-				}
-			}
-		}
-
 		elf_prot = make_prot(elf_ppnt->p_flags, &arch_state,
 				     !!interpreter, false);
 
@@ -1162,7 +1160,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			}
 		}
 
-		error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
+		error = elf_load(bprm->file, load_bias + vaddr, elf_ppnt,
 				elf_prot, elf_flags, total_size);
 		if (BAD_ADDR(error)) {
 			retval = IS_ERR_VALUE(error) ?
@@ -1217,10 +1215,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		if (end_data < k)
 			end_data = k;
 		k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;
-		if (k > elf_brk) {
-			bss_prot = elf_prot;
+		if (k > elf_brk)
 			elf_brk = k;
-		}
 	}
 
 	e_entry = elf_ex->e_entry + load_bias;
@@ -1232,18 +1228,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	start_data += load_bias;
 	end_data += load_bias;
 
-	/* Calling set_brk effectively mmaps the pages that we need
-	 * for the bss and break sections.  We must do this before
-	 * mapping in the interpreter, to make sure it doesn't wind
-	 * up getting placed where the bss needs to go.
-	 */
-	retval = set_brk(elf_bss, elf_brk, bss_prot);
-	if (retval)
-		goto out_free_dentry;
-	if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bss))) {
-		retval = -EFAULT; /* Nobody gets to see this, but.. */
-		goto out_free_dentry;
-	}
+	current->mm->start_brk = current->mm->brk = ELF_PAGEALIGN(elf_brk);
 
 	if (interpreter) {
 		elf_entry = load_elf_interp(interp_elf_ex,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts
  2023-09-26  3:27             ` Eric W. Biederman
@ 2023-09-27  2:34               ` Kees Cook
  0 siblings, 0 replies; 20+ messages in thread
From: Kees Cook @ 2023-09-27  2:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Sebastian Ott, Pedro Falcato, Thomas Weißschuh,
	Alexander Viro, Christian Brauner, Mark Brown, Willy Tarreau,
	sam, Rich Felker, linux-fsdevel, linux-mm, linux-kernel, stable

On Mon, Sep 25, 2023 at 10:27:02PM -0500, Eric W. Biederman wrote:
> Kees Cook <keescook@chromium.org> writes:
> 
> > On Mon, Sep 25, 2023 at 05:27:12PM +0200, Sebastian Ott wrote:
> >> On Mon, 25 Sep 2023, Eric W. Biederman wrote:
> >> > 
> >> > Implement a helper elf_load that wraps elf_map and performs all
> >> > of the necessary work to ensure that when "memsz > filesz"
> >> > the bytes described by "memsz > filesz" are zeroed.
> >> > 
> >> > Link: https://lkml.kernel.org/r/20230914-bss-alloc-v1-1-78de67d2c6dd@weissschuh.net
> >> > Reported-by: Sebastian Ott <sebott@redhat.com>
> >> > Reported-by: Thomas Weißschuh <linux@weissschuh.net>
> >> > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> > ---
> >> > fs/binfmt_elf.c | 111 +++++++++++++++++++++---------------------------
> >> > 1 file changed, 48 insertions(+), 63 deletions(-)
> >> > 
> >> > Can you please test this one?
> >
> > Eric thanks for doing this refactoring! This does look similar to the
> > earlier attempt:
> > https://lore.kernel.org/lkml/20221106021657.1145519-1-pedro.falcato@gmail.com/
> > and it's a bit easier to review.
> 
> I need to context switch away for a while so Kees if you will
> I will let you handle the rest of this one.
> 
> 
> A couple of thoughts running through my head for anyone whose ambitious
> might include cleaning up binfmt_elf.c
> 
> The elf_bss variable in load_elf_binary can be removed.
> 
> Work for a follow on patch is using my new elf_load in load_elf_interp
> and possibly in load_elf_library.  (More code size reduction).
> 
> An outstanding issue is if the first segment has filesz 0, and has a
> randomized locations.  But that is the same as today.
> 
> There is a whole question does it make sense for the elf loader
> to have it's own helper vm_brk_flags in mm/mmap.c or would it
> make more sense for binfmt_elf to do what binfmt_elf_fdpic does and
> have everything to go through vm_mmap.
> 
> I think replacing vm_brk_flags with vm_mmap would allow fixing the
> theoretical issue of filesz 0 and randomizing locations.
> 
> 
> 
> In this change I replaced an open coded padzero that did not clear
> all of the way to the end of the page, with padzero that does.

Yeah, the resulting code is way more readable now.

> I also stopped checking the return of padzero as there is at least
> one known case where testing for failure is the wrong thing to do.
> It looks like binfmt_elf_fdpic may have the proper set of tests
> for when error handling can be safely completed.
> 
> I found a couple of commits in the old history
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git,
> that look very interesting in understanding this code.
> 
> commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail")
> commit c6e2227e4a3e ("[SPARC64]: Missing user access return value checks in fs/binfmt_elf.c and fs/compat.c")
> commit 5bf3be033f50 ("v2.4.10.1 -> v2.4.10.2")
> 
> Looking at commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail"):
> >  commit 39b56d902bf35241e7cba6cc30b828ed937175ad
> >  Author: Pavel Machek <pavel@ucw.cz>
> >  Date:   Wed Feb 9 22:40:30 2005 -0800
> > 
> >     [PATCH] binfmt_elf: clearing bss may fail
> >     
> >     So we discover that Borland's Kylix application builder emits weird elf
> >     files which describe a non-writeable bss segment.
> >     
> >     So remove the clear_user() check at the place where we zero out the bss.  I
> >     don't _think_ there are any security implications here (plus we've never
> >     checked that clear_user() return value, so whoops if it is a problem).
> >     
> >     Signed-off-by: Pavel Machek <pavel@suse.cz>
> >     Signed-off-by: Andrew Morton <akpm@osdl.org>
> >     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> 
> It seems pretty clear that binfmt_elf_fdpic with skipping clear_user
> for non-writable segments and otherwise calling clear_user (aka padzero)
> and checking it's return code is the right thing to do.
> 
> I just skipped the error checking as that avoids breaking things.
> 
> It looks like Borland's Kylix died in 2005 so it might be safe to
> just consider read-only segments with memsz > filesz an error.

I really feel like having a read-only BSS is a pathological state that
should be detected early?

> Looking at commit 5bf3be033f50 ("v2.4.10.1 -> v2.4.10.2") the
> binfmt_elf.c bits confirm my guess that the weird structure is because
> before that point binfmt_elf.c assumed there would be only a single
> segment with memsz > filesz.  Which is why the code was structured so
> weirdly.

Agreed.

> Looking a little farther it looks like the binfmt_elf.c was introduced
> in Linux v1.0, with essentially the same structure in load_elf_binary as
> it has now.  Prior to that Linux hard coded support for a.out binaries
> in execve.  So if someone wants to add a Fixes tag it should be
> "Fixes: v1.0"
> 
> Which finally explains to me why the code is so odd.  For the most part
> the code has only received maintenance for the last 30 years or so.
> Strictly 29 years, but 30 has a better ring to it.
> 
> Anyway those are my rambling thoughts that might help someone.
> For now I will be happy if we can get my elf_load helper tested
> to everyone's satisfaction and merged.

I'm probably going to pull most of this email into the commit log for
the v2 patch -- there's good history here worth capturing.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-09-27  3:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-14 15:59 [PATCH RFC] binfmt_elf: fully allocate bss pages Thomas Weißschuh
2023-09-14 19:49 ` Eric W. Biederman
2023-09-14 22:18   ` Thomas Weißschuh
2023-09-15 19:35 ` Sebastian Ott
2023-09-15 22:15 ` Pedro Falcato
2023-09-15 22:41   ` Thomas Weißschuh
2023-09-18 14:11 ` kernel test robot
2023-09-21 10:36 ` Sebastian Ott
2023-09-25  0:50   ` Eric W. Biederman
2023-09-25  9:20     ` Sebastian Ott
2023-09-25  9:50       ` Eric W. Biederman
2023-09-25 12:59       ` [PATCH] binfmt_elf: Support segments with 0 filesz and misaligned starts Eric W. Biederman
2023-09-25 13:00         ` kernel test robot
2023-09-25 13:07           ` Eric W. Biederman
2023-09-25 15:27         ` Sebastian Ott
2023-09-25 17:06           ` Kees Cook
2023-09-26  3:27             ` Eric W. Biederman
2023-09-27  2:34               ` Kees Cook
2023-09-26 13:49         ` Dan Carpenter
2023-09-26 14:42           ` [PATCH v2] " Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.