qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* Approaches for same-on-same linux-user execve?
@ 2021-10-07 14:32 Alex Bennée
  2021-10-07 16:28 ` Arnd Bergmann
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Alex Bennée @ 2021-10-07 14:32 UTC (permalink / raw)
  To: Laurent Vivier, Richard Henderson
  Cc: assad.hashmi, qemu-devel, James Bottomley, qemu-arm,
	Eric W. Biederman, Arnd Bergmann

Hi,

I came across a use-case this week for ARM although this may be also
applicable to architectures where QEMU's emulation is ahead of the
hardware currently widely available - for example if you want to
exercise SVE code on AArch64. When the linux-user architecture is not
the same as the host architecture then binfmt_misc works perfectly fine.

However in the case you are running same-on-same you can't use
binfmt_misc to redirect execution to using QEMU because any attempt to
trap native binaries will cause your userspace to hang as binfmt_misc
will be invoked to run the QEMU binary needed to run your application
and a deadlock ensues.

There are some hacks you can apply at a local level like tweaking the
elf header of the binaries you want to run under emulation and adjusting
the binfmt_mask appropriately. This works but is messy and a faff to
set-up.

An ideal setup would be would be for the kernel to catch a SIGILL from a
failing user space program and then to re-launch the process using QEMU
with the old processes maps and execution state so it could continue.
However I suspect there are enough moving parts to make this very
fragile (e.g. what happens to the results of library feature probing
code). So two approaches I can think of are:

Trap execve in QEMU linux-user
------------------------------

We could add a flag to QEMU so at the point of execve it manually
invokes the new process with QEMU, passing on the flag to persist this
behaviour.


Add path mask to binfmt_misc
----------------------------

The other option would be to extend binfmt_misc to have a path mask so
it only applies it's alternative execution scheme to binaries in a
particular section of the file-system (or maybe some sort of pattern?).

Are there any other approaches you could take? Which do you think has
the most merit?

Thanks,

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Approaches for same-on-same linux-user execve?
  2021-10-07 14:32 Approaches for same-on-same linux-user execve? Alex Bennée
@ 2021-10-07 16:28 ` Arnd Bergmann
  2021-10-08 10:44   ` Alex Bennée
  2021-10-07 18:59 ` Laurent Vivier
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Arnd Bergmann @ 2021-10-07 16:28 UTC (permalink / raw)
  To: Alex Bennée
  Cc: assad.hashmi, Richard Henderson, Laurent Vivier, QEMU Developers,
	James Bottomley, qemu-arm, Eric W. Biederman, Arnd Bergmann

On Thu, Oct 7, 2021 at 4:32 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
> I came across a use-case this week for ARM although this may be also
> applicable to architectures where QEMU's emulation is ahead of the
> hardware currently widely available - for example if you want to
> exercise SVE code on AArch64. When the linux-user architecture is not
> the same as the host architecture then binfmt_misc works perfectly fine.
>
> However in the case you are running same-on-same you can't use
> binfmt_misc to redirect execution to using QEMU because any attempt to
> trap native binaries will cause your userspace to hang as binfmt_misc
> will be invoked to run the QEMU binary needed to run your application
> and a deadlock ensues.

Can you clarify how the code would run in this case? Does qemu-user
still emulate every single instruction, both the compatible and the incompatible
ones, or is the idea here to run as much as possible natively and only
emulate the instructions that are not available natively, using either
SIGILL or searching through the object code for those instructions?

> Trap execve in QEMU linux-user
> ------------------------------
>
> We could add a flag to QEMU so at the point of execve it manually
> invokes the new process with QEMU, passing on the flag to persist this
> behaviour.

This sounds like the obvious approach if you already do a full
instruction emulation. You'd still need to run the parent process
by calling qemu-user manually, but I suppose you need to do
something like this in any case.

> Add path mask to binfmt_misc
> ----------------------------
>
> The other option would be to extend binfmt_misc to have a path mask so
> it only applies it's alternative execution scheme to binaries in a
> particular section of the file-system (or maybe some sort of pattern?).

The main downside I see here is that it requires kernel modification, so
it would not work for old kernels.

> Are there any other approaches you could take? Which do you think has
> the most merit?

If we modify binfmt_misc in the kernel, it might be helpful to do it
by extending it with namespace support, so it could be constrained
to a single container without having to do the emulation outside.
Unfortunately that does not solve the problem of preventing the
qemu-user binary from triggering the binfmt_misc lookup.

       Arnd


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Approaches for same-on-same linux-user execve?
  2021-10-07 14:32 Approaches for same-on-same linux-user execve? Alex Bennée
  2021-10-07 16:28 ` Arnd Bergmann
@ 2021-10-07 18:59 ` Laurent Vivier
  2021-10-07 19:13 ` Warner Losh
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Laurent Vivier @ 2021-10-07 18:59 UTC (permalink / raw)
  To: Alex Bennée, Richard Henderson
  Cc: assad.hashmi, qemu-devel, James Bottomley, qemu-arm,
	Eric W. Biederman, Arnd Bergmann

Le 07/10/2021 à 16:32, Alex Bennée a écrit :
> Hi,
> 
> I came across a use-case this week for ARM although this may be also
> applicable to architectures where QEMU's emulation is ahead of the
> hardware currently widely available - for example if you want to
> exercise SVE code on AArch64. When the linux-user architecture is not
> the same as the host architecture then binfmt_misc works perfectly fine.
> 
> However in the case you are running same-on-same you can't use
> binfmt_misc to redirect execution to using QEMU because any attempt to
> trap native binaries will cause your userspace to hang as binfmt_misc
> will be invoked to run the QEMU binary needed to run your application
> and a deadlock ensues.
> 
> There are some hacks you can apply at a local level like tweaking the
> elf header of the binaries you want to run under emulation and adjusting
> the binfmt_mask appropriately. This works but is messy and a faff to
> set-up.
> 
> An ideal setup would be would be for the kernel to catch a SIGILL from a
> failing user space program and then to re-launch the process using QEMU
> with the old processes maps and execution state so it could continue.
> However I suspect there are enough moving parts to make this very
> fragile (e.g. what happens to the results of library feature probing
> code). So two approaches I can think of are:
> 
> Trap execve in QEMU linux-user
> ------------------------------
> 
> We could add a flag to QEMU so at the point of execve it manually
> invokes the new process with QEMU, passing on the flag to persist this
> behaviour.

Another approach can be to use ptrace(PTRACE_SYSEMU) to catch syscalls.

We need a wrapper that loads the first target binary and fork, it attach a ptrace() process and
intercept the syscalls to emulate them as we do in usermode linux.

I was thinking to this solution for instance to execute big-endian program (like ppc64) on
little-endian system (ppc64le).

But I'm not sure it fits in what you need...


> 
> Add path mask to binfmt_misc
> ----------------------------
> 
> The other option would be to extend binfmt_misc to have a path mask so
> it only applies it's alternative execution scheme to binaries in a
> particular section of the file-system (or maybe some sort of pattern?).
> 
> Are there any other approaches you could take? Which do you think has
> the most merit?

I don't know if it can apply to what you want, but I wrote years ago a binfmt namespace that applies
binfmt configuration only on a container but I didn't finish the work (it seems there can be some
security issues in what I did):

https://lore.kernel.org/lkml/20191216091220.465626-2-laurent@vivier.eu/T/

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Approaches for same-on-same linux-user execve?
  2021-10-07 14:32 Approaches for same-on-same linux-user execve? Alex Bennée
  2021-10-07 16:28 ` Arnd Bergmann
  2021-10-07 18:59 ` Laurent Vivier
@ 2021-10-07 19:13 ` Warner Losh
  2021-10-08 11:01 ` Daniel P. Berrangé
  2021-10-08 11:20 ` Arnd Bergmann
  4 siblings, 0 replies; 8+ messages in thread
From: Warner Losh @ 2021-10-07 19:13 UTC (permalink / raw)
  To: Alex Bennée
  Cc: assad.hashmi, Richard Henderson, Laurent Vivier, QEMU Developers,
	James Bottomley, qemu-arm, Eric W. Biederman, Arnd Bergmann

[-- Attachment #1: Type: text/plain, Size: 3697 bytes --]

On Thu, Oct 7, 2021 at 8:56 AM Alex Bennée <alex.bennee@linaro.org> wrote:

> Hi,
>
> I came across a use-case this week for ARM although this may be also
> applicable to architectures where QEMU's emulation is ahead of the
> hardware currently widely available - for example if you want to
> exercise SVE code on AArch64. When the linux-user architecture is not
> the same as the host architecture then binfmt_misc works perfectly fine.
>
> However in the case you are running same-on-same you can't use
> binfmt_misc to redirect execution to using QEMU because any attempt to
> trap native binaries will cause your userspace to hang as binfmt_misc
> will be invoked to run the QEMU binary needed to run your application
> and a deadlock ensues.
>
> There are some hacks you can apply at a local level like tweaking the
> elf header of the binaries you want to run under emulation and adjusting
> the binfmt_mask appropriately. This works but is messy and a faff to
> set-up.
>
> An ideal setup would be would be for the kernel to catch a SIGILL from a
> failing user space program and then to re-launch the process using QEMU
> with the old processes maps and execution state so it could continue.
> However I suspect there are enough moving parts to make this very
> fragile (e.g. what happens to the results of library feature probing
> code). So two approaches I can think of are:
>

32-bit arm had an 'eabi' section in ELF binaries. There it would have been
possible
to look at that and make a decision before the binary starts executing to
see whether
it should just run, or fork the linux-user binary. It would take kernel
changes, though.


> Trap execve in QEMU linux-user
> ------------------------------
>
> We could add a flag to QEMU so at the point of execve it manually
> invokes the new process with QEMU, passing on the flag to persist this
> behaviour.
>

The bsd-user code differs a little from linux-user in that it looks at the
binary being exec'd to determine what to do. It works OK, but isn't really
for this situation (we use it to optimize our package builds with additional
path processing for our mixed binary situation where we have native binaries
execing emulated binaries that then exec native binaries again. It is a bit
of a hack, though, and I'm not completely happy with it.

Add path mask to binfmt_misc
> ----------------------------
>
> The other option would be to extend binfmt_misc to have a path mask so
> it only applies it's alternative execution scheme to binaries in a
> particular section of the file-system (or maybe some sort of pattern?).
>
> Are there any other approaches you could take? Which do you think has
> the most merit?
>

In by-gone times, brandelf has bene used for situations where you wanted
to run an ELF binary with one emulation that looks like another. But that's
also
kernel hacks and also touching the local binary.

There's also the option of doing a VM86-like thing that allowed people to
run 16-bit x86 binaries on 32-bit processors. There the system calls would
SEGV
and you'd decode them inline, execute the emulation and move the IP to
execute
the next instruction after the INT XX system call. You could create a
loader that
knows how to load load the binaries and catch SIGILL and then emulate the
new
instructions on the old processor, but that's somewhat different than how
qemu
user-mode works today. But knowing you'd need to do this is hard,
potentially.
But one could expand the kernel to load in SIGILL handlers on-demand for
programs that do this, but that wouldn't work with old kernels and just
feels
weird...

Warner

[-- Attachment #2: Type: text/html, Size: 4707 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Approaches for same-on-same linux-user execve?
  2021-10-07 16:28 ` Arnd Bergmann
@ 2021-10-08 10:44   ` Alex Bennée
  2021-10-14 13:01     ` Assad Hashmi
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Bennée @ 2021-10-08 10:44 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: assad.hashmi, Richard Henderson, Laurent Vivier, QEMU Developers,
	James Bottomley, qemu-arm, Eric W. Biederman, Arnd Bergmann


Arnd Bergmann <arnd@arndb.de> writes:

> On Thu, Oct 7, 2021 at 4:32 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> I came across a use-case this week for ARM although this may be also
>> applicable to architectures where QEMU's emulation is ahead of the
>> hardware currently widely available - for example if you want to
>> exercise SVE code on AArch64. When the linux-user architecture is not
>> the same as the host architecture then binfmt_misc works perfectly fine.
>>
>> However in the case you are running same-on-same you can't use
>> binfmt_misc to redirect execution to using QEMU because any attempt to
>> trap native binaries will cause your userspace to hang as binfmt_misc
>> will be invoked to run the QEMU binary needed to run your application
>> and a deadlock ensues.
>
> Can you clarify how the code would run in this case? Does qemu-user
> still emulate every single instruction, both the compatible and the incompatible
> ones, or is the idea here to run as much as possible natively and only
> emulate the instructions that are not available natively, using either
> SIGILL or searching through the object code for those instructions?

qemu-user only every does a complete translation. The hope is of course
our translator is "fairly efficient" so for example integer SVE
operations should get unrolled into a series of AdvSIMD instructions on
the backend.

ARM's armie takes a different approach with the trap and emulate of
SIGILL instructions. This works well for the occasional "new"
instruction but will be less efficient overall if your instruction
stream is entirely novel.

>> Trap execve in QEMU linux-user
>> ------------------------------
>>
>> We could add a flag to QEMU so at the point of execve it manually
>> invokes the new process with QEMU, passing on the flag to persist this
>> behaviour.
>
> This sounds like the obvious approach if you already do a full
> instruction emulation. You'd still need to run the parent process
> by calling qemu-user manually, but I suppose you need to do
> something like this in any case.
>
>> Add path mask to binfmt_misc
>> ----------------------------
>>
>> The other option would be to extend binfmt_misc to have a path mask so
>> it only applies it's alternative execution scheme to binaries in a
>> particular section of the file-system (or maybe some sort of pattern?).
>
> The main downside I see here is that it requires kernel modification, so
> it would not work for old kernels.
>
>> Are there any other approaches you could take? Which do you think has
>> the most merit?
>
> If we modify binfmt_misc in the kernel, it might be helpful to do it
> by extending it with namespace support, so it could be constrained
> to a single container without having to do the emulation outside.
> Unfortunately that does not solve the problem of preventing the
> qemu-user binary from triggering the binfmt_misc lookup.

I wonder how that would interact with the persistent ("P") mode of
binfmt_misc. The backend is identified at the start and gets re-used
rather than looked up each time.

>
>        Arnd


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Approaches for same-on-same linux-user execve?
  2021-10-07 14:32 Approaches for same-on-same linux-user execve? Alex Bennée
                   ` (2 preceding siblings ...)
  2021-10-07 19:13 ` Warner Losh
@ 2021-10-08 11:01 ` Daniel P. Berrangé
  2021-10-08 11:20 ` Arnd Bergmann
  4 siblings, 0 replies; 8+ messages in thread
From: Daniel P. Berrangé @ 2021-10-08 11:01 UTC (permalink / raw)
  To: Alex Bennée
  Cc: assad.hashmi, Richard Henderson, Laurent Vivier, qemu-devel,
	James Bottomley, qemu-arm, Eric W. Biederman, Arnd Bergmann

On Thu, Oct 07, 2021 at 03:32:19PM +0100, Alex Bennée wrote:
> Hi,
> 
> I came across a use-case this week for ARM although this may be also
> applicable to architectures where QEMU's emulation is ahead of the
> hardware currently widely available - for example if you want to
> exercise SVE code on AArch64. When the linux-user architecture is not
> the same as the host architecture then binfmt_misc works perfectly fine.
> 
> However in the case you are running same-on-same you can't use
> binfmt_misc to redirect execution to using QEMU because any attempt to
> trap native binaries will cause your userspace to hang as binfmt_misc
> will be invoked to run the QEMU binary needed to run your application
> and a deadlock ensues.
> 
> There are some hacks you can apply at a local level like tweaking the
> elf header of the binaries you want to run under emulation and adjusting
> the binfmt_mask appropriately. This works but is messy and a faff to
> set-up.
> 
> An ideal setup would be would be for the kernel to catch a SIGILL from a
> failing user space program and then to re-launch the process using QEMU
> with the old processes maps and execution state so it could continue.
> However I suspect there are enough moving parts to make this very
> fragile (e.g. what happens to the results of library feature probing
> code). So two approaches I can think of are:
> 
> Trap execve in QEMU linux-user
> ------------------------------
> 
> We could add a flag to QEMU so at the point of execve it manually
> invokes the new process with QEMU, passing on the flag to persist this
> behaviour.
> 
> 
> Add path mask to binfmt_misc
> ----------------------------
> 
> The other option would be to extend binfmt_misc to have a path mask so
> it only applies it's alternative execution scheme to binaries in a
> particular section of the file-system (or maybe some sort of pattern?).
> 
> Are there any other approaches you could take? Which do you think has
> the most merit?

Could a new Linux personality flag be useful in combination with a
new flag in binfmt_misc.

eg a flag "E" for binfmt_misc which indicates the rule must only be
applied if the process is execve()d with PER_USE_BINFMT personality
set.

That would let you add a native match rule to binfmt_misc without
it affecting your system initially. To then run native binaries via
qemu-user you just need to set the personality() flag and the only
that  sub-process tree gets redirected.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Approaches for same-on-same linux-user execve?
  2021-10-07 14:32 Approaches for same-on-same linux-user execve? Alex Bennée
                   ` (3 preceding siblings ...)
  2021-10-08 11:01 ` Daniel P. Berrangé
@ 2021-10-08 11:20 ` Arnd Bergmann
  4 siblings, 0 replies; 8+ messages in thread
From: Arnd Bergmann @ 2021-10-08 11:20 UTC (permalink / raw)
  To: Alex Bennée
  Cc: assad.hashmi, Richard Henderson, Laurent Vivier, QEMU Developers,
	James Bottomley, qemu-arm, Eric W. Biederman, Arnd Bergmann

On Thu, Oct 7, 2021 at 4:32 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Are there any other approaches you could take? Which do you think has
> the most merit?

Reading through the ELF loader code in the kernel, I had another idea:
If qemu-user could be turned into a replacement for /lilb/ld.so and act
as an ELF interpreter rather than a binfmt-misc helper, this might address
a lot of the issues automatically.

It would need to be a statically linked binary so it doesn't itself require
an interpreter. It would have to do the job of ld.so in addition to
the emulation, but it could do that by finding the real ld.so somewhere
else and running that in emulation mode. It would also not work at
all for statically linked executables.

Not sure if that makes the tradeoffs better than your other suggestions,
but it seemed worth bringing up.

       Arnd


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Approaches for same-on-same linux-user execve?
  2021-10-08 10:44   ` Alex Bennée
@ 2021-10-14 13:01     ` Assad Hashmi
  0 siblings, 0 replies; 8+ messages in thread
From: Assad Hashmi @ 2021-10-14 13:01 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Arnd Bergmann, Richard Henderson, Laurent Vivier,
	QEMU Developers, James Bottomley, qemu-arm, Eric W. Biederman,
	Arnd Bergmann

> ARM's armie takes a different approach with the trap and emulate of
> SIGILL instructions. This works well for the occasional "new"
> instruction but will be less efficient overall if your instruction
> stream is entirely novel.

To clarify: earlier versions of armie did use the SIGILL trap-and-emulate
method, which was limited. Recent versions, including the latest release are
based on the DynamoRIO platform which enables full emulation and
instrumentation (https://dynamorio.org). By default, DynamoRIO and by extension
armie, follow all child processes, see
https://dynamorio.org/page_deploy.html#op_children.

As new Arm architecture features are added to QEMU, e.g. SVE, SVE2, SME etc.
there is an expectation in the Arm community that QEMU can run large Arm
user-space applications on Arm hardware, making lack of same-on-same execve a
not insignificant blocker.

AIUI, given the open-source licensing of QEMU and DynamoRIO, there would be no
legal reason for QEMU not to borrow from DynamoRIO.


On Fri, 8 Oct 2021 at 11:49, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Arnd Bergmann <arnd@arndb.de> writes:
>
> > On Thu, Oct 7, 2021 at 4:32 PM Alex Bennée <alex.bennee@linaro.org> wrote:
> >>
> >> I came across a use-case this week for ARM although this may be also
> >> applicable to architectures where QEMU's emulation is ahead of the
> >> hardware currently widely available - for example if you want to
> >> exercise SVE code on AArch64. When the linux-user architecture is not
> >> the same as the host architecture then binfmt_misc works perfectly fine.
> >>
> >> However in the case you are running same-on-same you can't use
> >> binfmt_misc to redirect execution to using QEMU because any attempt to
> >> trap native binaries will cause your userspace to hang as binfmt_misc
> >> will be invoked to run the QEMU binary needed to run your application
> >> and a deadlock ensues.
> >
> > Can you clarify how the code would run in this case? Does qemu-user
> > still emulate every single instruction, both the compatible and the incompatible
> > ones, or is the idea here to run as much as possible natively and only
> > emulate the instructions that are not available natively, using either
> > SIGILL or searching through the object code for those instructions?
>
> qemu-user only every does a complete translation. The hope is of course
> our translator is "fairly efficient" so for example integer SVE
> operations should get unrolled into a series of AdvSIMD instructions on
> the backend.
>
> ARM's armie takes a different approach with the trap and emulate of
> SIGILL instructions. This works well for the occasional "new"
> instruction but will be less efficient overall if your instruction
> stream is entirely novel.
>
> >> Trap execve in QEMU linux-user
> >> ------------------------------
> >>
> >> We could add a flag to QEMU so at the point of execve it manually
> >> invokes the new process with QEMU, passing on the flag to persist this
> >> behaviour.
> >
> > This sounds like the obvious approach if you already do a full
> > instruction emulation. You'd still need to run the parent process
> > by calling qemu-user manually, but I suppose you need to do
> > something like this in any case.
> >
> >> Add path mask to binfmt_misc
> >> ----------------------------
> >>
> >> The other option would be to extend binfmt_misc to have a path mask so
> >> it only applies it's alternative execution scheme to binaries in a
> >> particular section of the file-system (or maybe some sort of pattern?).
> >
> > The main downside I see here is that it requires kernel modification, so
> > it would not work for old kernels.
> >
> >> Are there any other approaches you could take? Which do you think has
> >> the most merit?
> >
> > If we modify binfmt_misc in the kernel, it might be helpful to do it
> > by extending it with namespace support, so it could be constrained
> > to a single container without having to do the emulation outside.
> > Unfortunately that does not solve the problem of preventing the
> > qemu-user binary from triggering the binfmt_misc lookup.
>
> I wonder how that would interact with the persistent ("P") mode of
> binfmt_misc. The backend is identified at the start and gets re-used
> rather than looked up each time.
>
> >
> >        Arnd
>
>
> --
> Alex Bennée


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-10-14 14:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-07 14:32 Approaches for same-on-same linux-user execve? Alex Bennée
2021-10-07 16:28 ` Arnd Bergmann
2021-10-08 10:44   ` Alex Bennée
2021-10-14 13:01     ` Assad Hashmi
2021-10-07 18:59 ` Laurent Vivier
2021-10-07 19:13 ` Warner Losh
2021-10-08 11:01 ` Daniel P. Berrangé
2021-10-08 11:20 ` Arnd Bergmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).