Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote on 01/21/2016 
02:30:49 PM:

> 
> On Thu, Jan 21, 2016 at 02:02:17PM -0500, Stefan Berger wrote:
> >    What is IMA namespace in relation to a device's name? The method is 
to
> >    read the major/minor numbers on the host and created /dev/tpm0 with 
the
> >    same major/minor numbers in the container's filesystem. The name
> >    doesn't matter I guess, but major/minor are important.
> 
> Ostensibly we number the /dev/tpmX's in relation to the tpm index
> number.
> 
> Internally to the kernel the TPM access is done by that tpm index.
> 
> Today, IMA hard codes that index value to 0 (IIRC). I could see a
> future IMA allowing user space to specify the index. The index is also
> how to associate the /dev/tpm node with the /sysfs files.
> 
> So the index is important, we'd want to control it for namespaces.
> 
> In any case the tpm index is part of the contract, and it would be
> ideal if the IMA namespace made tpm index 0 be the right vtpm.

Got it.

> 
> >    The problem I have run into in particular with Docker and golang is
> >    that Docker invokes the golang function to run an external program. 
The
> >    golang function does a clone(), a whole lot of other stuff after 
it,
> >    and in the end the execve().
> 
> Well, ultimately that is a docker problem, as you describe IMA has a
> special new requirement where the IMA NS has to be setup quickly.
> 
> >    The code is here:
> >    [1]https://golang.org/src/syscall/exec_linux.go
> >    Look at the function forkAndExecInChildon line 56++.
> 
> Well, that is just bad API design, sorry. The unix model of 
fork()/exec()
> is that the app gets a chance to adjust the environment between
> fork/exec, and this design, while easy to use, locks the app into the
> hard wired customization that forkAndExecInChild does.

>From a callers' perspective it's attractive to have it do all kinds of 
stuff IF one doesn't have to go in between. Unfortunately we have to go in 
between.

> 
> Maybe add a callback to SysProcAttr or something? Can't help you here.
> 
> You can't let this influence the kernel UAPI design.

The choice is between getting this working 'today' (even if just locally) 
or discussing this with golang designers, which in the ideal case would 
cause me waiting for the next version, dealing with that version 
dependency etc., plus the delay. So, clearly, an additional ioctl() and 
~50 lines of code make this work 'now'. Doesn't this seem worth it?

> 
> >    available. So, the conclusion is, to accomodate golang (for 
example) we
> >    can create the device pair, sit the vTPM on top of the master, and
> >    reserve the device pair befor the next clone() so that IMA finds it 
and
> >    can hook up to it.
> >    What is wrong with this scheme? The ioctl for 'reservation' before 
the
> >    clone()?
> 
> Yes, how does that sort of thing even make sense in a complex
> multi-threaded world?

controlfd = open("/dev/vtpmx, ...");
ioctl(controlfd, CREATE_VTPM, &inargs, &outargs);
[...]
ioctl(controlfd, RESERVE_VTPM_FOR_NS, ...params);
clone()

The possibility of passing that controlfd between threads is there if 
clone() was to happen in another thread ... (it currently doesn't). It 
shouldn't be a problem.

> 
> >    Should it work like this?
> 
> Sort of like this:
> 
>     controlfd = open("/dev/vtpmx", ...);
>     ioctl(controlfd, CREATE_VTPM, &inargs, &outargs);
>     serverfd = outargs.fd;
>     // /dev/tpmX exists. X is returned in outargs.tpm_index, maybe 
> return major/minor too
> 
>     child = clone(...)
>     ioctl(??? , ASSIGN_VTPM_TO_NS, .. child->ima_ns .., to index = 0,
>           from index = outargs.tpm_index);

after the clone() you are in that IMA namespace. So the only argument 
needed there is the index to the tpm_chip to hook up to the current IMA 
namespace.

> 
>     /* tpm index X is destroyed, kernel prevents reuse of index X
>        until the NS is destroyed too. /dev/tpmX is removed by udev */
>     close(severfd);
> 
> So, you'd probably make a vtpm daemon that took as execv args a
> reference to the IMA namespace to create the tpm in, and have docker
> launch it after the clone, but before the exec in the parent
> namespace.


I got that part with the fd, major & minor number. It seems to work.
I have one ioctl to reserve for before the clone and another ioctl to hook 
IMA-NS and vTPM together after the clone, but that patch is for later. So 
let's not just kill the ioctl for 'reservation' like that, please.

> 
> That is fairly similar to how net ns works, with the wrinkle you have
> to do this before the exec, I guess.
> 
> It also allows hw tpms to be routed to the ns.

How many hardware TPMs are going to be there ? One? That is to be used for 
the host, right? Presumably the containers will all get emulated TPMs. And 
sharing the single hardware TPM between multiple containers just isn't 
possible.

This will not be possible when going through the vTPM driver, but you have 
the ??? up there. I'd put the 'controlfd' in that place. The vTPM driver 
will only know about vtpm_dev->chip that it created and none of them is a 
hardware TPM.

    Stefan

> 
> The docker container just has the normal /dev/tpm0 
> 
> Jason
>