Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote on 01/20/2016 
08:17:01 PM:

> 
> On Wed, Jan 20, 2016 at 09:39:09AM -0500, Stefan Berger wrote:
> >    Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote on 
01/19/2016
> >    06:51:07 PM:
> >    >
> >    > On Thu, Jan 14, 2016 at 11:01:57AM -0500, Stefan Berger wrote:
> >    > > +   pdev = platform_device_register_simple("tpm_vtpm", 
> vtpm_dev->dev_num,
> >    > > +                      NULL, 0);
> >    >
> >    > This seems strange, something like this should not be creating
> >    > platform_devices.
> >    Should it be a misc_device then ?
> 
> No. Check what other virtual devices are doing..

register_chrdev maybe?

> 
> Generally speaking, the tpm core should not insist on a parent device
> the way it does. That is an artifact of the clean up I did of the
> drivers that was never fully compelted. To make the driver migration
> easier the TPM core is using devm in places it probably shouldn't, but
> unwinding that requires more difficult all-driver changes.
> 
> If you get stuck here you will have to fix the core code.
> 
> >    > > +   vtpm_dev->chip = tpmm_chip_alloc_pattern(vtpm_dev->pdev,
> >    > > +            &vtpm_tpm_ops,
> >    > > +            VTPM_DEV_PREFIX_CLIENT"%d");
> >    >
> >    > I agree with Jarkko I don't see why these deserve special names..
> >    I did that because it becomes confusing when there are many 
/dev/tpm%d
> >    type of devices. This way we can distinguish between a hardware
> >    /dev/tpm%d and all these types of devices.
> 
> That isn't really consistent with how linux kernel naming works these
> days. Use a udev rule in userspace if you really want this.

Ok, then I will just try to call them /dev/tpm%d.

> 
> >    > Presumably some namespace magic can be used to make them show up 
as
> >    > tpm0 in a container?
> >    The magic is to have the container management stack create the 
device
> >    pair. From the ioctl it learns the name of the devices that were
> >    created and it then finds out about the major/minor number of the
> >    created device and have /dev/tpm0 with that major/minor created in 
the
> >    container's /dev directory.
> 
> Except that isn't good enough - the IMA kernel side doesn't know that 
this
> tpm is now acting as the 'main' 'default' TPM.

Hooking the vTPM to IMA requires another patch that I haven't shown since 
IMA namespacing isn't public yet. Basically we implement another ioctl() 
that is to be called before the clone() in order to 'reserve' a vtpm 
device pair for the calling process. During the clone() call IMA 
namespacing code can query the vTPM driver for a 'reserved' device pair. 
Hooking IMA up after the clone() may also work but in case of 
docker/golang it's better to do this before since the language libraries 
do a lot after the clone automatically.

> 
> >    The issue with sysfs is that sysfs is showing directory entries for 
all
> >    vTPMs in all containers.
> 
> That is not a problem to solve here. The sysfs will also show entires
> for hardware tpms too. You need somekind of namespace related solution
> that works for all tpms.
> 
> I don't know what became of it, but there used to be 'device
> namespaces' patch series that delt with this kind of problem.

Would be good to have now since I for me the best way is to work around 
the problem entirely -- not registering with sysfs.

> 
> Again, I suspect strongly this will need some kind of TPM/IMA
> namespace to actually do what you want.

So here things aren't so clear to me, either. Sysfs should really only 
show the devices that are relevant to a container, but it seems to show a 
lot more than that, serial bus for example, i8042 etc. , that don't 
necessarily have relevance inside the container. So if there's a hardware 
TPM on the system, it probably shouldn't show the sysfs entries in the 
container. If we add a vTPM for the container, it would ideally show up as 
tpm0 inside the container's sysfs entries, but it doesn't do that unless 
this is the first device created and there's no hardware TPM. So while all 
of this isn't possible with sysfs, I made the choice to hide the device 
from sysfs entirely. If hw tpm0 shows, fine, we can report BIOS 
measurements then. It seems like device subsystem namespacing would be 
needed to realize part of this so that multiple tpm0s could exist, one in 
each device namespace.


> 
> >    If we don't want users to see the PCRs of other containers we set
> >    this flag. The side effect of setting this flag is that the
> >    container will not be able to see its own TPM device, either, but
> >    presumably the user has commands to reads PCRs etc.
> 
> Erm, sysfs is just an obvious symptom, the container could also just
> create the dev node and access any tpm it likes, or use some future
> (?) IMA API with a tpm device number.

No, the device cgroup prevents that. Only the major/minor number of its 
dedicated TPM (client) chardev can be accessed.



> 
> >    VTPM_FLAG_NO_LOG: Container's boot/start without firmware that
> >    would write an ACPI log. So in most cases one will choose the
> >    VTPM_FLAG_NO_LOG flag since there is no log. Maybe that's a flag
> >    the user shouldn't have control over.
> 
> ? But there is no ACPI handle so this should already be naturally
> defeated by the core code. Why a flag?
> 
> >    > Another way to handle this is to return the FD for the server 
side of
> >    > the new TPM here and then route accesses from the /dev/tpmX side 
to
> >    > that FD. Avoid the extra char device. There is some kernel
> >    > infrasturcture that helpd do this (anon fd or something, IIRC)
> 
> >    Hm, I would prefer to keep the server-side char device since that 
way
> >    one isn't tied to fd passing or so and the TPM could terminate and 
be
> >    restarted later on on top of that server side device while keeping 
the
> >    client side device usable beyond the server side shutting down.
> 
> Don't be afraid of FD passing.
> 
> vTPM daemon restart isn't really feasible anyhow as TPMs are
> stateful. The kernel is not prepared to survive a HW TPM reboot.  If
> that problem were to be solved somehow then a scheme like systemd's
> FDSTORE=1 to keep the fd alive during a daemon crash could be used.
> 
> The huge downside to using a master side dev node is that these things
> will leak. Killing the vtpm daemon will not clean up the slave
> interface and another command and sysadmin interaction will be needed
> to sort out the inevitable mess.

This is a scenario that works, though.

The server side opened /dev/vtpms1. It closed it and the clients side 
disappeared. Both devices /dev/vtpms1 and /dev/vtpmc1 were removed. The 
client side then shows the following:

# ls -l /proc/self/fd
total 0
lrwx------. 1 root root 64 Jan 20 21:34 0 -> /dev/pts/1
lrwx------. 1 root root 64 Jan 20 21:34 1 -> /dev/pts/1
lrwx------. 1 root root 64 Jan 20 21:34 100 -> /dev/vtpmc1 (deleted)
lrwx------. 1 root root 64 Jan 20 21:34 2 -> /dev/pts/1
lr-x------. 1 root root 64 Jan 20 21:34 3 -> /proc/23266/fd

# echo -en '\x00\xc1\x00\x00\x00\x0a\x00\x00\x00\x00' >&100
-bash: echo: write error: Input/output error


Once the client did a 'exec 100>&-', so closed the fd, the /dev/vtpmc0 
could actually reappear.

There's a 'keepalive' flag in this vtpm driver that keeps the device pair 
around even if the server side shuts down. If that had been set, 
/dev/vtpmc1 wouldn't have been deleted upon server side closure.

  Stefan

> 
> >    > Then del becomes close() on the special fd
> >    And with that the client side device would *have* to close as well
> >    because the backend now is inaccessible?
> 
> The client side would see failures on all TPM commands it tries to
> 
> Jason
>