Re: RFC(v2): Audit Kernel Container IDs

From: Casey Schaufler <casey@schaufler-ca.com>
To: Richard Guy Briggs <rgb@redhat.com>
Cc: cgroups@vger.kernel.org,
	Linux Containers <containers@lists.linux-foundation.org>,
	Linux API <linux-api@vger.kernel.org>,
	Linux Audit <linux-audit@redhat.com>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Linux Network Development <netdev@vger.kernel.org>,
	mszeredi@redhat.com, Andy Lutomirski <luto@kernel.org>,
	jlayton@redhat.com, Carlos O'Donell <carlos@redhat.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	David Howells <dhowells@redhat.com>, Simo Sorce <simo@redhat.com>,
	trondmy@primarydata.com, Eric Paris <eparis@parisplace.org>,
	"Serge E. Hallyn" <serge@hallyn.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>
Subject: Re: RFC(v2): Audit Kernel Container IDs
Date: Thu, 19 Oct 2017 06:32:30 -0700	[thread overview]
Message-ID: <18cb69a5-f998-0e6e-85df-7f4b9b768a6f@schaufler-ca.com> (raw)
In-Reply-To: <20171019000527.eio6dfsmujmtioyt@madcap2.tricolour.ca>

On 10/18/2017 5:05 PM, Richard Guy Briggs wrote:
> On 2017-10-17 01:10, Casey Schaufler wrote:
>> On 10/16/2017 5:33 PM, Richard Guy Briggs wrote:
>>> On 2017-10-12 16:33, Casey Schaufler wrote:
>>>> On 10/12/2017 7:14 AM, Richard Guy Briggs wrote:
>>>>> Containers are a userspace concept.  The kernel knows nothing of them.
>>>>>
>>>>> The Linux audit system needs a way to be able to track the container
>>>>> provenance of events and actions.  Audit needs the kernel's help to do
>>>>> this.
>>>>>
>>>>> Since the concept of a container is entirely a userspace concept, a
>>>>> registration from the userspace container orchestration system initiates
>>>>> this.  This will define a point in time and a set of resources
>>>>> associated with a particular container with an audit container ID.
>>>>>
>>>>> The registration is a pseudo filesystem (proc, since PID tree already
>>>>> exists) write of a u8[16] UUID representing the container ID to a file
>>>>> representing a process that will become the first process in a new
>>>>> container.  This write might place restrictions on mount namespaces
>>>>> required to define a container, or at least careful checking of
>>>>> namespaces in the kernel to verify permissions of the orchestrator so it
>>>>> can't change its own container ID.  A bind mount of nsfs may be
>>>>> necessary in the container orchestrator's mntNS.
>>>>> Note: Use a 128-bit scalar rather than a string to make compares faster
>>>>> and simpler.
>>>>>
>>>>> Require a new CAP_CONTAINER_ADMIN to be able to carry out the
>>>>> registration.
>>>> Hang on. If containers are a user space concept, how can
>>>> you want CAP_CONTAINER_ANYTHING? If there's not such thing as
>>>> a container, how can you be asking for a capability to manage
>>>> them?
>>> There is such a thing, but the kernel doesn't know about it yet.
>> Then how can it be the kernel's place to control access to a
>> container resource, that is, the containerID.
> Ok, let me try to address your objections.
>
> The kernel can know enough that if it is already set to not allow it to
> be set again.  Or if the user doesn't have permission to set it that the
> user be denied this action.  How is this different from loginuid and
> sessionid?
>>>   This
>>> same situation exists for loginuid and sessionid which are userspace
>>> concepts that the kernel tracks for the convenience of userspace.
>> Ah, no. Loginuid identifies a user, which is a kernel concept in
>> that a user is defined by the uid.
> This simple explanation doesn't help me.  What makes that a kernel
> concept?  The fact that it is stored and compared in more than one
> place?
>
>> The session ID has well defined kernel semantics. You're trying to say
>> that the containerID is an opaque value that is meaningless to the
>> kernel, but you still want the kernel to protect it. How can the
>> kernel know if it is protecting it correctly?
> How so?  A userspace process triggers this.  Does the kernel know what
> these values mean?  Does it do anything with them other than report
> them or allow audit to filter them?  It is given some instructions on
> how to treat it.
>
> This is what we're trying to do with the containerID.
>
>>>   As
>>> for its name, I'm not particularly picky, so if you don't like
>>> CAP_CONTAINER_* then I'm fine with CAP_AUDIT_CONTAINERID.  It really
>>> needs to be distinct from CAP_AUDIT_WRITE and CAP_AUDIT_CONTROL since we
>>> don't want to give the ability to set a containerID to any process that
>>> is able to do audit logging (such as vsftpd) and similarly we don't want
>>> to give the orchestrator the ability to control the setup of the audit
>>> daemon.
>> Sorry, but what aspect of the kernel security policy is this
>> capability supposed to protect? That's what capabilities are
>> for, not the undefined support of undefined user-space behavior.
> Similarly, loginuids and sessionIDs are only used for audit tracking and
> filtering.

Tell me again why you're not reusing either of these?

>
>> If it's audit behavior, you want CAP_AUDIT_CONTROL. If it's
>> more than audit behavior you have to define what system security
>> policy you're dealing with in order to pick the right capability.
> It isn't audit behaviour (yet), it is audit reporting information, a
> level above simply writing logs and a level below controlling daemon
> behaviour.

You are changing audit information. That's CAP_AUDIT_CONTROL.

>
>> We get this request pretty regularly. "I need my own capability
>> because I have a niche thing that isn't part of the system security
>> policy but that is important!" Fit the containerID into the
>> system security policy, and if that results in using CAP_SYS_ADMIN,
>> oh well.
> There's far too much piled in to CAP_SYS_ADMIN already, which is making
> capabilites less and less useful.  

No. The value of capabilities is in separating privilege from DAC.
Granularity is a bonus. The current granularity is too fine in some
cases and too coarse in others.

> I realize that capabilities are
> limited compared with netlink message types, but this falls in between
> the abilities needed by CAP_AUDIT_CONTROL and CAP_AUDIT_WRITE.

There is *nothing* about your use that makes a compelling
argument for a new capability. If you can't decide between
CAP_AUDIT_CONTROL and CAP_AUDIT_WRITE require both.

>
> I'll continue on Steve Grubb's comment...
>
>>>>>   At that time, record the target container's user-supplied
>>>>> container identifier along with the target container's first process
>>>>> (which may become the target container's "init" process) process ID
>>>>> (referenced from the initial PID namespace), all namespace IDs (in the
>>>>> form of a nsfs device number and inode number tuple) in a new auxilliary
>>>>> record AUDIT_CONTAINER with a qualifying op=$action field.
>>>>>
>>>>> Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
>>>>> container ID present on an auditable action or event.
>>>>>
>>>>> Forked and cloned processes inherit their parent's container ID,
>>>>> referenced in the process' task_struct.
>>>>>
>>>>> Mimic setns(2) and return an error if the process has already initiated
>>>>> threading or forked since this registration should happen before the
>>>>> process execution is started by the orchestrator and hence should not
>>>>> yet have any threads or children.  If this is deemed overly restrictive,
>>>>> switch all threads and children to the new containerID.
>>>>>
>>>>> Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN.
>>>>>
>>>>> Log the creation of every namespace, inheriting/adding its spawning
>>>>> process' containerID(s), if applicable.  Include the spawning and
>>>>> spawned namespace IDs (device and inode number tuples).
>>>>> [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
>>>>> Note: At this point it appears only network namespaces may need to track
>>>>> container IDs apart from processes since incoming packets may cause an
>>>>> auditable event before being associated with a process.
>>>>>
>>>>> Log the destruction of every namespace when it is no longer used by any
>>>>> process, include the namespace IDs (device and inode number tuples).
>>>>> [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
>>>>>
>>>>> Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
>>>>> the parent and child namespace IDs for any changes to a process'
>>>>> namespaces. [setns(2)]
>>>>> Note: It may be possible to combine AUDIT_NS_* record formats and
>>>>> distinguish them with an op=$action field depending on the fields
>>>>> required for each message type.
>>>>>
>>>>> When a container ceases to exist because the last process in that
>>>>> container has exited and hence the last namespace has been destroyed and
>>>>> its refcount dropping to zero, log the fact.
>>>>> (This latter is likely needed for certification accountability.)  A
>>>>> container object may need a list of processes and/or namespaces.
>>>>>
>>>>> A namespace cannot directly migrate from one container to another but
>>>>> could be assigned to a newly spawned container.  A namespace can be
>>>>> moved from one container to another indirectly by having that namespace
>>>>> used in a second process in another container and then ending all the
>>>>> processes in the first container.
>>>>>
>>>>> (v2)
>>>>> - switch from u64 to u128 UUID
>>>>> - switch from "signal" and "trigger" to "register"
>>>>> - restrict registration to single process or force all threads and children into same container
>>>>>
>>>>> - RGB
>>> - RGB
>>>
>>> --
>>> Richard Guy Briggs <rgb@redhat.com>
>>> Sr. S/W Engineer, Kernel Security, Base Operating Systems
>>> Remote, Ottawa, Red Hat Canada
>>> IRC: rgb, SunRaycer
>>> Voice: +1.647.777.2635, Internal: (81) 32635
>>>
> - RGB
>
> --
> Richard Guy Briggs <rgb@redhat.com>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635
>