From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752796AbdJLOOm (ORCPT ); Thu, 12 Oct 2017 10:14:42 -0400 Received: from mx1.redhat.com ([209.132.183.28]:50624 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751924AbdJLOOj (ORCPT ); Thu, 12 Oct 2017 10:14:39 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com EF5C0A0363 Authentication-Results: ext-mx10.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx10.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=rgb@redhat.com Date: Thu, 12 Oct 2017 10:14:00 -0400 From: Richard Guy Briggs To: cgroups@vger.kernel.org, Linux Containers , Linux API , Linux Audit , Linux FS Devel , Linux Kernel , Linux Network Development Cc: Simo Sorce , "Carlos O'Donell" , Aristeu Rozanski , David Howells , "Eric W. Biederman" , Eric Paris , jlayton@redhat.com, Andy Lutomirski , mszeredi@redhat.com, Paul Moore , "Serge E. Hallyn" , Steve Grubb , trondmy@primarydata.com, Al Viro Subject: RFC(v2): Audit Kernel Container IDs Message-ID: <20171012141359.saqdtnodwmbz33b2@madcap2.tricolour.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: NeoMutt/20170914 (1.9.0) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Thu, 12 Oct 2017 14:14:39 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Containers are a userspace concept. The kernel knows nothing of them. The Linux audit system needs a way to be able to track the container provenance of events and actions. Audit needs the kernel's help to do this. Since the concept of a container is entirely a userspace concept, a registration from the userspace container orchestration system initiates this. This will define a point in time and a set of resources associated with a particular container with an audit container ID. The registration is a pseudo filesystem (proc, since PID tree already exists) write of a u8[16] UUID representing the container ID to a file representing a process that will become the first process in a new container. This write might place restrictions on mount namespaces required to define a container, or at least careful checking of namespaces in the kernel to verify permissions of the orchestrator so it can't change its own container ID. A bind mount of nsfs may be necessary in the container orchestrator's mntNS. Note: Use a 128-bit scalar rather than a string to make compares faster and simpler. Require a new CAP_CONTAINER_ADMIN to be able to carry out the registration. At that time, record the target container's user-supplied container identifier along with the target container's first process (which may become the target container's "init" process) process ID (referenced from the initial PID namespace), all namespace IDs (in the form of a nsfs device number and inode number tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying op=$action field. Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid container ID present on an auditable action or event. Forked and cloned processes inherit their parent's container ID, referenced in the process' task_struct. Mimic setns(2) and return an error if the process has already initiated threading or forked since this registration should happen before the process execution is started by the orchestrator and hence should not yet have any threads or children. If this is deemed overly restrictive, switch all threads and children to the new containerID. Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN. Log the creation of every namespace, inheriting/adding its spawning process' containerID(s), if applicable. Include the spawning and spawned namespace IDs (device and inode number tuples). [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] Note: At this point it appears only network namespaces may need to track container IDs apart from processes since incoming packets may cause an auditable event before being associated with a process. Log the destruction of every namespace when it is no longer used by any process, include the namespace IDs (device and inode number tuples). [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) the parent and child namespace IDs for any changes to a process' namespaces. [setns(2)] Note: It may be possible to combine AUDIT_NS_* record formats and distinguish them with an op=$action field depending on the fields required for each message type. When a container ceases to exist because the last process in that container has exited and hence the last namespace has been destroyed and its refcount dropping to zero, log the fact. (This latter is likely needed for certification accountability.) A container object may need a list of processes and/or namespaces. A namespace cannot directly migrate from one container to another but could be assigned to a newly spawned container. A namespace can be moved from one container to another indirectly by having that namespace used in a second process in another container and then ending all the processes in the first container. (v2) - switch from u64 to u128 UUID - switch from "signal" and "trigger" to "register" - restrict registration to single process or force all threads and children into same container - RGB -- Richard Guy Briggs Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Richard Guy Briggs Subject: RFC(v2): Audit Kernel Container IDs Date: Thu, 12 Oct 2017 10:14:00 -0400 Message-ID: <20171012141359.saqdtnodwmbz33b2@madcap2.tricolour.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Simo Sorce , Carlos O'Donell , Aristeu Rozanski , David Howells , "Eric W. Biederman" , Eric Paris , jlayton@redhat.com, Andy Lutomirski , mszeredi@redhat.com, Paul Moore , "Serge E. Hallyn" , Steve Grubb , trondmy@primarydata.com, Al Viro To: cgroups@vger.kernel.org, Linux Containers , Linux API , Linux Audit , Linux FS Devel , Linux Kernel , Linux Network Development Return-path: Received: from mx1.redhat.com ([209.132.183.28]:50624 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751924AbdJLOOj (ORCPT ); Thu, 12 Oct 2017 10:14:39 -0400 Content-Disposition: inline Sender: netdev-owner@vger.kernel.org List-ID: Containers are a userspace concept. The kernel knows nothing of them. The Linux audit system needs a way to be able to track the container provenance of events and actions. Audit needs the kernel's help to do this. Since the concept of a container is entirely a userspace concept, a registration from the userspace container orchestration system initiates this. This will define a point in time and a set of resources associated with a particular container with an audit container ID. The registration is a pseudo filesystem (proc, since PID tree already exists) write of a u8[16] UUID representing the container ID to a file representing a process that will become the first process in a new container. This write might place restrictions on mount namespaces required to define a container, or at least careful checking of namespaces in the kernel to verify permissions of the orchestrator so it can't change its own container ID. A bind mount of nsfs may be necessary in the container orchestrator's mntNS. Note: Use a 128-bit scalar rather than a string to make compares faster and simpler. Require a new CAP_CONTAINER_ADMIN to be able to carry out the registration. At that time, record the target container's user-supplied container identifier along with the target container's first process (which may become the target container's "init" process) process ID (referenced from the initial PID namespace), all namespace IDs (in the form of a nsfs device number and inode number tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying op=$action field. Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid container ID present on an auditable action or event. Forked and cloned processes inherit their parent's container ID, referenced in the process' task_struct. Mimic setns(2) and return an error if the process has already initiated threading or forked since this registration should happen before the process execution is started by the orchestrator and hence should not yet have any threads or children. If this is deemed overly restrictive, switch all threads and children to the new containerID. Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN. Log the creation of every namespace, inheriting/adding its spawning process' containerID(s), if applicable. Include the spawning and spawned namespace IDs (device and inode number tuples). [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] Note: At this point it appears only network namespaces may need to track container IDs apart from processes since incoming packets may cause an auditable event before being associated with a process. Log the destruction of every namespace when it is no longer used by any process, include the namespace IDs (device and inode number tuples). [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) the parent and child namespace IDs for any changes to a process' namespaces. [setns(2)] Note: It may be possible to combine AUDIT_NS_* record formats and distinguish them with an op=$action field depending on the fields required for each message type. When a container ceases to exist because the last process in that container has exited and hence the last namespace has been destroyed and its refcount dropping to zero, log the fact. (This latter is likely needed for certification accountability.) A container object may need a list of processes and/or namespaces. A namespace cannot directly migrate from one container to another but could be assigned to a newly spawned container. A namespace can be moved from one container to another indirectly by having that namespace used in a second process in another container and then ending all the processes in the first container. (v2) - switch from u64 to u128 UUID - switch from "signal" and "trigger" to "register" - restrict registration to single process or force all threads and children into same container - RGB