linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 0/8] namespaces: log namespaces per task
@ 2014-08-21  1:09 Richard Guy Briggs
  2014-08-21  1:09 ` [PATCH V4 1/8] namespaces: assign each namespace instance a serial number Richard Guy Briggs
                   ` (8 more replies)
  0 siblings, 9 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-21  1:09 UTC (permalink / raw)
  To: linux-audit, linux-kernel, containers, linux-api
  Cc: Richard Guy Briggs, arozansk, eparis, sgrubb, ebiederm, serge

The purpose is to track namespace instances in use by logged processes from the
perspective of init_*_ns by assigning each a per-kernel, per-boot serial
number.

1/8 defines a function to generate them and assigns them.

Use a serial number per namespace (unique across one boot of one kernel)
instead of the inode number (which is claimed to have had the right to change
reserved and is not necessarily unique if there is more than one proc fs).  It
could be argued that the inode numbers have now become a defacto interface and
can't change now, but I'm proposing this approach to see if this helps address
some of the objections to the earlier patchset.

2/8 adds access functions to get to the serial numbers in a similar way to
inode access for namespace proc operations.

3/8 implements, as suggested by Serge Hallyn, making these serial numbers
available in /proc/self/ns/{ipc,mnt,net,pid,user,uts}_snum.  I chose "snum"
instead of "seq" for consistency with inum and there are a number of other uses
of "seq" in the namespace code.

4/8 Document proc's ns entries structure in Documentation/filesystems/proc.txt

5/8 exposes proc's ns entries structure which lists a number of useful
operations per namespace type for other subsystems to use.

6/8 provides an example of usage for audit_log_task_info() which is used by
syscall audits, among others.  audit_log_task() and audit_common_recv_message()
would be other potential use cases.

Proposed output format:
This differs slightly from Aristeu's patch because of the label conflict with
"pid=" due to including it in existing records rather than it being a seperate
record.  It has now returned to being a seperate record.  The serial numbers
are printed in hex.
	type=NS_INFO msg=audit(1408577535.306:82):  netns=8 utsns=2 ipcns=1 pidns=4 userns=3 mntns=5

7/8 tracks the creation and deletion of of namespaces, listing the type of
namespace instance, related namespace id if there is one and the newly minted
serial number.

Proposed output format for initial namespace creation:
	type=AUDIT_NS_INIT_UTS msg=audit(1408577534.868:5): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel old_utsns=0 utsns=2 res=1
	type=AUDIT_NS_INIT_USER msg=audit(1408577534.868:6): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel old_userns=0 userns=3 res=1
	type=AUDIT_NS_INIT_PID msg=audit(1408577534.868:7): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel old_pidns=0 pidns=4 res=1
	type=AUDIT_NS_INIT_MNT msg=audit(1408577534.868:8): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel old_mntns=0 mntns=5 res=1
	type=AUDIT_NS_INIT_IPC msg=audit(1408577534.868:9): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel old_ipcns=0 ipcns=1 res=1
	type=AUDIT_NS_INIT_NET msg=audit(1408577533.500:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel old_netns=0 netns=7 res=1

And a CLONE action would result in:
	type=type=AUDIT_NS_INIT_NET msg=audit(1408577535.306:81): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 old_netns=7 netns=8 res=1
	type=type=AUDIT_NS_INIT_MNT msg=audit(1408577535.307:83): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 old_mntns=5 mntns=9 res=1

While deleting a namespace would result in:
	type=type=AUDIT_NS_DEL_MNT msg=audit(1408577552.221:85): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 mntns=9 res=1

8/8 change audit startup from __initcall to subsys_initcall to get it started
earlier to be able to receive initial namespace log messages.


v3 -> v4:
	Seperate out the NS_INFO message from the SYSCALL message.
	Moved audit_log_namespace_info() out of audit_log_task_info().
	Use a seperate message type per namespace type for each of INIT/DEL.
	Make ns= easier to search across NS_INFO and NS_INIT/DEL_XXX msg types.
	Add /proc/<pid>/ns/ documentation.
	Fix dynamic initial ns logging.

v2 -> v3:
	Use atomic64_t in ns_serial to simplify it.
	Avoid funciton duplication in proc, keying on dentry.
	Squash down audit patch to avoid rcu sleep issues.
	Add tracking for creation and deletion of namespace instances.

v1 -> v2:
	Avoid rollover by switching from an int to a long long.
	Change rollover behaviour from simply avoiding zero to raising a BUG.
	Expose serial numbers in /proc/<pid>/ns/*_snum.
	Expose ns_entries and use it in audit.


Notes:
As for CAP_AUDIT_READ, a patchset has been accepted upstream to check
capabilities of userspace processes that try to join netlink broadcast groups.

This set does not try to solve the non-init namespace audit messages and
auditd problem yet.  That will come later, likely with additional auditd
instances running in another namespace with a limited ability to influence the
master auditd.  I echo Eric B's idea that messages destined for different
namespaces would have to be tailored for that namespace with references that
make sense (such as the right pid number reported to that pid namespace, and
not leaking info about parents or peers).

Questions:
Is there a way to link serial numbers of namespaces involved in migration of a
container to another kernel?  It sounds like what is needed is a part of a
mangement application that is able to pull the audit records from constituent
hosts to build an audit trail of a container.

What additional events should list this information?

Does this present any problematic information leaks?  Only CAP_AUDIT_CONTROL
(and now CAP_AUDIT_READ) in init_user_ns can get to this information in
the init namespace at the moment from audit.  *However*, the addition of the
proc/<pid>/ns/*_snum does make it available to other processes now.


Richard Guy Briggs (8):
  namespaces: assign each namespace instance a serial number
  namespaces: expose namespace instance serial number in proc_ns_operations
  namespaces: expose ns instance serial numbers in proc
  Documentation: add a section for /proc/<pid>/ns/
  namespaces: expose ns_entries
  audit: log namespace serial numbers
  audit: log creation and deletion of namespace instances
  audit: initialize at subsystem time rather than device time

 Documentation/filesystems/proc.txt |   16 +++++++
 fs/mount.h                         |    1 +
 fs/namespace.c                     |   20 +++++++++
 fs/proc/namespaces.c               |   35 ++++++++++++----
 include/linux/audit.h              |   15 +++++++
 include/linux/ipc_namespace.h      |    1 +
 include/linux/nsproxy.h            |    8 ++++
 include/linux/pid_namespace.h      |    1 +
 include/linux/proc_ns.h            |    2 +
 include/linux/user_namespace.h     |    1 +
 include/linux/utsname.h            |    1 +
 include/net/net_namespace.h        |    1 +
 include/uapi/linux/audit.h         |   13 ++++++
 init/version.c                     |    1 +
 ipc/msgutil.c                      |    1 +
 ipc/namespace.c                    |   20 +++++++++
 kernel/audit.c                     |   78 +++++++++++++++++++++++++++++++++++-
 kernel/auditsc.c                   |    2 +
 kernel/nsproxy.c                   |   17 ++++++++
 kernel/pid.c                       |    1 +
 kernel/pid_namespace.c             |   19 +++++++++
 kernel/user.c                      |    1 +
 kernel/user_namespace.c            |   20 +++++++++
 kernel/utsname.c                   |   21 ++++++++++
 net/core/net_namespace.c           |   27 ++++++++++++-
 security/integrity/ima/ima_api.c   |    2 +
 26 files changed, 314 insertions(+), 11 deletions(-)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH V4 1/8] namespaces: assign each namespace instance a serial number
  2014-08-21  1:09 [PATCH V4 0/8] namespaces: log namespaces per task Richard Guy Briggs
@ 2014-08-21  1:09 ` Richard Guy Briggs
  2014-08-21 21:22   ` Andy Lutomirski
  2014-08-23 12:05   ` Eric W. Biederman
  2014-08-21  1:09 ` [PATCH V4 2/8] namespaces: expose namespace instance serial number in proc_ns_operations Richard Guy Briggs
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-21  1:09 UTC (permalink / raw)
  To: linux-audit, linux-kernel, containers, linux-api
  Cc: Richard Guy Briggs, arozansk, eparis, sgrubb, ebiederm, serge

Generate and assign a serial number per namespace instance since boot.

Use a serial number per namespace (unique across one boot of one kernel)
instead of the inode number (which is claimed to have had the right to change
reserved and is not necessarily unique if there is more than one proc fs) to
uniquely identify it per kernel boot.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
---
 fs/mount.h                     |    1 +
 fs/namespace.c                 |    1 +
 include/linux/ipc_namespace.h  |    1 +
 include/linux/nsproxy.h        |    8 ++++++++
 include/linux/pid_namespace.h  |    1 +
 include/linux/user_namespace.h |    1 +
 include/linux/utsname.h        |    1 +
 include/net/net_namespace.h    |    1 +
 init/version.c                 |    1 +
 ipc/msgutil.c                  |    1 +
 ipc/namespace.c                |    2 ++
 kernel/nsproxy.c               |   17 +++++++++++++++++
 kernel/pid.c                   |    1 +
 kernel/pid_namespace.c         |    2 ++
 kernel/user.c                  |    1 +
 kernel/user_namespace.c        |    2 ++
 kernel/utsname.c               |    2 ++
 net/core/net_namespace.c       |    8 +++++++-
 18 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index d55297f..c076f99 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -5,6 +5,7 @@
 struct mnt_namespace {
 	atomic_t		count;
 	unsigned int		proc_inum;
+	long long	serial_num;
 	struct mount *	root;
 	struct list_head	list;
 	struct user_namespace	*user_ns;
diff --git a/fs/namespace.c b/fs/namespace.c
index 182bc41..9af49ff 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2486,6 +2486,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)
 		kfree(new_ns);
 		return ERR_PTR(ret);
 	}
+	new_ns->serial_num = ns_serial();
 	new_ns->seq = atomic64_add_return(1, &mnt_ns_seq);
 	atomic_set(&new_ns->count, 1);
 	new_ns->root = NULL;
diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 35e7eca..8ccfb2d 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -69,6 +69,7 @@ struct ipc_namespace {
 	struct user_namespace *user_ns;
 
 	unsigned int	proc_inum;
+	long long	serial_num;
 };
 
 extern struct ipc_namespace init_ipc_ns;
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index b4ec59d..8e5fe0d 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -66,6 +66,14 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
 	return rcu_dereference(tsk->nsproxy);
 }
 
+long long ns_serial(void);
+enum {
+	NS_IPC_INIT_SN	= 1,
+	NS_UTS_INIT_SN	= 2,
+	NS_USER_INIT_SN	= 3,
+	NS_PID_INIT_SN	= 4,
+};
+
 int copy_namespaces(unsigned long flags, struct task_struct *tsk);
 void exit_task_namespaces(struct task_struct *tsk);
 void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 7246ef3..4d8023e 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -43,6 +43,7 @@ struct pid_namespace {
 	int hide_pid;
 	int reboot;	/* group exit code if this pidns was rebooted */
 	unsigned int proc_inum;
+	long long	serial_num;
 };
 
 extern struct pid_namespace init_pid_ns;
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 4836ba3..159ac26 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -27,6 +27,7 @@ struct user_namespace {
 	kuid_t			owner;
 	kgid_t			group;
 	unsigned int		proc_inum;
+	long long	serial_num;
 
 	/* Register of per-UID persistent keyrings for this namespace */
 #ifdef CONFIG_PERSISTENT_KEYRINGS
diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 239e277..8490197 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -24,6 +24,7 @@ struct uts_namespace {
 	struct new_utsname name;
 	struct user_namespace *user_ns;
 	unsigned int proc_inum;
+	long long	serial_num;
 };
 extern struct uts_namespace init_uts_ns;
 
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 361d260..5238a06 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -61,6 +61,7 @@ struct net {
 	struct user_namespace   *user_ns;	/* Owning user namespace */
 
 	unsigned int		proc_inum;
+	long long	serial_num;
 
 	struct proc_dir_entry 	*proc_net;
 	struct proc_dir_entry 	*proc_net_stat;
diff --git a/init/version.c b/init/version.c
index 1a4718e..cfdcb85 100644
--- a/init/version.c
+++ b/init/version.c
@@ -36,6 +36,7 @@ struct uts_namespace init_uts_ns = {
 	},
 	.user_ns = &init_user_ns,
 	.proc_inum = PROC_UTS_INIT_INO,
+	.serial_num = NS_UTS_INIT_SN /* ns_serial() */,
 };
 EXPORT_SYMBOL_GPL(init_uts_ns);
 
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index 7e70959..9aa66ae 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -32,6 +32,7 @@ struct ipc_namespace init_ipc_ns = {
 	.count		= ATOMIC_INIT(1),
 	.user_ns = &init_user_ns,
 	.proc_inum = PROC_IPC_INIT_INO,
+	.serial_num = NS_IPC_INIT_SN /* ns_serial() */,
 };
 
 atomic_t nr_ipc_ns = ATOMIC_INIT(1);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 59451c1..76dac5c 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -41,6 +41,8 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 	}
 	atomic_inc(&nr_ipc_ns);
 
+	ns->serial_num = ns_serial();
+
 	sem_init_ns(ns);
 	msg_init_ns(ns);
 	shm_init_ns(ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 8e78110..93cb380 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -41,6 +41,23 @@ struct nsproxy init_nsproxy = {
 #endif
 };
 
+/**
+ * ns_serial - compute a serial number for the namespace
+ *
+ * Compute a serial number for the namespace to uniquely identify it in
+ * audit records.
+ */
+long long ns_serial(void)
+{
+	static atomic64_t serial = ATOMIC_INIT(4); /* reserved for IPC, UTS, user, PID */
+	long long ret;
+
+	ret = atomic64_add_return(1, &serial);
+	BUG_ON(!ret);
+
+	return ret;
+}
+
 static inline struct nsproxy *create_nsproxy(void)
 {
 	struct nsproxy *nsproxy;
diff --git a/kernel/pid.c b/kernel/pid.c
index 9b9a266..3bf7127 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -80,6 +80,7 @@ struct pid_namespace init_pid_ns = {
 	.child_reaper = &init_task,
 	.user_ns = &init_user_ns,
 	.proc_inum = PROC_PID_INIT_INO,
+	.serial_num = NS_PID_INIT_SN /* ns_serial() */,
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index db95d8e..40a8b36 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -109,6 +109,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	if (err)
 		goto out_free_map;
 
+	ns->serial_num = ns_serial();
+
 	kref_init(&ns->kref);
 	ns->level = level;
 	ns->parent = get_pid_ns(parent_pid_ns);
diff --git a/kernel/user.c b/kernel/user.c
index 4efa393..2f597e0 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -51,6 +51,7 @@ struct user_namespace init_user_ns = {
 	.owner = GLOBAL_ROOT_UID,
 	.group = GLOBAL_ROOT_GID,
 	.proc_inum = PROC_USER_INIT_INO,
+	.serial_num = NS_USER_INIT_SN /* ns_serial() */,
 #ifdef CONFIG_PERSISTENT_KEYRINGS
 	.persistent_keyring_register_sem =
 	__RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem),
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index fcc0256..5c5c399 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -92,6 +92,8 @@ int create_user_ns(struct cred *new)
 		return ret;
 	}
 
+	ns->serial_num = ns_serial();
+
 	atomic_set(&ns->count, 1);
 	/* Leave the new->user_ns reference with the new user namespace. */
 	ns->parent = parent_ns;
diff --git a/kernel/utsname.c b/kernel/utsname.c
index fd39312..d0cf7b5 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -48,6 +48,8 @@ static struct uts_namespace *clone_uts_ns(struct user_namespace *user_ns,
 		return ERR_PTR(err);
 	}
 
+	ns->serial_num = ns_serial();
+
 	down_read(&uts_sem);
 	memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
 	ns->user_ns = get_user_ns(user_ns);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 85b6269..3b5cfdb 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -384,7 +384,13 @@ EXPORT_SYMBOL_GPL(get_net_ns_by_pid);
 
 static __net_init int net_ns_net_init(struct net *net)
 {
-	return proc_alloc_inum(&net->proc_inum);
+	int ret;
+
+	ret = proc_alloc_inum(&net->proc_inum);
+	if (ret)
+		return ret;
+	net->serial_num = ns_serial();
+	return 0;
 }
 
 static __net_exit void net_ns_net_exit(struct net *net)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V4 2/8] namespaces: expose namespace instance serial number in proc_ns_operations
  2014-08-21  1:09 [PATCH V4 0/8] namespaces: log namespaces per task Richard Guy Briggs
  2014-08-21  1:09 ` [PATCH V4 1/8] namespaces: assign each namespace instance a serial number Richard Guy Briggs
@ 2014-08-21  1:09 ` Richard Guy Briggs
  2014-08-21  1:09 ` [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc Richard Guy Briggs
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-21  1:09 UTC (permalink / raw)
  To: linux-audit, linux-kernel, containers, linux-api
  Cc: Richard Guy Briggs, arozansk, eparis, sgrubb, ebiederm, serge

Expose the namespace instance serial number for each namespace type in the proc
namespace operations structure to make it available for the proc filesystem.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
---
 fs/namespace.c           |    7 +++++++
 include/linux/proc_ns.h  |    1 +
 ipc/namespace.c          |    8 ++++++++
 kernel/pid_namespace.c   |    7 +++++++
 kernel/user_namespace.c  |    7 +++++++
 kernel/utsname.c         |    8 ++++++++
 net/core/net_namespace.c |    7 +++++++
 7 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9af49ff..f433f21 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3028,6 +3028,12 @@ static unsigned int mntns_inum(void *ns)
 	return mnt_ns->proc_inum;
 }
 
+static long long mntns_snum(void *ns)
+{
+	struct mnt_namespace *mnt_ns = ns;
+	return mnt_ns->serial_num;
+}
+
 const struct proc_ns_operations mntns_operations = {
 	.name		= "mnt",
 	.type		= CLONE_NEWNS,
@@ -3035,4 +3041,5 @@ const struct proc_ns_operations mntns_operations = {
 	.put		= mntns_put,
 	.install	= mntns_install,
 	.inum		= mntns_inum,
+	.snum		= mntns_snum,
 };
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..aaafe3e 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -14,6 +14,7 @@ struct proc_ns_operations {
 	void (*put)(void *ns);
 	int (*install)(struct nsproxy *nsproxy, void *ns);
 	unsigned int (*inum)(void *ns);
+	long long (*snum)(void *ns);
 };
 
 struct proc_ns {
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 76dac5c..36ce7ff 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -191,6 +191,13 @@ static unsigned int ipcns_inum(void *vp)
 	return ns->proc_inum;
 }
 
+static long long ipcns_snum(void *vp)
+{
+	struct ipc_namespace *ns = vp;
+
+	return ns->serial_num;
+}
+
 const struct proc_ns_operations ipcns_operations = {
 	.name		= "ipc",
 	.type		= CLONE_NEWIPC,
@@ -198,4 +205,5 @@ const struct proc_ns_operations ipcns_operations = {
 	.put		= ipcns_put,
 	.install	= ipcns_install,
 	.inum		= ipcns_inum,
+	.snum		= ipcns_snum,
 };
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 40a8b36..059b330 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -370,6 +370,12 @@ static unsigned int pidns_inum(void *ns)
 	return pid_ns->proc_inum;
 }
 
+static long long pidns_snum(void *ns)
+{
+	struct pid_namespace *pid_ns = ns;
+	return pid_ns->serial_num;
+}
+
 const struct proc_ns_operations pidns_operations = {
 	.name		= "pid",
 	.type		= CLONE_NEWPID,
@@ -377,6 +383,7 @@ const struct proc_ns_operations pidns_operations = {
 	.put		= pidns_put,
 	.install	= pidns_install,
 	.inum		= pidns_inum,
+	.snum		= pidns_snum,
 };
 
 static __init int pid_namespaces_init(void)
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 5c5c399..3f04df5 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -896,6 +896,12 @@ static unsigned int userns_inum(void *ns)
 	return user_ns->proc_inum;
 }
 
+static long long userns_snum(void *ns)
+{
+	struct user_namespace *user_ns = ns;
+	return user_ns->serial_num;
+}
+
 const struct proc_ns_operations userns_operations = {
 	.name		= "user",
 	.type		= CLONE_NEWUSER,
@@ -903,6 +909,7 @@ const struct proc_ns_operations userns_operations = {
 	.put		= userns_put,
 	.install	= userns_install,
 	.inum		= userns_inum,
+	.snum		= userns_snum,
 };
 
 static __init int user_namespaces_init(void)
diff --git a/kernel/utsname.c b/kernel/utsname.c
index d0cf7b5..ffeac1b 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -132,6 +132,13 @@ static unsigned int utsns_inum(void *vp)
 	return ns->proc_inum;
 }
 
+static long long utsns_snum(void *vp)
+{
+	struct uts_namespace *ns = vp;
+
+	return ns->serial_num;
+}
+
 const struct proc_ns_operations utsns_operations = {
 	.name		= "uts",
 	.type		= CLONE_NEWUTS,
@@ -139,4 +146,5 @@ const struct proc_ns_operations utsns_operations = {
 	.put		= utsns_put,
 	.install	= utsns_install,
 	.inum		= utsns_inum,
+	.snum		= utsns_snum,
 };
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 3b5cfdb..c402eea 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -671,6 +671,12 @@ static unsigned int netns_inum(void *ns)
 	return net->proc_inum;
 }
 
+static long long netns_snum(void *ns)
+{
+	struct net *net = ns;
+	return net->serial_num;
+}
+
 const struct proc_ns_operations netns_operations = {
 	.name		= "net",
 	.type		= CLONE_NEWNET,
@@ -678,5 +684,6 @@ const struct proc_ns_operations netns_operations = {
 	.put		= netns_put,
 	.install	= netns_install,
 	.inum		= netns_inum,
+	.snum		= netns_snum,
 };
 #endif
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
  2014-08-21  1:09 [PATCH V4 0/8] namespaces: log namespaces per task Richard Guy Briggs
  2014-08-21  1:09 ` [PATCH V4 1/8] namespaces: assign each namespace instance a serial number Richard Guy Briggs
  2014-08-21  1:09 ` [PATCH V4 2/8] namespaces: expose namespace instance serial number in proc_ns_operations Richard Guy Briggs
@ 2014-08-21  1:09 ` Richard Guy Briggs
  2014-08-21 21:13   ` Andy Lutomirski
  2014-08-21  1:09 ` [PATCH V4 4/8] Documentation: add a section for /proc/<pid>/ns/ Richard Guy Briggs
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-21  1:09 UTC (permalink / raw)
  To: linux-audit, linux-kernel, containers, linux-api
  Cc: Richard Guy Briggs, arozansk, eparis, sgrubb, ebiederm, serge

Expose the namespace instace serial numbers in the proc filesystem at
/proc/<pid>/ns/<ns>_snum.  The link text gives the serial number in hex.

"snum" was chosen instead of "seq" for consistency with inum and there are a
number of other uses of "seq" in the namespace code.

Suggested-by: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
---
 fs/proc/namespaces.c |   33 +++++++++++++++++++++++++--------
 1 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..e953e0a 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -47,12 +47,15 @@ static char *ns_dname(struct dentry *dentry, char *buffer, int buflen)
 	struct inode *inode = dentry->d_inode;
 	const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns.ns_ops;
 
-	return dynamic_dname(dentry, buffer, buflen, "%s:[%lu]",
-		ns_ops->name, inode->i_ino);
+	if (strstr(dentry->d_iname, "_snum"))
+		return dynamic_dname(dentry, buffer, buflen, "%s_snum:[%llx]",
+			ns_ops->name, ns_ops->snum(PROC_I(inode)->ns.ns));
+	else
+		return dynamic_dname(dentry, buffer, buflen, "%s:[%lu]",
+			ns_ops->name, inode->i_ino);
 }
 
-const struct dentry_operations ns_dentry_operations =
-{
+const struct dentry_operations ns_dentry_operations = {
 	.d_delete	= always_delete_dentry,
 	.d_dname	= ns_dname,
 };
@@ -160,7 +163,10 @@ static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int bufl
 	if (!ns)
 		goto out_put_task;
 
-	snprintf(name, sizeof(name), "%s:[%u]", ns_ops->name, ns_ops->inum(ns));
+	if (strstr(dentry->d_iname, "_snum"))
+		snprintf(name, sizeof(name), "%s_snum:[%llx]", ns_ops->name, ns_ops->snum(ns));
+	else
+		snprintf(name, sizeof(name), "%s:[%u]", ns_ops->name, ns_ops->inum(ns));
 	res = readlink_copy(buffer, buflen, name);
 	ns_ops->put(ns);
 out_put_task:
@@ -210,16 +216,23 @@ static int proc_ns_dir_readdir(struct file *file, struct dir_context *ctx)
 
 	if (!dir_emit_dots(file, ctx))
 		goto out;
-	if (ctx->pos >= 2 + ARRAY_SIZE(ns_entries))
+	if (ctx->pos >= 2 + 2 * ARRAY_SIZE(ns_entries))
 		goto out;
 	entry = ns_entries + (ctx->pos - 2);
 	last = &ns_entries[ARRAY_SIZE(ns_entries) - 1];
 	while (entry <= last) {
 		const struct proc_ns_operations *ops = *entry;
+		char name[50];
+
 		if (!proc_fill_cache(file, ctx, ops->name, strlen(ops->name),
 				     proc_ns_instantiate, task, ops))
 			break;
 		ctx->pos++;
+		snprintf(name, sizeof(name), "%s_snum", ops->name);
+		if (!proc_fill_cache(file, ctx, name, strlen(name),
+				     proc_ns_instantiate, task, ops))
+			break;
+		ctx->pos++;
 		entry++;
 	}
 out:
@@ -247,9 +260,13 @@ static struct dentry *proc_ns_dir_lookup(struct inode *dir,
 
 	last = &ns_entries[ARRAY_SIZE(ns_entries)];
 	for (entry = ns_entries; entry < last; entry++) {
-		if (strlen((*entry)->name) != len)
+		char name[50];
+
+		snprintf(name, sizeof(name), "%s_snum", (*entry)->name);
+		if (strlen((*entry)->name) != len && strlen(name) != len)
 			continue;
-		if (!memcmp(dentry->d_name.name, (*entry)->name, len))
+		if (!memcmp(dentry->d_name.name, (*entry)->name, len)
+		    || !memcmp(dentry->d_name.name, name, len))
 			break;
 	}
 	if (entry == last)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V4 4/8] Documentation: add a section for /proc/<pid>/ns/
  2014-08-21  1:09 [PATCH V4 0/8] namespaces: log namespaces per task Richard Guy Briggs
                   ` (2 preceding siblings ...)
  2014-08-21  1:09 ` [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc Richard Guy Briggs
@ 2014-08-21  1:09 ` Richard Guy Briggs
  2014-08-21  1:09 ` [PATCH V4 5/8] namespaces: expose ns_entries Richard Guy Briggs
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-21  1:09 UTC (permalink / raw)
  To: linux-audit, linux-kernel, containers, linux-api
  Cc: Richard Guy Briggs, arozansk, eparis, sgrubb, ebiederm, serge

---
 Documentation/filesystems/proc.txt |   16 ++++++++++++++++
 1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index ddc531a..c4bfd6f 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -42,6 +42,7 @@ Table of Contents
   3.6	/proc/<pid>/comm  & /proc/<pid>/task/<tid>/comm
   3.7   /proc/<pid>/task/<tid>/children - Information about task children
   3.8   /proc/<pid>/fdinfo/<fd> - Information about opened file
+  3.9   /proc/<pid>/ns/<ns>{,_snum} - Information about process namespaces
 
   4	Configuring procfs
   4.1	Mount options
@@ -1744,6 +1745,21 @@ pair provide additional information particular to the objects they represent.
 	optional and may be omitted if no marks created yet.
 
 
+3.9	/proc/<pid>/ns/<nstype>{,_snum} - Information about process namespaces
+--------------------------------------------------------------------------
+These files provides information about the namespaces within which the process
+is contained.  The files named only with the namespace type <nstype> contain a
+link that lists the containing namespace' inode number in its proc filesystem.
+The files with suffix _snum contain a link that lists the containing
+namespace' instance serial number, unique per kernel since boot.  The
+namespace types are self-describing.
+
+The output format of the inode links is:
+	<nstype>:[<inode_number>]
+The output format of the serial number links is:
+	<nstype>_snum:[<serial_number>]
+
+
 ------------------------------------------------------------------------------
 Configuring procfs
 ------------------------------------------------------------------------------
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V4 5/8] namespaces: expose ns_entries
  2014-08-21  1:09 [PATCH V4 0/8] namespaces: log namespaces per task Richard Guy Briggs
                   ` (3 preceding siblings ...)
  2014-08-21  1:09 ` [PATCH V4 4/8] Documentation: add a section for /proc/<pid>/ns/ Richard Guy Briggs
@ 2014-08-21  1:09 ` Richard Guy Briggs
  2014-08-21  1:09 ` [PATCH V4 6/8] audit: log namespace serial numbers Richard Guy Briggs
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-21  1:09 UTC (permalink / raw)
  To: linux-audit, linux-kernel, containers, linux-api
  Cc: Richard Guy Briggs, arozansk, eparis, sgrubb, ebiederm, serge

Expose ns_entries so subsystems other than proc can use this set of namespace
operations.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
---
 fs/proc/namespaces.c    |    2 +-
 include/linux/proc_ns.h |    1 +
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index e953e0a..29c3909 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -15,7 +15,7 @@
 #include "internal.h"
 
 
-static const struct proc_ns_operations *ns_entries[] = {
+const struct proc_ns_operations *ns_entries[] = {
 #ifdef CONFIG_NET_NS
 	&netns_operations,
 #endif
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index aaafe3e..f4563db 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -28,6 +28,7 @@ extern const struct proc_ns_operations ipcns_operations;
 extern const struct proc_ns_operations pidns_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations *ns_entries[];
 
 /*
  * We always define these enumerators
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V4 6/8] audit: log namespace serial numbers
  2014-08-21  1:09 [PATCH V4 0/8] namespaces: log namespaces per task Richard Guy Briggs
                   ` (4 preceding siblings ...)
  2014-08-21  1:09 ` [PATCH V4 5/8] namespaces: expose ns_entries Richard Guy Briggs
@ 2014-08-21  1:09 ` Richard Guy Briggs
  2014-08-21  1:09 ` [PATCH V4 7/8] audit: log creation and deletion of namespace instances Richard Guy Briggs
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-21  1:09 UTC (permalink / raw)
  To: linux-audit, linux-kernel, containers, linux-api
  Cc: Richard Guy Briggs, arozansk, eparis, sgrubb, ebiederm, serge

Log the namespace serial numbers of a task in a new record type (1329) (usually
accompanies audit_log_task_info() type=SYSCALL record) which is used by syscall
audits, among others..

Idea first presented:
	https://www.redhat.com/archives/linux-audit/2013-March/msg00020.html

Typical output format would look something like:
	type=NS_INFO msg=audit(1408577535.306:82):  netns=8 utsns=2 ipcns=1 pidns=4 userns=3 mntns=5

The serial numbers are printed in hex.

Suggested-by: Aristeu Rozanski <arozansk@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
---
 include/linux/audit.h            |    7 +++++++
 include/uapi/linux/audit.h       |    1 +
 kernel/audit.c                   |   29 +++++++++++++++++++++++++++++
 kernel/auditsc.c                 |    2 ++
 security/integrity/ima/ima_api.c |    2 ++
 5 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 22cfddb..5ea3609 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -101,6 +101,13 @@ extern int __weak audit_classify_compat_syscall(int abi, unsigned syscall);
 struct filename;
 
 extern void audit_log_session_info(struct audit_buffer *ab);
+#ifdef CONFIG_NAMESPACES
+extern void audit_log_namespace_info(struct task_struct *tsk);
+#else
+void audit_log_namespace_info(struct task_struct *tsk)
+{
+}
+#endif
 
 #ifdef CONFIG_AUDIT_COMPAT_GENERIC
 #define audit_is_compat(arch)  (!((arch) & __AUDIT_ARCH_64BIT))
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index cf67147..84bbcdb 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -110,6 +110,7 @@
 #define AUDIT_SECCOMP		1326	/* Secure Computing event */
 #define AUDIT_PROCTITLE		1327	/* Proctitle emit event */
 #define AUDIT_FEATURE_CHANGE	1328	/* audit log listing feature changes */
+#define AUDIT_NS_INFO		1329	/* Record process namespace IDs */
 
 #define AUDIT_AVC		1400	/* SE Linux avc denial or grant */
 #define AUDIT_SELINUX_ERR	1401	/* Internal SE Linux Errors */
diff --git a/kernel/audit.c b/kernel/audit.c
index 3ef2e0e..a4c39a0 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -65,6 +65,7 @@
 #include <linux/freezer.h>
 #include <linux/tty.h>
 #include <linux/pid_namespace.h>
+#include <linux/proc_ns.h>
 #include <net/netns/generic.h>
 
 #include "audit.h"
@@ -743,6 +744,8 @@ static void audit_log_feature_change(int which, u32 old_feature, u32 new_feature
 			 audit_feature_names[which], !!old_feature, !!new_feature,
 			 !!old_lock, !!new_lock, res);
 	audit_log_end(ab);
+
+	audit_log_namespace_info(current);
 }
 
 static int audit_set_feature(struct sk_buff *skb)
@@ -1661,6 +1664,30 @@ void audit_log_session_info(struct audit_buffer *ab)
 	audit_log_format(ab, " auid=%u ses=%u", auid, sessionid);
 }
 
+#ifdef CONFIG_NAMESPACES
+void audit_log_namespace_info(struct task_struct *tsk)
+{
+	const struct proc_ns_operations **entry;
+	bool end = false;
+	struct audit_buffer *ab;
+
+	if (!tsk)
+		return;
+	ab = audit_log_start(tsk->audit_context, GFP_KERNEL,
+			     AUDIT_NS_INFO);
+	if (!ab)
+		return;
+	for (entry = ns_entries; !end; entry++) {
+		void *ns = (*entry)->get(tsk);
+		audit_log_format(ab, " %sns=%llx", (*entry)->name,
+				 (*entry)->snum(ns));
+		(*entry)->put(ns);
+		end = (*entry)->type == CLONE_NEWNS;
+	}
+	audit_log_end(ab);
+}
+#endif /* CONFIG_NAMESPACES */
+
 void audit_log_key(struct audit_buffer *ab, char *key)
 {
 	audit_log_format(ab, " key=");
@@ -1933,6 +1960,8 @@ void audit_log_link_denied(const char *operation, struct path *link)
 	audit_log_format(ab, " res=0");
 	audit_log_end(ab);
 
+	audit_log_namespace_info(current);
+
 	/* Generate AUDIT_PATH record with object. */
 	name->type = AUDIT_TYPE_NORMAL;
 	audit_copy_inode(name, link->dentry, link->dentry->d_inode);
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 21eae3c..08b9af9 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1383,6 +1383,8 @@ static void audit_log_exit(struct audit_context *context, struct task_struct *ts
 	audit_log_key(ab, context->filterkey);
 	audit_log_end(ab);
 
+	audit_log_namespace_info(tsk);
+
 	for (aux = context->aux; aux; aux = aux->next) {
 
 		ab = audit_log_start(context, GFP_KERNEL, aux->type);
diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index d9cd5ce..06d6897 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -323,6 +323,8 @@ void ima_audit_measurement(struct integrity_iint_cache *iint,
 	audit_log_task_info(ab, current);
 	audit_log_end(ab);
 
+	audit_log_namespace_info(current);
+
 	iint->flags |= IMA_AUDITED;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V4 7/8] audit: log creation and deletion of namespace instances
  2014-08-21  1:09 [PATCH V4 0/8] namespaces: log namespaces per task Richard Guy Briggs
                   ` (5 preceding siblings ...)
  2014-08-21  1:09 ` [PATCH V4 6/8] audit: log namespace serial numbers Richard Guy Briggs
@ 2014-08-21  1:09 ` Richard Guy Briggs
  2014-08-21  1:09 ` [PATCH V4 8/8] audit: initialize at subsystem time rather than device time Richard Guy Briggs
  2014-08-21 20:05 ` [PATCH V4 0/8] namespaces: log namespaces per task Aristeu Rozanski
  8 siblings, 0 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-21  1:09 UTC (permalink / raw)
  To: linux-audit, linux-kernel, containers, linux-api
  Cc: Richard Guy Briggs, arozansk, eparis, sgrubb, ebiederm, serge

Log the creation and deletion of namespace instances in all 6 types of
namespaces.

Twelve new audit message types have been introduced:
AUDIT_NS_INIT_MNT       1330    /* Record mount namespace instance creation */
AUDIT_NS_INIT_UTS       1331    /* Record UTS namespace instance creation */
AUDIT_NS_INIT_IPC       1332    /* Record IPC namespace instance creation */
AUDIT_NS_INIT_USER      1333    /* Record USER namespace instance creation */
AUDIT_NS_INIT_PID       1334    /* Record PID namespace instance creation */
AUDIT_NS_INIT_NET       1335    /* Record NET namespace instance creation */
AUDIT_NS_DEL_MNT        1336    /* Record mount namespace instance deletion */
AUDIT_NS_DEL_UTS        1337    /* Record UTS namespace instance deletion */
AUDIT_NS_DEL_IPC        1338    /* Record IPC namespace instance deletion */
AUDIT_NS_DEL_USER       1339    /* Record USER namespace instance deletion */
AUDIT_NS_DEL_PID        1340    /* Record PID namespace instance deletion */
AUDIT_NS_DEL_NET        1341    /* Record NET namespace instance deletion */

As suggested by Eric Paris, there are 12 message types, one for each of
creation and deletion, one for each type of namespace so that text searches are
easier in conjunction with the AUDIT_NS_INFO message type, being able to search
for all records such as "netns=7 " and to avoid fields disappearing per message
type to make ausearch more efficient.

A typical startup would look roughly like:

	type=AUDIT_NS_INIT_UTS msg=audit(1408577534.868:5): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel old_utsns=0 utsns=2 res=1
	type=AUDIT_NS_INIT_USER msg=audit(1408577534.868:6): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel old_userns=0 userns=3 res=1
	type=AUDIT_NS_INIT_PID msg=audit(1408577534.868:7): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel old_pidns=0 pidns=4 res=1
	type=AUDIT_NS_INIT_MNT msg=audit(1408577534.868:8): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel old_mntns=0 mntns=5 res=1
	type=AUDIT_NS_INIT_IPC msg=audit(1408577534.868:9): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel old_ipcns=0 ipcns=1 res=1
	type=AUDIT_NS_INIT_NET msg=audit(1408577533.500:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel old_netns=0 netns=7 res=1

And a CLONE action would result in:
	type=type=AUDIT_NS_INIT_NET msg=audit(1408577535.306:81): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 old_netns=7 netns=8 res=1
	type=type=AUDIT_NS_INIT_MNT msg=audit(1408577535.307:83): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 old_mntns=5 mntns=9 res=1

While deleting a namespace would result in:
	type=type=AUDIT_NS_DEL_MNT msg=audit(1408577552.221:85): pid=481 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 mntns=9 res=1

If non-zero, old_snum lists the namespace from which it was cloned.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
---
 fs/namespace.c             |   12 +++++++++++
 include/linux/audit.h      |    8 +++++++
 include/uapi/linux/audit.h |   12 +++++++++++
 ipc/namespace.c            |   10 +++++++++
 kernel/audit.c             |   47 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/pid_namespace.c     |   10 +++++++++
 kernel/user_namespace.c    |   11 ++++++++++
 kernel/utsname.c           |   11 ++++++++++
 net/core/net_namespace.c   |   12 +++++++++++
 9 files changed, 133 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index f433f21..cb05b3d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -24,6 +24,7 @@
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
 #include <linux/bootmem.h>
+#include <linux/audit.h>
 #include "pnode.h"
 #include "internal.h"
 
@@ -2459,6 +2460,7 @@ dput_out:
 
 static void free_mnt_ns(struct mnt_namespace *ns)
 {
+	audit_log_ns_del(AUDIT_NS_DEL_MNT, ns->serial_num);
 	proc_free_inum(ns->proc_inum);
 	put_user_ns(ns->user_ns);
 	kfree(ns);
@@ -2519,6 +2521,7 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns,
 	new_ns = alloc_mnt_ns(user_ns);
 	if (IS_ERR(new_ns))
 		return new_ns;
+	audit_log_ns_init(AUDIT_NS_INIT_MNT, ns->serial_num, new_ns->serial_num);
 
 	namespace_lock();
 	/* First pass: copy the tree topology */
@@ -2831,6 +2834,15 @@ static void __init init_mount_tree(void)
 	set_fs_root(current->fs, &root);
 }
 
+/* log the serial number of init mnt namespace after audit service starts */
+static int __init mnt_ns_init_log(void)
+{
+	struct mnt_namespace *init_mnt_ns = init_task.nsproxy->mnt_ns;
+	audit_log_ns_init(AUDIT_NS_INIT_MNT, 0, init_mnt_ns->serial_num);
+	return 0;
+}
+late_initcall(mnt_ns_init_log);
+
 void __init mnt_init(void)
 {
 	unsigned u;
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 5ea3609..c245837 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -466,6 +466,9 @@ extern void		    audit_log_key(struct audit_buffer *ab,
 					  char *key);
 extern void		    audit_log_link_denied(const char *operation,
 						  struct path *link);
+extern int		    audit_log_ns_init(int type, long long old_snum,
+					      long long snum);
+extern int		    audit_log_ns_del(int type, long long snum);
 extern void		    audit_log_lost(const char *message);
 #ifdef CONFIG_SECURITY
 extern void 		    audit_log_secctx(struct audit_buffer *ab, u32 secid);
@@ -524,6 +527,11 @@ static inline void audit_log_key(struct audit_buffer *ab, char *key)
 static inline void audit_log_link_denied(const char *string,
 					 const struct path *link)
 { }
+static inline int audit_log_ns_init(int type, long long old_snum,
+				    long long snum)
+{ }
+static inline int audit_log_ns_del(int type, long long snum)
+{ }
 static inline void audit_log_secctx(struct audit_buffer *ab, u32 secid)
 { }
 static inline int audit_log_task_context(struct audit_buffer *ab)
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 84bbcdb..fd13a02 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -111,6 +111,18 @@
 #define AUDIT_PROCTITLE		1327	/* Proctitle emit event */
 #define AUDIT_FEATURE_CHANGE	1328	/* audit log listing feature changes */
 #define AUDIT_NS_INFO		1329	/* Record process namespace IDs */
+#define AUDIT_NS_INIT_MNT	1330	/* Record mount namespace instance creation */
+#define AUDIT_NS_INIT_UTS	1331	/* Record UTS namespace instance creation */
+#define AUDIT_NS_INIT_IPC	1332	/* Record IPC namespace instance creation */
+#define AUDIT_NS_INIT_USER	1333	/* Record USER namespace instance creation */
+#define AUDIT_NS_INIT_PID	1334	/* Record PID namespace instance creation */
+#define AUDIT_NS_INIT_NET	1335	/* Record NET namespace instance creation */
+#define AUDIT_NS_DEL_MNT	1336	/* Record mount namespace instance deletion */
+#define AUDIT_NS_DEL_UTS	1337	/* Record UTS namespace instance deletion */
+#define AUDIT_NS_DEL_IPC	1338	/* Record IPC namespace instance deletion */
+#define AUDIT_NS_DEL_USER	1339	/* Record USER namespace instance deletion */
+#define AUDIT_NS_DEL_PID	1340	/* Record PID namespace instance deletion */
+#define AUDIT_NS_DEL_NET	1341	/* Record NET namespace instance deletion */
 
 #define AUDIT_AVC		1400	/* SE Linux avc denial or grant */
 #define AUDIT_SELINUX_ERR	1401	/* Internal SE Linux Errors */
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 36ce7ff..538c4b9 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -13,6 +13,7 @@
 #include <linux/mount.h>
 #include <linux/user_namespace.h>
 #include <linux/proc_ns.h>
+#include <linux/audit.h>
 
 #include "util.h"
 
@@ -42,6 +43,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 	atomic_inc(&nr_ipc_ns);
 
 	ns->serial_num = ns_serial();
+	audit_log_ns_init(AUDIT_NS_INIT_IPC, old_ns->serial_num, ns->serial_num);
 
 	sem_init_ns(ns);
 	msg_init_ns(ns);
@@ -121,6 +123,7 @@ static void free_ipc_ns(struct ipc_namespace *ns)
 	 */
 	ipcns_notify(IPCNS_REMOVED);
 	put_user_ns(ns->user_ns);
+	audit_log_ns_del(AUDIT_NS_DEL_IPC, ns->serial_num);
 	proc_free_inum(ns->proc_inum);
 	kfree(ns);
 }
@@ -207,3 +210,10 @@ const struct proc_ns_operations ipcns_operations = {
 	.inum		= ipcns_inum,
 	.snum		= ipcns_snum,
 };
+
+static int __init ipc_namespaces_init(void)
+{
+	return audit_log_ns_init(AUDIT_NS_INIT_IPC, 0, init_ipc_ns.serial_num);
+}
+
+late_initcall(ipc_namespaces_init);
diff --git a/kernel/audit.c b/kernel/audit.c
index a4c39a0..6d95d1c 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1970,6 +1970,53 @@ out:
 	kfree(name);
 }
 
+static char* ns_name[] = {
+	"mnt",
+	"uts",
+	"ipc",
+	"user",
+	"pid",
+	"net",
+};
+
+/**
+ * audit_log_ns_init - report a namespace instance creation
+ * @type: type of audit namespace instance created message
+ * @old_snum: the serial number of the cloned namespace instance
+ * @snum: the serial number of the new namespace instance
+ */
+int audit_log_ns_init(int type, long long old_snum, long long snum)
+{
+	struct audit_buffer *ab;
+	char* audit_ns_name = ns_name[type - AUDIT_NS_INIT_MNT];
+
+	if (type < AUDIT_NS_INIT_MNT || type > AUDIT_NS_INIT_NET)
+		return 0;
+	audit_log_common_recv_msg(&ab, type);
+	audit_log_format(ab, " old_%sns=%llx %sns=%llx res=1",
+			 audit_ns_name, old_snum, audit_ns_name, snum);
+	audit_log_end(ab);
+	return 0;
+}
+
+/**
+ * audit_log_ns_del - report a namespace instance deleted
+ * @type: type of audit namespace instance deleted message
+ * @snum: the serial number of the namespace instance
+ */
+int audit_log_ns_del(int type, long long snum)
+{
+	struct audit_buffer *ab;
+	char* audit_ns_name = ns_name[type - AUDIT_NS_DEL_MNT];
+
+	if (type < AUDIT_NS_DEL_MNT || type > AUDIT_NS_DEL_NET)
+		return 0;
+	audit_log_common_recv_msg(&ab, type);
+	audit_log_format(ab, " %ssn=%llx res=1", audit_ns_name, snum);
+	audit_log_end(ab);
+	return 0;
+}
+
 /**
  * audit_log_end - end one audit record
  * @ab: the audit_buffer
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 059b330..7da351a 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -18,6 +18,7 @@
 #include <linux/proc_ns.h>
 #include <linux/reboot.h>
 #include <linux/export.h>
+#include <linux/audit.h>
 
 struct pid_cache {
 	int nr_ids;
@@ -110,6 +111,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 		goto out_free_map;
 
 	ns->serial_num = ns_serial();
+	audit_log_ns_init(AUDIT_NS_INIT_PID, parent_pid_ns->serial_num,
+			  ns->serial_num);
 
 	kref_init(&ns->kref);
 	ns->level = level;
@@ -144,6 +147,7 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
 {
 	int i;
 
+	audit_log_ns_del(AUDIT_NS_DEL_PID, ns->serial_num);
 	proc_free_inum(ns->proc_inum);
 	for (i = 0; i < PIDMAP_ENTRIES; i++)
 		kfree(ns->pidmap[i].page);
@@ -397,3 +401,9 @@ static __init int pid_namespaces_init(void)
 }
 
 __initcall(pid_namespaces_init);
+
+static __init int pid_namespaces_late_init(void)
+{
+	return audit_log_ns_init(AUDIT_NS_INIT_PID, 0, init_pid_ns.serial_num);
+}
+late_initcall(pid_namespaces_late_init);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 3f04df5..24497a0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -22,6 +22,7 @@
 #include <linux/ctype.h>
 #include <linux/projid.h>
 #include <linux/fs_struct.h>
+#include <linux/audit.h>
 
 static struct kmem_cache *user_ns_cachep __read_mostly;
 
@@ -93,6 +94,8 @@ int create_user_ns(struct cred *new)
 	}
 
 	ns->serial_num = ns_serial();
+	audit_log_ns_init(AUDIT_NS_INIT_USER, parent_ns->serial_num,
+			  ns->serial_num);
 
 	atomic_set(&ns->count, 1);
 	/* Leave the new->user_ns reference with the new user namespace. */
@@ -138,6 +141,7 @@ void free_user_ns(struct user_namespace *ns)
 #ifdef CONFIG_PERSISTENT_KEYRINGS
 		key_put(ns->persistent_keyring_register);
 #endif
+		audit_log_ns_del(AUDIT_NS_DEL_USER, ns->serial_num);
 		proc_free_inum(ns->proc_inum);
 		kmem_cache_free(user_ns_cachep, ns);
 		ns = parent;
@@ -918,3 +922,10 @@ static __init int user_namespaces_init(void)
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
+
+static __init int user_namespaces_late_init(void)
+{
+	return audit_log_ns_init(AUDIT_NS_INIT_USER, 0,
+				 init_user_ns.serial_num);
+}
+late_initcall(user_namespaces_late_init);
diff --git a/kernel/utsname.c b/kernel/utsname.c
index ffeac1b..05ecc2d 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -16,6 +16,7 @@
 #include <linux/slab.h>
 #include <linux/user_namespace.h>
 #include <linux/proc_ns.h>
+#include <linux/audit.h>
 
 static struct uts_namespace *create_uts_ns(void)
 {
@@ -49,6 +50,7 @@ static struct uts_namespace *clone_uts_ns(struct user_namespace *user_ns,
 	}
 
 	ns->serial_num = ns_serial();
+	audit_log_ns_init(AUDIT_NS_INIT_UTS, old_ns->serial_num, ns->serial_num);
 
 	down_read(&uts_sem);
 	memcpy(&ns->name, &old_ns->name, sizeof(ns->name));
@@ -86,6 +88,7 @@ void free_uts_ns(struct kref *kref)
 
 	ns = container_of(kref, struct uts_namespace, kref);
 	put_user_ns(ns->user_ns);
+	audit_log_ns_del(AUDIT_NS_DEL_UTS, ns->serial_num);
 	proc_free_inum(ns->proc_inum);
 	kfree(ns);
 }
@@ -148,3 +151,11 @@ const struct proc_ns_operations utsns_operations = {
 	.inum		= utsns_inum,
 	.snum		= utsns_snum,
 };
+
+static int __init uts_namespaces_init(void)
+{
+	return audit_log_ns_init(AUDIT_NS_INIT_UTS, 0,
+				 init_uts_ns.serial_num);
+}
+
+late_initcall(uts_namespaces_init);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index c402eea..295ecc3 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -17,6 +17,7 @@
 #include <linux/user_namespace.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
+#include <linux/audit.h>
 
 /*
  *	Our network namespace constructor/destructor lists
@@ -253,6 +254,8 @@ struct net *copy_net_ns(unsigned long flags,
 	mutex_lock(&net_mutex);
 	rv = setup_net(net, user_ns);
 	if (rv == 0) {
+		audit_log_ns_init(AUDIT_NS_INIT_NET, old_net->serial_num,
+				  net->serial_num);
 		rtnl_lock();
 		list_add_tail_rcu(&net->list, &net_namespace_list);
 		rtnl_unlock();
@@ -395,6 +398,7 @@ static __net_init int net_ns_net_init(struct net *net)
 
 static __net_exit void net_ns_net_exit(struct net *net)
 {
+	audit_log_ns_del(AUDIT_NS_DEL_NET, net->serial_num);
 	proc_free_inum(net->proc_inum);
 }
 
@@ -441,6 +445,14 @@ static int __init net_ns_init(void)
 
 pure_initcall(net_ns_init);
 
+/* log the serial number of init_net namespace after audit service starts */
+static int __init net_ns_init_log(void)
+{
+	audit_log_ns_init(AUDIT_NS_INIT_NET, 0, init_net.serial_num);
+	return 0;
+}
+late_initcall(net_ns_init_log);
+
 #ifdef CONFIG_NET_NS
 static int __register_pernet_operations(struct list_head *list,
 					struct pernet_operations *ops)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH V4 8/8] audit: initialize at subsystem time rather than device time
  2014-08-21  1:09 [PATCH V4 0/8] namespaces: log namespaces per task Richard Guy Briggs
                   ` (6 preceding siblings ...)
  2014-08-21  1:09 ` [PATCH V4 7/8] audit: log creation and deletion of namespace instances Richard Guy Briggs
@ 2014-08-21  1:09 ` Richard Guy Briggs
  2014-08-21 20:05 ` [PATCH V4 0/8] namespaces: log namespaces per task Aristeu Rozanski
  8 siblings, 0 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-21  1:09 UTC (permalink / raw)
  To: linux-audit, linux-kernel, containers, linux-api
  Cc: Richard Guy Briggs, arozansk, eparis, sgrubb, ebiederm, serge

The audit subsystem should be initialized a bit earlier so that it is in place
in time for initial namespace serial number logging.
---
 kernel/audit.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index 6d95d1c..aa99518 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1186,7 +1186,7 @@ static int __init audit_init(void)
 
 	return 0;
 }
-__initcall(audit_init);
+subsys_initcall(audit_init);
 
 /* Process kernel command-line parameter at boot time.  audit=0 or audit=1. */
 static int __init audit_enable(char *str)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 0/8] namespaces: log namespaces per task
  2014-08-21  1:09 [PATCH V4 0/8] namespaces: log namespaces per task Richard Guy Briggs
                   ` (7 preceding siblings ...)
  2014-08-21  1:09 ` [PATCH V4 8/8] audit: initialize at subsystem time rather than device time Richard Guy Briggs
@ 2014-08-21 20:05 ` Aristeu Rozanski
  2014-08-21 22:32   ` Richard Guy Briggs
  8 siblings, 1 reply; 30+ messages in thread
From: Aristeu Rozanski @ 2014-08-21 20:05 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: linux-audit, linux-kernel, containers, linux-api, eparis, sgrubb,
	ebiederm, serge

Hi Richard,
On Wed, Aug 20, 2014 at 09:09:33PM -0400, Richard Guy Briggs wrote:
> Is there a way to link serial numbers of namespaces involved in migration of a
> container to another kernel?  It sounds like what is needed is a part of a
> mangement application that is able to pull the audit records from constituent
> hosts to build an audit trail of a container.

since you're introducing a brand new serial number to make it unique
across different procfs mounts, why not instead of a simple counter,
use the hash output of say, $hostname-$creation_time-$random? Or perhaps
get a short hash of the hostname (generated once whenever hostname is
set) and append the serial number you're implementing? It'd be way less human
readable than your current proposal but it'd be unique "globally" (as long you
don't have machines with the same hostname migrating containers between them),
allowing the migrated namespaces to retain their unique identification across
audit logs. It'd of course be way more costly than just using an atomic counter,
but could be useful to anything that needs to refer to a namespace and could be
migrated to another machine.

What you think? Sounds too crazy? :)

-- 
Aristeu


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
  2014-08-21  1:09 ` [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc Richard Guy Briggs
@ 2014-08-21 21:13   ` Andy Lutomirski
  2014-08-22  1:58     ` Richard Guy Briggs
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2014-08-21 21:13 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Eric W. Biederman, Eric Paris, Steve Grubb, arozansk,
	linux-audit, Linux API, Linux Containers, Serge E. Hallyn,
	linux-kernel

On Aug 20, 2014 8:12 PM, "Richard Guy Briggs" <rgb@redhat.com> wrote:
>
> Expose the namespace instace serial numbers in the proc filesystem at
> /proc/<pid>/ns/<ns>_snum.  The link text gives the serial number in hex.

What's the use case?

I understand the utility of giving unique numbers to the audit code,
but I don't think this part is necessary for that, and I'd like to
understand what else will use this before committing to a duplicative
API like this.

Note that this API is thoroughly incompatible with CRIU.  If we do
this, someone will ask for a namespace number namespace, and that way
lies madness.

--Andy

>
> "snum" was chosen instead of "seq" for consistency with inum and there are a
> number of other uses of "seq" in the namespace code.
>
> Suggested-by: Serge E. Hallyn <serge@hallyn.com>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> ---
>  fs/proc/namespaces.c |   33 +++++++++++++++++++++++++--------
>  1 files changed, 25 insertions(+), 8 deletions(-)
>
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 8902609..e953e0a 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -47,12 +47,15 @@ static char *ns_dname(struct dentry *dentry, char *buffer, int buflen)
>         struct inode *inode = dentry->d_inode;
>         const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns.ns_ops;
>
> -       return dynamic_dname(dentry, buffer, buflen, "%s:[%lu]",
> -               ns_ops->name, inode->i_ino);
> +       if (strstr(dentry->d_iname, "_snum"))
> +               return dynamic_dname(dentry, buffer, buflen, "%s_snum:[%llx]",
> +                       ns_ops->name, ns_ops->snum(PROC_I(inode)->ns.ns));
> +       else
> +               return dynamic_dname(dentry, buffer, buflen, "%s:[%lu]",
> +                       ns_ops->name, inode->i_ino);
>  }
>
> -const struct dentry_operations ns_dentry_operations =
> -{
> +const struct dentry_operations ns_dentry_operations = {
>         .d_delete       = always_delete_dentry,
>         .d_dname        = ns_dname,
>  };
> @@ -160,7 +163,10 @@ static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int bufl
>         if (!ns)
>                 goto out_put_task;
>
> -       snprintf(name, sizeof(name), "%s:[%u]", ns_ops->name, ns_ops->inum(ns));
> +       if (strstr(dentry->d_iname, "_snum"))
> +               snprintf(name, sizeof(name), "%s_snum:[%llx]", ns_ops->name, ns_ops->snum(ns));
> +       else
> +               snprintf(name, sizeof(name), "%s:[%u]", ns_ops->name, ns_ops->inum(ns));
>         res = readlink_copy(buffer, buflen, name);
>         ns_ops->put(ns);
>  out_put_task:
> @@ -210,16 +216,23 @@ static int proc_ns_dir_readdir(struct file *file, struct dir_context *ctx)
>
>         if (!dir_emit_dots(file, ctx))
>                 goto out;
> -       if (ctx->pos >= 2 + ARRAY_SIZE(ns_entries))
> +       if (ctx->pos >= 2 + 2 * ARRAY_SIZE(ns_entries))
>                 goto out;
>         entry = ns_entries + (ctx->pos - 2);
>         last = &ns_entries[ARRAY_SIZE(ns_entries) - 1];
>         while (entry <= last) {
>                 const struct proc_ns_operations *ops = *entry;
> +               char name[50];
> +
>                 if (!proc_fill_cache(file, ctx, ops->name, strlen(ops->name),
>                                      proc_ns_instantiate, task, ops))
>                         break;
>                 ctx->pos++;
> +               snprintf(name, sizeof(name), "%s_snum", ops->name);
> +               if (!proc_fill_cache(file, ctx, name, strlen(name),
> +                                    proc_ns_instantiate, task, ops))
> +                       break;
> +               ctx->pos++;
>                 entry++;
>         }
>  out:
> @@ -247,9 +260,13 @@ static struct dentry *proc_ns_dir_lookup(struct inode *dir,
>
>         last = &ns_entries[ARRAY_SIZE(ns_entries)];
>         for (entry = ns_entries; entry < last; entry++) {
> -               if (strlen((*entry)->name) != len)
> +               char name[50];
> +
> +               snprintf(name, sizeof(name), "%s_snum", (*entry)->name);
> +               if (strlen((*entry)->name) != len && strlen(name) != len)
>                         continue;
> -               if (!memcmp(dentry->d_name.name, (*entry)->name, len))
> +               if (!memcmp(dentry->d_name.name, (*entry)->name, len)
> +                   || !memcmp(dentry->d_name.name, name, len))
>                         break;
>         }
>         if (entry == last)
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number
  2014-08-21  1:09 ` [PATCH V4 1/8] namespaces: assign each namespace instance a serial number Richard Guy Briggs
@ 2014-08-21 21:22   ` Andy Lutomirski
  2014-08-21 21:28     ` Richard Guy Briggs
  2014-08-23 12:05   ` Eric W. Biederman
  1 sibling, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2014-08-21 21:22 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Eric W. Biederman, Eric Paris, Steve Grubb, arozansk,
	linux-audit, Linux API, Linux Containers, Serge E. Hallyn,
	linux-kernel

On Aug 20, 2014 8:12 PM, "Richard Guy Briggs" <rgb@redhat.com> wrote:
>
> Generate and assign a serial number per namespace instance since boot.
>
> Use a serial number per namespace (unique across one boot of one kernel)
> instead of the inode number (which is claimed to have had the right to change
> reserved and is not necessarily unique if there is more than one proc fs) to
> uniquely identify it per kernel boot.
>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> ---
>  fs/mount.h                     |    1 +
>  fs/namespace.c                 |    1 +
>  include/linux/ipc_namespace.h  |    1 +
>  include/linux/nsproxy.h        |    8 ++++++++
>  include/linux/pid_namespace.h  |    1 +
>  include/linux/user_namespace.h |    1 +
>  include/linux/utsname.h        |    1 +
>  include/net/net_namespace.h    |    1 +
>  init/version.c                 |    1 +
>  ipc/msgutil.c                  |    1 +
>  ipc/namespace.c                |    2 ++
>  kernel/nsproxy.c               |   17 +++++++++++++++++
>  kernel/pid.c                   |    1 +
>  kernel/pid_namespace.c         |    2 ++
>  kernel/user.c                  |    1 +
>  kernel/user_namespace.c        |    2 ++
>  kernel/utsname.c               |    2 ++
>  net/core/net_namespace.c       |    8 +++++++-
>  18 files changed, 51 insertions(+), 1 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index d55297f..c076f99 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -5,6 +5,7 @@
>  struct mnt_namespace {
>         atomic_t                count;
>         unsigned int            proc_inum;
> +       long long       serial_num;
>         struct mount *  root;
>         struct list_head        list;
>         struct user_namespace   *user_ns;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 182bc41..9af49ff 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2486,6 +2486,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)
>                 kfree(new_ns);
>                 return ERR_PTR(ret);
>         }
> +       new_ns->serial_num = ns_serial();
>         new_ns->seq = atomic64_add_return(1, &mnt_ns_seq);
>         atomic_set(&new_ns->count, 1);
>         new_ns->root = NULL;
> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
> index 35e7eca..8ccfb2d 100644
> --- a/include/linux/ipc_namespace.h
> +++ b/include/linux/ipc_namespace.h
> @@ -69,6 +69,7 @@ struct ipc_namespace {
>         struct user_namespace *user_ns;
>
>         unsigned int    proc_inum;
> +       long long       serial_num;
>  };
>
>  extern struct ipc_namespace init_ipc_ns;
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index b4ec59d..8e5fe0d 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -66,6 +66,14 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
>         return rcu_dereference(tsk->nsproxy);
>  }
>
> +long long ns_serial(void);
> +enum {
> +       NS_IPC_INIT_SN  = 1,
> +       NS_UTS_INIT_SN  = 2,
> +       NS_USER_INIT_SN = 3,
> +       NS_PID_INIT_SN  = 4,

Please add an extra value here for the first dynamic value...
> +/**
> + * ns_serial - compute a serial number for the namespace
> + *
> + * Compute a serial number for the namespace to uniquely identify it in
> + * audit records.
> + */
> +long long ns_serial(void)
> +{
> +       static atomic64_t serial = ATOMIC_INIT(4); /* reserved for IPC, UTS, user, PID */

...and use it here.

Also, does this work on all architectures?


--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number
  2014-08-21 21:22   ` Andy Lutomirski
@ 2014-08-21 21:28     ` Richard Guy Briggs
  2014-08-21 21:30       ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-21 21:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Eric Paris, Steve Grubb, arozansk,
	linux-audit, Linux API, Linux Containers, Serge E. Hallyn,
	linux-kernel

On 14/08/21, Andy Lutomirski wrote:
> On Aug 20, 2014 8:12 PM, "Richard Guy Briggs" <rgb@redhat.com> wrote:
> >
> > Generate and assign a serial number per namespace instance since boot.
> >
> > Use a serial number per namespace (unique across one boot of one kernel)
> > instead of the inode number (which is claimed to have had the right to change
> > reserved and is not necessarily unique if there is more than one proc fs) to
> > uniquely identify it per kernel boot.
> >
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > ---
> >  fs/mount.h                     |    1 +
> >  fs/namespace.c                 |    1 +
> >  include/linux/ipc_namespace.h  |    1 +
> >  include/linux/nsproxy.h        |    8 ++++++++
> >  include/linux/pid_namespace.h  |    1 +
> >  include/linux/user_namespace.h |    1 +
> >  include/linux/utsname.h        |    1 +
> >  include/net/net_namespace.h    |    1 +
> >  init/version.c                 |    1 +
> >  ipc/msgutil.c                  |    1 +
> >  ipc/namespace.c                |    2 ++
> >  kernel/nsproxy.c               |   17 +++++++++++++++++
> >  kernel/pid.c                   |    1 +
> >  kernel/pid_namespace.c         |    2 ++
> >  kernel/user.c                  |    1 +
> >  kernel/user_namespace.c        |    2 ++
> >  kernel/utsname.c               |    2 ++
> >  net/core/net_namespace.c       |    8 +++++++-
> >  18 files changed, 51 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/mount.h b/fs/mount.h
> > index d55297f..c076f99 100644
> > --- a/fs/mount.h
> > +++ b/fs/mount.h
> > @@ -5,6 +5,7 @@
> >  struct mnt_namespace {
> >         atomic_t                count;
> >         unsigned int            proc_inum;
> > +       long long       serial_num;
> >         struct mount *  root;
> >         struct list_head        list;
> >         struct user_namespace   *user_ns;
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index 182bc41..9af49ff 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -2486,6 +2486,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)
> >                 kfree(new_ns);
> >                 return ERR_PTR(ret);
> >         }
> > +       new_ns->serial_num = ns_serial();
> >         new_ns->seq = atomic64_add_return(1, &mnt_ns_seq);
> >         atomic_set(&new_ns->count, 1);
> >         new_ns->root = NULL;
> > diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
> > index 35e7eca..8ccfb2d 100644
> > --- a/include/linux/ipc_namespace.h
> > +++ b/include/linux/ipc_namespace.h
> > @@ -69,6 +69,7 @@ struct ipc_namespace {
> >         struct user_namespace *user_ns;
> >
> >         unsigned int    proc_inum;
> > +       long long       serial_num;
> >  };
> >
> >  extern struct ipc_namespace init_ipc_ns;
> > diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> > index b4ec59d..8e5fe0d 100644
> > --- a/include/linux/nsproxy.h
> > +++ b/include/linux/nsproxy.h
> > @@ -66,6 +66,14 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
> >         return rcu_dereference(tsk->nsproxy);
> >  }
> >
> > +long long ns_serial(void);
> > +enum {
> > +       NS_IPC_INIT_SN  = 1,
> > +       NS_UTS_INIT_SN  = 2,
> > +       NS_USER_INIT_SN = 3,
> > +       NS_PID_INIT_SN  = 4,
> 
> Please add an extra value here for the first dynamic value...
> > +/**
> > + * ns_serial - compute a serial number for the namespace
> > + *
> > + * Compute a serial number for the namespace to uniquely identify it in
> > + * audit records.
> > + */
> > +long long ns_serial(void)
> > +{
> > +       static atomic64_t serial = ATOMIC_INIT(4); /* reserved for IPC, UTS, user, PID */
> 
> ...and use it here.

Yup, good idea.  Thanks.

> Also, does this work on all architectures?

I only know for certain x86_64.  There is discussion elsewhere about
changing ns_serial from returning a long long to returning a u64.  That
should help.  I can run standard compile and boot tests on several
others, but won't be able to check output.

> --Andy

- RGB

--
Richard Guy Briggs <rbriggs@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number
  2014-08-21 21:28     ` Richard Guy Briggs
@ 2014-08-21 21:30       ` Andy Lutomirski
  2014-08-21 22:15         ` Richard Guy Briggs
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2014-08-21 21:30 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Eric W. Biederman, Eric Paris, Steve Grubb, arozansk,
	linux-audit, Linux API, Linux Containers, Serge E. Hallyn,
	linux-kernel

On Thu, Aug 21, 2014 at 2:28 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> On 14/08/21, Andy Lutomirski wrote:
>> On Aug 20, 2014 8:12 PM, "Richard Guy Briggs" <rgb@redhat.com> wrote:
>> >
>> > Generate and assign a serial number per namespace instance since boot.
>> >
>> > Use a serial number per namespace (unique across one boot of one kernel)
>> > instead of the inode number (which is claimed to have had the right to change
>> > reserved and is not necessarily unique if there is more than one proc fs) to
>> > uniquely identify it per kernel boot.
>> >
>> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
>> > ---
>> >  fs/mount.h                     |    1 +
>> >  fs/namespace.c                 |    1 +
>> >  include/linux/ipc_namespace.h  |    1 +
>> >  include/linux/nsproxy.h        |    8 ++++++++
>> >  include/linux/pid_namespace.h  |    1 +
>> >  include/linux/user_namespace.h |    1 +
>> >  include/linux/utsname.h        |    1 +
>> >  include/net/net_namespace.h    |    1 +
>> >  init/version.c                 |    1 +
>> >  ipc/msgutil.c                  |    1 +
>> >  ipc/namespace.c                |    2 ++
>> >  kernel/nsproxy.c               |   17 +++++++++++++++++
>> >  kernel/pid.c                   |    1 +
>> >  kernel/pid_namespace.c         |    2 ++
>> >  kernel/user.c                  |    1 +
>> >  kernel/user_namespace.c        |    2 ++
>> >  kernel/utsname.c               |    2 ++
>> >  net/core/net_namespace.c       |    8 +++++++-
>> >  18 files changed, 51 insertions(+), 1 deletions(-)
>> >
>> > diff --git a/fs/mount.h b/fs/mount.h
>> > index d55297f..c076f99 100644
>> > --- a/fs/mount.h
>> > +++ b/fs/mount.h
>> > @@ -5,6 +5,7 @@
>> >  struct mnt_namespace {
>> >         atomic_t                count;
>> >         unsigned int            proc_inum;
>> > +       long long       serial_num;
>> >         struct mount *  root;
>> >         struct list_head        list;
>> >         struct user_namespace   *user_ns;
>> > diff --git a/fs/namespace.c b/fs/namespace.c
>> > index 182bc41..9af49ff 100644
>> > --- a/fs/namespace.c
>> > +++ b/fs/namespace.c
>> > @@ -2486,6 +2486,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)
>> >                 kfree(new_ns);
>> >                 return ERR_PTR(ret);
>> >         }
>> > +       new_ns->serial_num = ns_serial();
>> >         new_ns->seq = atomic64_add_return(1, &mnt_ns_seq);
>> >         atomic_set(&new_ns->count, 1);
>> >         new_ns->root = NULL;
>> > diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
>> > index 35e7eca..8ccfb2d 100644
>> > --- a/include/linux/ipc_namespace.h
>> > +++ b/include/linux/ipc_namespace.h
>> > @@ -69,6 +69,7 @@ struct ipc_namespace {
>> >         struct user_namespace *user_ns;
>> >
>> >         unsigned int    proc_inum;
>> > +       long long       serial_num;
>> >  };
>> >
>> >  extern struct ipc_namespace init_ipc_ns;
>> > diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
>> > index b4ec59d..8e5fe0d 100644
>> > --- a/include/linux/nsproxy.h
>> > +++ b/include/linux/nsproxy.h
>> > @@ -66,6 +66,14 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
>> >         return rcu_dereference(tsk->nsproxy);
>> >  }
>> >
>> > +long long ns_serial(void);
>> > +enum {
>> > +       NS_IPC_INIT_SN  = 1,
>> > +       NS_UTS_INIT_SN  = 2,
>> > +       NS_USER_INIT_SN = 3,
>> > +       NS_PID_INIT_SN  = 4,
>>
>> Please add an extra value here for the first dynamic value...
>> > +/**
>> > + * ns_serial - compute a serial number for the namespace
>> > + *
>> > + * Compute a serial number for the namespace to uniquely identify it in
>> > + * audit records.
>> > + */
>> > +long long ns_serial(void)
>> > +{
>> > +       static atomic64_t serial = ATOMIC_INIT(4); /* reserved for IPC, UTS, user, PID */
>>
>> ...and use it here.
>
> Yup, good idea.  Thanks.
>
>> Also, does this work on all architectures?
>
> I only know for certain x86_64.  There is discussion elsewhere about
> changing ns_serial from returning a long long to returning a u64.  That
> should help.  I can run standard compile and boot tests on several
> others, but won't be able to check output.

I was thinking of atomic64.  I guess there's a generic implementation
that works everywhere, if somewhat slowly.

--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number
  2014-08-21 21:30       ` Andy Lutomirski
@ 2014-08-21 22:15         ` Richard Guy Briggs
  0 siblings, 0 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-21 22:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Eric Paris, Steve Grubb, arozansk,
	linux-audit, Linux API, Linux Containers, Serge E. Hallyn,
	linux-kernel

On 14/08/21, Andy Lutomirski wrote:
> On Thu, Aug 21, 2014 at 2:28 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 14/08/21, Andy Lutomirski wrote:
> >> On Aug 20, 2014 8:12 PM, "Richard Guy Briggs" <rgb@redhat.com> wrote:
> >> >
> >> > Generate and assign a serial number per namespace instance since boot.
> >> >
> >> > Use a serial number per namespace (unique across one boot of one kernel)
> >> > instead of the inode number (which is claimed to have had the right to change
> >> > reserved and is not necessarily unique if there is more than one proc fs) to
> >> > uniquely identify it per kernel boot.
> >> >
> >> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> >> > ---
> >> >  fs/mount.h                     |    1 +
> >> >  fs/namespace.c                 |    1 +
> >> >  include/linux/ipc_namespace.h  |    1 +
> >> >  include/linux/nsproxy.h        |    8 ++++++++
> >> >  include/linux/pid_namespace.h  |    1 +
> >> >  include/linux/user_namespace.h |    1 +
> >> >  include/linux/utsname.h        |    1 +
> >> >  include/net/net_namespace.h    |    1 +
> >> >  init/version.c                 |    1 +
> >> >  ipc/msgutil.c                  |    1 +
> >> >  ipc/namespace.c                |    2 ++
> >> >  kernel/nsproxy.c               |   17 +++++++++++++++++
> >> >  kernel/pid.c                   |    1 +
> >> >  kernel/pid_namespace.c         |    2 ++
> >> >  kernel/user.c                  |    1 +
> >> >  kernel/user_namespace.c        |    2 ++
> >> >  kernel/utsname.c               |    2 ++
> >> >  net/core/net_namespace.c       |    8 +++++++-
> >> >  18 files changed, 51 insertions(+), 1 deletions(-)
> >> >
> >> > diff --git a/fs/mount.h b/fs/mount.h
> >> > index d55297f..c076f99 100644
> >> > --- a/fs/mount.h
> >> > +++ b/fs/mount.h
> >> > @@ -5,6 +5,7 @@
> >> >  struct mnt_namespace {
> >> >         atomic_t                count;
> >> >         unsigned int            proc_inum;
> >> > +       long long       serial_num;
> >> >         struct mount *  root;
> >> >         struct list_head        list;
> >> >         struct user_namespace   *user_ns;
> >> > diff --git a/fs/namespace.c b/fs/namespace.c
> >> > index 182bc41..9af49ff 100644
> >> > --- a/fs/namespace.c
> >> > +++ b/fs/namespace.c
> >> > @@ -2486,6 +2486,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)
> >> >                 kfree(new_ns);
> >> >                 return ERR_PTR(ret);
> >> >         }
> >> > +       new_ns->serial_num = ns_serial();
> >> >         new_ns->seq = atomic64_add_return(1, &mnt_ns_seq);
> >> >         atomic_set(&new_ns->count, 1);
> >> >         new_ns->root = NULL;
> >> > diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
> >> > index 35e7eca..8ccfb2d 100644
> >> > --- a/include/linux/ipc_namespace.h
> >> > +++ b/include/linux/ipc_namespace.h
> >> > @@ -69,6 +69,7 @@ struct ipc_namespace {
> >> >         struct user_namespace *user_ns;
> >> >
> >> >         unsigned int    proc_inum;
> >> > +       long long       serial_num;
> >> >  };
> >> >
> >> >  extern struct ipc_namespace init_ipc_ns;
> >> > diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> >> > index b4ec59d..8e5fe0d 100644
> >> > --- a/include/linux/nsproxy.h
> >> > +++ b/include/linux/nsproxy.h
> >> > @@ -66,6 +66,14 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
> >> >         return rcu_dereference(tsk->nsproxy);
> >> >  }
> >> >
> >> > +long long ns_serial(void);
> >> > +enum {
> >> > +       NS_IPC_INIT_SN  = 1,
> >> > +       NS_UTS_INIT_SN  = 2,
> >> > +       NS_USER_INIT_SN = 3,
> >> > +       NS_PID_INIT_SN  = 4,
> >>
> >> Please add an extra value here for the first dynamic value...
> >> > +/**
> >> > + * ns_serial - compute a serial number for the namespace
> >> > + *
> >> > + * Compute a serial number for the namespace to uniquely identify it in
> >> > + * audit records.
> >> > + */
> >> > +long long ns_serial(void)
> >> > +{
> >> > +       static atomic64_t serial = ATOMIC_INIT(4); /* reserved for IPC, UTS, user, PID */
> >>
> >> ...and use it here.
> >
> > Yup, good idea.  Thanks.
> >
> >> Also, does this work on all architectures?
> >
> > I only know for certain x86_64.  There is discussion elsewhere about
> > changing ns_serial from returning a long long to returning a u64.  That
> > should help.  I can run standard compile and boot tests on several
> > others, but won't be able to check output.
> 
> I was thinking of atomic64.  I guess there's a generic implementation
> that works everywhere, if somewhat slowly.

Yup.

>From include/linux/types.h:
	#ifdef CONFIG_64BIT
	typedef struct {
        	long counter;
	} atomic64_t;
	#endif

>From include/asm-generic/atomic64.h:
	typedef struct {
        	long long counter;
	} atomic64_t;

and for non-x86, there is lib/atomic64.c:
	atomic64_add_return()
which is in C while the x86_64 version is in ASM.

> --Andy

- RGB

--
Richard Guy Briggs <rbriggs@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 0/8] namespaces: log namespaces per task
  2014-08-21 20:05 ` [PATCH V4 0/8] namespaces: log namespaces per task Aristeu Rozanski
@ 2014-08-21 22:32   ` Richard Guy Briggs
  0 siblings, 0 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-21 22:32 UTC (permalink / raw)
  To: Aristeu Rozanski
  Cc: linux-audit, linux-kernel, containers, linux-api, eparis, sgrubb,
	ebiederm, serge

On 14/08/21, Aristeu Rozanski wrote:
> Hi Richard,

Hi Aris,

> On Wed, Aug 20, 2014 at 09:09:33PM -0400, Richard Guy Briggs wrote:
> > Is there a way to link serial numbers of namespaces involved in migration of a
> > container to another kernel?  It sounds like what is needed is a part of a
> > mangement application that is able to pull the audit records from constituent
> > hosts to build an audit trail of a container.
> 
> since you're introducing a brand new serial number to make it unique
> across different procfs mounts, why not instead of a simple counter,
> use the hash output of say, $hostname-$creation_time-$random?

I had thought of this earlier on, but I could see many VMs started up
from an identical image, making the resulting hash possibly identical.

Besides, hostname isn't known yet when we are creating initial
namespaces.

> Or perhaps
> get a short hash of the hostname (generated once whenever hostname is
> set) and append the serial number you're implementing? It'd be way less human
> readable than your current proposal but it'd be unique "globally" (as long you
> don't have machines with the same hostname migrating containers between them),
> allowing the migrated namespaces to retain their unique identification across
> audit logs. It'd of course be way more costly than just using an atomic counter,
> but could be useful to anything that needs to refer to a namespace and could be
> migrated to another machine.

This also means that any namespace that is migrated would have to be
recreated on another host and inject an existing ID into it rather than
have the host creating it generate that ID.  Some namespaces are peers
that take the kernel default, while others are hierarchical and inherit
from their creating namespaces.

It was much easier at my layer to punt that management to a higher
layer that already knew about the other hosts in play and to manage that
information as it saw fit.

> What you think? Sounds too crazy? :)

Yup.  I was hoping there would be some kind of unique identifier per
running kernel, including CPU_ID (which may not exist or may be shut
off), RTC boot value (which may be identical for VMs), or initial random
state (which could be identical for VMs).

> Aristeu

- RGB

--
Richard Guy Briggs <rbriggs@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
  2014-08-21 21:13   ` Andy Lutomirski
@ 2014-08-22  1:58     ` Richard Guy Briggs
  2014-08-24 17:52       ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-22  1:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Eric Paris, Steve Grubb, arozansk,
	linux-audit, Linux API, Linux Containers, Serge E. Hallyn,
	linux-kernel

On 14/08/21, Andy Lutomirski wrote:
> On Aug 20, 2014 8:12 PM, "Richard Guy Briggs" <rgb@redhat.com> wrote:
> > Expose the namespace instace serial numbers in the proc filesystem at
> > /proc/<pid>/ns/<ns>_snum.  The link text gives the serial number in hex.
> 
> What's the use case?
> 
> I understand the utility of giving unique numbers to the audit code,
> but I don't think this part is necessary for that, and I'd like to
> understand what else will use this before committing to a duplicative
> API like this.

How does a container manager get those numbers?  It could provoke a task
to cause an audit event that emits a NS_INFO message, or it could run a
task in that container to report its namespace serial numbers directly
from its /proc mount.

The discussion in this thread touches on the use cases:
	https://lkml.org/lkml/2014/4/22/662

> Note that this API is thoroughly incompatible with CRIU.  If we do
> this, someone will ask for a namespace number namespace, and that way
> lies madness.

I had a very brief look at CRIU, but not enough to understand the issue.
Others have hinted at this problem.

Do you have a suggestion of a different approach that would be
compatible with CRIU?

I'd originally considered some sort of UUID that would be globally
unique, but that would be very hard to devise or guarantee, and besides,
namespaces aren't only used by containers and could be shared in other
ways.  Tracking the usage and migration of namespaces should be the task
of an upper layer.

> --Andy
> 
> >
> > "snum" was chosen instead of "seq" for consistency with inum and there are a
> > number of other uses of "seq" in the namespace code.
> >
> > Suggested-by: Serge E. Hallyn <serge@hallyn.com>
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > ---
> >  fs/proc/namespaces.c |   33 +++++++++++++++++++++++++--------
> >  1 files changed, 25 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> > index 8902609..e953e0a 100644
> > --- a/fs/proc/namespaces.c
> > +++ b/fs/proc/namespaces.c
> > @@ -47,12 +47,15 @@ static char *ns_dname(struct dentry *dentry, char *buffer, int buflen)
> >         struct inode *inode = dentry->d_inode;
> >         const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns.ns_ops;
> >
> > -       return dynamic_dname(dentry, buffer, buflen, "%s:[%lu]",
> > -               ns_ops->name, inode->i_ino);
> > +       if (strstr(dentry->d_iname, "_snum"))
> > +               return dynamic_dname(dentry, buffer, buflen, "%s_snum:[%llx]",
> > +                       ns_ops->name, ns_ops->snum(PROC_I(inode)->ns.ns));
> > +       else
> > +               return dynamic_dname(dentry, buffer, buflen, "%s:[%lu]",
> > +                       ns_ops->name, inode->i_ino);
> >  }
> >
> > -const struct dentry_operations ns_dentry_operations =
> > -{
> > +const struct dentry_operations ns_dentry_operations = {
> >         .d_delete       = always_delete_dentry,
> >         .d_dname        = ns_dname,
> >  };
> > @@ -160,7 +163,10 @@ static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int bufl
> >         if (!ns)
> >                 goto out_put_task;
> >
> > -       snprintf(name, sizeof(name), "%s:[%u]", ns_ops->name, ns_ops->inum(ns));
> > +       if (strstr(dentry->d_iname, "_snum"))
> > +               snprintf(name, sizeof(name), "%s_snum:[%llx]", ns_ops->name, ns_ops->snum(ns));
> > +       else
> > +               snprintf(name, sizeof(name), "%s:[%u]", ns_ops->name, ns_ops->inum(ns));
> >         res = readlink_copy(buffer, buflen, name);
> >         ns_ops->put(ns);
> >  out_put_task:
> > @@ -210,16 +216,23 @@ static int proc_ns_dir_readdir(struct file *file, struct dir_context *ctx)
> >
> >         if (!dir_emit_dots(file, ctx))
> >                 goto out;
> > -       if (ctx->pos >= 2 + ARRAY_SIZE(ns_entries))
> > +       if (ctx->pos >= 2 + 2 * ARRAY_SIZE(ns_entries))
> >                 goto out;
> >         entry = ns_entries + (ctx->pos - 2);
> >         last = &ns_entries[ARRAY_SIZE(ns_entries) - 1];
> >         while (entry <= last) {
> >                 const struct proc_ns_operations *ops = *entry;
> > +               char name[50];
> > +
> >                 if (!proc_fill_cache(file, ctx, ops->name, strlen(ops->name),
> >                                      proc_ns_instantiate, task, ops))
> >                         break;
> >                 ctx->pos++;
> > +               snprintf(name, sizeof(name), "%s_snum", ops->name);
> > +               if (!proc_fill_cache(file, ctx, name, strlen(name),
> > +                                    proc_ns_instantiate, task, ops))
> > +                       break;
> > +               ctx->pos++;
> >                 entry++;
> >         }
> >  out:
> > @@ -247,9 +260,13 @@ static struct dentry *proc_ns_dir_lookup(struct inode *dir,
> >
> >         last = &ns_entries[ARRAY_SIZE(ns_entries)];
> >         for (entry = ns_entries; entry < last; entry++) {
> > -               if (strlen((*entry)->name) != len)
> > +               char name[50];
> > +
> > +               snprintf(name, sizeof(name), "%s_snum", (*entry)->name);
> > +               if (strlen((*entry)->name) != len && strlen(name) != len)
> >                         continue;
> > -               if (!memcmp(dentry->d_name.name, (*entry)->name, len))
> > +               if (!memcmp(dentry->d_name.name, (*entry)->name, len)
> > +                   || !memcmp(dentry->d_name.name, name, len))
> >                         break;
> >         }
> >         if (entry == last)
> > --
> > 1.7.1
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-api" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

- RGB

--
Richard Guy Briggs <rbriggs@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number
  2014-08-21  1:09 ` [PATCH V4 1/8] namespaces: assign each namespace instance a serial number Richard Guy Briggs
  2014-08-21 21:22   ` Andy Lutomirski
@ 2014-08-23 12:05   ` Eric W. Biederman
  2014-08-24 20:38     ` Richard Guy Briggs
  1 sibling, 1 reply; 30+ messages in thread
From: Eric W. Biederman @ 2014-08-23 12:05 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: linux-audit, linux-kernel, containers, linux-api, arozansk,
	eparis, sgrubb, serge, Andy Lutomirski

Richard Guy Briggs <rgb@redhat.com> writes:

> Generate and assign a serial number per namespace instance since boot.
>
> Use a serial number per namespace (unique across one boot of one kernel)
> instead of the inode number (which is claimed to have had the right to change
> reserved and is not necessarily unique if there is more than one proc fs) to
> uniquely identify it per kernel boot.

This approach is just broken.

For this to work with migration (aka criu) you need to implement a
namespace of namespaces.  You haven't done this, and therefore
such an interface will break existing userspace.

Inside of audit I can understand not caring about these issues,
but you go foward and expose these serial numbers in proc,
and generally make this infrastructure available to others.

The deep issue with migration is that we move tasks from one machine
from another and on the destination machine we need to have all of the
same global identifiers for software to function properly.

My weasel words around the proc inode numbers is to preserve to allow us
room to be able to restore those ids if it every becomes relevant for
migration.

That is the proc inode numbers (technically) live in a pid namespace,
(aka a mount of proc).  So depending on the pid namespace you are in
or the mount of proc you look in the numbers could change.

Qualifications like that must exist to have a prayer of ever supporting
process migration in the crazy corner cases where people start caring
about inode numbers.

We currently don't and inode numbers for a namespace will never change
after a namespace is created.  So I think you really are ok using the
proc inode numbers.  I am happy declaring by fiat that the inode numbers
that audit uses are the numbers connected to the initial pid namespace.

At a fairly basic level anything that is used to identify namespaces for
any general purpose use needs to have most if not all of the same
properties of the proc inode numbers.  The most important of which is
being tied to some context/namespace so there is a ability if we ever
need it to migrate those numbers from one machine to another.

Eric

> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index 8e78110..93cb380 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -41,6 +41,23 @@ struct nsproxy init_nsproxy = {
>  #endif
>  };
>  
> +/**
> + * ns_serial - compute a serial number for the namespace
> + *
> + * Compute a serial number for the namespace to uniquely identify it in
> + * audit records.
> + */
> +long long ns_serial(void)
> +{
> +	static atomic64_t serial = ATOMIC_INIT(4); /* reserved for IPC, UTS, user, PID */
> +	long long ret;
> +
> +	ret = atomic64_add_return(1, &serial);
> +	BUG_ON(!ret);
> +
> +	return ret;
> +}
> +
>  static inline struct nsproxy *create_nsproxy(void)
>  {
>  	struct nsproxy *nsproxy;

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
  2014-08-22  1:58     ` Richard Guy Briggs
@ 2014-08-24 17:52       ` Andy Lutomirski
  2014-08-24 20:28         ` Richard Guy Briggs
  2014-08-25 13:30         ` Nicolas Dichtel
  0 siblings, 2 replies; 30+ messages in thread
From: Andy Lutomirski @ 2014-08-24 17:52 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Eric W. Biederman, Eric Paris, Steve Grubb, arozansk,
	linux-audit, Linux API, Linux Containers, Serge E. Hallyn,
	linux-kernel

On Thu, Aug 21, 2014 at 6:58 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> On 14/08/21, Andy Lutomirski wrote:
>> On Aug 20, 2014 8:12 PM, "Richard Guy Briggs" <rgb@redhat.com> wrote:
>> > Expose the namespace instace serial numbers in the proc filesystem at
>> > /proc/<pid>/ns/<ns>_snum.  The link text gives the serial number in hex.
>>
>> What's the use case?
>>
>> I understand the utility of giving unique numbers to the audit code,
>> but I don't think this part is necessary for that, and I'd like to
>> understand what else will use this before committing to a duplicative
>> API like this.
>
> How does a container manager get those numbers?  It could provoke a task
> to cause an audit event that emits a NS_INFO message, or it could run a
> task in that container to report its namespace serial numbers directly
> from its /proc mount.

Why does a container manager need them?  Is there any reason that
keeping them entirely contained within the audit system would be a
problem?

>
> The discussion in this thread touches on the use cases:
>         https://lkml.org/lkml/2014/4/22/662
>
>> Note that this API is thoroughly incompatible with CRIU.  If we do
>> this, someone will ask for a namespace number namespace, and that way
>> lies madness.
>
> I had a very brief look at CRIU, but not enough to understand the issue.
> Others have hinted at this problem.
>
> Do you have a suggestion of a different approach that would be
> compatible with CRIU?
>
> I'd originally considered some sort of UUID that would be globally
> unique, but that would be very hard to devise or guarantee, and besides,
> namespaces aren't only used by containers and could be shared in other
> ways.  Tracking the usage and migration of namespaces should be the task
> of an upper layer.
>

CRIU wants to save the complete state of a namespace and then restore
it.  For that to work, any information exposed to things in the
namespace *cannot* be globally unique or unique per boot, since CRIU
needs to arrange for that information to match whatever it was when
CRIU saved it.

Also, I think that code running in a namespace has no business even
knowing a unique identity of that namespace from the perspective of
the host.

Here's a specific use case for *not* exposing this: Tor.  Ideally, Tor
clients would run in a namespace that does not know about any global
identity.  That means no IP addresses, but it also means no global
namespace serial numbers.

--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
  2014-08-24 17:52       ` Andy Lutomirski
@ 2014-08-24 20:28         ` Richard Guy Briggs
  2014-08-25 13:30         ` Nicolas Dichtel
  1 sibling, 0 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-24 20:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Serge E. Hallyn, Linux API, Linux Containers, linux-kernel,
	linux-audit, Eric W. Biederman

On 14/08/24, Andy Lutomirski wrote:
> On Thu, Aug 21, 2014 at 6:58 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 14/08/21, Andy Lutomirski wrote:
> >> On Aug 20, 2014 8:12 PM, "Richard Guy Briggs" <rgb@redhat.com> wrote:
> >> > Expose the namespace instace serial numbers in the proc filesystem at
> >> > /proc/<pid>/ns/<ns>_snum.  The link text gives the serial number in hex.
> >>
> >> What's the use case?
> >>
> >> I understand the utility of giving unique numbers to the audit code,
> >> but I don't think this part is necessary for that, and I'd like to
> >> understand what else will use this before committing to a duplicative
> >> API like this.
> >
> > How does a container manager get those numbers?  It could provoke a task
> > to cause an audit event that emits a NS_INFO message, or it could run a
> > task in that container to report its namespace serial numbers directly
> > from its /proc mount.
> 
> Why does a container manager need them?  Is there any reason that
> keeping them entirely contained within the audit system would be a
> problem?

The audit system is currently per-kernel.  If a container is migrated
from one kernel to another, the first audit system is no longer able to
monitor or care about it.  It is the container manager's scope that has
the capability to monitor and care about it.

This might be a good argument to augment the audit system as we
currently know it to be able to do this across kernels, but that isn't
currently the case.

> > The discussion in this thread touches on the use cases:
> >         https://lkml.org/lkml/2014/4/22/662
> >
> >> Note that this API is thoroughly incompatible with CRIU.  If we do
> >> this, someone will ask for a namespace number namespace, and that way
> >> lies madness.
> >
> > I had a very brief look at CRIU, but not enough to understand the issue.
> > Others have hinted at this problem.
> >
> > Do you have a suggestion of a different approach that would be
> > compatible with CRIU?
> >
> > I'd originally considered some sort of UUID that would be globally
> > unique, but that would be very hard to devise or guarantee, and besides,
> > namespaces aren't only used by containers and could be shared in other
> > ways.  Tracking the usage and migration of namespaces should be the task
> > of an upper layer.
> 
> CRIU wants to save the complete state of a namespace and then restore
> it.  For that to work, any information exposed to things in the
> namespace *cannot* be globally unique or unique per boot, since CRIU
> needs to arrange for that information to match whatever it was when
> CRIU saved it.

So are you agreeing with Eric Biederman's idea that its proc inode
number should be initially assigned serially, but reserve the right to
be settable on a restore of a namespace from another host?  What if that
inode number collides with an existing one?

Does CRIU have no lattitude at all to be able to track a new namespace
ID?

> Also, I think that code running in a namespace has no business even
> knowing a unique identity of that namespace from the perspective of
> the host.

Too late.  There is already the namespace proc inode numbers.  That
number is almost completely meaningless to the code running inside the
container/namespace.

> Here's a specific use case for *not* exposing this: Tor.  Ideally, Tor
> clients would run in a namespace that does not know about any global
> identity.  That means no IP addresses, but it also means no global
> namespace serial numbers.

Well, it already has an IP address (which might be masqueraded by the
host or another upstream router) and a namespace inode number.

I'm not aware of support for anonymous namespaces, let along anonymous
containers yet.

> --Andy

- RGB

--
Richard Guy Briggs <rbriggs@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number
  2014-08-23 12:05   ` Eric W. Biederman
@ 2014-08-24 20:38     ` Richard Guy Briggs
  2014-08-28 20:05       ` Eric W. Biederman
  0 siblings, 1 reply; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-24 20:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-api, containers, linux-kernel, Andy Lutomirski, linux-audit, serge

On 14/08/23, Eric W. Biederman wrote:
> Richard Guy Briggs <rgb@redhat.com> writes:
> 
> > Generate and assign a serial number per namespace instance since boot.
> >
> > Use a serial number per namespace (unique across one boot of one kernel)
> > instead of the inode number (which is claimed to have had the right to change
> > reserved and is not necessarily unique if there is more than one proc fs) to
> > uniquely identify it per kernel boot.
> 
> This approach is just broken.
> 
> For this to work with migration (aka criu) you need to implement a
> namespace of namespaces.  You haven't done this, and therefore
> such an interface will break existing userspace.
> 
> Inside of audit I can understand not caring about these issues,
> but you go foward and expose these serial numbers in proc,
> and generally make this infrastructure available to others.
> 
> The deep issue with migration is that we move tasks from one machine
> from another and on the destination machine we need to have all of the
> same global identifiers for software to function properly.
> 
> My weasel words around the proc inode numbers is to preserve to allow us
> room to be able to restore those ids if it every becomes relevant for
> migration.

What do you do if the inode number is already in use on the target host?

> That is the proc inode numbers (technically) live in a pid namespace,
> (aka a mount of proc).  So depending on the pid namespace you are in
> or the mount of proc you look in the numbers could change.
> 
> Qualifications like that must exist to have a prayer of ever supporting
> process migration in the crazy corner cases where people start caring
> about inode numbers.
> 
> We currently don't and inode numbers for a namespace will never change
> after a namespace is created.  So I think you really are ok using the
> proc inode numbers.  I am happy declaring by fiat that the inode numbers
> that audit uses are the numbers connected to the initial pid namespace.

But once a namespace/container is migrated, it is a different audit that
is looking at it (unless we create an audit manager or entity that
functions at the level of a container manager), so audit should not care.

> At a fairly basic level anything that is used to identify namespaces for
> any general purpose use needs to have most if not all of the same
> properties of the proc inode numbers.  The most important of which is
> being tied to some context/namespace so there is a ability if we ever
> need it to migrate those numbers from one machine to another.

Sooo...  does it make any sense to have those inode or serial numbers be
blank inside the namespace/container itself, but only visible to its
manager outside the container (unless it is the initial namespace)?

> Eric
> 
> > diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> > index 8e78110..93cb380 100644
> > --- a/kernel/nsproxy.c
> > +++ b/kernel/nsproxy.c
> > @@ -41,6 +41,23 @@ struct nsproxy init_nsproxy = {
> >  #endif
> >  };
> >  
> > +/**
> > + * ns_serial - compute a serial number for the namespace
> > + *
> > + * Compute a serial number for the namespace to uniquely identify it in
> > + * audit records.
> > + */
> > +long long ns_serial(void)
> > +{
> > +	static atomic64_t serial = ATOMIC_INIT(4); /* reserved for IPC, UTS, user, PID */
> > +	long long ret;
> > +
> > +	ret = atomic64_add_return(1, &serial);
> > +	BUG_ON(!ret);
> > +
> > +	return ret;
> > +}
> > +
> >  static inline struct nsproxy *create_nsproxy(void)
> >  {
> >  	struct nsproxy *nsproxy;
> 
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-audit

- RGB

--
Richard Guy Briggs <rbriggs@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
  2014-08-24 17:52       ` Andy Lutomirski
  2014-08-24 20:28         ` Richard Guy Briggs
@ 2014-08-25 13:30         ` Nicolas Dichtel
  2014-08-25 14:04           ` Andy Lutomirski
  1 sibling, 1 reply; 30+ messages in thread
From: Nicolas Dichtel @ 2014-08-25 13:30 UTC (permalink / raw)
  To: Andy Lutomirski, Richard Guy Briggs
  Cc: Serge E. Hallyn, Linux API, Linux Containers, linux-kernel,
	linux-audit, Eric W. Biederman, netdev

Le 24/08/2014 19:52, Andy Lutomirski a écrit :
> On Thu, Aug 21, 2014 at 6:58 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> On 14/08/21, Andy Lutomirski wrote:
>>> On Aug 20, 2014 8:12 PM, "Richard Guy Briggs" <rgb@redhat.com> wrote:
>>>> Expose the namespace instace serial numbers in the proc filesystem at
>>>> /proc/<pid>/ns/<ns>_snum.  The link text gives the serial number in hex.
>>>
>>> What's the use case?
>>>
>>> I understand the utility of giving unique numbers to the audit code,
>>> but I don't think this part is necessary for that, and I'd like to
>>> understand what else will use this before committing to a duplicative
>>> API like this.
>>
>> How does a container manager get those numbers?  It could provoke a task
>> to cause an audit event that emits a NS_INFO message, or it could run a
>> task in that container to report its namespace serial numbers directly
>> from its /proc mount.
>
> Why does a container manager need them?  Is there any reason that
> keeping them entirely contained within the audit system would be a
> problem?
>
>>
>> The discussion in this thread touches on the use cases:
>>          https://lkml.org/lkml/2014/4/22/662
>>
>>> Note that this API is thoroughly incompatible with CRIU.  If we do
>>> this, someone will ask for a namespace number namespace, and that way
>>> lies madness.
>>
>> I had a very brief look at CRIU, but not enough to understand the issue.
>> Others have hinted at this problem.
>>
>> Do you have a suggestion of a different approach that would be
>> compatible with CRIU?
>>
>> I'd originally considered some sort of UUID that would be globally
>> unique, but that would be very hard to devise or guarantee, and besides,
>> namespaces aren't only used by containers and could be shared in other
>> ways.  Tracking the usage and migration of namespaces should be the task
>> of an upper layer.
>>
>
> CRIU wants to save the complete state of a namespace and then restore
> it.  For that to work, any information exposed to things in the
> namespace *cannot* be globally unique or unique per boot, since CRIU
> needs to arrange for that information to match whatever it was when
> CRIU saved it.
How are ifindex of network devices managed? These ifindexes are unique per boot,
thus can change depending on the order in which netdev are created.
These ifindexes are unique per boot and exposed to userspace ...

>
> Also, I think that code running in a namespace has no business even
> knowing a unique identity of that namespace from the perspective of
> the host.
Another scenario is when you have virtual network devices across two netns. You
need to identify the peer netns to have a netlink message which is fully 
interpretable by the userspace.


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
  2014-08-25 13:30         ` Nicolas Dichtel
@ 2014-08-25 14:04           ` Andy Lutomirski
  2014-08-25 15:43             ` Nicolas Dichtel
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2014-08-25 14:04 UTC (permalink / raw)
  To: nicolas.dichtel
  Cc: Linux Containers, linux-kernel, Serge E. Hallyn,
	Eric W. Biederman, linux-audit, Linux API, Richard Guy Briggs,
	netdev

On Aug 25, 2014 6:30 AM, "Nicolas Dichtel" <nicolas.dichtel@6wind.com> wrote:
>> CRIU wants to save the complete state of a namespace and then restore
>> it.  For that to work, any information exposed to things in the
>> namespace *cannot* be globally unique or unique per boot, since CRIU
>> needs to arrange for that information to match whatever it was when
>> CRIU saved it.
>
> How are ifindex of network devices managed? These ifindexes are unique per boot,
> thus can change depending on the order in which netdev are created.
> These ifindexes are unique per boot and exposed to userspace ...
>

This does not appear to be true.

$ sudo unshare --net
# ip link add veth0 type veth peer name veth1
# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
    link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff
3: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
    link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff
# logout
$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
state DOWN qlen 1000

>
>>
>> Also, I think that code running in a namespace has no business even
>> knowing a unique identity of that namespace from the perspective of
>> the host.
>
> Another scenario is when you have virtual network devices across two netns. You
> need to identify the peer netns to have a netlink message which is fully interpretable by the userspace.

Let me try again, with emphasis in the right place.

I think that *code running in a namespace* has no business even
knowing a unique identity of *that namespace* from the perspective of
the host.

In your example, if there's a veth device between netns A and netns B,
then code *in netns A* has no business knowing the identity of its
veth peer if its peer (B) is a sibling or ancestor.  It also IMO has
no business knowing the identity of its own netns (A) other than as
"my netns".

If A and B are siblings, then their parent needs to know where that
veth device goes, but I think this is already the case to a sufficient
extent today.

I feel like this discussion is falling into a common trap of new API
discussions.  Can one of you who wants this API please articulate,
with a reasonably precise example, what it is that you want to do, why
you can't easily do it already, and how this API helps?  I currently
understand how the API creates problems, but I don't understand how it
solves any problems, and I will NAK it (and I suspect that Eric will,
too, which is pretty much fatal) unless that changes.

Thanks,
Andy

>
>
> Regards,
> Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
  2014-08-25 14:04           ` Andy Lutomirski
@ 2014-08-25 15:43             ` Nicolas Dichtel
  2014-08-25 16:13               ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Dichtel @ 2014-08-25 15:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux Containers, linux-kernel, Serge E. Hallyn,
	Eric W. Biederman, linux-audit, Linux API, Richard Guy Briggs,
	netdev

Le 25/08/2014 16:04, Andy Lutomirski a écrit :
> On Aug 25, 2014 6:30 AM, "Nicolas Dichtel" <nicolas.dichtel@6wind.com> wrote:
>>> CRIU wants to save the complete state of a namespace and then restore
>>> it.  For that to work, any information exposed to things in the
>>> namespace *cannot* be globally unique or unique per boot, since CRIU
>>> needs to arrange for that information to match whatever it was when
>>> CRIU saved it.
>>
>> How are ifindex of network devices managed? These ifindexes are unique per boot,
>> thus can change depending on the order in which netdev are created.
>> These ifindexes are unique per boot and exposed to userspace ...
>>
>
> This does not appear to be true.
>
> $ sudo unshare --net
> # ip link add veth0 type veth peer name veth1
> # ip link
> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
>      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 2: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
> DEFAULT group default qlen 1000
>      link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff
> 3: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
> DEFAULT group default qlen 1000
>      link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff
> # logout
> $ ip link
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 3: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
> state DOWN qlen 1000
>
I've probably misunderstood what you're trying to say. ifindexes are unique per
boot and per netns. These ifindexes depend on the interface creation order:

$ ip netns add 1
$ ip link set eth1 netns 1
$ ip netns exec 1 ip link add veth0 type veth peer name veth1
$ ip netns exec 1 ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT 
group default qlen 1000
     link/ether 9a:a0:89:99:a0:3c brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group 
default qlen 1000
     link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff
4: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT 
group default qlen 1000
     link/ether 96:86:44:49:ce:a8 brd ff:ff:ff:ff:ff:ff
$ ip netns del 1
$ ip netns add 1
$ ip netns exec 1 ip link add veth0 type veth peer name veth1
$ ip link set eth1 netns 1
$ ip netns exec 1 ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT 
group default qlen 1000
     link/ether 86:92:90:01:32:6b brd ff:ff:ff:ff:ff:ff
3: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT 
group default qlen 1000
     link/ether ae:8b:d2:71:48:a2 brd ff:ff:ff:ff:ff:ff
4: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group 
default qlen 1000
     link/ether 52:54:00:12:34:57 brd ff:ff:ff:ff:ff:ff

Note: when an interface is moved to another netns, the ifindex is kept if
possible, else another ifindex is chosen.
I will dig a bit to understand how CRIU save these netns informations.

>>
>>>
>>> Also, I think that code running in a namespace has no business even
>>> knowing a unique identity of that namespace from the perspective of
>>> the host.
>>
>> Another scenario is when you have virtual network devices across two netns. You
>> need to identify the peer netns to have a netlink message which is fully interpretable by the userspace.
>
> Let me try again, with emphasis in the right place.
>
> I think that *code running in a namespace* has no business even
> knowing a unique identity of *that namespace* from the perspective of
> the host.
>
> In your example, if there's a veth device between netns A and netns B,
> then code *in netns A* has no business knowing the identity of its
> veth peer if its peer (B) is a sibling or ancestor.  It also IMO has
> no business knowing the identity of its own netns (A) other than as
> "my netns".
I do not agree (see the example below).

>
> If A and B are siblings, then their parent needs to know where that
> veth device goes, but I think this is already the case to a sufficient
> extent today.
I'm not aware of a hierarchy between netns. A daemon should be able to
got the full network configuration, even if it's started when this configuration
is already applied, ie even if it doesn't know what happen before it starts.

>
> I feel like this discussion is falling into a common trap of new API
> discussions.  Can one of you who wants this API please articulate,
> with a reasonably precise example, what it is that you want to do, why
> you can't easily do it already, and how this API helps?  I currently
> understand how the API creates problems, but I don't understand how it
> solves any problems, and I will NAK it (and I suspect that Eric will,
> too, which is pretty much fatal) unless that changes.
What I'm trying to solve is to have full info in netlink messages sent by the
kernel, thus beeing able to identify a peer netns (and this is close from what
audit guys are trying to have). Theorically, messages sent by the kernel can be
reused as is to have the same configuration. This is not the case with x-netns
devices. Here is an example, with ip tunnels:

$ ip netns add 1
$ ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 dev eth0
$ ip -d link ls ipip1
8: ipip1@eth0: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT 
group default
     link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
     ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 ttl inherit pmtudisc
$ ip link set ipip1 netns 1
$ ip netns exec 1 ip -d link ls ipip1
8: ipip1@tunl0: <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN mode 
DEFAULT group default
     link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
     ipip remote 10.16.0.121 local 10.16.0.249 dev tunl0 ttl inherit pmtudisc

Now informations got with 'ip link' are wrong and incomplete:
  - the link dev is now tunl0 instead of eth0, because we only got an ifindex
    from the kernel without any netns informations.
  - the encapsulation addresses are not part of this netns but the user doesn't
    known that (still because netns info is missing). These IPv4 addresses may
    exist into this netns.
  - it's not possible to create the same netdevice with these infos.


Hope it's more clear now.


Regards,
Nicolas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
  2014-08-25 15:43             ` Nicolas Dichtel
@ 2014-08-25 16:13               ` Andy Lutomirski
  2014-08-25 16:41                 ` Nicolas Dichtel
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2014-08-25 16:13 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: Linux Containers, linux-kernel, Serge E. Hallyn,
	Eric W. Biederman, linux-audit, Linux API, Richard Guy Briggs,
	netdev

On Mon, Aug 25, 2014 at 8:43 AM, Nicolas Dichtel
<nicolas.dichtel@6wind.com> wrote:
> Le 25/08/2014 16:04, Andy Lutomirski a écrit :
>
>> On Aug 25, 2014 6:30 AM, "Nicolas Dichtel" <nicolas.dichtel@6wind.com>
>> wrote:
>>>>
>>>> CRIU wants to save the complete state of a namespace and then restore
>>>> it.  For that to work, any information exposed to things in the
>>>> namespace *cannot* be globally unique or unique per boot, since CRIU
>>>> needs to arrange for that information to match whatever it was when
>>>> CRIU saved it.
>>>
>>>
>>> How are ifindex of network devices managed? These ifindexes are unique
>>> per boot,
>>> thus can change depending on the order in which netdev are created.
>>> These ifindexes are unique per boot and exposed to userspace ...
>>>
>>
>> This does not appear to be true.
>>
>> $ sudo unshare --net
>> # ip link add veth0 type veth peer name veth1
>> # ip link
>> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group
>> default
>>      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 2: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
>> DEFAULT group default qlen 1000
>>      link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff
>> 3: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
>> DEFAULT group default qlen 1000
>>      link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff
>> # logout
>> $ ip link
>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>>      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 3: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
>> state DOWN qlen 1000
>>
> I've probably misunderstood what you're trying to say. ifindexes are unique
> per
> boot and per netns.

I think we both misunderstood each other.  The ifindexes are unique
*per netns*, which means that, if you're unprivileged in a netns,
global information doesn't leak to you.  I think this is good.

>>
>> Let me try again, with emphasis in the right place.
>>
>> I think that *code running in a namespace* has no business even
>> knowing a unique identity of *that namespace* from the perspective of
>> the host.
>>
>> In your example, if there's a veth device between netns A and netns B,
>> then code *in netns A* has no business knowing the identity of its
>> veth peer if its peer (B) is a sibling or ancestor.  It also IMO has
>> no business knowing the identity of its own netns (A) other than as
>> "my netns".
>
> I do not agree (see the example below).
>
>
>>
>> If A and B are siblings, then their parent needs to know where that
>> veth device goes, but I think this is already the case to a sufficient
>> extent today.
>
> I'm not aware of a hierarchy between netns. A daemon should be able to
> got the full network configuration, even if it's started when this
> configuration
> is already applied, ie even if it doesn't know what happen before it starts.
>

I don't know exactly which namespaces have an explicit hierarchy, but
there is certainly a hierarchy of *user* namespaces, and network
namespaces live in user namespaces, so they at least have somewhat of
a hierarchy.

>
>>
>> I feel like this discussion is falling into a common trap of new API
>> discussions.  Can one of you who wants this API please articulate,
>> with a reasonably precise example, what it is that you want to do, why
>> you can't easily do it already, and how this API helps?  I currently
>> understand how the API creates problems, but I don't understand how it
>> solves any problems, and I will NAK it (and I suspect that Eric will,
>> too, which is pretty much fatal) unless that changes.
>
> What I'm trying to solve is to have full info in netlink messages sent by
> the
> kernel, thus beeing able to identify a peer netns (and this is close from
> what
> audit guys are trying to have). Theorically, messages sent by the kernel can
> be
> reused as is to have the same configuration. This is not the case with
> x-netns
> devices. Here is an example, with ip tunnels:
>
> $ ip netns add 1
> $ ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 dev eth0
> $ ip -d link ls ipip1
> 8: ipip1@eth0: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode
> DEFAULT group default
>     link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
>     ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 ttl inherit pmtudisc
> $ ip link set ipip1 netns 1
> $ ip netns exec 1 ip -d link ls ipip1
> 8: ipip1@tunl0: <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN
> mode DEFAULT group default
>     link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
>     ipip remote 10.16.0.121 local 10.16.0.249 dev tunl0 ttl inherit pmtudisc
>
> Now informations got with 'ip link' are wrong and incomplete:
>  - the link dev is now tunl0 instead of eth0, because we only got an ifindex
>    from the kernel without any netns informations.
>  - the encapsulation addresses are not part of this netns but the user
> doesn't
>    known that (still because netns info is missing). These IPv4 addresses
> may
>    exist into this netns.
>  - it's not possible to create the same netdevice with these infos.
>

Aha.  That's a genuine problem.

Perhaps we need a concept of which netnses should be able to see each other.

I think I would be okay with a somewhat different outcome from your example:

$ ip netns exec 1 ip -d link ls ipip1
8: ipip1@[unknown device in another namespace]:
<POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN

I think this outcome is mandatory if netns 1 lives in a subsidiary
user namespace.

Certainly, if you do the 'ip link' in the original namespace, I agree
that this should work.

For most namespace types, this all works transparently, since
everything has an real identity all the way up the hierarchy.  Network
namespaces are different.

I don't think that exposing serial numbers in /proc is a good
solution, both for the reasons already described and because I don't
think that iproute2 should need to muck around with /proc to function
correctly.  Eric, any clever ideas here?  Do we need fancier netlink
messages for this?

--Andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
  2014-08-25 16:13               ` Andy Lutomirski
@ 2014-08-25 16:41                 ` Nicolas Dichtel
  2014-08-25 16:50                   ` Andy Lutomirski
  0 siblings, 1 reply; 30+ messages in thread
From: Nicolas Dichtel @ 2014-08-25 16:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux Containers, linux-kernel, Serge E. Hallyn,
	Eric W. Biederman, linux-audit, Linux API, Richard Guy Briggs,
	netdev

Le 25/08/2014 18:13, Andy Lutomirski a écrit :
> On Mon, Aug 25, 2014 at 8:43 AM, Nicolas Dichtel
> <nicolas.dichtel@6wind.com> wrote:
>> Le 25/08/2014 16:04, Andy Lutomirski a écrit :
>>
>>> On Aug 25, 2014 6:30 AM, "Nicolas Dichtel" <nicolas.dichtel@6wind.com>
>>> wrote:
>>>>>
>>>>> CRIU wants to save the complete state of a namespace and then restore
>>>>> it.  For that to work, any information exposed to things in the
>>>>> namespace *cannot* be globally unique or unique per boot, since CRIU
>>>>> needs to arrange for that information to match whatever it was when
>>>>> CRIU saved it.
>>>>
>>>>
>>>> How are ifindex of network devices managed? These ifindexes are unique
>>>> per boot,
>>>> thus can change depending on the order in which netdev are created.
>>>> These ifindexes are unique per boot and exposed to userspace ...
>>>>
>>>
>>> This does not appear to be true.
>>>
>>> $ sudo unshare --net
>>> # ip link add veth0 type veth peer name veth1
>>> # ip link
>>> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group
>>> default
>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>> 2: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
>>> DEFAULT group default qlen 1000
>>>       link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff
>>> 3: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
>>> DEFAULT group default qlen 1000
>>>       link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff
>>> # logout
>>> $ ip link
>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>> 3: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
>>> state DOWN qlen 1000
>>>
>> I've probably misunderstood what you're trying to say. ifindexes are unique
>> per
>> boot and per netns.
>
> I think we both misunderstood each other.  The ifindexes are unique
> *per netns*, which means that, if you're unprivileged in a netns,
> global information doesn't leak to you.  I think this is good.
Ok, I agree. I think audit daemons are always running under privileged users.

>
>>>
>>> Let me try again, with emphasis in the right place.
>>>
>>> I think that *code running in a namespace* has no business even
>>> knowing a unique identity of *that namespace* from the perspective of
>>> the host.
>>>
>>> In your example, if there's a veth device between netns A and netns B,
>>> then code *in netns A* has no business knowing the identity of its
>>> veth peer if its peer (B) is a sibling or ancestor.  It also IMO has
>>> no business knowing the identity of its own netns (A) other than as
>>> "my netns".
>>
>> I do not agree (see the example below).
>>
>>
>>>
>>> If A and B are siblings, then their parent needs to know where that
>>> veth device goes, but I think this is already the case to a sufficient
>>> extent today.
>>
>> I'm not aware of a hierarchy between netns. A daemon should be able to
>> got the full network configuration, even if it's started when this
>> configuration
>> is already applied, ie even if it doesn't know what happen before it starts.
>>
>
> I don't know exactly which namespaces have an explicit hierarchy, but
> there is certainly a hierarchy of *user* namespaces, and network
> namespaces live in user namespaces, so they at least have somewhat of
> a hierarchy.
>
>>
>>>
>>> I feel like this discussion is falling into a common trap of new API
>>> discussions.  Can one of you who wants this API please articulate,
>>> with a reasonably precise example, what it is that you want to do, why
>>> you can't easily do it already, and how this API helps?  I currently
>>> understand how the API creates problems, but I don't understand how it
>>> solves any problems, and I will NAK it (and I suspect that Eric will,
>>> too, which is pretty much fatal) unless that changes.
>>
>> What I'm trying to solve is to have full info in netlink messages sent by
>> the
>> kernel, thus beeing able to identify a peer netns (and this is close from
>> what
>> audit guys are trying to have). Theorically, messages sent by the kernel can
>> be
>> reused as is to have the same configuration. This is not the case with
>> x-netns
>> devices. Here is an example, with ip tunnels:
>>
>> $ ip netns add 1
>> $ ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 dev eth0
>> $ ip -d link ls ipip1
>> 8: ipip1@eth0: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode
>> DEFAULT group default
>>      link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
>>      ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 ttl inherit pmtudisc
>> $ ip link set ipip1 netns 1
>> $ ip netns exec 1 ip -d link ls ipip1
>> 8: ipip1@tunl0: <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN
>> mode DEFAULT group default
>>      link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
>>      ipip remote 10.16.0.121 local 10.16.0.249 dev tunl0 ttl inherit pmtudisc
>>
>> Now informations got with 'ip link' are wrong and incomplete:
>>   - the link dev is now tunl0 instead of eth0, because we only got an ifindex
>>     from the kernel without any netns informations.
>>   - the encapsulation addresses are not part of this netns but the user
>> doesn't
>>     known that (still because netns info is missing). These IPv4 addresses
>> may
>>     exist into this netns.
>>   - it's not possible to create the same netdevice with these infos.
>>
>
> Aha.  That's a genuine problem.
>
> Perhaps we need a concept of which netnses should be able to see each other.
Yes, I agree. This is not required for all netns, only a subset of netns should
be able to see each other.

>
> I think I would be okay with a somewhat different outcome from your example:
>
> $ ip netns exec 1 ip -d link ls ipip1
> 8: ipip1@[unknown device in another namespace]:
> <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN
>
> I think this outcome is mandatory if netns 1 lives in a subsidiary
> user namespace.
Yes.

>
> Certainly, if you do the 'ip link' in the original namespace, I agree
> that this should work.
And yes :)

I will update my previous proposal 
(http://thread.gmane.org/gmane.linux.network/315933/focus=321753)
to allow to get an id for a peer netns only when the user namespace is the same.

>
> For most namespace types, this all works transparently, since
> everything has an real identity all the way up the hierarchy.  Network
> namespaces are different.
>
> I don't think that exposing serial numbers in /proc is a good
> solution, both for the reasons already described and because I don't
> think that iproute2 should need to muck around with /proc to function
A netlink API is probably enough. But it will help only for the network
problem, not for audit. I was hoping to find a common solution.

> correctly.  Eric, any clever ideas here?  Do we need fancier netlink
> messages for this?
>
> --Andy
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
  2014-08-25 16:41                 ` Nicolas Dichtel
@ 2014-08-25 16:50                   ` Andy Lutomirski
  2014-08-27 15:17                     ` Richard Guy Briggs
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2014-08-25 16:50 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: Linux Containers, linux-kernel, Serge E. Hallyn,
	Eric W. Biederman, linux-audit, Linux API, Richard Guy Briggs,
	netdev

On Mon, Aug 25, 2014 at 9:41 AM, Nicolas Dichtel
<nicolas.dichtel@6wind.com> wrote:
> Le 25/08/2014 18:13, Andy Lutomirski a écrit :
>
>> On Mon, Aug 25, 2014 at 8:43 AM, Nicolas Dichtel
>> <nicolas.dichtel@6wind.com> wrote:
>>>
>>> Le 25/08/2014 16:04, Andy Lutomirski a écrit :
>>>
>>>> On Aug 25, 2014 6:30 AM, "Nicolas Dichtel" <nicolas.dichtel@6wind.com>
>>>> wrote:
>>>>>>
>>>>>>
>>>>>> CRIU wants to save the complete state of a namespace and then restore
>>>>>> it.  For that to work, any information exposed to things in the
>>>>>> namespace *cannot* be globally unique or unique per boot, since CRIU
>>>>>> needs to arrange for that information to match whatever it was when
>>>>>> CRIU saved it.
>>>>>
>>>>>
>>>>>
>>>>> How are ifindex of network devices managed? These ifindexes are unique
>>>>> per boot,
>>>>> thus can change depending on the order in which netdev are created.
>>>>> These ifindexes are unique per boot and exposed to userspace ...
>>>>>
>>>>
>>>> This does not appear to be true.
>>>>
>>>> $ sudo unshare --net
>>>> # ip link add veth0 type veth peer name veth1
>>>> # ip link
>>>> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group
>>>> default
>>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>> 2: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
>>>> DEFAULT group default qlen 1000
>>>>       link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff
>>>> 3: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
>>>> DEFAULT group default qlen 1000
>>>>       link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff
>>>> # logout
>>>> $ ip link
>>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>> 3: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
>>>> state DOWN qlen 1000
>>>>
>>> I've probably misunderstood what you're trying to say. ifindexes are
>>> unique
>>> per
>>> boot and per netns.
>>
>>
>> I think we both misunderstood each other.  The ifindexes are unique
>> *per netns*, which means that, if you're unprivileged in a netns,
>> global information doesn't leak to you.  I think this is good.
>
> Ok, I agree. I think audit daemons are always running under privileged
> users.
>
>
>>
>>>>
>>>> Let me try again, with emphasis in the right place.
>>>>
>>>> I think that *code running in a namespace* has no business even
>>>> knowing a unique identity of *that namespace* from the perspective of
>>>> the host.
>>>>
>>>> In your example, if there's a veth device between netns A and netns B,
>>>> then code *in netns A* has no business knowing the identity of its
>>>> veth peer if its peer (B) is a sibling or ancestor.  It also IMO has
>>>> no business knowing the identity of its own netns (A) other than as
>>>> "my netns".
>>>
>>>
>>> I do not agree (see the example below).
>>>
>>>
>>>>
>>>> If A and B are siblings, then their parent needs to know where that
>>>> veth device goes, but I think this is already the case to a sufficient
>>>> extent today.
>>>
>>>
>>> I'm not aware of a hierarchy between netns. A daemon should be able to
>>> got the full network configuration, even if it's started when this
>>> configuration
>>> is already applied, ie even if it doesn't know what happen before it
>>> starts.
>>>
>>
>> I don't know exactly which namespaces have an explicit hierarchy, but
>> there is certainly a hierarchy of *user* namespaces, and network
>> namespaces live in user namespaces, so they at least have somewhat of
>> a hierarchy.
>>
>>>
>>>>
>>>> I feel like this discussion is falling into a common trap of new API
>>>> discussions.  Can one of you who wants this API please articulate,
>>>> with a reasonably precise example, what it is that you want to do, why
>>>> you can't easily do it already, and how this API helps?  I currently
>>>> understand how the API creates problems, but I don't understand how it
>>>> solves any problems, and I will NAK it (and I suspect that Eric will,
>>>> too, which is pretty much fatal) unless that changes.
>>>
>>>
>>> What I'm trying to solve is to have full info in netlink messages sent by
>>> the
>>> kernel, thus beeing able to identify a peer netns (and this is close from
>>> what
>>> audit guys are trying to have). Theorically, messages sent by the kernel
>>> can
>>> be
>>> reused as is to have the same configuration. This is not the case with
>>> x-netns
>>> devices. Here is an example, with ip tunnels:
>>>
>>> $ ip netns add 1
>>> $ ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 dev
>>> eth0
>>> $ ip -d link ls ipip1
>>> 8: ipip1@eth0: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode
>>> DEFAULT group default
>>>      link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
>>>      ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 ttl inherit
>>> pmtudisc
>>> $ ip link set ipip1 netns 1
>>> $ ip netns exec 1 ip -d link ls ipip1
>>> 8: ipip1@tunl0: <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN
>>> mode DEFAULT group default
>>>      link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
>>>      ipip remote 10.16.0.121 local 10.16.0.249 dev tunl0 ttl inherit
>>> pmtudisc
>>>
>>> Now informations got with 'ip link' are wrong and incomplete:
>>>   - the link dev is now tunl0 instead of eth0, because we only got an
>>> ifindex
>>>     from the kernel without any netns informations.
>>>   - the encapsulation addresses are not part of this netns but the user
>>> doesn't
>>>     known that (still because netns info is missing). These IPv4
>>> addresses
>>> may
>>>     exist into this netns.
>>>   - it's not possible to create the same netdevice with these infos.
>>>
>>
>> Aha.  That's a genuine problem.
>>
>> Perhaps we need a concept of which netnses should be able to see each
>> other.
>
> Yes, I agree. This is not required for all netns, only a subset of netns
> should
>
> be able to see each other.
>
>>
>> I think I would be okay with a somewhat different outcome from your
>> example:
>>
>> $ ip netns exec 1 ip -d link ls ipip1
>> 8: ipip1@[unknown device in another namespace]:
>> <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN
>>
>> I think this outcome is mandatory if netns 1 lives in a subsidiary
>> user namespace.
>
> Yes.
>
>
>>
>> Certainly, if you do the 'ip link' in the original namespace, I agree
>> that this should work.
>
> And yes :)
>
> I will update my previous proposal
> (http://thread.gmane.org/gmane.linux.network/315933/focus=321753)
> to allow to get an id for a peer netns only when the user namespace is the
> same.
>

I think it should work if the peer userns is the same or a descendent.
I also wonder whether the peer's ifindex should be suppressed if peer
userns is not the same or a descendent.

Now you just have to get Eric to be happy with the id allocation. :)
This may be nontrivial.

>
>>
>> For most namespace types, this all works transparently, since
>> everything has an real identity all the way up the hierarchy.  Network
>> namespaces are different.
>>
>> I don't think that exposing serial numbers in /proc is a good
>> solution, both for the reasons already described and because I don't
>> think that iproute2 should need to muck around with /proc to function
>
> A netlink API is probably enough. But it will help only for the network
> problem, not for audit. I was hoping to find a common solution.

I still don't understand why audit needs anything beyond the audit
part of this patch set.  I have no problem with audit seeing that
migrated/restored namespaces are really brand-new namespaces, as long
as the code in those namespaces isn't exposed to it.

>
>
>> correctly.  Eric, any clever ideas here?  Do we need fancier netlink
>> messages for this?
>>
>> --Andy
>>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
  2014-08-25 16:50                   ` Andy Lutomirski
@ 2014-08-27 15:17                     ` Richard Guy Briggs
  0 siblings, 0 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-08-27 15:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Nicolas Dichtel, Linux Containers, linux-kernel, Serge E. Hallyn,
	Eric W. Biederman, linux-audit, Linux API, netdev

On 14/08/25, Andy Lutomirski wrote:
> On Mon, Aug 25, 2014 at 9:41 AM, Nicolas Dichtel
> <nicolas.dichtel@6wind.com> wrote:
> > Le 25/08/2014 18:13, Andy Lutomirski a écrit :
> >
> >> On Mon, Aug 25, 2014 at 8:43 AM, Nicolas Dichtel
> >> <nicolas.dichtel@6wind.com> wrote:
> >>>
> >>> Le 25/08/2014 16:04, Andy Lutomirski a écrit :
> >>>
> >>>> On Aug 25, 2014 6:30 AM, "Nicolas Dichtel" <nicolas.dichtel@6wind.com>
> >>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> CRIU wants to save the complete state of a namespace and then restore
> >>>>>> it.  For that to work, any information exposed to things in the
> >>>>>> namespace *cannot* be globally unique or unique per boot, since CRIU
> >>>>>> needs to arrange for that information to match whatever it was when
> >>>>>> CRIU saved it.
> >>>>>
> >>>>>
> >>>>>
> >>>>> How are ifindex of network devices managed? These ifindexes are unique
> >>>>> per boot,
> >>>>> thus can change depending on the order in which netdev are created.
> >>>>> These ifindexes are unique per boot and exposed to userspace ...
> >>>>>
> >>>>
> >>>> This does not appear to be true.
> >>>>
> >>>> $ sudo unshare --net
> >>>> # ip link add veth0 type veth peer name veth1
> >>>> # ip link
> >>>> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group
> >>>> default
> >>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> >>>> 2: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
> >>>> DEFAULT group default qlen 1000
> >>>>       link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff
> >>>> 3: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
> >>>> DEFAULT group default qlen 1000
> >>>>       link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff
> >>>> # logout
> >>>> $ ip link
> >>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
> >>>>       link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> >>>> 3: em1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
> >>>> state DOWN qlen 1000
> >>>>
> >>> I've probably misunderstood what you're trying to say. ifindexes are
> >>> unique
> >>> per
> >>> boot and per netns.
> >>
> >>
> >> I think we both misunderstood each other.  The ifindexes are unique
> >> *per netns*, which means that, if you're unprivileged in a netns,
> >> global information doesn't leak to you.  I think this is good.
> >
> > Ok, I agree. I think audit daemons are always running under privileged
> > users.
> >
> >
> >>
> >>>>
> >>>> Let me try again, with emphasis in the right place.
> >>>>
> >>>> I think that *code running in a namespace* has no business even
> >>>> knowing a unique identity of *that namespace* from the perspective of
> >>>> the host.
> >>>>
> >>>> In your example, if there's a veth device between netns A and netns B,
> >>>> then code *in netns A* has no business knowing the identity of its
> >>>> veth peer if its peer (B) is a sibling or ancestor.  It also IMO has
> >>>> no business knowing the identity of its own netns (A) other than as
> >>>> "my netns".
> >>>
> >>>
> >>> I do not agree (see the example below).
> >>>
> >>>
> >>>>
> >>>> If A and B are siblings, then their parent needs to know where that
> >>>> veth device goes, but I think this is already the case to a sufficient
> >>>> extent today.
> >>>
> >>>
> >>> I'm not aware of a hierarchy between netns. A daemon should be able to
> >>> got the full network configuration, even if it's started when this
> >>> configuration
> >>> is already applied, ie even if it doesn't know what happen before it
> >>> starts.
> >>>
> >>
> >> I don't know exactly which namespaces have an explicit hierarchy, but
> >> there is certainly a hierarchy of *user* namespaces, and network
> >> namespaces live in user namespaces, so they at least have somewhat of
> >> a hierarchy.
> >>
> >>>
> >>>>
> >>>> I feel like this discussion is falling into a common trap of new API
> >>>> discussions.  Can one of you who wants this API please articulate,
> >>>> with a reasonably precise example, what it is that you want to do, why
> >>>> you can't easily do it already, and how this API helps?  I currently
> >>>> understand how the API creates problems, but I don't understand how it
> >>>> solves any problems, and I will NAK it (and I suspect that Eric will,
> >>>> too, which is pretty much fatal) unless that changes.
> >>>
> >>>
> >>> What I'm trying to solve is to have full info in netlink messages sent by
> >>> the
> >>> kernel, thus beeing able to identify a peer netns (and this is close from
> >>> what
> >>> audit guys are trying to have). Theorically, messages sent by the kernel
> >>> can
> >>> be
> >>> reused as is to have the same configuration. This is not the case with
> >>> x-netns
> >>> devices. Here is an example, with ip tunnels:
> >>>
> >>> $ ip netns add 1
> >>> $ ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 dev
> >>> eth0
> >>> $ ip -d link ls ipip1
> >>> 8: ipip1@eth0: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode
> >>> DEFAULT group default
> >>>      link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
> >>>      ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 ttl inherit
> >>> pmtudisc
> >>> $ ip link set ipip1 netns 1
> >>> $ ip netns exec 1 ip -d link ls ipip1
> >>> 8: ipip1@tunl0: <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN
> >>> mode DEFAULT group default
> >>>      link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
> >>>      ipip remote 10.16.0.121 local 10.16.0.249 dev tunl0 ttl inherit
> >>> pmtudisc
> >>>
> >>> Now informations got with 'ip link' are wrong and incomplete:
> >>>   - the link dev is now tunl0 instead of eth0, because we only got an
> >>> ifindex
> >>>     from the kernel without any netns informations.
> >>>   - the encapsulation addresses are not part of this netns but the user
> >>> doesn't
> >>>     known that (still because netns info is missing). These IPv4
> >>> addresses
> >>> may
> >>>     exist into this netns.
> >>>   - it's not possible to create the same netdevice with these infos.
> >>>
> >>
> >> Aha.  That's a genuine problem.
> >>
> >> Perhaps we need a concept of which netnses should be able to see each
> >> other.
> >
> > Yes, I agree. This is not required for all netns, only a subset of netns
> > should
> >
> > be able to see each other.
> >
> >>
> >> I think I would be okay with a somewhat different outcome from your
> >> example:
> >>
> >> $ ip netns exec 1 ip -d link ls ipip1
> >> 8: ipip1@[unknown device in another namespace]:
> >> <POINTOPOINT,NOARP,M-DOWN> mtu 1480 qdisc noop state DOWN
> >>
> >> I think this outcome is mandatory if netns 1 lives in a subsidiary
> >> user namespace.
> >
> > Yes.
> >
> >
> >>
> >> Certainly, if you do the 'ip link' in the original namespace, I agree
> >> that this should work.
> >
> > And yes :)
> >
> > I will update my previous proposal
> > (http://thread.gmane.org/gmane.linux.network/315933/focus=321753)
> > to allow to get an id for a peer netns only when the user namespace is the
> > same.
> 
> I think it should work if the peer userns is the same or a descendent.
> I also wonder whether the peer's ifindex should be suppressed if peer
> userns is not the same or a descendent.
> 
> Now you just have to get Eric to be happy with the id allocation. :)
> This may be nontrivial.
> 
> >> For most namespace types, this all works transparently, since
> >> everything has an real identity all the way up the hierarchy.  Network
> >> namespaces are different.
> >>
> >> I don't think that exposing serial numbers in /proc is a good
> >> solution, both for the reasons already described and because I don't
> >> think that iproute2 should need to muck around with /proc to function
> >
> > A netlink API is probably enough. But it will help only for the network
> > problem, not for audit. I was hoping to find a common solution.
> 
> I still don't understand why audit needs anything beyond the audit
> part of this patch set.  I have no problem with audit seeing that
> migrated/restored namespaces are really brand-new namespaces, as long
> as the code in those namespaces isn't exposed to it.

Ok, I'm starting to get this...  Perhaps /proc wasn't the best place to
expose this.  Audit or an audit aggregator is the only one that needs to
know any of this information.  This could be accomplished with
CAP_AUDIT_CONTROL and a new netlink audit message type to fetch
individual or all namespace IDs for a particular PID via auditctl, or by
having a CAP_AUDIT_WRITE-capable application pull the trigger to simply
dump that information to the log.

> >> correctly.  Eric, any clever ideas here?  Do we need fancier netlink
> >> messages for this?
> >>
> >> --Andy
> 
> Andy Lutomirski

- RGB

--
Richard Guy Briggs <rbriggs@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number
  2014-08-24 20:38     ` Richard Guy Briggs
@ 2014-08-28 20:05       ` Eric W. Biederman
  2014-09-02 21:40         ` Richard Guy Briggs
  0 siblings, 1 reply; 30+ messages in thread
From: Eric W. Biederman @ 2014-08-28 20:05 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: linux-api, containers, linux-kernel, Andy Lutomirski, linux-audit, serge

Richard Guy Briggs <rgb@redhat.com> writes:

> On 14/08/23, Eric W. Biederman wrote:
>> Richard Guy Briggs <rgb@redhat.com> writes:
>> 
>> > Generate and assign a serial number per namespace instance since boot.
>> >
>> > Use a serial number per namespace (unique across one boot of one kernel)
>> > instead of the inode number (which is claimed to have had the right to change
>> > reserved and is not necessarily unique if there is more than one proc fs) to
>> > uniquely identify it per kernel boot.
>> 
>> This approach is just broken.
>> 
>> For this to work with migration (aka criu) you need to implement a
>> namespace of namespaces.  You haven't done this, and therefore
>> such an interface will break existing userspace.
>> 
>> Inside of audit I can understand not caring about these issues,
>> but you go foward and expose these serial numbers in proc,
>> and generally make this infrastructure available to others.
>> 
>> The deep issue with migration is that we move tasks from one machine
>> from another and on the destination machine we need to have all of the
>> same global identifiers for software to function properly.
>> 
>> My weasel words around the proc inode numbers is to preserve to allow us
>> room to be able to restore those ids if it every becomes relevant for
>> migration.
>
> What do you do if the inode number is already in use on the target
> host?

Since the inode numbers are relative to a superblock or a pid namespace
the numbers that are in use can be restored on the target system
by creating them in the appropriate namespace.

The support does not exist in the kernel today for doing that because no
one has cared but as architected the support can be added if needed to
support migration.

>> That is the proc inode numbers (technically) live in a pid namespace,
>> (aka a mount of proc).  So depending on the pid namespace you are in
>> or the mount of proc you look in the numbers could change.
>> 
>> Qualifications like that must exist to have a prayer of ever supporting
>> process migration in the crazy corner cases where people start caring
>> about inode numbers.
>> 
>> We currently don't and inode numbers for a namespace will never change
>> after a namespace is created.  So I think you really are ok using the
>> proc inode numbers.  I am happy declaring by fiat that the inode numbers
>> that audit uses are the numbers connected to the initial pid namespace.
>
> But once a namespace/container is migrated, it is a different audit that
> is looking at it (unless we create an audit manager or entity that
> functions at the level of a container manager), so audit should not care.

These numbers were exported to everyone as a general purpose facility in
proc.  If audit is global and audit doesn't migrate you are right it
doesn't matter.  However if these numbers are used by anyone else for
anything else it causes a problem.

Further given that people run entire distributions in containers we may
reach the point where we wish to run auditd in a container in the
future.  I would hate to paint ourselves into a corner with a design
that could never allow audit to migrate.  Support that case someday
seems a valid naive desire.

>> At a fairly basic level anything that is used to identify namespaces for
>> any general purpose use needs to have most if not all of the same
>> properties of the proc inode numbers.  The most important of which is
>> being tied to some context/namespace so there is a ability if we ever
>> need it to migrate those numbers from one machine to another.
>
> Sooo...  does it make any sense to have those inode or serial numbers be
> blank inside the namespace/container itself, but only visible to its
> manager outside the container (unless it is the initial namespace)?

Mostly I think it makes sense to use the inode numbers from the initial
pid namespace.  They already exist.  They already are unique.  (Which
means I don't need to maintain more code and more special cases).  And
the do what you need now.

I probably haven't followed closely enough but I don't see what makes
inode numbers undesirable.

Eric

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number
  2014-08-28 20:05       ` Eric W. Biederman
@ 2014-09-02 21:40         ` Richard Guy Briggs
  0 siblings, 0 replies; 30+ messages in thread
From: Richard Guy Briggs @ 2014-09-02 21:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-api, containers, linux-kernel, Andy Lutomirski, linux-audit, serge

On 14/08/28, Eric W. Biederman wrote:
> Richard Guy Briggs <rgb@redhat.com> writes:
> > On 14/08/23, Eric W. Biederman wrote:
> >> Richard Guy Briggs <rgb@redhat.com> writes:
> >> 
> >> > Generate and assign a serial number per namespace instance since boot.
> >> >
> >> > Use a serial number per namespace (unique across one boot of one kernel)
> >> > instead of the inode number (which is claimed to have had the right to change
> >> > reserved and is not necessarily unique if there is more than one proc fs) to
> >> > uniquely identify it per kernel boot.
> >> 
> >> This approach is just broken.
> >> 
> >> For this to work with migration (aka criu) you need to implement a
> >> namespace of namespaces.  You haven't done this, and therefore
> >> such an interface will break existing userspace.
> >> 
> >> Inside of audit I can understand not caring about these issues,
> >> but you go foward and expose these serial numbers in proc,
> >> and generally make this infrastructure available to others.
> >> 
> >> The deep issue with migration is that we move tasks from one machine
> >> from another and on the destination machine we need to have all of the
> >> same global identifiers for software to function properly.
> >> 
> >> My weasel words around the proc inode numbers is to preserve to allow us
> >> room to be able to restore those ids if it every becomes relevant for
> >> migration.
> >
> > What do you do if the inode number is already in use on the target
> > host?
> 
> Since the inode numbers are relative to a superblock or a pid namespace
> the numbers that are in use can be restored on the target system
> by creating them in the appropriate namespace.

So you seem to be advocating for a namespace of namespaces, since
neither host can create a new namespace without consulting the others in
its pool for a new free number.

> The support does not exist in the kernel today for doing that because no
> one has cared but as architected the support can be added if needed to
> support migration.
> 
> >> That is the proc inode numbers (technically) live in a pid namespace,
> >> (aka a mount of proc).  So depending on the pid namespace you are in
> >> or the mount of proc you look in the numbers could change.
> >> 
> >> Qualifications like that must exist to have a prayer of ever supporting
> >> process migration in the crazy corner cases where people start caring
> >> about inode numbers.
> >> 
> >> We currently don't and inode numbers for a namespace will never change
> >> after a namespace is created.  So I think you really are ok using the
> >> proc inode numbers.  I am happy declaring by fiat that the inode numbers
> >> that audit uses are the numbers connected to the initial pid namespace.
> >
> > But once a namespace/container is migrated, it is a different audit that
> > is looking at it (unless we create an audit manager or entity that
> > functions at the level of a container manager), so audit should not care.
> 
> These numbers were exported to everyone as a general purpose facility in
> proc.  If audit is global and audit doesn't migrate you are right it
> doesn't matter.  However if these numbers are used by anyone else for
> anything else it causes a problem.

So let us restrict their use to audit, by removing them from
/proc/<pid>/ns/ and only exposing them via netlink calls to audit gated
by CAP_AUDIT_WRITE or CAP_AUDIT_CONTROL.

> Further given that people run entire distributions in containers we may
> reach the point where we wish to run auditd in a container in the
> future.  I would hate to paint ourselves into a corner with a design
> that could never allow audit to migrate.  Support that case someday
> seems a valid naive desire.

Agreed.  That is an option we do not want to rule out at this point.
I'll need to think about this one more.

> >> At a fairly basic level anything that is used to identify namespaces for
> >> any general purpose use needs to have most if not all of the same
> >> properties of the proc inode numbers.  The most important of which is
> >> being tied to some context/namespace so there is a ability if we ever
> >> need it to migrate those numbers from one machine to another.
> >
> > Sooo...  does it make any sense to have those inode or serial numbers be
> > blank inside the namespace/container itself, but only visible to its
> > manager outside the container (unless it is the initial namespace)?
> 
> Mostly I think it makes sense to use the inode numbers from the initial
> pid namespace.  They already exist.  They already are unique.  (Which
> means I don't need to maintain more code and more special cases).  And
> the do what you need now.

Will inode numbers never be re-used once they are freed?  Guaranteed?

> I probably haven't followed closely enough but I don't see what makes
> inode numbers undesirable.

This posting:
	https://www.redhat.com/archives/linux-audit/2013-March/msg00032.html

> Eric

- RGB

--
Richard Guy Briggs <rbriggs@redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2014-09-02 21:40 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-21  1:09 [PATCH V4 0/8] namespaces: log namespaces per task Richard Guy Briggs
2014-08-21  1:09 ` [PATCH V4 1/8] namespaces: assign each namespace instance a serial number Richard Guy Briggs
2014-08-21 21:22   ` Andy Lutomirski
2014-08-21 21:28     ` Richard Guy Briggs
2014-08-21 21:30       ` Andy Lutomirski
2014-08-21 22:15         ` Richard Guy Briggs
2014-08-23 12:05   ` Eric W. Biederman
2014-08-24 20:38     ` Richard Guy Briggs
2014-08-28 20:05       ` Eric W. Biederman
2014-09-02 21:40         ` Richard Guy Briggs
2014-08-21  1:09 ` [PATCH V4 2/8] namespaces: expose namespace instance serial number in proc_ns_operations Richard Guy Briggs
2014-08-21  1:09 ` [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc Richard Guy Briggs
2014-08-21 21:13   ` Andy Lutomirski
2014-08-22  1:58     ` Richard Guy Briggs
2014-08-24 17:52       ` Andy Lutomirski
2014-08-24 20:28         ` Richard Guy Briggs
2014-08-25 13:30         ` Nicolas Dichtel
2014-08-25 14:04           ` Andy Lutomirski
2014-08-25 15:43             ` Nicolas Dichtel
2014-08-25 16:13               ` Andy Lutomirski
2014-08-25 16:41                 ` Nicolas Dichtel
2014-08-25 16:50                   ` Andy Lutomirski
2014-08-27 15:17                     ` Richard Guy Briggs
2014-08-21  1:09 ` [PATCH V4 4/8] Documentation: add a section for /proc/<pid>/ns/ Richard Guy Briggs
2014-08-21  1:09 ` [PATCH V4 5/8] namespaces: expose ns_entries Richard Guy Briggs
2014-08-21  1:09 ` [PATCH V4 6/8] audit: log namespace serial numbers Richard Guy Briggs
2014-08-21  1:09 ` [PATCH V4 7/8] audit: log creation and deletion of namespace instances Richard Guy Briggs
2014-08-21  1:09 ` [PATCH V4 8/8] audit: initialize at subsystem time rather than device time Richard Guy Briggs
2014-08-21 20:05 ` [PATCH V4 0/8] namespaces: log namespaces per task Aristeu Rozanski
2014-08-21 22:32   ` Richard Guy Briggs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).