SELinux Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v5] selinux: sidtab: reverse lookup hash table
@ 2019-11-07 10:17 Jeff Vander Stoep
  2019-11-07 12:01 ` Jeffrey Vander Stoep
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Jeff Vander Stoep @ 2019-11-07 10:17 UTC (permalink / raw)
  To: selinux; +Cc: omosnace, paul, sds, will, Jeff Vander Stoep, Jovana Knezevic

This replaces the reverse table lookup and reverse cache with a
hashtable which improves cache-miss reverse-lookup times from
O(n) to O(1)* and maintains the same performance as a reverse
cache hit.

This reduces the time needed to add a new sidtab entry from ~500us
to 5us on a Pixel 3 when there are ~10,000 sidtab entries.

The implementation uses the kernel's generic hashtable API,
It uses the context's string represtation as the hash source,
and the kernels generic string hashing algorithm full_name_hash()
to reduce the string to a 32 bit value.

This change also maintains the improvement introduced in
commit ee1a84fdfeed ("selinux: overhaul sidtab to fix bug and improve
performance") which removed the need to keep the current sidtab
locked during policy reload. It does however introduce periodic
locking of the target sidtab while converting the hashtable. Sidtab
entries are never modified or removed, so the context struct stored
in the sid_to_context tree can also be used for the context_to_sid
hashtable to reduce memory usage.

This bug was reported by:
- On the selinux bug tracker.
  BUG: kernel softlockup due to too many SIDs/contexts #37
  https://github.com/SELinuxProject/selinux-kernel/issues/37
- Jovana Knezevic on Android's bugtracker.
  Bug: 140252993
  "During multi-user performance testing, we create and remove users
  many times. selinux_android_restorecon_pkgdir goes from 1ms to over
  20ms after about 200 user creations and removals. Accumulated over
  ~280 packages, that adds a significant time to user creation,
  making perf benchmarks unreliable."

* Hashtable lookup is only O(1) when n < the number of buckets.

Changes in V2:
-The hashtable uses sidtab_entry_leaf objects directly so these
objects are shared between the sid_to_context lookup tree and the
context_to_sid hashtable. This simplifies memory allocation and
was suggested by Ondrej Mosnacek.
-The new sidtab hash stats file in selinuxfs has been moved out of
the avc dir and into a new "ss" dir.

V3:
-Add lock nesting notation.

V4:
-Moved to *_rcu variants of the various hashtable functions
as suggested by Ondrej Mosnacek and Will Deacon.
-Naming/spelling fixups.

Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
Reported-by: Stephen Smalley <sds@tycho.nsa.gov>
Reported-by: Jovana Knezevic <jovanak@google.com>
---
 security/selinux/Kconfig            |  12 ++
 security/selinux/include/security.h |   1 +
 security/selinux/selinuxfs.c        |  65 +++++++
 security/selinux/ss/context.h       |  11 +-
 security/selinux/ss/policydb.c      |   5 +
 security/selinux/ss/services.c      |  83 +++++++--
 security/selinux/ss/services.h      |   4 +-
 security/selinux/ss/sidtab.c        | 264 ++++++++++++++--------------
 security/selinux/ss/sidtab.h        |  17 +-
 9 files changed, 302 insertions(+), 160 deletions(-)

diff --git a/security/selinux/Kconfig b/security/selinux/Kconfig
index 5711689deb6a..c9e576c430c2 100644
--- a/security/selinux/Kconfig
+++ b/security/selinux/Kconfig
@@ -85,3 +85,15 @@ config SECURITY_SELINUX_CHECKREQPROT_VALUE
 	  via /selinux/checkreqprot if authorized by policy.
 
 	  If you are unsure how to answer this question, answer 0.
+
+config SECURITY_SELINUX_SIDTAB_HASH_BITS
+	int "NSA SELinux sidtab hashtable size"
+	depends on SECURITY_SELINUX
+	range 8 13
+	default 9
+	help
+	  This option sets the number of buckets used in the sidtab hashtable
+	  to 2^SECURITY_SELINUX_SIDTAB_HASH_BITS buckets. The number of hash
+	  collisions may be viewed at /sys/fs/selinux/ss/sidtab_hash_stats. If
+	  chain lengths are high (e.g. > 20) then selecting a higher value here
+	  will ensure that lookups times are short and stable.
diff --git a/security/selinux/include/security.h b/security/selinux/include/security.h
index ae840634e3c7..8c0dbbd076c6 100644
--- a/security/selinux/include/security.h
+++ b/security/selinux/include/security.h
@@ -395,5 +395,6 @@ extern int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm);
 extern void avtab_cache_init(void);
 extern void ebitmap_cache_init(void);
 extern void hashtab_cache_init(void);
+extern int security_sidtab_hash_stats(struct selinux_state *state, char *page);
 
 #endif /* _SELINUX_SECURITY_H_ */
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index ee94fa469c29..dd7bb1f1dc99 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1482,6 +1482,32 @@ static ssize_t sel_read_avc_hash_stats(struct file *filp, char __user *buf,
 	return length;
 }
 
+static ssize_t sel_read_sidtab_hash_stats(struct file *filp, char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	struct selinux_fs_info *fsi = file_inode(filp)->i_sb->s_fs_info;
+	struct selinux_state *state = fsi->state;
+	char *page;
+	ssize_t length;
+
+	page = (char *)__get_free_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	length = security_sidtab_hash_stats(state, page);
+	if (length >= 0)
+		length = simple_read_from_buffer(buf, count, ppos, page,
+						length);
+	free_page((unsigned long)page);
+
+	return length;
+}
+
+static const struct file_operations sel_sidtab_hash_stats_ops = {
+	.read		= sel_read_sidtab_hash_stats,
+	.llseek		= generic_file_llseek,
+};
+
 static const struct file_operations sel_avc_cache_threshold_ops = {
 	.read		= sel_read_avc_cache_threshold,
 	.write		= sel_write_avc_cache_threshold,
@@ -1599,6 +1625,37 @@ static int sel_make_avc_files(struct dentry *dir)
 	return 0;
 }
 
+static int sel_make_ss_files(struct dentry *dir)
+{
+	struct super_block *sb = dir->d_sb;
+	struct selinux_fs_info *fsi = sb->s_fs_info;
+	int i;
+	static struct tree_descr files[] = {
+		{ "sidtab_hash_stats", &sel_sidtab_hash_stats_ops, S_IRUGO },
+	};
+
+	for (i = 0; i < ARRAY_SIZE(files); i++) {
+		struct inode *inode;
+		struct dentry *dentry;
+
+		dentry = d_alloc_name(dir, files[i].name);
+		if (!dentry)
+			return -ENOMEM;
+
+		inode = sel_make_inode(dir->d_sb, S_IFREG|files[i].mode);
+		if (!inode) {
+			dput(dentry);
+			return -ENOMEM;
+		}
+
+		inode->i_fop = files[i].ops;
+		inode->i_ino = ++fsi->last_ino;
+		d_add(dentry, inode);
+	}
+
+	return 0;
+}
+
 static ssize_t sel_read_initcon(struct file *file, char __user *buf,
 				size_t count, loff_t *ppos)
 {
@@ -1963,6 +2020,14 @@ static int sel_fill_super(struct super_block *sb, struct fs_context *fc)
 	}
 
 	ret = sel_make_avc_files(dentry);
+
+	dentry = sel_make_dir(sb->s_root, "ss", &fsi->last_ino);
+	if (IS_ERR(dentry)) {
+		ret = PTR_ERR(dentry);
+		goto err;
+	}
+
+	ret = sel_make_ss_files(dentry);
 	if (ret)
 		goto err;
 
diff --git a/security/selinux/ss/context.h b/security/selinux/ss/context.h
index 513e67f48878..3ba044fe02ed 100644
--- a/security/selinux/ss/context.h
+++ b/security/selinux/ss/context.h
@@ -31,6 +31,7 @@ struct context {
 	u32 len;        /* length of string in bytes */
 	struct mls_range range;
 	char *str;	/* string representation if context cannot be mapped. */
+	u32 hash;	/* a hash of the string representation */
 };
 
 static inline void mls_context_init(struct context *c)
@@ -168,12 +169,13 @@ static inline int context_cpy(struct context *dst, struct context *src)
 		kfree(dst->str);
 		return rc;
 	}
+	dst->hash = src->hash;
 	return 0;
 }
 
 static inline void context_destroy(struct context *c)
 {
-	c->user = c->role = c->type = 0;
+	c->user = c->role = c->type = c->hash = 0;
 	kfree(c->str);
 	c->str = NULL;
 	c->len = 0;
@@ -182,6 +184,8 @@ static inline void context_destroy(struct context *c)
 
 static inline int context_cmp(struct context *c1, struct context *c2)
 {
+	if (c1->hash && c2->hash && (c1->hash != c2->hash))
+		return 0;
 	if (c1->len && c2->len)
 		return (c1->len == c2->len && !strcmp(c1->str, c2->str));
 	if (c1->len || c2->len)
@@ -192,5 +196,10 @@ static inline int context_cmp(struct context *c1, struct context *c2)
 		mls_context_cmp(c1, c2));
 }
 
+static inline unsigned int context_compute_hash(const char *s)
+{
+	return full_name_hash(NULL, s, strlen(s));
+}
+
 #endif	/* _SS_CONTEXT_H_ */
 
diff --git a/security/selinux/ss/policydb.c b/security/selinux/ss/policydb.c
index e20624a68f5d..e369b0092cdf 100644
--- a/security/selinux/ss/policydb.c
+++ b/security/selinux/ss/policydb.c
@@ -878,6 +878,11 @@ int policydb_load_isids(struct policydb *p, struct sidtab *s)
 			sidtab_destroy(s);
 			goto out;
 		}
+		rc = context_add_hash(p, &c->context[0]);
+		if (rc) {
+			sidtab_destroy(s);
+			goto out;
+		}
 
 		rc = sidtab_set_initial(s, c->sid[0], &c->context[0]);
 		if (rc) {
diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 3a29e7c24ba9..689dff4e756f 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -1257,6 +1257,17 @@ static int context_struct_to_string(struct policydb *p,
 
 #include "initial_sid_to_string.h"
 
+int security_sidtab_hash_stats(struct selinux_state *state, char *page)
+{
+	int rc;
+
+	read_lock(&state->ss->policy_rwlock);
+	rc = sidtab_hash_stats(state->ss->sidtab, page);
+	read_unlock(&state->ss->policy_rwlock);
+
+	return rc;
+}
+
 const char *security_get_initial_sid_context(u32 sid)
 {
 	if (unlikely(sid > SECINITSID_NUM))
@@ -1384,6 +1395,8 @@ static int string_to_context_struct(struct policydb *pol,
 	int rc = 0;
 
 	context_init(ctx);
+	/* hash the string before it gets mutated */
+	ctx->hash = context_compute_hash(scontext);
 
 	/* Parse the security context. */
 
@@ -1449,6 +1462,42 @@ static int string_to_context_struct(struct policydb *pol,
 	return rc;
 }
 
+int context_add_hash(struct policydb *policydb,
+		     struct context *context)
+{
+	int rc;
+	char *str;
+	int len;
+
+	if (context->str) {
+		context->hash = context_compute_hash(context->str);
+	} else {
+		rc = context_struct_to_string(policydb, context,
+					      &str, &len);
+		if (rc)
+			return rc;
+		context->hash = context_compute_hash(str);
+		kfree(str);
+	}
+	return 0;
+}
+
+static int context_struct_to_sid(struct selinux_state *state,
+				 struct context *context, u32 *sid)
+{
+	int rc;
+	struct sidtab *sidtab = state->ss->sidtab;
+	struct policydb *policydb = &state->ss->policydb;
+
+	if (!context->hash) {
+		rc = context_add_hash(policydb, context);
+		if (rc)
+			return rc;
+	}
+
+	return sidtab_context_to_sid(sidtab, context, sid);
+}
+
 static int security_context_to_sid_core(struct selinux_state *state,
 					const char *scontext, u32 scontext_len,
 					u32 *sid, u32 def_sid, gfp_t gfp_flags,
@@ -1501,7 +1550,7 @@ static int security_context_to_sid_core(struct selinux_state *state,
 		str = NULL;
 	} else if (rc)
 		goto out_unlock;
-	rc = sidtab_context_to_sid(sidtab, &context, sid);
+	rc = context_struct_to_sid(state, &context, sid);
 	context_destroy(&context);
 out_unlock:
 	read_unlock(&state->ss->policy_rwlock);
@@ -1805,7 +1854,7 @@ static int security_compute_sid(struct selinux_state *state,
 			goto out_unlock;
 	}
 	/* Obtain the sid for the context. */
-	rc = sidtab_context_to_sid(sidtab, &newcontext, out_sid);
+	rc = context_struct_to_sid(state, &newcontext, out_sid);
 out_unlock:
 	read_unlock(&state->ss->policy_rwlock);
 	context_destroy(&newcontext);
@@ -2026,6 +2075,10 @@ static int convert_context(struct context *oldc, struct context *newc, void *p)
 			goto bad;
 	}
 
+	rc = context_add_hash(args->newp, newc);
+	if (rc)
+		goto bad;
+
 	return 0;
 bad:
 	/* Map old representation to string and save it. */
@@ -2273,8 +2326,7 @@ int security_port_sid(struct selinux_state *state,
 
 	if (c) {
 		if (!c->sid[0]) {
-			rc = sidtab_context_to_sid(sidtab,
-						   &c->context[0],
+			rc = context_struct_to_sid(state, &c->context[0],
 						   &c->sid[0]);
 			if (rc)
 				goto out;
@@ -2367,8 +2419,7 @@ int security_ib_endport_sid(struct selinux_state *state,
 
 	if (c) {
 		if (!c->sid[0]) {
-			rc = sidtab_context_to_sid(sidtab,
-						   &c->context[0],
+			rc = context_struct_to_sid(state, &c->context[0],
 						   &c->sid[0]);
 			if (rc)
 				goto out;
@@ -2409,13 +2460,11 @@ int security_netif_sid(struct selinux_state *state,
 
 	if (c) {
 		if (!c->sid[0] || !c->sid[1]) {
-			rc = sidtab_context_to_sid(sidtab,
-						  &c->context[0],
-						  &c->sid[0]);
+			rc = context_struct_to_sid(state, &c->context[0],
+						   &c->sid[0]);
 			if (rc)
 				goto out;
-			rc = sidtab_context_to_sid(sidtab,
-						   &c->context[1],
+			rc = context_struct_to_sid(state, &c->context[1],
 						   &c->sid[1]);
 			if (rc)
 				goto out;
@@ -2594,7 +2643,7 @@ int security_get_user_sids(struct selinux_state *state,
 						 &usercon))
 				continue;
 
-			rc = sidtab_context_to_sid(sidtab, &usercon, &sid);
+			rc = context_struct_to_sid(state, &usercon, &sid);
 			if (rc)
 				goto out_unlock;
 			if (mynel < maxnel) {
@@ -2665,7 +2714,6 @@ static inline int __security_genfs_sid(struct selinux_state *state,
 				       u32 *sid)
 {
 	struct policydb *policydb = &state->ss->policydb;
-	struct sidtab *sidtab = state->ss->sidtab;
 	int len;
 	u16 sclass;
 	struct genfs *genfs;
@@ -2700,7 +2748,7 @@ static inline int __security_genfs_sid(struct selinux_state *state,
 		goto out;
 
 	if (!c->sid[0]) {
-		rc = sidtab_context_to_sid(sidtab, &c->context[0], &c->sid[0]);
+		rc = context_struct_to_sid(state, &c->context[0], &c->sid[0]);
 		if (rc)
 			goto out;
 	}
@@ -2763,7 +2811,7 @@ int security_fs_use(struct selinux_state *state, struct super_block *sb)
 	if (c) {
 		sbsec->behavior = c->v.behavior;
 		if (!c->sid[0]) {
-			rc = sidtab_context_to_sid(sidtab, &c->context[0],
+			rc = context_struct_to_sid(state, &c->context[0],
 						   &c->sid[0]);
 			if (rc)
 				goto out;
@@ -3019,8 +3067,7 @@ int security_sid_mls_copy(struct selinux_state *state,
 			goto out_unlock;
 		}
 	}
-
-	rc = sidtab_context_to_sid(sidtab, &newcon, new_sid);
+	rc = context_struct_to_sid(state, &newcon, new_sid);
 out_unlock:
 	read_unlock(&state->ss->policy_rwlock);
 	context_destroy(&newcon);
@@ -3613,7 +3660,7 @@ int security_netlbl_secattr_to_sid(struct selinux_state *state,
 		if (!mls_context_isvalid(policydb, &ctx_new))
 			goto out_free;
 
-		rc = sidtab_context_to_sid(sidtab, &ctx_new, sid);
+		rc = context_struct_to_sid(state, &ctx_new, sid);
 		if (rc)
 			goto out_free;
 
diff --git a/security/selinux/ss/services.h b/security/selinux/ss/services.h
index 9a36de860368..fc40640a9725 100644
--- a/security/selinux/ss/services.h
+++ b/security/selinux/ss/services.h
@@ -8,7 +8,7 @@
 #define _SS_SERVICES_H_
 
 #include "policydb.h"
-#include "sidtab.h"
+#include "context.h"
 
 /* Mapping for a single class */
 struct selinux_mapping {
@@ -39,4 +39,6 @@ void services_compute_xperms_drivers(struct extended_perms *xperms,
 void services_compute_xperms_decision(struct extended_perms_decision *xpermd,
 					struct avtab_node *node);
 
+int context_add_hash(struct policydb *policydb, struct context *context);
+
 #endif	/* _SS_SERVICES_H_ */
diff --git a/security/selinux/ss/sidtab.c b/security/selinux/ss/sidtab.c
index 7d49994e8d5f..c052e620246c 100644
--- a/security/selinux/ss/sidtab.c
+++ b/security/selinux/ss/sidtab.c
@@ -17,26 +17,42 @@
 #include "security.h"
 #include "sidtab.h"
 
+#define index_to_sid(index) (index + SECINITSID_NUM + 1)
+#define sid_to_index(sid) (sid - (SECINITSID_NUM + 1))
+
 int sidtab_init(struct sidtab *s)
 {
 	u32 i;
 
 	memset(s->roots, 0, sizeof(s->roots));
 
-	/* max count is SIDTAB_MAX so valid index is always < SIDTAB_MAX */
-	for (i = 0; i < SIDTAB_RCACHE_SIZE; i++)
-		s->rcache[i] = SIDTAB_MAX;
-
 	for (i = 0; i < SECINITSID_NUM; i++)
 		s->isids[i].set = 0;
 
 	s->count = 0;
 	s->convert = NULL;
+	hash_init(s->context_to_sid);
 
 	spin_lock_init(&s->lock);
 	return 0;
 }
 
+static u32 context_to_sid(struct sidtab *s, struct context *context)
+{
+	struct sidtab_entry_leaf *entry;
+
+	rcu_read_lock();
+	hash_for_each_possible_rcu(s->context_to_sid, entry, list,
+				   context->hash) {
+		if (context_cmp(&entry->context, context)) {
+			rcu_read_unlock();
+			return entry->sid;
+		}
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
 int sidtab_set_initial(struct sidtab *s, u32 sid, struct context *context)
 {
 	struct sidtab_isid_entry *entry;
@@ -47,14 +63,61 @@ int sidtab_set_initial(struct sidtab *s, u32 sid, struct context *context)
 
 	entry = &s->isids[sid - 1];
 
-	rc = context_cpy(&entry->context, context);
+	rc = context_cpy(&entry->leaf.context, context);
 	if (rc)
 		return rc;
 
 	entry->set = 1;
+
+	/*
+	 * Multiple initial sids may map to the same context. Check that this
+	 * context is not already represented in the context_to_sid hashtable
+	 * to avoid duplicate entries and long linked lists upon hash
+	 * collision.
+	 */
+	if (!context_to_sid(s, context)) {
+		entry->leaf.sid = sid;
+		hash_add(s->context_to_sid, &entry->leaf.list, context->hash);
+	}
+
 	return 0;
 }
 
+int sidtab_hash_stats(struct sidtab *sidtab, char *page)
+{
+	int i;
+	int chain_len = 0;
+	int slots_used = 0;
+	int entries = 0;
+	int max_chain_len = 0;
+	int cur_bucket = 0;
+	struct sidtab_entry_leaf *entry;
+
+	rcu_read_lock();
+	hash_for_each_rcu(sidtab->context_to_sid, i, entry, list) {
+		entries++;
+		if (i == cur_bucket) {
+			chain_len++;
+			if (chain_len == 1)
+				slots_used++;
+		} else {
+			cur_bucket = i;
+			if (chain_len > max_chain_len)
+				max_chain_len = chain_len;
+			chain_len = 0;
+		}
+	}
+	rcu_read_unlock();
+
+	if (chain_len > max_chain_len)
+		max_chain_len = chain_len;
+
+	return scnprintf(page, PAGE_SIZE, "context_to_sid:  %d entries and "
+			 "%d/%d buckets used, longest chain length %d\n",
+			 entries, slots_used, SIDTAB_HASH_BUCKETS,
+			 max_chain_len);
+}
+
 static u32 sidtab_level_from_count(u32 count)
 {
 	u32 capacity = SIDTAB_LEAF_ENTRIES;
@@ -88,7 +151,8 @@ static int sidtab_alloc_roots(struct sidtab *s, u32 level)
 	return 0;
 }
 
-static struct context *sidtab_do_lookup(struct sidtab *s, u32 index, int alloc)
+static struct sidtab_entry_leaf *sidtab_do_lookup(struct sidtab *s, u32 index,
+						  int alloc)
 {
 	union sidtab_entry_inner *entry;
 	u32 level, capacity_shift, leaf_index = index / SIDTAB_LEAF_ENTRIES;
@@ -125,7 +189,7 @@ static struct context *sidtab_do_lookup(struct sidtab *s, u32 index, int alloc)
 		if (!entry->ptr_leaf)
 			return NULL;
 	}
-	return &entry->ptr_leaf->entries[index % SIDTAB_LEAF_ENTRIES].context;
+	return &entry->ptr_leaf->entries[index % SIDTAB_LEAF_ENTRIES];
 }
 
 static struct context *sidtab_lookup(struct sidtab *s, u32 index)
@@ -136,12 +200,12 @@ static struct context *sidtab_lookup(struct sidtab *s, u32 index)
 	if (index >= count)
 		return NULL;
 
-	return sidtab_do_lookup(s, index, 0);
+	return &sidtab_do_lookup(s, index, 0)->context;
 }
 
 static struct context *sidtab_lookup_initial(struct sidtab *s, u32 sid)
 {
-	return s->isids[sid - 1].set ? &s->isids[sid - 1].context : NULL;
+	return s->isids[sid - 1].set ? &s->isids[sid - 1].leaf.context : NULL;
 }
 
 static struct context *sidtab_search_core(struct sidtab *s, u32 sid, int force)
@@ -150,7 +214,7 @@ static struct context *sidtab_search_core(struct sidtab *s, u32 sid, int force)
 
 	if (sid != 0) {
 		if (sid > SECINITSID_NUM)
-			context = sidtab_lookup(s, sid - (SECINITSID_NUM + 1));
+			context = sidtab_lookup(s, sid_to_index(sid));
 		else
 			context = sidtab_lookup_initial(s, sid);
 		if (context && (!context->len || force))
@@ -170,117 +234,30 @@ struct context *sidtab_search_force(struct sidtab *s, u32 sid)
 	return sidtab_search_core(s, sid, 1);
 }
 
-static int sidtab_find_context(union sidtab_entry_inner entry,
-			       u32 *pos, u32 count, u32 level,
-			       struct context *context, u32 *index)
-{
-	int rc;
-	u32 i;
-
-	if (level != 0) {
-		struct sidtab_node_inner *node = entry.ptr_inner;
-
-		i = 0;
-		while (i < SIDTAB_INNER_ENTRIES && *pos < count) {
-			rc = sidtab_find_context(node->entries[i],
-						 pos, count, level - 1,
-						 context, index);
-			if (rc == 0)
-				return 0;
-			i++;
-		}
-	} else {
-		struct sidtab_node_leaf *node = entry.ptr_leaf;
-
-		i = 0;
-		while (i < SIDTAB_LEAF_ENTRIES && *pos < count) {
-			if (context_cmp(&node->entries[i].context, context)) {
-				*index = *pos;
-				return 0;
-			}
-			(*pos)++;
-			i++;
-		}
-	}
-	return -ENOENT;
-}
-
-static void sidtab_rcache_update(struct sidtab *s, u32 index, u32 pos)
-{
-	while (pos > 0) {
-		WRITE_ONCE(s->rcache[pos], READ_ONCE(s->rcache[pos - 1]));
-		--pos;
-	}
-	WRITE_ONCE(s->rcache[0], index);
-}
-
-static void sidtab_rcache_push(struct sidtab *s, u32 index)
-{
-	sidtab_rcache_update(s, index, SIDTAB_RCACHE_SIZE - 1);
-}
-
-static int sidtab_rcache_search(struct sidtab *s, struct context *context,
-				u32 *index)
-{
-	u32 i;
-
-	for (i = 0; i < SIDTAB_RCACHE_SIZE; i++) {
-		u32 v = READ_ONCE(s->rcache[i]);
-
-		if (v >= SIDTAB_MAX)
-			continue;
-
-		if (context_cmp(sidtab_do_lookup(s, v, 0), context)) {
-			sidtab_rcache_update(s, v, i);
-			*index = v;
-			return 0;
-		}
-	}
-	return -ENOENT;
-}
-
-static int sidtab_reverse_lookup(struct sidtab *s, struct context *context,
-				 u32 *index)
+int sidtab_context_to_sid(struct sidtab *s, struct context *context,
+			  u32 *sid)
 {
 	unsigned long flags;
-	u32 count, count_locked, level, pos;
+	u32 count;
 	struct sidtab_convert_params *convert;
-	struct context *dst, *dst_convert;
+	struct sidtab_entry_leaf *dst, *dst_convert;
 	int rc;
 
-	rc = sidtab_rcache_search(s, context, index);
-	if (rc == 0)
+	*sid = context_to_sid(s, context);
+	if (*sid)
 		return 0;
 
-	/* read entries only after reading count */
-	count = smp_load_acquire(&s->count);
-	level = sidtab_level_from_count(count);
-
-	pos = 0;
-	rc = sidtab_find_context(s->roots[level], &pos, count, level,
-				 context, index);
-	if (rc == 0) {
-		sidtab_rcache_push(s, *index);
-		return 0;
-	}
-
 	/* lock-free search failed: lock, re-search, and insert if not found */
 	spin_lock_irqsave(&s->lock, flags);
 
+	rc = 0;
+	*sid = context_to_sid(s, context);
+	if (*sid)
+		goto out_unlock;
+
+	/* read entries only after reading count */
+	count = smp_load_acquire(&s->count);
 	convert = s->convert;
-	count_locked = s->count;
-	level = sidtab_level_from_count(count_locked);
-
-	/* if count has changed before we acquired the lock, then catch up */
-	while (count < count_locked) {
-		if (context_cmp(sidtab_do_lookup(s, count, 0), context)) {
-			sidtab_rcache_push(s, count);
-			*index = count;
-			rc = 0;
-			goto out_unlock;
-		}
-		++count;
-	}
 
 	/* bail out if we already reached max entries */
 	rc = -EOVERFLOW;
@@ -293,7 +270,9 @@ static int sidtab_reverse_lookup(struct sidtab *s, struct context *context,
 	if (!dst)
 		goto out_unlock;
 
-	rc = context_cpy(dst, context);
+	dst->sid = index_to_sid(count);
+
+	rc = context_cpy(&dst->context, context);
 	if (rc)
 		goto out_unlock;
 
@@ -305,29 +284,36 @@ static int sidtab_reverse_lookup(struct sidtab *s, struct context *context,
 		rc = -ENOMEM;
 		dst_convert = sidtab_do_lookup(convert->target, count, 1);
 		if (!dst_convert) {
-			context_destroy(dst);
+			context_destroy(&dst->context);
 			goto out_unlock;
 		}
 
-		rc = convert->func(context, dst_convert, convert->args);
+		rc = convert->func(context, &dst_convert->context,
+				convert->args);
 		if (rc) {
-			context_destroy(dst);
+			context_destroy(&dst->context);
 			goto out_unlock;
 		}
+		dst_convert->sid = index_to_sid(count);
+		convert->target->count = count + 1;
 
 		/* at this point we know the insert won't fail */
-		convert->target->count = count + 1;
+		spin_lock_irqsave_nested(&convert->target->lock, flags,
+				SINGLE_DEPTH_NESTING);
+		hash_add_rcu(convert->target->context_to_sid,
+				&dst_convert->list, dst_convert->context.hash);
+		spin_unlock_irqrestore(&convert->target->lock, flags);
 	}
 
 	if (context->len)
 		pr_info("SELinux:  Context %s is not valid (left unmapped).\n",
 			context->str);
 
-	sidtab_rcache_push(s, count);
-	*index = count;
+	*sid = index_to_sid(count);
 
-	/* write entries before writing new count */
+	/* Write entries before updating count. */
 	smp_store_release(&s->count, count + 1);
+	hash_add_rcu(s->context_to_sid, &dst->list, dst->context.hash);
 
 	rc = 0;
 out_unlock:
@@ -335,25 +321,23 @@ static int sidtab_reverse_lookup(struct sidtab *s, struct context *context,
 	return rc;
 }
 
-int sidtab_context_to_sid(struct sidtab *s, struct context *context, u32 *sid)
+static void sidtab_convert_hashtable(struct sidtab *s, u32 count)
 {
-	int rc;
+	unsigned long flags;
+	struct sidtab_entry_leaf *entry;
 	u32 i;
 
-	for (i = 0; i < SECINITSID_NUM; i++) {
-		struct sidtab_isid_entry *entry = &s->isids[i];
+	for (i = 0; i < count; i++) {
+		entry = sidtab_do_lookup(s, i, 0);
+		entry->sid = index_to_sid(i);
 
-		if (entry->set && context_cmp(context, &entry->context)) {
-			*sid = i + 1;
-			return 0;
-		}
-	}
+		spin_lock_irqsave(&s->lock, flags);
+		hash_add_rcu(s->context_to_sid, &entry->list,
+				entry->context.hash);
+		spin_unlock_irqrestore(&s->lock, flags);
 
-	rc = sidtab_reverse_lookup(s, context, sid);
-	if (rc)
-		return rc;
-	*sid += SECINITSID_NUM + 1;
-	return 0;
+		cond_resched();
+	}
 }
 
 static int sidtab_convert_tree(union sidtab_entry_inner *edst,
@@ -400,6 +384,7 @@ static int sidtab_convert_tree(union sidtab_entry_inner *edst,
 		}
 		cond_resched();
 	}
+
 	return 0;
 }
 
@@ -449,8 +434,12 @@ int sidtab_convert(struct sidtab *s, struct sidtab_convert_params *params)
 		spin_lock_irqsave(&s->lock, flags);
 		s->convert = NULL;
 		spin_unlock_irqrestore(&s->lock, flags);
+		return rc;
 	}
-	return rc;
+
+	sidtab_convert_hashtable(params->target, count);
+
+	return 0;
 }
 
 static void sidtab_destroy_tree(union sidtab_entry_inner entry, u32 level)
@@ -474,7 +463,7 @@ static void sidtab_destroy_tree(union sidtab_entry_inner entry, u32 level)
 
 		for (i = 0; i < SIDTAB_LEAF_ENTRIES; i++)
 			context_destroy(&node->entries[i].context);
-		kfree(node);
+		kfree_rcu(node, rcu);
 	}
 }
 
@@ -484,11 +473,16 @@ void sidtab_destroy(struct sidtab *s)
 
 	for (i = 0; i < SECINITSID_NUM; i++)
 		if (s->isids[i].set)
-			context_destroy(&s->isids[i].context);
+			context_destroy(&s->isids[i].leaf.context);
 
 	level = SIDTAB_MAX_LEVEL;
 	while (level && !s->roots[level].ptr_inner)
 		--level;
 
 	sidtab_destroy_tree(s->roots[level], level);
+	/*
+	 * The context_to_sid hashtable's objects are all shared
+	 * with the isids array and context tree, and so don't need
+	 * to be cleaned up here.
+	 */
 }
diff --git a/security/selinux/ss/sidtab.h b/security/selinux/ss/sidtab.h
index 1f4763141aa1..f1f505df97ba 100644
--- a/security/selinux/ss/sidtab.h
+++ b/security/selinux/ss/sidtab.h
@@ -13,11 +13,14 @@
 
 #include <linux/spinlock_types.h>
 #include <linux/log2.h>
+#include <linux/hashtable.h>
 
 #include "context.h"
 
 struct sidtab_entry_leaf {
+	u32 sid;
 	struct context context;
+	struct hlist_node list;
 };
 
 struct sidtab_node_inner;
@@ -49,6 +52,7 @@ union sidtab_entry_inner {
 
 struct sidtab_node_leaf {
 	struct sidtab_entry_leaf entries[SIDTAB_LEAF_ENTRIES];
+	struct rcu_head rcu;
 };
 
 struct sidtab_node_inner {
@@ -57,7 +61,7 @@ struct sidtab_node_inner {
 
 struct sidtab_isid_entry {
 	int set;
-	struct context context;
+	struct sidtab_entry_leaf leaf;
 };
 
 struct sidtab_convert_params {
@@ -66,7 +70,8 @@ struct sidtab_convert_params {
 	struct sidtab *target;
 };
 
-#define SIDTAB_RCACHE_SIZE 3
+#define SIDTAB_HASH_BITS CONFIG_SECURITY_SELINUX_SIDTAB_HASH_BITS
+#define SIDTAB_HASH_BUCKETS (1 << SIDTAB_HASH_BITS)
 
 struct sidtab {
 	/*
@@ -83,11 +88,11 @@ struct sidtab {
 	struct sidtab_convert_params *convert;
 	spinlock_t lock;
 
-	/* reverse lookup cache - access atomically via {READ|WRITE}_ONCE() */
-	u32 rcache[SIDTAB_RCACHE_SIZE];
-
 	/* index == SID - 1 (no entry for SECSID_NULL) */
 	struct sidtab_isid_entry isids[SECINITSID_NUM];
+
+	/* Hash table for fast reverse context-to-sid lookups. */
+	DECLARE_HASHTABLE(context_to_sid, SIDTAB_HASH_BITS);
 };
 
 int sidtab_init(struct sidtab *s);
@@ -101,6 +106,8 @@ int sidtab_context_to_sid(struct sidtab *s, struct context *context, u32 *sid);
 
 void sidtab_destroy(struct sidtab *s);
 
+int sidtab_hash_stats(struct sidtab *sidtab, char *page);
+
 #endif	/* _SS_SIDTAB_H_ */
 
 
-- 
2.24.0.rc1.363.gb1bccd3e3d-goog


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5] selinux: sidtab: reverse lookup hash table
  2019-11-07 10:17 [PATCH v5] selinux: sidtab: reverse lookup hash table Jeff Vander Stoep
@ 2019-11-07 12:01 ` Jeffrey Vander Stoep
  2019-11-14 23:35   ` Paul Moore
  2019-11-07 14:54 ` Stephen Smalley
  2019-11-07 16:12 ` Ondrej Mosnacek
  2 siblings, 1 reply; 11+ messages in thread
From: Jeffrey Vander Stoep @ 2019-11-07 12:01 UTC (permalink / raw)
  To: SElinux list
  Cc: Ondrej Mosnacek, Paul Moore, Stephen Smalley, will, Jovana Knezevic

Note this should be v4, not v5.

Paul, please let me know if you'd prefer my change applied on top of Ondrej's
 "selinux: cache the SID -> context string translation" patch.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5] selinux: sidtab: reverse lookup hash table
  2019-11-07 10:17 [PATCH v5] selinux: sidtab: reverse lookup hash table Jeff Vander Stoep
  2019-11-07 12:01 ` Jeffrey Vander Stoep
@ 2019-11-07 14:54 ` Stephen Smalley
  2019-11-07 15:42   ` Ondrej Mosnacek
  2019-11-07 15:49   ` Jeffrey Vander Stoep
  2019-11-07 16:12 ` Ondrej Mosnacek
  2 siblings, 2 replies; 11+ messages in thread
From: Stephen Smalley @ 2019-11-07 14:54 UTC (permalink / raw)
  To: Jeff Vander Stoep, selinux; +Cc: omosnace, paul, will, Jovana Knezevic

On 11/7/19 5:17 AM, Jeff Vander Stoep wrote:
> This replaces the reverse table lookup and reverse cache with a
> hashtable which improves cache-miss reverse-lookup times from
> O(n) to O(1)* and maintains the same performance as a reverse
> cache hit.
> 
> This reduces the time needed to add a new sidtab entry from ~500us
> to 5us on a Pixel 3 when there are ~10,000 sidtab entries.
> 
> The implementation uses the kernel's generic hashtable API,
> It uses the context's string represtation as the hash source,
> and the kernels generic string hashing algorithm full_name_hash()
> to reduce the string to a 32 bit value.
> 
> This change also maintains the improvement introduced in
> commit ee1a84fdfeed ("selinux: overhaul sidtab to fix bug and improve
> performance") which removed the need to keep the current sidtab
> locked during policy reload. It does however introduce periodic
> locking of the target sidtab while converting the hashtable. Sidtab
> entries are never modified or removed, so the context struct stored
> in the sid_to_context tree can also be used for the context_to_sid
> hashtable to reduce memory usage.
> 
> This bug was reported by:
> - On the selinux bug tracker.
>    BUG: kernel softlockup due to too many SIDs/contexts #37
>    https://github.com/SELinuxProject/selinux-kernel/issues/37
> - Jovana Knezevic on Android's bugtracker.
>    Bug: 140252993
>    "During multi-user performance testing, we create and remove users
>    many times. selinux_android_restorecon_pkgdir goes from 1ms to over
>    20ms after about 200 user creations and removals. Accumulated over
>    ~280 packages, that adds a significant time to user creation,
>    making perf benchmarks unreliable."
> 
> * Hashtable lookup is only O(1) when n < the number of buckets.
> 
> Changes in V2:
> -The hashtable uses sidtab_entry_leaf objects directly so these
> objects are shared between the sid_to_context lookup tree and the
> context_to_sid hashtable. This simplifies memory allocation and
> was suggested by Ondrej Mosnacek.
> -The new sidtab hash stats file in selinuxfs has been moved out of
> the avc dir and into a new "ss" dir.
> 
> V3:
> -Add lock nesting notation.
> 
> V4:
> -Moved to *_rcu variants of the various hashtable functions
> as suggested by Ondrej Mosnacek and Will Deacon.
> -Naming/spelling fixups.

This still triggers a splat for me in running Ondrej's reproducer (also 
ran selinux-testsuite first, not sure if that mattered):

SELinux:  Converting 3784 SID table entries...
kernel: ------------[ cut here ]------------
syscall 1 left IRQs disabled
WARNING: CPU: 1 PID: 8313 at arch/x86/entry/common.c:261 
syscall_return_slowpathModules linked in: crypto_user 
scsi_transport_iscsi xt_multiport bluetooth ecdh_kernel:  crc32_pclmul 
snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel sCPU: 1 PID: 
8313 Comm: runcon Not tainted 5.4.0-rc1+ #57
Hardware name: Dell Inc. OptiPlex 7050/0NW6H5, BIOS 1.8.3 03/23/2018
RIP: 0010:syscall_return_slowpath+0x1e1/0x4a0
Code: ff ff e9 60 ff ff ff e8 2d ad 05 00 e9 2a ff ff ff 48 8d 7d 78 e8 
cf cf 50RSP: 0018:ffff8885a6adff28 EFLAGS: 00010082
RAX: 0000000000000000 RBX: 0000000000000080 RCX: 0000000000000000
RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffffed10b4d5bfd7
RBP: ffff8885a6adff58 R08: ffffffffaa219f90 R09: fffffbfff5883615
R10: fffffbfff5883614 R11: ffffffffac41b0a3 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f3e32c67080(0000) GS:ffff888822a40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3e32f145d0 CR3: 00000005a7476005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
kernel:  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f3e32e11467
Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e 
fa 64 8bRSP: 002b:00007ffec6482bc8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
RAX: 000000000000002f RBX: 00007ffec6483d88 RCX: 00007f3e32e11467
RDX: 000000000000002f RSI: 000055b75e5a5600 RDI: 0000000000000003
RBP: 000055b75e5a5600 R08: 00000000ffffffff R09: 00007ffec6482a60
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
R13: 00007ffec6483c4c R14: 0000000000000000 R15: 0000000000000000
irq event stamp: 7292
hardirqs last  enabled at (7291): [<ffffffffab26d9bb>] 
_raw_write_unlock_irqresthardirqs last disabled at (7292): 
[<ffffffffab26db61>] _raw_spin_lock_irqsave+0xsoftirqs last  enabled at 
(7036): [<ffffffffab600494>] __do_softirq+0x494/0x665
softirqs last disabled at (7025): [<ffffffffaa168b8f>] irq_exit+0x13f/0x160
kernel: ---[ end trace 1557497a64e0b001 ]---
kernel: ------------[ cut here ]------------

Some comments below.  Recommend running it by the RCU wizards e.g. Paul 
McKenney too given the subtleties of getting RCU right.

<snip>

> diff --git a/security/selinux/ss/sidtab.c b/security/selinux/ss/sidtab.c
> index 7d49994e8d5f..c052e620246c 100644
> --- a/security/selinux/ss/sidtab.c
> +++ b/security/selinux/ss/sidtab.c
> @@ -17,26 +17,42 @@
<snip>
> +static u32 context_to_sid(struct sidtab *s, struct context *context)
> +{
> +	struct sidtab_entry_leaf *entry;
> +
> +	rcu_read_lock();
> +	hash_for_each_possible_rcu(s->context_to_sid, entry, list,
> +				   context->hash) {
> +		if (context_cmp(&entry->context, context)) {
> +			rcu_read_unlock();
> +			return entry->sid;

This looks unsafe to me; I would have assumed you need to at least save 
entry->sid to a temporary before rcu_read_unlock()?

> @@ -305,29 +284,36 @@ static int sidtab_reverse_lookup(struct sidtab *s, struct context *context,
>   		rc = -ENOMEM;
>   		dst_convert = sidtab_do_lookup(convert->target, count, 1);
>   		if (!dst_convert) {
> -			context_destroy(dst);
> +			context_destroy(&dst->context);
>   			goto out_unlock;
>   		}
>   
> -		rc = convert->func(context, dst_convert, convert->args);
> +		rc = convert->func(context, &dst_convert->context,
> +				convert->args);
>   		if (rc) {
> -			context_destroy(dst);
> +			context_destroy(&dst->context);
>   			goto out_unlock;
>   		}
> +		dst_convert->sid = index_to_sid(count);
> +		convert->target->count = count + 1;
>   
>   		/* at this point we know the insert won't fail */
> -		convert->target->count = count + 1;
> +		spin_lock_irqsave_nested(&convert->target->lock, flags,
> +				SINGLE_DEPTH_NESTING);
> +		hash_add_rcu(convert->target->context_to_sid,
> +				&dst_convert->list, dst_convert->context.hash);
> +		spin_unlock_irqrestore(&convert->target->lock, flags);

Still having a hard time understanding why we need to lock the target. 
The target here is always the newsidtab allocated by 
security_load_policy(), not yet exposed to any other threads in the system?

>   	}
>   
>   	if (context->len)
>   		pr_info("SELinux:  Context %s is not valid (left unmapped).\n",
>   			context->str);
>   
> -	sidtab_rcache_push(s, count);
> -	*index = count;
> +	*sid = index_to_sid(count);
>   
> -	/* write entries before writing new count */
> +	/* Write entries before updating count. */
>   	smp_store_release(&s->count, count + 1);
> +	hash_add_rcu(s->context_to_sid, &dst->list, dst->context.hash);

Exactly what guarantee do we think we are getting from placing the 
hash_add_rcu() after the smp_store_release() on s->count, and is that 
guarantee well-founded?

> @@ -474,7 +463,7 @@ static void sidtab_destroy_tree(union sidtab_entry_inner entry, u32 level)
>   
>   		for (i = 0; i < SIDTAB_LEAF_ENTRIES; i++)
>   			context_destroy(&node->entries[i].context);
> -		kfree(node);
> +		kfree_rcu(node, rcu);

Is it safe to do this without having deleted it from the hash and 
without having taken the spinlock?

>   	}
>   }
>   

> diff --git a/security/selinux/ss/sidtab.h b/security/selinux/ss/sidtab.h
> index 1f4763141aa1..f1f505df97ba 100644
> --- a/security/selinux/ss/sidtab.h
> +++ b/security/selinux/ss/sidtab.h
> @@ -83,11 +88,11 @@ struct sidtab {
>   	struct sidtab_convert_params *convert;
>   	spinlock_t lock;
>   
> -	/* reverse lookup cache - access atomically via {READ|WRITE}_ONCE() */
> -	u32 rcache[SIDTAB_RCACHE_SIZE];
> -
>   	/* index == SID - 1 (no entry for SECSID_NULL) */
>   	struct sidtab_isid_entry isids[SECINITSID_NUM];
> +
> +	/* Hash table for fast reverse context-to-sid lookups. */
> +	DECLARE_HASHTABLE(context_to_sid, SIDTAB_HASH_BITS);

Should this and/or any of the associated RCU-protected structures be 
marked with __rcu for sparse checking?

>   };
>   
>   int sidtab_init(struct sidtab *s);



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5] selinux: sidtab: reverse lookup hash table
  2019-11-07 14:54 ` Stephen Smalley
@ 2019-11-07 15:42   ` Ondrej Mosnacek
  2019-11-07 15:59     ` Jeffrey Vander Stoep
  2019-11-07 15:49   ` Jeffrey Vander Stoep
  1 sibling, 1 reply; 11+ messages in thread
From: Ondrej Mosnacek @ 2019-11-07 15:42 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Jeff Vander Stoep, SElinux list, Paul Moore, will, Jovana Knezevic

On Thu, Nov 7, 2019 at 3:54 PM Stephen Smalley <sds@tycho.nsa.gov> wrote:
> On 11/7/19 5:17 AM, Jeff Vander Stoep wrote:
> > This replaces the reverse table lookup and reverse cache with a
> > hashtable which improves cache-miss reverse-lookup times from
> > O(n) to O(1)* and maintains the same performance as a reverse
> > cache hit.
> >
> > This reduces the time needed to add a new sidtab entry from ~500us
> > to 5us on a Pixel 3 when there are ~10,000 sidtab entries.
> >
> > The implementation uses the kernel's generic hashtable API,
> > It uses the context's string represtation as the hash source,
> > and the kernels generic string hashing algorithm full_name_hash()
> > to reduce the string to a 32 bit value.
> >
> > This change also maintains the improvement introduced in
> > commit ee1a84fdfeed ("selinux: overhaul sidtab to fix bug and improve
> > performance") which removed the need to keep the current sidtab
> > locked during policy reload. It does however introduce periodic
> > locking of the target sidtab while converting the hashtable. Sidtab
> > entries are never modified or removed, so the context struct stored
> > in the sid_to_context tree can also be used for the context_to_sid
> > hashtable to reduce memory usage.
> >
> > This bug was reported by:
> > - On the selinux bug tracker.
> >    BUG: kernel softlockup due to too many SIDs/contexts #37
> >    https://github.com/SELinuxProject/selinux-kernel/issues/37
> > - Jovana Knezevic on Android's bugtracker.
> >    Bug: 140252993
> >    "During multi-user performance testing, we create and remove users
> >    many times. selinux_android_restorecon_pkgdir goes from 1ms to over
> >    20ms after about 200 user creations and removals. Accumulated over
> >    ~280 packages, that adds a significant time to user creation,
> >    making perf benchmarks unreliable."
> >
> > * Hashtable lookup is only O(1) when n < the number of buckets.
> >
> > Changes in V2:
> > -The hashtable uses sidtab_entry_leaf objects directly so these
> > objects are shared between the sid_to_context lookup tree and the
> > context_to_sid hashtable. This simplifies memory allocation and
> > was suggested by Ondrej Mosnacek.
> > -The new sidtab hash stats file in selinuxfs has been moved out of
> > the avc dir and into a new "ss" dir.
> >
> > V3:
> > -Add lock nesting notation.
> >
> > V4:
> > -Moved to *_rcu variants of the various hashtable functions
> > as suggested by Ondrej Mosnacek and Will Deacon.
> > -Naming/spelling fixups.
>
> This still triggers a splat for me in running Ondrej's reproducer (also
> ran selinux-testsuite first, not sure if that mattered):
>
> SELinux:  Converting 3784 SID table entries...
> kernel: ------------[ cut here ]------------
> syscall 1 left IRQs disabled
> WARNING: CPU: 1 PID: 8313 at arch/x86/entry/common.c:261
> syscall_return_slowpathModules linked in: crypto_user
> scsi_transport_iscsi xt_multiport bluetooth ecdh_kernel:  crc32_pclmul
> snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel sCPU: 1 PID:
> 8313 Comm: runcon Not tainted 5.4.0-rc1+ #57
> Hardware name: Dell Inc. OptiPlex 7050/0NW6H5, BIOS 1.8.3 03/23/2018
> RIP: 0010:syscall_return_slowpath+0x1e1/0x4a0
> Code: ff ff e9 60 ff ff ff e8 2d ad 05 00 e9 2a ff ff ff 48 8d 7d 78 e8
> cf cf 50RSP: 0018:ffff8885a6adff28 EFLAGS: 00010082
> RAX: 0000000000000000 RBX: 0000000000000080 RCX: 0000000000000000
> RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffffed10b4d5bfd7
> RBP: ffff8885a6adff58 R08: ffffffffaa219f90 R09: fffffbfff5883615
> R10: fffffbfff5883614 R11: ffffffffac41b0a3 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> FS:  00007f3e32c67080(0000) GS:ffff888822a40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f3e32f145d0 CR3: 00000005a7476005 CR4: 00000000003606e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> kernel:  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x7f3e32e11467
> Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e
> fa 64 8bRSP: 002b:00007ffec6482bc8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000001
> RAX: 000000000000002f RBX: 00007ffec6483d88 RCX: 00007f3e32e11467
> RDX: 000000000000002f RSI: 000055b75e5a5600 RDI: 0000000000000003
> RBP: 000055b75e5a5600 R08: 00000000ffffffff R09: 00007ffec6482a60
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> R13: 00007ffec6483c4c R14: 0000000000000000 R15: 0000000000000000
> irq event stamp: 7292
> hardirqs last  enabled at (7291): [<ffffffffab26d9bb>]
> _raw_write_unlock_irqresthardirqs last disabled at (7292):
> [<ffffffffab26db61>] _raw_spin_lock_irqsave+0xsoftirqs last  enabled at
> (7036): [<ffffffffab600494>] __do_softirq+0x494/0x665
> softirqs last disabled at (7025): [<ffffffffaa168b8f>] irq_exit+0x13f/0x160
> kernel: ---[ end trace 1557497a64e0b001 ]---
> kernel: ------------[ cut here ]------------
>
> Some comments below.  Recommend running it by the RCU wizards e.g. Paul
> McKenney too given the subtleties of getting RCU right.
>
> <snip>
>
> > diff --git a/security/selinux/ss/sidtab.c b/security/selinux/ss/sidtab.c
> > index 7d49994e8d5f..c052e620246c 100644
> > --- a/security/selinux/ss/sidtab.c
> > +++ b/security/selinux/ss/sidtab.c
> > @@ -17,26 +17,42 @@
> <snip>
> > +static u32 context_to_sid(struct sidtab *s, struct context *context)
> > +{
> > +     struct sidtab_entry_leaf *entry;
> > +
> > +     rcu_read_lock();
> > +     hash_for_each_possible_rcu(s->context_to_sid, entry, list,
> > +                                context->hash) {
> > +             if (context_cmp(&entry->context, context)) {
> > +                     rcu_read_unlock();
> > +                     return entry->sid;
>
> This looks unsafe to me; I would have assumed you need to at least save
> entry->sid to a temporary before rcu_read_unlock()?
>
> > @@ -305,29 +284,36 @@ static int sidtab_reverse_lookup(struct sidtab *s, struct context *context,
> >               rc = -ENOMEM;
> >               dst_convert = sidtab_do_lookup(convert->target, count, 1);
> >               if (!dst_convert) {
> > -                     context_destroy(dst);
> > +                     context_destroy(&dst->context);
> >                       goto out_unlock;
> >               }
> >
> > -             rc = convert->func(context, dst_convert, convert->args);
> > +             rc = convert->func(context, &dst_convert->context,
> > +                             convert->args);
> >               if (rc) {
> > -                     context_destroy(dst);
> > +                     context_destroy(&dst->context);
> >                       goto out_unlock;
> >               }
> > +             dst_convert->sid = index_to_sid(count);
> > +             convert->target->count = count + 1;
> >
> >               /* at this point we know the insert won't fail */
> > -             convert->target->count = count + 1;
> > +             spin_lock_irqsave_nested(&convert->target->lock, flags,
> > +                             SINGLE_DEPTH_NESTING);
> > +             hash_add_rcu(convert->target->context_to_sid,
> > +                             &dst_convert->list, dst_convert->context.hash);
> > +             spin_unlock_irqrestore(&convert->target->lock, flags);
>
> Still having a hard time understanding why we need to lock the target.
> The target here is always the newsidtab allocated by
> security_load_policy(), not yet exposed to any other threads in the system?

I believe this is to avoid a race with the hash_add() in
sidtab_convert_hashtable(), which locks only the target table.
However, I don't like the fact that we need to mess with inconsistent
lock nesting here... I think it would be better to just lock the
master's spinlock in sidtab_convert_hashtable() - it will then block
more when entries would be added to sidtab, but that shouldn't be a
big concern in practice.

-- 
Ondrej Mosnacek <omosnace at redhat dot com>
Software Engineer, Security Technologies
Red Hat, Inc.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5] selinux: sidtab: reverse lookup hash table
  2019-11-07 14:54 ` Stephen Smalley
  2019-11-07 15:42   ` Ondrej Mosnacek
@ 2019-11-07 15:49   ` Jeffrey Vander Stoep
  1 sibling, 0 replies; 11+ messages in thread
From: Jeffrey Vander Stoep @ 2019-11-07 15:49 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: SElinux list, Ondrej Mosnacek, Paul Moore, will, Jovana Knezevic

On Thu, Nov 7, 2019 at 3:54 PM Stephen Smalley <sds@tycho.nsa.gov> wrote:
>
> On 11/7/19 5:17 AM, Jeff Vander Stoep wrote:
> > This replaces the reverse table lookup and reverse cache with a
> > hashtable which improves cache-miss reverse-lookup times from
> > O(n) to O(1)* and maintains the same performance as a reverse
> > cache hit.
> >
> > This reduces the time needed to add a new sidtab entry from ~500us
> > to 5us on a Pixel 3 when there are ~10,000 sidtab entries.
> >
> > The implementation uses the kernel's generic hashtable API,
> > It uses the context's string represtation as the hash source,
> > and the kernels generic string hashing algorithm full_name_hash()
> > to reduce the string to a 32 bit value.
> >
> > This change also maintains the improvement introduced in
> > commit ee1a84fdfeed ("selinux: overhaul sidtab to fix bug and improve
> > performance") which removed the need to keep the current sidtab
> > locked during policy reload. It does however introduce periodic
> > locking of the target sidtab while converting the hashtable. Sidtab
> > entries are never modified or removed, so the context struct stored
> > in the sid_to_context tree can also be used for the context_to_sid
> > hashtable to reduce memory usage.
> >
> > This bug was reported by:
> > - On the selinux bug tracker.
> >    BUG: kernel softlockup due to too many SIDs/contexts #37
> >    https://github.com/SELinuxProject/selinux-kernel/issues/37
> > - Jovana Knezevic on Android's bugtracker.
> >    Bug: 140252993
> >    "During multi-user performance testing, we create and remove users
> >    many times. selinux_android_restorecon_pkgdir goes from 1ms to over
> >    20ms after about 200 user creations and removals. Accumulated over
> >    ~280 packages, that adds a significant time to user creation,
> >    making perf benchmarks unreliable."
> >
> > * Hashtable lookup is only O(1) when n < the number of buckets.
> >
> > Changes in V2:
> > -The hashtable uses sidtab_entry_leaf objects directly so these
> > objects are shared between the sid_to_context lookup tree and the
> > context_to_sid hashtable. This simplifies memory allocation and
> > was suggested by Ondrej Mosnacek.
> > -The new sidtab hash stats file in selinuxfs has been moved out of
> > the avc dir and into a new "ss" dir.
> >
> > V3:
> > -Add lock nesting notation.
> >
> > V4:
> > -Moved to *_rcu variants of the various hashtable functions
> > as suggested by Ondrej Mosnacek and Will Deacon.
> > -Naming/spelling fixups.
>
> This still triggers a splat for me in running Ondrej's reproducer (also
> ran selinux-testsuite first, not sure if that mattered):

Sorry, I am unable to reproduce this using Ondrej's reproducer. :(

I am sharing the same "flags" variable between the nested locks. That seems like
a bug. Anyway, based on Ondrej's reply I'll go with a different
approach and keep
the current sidtab locked.

>
> SELinux:  Converting 3784 SID table entries...
> kernel: ------------[ cut here ]------------
> syscall 1 left IRQs disabled
> WARNING: CPU: 1 PID: 8313 at arch/x86/entry/common.c:261
> syscall_return_slowpathModules linked in: crypto_user
> scsi_transport_iscsi xt_multiport bluetooth ecdh_kernel:  crc32_pclmul
> snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel sCPU: 1 PID:
> 8313 Comm: runcon Not tainted 5.4.0-rc1+ #57
> Hardware name: Dell Inc. OptiPlex 7050/0NW6H5, BIOS 1.8.3 03/23/2018
> RIP: 0010:syscall_return_slowpath+0x1e1/0x4a0
> Code: ff ff e9 60 ff ff ff e8 2d ad 05 00 e9 2a ff ff ff 48 8d 7d 78 e8
> cf cf 50RSP: 0018:ffff8885a6adff28 EFLAGS: 00010082
> RAX: 0000000000000000 RBX: 0000000000000080 RCX: 0000000000000000
> RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffffed10b4d5bfd7
> RBP: ffff8885a6adff58 R08: ffffffffaa219f90 R09: fffffbfff5883615
> R10: fffffbfff5883614 R11: ffffffffac41b0a3 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> FS:  00007f3e32c67080(0000) GS:ffff888822a40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f3e32f145d0 CR3: 00000005a7476005 CR4: 00000000003606e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> kernel:  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x7f3e32e11467
> Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e
> fa 64 8bRSP: 002b:00007ffec6482bc8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000001
> RAX: 000000000000002f RBX: 00007ffec6483d88 RCX: 00007f3e32e11467
> RDX: 000000000000002f RSI: 000055b75e5a5600 RDI: 0000000000000003
> RBP: 000055b75e5a5600 R08: 00000000ffffffff R09: 00007ffec6482a60
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
> R13: 00007ffec6483c4c R14: 0000000000000000 R15: 0000000000000000
> irq event stamp: 7292
> hardirqs last  enabled at (7291): [<ffffffffab26d9bb>]
> _raw_write_unlock_irqresthardirqs last disabled at (7292):
> [<ffffffffab26db61>] _raw_spin_lock_irqsave+0xsoftirqs last  enabled at
> (7036): [<ffffffffab600494>] __do_softirq+0x494/0x665
> softirqs last disabled at (7025): [<ffffffffaa168b8f>] irq_exit+0x13f/0x160
> kernel: ---[ end trace 1557497a64e0b001 ]---
> kernel: ------------[ cut here ]------------
>
> Some comments below.  Recommend running it by the RCU wizards e.g. Paul
> McKenney too given the subtleties of getting RCU right.

Sure, I've also cc'ed Will Deacon.
>
> <snip>
>
> > diff --git a/security/selinux/ss/sidtab.c b/security/selinux/ss/sidtab.c
> > index 7d49994e8d5f..c052e620246c 100644
> > --- a/security/selinux/ss/sidtab.c
> > +++ b/security/selinux/ss/sidtab.c
> > @@ -17,26 +17,42 @@
> <snip>
> > +static u32 context_to_sid(struct sidtab *s, struct context *context)
> > +{
> > +     struct sidtab_entry_leaf *entry;
> > +
> > +     rcu_read_lock();
> > +     hash_for_each_possible_rcu(s->context_to_sid, entry, list,
> > +                                context->hash) {
> > +             if (context_cmp(&entry->context, context)) {
> > +                     rcu_read_unlock();
> > +                     return entry->sid;
>
> This looks unsafe to me; I would have assumed you need to at least save
> entry->sid to a temporary before rcu_read_unlock()?

Good point.
>
> > @@ -305,29 +284,36 @@ static int sidtab_reverse_lookup(struct sidtab *s, struct context *context,
> >               rc = -ENOMEM;
> >               dst_convert = sidtab_do_lookup(convert->target, count, 1);
> >               if (!dst_convert) {
> > -                     context_destroy(dst);
> > +                     context_destroy(&dst->context);
> >                       goto out_unlock;
> >               }
> >
> > -             rc = convert->func(context, dst_convert, convert->args);
> > +             rc = convert->func(context, &dst_convert->context,
> > +                             convert->args);
> >               if (rc) {
> > -                     context_destroy(dst);
> > +                     context_destroy(&dst->context);
> >                       goto out_unlock;
> >               }
> > +             dst_convert->sid = index_to_sid(count);
> > +             convert->target->count = count + 1;
> >
> >               /* at this point we know the insert won't fail */
> > -             convert->target->count = count + 1;
> > +             spin_lock_irqsave_nested(&convert->target->lock, flags,
> > +                             SINGLE_DEPTH_NESTING);
> > +             hash_add_rcu(convert->target->context_to_sid,
> > +                             &dst_convert->list, dst_convert->context.hash);
> > +             spin_unlock_irqrestore(&convert->target->lock, flags);
>
> Still having a hard time understanding why we need to lock the target.
> The target here is always the newsidtab allocated by
> security_load_policy(), not yet exposed to any other threads in the system?

It is exposed to other threads, for example in sidtab_context_to_sid()
which can be
triggered by calls from userspace like restorecon, runcon, etc.

That means that hash_add_rcu can be called simultaneously on newsidtab in
sidtab_convert_hashtable() and sidtab_context_to_sid().

The previous implementation avoided this issue because sidtab_convert() and
sidtab_context_to_sid() wrote to non-overlapping parts of the newsidtab tree so
no locking was necessary.

>
> >       }
> >
> >       if (context->len)
> >               pr_info("SELinux:  Context %s is not valid (left unmapped).\n",
> >                       context->str);
> >
> > -     sidtab_rcache_push(s, count);
> > -     *index = count;
> > +     *sid = index_to_sid(count);
> >
> > -     /* write entries before writing new count */
> > +     /* Write entries before updating count. */
> >       smp_store_release(&s->count, count + 1);
> > +     hash_add_rcu(s->context_to_sid, &dst->list, dst->context.hash);
>
> Exactly what guarantee do we think we are getting from placing the
> hash_add_rcu() after the smp_store_release() on s->count, and is that
> guarantee well-founded?

None at all. I could move the hash_add_rcu() to before the
smp_store_release(). The ordering
of these two statements is not important.
>
> > @@ -474,7 +463,7 @@ static void sidtab_destroy_tree(union sidtab_entry_inner entry, u32 level)
> >
> >               for (i = 0; i < SIDTAB_LEAF_ENTRIES; i++)
> >                       context_destroy(&node->entries[i].context);
> > -             kfree(node);
> > +             kfree_rcu(node, rcu);
>
> Is it safe to do this without having deleted it from the hash and
> without having taken the spinlock?

Yes and yes.

At this point we've already shifted over to using the new sidtab so
only this thread has access to the oldsidtab.
The objects within the sidtab are entirely shared between the tree and
the hashtable so we should just free
them once. Calling hash_del_rcu() before hand is just going to move
pointers around on a list of objects which
will all be deleted anyways.
>
> >       }
> >   }
> >
>
> > diff --git a/security/selinux/ss/sidtab.h b/security/selinux/ss/sidtab.h
> > index 1f4763141aa1..f1f505df97ba 100644
> > --- a/security/selinux/ss/sidtab.h
> > +++ b/security/selinux/ss/sidtab.h
> > @@ -83,11 +88,11 @@ struct sidtab {
> >       struct sidtab_convert_params *convert;
> >       spinlock_t lock;
> >
> > -     /* reverse lookup cache - access atomically via {READ|WRITE}_ONCE() */
> > -     u32 rcache[SIDTAB_RCACHE_SIZE];
> > -
> >       /* index == SID - 1 (no entry for SECSID_NULL) */
> >       struct sidtab_isid_entry isids[SECINITSID_NUM];
> > +
> > +     /* Hash table for fast reverse context-to-sid lookups. */
> > +     DECLARE_HASHTABLE(context_to_sid, SIDTAB_HASH_BITS);
>
> Should this and/or any of the associated RCU-protected structures be
> marked with __rcu for sparse checking?

I'm unsure what you mean.
>
> >   };
> >
> >   int sidtab_init(struct sidtab *s);
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5] selinux: sidtab: reverse lookup hash table
  2019-11-07 15:42   ` Ondrej Mosnacek
@ 2019-11-07 15:59     ` Jeffrey Vander Stoep
  0 siblings, 0 replies; 11+ messages in thread
From: Jeffrey Vander Stoep @ 2019-11-07 15:59 UTC (permalink / raw)
  To: Ondrej Mosnacek
  Cc: Stephen Smalley, SElinux list, Paul Moore, will, Jovana Knezevic

> > > @@ -305,29 +284,36 @@ static int sidtab_reverse_lookup(struct sidtab *s, struct context *context,
> > >               rc = -ENOMEM;
> > >               dst_convert = sidtab_do_lookup(convert->target, count, 1);
> > >               if (!dst_convert) {
> > > -                     context_destroy(dst);
> > > +                     context_destroy(&dst->context);
> > >                       goto out_unlock;
> > >               }
> > >
> > > -             rc = convert->func(context, dst_convert, convert->args);
> > > +             rc = convert->func(context, &dst_convert->context,
> > > +                             convert->args);
> > >               if (rc) {
> > > -                     context_destroy(dst);
> > > +                     context_destroy(&dst->context);
> > >                       goto out_unlock;
> > >               }
> > > +             dst_convert->sid = index_to_sid(count);
> > > +             convert->target->count = count + 1;
> > >
> > >               /* at this point we know the insert won't fail */
> > > -             convert->target->count = count + 1;
> > > +             spin_lock_irqsave_nested(&convert->target->lock, flags,
> > > +                             SINGLE_DEPTH_NESTING);
> > > +             hash_add_rcu(convert->target->context_to_sid,
> > > +                             &dst_convert->list, dst_convert->context.hash);
> > > +             spin_unlock_irqrestore(&convert->target->lock, flags);
> >
> > Still having a hard time understanding why we need to lock the target.
> > The target here is always the newsidtab allocated by
> > security_load_policy(), not yet exposed to any other threads in the system?
>
> I believe this is to avoid a race with the hash_add() in
> sidtab_convert_hashtable(), which locks only the target table.
> However, I don't like the fact that we need to mess with inconsistent
> lock nesting here... I think it would be better to just lock the
> master's spinlock in sidtab_convert_hashtable() - it will then block
> more when entries would be added to sidtab, but that shouldn't be a
> big concern in practice.
>

I'll try that. Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5] selinux: sidtab: reverse lookup hash table
  2019-11-07 10:17 [PATCH v5] selinux: sidtab: reverse lookup hash table Jeff Vander Stoep
  2019-11-07 12:01 ` Jeffrey Vander Stoep
  2019-11-07 14:54 ` Stephen Smalley
@ 2019-11-07 16:12 ` Ondrej Mosnacek
  2019-11-07 16:44   ` Will Deacon
  2 siblings, 1 reply; 11+ messages in thread
From: Ondrej Mosnacek @ 2019-11-07 16:12 UTC (permalink / raw)
  To: Jeff Vander Stoep
  Cc: SElinux list, Paul Moore, Stephen Smalley, will, Jovana Knezevic

On Thu, Nov 7, 2019 at 11:17 AM Jeff Vander Stoep <jeffv@google.com> wrote:
> This replaces the reverse table lookup and reverse cache with a
> hashtable which improves cache-miss reverse-lookup times from
> O(n) to O(1)* and maintains the same performance as a reverse
> cache hit.
>
> This reduces the time needed to add a new sidtab entry from ~500us
> to 5us on a Pixel 3 when there are ~10,000 sidtab entries.
>
> The implementation uses the kernel's generic hashtable API,
> It uses the context's string represtation as the hash source,
> and the kernels generic string hashing algorithm full_name_hash()
> to reduce the string to a 32 bit value.
>
> This change also maintains the improvement introduced in
> commit ee1a84fdfeed ("selinux: overhaul sidtab to fix bug and improve
> performance") which removed the need to keep the current sidtab
> locked during policy reload. It does however introduce periodic
> locking of the target sidtab while converting the hashtable. Sidtab
> entries are never modified or removed, so the context struct stored
> in the sid_to_context tree can also be used for the context_to_sid
> hashtable to reduce memory usage.
>
> This bug was reported by:
> - On the selinux bug tracker.
>   BUG: kernel softlockup due to too many SIDs/contexts #37
>   https://github.com/SELinuxProject/selinux-kernel/issues/37
> - Jovana Knezevic on Android's bugtracker.
>   Bug: 140252993
>   "During multi-user performance testing, we create and remove users
>   many times. selinux_android_restorecon_pkgdir goes from 1ms to over
>   20ms after about 200 user creations and removals. Accumulated over
>   ~280 packages, that adds a significant time to user creation,
>   making perf benchmarks unreliable."
>
> * Hashtable lookup is only O(1) when n < the number of buckets.
>
> Changes in V2:
> -The hashtable uses sidtab_entry_leaf objects directly so these
> objects are shared between the sid_to_context lookup tree and the
> context_to_sid hashtable. This simplifies memory allocation and
> was suggested by Ondrej Mosnacek.
> -The new sidtab hash stats file in selinuxfs has been moved out of
> the avc dir and into a new "ss" dir.
>
> V3:
> -Add lock nesting notation.
>
> V4:
> -Moved to *_rcu variants of the various hashtable functions
> as suggested by Ondrej Mosnacek and Will Deacon.

I may be misunderstanding the purpose of RCU, but isn't all this RCU
stuff a big overkill when these entries are never removed? (Well, they
are removed when the sidtab is being destroyed, at which point however
no other threads are accessing them any more.) In my understanding,
RCU basically makes sure that objects that you dereference in an RCU
critical section are not destroyed until you leave it. Yes, it also
helps you to dereference/update them safely, but that can be achieved
on its own in a simpler way. Instead of using RCU here, I would
instead suggest looking into adding an equivalent of
list_for_each_entry_lockless()* for hlist and maybe introduce some
suitable hlist_add_something() function that ensures consistency
w.r.t. the lockless traversal (perhaps the WRITE_ONCE() in hlist_add()
is already sufficient, but I'm not sure...).

> -Naming/spelling fixups.
>
> Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
> Reported-by: Stephen Smalley <sds@tycho.nsa.gov>
> Reported-by: Jovana Knezevic <jovanak@google.com>
> ---
>  security/selinux/Kconfig            |  12 ++
>  security/selinux/include/security.h |   1 +
>  security/selinux/selinuxfs.c        |  65 +++++++
>  security/selinux/ss/context.h       |  11 +-
>  security/selinux/ss/policydb.c      |   5 +
>  security/selinux/ss/services.c      |  83 +++++++--
>  security/selinux/ss/services.h      |   4 +-
>  security/selinux/ss/sidtab.c        | 264 ++++++++++++++--------------
>  security/selinux/ss/sidtab.h        |  17 +-
>  9 files changed, 302 insertions(+), 160 deletions(-)

<snip>

-- 
Ondrej Mosnacek <omosnace at redhat dot com>
Software Engineer, Security Technologies
Red Hat, Inc.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5] selinux: sidtab: reverse lookup hash table
  2019-11-07 16:12 ` Ondrej Mosnacek
@ 2019-11-07 16:44   ` Will Deacon
  2019-11-07 18:41     ` Ondrej Mosnacek
  0 siblings, 1 reply; 11+ messages in thread
From: Will Deacon @ 2019-11-07 16:44 UTC (permalink / raw)
  To: Ondrej Mosnacek
  Cc: Jeff Vander Stoep, SElinux list, Paul Moore, Stephen Smalley,
	Jovana Knezevic

On Thu, Nov 07, 2019 at 05:12:53PM +0100, Ondrej Mosnacek wrote:
> On Thu, Nov 7, 2019 at 11:17 AM Jeff Vander Stoep <jeffv@google.com> wrote:
> > This replaces the reverse table lookup and reverse cache with a
> > hashtable which improves cache-miss reverse-lookup times from
> > O(n) to O(1)* and maintains the same performance as a reverse
> > cache hit.
> >
> > This reduces the time needed to add a new sidtab entry from ~500us
> > to 5us on a Pixel 3 when there are ~10,000 sidtab entries.
> >
> > The implementation uses the kernel's generic hashtable API,
> > It uses the context's string represtation as the hash source,
> > and the kernels generic string hashing algorithm full_name_hash()
> > to reduce the string to a 32 bit value.
> >
> > This change also maintains the improvement introduced in
> > commit ee1a84fdfeed ("selinux: overhaul sidtab to fix bug and improve
> > performance") which removed the need to keep the current sidtab
> > locked during policy reload. It does however introduce periodic
> > locking of the target sidtab while converting the hashtable. Sidtab
> > entries are never modified or removed, so the context struct stored
> > in the sid_to_context tree can also be used for the context_to_sid
> > hashtable to reduce memory usage.
> >
> > This bug was reported by:
> > - On the selinux bug tracker.
> >   BUG: kernel softlockup due to too many SIDs/contexts #37
> >   https://github.com/SELinuxProject/selinux-kernel/issues/37
> > - Jovana Knezevic on Android's bugtracker.
> >   Bug: 140252993
> >   "During multi-user performance testing, we create and remove users
> >   many times. selinux_android_restorecon_pkgdir goes from 1ms to over
> >   20ms after about 200 user creations and removals. Accumulated over
> >   ~280 packages, that adds a significant time to user creation,
> >   making perf benchmarks unreliable."
> >
> > * Hashtable lookup is only O(1) when n < the number of buckets.
> >
> > Changes in V2:
> > -The hashtable uses sidtab_entry_leaf objects directly so these
> > objects are shared between the sid_to_context lookup tree and the
> > context_to_sid hashtable. This simplifies memory allocation and
> > was suggested by Ondrej Mosnacek.
> > -The new sidtab hash stats file in selinuxfs has been moved out of
> > the avc dir and into a new "ss" dir.
> >
> > V3:
> > -Add lock nesting notation.
> >
> > V4:
> > -Moved to *_rcu variants of the various hashtable functions
> > as suggested by Ondrej Mosnacek and Will Deacon.
> 
> I may be misunderstanding the purpose of RCU, but isn't all this RCU
> stuff a big overkill when these entries are never removed? (Well, they
> are removed when the sidtab is being destroyed, at which point however
> no other threads are accessing them any more.) In my understanding,
> RCU basically makes sure that objects that you dereference in an RCU
> critical section are not destroyed until you leave it. Yes, it also
> helps you to dereference/update them safely, but that can be achieved
> on its own in a simpler way. Instead of using RCU here, I would
> instead suggest looking into adding an equivalent of
> list_for_each_entry_lockless()* for hlist and maybe introduce some
> suitable hlist_add_something() function that ensures consistency
> w.r.t. the lockless traversal (perhaps the WRITE_ONCE() in hlist_add()
> is already sufficient, but I'm not sure...).

If you use the existing _rcu accessors you will get correctly enforced
dependency ordering on the reader side and correctly placed release
ordering on the updater side. I don't think that's a big overkill, and
you can use the RCU accessors to achieve the lockless traversal.

hlist_add() is not safe against a concurrent traversal because the
WRITE_ONCE() provides no memory ordering guarantees, allowing the readers
to see an uninitialised node.

How exactly would list_for_each_entry_lockless() and hlist_add_something()
differ from the RCU variants, assuming they're implemented correctly?

Will

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5] selinux: sidtab: reverse lookup hash table
  2019-11-07 16:44   ` Will Deacon
@ 2019-11-07 18:41     ` Ondrej Mosnacek
  2019-11-07 20:06       ` Jeffrey Vander Stoep
  0 siblings, 1 reply; 11+ messages in thread
From: Ondrej Mosnacek @ 2019-11-07 18:41 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jeff Vander Stoep, SElinux list, Paul Moore, Stephen Smalley,
	Jovana Knezevic

On Thu, Nov 7, 2019 at 5:44 PM Will Deacon <will@kernel.org> wrote:
> On Thu, Nov 07, 2019 at 05:12:53PM +0100, Ondrej Mosnacek wrote:
> > On Thu, Nov 7, 2019 at 11:17 AM Jeff Vander Stoep <jeffv@google.com> wrote:
> > > This replaces the reverse table lookup and reverse cache with a
> > > hashtable which improves cache-miss reverse-lookup times from
> > > O(n) to O(1)* and maintains the same performance as a reverse
> > > cache hit.
> > >
> > > This reduces the time needed to add a new sidtab entry from ~500us
> > > to 5us on a Pixel 3 when there are ~10,000 sidtab entries.
> > >
> > > The implementation uses the kernel's generic hashtable API,
> > > It uses the context's string represtation as the hash source,
> > > and the kernels generic string hashing algorithm full_name_hash()
> > > to reduce the string to a 32 bit value.
> > >
> > > This change also maintains the improvement introduced in
> > > commit ee1a84fdfeed ("selinux: overhaul sidtab to fix bug and improve
> > > performance") which removed the need to keep the current sidtab
> > > locked during policy reload. It does however introduce periodic
> > > locking of the target sidtab while converting the hashtable. Sidtab
> > > entries are never modified or removed, so the context struct stored
> > > in the sid_to_context tree can also be used for the context_to_sid
> > > hashtable to reduce memory usage.
> > >
> > > This bug was reported by:
> > > - On the selinux bug tracker.
> > >   BUG: kernel softlockup due to too many SIDs/contexts #37
> > >   https://github.com/SELinuxProject/selinux-kernel/issues/37
> > > - Jovana Knezevic on Android's bugtracker.
> > >   Bug: 140252993
> > >   "During multi-user performance testing, we create and remove users
> > >   many times. selinux_android_restorecon_pkgdir goes from 1ms to over
> > >   20ms after about 200 user creations and removals. Accumulated over
> > >   ~280 packages, that adds a significant time to user creation,
> > >   making perf benchmarks unreliable."
> > >
> > > * Hashtable lookup is only O(1) when n < the number of buckets.
> > >
> > > Changes in V2:
> > > -The hashtable uses sidtab_entry_leaf objects directly so these
> > > objects are shared between the sid_to_context lookup tree and the
> > > context_to_sid hashtable. This simplifies memory allocation and
> > > was suggested by Ondrej Mosnacek.
> > > -The new sidtab hash stats file in selinuxfs has been moved out of
> > > the avc dir and into a new "ss" dir.
> > >
> > > V3:
> > > -Add lock nesting notation.
> > >
> > > V4:
> > > -Moved to *_rcu variants of the various hashtable functions
> > > as suggested by Ondrej Mosnacek and Will Deacon.
> >
> > I may be misunderstanding the purpose of RCU, but isn't all this RCU
> > stuff a big overkill when these entries are never removed? (Well, they
> > are removed when the sidtab is being destroyed, at which point however
> > no other threads are accessing them any more.) In my understanding,
> > RCU basically makes sure that objects that you dereference in an RCU
> > critical section are not destroyed until you leave it. Yes, it also
> > helps you to dereference/update them safely, but that can be achieved
> > on its own in a simpler way. Instead of using RCU here, I would
> > instead suggest looking into adding an equivalent of
> > list_for_each_entry_lockless()* for hlist and maybe introduce some
> > suitable hlist_add_something() function that ensures consistency
> > w.r.t. the lockless traversal (perhaps the WRITE_ONCE() in hlist_add()
> > is already sufficient, but I'm not sure...).
>
> If you use the existing _rcu accessors you will get correctly enforced
> dependency ordering on the reader side and correctly placed release
> ordering on the updater side. I don't think that's a big overkill, and
> you can use the RCU accessors to achieve the lockless traversal.

OK, but you don't need the read_lock()/unlock(), rcu_head field in the
entries, and kfree_rcu(), right?

>
> hlist_add() is not safe against a concurrent traversal because the
> WRITE_ONCE() provides no memory ordering guarantees, allowing the readers
> to see an uninitialised node.

Right, so we would need a new function that does smp_store_release() instead.

>
> How exactly would list_for_each_entry_lockless() and hlist_add_something()
> differ from the RCU variants, assuming they're implemented correctly?

Looking at the implementation of rcu_dereference() and
rcu_assign_pointer(), they would probably be practically the same,
except the rcu_read_lock_held() check in rcu_dereference(). That and
they would clearly communicate that they are not doing actual RCU, but
just allow lockless traversal of an add-only hlist.

--
Ondrej Mosnacek <omosnace at redhat dot com>
Software Engineer, Security Technologies
Red Hat, Inc.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5] selinux: sidtab: reverse lookup hash table
  2019-11-07 18:41     ` Ondrej Mosnacek
@ 2019-11-07 20:06       ` Jeffrey Vander Stoep
  0 siblings, 0 replies; 11+ messages in thread
From: Jeffrey Vander Stoep @ 2019-11-07 20:06 UTC (permalink / raw)
  To: Ondrej Mosnacek
  Cc: Will Deacon, SElinux list, Paul Moore, Stephen Smalley, Jovana Knezevic

> > If you use the existing _rcu accessors you will get correctly enforced
> > dependency ordering on the reader side and correctly placed release
> > ordering on the updater side. I don't think that's a big overkill, and
> > you can use the RCU accessors to achieve the lockless traversal.
>
> OK, but you don't need the read_lock()/unlock(), rcu_head field in the
> entries, and kfree_rcu(), right?

Hmm... I can get rid of rcu_head/kfree_rcu in any case because we never
remove items during normal use so we don't require the "rcu grace period".
>
> >
> > hlist_add() is not safe against a concurrent traversal because the
> > WRITE_ONCE() provides no memory ordering guarantees, allowing the readers
> > to see an uninitialised node.
>
> Right, so we would need a new function that does smp_store_release() instead.
>
> >
> > How exactly would list_for_each_entry_lockless() and hlist_add_something()
> > differ from the RCU variants, assuming they're implemented correctly?
>
> Looking at the implementation of rcu_dereference() and
> rcu_assign_pointer(), they would probably be practically the same,
> except the rcu_read_lock_held() check in rcu_dereference(). That and
> they would clearly communicate that they are not doing actual RCU, but
> just allow lockless traversal of an add-only hlist.

>
> --
> Ondrej Mosnacek <omosnace at redhat dot com>
> Software Engineer, Security Technologies
> Red Hat, Inc.
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5] selinux: sidtab: reverse lookup hash table
  2019-11-07 12:01 ` Jeffrey Vander Stoep
@ 2019-11-14 23:35   ` Paul Moore
  0 siblings, 0 replies; 11+ messages in thread
From: Paul Moore @ 2019-11-14 23:35 UTC (permalink / raw)
  To: Jeffrey Vander Stoep
  Cc: SElinux list, Ondrej Mosnacek, Stephen Smalley, will, Jovana Knezevic

On Thu, Nov 7, 2019 at 7:01 AM Jeffrey Vander Stoep <jeffv@google.com> wrote:
> Note this should be v4, not v5.
>
> Paul, please let me know if you'd prefer my change applied on top of Ondrej's
>  "selinux: cache the SID -> context string translation" patch.

Sorry for the delay, travel had me a bit backlogged.  Regardless,
don't worry about rebasing on other in-flight patches; unless it gets
*really* messy I'll take care of that when I merge everything.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, back to index

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-07 10:17 [PATCH v5] selinux: sidtab: reverse lookup hash table Jeff Vander Stoep
2019-11-07 12:01 ` Jeffrey Vander Stoep
2019-11-14 23:35   ` Paul Moore
2019-11-07 14:54 ` Stephen Smalley
2019-11-07 15:42   ` Ondrej Mosnacek
2019-11-07 15:59     ` Jeffrey Vander Stoep
2019-11-07 15:49   ` Jeffrey Vander Stoep
2019-11-07 16:12 ` Ondrej Mosnacek
2019-11-07 16:44   ` Will Deacon
2019-11-07 18:41     ` Ondrej Mosnacek
2019-11-07 20:06       ` Jeffrey Vander Stoep

SELinux Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/selinux/0 selinux/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 selinux selinux/ https://lore.kernel.org/selinux \
		selinux@vger.kernel.org
	public-inbox-index selinux

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.selinux


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git