All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] SELinux: Improve hashing
@ 2020-04-29 20:29 siarhei.liakh
  2020-04-29 20:29 ` [PATCH 1/2] SELinux: Add median to debug output of hash table stats siarhei.liakh
  2020-04-29 20:29 ` [PATCH 2/2] SELinux: Introduce hash3() as alternative to shift-xor siarhei.liakh
  0 siblings, 2 replies; 8+ messages in thread
From: siarhei.liakh @ 2020-04-29 20:29 UTC (permalink / raw)
  To: selinux
  Cc: colin.king, eparis, gregkh, jeffv, omosnace, paul,
	stephen.smalley.work, tglx

From: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>

This is a follow-up to the original discussion of hashing within SELinux [1].
Based on the initial feedback received from Paul Moore and Ondrej Mosnacek,
I've re-focused and reduced the patch set down to the following two patches:

1. SELinux: Add median to debug output of hash table stats
2. SELinux: Introduce hash3() as alternative to shift-xor

The first one simply adds some extra stats in debug output generated by
the hash tables. I found it to be useful during the investigation, however
the patch not required for the actual hash improvement to work or be visible.

The second patch contains the actual proposed hash function alternative,
which improves bucket distribution in AVC, and latency in AVTab. After some
substantial research and testing, this new function (hash3()) ended up being
just as good at hashing 3x u32s as (or even *very* slightly better than)
jhash_3words() from jhash.h, avtab_hash() (which is a version of MurmurHash3),
and even a version of xxhash unrolled for just 3x 32-bit values. Please note
that all of the above is written strictly in context of SELinux and the way
it addresses its Access Vectors by 3-tuple of values (Source ID, Target ID,
Class ID). I suspect that significant part of surprising performance of
hash3() (or, depending on a view point, rather poor performance of other
"good" hashes) can be attributed to a sequential nature of these IDs within
a limited range.

Please keep in mind that the goal of this patch is not to replace any
particular hash function with something clearly more superior in one way
or another, but rather to find an acceptable standard which could be used
throughout SELinux as a way to eliminate multiple, substantially similar,
yet slightly different local implementations of hash functions within each
of SELinux components.

See second patch for the actual discussion of the new hash function and its
performance.

Please CC me directly on all replies.

Thank you!

References:

[1] https://lore.kernel.org/selinux/20200408182416.30995-1-siarhei.liakh@concurrent-rt.com/

Signed-off-by: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>

Siarhei Liakh (2):
  SELinux: Add median to debug output of hash table stats
  SELinux: Introduce hash3() as alternative to shift-xor

 security/selinux/Kconfig          | 10 ++++
 security/selinux/avc.c            | 50 +++++++++++++---
 security/selinux/include/hash3.h  | 44 +++++++++++++++
 security/selinux/include/median.h | 67 ++++++++++++++++++++++
 security/selinux/ss/avtab.c       | 94 ++++++++++++++++---------------
 security/selinux/ss/avtab.h       |  1 +
 security/selinux/ss/hashtab.c     | 28 ++++++++-
 security/selinux/ss/hashtab.h     |  5 ++
 security/selinux/ss/policydb.c    | 14 +++--
 9 files changed, 252 insertions(+), 61 deletions(-)
 create mode 100644 security/selinux/include/hash3.h
 create mode 100644 security/selinux/include/median.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] SELinux: Add median to debug output of hash table stats
  2020-04-29 20:29 [PATCH 0/2] SELinux: Improve hashing siarhei.liakh
@ 2020-04-29 20:29 ` siarhei.liakh
  2020-05-13 21:55   ` Paul Moore
  2020-04-29 20:29 ` [PATCH 2/2] SELinux: Introduce hash3() as alternative to shift-xor siarhei.liakh
  1 sibling, 1 reply; 8+ messages in thread
From: siarhei.liakh @ 2020-04-29 20:29 UTC (permalink / raw)
  To: selinux
  Cc: colin.king, eparis, gregkh, jeffv, omosnace, paul,
	stephen.smalley.work, tglx

From: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>

This change introduces a median() function which is then used to report
25th, 50th, and 75th percentile metrics within distributions of hash table
bucket chain lengths. This allows to better assess and compare relative
effectiveness of different hash functions. Specifically, it allows to
ensure new functions not only reduce the maximum, but also improve (or, at
least, have no negative impact) on the median.

Sample output before change:

avc:
entries: 508
buckets used: 213/512
longest chain: 10

policydb:
SELinux: roles:  14 entries and 6/16 buckets used, longest chain length 5

Sample output after the change:

avc:
entries: 508
buckets used: 217/512
longest chain: 9
non-zero chain Q1/Med/Q3: 1/2/3

policydb:
SELinux: roles:  14 entries and 6/16 buckets used, longest chain length 5
non-zero Q1/Med/Q3 1/2/4

Signed-off-by: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>
---
Please CC me directly on all replies.

 security/selinux/Kconfig          | 10 +++++
 security/selinux/avc.c            | 42 ++++++++++++++++---
 security/selinux/include/median.h | 67 +++++++++++++++++++++++++++++++
 security/selinux/ss/avtab.c       | 37 ++++++++++++++---
 security/selinux/ss/hashtab.c     | 28 ++++++++++++-
 security/selinux/ss/hashtab.h     |  5 +++
 security/selinux/ss/policydb.c    | 14 ++++---
 7 files changed, 185 insertions(+), 18 deletions(-)
 create mode 100644 security/selinux/include/median.h

diff --git a/security/selinux/Kconfig b/security/selinux/Kconfig
index 9e921fc72538..57c427e019c9 100644
--- a/security/selinux/Kconfig
+++ b/security/selinux/Kconfig
@@ -115,3 +115,13 @@ config SECURITY_SELINUX_SID2STR_CACHE_SIZE
 	  conversion.  Setting this option to 0 disables the cache completely.
 
 	  If unsure, keep the default value.
+
+config SECURITY_SELINUX_DEBUG_HASHES
+	bool "Print additional information about hash tables"
+	depends on SECURITY_SELINUX
+	default n
+	help
+	  This option allows to gather and display additional information about
+	  some of the key hash tables within SELinux.
+
+	  If unsure, keep the default value.
diff --git a/security/selinux/avc.c b/security/selinux/avc.c
index d18cb32a242a..c3bbdb083371 100644
--- a/security/selinux/avc.c
+++ b/security/selinux/avc.c
@@ -31,6 +31,10 @@
 #include "avc_ss.h"
 #include "classmap.h"
 
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+#include "median.h"
+#endif
+
 #define AVC_CACHE_SLOTS			512
 #define AVC_DEF_CACHE_THRESHOLD		512
 #define AVC_CACHE_RECLAIM		16
@@ -149,9 +153,13 @@ void __init avc_init(void)
 
 int avc_get_hash_stats(struct selinux_avc *avc, char *page)
 {
-	int i, chain_len, max_chain_len, slots_used;
+	int i, chain_len, max_chain_len, slots_used, ret;
 	struct avc_node *node;
 	struct hlist_head *head;
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+	u32 med_ct = 0;
+	u32 *counts = vmalloc(AVC_CACHE_SLOTS * sizeof(u32));
+#endif
 
 	rcu_read_lock();
 
@@ -164,6 +172,10 @@ int avc_get_hash_stats(struct selinux_avc *avc, char *page)
 			chain_len = 0;
 			hlist_for_each_entry_rcu(node, head, list)
 				chain_len++;
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+			if (counts && chain_len)
+				counts[med_ct++] = chain_len;
+#endif
 			if (chain_len > max_chain_len)
 				max_chain_len = chain_len;
 		}
@@ -171,10 +183,30 @@ int avc_get_hash_stats(struct selinux_avc *avc, char *page)
 
 	rcu_read_unlock();
 
-	return scnprintf(page, PAGE_SIZE, "entries: %d\nbuckets used: %d/%d\n"
-			 "longest chain: %d\n",
-			 atomic_read(&avc->avc_cache.active_nodes),
-			 slots_used, AVC_CACHE_SLOTS, max_chain_len);
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+	if (counts) {
+		u32 q1 = 0;
+		u32 q3 = 0;
+		u32 med = median(counts, med_ct, &q1, &q3);
+
+		vfree(counts);
+		ret = scnprintf(page, PAGE_SIZE,
+				"entries: %d\nbuckets used: %d/%d\n"
+				 "longest chain: %d\n"
+				 "non-zero chain Q1/Med/Q3: %u/%u/%u\n",
+				atomic_read(&avc->avc_cache.active_nodes),
+				slots_used, AVC_CACHE_SLOTS, max_chain_len,
+				q1, med, q3);
+	} else /* Falling through! */
+#endif
+	{
+		ret = scnprintf(page, PAGE_SIZE,
+				"entries: %d\nbuckets used: %d/%d\n"
+				 "longest chain: %d\n",
+				atomic_read(&avc->avc_cache.active_nodes),
+				slots_used, AVC_CACHE_SLOTS, max_chain_len);
+	}
+	return ret;
 }
 
 /*
diff --git a/security/selinux/include/median.h b/security/selinux/include/median.h
new file mode 100644
index 000000000000..e90565b9d7f6
--- /dev/null
+++ b/security/selinux/include/median.h
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* (C) Siarhei Liakh <Siarhei.Liakh@concurrent-rt.com>, 2020, GPL 2.0+ */
+
+#ifndef _LINUX_MEDIAN_H
+#define _LINUX_MEDIAN_H
+#include <linux/types.h>
+#include <linux/sort.h>
+
+/*
+ * Helper function to compare two u32s
+ */
+static int __cmp_u32(const void *a, const void *b)
+{
+	u32 x = *(const u32 *)(a);
+	u32 y = *(const u32 *)(b);
+
+	if (x < y)
+		return -1;
+
+	if (x > y)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * median(): Find a median of an array containing "count" of u32s,
+ * optionally return 25th (q1) and 75th (q3) percentile.
+ */
+static inline u32 median(u32 val[], size_t count, u32 *q1, u32 *q3)
+{
+	u32 _q1 = 0;
+	u32 _q2 = 0;
+	u32 _q3 = 0;
+
+	if (count > 0) {
+		/*
+		 * Using existing sort() functionality as the easiest way
+		 * to get median with least amount of new code.
+		 *
+		 * This is not the most efficient way to find a median,
+		 * so feel free to implement a better one if performance is
+		 * of a primary concern. "Selection algorithm" Wikipedia
+		 * article is a good start:
+		 * https://en.m.wikipedia.org/wiki/Selection_algorithm
+		 */
+		sort(val, count, sizeof(u32), &__cmp_u32, NULL);
+
+		/*
+		 * Should really do some math if count is even,
+		 * but this is close enough for our purposes.
+		 */
+
+		_q1 = val[count / 4];
+		_q2 = val[count / 2];
+		_q3 = val[(count * 3) / 4];
+	}
+
+	if (q1)
+		*q1 = _q1;
+
+	if (q3)
+		*q3 = _q3;
+
+	return _q2;
+}
+#endif /* #ifndef _LINUX_MEAN_H */
diff --git a/security/selinux/ss/avtab.c b/security/selinux/ss/avtab.c
index 01b300a4a882..8baaddb01a95 100644
--- a/security/selinux/ss/avtab.c
+++ b/security/selinux/ss/avtab.c
@@ -23,6 +23,10 @@
 #include "avtab.h"
 #include "policydb.h"
 
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+#include "median.h"
+#endif
+
 static struct kmem_cache *avtab_node_cachep;
 static struct kmem_cache *avtab_xperms_cachep;
 
@@ -345,6 +349,10 @@ void avtab_hash_eval(struct avtab *h, char *tag)
 	int i, chain_len, slots_used, max_chain_len;
 	unsigned long long chain2_len_sum;
 	struct avtab_node *cur;
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+	u32 med_ct = 0;
+	u32 *counts = vmalloc(h->nslot * sizeof(u32));
+#endif
 
 	slots_used = 0;
 	max_chain_len = 0;
@@ -358,17 +366,36 @@ void avtab_hash_eval(struct avtab *h, char *tag)
 				chain_len++;
 				cur = cur->next;
 			}
-
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+			if (counts && chain_len)
+				counts[med_ct++] = chain_len;
+#endif
 			if (chain_len > max_chain_len)
 				max_chain_len = chain_len;
 			chain2_len_sum += chain_len * chain_len;
 		}
 	}
 
-	pr_debug("SELinux: %s:  %d entries and %d/%d buckets used, "
-	       "longest chain length %d sum of chain length^2 %llu\n",
-	       tag, h->nel, slots_used, h->nslot, max_chain_len,
-	       chain2_len_sum);
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+	if (counts) {
+		u32 q1 = 0;
+		u32 q3 = 0;
+		u32 med = median(counts, med_ct, &q1, &q3);
+
+		vfree(counts);
+		pr_debug("SELinux: %s:  %d entries and %d/%d buckets used, "
+			 "longest chain length %d non-zero Q1/Med/Q3 %u/%u/%u "
+			 "sum of chain length^2 %llu\n",
+			 tag, h->nel, slots_used, h->nslot, max_chain_len,
+			 q1, med, q3, chain2_len_sum);
+	} else /* Falling through! */
+#endif
+	{
+		pr_debug("SELinux: %s:  %d entries and %d/%d buckets used, "
+			 "longest chain length %d sum of chain length^2 %llu\n",
+			 tag, h->nel, slots_used, h->nslot, max_chain_len,
+			 chain2_len_sum);
+	}
 }
 
 static uint16_t spec_order[] = {
diff --git a/security/selinux/ss/hashtab.c b/security/selinux/ss/hashtab.c
index 883f19d32c28..e42d814067ba 100644
--- a/security/selinux/ss/hashtab.c
+++ b/security/selinux/ss/hashtab.c
@@ -8,8 +8,13 @@
 #include <linux/slab.h>
 #include <linux/errno.h>
 #include <linux/sched.h>
+#include <linux/vmalloc.h>
 #include "hashtab.h"
 
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+#include "median.h"
+#endif
+
 static struct kmem_cache *hashtab_node_cachep;
 
 /*
@@ -168,7 +173,10 @@ void hashtab_stat(struct hashtab *h, struct hashtab_info *info)
 {
 	u32 i, chain_len, slots_used, max_chain_len;
 	struct hashtab_node *cur;
-
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+	u32 med_ct = 0;
+	u32 *counts = vmalloc(h->size * sizeof(u32));
+#endif
 	slots_used = 0;
 	max_chain_len = 0;
 	for (i = 0; i < h->size; i++) {
@@ -180,7 +188,10 @@ void hashtab_stat(struct hashtab *h, struct hashtab_info *info)
 				chain_len++;
 				cur = cur->next;
 			}
-
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+			if (counts && chain_len)
+				counts[med_ct++] = chain_len;
+#endif
 			if (chain_len > max_chain_len)
 				max_chain_len = chain_len;
 		}
@@ -188,6 +199,19 @@ void hashtab_stat(struct hashtab *h, struct hashtab_info *info)
 
 	info->slots_used = slots_used;
 	info->max_chain_len = max_chain_len;
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+	if (counts) {
+		info->med_chain_len = median(counts, med_ct,
+					     &(info->q1_chain_len),
+					     &(info->q3_chain_len));
+
+		vfree(counts);
+	} else {
+		info->q1_chain_len = 0;
+		info->med_chain_len = 0;
+		info->q3_chain_len = 0;
+	}
+#endif
 }
 
 void __init hashtab_cache_init(void)
diff --git a/security/selinux/ss/hashtab.h b/security/selinux/ss/hashtab.h
index dde54d9ff01c..b660c0078dae 100644
--- a/security/selinux/ss/hashtab.h
+++ b/security/selinux/ss/hashtab.h
@@ -32,6 +32,11 @@ struct hashtab {
 struct hashtab_info {
 	u32 slots_used;
 	u32 max_chain_len;
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+	u32 q1_chain_len;
+	u32 med_chain_len;
+	u32 q3_chain_len;
+#endif
 };
 
 /*
diff --git a/security/selinux/ss/policydb.c b/security/selinux/ss/policydb.c
index c21b922e5ebe..420e347ac77c 100644
--- a/security/selinux/ss/policydb.c
+++ b/security/selinux/ss/policydb.c
@@ -41,9 +41,9 @@
 #include "mls.h"
 #include "services.h"
 
-#define _DEBUG_HASHES
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
+#include "median.h"
 
-#ifdef DEBUG_HASHES
 static const char *symtab_name[SYM_NUM] = {
 	"common prefixes",
 	"classes",
@@ -623,15 +623,17 @@ static int (*index_f[SYM_NUM]) (void *key, void *datum, void *datap) =
 	cat_index,
 };
 
-#ifdef DEBUG_HASHES
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
 static void hash_eval(struct hashtab *h, const char *hash_name)
 {
 	struct hashtab_info info;
 
 	hashtab_stat(h, &info);
-	pr_debug("SELinux: %s:  %d entries and %d/%d buckets used, longest chain length %d\n",
+	pr_debug("SELinux: %s:  %d entries and %d/%d buckets used, "
+		 "longest chain length %d non-zero Q1/Med/Q3 %u/%u/%u\n",
 		 hash_name, h->nel, info.slots_used, h->size,
-		 info.max_chain_len);
+		 info.max_chain_len, info.q1_chain_len,
+		 info.med_chain_len, info.q3_chain_len);
 }
 
 static void symtab_hash_eval(struct symtab *s)
@@ -670,7 +672,7 @@ static int policydb_index(struct policydb *p)
 	pr_debug("SELinux:  %d classes, %d rules\n",
 		 p->p_classes.nprim, p->te_avtab.nel);
 
-#ifdef DEBUG_HASHES
+#ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
 	avtab_hash_eval(&p->te_avtab, "rules");
 	symtab_hash_eval(p->symtab);
 #endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] SELinux: Introduce hash3() as alternative to shift-xor
  2020-04-29 20:29 [PATCH 0/2] SELinux: Improve hashing siarhei.liakh
  2020-04-29 20:29 ` [PATCH 1/2] SELinux: Add median to debug output of hash table stats siarhei.liakh
@ 2020-04-29 20:29 ` siarhei.liakh
  2020-05-05 21:18   ` Paul Moore
  1 sibling, 1 reply; 8+ messages in thread
From: siarhei.liakh @ 2020-04-29 20:29 UTC (permalink / raw)
  To: selinux
  Cc: colin.king, eparis, gregkh, jeffv, omosnace, paul,
	stephen.smalley.work, tglx

From: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>

This change improves performance of hashing functions used in AVC and
AVTab to hash 3x 32-bit values. As compared to original shift-xor function
used in AVC, substantial improvement in quality of hashing is gained by:
1. replacing shifts with rolls, thus preserving all of the incoming
   entropy
2. adjusting shift/roll constants to reduce overlap between the input
   values
3. use of arithmetic addition instead of XOR, as a way to spread out
   collisions
4. use of hash_32() to level out the distribution of hash output values
   throughout the range

Within the scope of particular application, these changes bring hash
quality to that of much more complex MurmurHash3 (which is currently used
in AVTab), while maintaining trivial simplicity of the original code.

The only price paid for the improvements is introduction of a single
32-bit multiplication instruction upon which hash_32() is built, which is
not a problem on almost any modern platform where fast multiplication is
available. For example, 32-bit MULs have had latency of 4 on pretty much
all Intel x86 CPUs for at least a decade or so (while ADD/XOR/SHL/ROL have
latency of 1). Platforms without fast multiplication provide their own
alternative mechanism for hash_32().

New hash function hash3() had been applied to two key parts of SELinux:
Access Vector Cache (AVC) and Access Vector Table (AVTab) with results
that follow.

AVC:

Below is the content of hash_stats, as observed on Fedora 32 in its
default configuration with targeted policy, averaged over 100 samples. The
stats show that new hash function not only cuts the longest chain almost
in half, but also provides a much better distribution throughout the table:
about 39% more buckets are being used, with at least half of all non-zero
buckets consistently staying at length 1, with 75% at 2 or below.

/sys/fs/selinux/avc/hash_stats:
            Old     New
Samples:   100      100
Entries:   504.43   503.67
Buckets:   219.73   304.61
1st Qu.:     1.00     1.00
Median :     1.39     1.00
3rd Qu.:     3.00     2.00
Maximum:     9.32     5.37

Next, performance of avc_lookup() is analyzed by timing it with ftrace on
a system with same configuration as above:

acv_lookup(), latency in us:
            Old     New
 Samples: 261534    244168
 Min.   : 0.1320    0.1310
 1st Qu.: 0.2410    0.2360
 Median : 0.4220    0.4240
 3rd Qu.: 0.5580    0.5600
 Max.   : 9.9300    9.9890

Considering small size of AVC in default configuration, the change does
not show any latency improvements. In fact, median and 75th percentile
timings appear to be in line with addition of extra 4 clock cycles for MUL
(roughly 2ns on a 2.2Ghz CPU), or 0.47%. This appears to be a small price
to pay for substantially better distribution of hash values, reducing a
probability and/or severity of potential pathological behavior on some
larger configurations.

Note that absolute max latency is likely not indicative, as it is
susceptible to one-off events such as interrupts, cache misses, etc.

AVTab:

Unlike AVC, AVTab hash table is much larger and much more densely
populated. In the past, need for better hashing in AVTab was demonstrated,
resulting in transition to a local copy of custom implementation of much
more complicated (and better) MurmurHash3, adapted for hashing of 3x u32s.
This change replaces MurmurHash3 with a much simpler, faster, yet just as
effective (at least, in this particular use case) hash3() function
described above.

As evidenced by debug output produced during the boot process, despite
being much simpler, hash3() yields quite similar (if not just a tiny bit
better) hash distribution quality within AVTab:

Old:
rules:  81014 entries and 29600/32768 buckets used, longest chain length
11 non-zero Q1/Med/Q3 2/2/4 sum of chain length^2 290030

New:
rules:  81014 entries and 29645/32768 buckets used, longest chain length
11 non-zero Q1/Med/Q3 2/2/4 sum of chain length^2 288810

Performance, though, is a different matter: a clear savings of 8ns to
10ns (75th and 50th percentiles respectively) had been measured with
ftrace for the most frequent AVTab lookup method:

avtab_search_node(), latency in us:
          Old       New
 Samples: 5953243   5458099
 Min.   : 0.136     0.129
 1st Qu.: 0.199     0.189
 Median : 0.219     0.209
 3rd Qu.: 0.294     0.286
 Max.   :10.000     9.990

The results are not as clear for much less frequently (that is 1500x call
frequency difference) used avtab_search(), though they appear to lean
towards a positive side:

avtab_search(), latency in us:
            Old     New
 Samples: 3863      3638
 Min.   : 0.165     0.157
 1st Qu.: 0.297     0.293
 Median : 0.510     0.517
 3rd Qu.: 0.803     0.774
 Max.   : 9.343     7.701

Signed-off-by: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>
---
Please CC me directly on all replies.

 security/selinux/avc.c           |  8 +++--
 security/selinux/include/hash3.h | 44 ++++++++++++++++++++++++
 security/selinux/ss/avtab.c      | 57 ++++++++++----------------------
 security/selinux/ss/avtab.h      |  1 +
 4 files changed, 67 insertions(+), 43 deletions(-)
 create mode 100644 security/selinux/include/hash3.h

diff --git a/security/selinux/avc.c b/security/selinux/avc.c
index c3bbdb083371..ed092324bef1 100644
--- a/security/selinux/avc.c
+++ b/security/selinux/avc.c
@@ -30,12 +30,14 @@
 #include "avc.h"
 #include "avc_ss.h"
 #include "classmap.h"
+#include "hash3.h"
 
 #ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
 #include "median.h"
 #endif
 
-#define AVC_CACHE_SLOTS			512
+#define AVC_CACHE_BITS			(9)
+#define AVC_CACHE_SLOTS			(1 << AVC_CACHE_BITS)
 #define AVC_DEF_CACHE_THRESHOLD		512
 #define AVC_CACHE_RECLAIM		16
 
@@ -125,9 +127,9 @@ static struct kmem_cache *avc_xperms_data_cachep;
 static struct kmem_cache *avc_xperms_decision_cachep;
 static struct kmem_cache *avc_xperms_cachep;
 
-static inline int avc_hash(u32 ssid, u32 tsid, u16 tclass)
+static inline u32 avc_hash(u32 ssid, u32 tsid, u16 tclass)
 {
-	return (ssid ^ (tsid<<2) ^ (tclass<<4)) & (AVC_CACHE_SLOTS - 1);
+	return hash3(ssid, tsid, tclass, AVC_CACHE_BITS);
 }
 
 /**
diff --git a/security/selinux/include/hash3.h b/security/selinux/include/hash3.h
new file mode 100644
index 000000000000..21e2408f5227
--- /dev/null
+++ b/security/selinux/include/hash3.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* (C) Siarhei Liakh <Siarhei.Liakh@concurrent-rt.com>, 2020, GPL 2.0+ */
+
+#ifndef _SELINUX_HASH3_H
+#define _SELINUX_HASH3_H
+
+#include <linux/hash.h>
+#include <linux/bitops.h>
+
+/*
+ * hash3(): Mix and hash 3 x u32's with minimal overhead,
+ * truncate result to requested number of bits.
+ *
+ * This hash function produces very good results for inputs where most of
+ * input entropy is contained within the lower 11 bits of each of the words.
+ *
+ * For example, AVC hash table (in avc.c) is indexed by a 3-tuple (u32 ssid,
+ * u32 tsid, u16 tclass), where (on Fedora 32 Beta, as of March 2020) ssid and
+ * tsid appear to be sequential indexes between 0x0 and 0xA00, while tclass
+ * appears to be confined to the lower 8 bits, resulting in almost perfect
+ * packing of the indexes into a single 32-bit value.
+ *
+ * The function still produces reasonable hash values even when input value
+ * ranges span beyond 11 bits, as long as the placement of entropy within the
+ * input values is roughly the same for each of the componets (a, b, c), and
+ * the address space (a, b, c) is sparsely populated. Such behaviour is the
+ * result of two conscious choices: (1) use of rol32() to preserve all of the
+ * incoming entropy (as opposed to simple shifts which discard some input bits)
+ * and (2) use of arithmetic addition which carries over colliding bits (as
+ * opposed to binary XOR, which does not carry).
+ *
+ * The function will produce horrible collisions if input entropy is distrubuted
+ * within (a, b, c) such that it ends up within the same bit ranges after
+ * rotations, and the address space is densly populated. If that is the case,
+ * then two options are available:
+ * 1. Try switching around some of the inputs. EX: (a, b, c) => (b, c, a)
+ * 2. Use a real hash, such as jhash_3words() from linux/jhash.h
+ */
+static inline u32 hash3(u32 a, u32 b, u32 c, int bits)
+{
+	return hash_32(a + rol32(b, 11) + rol32(c, 22), bits);
+}
+
+#endif /* #ifndef _SELINUX_HASH3_H */
diff --git a/security/selinux/ss/avtab.c b/security/selinux/ss/avtab.c
index 8baaddb01a95..d24f18ab9e4d 100644
--- a/security/selinux/ss/avtab.c
+++ b/security/selinux/ss/avtab.c
@@ -22,6 +22,7 @@
 #include <linux/errno.h>
 #include "avtab.h"
 #include "policydb.h"
+#include "hash3.h"
 
 #ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
 #include "median.h"
@@ -30,43 +31,10 @@
 static struct kmem_cache *avtab_node_cachep;
 static struct kmem_cache *avtab_xperms_cachep;
 
-/* Based on MurmurHash3, written by Austin Appleby and placed in the
- * public domain.
- */
-static inline int avtab_hash(struct avtab_key *keyp, u32 mask)
+static inline u32 avtab_hash(struct avtab_key *keyp, u32 bits)
 {
-	static const u32 c1 = 0xcc9e2d51;
-	static const u32 c2 = 0x1b873593;
-	static const u32 r1 = 15;
-	static const u32 r2 = 13;
-	static const u32 m  = 5;
-	static const u32 n  = 0xe6546b64;
-
-	u32 hash = 0;
-
-#define mix(input) { \
-	u32 v = input; \
-	v *= c1; \
-	v = (v << r1) | (v >> (32 - r1)); \
-	v *= c2; \
-	hash ^= v; \
-	hash = (hash << r2) | (hash >> (32 - r2)); \
-	hash = hash * m + n; \
-}
-
-	mix(keyp->target_class);
-	mix(keyp->target_type);
-	mix(keyp->source_type);
-
-#undef mix
-
-	hash ^= hash >> 16;
-	hash *= 0x85ebca6b;
-	hash ^= hash >> 13;
-	hash *= 0xc2b2ae35;
-	hash ^= hash >> 16;
-
-	return hash & mask;
+	return hash3(keyp->target_class, keyp->target_type,
+		     keyp->source_type, bits);
 }
 
 static struct avtab_node*
@@ -116,7 +84,7 @@ static int avtab_insert(struct avtab *h, struct avtab_key *key, struct avtab_dat
 	if (!h)
 		return -EINVAL;
 
-	hvalue = avtab_hash(key, h->mask);
+	hvalue = avtab_hash(key, h->bits);
 	for (prev = NULL, cur = h->htable[hvalue];
 	     cur;
 	     prev = cur, cur = cur->next) {
@@ -160,7 +128,7 @@ avtab_insert_nonunique(struct avtab *h, struct avtab_key *key, struct avtab_datu
 
 	if (!h)
 		return NULL;
-	hvalue = avtab_hash(key, h->mask);
+	hvalue = avtab_hash(key, h->bits);
 	for (prev = NULL, cur = h->htable[hvalue];
 	     cur;
 	     prev = cur, cur = cur->next) {
@@ -191,7 +159,7 @@ struct avtab_datum *avtab_search(struct avtab *h, struct avtab_key *key)
 	if (!h)
 		return NULL;
 
-	hvalue = avtab_hash(key, h->mask);
+	hvalue = avtab_hash(key, h->bits);
 	for (cur = h->htable[hvalue]; cur;
 	     cur = cur->next) {
 		if (key->source_type == cur->key.source_type &&
@@ -227,7 +195,7 @@ avtab_search_node(struct avtab *h, struct avtab_key *key)
 	if (!h)
 		return NULL;
 
-	hvalue = avtab_hash(key, h->mask);
+	hvalue = avtab_hash(key, h->bits);
 	for (cur = h->htable[hvalue]; cur;
 	     cur = cur->next) {
 		if (key->source_type == cur->key.source_type &&
@@ -301,6 +269,7 @@ void avtab_destroy(struct avtab *h)
 	h->htable = NULL;
 	h->nslot = 0;
 	h->mask = 0;
+	h->bits = 0;
 }
 
 void avtab_init(struct avtab *h)
@@ -313,6 +282,7 @@ void avtab_init(struct avtab *h)
 int avtab_alloc(struct avtab *h, u32 nrules)
 {
 	u32 mask = 0;
+	u32 bits = 0;
 	u32 shift = 0;
 	u32 work = nrules;
 	u32 nslot = 0;
@@ -339,6 +309,13 @@ int avtab_alloc(struct avtab *h, u32 nrules)
 	h->nel = 0;
 	h->nslot = nslot;
 	h->mask = mask;
+
+	while (mask) {
+		mask  = mask >> 1;
+		bits++;
+	}
+
+	h->bits = bits;
 	pr_debug("SELinux: %d avtab hash slots, %d rules.\n",
 	       h->nslot, nrules);
 	return 0;
diff --git a/security/selinux/ss/avtab.h b/security/selinux/ss/avtab.h
index 5fdcb6696bcc..bf24d8094019 100644
--- a/security/selinux/ss/avtab.h
+++ b/security/selinux/ss/avtab.h
@@ -85,6 +85,7 @@ struct avtab {
 	u32 nel;	/* number of elements */
 	u32 nslot;      /* number of hash slots */
 	u32 mask;       /* mask to compute hash func */
+	u32 bits;       /* number of bits in mask */
 };
 
 void avtab_init(struct avtab *h);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] SELinux: Introduce hash3() as alternative to shift-xor
  2020-04-29 20:29 ` [PATCH 2/2] SELinux: Introduce hash3() as alternative to shift-xor siarhei.liakh
@ 2020-05-05 21:18   ` Paul Moore
  2020-05-06 21:42     ` Siarhei Liakh
  0 siblings, 1 reply; 8+ messages in thread
From: Paul Moore @ 2020-05-05 21:18 UTC (permalink / raw)
  To: siarhei.liakh
  Cc: selinux, colin.king, Eric Paris, gregkh, jeffv, omosnace,
	Stephen Smalley, tglx

On Wed, Apr 29, 2020 at 4:29 PM <siarhei.liakh@concurrent-rt.com> wrote:
> From: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>
>
> This change improves performance of hashing functions used in AVC and
> AVTab to hash 3x 32-bit values. As compared to original shift-xor function
> used in AVC, substantial improvement in quality of hashing is gained by:
> 1. replacing shifts with rolls, thus preserving all of the incoming
>    entropy
> 2. adjusting shift/roll constants to reduce overlap between the input
>    values
> 3. use of arithmetic addition instead of XOR, as a way to spread out
>    collisions
> 4. use of hash_32() to level out the distribution of hash output values
>    throughout the range
>
> Within the scope of particular application, these changes bring hash
> quality to that of much more complex MurmurHash3 (which is currently used
> in AVTab), while maintaining trivial simplicity of the original code.
>
> The only price paid for the improvements is introduction of a single
> 32-bit multiplication instruction upon which hash_32() is built, which is
> not a problem on almost any modern platform where fast multiplication is
> available. For example, 32-bit MULs have had latency of 4 on pretty much
> all Intel x86 CPUs for at least a decade or so (while ADD/XOR/SHL/ROL have
> latency of 1). Platforms without fast multiplication provide their own
> alternative mechanism for hash_32().
>
> New hash function hash3() had been applied to two key parts of SELinux:
> Access Vector Cache (AVC) and Access Vector Table (AVTab) with results
> that follow.
>
> AVC:
>
> Below is the content of hash_stats, as observed on Fedora 32 in its
> default configuration with targeted policy, averaged over 100 samples. The
> stats show that new hash function not only cuts the longest chain almost
> in half, but also provides a much better distribution throughout the table:
> about 39% more buckets are being used, with at least half of all non-zero
> buckets consistently staying at length 1, with 75% at 2 or below.
>
> /sys/fs/selinux/avc/hash_stats:
>             Old     New
> Samples:   100      100
> Entries:   504.43   503.67
> Buckets:   219.73   304.61
> 1st Qu.:     1.00     1.00
> Median :     1.39     1.00
> 3rd Qu.:     3.00     2.00
> Maximum:     9.32     5.37
>
> Next, performance of avc_lookup() is analyzed by timing it with ftrace on
> a system with same configuration as above:
>
> acv_lookup(), latency in us:
>             Old     New
>  Samples: 261534    244168
>  Min.   : 0.1320    0.1310
>  1st Qu.: 0.2410    0.2360
>  Median : 0.4220    0.4240
>  3rd Qu.: 0.5580    0.5600
>  Max.   : 9.9300    9.9890
>
> Considering small size of AVC in default configuration, the change does
> not show any latency improvements. In fact, median and 75th percentile
> timings appear to be in line with addition of extra 4 clock cycles for MUL
> (roughly 2ns on a 2.2Ghz CPU), or 0.47%. This appears to be a small price
> to pay for substantially better distribution of hash values, reducing a
> probability and/or severity of potential pathological behavior on some
> larger configurations.
>
> Note that absolute max latency is likely not indicative, as it is
> susceptible to one-off events such as interrupts, cache misses, etc.

Thanks for providing more performance information, this is helpful.
Right now as I look at the results, I'm trying to decide if I care at
all about the chain length information as ultimately the latency is
really what we care about, yes?  I'm also trying to reconcile the
chain length information with the latency information; based on the
difference in chain lengths I would have expected more of a difference
in the latency numbers.  As you mention the latency numbers are
basically the same (give or take two clock cycles) between the two
hash implementations, assuming we exclude the worst case.  If we
include the worst case the old implementation is better, at least in
the data you've provided.

Is the added complexity of the hash3 implementation, as compared to
the current avc hash, stealing from the performance improvement
brought by the improved hash table utilization?  If true, I would
expect the penalty to be constant regardless of the chain length, but
it seems like the latency difference between old and new gets worse
(from a "new" perspective) as the chain length grows.  Can you explain
that?  Or are we simply playing in the noise of the measurement since
we are only talking about a clock cycle or two?

> AVTab:
>
> Unlike AVC, AVTab hash table is much larger and much more densely
> populated. In the past, need for better hashing in AVTab was demonstrated,
> resulting in transition to a local copy of custom implementation of much
> more complicated (and better) MurmurHash3, adapted for hashing of 3x u32s.
> This change replaces MurmurHash3 with a much simpler, faster, yet just as
> effective (at least, in this particular use case) hash3() function
> described above.
>
> As evidenced by debug output produced during the boot process, despite
> being much simpler, hash3() yields quite similar (if not just a tiny bit
> better) hash distribution quality within AVTab:
>
> Old:
> rules:  81014 entries and 29600/32768 buckets used, longest chain length
> 11 non-zero Q1/Med/Q3 2/2/4 sum of chain length^2 290030
>
> New:
> rules:  81014 entries and 29645/32768 buckets used, longest chain length
> 11 non-zero Q1/Med/Q3 2/2/4 sum of chain length^2 288810

Oh, one nit against the first patch - you might as well use all
lower-case letters in the stats (e.g. "q1" not "Q1" and "med" instead
of "Med") to be consistent with the existing output.

> Performance, though, is a different matter: a clear savings of 8ns to
> 10ns (75th and 50th percentiles respectively) had been measured with
> ftrace for the most frequent AVTab lookup method:
>
> avtab_search_node(), latency in us:
>           Old       New
>  Samples: 5953243   5458099
>  Min.   : 0.136     0.129
>  1st Qu.: 0.199     0.189
>  Median : 0.219     0.209
>  3rd Qu.: 0.294     0.286
>  Max.   :10.000     9.990
>
> The results are not as clear for much less frequently (that is 1500x call
> frequency difference) used avtab_search(), though they appear to lean
> towards a positive side:
>
> avtab_search(), latency in us:
>             Old     New
>  Samples: 3863      3638
>  Min.   : 0.165     0.157
>  1st Qu.: 0.297     0.293
>  Median : 0.510     0.517
>  3rd Qu.: 0.803     0.774
>  Max.   : 9.343     7.701

So with the avtab we see similar chain length numbers between the two
hash implementations but we manage to save a couple of clock cycles
across the board from best to worst.  I'm guessing this is almost
surely due to the simpler hash3 implementation over the current avtab
implementation, yes?

(more comments inline)

> Signed-off-by: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>
> ---
> Please CC me directly on all replies.
>
>  security/selinux/avc.c           |  8 +++--
>  security/selinux/include/hash3.h | 44 ++++++++++++++++++++++++
>  security/selinux/ss/avtab.c      | 57 ++++++++++----------------------
>  security/selinux/ss/avtab.h      |  1 +
>  4 files changed, 67 insertions(+), 43 deletions(-)
>  create mode 100644 security/selinux/include/hash3.h
>
> diff --git a/security/selinux/avc.c b/security/selinux/avc.c
> index c3bbdb083371..ed092324bef1 100644
> --- a/security/selinux/avc.c
> +++ b/security/selinux/avc.c
> @@ -30,12 +30,14 @@
>  #include "avc.h"
>  #include "avc_ss.h"
>  #include "classmap.h"
> +#include "hash3.h"
>
>  #ifdef CONFIG_SECURITY_SELINUX_DEBUG_HASHES
>  #include "median.h"
>  #endif
>
> -#define AVC_CACHE_SLOTS                        512
> +#define AVC_CACHE_BITS                 (9)
> +#define AVC_CACHE_SLOTS                        (1 << AVC_CACHE_BITS)
>  #define AVC_DEF_CACHE_THRESHOLD                512
>  #define AVC_CACHE_RECLAIM              16
>
> @@ -125,9 +127,9 @@ static struct kmem_cache *avc_xperms_data_cachep;
>  static struct kmem_cache *avc_xperms_decision_cachep;
>  static struct kmem_cache *avc_xperms_cachep;
>
> -static inline int avc_hash(u32 ssid, u32 tsid, u16 tclass)
> +static inline u32 avc_hash(u32 ssid, u32 tsid, u16 tclass)
>  {
> -       return (ssid ^ (tsid<<2) ^ (tclass<<4)) & (AVC_CACHE_SLOTS - 1);
> +       return hash3(ssid, tsid, tclass, AVC_CACHE_BITS);
>  }
>
>  /**
> diff --git a/security/selinux/include/hash3.h b/security/selinux/include/hash3.h
> new file mode 100644
> index 000000000000..21e2408f5227
> --- /dev/null
> +++ b/security/selinux/include/hash3.h
> @@ -0,0 +1,44 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* (C) Siarhei Liakh <Siarhei.Liakh@concurrent-rt.com>, 2020, GPL 2.0+ */
> +
> +#ifndef _SELINUX_HASH3_H
> +#define _SELINUX_HASH3_H
> +
> +#include <linux/hash.h>
> +#include <linux/bitops.h>
> +
> +/*
> + * hash3(): Mix and hash 3 x u32's with minimal overhead,
> + * truncate result to requested number of bits.
> + *
> + * This hash function produces very good results for inputs where most of
> + * input entropy is contained within the lower 11 bits of each of the words.
> + *
> + * For example, AVC hash table (in avc.c) is indexed by a 3-tuple (u32 ssid,
> + * u32 tsid, u16 tclass), where (on Fedora 32 Beta, as of March 2020) ssid and
> + * tsid appear to be sequential indexes between 0x0 and 0xA00, while tclass
> + * appears to be confined to the lower 8 bits, resulting in almost perfect
> + * packing of the indexes into a single 32-bit value.
> + *
> + * The function still produces reasonable hash values even when input value
> + * ranges span beyond 11 bits, as long as the placement of entropy within the
> + * input values is roughly the same for each of the componets (a, b, c), and
> + * the address space (a, b, c) is sparsely populated. Such behaviour is the
> + * result of two conscious choices: (1) use of rol32() to preserve all of the
> + * incoming entropy (as opposed to simple shifts which discard some input bits)
> + * and (2) use of arithmetic addition which carries over colliding bits (as
> + * opposed to binary XOR, which does not carry).
> + *
> + * The function will produce horrible collisions if input entropy is distrubuted
> + * within (a, b, c) such that it ends up within the same bit ranges after
> + * rotations, and the address space is densly populated. If that is the case,
> + * then two options are available:
> + * 1. Try switching around some of the inputs. EX: (a, b, c) => (b, c, a)
> + * 2. Use a real hash, such as jhash_3words() from linux/jhash.h
> + */

I'm not one to throw stones as my spelling is terrible, but I wouldn't
be doing my job if I didn't ask you to run spellcheck on the comment
above.

> +static inline u32 hash3(u32 a, u32 b, u32 c, int bits)
> +{
> +       return hash_32(a + rol32(b, 11) + rol32(c, 22), bits);
> +}
> +
> +#endif /* #ifndef _SELINUX_HASH3_H */

...

> diff --git a/security/selinux/ss/avtab.h b/security/selinux/ss/avtab.h
> index 5fdcb6696bcc..bf24d8094019 100644
> --- a/security/selinux/ss/avtab.h
> +++ b/security/selinux/ss/avtab.h
> @@ -85,6 +85,7 @@ struct avtab {
>         u32 nel;        /* number of elements */
>         u32 nslot;      /* number of hash slots */
>         u32 mask;       /* mask to compute hash func */
> +       u32 bits;       /* number of bits in mask */
>  };

We can get rid of the "mask" field, right?

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] SELinux: Introduce hash3() as alternative to shift-xor
  2020-05-05 21:18   ` Paul Moore
@ 2020-05-06 21:42     ` Siarhei Liakh
  0 siblings, 0 replies; 8+ messages in thread
From: Siarhei Liakh @ 2020-05-06 21:42 UTC (permalink / raw)
  To: Paul Moore
  Cc: selinux, colin.king, Eric Paris, gregkh, jeffv, omosnace,
	Stephen Smalley, tglx

The 05/05/2020 17:18, Paul Moore wrote:
> On Wed, Apr 29, 2020 at 4:29 PM <siarhei.liakh@concurrent-rt.com> wrote:
> > From: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>
> >
> > This change improves performance of hashing functions used in AVC and
> > AVTab to hash 3x 32-bit values. As compared to original shift-xor function
> > used in AVC, substantial improvement in quality of hashing is gained by:
> > 1. replacing shifts with rolls, thus preserving all of the incoming
> >    entropy
> > 2. adjusting shift/roll constants to reduce overlap between the input
> >    values
> > 3. use of arithmetic addition instead of XOR, as a way to spread out
> >    collisions
> > 4. use of hash_32() to level out the distribution of hash output values
> >    throughout the range
> >
> > Within the scope of particular application, these changes bring hash
> > quality to that of much more complex MurmurHash3 (which is currently used
> > in AVTab), while maintaining trivial simplicity of the original code.
>
> [ . . . ] 
>
> > /sys/fs/selinux/avc/hash_stats:
> >             Old     New
> > Samples:   100      100
> > Entries:   504.43   503.67
> > Buckets:   219.73   304.61
> > 1st Qu.:     1.00     1.00
> > Median :     1.39     1.00
> > 3rd Qu.:     3.00     2.00
> > Maximum:     9.32     5.37
> >
> > Next, performance of avc_lookup() is analyzed by timing it with ftrace on
> > a system with same configuration as above:
> >
> > acv_lookup(), latency in us:
> >             Old     New
> >  Samples: 261534    244168
> >  Min.   : 0.1320    0.1310
> >  1st Qu.: 0.2410    0.2360
> >  Median : 0.4220    0.4240
> >  3rd Qu.: 0.5580    0.5600
> >  Max.   : 9.9300    9.9890
> >
> > Considering small size of AVC in default configuration, the change does
> > not show any latency improvements. In fact, median and 75th percentile
> > timings appear to be in line with addition of extra 4 clock cycles for MUL
> > (roughly 2ns on a 2.2Ghz CPU), or 0.47%. This appears to be a small price
> > to pay for substantially better distribution of hash values, reducing a
> > probability and/or severity of potential pathological behavior on some
> > larger configurations.
> >
> > Note that absolute max latency is likely not indicative, as it is
> > susceptible to one-off events such as interrupts, cache misses, etc.
> 
> Thanks for providing more performance information, this is helpful.
> Right now as I look at the results, I'm trying to decide if I care at
> all about the chain length information as ultimately the latency is
> really what we care about, yes?

Correct, but there are different ways to look at it. One way is to only look
at average / median we get right now in current default configuration. Another
way is to think in terms of potential variability and possible pathological
cases down the road which would be difficult to diagnose and/or fix for an
end-user. In my mind, this is the same type of a decision as with choice of a
default sort algorithm: quicksort is faster on average, however Linux Kernel
relies on heap sort as it does not have a O(n^2) worst-case behavior (as
explained in lib/sort.c). I understand that sort has a guaranteed
bound on worst-case performance, while hash is probabalistic in nature and
the worst-case scenario can still happen even with a good hash. However,
hash3() does demonstrate a much better distribution as compared to the current
hash function, thus reducing probability and/or severity of any potential
pathological case. Think of this as proactive preventative maintenance, rather
than a reactive bug fix.

> I'm also trying to reconcile the
> chain length information with the latency information; based on the
> difference in chain lengths I would have expected more of a difference
> in the latency numbers.  As you mention the latency numbers are
> basically the same (give or take two clock cycles) between the two
> hash implementations, assuming we exclude the worst case.  If we
> include the worst case the old implementation is better, at least in
> the data you've provided.
> 
> Is the added complexity of the hash3 implementation, as compared to
> the current avc hash, stealing from the performance improvement
> brought by the improved hash table utilization?  If true, I would
> expect the penalty to be constant regardless of the chain length, but
> it seems like the latency difference between old and new gets worse
> (from a "new" perspective) as the chain length grows.  Can you explain
> that?  Or are we simply playing in the noise of the measurement since
> we are only talking about a clock cycle or two?

TL;DR:
L1/L2/L3 cache latency is 4-6/14/50-70 cycles on Intel Skylake [1].
So, I think it is just noise.

[1] https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
Section 2.2.1.3, Table 2-6 on page 2-11.

Long version:

I suspect that what we see is the result of whole ACV being able to fit within
CPU cache without much cache line eviction in between the consecutive inserts/
lookups, combined with really good branch predicion, speculation & etc.

Here is a fun experiment: I just tested a simple scan of a flat 512-element
array of packed 64-bit values {a:24; b:24; c:16} and it comes up as roughly
500 cycles (~127ns) per *full* scan on Intel Core i7-7820HK CPU @ 3.90GHz. That
would translate to about 225ns on the 2.2Ghz system benchmarked above, which is
just about what we get for 25th percentile lowest latency of hash table! Thus,
a straight scan would be the best if abolute max latency of isolated 512-enry
AVC is all we care about and cache pressure is not a factor. Sounds pretty
crazy, right? 

However, such approach not only does not scale well, but also churns through
up to 4KB (64 cache lines) of memory on each full pass. In contrast, a hash
table with bucket depth of 5 would only issue 6 cache line fetches worst case.
Thus, even if flat scan is *faster* in this particular configuration, it does
not necessarily mean that it is *better* for system as a whole.

What we have is a confuence of the following factors:
1. AVC is a central point of the SELinux which is used all the time by pretty
much everything running on every core
2. L3 shared mode hit is abouot 50% more expensive than a hit in a non-shared
mode [2]
3. Remote DRAM access is about 1.5x - 2x more expensive than local [2]
4. Single-socket NUMA systems becoming ubiquitous (ex: AMD ZEN / EPYC)

Just think about this: on a single-socket AMD EPYC system we can have at most
of AVC memory access as either shared or remote. To me it only makes sense to
reduce any unnecessary cache/DRAM references proactively, even if there is no
obvious immediate payoff right this moment (given, of course, that it does not
make things worse for a "regular" use case today).

[2] https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
Table 2, page 22. (yes, this info is about 10 years old)

> > AVTab:
> [ . . . ]
> > Old:
> > rules:  81014 entries and 29600/32768 buckets used, longest chain length
> > 11 non-zero Q1/Med/Q3 2/2/4 sum of chain length^2 290030
> >
> > New:
> > rules:  81014 entries and 29645/32768 buckets used, longest chain length
> > 11 non-zero Q1/Med/Q3 2/2/4 sum of chain length^2 288810
> 
> Oh, one nit against the first patch - you might as well use all
> lower-case letters in the stats (e.g. "q1" not "Q1" and "med" instead
> of "Med") to be consistent with the existing output.

Will do.
 
> > Performance, though, is a different matter: a clear savings of 8ns to
> > 10ns (75th and 50th percentiles respectively) had been measured with
> > ftrace for the most frequent AVTab lookup method:
> >
> > avtab_search_node(), latency in us:
> >           Old       New
> >  Samples: 5953243   5458099
> >  Min.   : 0.136     0.129
> >  1st Qu.: 0.199     0.189
> >  Median : 0.219     0.209
> >  3rd Qu.: 0.294     0.286
> >  Max.   :10.000     9.990
> >
> > The results are not as clear for much less frequently (that is 1500x call
> > frequency difference) used avtab_search(), though they appear to lean
> > towards a positive side:
> >
> > avtab_search(), latency in us:
> >             Old     New
> >  Samples: 3863      3638
> >  Min.   : 0.165     0.157
> >  1st Qu.: 0.297     0.293
> >  Median : 0.510     0.517
> >  3rd Qu.: 0.803     0.774
> >  Max.   : 9.343     7.701
> 
> So with the avtab we see similar chain length numbers between the two
> hash implementations but we manage to save a couple of clock cycles
> across the board from best to worst.  I'm guessing this is almost
> surely due to the simpler hash3 implementation over the current avtab
> implementation, yes?

Yes. MurmurHash3() version translates into roughly 6x more instructions
as compared to hash3(), so the fixed offset is really measurable. Note
that Max for avtab_search_node() is pretty much the same in both cases,
which falls in line with my earlier assertion that it is likely not
indicative of real performance differences as it is a function of other
more random and more expensive processes (such as interrupts, cache
misses, and etc.).

> 
> (more comments inline)
> 
> [ . . . ]
> > +/*
> > + * hash3(): Mix and hash 3 x u32's with minimal overhead,
> > + * truncate result to requested number of bits.
> > + *
> > + * This hash function produces very good results for inputs where most of
> > + * input entropy is contained within the lower 11 bits of each of the words.
> > + *
> > + * For example, AVC hash table (in avc.c) is indexed by a 3-tuple (u32 ssid,
> > + * u32 tsid, u16 tclass), where (on Fedora 32 Beta, as of March 2020) ssid and
> > + * tsid appear to be sequential indexes between 0x0 and 0xA00, while tclass
> > + * appears to be confined to the lower 8 bits, resulting in almost perfect
> > + * packing of the indexes into a single 32-bit value.
> > + *
> > + * The function still produces reasonable hash values even when input value
> > + * ranges span beyond 11 bits, as long as the placement of entropy within the
> > + * input values is roughly the same for each of the componets (a, b, c), and
> > + * the address space (a, b, c) is sparsely populated. Such behaviour is the
> > + * result of two conscious choices: (1) use of rol32() to preserve all of the
> > + * incoming entropy (as opposed to simple shifts which discard some input bits)
> > + * and (2) use of arithmetic addition which carries over colliding bits (as
> > + * opposed to binary XOR, which does not carry).
> > + *
> > + * The function will produce horrible collisions if input entropy is distrubuted
> > + * within (a, b, c) such that it ends up within the same bit ranges after
> > + * rotations, and the address space is densly populated. If that is the case,
> > + * then two options are available:
> > + * 1. Try switching around some of the inputs. EX: (a, b, c) => (b, c, a)
> > + * 2. Use a real hash, such as jhash_3words() from linux/jhash.h
> > + */
> 
> I'm not one to throw stones as my spelling is terrible, but I wouldn't
> be doing my job if I didn't ask you to run spellcheck on the comment
> above.

Will do.
 
> > +static inline u32 hash3(u32 a, u32 b, u32 c, int bits)
> > +{
> > +       return hash_32(a + rol32(b, 11) + rol32(c, 22), bits);
> > +}
> > +
> > +#endif /* #ifndef _SELINUX_HASH3_H */
> 
> ...
> 
> > diff --git a/security/selinux/ss/avtab.h b/security/selinux/ss/avtab.h
> > index 5fdcb6696bcc..bf24d8094019 100644
> > --- a/security/selinux/ss/avtab.h
> > +++ b/security/selinux/ss/avtab.h
> > @@ -85,6 +85,7 @@ struct avtab {
> >         u32 nel;        /* number of elements */
> >         u32 nslot;      /* number of hash slots */
> >         u32 mask;       /* mask to compute hash func */
> > +       u32 bits;       /* number of bits in mask */
> >  };
> 
> We can get rid of the "mask" field, right?

Yes, I think we can.

-- 
Siarhei Liakh
Concurrent Real-Time

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] SELinux: Add median to debug output of hash table stats
  2020-04-29 20:29 ` [PATCH 1/2] SELinux: Add median to debug output of hash table stats siarhei.liakh
@ 2020-05-13 21:55   ` Paul Moore
  2020-06-02 20:42     ` Siarhei Liakh
  0 siblings, 1 reply; 8+ messages in thread
From: Paul Moore @ 2020-05-13 21:55 UTC (permalink / raw)
  To: siarhei.liakh
  Cc: selinux, colin.king, Eric Paris, gregkh, jeffv, omosnace,
	Stephen Smalley, tglx

On Wed, Apr 29, 2020 at 4:29 PM <siarhei.liakh@concurrent-rt.com> wrote:
>
> From: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>
>
> This change introduces a median() function which is then used to report
> 25th, 50th, and 75th percentile metrics within distributions of hash table
> bucket chain lengths. This allows to better assess and compare relative
> effectiveness of different hash functions. Specifically, it allows to
> ensure new functions not only reduce the maximum, but also improve (or, at
> least, have no negative impact) on the median.
>
> Sample output before change:
>
> avc:
> entries: 508
> buckets used: 213/512
> longest chain: 10
>
> policydb:
> SELinux: roles:  14 entries and 6/16 buckets used, longest chain length 5
>
> Sample output after the change:
>
> avc:
> entries: 508
> buckets used: 217/512
> longest chain: 9
> non-zero chain Q1/Med/Q3: 1/2/3
>
> policydb:
> SELinux: roles:  14 entries and 6/16 buckets used, longest chain length 5
> non-zero Q1/Med/Q3 1/2/4
>
> Signed-off-by: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>
> ---
> Please CC me directly on all replies.
>
>  security/selinux/Kconfig          | 10 +++++
>  security/selinux/avc.c            | 42 ++++++++++++++++---
>  security/selinux/include/median.h | 67 +++++++++++++++++++++++++++++++
>  security/selinux/ss/avtab.c       | 37 ++++++++++++++---
>  security/selinux/ss/hashtab.c     | 28 ++++++++++++-
>  security/selinux/ss/hashtab.h     |  5 +++
>  security/selinux/ss/policydb.c    | 14 ++++---
>  7 files changed, 185 insertions(+), 18 deletions(-)
>  create mode 100644 security/selinux/include/median.h
>
> diff --git a/security/selinux/Kconfig b/security/selinux/Kconfig
> index 9e921fc72538..57c427e019c9 100644
> --- a/security/selinux/Kconfig
> +++ b/security/selinux/Kconfig
> @@ -115,3 +115,13 @@ config SECURITY_SELINUX_SID2STR_CACHE_SIZE
>           conversion.  Setting this option to 0 disables the cache completely.
>
>           If unsure, keep the default value.
> +
> +config SECURITY_SELINUX_DEBUG_HASHES
> +       bool "Print additional information about hash tables"
> +       depends on SECURITY_SELINUX
> +       default n
> +       help
> +         This option allows to gather and display additional information about
> +         some of the key hash tables within SELinux.
> +
> +         If unsure, keep the default value.

I forgot to mention this earlier, but I think this is another case
where we don't need to add another Kconfig option.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] SELinux: Add median to debug output of hash table stats
  2020-05-13 21:55   ` Paul Moore
@ 2020-06-02 20:42     ` Siarhei Liakh
  2020-06-06 13:05       ` Paul Moore
  0 siblings, 1 reply; 8+ messages in thread
From: Siarhei Liakh @ 2020-06-02 20:42 UTC (permalink / raw)
  To: Paul Moore
  Cc: selinux, colin.king, Eric Paris, gregkh, jeffv, omosnace,
	Stephen Smalley, tglx

The 05/13/2020 17:55, Paul Moore wrote:
> On Wed, Apr 29, 2020 at 4:29 PM <siarhei.liakh@concurrent-rt.com> wrote:
> >
> > From: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>
> >
> > This change introduces a median() function which is then used to report
> > 25th, 50th, and 75th percentile metrics within distributions of hash table
> > bucket chain lengths. This allows to better assess and compare relative
> > effectiveness of different hash functions. Specifically, it allows to
> > ensure new functions not only reduce the maximum, but also improve (or, at
> > least, have no negative impact) on the median.
[ . . . ]
> > diff --git a/security/selinux/Kconfig b/security/selinux/Kconfig
> > index 9e921fc72538..57c427e019c9 100644
> > --- a/security/selinux/Kconfig
> > +++ b/security/selinux/Kconfig
> > @@ -115,3 +115,13 @@ config SECURITY_SELINUX_SID2STR_CACHE_SIZE
> >           conversion.  Setting this option to 0 disables the cache completely.
> >
> >           If unsure, keep the default value.
> > +
> > +config SECURITY_SELINUX_DEBUG_HASHES
> > +       bool "Print additional information about hash tables"
> > +       depends on SECURITY_SELINUX
> > +       default n
> > +       help
> > +         This option allows to gather and display additional information about
> > +         some of the key hash tables within SELinux.
> > +
> > +         If unsure, keep the default value.
> 
> I forgot to mention this earlier, but I think this is another case
> where we don't need to add another Kconfig option.

Right. What is your preferred way to control conditional inclusion of
code spread out across several files?

My issue is that there already are two different symbols which require
coordination to activate this functionality: DEBUG_HASHES defined and used
locally within policydb.c and simple DEBUG which is needed for pr_debug()
statements throughout the code.

Personally, I prefer something global and controlled from a single well-known
place, hence the Kconfig. However, I also see your point about reducing
Kconfig... But if not Kconfig, then what? Should I just create an additional
.h file with all SELinux-specific debug symbols and have it included
everywhere in SELinux?

How would you approach this?

Thank you.
-- 
Siarhei Liakh
Concurrent Real-Time

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] SELinux: Add median to debug output of hash table stats
  2020-06-02 20:42     ` Siarhei Liakh
@ 2020-06-06 13:05       ` Paul Moore
  0 siblings, 0 replies; 8+ messages in thread
From: Paul Moore @ 2020-06-06 13:05 UTC (permalink / raw)
  To: Siarhei Liakh
  Cc: selinux, colin.king, Eric Paris, gregkh, jeffv, Ondrej Mosnacek,
	Stephen Smalley, tglx

On Tue, Jun 2, 2020 at 4:42 PM Siarhei Liakh
<siarhei.liakh@concurrent-rt.com> wrote:
> The 05/13/2020 17:55, Paul Moore wrote:
> > On Wed, Apr 29, 2020 at 4:29 PM <siarhei.liakh@concurrent-rt.com> wrote:
> > >
> > > From: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>
> > >
> > > This change introduces a median() function which is then used to report
> > > 25th, 50th, and 75th percentile metrics within distributions of hash table
> > > bucket chain lengths. This allows to better assess and compare relative
> > > effectiveness of different hash functions. Specifically, it allows to
> > > ensure new functions not only reduce the maximum, but also improve (or, at
> > > least, have no negative impact) on the median.
> [ . . . ]
> > > diff --git a/security/selinux/Kconfig b/security/selinux/Kconfig
> > > index 9e921fc72538..57c427e019c9 100644
> > > --- a/security/selinux/Kconfig
> > > +++ b/security/selinux/Kconfig
> > > @@ -115,3 +115,13 @@ config SECURITY_SELINUX_SID2STR_CACHE_SIZE
> > >           conversion.  Setting this option to 0 disables the cache completely.
> > >
> > >           If unsure, keep the default value.
> > > +
> > > +config SECURITY_SELINUX_DEBUG_HASHES
> > > +       bool "Print additional information about hash tables"
> > > +       depends on SECURITY_SELINUX
> > > +       default n
> > > +       help
> > > +         This option allows to gather and display additional information about
> > > +         some of the key hash tables within SELinux.
> > > +
> > > +         If unsure, keep the default value.
> >
> > I forgot to mention this earlier, but I think this is another case
> > where we don't need to add another Kconfig option.
>
> Right. What is your preferred way to control conditional inclusion of
> code spread out across several files?

Sorry for the delay.

Instead of talking about the mechanics of how to make the code
conditional, I would first like to have a discussion about if the code
should even be conditional, and if it is unconditional, do we need it?
 More on this below.

> My issue is that there already are two different symbols which require
> coordination to activate this functionality: DEBUG_HASHES defined and used
> locally within policydb.c and simple DEBUG which is needed for pr_debug()
> statements throughout the code.

My general thinking is that if the information is useful to a user to
manage and tune their system then we should include the code.  If the
information is only useful to kernel developers to play with different
designs or implementation then we can, and should, leave that code
out.  Developers are likely going to need to add their own
instrumentation anyway for testing, no sense in us cluttering up the
kernel for something that may never be useful to anyone.

> Personally, I prefer something global and controlled from a single well-known
> place, hence the Kconfig. However, I also see your point about reducing
> Kconfig... But if not Kconfig, then what? Should I just create an additional
> .h file with all SELinux-specific debug symbols and have it included
> everywhere in SELinux?
>
> How would you approach this?

I *think* the answers above should help, but if not let me know :)

--
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-06-06 13:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-29 20:29 [PATCH 0/2] SELinux: Improve hashing siarhei.liakh
2020-04-29 20:29 ` [PATCH 1/2] SELinux: Add median to debug output of hash table stats siarhei.liakh
2020-05-13 21:55   ` Paul Moore
2020-06-02 20:42     ` Siarhei Liakh
2020-06-06 13:05       ` Paul Moore
2020-04-29 20:29 ` [PATCH 2/2] SELinux: Introduce hash3() as alternative to shift-xor siarhei.liakh
2020-05-05 21:18   ` Paul Moore
2020-05-06 21:42     ` Siarhei Liakh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.