Re: [PATCH 2/2] SELinux: Introduce hash3() as alternative to shift-xor

From: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>
To: Paul Moore <paul@paul-moore.com>
Cc: selinux@vger.kernel.org, colin.king@canonical.com,
	Eric Paris <eparis@parisplace.org>,
	gregkh@linuxfoundation.org, jeffv@google.com,
	omosnace@redhat.com,
	Stephen Smalley <stephen.smalley.work@gmail.com>,
	tglx@linutronix.de
Subject: Re: [PATCH 2/2] SELinux: Introduce hash3() as alternative to shift-xor
Date: Wed, 6 May 2020 17:42:13 -0400	[thread overview]
Message-ID: <20200506214208.GA3947@concurrent-rt.com> (raw)
In-Reply-To: <CAHC9VhRKpWzaPrF3SGwC1J_DGGuSE0Z0tiYrVV_Rdkam0JosGw@mail.gmail.com>

The 05/05/2020 17:18, Paul Moore wrote:
> On Wed, Apr 29, 2020 at 4:29 PM <siarhei.liakh@concurrent-rt.com> wrote:
> > From: Siarhei Liakh <siarhei.liakh@concurrent-rt.com>
> >
> > This change improves performance of hashing functions used in AVC and
> > AVTab to hash 3x 32-bit values. As compared to original shift-xor function
> > used in AVC, substantial improvement in quality of hashing is gained by:
> > 1. replacing shifts with rolls, thus preserving all of the incoming
> >    entropy
> > 2. adjusting shift/roll constants to reduce overlap between the input
> >    values
> > 3. use of arithmetic addition instead of XOR, as a way to spread out
> >    collisions
> > 4. use of hash_32() to level out the distribution of hash output values
> >    throughout the range
> >
> > Within the scope of particular application, these changes bring hash
> > quality to that of much more complex MurmurHash3 (which is currently used
> > in AVTab), while maintaining trivial simplicity of the original code.
>
> [ . . . ] 
>
> > /sys/fs/selinux/avc/hash_stats:
> >             Old     New
> > Samples:   100      100
> > Entries:   504.43   503.67
> > Buckets:   219.73   304.61
> > 1st Qu.:     1.00     1.00
> > Median :     1.39     1.00
> > 3rd Qu.:     3.00     2.00
> > Maximum:     9.32     5.37
> >
> > Next, performance of avc_lookup() is analyzed by timing it with ftrace on
> > a system with same configuration as above:
> >
> > acv_lookup(), latency in us:
> >             Old     New
> >  Samples: 261534    244168
> >  Min.   : 0.1320    0.1310
> >  1st Qu.: 0.2410    0.2360
> >  Median : 0.4220    0.4240
> >  3rd Qu.: 0.5580    0.5600
> >  Max.   : 9.9300    9.9890
> >
> > Considering small size of AVC in default configuration, the change does
> > not show any latency improvements. In fact, median and 75th percentile
> > timings appear to be in line with addition of extra 4 clock cycles for MUL
> > (roughly 2ns on a 2.2Ghz CPU), or 0.47%. This appears to be a small price
> > to pay for substantially better distribution of hash values, reducing a
> > probability and/or severity of potential pathological behavior on some
> > larger configurations.
> >
> > Note that absolute max latency is likely not indicative, as it is
> > susceptible to one-off events such as interrupts, cache misses, etc.
> 
> Thanks for providing more performance information, this is helpful.
> Right now as I look at the results, I'm trying to decide if I care at
> all about the chain length information as ultimately the latency is
> really what we care about, yes?

Correct, but there are different ways to look at it. One way is to only look
at average / median we get right now in current default configuration. Another
way is to think in terms of potential variability and possible pathological
cases down the road which would be difficult to diagnose and/or fix for an
end-user. In my mind, this is the same type of a decision as with choice of a
default sort algorithm: quicksort is faster on average, however Linux Kernel
relies on heap sort as it does not have a O(n^2) worst-case behavior (as
explained in lib/sort.c). I understand that sort has a guaranteed
bound on worst-case performance, while hash is probabalistic in nature and
the worst-case scenario can still happen even with a good hash. However,
hash3() does demonstrate a much better distribution as compared to the current
hash function, thus reducing probability and/or severity of any potential
pathological case. Think of this as proactive preventative maintenance, rather
than a reactive bug fix.

> I'm also trying to reconcile the
> chain length information with the latency information; based on the
> difference in chain lengths I would have expected more of a difference
> in the latency numbers.  As you mention the latency numbers are
> basically the same (give or take two clock cycles) between the two
> hash implementations, assuming we exclude the worst case.  If we
> include the worst case the old implementation is better, at least in
> the data you've provided.
> 
> Is the added complexity of the hash3 implementation, as compared to
> the current avc hash, stealing from the performance improvement
> brought by the improved hash table utilization?  If true, I would
> expect the penalty to be constant regardless of the chain length, but
> it seems like the latency difference between old and new gets worse
> (from a "new" perspective) as the chain length grows.  Can you explain
> that?  Or are we simply playing in the noise of the measurement since
> we are only talking about a clock cycle or two?

TL;DR:
L1/L2/L3 cache latency is 4-6/14/50-70 cycles on Intel Skylake [1].
So, I think it is just noise.

[1] https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
Section 2.2.1.3, Table 2-6 on page 2-11.

Long version:

I suspect that what we see is the result of whole ACV being able to fit within
CPU cache without much cache line eviction in between the consecutive inserts/
lookups, combined with really good branch predicion, speculation & etc.

Here is a fun experiment: I just tested a simple scan of a flat 512-element
array of packed 64-bit values {a:24; b:24; c:16} and it comes up as roughly
500 cycles (~127ns) per *full* scan on Intel Core i7-7820HK CPU @ 3.90GHz. That
would translate to about 225ns on the 2.2Ghz system benchmarked above, which is
just about what we get for 25th percentile lowest latency of hash table! Thus,
a straight scan would be the best if abolute max latency of isolated 512-enry
AVC is all we care about and cache pressure is not a factor. Sounds pretty
crazy, right? 

However, such approach not only does not scale well, but also churns through
up to 4KB (64 cache lines) of memory on each full pass. In contrast, a hash
table with bucket depth of 5 would only issue 6 cache line fetches worst case.
Thus, even if flat scan is *faster* in this particular configuration, it does
not necessarily mean that it is *better* for system as a whole.

What we have is a confuence of the following factors:
1. AVC is a central point of the SELinux which is used all the time by pretty
much everything running on every core
2. L3 shared mode hit is abouot 50% more expensive than a hit in a non-shared
mode [2]
3. Remote DRAM access is about 1.5x - 2x more expensive than local [2]
4. Single-socket NUMA systems becoming ubiquitous (ex: AMD ZEN / EPYC)

Just think about this: on a single-socket AMD EPYC system we can have at most
of AVC memory access as either shared or remote. To me it only makes sense to
reduce any unnecessary cache/DRAM references proactively, even if there is no
obvious immediate payoff right this moment (given, of course, that it does not
make things worse for a "regular" use case today).

[2] https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
Table 2, page 22. (yes, this info is about 10 years old)

> > AVTab:
> [ . . . ]
> > Old:
> > rules:  81014 entries and 29600/32768 buckets used, longest chain length
> > 11 non-zero Q1/Med/Q3 2/2/4 sum of chain length^2 290030
> >
> > New:
> > rules:  81014 entries and 29645/32768 buckets used, longest chain length
> > 11 non-zero Q1/Med/Q3 2/2/4 sum of chain length^2 288810
> 
> Oh, one nit against the first patch - you might as well use all
> lower-case letters in the stats (e.g. "q1" not "Q1" and "med" instead
> of "Med") to be consistent with the existing output.

Will do.

> > Performance, though, is a different matter: a clear savings of 8ns to
> > 10ns (75th and 50th percentiles respectively) had been measured with
> > ftrace for the most frequent AVTab lookup method:
> >
> > avtab_search_node(), latency in us:
> >           Old       New
> >  Samples: 5953243   5458099
> >  Min.   : 0.136     0.129
> >  1st Qu.: 0.199     0.189
> >  Median : 0.219     0.209
> >  3rd Qu.: 0.294     0.286
> >  Max.   :10.000     9.990
> >
> > The results are not as clear for much less frequently (that is 1500x call
> > frequency difference) used avtab_search(), though they appear to lean
> > towards a positive side:
> >
> > avtab_search(), latency in us:
> >             Old     New
> >  Samples: 3863      3638
> >  Min.   : 0.165     0.157
> >  1st Qu.: 0.297     0.293
> >  Median : 0.510     0.517
> >  3rd Qu.: 0.803     0.774
> >  Max.   : 9.343     7.701
> 
> So with the avtab we see similar chain length numbers between the two
> hash implementations but we manage to save a couple of clock cycles
> across the board from best to worst.  I'm guessing this is almost
> surely due to the simpler hash3 implementation over the current avtab
> implementation, yes?

Yes. MurmurHash3() version translates into roughly 6x more instructions
as compared to hash3(), so the fixed offset is really measurable. Note
that Max for avtab_search_node() is pretty much the same in both cases,
which falls in line with my earlier assertion that it is likely not
indicative of real performance differences as it is a function of other
more random and more expensive processes (such as interrupts, cache
misses, and etc.).

> 
> (more comments inline)
> 
> [ . . . ]
> > +/*
> > + * hash3(): Mix and hash 3 x u32's with minimal overhead,
> > + * truncate result to requested number of bits.
> > + *
> > + * This hash function produces very good results for inputs where most of
> > + * input entropy is contained within the lower 11 bits of each of the words.
> > + *
> > + * For example, AVC hash table (in avc.c) is indexed by a 3-tuple (u32 ssid,
> > + * u32 tsid, u16 tclass), where (on Fedora 32 Beta, as of March 2020) ssid and
> > + * tsid appear to be sequential indexes between 0x0 and 0xA00, while tclass
> > + * appears to be confined to the lower 8 bits, resulting in almost perfect
> > + * packing of the indexes into a single 32-bit value.
> > + *
> > + * The function still produces reasonable hash values even when input value
> > + * ranges span beyond 11 bits, as long as the placement of entropy within the
> > + * input values is roughly the same for each of the componets (a, b, c), and
> > + * the address space (a, b, c) is sparsely populated. Such behaviour is the
> > + * result of two conscious choices: (1) use of rol32() to preserve all of the
> > + * incoming entropy (as opposed to simple shifts which discard some input bits)
> > + * and (2) use of arithmetic addition which carries over colliding bits (as
> > + * opposed to binary XOR, which does not carry).
> > + *
> > + * The function will produce horrible collisions if input entropy is distrubuted
> > + * within (a, b, c) such that it ends up within the same bit ranges after
> > + * rotations, and the address space is densly populated. If that is the case,
> > + * then two options are available:
> > + * 1. Try switching around some of the inputs. EX: (a, b, c) => (b, c, a)
> > + * 2. Use a real hash, such as jhash_3words() from linux/jhash.h
> > + */
> 
> I'm not one to throw stones as my spelling is terrible, but I wouldn't
> be doing my job if I didn't ask you to run spellcheck on the comment
> above.

Will do.

> > +static inline u32 hash3(u32 a, u32 b, u32 c, int bits)
> > +{
> > +       return hash_32(a + rol32(b, 11) + rol32(c, 22), bits);
> > +}
> > +
> > +#endif /* #ifndef _SELINUX_HASH3_H */
> 
> ...
> 
> > diff --git a/security/selinux/ss/avtab.h b/security/selinux/ss/avtab.h
> > index 5fdcb6696bcc..bf24d8094019 100644
> > --- a/security/selinux/ss/avtab.h
> > +++ b/security/selinux/ss/avtab.h
> > @@ -85,6 +85,7 @@ struct avtab {
> >         u32 nel;        /* number of elements */
> >         u32 nslot;      /* number of hash slots */
> >         u32 mask;       /* mask to compute hash func */
> > +       u32 bits;       /* number of bits in mask */
> >  };
> 
> We can get rid of the "mask" field, right?

Yes, I think we can.

-- 
Siarhei Liakh
Concurrent Real-Time