linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
@ 2016-12-21 22:29 Jason A. Donenfeld
  2016-12-22  3:55 ` George Spelvin
  0 siblings, 1 reply; 6+ messages in thread
From: Jason A. Donenfeld @ 2016-12-21 22:29 UTC (permalink / raw)
  To: kernel-hardening, Theodore Ts'o, George Spelvin,
	Eric Dumazet, Jason, Andi Kleen, David Miller, David Laight,
	Daniel J . Bernstein, Eric Biggers, Hannes Frederic Sowa,
	Jean-Philippe Aumasson, Linux Crypto Mailing List, LKML,
	Andy Lutomirski, Netdev, Tom Herbert, Linus Torvalds,
	Vegard Nossum

On Wed, Dec 21, 2016 at 11:27 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> And "with enough registers" includes ARM and MIPS, right?  So the only
> real problem is 32-bit x86, and you're right, at that point, only
> people who might care are people who are using a space-radiation
> hardened 386 --- and they're not likely to be doing high throughput
> TCP connections.  :-)

Plus the benchmark was bogus anyway, and when I built a more specific
harness -- actually comparing the TCP sequence number functions --
SipHash was faster than MD5, even on register starved x86. So I think
we're fine and this chapter of the discussion can come to a close, in
order to move on to more interesting things.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
  2016-12-21 22:29 [kernel-hardening] Re: HalfSipHash Acceptable Usage Jason A. Donenfeld
@ 2016-12-22  3:55 ` George Spelvin
  2016-12-22  4:40   ` Jason A. Donenfeld
  0 siblings, 1 reply; 6+ messages in thread
From: George Spelvin @ 2016-12-22  3:55 UTC (permalink / raw)
  To: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, hannes,
	Jason, jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, linux, luto, netdev, tom, torvalds, tytso,
	vegard.nossum

> Plus the benchmark was bogus anyway, and when I built a more specific
> harness -- actually comparing the TCP sequence number functions --
> SipHash was faster than MD5, even on register starved x86. So I think
> we're fine and this chapter of the discussion can come to a close, in
> order to move on to more interesting things.

Do we have to go through this?  No, the benchmark was *not* bogus.

Here's myresults from *your* benchmark.  I can't reboot some of my test
machines, so I took net/core/secure_seq.c, lib/siphash.c, lib/md5.c and
include/linux/siphash.h straight out of your test tree.

Then I replaced the kernel #includes with the necessary typedefs
and #defines to make it compile in user-space.  (Voluminous but
straightforward.)  E.g.

#define __aligned(x) __attribute__((__aligned__(x)))
#define ____cacheline_aligned __aligned(64)
#define CONFIG_INET 1
#define IS_ENABLED(x) 1
#define ktime_get_real_ns() 0
#define sysctl_tcp_timestamps 0

... etc.

Then I modified your benchmark code into the appended code.  The
differences are:
* I didn't iterate 100K times, I timed the functions *once*.
* I saved the times in a buffer and printed them all at the end
  so printf() wouldn't pollute the caches.
* Before every even-numbered iteration, I flushed the I-cache
  of everything from _init to _fini (i.e. all the non-library code).
  This cold-cache case is what is going to happen in the kernel.

In the results below, note that I did *not* re-flush between phases
of the test.  The effects of cacheing is clearly apparent in the tcpv4
results, where the tcpv6 code loaded the cache.

You can also see that the SipHash code benefits more from cacheing when
entered with a cold cache, as it iterates over the input words, while
the MD5 code is one big unrolled blob.

Order of computation is down the columns first, across second.

The P4 results were:
tcpv6 md5 cold:		4084	3488	3584	3584	3568
tcpv4 md5 cold:		1052	 996	 996	1060	 996
tcpv6 siphash cold:	4080	3296	3312	3296	3312
tcpv4 siphash cold:	2968	2748	2972	2716	2716
tcpv6 md5 hot:		 900	 712	 712	712	 712
tcpv4 md5 hot:		 632	 672	 672	672	 672
tcpv6 siphash hot:	2484	2292	2340	2340	2340
tcpv4 siphash hot:	1660	1560	1564	2340	1564

SipHash actually wins slightly in the cold-cache case, because
it iterates more.  In the hot-cache case, it loses horribly.

Core 2 duo:
tcpv6 md5 cold:		3396	2868	2964	3012	2832
tcpv4 md5 cold:		1368	1044	1320	1332	1308
tcpv6 siphash cold:	2940	2952	2916	2448	2604
tcpv4 siphash cold:	3192	2988	3576	3504	3624
tcpv6 md5 hot:		1116	1032	 996	1008	1008
tcpv4 md5 hot:		 936	 936	 936	 936	 936
tcpv6 siphash hot:	1200	1236	1236	1188	1188
tcpv4 siphash hot:	 936	 804	 804	 804	 804

Pretty much a tie, honestly.

Ivy Bridge:
tcpv6 md5 cold:		6086	6136	6962	6358	6060
tcpv4 md5 cold:		 816	 732	1046	1054	1012
tcpv6 siphash cold:	3756	1886	2152	2390	2566
tcpv4 siphash cold:	3264	2108	3026	3120	3526
tcpv6 md5 hot:		1062	 808	 824	 824	 832
tcpv4 md5 hot:		 730	 730	 740	 748	 748
tcpv6 siphash hot:	 960	 952	 936	1112	 926
tcpv4 siphash hot:	 638	 544	 562	 552	 560

Modern processors *hate* cold caches.  But notice how md5 is *faster*
than SipHash on hot-cache IPv6.

Ivy Bridge, -m64:
tcpv6 md5 cold:		4680	3672	3956	3616	3525
tcpv4 md5 cold:		1066	1416	1179	1179	1134
tcpv6 siphash cold:	 940	1258	1995	1609	2255
tcpv4 siphash cold:	1440	1269	1292	1870	1621
tcpv6 md5 hot:		1372	1111	1122	1088	1088
tcpv4 md5 hot:		 997	 997	 997	 997	 998
tcpv6 siphash hot:	 340	 340	 340	 352	 340
tcpv4 siphash hot:	 227	 238	 238	 238	 238

Of course, when you compile -m64, SipHash is unbeatable.


Here's the modified benchmark() code.  The entire package is
a bit voluminous for the mailing list, but anyone is welcome to it.

static void clflush(void)
{
	extern char const _init, _fini;
	char const *p = &_init;

	while (p < &_fini) {
		asm("clflush %0" : : "m" (*p));
		p += 64;
	}
}

typedef uint32_t cycles_t;
static cycles_t get_cycles(void)
{
	uint32_t eax, edx;
	asm volatile("rdtsc" : "=a" (eax), "=d" (edx));
	return eax;
}

static int benchmark(void)
{
	cycles_t start, finish;
	int i;
	u32 seq_number = 0;
	__be32 saddr6[4] = { 1, 4, 182, 393 }, daddr6[4] = { 9192, 18288, 2222222, 0xffffff10 };
	__be32 saddr4 = 28888, daddr4 = 182112;
	__be16 sport = 22, dport = 41992;
	u32 tsoff;
	cycles_t result[4];

	printf("seq num benchmark\n");

	for (i = 0; i < 10; i++) {
		if ((i & 1) == 0)
			clflush();

		start = get_cycles();
		seq_number += secure_tcpv6_sequence_number_md5(saddr6, daddr6, sport, dport, &tsoff);
		finish = get_cycles();
		result[0] = finish - start;

		start = get_cycles();
		seq_number += secure_tcp_sequence_number_md5(saddr4, daddr4, sport, dport, &tsoff);
		finish = get_cycles();
		result[1] = finish - start;

		start = get_cycles();
		seq_number += secure_tcpv6_sequence_number(saddr6, daddr6, sport, dport, &tsoff);
		finish = get_cycles();
		result[2] = finish - start;

		start = get_cycles();
		seq_number += secure_tcp_sequence_number(saddr4, daddr4, sport, dport, &tsoff);
		finish = get_cycles();
		result[3] = finish - start;

		printf("* Iteration %d results:\n", i);
		printf("secure_tcpv6_sequence_number_md5# cycles: %u\n", result[0]);
		printf("secure_tcp_sequence_number_md5# cycles: %u\n", result[1]);
		printf("secure_tcpv6_sequence_number_siphash# cycles: %u\n", result[2]);
		printf("secure_tcp_sequence_number_siphash# cycles: %u\n", result[3]);
		printf("benchmark result: %u\n", seq_number);
	}

	printf("benchmark result: %u\n", seq_number);
	return 0;
}
//device_initcall(benchmark);

int
main(void)
{
	memset(net_secret, 0xff, sizeof net_secret);
	memset(net_secret_md5, 0xff, sizeof net_secret_md5);
	return benchmark();
}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
  2016-12-22  3:55 ` George Spelvin
@ 2016-12-22  4:40   ` Jason A. Donenfeld
  0 siblings, 0 replies; 6+ messages in thread
From: Jason A. Donenfeld @ 2016-12-22  4:40 UTC (permalink / raw)
  To: George Spelvin
  Cc: Andi Kleen, David Miller, David Laight, Daniel J . Bernstein,
	Eric Biggers, Eric Dumazet, Hannes Frederic Sowa,
	Jean-Philippe Aumasson, kernel-hardening,
	Linux Crypto Mailing List, LKML, Andy Lutomirski, Netdev,
	Tom Herbert, Linus Torvalds, Theodore Ts'o, Vegard Nossum

Hi George,

On Thu, Dec 22, 2016 at 4:55 AM, George Spelvin
<linux@sciencehorizons.net> wrote:
> Do we have to go through this?  No, the benchmark was *not* bogus.
> Then I replaced the kernel #includes with the necessary typedefs
> and #defines to make it compile in user-space.
> * I didn't iterate 100K times, I timed the functions *once*.
> * I saved the times in a buffer and printed them all at the end
>   so printf() wouldn't pollute the caches.
> * Before every even-numbered iteration, I flushed the I-cache
>   of everything from _init to _fini (i.e. all the non-library code).
>   This cold-cache case is what is going to happen in the kernel.

Wow! Great. Thanks for the pointers on the right way to do this. Very
helpful, and enlightening results indeed. Think you could send me the
whole .c of what you finally came up with? I'd like to run this on
some other architectures; I've got a few odd boxes laying around here.

> The P4 results were:
> SipHash actually wins slightly in the cold-cache case, because
> it iterates more.  In the hot-cache case, it loses
> Core 2 duo:
> Pretty much a tie, honestly.
> Ivy Bridge:
> Modern processors *hate* cold caches.  But notice how md5 is *faster*
> than SipHash on hot-cache IPv6.
> Ivy Bridge, -m64:
> Of course, when you compile -m64, SipHash is unbeatable.

Okay, so I think these results are consistent with some of the
assessments from before -- that SipHash is really just fine as a
replacement for MD5. Not great on older 32-bit x86, but not too
horrible, and the performance improvements on every other architecture
and the security improvements everywhere are a net good.

> Here's the modified benchmark() code.  The entire package is
> a bit voluminous for the mailing list, but anyone is welcome to it.

Please do send! I'm sure I'll learn from reading it. Thanks again for
doing the hardwork of putting something proper together.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
  2016-12-21 16:39       ` [kernel-hardening] " Rik van Riel
@ 2016-12-21 17:08         ` Eric Dumazet
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Dumazet @ 2016-12-21 17:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kernel-hardening, Jason A. Donenfeld, George Spelvin,
	Theodore Ts'o, Andi Kleen, David Miller, David Laight,
	Daniel J . Bernstein, Eric Biggers, Hannes Frederic Sowa,
	Jean-Philippe Aumasson, Linux Crypto Mailing List, LKML,
	Andy Lutomirski, Netdev, Tom Herbert, Linus Torvalds,
	Vegard Nossum

On Wed, 2016-12-21 at 11:39 -0500, Rik van Riel wrote:

> Does anybody still have a P4?
> 
> If they do, they're probably better off replacing
> it with an Atom. The reduced power bills will pay
> for replacing that P4 within a year or two.

Well, maybe they have millions of units to replace.

> 
> In short, I am not sure how important the P4
> performance numbers are, especially if we can
> improve security for everybody else...

Worth adding that the ISN or syncookie generation are less than 10% of
the actual cost of handling a problematic (having to generate ISN or
syncookie) TCP packet anyway.

So we are talking of minors potential impact for '2000-era' cpus.

Definitely I vote for using SipHash in TCP ASAP.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
  2016-12-21 15:55 George Spelvin
@ 2016-12-21 16:41 ` Rik van Riel
  0 siblings, 0 replies; 6+ messages in thread
From: Rik van Riel @ 2016-12-21 16:41 UTC (permalink / raw)
  To: kernel-hardening, Jason, linux
  Cc: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, hannes,
	jeanphilippe.aumasson, linux-crypto, linux-kernel, luto, netdev,
	tom, torvalds, tytso, vegard.nossum

[-- Attachment #1: Type: text/plain, Size: 710 bytes --]

On Wed, 2016-12-21 at 10:55 -0500, George Spelvin wrote:
> Actually, DJB just made a very relevant suggestion.
> 
> As I've mentioned, the 32-bit performance problems are an x86-
> specific
> problem.  ARM does very well, and other processors aren't bad at all.
> 
> SipHash fits very nicely (and runs very fast) in the MMX registers.
> 
> They're 64 bits, and there are 8 of them, so the integer registers
> can
> be reserved for pointers and loop counters and all that.  And there's
> reference code available.
> 
> How much does kernel_fpu_begin()/kernel_fpu_end() cost?

Those can be very expensive. Almost certainly not
worth it for small amounts of data.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
  2016-12-21 15:56     ` Eric Dumazet
@ 2016-12-21 16:39       ` Rik van Riel
  2016-12-21 17:08         ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Rik van Riel @ 2016-12-21 16:39 UTC (permalink / raw)
  To: kernel-hardening, Jason A. Donenfeld
  Cc: George Spelvin, Theodore Ts'o, Andi Kleen, David Miller,
	David Laight, Daniel J . Bernstein, Eric Biggers,
	Hannes Frederic Sowa, Jean-Philippe Aumasson,
	Linux Crypto Mailing List, LKML, Andy Lutomirski, Netdev,
	Tom Herbert, Linus Torvalds, Vegard Nossum

[-- Attachment #1: Type: text/plain, Size: 1167 bytes --]

On Wed, 2016-12-21 at 07:56 -0800, Eric Dumazet wrote:
> On Wed, 2016-12-21 at 15:42 +0100, Jason A. Donenfeld wrote:
> George said :
> 
> > Cycles per byte on 1024 bytes of data:
> >                       Pentium Core 2  Ivy
> >                       4       Duo     Bridge
> > SipHash-2-4           38.9     8.3     5.8
> > HalfSipHash-2-4       12.7     4.5     3.2
> > MD5                    8.3     5.7     4.7
> 
> 
> That really was for 1024 bytes blocks, so pretty much useless for our
> discussion ?
> 
> Reading your numbers last week, I thought SipHash was faster, but
> George
> numbers are giving the opposite impression.
> 
> I do not have a P4 to make tests, so I only can trust you or George.

Does anybody still have a P4?

If they do, they're probably better off replacing
it with an Atom. The reduced power bills will pay
for replacing that P4 within a year or two.

In short, I am not sure how important the P4
performance numbers are, especially if we can
improve security for everybody else...

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-12-22  4:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-21 22:29 [kernel-hardening] Re: HalfSipHash Acceptable Usage Jason A. Donenfeld
2016-12-22  3:55 ` George Spelvin
2016-12-22  4:40   ` Jason A. Donenfeld
  -- strict thread matches above, loose matches on Subject: below --
2016-12-21 15:55 George Spelvin
2016-12-21 16:41 ` [kernel-hardening] " Rik van Riel
2016-12-21  3:28 George Spelvin
2016-12-21  5:29 ` Eric Dumazet
2016-12-21 14:42   ` Jason A. Donenfeld
2016-12-21 15:56     ` Eric Dumazet
2016-12-21 16:39       ` [kernel-hardening] " Rik van Riel
2016-12-21 17:08         ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).