Re: [PATCH] lib/vsprintf.c: Further simplify uuid_string()

* Re: [PATCH] lib/vsprintf.c: Further simplify uuid_string()
       [not found] <1464594339.27624.45.camel@linux.intel.com>
@ 2016-05-30 17:32 ` George Spelvin
  2016-05-31 20:31   ` [PATCH] lib/uuid.c: eliminate uuid_[bl]e_index arrays George Spelvin
  0 siblings, 1 reply; 10+ messages in thread
From: George Spelvin @ 2016-05-30 17:32 UTC (permalink / raw)
  To: andriy.shevchenko; +Cc: akpm, bbjorn, linux, linux-kernel, tytso

On Mon, 30 May 2016 at 10:45:39 Andy Shevchenko wrote:
> You have to send this to public mailing list,

Okay.  lkml and commenters on recent patches added to Cc:.

To those just joining, I had a couple of local patches I had made to
lib/vsprintf.c:uuid_string() which had been sitting around ignored due
to not knowing who to submit them to.

Then I noticed (merge conflicts while rebasing my local patches) that
Andy was working on the area and sent them to him.

[Aside: Ted, I agree that whoever decided to make UUIDs non-universal
by introducing an ambiguous parse is richly deserving of scorn.]


> but looking briefly to the code it looks like trade bad for worse, on
> other words I don't see much beneficial from it. You are decreasing
> code size by increasing data size. Speed is not a bottleneck here
> anyway, since *printf() is slow by definition. Table should be shared
> via uuid.h header and code if you wish it look generic.

Wow, I didn't expect that!  I thought the improvement was obvious in
pretty much every possible way, which is why I didn't include much
explanation.  I'll explain my thinking in more detail here; could you
tell me where you disagree?


I agree with you that such "support" code shouldn't try to micro-optimize
its own performance at the expense of static resources (code size, large
tables), because CPU resources like L1 cache, TLB, and branch predictor
slots should be reserved for the hot code.

It's not that we don't care about speed, but given its infrequent use
we care even more about slowing down the *rest* of the kernel.

That's exactly why I headlined the space savings rather than the
performance gains.

As an example of a less useful "optimization", it's possible to unroll
the loop twice and print 16 bits per iteration.  You could even shrink
the tables to 8 bytes each, since since odd entries can be derived from
even entries by complementing the low-order bit:

 	for (i = 0; i < 8; i++) {
		p = hex_byte_pack(p, addr[index[i]]);
		p = hex_byte_pack(p, addr[index[i]^1]);
		if ((unsigned)i - 1 < 4)
			*p++ = '-';
	}



There are two optimzations here:

1) Replacing the "uc" boolean with a pointer.

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 7332a5d7..231cf6d3 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -1316,24 +1316,24 @@ char *uuid_string(char *buf, char *end, const u8 *addr,
 	char *p = uuid;
 	int i;
 	const u8 *index = uuid_be_index;
-	bool uc = false;
+	const char *hex = hex_asc;
 
 	switch (*(++fmt)) {
 	case 'L':
-		uc = true;		/* fall-through */
+		hex = hex_asc_upper;	/* fall-through */
 	case 'l':
 		index = uuid_le_index;
 		break;
 	case 'B':
-		uc = true;
+		hex = hex_asc_upper;
 		break;
 	}
 
 	for (i = 0; i < 16; i++) {
-		if (uc)
-			p = hex_byte_pack_upper(p, addr[index[i]]);
-		else
-			p = hex_byte_pack(p, addr[index[i]]);
+		u8 byte = addr[index[i]];
+
+		*p++ = hex[byte >> 4];
+		*p++ = hex[byte & 0x0f];
 		switch (i) {
 		case 3:
 		case 5:

This seems like an obvious win.  It's performing exactly the same data
memory accesses as the conditional form (the tables it uses are the
tables which the hex_byte_pack() macros use internally), while reducing
the code size both statically and dynamically.

Dynamically, it eliminates 16 conditional branches.  The only extra
reesource consumed is the "hex" pointer variable which is larger than the
"uc" boolean, but as it's one register either way, that's nothing.

So it's saving L1I space and a branch predictor entry as well as being
faster, all without making the code any harder to understand.

I really don't see any downside at all.


2) Inverting the order of formatting.

This is a more complex change.  Rather than format the UUID in output
order, and use a table to get the permuted input byte order, I invert
the table to list the output position for each input byte.

Thus, the input is read in order and the output is written out of order.

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 231cf6d3..aa6135cd 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -1313,38 +1313,35 @@ char *uuid_string(char *buf, char *end, const u8 *addr,
 		  struct printf_spec spec, const char *fmt)
 {
 	char uuid[UUID_STRING_LEN + 1];
-	char *p = uuid;
 	int i;
-	const u8 *index = uuid_be_index;
+	/* Offset in uuid[] where each byte is printed, two cases. */
+	static const u8 be[16] = {0,2,4,6,9,11,14,16,19,21,24,26,28,30,32,34};
+	static const u8 le[16] = {6,4,2,0,11,9,16,14,19,21,24,26,28,30,32,34};
+	const u8 *index = be;
 	const char *hex = hex_asc;
 
 	switch (*(++fmt)) {
 	case 'L':
 		hex = hex_asc_upper;	/* fall-through */
 	case 'l':
-		index = uuid_le_index;
+		index = le;
 		break;
 	case 'B':
 		hex = hex_asc_upper;
 		break;
 	}
 
+	/* Format each byte of the raw uuid into the buffer */
 	for (i = 0; i < 16; i++) {
-		u8 byte = addr[index[i]];
+		u8 byte = addr[i];
+		char *p = uuid + index[i];
 
-		*p++ = hex[byte >> 4];
-		*p++ = hex[byte & 0x0f];
-		switch (i) {
-		case 3:
-		case 5:
-		case 7:
-		case 9:
-			*p++ = '-';
-			break;
-		}
+		p[0] = hex[byte >> 4];
+		p[1] = hex[byte & 0x0f];
 	}
-
-	*p = 0;
+	/* Insert the fixed punctuation */
+	uuid[23] = uuid[18] = uuid[13] = uuid[8] = '-';
+	uuid[UUID_STRING_LEN] = '\0';
 
 	return string(buf, end, uuid, spec);
 }

When I first wrote it, it didn't increase table size at all, since it
was just swapping one set of tables for another.

Now that you're sharing the tables, it's possible to *reduce* table size.
My be[] table is identical to your si[] table in __uuid_to_bin(),
and using my le[] table there would let you eliminate uuid_le_index[]
and uuid_be_index[] completely, for a net 16 byte saving.

More significantly, it decreases both static and dynamic code size by
trading the switch() for the four explicit stores of the '-' punctuation.

Most of the work done by the switch() is folded into the table
showing which output positions to reserve for punctuation.

Again, the change is not only a speed boost, but speeds up the rest of
the kernel by reducing cache and branch predictor use.


I agree it's harder to understand, precisely because the table does two
things at once, but not hugely so; the comments resolve any questions the
reader might have.  In general, subtlety is a problem when it affects  a
large area of code (RCU being the poster child); this is a self-contained
micro-optimization which doesn't affect the understandability or
maintainability of the kernel as a whole, and someone who does care about
this particular fucntion doesn't have to read very much to figure it out.

^ permalink raw reply related	[flat|nested] 10+ messages in thread