From: David Howells <dhowells@redhat.com>
To: Al Viro <viro@ZenIV.linux.org.uk>
Cc: dhowells@redhat.com, linux-afs@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 02/24] iov_iter: Renumber the ITER_* constants in uio.h
Date: Tue, 23 Oct 2018 14:20:39 +0100 [thread overview]
Message-ID: <31390.1540300839@warthog.procyon.org.uk> (raw)
In-Reply-To: <20181020045933.GI32577@ZenIV.linux.org.uk>
Al Viro <viro@ZenIV.linux.org.uk> wrote:
> On Sat, Oct 20, 2018 at 02:10:52AM +0100, David Howells wrote:
> > Renumber the ITER_* constants in uio.h to be contiguous to make comparing
> > them more efficient in a switch-statement.
>
> Are you sure that they *are* more efficient that way? Some of those paths
> are fairly hot, so much that I would really like to see profiles before
> and after these changes (both this and the previous one).
I updated my test program to try and overcome the effects of branch prediction
by going forwards through the set of iterator types and then back again inside
the inner loop rather than putting it on the outside.
I also made it run all the tests itself, do the stats calculations itself on
the output and only output at the end.
Firstly, with 32MiB bufsize to cause the dcache to be fully cycled (values are
times in nanoseconds, lower is better):
TEST TOTAL LOWEST HIGHEST MEAN STDDEV
0-AND 86690784677 10743312 11185591 10836348 36292
1-EQ 86718137278 10743467 11189117 10839767 34984
2-SWITC 86716408086 10724792 11278450 10839551 35031
3-ASM 86755866820 10742027 11214087 10844483 33932
4-ASM 86587355807 10728937 11291730 10823419 34628
5-SW-BT 86532876718 10725349 11303746 10816609 31168
And, secondly, with 1MiB bufsize to keep the data fully within the dcache:
TEST TOTAL LOWEST HIGHEST MEAN STDDEV
0-AND 665992395 81216 136894 83249 2598
1-EQ 665442229 81146 102292 83180 1767
2-SWITC 665263828 81133 103332 83157 1819
3-ASM 667402344 81525 119804 83425 2251
4-ASM 664493028 81178 104506 83061 2230
5-SW-BT 665792835 81378 115603 83224 2145
Each test was run 8000 times (samples parameter). The differences appear to
be around .1% and are quite variable, presumably because of interrupts.
Interestingly, if I comment out all the memcpy calls and set iter.type to 5, I
get something that looks like:
TEST TOTAL LOWEST HIGHEST MEAN STDDEV
0-AND 5091749489 569535 925088 636468 13459
1-EQ 4565513562 569496 620423 570689 3138
2-SWITC 4126907116 498480 650886 515863 10394
3-ASM 9126139444 1137597 1177709 1140767 5006
4-ASM 7411945298 924526 1123532 926493 4301
5-SW-BT 5135087754 640499 679302 641885 3217
This is a pretty consistent picture over all runs. The numbers are larger
than for the 1MiB buffer size because it's doing 32x as many chunks, even
those chunks aren't actually being copied. Reducing bufsize to 1MiB:
TEST TOTAL LOWEST HIGHEST MEAN STDDEV
0-AND 176123187 19299 54659 22015 2837
1-EQ 153711816 19123 36712 19213 810
2-SWITC 140564024 16917 32145 17570 947
3-ASM 296678820 36948 54856 37084 1113
4-ASM 242791274 30210 49183 30348 1188
5-SW-BT 171511145 21351 35625 21438 765
3-ASM sticks out like a sore thumb, presumably because of the adjacent jumps.
Letting gcc optimise a switch-statement with sequentially numbered cases seems
to be optimal.
David
---
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <limits.h>
#include <math.h>
#include <sys/mman.h>
/* Adjustable parameters */
#define bufsize (32ULL * 1024 * 1024)
#define chunksize (1 * 128)
#define samples (8000)
#define noinline __attribute__((noinline))
#define unlikely(x) __builtin_expect(!!(x), 0)
struct iov_iter {
int type;
size_t iov_offset;
size_t count;
char *p;
};
noinline void iterate_iovec(const void *addr, size_t bytes, struct iov_iter *i)
{
memcpy(i->p + i->iov_offset, addr, bytes);
}
noinline void iterate_kvec(const void *addr, size_t bytes, struct iov_iter *i)
{
memcpy(i->p + i->iov_offset, addr, bytes);
}
noinline void iterate_bvec(const void *addr, size_t bytes, struct iov_iter *i)
{
memcpy(i->p + i->iov_offset, addr, bytes);
}
noinline void iterate_pipe(const void *addr, size_t bytes, struct iov_iter *i)
{
memcpy(i->p + i->iov_offset, addr, bytes);
}
noinline void iterate_discard(const void *addr, size_t bytes, struct iov_iter *i)
{
memcpy(i->p + i->iov_offset, addr, bytes);
}
noinline void iterate_mapping(const void *addr, size_t bytes, struct iov_iter *i)
{
memcpy(i->p + i->iov_offset, addr, bytes);
}
enum b_iter_type {
B_ITER_IOVEC = 0,
B_ITER_KVEC = 2,
B_ITER_BVEC = 4,
B_ITER_PIPE = 8,
B_ITER_DISCARD = 16,
B_ITER_MAPPING = 32,
};
enum s_iter_type {
S_ITER_IOVEC = 0,
S_ITER_KVEC = 1,
S_ITER_BVEC = 2,
S_ITER_PIPE = 3,
S_ITER_DISCARD = 4,
S_ITER_MAPPING = 5,
};
/*
* 0-AND: Use bitwise-AND if-chain tests
*/
noinline void TEST0_copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
if ( unlikely(i->type & B_ITER_BVEC)) iterate_bvec(addr, bytes, i);
else if (unlikely(i->type & B_ITER_KVEC)) iterate_kvec(addr, bytes, i);
else if (unlikely(i->type & B_ITER_PIPE)) iterate_pipe(addr, bytes, i);
else if (unlikely(i->type & B_ITER_DISCARD)) iterate_discard(addr, bytes, i);
else if (unlikely(i->type & B_ITER_MAPPING)) iterate_mapping(addr, bytes, i);
else iterate_iovec(addr, bytes, i);
i->iov_offset += bytes;
}
/*
* 1-EQ: Use integer-equals if-chain tests on sequential integers
*/
noinline void TEST1_copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
if ( unlikely(i->type == S_ITER_IOVEC)) iterate_iovec(addr, bytes, i);
else if (unlikely(i->type == S_ITER_KVEC)) iterate_kvec(addr, bytes, i);
else if (unlikely(i->type == S_ITER_BVEC)) iterate_bvec(addr, bytes, i);
else if (unlikely(i->type == S_ITER_PIPE)) iterate_pipe(addr, bytes, i);
else if (unlikely(i->type == S_ITER_DISCARD)) iterate_discard(addr, bytes, i);
else if (unlikely(i->type == S_ITER_MAPPING)) iterate_mapping(addr, bytes, i);
else {
printf("unknown iter %d\n", i->type);
exit(2);
}
i->iov_offset += bytes;
}
/*
* 2-SWITC: Use switch with sequential integers
*/
noinline void TEST2_copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
switch (i->type) {
case S_ITER_IOVEC: iterate_iovec(addr, bytes, i); break;
case S_ITER_BVEC: iterate_bvec(addr, bytes, i); break;
case S_ITER_KVEC: iterate_kvec(addr, bytes, i); break;
case S_ITER_PIPE: iterate_pipe(addr, bytes, i); break;
case S_ITER_DISCARD: iterate_discard(addr, bytes, i); break;
case S_ITER_MAPPING: iterate_mapping(addr, bytes, i); break;
default:
printf("unknown iter %d\n", i->type);
exit(2);
}
i->iov_offset += bytes;
}
/*
* 3-ASM: Like 1-EQ, but hand-optimised to remove half the CMP instructions.
*/
extern void TEST3_copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
asm(
"TEST3_copy_to_iter: \n"
" push %rbp \n"
" mov %rsi,%rbp \n"
" push %rbx \n"
" mov %rdx,%rbx \n"
" sub $0x8,%rsp \n"
" mov (%rdx),%esi \n"
" cmp $0x1,%esi \n"
" jc 72f \n"
" je 96f \n"
" cmp $0x3,%esi \n"
" jc 112f \n"
" je 128f \n"
" cmp $0x5,%esi \n"
" jc 144f \n"
" je 160f \n"
" xor %eax,%eax \n"
" jmp 80f \n"
"72: \n"
" mov %rbp,%rsi \n"
" callq iterate_iovec \n"
"80: \n"
" add %rbp,0x8(%rbx) \n"
" add $0x8,%rsp \n"
" pop %rbx \n"
" pop %rbp \n"
" retq \n"
" nopl 0x0(%rax,%rax,1) \n"
"96: \n"
" mov %rbp,%rsi \n"
" callq iterate_kvec \n"
" jmp 80b \n"
"112: \n"
" mov %rbp,%rsi \n"
" callq iterate_bvec \n"
" jmp 80b \n"
"128: \n"
" mov %rbp,%rsi \n"
" callq iterate_pipe \n"
" jmp 80b \n"
"144: \n"
" mov %rbp,%rsi \n"
" callq iterate_discard \n"
" jmp 80b \n"
"160: \n"
" mov %rbp,%rsi \n"
" callq iterate_mapping \n"
" jmp 80b \n"
);
/*
* 4-ASM: Like 3-ASM, but further optimised to move some of the jumps out to
* branches.
*/
extern void TEST4_copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
asm(
"TEST4_copy_to_iter: \n"
" push %rbp \n"
" mov %rsi,%rbp \n"
" push %rbx \n"
" mov %rdx,%rbx \n"
" sub $0x8,%rsp \n"
" mov (%rdx),%esi \n"
" cmp $0x1,%esi \n"
" jle 72f \n"
" cmp $0x3,%esi \n"
" jle 112f \n"
" cmp $0x5,%esi \n"
" jle 144f \n"
" xor %eax,%eax \n"
" jmp 80f \n"
"72: \n"
" je 96f \n"
" mov %rbp,%rsi \n"
" callq iterate_iovec \n"
"80: \n"
" add %rbp,0x8(%rbx) \n"
" add $0x8,%rsp \n"
" pop %rbx \n"
" pop %rbp \n"
" retq \n"
" nopl 0x0(%rax,%rax,1) \n"
"96: \n"
" mov %rbp,%rsi \n"
" callq iterate_kvec \n"
" jmp 80b \n"
"112: \n"
" je 128f \n"
" mov %rbp,%rsi \n"
" callq iterate_bvec \n"
" jmp 80b \n"
"128: \n"
" mov %rbp,%rsi \n"
" callq iterate_pipe \n"
" jmp 80b \n"
"144: \n"
" je 160f \n"
" mov %rbp,%rsi \n"
" callq iterate_discard \n"
" jmp 80b \n"
"160: \n"
" mov %rbp,%rsi \n"
" callq iterate_mapping \n"
" jmp 80b \n"
);
/*
* 5-SW-BIT: Like 1-SWITC, but using the bitwise-optimised constants used
* by 0-AND.
*/
noinline void TEST5_copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
switch (i->type) {
case B_ITER_IOVEC: iterate_iovec(addr, bytes, i); break;
case B_ITER_BVEC: iterate_bvec(addr, bytes, i); break;
case B_ITER_KVEC: iterate_kvec(addr, bytes, i); break;
case B_ITER_PIPE: iterate_pipe(addr, bytes, i); break;
case B_ITER_DISCARD: iterate_discard(addr, bytes, i); break;
case B_ITER_MAPPING: iterate_mapping(addr, bytes, i); break;
default:
printf("unknown iter %d\n", i->type);
exit(2);
}
i->iov_offset += bytes;
}
#define nr_tests 6
const char *const test_names[nr_tests] = {
"0-AND",
"1-EQ",
"2-SWITC",
"3-ASM",
"4-ASM",
"5-SW-BT",
};
const int b_types[] = {
B_ITER_IOVEC,
B_ITER_KVEC,
B_ITER_BVEC,
B_ITER_PIPE,
B_ITER_DISCARD,
B_ITER_MAPPING,
-1
};
const int s_types[] = {
S_ITER_IOVEC,
S_ITER_KVEC,
S_ITER_BVEC,
S_ITER_PIPE,
S_ITER_DISCARD,
S_ITER_MAPPING,
-1
};
unsigned long long sample_buf[nr_tests][samples];
static __always_inline void test(unsigned int test, void *a, void *b)
{
unsigned long long tmp;
struct timespec tv_1, tv_2;
int i = 0, j, k;
for (j = 0; j < samples; j++) {
struct iov_iter iter = {
.p = a,
.iov_offset = 0,
};
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tv_1);
for (k = 0; k < bufsize / chunksize; k++) {
const int *types;
void *p;
int x;
switch (test) {
case 0: types = b_types; break;
case 1: types = s_types; break;
case 2: types = s_types; break;
case 3: types = s_types; break;
case 4: types = s_types; break;
case 5: types = b_types; break;
default: types = NULL; break;
}
/* For 6 tests, do 0 1 2 3 4 5 5 4 3 2 1 0 */
x = i++ % (nr_tests * 2);
if (x >= nr_tests)
x = nr_tests * 2 - 1 - x;
iter.type = types[x];
p = b + k * chunksize;
switch (test) {
case 0: TEST0_copy_to_iter(p, chunksize, &iter); break;
case 1: TEST1_copy_to_iter(p, chunksize, &iter); break;
case 2: TEST2_copy_to_iter(p, chunksize, &iter); break;
case 3: TEST3_copy_to_iter(p, chunksize, &iter); break;
case 4: TEST4_copy_to_iter(p, chunksize, &iter); break;
case 5: TEST5_copy_to_iter(p, chunksize, &iter); break;
}
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tv_2);
tmp = (tv_2.tv_sec - tv_1.tv_sec) * 1000ULL * 1000 * 1000;
tmp += tv_2.tv_nsec - tv_1.tv_nsec;
sample_buf[test][j] = tmp;
}
}
static noinline void test_0(void *a, void *b) { test(0, a, b); }
static noinline void test_1(void *a, void *b) { test(1, a, b); }
static noinline void test_2(void *a, void *b) { test(2, a, b); }
static noinline void test_3(void *a, void *b) { test(3, a, b); }
static noinline void test_4(void *a, void *b) { test(4, a, b); }
static noinline void test_5(void *a, void *b) { test(5, a, b); }
int main()
{
void *a, *b;
int i, j;
a = mmap(NULL, bufsize, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
if (a == MAP_FAILED) {
perror("mmap");
exit(1);
}
b = mmap(NULL, bufsize, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
if (a == MAP_FAILED) {
perror("mmap");
exit(1);
}
memset(a, 1, bufsize);
memset(b, 1, bufsize);
test_0(a, b);
test_1(a, b);
test_2(a, b);
test_3(a, b);
test_4(a, b);
test_5(a, b);
for (i = 0; i < nr_tests; i++) {
for (j = 0; j < samples; j++) {
printf("sample,%s,%llu\n", test_names[i], sample_buf[i][j]);
}
}
printf("\n");
printf("TEST TOTAL LOWEST HIGHEST MEAN STDDEV\n");
for (i = 0; i < nr_tests; i++) {
long long lowest = ULLONG_MAX, highest = 0;;
long long total = 0, mean;
double stddev;
for (j = 0; j < samples; j++) {
total += sample_buf[i][j];
if (sample_buf[i][j] < lowest)
lowest = sample_buf[i][j];
if (sample_buf[i][j] > highest)
highest = sample_buf[i][j];
}
mean = total / samples;
stddev = 0;
for (j = 0; j < samples; j++) {
long long x = sample_buf[i][j];
x -= mean;
stddev += (double)x * (double)x;
}
stddev /= (samples - 1);
stddev = sqrt(stddev);
printf("%7s %12llu %12llu %12llu %12llu %12llu\n",
test_names[i], total, lowest, highest, mean,
(unsigned long long)stddev);
}
return 0;
}
next prev parent reply other threads:[~2018-10-23 13:20 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-20 1:10 [PATCH 00/24] AFS development David Howells
2018-10-20 1:10 ` [PATCH 01/24] iov_iter: Separate type from direction and use accessor functions David Howells
2018-10-20 4:56 ` Al Viro
2018-10-22 13:00 ` David Howells
2018-10-20 1:10 ` [PATCH 02/24] iov_iter: Renumber the ITER_* constants in uio.h David Howells
2018-10-20 4:59 ` Al Viro
2018-10-22 15:54 ` David Howells
2018-10-23 13:20 ` David Howells [this message]
2018-10-20 1:10 ` [PATCH 03/24] iov_iter: Add I/O discard iterator David Howells
2018-10-20 5:05 ` Al Viro
2018-10-22 16:18 ` David Howells
2018-10-20 1:11 ` [PATCH 04/24] afs: Better tracing of protocol errors David Howells
2018-10-20 1:11 ` [PATCH 05/24] afs: Set up the iov_iter before calling afs_extract_data() David Howells
2018-10-20 1:11 ` [PATCH 06/24] afs: Improve FS server rotation error handling David Howells
2018-10-20 1:11 ` [PATCH 07/24] afs: Implement VL server rotation David Howells
2018-10-20 1:11 ` [PATCH 08/24] afs: Fix TTL on VL server and address lists David Howells
2018-10-20 1:11 ` [PATCH 09/24] afs: Handle EIO from delivery function David Howells
2018-10-20 1:11 ` [PATCH 10/24] afs: Add a couple of tracepoints to log I/O errors David Howells
2018-10-20 1:11 ` [PATCH 11/24] afs: Don't invoke the server to read data beyond EOF David Howells
2018-10-20 1:12 ` [PATCH 12/24] afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS David Howells
2018-10-20 1:12 ` [PATCH 13/24] afs: Commit the status on a new file/dir/symlink David Howells
2018-10-20 1:12 ` [PATCH 14/24] afs: Remove callback details from afs_callback_break struct David Howells
2018-10-20 1:12 ` [PATCH 15/24] afs: Implement the YFS cache manager service David Howells
2018-10-20 1:12 ` [PATCH 16/24] afs: Fix FS.FetchStatus delivery from updating wrong vnode David Howells
2018-10-20 1:12 ` [PATCH 17/24] afs: Calc callback expiry in op reply delivery David Howells
2018-10-20 1:12 ` [PATCH 18/24] afs: Get the target vnode in afs_rmdir() and get a callback on it David Howells
2018-10-20 1:12 ` [PATCH 19/24] afs: Expand data structure fields to support YFS David Howells
2018-10-20 1:13 ` [PATCH 20/24] afs: Implement YFS support in the fs client David Howells
2018-10-20 1:13 ` [PATCH 21/24] afs: Allow dumping of server cursor on operation failure David Howells
2018-10-20 1:13 ` [PATCH 22/24] afs: Eliminate the address pointer from the address list cursor David Howells
2018-10-20 1:13 ` [PATCH 23/24] afs: Fix callback handling David Howells
2018-10-20 1:13 ` [PATCH 24/24] afs: Probe multiple fileservers simultaneously David Howells
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=31390.1540300839@warthog.procyon.org.uk \
--to=dhowells@redhat.com \
--cc=linux-afs@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=viro@ZenIV.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.