x86: Static optimisations for copy_user

From: Chris Wilson <chris@chris-wilson.co.uk>
To: linux-kernel@vger.kernel.org
Cc: x86@kernel.org, intel-gfx@lists.freedesktop.org
Subject: x86: Static optimisations for copy_user
Date: Thu,  1 Jun 2017 07:58:40 +0100	[thread overview]
Message-ID: <20170601065843.2392-1-chris@chris-wilson.co.uk> (raw)

I was looking at the overhead of drmIoctl() in a microbenchmark that
repeatedly did a copy_from_user(.size=8) followed by a
copy_to_user(.size=8) as part of the DRM_IOCTL_I915_GEM_BUSY. I found
that if I forced inlined the get_user/put_user instead the walltime of
the ioctl was improved by about 20%. If copy_user_generic_unrolled was
used instead of copy_user_enhanced_fast_string, performance of the
microbenchmark was improved by 10%. Benchmarking on a few machines

(Broadwell)
 benchmark_copy_user(hot):
       size   unrolled     string fast-string
          1        158         77         79
          2        306        154        158
          4        614        308        317
          6        926        462        476
          8       1344        298        635
         12       1773        482        952
         16       2797        602       1269
         24       4020        903       1906
         32       5055       1204       2540
         48       6150       1806       3810
         64       9564       2409       5082
         96      13583       3612       6483
        128      18108       4815       8434

(Broxton)
 benchmark_copy_user(hot):
       size   unrolled     string fast-string
          1        270         52         53
          2        364        106        109
          4        460        213        218
          6        486        305        312
          8       1250        253        437
         12       1009        332        625
         16       2059        514        897
         24       2624        672       1071
         32       3043       1014       1750
         48       3620       1499       2561
         64       7777       1971       3333
         96       7499       2876       4772
        128       9999       3733       6088

which says that for this cache hot case in benchmarking the rep mov
microcode noticeably underperforms. Though once we pass a few
cachelines, and definitely after exceeding L1 cache, rep mov is the
clear winner. From cold, there is no difference in timings.

I can improve the microbenchmark by either force inlining the
raw_copy_*_user switches, or by switching to copy_user_generic_unrolled.
Both leave a sour taste. The switch is too big to be inlined, and if
called out-of-line the function call overhead negates its benefits.
Switching between fast-string and unrolled makes a presumption on
behaviour.

In the end, I limited this series to just adding a few extra
translations for statically known copy_*_user().
-Chris