All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: linux-kernel@vger.kernel.org
Cc: x86@kernel.org, intel-gfx@lists.freedesktop.org
Subject: x86: Static optimisations for copy_user
Date: Thu,  1 Jun 2017 07:58:40 +0100	[thread overview]
Message-ID: <20170601065843.2392-1-chris@chris-wilson.co.uk> (raw)

I was looking at the overhead of drmIoctl() in a microbenchmark that
repeatedly did a copy_from_user(.size=8) followed by a
copy_to_user(.size=8) as part of the DRM_IOCTL_I915_GEM_BUSY. I found
that if I forced inlined the get_user/put_user instead the walltime of
the ioctl was improved by about 20%. If copy_user_generic_unrolled was
used instead of copy_user_enhanced_fast_string, performance of the
microbenchmark was improved by 10%. Benchmarking on a few machines

(Broadwell)
 benchmark_copy_user(hot):
       size   unrolled     string fast-string
          1        158         77         79
          2        306        154        158
          4        614        308        317
          6        926        462        476
          8       1344        298        635
         12       1773        482        952
         16       2797        602       1269
         24       4020        903       1906
         32       5055       1204       2540
         48       6150       1806       3810
         64       9564       2409       5082
         96      13583       3612       6483
        128      18108       4815       8434

(Broxton)
 benchmark_copy_user(hot):
       size   unrolled     string fast-string
          1        270         52         53
          2        364        106        109
          4        460        213        218
          6        486        305        312
          8       1250        253        437
         12       1009        332        625
         16       2059        514        897
         24       2624        672       1071
         32       3043       1014       1750
         48       3620       1499       2561
         64       7777       1971       3333
         96       7499       2876       4772
        128       9999       3733       6088

which says that for this cache hot case in benchmarking the rep mov
microcode noticeably underperforms. Though once we pass a few
cachelines, and definitely after exceeding L1 cache, rep mov is the
clear winner. From cold, there is no difference in timings.

I can improve the microbenchmark by either force inlining the
raw_copy_*_user switches, or by switching to copy_user_generic_unrolled.
Both leave a sour taste. The switch is too big to be inlined, and if
called out-of-line the function call overhead negates its benefits.
Switching between fast-string and unrolled makes a presumption on
behaviour.

In the end, I limited this series to just adding a few extra
translations for statically known copy_*_user().
-Chris

WARNING: multiple messages have this Message-ID (diff)
From: Chris Wilson <chris@chris-wilson.co.uk>
To: linux-kernel@vger.kernel.org
Cc: intel-gfx@lists.freedesktop.org, x86@kernel.org
Subject: x86: Static optimisations for copy_user
Date: Thu,  1 Jun 2017 07:58:40 +0100	[thread overview]
Message-ID: <20170601065843.2392-1-chris@chris-wilson.co.uk> (raw)

I was looking at the overhead of drmIoctl() in a microbenchmark that
repeatedly did a copy_from_user(.size=8) followed by a
copy_to_user(.size=8) as part of the DRM_IOCTL_I915_GEM_BUSY. I found
that if I forced inlined the get_user/put_user instead the walltime of
the ioctl was improved by about 20%. If copy_user_generic_unrolled was
used instead of copy_user_enhanced_fast_string, performance of the
microbenchmark was improved by 10%. Benchmarking on a few machines

(Broadwell)
 benchmark_copy_user(hot):
       size   unrolled     string fast-string
          1        158         77         79
          2        306        154        158
          4        614        308        317
          6        926        462        476
          8       1344        298        635
         12       1773        482        952
         16       2797        602       1269
         24       4020        903       1906
         32       5055       1204       2540
         48       6150       1806       3810
         64       9564       2409       5082
         96      13583       3612       6483
        128      18108       4815       8434

(Broxton)
 benchmark_copy_user(hot):
       size   unrolled     string fast-string
          1        270         52         53
          2        364        106        109
          4        460        213        218
          6        486        305        312
          8       1250        253        437
         12       1009        332        625
         16       2059        514        897
         24       2624        672       1071
         32       3043       1014       1750
         48       3620       1499       2561
         64       7777       1971       3333
         96       7499       2876       4772
        128       9999       3733       6088

which says that for this cache hot case in benchmarking the rep mov
microcode noticeably underperforms. Though once we pass a few
cachelines, and definitely after exceeding L1 cache, rep mov is the
clear winner. From cold, there is no difference in timings.

I can improve the microbenchmark by either force inlining the
raw_copy_*_user switches, or by switching to copy_user_generic_unrolled.
Both leave a sour taste. The switch is too big to be inlined, and if
called out-of-line the function call overhead negates its benefits.
Switching between fast-string and unrolled makes a presumption on
behaviour.

In the end, I limited this series to just adding a few extra
translations for statically known copy_*_user().
-Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

             reply	other threads:[~2017-06-01  6:59 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-01  6:58 Chris Wilson [this message]
2017-06-01  6:58 ` x86: Static optimisations for copy_user Chris Wilson
2017-06-01  6:58 ` [PATCH 1/3] x86-32: Teach copy_from_user to unroll .size=6/8 Chris Wilson
2017-06-01  6:58   ` Chris Wilson
2017-06-01  6:58 ` [PATCH 2/3] x86-32: Expand static copy_to_user() Chris Wilson
2017-06-01  6:58   ` Chris Wilson
2017-06-01  6:58 ` [PATCH 3/3] x86-64: Inline 6/12 byte copy_user Chris Wilson
2017-06-01  7:17 ` ✓ Fi.CI.BAT: success for series starting with [1/3] x86-32: Teach copy_from_user to unroll .size=6/8 Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170601065843.2392-1-chris@chris-wilson.co.uk \
    --to=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.