All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vineet Gupta <Vineet.Gupta1@synopsys.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Richard Henderson <rth@twiddle.net>,
	Russell King <linux@armlinux.org.uk>,
	Will Deacon <will.deacon@arm.com>,
	Haavard Skinnemoen <hskinnemoen@gmail.com>,
	Steven Miao <realmz6@gmail.com>,
	Jesper Nilsson <jesper.nilsson@axis.com>,
	Mark Salter <msalter@redhat.com>,
	Yoshinori Sato <ysato@users.sourceforge.jp>,
	"Richard Kuo" <rkuo@codeaurora.org>,
	Tony Luck <tony.luck@intel.com>,
	"Geert Uytterhoeven" <geert@linux-m68k.org>,
	James Hogan <james.hogan@imgtec.com>,
	Michal Simek <monstr@monstr.eu>,
	David Howells <dhowells@redhat.com>,
	"Ley Foon Tan" <lftan@altera.com>,
	Jonas Bonn <Jonas.Nilsson@synopsys>
Subject: Re: [RFC][CFT][PATCHSET v1] uaccess unification
Date: Thu, 30 Mar 2017 13:40:31 -0700	[thread overview]
Message-ID: <efb7aaa4-7d25-0c68-ebf8-cdd7eb1297dc@synopsys.com> (raw)
In-Reply-To: <CA+55aFyGwYwdk8i7-GbXV7NLTn38e-bow3VD-hHcQmTr9ebAjw@mail.gmail.com>

On 03/29/2017 05:27 PM, Linus Torvalds wrote:
> On Wed, Mar 29, 2017 at 5:02 PM, Vineet Gupta
> <Vineet.Gupta1@synopsys.com> wrote:
>>
>> I guess I can in next day or two - but mind you the inline version for ARC is kind
>> of special vs. other arches. We have this "manual" constant propagation to elide
>> the unrolled LD/ST for 1-15 byte stragglers, when @sz is constant.
> 
> I don't think that's special. We do that on x86 too, and I suspect ARC
> copied it from there (or from somebody else who did it).

No, I (re)wrote that code and AFAIKR didn't copy from anyone and AFAICS it is
certainly different from others if not special. If you look closely at
arc:access.h it is not the trivial check for 1-2-4 conversion as in the commit you
referred to. It actually tries to compile time eliminate hunks from inline
assembly, for constant @sz (so is designed purely for inlined variants, whether
that matters or  not is a different story). Thing is from the hardware POV, 4
LD/ST in flight is good (atleast for ARC700 cores) so we wrap it up in a Zero
delay loop. This takes care of multiples of 16 bytes, the last 15 bytes are the
killer which requires bunch of conditionals which is what I try to eliminate.

FWIW, I experimented with uaccess inlining on ARC
1. pristine 4.11-rc1 (all inline)
2. Inline + disabling the "smart" const propagation
3. Out of line only variants (which already existed/default on ARC for -Os, but
hacked for current -O3)

Numbers for LMBench FS latency (off of tmpfs to avoid any device related
perturbation). Note that LMBench already runs them several times itself and each
of below is obviously with a fresh reboot since kernels were different.

So it seems 0k file create/del gets worse without the smart inline, while 10k gets
better. mmap (16k) got worse as well. With out of line some got better while some
worse.


   File & VM system latencies in microseconds - smaller is better
   -------------------------------------------------------------------------------
   Host                 OS   0K File      10K File     Mmap    Prot   Page   100fd
                           Create Delete Create Delete Latency Fault  Fault  selct
   --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
   170329-v4 Linux 4.11.0-  124.3   75.3  734.2  147.8  2200.0 6.205    10.9  87.6
   170330-v4 Linux 4.11.0-  154.9   88.3  709.2  131.2  2494.0 4.056    11.0  91.1
   170330-v4 Linux 4.11.0-  157.7   69.8  622.7  140.8  2168.0 5.654    10.8  91.0

Compare that to data against

1. pristine 4.11-rc1 (all inline)
2. Al's series + ARC forced inline
3. Al's series + ARC forced NOT inline

   File & VM system latencies in microseconds - smaller is better
   -------------------------------------------------------------------------------
   Host                 OS   0K File      10K File     Mmap    Prot   Page   100fd
                           Create Delete Create Delete Latency Fault  Fault  selct
   --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
   170329-v4 Linux 4.11.0-  124.3   75.3  734.2  147.8  2200.0 6.205    10.9  87.6
   170329-v4 Linux 4.11.0-  141.2   63.4  629.7  130.0  2172.0 5.796    10.8  90.0
   170329-v4 Linux 4.11.0-  154.9   89.2  691.6  147.7  2323.0 4.922    10.8  92.3

So it's a mix bag really. Maybe we need some better directed test to really drill
it down.


> But at least on x86 is is limited entirely to the "__" versions, and
> it's almost entirely pointless. We actually removed some of that kind
> of code because it was *do* pointless, and it had just been copied
> around into the "atomic" versions too.
> 
> See for example commit bd28b14591b9 ("x86: remove more uaccess_32.h
> complexity"), which did that.
> 
> The basic "__" versions still do that constant-size thing, but they
> really are questionable. 

Perhaps because the scope of constant usage was pretty narrow - it would only
benefit if *copy_from_user() were called with 1,2,4 which is relatively unlikely
as we have __get_user and friends for that already.

> Exactly because it's just the "__" versions -
> the *regular* "copy_to/from_user()" is an unconditional function call,
> because inlining it isn't just the access operations, it's the size
> check, and on modern x86 it's also the "set AC to mark the user access
> as safe".

So what you are saying is it is relatively costly on x86 because of SMAP which may
not be true for arches w/o hardware support.
Note that I'm not arguing for/against inlining per-se, it seems it doesn't matter

-Vineet

WARNING: multiple messages have this Message-ID (diff)
From: Vineet Gupta <Vineet.Gupta1@synopsys.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Richard Henderson <rth@twiddle.net>,
	Russell King <linux@armlinux.org.uk>,
	Will Deacon <will.deacon@arm.com>,
	Haavard Skinnemoen <hskinnemoen@gmail.com>,
	Steven Miao <realmz6@gmail.com>,
	Jesper Nilsson <jesper.nilsson@axis.com>,
	Mark Salter <msalter@redhat.com>,
	Yoshinori Sato <ysato@users.sourceforge.jp>,
	Richard Kuo <rkuo@codeaurora.org>,
	Tony Luck <tony.luck@intel.com>,
	Geert Uytterhoeven <geert@linux-m68k.org>,
	James Hogan <james.hogan@imgtec.com>,
	Michal Simek <monstr@monstr.eu>,
	David Howells <dhowells@redhat.com>,
	Ley Foon Tan <lftan@altera.com>, Jonas Bonn <Jonas.Nilsson@syn>
Subject: Re: [RFC][CFT][PATCHSET v1] uaccess unification
Date: Thu, 30 Mar 2017 13:40:31 -0700	[thread overview]
Message-ID: <efb7aaa4-7d25-0c68-ebf8-cdd7eb1297dc@synopsys.com> (raw)
In-Reply-To: <CA+55aFyGwYwdk8i7-GbXV7NLTn38e-bow3VD-hHcQmTr9ebAjw@mail.gmail.com>

On 03/29/2017 05:27 PM, Linus Torvalds wrote:
> On Wed, Mar 29, 2017 at 5:02 PM, Vineet Gupta
> <Vineet.Gupta1@synopsys.com> wrote:
>>
>> I guess I can in next day or two - but mind you the inline version for ARC is kind
>> of special vs. other arches. We have this "manual" constant propagation to elide
>> the unrolled LD/ST for 1-15 byte stragglers, when @sz is constant.
> 
> I don't think that's special. We do that on x86 too, and I suspect ARC
> copied it from there (or from somebody else who did it).

No, I (re)wrote that code and AFAIKR didn't copy from anyone and AFAICS it is
certainly different from others if not special. If you look closely at
arc:access.h it is not the trivial check for 1-2-4 conversion as in the commit you
referred to. It actually tries to compile time eliminate hunks from inline
assembly, for constant @sz (so is designed purely for inlined variants, whether
that matters or  not is a different story). Thing is from the hardware POV, 4
LD/ST in flight is good (atleast for ARC700 cores) so we wrap it up in a Zero
delay loop. This takes care of multiples of 16 bytes, the last 15 bytes are the
killer which requires bunch of conditionals which is what I try to eliminate.

FWIW, I experimented with uaccess inlining on ARC
1. pristine 4.11-rc1 (all inline)
2. Inline + disabling the "smart" const propagation
3. Out of line only variants (which already existed/default on ARC for -Os, but
hacked for current -O3)

Numbers for LMBench FS latency (off of tmpfs to avoid any device related
perturbation). Note that LMBench already runs them several times itself and each
of below is obviously with a fresh reboot since kernels were different.

So it seems 0k file create/del gets worse without the smart inline, while 10k gets
better. mmap (16k) got worse as well. With out of line some got better while some
worse.


   File & VM system latencies in microseconds - smaller is better
   -------------------------------------------------------------------------------
   Host                 OS   0K File      10K File     Mmap    Prot   Page   100fd
                           Create Delete Create Delete Latency Fault  Fault  selct
   --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
   170329-v4 Linux 4.11.0-  124.3   75.3  734.2  147.8  2200.0 6.205    10.9  87.6
   170330-v4 Linux 4.11.0-  154.9   88.3  709.2  131.2  2494.0 4.056    11.0  91.1
   170330-v4 Linux 4.11.0-  157.7   69.8  622.7  140.8  2168.0 5.654    10.8  91.0

Compare that to data against

1. pristine 4.11-rc1 (all inline)
2. Al's series + ARC forced inline
3. Al's series + ARC forced NOT inline

   File & VM system latencies in microseconds - smaller is better
   -------------------------------------------------------------------------------
   Host                 OS   0K File      10K File     Mmap    Prot   Page   100fd
                           Create Delete Create Delete Latency Fault  Fault  selct
   --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
   170329-v4 Linux 4.11.0-  124.3   75.3  734.2  147.8  2200.0 6.205    10.9  87.6
   170329-v4 Linux 4.11.0-  141.2   63.4  629.7  130.0  2172.0 5.796    10.8  90.0
   170329-v4 Linux 4.11.0-  154.9   89.2  691.6  147.7  2323.0 4.922    10.8  92.3

So it's a mix bag really. Maybe we need some better directed test to really drill
it down.


> But at least on x86 is is limited entirely to the "__" versions, and
> it's almost entirely pointless. We actually removed some of that kind
> of code because it was *do* pointless, and it had just been copied
> around into the "atomic" versions too.
> 
> See for example commit bd28b14591b9 ("x86: remove more uaccess_32.h
> complexity"), which did that.
> 
> The basic "__" versions still do that constant-size thing, but they
> really are questionable. 

Perhaps because the scope of constant usage was pretty narrow - it would only
benefit if *copy_from_user() were called with 1,2,4 which is relatively unlikely
as we have __get_user and friends for that already.

> Exactly because it's just the "__" versions -
> the *regular* "copy_to/from_user()" is an unconditional function call,
> because inlining it isn't just the access operations, it's the size
> check, and on modern x86 it's also the "set AC to mark the user access
> as safe".

So what you are saying is it is relatively costly on x86 because of SMAP which may
not be true for arches w/o hardware support.
Note that I'm not arguing for/against inlining per-se, it seems it doesn't matter

-Vineet

  parent reply	other threads:[~2017-03-30 20:41 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-29  5:57 [RFC][CFT][PATCHSET v1] uaccess unification Al Viro
2017-03-29  5:57 ` Al Viro
2017-03-29 20:08 ` Vineet Gupta
2017-03-29 20:08   ` Vineet Gupta
2017-03-29 20:08   ` Vineet Gupta
2017-03-29 20:29   ` Al Viro
2017-03-29 20:29     ` Al Viro
2017-03-29 20:37     ` Linus Torvalds
2017-03-29 20:37       ` Linus Torvalds
2017-03-29 21:03       ` Al Viro
2017-03-29 21:03         ` Al Viro
2017-03-29 21:24         ` Linus Torvalds
2017-03-29 21:24           ` Linus Torvalds
2017-03-29 23:09           ` Al Viro
2017-03-29 23:09             ` Al Viro
2017-03-29 23:43             ` Linus Torvalds
2017-03-29 23:43               ` Linus Torvalds
2017-03-30 15:31               ` Al Viro
2017-03-30 15:31                 ` Al Viro
2017-03-29 21:14     ` Vineet Gupta
2017-03-29 21:14       ` Vineet Gupta
2017-03-29 23:42       ` Al Viro
2017-03-29 23:42         ` Al Viro
2017-03-30  0:02         ` Vineet Gupta
2017-03-30  0:02           ` Vineet Gupta
2017-03-30  0:27           ` Linus Torvalds
2017-03-30  0:27             ` Linus Torvalds
2017-03-30  1:15             ` Al Viro
2017-03-30  1:15               ` Al Viro
2017-03-30 20:40             ` Vineet Gupta [this message]
2017-03-30 20:40               ` Vineet Gupta
2017-03-30 20:59               ` Linus Torvalds
2017-03-30 20:59                 ` Linus Torvalds
2017-03-30 23:21                 ` Russell King - ARM Linux
2017-03-30 23:21                   ` Russell King - ARM Linux
2017-03-30 12:32 ` Martin Schwidefsky
2017-03-30 12:32   ` Martin Schwidefsky
2017-03-30 14:48   ` Al Viro
2017-03-30 14:48     ` Al Viro
2017-03-30 16:22 ` Russell King - ARM Linux
2017-03-30 16:22   ` Russell King - ARM Linux
2017-03-30 16:43   ` Al Viro
2017-03-30 16:43     ` Al Viro
2017-03-30 17:18     ` Linus Torvalds
2017-03-30 17:18       ` Linus Torvalds
2017-03-30 18:48       ` Al Viro
2017-03-30 18:48         ` Al Viro
2017-03-30 18:54         ` Al Viro
2017-03-30 18:54           ` Al Viro
2017-03-30 18:59           ` Linus Torvalds
2017-03-30 18:59             ` Linus Torvalds
2017-03-30 19:10             ` Al Viro
2017-03-30 19:10               ` Al Viro
2017-03-30 19:19               ` Linus Torvalds
2017-03-30 19:19                 ` Linus Torvalds
2017-03-30 21:08                 ` Al Viro
2017-03-30 21:08                   ` Al Viro
2017-03-30 18:56         ` Linus Torvalds
2017-03-30 18:56           ` Linus Torvalds
2017-03-31  0:21 ` Kees Cook
2017-03-31  0:21   ` Kees Cook
2017-03-31 13:38   ` James Hogan
2017-03-31 13:38     ` James Hogan
2017-04-03 16:27 ` James Morse
2017-04-03 16:27   ` James Morse
2017-04-04 20:26 ` Max Filippov
2017-04-04 20:26   ` Max Filippov
2017-04-04 20:26   ` Max Filippov
2017-04-04 20:52   ` Al Viro
2017-04-04 20:52     ` Al Viro
2017-04-05  5:05 ` ia64 exceptions (Re: [RFC][CFT][PATCHSET v1] uaccess unification) Al Viro
2017-04-05  5:05   ` Al Viro
2017-04-05  8:08   ` Al Viro
2017-04-05  8:08     ` Al Viro
2017-04-05 18:44     ` Tony Luck
2017-04-05 18:44       ` Tony Luck
2017-04-05 20:33       ` Al Viro
2017-04-05 20:33         ` Al Viro
2017-04-07  0:24 ` [RFC][CFT][PATCHSET v2] uaccess unification Al Viro
2017-04-07  0:24   ` Al Viro
2017-04-07  0:35   ` Al Viro
2017-04-07  0:35     ` Al Viro
     [not found] <CACVxJT8+fQqvpSPb9rTWFy6g7moqUqxi+Ewjcg0ykuqo=vm4Ow@mail.gmail.com>
2017-03-30 13:27 ` [RFC][CFT][PATCHSET v1] " Alexey Dobriyan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=efb7aaa4-7d25-0c68-ebf8-cdd7eb1297dc@synopsys.com \
    --to=vineet.gupta1@synopsys.com \
    --cc=Jonas.Nilsson@synopsys \
    --cc=dhowells@redhat.com \
    --cc=geert@linux-m68k.org \
    --cc=hskinnemoen@gmail.com \
    --cc=james.hogan@imgtec.com \
    --cc=jesper.nilsson@axis.com \
    --cc=lftan@altera.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=monstr@monstr.eu \
    --cc=msalter@redhat.com \
    --cc=realmz6@gmail.com \
    --cc=rkuo@codeaurora.org \
    --cc=rth@twiddle.net \
    --cc=tony.luck@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=will.deacon@arm.com \
    --cc=ysato@users.sourceforge.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.