All of lore.kernel.org
 help / color / mirror / Atom feed
From: "'Naveen N. Rao'" <naveen.n.rao@linux.vnet.ibm.com>
To: Benjamin Herrenschmidt <benh@au1.ibm.com>
Cc: David Laight <David.Laight@ACULAB.COM>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"daniel@iogearbox.net" <daniel@iogearbox.net>,
	"ast@fb.com" <ast@fb.com>,
	Madhavan Srinivasan <maddy@linux.vnet.ibm.com>,
	Michael Ellerman <mpe@ellerman.id.au>
Subject: Re: [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations
Date: Tue, 24 Jan 2017 00:52:27 +0530	[thread overview]
Message-ID: <20170123192227.GE3820@naverao1-tp.localdomain> (raw)
In-Reply-To: <1484492458.11927.17.camel@au1.ibm.com>

On 2017/01/15 09:00AM, Benjamin Herrenschmidt wrote:
> On Fri, 2017-01-13 at 23:22 +0530, 'Naveen N. Rao' wrote:
> > > That rather depends on whether the processor has a store to load forwarder
> > > that will satisfy the read from the store buffer.
> > > I don't know about ppc, but at least some x86 will do that.
> > 
> > Interesting - good to know that.
> > 
> > However, I don't think powerpc does that and in-register swap is likely 
> > faster regardless. Note also that gcc prefers this form at higher 
> > optimization levels.
> 
> Of course powerpc has a load-store forwarder these days, however, I
> wouldn't be surprised if the in-register form was still faster on some
> implementations, but this needs to be tested.

Thanks for clarifying! To test this, I wrote a simple (perhaps naive) 
test that just issues a whole lot of endian swaps and in _that_ test, it 
does look like the load-store forwarder is doing pretty well.

The tests:

bpf-bswap.S:
-----------
	.file   "bpf-bswap.S"
        .abiversion 2
        .section        ".text"
        .align 2
        .globl main
        .type   main, @function
main:
        mflr    0
        std     0,16(1)
        stdu    1,-32760(1)
	addi	3,1,32
	li	4,0
	li	5,32720
	li	11,32720
	mulli	11,11,8
	li	10,0
	li	7,16
1:	ldx	6,3,4
	stdx	6,1,7
	ldbrx	6,1,7
	stdx	6,3,4
	addi	4,4,8
	cmpd	4,5
	beq	2f
	b	1b
2:	addi	10,10,1
	li	4,0
	cmpd	10,11
	beq	3f
	b	1b
3:	li	3,0
        addi	1,1,32760
        ld      0,16(1)
	mtlr	0
	blr

bpf-bswap-reg.S:
---------------
	.file   "bpf-bswap-reg.S"
        .abiversion 2
        .section        ".text"
        .align 2
        .globl main
        .type   main, @function
main:
        mflr    0
        std     0,16(1)
        stdu    1,-32760(1)
	addi	3,1,32
	li	4,0
	li	5,32720
	li	11,32720
	mulli	11,11,8
	li	10,0
1:	ldx	6,3,4
	rldicl	7,6,32,32
	rlwinm	8,6,24,0,31
	rlwimi	8,6,8,8,15
	rlwinm	9,7,24,0,31
	rlwimi	8,6,8,24,31
	rlwimi	9,7,8,8,15
	rlwimi	9,7,8,24,31
	rldicr	8,8,32,31
	or	6,8,9
	stdx	6,3,4
	addi	4,4,8
	cmpd	4,5
	beq	2f
	b	1b
2:	addi	10,10,1
	li	4,0
	cmpd	10,11
	beq	3f
	b	1b
3:	li	3,0
        addi	1,1,32760
        ld      0,16(1)
	mtlr	0
	blr

Profiling the two variants:

# perf stat ./bpf-bswap

 Performance counter stats for './bpf-bswap':

       1395.979224      task-clock (msec)         #    0.999 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                45      page-faults               #    0.032 K/sec                  
     4,651,874,673      cycles                    #    3.332 GHz                      (66.87%)
         3,141,186      stalled-cycles-frontend   #    0.07% frontend cycles idle     (50.57%)
     1,117,289,485      stalled-cycles-backend    #   24.02% backend cycles idle      (50.57%)
     8,565,963,861      instructions              #    1.84  insn per cycle         
                                                  #    0.13  stalled cycles per insn  (67.05%)
     2,174,029,771      branches                  # 1557.351 M/sec                    (49.69%)
           262,656      branch-misses             #    0.01% of all branches          (50.05%)

       1.396893189 seconds time elapsed

# perf stat ./bpf-bswap-reg

 Performance counter stats for './bpf-bswap-reg':

       1819.758102      task-clock (msec)         #    0.999 CPUs utilized          
                 3      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                44      page-faults               #    0.024 K/sec                  
     6,034,777,602      cycles                    #    3.316 GHz                      (66.83%)
         2,010,983      stalled-cycles-frontend   #    0.03% frontend cycles idle     (50.47%)
     1,024,975,759      stalled-cycles-backend    #   16.98% backend cycles idle      (50.52%)
    16,043,732,849      instructions              #    2.66  insn per cycle         
                                                  #    0.06  stalled cycles per insn  (67.01%)
     2,148,710,750      branches                  # 1180.767 M/sec                    (49.57%)
           268,046      branch-misses             #    0.01% of all branches          (49.52%)

       1.821501345 seconds time elapsed


This is all in a POWER8 vm. On POWER7, the in-register variant is around 
4 times faster than the ldbrx variant.

So, yes, unless I've missed something, the ldbrx variant seems to 
perform better, if not on par with the in-register swap variant on 
POWER8.

> 
> Ideally, you'd want to try to "optimize" load+swap or swap+store
> though.

Agreed. This is already the case with BPF for packet access - those use 
skb helpers which issue the appropriate lhbrx/lwbrx/ldbrx. The newer 
BPF_FROM_LE/BPF_FROM_BE are for endian operations with other BPF 
programs.

We can probably implement an extra pass to detect use of endian swap and 
try to match it up with a previous load or a subsequent store though...

Thanks!
- Naveen

  reply	other threads:[~2017-01-23 19:23 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-13 17:10 [PATCH 1/3] powerpc: bpf: remove redundant check for non-null image Naveen N. Rao
2017-01-13 17:10 ` [PATCH 2/3] powerpc: bpf: flush the entire JIT buffer Naveen N. Rao
2017-01-13 20:10   ` Alexei Starovoitov
2017-01-13 22:55   ` Daniel Borkmann
2017-01-27  0:40   ` [2/3] " Michael Ellerman
2017-01-13 17:10 ` [PATCH 3/3] powerpc: bpf: implement in-register swap for 64-bit endian operations Naveen N. Rao
2017-01-13 17:17   ` David Laight
2017-01-13 17:17     ` David Laight
2017-01-13 17:52     ` 'Naveen N. Rao'
2017-01-15 15:00       ` Benjamin Herrenschmidt
2017-01-23 19:22         ` 'Naveen N. Rao' [this message]
2017-01-24 16:13           ` David Laight
2017-01-24 16:13             ` David Laight
2017-01-24 16:25             ` 'Naveen N. Rao'
2017-01-13 20:09 ` [PATCH 1/3] powerpc: bpf: remove redundant check for non-null image Alexei Starovoitov
2017-01-16 18:38 ` David Miller
2017-01-23 17:14   ` Naveen N. Rao
2017-01-23 17:14     ` Naveen N. Rao
2017-01-27  0:40 ` [1/3] " Michael Ellerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170123192227.GE3820@naverao1-tp.localdomain \
    --to=naveen.n.rao@linux.vnet.ibm.com \
    --cc=David.Laight@ACULAB.COM \
    --cc=ast@fb.com \
    --cc=benh@au1.ibm.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=maddy@linux.vnet.ibm.com \
    --cc=mpe@ellerman.id.au \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.