From: "David S. Miller" <davem@caip.rutgers.edu>
To: ecd@skynet.be
Cc: lmlinux@neteng.engr.sgi.com, sparclinux-cvs@vger.rutgers.edu,
alan@cymru.net, torvalds@cs.helsinki.fi
Subject: idea for csum_partial_copy on Viking/MXCC
Date: Sun, 19 May 1996 01:45:17 -0400 [thread overview]
Message-ID: <199605190545.BAA21316@huahaga.rutgers.edu> (raw)
(Note: This is just another one of my crazy ideas, consider this
something to do possibly in the future when someone has tons of
copious free time. For now I'm going to get the software version
working as fast as it can.)
(Some background for some of you, Viking/MXCC Sparc has a hardware
block copy facility which can copy cache sub-block aligned chunks of
ram very quickly.)
This is silly, but it would get disgustingly fast numbers. (btw,
eddie, still waiting for the memcpy.s of yours so that I can do some
testing tonight...)
You use the MXCC stream copy stuff if you have a buffer bigger than
256 bytes and you can align it to 32 bytes. The unrolled loops right
now look like:
ldd [%src + offset + 0x18], %t0; ! multi-cycle cache stall
ldd [%src + offset + 0x10], %t2; ! 1 cycle, cache hit
ldd [%src + offset + 0x08], %t4; ! 1 cycle, cache hit
st %t0, [%dest + offset + 0x18]; ! multi-cycle cache stall
addxcc %t0, %accum, %accum; ! 1 cycle, does not pair
st %t1, [%dest + offset + 0x1c];
addxcc %t1, %accum, %accum; ! 1 cycle, cache hit
st %t2, [%dest + offset + 0x10];
addxcc %t2, %accum, %accum; ! 1 cycle, cache hit
ldd [%src + offset + 0x00], %t0; ! 1 cycle, cache hit
st %t3, [%dest + offset + 0x14]; ! 1 cycle, cache hit, cannot pair
addxcc %t3, %accum, %accum;
st %t4, [%dest + offset + 0x08]; ! 1 cycle, cache hit
addxcc %t4, %accum, %accum;
st %t5, [%dest + offset + 0x0c]; ! 1 cycle, cache hit
addxcc %t5, %accum, %accum;
st %t0, [%dest + offset + 0x00]; ! 1 cycle, cache hit
addxcc %t0, %accum, %accum;
st %t1, [%dest + offset + 0x04]; ! 1 cycle, cache hit
addxcc %t1, %accum, %accum;
! around 19 clock cycles
Bite me, those stores make this stuff impossible to schedule without
grabbing a register window which I refuse to do.
Ok, on the MXCC you eat some cycles so that you have the registers
setup for the source (for the checksum calculations) and the page
numbers etc. for the stream operation for the entire chunk being
csum/copied. Then it looks like this:
st %stream_addr1, [%stream_addr2] ASI_MXCC
/* Processor stalls 3 or 4 clocks to get stream operation going. */
ldd [%src + offset + 0x18], %t0; ! cache hit
addxcc %t0, %accum, %accum; ! 1 cycle, does pair
addxcc %t1, %accum, %accum; ! 1 cycle, no pair
ldd [%src + offset + 0x10], %t2; ! cache hit
addxcc %t2, %accum, %accum; ! 1 cycle, does pair
addxcc %t3, %accum, %accum; ! 1 cycle, no pair
ldd [%src + offset + 0x08], %t4; ! cache hit
addxcc %t4, %accum, %accum; ! 1 cycle, cache hit
addxcc %t5, %accum, %accum; ! 1 cycle, no pair
ldd [%src + offset + 0x00], %t0; ! 1 cycle
addxcc %t0, %accum, %accum; ! 1 cycle, cache hit
addxcc %t1, %accum, %accum; ! 1 cycle, no pair
! around 12 clock cycles
MXCC does all those ugly and hard to schedule stores for us ;-) Note
that I could probably schedule that new sequence even better.
Saving of 7 clock cycles for _every_ 32 byte aligned block we csum,
the overhead of setting up for the stream operation is fuzzed away by
the fact that we usually run this thing many times in a row (thus the
"only do optimization if len >= 256" rule above).
Let's assume in such an implementation that we eat around 13 or 14
cycles getting the registers ready for the stream operation. Fine,
then after two straight iterations of the above code sequence we are
breaking even, we commonly run it many times in a row.
Common case for full ethernet frame is that we do 128 bytes at a shot,
11 times. This works out to:
(7 clocks saved per iteration * (128 / 32) * 11) -
(14 stream-op setup cycles * 11 iterations)
== 308 saved cycles - 154 lost cycles
== 154 clocks less per csum on ethernet sized packet frame
Old code == 846 total cycles for ethernet sized packet frame
New code == 692 "" ""
We go ~20% faster ;-) A possible issue is overhead of function
ptr dereference for the call, but based upon the performance of our
dynamic mmu code I doubt it would matter and it would give gcc some
dead cycles to fill in the networking code anyways.
As noted previously, this would be a "research" venture to see what
kind of numbers it would really get. Now that I think about it I
would be very leery about putting this into the tree so that we don't
hit the sun4d XBUS IOCACHE hardware bug (it is only triggered by MXCC
hardware block copy operations and certain types of dma activity with
a certain set of bit patterns in the data, nasty bug).
For now I'm re-scheduling the software csum/copy code to work as it
should (I was hitting the cache in the wrong way I've found from
Andy's numbers, fixing this right now).
Later,
David S. Miller
davem@caip.rutgers.edu
reply other threads:[~1996-05-19 5:45 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=199605190545.BAA21316@huahaga.rutgers.edu \
--to=davem@caip.rutgers.edu \
--cc=alan@cymru.net \
--cc=ecd@skynet.be \
--cc=lmlinux@neteng.engr.sgi.com \
--cc=sparclinux-cvs@vger.rutgers.edu \
--cc=torvalds@cs.helsinki.fi \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.