All of lore.kernel.org
 help / color / mirror / Atom feed
* rte_prefetch0() is effective?
@ 2015-12-24  6:35 Moon-Sang Lee
  2016-01-13 11:34 ` Bruce Richardson
  0 siblings, 1 reply; 4+ messages in thread
From: Moon-Sang Lee @ 2015-12-24  6:35 UTC (permalink / raw)
  To: dev

I see codes as below in example directory, and I wonder it is effective.
Coherent IO is adopted to modern architectures,
so I think that DMA initiation by rte_eth_rx_burst() might already fulfills
cache lines of RX buffers.
Do I really need to call rte_prefetchX()?

            nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
MAX_PKT_BURST);
            ...
            /* Prefetch and forward already prefetched packets */
            for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
                rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
                        j + PREFETCH_OFFSET], void *));
                l3fwd_simple_forward(pkts_burst[j], portid,
                    qconf);
            }


-- 
Moon-Sang Lee, SW Engineer
Email: sang0627@gmail.com
Wisdom begins in wonder. *Socrates*

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rte_prefetch0() is effective?
  2015-12-24  6:35 rte_prefetch0() is effective? Moon-Sang Lee
@ 2016-01-13 11:34 ` Bruce Richardson
  2016-01-13 15:17   ` Polehn, Mike A
  2016-01-13 17:29   ` Matthew Hall
  0 siblings, 2 replies; 4+ messages in thread
From: Bruce Richardson @ 2016-01-13 11:34 UTC (permalink / raw)
  To: Moon-Sang Lee; +Cc: dev

On Thu, Dec 24, 2015 at 03:35:14PM +0900, Moon-Sang Lee wrote:
> I see codes as below in example directory, and I wonder it is effective.
> Coherent IO is adopted to modern architectures,
> so I think that DMA initiation by rte_eth_rx_burst() might already fulfills
> cache lines of RX buffers.
> Do I really need to call rte_prefetchX()?
> 
>             nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
> MAX_PKT_BURST);
>             ...
>             /* Prefetch and forward already prefetched packets */
>             for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
>                 rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
>                         j + PREFETCH_OFFSET], void *));
>                 l3fwd_simple_forward(pkts_burst[j], portid,
>                     qconf);
>             }
> 

Good question.
When the first example apps using this style of prefetch were originally written,
yes, there was a noticable performance increase achieved by using the prefetch.
Thereafter, I'm not sure that anyone has checked with each generation of
platforms whether the prefetches are still necessary and how much they help, but
I suspect that they still help a bit, and don't hurt performance.
It would be an interesting exercise to check whether the prefetch offsets used
in code like above can be adjusted to give better performance on our latest
supported platforms.

/Bruce

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rte_prefetch0() is effective?
  2016-01-13 11:34 ` Bruce Richardson
@ 2016-01-13 15:17   ` Polehn, Mike A
  2016-01-13 17:29   ` Matthew Hall
  1 sibling, 0 replies; 4+ messages in thread
From: Polehn, Mike A @ 2016-01-13 15:17 UTC (permalink / raw)
  To: Richardson, Bruce, Moon-Sang Lee; +Cc: dev

Prefetchs make a big difference because a powerful CPU like IA is always trying to find items to prefetch and the priority of these is not always easy to determine. This is especially a problem across subroutine calls since the compiler cannot determine what is of priority in the other subroutines and the runtime CPU logic cannot always have the future well predicted far enough in the future for all possible paths, especially if you have a cache miss, which takes eons of clock cycles to do a memory access probably resulting in a CPU stall.

Until we get to the point of the computers full understanding the logic of the program and writing optimum code (putting programmers out of business) , the understanding of what is important as the program progresses gives the programmer knowledge of what is desirable to prefetch. It is difficult to determine if the CPU is going to have the same priority of the prefetch, so having a prefetch may or may not show up as a measureable performance improvement under some conditions, but having the prefetch decision in place can make prefetch priority decision correct in these other cases, which make a performance improvement.

Removing a prefetch without thinking through and fully understanding the logic of why it is there, or what he added cost (in the case of calculating an address for the prefetch that affects other current operations) if any, is just plain amateur  work.  It is not to say people do not make bad judgments on what needs to be prefetched and put poor prefetch placement and should only be removed if not logically proper for expected runtime operation.

Only more primitive CPUs with no prefetch capabilities don't benefit from properly placed prefetches. 

Mike

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
Sent: Wednesday, January 13, 2016 3:35 AM
To: Moon-Sang Lee
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] rte_prefetch0() is effective?

On Thu, Dec 24, 2015 at 03:35:14PM +0900, Moon-Sang Lee wrote:
> I see codes as below in example directory, and I wonder it is effective.
> Coherent IO is adopted to modern architectures, so I think that DMA 
> initiation by rte_eth_rx_burst() might already fulfills cache lines of 
> RX buffers.
> Do I really need to call rte_prefetchX()?
> 
>             nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst, 
> MAX_PKT_BURST);
>             ...
>             /* Prefetch and forward already prefetched packets */
>             for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
>                 rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
>                         j + PREFETCH_OFFSET], void *));
>                 l3fwd_simple_forward(pkts_burst[j], portid,
>                     qconf);
>             }
> 

Good question.
When the first example apps using this style of prefetch were originally written, yes, there was a noticable performance increase achieved by using the prefetch.
Thereafter, I'm not sure that anyone has checked with each generation of platforms whether the prefetches are still necessary and how much they help, but I suspect that they still help a bit, and don't hurt performance.
It would be an interesting exercise to check whether the prefetch offsets used in code like above can be adjusted to give better performance on our latest supported platforms.

/Bruce

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rte_prefetch0() is effective?
  2016-01-13 11:34 ` Bruce Richardson
  2016-01-13 15:17   ` Polehn, Mike A
@ 2016-01-13 17:29   ` Matthew Hall
  1 sibling, 0 replies; 4+ messages in thread
From: Matthew Hall @ 2016-01-13 17:29 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Wed, Jan 13, 2016 at 11:34:33AM +0000, Bruce Richardson wrote:
> When the first example apps using this style of prefetch were originally 
> written, yes, there was a noticable performance increase achieved by using 
> the prefetch. Thereafter, I'm not sure that anyone has checked with each 
> generation of platforms whether the prefetches are still necessary and how 
> much they help, but I suspect that they still help a bit, and don't hurt 
> performance.

FYI, for me as a community member this paragraph describes one of my top 
irritations about DPDK.

The Intel accelerations, such as adding prefetches, or support for new 
features like the librte_power, are treated as one-off projects not as ongoing 
technical efforts which need periodic retesting and maintenance. Thus it 
turned out that after waiting over a month for a reply, I eventually 
discovered librte_power probably never worked right at all since at least 
Sandy Bridge, which is a very old chip by now for servers.

The accelerations are also treated like black magic. Meaning no comments are 
put in the code about how and why they work, so an outsider trying his best to 
measure things in VTune to help provide the ongoing testing and maintenance, 
can not tell why something was done or how it might be adjusted to work right 
in their environment if their hardware is older or newer than whatever 
undocumented hardware was used in developing the example. There's nowhere I 
know of that says the reference platform and core generation used for 
developing an example either so I could get some idea if it's current or old 
code.

When I ask a high level question, such as "Which Intel accelerations should 
one make sure are enabled to get best performance?" it normally doesn't get 
any reply. This makes life difficult because there are many dozen 
accelerations listed in the data sheet of a typical modern Intel core and no 
guidance is provided on the priority of the different accelerations for DPDK. 
So I don't have a good idea about where to focus my time to get the best 
acceleration out of all the technology it must have cost Intel millions or 
billions to create. To me that's very sad.

I am hoping maybe there are some resources we could make available to help 
understand the principles behind the accelerations so it is easier for the 
community to take part in maintaining them and maybe even helping create new 
ones.

Note: I read through all the subchapters here:

http://dpdk.org/doc/guides/prog_guide/perf_opt_guidelines.html

None of them mention any CPU acceleration details whatsoever.

They don't explain any specifics on prefetch or branch prediction. Only that 
they exist and do things.

Sincerely,
Matthew.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-01-13 17:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-24  6:35 rte_prefetch0() is effective? Moon-Sang Lee
2016-01-13 11:34 ` Bruce Richardson
2016-01-13 15:17   ` Polehn, Mike A
2016-01-13 17:29   ` Matthew Hall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.