All of lore.kernel.org
 help / color / mirror / Atom feed
* XDP performance regression due to CONFIG_RETPOLINE Spectre V2
@ 2018-04-12 13:50 Jesper Dangaard Brouer
  2018-04-12 14:51 ` Christoph Hellwig
  2018-04-16 12:27   ` Christoph Hellwig
  0 siblings, 2 replies; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2018-04-12 13:50 UTC (permalink / raw)
  To: xdp-newbies, netdev
  Cc: brouer, Christoph Hellwig, David Woodhouse, William Tu,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Arnaldo Carvalho de Melo

Heads-up XDP performance nerds!

I got an unpleasant surprise when I updated my GCC compiler (to support
the option -mindirect-branch=thunk-extern).  My XDP redirect
performance numbers when cut in half; from approx 13Mpps to 6Mpps
(single CPU core).  I've identified the issue, which is caused by
kernel CONFIG_RETPOLINE, that only have effect when the GCC compiler
have support.  This is mitigation of Spectre variant 2 (CVE-2017-5715)
related to indirect (function call) branches.

XDP_REDIRECT itself only have two primary (per packet) indirect
function calls, ndo_xdp_xmit and invoking bpf_prog, plus any
map_lookup_elem calls in the bpf_prog.  I PoC implemented bulking for
ndo_xdp_xmit, which helped, but not enough. The real root-cause is all
the DMA API calls, which uses function pointers extensively.


Mitigation plan
---------------
Implement support for keeping the DMA mapping through the XDP return
call, to remove RX map/unmap calls.  Implement bulking for XDP
ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
bulking via scatter-gatter DMA calls, XDP TX need it for DMA
map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
to mitigate (via bulk technique). Ask DMA maintainer for a common
case direct call for swiotlb DMA sync call ;-)

Root-cause verification
-----------------------
I have verified that indirect DMA calls are the root-cause, by
removing the DMA sync calls from the code (as they for swiotlb does
nothing), and manually inlined the DMA map calls (basically calling
phys_to_dma(dev, page_to_phys(page)) + offset). For my ixgbe test,
performance "returned" to 11Mpps.

Perf reports
------------
It is not easy to diagnose via perf event tool. I'm coordinating with
ACME to make it easier to pinpoint the hotspots.  Lookout for symbols:
__x86_indirect_thunk_r10, __indirect_thunk_start, __x86_indirect_thunk_rdx
etc.  Be aware that they might not be super high in perf top, but they
stop CPU speculation.  Thus, instead use perf-stat and see the
negative effect of 'insn per cycle'.


Want to understand retpoline at ASM level read this:
 https://support.google.com/faqs/answer/7625886

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-04-17  7:13 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-12 13:50 XDP performance regression due to CONFIG_RETPOLINE Spectre V2 Jesper Dangaard Brouer
2018-04-12 14:51 ` Christoph Hellwig
2018-04-12 14:56   ` Christoph Hellwig
2018-04-12 15:31     ` Jesper Dangaard Brouer
2018-04-13 16:49       ` Christoph Hellwig
2018-04-13 17:12     ` Tushar Dave
2018-04-13 17:26       ` Christoph Hellwig
2018-04-14 19:29         ` David Woodhouse
2018-04-16  6:02           ` Jesper Dangaard Brouer
2018-04-16 12:27 ` Christoph Hellwig
2018-04-16 12:27   ` Christoph Hellwig
2018-04-16 16:04   ` Alexander Duyck
2018-04-17  6:19     ` Christoph Hellwig
2018-04-16 18:05   ` dma-mapping: bypass dma_ops for direct mappings kbuild test robot
2018-04-16 18:26     ` Jesper Dangaard Brouer
2018-04-16 18:31   ` kbuild test robot
2018-04-16 21:07   ` XDP performance regression due to CONFIG_RETPOLINE Spectre V2 Jesper Dangaard Brouer
2018-04-17  6:15     ` Christoph Hellwig
2018-04-17  7:07       ` Jesper Dangaard Brouer
2018-04-17  7:13         ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.