[PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

* [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
@ 2017-02-22 16:38 Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 01/12] sock: allocate skbs from optmem Willem de Bruijn
                   ` (14 more replies)
  0 siblings, 15 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:38 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

RFCv2:

I have received a few requests for status and rebased code of this
feature. We have been running this code internally, discovering and
fixing various bugs. With net-next closed, now seems like a good time
to share an updated patchset with fixes. The rebase from RFCv1/v4.2
was mostly straightforward: mainly iov_iter changes. Full changelog:

  RFC -> RFCv2:
    - review comment: do not loop skb with zerocopy frags onto rx:
          add skb_orphan_frags_rx to orphan even refcounted frags
	  call this in __netif_receive_skb_core, deliver_skb and tun:
	  the same as 1080e512d44d ("net: orphan frags on receive")
    - fix: hold an explicit sk reference on each notification skb.
          previously relied on the reference (or wmem) held by the
	  data skb that would trigger notification, but this breaks
	  on skb_orphan.
    - fix: when aborting a send, do not inc the zerocopy counter
          this caused gaps in the notification chain
    - fix: in packet with SOCK_DGRAM, pull ll headers before calling
          zerocopy_sg_from_iter
    - fix: if sock_zerocopy_realloc does not allow coalescing,
          do not fail, just allocate a new ubuf
    - fix: in tcp, check return value of second allocation attempt
    - chg: allocate notification skbs from optmem
          to avoid affecting tcp write queue accounting (TSQ)
    - chg: limit #locked pages (ulimit) per user instead of per process
    - chg: grow notification ids from 16 to 32 bit
      - pass range [lo, hi] through 32 bit fields ee_info and ee_data
    - chg: rebased to davem-net-next on top of v4.10-rc7
    - add: limit notification coalescing
          sharing ubufs limits overhead, but delays notification until
	  the last packet is released, possibly unbounded. Add a cap. 
    - tests: add snd_zerocopy_lo pf_packet test
    - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)

The change to allocate notification skbuffs from optmem requires
ensuring that net.core.optmem is at least a few 100KB. To
experiment, run

  sysctl -w net.core.optmem_max=1048576

The snd_zerocopy_lo benchmarks reported in the individual patches were
rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were
replaced with skb_orphan_frags to allow looping to local sockets. The
netperf results below are also rerun with v2.

In application load, copy avoidance shows a roughly 5% systemwide
reduction in cycles when streaming large flows and a 4-8% reduction in
wall clock time on early tensorflow test workloads.

Overview (from original RFC):

Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY.
Implement the feature for TCP, UDP, RAW and packet sockets. This is
a generalization of a previous packet socket RFC patch

  http://patchwork.ozlabs.org/patch/413184/

On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
creates skbuff fragments directly from these pages. On tx completion,
it notifies the socket owner that it is safe to modify memory by
queuing a completion notification onto the socket error queue.

The kernel already implements such copy avoidance with vmsplice plus
splice and with ubuf_info for tun and virtio. Extend the second
with features required by TCP and others: reference counting to
support cloning (retransmit queue) and shared fragments (GSO) and
notification coalescing to handle corking.

Notifications are queued onto the socket error queue as a range
range [N, N+m], where N is a per-socket counter incremented on each
successful zerocopy send call.

* Performance

The below table shows cycles reported by perf for a netperf process
sending a single 10 Gbps TCP_STREAM. The first three columns show
Mcycles spent in the netperf process context. The second three columns
show time spent systemwide (-a -C A,B) on the two cpus that run the
process and interrupt handler. Reported is the median of at least 3
runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
are disabled and the kernel is booted with idle=halt.

NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size

perf stat -e cycles $NETPERF
perf stat -C 2,3 -a -e cycles $NETPERF

        --process cycles--      ----cpu cycles----
           std      zc   %      std         zc   %
4K      27,609  11,217  41      49,217  39,175  79
16K     21,370   3,823  18      43,540  29,213  67
64K     20,557   2,312  11      42,189  26,910  64
256K    21,110   2,134  10      43,006  27,104  63
1M      20,987   1,610   8      42,759  25,931  61

Perf record indicates the main source of these differences. Process
cycles only at 1M writes (perf record; perf report -n):

std:
Samples: 42K of event 'cycles', Event count (approx.): 21258597313                                                   
 79.41%         33884  netperf  [kernel.kallsyms]  [k] copy_user_generic_string
  3.27%          1396  netperf  [kernel.kallsyms]  [k] tcp_sendmsg
  1.66%           694  netperf  [kernel.kallsyms]  [k] get_page_from_freelist
  0.79%           325  netperf  [kernel.kallsyms]  [k] tcp_ack
  0.43%           188  netperf  [kernel.kallsyms]  [k] __alloc_skb

zc:
Samples: 1K of event 'cycles', Event count (approx.): 1439509124                                                     
 30.36%           584  netperf.zerocop  [kernel.kallsyms]  [k] gup_pte_range
 14.63%           284  netperf.zerocop  [kernel.kallsyms]  [k] __zerocopy_sg_from_iter
  8.03%           159  netperf.zerocop  [kernel.kallsyms]  [k] skb_zerocopy_add_frags_iter
  4.84%            96  netperf.zerocop  [kernel.kallsyms]  [k] __alloc_skb
  3.10%            60  netperf.zerocop  [kernel.kallsyms]  [k] kmem_cache_alloc_node

* Safety

The number of pages that can be pinned on behalf of a user with
MSG_ZEROCOPY is bound by the locked memory ulimit.

While the kernel holds process memory pinned, a process cannot safely
reuse those pages for other purposes. Packets looped onto the receive
stack and queued to a socket can be held indefinitely. Avoid unbounded
notification latency by restricting user pages to egress paths only.
skb_orphan_frags_rx() will create a private copy of pages even for
refcounted packets when these are looped, as did skb_orphan_frags for
the original tun zerocopy implementation.

Pages are not remapped read-only. Processes can modify packet contents
while packets are in flight in the kernel path. Bytes on which kernel
control flow depends (headers) are copied to avoid TOCTTOU attacks.
Datapath integrity does not otherwise depend on payload, with three
exceptions: checksums, optional sk_filter/tc u32/.. and device +
driver logic. The effect of wrong checksums is limited to the
misbehaving process. TC filters that access contents may have to be
excluded by adding an skb_orphan_frags_rx. 

Processes can also safely avoid OOM conditions by bounding the number
of bytes passed with MSG_ZEROCOPY and by removing shared pages after
transmission from their own memory map.

* Limitations / Known Issues

- PF_INET6 is not yet supported.
- TCP does not build max GSO packets, especially for
     small send buffers (< 4 KB)

Willem de Bruijn (12):
  sock: allocate skbs from optmem
  sock: skb_copy_ubufs support for compound pages
  sock: add generic socket zerocopy
  sock: enable sendmsg zerocopy
  sock: sendmsg zerocopy notification coalescing
  sock: sendmsg zerocopy ulimit
  sock: sendmsg zerocopy limit bytes per notification
  tcp: enable sendmsg zerocopy
  udp: enable sendmsg zerocopy
  raw: enable sendmsg zerocopy with IP_HDRINCL
  packet: enable sendmsg zerocopy
  test: add sendmsg zerocopy tests

 drivers/net/tun.c                             |   2 +-
 drivers/vhost/net.c                           |   1 +
 include/linux/sched.h                         |   2 +-
 include/linux/skbuff.h                        |  94 +++-
 include/linux/socket.h                        |   1 +
 include/net/sock.h                            |   4 +
 include/uapi/linux/errqueue.h                 |   1 +
 net/core/datagram.c                           |  35 +-
 net/core/dev.c                                |   4 +-
 net/core/skbuff.c                             | 327 ++++++++++++--
 net/core/sock.c                               |  29 ++
 net/ipv4/ip_output.c                          |  34 +-
 net/ipv4/raw.c                                |  27 +-
 net/ipv4/tcp.c                                |  37 +-
 net/packet/af_packet.c                        |  52 ++-
 tools/testing/selftests/net/.gitignore        |   2 +
 tools/testing/selftests/net/Makefile          |   1 +
 tools/testing/selftests/net/snd_zerocopy.c    | 354 +++++++++++++++
 tools/testing/selftests/net/snd_zerocopy_lo.c | 596 ++++++++++++++++++++++++++
 19 files changed, 1536 insertions(+), 67 deletions(-)
 create mode 100644 tools/testing/selftests/net/snd_zerocopy.c
 create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c

-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply	[flat|nested] 37+ messages in thread