[PATCH v2 net-next 0/5] xps_flows: XPS flow steering when there is no socket

* [PATCH v2 net-next 0/5]  xps_flows: XPS flow steering when there is no socket
@ 2016-09-29  3:54 Tom Herbert
  2016-09-29  3:54 ` [PATCH v2 net-next 1/5] net: Set SW hash in skb_set_hash_from_sk Tom Herbert
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Tom Herbert @ 2016-09-29  3:54 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, rick.jones2, alexander.duyck

This patch set introduces transmit flow steering for socketless packets.
The idea is that we record the transmit queues in a flow table that is
indexed by skbuff hash.  The flow table entries have two values: the
queue_index and the head cnt of packets from the TX queue. We only allow
a queue to change for a flow if the tail cnt in the TX queue advances
beyond the recorded head cnt. That is the condition that should indicate
that all outstanding packets for the flow have completed transmission so
the queue can change.

Tracking the inflight queue is performed as part of DQL. Two fields are
added to the dql structure: num_enqueue_ops and num_completed_ops.
num_enqueue_ops incremented in dql_queued and num_completed_ops is
incremented in dql_completed by the number of operations completed (a
new argument to the function).

This patch set creates /sys/class/net/eth*/xps_dev_flow_table_cnt
which number of entries in the XPS flow table.

Note that the functionality here is technically best effort (for
instance we don't obtain a lock while processing a flow table entry).
Under high load it is possible that OOO packets can still be generated
due to XPS if two threads are hammering on the same flow table entry.
The assumption of these patches is that OOO packets are not the end of
the world and these should prevent OOO in most common use cases with
XPS.

This is a followup to previous RFC version. Fixes from RFC are:

  - Move counters to DQL
  - Fixed typo
  - Simplified get flow index funtion
  - Fixed sysfs flow_table_cnt to properly use DEVICE_ATTR_RW
  - Renamed the mechanism

V2:
  - Added documentation in scaling.txt and sysfs documentation
  - Call skb_tx_hash directly from get_xps_queue. This allows
    the socketless transmit flow steering to work properly if
    a flow is bouncing between non-XPS and XPS CPUS. (suggested
    by Alexander Duyck).
  - Added a whold bunch of tested results provided by Rick Jones
    (Thanks Rick!)

Tested:
  Manually forced all packets to go through the xps_flows path.
  Observed that some flows were deferred to change queues because
  packets were in flight witht the flow bucket.

Testing done by Rick Jones:

  Here is a quick look at performance tests for the result of trying the
  prototype fix for the packet reordering problem with VMs sending over
  an XPS-configured NIC.  In particular, the Emulex/Avago/Broadcom
  Skyhawk.  The fix was applied to a 4.4 kernel.

  Before: 3884 Mbit/s
  After: 8897 Mbit/s

  That was from a VM on a node with a Skyhawk and 2 E5-2640 processors
  to baremetal E5-2640 with a BE3.  Physical MTU was 1500, the VM's
  vNIC's MTU was 1400.  Systems were HPE ProLiants in OS Control Mode
  for power management, with the "performance" frequency governor
  loaded. An OpenStack Mitaka setup with Distributed Virtual Router.

  We had some other NIC types in the setup as well.  XPS was also
  enabled on the ConnectX3-Pro.  It was not enabled on the 82599ES (a
  function of the kernel being used, which had it disabled from the
  first reports of XPS negatively affecting VM traffic at the beginning
  of the year)

  Average Mbit/s From NIC type To Bare Metal BE3:

   NIC Type,
   CPU on VM Host            Before        After
  ------------------------------------------------
  ConnectX-3 Pro,E5-2670v3    9224         9271
  BE3, E5-2640                9016         9022
  82599, E5-2640              9192         9003
  BCM57840, E5-2640           9213         9153
  Skyhawk, E5-2640            3884         8897

  For completeness:
  Average Mbit/s To NIC type from Bare Metal BE3:

  NIC Type,
  CPU on VM Host            Before        After
  ------------------------------------------------
  ConnectX-3 Pro,E5-2670v3    9322         9144
  BE3, E5-2640                9074         9017
  82599, E5-2640              8670         8564
  BCM57840, E5-2640           2468 *       7979
  Skyhawk, E5-2640            8897         9269

  * This is the Busted bnx2x NIC FW GRO implementation issue.  It was
    not visible in the "After" because the system was setup to disable
    the NIC FW GRO by the time it booted on the fix kernel.

  Average Transactions/s Between NIC type and Bare Metal BE3:

  NIC Type,
  CPU on VM Host            Before        After
  ------------------------------------------------
  ConnectX-3 Pro,E5-2670v3   12421         12612
  BE3, E5-2640                8178          8484
  82599, E5-2640              8499          8549
  BCM57840, E5-2640           8544          8560
  Skyhawk, E5-2640            8537          8701

Tom Herbert (5):
  net: Set SW hash in skb_set_hash_from_sk
  dql: Add counters for number of queuing and completion operations
  net: Add xps_dev_flow_table_cnt
  xps_flows: XPS for packets that don't have a socket
  xps: Documentation for transmit socketles flow steering

 Documentation/ABI/testing/sysfs-class-net |   8 +++
 Documentation/networking/scaling.txt      |  26 ++++++++
 include/linux/dynamic_queue_limits.h      |   7 +-
 include/linux/netdevice.h                 |  26 +++++++-
 include/net/sock.h                        |   6 +-
 lib/dynamic_queue_limits.c                |   3 +-
 net/Kconfig                               |   6 ++
 net/core/dev.c                            |  87 +++++++++++++++++++------
 net/core/net-sysfs.c                      | 103 ++++++++++++++++++++++++++++++
 9 files changed, 246 insertions(+), 26 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 14+ messages in thread