From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jiayu Hu <jiayu.hu@intel.com>
Subject: [RFC] Add GRO support in DPDK
Date: Mon, 23 Jan 2017 21:03:12 +0800
Message-ID: <1485176592-111525-1-git-send-email-jiayu.hu@intel.com>
Cc: Jiayu Hu <jiayu.hu@intel.com>, keith.wiles@intel.com,
 ray.kinsella@intel.com, konstantin.ananyev@intel.com,
 walter.e.gilmore@intel.com, venky.venkatesan@intel.com,
 yuanhan.liu@linux.intel.com
To: dev@dpdk.org
Return-path: <dev-bounces@dpdk.org>
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id 1E5CB2BE9
 for <dev@dpdk.org>; Mon, 23 Jan 2017 14:03:32 +0100 (CET)
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

With the support of hardware segmentation techniques in DPDK, the
networking stack overheads of send-side of applications, which directly
leverage DPDK, have been greatly reduced. But for receive-side, numbers of
segmented packets seriously burden the networking stack of applications.
Generic Receive Offload (GRO) is a widely used method to solve the
receive-side issue, which gains performance by reducing the amount of
packets processed by the networking stack. But currently, DPDK doesn't
support GRO. Therefore, we propose to add GRO support in DPDK, and this
RFC is used to explain the basic DPDK GRO design.

DPDK GRO is a SW-based packets assembly library, which provides GRO
abilities for numbers of protocols. In DPDK GRO, packets are merged
before returning to applications and after receiving from drivers.

In DPDK, GRO is a capability of NIC drivers. That support GRO or not and
what GRO types are supported are up to NIC drivers. Different drivers may
support different GRO types. By default, drivers enable all supported GRO
types. For applications, they can inquire the supported GRO types by
each driver, and can control what GRO types are applied. For example,
ixgbe supports TCP and UDP GRO, but the application just needs TCP GRO.
The application can disable ixgbe UDP GRO.

To support GRO, a driver should provide a way to tell applications what
GRO types are supported, and provides a GRO function, which is in charge
of assembling packets. Since different drivers may support different GRO
types, their GRO functions may be different. For applications, they don't
need extra operations to enable GRO. But if there are some GRO types that
are not needed, applications can use an API, like
rte_eth_gro_disable_protocols, to disable them. Besides, they can
re-enable the disabled ones.

The GRO function processes numbers of packets at a time. In each
invocation, what GRO types are applied depends on applications, and the
amount of packets to merge depends on the networking status and
applications. Specifically, applications determine the maximum number of
packets to be processed by the GRO function, but how many packets are
actually processed depends on if there are available packets to receive.
For example, the receive-side application asks the GRO function to
process 64 packets, but the sender only sends 40 packets. At this time,
the GRO function returns after processing 40 packets. To reassemble the
given packets, the GRO function performs an "assembly procedure" on each
packet. We use an example to demonstrate this procedure. Supposing the
GRO function is going to process packetX, it will do the following two
things:
	a. Find a L4 assembly function according to the packet type of
	packetX. A L4 assembly function is in charge of merging packets of a
	specific type. For example, TCPv4 assembly function merges packets
	whose L3 IPv4 and L4 is TCP. Each L4 assembly function has a packet
	array, which keeps the packets that are unable to assemble.
	Initially, the packet array is empty;
	b. The L4 assembly function traverses own packet array to find a
	mergeable packet (comparing Ethernet, IP and L4 header fields). If
	finds, merges it and packetX via chaining them together; if doesn't,
	allocates a new array element to store packetX and updates element
	number of the array.
After performing the assembly procedure to all packets, the GRO function
combines the results of all packet arrays, and returns these packets to
applications.

There are lots of ways to implement the above design in DPDK. One of the
ways is:
	a. Drivers tell applications what GRO types are supported via
	dev->dev_ops->dev_infos_get;
	b. When initialize, drivers register own GRO function as a RX
	callback, which is invoked inside rte_eth_rx_burst. The name of the
	GRO function should be like xxx_gro_receive (e.g. ixgbe_gro_receive).
	Currently, the RX callback can only process the packets returned by
	dev->rx_pkt_burst each time, and the maximum packet number
	dev->rx_pkt_burst returns is determined by each driver, which can't
	be interfered by applications. Therefore, to implement the above GRO
	design, we have to modify current RX implementation to make driver
	return packets as many as possible until the packet number meets the
	demand of applications or there are not available packets to receive.
	This modification is also proposed in patch:
	http://dpdk.org/ml/archives/dev/2017-January/055887.html;
	c. The GRO types to apply and the maximum number of packets to merge
	are passed by resetting RX callback parameters. It can be achieved by
	invoking rte_eth_rx_callback;
	d. Simply, we can just store packet addresses into the packet array.
	To check one element, we need to fetch the packet via its address.
	However, this simple design is not efficient enough. Since whenever
	checking one packet, one pointer dereference is generated. And a
	pointer dereference always causes a cache line miss. A better way is
	to store some rules in each array element. The rules must be the
	prerequisites of merging two packets, like the sequence number of TCP
	packets. We first compare the rules, then retrieve the packet if the
	rules match. If storing the rules causes the packet array structure
	is cache-unfriendly, we can store a fixed-length signature of the
	rules instead. For example, the signature can be calculated by
	performing XOR operation on IP addresses. Both design can avoid
	unnecessary pointer dereferences.