[RFC] Connection Tracking Offload netdev RFC v1.0, Part 2/2 - TC with Connection Tracking - hardware offloading and Netfilter changes

* [RFC]  Connection Tracking Offload netdev RFC v1.0, Part 2/2  - TC with Connection Tracking - hardware offloading and Netfilter changes
@ 2019-01-23 11:29 Guy Shattah
  0 siblings, 0 replies; only message in thread
From: Guy Shattah @ 2019-01-23 11:29 UTC (permalink / raw)
  To: Guy Shattah, Marcelo Leitner, Aaron Conole, John Hurley,
	Simon Horman, Justin Pettit, Gregory Rose, Eelco Chaudron,
	Flavio Leitner, Florian Westphal, Jiri Pirko, Rashid Khan,
	Sushil Kulkarni, Andy Gospodarek, Roi Dayan, Yossi Kuperman,
	Or Gerlitz, Rony Efraim, davem, Pablo Neira Ayuso
  Cc: netdev

-------------------------------------------------------------------------
Connection Tracking Offload netdev RFC v1.0 Part 2/2  - TC with Connection Tracking - hardware offloading and Netfilter changes
--------------------------------------------------------------------------
This part continues the discussion of hardware offload.
All details regarding hardware specific implementation and driver were deliberately omitted from this document. as this document should be one fit all vendors

TC command-line interface
--------------------------------
In case of connection tracking hardware-offload TC should reject all skip_software rules. This is done to make 
sure sw and hw are always in sync.
It has to be forced by tc in order to prevent cases where there is a chain with 
connection tracking (a rule which is offloaded) and the following 
chain (which does not contain hardware offload action or match) was not offloaded.

A suggestion:
* On the first time a ct_action or ct_state is used tc would scan
  all existing chains and refuse to insert rule if at least one skip_software rule is found.
  If everything is OK then raise a flag to mark that all rules from now on must
  be both in software and hardware.
* Any further rules will check this flag.
* This flag is clear when all rules in TC are cleared.

I'm certain that we could find more solutions if this one is not sufficient.

Offloading ct_state matches:
--------------------------------
The current offload interface used by TC is ndo_setup_tc().
This interface is used to send a tc chain to the hardware.

The interface is kept the same for commands containing ct_state.

The only difference is in the underlying driver/hardware implementation.

1. It is the duty of the driver to convert 'matches' such as:
    ct_zone=XX,ct_mark=XX,,ct_label=XX,ct_state=±trk,±new,±est,±dnat,±snat,±inv,±rel,±rpl
    properly to the hardware interface.

2. It is the duty of the driver to make sure that in case of ambiguity, where 
   hardware is not capable of making a decision on a packet path to send 
   packet to the software for further processing [see section on fallback to software]

3. please note that Hardware may or may not have the following capabilities:
(a) packet check-sum (layer 3/4)
(b) TCP window-validation
(c) flow counters/stats
(d) other?

Offloading  ct_action / changes in tc data-path:
--------------------------------------------------

While on pure software tc connection tracking implementation the packet is sent to connection-tracking 
to retrieve result into skb->_nfct and futher matches are used on this struct to make routing decisions,
On tc with offload support there is an additional test:
If packet is part of an established connection a decision is made to offload connection to hardware.

To offload this connection a 'flow offload' infrastructure developed by Pablo is used,
( This infrastructure is enhanced to support connection tracking as described below.[see section on netfilter changes] )

The packet is sent to nft_flow_offload_eval() along with additional data for offload evaluation.

Such additional data is: 
* struct _nfct
  * from which current flags such as ±rpl/±rel  can be used by the driver to set ct_state on
    established connection.
  * TCP window information
  * Any other additional information.
* Originating tc chain id, from which the offload took place.
* flow (connection) id

Possible syncronization issues:
---------------------------------

When connection are offloaded there is a short latency until rule is active in hardware.
This means that packets may arrive to TC after connection has been offloaded. 
While this doesn't introduce any issue on UDP+ICMP it might be an issue
on TCP due to TCP window issues. In this case I'd suggest to offload the connection with a 
large tcp window and allow the hardware (where window-validation is supported) rescale the window back.

Netfilter changes
---------------------
Once a packet passed through a chain which includes a CT action resulted with 'established' state
TC to send the packet to Netfilter for offload evaluation [ nft_flow_offload_eval()] and then
and flow should be offloaded.

Changes:

1.  On Pablo's patches, he had added a new ndo for hardware offload:
   https://github.com/orgcandman/linux-next-work/commit/351bd1303c2ae5de0c4512f0cdf7b61508df0777#diff-cf45716f2ff8e67a1727be77c3b894feR1392

   However, During discussions in netfilter workshop,(May 2018) Pablo and Jiri Pirko have agreed that both NDOs basically 
   do the same thing - hardware offload. Hence it would be wiser to use ndo_setup_tc call from netfilter to do the 
   offload, rather than adding a whole new NDO.
   Another suggestion was made it to rename ndo_setup_tc in the future, to reflect that it is not only used from TC.

   Please read the following paper from netdev, the paper discusses this solution in depth:
   https://www.files.netdevconf.org/d/c5b1c9709ca743d79a63/

2. in case of tc NAT, the NAT header-rewrite has to be a part of the 
   connection-tracking offload. hence TC will supply Netfilter offload
   code not only the chain ID from which the offload took place but
   also information on how to do NAT translation.

2. Netfilter to maintain two separate lists of offloaded connections.
   Each list is offloaded flows is kept inside: struct nf_flowtable *flowtable
   (currently - there is only one list)

2.1 The first list is a list of 'soft' offloaded connections. 
   In this list Netfilter iterates and is responsible for flow aging and flow termination.
   For flow aging it iterates over all the flows.
   For flow termination it will be using FIN packets arriving from TC.

2.2 The second list is a list of 'hardware' offloaded connections.
    This is a software representation of the psychical list representation in hardware for offloaded connections.
        Flows in this list are aged and terminated by the hardware/driver by event/callback.

3. On reaching full HW capacity:
   In case where hardware capacity is reached Netfilter should mark the flow
   (a new addition inside struct nf_conn or flow offload entry)
   and put it on a 'pending list'.
   Once the 'offloaded flows list' has new empty spots (released by events)
   then it should try to offload again.

   Another possibility (maybe in phase two?) is have some kind of smarter LRU algorithm 
   where some of the connections are offloaded, some are not and there is an algorithm to 
   decide and switch between hardware and software, based on traffic (stats/counters).

4. Expiration mechanism [in software]:
   The current Expiration in Connection Tracking:
   Lazy expiration: old connections are removed when new connections arrive.
   Relatively long time: over 5 Days for established TCP connection About 180 secs for UDP

   Expiration in Netfilter flow offload :
   Every once in a while scan all existing offloaded flows and query driver to see if flow has expired [suggested queries to driver in a code that was not submitted upstream:
                if (nf_flow_has_expired(flow) ||
                     nf_flow_is_dying(flow)) {

        Suggested expiration algorithm [to be implemented in netfilter flow-offload or the driver]:
        Instead of scanning millions of items per one second scan a small number of items per seconds. 
        Covering all of them twice within the timeout period:
        With N offloaded connections - Every K seconds sample N/2k of the flows in the systems.
        K= 180 secs for udp/icmp
        K= 3600 secs for tcp
        Final expiration value either configurable or constant

5. counters - NetFilter has to be enhanced to pull counter (per flow) from hardware
   and update stats per connection/flow.

6. megaflows - Netfilter has to be enhanced with API to support deletion of
               multiple flows according to the OVS recirculation id/tc chain id.
			   As dicussed - this will be either a blocking call or async.
7. Fallback - Netfilter  has to be enhanced with API to support case where 
              a flow has to be moved from HW back to SW and then: 
			  (1) never be moved to HW again OR (2) attemp to move back to HW where possible.

Handling OVS rule eviction:
------------------------------
The way OVS offload is being done today is by OVS sending a copy of the OVS-Kernel data-path netlink
messages to TC. OVS then samples counters/stats to see if this flow/connection is still valid
and has not aged. OVS may decide to evict this rule due to aging or due to user manually removing
the openflow rule.
When a rule was removed (a rule containing ct_action command) then a netlink command is sent to TC
TC should then send a command to netfilter offload code (similar to the event mentioned before) to 
empty the offloaded connection list.

Regarding counter/stats: for a single chain with a ct_action we need to 
pull counters/stats from all the offloaded connections. an accumulative counter.

Fallback to software
-------------------------
While hardware is processing a packet there is a possibility that hardware
reaches a state/flow table which does not contain sufficient information on how to 
continue processing.
Such cases may happen when originating flow-table sends a packet to non-existing 
target flow table. When hardware capabilities are not sufficient to make a decision, 
When there is ambiguity or when packet failed all tests and is sent to software
for further analysis and possibly other cases.

We should note that this may happen in any state of the processing:
It may happen after packet has been decapsulated or after encapsulation.
After header-rewrite (such as NAT) or after some meta-data was placed on the packet
or the connection (ct_label data, for example).

We define two acceptable behaviors:

1. Fallback from a middle of processing to a tc chain which is not the entry one (i.e. chain zero)
   the mechanism should be able to fully deliver a packet with all meta-data
   to upper layer with exact middle point of processing. to continue processing.

2. Fallback from a middle of processing to a start/entry-point of processing.
   this is a private case of #1. Where last flow table is zero.
   In this case a replica of the original unmodified packet has to be sent to the start/entry-point of upper-level.

Any other behavior is not acceptable, since software cannot handle cases
where the packet header has been modified without knowning where the process stopped.

In addition to the two described behavior, we also allow a fallback into a middle of 
a tc chain.

Why is it required?

In case of a tc rule that does decapsulation/header-rewrite and then connection tracking
it is possible that a packet will go trough  decapsulation/header-rewrite and then fail
on connection-tracking in hardware (due to connection not being offloaded yet).
in this case the software (i.e. tc) may receive a malformed packet which does 
not match accurately at the tc chain. Hence there is a need to enter the packet
into a middle a chain, after matches were done and all the actions, except for connection tracking.
i.e.; the packet is sent to the equivalent tc chain to do connection-tracking there.

Hence, there is support for fallback into (1) a beginning of tc chain and (2) right into the middle of it before CT.

----
Now lets talk about fallbacks:

1. Fallback from hardware/driver to TC
	The driver should be able to restore information (ct_zone/mark/label/last flow table)
	into the packet. 
	We suggest that ct_zone/ct_mark/ct_label/ct_state are restored into struct nf_conn (skb->_nfct)
	and the last flow_table	is restored into skb->cb (we have space there we could use)

    On receiving packet with skb->cb set to certain value tc will start processing
    the packet from a chain specific in  skb->cb .

2. Fallback from TC to OVS.
	There is a need to add a new logic into OVS which supports handling of a packet
	from a specific recirculation id/chain/flow table.
	Currently OVS-kernel clones the skb and if it fails processing (in kernel)
	then the original packet is sent to OVS-userspace.
	This cannot work for our case since TC doesn't keep the original copy of the 
	packet (before header-rewrite/decapsulation/etc) and understandably, the packet 
	cannot be restored to its original form.

	We suggest to do the TC->OVS path in the same manner as Driver->TC path.
	a packet is reassembled and restored by the driver and injected into TC 
	(along with meta-data on skb->cv). If TC lacks the information on how to 
	proceed the packet is left unchanged and the next stop is OVS which has to be
	extended with new logic on how to proceed.

^ permalink raw reply	[flat|nested] only message in thread