All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL
@ 2016-05-28 16:37 wexu
  2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 1/3] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: wexu @ 2016-05-28 16:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: jasowang, marcel, mst, victork, dfleytma, yvugenfi, Wei Xu

From: Wei Xu <wexu@redhat.com>

Changes in V6:
- Sync upstream code
- Split new fields in 'virtio_net_hdr' to a seperate patch
- Remove feature bit code, replace it with a command line parameter 'guest_rsc'
which is turned off by default.

Changes in V5:
- Passed all IPv4/6 test cases
- Add new fields in 'virtio_net_hdr'
- Set 'gso_type' & 'coalesced packets' in new field.
- Bypass all 'tcp option' packet
- Bypass all 'pure ack' packet
- Bypass all 'duplicate ack' packet
- Change 'guest_rsc' feature bit to 'false' by default
- Feedbacks from v4, typo, etc.

Note:
There is still a few pending issues about the feature bit, and need to be 
discussed with windows driver maintainer, so linux guests with this patch
won't work at current, haven't figure it out yet, but i'm guessing it's
caused by the 'gso_type' is set to 'VIRTIO_NET_HDR_GSO_TCPV4/6',
will fix it after get the final solution, the below test steps and
performance data is based on v4.

Another suggestion from Jason is to adjust part of the code to make it
more readable, since there maybe still few change about the flowchart
in the future, such as timestamp, duplicate ack, so i'd like to delay it
temporarily.

Changes in V4:
- Add new host feature bit
- Replace using fixed header lenght with dynamic header lenght in VirtIONet 
- Change ip/ip6 header union in NetRscUnit to void* pointer
- Add macro prefix, adjust code indent, etc.

Changes in V3:
- Removed big param list, replace it with 'NetRscUnit' 
- Different virtio header size
- Modify callback function to direct call.
- Needn't check the failure of g_malloc()
- Other code format adjustment, macro naming, etc 

Changes in V2:
- Add detailed commit log

This patch is to support WHQL test for Windows guest, while this feature also
benifits other guest works as a kernel 'gro' like feature with userspace 
implementation.
Feature information:
  http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324

Both IPv4 and IPv6 are supported, though performance with userspace virtio
is slow than vhost-net, there is about 1.5x to 2x performance improvement to
userspace virtio, this is done by turning this feature on and disable
'tso/gso/gro' on corresponding tap interface and guest interface, while get
less improment with all these feature on.

Linux guest performance data(Netperf):
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.101 () port 0 AF_INET : nodelay
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384     64    6.00     1221.20   
 87380  16384     64    6.00     1260.30   

 87380  16384    128    6.00     1978.51   
 87380  16384    128    6.00     2286.05   

 87380  16384    256    6.00     2677.94   
 87380  16384    256    6.00     4615.42   

 87380  16384    512    6.00     2956.54   
 87380  16384    512    6.00     5356.39   

 87380  16384   1024    6.00     2798.17   
 87380  16384   1024    6.00     4943.30   

 87380  16384   2048    6.00     2681.09   
 87380  16384   2048    6.00     4835.81   

 87380  16384   4096    6.00     3390.14   
 87380  16384   4096    6.00     5391.54   

 87380  16384   8092    6.00     3008.27   
 87380  16384   8092    6.00     5381.68   

 87380  16384  10240    6.00     2999.89   
 87380  16384  10240    6.00     5393.11 

Test steps:
Although this feature is mainly used for window guest, i used linux guest to 
help test the feature, to make things simple, i used 3 steps to test the patch
as i moved on.

1. With a tcp socket client/server pair running on 2 linux guest, thus i can 
control
the traffic and debugging the code as i want.
2. Netperf on linux guest test the throughput.
3. WHQL test with 2 Windows guests.

Wei Xu (3):
  virtio-net rsc: support coalescing ipv4 tcp traffic
  virtio-net rsc: support coalescing ipv6 tcp traffic
  virtio-net rsc: add 2 new rsc information fields to 'virtio_net_hdr'

 hw/net/virtio-net.c                         | 642 +++++++++++++++++++++++++++-
 include/hw/virtio/virtio-net.h              |   2 +
 include/hw/virtio/virtio.h                  |  75 ++++
 include/standard-headers/linux/virtio_net.h |   3 +
 4 files changed, 721 insertions(+), 1 deletion(-)

-- 
2.7.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [ RFC Patch v6 1/3] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-05-28 16:37 [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL wexu
@ 2016-05-28 16:37 ` wexu
  2016-05-30  4:20   ` Jason Wang
  2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 2/3] virtio-net rsc: support coalescing ipv6 " wexu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: wexu @ 2016-05-28 16:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: jasowang, marcel, mst, victork, dfleytma, yvugenfi, Wei Xu

From: Wei Xu <wexu@redhat.com>

All the data packets in a tcp connection will be cached to a big buffer
in every receive interval, and will be sent out via a timer, the
'virtio_net_rsc_timeout' controls the interval, the value will influent the
performance and response of tcp connection extremely, 50000(50us) is a
experience value to gain a performance improvement, since the whql test
sends packets every 100us, so '300000(300us)' can pass the test case,
this is also the default value, it's tunable via the command line
parameter 'rsc_interval' with 'virtio-net-pci' device, for example, below
parameter is to launch a guest with interval set as '500000'.

'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,rsc_interval=500000'
will

The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets.

'NetRscChain' is used to save the segments of different protocols in a
VirtIONet device.

The main handler of TCP includes TCP window update, duplicated ACK check
and the real data coalescing if the new segment passed sanity check
and is identified as an 'wanted' one.

An 'wanted' segment means:
1. Segment is within current window and the sequence is the expected one.
2. 'ACK' of the segment is in the valid window.

Sanity check includes:
1. Incorrect version in IP header
2. IP options & IP fragment
3. Not a TCP packets
4. Sanity size check to prevent buffer overflow attack.

There maybe more cases should be considered such as ip identification other
flags, while it broke the test because windows set it to the same even it's
not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
means the packets should also be bypassed, and this should be done
after searching for the same connection packets in the pool and sending
all of them out, this is to avoid out of data.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'FIN/RST' will trigger a finalization, because
this normally happens upon a connection is going to be closed, an 'URG' packet
also finalize current coalescing unit.

Statistics can be used to monitor the basic coalescing status, the 'out of order'
and 'out of window' means how many retransmitting packets, thus describe the
performance intuitively.

Signed-off-by: Wei Xu <wexu@redhat.com>
---
 hw/net/virtio-net.c                         | 498 +++++++++++++++++++++++++++-
 include/hw/virtio/virtio-net.h              |   2 +
 include/hw/virtio/virtio.h                  |  75 +++++
 include/standard-headers/linux/virtio_net.h |   1 +
 4 files changed, 575 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 5798f87..b3bb63b 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -15,10 +15,12 @@
 #include "qemu/iov.h"
 #include "hw/virtio/virtio.h"
 #include "net/net.h"
+#include "net/eth.h"
 #include "net/checksum.h"
 #include "net/tap.h"
 #include "qemu/error-report.h"
 #include "qemu/timer.h"
+#include "qemu/sockets.h"
 #include "hw/virtio/virtio-net.h"
 #include "net/vhost_net.h"
 #include "hw/virtio/virtio-bus.h"
@@ -38,6 +40,25 @@
 #define endof(container, field) \
     (offsetof(container, field) + sizeof(((container *)0)->field))
 
+#define VIRTIO_NET_IP4_ADDR_SIZE   8        /* ipv4 saddr + daddr */
+#define VIRTIO_NET_TCP_PORT_SIZE   4        /* sport + dport */
+
+#define VIRTIO_NET_TCP_FLAG         0x3F
+#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
+
+/* IPv4 max payload, 16 bits in the header */
+#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
+#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
+
+/* header length value in ip header without option */
+#define VIRTIO_NET_IP4_HEADER_LENGTH 5
+
+/* Purge coalesced packets timer interval, This value affects the performance
+   a lot, and should be tuned carefully, '300000'(300us) is the recommended
+   value to pass the WHQL test, '50000' can gain 2x netperf throughput with
+   tso/gso/gro 'off'. */
+#define VIRTIO_NET_RSC_INTERVAL  300000
+
 typedef struct VirtIOFeature {
     uint32_t flags;
     size_t end;
@@ -1089,7 +1110,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
     return 0;
 }
 
-static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
+static ssize_t virtio_net_do_receive(NetClientState *nc,
+                                     const uint8_t *buf, size_t size)
 {
     VirtIONet *n = qemu_get_nic_opaque(nc);
     VirtIONetQueue *q = virtio_net_get_subqueue(nc);
@@ -1685,6 +1707,474 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
     return 0;
 }
 
+static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
+                                         const uint8_t *buf, NetRscUnit* unit)
+{
+    uint16_t hdr_len;
+    uint16_t ip_hdrlen;
+    struct ip_header *ip;
+
+    hdr_len = chain->n->guest_hdr_len;
+    ip = (struct ip_header *)(buf + hdr_len + sizeof(struct eth_header));
+    unit->ip = (void *)ip;
+    ip_hdrlen = (ip->ip_ver_len & 0xF) << 2;
+    unit->ip_plen = &ip->ip_len;
+    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
+    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
+    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
+}
+
+static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr *vhdr,
+                                         struct ip_header *ip)
+{
+    uint32_t sum;
+
+    ip->ip_sum = 0;
+    sum = net_checksum_add_cont(sizeof(struct ip_header), (uint8_t *)ip, 0);
+    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
+    vhdr->flags &= ~VIRTIO_NET_HDR_F_NEEDS_CSUM;
+    vhdr->flags |= VIRTIO_NET_HDR_F_DATA_VALID;
+}
+
+static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
+{
+    int ret;
+    struct virtio_net_hdr *h;
+
+    h = (struct virtio_net_hdr *)seg->buf;
+    virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
+    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
+    QTAILQ_REMOVE(&chain->buffers, seg, next);
+    g_free(seg->buf);
+    g_free(seg);
+
+    return ret;
+}
+
+static void virtio_net_rsc_purge(void *opq)
+{
+    NetRscSeg *seg, *rn;
+    NetRscChain *chain = (NetRscChain *)opq;
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
+        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+            chain->stat.purge_failed++;
+            continue;
+        }
+    }
+
+    chain->stat.timer++;
+    if (!QTAILQ_EMPTY(&chain->buffers)) {
+        timer_mod(chain->drain_timer,
+              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + chain->n->rsc_timeout);
+    }
+}
+
+static void virtio_net_rsc_cleanup(VirtIONet *n)
+{
+    NetRscChain *chain, *rn_chain;
+    NetRscSeg *seg, *rn_seg;
+
+    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
+        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
+            QTAILQ_REMOVE(&chain->buffers, seg, next);
+            g_free(seg->buf);
+            g_free(seg);
+        }
+
+        timer_del(chain->drain_timer);
+        timer_free(chain->drain_timer);
+        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
+        g_free(chain);
+    }
+}
+
+static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
+                                     const uint8_t *buf, size_t size)
+{
+    uint16_t hdr_len;
+    NetRscSeg *seg;
+
+    hdr_len = chain->n->guest_hdr_len;
+    seg = g_malloc(sizeof(NetRscSeg));
+    seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)\
+                   + VIRTIO_NET_MAX_TCP_PAYLOAD);
+    memcpy(seg->buf, buf, size);
+    seg->size = size;
+    seg->packets = 1;
+    seg->dup_ack = 0;
+    seg->is_coalesced = 0;
+    seg->nc = nc;
+
+    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
+    chain->stat.cache++;
+
+    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
+}
+
+static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
+                                         NetRscSeg *seg, const uint8_t *buf,
+                                         struct tcp_header *n_tcp,
+                                         struct tcp_header *o_tcp)
+{
+    uint32_t nack, oack;
+    uint16_t nwin, owin;
+
+    nack = htonl(n_tcp->th_ack);
+    nwin = htons(n_tcp->th_win);
+    oack = htonl(o_tcp->th_ack);
+    owin = htons(o_tcp->th_win);
+
+    if ((nack - oack) >= VIRTIO_NET_MAX_TCP_PAYLOAD) {
+        chain->stat.ack_out_of_win++;
+        return RSC_FINAL;
+    } else if (nack == oack) {
+        /* duplicated ack or window probe */
+        if (nwin == owin) {
+            /* duplicated ack, add dup ack count due to whql test up to 1 */
+            chain->stat.dup_ack++;
+            return RSC_FINAL;
+        } else {
+            /* Coalesce window update */
+            o_tcp->th_win = n_tcp->th_win;
+            chain->stat.win_update++;
+            return RSC_COALESCE;
+        }
+    } else {
+        /* pure ack, go to 'C', finalize*/
+        chain->stat.pure_ack++;
+        return RSC_FINAL;
+    }
+}
+
+static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
+                                            NetRscSeg *seg, const uint8_t *buf,
+                                            NetRscUnit *n_unit)
+{
+    void *data;
+    uint16_t o_ip_len;
+    uint32_t nseq, oseq;
+    NetRscUnit *o_unit;
+
+    o_unit = &seg->unit;
+    o_ip_len = htons(*o_unit->ip_plen);
+    nseq = htonl(n_unit->tcp->th_seq);
+    oseq = htonl(o_unit->tcp->th_seq);
+
+    /* out of order or retransmitted. */
+    if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) {
+        chain->stat.data_out_of_win++;
+        return RSC_FINAL;
+    }
+
+    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
+    if (nseq == oseq) {
+        if ((o_unit->payload == 0) && n_unit->payload) {
+            /* From no payload to payload, normal case, not a dup ack or etc */
+            chain->stat.data_after_pure_ack++;
+            goto coalesce;
+        } else {
+            return virtio_net_rsc_handle_ack(chain, seg, buf,
+                                             n_unit->tcp, o_unit->tcp);
+        }
+    } else if ((nseq - oseq) != o_unit->payload) {
+        /* Not a consistent packet, out of order */
+        chain->stat.data_out_of_order++;
+        return RSC_FINAL;
+    } else {
+coalesce:
+        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
+            chain->stat.over_size++;
+            return RSC_FINAL;
+        }
+
+        /* Here comes the right data, the payload length in v4/v6 is different,
+           so use the field value to update and record the new data len */
+        o_unit->payload += n_unit->payload; /* update new data len */
+
+        /* update field in ip header */
+        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
+
+        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
+           for windows guest, while this may change the behavior for linux
+           guest. */
+        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
+
+        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
+        o_unit->tcp->th_win = n_unit->tcp->th_win;
+
+        memmove(seg->buf + seg->size, data, n_unit->payload);
+        seg->size += n_unit->payload;
+        seg->packets++;
+        chain->stat.coalesced++;
+        return RSC_COALESCE;
+    }
+}
+
+static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
+                                        const uint8_t *buf, size_t size,
+                                        NetRscUnit *unit)
+{
+    struct ip_header *ip1, *ip2;
+
+    ip1 = (struct ip_header *)(unit->ip);
+    ip2 = (struct ip_header *)(seg->unit.ip);
+    if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst)
+        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
+        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
+        chain->stat.no_match++;
+        return RSC_NO_MATCH;
+    }
+
+    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
+}
+
+/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
+ * to prevent out of order */
+static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
+                                         struct tcp_header *tcp)
+{
+    uint16_t tcp_hdr;
+    uint16_t tcp_flag;
+
+    tcp_flag = htons(tcp->th_offset_flags);
+    tcp_hdr = (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10;
+    tcp_flag &= VIRTIO_NET_TCP_FLAG;
+    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
+    if (tcp_flag & TH_SYN) {
+        chain->stat.tcp_syn++;
+        return RSC_BYPASS;
+    }
+
+    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
+        chain->stat.tcp_ctrl_drain++;
+        return RSC_FINAL;
+    }
+
+    if (tcp_hdr > sizeof(struct tcp_header)) {
+        chain->stat.tcp_all_opt++;
+        return RSC_FINAL;
+    }
+
+    return RSC_WANT;
+}
+
+static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientState *nc,
+                                       const uint8_t *buf, size_t size)
+{
+    if (QTAILQ_EMPTY(&chain->buffers)) {
+        chain->stat.empty_cache++;
+        virtio_net_rsc_cache_buf(chain, nc, buf, size);
+        timer_mod(chain->drain_timer,
+              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + chain->n->rsc_timeout);
+        return 1;
+    }
+
+    return 0;
+}
+
+static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
+                                         const uint8_t *buf, size_t size,
+                                         NetRscUnit *unit)
+{
+    int ret;
+    NetRscSeg *seg, *nseg;
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
+        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
+
+        if (ret == RSC_FINAL) {
+            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+                /* Send failed */
+                chain->stat.final_failed++;
+                return 0;
+            }
+
+            /* Send current packet */
+            return virtio_net_do_receive(nc, buf, size);
+        } else if (ret == RSC_NO_MATCH) {
+            continue;
+        } else {
+            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
+            seg->is_coalesced = 1;
+            return size;
+        }
+    }
+
+    chain->stat.no_match_cache++;
+    virtio_net_rsc_cache_buf(chain, nc, buf, size);
+    return size;
+}
+
+/* Drain a connection data, this is to avoid out of order segments */
+static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
+                                        const uint8_t *buf, size_t size,
+                                        uint16_t ip_start, uint16_t ip_size,
+                                        uint16_t tcp_port, uint16_t port_size)
+{
+    NetRscSeg *seg, *nseg;
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
+        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
+            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {
+            continue;
+        }
+        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+            chain->stat.drain_failed++;
+        }
+
+        break;
+    }
+
+    return virtio_net_do_receive(nc, buf, size);
+}
+
+static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
+                                            struct ip_header *ip,
+                                            const uint8_t *buf, size_t size)
+{
+    uint16_t ip_len;
+    uint16_t hdr_len;
+
+    hdr_len = chain->n->guest_hdr_len;
+    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header)
+        + sizeof(struct tcp_header))) {
+        return RSC_BYPASS;
+    }
+
+    /* Not an ipv4 one */
+    if (((ip->ip_ver_len & 0xF0) >> 4) != IP_HEADER_VERSION_4) {
+        chain->stat.ip_option++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ip option */
+    if (VIRTIO_NET_IP4_HEADER_LENGTH != (ip->ip_ver_len & 0xF)) {
+        chain->stat.ip_option++;
+        return RSC_BYPASS;
+    }
+
+    if (ip->ip_p != IPPROTO_TCP) {
+        chain->stat.bypass_not_tcp++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ip fragment */
+    if (!(htons(ip->ip_off) & IP_DF)) {
+        chain->stat.ip_frag++;
+        return RSC_BYPASS;
+    }
+
+    ip_len = htons(ip->ip_len);
+    if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header))
+        || ip_len > (size - hdr_len - sizeof(struct eth_header))) {
+        chain->stat.ip_hacked++;
+        return RSC_BYPASS;
+    }
+
+    return RSC_WANT;
+}
+
+static size_t virtio_net_rsc_receive4(NetRscChain *chain, NetClientState* nc,
+                                      const uint8_t *buf, size_t size)
+{
+    int32_t ret;
+    uint16_t hdr_len;
+    NetRscUnit unit;
+
+    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
+    virtio_net_rsc_extract_unit4(chain, buf, &unit);
+    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
+    if (ret == RSC_BYPASS) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else if (ret == RSC_FINAL) {
+        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
+                ((hdr_len + sizeof(struct eth_header)) + 12),
+                VIRTIO_NET_IP4_ADDR_SIZE,
+                hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header),
+                VIRTIO_NET_TCP_PORT_SIZE);
+    }
+
+    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
+        return size;
+    }
+
+    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
+}
+
+static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
+                                                NetClientState *nc,
+                                                uint16_t proto)
+{
+    NetRscChain *chain;
+
+    if (proto != (uint16_t)ETH_P_IP) {
+        return NULL;
+    }
+
+    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
+        if (chain->proto == proto) {
+            return chain;
+        }
+    }
+
+    chain = g_malloc(sizeof(*chain));
+    chain->n = n;
+    chain->proto = proto;
+    chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
+    chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
+                                      virtio_net_rsc_purge, chain);
+    memset(&chain->stat, 0, sizeof(chain->stat));
+
+    QTAILQ_INIT(&chain->buffers);
+    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
+
+    return chain;
+}
+
+static ssize_t virtio_net_rsc_receive(NetClientState *nc,
+                                      const uint8_t *buf, size_t size)
+{
+    uint16_t proto;
+    NetRscChain *chain;
+    struct eth_header *eth;
+    VirtIONet *n;
+
+    n = qemu_get_nic_opaque(nc);
+    if (size < (n->guest_hdr_len + sizeof(struct eth_header))) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    eth = (struct eth_header *)(buf + n->guest_hdr_len);
+    proto = htons(eth->h_proto);
+
+    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
+    if (!chain) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else {
+        chain->stat.received++;
+        return virtio_net_rsc_receive4(chain, nc, buf, size);
+    }
+}
+
+static ssize_t virtio_net_receive(NetClientState *nc,
+                                  const uint8_t *buf, size_t size)
+{
+    VirtIONet *n;
+
+    n = qemu_get_nic_opaque(nc);
+    if (n->host_features & (1ULL << VIRTIO_NET_F_GUEST_RSC)) {
+        return virtio_net_rsc_receive(nc, buf, size);
+    } else {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+}
+
 static NetClientInfo net_virtio_info = {
     .type = NET_CLIENT_OPTIONS_KIND_NIC,
     .size = sizeof(NICState),
@@ -1814,6 +2304,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     nc = qemu_get_queue(n->nic);
     nc->rxfilter_notify_enabled = 1;
 
+    QTAILQ_INIT(&n->rsc_chains);
     n->qdev = dev;
     register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
                     virtio_net_save, virtio_net_load, n);
@@ -1848,6 +2339,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
     g_free(n->vqs);
     qemu_del_nic(n->nic);
     virtio_cleanup(vdev);
+    virtio_net_rsc_cleanup(n);
 }
 
 static void virtio_net_instance_init(Object *obj)
@@ -1909,6 +2401,10 @@ static Property virtio_net_properties[] = {
                        TX_TIMER_INTERVAL),
     DEFINE_PROP_INT32("x-txburst", VirtIONet, net_conf.txburst, TX_BURST),
     DEFINE_PROP_STRING("tx", VirtIONet, net_conf.tx),
+    DEFINE_PROP_BIT("guest_rsc", VirtIONet, host_features,
+                    VIRTIO_NET_F_GUEST_RSC, false),
+    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
+                      VIRTIO_NET_RSC_INTERVAL),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index 0cabdb6..e995190 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -59,6 +59,8 @@ typedef struct VirtIONet {
     VirtIONetQueue *vqs;
     VirtQueue *ctrl_vq;
     NICState *nic;
+    QTAILQ_HEAD(, NetRscChain) rsc_chains;
+    uint32_t rsc_timeout;
     uint32_t tx_timeout;
     int32_t tx_burst;
     uint32_t has_vnet_hdr;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 6a37065..a7cb84a 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -30,6 +30,8 @@
                                 (0x1ULL << VIRTIO_F_ANY_LAYOUT))
 
 struct VirtQueue;
+struct VirtIONet;
+typedef struct VirtIONet VirtIONet;
 
 static inline hwaddr vring_align(hwaddr addr,
                                              unsigned long align)
@@ -128,6 +130,79 @@ typedef struct VirtioDeviceClass {
     int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
 } VirtioDeviceClass;
 
+/* Coalesced packets type & status */
+typedef enum {
+    RSC_COALESCE,           /* Data been coalesced */
+    RSC_FINAL,              /* Will terminate current connection */
+    RSC_NO_MATCH,           /* No matched in the buffer pool */
+    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
+    RSC_WANT                /* Data want to be coalesced */
+} COALESCE_STATUS;
+
+typedef struct NetRscStat {
+    uint32_t received;
+    uint32_t coalesced;
+    uint32_t over_size;
+    uint32_t cache;
+    uint32_t empty_cache;
+    uint32_t no_match_cache;
+    uint32_t win_update;
+    uint32_t no_match;
+    uint32_t tcp_syn;
+    uint32_t tcp_ctrl_drain;
+    uint32_t dup_ack;
+    uint32_t dup_ack1;
+    uint32_t dup_ack2;
+    uint32_t pure_ack;
+    uint32_t ack_out_of_win;
+    uint32_t data_out_of_win;
+    uint32_t data_out_of_order;
+    uint32_t data_after_pure_ack;
+    uint32_t bypass_not_tcp;
+    uint32_t tcp_option;
+    uint32_t tcp_all_opt;
+    uint32_t ip_frag;
+    uint32_t ip_hacked;
+    uint32_t ip_option;
+    uint32_t purge_failed;
+    uint32_t drain_failed;
+    uint32_t final_failed;
+    int64_t  timer;
+} NetRscStat;
+
+/* Rsc unit general info used to checking if can coalescing */
+typedef struct NetRscUnit {
+    void *ip;   /* ip header */
+    uint16_t *ip_plen;      /* data len pointer in ip header field */
+    struct tcp_header *tcp; /* tcp header */
+    uint16_t tcp_hdrlen;    /* tcp header len */
+    uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
+} NetRscUnit;
+
+/* Coalesced segmant */
+typedef struct NetRscSeg {
+    QTAILQ_ENTRY(NetRscSeg) next;
+    void *buf;
+    size_t size;
+    uint16_t packets;
+    uint16_t dup_ack;
+    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
+    NetRscUnit unit;
+    NetClientState *nc;
+} NetRscSeg;
+
+/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
+typedef struct NetRscChain {
+    QTAILQ_ENTRY(NetRscChain) next;
+    VirtIONet *n;                            /* VirtIONet */
+    uint16_t proto;
+    uint8_t  gso_type;
+    uint16_t max_payload;
+    QEMUTimer *drain_timer;
+    QTAILQ_HEAD(, NetRscSeg) buffers;
+    NetRscStat stat;
+} NetRscChain;
+
 void virtio_instance_init_common(Object *proxy_obj, void *data,
                                  size_t vdev_size, const char *vdev_name);
 
diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
index a78f33e..5b95762 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -55,6 +55,7 @@
 #define VIRTIO_NET_F_MQ	22	/* Device supports Receive Flow
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
+#define VIRTIO_NET_F_GUEST_RSC  24      /* Guest can coalesce tcp packets */
 
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO	6	/* Host handles pkts w/ any GSO type */
-- 
2.7.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [ RFC Patch v6 2/3] virtio-net rsc: support coalescing ipv6 tcp traffic
  2016-05-28 16:37 [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL wexu
  2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 1/3] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
@ 2016-05-28 16:37 ` wexu
  2016-05-30  4:25   ` Jason Wang
  2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 3/3] virtio-net rsc: add 2 new rsc information fields to 'virtio_net_hdr' wexu
  2016-05-30  4:22 ` [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL Jason Wang
  3 siblings, 1 reply; 12+ messages in thread
From: wexu @ 2016-05-28 16:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: jasowang, marcel, mst, victork, dfleytma, yvugenfi, Wei Xu

From: Wei Xu <wexu@redhat.com>

Most stuffs are like ipv4 2 differences between ipv4 and ipv6.

1. Fragment length in ipv4 header includes itself, while it's not
included for ipv6, thus means ipv6 can carry a real '65535' payload.

2. IPv6 header does not need calculate header checksum.

Signed-off-by: Wei Xu <wexu@redhat.com>
---
 hw/net/virtio-net.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 144 insertions(+), 8 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index b3bb63b..cc8cbe4 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -53,6 +53,10 @@
 /* header length value in ip header without option */
 #define VIRTIO_NET_IP4_HEADER_LENGTH 5
 
+#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)
+#define VIRTIO_NET_IP6_ADDR_SIZE   32      /* ipv6 saddr + daddr */
+#define VIRTIO_NET_MAX_IP6_PAYLOAD VIRTIO_NET_MAX_TCP_PAYLOAD
+
 /* Purge coalesced packets timer interval, This value affects the performance
    a lot, and should be tuned carefully, '300000'(300us) is the recommended
    value to pass the WHQL test, '50000' can gain 2x netperf throughput with
@@ -1724,6 +1728,25 @@ static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
     unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
 }
 
+static void virtio_net_rsc_extract_unit6(NetRscChain *chain,
+                                         const uint8_t *buf, NetRscUnit* unit)
+{
+    uint16_t hdr_len;
+    struct ip6_header *ip6;
+
+    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
+    ip6 = (struct ip6_header *)(buf + hdr_len + sizeof(struct eth_header));
+    unit->ip = ip6;
+    unit->ip_plen = &(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
+    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip)\
+                                        + sizeof(struct ip6_header));
+    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
+
+    /* There is a difference between payload lenght in ipv4 and v6,
+       ip header is excluded in ipv6 */
+    unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
+}
+
 static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr *vhdr,
                                          struct ip_header *ip)
 {
@@ -1742,7 +1765,9 @@ static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
     struct virtio_net_hdr *h;
 
     h = (struct virtio_net_hdr *)seg->buf;
-    virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
+    if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
+        virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
+    }
     ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
     QTAILQ_REMOVE(&chain->buffers, seg, next);
     g_free(seg->buf);
@@ -1798,7 +1823,7 @@ static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
     hdr_len = chain->n->guest_hdr_len;
     seg = g_malloc(sizeof(NetRscSeg));
     seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)\
-                   + VIRTIO_NET_MAX_TCP_PAYLOAD);
+                   + sizeof(struct ip6_header) + VIRTIO_NET_MAX_TCP_PAYLOAD);
     memcpy(seg->buf, buf, size);
     seg->size = size;
     seg->packets = 1;
@@ -1809,7 +1834,18 @@ static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
     QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
     chain->stat.cache++;
 
-    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
+    switch (chain->proto) {
+    case ETH_P_IP:
+        virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
+        break;
+
+    case ETH_P_IPV6:
+        virtio_net_rsc_extract_unit6(chain, seg->buf, &seg->unit);
+        break;
+
+    default:
+        g_assert_not_reached();
+    }
 }
 
 static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
@@ -1929,6 +1965,24 @@ static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
     return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
 }
 
+static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg *seg,
+                        const uint8_t *buf, size_t size, NetRscUnit *unit)
+{
+    struct ip6_header *ip1, *ip2;
+
+    ip1 = (struct ip6_header *)(unit->ip);
+    ip2 = (struct ip6_header *)(seg->unit.ip);
+    if (memcmp(&ip1->ip6_src, &ip2->ip6_src, sizeof(struct in6_address))
+        || memcmp(&ip1->ip6_dst, &ip2->ip6_dst, sizeof(struct in6_address))
+        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
+        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
+            chain->stat.no_match++;
+            return RSC_NO_MATCH;
+    }
+
+    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
+}
+
 /* Pakcets with 'SYN' should bypass, other flag should be sent after drain
  * to prevent out of order */
 static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
@@ -1981,7 +2035,11 @@ static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
     NetRscSeg *seg, *nseg;
 
     QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
-        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
+        if (chain->proto == ETH_P_IP) {
+            ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
+        } else {
+            ret = virtio_net_rsc_coalesce6(chain, seg, buf, size, unit);
+        }
 
         if (ret == RSC_FINAL) {
             if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
@@ -2106,13 +2164,82 @@ static size_t virtio_net_rsc_receive4(NetRscChain *chain, NetClientState* nc,
     return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
 }
 
+static int32_t virtio_net_rsc_sanity_check6(NetRscChain *chain,
+                                            struct ip6_header *ip6,
+                                            const uint8_t *buf, size_t size)
+{
+    uint16_t ip_len;
+    uint16_t hdr_len;
+
+    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
+    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip6_header)
+        + sizeof(tcp_header))) {
+        return RSC_BYPASS;
+    }
+
+    if (((ip6->ip6_ctlun.ip6_un1.ip6_un1_flow & 0xF0) >> 4)
+        != IP_HEADER_VERSION_6) {
+        return RSC_BYPASS;
+    }
+
+    /* Both option and protocol is checked in this */
+    if (ip6->ip6_ctlun.ip6_un1.ip6_un1_nxt != IPPROTO_TCP) {
+        chain->stat.bypass_not_tcp++;
+        return RSC_BYPASS;
+    }
+
+    ip_len = htons(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
+    if (ip_len < sizeof(struct tcp_header)
+        || ip_len > (size - hdr_len - sizeof(struct eth_header)
+                     - sizeof(struct ip6_header))) {
+        chain->stat.ip_hacked++;
+        return RSC_BYPASS;
+    }
+
+    return RSC_WANT;
+}
+
+static size_t virtio_net_rsc_receive6(void *opq, NetClientState* nc,
+                                      const uint8_t *buf, size_t size)
+{
+    int32_t ret;
+    uint16_t hdr_len;
+    NetRscChain *chain;
+    NetRscUnit unit;
+
+    chain = (NetRscChain *)opq;
+    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
+    virtio_net_rsc_extract_unit6(chain, buf, &unit);
+    if (RSC_WANT != virtio_net_rsc_sanity_check6(chain,
+                                                 unit.ip, buf, size)) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
+    if (ret == RSC_BYPASS) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else if (ret == RSC_FINAL) {
+        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
+                ((hdr_len + sizeof(struct eth_header)) + 8),
+                VIRTIO_NET_IP6_ADDR_SIZE,
+                hdr_len + sizeof(struct eth_header) + sizeof(struct ip6_header),
+                VIRTIO_NET_TCP_PORT_SIZE);
+    }
+
+    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
+        return size;
+    }
+
+    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
+}
+
 static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
                                                 NetClientState *nc,
                                                 uint16_t proto)
 {
     NetRscChain *chain;
 
-    if (proto != (uint16_t)ETH_P_IP) {
+    if ((proto != (uint16_t)ETH_P_IP) && (proto != (uint16_t)ETH_P_IPV6)) {
         return NULL;
     }
 
@@ -2125,8 +2252,13 @@ static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
     chain = g_malloc(sizeof(*chain));
     chain->n = n;
     chain->proto = proto;
-    chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
-    chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+    if (proto == (uint16_t)ETH_P_IP) {
+        chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
+        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+    } else {
+        chain->max_payload = VIRTIO_NET_MAX_IP6_PAYLOAD;
+        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+    }
     chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
                                       virtio_net_rsc_purge, chain);
     memset(&chain->stat, 0, sizeof(chain->stat));
@@ -2158,7 +2290,11 @@ static ssize_t virtio_net_rsc_receive(NetClientState *nc,
         return virtio_net_do_receive(nc, buf, size);
     } else {
         chain->stat.received++;
-        return virtio_net_rsc_receive4(chain, nc, buf, size);
+        if (proto == (uint16_t)ETH_P_IP) {
+            return virtio_net_rsc_receive4(chain, nc, buf, size);
+        } else  {
+            return virtio_net_rsc_receive6(chain, nc, buf, size);
+        }
     }
 }
 
-- 
2.7.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [ RFC Patch v6 3/3] virtio-net rsc: add 2 new rsc information fields to 'virtio_net_hdr'
  2016-05-28 16:37 [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL wexu
  2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 1/3] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
  2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 2/3] virtio-net rsc: support coalescing ipv6 " wexu
@ 2016-05-28 16:37 ` wexu
  2016-05-30  5:57   ` Jason Wang
  2016-05-30  4:22 ` [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL Jason Wang
  3 siblings, 1 reply; 12+ messages in thread
From: wexu @ 2016-05-28 16:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: jasowang, marcel, mst, victork, dfleytma, yvugenfi, Wei Xu

From: Wei Xu <wexu@redhat.com>

Field 'coalesced' is to indicate how many packets are coalesced and field
'dup_ack' is how many duplicate acks are merged, guest driver can use these
information to notify what's the exact scene of original traffic over the
networks.

Signed-off-by: Wei Xu <wexu@redhat.com>
---
 hw/net/virtio-net.c                         | 8 ++++++++
 include/standard-headers/linux/virtio_net.h | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index cc8cbe4..20f552a 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1768,6 +1768,10 @@ static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
     if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
         virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
     }
+    h->coalesced = seg->packets;
+    h->dup_ack = seg->dup_ack;
+    h->gso_type = chain->gso_type;
+    h->gso_size = chain->max_payload;
     ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
     QTAILQ_REMOVE(&chain->buffers, seg, next);
     g_free(seg->buf);
@@ -2302,9 +2306,13 @@ static ssize_t virtio_net_receive(NetClientState *nc,
                                   const uint8_t *buf, size_t size)
 {
     VirtIONet *n;
+    struct virtio_net_hdr *h;
 
     n = qemu_get_nic_opaque(nc);
     if (n->host_features & (1ULL << VIRTIO_NET_F_GUEST_RSC)) {
+        h = (struct virtio_net_hdr *)buf;
+        h->coalesced = 0;
+        h->dup_ack = 0;
         return virtio_net_rsc_receive(nc, buf, size);
     } else {
         return virtio_net_do_receive(nc, buf, size);
diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
index 5b95762..c837417 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -114,6 +114,8 @@ struct virtio_net_hdr {
 	__virtio16 gso_size;		/* Bytes to append to hdr_len per frame */
 	__virtio16 csum_start;	/* Position to start checksumming from */
 	__virtio16 csum_offset;	/* Offset after that to place checksum */
+    __virtio16 coalesced;   /* packets coalesced by host */
+    __virtio16 dup_ack;     /* duplicate ack count */
 };
 
 /* This is the version of the header to use when the MRG_RXBUF
-- 
2.7.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [ RFC Patch v6 1/3] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 1/3] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
@ 2016-05-30  4:20   ` Jason Wang
  2016-06-02 16:16     ` Wei Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Wang @ 2016-05-30  4:20 UTC (permalink / raw)
  To: wexu, qemu-devel; +Cc: marcel, mst, victork, dfleytma, yvugenfi



On 2016年05月29日 00:37, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
>
> All the data packets in a tcp connection will be cached to a big buffer
> in every receive interval, and will be sent out via a timer, the
> 'virtio_net_rsc_timeout' controls the interval, the value will influent the
> performance and response of tcp connection extremely, 50000(50us) is a
> experience value to gain a performance improvement, since the whql test
> sends packets every 100us, so '300000(300us)' can pass the test case,
> this is also the default value, it's tunable via the command line
> parameter 'rsc_interval' with 'virtio-net-pci' device, for example, below
> parameter is to launch a guest with interval set as '500000'.

Does the value make sense if it was smaller than 1us? If not, why not 
just make the unit to be 1us?

>
> 'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,rsc_interval=500000'
> will
>
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets.
>
> 'NetRscChain' is used to save the segments of different protocols in a
> VirtIONet device.
>
> The main handler of TCP includes TCP window update, duplicated ACK check
> and the real data coalescing if the new segment passed sanity check
> and is identified as an 'wanted' one.
>
> An 'wanted' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. 'ACK' of the segment is in the valid window.
>
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. IP options & IP fragment
> 3. Not a TCP packets
> 4. Sanity size check to prevent buffer overflow attack.
>
> There maybe more cases should be considered such as ip identification other
> flags, while it broke the test because windows set it to the same even it's
> not a fragment.
>
> Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
> and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
> means the packets should also be bypassed, and this should be done
> after searching for the same connection packets in the pool and sending
> all of them out, this is to avoid out of data.
>
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'FIN/RST' will trigger a finalization, because
> this normally happens upon a connection is going to be closed, an 'URG' packet
> also finalize current coalescing unit.
>
> Statistics can be used to monitor the basic coalescing status, the 'out of order'
> and 'out of window' means how many retransmitting packets, thus describe the
> performance intuitively.

We usually don't add device specific monitor command. Maybe a new 
control vq command ethtool -S in guest in the future. I was thinking of 
removing those counters since it was never used in this series.

>
> Signed-off-by: Wei Xu <wexu@redhat.com>
> ---
>   hw/net/virtio-net.c                         | 498 +++++++++++++++++++++++++++-
>   include/hw/virtio/virtio-net.h              |   2 +
>   include/hw/virtio/virtio.h                  |  75 +++++
>   include/standard-headers/linux/virtio_net.h |   1 +

For RFC, it's ok. But for formal patch, this is not the correct way to 
modify Linux headers. There's a script in 
scripts/update-linux-headers.sh which was used to sync it from Linux 
source. This means, it must be merged in Linux or at least in 
maintainer's tree first.

>   4 files changed, 575 insertions(+), 1 deletion(-)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 5798f87..b3bb63b 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -15,10 +15,12 @@
>   #include "qemu/iov.h"
>   #include "hw/virtio/virtio.h"
>   #include "net/net.h"
> +#include "net/eth.h"
>   #include "net/checksum.h"
>   #include "net/tap.h"
>   #include "qemu/error-report.h"
>   #include "qemu/timer.h"
> +#include "qemu/sockets.h"
>   #include "hw/virtio/virtio-net.h"
>   #include "net/vhost_net.h"
>   #include "hw/virtio/virtio-bus.h"
> @@ -38,6 +40,25 @@
>   #define endof(container, field) \
>       (offsetof(container, field) + sizeof(((container *)0)->field))
>   
> +#define VIRTIO_NET_IP4_ADDR_SIZE   8        /* ipv4 saddr + daddr */
> +#define VIRTIO_NET_TCP_PORT_SIZE   4        /* sport + dport */
> +
> +#define VIRTIO_NET_TCP_FLAG         0x3F
> +#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
> +
> +/* IPv4 max payload, 16 bits in the header */
> +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
> +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535

Is this still true if window scaling is enabled? And need to make sure 
your ack/seq processing can work for window scaling.

> +
> +/* header length value in ip header without option */
> +#define VIRTIO_NET_IP4_HEADER_LENGTH 5
> +
> +/* Purge coalesced packets timer interval, This value affects the performance
> +   a lot, and should be tuned carefully, '300000'(300us) is the recommended
> +   value to pass the WHQL test, '50000' can gain 2x netperf throughput with
> +   tso/gso/gro 'off'. */
> +#define VIRTIO_NET_RSC_INTERVAL  300000

Do we need a new property for this value?

> +
>   typedef struct VirtIOFeature {
>       uint32_t flags;
>       size_t end;
> @@ -1089,7 +1110,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
>       return 0;
>   }
>   
> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
> +static ssize_t virtio_net_do_receive(NetClientState *nc,
> +                                     const uint8_t *buf, size_t size)
>   {
>       VirtIONet *n = qemu_get_nic_opaque(nc);
>       VirtIONetQueue *q = virtio_net_get_subqueue(nc);
> @@ -1685,6 +1707,474 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
>       return 0;
>   }
>   
> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    uint16_t hdr_len;
> +    uint16_t ip_hdrlen;
> +    struct ip_header *ip;
> +
> +    hdr_len = chain->n->guest_hdr_len;
> +    ip = (struct ip_header *)(buf + hdr_len + sizeof(struct eth_header));
> +    unit->ip = (void *)ip;
> +    ip_hdrlen = (ip->ip_ver_len & 0xF) << 2;
> +    unit->ip_plen = &ip->ip_len;
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
> +}
> +
> +static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr *vhdr,
> +                                         struct ip_header *ip)
> +{
> +    uint32_t sum;
> +
> +    ip->ip_sum = 0;
> +    sum = net_checksum_add_cont(sizeof(struct ip_header), (uint8_t *)ip, 0);
> +    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
> +    vhdr->flags &= ~VIRTIO_NET_HDR_F_NEEDS_CSUM;
> +    vhdr->flags |= VIRTIO_NET_HDR_F_DATA_VALID;
> +}
> +
> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
> +{
> +    int ret;
> +    struct virtio_net_hdr *h;
> +
> +    h = (struct virtio_net_hdr *)seg->buf;
> +    virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
> +    g_free(seg->buf);
> +    g_free(seg);
> +
> +    return ret;
> +}
> +
> +static void virtio_net_rsc_purge(void *opq)
> +{
> +    NetRscSeg *seg, *rn;
> +    NetRscChain *chain = (NetRscChain *)opq;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.purge_failed++;
> +            continue;

Why need this?

> +        }
> +    }
> +
> +    chain->stat.timer++;
> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + chain->n->rsc_timeout);

Consider the timer is invisible to guest, why use QEMU_CLOCK_VIRTUAL?

> +    }
> +}
> +
> +static void virtio_net_rsc_cleanup(VirtIONet *n)
> +{
> +    NetRscChain *chain, *rn_chain;
> +    NetRscSeg *seg, *rn_seg;
> +
> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
> +            g_free(seg->buf);
> +            g_free(seg);
> +        }
> +
> +        timer_del(chain->drain_timer);
> +        timer_free(chain->drain_timer);
> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
> +        g_free(chain);
> +    }
> +}
> +
> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
> +                                     const uint8_t *buf, size_t size)
> +{
> +    uint16_t hdr_len;
> +    NetRscSeg *seg;
> +
> +    hdr_len = chain->n->guest_hdr_len;
> +    seg = g_malloc(sizeof(NetRscSeg));
> +    seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)\
> +                   + VIRTIO_NET_MAX_TCP_PAYLOAD);
> +    memcpy(seg->buf, buf, size);
> +    seg->size = size;
> +    seg->packets = 1;
> +    seg->dup_ack = 0;
> +    seg->is_coalesced = 0;
> +    seg->nc = nc;
> +
> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
> +    chain->stat.cache++;
> +
> +    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +}
> +
> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
> +                                         NetRscSeg *seg, const uint8_t *buf,
> +                                         struct tcp_header *n_tcp,
> +                                         struct tcp_header *o_tcp)
> +{
> +    uint32_t nack, oack;
> +    uint16_t nwin, owin;
> +
> +    nack = htonl(n_tcp->th_ack);
> +    nwin = htons(n_tcp->th_win);
> +    oack = htonl(o_tcp->th_ack);
> +    owin = htons(o_tcp->th_win);
> +
> +    if ((nack - oack) >= VIRTIO_NET_MAX_TCP_PAYLOAD) {
> +        chain->stat.ack_out_of_win++;
> +        return RSC_FINAL;
> +    } else if (nack == oack) {
> +        /* duplicated ack or window probe */
> +        if (nwin == owin) {
> +            /* duplicated ack, add dup ack count due to whql test up to 1 */
> +            chain->stat.dup_ack++;
> +            return RSC_FINAL;
> +        } else {
> +            /* Coalesce window update */
> +            o_tcp->th_win = n_tcp->th_win;
> +            chain->stat.win_update++;
> +            return RSC_COALESCE;
> +        }
> +    } else {
> +        /* pure ack, go to 'C', finalize*/
> +        chain->stat.pure_ack++;
> +        return RSC_FINAL;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
> +                                            NetRscSeg *seg, const uint8_t *buf,
> +                                            NetRscUnit *n_unit)
> +{
> +    void *data;
> +    uint16_t o_ip_len;
> +    uint32_t nseq, oseq;
> +    NetRscUnit *o_unit;
> +
> +    o_unit = &seg->unit;
> +    o_ip_len = htons(*o_unit->ip_plen);
> +    nseq = htonl(n_unit->tcp->th_seq);
> +    oseq = htonl(o_unit->tcp->th_seq);
> +
> +    /* out of order or retransmitted. */
> +    if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) {
> +        chain->stat.data_out_of_win++;
> +        return RSC_FINAL;
> +    }
> +
> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
> +    if (nseq == oseq) {
> +        if ((o_unit->payload == 0) && n_unit->payload) {
> +            /* From no payload to payload, normal case, not a dup ack or etc */
> +            chain->stat.data_after_pure_ack++;
> +            goto coalesce;
> +        } else {
> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
> +                                             n_unit->tcp, o_unit->tcp);
> +        }
> +    } else if ((nseq - oseq) != o_unit->payload) {
> +        /* Not a consistent packet, out of order */
> +        chain->stat.data_out_of_order++;
> +        return RSC_FINAL;
> +    } else {
> +coalesce:
> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
> +            chain->stat.over_size++;
> +            return RSC_FINAL;
> +        }
> +
> +        /* Here comes the right data, the payload length in v4/v6 is different,
> +           so use the field value to update and record the new data len */
> +        o_unit->payload += n_unit->payload; /* update new data len */
> +
> +        /* update field in ip header */
> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
> +
> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
> +           for windows guest, while this may change the behavior for linux
> +           guest. */
> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
> +
> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
> +
> +        memmove(seg->buf + seg->size, data, n_unit->payload);
> +        seg->size += n_unit->payload;
> +        seg->packets++;
> +        chain->stat.coalesced++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
> +                                        const uint8_t *buf, size_t size,
> +                                        NetRscUnit *unit)
> +{
> +    struct ip_header *ip1, *ip2;
> +
> +    ip1 = (struct ip_header *)(unit->ip);
> +    ip2 = (struct ip_header *)(seg->unit.ip);
> +    if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst)
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +        chain->stat.no_match++;
> +        return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
> +/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
> + * to prevent out of order */
> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> +                                         struct tcp_header *tcp)
> +{
> +    uint16_t tcp_hdr;
> +    uint16_t tcp_flag;
> +
> +    tcp_flag = htons(tcp->th_offset_flags);
> +    tcp_hdr = (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10;
> +    tcp_flag &= VIRTIO_NET_TCP_FLAG;
> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
> +    if (tcp_flag & TH_SYN) {
> +        chain->stat.tcp_syn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
> +        chain->stat.tcp_ctrl_drain++;
> +        return RSC_FINAL;
> +    }
> +
> +    if (tcp_hdr > sizeof(struct tcp_header)) {
> +        chain->stat.tcp_all_opt++;
> +        return RSC_FINAL;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientState *nc,
> +                                       const uint8_t *buf, size_t size)
> +{

The name is really confusing, it in fact cache the buf if the cache is 
empty. So I guess what you mean is "virtio_net_rsc_cache_if_empty()?". 
But the question is why need a specific function like this, can we 
simply add the logic to virtio_net_rsc_do_coalesce()? (And looks like 
most of the logic has been there, except for timer and counter.

> +    if (QTAILQ_EMPTY(&chain->buffers)) {
> +        chain->stat.empty_cache++;
> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + chain->n->rsc_timeout);
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
> +                                         const uint8_t *buf, size_t size,
> +                                         NetRscUnit *unit)
> +{
> +    int ret;
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +
> +        if (ret == RSC_FINAL) {
> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +                /* Send failed */
> +                chain->stat.final_failed++;
> +                return 0;
> +            }
> +
> +            /* Send current packet */
> +            return virtio_net_do_receive(nc, buf, size);
> +        } else if (ret == RSC_NO_MATCH) {
> +            continue;
> +        } else {
> +            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
> +            seg->is_coalesced = 1;
> +            return size;
> +        }
> +    }
> +
> +    chain->stat.no_match_cache++;
> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +    return size;
> +}
> +
> +/* Drain a connection data, this is to avoid out of order segments */
> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
> +                                        const uint8_t *buf, size_t size,
> +                                        uint16_t ip_start, uint16_t ip_size,
> +                                        uint16_t tcp_port, uint16_t port_size)
> +{
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
> +            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {

The header comparison is rather similar to what you've done in 
virtio_net_rsc_coalesce4/6(). Any chance to reduce the duplication?

> +            continue;
> +        }
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.drain_failed++;
> +        }
> +
> +        break;
> +    }
> +
> +    return virtio_net_do_receive(nc, buf, size);
> +}
> +
> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
> +                                            struct ip_header *ip,
> +                                            const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +    uint16_t hdr_len;
> +
> +    hdr_len = chain->n->guest_hdr_len;
> +    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header)
> +        + sizeof(struct tcp_header))) {

It's interesting that you don't have a stat counter for this case.

> +        return RSC_BYPASS;
> +    }
> +
> +    /* Not an ipv4 one */
> +    if (((ip->ip_ver_len & 0xF0) >> 4) != IP_HEADER_VERSION_4) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip option */
> +    if (VIRTIO_NET_IP4_HEADER_LENGTH != (ip->ip_ver_len & 0xF)) {

(ip-> ...) != VIRTIO_NET_IP4_HEADER_LENGTH

> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (ip->ip_p != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip fragment */
> +    if (!(htons(ip->ip_off) & IP_DF)) {
> +        chain->stat.ip_frag++;
> +        return RSC_BYPASS;
> +    }
> +
> +    ip_len = htons(ip->ip_len);
> +    if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header))
> +        || ip_len > (size - hdr_len - sizeof(struct eth_header))) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static size_t virtio_net_rsc_receive4(NetRscChain *chain, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    uint16_t hdr_len;
> +    NetRscUnit unit;
> +
> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);

Is this safe for you do header analysis before doing sanity check?

> +    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {

virtio_net_rsc_sanity_check4() = RSC_WANT please. I've pointed out this 
kind of style issues several times.

> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                ((hdr_len + sizeof(struct eth_header)) + 12),
> +                VIRTIO_NET_IP4_ADDR_SIZE,
> +                hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header),
> +                VIRTIO_NET_TCP_PORT_SIZE);
> +    }
> +
> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
> +        return size;
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
> +                                                NetClientState *nc,
> +                                                uint16_t proto)
> +{
> +    NetRscChain *chain;
> +
> +    if (proto != (uint16_t)ETH_P_IP) {
> +        return NULL;
> +    }
> +
> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
> +        if (chain->proto == proto) {
> +            return chain;
> +        }
> +    }
> +
> +    chain = g_malloc(sizeof(*chain));
> +    chain->n = n;
> +    chain->proto = proto;
> +    chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
> +    chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
> +                                      virtio_net_rsc_purge, chain);
> +    memset(&chain->stat, 0, sizeof(chain->stat));
> +
> +    QTAILQ_INIT(&chain->buffers);
> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
> +
> +    return chain;
> +}
> +
> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    uint16_t proto;
> +    NetRscChain *chain;
> +    struct eth_header *eth;
> +    VirtIONet *n;
> +
> +    n = qemu_get_nic_opaque(nc);
> +    if (size < (n->guest_hdr_len + sizeof(struct eth_header))) {

Here should use host_hdr_len, no?

> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
> +    proto = htons(eth->h_proto);
> +
> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
> +    if (!chain) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else {
> +        chain->stat.received++;
> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
> +    }
> +}
> +
> +static ssize_t virtio_net_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)
> +{
> +    VirtIONet *n;
> +
> +    n = qemu_get_nic_opaque(nc);
> +    if (n->host_features & (1ULL << VIRTIO_NET_F_GUEST_RSC)) {
> +        return virtio_net_rsc_receive(nc, buf, size);
> +    } else {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +}
> +
>   static NetClientInfo net_virtio_info = {
>       .type = NET_CLIENT_OPTIONS_KIND_NIC,
>       .size = sizeof(NICState),
> @@ -1814,6 +2304,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>       nc = qemu_get_queue(n->nic);
>       nc->rxfilter_notify_enabled = 1;
>   
> +    QTAILQ_INIT(&n->rsc_chains);
>       n->qdev = dev;
>       register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
>                       virtio_net_save, virtio_net_load, n);
> @@ -1848,6 +2339,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>       g_free(n->vqs);
>       qemu_del_nic(n->nic);
>       virtio_cleanup(vdev);
> +    virtio_net_rsc_cleanup(n);
>   }
>   
>   static void virtio_net_instance_init(Object *obj)
> @@ -1909,6 +2401,10 @@ static Property virtio_net_properties[] = {
>                          TX_TIMER_INTERVAL),
>       DEFINE_PROP_INT32("x-txburst", VirtIONet, net_conf.txburst, TX_BURST),
>       DEFINE_PROP_STRING("tx", VirtIONet, net_conf.tx),
> +    DEFINE_PROP_BIT("guest_rsc", VirtIONet, host_features,
> +                    VIRTIO_NET_F_GUEST_RSC, false),
> +    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
> +                      VIRTIO_NET_RSC_INTERVAL),
>       DEFINE_PROP_END_OF_LIST(),
>   };
>   
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index 0cabdb6..e995190 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -59,6 +59,8 @@ typedef struct VirtIONet {
>       VirtIONetQueue *vqs;
>       VirtQueue *ctrl_vq;
>       NICState *nic;
> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
> +    uint32_t rsc_timeout;
>       uint32_t tx_timeout;
>       int32_t tx_burst;
>       uint32_t has_vnet_hdr;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 6a37065..a7cb84a 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -30,6 +30,8 @@
>                                   (0x1ULL << VIRTIO_F_ANY_LAYOUT))
>   
>   struct VirtQueue;
> +struct VirtIONet;
> +typedef struct VirtIONet VirtIONet;
>   
>   static inline hwaddr vring_align(hwaddr addr,
>                                                unsigned long align)
> @@ -128,6 +130,79 @@ typedef struct VirtioDeviceClass {
>       int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>   } VirtioDeviceClass;
>   
> +/* Coalesced packets type & status */
> +typedef enum {
> +    RSC_COALESCE,           /* Data been coalesced */
> +    RSC_FINAL,              /* Will terminate current connection */
> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
> +    RSC_WANT                /* Data want to be coalesced */
> +} COALESCE_STATUS;
> +
> +typedef struct NetRscStat {
> +    uint32_t received;
> +    uint32_t coalesced;
> +    uint32_t over_size;
> +    uint32_t cache;
> +    uint32_t empty_cache;
> +    uint32_t no_match_cache;
> +    uint32_t win_update;
> +    uint32_t no_match;
> +    uint32_t tcp_syn;
> +    uint32_t tcp_ctrl_drain;
> +    uint32_t dup_ack;
> +    uint32_t dup_ack1;
> +    uint32_t dup_ack2;
> +    uint32_t pure_ack;
> +    uint32_t ack_out_of_win;
> +    uint32_t data_out_of_win;
> +    uint32_t data_out_of_order;
> +    uint32_t data_after_pure_ack;
> +    uint32_t bypass_not_tcp;
> +    uint32_t tcp_option;
> +    uint32_t tcp_all_opt;
> +    uint32_t ip_frag;
> +    uint32_t ip_hacked;
> +    uint32_t ip_option;
> +    uint32_t purge_failed;
> +    uint32_t drain_failed;
> +    uint32_t final_failed;
> +    int64_t  timer;
> +} NetRscStat;
> +
> +/* Rsc unit general info used to checking if can coalescing */
> +typedef struct NetRscUnit {
> +    void *ip;   /* ip header */
> +    uint16_t *ip_plen;      /* data len pointer in ip header field */
> +    struct tcp_header *tcp; /* tcp header */
> +    uint16_t tcp_hdrlen;    /* tcp header len */
> +    uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
> +} NetRscUnit;
> +
> +/* Coalesced segmant */
> +typedef struct NetRscSeg {
> +    QTAILQ_ENTRY(NetRscSeg) next;
> +    void *buf;
> +    size_t size;
> +    uint16_t packets;
> +    uint16_t dup_ack;
> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
> +    NetRscUnit unit;
> +    NetClientState *nc;
> +} NetRscSeg;
> +
> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
> +typedef struct NetRscChain {
> +    QTAILQ_ENTRY(NetRscChain) next;
> +    VirtIONet *n;                            /* VirtIONet */
> +    uint16_t proto;
> +    uint8_t  gso_type;
> +    uint16_t max_payload;
> +    QEMUTimer *drain_timer;
> +    QTAILQ_HEAD(, NetRscSeg) buffers;
> +    NetRscStat stat;
> +} NetRscChain;
> +
>   void virtio_instance_init_common(Object *proxy_obj, void *data,
>                                    size_t vdev_size, const char *vdev_name);
>   
> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> index a78f33e..5b95762 100644
> --- a/include/standard-headers/linux/virtio_net.h
> +++ b/include/standard-headers/linux/virtio_net.h
> @@ -55,6 +55,7 @@
>   #define VIRTIO_NET_F_MQ	22	/* Device supports Receive Flow
>   					 * Steering */
>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> +#define VIRTIO_NET_F_GUEST_RSC  24      /* Guest can coalesce tcp packets */

The comment should be "Guest can receive coalesced tcp packets" ?

>   
>   #ifndef VIRTIO_NET_NO_LEGACY
>   #define VIRTIO_NET_F_GSO	6	/* Host handles pkts w/ any GSO type */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL
  2016-05-28 16:37 [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL wexu
                   ` (2 preceding siblings ...)
  2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 3/3] virtio-net rsc: add 2 new rsc information fields to 'virtio_net_hdr' wexu
@ 2016-05-30  4:22 ` Jason Wang
  2016-05-30  4:50   ` Wei Xu
  3 siblings, 1 reply; 12+ messages in thread
From: Jason Wang @ 2016-05-30  4:22 UTC (permalink / raw)
  To: wexu, qemu-devel; +Cc: victork, mst, yvugenfi, marcel, dfleytma



On 2016年05月29日 00:37, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
>
> Changes in V6:
> - Sync upstream code
> - Split new fields in 'virtio_net_hdr' to a seperate patch
> - Remove feature bit code, replace it with a command line parameter 'guest_rsc'
> which is turned off by default.
>
> Changes in V5:
> - Passed all IPv4/6 test cases
> - Add new fields in 'virtio_net_hdr'
> - Set 'gso_type' & 'coalesced packets' in new field.
> - Bypass all 'tcp option' packet
> - Bypass all 'pure ack' packet
> - Bypass all 'duplicate ack' packet
> - Change 'guest_rsc' feature bit to 'false' by default
> - Feedbacks from v4, typo, etc.

Change-log is very important for the ease and speed up reviewers. More 
details are more than welcomed. But I see some changes were not 
documented here. Please give a more complete one in next iteration.

>
> Note:
> There is still a few pending issues about the feature bit, and need to be
> discussed with windows driver maintainer, so linux guests with this patch
> won't work at current, haven't figure it out yet, but i'm guessing it's
> caused by the 'gso_type' is set to 'VIRTIO_NET_HDR_GSO_TCPV4/6',
> will fix it after get the final solution, the below test steps and
> performance data is based on v4.

This is probably because you've increased the vnet header length.

>
> Another suggestion from Jason is to adjust part of the code to make it
> more readable, since there maybe still few change about the flowchart
> in the future, such as timestamp, duplicate ack, so i'd like to delay it
> temporarily.
>
> Changes in V4:
> - Add new host feature bit
> - Replace using fixed header lenght with dynamic header lenght in VirtIONet
> - Change ip/ip6 header union in NetRscUnit to void* pointer
> - Add macro prefix, adjust code indent, etc.
>
> Changes in V3:
> - Removed big param list, replace it with 'NetRscUnit'
> - Different virtio header size
> - Modify callback function to direct call.
> - Needn't check the failure of g_malloc()
> - Other code format adjustment, macro naming, etc
>
> Changes in V2:
> - Add detailed commit log
>
> This patch is to support WHQL test for Windows guest, while this feature also
> benifits other guest works as a kernel 'gro' like feature with userspace
> implementation.
> Feature information:
>    http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
>
> Both IPv4 and IPv6 are supported, though performance with userspace virtio
> is slow than vhost-net, there is about 1.5x to 2x performance improvement to
> userspace virtio, this is done by turning this feature on and disable
> 'tso/gso/gro' on corresponding tap interface and guest interface, while get
> less improment with all these feature on.
>
> Linux guest performance data(Netperf):
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.101 () port 0 AF_INET : nodelay
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
>   87380  16384     64    6.00     1221.20
>   87380  16384     64    6.00     1260.30
>
>   87380  16384    128    6.00     1978.51
>   87380  16384    128    6.00     2286.05
>
>   87380  16384    256    6.00     2677.94
>   87380  16384    256    6.00     4615.42
>
>   87380  16384    512    6.00     2956.54
>   87380  16384    512    6.00     5356.39
>
>   87380  16384   1024    6.00     2798.17
>   87380  16384   1024    6.00     4943.30
>
>   87380  16384   2048    6.00     2681.09
>   87380  16384   2048    6.00     4835.81
>
>   87380  16384   4096    6.00     3390.14
>   87380  16384   4096    6.00     5391.54
>
>   87380  16384   8092    6.00     3008.27
>   87380  16384   8092    6.00     5381.68
>
>   87380  16384  10240    6.00     2999.89
>   87380  16384  10240    6.00     5393.11
>
> Test steps:
> Although this feature is mainly used for window guest, i used linux guest to
> help test the feature, to make things simple, i used 3 steps to test the patch
> as i moved on.
>
> 1. With a tcp socket client/server pair running on 2 linux guest, thus i can
> control
> the traffic and debugging the code as i want.
> 2. Netperf on linux guest test the throughput.
> 3. WHQL test with 2 Windows guests.
>
> Wei Xu (3):
>    virtio-net rsc: support coalescing ipv4 tcp traffic
>    virtio-net rsc: support coalescing ipv6 tcp traffic
>    virtio-net rsc: add 2 new rsc information fields to 'virtio_net_hdr'
>
>   hw/net/virtio-net.c                         | 642 +++++++++++++++++++++++++++-
>   include/hw/virtio/virtio-net.h              |   2 +
>   include/hw/virtio/virtio.h                  |  75 ++++
>   include/standard-headers/linux/virtio_net.h |   3 +
>   4 files changed, 721 insertions(+), 1 deletion(-)
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [ RFC Patch v6 2/3] virtio-net rsc: support coalescing ipv6 tcp traffic
  2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 2/3] virtio-net rsc: support coalescing ipv6 " wexu
@ 2016-05-30  4:25   ` Jason Wang
  2016-06-02 16:17     ` Wei Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Wang @ 2016-05-30  4:25 UTC (permalink / raw)
  To: wexu, qemu-devel; +Cc: victork, mst, yvugenfi, marcel, dfleytma



On 2016年05月29日 00:37, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
>
> Most stuffs are like ipv4 2 differences between ipv4 and ipv6.
>
> 1. Fragment length in ipv4 header includes itself, while it's not
> included for ipv6, thus means ipv6 can carry a real '65535' payload.
>
> 2. IPv6 header does not need calculate header checksum.
>
> Signed-off-by: Wei Xu <wexu@redhat.com>
> ---
>   hw/net/virtio-net.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 144 insertions(+), 8 deletions(-)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index b3bb63b..cc8cbe4 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -53,6 +53,10 @@
>   /* header length value in ip header without option */
>   #define VIRTIO_NET_IP4_HEADER_LENGTH 5
>   
> +#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)
> +#define VIRTIO_NET_IP6_ADDR_SIZE   32      /* ipv6 saddr + daddr */
> +#define VIRTIO_NET_MAX_IP6_PAYLOAD VIRTIO_NET_MAX_TCP_PAYLOAD
> +
>   /* Purge coalesced packets timer interval, This value affects the performance
>      a lot, and should be tuned carefully, '300000'(300us) is the recommended
>      value to pass the WHQL test, '50000' can gain 2x netperf throughput with
> @@ -1724,6 +1728,25 @@ static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
>       unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
>   }
>   
> +static void virtio_net_rsc_extract_unit6(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    uint16_t hdr_len;
> +    struct ip6_header *ip6;
> +
> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
> +    ip6 = (struct ip6_header *)(buf + hdr_len + sizeof(struct eth_header));
> +    unit->ip = ip6;
> +    unit->ip_plen = &(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip)\
> +                                        + sizeof(struct ip6_header));
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +
> +    /* There is a difference between payload lenght in ipv4 and v6,
> +       ip header is excluded in ipv6 */
> +    unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
> +}
> +
>   static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr *vhdr,
>                                            struct ip_header *ip)
>   {
> @@ -1742,7 +1765,9 @@ static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
>       struct virtio_net_hdr *h;
>   
>       h = (struct virtio_net_hdr *)seg->buf;
> -    virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
> +    if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
> +        virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
> +    }
>       ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
>       QTAILQ_REMOVE(&chain->buffers, seg, next);
>       g_free(seg->buf);
> @@ -1798,7 +1823,7 @@ static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
>       hdr_len = chain->n->guest_hdr_len;
>       seg = g_malloc(sizeof(NetRscSeg));
>       seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)\
> -                   + VIRTIO_NET_MAX_TCP_PAYLOAD);
> +                   + sizeof(struct ip6_header) + VIRTIO_NET_MAX_TCP_PAYLOAD);
>       memcpy(seg->buf, buf, size);
>       seg->size = size;
>       seg->packets = 1;
> @@ -1809,7 +1834,18 @@ static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
>       QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
>       chain->stat.cache++;
>   
> -    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +    switch (chain->proto) {
> +    case ETH_P_IP:
> +        virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +        break;
> +
> +    case ETH_P_IPV6:
> +        virtio_net_rsc_extract_unit6(chain, seg->buf, &seg->unit);
> +        break;
> +
> +    default:
> +        g_assert_not_reached();
> +    }
>   }
>   
>   static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
> @@ -1929,6 +1965,24 @@ static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
>       return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
>   }
>   
> +static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg *seg,
> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    struct ip6_header *ip1, *ip2;
> +
> +    ip1 = (struct ip6_header *)(unit->ip);
> +    ip2 = (struct ip6_header *)(seg->unit.ip);
> +    if (memcmp(&ip1->ip6_src, &ip2->ip6_src, sizeof(struct in6_address))
> +        || memcmp(&ip1->ip6_dst, &ip2->ip6_dst, sizeof(struct in6_address))
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +            chain->stat.no_match++;
> +            return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
>   /* Pakcets with 'SYN' should bypass, other flag should be sent after drain
>    * to prevent out of order */
>   static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> @@ -1981,7 +2035,11 @@ static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
>       NetRscSeg *seg, *nseg;
>   
>       QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> -        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +        if (chain->proto == ETH_P_IP) {
> +            ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +        } else {
> +            ret = virtio_net_rsc_coalesce6(chain, seg, buf, size, unit);
> +        }
>   
>           if (ret == RSC_FINAL) {
>               if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> @@ -2106,13 +2164,82 @@ static size_t virtio_net_rsc_receive4(NetRscChain *chain, NetClientState* nc,
>       return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
>   }
>   
> +static int32_t virtio_net_rsc_sanity_check6(NetRscChain *chain,
> +                                            struct ip6_header *ip6,
> +                                            const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +    uint16_t hdr_len;
> +
> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
> +    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip6_header)
> +        + sizeof(tcp_header))) {
> +        return RSC_BYPASS;
> +    }
> +
> +    if (((ip6->ip6_ctlun.ip6_un1.ip6_un1_flow & 0xF0) >> 4)
> +        != IP_HEADER_VERSION_6) {
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Both option and protocol is checked in this */
> +    if (ip6->ip6_ctlun.ip6_un1.ip6_un1_nxt != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    ip_len = htons(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
> +    if (ip_len < sizeof(struct tcp_header)
> +        || ip_len > (size - hdr_len - sizeof(struct eth_header)
> +                     - sizeof(struct ip6_header))) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static size_t virtio_net_rsc_receive6(void *opq, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    uint16_t hdr_len;
> +    NetRscChain *chain;
> +    NetRscUnit unit;
> +
> +    chain = (NetRscChain *)opq;
> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
> +    virtio_net_rsc_extract_unit6(chain, buf, &unit);

Same issue as ipv4, looks not safe if you analyze header before doing 
sanity check.

> +    if (RSC_WANT != virtio_net_rsc_sanity_check6(chain,
> +                                                 unit.ip, buf, size)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                ((hdr_len + sizeof(struct eth_header)) + 8),
> +                VIRTIO_NET_IP6_ADDR_SIZE,
> +                hdr_len + sizeof(struct eth_header) + sizeof(struct ip6_header),
> +                VIRTIO_NET_TCP_PORT_SIZE);
> +    }
> +
> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
> +        return size;
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
>   static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
>                                                   NetClientState *nc,
>                                                   uint16_t proto)
>   {
>       NetRscChain *chain;
>   
> -    if (proto != (uint16_t)ETH_P_IP) {
> +    if ((proto != (uint16_t)ETH_P_IP) && (proto != (uint16_t)ETH_P_IPV6)) {
>           return NULL;
>       }
>   
> @@ -2125,8 +2252,13 @@ static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
>       chain = g_malloc(sizeof(*chain));
>       chain->n = n;
>       chain->proto = proto;
> -    chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
> -    chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +    if (proto == (uint16_t)ETH_P_IP) {
> +        chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
> +        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +    } else {
> +        chain->max_payload = VIRTIO_NET_MAX_IP6_PAYLOAD;
> +        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> +    }
>       chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>                                         virtio_net_rsc_purge, chain);
>       memset(&chain->stat, 0, sizeof(chain->stat));
> @@ -2158,7 +2290,11 @@ static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>           return virtio_net_do_receive(nc, buf, size);
>       } else {
>           chain->stat.received++;
> -        return virtio_net_rsc_receive4(chain, nc, buf, size);
> +        if (proto == (uint16_t)ETH_P_IP) {
> +            return virtio_net_rsc_receive4(chain, nc, buf, size);
> +        } else  {
> +            return virtio_net_rsc_receive6(chain, nc, buf, size);
> +        }
>       }
>   }
>   

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL
  2016-05-30  4:22 ` [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL Jason Wang
@ 2016-05-30  4:50   ` Wei Xu
  0 siblings, 0 replies; 12+ messages in thread
From: Wei Xu @ 2016-05-30  4:50 UTC (permalink / raw)
  To: Jason Wang, qemu-devel; +Cc: victork, mst, yvugenfi, marcel, dfleytma


On 2016年05月30日 12:22, Jason Wang wrote:
>
>
> On 2016年05月29日 00:37, wexu@redhat.com wrote:
>> From: Wei Xu <wexu@redhat.com>
>>
>> Changes in V6:
>> - Sync upstream code
>> - Split new fields in 'virtio_net_hdr' to a seperate patch
>> - Remove feature bit code, replace it with a command line parameter
>> 'guest_rsc'
>> which is turned off by default.
>>
>> Changes in V5:
>> - Passed all IPv4/6 test cases
>> - Add new fields in 'virtio_net_hdr'
>> - Set 'gso_type' & 'coalesced packets' in new field.
>> - Bypass all 'tcp option' packet
>> - Bypass all 'pure ack' packet
>> - Bypass all 'duplicate ack' packet
>> - Change 'guest_rsc' feature bit to 'false' by default
>> - Feedbacks from v4, typo, etc.
>
> Change-log is very important for the ease and speed up reviewers. More
> details are more than welcomed. But I see some changes were not
> documented here. Please give a more complete one in next iteration.
OK.
>
>>
>> Note:
>> There is still a few pending issues about the feature bit, and need to be
>> discussed with windows driver maintainer, so linux guests with this patch
>> won't work at current, haven't figure it out yet, but i'm guessing it's
>> caused by the 'gso_type' is set to 'VIRTIO_NET_HDR_GSO_TCPV4/6',
>> will fix it after get the final solution, the below test steps and
>> performance data is based on v4.
>
> This is probably because you've increased the vnet header length.
>
>>
>> Another suggestion from Jason is to adjust part of the code to make it
>> more readable, since there maybe still few change about the flowchart
>> in the future, such as timestamp, duplicate ack, so i'd like to delay it
>> temporarily.
>>
>> Changes in V4:
>> - Add new host feature bit
>> - Replace using fixed header lenght with dynamic header lenght in
>> VirtIONet
>> - Change ip/ip6 header union in NetRscUnit to void* pointer
>> - Add macro prefix, adjust code indent, etc.
>>
>> Changes in V3:
>> - Removed big param list, replace it with 'NetRscUnit'
>> - Different virtio header size
>> - Modify callback function to direct call.
>> - Needn't check the failure of g_malloc()
>> - Other code format adjustment, macro naming, etc
>>
>> Changes in V2:
>> - Add detailed commit log
>>
>> This patch is to support WHQL test for Windows guest, while this
>> feature also
>> benifits other guest works as a kernel 'gro' like feature with userspace
>> implementation.
>> Feature information:
>>    http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
>>
>> Both IPv4 and IPv6 are supported, though performance with userspace
>> virtio
>> is slow than vhost-net, there is about 1.5x to 2x performance
>> improvement to
>> userspace virtio, this is done by turning this feature on and disable
>> 'tso/gso/gro' on corresponding tap interface and guest interface,
>> while get
>> less improment with all these feature on.
>>
>> Linux guest performance data(Netperf):
>> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
>> 192.168.2.101 () port 0 AF_INET : nodelay
>> Size   Size    Size     Time     Throughput
>> bytes  bytes   bytes    secs.    10^6bits/sec
>>
>>   87380  16384     64    6.00     1221.20
>>   87380  16384     64    6.00     1260.30
>>
>>   87380  16384    128    6.00     1978.51
>>   87380  16384    128    6.00     2286.05
>>
>>   87380  16384    256    6.00     2677.94
>>   87380  16384    256    6.00     4615.42
>>
>>   87380  16384    512    6.00     2956.54
>>   87380  16384    512    6.00     5356.39
>>
>>   87380  16384   1024    6.00     2798.17
>>   87380  16384   1024    6.00     4943.30
>>
>>   87380  16384   2048    6.00     2681.09
>>   87380  16384   2048    6.00     4835.81
>>
>>   87380  16384   4096    6.00     3390.14
>>   87380  16384   4096    6.00     5391.54
>>
>>   87380  16384   8092    6.00     3008.27
>>   87380  16384   8092    6.00     5381.68
>>
>>   87380  16384  10240    6.00     2999.89
>>   87380  16384  10240    6.00     5393.11
>>
>> Test steps:
>> Although this feature is mainly used for window guest, i used linux
>> guest to
>> help test the feature, to make things simple, i used 3 steps to test
>> the patch
>> as i moved on.
>>
>> 1. With a tcp socket client/server pair running on 2 linux guest, thus
>> i can
>> control
>> the traffic and debugging the code as i want.
>> 2. Netperf on linux guest test the throughput.
>> 3. WHQL test with 2 Windows guests.
>>
>> Wei Xu (3):
>>    virtio-net rsc: support coalescing ipv4 tcp traffic
>>    virtio-net rsc: support coalescing ipv6 tcp traffic
>>    virtio-net rsc: add 2 new rsc information fields to 'virtio_net_hdr'
>>
>>   hw/net/virtio-net.c                         | 642
>> +++++++++++++++++++++++++++-
>>   include/hw/virtio/virtio-net.h              |   2 +
>>   include/hw/virtio/virtio.h                  |  75 ++++
>>   include/standard-headers/linux/virtio_net.h |   3 +
>>   4 files changed, 721 insertions(+), 1 deletion(-)
>>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [ RFC Patch v6 3/3] virtio-net rsc: add 2 new rsc information fields to 'virtio_net_hdr'
  2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 3/3] virtio-net rsc: add 2 new rsc information fields to 'virtio_net_hdr' wexu
@ 2016-05-30  5:57   ` Jason Wang
  2016-06-02 16:23     ` Wei Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Wang @ 2016-05-30  5:57 UTC (permalink / raw)
  To: wexu, qemu-devel; +Cc: victork, mst, yvugenfi, marcel, dfleytma



On 2016年05月29日 00:37, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
>
> Field 'coalesced' is to indicate how many packets are coalesced and field
> 'dup_ack' is how many duplicate acks are merged, guest driver can use these
> information to notify what's the exact scene of original traffic over the
> networks.
>
> Signed-off-by: Wei Xu <wexu@redhat.com>
> ---
>   hw/net/virtio-net.c                         | 8 ++++++++
>   include/standard-headers/linux/virtio_net.h | 2 ++
>   2 files changed, 10 insertions(+)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index cc8cbe4..20f552a 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -1768,6 +1768,10 @@ static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
>       if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
>           virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
>       }
> +    h->coalesced = seg->packets;
> +    h->dup_ack = seg->dup_ack;
> +    h->gso_type = chain->gso_type;
> +    h->gso_size = chain->max_payload;
>       ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
>       QTAILQ_REMOVE(&chain->buffers, seg, next);
>       g_free(seg->buf);
> @@ -2302,9 +2306,13 @@ static ssize_t virtio_net_receive(NetClientState *nc,
>                                     const uint8_t *buf, size_t size)
>   {
>       VirtIONet *n;
> +    struct virtio_net_hdr *h;
>   
>       n = qemu_get_nic_opaque(nc);
>       if (n->host_features & (1ULL << VIRTIO_NET_F_GUEST_RSC)) {
> +        h = (struct virtio_net_hdr *)buf;
> +        h->coalesced = 0;
> +        h->dup_ack = 0;
>           return virtio_net_rsc_receive(nc, buf, size);
>       } else {
>           return virtio_net_do_receive(nc, buf, size);
> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> index 5b95762..c837417 100644
> --- a/include/standard-headers/linux/virtio_net.h
> +++ b/include/standard-headers/linux/virtio_net.h
> @@ -114,6 +114,8 @@ struct virtio_net_hdr {
>   	__virtio16 gso_size;		/* Bytes to append to hdr_len per frame */
>   	__virtio16 csum_start;	/* Position to start checksumming from */
>   	__virtio16 csum_offset;	/* Offset after that to place checksum */
> +    __virtio16 coalesced;   /* packets coalesced by host */

Can we just reuse gso_segs for this?

> +    __virtio16 dup_ack;     /* duplicate ack count */
>   };
>   
>   /* This is the version of the header to use when the MRG_RXBUF

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [ RFC Patch v6 1/3] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-05-30  4:20   ` Jason Wang
@ 2016-06-02 16:16     ` Wei Xu
  0 siblings, 0 replies; 12+ messages in thread
From: Wei Xu @ 2016-06-02 16:16 UTC (permalink / raw)
  To: Jason Wang, qemu-devel; +Cc: marcel, mst, victork, dfleytma, yvugenfi

	On 2016年05月30日 12:20, Jason Wang wrote:
>
>
> On 2016年05月29日 00:37, wexu@redhat.com wrote:
>> From: Wei Xu <wexu@redhat.com>
>>
>> All the data packets in a tcp connection will be cached to a big buffer
>> in every receive interval, and will be sent out via a timer, the
>> 'virtio_net_rsc_timeout' controls the interval, the value will
>> influent the
>> performance and response of tcp connection extremely, 50000(50us) is a
>> experience value to gain a performance improvement, since the whql test
>> sends packets every 100us, so '300000(300us)' can pass the test case,
>> this is also the default value, it's tunable via the command line
>> parameter 'rsc_interval' with 'virtio-net-pci' device, for example, below
>> parameter is to launch a guest with interval set as '500000'.
>
> Does the value make sense if it was smaller than 1us? If not, why not
> just make the unit to be 1us?
It's an experience value, 500us is a good candidate for netperf 
throughput test, too short interval less than 50 us will gain an obvious 
penalty AFAIK, this is caused only few packets can be coalesced and the 
cycles wasted for rsc itself.
>>
>> 'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,rsc_interval=500000'
>>
>> will
>>
>> The timer will only be triggered if the packets pool is not empty,
>> and it'll drain off all the cached packets.
>>
>> 'NetRscChain' is used to save the segments of different protocols in a
>> VirtIONet device.
>>
>> The main handler of TCP includes TCP window update, duplicated ACK check
>> and the real data coalescing if the new segment passed sanity check
>> and is identified as an 'wanted' one.
>>
>> An 'wanted' segment means:
>> 1. Segment is within current window and the sequence is the expected one.
>> 2. 'ACK' of the segment is in the valid window.
>>
>> Sanity check includes:
>> 1. Incorrect version in IP header
>> 2. IP options & IP fragment
>> 3. Not a TCP packets
>> 4. Sanity size check to prevent buffer overflow attack.
>>
>> There maybe more cases should be considered such as ip identification
>> other
>> flags, while it broke the test because windows set it to the same even
>> it's
>> not a fragment.
>>
>> Normally it includes 2 typical ways to handle a TCP control flag,
>> 'bypass'
>> and 'finalize', 'bypass' means should be sent out directly, and
>> 'finalize'
>> means the packets should also be bypassed, and this should be done
>> after searching for the same connection packets in the pool and sending
>> all of them out, this is to avoid out of data.
>>
>> All the 'SYN' packets will be bypassed since this always begin a new'
>> connection, other flags such 'FIN/RST' will trigger a finalization,
>> because
>> this normally happens upon a connection is going to be closed, an
>> 'URG' packet
>> also finalize current coalescing unit.
>>
>> Statistics can be used to monitor the basic coalescing status, the
>> 'out of order'
>> and 'out of window' means how many retransmitting packets, thus
>> describe the
>> performance intuitively.
>
> We usually don't add device specific monitor command. Maybe a new
> control vq command ethtool -S in guest in the future. I was thinking of
> removing those counters since it was never used in this series.
Yes, that's a good idea, actually i'm doubting whether this feature 
should be a guest feature or a host feature, the spec says it should be 
more like a guest feature, but it's provided as a host built-in feature, 
so how and where to examine it is a problem. I'm using gdb to debugging 
it currently, normally i would check this counter directly via debug 
command, and the statistics is quite useful for troubleshooting, it'll 
be optional when this feature is geting more and more robust, can we 
keep it and leave if should keep it or where to display it along with 
qa's testing?
>
>>
>> Signed-off-by: Wei Xu <wexu@redhat.com>
>> ---
>>   hw/net/virtio-net.c                         | 498
>> +++++++++++++++++++++++++++-
>>   include/hw/virtio/virtio-net.h              |   2 +
>>   include/hw/virtio/virtio.h                  |  75 +++++
>>   include/standard-headers/linux/virtio_net.h |   1 +
>
> For RFC, it's ok. But for formal patch, this is not the correct way to
> modify Linux headers. There's a script in
> scripts/update-linux-headers.sh which was used to sync it from Linux
> source. This means, it must be merged in Linux or at least in
> maintainer's tree first.
>
>>   4 files changed, 575 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 5798f87..b3bb63b 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -15,10 +15,12 @@
>>   #include "qemu/iov.h"
>>   #include "hw/virtio/virtio.h"
>>   #include "net/net.h"
>> +#include "net/eth.h"
>>   #include "net/checksum.h"
>>   #include "net/tap.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/timer.h"
>> +#include "qemu/sockets.h"
>>   #include "hw/virtio/virtio-net.h"
>>   #include "net/vhost_net.h"
>>   #include "hw/virtio/virtio-bus.h"
>> @@ -38,6 +40,25 @@
>>   #define endof(container, field) \
>>       (offsetof(container, field) + sizeof(((container *)0)->field))
>> +#define VIRTIO_NET_IP4_ADDR_SIZE   8        /* ipv4 saddr + daddr */
>> +#define VIRTIO_NET_TCP_PORT_SIZE   4        /* sport + dport */
>> +
>> +#define VIRTIO_NET_TCP_FLAG         0x3F
>> +#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
>> +
>> +/* IPv4 max payload, 16 bits in the header */
>> +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
>> +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
>
> Is this still true if window scaling is enabled? And need to make sure
> your ack/seq processing can work for window scaling.
>
>> +
>> +/* header length value in ip header without option */
>> +#define VIRTIO_NET_IP4_HEADER_LENGTH 5
>> +
>> +/* Purge coalesced packets timer interval, This value affects the
>> performance
>> +   a lot, and should be tuned carefully, '300000'(300us) is the
>> recommended
>> +   value to pass the WHQL test, '50000' can gain 2x netperf
>> throughput with
>> +   tso/gso/gro 'off'. */
>> +#define VIRTIO_NET_RSC_INTERVAL  300000
>
> Do we need a new property for this value?
Yes, i added this already.
+    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
+                      VIRTIO_NET_RSC_INTERVAL),
>
>> +
>>   typedef struct VirtIOFeature {
>>       uint32_t flags;
>>       size_t end;
>> @@ -1089,7 +1110,8 @@ static int receive_filter(VirtIONet *n, const
>> uint8_t *buf, int size)
>>       return 0;
>>   }
>> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t
>> *buf, size_t size)
>> +static ssize_t virtio_net_do_receive(NetClientState *nc,
>> +                                     const uint8_t *buf, size_t size)
>>   {
>>       VirtIONet *n = qemu_get_nic_opaque(nc);
>>       VirtIONetQueue *q = virtio_net_get_subqueue(nc);
>> @@ -1685,6 +1707,474 @@ static int virtio_net_load_device(VirtIODevice
>> *vdev, QEMUFile *f,
>>       return 0;
>>   }
>> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
>> +                                         const uint8_t *buf,
>> NetRscUnit* unit)
>> +{
>> +    uint16_t hdr_len;
>> +    uint16_t ip_hdrlen;
>> +    struct ip_header *ip;
>> +
>> +    hdr_len = chain->n->guest_hdr_len;
>> +    ip = (struct ip_header *)(buf + hdr_len + sizeof(struct
>> eth_header));
>> +    unit->ip = (void *)ip;
>> +    ip_hdrlen = (ip->ip_ver_len & 0xF) << 2;
>> +    unit->ip_plen = &ip->ip_len;
>> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) +
>> ip_hdrlen);
>> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000)
>> >> 10;
>> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen -
>> unit->tcp_hdrlen;
>> +}
>> +
>> +static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr *vhdr,
>> +                                         struct ip_header *ip)
>> +{
>> +    uint32_t sum;
>> +
>> +    ip->ip_sum = 0;
>> +    sum = net_checksum_add_cont(sizeof(struct ip_header), (uint8_t
>> *)ip, 0);
>> +    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
>> +    vhdr->flags &= ~VIRTIO_NET_HDR_F_NEEDS_CSUM;
>> +    vhdr->flags |= VIRTIO_NET_HDR_F_DATA_VALID;
>> +}
>> +
>> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg
>> *seg)
>> +{
>> +    int ret;
>> +    struct virtio_net_hdr *h;
>> +
>> +    h = (struct virtio_net_hdr *)seg->buf;
>> +    virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
>> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
>> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
>> +    g_free(seg->buf);
>> +    g_free(seg);
>> +
>> +    return ret;
>> +}
>> +
>> +static void virtio_net_rsc_purge(void *opq)
>> +{
>> +    NetRscSeg *seg, *rn;
>> +    NetRscChain *chain = (NetRscChain *)opq;
>> +
>> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
>> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> +            chain->stat.purge_failed++;
>> +            continue;
>
> Why need this?
ah., We have discussed this before, there maybe momentary congestion and 
can be resumed quickly for the guest, so it's try to continue receiving 
rather then drop the cached packets.
>
>> +        }
>> +    }
>> +
>> +    chain->stat.timer++;
>> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
>> +        timer_mod(chain->drain_timer,
>> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
>> chain->n->rsc_timeout);
>
> Consider the timer is invisible to guest, why use QEMU_CLOCK_VIRTUAL?
Yes, it's invisible to the guest, should so should i choose a host or 
timer here?
>
>> +    }
>> +}
>> +
>> +static void virtio_net_rsc_cleanup(VirtIONet *n)
>> +{
>> +    NetRscChain *chain, *rn_chain;
>> +    NetRscSeg *seg, *rn_seg;
>> +
>> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
>> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
>> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
>> +            g_free(seg->buf);
>> +            g_free(seg);
>> +        }
>> +
>> +        timer_del(chain->drain_timer);
>> +        timer_free(chain->drain_timer);
>> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
>> +        g_free(chain);
>> +    }
>> +}
>> +
>> +static void virtio_net_rsc_cache_buf(NetRscChain *chain,
>> NetClientState *nc,
>> +                                     const uint8_t *buf, size_t size)
>> +{
>> +    uint16_t hdr_len;
>> +    NetRscSeg *seg;
>> +
>> +    hdr_len = chain->n->guest_hdr_len;
>> +    seg = g_malloc(sizeof(NetRscSeg));
>> +    seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)\
>> +                   + VIRTIO_NET_MAX_TCP_PAYLOAD);
>> +    memcpy(seg->buf, buf, size);
>> +    seg->size = size;
>> +    seg->packets = 1;
>> +    seg->dup_ack = 0;
>> +    seg->is_coalesced = 0;
>> +    seg->nc = nc;
>> +
>> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
>> +    chain->stat.cache++;
>> +
>> +    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
>> +}
>> +
>> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
>> +                                         NetRscSeg *seg, const
>> uint8_t *buf,
>> +                                         struct tcp_header *n_tcp,
>> +                                         struct tcp_header *o_tcp)
>> +{
>> +    uint32_t nack, oack;
>> +    uint16_t nwin, owin;
>> +
>> +    nack = htonl(n_tcp->th_ack);
>> +    nwin = htons(n_tcp->th_win);
>> +    oack = htonl(o_tcp->th_ack);
>> +    owin = htons(o_tcp->th_win);
>> +
>> +    if ((nack - oack) >= VIRTIO_NET_MAX_TCP_PAYLOAD) {
>> +        chain->stat.ack_out_of_win++;
>> +        return RSC_FINAL;
>> +    } else if (nack == oack) {
>> +        /* duplicated ack or window probe */
>> +        if (nwin == owin) {
>> +            /* duplicated ack, add dup ack count due to whql test up
>> to 1 */
>> +            chain->stat.dup_ack++;
>> +            return RSC_FINAL;
>> +        } else {
>> +            /* Coalesce window update */
>> +            o_tcp->th_win = n_tcp->th_win;
>> +            chain->stat.win_update++;
>> +            return RSC_COALESCE;
>> +        }
>> +    } else {
>> +        /* pure ack, go to 'C', finalize*/
>> +        chain->stat.pure_ack++;
>> +        return RSC_FINAL;
>> +    }
>> +}
>> +
>> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
>> +                                            NetRscSeg *seg, const
>> uint8_t *buf,
>> +                                            NetRscUnit *n_unit)
>> +{
>> +    void *data;
>> +    uint16_t o_ip_len;
>> +    uint32_t nseq, oseq;
>> +    NetRscUnit *o_unit;
>> +
>> +    o_unit = &seg->unit;
>> +    o_ip_len = htons(*o_unit->ip_plen);
>> +    nseq = htonl(n_unit->tcp->th_seq);
>> +    oseq = htonl(o_unit->tcp->th_seq);
>> +
>> +    /* out of order or retransmitted. */
>> +    if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) {
>> +        chain->stat.data_out_of_win++;
>> +        return RSC_FINAL;
>> +    }
>> +
>> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
>> +    if (nseq == oseq) {
>> +        if ((o_unit->payload == 0) && n_unit->payload) {
>> +            /* From no payload to payload, normal case, not a dup ack
>> or etc */
>> +            chain->stat.data_after_pure_ack++;
>> +            goto coalesce;
>> +        } else {
>> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
>> +                                             n_unit->tcp, o_unit->tcp);
>> +        }
>> +    } else if ((nseq - oseq) != o_unit->payload) {
>> +        /* Not a consistent packet, out of order */
>> +        chain->stat.data_out_of_order++;
>> +        return RSC_FINAL;
>> +    } else {
>> +coalesce:
>> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
>> +            chain->stat.over_size++;
>> +            return RSC_FINAL;
>> +        }
>> +
>> +        /* Here comes the right data, the payload length in v4/v6 is
>> different,
>> +           so use the field value to update and record the new data
>> len */
>> +        o_unit->payload += n_unit->payload; /* update new data len */
>> +
>> +        /* update field in ip header */
>> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
>> +
>> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be
>> coalesced
>> +           for windows guest, while this may change the behavior for
>> linux
>> +           guest. */
>> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
>> +
>> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
>> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
>> +
>> +        memmove(seg->buf + seg->size, data, n_unit->payload);
>> +        seg->size += n_unit->payload;
>> +        seg->packets++;
>> +        chain->stat.coalesced++;
>> +        return RSC_COALESCE;
>> +    }
>> +}
>> +
>> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg
>> *seg,
>> +                                        const uint8_t *buf, size_t size,
>> +                                        NetRscUnit *unit)
>> +{
>> +    struct ip_header *ip1, *ip2;
>> +
>> +    ip1 = (struct ip_header *)(unit->ip);
>> +    ip2 = (struct ip_header *)(seg->unit.ip);
>> +    if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst)
>> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
>> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
>> +        chain->stat.no_match++;
>> +        return RSC_NO_MATCH;
>> +    }
>> +
>> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
>> +}
>> +
>> +/* Pakcets with 'SYN' should bypass, other flag should be sent after
>> drain
>> + * to prevent out of order */
>> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
>> +                                         struct tcp_header *tcp)
>> +{
>> +    uint16_t tcp_hdr;
>> +    uint16_t tcp_flag;
>> +
>> +    tcp_flag = htons(tcp->th_offset_flags);
>> +    tcp_hdr = (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10;
>> +    tcp_flag &= VIRTIO_NET_TCP_FLAG;
>> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
>> +    if (tcp_flag & TH_SYN) {
>> +        chain->stat.tcp_syn++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
>> +        chain->stat.tcp_ctrl_drain++;
>> +        return RSC_FINAL;
>> +    }
>> +
>> +    if (tcp_hdr > sizeof(struct tcp_header)) {
>> +        chain->stat.tcp_all_opt++;
>> +        return RSC_FINAL;
>> +    }
>> +
>> +    return RSC_WANT;
>> +}
>> +
>> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain,
>> NetClientState *nc,
>> +                                       const uint8_t *buf, size_t size)
>> +{
>
> The name is really confusing, it in fact cache the buf if the cache is
> empty. So I guess what you mean is "virtio_net_rsc_cache_if_empty()?".
> But the question is why need a specific function like this, can we
> simply add the logic to virtio_net_rsc_do_coalesce()? (And looks like
> most of the logic has been there, except for timer and counter.
OK.
>
>> +    if (QTAILQ_EMPTY(&chain->buffers)) {
>> +        chain->stat.empty_cache++;
>> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
>> +        timer_mod(chain->drain_timer,
>> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
>> chain->n->rsc_timeout);
>> +        return 1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain,
>> NetClientState *nc,
>> +                                         const uint8_t *buf, size_t
>> size,
>> +                                         NetRscUnit *unit)
>> +{
>> +    int ret;
>> +    NetRscSeg *seg, *nseg;
>> +
>> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
>> +        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
>> +
>> +        if (ret == RSC_FINAL) {
>> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> +                /* Send failed */
>> +                chain->stat.final_failed++;
>> +                return 0;
>> +            }
>> +
>> +            /* Send current packet */
>> +            return virtio_net_do_receive(nc, buf, size);
>> +        } else if (ret == RSC_NO_MATCH) {
>> +            continue;
>> +        } else {
>> +            /* Coalesced, mark coalesced flag to tell calc cksum for
>> ipv4 */
>> +            seg->is_coalesced = 1;
>> +            return size;
>> +        }
>> +    }
>> +
>> +    chain->stat.no_match_cache++;
>> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
>> +    return size;
>> +}
>> +
>> +/* Drain a connection data, this is to avoid out of order segments */
>> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain,
>> NetClientState *nc,
>> +                                        const uint8_t *buf, size_t size,
>> +                                        uint16_t ip_start, uint16_t
>> ip_size,
>> +                                        uint16_t tcp_port, uint16_t
>> port_size)
>> +{
>> +    NetRscSeg *seg, *nseg;
>> +
>> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
>> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
>> +            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {
>
> The header comparison is rather similar to what you've done in
> virtio_net_rsc_coalesce4/6(). Any chance to reduce the duplication?
OK.
>
>> +            continue;
>> +        }
>> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> +            chain->stat.drain_failed++;
>> +        }
>> +
>> +        break;
>> +    }
>> +
>> +    return virtio_net_do_receive(nc, buf, size);
>> +}
>> +
>> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
>> +                                            struct ip_header *ip,
>> +                                            const uint8_t *buf,
>> size_t size)
>> +{
>> +    uint16_t ip_len;
>> +    uint16_t hdr_len;
>> +
>> +    hdr_len = chain->n->guest_hdr_len;
>> +    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct
>> ip_header)
>> +        + sizeof(struct tcp_header))) {
>
> It's interesting that you don't have a stat counter for this case.
Thanks.
>
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Not an ipv4 one */
>> +    if (((ip->ip_ver_len & 0xF0) >> 4) != IP_HEADER_VERSION_4) {
>> +        chain->stat.ip_option++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Don't handle packets with ip option */
>> +    if (VIRTIO_NET_IP4_HEADER_LENGTH != (ip->ip_ver_len & 0xF)) {
>
> (ip-> ...) != VIRTIO_NET_IP4_HEADER_LENGTH
OK.
>
>> +        chain->stat.ip_option++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    if (ip->ip_p != IPPROTO_TCP) {
>> +        chain->stat.bypass_not_tcp++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Don't handle packets with ip fragment */
>> +    if (!(htons(ip->ip_off) & IP_DF)) {
>> +        chain->stat.ip_frag++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    ip_len = htons(ip->ip_len);
>> +    if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header))
>> +        || ip_len > (size - hdr_len - sizeof(struct eth_header))) {
>> +        chain->stat.ip_hacked++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    return RSC_WANT;
>> +}
>> +
>> +static size_t virtio_net_rsc_receive4(NetRscChain *chain,
>> NetClientState* nc,
>> +                                      const uint8_t *buf, size_t size)
>> +{
>> +    int32_t ret;
>> +    uint16_t hdr_len;
>> +    NetRscUnit unit;
>> +
>> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
>> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
>
> Is this safe for you do header analysis before doing sanity check?
looks, there should be unsafe case here, will double check it.
>
>> +    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf,
>> size)) {
>
> virtio_net_rsc_sanity_check4() = RSC_WANT please. I've pointed out this
> kind of style issues several times.
>
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +
>> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
>> +    if (ret == RSC_BYPASS) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    } else if (ret == RSC_FINAL) {
>> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
>> +                ((hdr_len + sizeof(struct eth_header)) + 12),
>> +                VIRTIO_NET_IP4_ADDR_SIZE,
>> +                hdr_len + sizeof(struct eth_header) + sizeof(struct
>> ip_header),
>> +                VIRTIO_NET_TCP_PORT_SIZE);
>> +    }
>> +
>> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
>> +        return size;
>> +    }
>> +
>> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
>> +}
>> +
>> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
>> +                                                NetClientState *nc,
>> +                                                uint16_t proto)
>> +{
>> +    NetRscChain *chain;
>> +
>> +    if (proto != (uint16_t)ETH_P_IP) {
>> +        return NULL;
>> +    }
>> +
>> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
>> +        if (chain->proto == proto) {
>> +            return chain;
>> +        }
>> +    }
>> +
>> +    chain = g_malloc(sizeof(*chain));
>> +    chain->n = n;
>> +    chain->proto = proto;
>> +    chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
>> +    chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
>> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>> +                                      virtio_net_rsc_purge, chain);
>> +    memset(&chain->stat, 0, sizeof(chain->stat));
>> +
>> +    QTAILQ_INIT(&chain->buffers);
>> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
>> +
>> +    return chain;
>> +}
>> +
>> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>> +                                      const uint8_t *buf, size_t size)
>> +{
>> +    uint16_t proto;
>> +    NetRscChain *chain;
>> +    struct eth_header *eth;
>> +    VirtIONet *n;
>> +
>> +    n = qemu_get_nic_opaque(nc);
>> +    if (size < (n->guest_hdr_len + sizeof(struct eth_header))) {
>
> Here should use host_hdr_len, no?
Not understanding the code well, could you please give a typical usage 
of in what kind of case will the host_hdr_len be different with 
guest_hdr_len? i was thinking they are equal in most cases, is it caused 
by different feature capability such as mergeable buffers, etc?
>
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +
>> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
>> +    proto = htons(eth->h_proto);
>> +
>> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
>> +    if (!chain) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    } else {
>> +        chain->stat.received++;
>> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
>> +    }
>> +}
>> +
>> +static ssize_t virtio_net_receive(NetClientState *nc,
>> +                                  const uint8_t *buf, size_t size)
>> +{
>> +    VirtIONet *n;
>> +
>> +    n = qemu_get_nic_opaque(nc);
>> +    if (n->host_features & (1ULL << VIRTIO_NET_F_GUEST_RSC)) {
>> +        return virtio_net_rsc_receive(nc, buf, size);
>> +    } else {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +}
>> +
>>   static NetClientInfo net_virtio_info = {
>>       .type = NET_CLIENT_OPTIONS_KIND_NIC,
>>       .size = sizeof(NICState),
>> @@ -1814,6 +2304,7 @@ static void
>> virtio_net_device_realize(DeviceState *dev, Error **errp)
>>       nc = qemu_get_queue(n->nic);
>>       nc->rxfilter_notify_enabled = 1;
>> +    QTAILQ_INIT(&n->rsc_chains);
>>       n->qdev = dev;
>>       register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
>>                       virtio_net_save, virtio_net_load, n);
>> @@ -1848,6 +2339,7 @@ static void
>> virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>>       g_free(n->vqs);
>>       qemu_del_nic(n->nic);
>>       virtio_cleanup(vdev);
>> +    virtio_net_rsc_cleanup(n);
>>   }
>>   static void virtio_net_instance_init(Object *obj)
>> @@ -1909,6 +2401,10 @@ static Property virtio_net_properties[] = {
>>                          TX_TIMER_INTERVAL),
>>       DEFINE_PROP_INT32("x-txburst", VirtIONet, net_conf.txburst,
>> TX_BURST),
>>       DEFINE_PROP_STRING("tx", VirtIONet, net_conf.tx),
>> +    DEFINE_PROP_BIT("guest_rsc", VirtIONet, host_features,
>> +                    VIRTIO_NET_F_GUEST_RSC, false),
>> +    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
>> +                      VIRTIO_NET_RSC_INTERVAL),
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>> diff --git a/include/hw/virtio/virtio-net.h
>> b/include/hw/virtio/virtio-net.h
>> index 0cabdb6..e995190 100644
>> --- a/include/hw/virtio/virtio-net.h
>> +++ b/include/hw/virtio/virtio-net.h
>> @@ -59,6 +59,8 @@ typedef struct VirtIONet {
>>       VirtIONetQueue *vqs;
>>       VirtQueue *ctrl_vq;
>>       NICState *nic;
>> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>> +    uint32_t rsc_timeout;
>>       uint32_t tx_timeout;
>>       int32_t tx_burst;
>>       uint32_t has_vnet_hdr;
>> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>> index 6a37065..a7cb84a 100644
>> --- a/include/hw/virtio/virtio.h
>> +++ b/include/hw/virtio/virtio.h
>> @@ -30,6 +30,8 @@
>>                                   (0x1ULL << VIRTIO_F_ANY_LAYOUT))
>>   struct VirtQueue;
>> +struct VirtIONet;
>> +typedef struct VirtIONet VirtIONet;
>>   static inline hwaddr vring_align(hwaddr addr,
>>                                                unsigned long align)
>> @@ -128,6 +130,79 @@ typedef struct VirtioDeviceClass {
>>       int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>>   } VirtioDeviceClass;
>> +/* Coalesced packets type & status */
>> +typedef enum {
>> +    RSC_COALESCE,           /* Data been coalesced */
>> +    RSC_FINAL,              /* Will terminate current connection */
>> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
>> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp
>> ctrl, etc */
>> +    RSC_WANT                /* Data want to be coalesced */
>> +} COALESCE_STATUS;
>> +
>> +typedef struct NetRscStat {
>> +    uint32_t received;
>> +    uint32_t coalesced;
>> +    uint32_t over_size;
>> +    uint32_t cache;
>> +    uint32_t empty_cache;
>> +    uint32_t no_match_cache;
>> +    uint32_t win_update;
>> +    uint32_t no_match;
>> +    uint32_t tcp_syn;
>> +    uint32_t tcp_ctrl_drain;
>> +    uint32_t dup_ack;
>> +    uint32_t dup_ack1;
>> +    uint32_t dup_ack2;
>> +    uint32_t pure_ack;
>> +    uint32_t ack_out_of_win;
>> +    uint32_t data_out_of_win;
>> +    uint32_t data_out_of_order;
>> +    uint32_t data_after_pure_ack;
>> +    uint32_t bypass_not_tcp;
>> +    uint32_t tcp_option;
>> +    uint32_t tcp_all_opt;
>> +    uint32_t ip_frag;
>> +    uint32_t ip_hacked;
>> +    uint32_t ip_option;
>> +    uint32_t purge_failed;
>> +    uint32_t drain_failed;
>> +    uint32_t final_failed;
>> +    int64_t  timer;
>> +} NetRscStat;
>> +
>> +/* Rsc unit general info used to checking if can coalescing */
>> +typedef struct NetRscUnit {
>> +    void *ip;   /* ip header */
>> +    uint16_t *ip_plen;      /* data len pointer in ip header field */
>> +    struct tcp_header *tcp; /* tcp header */
>> +    uint16_t tcp_hdrlen;    /* tcp header len */
>> +    uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
>> +} NetRscUnit;
>> +
>> +/* Coalesced segmant */
>> +typedef struct NetRscSeg {
>> +    QTAILQ_ENTRY(NetRscSeg) next;
>> +    void *buf;
>> +    size_t size;
>> +    uint16_t packets;
>> +    uint16_t dup_ack;
>> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark
>> here */
>> +    NetRscUnit unit;
>> +    NetClientState *nc;
>> +} NetRscSeg;
>> +
>> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
>> +typedef struct NetRscChain {
>> +    QTAILQ_ENTRY(NetRscChain) next;
>> +    VirtIONet *n;                            /* VirtIONet */
>> +    uint16_t proto;
>> +    uint8_t  gso_type;
>> +    uint16_t max_payload;
>> +    QEMUTimer *drain_timer;
>> +    QTAILQ_HEAD(, NetRscSeg) buffers;
>> +    NetRscStat stat;
>> +} NetRscChain;
>> +
>>   void virtio_instance_init_common(Object *proxy_obj, void *data,
>>                                    size_t vdev_size, const char
>> *vdev_name);
>> diff --git a/include/standard-headers/linux/virtio_net.h
>> b/include/standard-headers/linux/virtio_net.h
>> index a78f33e..5b95762 100644
>> --- a/include/standard-headers/linux/virtio_net.h
>> +++ b/include/standard-headers/linux/virtio_net.h
>> @@ -55,6 +55,7 @@
>>   #define VIRTIO_NET_F_MQ    22    /* Device supports Receive Flow
>>                        * Steering */
>>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23    /* Set MAC address */
>> +#define VIRTIO_NET_F_GUEST_RSC  24      /* Guest can coalesce tcp
>> packets */
>
> The comment should be "Guest can receive coalesced tcp packets" ?
Yes, actually windows driver is using gso_type 'GSO_TCPV4/6' for 
coalesced packets, i'm doubting if they are overlapped and can be 
multiplexed.
>
>>   #ifndef VIRTIO_NET_NO_LEGACY
>>   #define VIRTIO_NET_F_GSO    6    /* Host handles pkts w/ any GSO
>> type */
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [ RFC Patch v6 2/3] virtio-net rsc: support coalescing ipv6 tcp traffic
  2016-05-30  4:25   ` Jason Wang
@ 2016-06-02 16:17     ` Wei Xu
  0 siblings, 0 replies; 12+ messages in thread
From: Wei Xu @ 2016-06-02 16:17 UTC (permalink / raw)
  To: qemu-devel



On 2016年05月30日 12:25, Jason Wang wrote:
>
>
> On 2016年05月29日 00:37, wexu@redhat.com wrote:
>> From: Wei Xu <wexu@redhat.com>
>>
>> Most stuffs are like ipv4 2 differences between ipv4 and ipv6.
>>
>> 1. Fragment length in ipv4 header includes itself, while it's not
>> included for ipv6, thus means ipv6 can carry a real '65535' payload.
>>
>> 2. IPv6 header does not need calculate header checksum.
>>
>> Signed-off-by: Wei Xu <wexu@redhat.com>
>> ---
>>   hw/net/virtio-net.c | 152
>> +++++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 144 insertions(+), 8 deletions(-)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index b3bb63b..cc8cbe4 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -53,6 +53,10 @@
>>   /* header length value in ip header without option */
>>   #define VIRTIO_NET_IP4_HEADER_LENGTH 5
>> +#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)
>> +#define VIRTIO_NET_IP6_ADDR_SIZE   32      /* ipv6 saddr + daddr */
>> +#define VIRTIO_NET_MAX_IP6_PAYLOAD VIRTIO_NET_MAX_TCP_PAYLOAD
>> +
>>   /* Purge coalesced packets timer interval, This value affects the
>> performance
>>      a lot, and should be tuned carefully, '300000'(300us) is the
>> recommended
>>      value to pass the WHQL test, '50000' can gain 2x netperf
>> throughput with
>> @@ -1724,6 +1728,25 @@ static void
>> virtio_net_rsc_extract_unit4(NetRscChain *chain,
>>       unit->payload = htons(*unit->ip_plen) - ip_hdrlen -
>> unit->tcp_hdrlen;
>>   }
>> +static void virtio_net_rsc_extract_unit6(NetRscChain *chain,
>> +                                         const uint8_t *buf,
>> NetRscUnit* unit)
>> +{
>> +    uint16_t hdr_len;
>> +    struct ip6_header *ip6;
>> +
>> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
>> +    ip6 = (struct ip6_header *)(buf + hdr_len + sizeof(struct
>> eth_header));
>> +    unit->ip = ip6;
>> +    unit->ip_plen = &(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
>> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip)\
>> +                                        + sizeof(struct ip6_header));
>> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000)
>> >> 10;
>> +
>> +    /* There is a difference between payload lenght in ipv4 and v6,
>> +       ip header is excluded in ipv6 */
>> +    unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
>> +}
>> +
>>   static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr *vhdr,
>>                                            struct ip_header *ip)
>>   {
>> @@ -1742,7 +1765,9 @@ static size_t
>> virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
>>       struct virtio_net_hdr *h;
>>       h = (struct virtio_net_hdr *)seg->buf;
>> -    virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
>> +    if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
>> +        virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
>> +    }
>>       ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
>>       QTAILQ_REMOVE(&chain->buffers, seg, next);
>>       g_free(seg->buf);
>> @@ -1798,7 +1823,7 @@ static void virtio_net_rsc_cache_buf(NetRscChain
>> *chain, NetClientState *nc,
>>       hdr_len = chain->n->guest_hdr_len;
>>       seg = g_malloc(sizeof(NetRscSeg));
>>       seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)\
>> -                   + VIRTIO_NET_MAX_TCP_PAYLOAD);
>> +                   + sizeof(struct ip6_header) +
>> VIRTIO_NET_MAX_TCP_PAYLOAD);
>>       memcpy(seg->buf, buf, size);
>>       seg->size = size;
>>       seg->packets = 1;
>> @@ -1809,7 +1834,18 @@ static void
>> virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
>>       QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
>>       chain->stat.cache++;
>> -    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
>> +    switch (chain->proto) {
>> +    case ETH_P_IP:
>> +        virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
>> +        break;
>> +
>> +    case ETH_P_IPV6:
>> +        virtio_net_rsc_extract_unit6(chain, seg->buf, &seg->unit);
>> +        break;
>> +
>> +    default:
>> +        g_assert_not_reached();
>> +    }
>>   }
>>   static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
>> @@ -1929,6 +1965,24 @@ static int32_t
>> virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
>>       return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
>>   }
>> +static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg
>> *seg,
>> +                        const uint8_t *buf, size_t size, NetRscUnit
>> *unit)
>> +{
>> +    struct ip6_header *ip1, *ip2;
>> +
>> +    ip1 = (struct ip6_header *)(unit->ip);
>> +    ip2 = (struct ip6_header *)(seg->unit.ip);
>> +    if (memcmp(&ip1->ip6_src, &ip2->ip6_src, sizeof(struct in6_address))
>> +        || memcmp(&ip1->ip6_dst, &ip2->ip6_dst, sizeof(struct
>> in6_address))
>> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
>> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
>> +            chain->stat.no_match++;
>> +            return RSC_NO_MATCH;
>> +    }
>> +
>> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
>> +}
>> +
>>   /* Pakcets with 'SYN' should bypass, other flag should be sent after
>> drain
>>    * to prevent out of order */
>>   static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
>> @@ -1981,7 +2035,11 @@ static size_t
>> virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
>>       NetRscSeg *seg, *nseg;
>>       QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
>> -        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
>> +        if (chain->proto == ETH_P_IP) {
>> +            ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
>> +        } else {
>> +            ret = virtio_net_rsc_coalesce6(chain, seg, buf, size, unit);
>> +        }
>>           if (ret == RSC_FINAL) {
>>               if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> @@ -2106,13 +2164,82 @@ static size_t
>> virtio_net_rsc_receive4(NetRscChain *chain, NetClientState* nc,
>>       return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
>>   }
>> +static int32_t virtio_net_rsc_sanity_check6(NetRscChain *chain,
>> +                                            struct ip6_header *ip6,
>> +                                            const uint8_t *buf,
>> size_t size)
>> +{
>> +    uint16_t ip_len;
>> +    uint16_t hdr_len;
>> +
>> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
>> +    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct
>> ip6_header)
>> +        + sizeof(tcp_header))) {
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    if (((ip6->ip6_ctlun.ip6_un1.ip6_un1_flow & 0xF0) >> 4)
>> +        != IP_HEADER_VERSION_6) {
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Both option and protocol is checked in this */
>> +    if (ip6->ip6_ctlun.ip6_un1.ip6_un1_nxt != IPPROTO_TCP) {
>> +        chain->stat.bypass_not_tcp++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    ip_len = htons(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
>> +    if (ip_len < sizeof(struct tcp_header)
>> +        || ip_len > (size - hdr_len - sizeof(struct eth_header)
>> +                     - sizeof(struct ip6_header))) {
>> +        chain->stat.ip_hacked++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    return RSC_WANT;
>> +}
>> +
>> +static size_t virtio_net_rsc_receive6(void *opq, NetClientState* nc,
>> +                                      const uint8_t *buf, size_t size)
>> +{
>> +    int32_t ret;
>> +    uint16_t hdr_len;
>> +    NetRscChain *chain;
>> +    NetRscUnit unit;
>> +
>> +    chain = (NetRscChain *)opq;
>> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
>> +    virtio_net_rsc_extract_unit6(chain, buf, &unit);
>
> Same issue as ipv4, looks not safe if you analyze header before doing
> sanity check.
OK.
>
>> +    if (RSC_WANT != virtio_net_rsc_sanity_check6(chain,
>> +                                                 unit.ip, buf, size)) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +
>> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
>> +    if (ret == RSC_BYPASS) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    } else if (ret == RSC_FINAL) {
>> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
>> +                ((hdr_len + sizeof(struct eth_header)) + 8),
>> +                VIRTIO_NET_IP6_ADDR_SIZE,
>> +                hdr_len + sizeof(struct eth_header) + sizeof(struct
>> ip6_header),
>> +                VIRTIO_NET_TCP_PORT_SIZE);
>> +    }
>> +
>> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
>> +        return size;
>> +    }
>> +
>> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
>> +}
>> +
>>   static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
>>                                                   NetClientState *nc,
>>                                                   uint16_t proto)
>>   {
>>       NetRscChain *chain;
>> -    if (proto != (uint16_t)ETH_P_IP) {
>> +    if ((proto != (uint16_t)ETH_P_IP) && (proto !=
>> (uint16_t)ETH_P_IPV6)) {
>>           return NULL;
>>       }
>> @@ -2125,8 +2252,13 @@ static NetRscChain
>> *virtio_net_rsc_lookup_chain(VirtIONet * n,
>>       chain = g_malloc(sizeof(*chain));
>>       chain->n = n;
>>       chain->proto = proto;
>> -    chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
>> -    chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
>> +    if (proto == (uint16_t)ETH_P_IP) {
>> +        chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
>> +        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
>> +    } else {
>> +        chain->max_payload = VIRTIO_NET_MAX_IP6_PAYLOAD;
>> +        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
>> +    }
>>       chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>>                                         virtio_net_rsc_purge, chain);
>>       memset(&chain->stat, 0, sizeof(chain->stat));
>> @@ -2158,7 +2290,11 @@ static ssize_t
>> virtio_net_rsc_receive(NetClientState *nc,
>>           return virtio_net_do_receive(nc, buf, size);
>>       } else {
>>           chain->stat.received++;
>> -        return virtio_net_rsc_receive4(chain, nc, buf, size);
>> +        if (proto == (uint16_t)ETH_P_IP) {
>> +            return virtio_net_rsc_receive4(chain, nc, buf, size);
>> +        } else  {
>> +            return virtio_net_rsc_receive6(chain, nc, buf, size);
>> +        }
>>       }
>>   }
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [ RFC Patch v6 3/3] virtio-net rsc: add 2 new rsc information fields to 'virtio_net_hdr'
  2016-05-30  5:57   ` Jason Wang
@ 2016-06-02 16:23     ` Wei Xu
  0 siblings, 0 replies; 12+ messages in thread
From: Wei Xu @ 2016-06-02 16:23 UTC (permalink / raw)
  To: Jason Wang, qemu-devel; +Cc: victork, mst, yvugenfi, marcel, dfleytma



On 2016年05月30日 13:57, Jason Wang wrote:
>
>
> On 2016年05月29日 00:37, wexu@redhat.com wrote:
>> From: Wei Xu <wexu@redhat.com>
>>
>> Field 'coalesced' is to indicate how many packets are coalesced and field
>> 'dup_ack' is how many duplicate acks are merged, guest driver can use
>> these
>> information to notify what's the exact scene of original traffic over the
>> networks.
>>
>> Signed-off-by: Wei Xu <wexu@redhat.com>
>> ---
>>   hw/net/virtio-net.c                         | 8 ++++++++
>>   include/standard-headers/linux/virtio_net.h | 2 ++
>>   2 files changed, 10 insertions(+)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index cc8cbe4..20f552a 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -1768,6 +1768,10 @@ static size_t
>> virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
>>       if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
>>           virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
>>       }
>> +    h->coalesced = seg->packets;
>> +    h->dup_ack = seg->dup_ack;
>> +    h->gso_type = chain->gso_type;
>> +    h->gso_size = chain->max_payload;
>>       ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
>>       QTAILQ_REMOVE(&chain->buffers, seg, next);
>>       g_free(seg->buf);
>> @@ -2302,9 +2306,13 @@ static ssize_t
>> virtio_net_receive(NetClientState *nc,
>>                                     const uint8_t *buf, size_t size)
>>   {
>>       VirtIONet *n;
>> +    struct virtio_net_hdr *h;
>>       n = qemu_get_nic_opaque(nc);
>>       if (n->host_features & (1ULL << VIRTIO_NET_F_GUEST_RSC)) {
>> +        h = (struct virtio_net_hdr *)buf;
>> +        h->coalesced = 0;
>> +        h->dup_ack = 0;
>>           return virtio_net_rsc_receive(nc, buf, size);
>>       } else {
>>           return virtio_net_do_receive(nc, buf, size);
>> diff --git a/include/standard-headers/linux/virtio_net.h
>> b/include/standard-headers/linux/virtio_net.h
>> index 5b95762..c837417 100644
>> --- a/include/standard-headers/linux/virtio_net.h
>> +++ b/include/standard-headers/linux/virtio_net.h
>> @@ -114,6 +114,8 @@ struct virtio_net_hdr {
>>       __virtio16 gso_size;        /* Bytes to append to hdr_len per
>> frame */
>>       __virtio16 csum_start;    /* Position to start checksumming from */
>>       __virtio16 csum_offset;    /* Offset after that to place
>> checksum */
>> +    __virtio16 coalesced;   /* packets coalesced by host */
>
> Can we just reuse gso_segs for this?
That's really a good idea, in particular, if we can multiplex the 
capability field and header fields, then i'm supposing we don't need 
change the spec, this feature may work if we don't coalesce any 
'dup_ack' packet due to my latest test.
>
>> +    __virtio16 dup_ack;     /* duplicate ack count */
>>   };
>>   /* This is the version of the header to use when the MRG_RXBUF
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-06-02 16:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-28 16:37 [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL wexu
2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 1/3] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
2016-05-30  4:20   ` Jason Wang
2016-06-02 16:16     ` Wei Xu
2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 2/3] virtio-net rsc: support coalescing ipv6 " wexu
2016-05-30  4:25   ` Jason Wang
2016-06-02 16:17     ` Wei Xu
2016-05-28 16:37 ` [Qemu-devel] [ RFC Patch v6 3/3] virtio-net rsc: add 2 new rsc information fields to 'virtio_net_hdr' wexu
2016-05-30  5:57   ` Jason Wang
2016-06-02 16:23     ` Wei Xu
2016-05-30  4:22 ` [Qemu-devel] [ RFC Patch v6 0/2] Support Receive-Segment-Offload(RSC) for WHQL Jason Wang
2016-05-30  4:50   ` Wei Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.