All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest
@ 2016-03-15  9:17 wexu
  2016-03-15  9:17 ` [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: wexu @ 2016-03-15  9:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: victork, mst, jasowang, yvugenfi, Wei Xu, marcel, dfleytma

From: Wei Xu <wexu@redhat.com>

Fixed issues based on rfc patch v2:
1. Removed big param list, replace it with 'NetRscUnit' 
2. Different virtio header size
3. Modify callback function to direct call.
4. Needn't check the failure of g_malloc()
5. Other code format adjustment, macro naming, etc 

This patch is to support WHQL test for Windows guest, while this feature also
benifits other guest works as a kernel 'gro' like feature with userspace implementation.
Feature information:
  http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324

Both IPv4 and IPv6 are supported, though performance with userspace virtio
is slow than vhost-net, there is about 1x to 3x performance improvement to
userspace virtio, this is done by turning this feature on and disable
'tso/gso/gro' on corresponding tap interface and guest interface, while get
less improment with all these feature on.

Test steps:
Although this feature is mainly used for window guest, i used linux guest to help test
the feature, to make things simple, i used 3 steps to test the patch as i moved on.
1. With a tcp socket client/server pair running on 2 linux guest, thus i can control
the traffic and debugging the code as i want.
2. Netperf on linux guest test the throughput.
3. WHQL test with 2 Windows guests.

Current status:
IPv4 pass all the above tests.
IPv6 just passed test step 1 and 2 as described ahead, the virtio nic cannot
receive any packet in WHQL test, looks like the test traffic is not sent from
on the support machine, test device can access both host and another linux
guest, tried a lot of ways to work it out but failed, maybe debug from windows
guest driver side can help figuring it out.

Note:
A 'MessageDevice' nic chose as 'Realtek' will panic the system sometimes during setup,
this can be figured out by replacing it with an 'e1000' nic.

Todo:
More sanity check and tcp 'ecn' and 'window' scale test.

Wei Xu (2):
  virtio-net rsc: support coalescing ipv4 tcp traffic
  virtio-net rsc: support coalescing ipv6 tcp traffic

 hw/net/virtio-net.c            | 602 ++++++++++++++++++++++++++++++++++++++++-
 include/hw/virtio/virtio-net.h |   1 +
 include/hw/virtio/virtio.h     |  75 +++++
 3 files changed, 677 insertions(+), 1 deletion(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-03-15  9:17 [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest wexu
@ 2016-03-15  9:17 ` wexu
  2016-03-15 10:00   ` Michael S. Tsirkin
  2016-03-17  8:42   ` Jason Wang
  2016-03-15  9:17 ` [Qemu-devel] [ Patch 2/2] virtio-net rsc: support coalescing ipv6 " wexu
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 32+ messages in thread
From: wexu @ 2016-03-15  9:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: victork, mst, jasowang, yvugenfi, Wei Xu, marcel, dfleytma

From: Wei Xu <wexu@redhat.com>

All the data packets in a tcp connection will be cached to a big buffer
in every receive interval, and will be sent out via a timer, the
'virtio_net_rsc_timeout' controls the interval, the value will influent the
performance and response of tcp connection extremely, 50000(50us) is a
experience value to gain a performance improvement, since the whql test
sends packets every 100us, so '300000(300us)' can pass the test case,
this is also the default value, it's gonna to be tunable.

The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets

'NetRscChain' is used to save the segments of different protocols in a
VirtIONet device.

The main handler of TCP includes TCP window update, duplicated ACK check
and the real data coalescing if the new segment passed sanity check
and is identified as an 'wanted' one.

An 'wanted' segment means:
1. Segment is within current window and the sequence is the expected one.
2. ACK of the segment is in the valid window.
3. If the ACK in the segment is a duplicated one, then it must less than 2,
   this is to notify upper layer TCP starting retransmission due to the spec.

Sanity check includes:
1. Incorrect version in IP header
2. IP options & IP fragment
3. Not a TCP packets
4. Sanity size check to prevent buffer overflow attack.

There maybe more cases should be considered such as ip identification other
flags, while it broke the test because windows set it to the same even it's
not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
means the packets should also be bypassed, and this should be done
after searching for the same connection packets in the pool and sending
all of them out, this is to avoid out of data.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'FIN/RST' will trigger a finalization, because
this normally happens upon a connection is going to be closed, an 'URG' packet
also finalize current coalescing unit while there maybe protocol difference to
different OS.

Statistics can be used to monitor the basic coalescing status, the 'out of order'
and 'out of window' means how many retransmitting packets, thus describe the
performance intuitively.

Signed-off-by: Wei Xu <wexu@redhat.com>
---
 hw/net/virtio-net.c            | 486 ++++++++++++++++++++++++++++++++++++++++-
 include/hw/virtio/virtio-net.h |   1 +
 include/hw/virtio/virtio.h     |  72 ++++++
 3 files changed, 558 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 5798f87..c23b45f 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -15,10 +15,12 @@
 #include "qemu/iov.h"
 #include "hw/virtio/virtio.h"
 #include "net/net.h"
+#include "net/eth.h"
 #include "net/checksum.h"
 #include "net/tap.h"
 #include "qemu/error-report.h"
 #include "qemu/timer.h"
+#include "qemu/sockets.h"
 #include "hw/virtio/virtio-net.h"
 #include "net/vhost_net.h"
 #include "hw/virtio/virtio-bus.h"
@@ -38,6 +40,35 @@
 #define endof(container, field) \
     (offsetof(container, field) + sizeof(((container *)0)->field))
 
+#define ETH_HDR_SZ (sizeof(struct eth_header))
+#define IP4_HDR_SZ (sizeof(struct ip_header))
+#define TCP_HDR_SZ (sizeof(struct tcp_header))
+#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ)
+
+#define IP4_ADDR_SIZE   8                   /* ipv4 saddr + daddr */
+#define TCP_PORT_SIZE   4                   /* sport + dport */
+
+/* IPv4 max payload, 16 bits in the header */
+#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
+#define MAX_TCP_PAYLOAD 65535
+
+/* max payload with virtio header */
+#define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
+                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)
+
+#define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */
+
+/* Purge coalesced packets timer interval */
+#define RSC_TIMER_INTERVAL  300000
+
+/* Switcher to enable/disable rsc */
+static bool virtio_net_rsc_bypass = 1;
+
+/* This value affects the performance a lot, and should be tuned carefully,
+   '300000'(300us) is the recommended value to pass the WHQL test, '50000' can
+   gain 2x netperf throughput with tso/gso/gro 'off'. */
+static uint32_t virtio_net_rsc_timeout = RSC_TIMER_INTERVAL;
+
 typedef struct VirtIOFeature {
     uint32_t flags;
     size_t end;
@@ -1089,7 +1120,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
     return 0;
 }
 
-static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
+static ssize_t virtio_net_do_receive(NetClientState *nc,
+                                  const uint8_t *buf, size_t size)
 {
     VirtIONet *n = qemu_get_nic_opaque(nc);
     VirtIONetQueue *q = virtio_net_get_subqueue(nc);
@@ -1685,6 +1717,456 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
     return 0;
 }
 
+static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
+                                         const uint8_t *buf, NetRscUnit* unit)
+{
+    uint16_t ip_hdrlen;
+
+    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
+    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
+    unit->ip_plen = &unit->ip->ip_len;
+    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
+    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
+    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
+}
+
+static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
+{
+    uint32_t sum;
+
+    ip->ip_sum = 0;
+    sum = net_checksum_add_cont(IP4_HDR_SZ, (uint8_t *)ip, 0);
+    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
+}
+
+static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
+{
+    int ret;
+
+    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
+    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
+    QTAILQ_REMOVE(&chain->buffers, seg, next);
+    g_free(seg->buf);
+    g_free(seg);
+
+    return ret;
+}
+
+static void virtio_net_rsc_purge(void *opq)
+{
+    NetRscChain *chain = (NetRscChain *)opq;
+    NetRscSeg *seg, *rn;
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
+        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+            chain->stat.purge_failed++;
+            continue;
+        }
+    }
+
+    chain->stat.timer++;
+    if (!QTAILQ_EMPTY(&chain->buffers)) {
+        timer_mod(chain->drain_timer,
+              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
+    }
+}
+
+static void virtio_net_rsc_cleanup(VirtIONet *n)
+{
+    NetRscChain *chain, *rn_chain;
+    NetRscSeg *seg, *rn_seg;
+
+    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
+        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
+            QTAILQ_REMOVE(&chain->buffers, seg, next);
+            g_free(seg->buf);
+            g_free(seg);
+        }
+
+        timer_del(chain->drain_timer);
+        timer_free(chain->drain_timer);
+        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
+        g_free(chain);
+    }
+}
+
+static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
+                                     const uint8_t *buf, size_t size)
+{
+    NetRscSeg *seg;
+
+    seg = g_malloc(sizeof(NetRscSeg));
+    seg->buf = g_malloc(MAX_VIRTIO_PAYLOAD);
+
+    memmove(seg->buf, buf, size);
+    seg->size = size;
+    seg->dup_ack_count = 0;
+    seg->is_coalesced = 0;
+    seg->nc = nc;
+
+    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
+    chain->stat.cache++;
+
+    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
+}
+
+static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
+                                 const uint8_t *buf, struct tcp_header *n_tcp,
+                                 struct tcp_header *o_tcp)
+{
+    uint32_t nack, oack;
+    uint16_t nwin, owin;
+
+    nack = htonl(n_tcp->th_ack);
+    nwin = htons(n_tcp->th_win);
+    oack = htonl(o_tcp->th_ack);
+    owin = htons(o_tcp->th_win);
+
+    if ((nack - oack) >= MAX_TCP_PAYLOAD) {
+        chain->stat.ack_out_of_win++;
+        return RSC_FINAL;
+    } else if (nack == oack) {
+        /* duplicated ack or window probe */
+        if (nwin == owin) {
+            /* duplicated ack, add dup ack count due to whql test up to 1 */
+            chain->stat.dup_ack++;
+
+            if (seg->dup_ack_count == 0) {
+                seg->dup_ack_count++;
+                chain->stat.dup_ack1++;
+                return RSC_COALESCE;
+            } else {
+                /* Spec says should send it directly */
+                chain->stat.dup_ack2++;
+                return RSC_FINAL;
+            }
+        } else {
+            /* Coalesce window update */
+            o_tcp->th_win = n_tcp->th_win;
+            chain->stat.win_update++;
+            return RSC_COALESCE;
+        }
+    } else {
+        /* pure ack, update ack */
+        o_tcp->th_ack = n_tcp->th_ack;
+        chain->stat.pure_ack++;
+        return RSC_COALESCE;
+    }
+}
+
+static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, NetRscSeg *seg,
+                                    const uint8_t *buf, NetRscUnit *n_unit)
+{
+    void *data;
+    uint16_t o_ip_len;
+    uint32_t nseq, oseq;
+    NetRscUnit *o_unit;
+
+    o_unit = &seg->unit;
+    o_ip_len = htons(*o_unit->ip_plen);
+    nseq = htonl(n_unit->tcp->th_seq);
+    oseq = htonl(o_unit->tcp->th_seq);
+
+    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
+        /* Log this only for debugging observation */
+        chain->stat.tcp_option++;
+    }
+
+    /* Ignore packet with more/larger tcp options */
+    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
+        chain->stat.tcp_larger_option++;
+        return RSC_FINAL;
+    }
+
+    /* out of order or retransmitted. */
+    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
+        chain->stat.data_out_of_win++;
+        return RSC_FINAL;
+    }
+
+    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
+    if (nseq == oseq) {
+        if ((0 == o_unit->payload) && n_unit->payload) {
+            /* From no payload to payload, normal case, not a dup ack or etc */
+            chain->stat.data_after_pure_ack++;
+            goto coalesce;
+        } else {
+            return virtio_net_rsc_handle_ack(chain, seg, buf,
+                                             n_unit->tcp, o_unit->tcp);
+        }
+    } else if ((nseq - oseq) != o_unit->payload) {
+        /* Not a consistent packet, out of order */
+        chain->stat.data_out_of_order++;
+        return RSC_FINAL;
+    } else {
+coalesce:
+        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
+            chain->stat.over_size++;
+            return RSC_FINAL;
+        }
+
+        /* Here comes the right data, the payload lengh in v4/v6 is different,
+           so use the field value to update and record the new data len */
+        o_unit->payload += n_unit->payload; /* update new data len */
+
+        /* update field in ip header */
+        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
+
+        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
+           for windows guest, while this may change the behavior for linux
+           guest. */
+        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
+
+        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
+        o_unit->tcp->th_win = n_unit->tcp->th_win;
+
+        memmove(seg->buf + seg->size, data, n_unit->payload);
+        seg->size += n_unit->payload;
+        chain->stat.coalesced++;
+        return RSC_COALESCE;
+    }
+}
+
+static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
+                        const uint8_t *buf, size_t size, NetRscUnit *unit)
+{
+    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
+        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
+        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
+        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
+        chain->stat.no_match++;
+        return RSC_NO_MATCH;
+    }
+
+    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
+}
+
+/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
+ * to prevent out of order */
+static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
+                                         struct tcp_header *tcp)
+{
+    uint16_t tcp_flag;
+
+    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
+    if (tcp_flag & TH_SYN) {
+        chain->stat.tcp_syn++;
+        return RSC_BYPASS;
+    }
+
+    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
+        chain->stat.tcp_ctrl_drain++;
+        return RSC_FINAL;
+    }
+
+    return RSC_WANT;
+}
+
+static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientState *nc,
+                          const uint8_t *buf, size_t size)
+{
+    if (QTAILQ_EMPTY(&chain->buffers)) {
+        chain->stat.empty_cache++;
+        virtio_net_rsc_cache_buf(chain, nc, buf, size);
+        timer_mod(chain->drain_timer,
+              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
+        return 1;
+    }
+
+    return 0;
+}
+
+static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
+                          const uint8_t *buf, size_t size, NetRscUnit *unit)
+{
+    int ret;
+    NetRscSeg *seg, *nseg;
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
+        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
+
+        if (ret == RSC_FINAL) {
+            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+                /* Send failed */
+                chain->stat.final_failed++;
+                return 0;
+            }
+
+            /* Send current packet */
+            return virtio_net_do_receive(nc, buf, size);
+        } else if (ret == RSC_NO_MATCH) {
+            continue;
+        } else {
+            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
+            seg->is_coalesced = 1;
+            return size;
+        }
+    }
+
+    chain->stat.no_match_cache++;
+    virtio_net_rsc_cache_buf(chain, nc, buf, size);
+    return size;
+}
+
+/* Drain a connection data, this is to avoid out of order segments */
+static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
+                        const uint8_t *buf, size_t size, uint16_t ip_start,
+                        uint16_t ip_size, uint16_t tcp_port, uint16_t port_size)
+{
+    NetRscSeg *seg, *nseg;
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
+        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
+            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {
+            continue;
+        }
+        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+            chain->stat.drain_failed++;
+        }
+
+        break;
+    }
+
+    return virtio_net_do_receive(nc, buf, size);
+}
+
+static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
+                        struct ip_header *ip, const uint8_t *buf, size_t size)
+{
+    uint16_t ip_len;
+
+    if (size < (chain->hdr_size + ETH_IP4_HDR_SZ + TCP_HDR_SZ)) {
+        return RSC_BYPASS;
+    }
+
+    /* Not an ipv4 one */
+    if (((0xF0 & ip->ip_ver_len) >> 4) != IP_HEADER_VERSION_4) {
+        chain->stat.ip_option++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ip option */
+    if (IP4_HEADER_LEN != (0xF & ip->ip_ver_len)) {
+        chain->stat.ip_option++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ip fragment */
+    if (!(htons(ip->ip_off) & IP_DF)) {
+        chain->stat.ip_frag++;
+        return RSC_BYPASS;
+    }
+
+    if (ip->ip_p != IPPROTO_TCP) {
+        chain->stat.bypass_not_tcp++;
+        return RSC_BYPASS;
+    }
+
+    /* Sanity check */
+    ip_len = htons(ip->ip_len);
+    if (ip_len < (IP4_HDR_SZ + TCP_HDR_SZ)
+        || ip_len > (size - chain->hdr_size - ETH_HDR_SZ)) {
+        chain->stat.ip_hacked++;
+        return RSC_BYPASS;
+    }
+
+    return RSC_WANT;
+}
+
+static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
+                                      const uint8_t *buf, size_t size)
+{
+    int32_t ret;
+    NetRscChain *chain;
+    NetRscUnit unit;
+
+    chain = (NetRscChain *)opq;
+    virtio_net_rsc_extract_unit4(chain, buf, &unit);
+    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
+    if (ret == RSC_BYPASS) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else if (ret == RSC_FINAL) {
+        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
+                        ((chain->hdr_size + ETH_HDR_SZ) + 12), IP4_ADDR_SIZE,
+                        (chain->hdr_size + ETH_IP4_HDR_SZ), TCP_PORT_SIZE);
+    }
+
+    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
+        return size;
+    }
+
+    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
+}
+
+static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
+                                            NetClientState *nc, uint16_t proto)
+{
+    NetRscChain *chain;
+
+    /* Only handle IPv4/6 */
+    if (proto != (uint16_t)ETH_P_IP) {
+        return NULL;
+    }
+
+    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
+        if (chain->proto == proto) {
+            return chain;
+        }
+    }
+
+    chain = g_malloc(sizeof(*chain));
+    chain->hdr_size = n->guest_hdr_len;
+    chain->proto = proto;
+    chain->max_payload = MAX_IP4_PAYLOAD;
+    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
+                                      virtio_net_rsc_purge, chain);
+    memset(&chain->stat, 0, sizeof(chain->stat));
+
+    QTAILQ_INIT(&chain->buffers);
+    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
+
+    return chain;
+}
+
+static ssize_t virtio_net_rsc_receive(NetClientState *nc,
+                                      const uint8_t *buf, size_t size)
+{
+    uint16_t proto;
+    NetRscChain *chain;
+    struct eth_header *eth;
+    VirtIONet *n;
+
+    n = qemu_get_nic_opaque(nc);
+    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    eth = (struct eth_header *)(buf + n->guest_hdr_len);
+    proto = htons(eth->h_proto);
+
+    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
+    if (!chain) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else {
+        chain->stat.received++;
+        return virtio_net_rsc_receive4(chain, nc, buf, size);
+    }
+}
+
+static ssize_t virtio_net_receive(NetClientState *nc,
+                                  const uint8_t *buf, size_t size)
+{
+    if (virtio_net_rsc_bypass) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else {
+        return virtio_net_rsc_receive(nc, buf, size);
+    }
+}
+
 static NetClientInfo net_virtio_info = {
     .type = NET_CLIENT_OPTIONS_KIND_NIC,
     .size = sizeof(NICState),
@@ -1814,6 +2296,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     nc = qemu_get_queue(n->nic);
     nc->rxfilter_notify_enabled = 1;
 
+    QTAILQ_INIT(&n->rsc_chains);
     n->qdev = dev;
     register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
                     virtio_net_save, virtio_net_load, n);
@@ -1848,6 +2331,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
     g_free(n->vqs);
     qemu_del_nic(n->nic);
     virtio_cleanup(vdev);
+    virtio_net_rsc_cleanup(n);
 }
 
 static void virtio_net_instance_init(Object *obj)
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index 0cabdb6..6939e92 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -59,6 +59,7 @@ typedef struct VirtIONet {
     VirtIONetQueue *vqs;
     VirtQueue *ctrl_vq;
     NICState *nic;
+    QTAILQ_HEAD(, NetRscChain) rsc_chains;
     uint32_t tx_timeout;
     int32_t tx_burst;
     uint32_t has_vnet_hdr;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 2b5b248..3b1dfa8 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -128,6 +128,78 @@ typedef struct VirtioDeviceClass {
     int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
 } VirtioDeviceClass;
 
+/* Coalesced packets type & status */
+typedef enum {
+    RSC_COALESCE,           /* Data been coalesced */
+    RSC_FINAL,              /* Will terminate current connection */
+    RSC_NO_MATCH,           /* No matched in the buffer pool */
+    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
+    RSC_WANT                /* Data want to be coalesced */
+} COALESCE_STATUS;
+
+typedef struct NetRscStat {
+    uint32_t received;
+    uint32_t coalesced;
+    uint32_t over_size;
+    uint32_t cache;
+    uint32_t empty_cache;
+    uint32_t no_match_cache;
+    uint32_t win_update;
+    uint32_t no_match;
+    uint32_t tcp_syn;
+    uint32_t tcp_ctrl_drain;
+    uint32_t dup_ack;
+    uint32_t dup_ack1;
+    uint32_t dup_ack2;
+    uint32_t pure_ack;
+    uint32_t ack_out_of_win;
+    uint32_t data_out_of_win;
+    uint32_t data_out_of_order;
+    uint32_t data_after_pure_ack;
+    uint32_t bypass_not_tcp;
+    uint32_t tcp_option;
+    uint32_t tcp_larger_option;
+    uint32_t ip_frag;
+    uint32_t ip_hacked;
+    uint32_t ip_option;
+    uint32_t purge_failed;
+    uint32_t drain_failed;
+    uint32_t final_failed;
+    int64_t  timer;
+} NetRscStat;
+
+/* Rsc unit general info used to checking if can coalescing */
+typedef struct NetRscUnit {
+   struct ip_header *ip;   /* ip header */
+   uint16_t *ip_plen;      /* data len pointer in ip header field */
+   struct tcp_header *tcp; /* tcp header */
+   uint16_t tcp_hdrlen;    /* tcp header len */
+   uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
+} NetRscUnit;
+
+/* Coalesced segmant */
+typedef struct NetRscSeg {
+    QTAILQ_ENTRY(NetRscSeg) next;
+    void *buf;
+    size_t size;
+    uint32_t dup_ack_count;
+    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
+    NetRscUnit unit;
+    NetClientState *nc;
+} NetRscSeg;
+
+
+/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
+typedef struct NetRscChain {
+   QTAILQ_ENTRY(NetRscChain) next;
+   uint16_t hdr_size;
+   uint16_t proto;
+   uint16_t max_payload;
+   QEMUTimer *drain_timer;
+   QTAILQ_HEAD(, NetRscSeg) buffers;
+   NetRscStat stat;
+} NetRscChain;
+
 void virtio_instance_init_common(Object *proxy_obj, void *data,
                                  size_t vdev_size, const char *vdev_name);
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [ Patch 2/2] virtio-net rsc: support coalescing ipv6 tcp traffic
  2016-03-15  9:17 [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest wexu
  2016-03-15  9:17 ` [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
@ 2016-03-15  9:17 ` wexu
  2016-03-17  8:50   ` Jason Wang
  2016-03-15 10:01 ` [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest Michael S. Tsirkin
  2016-03-17  6:47 ` Jason Wang
  3 siblings, 1 reply; 32+ messages in thread
From: wexu @ 2016-03-15  9:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: victork, mst, jasowang, yvugenfi, Wei Xu, marcel, dfleytma

From: Wei Xu <wexu@redhat.com>

Most things like ipv4 except there is a significant difference between ipv4
and ipv6, the fragment lenght in ipv4 header includes itself, while it's not
included for ipv6, thus means ipv6 can carry a real '65535' unit.

Signed-off-by: Wei Xu <wexu@redhat.com>
---
 hw/net/virtio-net.c        | 146 ++++++++++++++++++++++++++++++++++++++++-----
 include/hw/virtio/virtio.h |   5 +-
 2 files changed, 135 insertions(+), 16 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index c23b45f..ef61b74 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -52,9 +52,14 @@
 #define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
 #define MAX_TCP_PAYLOAD 65535
 
-/* max payload with virtio header */
+#define IP6_HDR_SZ (sizeof(struct ip6_header))
+#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)
+#define IP6_ADDR_SIZE   32      /* ipv6 saddr + daddr */
+#define MAX_IP6_PAYLOAD MAX_TCP_PAYLOAD
+
+/* ip6 max payload, payload in ipv6 don't include the  header */
 #define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
-                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)
+                                + ETH_IP6_HDR_SZ + MAX_IP6_PAYLOAD)
 
 #define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */
 
@@ -1722,14 +1727,27 @@ static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
 {
     uint16_t ip_hdrlen;
 
-    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
-    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
-    unit->ip_plen = &unit->ip->ip_len;
-    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
+    unit->u_ip.ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
+    ip_hdrlen = ((0xF & unit->u_ip.ip->ip_ver_len) << 2);
+    unit->ip_plen = &unit->u_ip.ip->ip_len;
+    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->u_ip.ip) + ip_hdrlen);
     unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
     unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
 }
 
+static void virtio_net_rsc_extract_unit6(NetRscChain *chain,
+                                         const uint8_t *buf, NetRscUnit* unit)
+{
+    unit->u_ip.ip6 = (struct ip6_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
+    unit->ip_plen = &(unit->u_ip.ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
+    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->u_ip.ip6)\
+                                    + IP6_HDR_SZ);
+    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
+    /* There is a difference between payload lenght in ipv4 and v6,
+       ip header is excluded in ipv6 */
+    unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
+}
+
 static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
 {
     uint32_t sum;
@@ -1743,7 +1761,10 @@ static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
 {
     int ret;
 
-    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
+    if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
+        virtio_net_rsc_ipv4_checksum(seg->unit.u_ip.ip);
+    }
+
     ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
     QTAILQ_REMOVE(&chain->buffers, seg, next);
     g_free(seg->buf);
@@ -1807,7 +1828,11 @@ static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
     QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
     chain->stat.cache++;
 
-    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
+    if (chain->proto == ETH_P_IP) {
+        virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
+    } else {
+        virtio_net_rsc_extract_unit6(chain, seg->buf, &seg->unit);
+    }
 }
 
 static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
@@ -1930,8 +1955,8 @@ coalesce:
 static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
                         const uint8_t *buf, size_t size, NetRscUnit *unit)
 {
-    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
-        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
+    if ((unit->u_ip.ip->ip_src ^ seg->unit.u_ip.ip->ip_src)
+        || (unit->u_ip.ip->ip_dst ^ seg->unit.u_ip.ip->ip_dst)
         || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
         || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
         chain->stat.no_match++;
@@ -1941,6 +1966,22 @@ static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
     return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
 }
 
+static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg *seg,
+                        const uint8_t *buf, size_t size, NetRscUnit *unit)
+{
+    if (memcmp(&unit->u_ip.ip6->ip6_src, &seg->unit.u_ip.ip6->ip6_src,
+               sizeof(struct in6_address))
+        || memcmp(&unit->u_ip.ip6->ip6_dst, &seg->unit.u_ip.ip6->ip6_dst,
+                  sizeof(struct in6_address))
+        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
+        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
+            chain->stat.no_match++;
+            return RSC_NO_MATCH;
+    }
+
+    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
+}
+
 /* Pakcets with 'SYN' should bypass, other flag should be sent after drain
  * to prevent out of order */
 static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
@@ -1983,7 +2024,11 @@ static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
     NetRscSeg *seg, *nseg;
 
     QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
-        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
+        if (chain->proto == ETH_P_IP) {
+            ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
+        } else {
+            ret = virtio_net_rsc_coalesce6(chain, seg, buf, size, unit);
+        }
 
         if (ret == RSC_FINAL) {
             if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
@@ -2082,7 +2127,8 @@ static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
 
     chain = (NetRscChain *)opq;
     virtio_net_rsc_extract_unit4(chain, buf, &unit);
-    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
+    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain,
+                                                 unit.u_ip.ip, buf, size)) {
         return virtio_net_do_receive(nc, buf, size);
     }
 
@@ -2102,13 +2148,74 @@ static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
     return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
 }
 
+static int32_t virtio_net_rsc_sanity_check6(NetRscChain *chain, 
+                        struct ip6_header *ip, const uint8_t *buf, size_t size)
+{
+    uint16_t ip_len;
+
+    if (size < (chain->hdr_size + ETH_IP6_HDR_SZ + TCP_HDR_SZ)) {
+        return RSC_BYPASS;
+    }
+
+    if (((0xF0 & ip->ip6_ctlun.ip6_un1.ip6_un1_flow) >> 4)
+        != IP_HEADER_VERSION_6) {
+        return RSC_BYPASS;
+    }
+
+    /* Both option and protocol is checked in this */
+    if (ip->ip6_ctlun.ip6_un1.ip6_un1_nxt != IPPROTO_TCP) {
+        chain->stat.bypass_not_tcp++;
+        return RSC_BYPASS;
+    }
+
+    /* Sanity check */
+    ip_len = htons(ip->ip6_ctlun.ip6_un1.ip6_un1_plen);
+    if (ip_len < TCP_HDR_SZ
+        || ip_len > (size - chain->hdr_size - ETH_IP6_HDR_SZ)) {
+        chain->stat.ip_hacked++;
+        return RSC_BYPASS;
+    }
+
+    return RSC_WANT;
+}
+
+static size_t virtio_net_rsc_receive6(void *opq, NetClientState* nc,
+                                      const uint8_t *buf, size_t size)
+{
+    int32_t ret;
+    NetRscChain *chain;
+    NetRscUnit unit;
+
+    chain = (NetRscChain *)opq;
+    virtio_net_rsc_extract_unit6(chain, buf, &unit);
+    if (RSC_WANT != virtio_net_rsc_sanity_check6(chain,
+                                                 unit.u_ip.ip6, buf, size)) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
+    if (ret == RSC_BYPASS) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else if (ret == RSC_FINAL) {
+        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
+                        ((chain->hdr_size + ETH_HDR_SZ) + 8), IP6_ADDR_SIZE,
+                        (chain->hdr_size + ETH_IP6_HDR_SZ), TCP_PORT_SIZE);
+    }
+
+    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
+        return size;
+    }
+
+    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
+}
+
 static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
                                             NetClientState *nc, uint16_t proto)
 {
     NetRscChain *chain;
 
     /* Only handle IPv4/6 */
-    if (proto != (uint16_t)ETH_P_IP) {
+    if ((proto != (uint16_t)ETH_P_IP) && (proto != (uint16_t)ETH_P_IPV6)) {
         return NULL;
     }
 
@@ -2121,7 +2228,11 @@ static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
     chain = g_malloc(sizeof(*chain));
     chain->hdr_size = n->guest_hdr_len;
     chain->proto = proto;
-    chain->max_payload = MAX_IP4_PAYLOAD;
+    if (proto == (uint16_t)ETH_P_IP) {
+        chain->max_payload = MAX_IP4_PAYLOAD;
+    } else {
+        chain->max_payload = MAX_IP6_PAYLOAD;
+    }
     chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
                                       virtio_net_rsc_purge, chain);
     memset(&chain->stat, 0, sizeof(chain->stat));
@@ -2153,7 +2264,12 @@ static ssize_t virtio_net_rsc_receive(NetClientState *nc,
         return virtio_net_do_receive(nc, buf, size);
     } else {
         chain->stat.received++;
-        return virtio_net_rsc_receive4(chain, nc, buf, size);
+
+        if (proto == (uint16_t)ETH_P_IP) {
+            return virtio_net_rsc_receive4(chain, nc, buf, size);
+        } else  {
+            return virtio_net_rsc_receive6(chain, nc, buf, size);
+        }
     }
 }
 
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 3b1dfa8..13d20a4 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -170,7 +170,10 @@ typedef struct NetRscStat {
 
 /* Rsc unit general info used to checking if can coalescing */
 typedef struct NetRscUnit {
-   struct ip_header *ip;   /* ip header */
+    union {
+        struct ip_header *ip;   /* ip header */
+        struct ip6_header *ip6; /* ip6 header */
+    } u_ip;
    uint16_t *ip_plen;      /* data len pointer in ip header field */
    struct tcp_header *tcp; /* tcp header */
    uint16_t tcp_hdrlen;    /* tcp header len */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-03-15  9:17 ` [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
@ 2016-03-15 10:00   ` Michael S. Tsirkin
  2016-03-16  3:23     ` Wei Xu
  2016-03-17  8:42   ` Jason Wang
  1 sibling, 1 reply; 32+ messages in thread
From: Michael S. Tsirkin @ 2016-03-15 10:00 UTC (permalink / raw)
  To: wexu; +Cc: victork, jasowang, yvugenfi, qemu-devel, marcel, dfleytma

On Tue, Mar 15, 2016 at 05:17:03PM +0800, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
> 
> All the data packets in a tcp connection will be cached to a big buffer
> in every receive interval, and will be sent out via a timer, the
> 'virtio_net_rsc_timeout' controls the interval, the value will influent the
> performance and response of tcp connection extremely, 50000(50us) is a
> experience value to gain a performance improvement, since the whql test
> sends packets every 100us, so '300000(300us)' can pass the test case,
> this is also the default value, it's gonna to be tunable.
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets
> 
> 'NetRscChain' is used to save the segments of different protocols in a
> VirtIONet device.
> 
> The main handler of TCP includes TCP window update, duplicated ACK check
> and the real data coalescing if the new segment passed sanity check
> and is identified as an 'wanted' one.
> 
> An 'wanted' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. ACK of the segment is in the valid window.
> 3. If the ACK in the segment is a duplicated one, then it must less than 2,
>    this is to notify upper layer TCP starting retransmission due to the spec.
> 
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. IP options & IP fragment
> 3. Not a TCP packets
> 4. Sanity size check to prevent buffer overflow attack.
> 
> There maybe more cases should be considered such as ip identification other
> flags, while it broke the test because windows set it to the same even it's
> not a fragment.
> 
> Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
> and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
> means the packets should also be bypassed, and this should be done
> after searching for the same connection packets in the pool and sending
> all of them out, this is to avoid out of data.
> 
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'FIN/RST' will trigger a finalization, because
> this normally happens upon a connection is going to be closed, an 'URG' packet
> also finalize current coalescing unit while there maybe protocol difference to
> different OS.
> 
> Statistics can be used to monitor the basic coalescing status, the 'out of order'
> and 'out of window' means how many retransmitting packets, thus describe the
> performance intuitively.
> 
> Signed-off-by: Wei Xu <wexu@redhat.com>
> ---
>  hw/net/virtio-net.c            | 486 ++++++++++++++++++++++++++++++++++++++++-
>  include/hw/virtio/virtio-net.h |   1 +
>  include/hw/virtio/virtio.h     |  72 ++++++
>  3 files changed, 558 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 5798f87..c23b45f 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -15,10 +15,12 @@
>  #include "qemu/iov.h"
>  #include "hw/virtio/virtio.h"
>  #include "net/net.h"
> +#include "net/eth.h"
>  #include "net/checksum.h"
>  #include "net/tap.h"
>  #include "qemu/error-report.h"
>  #include "qemu/timer.h"
> +#include "qemu/sockets.h"
>  #include "hw/virtio/virtio-net.h"
>  #include "net/vhost_net.h"
>  #include "hw/virtio/virtio-bus.h"
> @@ -38,6 +40,35 @@
>  #define endof(container, field) \
>      (offsetof(container, field) + sizeof(((container *)0)->field))
>  
> +#define ETH_HDR_SZ (sizeof(struct eth_header))
> +#define IP4_HDR_SZ (sizeof(struct ip_header))
> +#define TCP_HDR_SZ (sizeof(struct tcp_header))
> +#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ)

It's better to open-code these imho.

> +
> +#define IP4_ADDR_SIZE   8                   /* ipv4 saddr + daddr */
> +#define TCP_PORT_SIZE   4                   /* sport + dport */
> +
> +/* IPv4 max payload, 16 bits in the header */
> +#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
> +#define MAX_TCP_PAYLOAD 65535
> +
> +/* max payload with virtio header */
> +#define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
> +                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)
> +
> +#define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */
> +
> +/* Purge coalesced packets timer interval */
> +#define RSC_TIMER_INTERVAL  300000

Pls prefix local macros with VIRTIO_NET_


> +
> +/* Switcher to enable/disable rsc */
> +static bool virtio_net_rsc_bypass = 1;
> +
> +/* This value affects the performance a lot, and should be tuned carefully,
> +   '300000'(300us) is the recommended value to pass the WHQL test, '50000' can
> +   gain 2x netperf throughput with tso/gso/gro 'off'. */

So either tests pass or we get good performance, but not both?
Hmm.

> +static uint32_t virtio_net_rsc_timeout = RSC_TIMER_INTERVAL;


This would beed to be tunable.

> +
>  typedef struct VirtIOFeature {
>      uint32_t flags;
>      size_t end;
> @@ -1089,7 +1120,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
>      return 0;
>  }
>  
> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
> +static ssize_t virtio_net_do_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)
>  {
>      VirtIONet *n = qemu_get_nic_opaque(nc);
>      VirtIONetQueue *q = virtio_net_get_subqueue(nc);
> @@ -1685,6 +1717,456 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
>      return 0;
>  }
>  
> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    uint16_t ip_hdrlen;
> +
> +    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
> +    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
> +    unit->ip_plen = &unit->ip->ip_len;
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
> +}
> +
> +static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
> +{
> +    uint32_t sum;
> +
> +    ip->ip_sum = 0;
> +    sum = net_checksum_add_cont(IP4_HDR_SZ, (uint8_t *)ip, 0);
> +    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
> +}
> +
> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
> +{
> +    int ret;
> +
> +    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
> +    g_free(seg->buf);
> +    g_free(seg);
> +
> +    return ret;
> +}
> +
> +static void virtio_net_rsc_purge(void *opq)
> +{
> +    NetRscChain *chain = (NetRscChain *)opq;
> +    NetRscSeg *seg, *rn;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.purge_failed++;
> +            continue;
> +        }
> +    }
> +
> +    chain->stat.timer++;
> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
> +    }
> +}
> +
> +static void virtio_net_rsc_cleanup(VirtIONet *n)
> +{
> +    NetRscChain *chain, *rn_chain;
> +    NetRscSeg *seg, *rn_seg;
> +
> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
> +            g_free(seg->buf);
> +            g_free(seg);
> +        }
> +
> +        timer_del(chain->drain_timer);
> +        timer_free(chain->drain_timer);
> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
> +        g_free(chain);
> +    }
> +}
> +
> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
> +                                     const uint8_t *buf, size_t size)
> +{
> +    NetRscSeg *seg;
> +
> +    seg = g_malloc(sizeof(NetRscSeg));
> +    seg->buf = g_malloc(MAX_VIRTIO_PAYLOAD);
> +
> +    memmove(seg->buf, buf, size);
> +    seg->size = size;
> +    seg->dup_ack_count = 0;
> +    seg->is_coalesced = 0;
> +    seg->nc = nc;
> +
> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
> +    chain->stat.cache++;
> +
> +    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +}
> +
> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
> +                                 const uint8_t *buf, struct tcp_header *n_tcp,
> +                                 struct tcp_header *o_tcp)
> +{
> +    uint32_t nack, oack;
> +    uint16_t nwin, owin;
> +
> +    nack = htonl(n_tcp->th_ack);
> +    nwin = htons(n_tcp->th_win);
> +    oack = htonl(o_tcp->th_ack);
> +    owin = htons(o_tcp->th_win);
> +
> +    if ((nack - oack) >= MAX_TCP_PAYLOAD) {
> +        chain->stat.ack_out_of_win++;
> +        return RSC_FINAL;
> +    } else if (nack == oack) {
> +        /* duplicated ack or window probe */
> +        if (nwin == owin) {
> +            /* duplicated ack, add dup ack count due to whql test up to 1 */
> +            chain->stat.dup_ack++;
> +
> +            if (seg->dup_ack_count == 0) {
> +                seg->dup_ack_count++;
> +                chain->stat.dup_ack1++;
> +                return RSC_COALESCE;
> +            } else {
> +                /* Spec says should send it directly */
> +                chain->stat.dup_ack2++;
> +                return RSC_FINAL;
> +            }
> +        } else {
> +            /* Coalesce window update */
> +            o_tcp->th_win = n_tcp->th_win;
> +            chain->stat.win_update++;
> +            return RSC_COALESCE;
> +        }
> +    } else {
> +        /* pure ack, update ack */
> +        o_tcp->th_ack = n_tcp->th_ack;
> +        chain->stat.pure_ack++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, NetRscSeg *seg,
> +                                    const uint8_t *buf, NetRscUnit *n_unit)
> +{
> +    void *data;
> +    uint16_t o_ip_len;
> +    uint32_t nseq, oseq;
> +    NetRscUnit *o_unit;
> +
> +    o_unit = &seg->unit;
> +    o_ip_len = htons(*o_unit->ip_plen);
> +    nseq = htonl(n_unit->tcp->th_seq);
> +    oseq = htonl(o_unit->tcp->th_seq);
> +
> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
> +        /* Log this only for debugging observation */
> +        chain->stat.tcp_option++;
> +    }
> +
> +    /* Ignore packet with more/larger tcp options */
> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
> +        chain->stat.tcp_larger_option++;
> +        return RSC_FINAL;
> +    }
> +
> +    /* out of order or retransmitted. */
> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
> +        chain->stat.data_out_of_win++;
> +        return RSC_FINAL;
> +    }
> +
> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
> +    if (nseq == oseq) {
> +        if ((0 == o_unit->payload) && n_unit->payload) {
> +            /* From no payload to payload, normal case, not a dup ack or etc */
> +            chain->stat.data_after_pure_ack++;
> +            goto coalesce;
> +        } else {
> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
> +                                             n_unit->tcp, o_unit->tcp);
> +        }
> +    } else if ((nseq - oseq) != o_unit->payload) {
> +        /* Not a consistent packet, out of order */
> +        chain->stat.data_out_of_order++;
> +        return RSC_FINAL;
> +    } else {
> +coalesce:
> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
> +            chain->stat.over_size++;
> +            return RSC_FINAL;
> +        }
> +
> +        /* Here comes the right data, the payload lengh in v4/v6 is different,
> +           so use the field value to update and record the new data len */
> +        o_unit->payload += n_unit->payload; /* update new data len */
> +
> +        /* update field in ip header */
> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
> +
> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
> +           for windows guest, while this may change the behavior for linux
> +           guest. */
> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
> +
> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
> +
> +        memmove(seg->buf + seg->size, data, n_unit->payload);
> +        seg->size += n_unit->payload;
> +        chain->stat.coalesced++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
> +        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +        chain->stat.no_match++;
> +        return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
> +/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
> + * to prevent out of order */
> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> +                                         struct tcp_header *tcp)
> +{
> +    uint16_t tcp_flag;
> +
> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
> +    if (tcp_flag & TH_SYN) {
> +        chain->stat.tcp_syn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
> +        chain->stat.tcp_ctrl_drain++;
> +        return RSC_FINAL;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientState *nc,
> +                          const uint8_t *buf, size_t size)
> +{
> +    if (QTAILQ_EMPTY(&chain->buffers)) {
> +        chain->stat.empty_cache++;
> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
> +                          const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    int ret;
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +
> +        if (ret == RSC_FINAL) {
> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +                /* Send failed */
> +                chain->stat.final_failed++;
> +                return 0;
> +            }
> +
> +            /* Send current packet */
> +            return virtio_net_do_receive(nc, buf, size);
> +        } else if (ret == RSC_NO_MATCH) {
> +            continue;
> +        } else {
> +            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
> +            seg->is_coalesced = 1;
> +            return size;
> +        }
> +    }
> +
> +    chain->stat.no_match_cache++;
> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +    return size;
> +}
> +
> +/* Drain a connection data, this is to avoid out of order segments */
> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
> +                        const uint8_t *buf, size_t size, uint16_t ip_start,
> +                        uint16_t ip_size, uint16_t tcp_port, uint16_t port_size)
> +{
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
> +            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {
> +            continue;
> +        }
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.drain_failed++;
> +        }
> +
> +        break;
> +    }
> +
> +    return virtio_net_do_receive(nc, buf, size);
> +}
> +
> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
> +                        struct ip_header *ip, const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +
> +    if (size < (chain->hdr_size + ETH_IP4_HDR_SZ + TCP_HDR_SZ)) {
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Not an ipv4 one */
> +    if (((0xF0 & ip->ip_ver_len) >> 4) != IP_HEADER_VERSION_4) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip option */
> +    if (IP4_HEADER_LEN != (0xF & ip->ip_ver_len)) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip fragment */
> +    if (!(htons(ip->ip_off) & IP_DF)) {
> +        chain->stat.ip_frag++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (ip->ip_p != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Sanity check */
> +    ip_len = htons(ip->ip_len);
> +    if (ip_len < (IP4_HDR_SZ + TCP_HDR_SZ)
> +        || ip_len > (size - chain->hdr_size - ETH_HDR_SZ)) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    NetRscChain *chain;
> +    NetRscUnit unit;
> +
> +    chain = (NetRscChain *)opq;
> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
> +    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                        ((chain->hdr_size + ETH_HDR_SZ) + 12), IP4_ADDR_SIZE,
> +                        (chain->hdr_size + ETH_IP4_HDR_SZ), TCP_PORT_SIZE);
> +    }
> +
> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
> +        return size;
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
> +                                            NetClientState *nc, uint16_t proto)
> +{
> +    NetRscChain *chain;
> +
> +    /* Only handle IPv4/6 */
> +    if (proto != (uint16_t)ETH_P_IP) {
> +        return NULL;
> +    }
> +
> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
> +        if (chain->proto == proto) {
> +            return chain;
> +        }
> +    }
> +
> +    chain = g_malloc(sizeof(*chain));
> +    chain->hdr_size = n->guest_hdr_len;
> +    chain->proto = proto;
> +    chain->max_payload = MAX_IP4_PAYLOAD;
> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
> +                                      virtio_net_rsc_purge, chain);
> +    memset(&chain->stat, 0, sizeof(chain->stat));
> +
> +    QTAILQ_INIT(&chain->buffers);
> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
> +
> +    return chain;
> +}
> +
> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    uint16_t proto;
> +    NetRscChain *chain;
> +    struct eth_header *eth;
> +    VirtIONet *n;
> +
> +    n = qemu_get_nic_opaque(nc);
> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
> +    proto = htons(eth->h_proto);
> +
> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
> +    if (!chain) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else {
> +        chain->stat.received++;
> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
> +    }
> +}
> +
> +static ssize_t virtio_net_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)
> +{
> +    if (virtio_net_rsc_bypass) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else {
> +        return virtio_net_rsc_receive(nc, buf, size);
> +    }
> +}
> +
>  static NetClientInfo net_virtio_info = {
>      .type = NET_CLIENT_OPTIONS_KIND_NIC,
>      .size = sizeof(NICState),
> @@ -1814,6 +2296,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>      nc = qemu_get_queue(n->nic);
>      nc->rxfilter_notify_enabled = 1;
>  
> +    QTAILQ_INIT(&n->rsc_chains);
>      n->qdev = dev;
>      register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
>                      virtio_net_save, virtio_net_load, n);
> @@ -1848,6 +2331,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>      g_free(n->vqs);
>      qemu_del_nic(n->nic);
>      virtio_cleanup(vdev);
> +    virtio_net_rsc_cleanup(n);
>  }
>  
>  static void virtio_net_instance_init(Object *obj)
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index 0cabdb6..6939e92 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -59,6 +59,7 @@ typedef struct VirtIONet {
>      VirtIONetQueue *vqs;
>      VirtQueue *ctrl_vq;
>      NICState *nic;
> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>      uint32_t tx_timeout;
>      int32_t tx_burst;
>      uint32_t has_vnet_hdr;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 2b5b248..3b1dfa8 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -128,6 +128,78 @@ typedef struct VirtioDeviceClass {
>      int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>  } VirtioDeviceClass;
>  
> +/* Coalesced packets type & status */
> +typedef enum {
> +    RSC_COALESCE,           /* Data been coalesced */
> +    RSC_FINAL,              /* Will terminate current connection */
> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
> +    RSC_WANT                /* Data want to be coalesced */
> +} COALESCE_STATUS;
> +
> +typedef struct NetRscStat {
> +    uint32_t received;
> +    uint32_t coalesced;
> +    uint32_t over_size;
> +    uint32_t cache;
> +    uint32_t empty_cache;
> +    uint32_t no_match_cache;
> +    uint32_t win_update;
> +    uint32_t no_match;
> +    uint32_t tcp_syn;
> +    uint32_t tcp_ctrl_drain;
> +    uint32_t dup_ack;
> +    uint32_t dup_ack1;
> +    uint32_t dup_ack2;
> +    uint32_t pure_ack;
> +    uint32_t ack_out_of_win;
> +    uint32_t data_out_of_win;
> +    uint32_t data_out_of_order;
> +    uint32_t data_after_pure_ack;
> +    uint32_t bypass_not_tcp;
> +    uint32_t tcp_option;
> +    uint32_t tcp_larger_option;
> +    uint32_t ip_frag;
> +    uint32_t ip_hacked;
> +    uint32_t ip_option;
> +    uint32_t purge_failed;
> +    uint32_t drain_failed;
> +    uint32_t final_failed;
> +    int64_t  timer;
> +} NetRscStat;
> +
> +/* Rsc unit general info used to checking if can coalescing */
> +typedef struct NetRscUnit {
> +   struct ip_header *ip;   /* ip header */
> +   uint16_t *ip_plen;      /* data len pointer in ip header field */
> +   struct tcp_header *tcp; /* tcp header */
> +   uint16_t tcp_hdrlen;    /* tcp header len */
> +   uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
> +} NetRscUnit;
> +
> +/* Coalesced segmant */
> +typedef struct NetRscSeg {
> +    QTAILQ_ENTRY(NetRscSeg) next;
> +    void *buf;
> +    size_t size;
> +    uint32_t dup_ack_count;
> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
> +    NetRscUnit unit;
> +    NetClientState *nc;
> +} NetRscSeg;
> +
> +
> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
> +typedef struct NetRscChain {
> +   QTAILQ_ENTRY(NetRscChain) next;
> +   uint16_t hdr_size;
> +   uint16_t proto;
> +   uint16_t max_payload;
> +   QEMUTimer *drain_timer;
> +   QTAILQ_HEAD(, NetRscSeg) buffers;
> +   NetRscStat stat;
> +} NetRscChain;
> +
>  void virtio_instance_init_common(Object *proxy_obj, void *data,
>                                   size_t vdev_size, const char *vdev_name);
>  
> -- 
> 2.5.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest
  2016-03-15  9:17 [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest wexu
  2016-03-15  9:17 ` [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
  2016-03-15  9:17 ` [Qemu-devel] [ Patch 2/2] virtio-net rsc: support coalescing ipv6 " wexu
@ 2016-03-15 10:01 ` Michael S. Tsirkin
  2016-03-16  3:08   ` Wei Xu
  2016-03-17  6:47 ` Jason Wang
  3 siblings, 1 reply; 32+ messages in thread
From: Michael S. Tsirkin @ 2016-03-15 10:01 UTC (permalink / raw)
  To: wexu; +Cc: victork, jasowang, yvugenfi, qemu-devel, marcel, dfleytma

On Tue, Mar 15, 2016 at 05:17:02PM +0800, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
> 
> Fixed issues based on rfc patch v2:
> 1. Removed big param list, replace it with 'NetRscUnit' 
> 2. Different virtio header size
> 3. Modify callback function to direct call.
> 4. Needn't check the failure of g_malloc()
> 5. Other code format adjustment, macro naming, etc 
> 
> This patch is to support WHQL test for Windows guest, while this feature also
> benifits other guest works as a kernel 'gro' like feature with userspace implementation.
> Feature information:
>   http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
> 
> Both IPv4 and IPv6 are supported, though performance with userspace virtio
> is slow than vhost-net, there is about 1x to 3x performance improvement to
> userspace virtio, this is done by turning this feature on and disable
> 'tso/gso/gro' on corresponding tap interface and guest interface, while get
> less improment with all these feature on.
> 
> Test steps:
> Although this feature is mainly used for window guest, i used linux guest to help test
> the feature, to make things simple, i used 3 steps to test the patch as i moved on.
> 1. With a tcp socket client/server pair running on 2 linux guest, thus i can control
> the traffic and debugging the code as i want.
> 2. Netperf on linux guest test the throughput.
> 3. WHQL test with 2 Windows guests.
> 
> Current status:
> IPv4 pass all the above tests.
> IPv6 just passed test step 1 and 2 as described ahead, the virtio nic cannot
> receive any packet in WHQL test, looks like the test traffic is not sent from
> on the support machine, test device can access both host and another linux
> guest, tried a lot of ways to work it out but failed, maybe debug from windows
> guest driver side can help figuring it out.
> 
> Note:
> A 'MessageDevice' nic chose as 'Realtek' will panic the system sometimes during setup,
> this can be figured out by replacing it with an 'e1000' nic.
> 
> Todo:
> More sanity check and tcp 'ecn' and 'window' scale test.

So at this point this is still an RFC, pls label as such
in the subject.
Also, commit log of each patch should also include info on
how to activate a feature.

thanks!

> Wei Xu (2):
>   virtio-net rsc: support coalescing ipv4 tcp traffic
>   virtio-net rsc: support coalescing ipv6 tcp traffic
> 
>  hw/net/virtio-net.c            | 602 ++++++++++++++++++++++++++++++++++++++++-
>  include/hw/virtio/virtio-net.h |   1 +
>  include/hw/virtio/virtio.h     |  75 +++++
>  3 files changed, 677 insertions(+), 1 deletion(-)
> 
> -- 
> 2.5.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest
  2016-03-15 10:01 ` [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest Michael S. Tsirkin
@ 2016-03-16  3:08   ` Wei Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Wei Xu @ 2016-03-16  3:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: victork, jasowang, yvugenfi, qemu-devel, marcel, dfleytma



----- Original Message -----
From: "Michael S. Tsirkin" <mst@redhat.com>
To: wexu@redhat.com
Cc: victork@redhat.com, jasowang@redhat.com, yvugenfi@redhat.com, qemu-devel@nongnu.org, marcel@redhat.com, dfleytma@redhat.com
Sent: Tuesday, March 15, 2016 6:01:12 PM
Subject: Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest

On Tue, Mar 15, 2016 at 05:17:02PM +0800, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
> 
> Fixed issues based on rfc patch v2:
> 1. Removed big param list, replace it with 'NetRscUnit' 
> 2. Different virtio header size
> 3. Modify callback function to direct call.
> 4. Needn't check the failure of g_malloc()
> 5. Other code format adjustment, macro naming, etc 
> 
> This patch is to support WHQL test for Windows guest, while this feature also
> benifits other guest works as a kernel 'gro' like feature with userspace implementation.
> Feature information:
>   http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
> 
> Both IPv4 and IPv6 are supported, though performance with userspace virtio
> is slow than vhost-net, there is about 1x to 3x performance improvement to
> userspace virtio, this is done by turning this feature on and disable
> 'tso/gso/gro' on corresponding tap interface and guest interface, while get
> less improment with all these feature on.
> 
> Test steps:
> Although this feature is mainly used for window guest, i used linux guest to help test
> the feature, to make things simple, i used 3 steps to test the patch as i moved on.
> 1. With a tcp socket client/server pair running on 2 linux guest, thus i can control
> the traffic and debugging the code as i want.
> 2. Netperf on linux guest test the throughput.
> 3. WHQL test with 2 Windows guests.
> 
> Current status:
> IPv4 pass all the above tests.
> IPv6 just passed test step 1 and 2 as described ahead, the virtio nic cannot
> receive any packet in WHQL test, looks like the test traffic is not sent from
> on the support machine, test device can access both host and another linux
> guest, tried a lot of ways to work it out but failed, maybe debug from windows
> guest driver side can help figuring it out.
> 
> Note:
> A 'MessageDevice' nic chose as 'Realtek' will panic the system sometimes during setup,
> this can be figured out by replacing it with an 'e1000' nic.
> 
> Todo:
> More sanity check and tcp 'ecn' and 'window' scale test.

So at this point this is still an RFC, pls label as such
in the subject.
Also, commit log of each patch should also include info on
how to activate a feature.

OK, thanks mst.

thanks!

> Wei Xu (2):
>   virtio-net rsc: support coalescing ipv4 tcp traffic
>   virtio-net rsc: support coalescing ipv6 tcp traffic
> 
>  hw/net/virtio-net.c            | 602 ++++++++++++++++++++++++++++++++++++++++-
>  include/hw/virtio/virtio-net.h |   1 +
>  include/hw/virtio/virtio.h     |  75 +++++
>  3 files changed, 677 insertions(+), 1 deletion(-)
> 
> -- 
> 2.5.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-03-15 10:00   ` Michael S. Tsirkin
@ 2016-03-16  3:23     ` Wei Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Wei Xu @ 2016-03-16  3:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: victork, jasowang, yvugenfi, qemu-devel, marcel, dfleytma



----- Original Message -----
From: "Michael S. Tsirkin" <mst@redhat.com>
To: wexu@redhat.com
Cc: victork@redhat.com, jasowang@redhat.com, yvugenfi@redhat.com, qemu-devel@nongnu.org, marcel@redhat.com, dfleytma@redhat.com
Sent: Tuesday, March 15, 2016 6:00:03 PM
Subject: Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic

On Tue, Mar 15, 2016 at 05:17:03PM +0800, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
> 
> All the data packets in a tcp connection will be cached to a big buffer
> in every receive interval, and will be sent out via a timer, the
> 'virtio_net_rsc_timeout' controls the interval, the value will influent the
> performance and response of tcp connection extremely, 50000(50us) is a
> experience value to gain a performance improvement, since the whql test
> sends packets every 100us, so '300000(300us)' can pass the test case,
> this is also the default value, it's gonna to be tunable.
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets
> 
> 'NetRscChain' is used to save the segments of different protocols in a
> VirtIONet device.
> 
> The main handler of TCP includes TCP window update, duplicated ACK check
> and the real data coalescing if the new segment passed sanity check
> and is identified as an 'wanted' one.
> 
> An 'wanted' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. ACK of the segment is in the valid window.
> 3. If the ACK in the segment is a duplicated one, then it must less than 2,
>    this is to notify upper layer TCP starting retransmission due to the spec.
> 
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. IP options & IP fragment
> 3. Not a TCP packets
> 4. Sanity size check to prevent buffer overflow attack.
> 
> There maybe more cases should be considered such as ip identification other
> flags, while it broke the test because windows set it to the same even it's
> not a fragment.
> 
> Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
> and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
> means the packets should also be bypassed, and this should be done
> after searching for the same connection packets in the pool and sending
> all of them out, this is to avoid out of data.
> 
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'FIN/RST' will trigger a finalization, because
> this normally happens upon a connection is going to be closed, an 'URG' packet
> also finalize current coalescing unit while there maybe protocol difference to
> different OS.
> 
> Statistics can be used to monitor the basic coalescing status, the 'out of order'
> and 'out of window' means how many retransmitting packets, thus describe the
> performance intuitively.
> 
> Signed-off-by: Wei Xu <wexu@redhat.com>
> ---
>  hw/net/virtio-net.c            | 486 ++++++++++++++++++++++++++++++++++++++++-
>  include/hw/virtio/virtio-net.h |   1 +
>  include/hw/virtio/virtio.h     |  72 ++++++
>  3 files changed, 558 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 5798f87..c23b45f 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -15,10 +15,12 @@
>  #include "qemu/iov.h"
>  #include "hw/virtio/virtio.h"
>  #include "net/net.h"
> +#include "net/eth.h"
>  #include "net/checksum.h"
>  #include "net/tap.h"
>  #include "qemu/error-report.h"
>  #include "qemu/timer.h"
> +#include "qemu/sockets.h"
>  #include "hw/virtio/virtio-net.h"
>  #include "net/vhost_net.h"
>  #include "hw/virtio/virtio-bus.h"
> @@ -38,6 +40,35 @@
>  #define endof(container, field) \
>      (offsetof(container, field) + sizeof(((container *)0)->field))
>  
> +#define ETH_HDR_SZ (sizeof(struct eth_header))
> +#define IP4_HDR_SZ (sizeof(struct ip_header))
> +#define TCP_HDR_SZ (sizeof(struct tcp_header))
> +#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ)

It's better to open-code these imho.

okay.

> +
> +#define IP4_ADDR_SIZE   8                   /* ipv4 saddr + daddr */
> +#define TCP_PORT_SIZE   4                   /* sport + dport */
> +
> +/* IPv4 max payload, 16 bits in the header */
> +#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
> +#define MAX_TCP_PAYLOAD 65535
> +
> +/* max payload with virtio header */
> +#define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
> +                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)
> +
> +#define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */
> +
> +/* Purge coalesced packets timer interval */
> +#define RSC_TIMER_INTERVAL  300000

Pls prefix local macros with VIRTIO_NET_

sure.


> +
> +/* Switcher to enable/disable rsc */
> +static bool virtio_net_rsc_bypass = 1;
> +
> +/* This value affects the performance a lot, and should be tuned carefully,
> +   '300000'(300us) is the recommended value to pass the WHQL test, '50000' can
> +   gain 2x netperf throughput with tso/gso/gro 'off'. */

So either tests pass or we get good performance, but not both?
Hmm.

Yes, the test case send 6 data packets every 100us, and then capture and checking if some of the packets are coalesced,
so the interval less than 100us is beat by the case definitely, actually it's really depends on the interval to pass the case,
i tried 200/400/500+(us) but there are still a few cases failed, don't know how the internal of the test case is, maybe it is to
bypass 'tso/gso' like feature in windows stack not to coalesce packets before sending out.

> +static uint32_t virtio_net_rsc_timeout = RSC_TIMER_INTERVAL;


This would beed to be tunable.
Yes, I have a question, can it be controlled by 'ethtool' or 'qmp/hmp'?

> +
>  typedef struct VirtIOFeature {
>      uint32_t flags;
>      size_t end;
> @@ -1089,7 +1120,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
>      return 0;
>  }
>  
> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
> +static ssize_t virtio_net_do_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)
>  {
>      VirtIONet *n = qemu_get_nic_opaque(nc);
>      VirtIONetQueue *q = virtio_net_get_subqueue(nc);
> @@ -1685,6 +1717,456 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
>      return 0;
>  }
>  
> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    uint16_t ip_hdrlen;
> +
> +    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
> +    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
> +    unit->ip_plen = &unit->ip->ip_len;
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
> +}
> +
> +static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
> +{
> +    uint32_t sum;
> +
> +    ip->ip_sum = 0;
> +    sum = net_checksum_add_cont(IP4_HDR_SZ, (uint8_t *)ip, 0);
> +    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
> +}
> +
> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
> +{
> +    int ret;
> +
> +    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
> +    g_free(seg->buf);
> +    g_free(seg);
> +
> +    return ret;
> +}
> +
> +static void virtio_net_rsc_purge(void *opq)
> +{
> +    NetRscChain *chain = (NetRscChain *)opq;
> +    NetRscSeg *seg, *rn;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.purge_failed++;
> +            continue;
> +        }
> +    }
> +
> +    chain->stat.timer++;
> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
> +    }
> +}
> +
> +static void virtio_net_rsc_cleanup(VirtIONet *n)
> +{
> +    NetRscChain *chain, *rn_chain;
> +    NetRscSeg *seg, *rn_seg;
> +
> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
> +            g_free(seg->buf);
> +            g_free(seg);
> +        }
> +
> +        timer_del(chain->drain_timer);
> +        timer_free(chain->drain_timer);
> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
> +        g_free(chain);
> +    }
> +}
> +
> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
> +                                     const uint8_t *buf, size_t size)
> +{
> +    NetRscSeg *seg;
> +
> +    seg = g_malloc(sizeof(NetRscSeg));
> +    seg->buf = g_malloc(MAX_VIRTIO_PAYLOAD);
> +
> +    memmove(seg->buf, buf, size);
> +    seg->size = size;
> +    seg->dup_ack_count = 0;
> +    seg->is_coalesced = 0;
> +    seg->nc = nc;
> +
> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
> +    chain->stat.cache++;
> +
> +    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +}
> +
> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
> +                                 const uint8_t *buf, struct tcp_header *n_tcp,
> +                                 struct tcp_header *o_tcp)
> +{
> +    uint32_t nack, oack;
> +    uint16_t nwin, owin;
> +
> +    nack = htonl(n_tcp->th_ack);
> +    nwin = htons(n_tcp->th_win);
> +    oack = htonl(o_tcp->th_ack);
> +    owin = htons(o_tcp->th_win);
> +
> +    if ((nack - oack) >= MAX_TCP_PAYLOAD) {
> +        chain->stat.ack_out_of_win++;
> +        return RSC_FINAL;
> +    } else if (nack == oack) {
> +        /* duplicated ack or window probe */
> +        if (nwin == owin) {
> +            /* duplicated ack, add dup ack count due to whql test up to 1 */
> +            chain->stat.dup_ack++;
> +
> +            if (seg->dup_ack_count == 0) {
> +                seg->dup_ack_count++;
> +                chain->stat.dup_ack1++;
> +                return RSC_COALESCE;
> +            } else {
> +                /* Spec says should send it directly */
> +                chain->stat.dup_ack2++;
> +                return RSC_FINAL;
> +            }
> +        } else {
> +            /* Coalesce window update */
> +            o_tcp->th_win = n_tcp->th_win;
> +            chain->stat.win_update++;
> +            return RSC_COALESCE;
> +        }
> +    } else {
> +        /* pure ack, update ack */
> +        o_tcp->th_ack = n_tcp->th_ack;
> +        chain->stat.pure_ack++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, NetRscSeg *seg,
> +                                    const uint8_t *buf, NetRscUnit *n_unit)
> +{
> +    void *data;
> +    uint16_t o_ip_len;
> +    uint32_t nseq, oseq;
> +    NetRscUnit *o_unit;
> +
> +    o_unit = &seg->unit;
> +    o_ip_len = htons(*o_unit->ip_plen);
> +    nseq = htonl(n_unit->tcp->th_seq);
> +    oseq = htonl(o_unit->tcp->th_seq);
> +
> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
> +        /* Log this only for debugging observation */
> +        chain->stat.tcp_option++;
> +    }
> +
> +    /* Ignore packet with more/larger tcp options */
> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
> +        chain->stat.tcp_larger_option++;
> +        return RSC_FINAL;
> +    }
> +
> +    /* out of order or retransmitted. */
> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
> +        chain->stat.data_out_of_win++;
> +        return RSC_FINAL;
> +    }
> +
> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
> +    if (nseq == oseq) {
> +        if ((0 == o_unit->payload) && n_unit->payload) {
> +            /* From no payload to payload, normal case, not a dup ack or etc */
> +            chain->stat.data_after_pure_ack++;
> +            goto coalesce;
> +        } else {
> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
> +                                             n_unit->tcp, o_unit->tcp);
> +        }
> +    } else if ((nseq - oseq) != o_unit->payload) {
> +        /* Not a consistent packet, out of order */
> +        chain->stat.data_out_of_order++;
> +        return RSC_FINAL;
> +    } else {
> +coalesce:
> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
> +            chain->stat.over_size++;
> +            return RSC_FINAL;
> +        }
> +
> +        /* Here comes the right data, the payload lengh in v4/v6 is different,
> +           so use the field value to update and record the new data len */
> +        o_unit->payload += n_unit->payload; /* update new data len */
> +
> +        /* update field in ip header */
> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
> +
> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
> +           for windows guest, while this may change the behavior for linux
> +           guest. */
> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
> +
> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
> +
> +        memmove(seg->buf + seg->size, data, n_unit->payload);
> +        seg->size += n_unit->payload;
> +        chain->stat.coalesced++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
> +        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +        chain->stat.no_match++;
> +        return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
> +/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
> + * to prevent out of order */
> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> +                                         struct tcp_header *tcp)
> +{
> +    uint16_t tcp_flag;
> +
> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
> +    if (tcp_flag & TH_SYN) {
> +        chain->stat.tcp_syn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
> +        chain->stat.tcp_ctrl_drain++;
> +        return RSC_FINAL;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientState *nc,
> +                          const uint8_t *buf, size_t size)
> +{
> +    if (QTAILQ_EMPTY(&chain->buffers)) {
> +        chain->stat.empty_cache++;
> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
> +                          const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    int ret;
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +
> +        if (ret == RSC_FINAL) {
> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +                /* Send failed */
> +                chain->stat.final_failed++;
> +                return 0;
> +            }
> +
> +            /* Send current packet */
> +            return virtio_net_do_receive(nc, buf, size);
> +        } else if (ret == RSC_NO_MATCH) {
> +            continue;
> +        } else {
> +            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
> +            seg->is_coalesced = 1;
> +            return size;
> +        }
> +    }
> +
> +    chain->stat.no_match_cache++;
> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +    return size;
> +}
> +
> +/* Drain a connection data, this is to avoid out of order segments */
> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
> +                        const uint8_t *buf, size_t size, uint16_t ip_start,
> +                        uint16_t ip_size, uint16_t tcp_port, uint16_t port_size)
> +{
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
> +            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {
> +            continue;
> +        }
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.drain_failed++;
> +        }
> +
> +        break;
> +    }
> +
> +    return virtio_net_do_receive(nc, buf, size);
> +}
> +
> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
> +                        struct ip_header *ip, const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +
> +    if (size < (chain->hdr_size + ETH_IP4_HDR_SZ + TCP_HDR_SZ)) {
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Not an ipv4 one */
> +    if (((0xF0 & ip->ip_ver_len) >> 4) != IP_HEADER_VERSION_4) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip option */
> +    if (IP4_HEADER_LEN != (0xF & ip->ip_ver_len)) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip fragment */
> +    if (!(htons(ip->ip_off) & IP_DF)) {
> +        chain->stat.ip_frag++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (ip->ip_p != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Sanity check */
> +    ip_len = htons(ip->ip_len);
> +    if (ip_len < (IP4_HDR_SZ + TCP_HDR_SZ)
> +        || ip_len > (size - chain->hdr_size - ETH_HDR_SZ)) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    NetRscChain *chain;
> +    NetRscUnit unit;
> +
> +    chain = (NetRscChain *)opq;
> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
> +    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                        ((chain->hdr_size + ETH_HDR_SZ) + 12), IP4_ADDR_SIZE,
> +                        (chain->hdr_size + ETH_IP4_HDR_SZ), TCP_PORT_SIZE);
> +    }
> +
> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
> +        return size;
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
> +                                            NetClientState *nc, uint16_t proto)
> +{
> +    NetRscChain *chain;
> +
> +    /* Only handle IPv4/6 */
> +    if (proto != (uint16_t)ETH_P_IP) {
> +        return NULL;
> +    }
> +
> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
> +        if (chain->proto == proto) {
> +            return chain;
> +        }
> +    }
> +
> +    chain = g_malloc(sizeof(*chain));
> +    chain->hdr_size = n->guest_hdr_len;
> +    chain->proto = proto;
> +    chain->max_payload = MAX_IP4_PAYLOAD;
> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
> +                                      virtio_net_rsc_purge, chain);
> +    memset(&chain->stat, 0, sizeof(chain->stat));
> +
> +    QTAILQ_INIT(&chain->buffers);
> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
> +
> +    return chain;
> +}
> +
> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    uint16_t proto;
> +    NetRscChain *chain;
> +    struct eth_header *eth;
> +    VirtIONet *n;
> +
> +    n = qemu_get_nic_opaque(nc);
> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
> +    proto = htons(eth->h_proto);
> +
> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
> +    if (!chain) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else {
> +        chain->stat.received++;
> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
> +    }
> +}
> +
> +static ssize_t virtio_net_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)
> +{
> +    if (virtio_net_rsc_bypass) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else {
> +        return virtio_net_rsc_receive(nc, buf, size);
> +    }
> +}
> +
>  static NetClientInfo net_virtio_info = {
>      .type = NET_CLIENT_OPTIONS_KIND_NIC,
>      .size = sizeof(NICState),
> @@ -1814,6 +2296,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>      nc = qemu_get_queue(n->nic);
>      nc->rxfilter_notify_enabled = 1;
>  
> +    QTAILQ_INIT(&n->rsc_chains);
>      n->qdev = dev;
>      register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
>                      virtio_net_save, virtio_net_load, n);
> @@ -1848,6 +2331,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>      g_free(n->vqs);
>      qemu_del_nic(n->nic);
>      virtio_cleanup(vdev);
> +    virtio_net_rsc_cleanup(n);
>  }
>  
>  static void virtio_net_instance_init(Object *obj)
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index 0cabdb6..6939e92 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -59,6 +59,7 @@ typedef struct VirtIONet {
>      VirtIONetQueue *vqs;
>      VirtQueue *ctrl_vq;
>      NICState *nic;
> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>      uint32_t tx_timeout;
>      int32_t tx_burst;
>      uint32_t has_vnet_hdr;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 2b5b248..3b1dfa8 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -128,6 +128,78 @@ typedef struct VirtioDeviceClass {
>      int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>  } VirtioDeviceClass;
>  
> +/* Coalesced packets type & status */
> +typedef enum {
> +    RSC_COALESCE,           /* Data been coalesced */
> +    RSC_FINAL,              /* Will terminate current connection */
> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
> +    RSC_WANT                /* Data want to be coalesced */
> +} COALESCE_STATUS;
> +
> +typedef struct NetRscStat {
> +    uint32_t received;
> +    uint32_t coalesced;
> +    uint32_t over_size;
> +    uint32_t cache;
> +    uint32_t empty_cache;
> +    uint32_t no_match_cache;
> +    uint32_t win_update;
> +    uint32_t no_match;
> +    uint32_t tcp_syn;
> +    uint32_t tcp_ctrl_drain;
> +    uint32_t dup_ack;
> +    uint32_t dup_ack1;
> +    uint32_t dup_ack2;
> +    uint32_t pure_ack;
> +    uint32_t ack_out_of_win;
> +    uint32_t data_out_of_win;
> +    uint32_t data_out_of_order;
> +    uint32_t data_after_pure_ack;
> +    uint32_t bypass_not_tcp;
> +    uint32_t tcp_option;
> +    uint32_t tcp_larger_option;
> +    uint32_t ip_frag;
> +    uint32_t ip_hacked;
> +    uint32_t ip_option;
> +    uint32_t purge_failed;
> +    uint32_t drain_failed;
> +    uint32_t final_failed;
> +    int64_t  timer;
> +} NetRscStat;
> +
> +/* Rsc unit general info used to checking if can coalescing */
> +typedef struct NetRscUnit {
> +   struct ip_header *ip;   /* ip header */
> +   uint16_t *ip_plen;      /* data len pointer in ip header field */
> +   struct tcp_header *tcp; /* tcp header */
> +   uint16_t tcp_hdrlen;    /* tcp header len */
> +   uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
> +} NetRscUnit;
> +
> +/* Coalesced segmant */
> +typedef struct NetRscSeg {
> +    QTAILQ_ENTRY(NetRscSeg) next;
> +    void *buf;
> +    size_t size;
> +    uint32_t dup_ack_count;
> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
> +    NetRscUnit unit;
> +    NetClientState *nc;
> +} NetRscSeg;
> +
> +
> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
> +typedef struct NetRscChain {
> +   QTAILQ_ENTRY(NetRscChain) next;
> +   uint16_t hdr_size;
> +   uint16_t proto;
> +   uint16_t max_payload;
> +   QEMUTimer *drain_timer;
> +   QTAILQ_HEAD(, NetRscSeg) buffers;
> +   NetRscStat stat;
> +} NetRscChain;
> +
>  void virtio_instance_init_common(Object *proxy_obj, void *data,
>                                   size_t vdev_size, const char *vdev_name);
>  
> -- 
> 2.5.0

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest
  2016-03-15  9:17 [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest wexu
                   ` (2 preceding siblings ...)
  2016-03-15 10:01 ` [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest Michael S. Tsirkin
@ 2016-03-17  6:47 ` Jason Wang
  2016-03-17 15:21   ` Wei Xu
  3 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2016-03-17  6:47 UTC (permalink / raw)
  To: wexu, qemu-devel; +Cc: marcel, victork, dfleytma, yvugenfi, mst



On 03/15/2016 05:17 PM, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
>
> Fixed issues based on rfc patch v2:
> 1. Removed big param list, replace it with 'NetRscUnit' 
> 2. Different virtio header size
> 3. Modify callback function to direct call.
> 4. Needn't check the failure of g_malloc()
> 5. Other code format adjustment, macro naming, etc 
>
> This patch is to support WHQL test for Windows guest, while this feature also
> benifits other guest works as a kernel 'gro' like feature with userspace implementation.
> Feature information:
>   http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
>
> Both IPv4 and IPv6 are supported, though performance with userspace virtio
> is slow than vhost-net, there is about 1x to 3x performance improvement to
> userspace virtio, this is done by turning this feature on and disable
> 'tso/gso/gro' on corresponding tap interface and guest interface, while get
> less improment with all these feature on.
>
> Test steps:
> Although this feature is mainly used for window guest, i used linux guest to help test
> the feature, to make things simple, i used 3 steps to test the patch as i moved on.
> 1. With a tcp socket client/server pair running on 2 linux guest, thus i can control
> the traffic and debugging the code as i want.
> 2. Netperf on linux guest test the throughput.
> 3. WHQL test with 2 Windows guests.
>
> Current status:
> IPv4 pass all the above tests.
> IPv6 just passed test step 1 and 2 as described ahead, the virtio nic cannot
> receive any packet in WHQL test, looks like the test traffic is not sent from
> on the support machine, test device can access both host and another linux
> guest, tried a lot of ways to work it out but failed, maybe debug from windows
> guest driver side can help figuring it out.

I think you need figure out where was the packet dropped first. If the
packet was not dropped by windows guest, you may want to try dropmonitor.

>
> Note:
> A 'MessageDevice' nic chose as 'Realtek' will panic the system sometimes during setup,
> this can be figured out by replacing it with an 'e1000' nic.
>
> Todo:
> More sanity check and tcp 'ecn' and 'window' scale test.
>
> Wei Xu (2):
>   virtio-net rsc: support coalescing ipv4 tcp traffic
>   virtio-net rsc: support coalescing ipv6 tcp traffic
>
>  hw/net/virtio-net.c            | 602 ++++++++++++++++++++++++++++++++++++++++-
>  include/hw/virtio/virtio-net.h |   1 +
>  include/hw/virtio/virtio.h     |  75 +++++
>  3 files changed, 677 insertions(+), 1 deletion(-)
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-03-15  9:17 ` [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
  2016-03-15 10:00   ` Michael S. Tsirkin
@ 2016-03-17  8:42   ` Jason Wang
  2016-03-17 16:45     ` Wei Xu
  1 sibling, 1 reply; 32+ messages in thread
From: Jason Wang @ 2016-03-17  8:42 UTC (permalink / raw)
  To: wexu, qemu-devel; +Cc: marcel, victork, dfleytma, yvugenfi, mst



On 03/15/2016 05:17 PM, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
>
> All the data packets in a tcp connection will be cached to a big buffer
> in every receive interval, and will be sent out via a timer, the
> 'virtio_net_rsc_timeout' controls the interval, the value will influent the
> performance and response of tcp connection extremely, 50000(50us) is a
> experience value to gain a performance improvement, since the whql test
> sends packets every 100us, so '300000(300us)' can pass the test case,
> this is also the default value, it's gonna to be tunable.
>
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets
>
> 'NetRscChain' is used to save the segments of different protocols in a
> VirtIONet device.
>
> The main handler of TCP includes TCP window update, duplicated ACK check
> and the real data coalescing if the new segment passed sanity check
> and is identified as an 'wanted' one.
>
> An 'wanted' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. ACK of the segment is in the valid window.
> 3. If the ACK in the segment is a duplicated one, then it must less than 2,
>    this is to notify upper layer TCP starting retransmission due to the spec.
>
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. IP options & IP fragment
> 3. Not a TCP packets
> 4. Sanity size check to prevent buffer overflow attack.
>
> There maybe more cases should be considered such as ip identification other
> flags, while it broke the test because windows set it to the same even it's
> not a fragment.
>
> Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
> and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
> means the packets should also be bypassed, and this should be done
> after searching for the same connection packets in the pool and sending
> all of them out, this is to avoid out of data.
>
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'FIN/RST' will trigger a finalization, because
> this normally happens upon a connection is going to be closed, an 'URG' packet
> also finalize current coalescing unit while there maybe protocol difference to
> different OS.

But URG packet should be sent as quickly as possible regardless of
ordering, no?

>
> Statistics can be used to monitor the basic coalescing status, the 'out of order'
> and 'out of window' means how many retransmitting packets, thus describe the
> performance intuitively.
>
> Signed-off-by: Wei Xu <wexu@redhat.com>
> ---
>  hw/net/virtio-net.c            | 486 ++++++++++++++++++++++++++++++++++++++++-
>  include/hw/virtio/virtio-net.h |   1 +
>  include/hw/virtio/virtio.h     |  72 ++++++
>  3 files changed, 558 insertions(+), 1 deletion(-)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 5798f87..c23b45f 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -15,10 +15,12 @@
>  #include "qemu/iov.h"
>  #include "hw/virtio/virtio.h"
>  #include "net/net.h"
> +#include "net/eth.h"
>  #include "net/checksum.h"
>  #include "net/tap.h"
>  #include "qemu/error-report.h"
>  #include "qemu/timer.h"
> +#include "qemu/sockets.h"
>  #include "hw/virtio/virtio-net.h"
>  #include "net/vhost_net.h"
>  #include "hw/virtio/virtio-bus.h"
> @@ -38,6 +40,35 @@
>  #define endof(container, field) \
>      (offsetof(container, field) + sizeof(((container *)0)->field))
>  
> +#define ETH_HDR_SZ (sizeof(struct eth_header))
> +#define IP4_HDR_SZ (sizeof(struct ip_header))
> +#define TCP_HDR_SZ (sizeof(struct tcp_header))
> +#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ)
> +
> +#define IP4_ADDR_SIZE   8                   /* ipv4 saddr + daddr */
> +#define TCP_PORT_SIZE   4                   /* sport + dport */
> +
> +/* IPv4 max payload, 16 bits in the header */
> +#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
> +#define MAX_TCP_PAYLOAD 65535
> +
> +/* max payload with virtio header */
> +#define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
> +                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)

Should we use guest_hdr_len instead of sizeof() here? Consider the
vnet_hdr will be extended in the future.

> +
> +#define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */

type, should be 'length'

> +
> +/* Purge coalesced packets timer interval */
> +#define RSC_TIMER_INTERVAL  300000
> +
> +/* Switcher to enable/disable rsc */
> +static bool virtio_net_rsc_bypass = 1;
> +
> +/* This value affects the performance a lot, and should be tuned carefully,
> +   '300000'(300us) is the recommended value to pass the WHQL test, '50000' can
> +   gain 2x netperf throughput with tso/gso/gro 'off'. */
> +static uint32_t virtio_net_rsc_timeout = RSC_TIMER_INTERVAL;
> +
>  typedef struct VirtIOFeature {
>      uint32_t flags;
>      size_t end;
> @@ -1089,7 +1120,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
>      return 0;
>  }
>  
> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
> +static ssize_t virtio_net_do_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)

indentation should also changed here.

>  {
>      VirtIONet *n = qemu_get_nic_opaque(nc);
>      VirtIONetQueue *q = virtio_net_get_subqueue(nc);
> @@ -1685,6 +1717,456 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
>      return 0;
>  }
>  
> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    uint16_t ip_hdrlen;
> +
> +    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
> +    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
> +    unit->ip_plen = &unit->ip->ip_len;
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
> +}
> +
> +static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
> +{
> +    uint32_t sum;
> +
> +    ip->ip_sum = 0;
> +    sum = net_checksum_add_cont(IP4_HDR_SZ, (uint8_t *)ip, 0);
> +    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
> +}
> +
> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
> +{
> +    int ret;
> +
> +    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
> +    g_free(seg->buf);
> +    g_free(seg);
> +
> +    return ret;
> +}
> +
> +static void virtio_net_rsc_purge(void *opq)
> +{
> +    NetRscChain *chain = (NetRscChain *)opq;
> +    NetRscSeg *seg, *rn;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.purge_failed++;
> +            continue;
> +        }
> +    }
> +
> +    chain->stat.timer++;
> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
> +    }
> +}
> +
> +static void virtio_net_rsc_cleanup(VirtIONet *n)
> +{
> +    NetRscChain *chain, *rn_chain;
> +    NetRscSeg *seg, *rn_seg;
> +
> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
> +            g_free(seg->buf);
> +            g_free(seg);
> +        }
> +
> +        timer_del(chain->drain_timer);
> +        timer_free(chain->drain_timer);
> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
> +        g_free(chain);
> +    }
> +}
> +
> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
> +                                     const uint8_t *buf, size_t size)
> +{
> +    NetRscSeg *seg;
> +
> +    seg = g_malloc(sizeof(NetRscSeg));
> +    seg->buf = g_malloc(MAX_VIRTIO_PAYLOAD);
> +
> +    memmove(seg->buf, buf, size);

Can seg->buf overlap with buf? If not, why use memmove() instead of
memcpy()?

> +    seg->size = size;
> +    seg->dup_ack_count = 0;
> +    seg->is_coalesced = 0;
> +    seg->nc = nc;
> +
> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
> +    chain->stat.cache++;
> +
> +    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +}
> +
> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
> +                                 const uint8_t *buf, struct tcp_header *n_tcp,
> +                                 struct tcp_header *o_tcp)
> +{
> +    uint32_t nack, oack;
> +    uint16_t nwin, owin;
> +
> +    nack = htonl(n_tcp->th_ack);
> +    nwin = htons(n_tcp->th_win);
> +    oack = htonl(o_tcp->th_ack);
> +    owin = htons(o_tcp->th_win);
> +
> +    if ((nack - oack) >= MAX_TCP_PAYLOAD) {
> +        chain->stat.ack_out_of_win++;
> +        return RSC_FINAL;
> +    } else if (nack == oack) {
> +        /* duplicated ack or window probe */
> +        if (nwin == owin) {
> +            /* duplicated ack, add dup ack count due to whql test up to 1 */
> +            chain->stat.dup_ack++;
> +
> +            if (seg->dup_ack_count == 0) {
> +                seg->dup_ack_count++;
> +                chain->stat.dup_ack1++;
> +                return RSC_COALESCE;
> +            } else {
> +                /* Spec says should send it directly */
> +                chain->stat.dup_ack2++;
> +                return RSC_FINAL;
> +            }
> +        } else {
> +            /* Coalesce window update */
> +            o_tcp->th_win = n_tcp->th_win;
> +            chain->stat.win_update++;
> +            return RSC_COALESCE;
> +        }
> +    } else {
> +        /* pure ack, update ack */
> +        o_tcp->th_ack = n_tcp->th_ack;
> +        chain->stat.pure_ack++;
> +        return RSC_COALESCE;

Looks like there're something I missed. The spec said:

"In other words, any pure ACK that is not a duplicate ACK or a window
update triggers an exception and must not be coalesced. All such pure
ACKs must be indicated as individual segments."

Does it mean we *should not* coalesce windows update and pure ack?
(Since it can wakeup transmission)?

> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, NetRscSeg *seg,
> +                                    const uint8_t *buf, NetRscUnit *n_unit)
> +{
> +    void *data;
> +    uint16_t o_ip_len;
> +    uint32_t nseq, oseq;
> +    NetRscUnit *o_unit;
> +
> +    o_unit = &seg->unit;
> +    o_ip_len = htons(*o_unit->ip_plen);
> +    nseq = htonl(n_unit->tcp->th_seq);
> +    oseq = htonl(o_unit->tcp->th_seq);
> +
> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
> +        /* Log this only for debugging observation */
> +        chain->stat.tcp_option++;
> +    }
> +
> +    /* Ignore packet with more/larger tcp options */
> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {

What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ?

> +        chain->stat.tcp_larger_option++;
> +        return RSC_FINAL;
> +    }
> +
> +    /* out of order or retransmitted. */
> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
> +        chain->stat.data_out_of_win++;
> +        return RSC_FINAL;
> +    }
> +
> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
> +    if (nseq == oseq) {
> +        if ((0 == o_unit->payload) && n_unit->payload) {
> +            /* From no payload to payload, normal case, not a dup ack or etc */
> +            chain->stat.data_after_pure_ack++;
> +            goto coalesce;
> +        } else {
> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
> +                                             n_unit->tcp, o_unit->tcp);
> +        }
> +    } else if ((nseq - oseq) != o_unit->payload) {
> +        /* Not a consistent packet, out of order */
> +        chain->stat.data_out_of_order++;
> +        return RSC_FINAL;
> +    } else {
> +coalesce:
> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
> +            chain->stat.over_size++;
> +            return RSC_FINAL;
> +        }
> +
> +        /* Here comes the right data, the payload lengh in v4/v6 is different,
> +           so use the field value to update and record the new data len */
> +        o_unit->payload += n_unit->payload; /* update new data len */
> +
> +        /* update field in ip header */
> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
> +
> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
> +           for windows guest, while this may change the behavior for linux
> +           guest. */

This needs more thought, 'can' probably means don't. (Linux GRO won't
merge PUSH packet).

> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
> +
> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
> +
> +        memmove(seg->buf + seg->size, data, n_unit->payload);
> +        seg->size += n_unit->payload;
> +        chain->stat.coalesced++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
> +        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +        chain->stat.no_match++;
> +        return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
> +/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
> + * to prevent out of order */
> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> +                                         struct tcp_header *tcp)
> +{
> +    uint16_t tcp_flag;
> +
> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
> +    if (tcp_flag & TH_SYN) {
> +        chain->stat.tcp_syn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
> +        chain->stat.tcp_ctrl_drain++;
> +        return RSC_FINAL;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientState *nc,
> +                          const uint8_t *buf, size_t size)

indentation looks wrong.

> +{
> +    if (QTAILQ_EMPTY(&chain->buffers)) {
> +        chain->stat.empty_cache++;
> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
> +                          const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{

and here.

> +    int ret;
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +
> +        if (ret == RSC_FINAL) {
> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +                /* Send failed */
> +                chain->stat.final_failed++;
> +                return 0;
> +            }
> +
> +            /* Send current packet */
> +            return virtio_net_do_receive(nc, buf, size);
> +        } else if (ret == RSC_NO_MATCH) {
> +            continue;
> +        } else {
> +            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
> +            seg->is_coalesced = 1;
> +            return size;
> +        }
> +    }
> +
> +    chain->stat.no_match_cache++;
> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +    return size;
> +}
> +
> +/* Drain a connection data, this is to avoid out of order segments */
> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
> +                        const uint8_t *buf, size_t size, uint16_t ip_start,
> +                        uint16_t ip_size, uint16_t tcp_port, uint16_t port_size)
> +{
> +    NetRscSeg *seg, *nseg;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
> +            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {
> +            continue;
> +        }
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.drain_failed++;
> +        }
> +
> +        break;
> +    }
> +
> +    return virtio_net_do_receive(nc, buf, size);
> +}
> +
> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
> +                        struct ip_header *ip, const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +
> +    if (size < (chain->hdr_size + ETH_IP4_HDR_SZ + TCP_HDR_SZ)) {
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Not an ipv4 one */
> +    if (((0xF0 & ip->ip_ver_len) >> 4) != IP_HEADER_VERSION_4) {

I've replied this several times, please use a consistent style. E.g
"ip->ip_ver_len & 0xF0".

> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip option */
> +    if (IP4_HEADER_LEN != (0xF & ip->ip_ver_len)) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip fragment */
> +    if (!(htons(ip->ip_off) & IP_DF)) {
> +        chain->stat.ip_frag++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (ip->ip_p != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Sanity check */
> +    ip_len = htons(ip->ip_len);
> +    if (ip_len < (IP4_HDR_SZ + TCP_HDR_SZ)
> +        || ip_len > (size - chain->hdr_size - ETH_HDR_SZ)) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    NetRscChain *chain;
> +    NetRscUnit unit;
> +
> +    chain = (NetRscChain *)opq;
> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
> +    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                        ((chain->hdr_size + ETH_HDR_SZ) + 12), IP4_ADDR_SIZE,
> +                        (chain->hdr_size + ETH_IP4_HDR_SZ), TCP_PORT_SIZE);
> +    }
> +
> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
> +        return size;
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
> +                                            NetClientState *nc, uint16_t proto)
> +{
> +    NetRscChain *chain;
> +
> +    /* Only handle IPv4/6 */
> +    if (proto != (uint16_t)ETH_P_IP) {

The code is conflict with the comment above.

> +        return NULL;
> +    }
> +
> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
> +        if (chain->proto == proto) {
> +            return chain;
> +        }
> +    }
> +
> +    chain = g_malloc(sizeof(*chain));
> +    chain->hdr_size = n->guest_hdr_len;

Why introduce a specified field for instead of just use n->guest_hdr_len?

> +    chain->proto = proto;
> +    chain->max_payload = MAX_IP4_PAYLOAD;
> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
> +                                      virtio_net_rsc_purge, chain);
> +    memset(&chain->stat, 0, sizeof(chain->stat));
> +
> +    QTAILQ_INIT(&chain->buffers);
> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
> +
> +    return chain;
> +}
> +
> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    uint16_t proto;
> +    NetRscChain *chain;
> +    struct eth_header *eth;
> +    VirtIONet *n;
> +
> +    n = qemu_get_nic_opaque(nc);
> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
> +    proto = htons(eth->h_proto);
> +
> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
> +    if (!chain) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else {
> +        chain->stat.received++;
> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
> +    }
> +}
> +
> +static ssize_t virtio_net_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)
> +{
> +    if (virtio_net_rsc_bypass) {
> +        return virtio_net_do_receive(nc, buf, size);

You need a feature bit for this and compat it for older machine types.
And also need some work on virtio spec I think.

> +    } else {
> +        return virtio_net_rsc_receive(nc, buf, size);
> +    }
> +}
> +
>  static NetClientInfo net_virtio_info = {
>      .type = NET_CLIENT_OPTIONS_KIND_NIC,
>      .size = sizeof(NICState),
> @@ -1814,6 +2296,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>      nc = qemu_get_queue(n->nic);
>      nc->rxfilter_notify_enabled = 1;
>  
> +    QTAILQ_INIT(&n->rsc_chains);
>      n->qdev = dev;
>      register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
>                      virtio_net_save, virtio_net_load, n);
> @@ -1848,6 +2331,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>      g_free(n->vqs);
>      qemu_del_nic(n->nic);
>      virtio_cleanup(vdev);
> +    virtio_net_rsc_cleanup(n);
>  }
>  
>  static void virtio_net_instance_init(Object *obj)
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index 0cabdb6..6939e92 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -59,6 +59,7 @@ typedef struct VirtIONet {
>      VirtIONetQueue *vqs;
>      VirtQueue *ctrl_vq;
>      NICState *nic;
> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>      uint32_t tx_timeout;
>      int32_t tx_burst;
>      uint32_t has_vnet_hdr;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 2b5b248..3b1dfa8 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -128,6 +128,78 @@ typedef struct VirtioDeviceClass {
>      int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>  } VirtioDeviceClass;
>  
> +/* Coalesced packets type & status */
> +typedef enum {
> +    RSC_COALESCE,           /* Data been coalesced */
> +    RSC_FINAL,              /* Will terminate current connection */
> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
> +    RSC_WANT                /* Data want to be coalesced */
> +} COALESCE_STATUS;
> +
> +typedef struct NetRscStat {
> +    uint32_t received;
> +    uint32_t coalesced;
> +    uint32_t over_size;
> +    uint32_t cache;
> +    uint32_t empty_cache;
> +    uint32_t no_match_cache;
> +    uint32_t win_update;
> +    uint32_t no_match;
> +    uint32_t tcp_syn;
> +    uint32_t tcp_ctrl_drain;
> +    uint32_t dup_ack;
> +    uint32_t dup_ack1;
> +    uint32_t dup_ack2;
> +    uint32_t pure_ack;
> +    uint32_t ack_out_of_win;
> +    uint32_t data_out_of_win;
> +    uint32_t data_out_of_order;
> +    uint32_t data_after_pure_ack;
> +    uint32_t bypass_not_tcp;
> +    uint32_t tcp_option;
> +    uint32_t tcp_larger_option;
> +    uint32_t ip_frag;
> +    uint32_t ip_hacked;
> +    uint32_t ip_option;
> +    uint32_t purge_failed;
> +    uint32_t drain_failed;
> +    uint32_t final_failed;
> +    int64_t  timer;
> +} NetRscStat;
> +
> +/* Rsc unit general info used to checking if can coalescing */
> +typedef struct NetRscUnit {
> +   struct ip_header *ip;   /* ip header */
> +   uint16_t *ip_plen;      /* data len pointer in ip header field */
> +   struct tcp_header *tcp; /* tcp header */
> +   uint16_t tcp_hdrlen;    /* tcp header len */
> +   uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
> +} NetRscUnit;
> +
> +/* Coalesced segmant */
> +typedef struct NetRscSeg {
> +    QTAILQ_ENTRY(NetRscSeg) next;
> +    void *buf;
> +    size_t size;
> +    uint32_t dup_ack_count;
> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
> +    NetRscUnit unit;
> +    NetClientState *nc;
> +} NetRscSeg;
> +
> +
> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
> +typedef struct NetRscChain {
> +   QTAILQ_ENTRY(NetRscChain) next;
> +   uint16_t hdr_size;
> +   uint16_t proto;
> +   uint16_t max_payload;
> +   QEMUTimer *drain_timer;
> +   QTAILQ_HEAD(, NetRscSeg) buffers;
> +   NetRscStat stat;
> +} NetRscChain;
> +
>  void virtio_instance_init_common(Object *proxy_obj, void *data,
>                                   size_t vdev_size, const char *vdev_name);
>  

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 2/2] virtio-net rsc: support coalescing ipv6 tcp traffic
  2016-03-15  9:17 ` [Qemu-devel] [ Patch 2/2] virtio-net rsc: support coalescing ipv6 " wexu
@ 2016-03-17  8:50   ` Jason Wang
  2016-03-17 16:50     ` Wei Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2016-03-17  8:50 UTC (permalink / raw)
  To: wexu, qemu-devel; +Cc: marcel, victork, dfleytma, yvugenfi, mst



On 03/15/2016 05:17 PM, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
>
> Most things like ipv4 except there is a significant difference between ipv4
> and ipv6, the fragment lenght in ipv4 header includes itself, while it's not
> included for ipv6, thus means ipv6 can carry a real '65535' unit.
>
> Signed-off-by: Wei Xu <wexu@redhat.com>
> ---
>  hw/net/virtio-net.c        | 146 ++++++++++++++++++++++++++++++++++++++++-----
>  include/hw/virtio/virtio.h |   5 +-
>  2 files changed, 135 insertions(+), 16 deletions(-)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index c23b45f..ef61b74 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -52,9 +52,14 @@
>  #define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
>  #define MAX_TCP_PAYLOAD 65535
>  
> -/* max payload with virtio header */
> +#define IP6_HDR_SZ (sizeof(struct ip6_header))
> +#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)
> +#define IP6_ADDR_SIZE   32      /* ipv6 saddr + daddr */
> +#define MAX_IP6_PAYLOAD MAX_TCP_PAYLOAD
> +
> +/* ip6 max payload, payload in ipv6 don't include the  header */
>  #define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
> -                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)
> +                                + ETH_IP6_HDR_SZ + MAX_IP6_PAYLOAD)
>  
>  #define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */
>  
> @@ -1722,14 +1727,27 @@ static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
>  {
>      uint16_t ip_hdrlen;
>  
> -    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
> -    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
> -    unit->ip_plen = &unit->ip->ip_len;
> -    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
> +    unit->u_ip.ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
> +    ip_hdrlen = ((0xF & unit->u_ip.ip->ip_ver_len) << 2);
> +    unit->ip_plen = &unit->u_ip.ip->ip_len;
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->u_ip.ip) + ip_hdrlen);
>      unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
>      unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
>  }
>  
> +static void virtio_net_rsc_extract_unit6(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    unit->u_ip.ip6 = (struct ip6_header *)(buf + chain->hdr_size + ETH_HDR_SZ);

The u_ip seems a little bit redundant. How about use a simple void * and
cast it to ipv4/ipv6 in proto specific callbacks?

The introducing of u_ip leads unnecessary ipv4 codes changes for ipv6
coalescing implementation.

> +    unit->ip_plen = &(unit->u_ip.ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->u_ip.ip6)\
> +                                    + IP6_HDR_SZ);
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +    /* There is a difference between payload lenght in ipv4 and v6,
> +       ip header is excluded in ipv6 */
> +    unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
> +}
> +
>  static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
>  {
>      uint32_t sum;
> @@ -1743,7 +1761,10 @@ static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
>  {
>      int ret;
>  
> -    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
> +    if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
> +        virtio_net_rsc_ipv4_checksum(seg->unit.u_ip.ip);
> +    }
> +
>      ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
>      QTAILQ_REMOVE(&chain->buffers, seg, next);
>      g_free(seg->buf);
> @@ -1807,7 +1828,11 @@ static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
>      QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
>      chain->stat.cache++;
>  
> -    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +    if (chain->proto == ETH_P_IP) {
> +        virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +    } else {

A switch and a g_assert_not_reached() is better than this.

> +        virtio_net_rsc_extract_unit6(chain, seg->buf, &seg->unit);
> +    }
>  }
>  
>  static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
> @@ -1930,8 +1955,8 @@ coalesce:
>  static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
>                          const uint8_t *buf, size_t size, NetRscUnit *unit)
>  {
> -    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
> -        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
> +    if ((unit->u_ip.ip->ip_src ^ seg->unit.u_ip.ip->ip_src)
> +        || (unit->u_ip.ip->ip_dst ^ seg->unit.u_ip.ip->ip_dst)
>          || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
>          || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
>          chain->stat.no_match++;
> @@ -1941,6 +1966,22 @@ static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
>      return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
>  }
>  
> +static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg *seg,
> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    if (memcmp(&unit->u_ip.ip6->ip6_src, &seg->unit.u_ip.ip6->ip6_src,
> +               sizeof(struct in6_address))
> +        || memcmp(&unit->u_ip.ip6->ip6_dst, &seg->unit.u_ip.ip6->ip6_dst,
> +                  sizeof(struct in6_address))
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +            chain->stat.no_match++;
> +            return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
>  /* Pakcets with 'SYN' should bypass, other flag should be sent after drain
>   * to prevent out of order */
>  static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> @@ -1983,7 +2024,11 @@ static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
>      NetRscSeg *seg, *nseg;
>  
>      QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> -        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +        if (chain->proto == ETH_P_IP) {
> +            ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +        } else {
> +            ret = virtio_net_rsc_coalesce6(chain, seg, buf, size, unit);
> +        }
>  
>          if (ret == RSC_FINAL) {
>              if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> @@ -2082,7 +2127,8 @@ static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
>  
>      chain = (NetRscChain *)opq;
>      virtio_net_rsc_extract_unit4(chain, buf, &unit);
> -    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
> +    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain,
> +                                                 unit.u_ip.ip, buf, size)) {
>          return virtio_net_do_receive(nc, buf, size);
>      }
>  
> @@ -2102,13 +2148,74 @@ static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
>      return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
>  }
>  
> +static int32_t virtio_net_rsc_sanity_check6(NetRscChain *chain, 
> +                        struct ip6_header *ip, const uint8_t *buf, size_t size)

Indentation is wrong here.

> +{
> +    uint16_t ip_len;
> +
> +    if (size < (chain->hdr_size + ETH_IP6_HDR_SZ + TCP_HDR_SZ)) {
> +        return RSC_BYPASS;
> +    }
> +
> +    if (((0xF0 & ip->ip6_ctlun.ip6_un1.ip6_un1_flow) >> 4)
> +        != IP_HEADER_VERSION_6) {
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Both option and protocol is checked in this */
> +    if (ip->ip6_ctlun.ip6_un1.ip6_un1_nxt != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Sanity check */

The comment is useless.

> +    ip_len = htons(ip->ip6_ctlun.ip6_un1.ip6_un1_plen);
> +    if (ip_len < TCP_HDR_SZ
> +        || ip_len > (size - chain->hdr_size - ETH_IP6_HDR_SZ)) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_WANT;
> +}
> +
> +static size_t virtio_net_rsc_receive6(void *opq, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{

Rather similar to ipv4 version, need to unify the code.

> +    int32_t ret;
> +    NetRscChain *chain;
> +    NetRscUnit unit;
> +
> +    chain = (NetRscChain *)opq;
> +    virtio_net_rsc_extract_unit6(chain, buf, &unit);
> +    if (RSC_WANT != virtio_net_rsc_sanity_check6(chain,
> +                                                 unit.u_ip.ip6, buf, size)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                        ((chain->hdr_size + ETH_HDR_SZ) + 8), IP6_ADDR_SIZE,
> +                        (chain->hdr_size + ETH_IP6_HDR_SZ), TCP_PORT_SIZE);
> +    }
> +
> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
> +        return size;
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
>  static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
>                                              NetClientState *nc, uint16_t proto)
>  {
>      NetRscChain *chain;
>  
>      /* Only handle IPv4/6 */
> -    if (proto != (uint16_t)ETH_P_IP) {
> +    if ((proto != (uint16_t)ETH_P_IP) && (proto != (uint16_t)ETH_P_IPV6)) {
>          return NULL;
>      }
>  
> @@ -2121,7 +2228,11 @@ static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
>      chain = g_malloc(sizeof(*chain));
>      chain->hdr_size = n->guest_hdr_len;
>      chain->proto = proto;
> -    chain->max_payload = MAX_IP4_PAYLOAD;
> +    if (proto == (uint16_t)ETH_P_IP) {
> +        chain->max_payload = MAX_IP4_PAYLOAD;
> +    } else {
> +        chain->max_payload = MAX_IP6_PAYLOAD;
> +    }
>      chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>                                        virtio_net_rsc_purge, chain);
>      memset(&chain->stat, 0, sizeof(chain->stat));
> @@ -2153,7 +2264,12 @@ static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>          return virtio_net_do_receive(nc, buf, size);
>      } else {
>          chain->stat.received++;
> -        return virtio_net_rsc_receive4(chain, nc, buf, size);
> +
> +        if (proto == (uint16_t)ETH_P_IP) {
> +            return virtio_net_rsc_receive4(chain, nc, buf, size);
> +        } else  {
> +            return virtio_net_rsc_receive6(chain, nc, buf, size);
> +        }
>      }
>  }
>  
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index 3b1dfa8..13d20a4 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -170,7 +170,10 @@ typedef struct NetRscStat {
>  
>  /* Rsc unit general info used to checking if can coalescing */
>  typedef struct NetRscUnit {
> -   struct ip_header *ip;   /* ip header */
> +    union {
> +        struct ip_header *ip;   /* ip header */
> +        struct ip6_header *ip6; /* ip6 header */
> +    } u_ip;
>     uint16_t *ip_plen;      /* data len pointer in ip header field */
>     struct tcp_header *tcp; /* tcp header */
>     uint16_t tcp_hdrlen;    /* tcp header len */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest
  2016-03-17  6:47 ` Jason Wang
@ 2016-03-17 15:21   ` Wei Xu
  2016-03-17 15:44     ` Michael S. Tsirkin
  0 siblings, 1 reply; 32+ messages in thread
From: Wei Xu @ 2016-03-17 15:21 UTC (permalink / raw)
  To: Jason Wang, qemu-devel; +Cc: marcel, victork, dfleytma, yvugenfi, mst



On 2016年03月17日 14:47, Jason Wang wrote:
> On 03/15/2016 05:17 PM,wexu@redhat.com  wrote:
>> From: Wei Xu<wexu@redhat.com>
>>
>> Fixed issues based on rfc patch v2:
>> 1. Removed big param list, replace it with 'NetRscUnit'
>> 2. Different virtio header size
>> 3. Modify callback function to direct call.
>> 4. Needn't check the failure of g_malloc()
>> 5. Other code format adjustment, macro naming, etc
>>
>> This patch is to support WHQL test for Windows guest, while this feature also
>> benifits other guest works as a kernel 'gro' like feature with userspace implementation.
>> Feature information:
>>    http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
>>
>> Both IPv4 and IPv6 are supported, though performance with userspace virtio
>> is slow than vhost-net, there is about 1x to 3x performance improvement to
>> userspace virtio, this is done by turning this feature on and disable
>> 'tso/gso/gro' on corresponding tap interface and guest interface, while get
>> less improment with all these feature on.
>>
>> Test steps:
>> Although this feature is mainly used for window guest, i used linux guest to help test
>> the feature, to make things simple, i used 3 steps to test the patch as i moved on.
>> 1. With a tcp socket client/server pair running on 2 linux guest, thus i can control
>> the traffic and debugging the code as i want.
>> 2. Netperf on linux guest test the throughput.
>> 3. WHQL test with 2 Windows guests.
>>
>> Current status:
>> IPv4 pass all the above tests.
>> IPv6 just passed test step 1 and 2 as described ahead, the virtio nic cannot
>> receive any packet in WHQL test, looks like the test traffic is not sent from
>> on the support machine, test device can access both host and another linux
>> guest, tried a lot of ways to work it out but failed, maybe debug from windows
>> guest driver side can help figuring it out.
> I think you need figure out where was the packet dropped first. If the
> packet was not dropped by windows guest, you may want to try dropmonitor.
Yes, there is something wrong with my previous description, i add some 
debug code and did new test, the packets are received by 
virtio_net_receive() and are finished putting to the vring with no error 
and sent to win guest already, but wireshark on win guest doesn't get 
it, because the test case did some hacking on the filter, it installed 
another lightweight filter, i'm not sure how these packets go in the 
guest, maybe they are received but dropped by driver or stack, etc.

I tried 'dropmonitor', it's very interesting but it helps very limitedly 
for windows guest, i can only use it with qemu on the host.
>> Note:
>> A 'MessageDevice' nic chose as 'Realtek' will panic the system sometimes during setup,
>> this can be figured out by replacing it with an 'e1000' nic.
>>
>> Todo:
>> More sanity check and tcp 'ecn' and 'window' scale test.
>>
>> Wei Xu (2):
>>    virtio-net rsc: support coalescing ipv4 tcp traffic
>>    virtio-net rsc: support coalescing ipv6 tcp traffic
>>
>>   hw/net/virtio-net.c            | 602 ++++++++++++++++++++++++++++++++++++++++-
>>   include/hw/virtio/virtio-net.h |   1 +
>>   include/hw/virtio/virtio.h     |  75 +++++
>>   3 files changed, 677 insertions(+), 1 deletion(-)
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest
  2016-03-17 15:21   ` Wei Xu
@ 2016-03-17 15:44     ` Michael S. Tsirkin
  2016-03-17 16:57       ` Wei Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Michael S. Tsirkin @ 2016-03-17 15:44 UTC (permalink / raw)
  To: Wei Xu; +Cc: victork, Jason Wang, yvugenfi, qemu-devel, marcel, dfleytma

On Thu, Mar 17, 2016 at 11:21:28PM +0800, Wei Xu wrote:
> 
> 
> On 2016年03月17日 14:47, Jason Wang wrote:
> >On 03/15/2016 05:17 PM,wexu@redhat.com  wrote:
> >>From: Wei Xu<wexu@redhat.com>
> >>
> >>Fixed issues based on rfc patch v2:
> >>1. Removed big param list, replace it with 'NetRscUnit'
> >>2. Different virtio header size
> >>3. Modify callback function to direct call.
> >>4. Needn't check the failure of g_malloc()
> >>5. Other code format adjustment, macro naming, etc
> >>
> >>This patch is to support WHQL test for Windows guest, while this feature also
> >>benifits other guest works as a kernel 'gro' like feature with userspace implementation.
> >>Feature information:
> >>   http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
> >>
> >>Both IPv4 and IPv6 are supported, though performance with userspace virtio
> >>is slow than vhost-net, there is about 1x to 3x performance improvement to
> >>userspace virtio, this is done by turning this feature on and disable
> >>'tso/gso/gro' on corresponding tap interface and guest interface, while get
> >>less improment with all these feature on.
> >>
> >>Test steps:
> >>Although this feature is mainly used for window guest, i used linux guest to help test
> >>the feature, to make things simple, i used 3 steps to test the patch as i moved on.
> >>1. With a tcp socket client/server pair running on 2 linux guest, thus i can control
> >>the traffic and debugging the code as i want.
> >>2. Netperf on linux guest test the throughput.
> >>3. WHQL test with 2 Windows guests.
> >>
> >>Current status:
> >>IPv4 pass all the above tests.
> >>IPv6 just passed test step 1 and 2 as described ahead, the virtio nic cannot
> >>receive any packet in WHQL test, looks like the test traffic is not sent from
> >>on the support machine, test device can access both host and another linux
> >>guest, tried a lot of ways to work it out but failed, maybe debug from windows
> >>guest driver side can help figuring it out.
> >I think you need figure out where was the packet dropped first. If the
> >packet was not dropped by windows guest, you may want to try dropmonitor.
> Yes, there is something wrong with my previous description, i add some debug
> code and did new test, the packets are received by virtio_net_receive() and
> are finished putting to the vring with no error and sent to win guest
> already, but wireshark on win guest doesn't get it, because the test case
> did some hacking on the filter, it installed another lightweight filter, i'm
> not sure how these packets go in the guest, maybe they are received but
> dropped by driver or stack, etc.

Add some debug output in the driver, rebuild it and see packets
as they are received and passed up the stack.

> I tried 'dropmonitor', it's very interesting but it helps very limitedly for
> windows guest, i can only use it with qemu on the host.
> >>Note:
> >>A 'MessageDevice' nic chose as 'Realtek' will panic the system sometimes during setup,
> >>this can be figured out by replacing it with an 'e1000' nic.
> >>
> >>Todo:
> >>More sanity check and tcp 'ecn' and 'window' scale test.
> >>
> >>Wei Xu (2):
> >>   virtio-net rsc: support coalescing ipv4 tcp traffic
> >>   virtio-net rsc: support coalescing ipv6 tcp traffic
> >>
> >>  hw/net/virtio-net.c            | 602 ++++++++++++++++++++++++++++++++++++++++-
> >>  include/hw/virtio/virtio-net.h |   1 +
> >>  include/hw/virtio/virtio.h     |  75 +++++
> >>  3 files changed, 677 insertions(+), 1 deletion(-)
> >>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-03-17  8:42   ` Jason Wang
@ 2016-03-17 16:45     ` Wei Xu
  2016-03-18  2:03       ` Jason Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Wei Xu @ 2016-03-17 16:45 UTC (permalink / raw)
  To: qemu-devel

On 2016年03月17日 16:42, Jason Wang wrote:

>
> On 03/15/2016 05:17 PM, wexu@redhat.com wrote:
>> From: Wei Xu <wexu@redhat.com>
>>
>> All the data packets in a tcp connection will be cached to a big buffer
>> in every receive interval, and will be sent out via a timer, the
>> 'virtio_net_rsc_timeout' controls the interval, the value will influent the
>> performance and response of tcp connection extremely, 50000(50us) is a
>> experience value to gain a performance improvement, since the whql test
>> sends packets every 100us, so '300000(300us)' can pass the test case,
>> this is also the default value, it's gonna to be tunable.
>>
>> The timer will only be triggered if the packets pool is not empty,
>> and it'll drain off all the cached packets
>>
>> 'NetRscChain' is used to save the segments of different protocols in a
>> VirtIONet device.
>>
>> The main handler of TCP includes TCP window update, duplicated ACK check
>> and the real data coalescing if the new segment passed sanity check
>> and is identified as an 'wanted' one.
>>
>> An 'wanted' segment means:
>> 1. Segment is within current window and the sequence is the expected one.
>> 2. ACK of the segment is in the valid window.
>> 3. If the ACK in the segment is a duplicated one, then it must less than 2,
>>     this is to notify upper layer TCP starting retransmission due to the spec.
>>
>> Sanity check includes:
>> 1. Incorrect version in IP header
>> 2. IP options & IP fragment
>> 3. Not a TCP packets
>> 4. Sanity size check to prevent buffer overflow attack.
>>
>> There maybe more cases should be considered such as ip identification other
>> flags, while it broke the test because windows set it to the same even it's
>> not a fragment.
>>
>> Normally it includes 2 typical ways to handle a TCP control flag, 'bypass'
>> and 'finalize', 'bypass' means should be sent out directly, and 'finalize'
>> means the packets should also be bypassed, and this should be done
>> after searching for the same connection packets in the pool and sending
>> all of them out, this is to avoid out of data.
>>
>> All the 'SYN' packets will be bypassed since this always begin a new'
>> connection, other flags such 'FIN/RST' will trigger a finalization, because
>> this normally happens upon a connection is going to be closed, an 'URG' packet
>> also finalize current coalescing unit while there maybe protocol difference to
>> different OS.
> But URG packet should be sent as quickly as possible regardless of
> ordering, no?

Yes, you right, URG will terminate the current 'SCU', i'll amend the commit log.

>
>> Statistics can be used to monitor the basic coalescing status, the 'out of order'
>> and 'out of window' means how many retransmitting packets, thus describe the
>> performance intuitively.
>>
>> Signed-off-by: Wei Xu <wexu@redhat.com>
>> ---
>>   hw/net/virtio-net.c            | 486 ++++++++++++++++++++++++++++++++++++++++-
>>   include/hw/virtio/virtio-net.h |   1 +
>>   include/hw/virtio/virtio.h     |  72 ++++++
>>   3 files changed, 558 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 5798f87..c23b45f 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -15,10 +15,12 @@
>>   #include "qemu/iov.h"
>>   #include "hw/virtio/virtio.h"
>>   #include "net/net.h"
>> +#include "net/eth.h"
>>   #include "net/checksum.h"
>>   #include "net/tap.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/timer.h"
>> +#include "qemu/sockets.h"
>>   #include "hw/virtio/virtio-net.h"
>>   #include "net/vhost_net.h"
>>   #include "hw/virtio/virtio-bus.h"
>> @@ -38,6 +40,35 @@
>>   #define endof(container, field) \
>>       (offsetof(container, field) + sizeof(((container *)0)->field))
>>   
>> +#define ETH_HDR_SZ (sizeof(struct eth_header))
>> +#define IP4_HDR_SZ (sizeof(struct ip_header))
>> +#define TCP_HDR_SZ (sizeof(struct tcp_header))
>> +#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ)
>> +
>> +#define IP4_ADDR_SIZE   8                   /* ipv4 saddr + daddr */
>> +#define TCP_PORT_SIZE   4                   /* sport + dport */
>> +
>> +/* IPv4 max payload, 16 bits in the header */
>> +#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
>> +#define MAX_TCP_PAYLOAD 65535
>> +
>> +/* max payload with virtio header */
>> +#define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
>> +                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)
> Should we use guest_hdr_len instead of sizeof() here? Consider the
> vnet_hdr will be extended in the future.

Sure.

>
>> +
>> +#define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */
> type, should be 'length'

ok.

>
>> +
>> +/* Purge coalesced packets timer interval */
>> +#define RSC_TIMER_INTERVAL  300000
>> +
>> +/* Switcher to enable/disable rsc */
>> +static bool virtio_net_rsc_bypass = 1;
>> +
>> +/* This value affects the performance a lot, and should be tuned carefully,
>> +   '300000'(300us) is the recommended value to pass the WHQL test, '50000' can
>> +   gain 2x netperf throughput with tso/gso/gro 'off'. */
>> +static uint32_t virtio_net_rsc_timeout = RSC_TIMER_INTERVAL;
>> +
>>   typedef struct VirtIOFeature {
>>       uint32_t flags;
>>       size_t end;
>> @@ -1089,7 +1120,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
>>       return 0;
>>   }
>>   
>> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
>> +static ssize_t virtio_net_do_receive(NetClientState *nc,
>> +                                  const uint8_t *buf, size_t size)
> indentation should also changed here.

ok.

>
>>   {
>>       VirtIONet *n = qemu_get_nic_opaque(nc);
>>       VirtIONetQueue *q = virtio_net_get_subqueue(nc);
>> @@ -1685,6 +1717,456 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
>>       return 0;
>>   }
>>   
>> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
>> +                                         const uint8_t *buf, NetRscUnit* unit)
>> +{
>> +    uint16_t ip_hdrlen;
>> +
>> +    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
>> +    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
>> +    unit->ip_plen = &unit->ip->ip_len;
>> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
>> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
>> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
>> +}
>> +
>> +static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
>> +{
>> +    uint32_t sum;
>> +
>> +    ip->ip_sum = 0;
>> +    sum = net_checksum_add_cont(IP4_HDR_SZ, (uint8_t *)ip, 0);
>> +    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
>> +}
>> +
>> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
>> +{
>> +    int ret;
>> +
>> +    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
>> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
>> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
>> +    g_free(seg->buf);
>> +    g_free(seg);
>> +
>> +    return ret;
>> +}
>> +
>> +static void virtio_net_rsc_purge(void *opq)
>> +{
>> +    NetRscChain *chain = (NetRscChain *)opq;
>> +    NetRscSeg *seg, *rn;
>> +
>> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
>> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> +            chain->stat.purge_failed++;
>> +            continue;
>> +        }
>> +    }
>> +
>> +    chain->stat.timer++;
>> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
>> +        timer_mod(chain->drain_timer,
>> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
>> +    }
>> +}
>> +
>> +static void virtio_net_rsc_cleanup(VirtIONet *n)
>> +{
>> +    NetRscChain *chain, *rn_chain;
>> +    NetRscSeg *seg, *rn_seg;
>> +
>> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
>> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
>> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
>> +            g_free(seg->buf);
>> +            g_free(seg);
>> +        }
>> +
>> +        timer_del(chain->drain_timer);
>> +        timer_free(chain->drain_timer);
>> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
>> +        g_free(chain);
>> +    }
>> +}
>> +
>> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
>> +                                     const uint8_t *buf, size_t size)
>> +{
>> +    NetRscSeg *seg;
>> +
>> +    seg = g_malloc(sizeof(NetRscSeg));
>> +    seg->buf = g_malloc(MAX_VIRTIO_PAYLOAD);
>> +
>> +    memmove(seg->buf, buf, size);
> Can seg->buf overlap with buf? If not, why use memmove() instead of
> memcpy()?

Yes, they will not overlap, memcpy() is good for it.

>
>> +    seg->size = size;
>> +    seg->dup_ack_count = 0;
>> +    seg->is_coalesced = 0;
>> +    seg->nc = nc;
>> +
>> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
>> +    chain->stat.cache++;
>> +
>> +    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
>> +}
>> +
>> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
>> +                                 const uint8_t *buf, struct tcp_header *n_tcp,
>> +                                 struct tcp_header *o_tcp)
>> +{
>> +    uint32_t nack, oack;
>> +    uint16_t nwin, owin;
>> +
>> +    nack = htonl(n_tcp->th_ack);
>> +    nwin = htons(n_tcp->th_win);
>> +    oack = htonl(o_tcp->th_ack);
>> +    owin = htons(o_tcp->th_win);
>> +
>> +    if ((nack - oack) >= MAX_TCP_PAYLOAD) {
>> +        chain->stat.ack_out_of_win++;
>> +        return RSC_FINAL;
>> +    } else if (nack == oack) {
>> +        /* duplicated ack or window probe */
>> +        if (nwin == owin) {
>> +            /* duplicated ack, add dup ack count due to whql test up to 1 */
>> +            chain->stat.dup_ack++;
>> +
>> +            if (seg->dup_ack_count == 0) {
>> +                seg->dup_ack_count++;
>> +                chain->stat.dup_ack1++;
>> +                return RSC_COALESCE;
>> +            } else {
>> +                /* Spec says should send it directly */
>> +                chain->stat.dup_ack2++;
>> +                return RSC_FINAL;
>> +            }
>> +        } else {
>> +            /* Coalesce window update */
>> +            o_tcp->th_win = n_tcp->th_win;
>> +            chain->stat.win_update++;
>> +            return RSC_COALESCE;
>> +        }
>> +    } else {
>> +        /* pure ack, update ack */
>> +        o_tcp->th_ack = n_tcp->th_ack;
>> +        chain->stat.pure_ack++;
>> +        return RSC_COALESCE;
> Looks like there're something I missed. The spec said:
>
> "In other words, any pure ACK that is not a duplicate ACK or a window
> update triggers an exception and must not be coalesced. All such pure
> ACKs must be indicated as individual segments."
>
> Does it mean we *should not* coalesce windows update and pure ack?
> (Since it can wakeup transmission)?

It's also a little bit inexplicit and flexible due to the spec, please see the flowchart I on the same page.

Comments about the  flowchart:
------------------------------------------------------------------------
The first of the following two flowcharts describes the rules for coalescing segments and updating the TCP headers.
This flowchart refers to mechanisms for distinguishing valid duplicate ACKs and window updates. The second flowchart describes these mechanisms.
------------------------------------------------------------------------
As show in the flowchart, only status 'C' will break current scu and get finalized, both 'A' and 'B' can be coalesced afaik.

>
>> +    }
>> +}
>> +
>> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, NetRscSeg *seg,
>> +                                    const uint8_t *buf, NetRscUnit *n_unit)
>> +{
>> +    void *data;
>> +    uint16_t o_ip_len;
>> +    uint32_t nseq, oseq;
>> +    NetRscUnit *o_unit;
>> +
>> +    o_unit = &seg->unit;
>> +    o_ip_len = htons(*o_unit->ip_plen);
>> +    nseq = htonl(n_unit->tcp->th_seq);
>> +    oseq = htonl(o_unit->tcp->th_seq);
>> +
>> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
>> +        /* Log this only for debugging observation */
>> +        chain->stat.tcp_option++;
>> +    }
>> +
>> +    /* Ignore packet with more/larger tcp options */
>> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
> What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ?
do you mean '<'? that also means some option changed, maybe i should 
include it.
>
>> +        chain->stat.tcp_larger_option++;
>> +        return RSC_FINAL;
>> +    }
>> +
>> +    /* out of order or retransmitted. */
>> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
>> +        chain->stat.data_out_of_win++;
>> +        return RSC_FINAL;
>> +    }
>> +
>> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
>> +    if (nseq == oseq) {
>> +        if ((0 == o_unit->payload) && n_unit->payload) {
>> +            /* From no payload to payload, normal case, not a dup ack or etc */
>> +            chain->stat.data_after_pure_ack++;
>> +            goto coalesce;
>> +        } else {
>> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
>> +                                             n_unit->tcp, o_unit->tcp);
>> +        }
>> +    } else if ((nseq - oseq) != o_unit->payload) {
>> +        /* Not a consistent packet, out of order */
>> +        chain->stat.data_out_of_order++;
>> +        return RSC_FINAL;
>> +    } else {
>> +coalesce:
>> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
>> +            chain->stat.over_size++;
>> +            return RSC_FINAL;
>> +        }
>> +
>> +        /* Here comes the right data, the payload lengh in v4/v6 is different,
>> +           so use the field value to update and record the new data len */
>> +        o_unit->payload += n_unit->payload; /* update new data len */
>> +
>> +        /* update field in ip header */
>> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
>> +
>> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
>> +           for windows guest, while this may change the behavior for linux
>> +           guest. */
> This needs more thought, 'can' probably means don't. (Linux GRO won't
> merge PUSH packet).
Yes, since it's mainly for win guest, how about take this as this first 
and do more test and see how to handle it?
>
>> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
>> +
>> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
>> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
>> +
>> +        memmove(seg->buf + seg->size, data, n_unit->payload);
>> +        seg->size += n_unit->payload;
>> +        chain->stat.coalesced++;
>> +        return RSC_COALESCE;
>> +    }
>> +}
>> +
>> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
>> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
>> +{
>> +    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
>> +        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
>> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
>> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
>> +        chain->stat.no_match++;
>> +        return RSC_NO_MATCH;
>> +    }
>> +
>> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
>> +}
>> +
>> +/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
>> + * to prevent out of order */
>> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
>> +                                         struct tcp_header *tcp)
>> +{
>> +    uint16_t tcp_flag;
>> +
>> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
>> +    if (tcp_flag & TH_SYN) {
>> +        chain->stat.tcp_syn++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) {
>> +        chain->stat.tcp_ctrl_drain++;
>> +        return RSC_FINAL;
>> +    }
>> +
>> +    return RSC_WANT;
>> +}
>> +
>> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientState *nc,
>> +                          const uint8_t *buf, size_t size)
> indentation looks wrong.
ok.
>
>> +{
>> +    if (QTAILQ_EMPTY(&chain->buffers)) {
>> +        chain->stat.empty_cache++;
>> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
>> +        timer_mod(chain->drain_timer,
>> +              qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_timeout);
>> +        return 1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
>> +                          const uint8_t *buf, size_t size, NetRscUnit *unit)
>> +{
> and here.
ok.
>
>> +    int ret;
>> +    NetRscSeg *seg, *nseg;
>> +
>> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
>> +        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
>> +
>> +        if (ret == RSC_FINAL) {
>> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> +                /* Send failed */
>> +                chain->stat.final_failed++;
>> +                return 0;
>> +            }
>> +
>> +            /* Send current packet */
>> +            return virtio_net_do_receive(nc, buf, size);
>> +        } else if (ret == RSC_NO_MATCH) {
>> +            continue;
>> +        } else {
>> +            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
>> +            seg->is_coalesced = 1;
>> +            return size;
>> +        }
>> +    }
>> +
>> +    chain->stat.no_match_cache++;
>> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
>> +    return size;
>> +}
>> +
>> +/* Drain a connection data, this is to avoid out of order segments */
>> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
>> +                        const uint8_t *buf, size_t size, uint16_t ip_start,
>> +                        uint16_t ip_size, uint16_t tcp_port, uint16_t port_size)
>> +{
>> +    NetRscSeg *seg, *nseg;
>> +
>> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
>> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
>> +            || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)) {
>> +            continue;
>> +        }
>> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> +            chain->stat.drain_failed++;
>> +        }
>> +
>> +        break;
>> +    }
>> +
>> +    return virtio_net_do_receive(nc, buf, size);
>> +}
>> +
>> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
>> +                        struct ip_header *ip, const uint8_t *buf, size_t size)
>> +{
>> +    uint16_t ip_len;
>> +
>> +    if (size < (chain->hdr_size + ETH_IP4_HDR_SZ + TCP_HDR_SZ)) {
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Not an ipv4 one */
>> +    if (((0xF0 & ip->ip_ver_len) >> 4) != IP_HEADER_VERSION_4) {
> I've replied this several times, please use a consistent style. E.g
> "ip->ip_ver_len & 0xF0".
Sorry, i used to think you meant the 'IP_HEADER_VERSION_4'.
>
>> +        chain->stat.ip_option++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Don't handle packets with ip option */
>> +    if (IP4_HEADER_LEN != (0xF & ip->ip_ver_len)) {
>> +        chain->stat.ip_option++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Don't handle packets with ip fragment */
>> +    if (!(htons(ip->ip_off) & IP_DF)) {
>> +        chain->stat.ip_frag++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    if (ip->ip_p != IPPROTO_TCP) {
>> +        chain->stat.bypass_not_tcp++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Sanity check */
>> +    ip_len = htons(ip->ip_len);
>> +    if (ip_len < (IP4_HDR_SZ + TCP_HDR_SZ)
>> +        || ip_len > (size - chain->hdr_size - ETH_HDR_SZ)) {
>> +        chain->stat.ip_hacked++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    return RSC_WANT;
>> +}
>> +
>> +static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
>> +                                      const uint8_t *buf, size_t size)
>> +{
>> +    int32_t ret;
>> +    NetRscChain *chain;
>> +    NetRscUnit unit;
>> +
>> +    chain = (NetRscChain *)opq;
>> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
>> +    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +
>> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
>> +    if (ret == RSC_BYPASS) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    } else if (ret == RSC_FINAL) {
>> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
>> +                        ((chain->hdr_size + ETH_HDR_SZ) + 12), IP4_ADDR_SIZE,
>> +                        (chain->hdr_size + ETH_IP4_HDR_SZ), TCP_PORT_SIZE);
>> +    }
>> +
>> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
>> +        return size;
>> +    }
>> +
>> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
>> +}
>> +
>> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
>> +                                            NetClientState *nc, uint16_t proto)
>> +{
>> +    NetRscChain *chain;
>> +
>> +    /* Only handle IPv4/6 */
>> +    if (proto != (uint16_t)ETH_P_IP) {
> The code is conflict with the comment above.
ok.
>
>> +        return NULL;
>> +    }
>> +
>> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
>> +        if (chain->proto == proto) {
>> +            return chain;
>> +        }
>> +    }
>> +
>> +    chain = g_malloc(sizeof(*chain));
>> +    chain->hdr_size = n->guest_hdr_len;
> Why introduce a specified field for instead of just use n->guest_hdr_len?
this is to reduce invoking times to find VirtIONet by 'nc', there are a 
few places will use it.
>
>> +    chain->proto = proto;
>> +    chain->max_payload = MAX_IP4_PAYLOAD;
>> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>> +                                      virtio_net_rsc_purge, chain);
>> +    memset(&chain->stat, 0, sizeof(chain->stat));
>> +
>> +    QTAILQ_INIT(&chain->buffers);
>> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
>> +
>> +    return chain;
>> +}
>> +
>> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>> +                                      const uint8_t *buf, size_t size)
>> +{
>> +    uint16_t proto;
>> +    NetRscChain *chain;
>> +    struct eth_header *eth;
>> +    VirtIONet *n;
>> +
>> +    n = qemu_get_nic_opaque(nc);
>> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +
>> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
>> +    proto = htons(eth->h_proto);
>> +
>> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
>> +    if (!chain) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    } else {
>> +        chain->stat.received++;
>> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
>> +    }
>> +}
>> +
>> +static ssize_t virtio_net_receive(NetClientState *nc,
>> +                                  const uint8_t *buf, size_t size)
>> +{
>> +    if (virtio_net_rsc_bypass) {
>> +        return virtio_net_do_receive(nc, buf, size);
> You need a feature bit for this and compat it for older machine types.
> And also need some work on virtio spec I think.
yes, not sure which way is good to support this, hmp/qmp/ethtool, this 
is gonna to support win guest,
so need a well-compatible interface, any comments?
>
>> +    } else {
>> +        return virtio_net_rsc_receive(nc, buf, size);
>> +    }
>> +}
>> +
>>   static NetClientInfo net_virtio_info = {
>>       .type = NET_CLIENT_OPTIONS_KIND_NIC,
>>       .size = sizeof(NICState),
>> @@ -1814,6 +2296,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>>       nc = qemu_get_queue(n->nic);
>>       nc->rxfilter_notify_enabled = 1;
>>   
>> +    QTAILQ_INIT(&n->rsc_chains);
>>       n->qdev = dev;
>>       register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
>>                       virtio_net_save, virtio_net_load, n);
>> @@ -1848,6 +2331,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>>       g_free(n->vqs);
>>       qemu_del_nic(n->nic);
>>       virtio_cleanup(vdev);
>> +    virtio_net_rsc_cleanup(n);
>>   }
>>   
>>   static void virtio_net_instance_init(Object *obj)
>> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
>> index 0cabdb6..6939e92 100644
>> --- a/include/hw/virtio/virtio-net.h
>> +++ b/include/hw/virtio/virtio-net.h
>> @@ -59,6 +59,7 @@ typedef struct VirtIONet {
>>       VirtIONetQueue *vqs;
>>       VirtQueue *ctrl_vq;
>>       NICState *nic;
>> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>>       uint32_t tx_timeout;
>>       int32_t tx_burst;
>>       uint32_t has_vnet_hdr;
>> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>> index 2b5b248..3b1dfa8 100644
>> --- a/include/hw/virtio/virtio.h
>> +++ b/include/hw/virtio/virtio.h
>> @@ -128,6 +128,78 @@ typedef struct VirtioDeviceClass {
>>       int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>>   } VirtioDeviceClass;
>>   
>> +/* Coalesced packets type & status */
>> +typedef enum {
>> +    RSC_COALESCE,           /* Data been coalesced */
>> +    RSC_FINAL,              /* Will terminate current connection */
>> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
>> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
>> +    RSC_WANT                /* Data want to be coalesced */
>> +} COALESCE_STATUS;
>> +
>> +typedef struct NetRscStat {
>> +    uint32_t received;
>> +    uint32_t coalesced;
>> +    uint32_t over_size;
>> +    uint32_t cache;
>> +    uint32_t empty_cache;
>> +    uint32_t no_match_cache;
>> +    uint32_t win_update;
>> +    uint32_t no_match;
>> +    uint32_t tcp_syn;
>> +    uint32_t tcp_ctrl_drain;
>> +    uint32_t dup_ack;
>> +    uint32_t dup_ack1;
>> +    uint32_t dup_ack2;
>> +    uint32_t pure_ack;
>> +    uint32_t ack_out_of_win;
>> +    uint32_t data_out_of_win;
>> +    uint32_t data_out_of_order;
>> +    uint32_t data_after_pure_ack;
>> +    uint32_t bypass_not_tcp;
>> +    uint32_t tcp_option;
>> +    uint32_t tcp_larger_option;
>> +    uint32_t ip_frag;
>> +    uint32_t ip_hacked;
>> +    uint32_t ip_option;
>> +    uint32_t purge_failed;
>> +    uint32_t drain_failed;
>> +    uint32_t final_failed;
>> +    int64_t  timer;
>> +} NetRscStat;
>> +
>> +/* Rsc unit general info used to checking if can coalescing */
>> +typedef struct NetRscUnit {
>> +   struct ip_header *ip;   /* ip header */
>> +   uint16_t *ip_plen;      /* data len pointer in ip header field */
>> +   struct tcp_header *tcp; /* tcp header */
>> +   uint16_t tcp_hdrlen;    /* tcp header len */
>> +   uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
>> +} NetRscUnit;
>> +
>> +/* Coalesced segmant */
>> +typedef struct NetRscSeg {
>> +    QTAILQ_ENTRY(NetRscSeg) next;
>> +    void *buf;
>> +    size_t size;
>> +    uint32_t dup_ack_count;
>> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
>> +    NetRscUnit unit;
>> +    NetClientState *nc;
>> +} NetRscSeg;
>> +
>> +
>> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
>> +typedef struct NetRscChain {
>> +   QTAILQ_ENTRY(NetRscChain) next;
>> +   uint16_t hdr_size;
>> +   uint16_t proto;
>> +   uint16_t max_payload;
>> +   QEMUTimer *drain_timer;
>> +   QTAILQ_HEAD(, NetRscSeg) buffers;
>> +   NetRscStat stat;
>> +} NetRscChain;
>> +
>>   void virtio_instance_init_common(Object *proxy_obj, void *data,
>>                                    size_t vdev_size, const char *vdev_name);
>>   
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 2/2] virtio-net rsc: support coalescing ipv6 tcp traffic
  2016-03-17  8:50   ` Jason Wang
@ 2016-03-17 16:50     ` Wei Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Wei Xu @ 2016-03-17 16:50 UTC (permalink / raw)
  To: Jason Wang, qemu-devel; +Cc: marcel, victork, dfleytma, yvugenfi, mst

On 2016年03月17日 16:50, Jason Wang wrote:
>
> On 03/15/2016 05:17 PM, wexu@redhat.com wrote:
>> From: Wei Xu <wexu@redhat.com>
>>
>> Most things like ipv4 except there is a significant difference between ipv4
>> and ipv6, the fragment lenght in ipv4 header includes itself, while it's not
>> included for ipv6, thus means ipv6 can carry a real '65535' unit.
>>
>> Signed-off-by: Wei Xu <wexu@redhat.com>
>> ---
>>   hw/net/virtio-net.c        | 146 ++++++++++++++++++++++++++++++++++++++++-----
>>   include/hw/virtio/virtio.h |   5 +-
>>   2 files changed, 135 insertions(+), 16 deletions(-)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index c23b45f..ef61b74 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -52,9 +52,14 @@
>>   #define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ)
>>   #define MAX_TCP_PAYLOAD 65535
>>   
>> -/* max payload with virtio header */
>> +#define IP6_HDR_SZ (sizeof(struct ip6_header))
>> +#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)
>> +#define IP6_ADDR_SIZE   32      /* ipv6 saddr + daddr */
>> +#define MAX_IP6_PAYLOAD MAX_TCP_PAYLOAD
>> +
>> +/* ip6 max payload, payload in ipv6 don't include the  header */
>>   #define MAX_VIRTIO_PAYLOAD  (sizeof(struct virtio_net_hdr_mrg_rxbuf) \
>> -                                + ETH_HDR_SZ + MAX_TCP_PAYLOAD)
>> +                                + ETH_IP6_HDR_SZ + MAX_IP6_PAYLOAD)
>>   
>>   #define IP4_HEADER_LEN 5 /* header lenght value in ip header without option */
>>   
>> @@ -1722,14 +1727,27 @@ static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
>>   {
>>       uint16_t ip_hdrlen;
>>   
>> -    unit->ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
>> -    ip_hdrlen = ((0xF & unit->ip->ip_ver_len) << 2);
>> -    unit->ip_plen = &unit->ip->ip_len;
>> -    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
>> +    unit->u_ip.ip = (struct ip_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
>> +    ip_hdrlen = ((0xF & unit->u_ip.ip->ip_ver_len) << 2);
>> +    unit->ip_plen = &unit->u_ip.ip->ip_len;
>> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->u_ip.ip) + ip_hdrlen);
>>       unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
>>       unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
>>   }
>>   
>> +static void virtio_net_rsc_extract_unit6(NetRscChain *chain,
>> +                                         const uint8_t *buf, NetRscUnit* unit)
>> +{
>> +    unit->u_ip.ip6 = (struct ip6_header *)(buf + chain->hdr_size + ETH_HDR_SZ);
> The u_ip seems a little bit redundant. How about use a simple void * and
> cast it to ipv4/ipv6 in proto specific callbacks?
>
> The introducing of u_ip leads unnecessary ipv4 codes changes for ipv6
> coalescing implementation.
Sure.
>> +    unit->ip_plen = &(unit->u_ip.ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
>> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->u_ip.ip6)\
>> +                                    + IP6_HDR_SZ);
>> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
>> +    /* There is a difference between payload lenght in ipv4 and v6,
>> +       ip header is excluded in ipv6 */
>> +    unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
>> +}
>> +
>>   static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip)
>>   {
>>       uint32_t sum;
>> @@ -1743,7 +1761,10 @@ static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
>>   {
>>       int ret;
>>   
>> -    virtio_net_rsc_ipv4_checksum(seg->unit.ip);
>> +    if ((chain->proto == ETH_P_IP) && seg->is_coalesced) {
>> +        virtio_net_rsc_ipv4_checksum(seg->unit.u_ip.ip);
>> +    }
>> +
>>       ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
>>       QTAILQ_REMOVE(&chain->buffers, seg, next);
>>       g_free(seg->buf);
>> @@ -1807,7 +1828,11 @@ static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
>>       QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
>>       chain->stat.cache++;
>>   
>> -    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
>> +    if (chain->proto == ETH_P_IP) {
>> +        virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
>> +    } else {
> A switch and a g_assert_not_reached() is better than this.
sure.
>
>> +        virtio_net_rsc_extract_unit6(chain, seg->buf, &seg->unit);
>> +    }
>>   }
>>   
>>   static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSeg *seg,
>> @@ -1930,8 +1955,8 @@ coalesce:
>>   static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
>>                           const uint8_t *buf, size_t size, NetRscUnit *unit)
>>   {
>> -    if ((unit->ip->ip_src ^ seg->unit.ip->ip_src)
>> -        || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst)
>> +    if ((unit->u_ip.ip->ip_src ^ seg->unit.u_ip.ip->ip_src)
>> +        || (unit->u_ip.ip->ip_dst ^ seg->unit.u_ip.ip->ip_dst)
>>           || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
>>           || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
>>           chain->stat.no_match++;
>> @@ -1941,6 +1966,22 @@ static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
>>       return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
>>   }
>>   
>> +static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg *seg,
>> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
>> +{
>> +    if (memcmp(&unit->u_ip.ip6->ip6_src, &seg->unit.u_ip.ip6->ip6_src,
>> +               sizeof(struct in6_address))
>> +        || memcmp(&unit->u_ip.ip6->ip6_dst, &seg->unit.u_ip.ip6->ip6_dst,
>> +                  sizeof(struct in6_address))
>> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
>> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
>> +            chain->stat.no_match++;
>> +            return RSC_NO_MATCH;
>> +    }
>> +
>> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
>> +}
>> +
>>   /* Pakcets with 'SYN' should bypass, other flag should be sent after drain
>>    * to prevent out of order */
>>   static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
>> @@ -1983,7 +2024,11 @@ static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
>>       NetRscSeg *seg, *nseg;
>>   
>>       QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
>> -        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
>> +        if (chain->proto == ETH_P_IP) {
>> +            ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
>> +        } else {
>> +            ret = virtio_net_rsc_coalesce6(chain, seg, buf, size, unit);
>> +        }
>>   
>>           if (ret == RSC_FINAL) {
>>               if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> @@ -2082,7 +2127,8 @@ static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
>>   
>>       chain = (NetRscChain *)opq;
>>       virtio_net_rsc_extract_unit4(chain, buf, &unit);
>> -    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)) {
>> +    if (RSC_WANT != virtio_net_rsc_sanity_check4(chain,
>> +                                                 unit.u_ip.ip, buf, size)) {
>>           return virtio_net_do_receive(nc, buf, size);
>>       }
>>   
>> @@ -2102,13 +2148,74 @@ static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc,
>>       return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
>>   }
>>   
>> +static int32_t virtio_net_rsc_sanity_check6(NetRscChain *chain,
>> +                        struct ip6_header *ip, const uint8_t *buf, size_t size)
> Indentation is wrong here.
ok.
>
>> +{
>> +    uint16_t ip_len;
>> +
>> +    if (size < (chain->hdr_size + ETH_IP6_HDR_SZ + TCP_HDR_SZ)) {
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    if (((0xF0 & ip->ip6_ctlun.ip6_un1.ip6_un1_flow) >> 4)
>> +        != IP_HEADER_VERSION_6) {
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Both option and protocol is checked in this */
>> +    if (ip->ip6_ctlun.ip6_un1.ip6_un1_nxt != IPPROTO_TCP) {
>> +        chain->stat.bypass_not_tcp++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Sanity check */
> The comment is useless.
ok.
>
>> +    ip_len = htons(ip->ip6_ctlun.ip6_un1.ip6_un1_plen);
>> +    if (ip_len < TCP_HDR_SZ
>> +        || ip_len > (size - chain->hdr_size - ETH_IP6_HDR_SZ)) {
>> +        chain->stat.ip_hacked++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    return RSC_WANT;
>> +}
>> +
>> +static size_t virtio_net_rsc_receive6(void *opq, NetClientState* nc,
>> +                                      const uint8_t *buf, size_t size)
>> +{
> Rather similar to ipv4 version, need to unify the code.
ok.
>
>> +    int32_t ret;
>> +    NetRscChain *chain;
>> +    NetRscUnit unit;
>> +
>> +    chain = (NetRscChain *)opq;
>> +    virtio_net_rsc_extract_unit6(chain, buf, &unit);
>> +    if (RSC_WANT != virtio_net_rsc_sanity_check6(chain,
>> +                                                 unit.u_ip.ip6, buf, size)) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +
>> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
>> +    if (ret == RSC_BYPASS) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    } else if (ret == RSC_FINAL) {
>> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
>> +                        ((chain->hdr_size + ETH_HDR_SZ) + 8), IP6_ADDR_SIZE,
>> +                        (chain->hdr_size + ETH_IP6_HDR_SZ), TCP_PORT_SIZE);
>> +    }
>> +
>> +    if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) {
>> +        return size;
>> +    }
>> +
>> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
>> +}
>> +
>>   static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
>>                                               NetClientState *nc, uint16_t proto)
>>   {
>>       NetRscChain *chain;
>>   
>>       /* Only handle IPv4/6 */
>> -    if (proto != (uint16_t)ETH_P_IP) {
>> +    if ((proto != (uint16_t)ETH_P_IP) && (proto != (uint16_t)ETH_P_IPV6)) {
>>           return NULL;
>>       }
>>   
>> @@ -2121,7 +2228,11 @@ static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
>>       chain = g_malloc(sizeof(*chain));
>>       chain->hdr_size = n->guest_hdr_len;
>>       chain->proto = proto;
>> -    chain->max_payload = MAX_IP4_PAYLOAD;
>> +    if (proto == (uint16_t)ETH_P_IP) {
>> +        chain->max_payload = MAX_IP4_PAYLOAD;
>> +    } else {
>> +        chain->max_payload = MAX_IP6_PAYLOAD;
>> +    }
>>       chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>>                                         virtio_net_rsc_purge, chain);
>>       memset(&chain->stat, 0, sizeof(chain->stat));
>> @@ -2153,7 +2264,12 @@ static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>>           return virtio_net_do_receive(nc, buf, size);
>>       } else {
>>           chain->stat.received++;
>> -        return virtio_net_rsc_receive4(chain, nc, buf, size);
>> +
>> +        if (proto == (uint16_t)ETH_P_IP) {
>> +            return virtio_net_rsc_receive4(chain, nc, buf, size);
>> +        } else  {
>> +            return virtio_net_rsc_receive6(chain, nc, buf, size);
>> +        }
>>       }
>>   }
>>   
>> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>> index 3b1dfa8..13d20a4 100644
>> --- a/include/hw/virtio/virtio.h
>> +++ b/include/hw/virtio/virtio.h
>> @@ -170,7 +170,10 @@ typedef struct NetRscStat {
>>   
>>   /* Rsc unit general info used to checking if can coalescing */
>>   typedef struct NetRscUnit {
>> -   struct ip_header *ip;   /* ip header */
>> +    union {
>> +        struct ip_header *ip;   /* ip header */
>> +        struct ip6_header *ip6; /* ip6 header */
>> +    } u_ip;
>>      uint16_t *ip_plen;      /* data len pointer in ip header field */
>>      struct tcp_header *tcp; /* tcp header */
>>      uint16_t tcp_hdrlen;    /* tcp header len */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest
  2016-03-17 15:44     ` Michael S. Tsirkin
@ 2016-03-17 16:57       ` Wei Xu
  2016-03-18  2:22         ` Jason Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Wei Xu @ 2016-03-17 16:57 UTC (permalink / raw)
  To: qemu-devel

On 2016年03月17日 23:44, Michael S. Tsirkin wrote:
> On Thu, Mar 17, 2016 at 11:21:28PM +0800, Wei Xu wrote:
>>
>> On 2016年03月17日 14:47, Jason Wang wrote:
>>> On 03/15/2016 05:17 PM,wexu@redhat.com  wrote:
>>>> From: Wei Xu<wexu@redhat.com>
>>>>
>>>> Fixed issues based on rfc patch v2:
>>>> 1. Removed big param list, replace it with 'NetRscUnit'
>>>> 2. Different virtio header size
>>>> 3. Modify callback function to direct call.
>>>> 4. Needn't check the failure of g_malloc()
>>>> 5. Other code format adjustment, macro naming, etc
>>>>
>>>> This patch is to support WHQL test for Windows guest, while this feature also
>>>> benifits other guest works as a kernel 'gro' like feature with userspace implementation.
>>>> Feature information:
>>>>    http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
>>>>
>>>> Both IPv4 and IPv6 are supported, though performance with userspace virtio
>>>> is slow than vhost-net, there is about 1x to 3x performance improvement to
>>>> userspace virtio, this is done by turning this feature on and disable
>>>> 'tso/gso/gro' on corresponding tap interface and guest interface, while get
>>>> less improment with all these feature on.
>>>>
>>>> Test steps:
>>>> Although this feature is mainly used for window guest, i used linux guest to help test
>>>> the feature, to make things simple, i used 3 steps to test the patch as i moved on.
>>>> 1. With a tcp socket client/server pair running on 2 linux guest, thus i can control
>>>> the traffic and debugging the code as i want.
>>>> 2. Netperf on linux guest test the throughput.
>>>> 3. WHQL test with 2 Windows guests.
>>>>
>>>> Current status:
>>>> IPv4 pass all the above tests.
>>>> IPv6 just passed test step 1 and 2 as described ahead, the virtio nic cannot
>>>> receive any packet in WHQL test, looks like the test traffic is not sent from
>>>> on the support machine, test device can access both host and another linux
>>>> guest, tried a lot of ways to work it out but failed, maybe debug from windows
>>>> guest driver side can help figuring it out.
>>> I think you need figure out where was the packet dropped first. If the
>>> packet was not dropped by windows guest, you may want to try dropmonitor.
>> Yes, there is something wrong with my previous description, i add some debug
>> code and did new test, the packets are received by virtio_net_receive() and
>> are finished putting to the vring with no error and sent to win guest
>> already, but wireshark on win guest doesn't get it, because the test case
>> did some hacking on the filter, it installed another lightweight filter, i'm
>> not sure how these packets go in the guest, maybe they are received but
>> dropped by driver or stack, etc.
> Add some debug output in the driver, rebuild it and see packets
> as they are received and passed up the stack.
Yes, but this is to win guest, i tried to build a windows debug binary 
but failed, is there any possible missing path in virtio between pushed 
it to vring and notified the guest successfully? i'm sure at this by 
debugging it with gdb.
>
>> I tried 'dropmonitor', it's very interesting but it helps very limitedly for
>> windows guest, i can only use it with qemu on the host.
>>>> Note:
>>>> A 'MessageDevice' nic chose as 'Realtek' will panic the system sometimes during setup,
>>>> this can be figured out by replacing it with an 'e1000' nic.
>>>>
>>>> Todo:
>>>> More sanity check and tcp 'ecn' and 'window' scale test.
>>>>
>>>> Wei Xu (2):
>>>>    virtio-net rsc: support coalescing ipv4 tcp traffic
>>>>    virtio-net rsc: support coalescing ipv6 tcp traffic
>>>>
>>>>   hw/net/virtio-net.c            | 602 ++++++++++++++++++++++++++++++++++++++++-
>>>>   include/hw/virtio/virtio-net.h |   1 +
>>>>   include/hw/virtio/virtio.h     |  75 +++++
>>>>   3 files changed, 677 insertions(+), 1 deletion(-)
>>>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-03-17 16:45     ` Wei Xu
@ 2016-03-18  2:03       ` Jason Wang
  2016-03-18  4:17         ` Wei Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2016-03-18  2:03 UTC (permalink / raw)
  To: Wei Xu, qemu-devel



On 03/18/2016 12:45 AM, Wei Xu wrote:
> On 2016年03月17日 16:42, Jason Wang wrote:
>
>>
>> On 03/15/2016 05:17 PM, wexu@redhat.com wrote:
>>> From: Wei Xu <wexu@redhat.com>
>>>
>>> All the data packets in a tcp connection will be cached to a big buffer
>>> in every receive interval, and will be sent out via a timer, the
>>> 'virtio_net_rsc_timeout' controls the interval, the value will
>>> influent the
>>> performance and response of tcp connection extremely, 50000(50us) is a
>>> experience value to gain a performance improvement, since the whql test
>>> sends packets every 100us, so '300000(300us)' can pass the test case,
>>> this is also the default value, it's gonna to be tunable.
>>>
>>> The timer will only be triggered if the packets pool is not empty,
>>> and it'll drain off all the cached packets
>>>
>>> 'NetRscChain' is used to save the segments of different protocols in a
>>> VirtIONet device.
>>>
>>> The main handler of TCP includes TCP window update, duplicated ACK
>>> check
>>> and the real data coalescing if the new segment passed sanity check
>>> and is identified as an 'wanted' one.
>>>
>>> An 'wanted' segment means:
>>> 1. Segment is within current window and the sequence is the expected
>>> one.
>>> 2. ACK of the segment is in the valid window.
>>> 3. If the ACK in the segment is a duplicated one, then it must less
>>> than 2,
>>>     this is to notify upper layer TCP starting retransmission due to
>>> the spec.
>>>
>>> Sanity check includes:
>>> 1. Incorrect version in IP header
>>> 2. IP options & IP fragment
>>> 3. Not a TCP packets
>>> 4. Sanity size check to prevent buffer overflow attack.
>>>
>>> There maybe more cases should be considered such as ip
>>> identification other
>>> flags, while it broke the test because windows set it to the same
>>> even it's
>>> not a fragment.
>>>
>>> Normally it includes 2 typical ways to handle a TCP control flag,
>>> 'bypass'
>>> and 'finalize', 'bypass' means should be sent out directly, and
>>> 'finalize'
>>> means the packets should also be bypassed, and this should be done
>>> after searching for the same connection packets in the pool and sending
>>> all of them out, this is to avoid out of data.
>>>
>>> All the 'SYN' packets will be bypassed since this always begin a new'
>>> connection, other flags such 'FIN/RST' will trigger a finalization,
>>> because
>>> this normally happens upon a connection is going to be closed, an
>>> 'URG' packet
>>> also finalize current coalescing unit while there maybe protocol
>>> difference to
>>> different OS.
>> But URG packet should be sent as quickly as possible regardless of
>> ordering, no?
>
> Yes, you right, URG will terminate the current 'SCU', i'll amend the
> commit log.
>
>>
>>> Statistics can be used to monitor the basic coalescing status, the
>>> 'out of order'
>>> and 'out of window' means how many retransmitting packets, thus
>>> describe the
>>> performance intuitively.
>>>
>>> Signed-off-by: Wei Xu <wexu@redhat.com>
>>> ---
>>>   hw/net/virtio-net.c            | 486
>>> ++++++++++++++++++++++++++++++++++++++++-
>>>   include/hw/virtio/virtio-net.h |   1 +
>>>   include/hw/virtio/virtio.h     |  72 ++++++
>>>   3 files changed, 558 insertions(+), 1 deletion(-)

[...]

>>> +        } else {
>>> +            /* Coalesce window update */
>>> +            o_tcp->th_win = n_tcp->th_win;
>>> +            chain->stat.win_update++;
>>> +            return RSC_COALESCE;
>>> +        }
>>> +    } else {
>>> +        /* pure ack, update ack */
>>> +        o_tcp->th_ack = n_tcp->th_ack;
>>> +        chain->stat.pure_ack++;
>>> +        return RSC_COALESCE;
>> Looks like there're something I missed. The spec said:
>>
>> "In other words, any pure ACK that is not a duplicate ACK or a window
>> update triggers an exception and must not be coalesced. All such pure
>> ACKs must be indicated as individual segments."
>>
>> Does it mean we *should not* coalesce windows update and pure ack?
>> (Since it can wakeup transmission)?
>
> It's also a little bit inexplicit and flexible due to the spec, please
> see the flowchart I on the same page.
>
> Comments about the  flowchart:
> ------------------------------------------------------------------------
> The first of the following two flowcharts describes the rules for
> coalescing segments and updating the TCP headers.
> This flowchart refers to mechanisms for distinguishing valid duplicate
> ACKs and window updates. The second flowchart describes these mechanisms.
> ------------------------------------------------------------------------
> As show in the flowchart, only status 'C' will break current scu and
> get finalized, both 'A' and 'B' can be coalesced afaik.
>

Interesting, looks like you're right.

>>
>>> +    }
>>> +}
>>> +
>>> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
>>> NetRscSeg *seg,
>>> +                                    const uint8_t *buf, NetRscUnit
>>> *n_unit)
>>> +{
>>> +    void *data;
>>> +    uint16_t o_ip_len;
>>> +    uint32_t nseq, oseq;
>>> +    NetRscUnit *o_unit;
>>> +
>>> +    o_unit = &seg->unit;
>>> +    o_ip_len = htons(*o_unit->ip_plen);
>>> +    nseq = htonl(n_unit->tcp->th_seq);
>>> +    oseq = htonl(o_unit->tcp->th_seq);
>>> +
>>> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
>>> +        /* Log this only for debugging observation */
>>> +        chain->stat.tcp_option++;
>>> +    }
>>> +
>>> +    /* Ignore packet with more/larger tcp options */
>>> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
>> What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ?
> do you mean '<'? that also means some option changed, maybe i should
> include it.

Yes.

>>
>>> +        chain->stat.tcp_larger_option++;
>>> +        return RSC_FINAL;
>>> +    }
>>> +
>>> +    /* out of order or retransmitted. */
>>> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
>>> +        chain->stat.data_out_of_win++;
>>> +        return RSC_FINAL;
>>> +    }
>>> +
>>> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
>>> +    if (nseq == oseq) {
>>> +        if ((0 == o_unit->payload) && n_unit->payload) {
>>> +            /* From no payload to payload, normal case, not a dup
>>> ack or etc */
>>> +            chain->stat.data_after_pure_ack++;
>>> +            goto coalesce;
>>> +        } else {
>>> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
>>> +                                             n_unit->tcp,
>>> o_unit->tcp);
>>> +        }
>>> +    } else if ((nseq - oseq) != o_unit->payload) {
>>> +        /* Not a consistent packet, out of order */
>>> +        chain->stat.data_out_of_order++;
>>> +        return RSC_FINAL;
>>> +    } else {
>>> +coalesce:
>>> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
>>> +            chain->stat.over_size++;
>>> +            return RSC_FINAL;
>>> +        }
>>> +
>>> +        /* Here comes the right data, the payload lengh in v4/v6 is
>>> different,
>>> +           so use the field value to update and record the new data
>>> len */
>>> +        o_unit->payload += n_unit->payload; /* update new data len */
>>> +
>>> +        /* update field in ip header */
>>> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
>>> +
>>> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be
>>> coalesced
>>> +           for windows guest, while this may change the behavior
>>> for linux
>>> +           guest. */
>> This needs more thought, 'can' probably means don't. (Linux GRO won't
>> merge PUSH packet).
> Yes, since it's mainly for win guest, how about take this as this
> first and do more test and see how to handle it?

Right, this is not an issue probably but just an optimization for latency.

[...]

>>
>>> +        return NULL;
>>> +    }
>>> +
>>> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
>>> +        if (chain->proto == proto) {
>>> +            return chain;
>>> +        }
>>> +    }
>>> +
>>> +    chain = g_malloc(sizeof(*chain));
>>> +    chain->hdr_size = n->guest_hdr_len;
>> Why introduce a specified field for instead of just use
>> n->guest_hdr_len?
> this is to reduce invoking times to find VirtIONet by 'nc', there are
> a few places will use it.

Okay, then I think you'd better keep a pointer to VirtIONet structure
instead. (Consider you may want to refer more data from it, we don't
want to duplicate fields in two places).

>>
>>> +    chain->proto = proto;
>>> +    chain->max_payload = MAX_IP4_PAYLOAD;
>>> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>>> +                                      virtio_net_rsc_purge, chain);
>>> +    memset(&chain->stat, 0, sizeof(chain->stat));
>>> +
>>> +    QTAILQ_INIT(&chain->buffers);
>>> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
>>> +
>>> +    return chain;
>>> +}
>>> +
>>> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>>> +                                      const uint8_t *buf, size_t size)
>>> +{
>>> +    uint16_t proto;
>>> +    NetRscChain *chain;
>>> +    struct eth_header *eth;
>>> +    VirtIONet *n;
>>> +
>>> +    n = qemu_get_nic_opaque(nc);
>>> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
>>> +        return virtio_net_do_receive(nc, buf, size);
>>> +    }
>>> +
>>> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
>>> +    proto = htons(eth->h_proto);
>>> +
>>> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
>>> +    if (!chain) {
>>> +        return virtio_net_do_receive(nc, buf, size);
>>> +    } else {
>>> +        chain->stat.received++;
>>> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
>>> +    }
>>> +}
>>> +
>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>> +                                  const uint8_t *buf, size_t size)
>>> +{
>>> +    if (virtio_net_rsc_bypass) {
>>> +        return virtio_net_do_receive(nc, buf, size);
>> You need a feature bit for this and compat it for older machine types.
>> And also need some work on virtio spec I think.
> yes, not sure which way is good to support this, hmp/qmp/ethtool, this
> is gonna to support win guest,
> so need a well-compatible interface, any comments?

I think this should be implemented through feature bits/negotiation
instead of something like ethtool.

[...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest
  2016-03-17 16:57       ` Wei Xu
@ 2016-03-18  2:22         ` Jason Wang
  2016-03-18  4:24           ` Wei Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2016-03-18  2:22 UTC (permalink / raw)
  To: Wei Xu, qemu-devel



On 03/18/2016 12:57 AM, Wei Xu wrote:
> On 2016年03月17日 23:44, Michael S. Tsirkin wrote:
>> On Thu, Mar 17, 2016 at 11:21:28PM +0800, Wei Xu wrote:
>>>
>>> On 2016年03月17日 14:47, Jason Wang wrote:
>>>> On 03/15/2016 05:17 PM,wexu@redhat.com  wrote:
>>>>> From: Wei Xu<wexu@redhat.com>
>>>>>
>>>>> Fixed issues based on rfc patch v2:
>>>>> 1. Removed big param list, replace it with 'NetRscUnit'
>>>>> 2. Different virtio header size
>>>>> 3. Modify callback function to direct call.
>>>>> 4. Needn't check the failure of g_malloc()
>>>>> 5. Other code format adjustment, macro naming, etc
>>>>>
>>>>> This patch is to support WHQL test for Windows guest, while this
>>>>> feature also
>>>>> benifits other guest works as a kernel 'gro' like feature with
>>>>> userspace implementation.
>>>>> Feature information:
>>>>>    http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
>>>>>
>>>>> Both IPv4 and IPv6 are supported, though performance with
>>>>> userspace virtio
>>>>> is slow than vhost-net, there is about 1x to 3x performance
>>>>> improvement to
>>>>> userspace virtio, this is done by turning this feature on and disable
>>>>> 'tso/gso/gro' on corresponding tap interface and guest interface,
>>>>> while get
>>>>> less improment with all these feature on.
>>>>>
>>>>> Test steps:
>>>>> Although this feature is mainly used for window guest, i used
>>>>> linux guest to help test
>>>>> the feature, to make things simple, i used 3 steps to test the
>>>>> patch as i moved on.
>>>>> 1. With a tcp socket client/server pair running on 2 linux guest,
>>>>> thus i can control
>>>>> the traffic and debugging the code as i want.
>>>>> 2. Netperf on linux guest test the throughput.
>>>>> 3. WHQL test with 2 Windows guests.
>>>>>
>>>>> Current status:
>>>>> IPv4 pass all the above tests.
>>>>> IPv6 just passed test step 1 and 2 as described ahead, the virtio
>>>>> nic cannot
>>>>> receive any packet in WHQL test, looks like the test traffic is
>>>>> not sent from
>>>>> on the support machine, test device can access both host and
>>>>> another linux
>>>>> guest, tried a lot of ways to work it out but failed, maybe debug
>>>>> from windows
>>>>> guest driver side can help figuring it out.
>>>> I think you need figure out where was the packet dropped first. If the
>>>> packet was not dropped by windows guest, you may want to try
>>>> dropmonitor.
>>> Yes, there is something wrong with my previous description, i add
>>> some debug
>>> code and did new test, the packets are received by
>>> virtio_net_receive() and
>>> are finished putting to the vring with no error and sent to win guest
>>> already, but wireshark on win guest doesn't get it, because the test
>>> case
>>> did some hacking on the filter, it installed another lightweight
>>> filter, i'm
>>> not sure how these packets go in the guest, maybe they are received but
>>> dropped by driver or stack, etc.
>> Add some debug output in the driver, rebuild it and see packets
>> as they are received and passed up the stack.
> Yes, but this is to win guest, i tried to build a windows debug binary
> but failed, is there any possible missing path in virtio between
> pushed it to vring and notified the guest successfully? i'm sure at
> this by debugging it with gdb.

Is the packet always dropped or does it help if you turn off some
configuration (e.g checksum offloads) works?

>>
>>> I tried 'dropmonitor', it's very interesting but it helps very
>>> limitedly for
>>> windows guest, i can only use it with qemu on the host.
>>>>> Note:
>>>>> A 'MessageDevice' nic chose as 'Realtek' will panic the system
>>>>> sometimes during setup,
>>>>> this can be figured out by replacing it with an 'e1000' nic.
>>>>>
>>>>> Todo:
>>>>> More sanity check and tcp 'ecn' and 'window' scale test.
>>>>>
>>>>> Wei Xu (2):
>>>>>    virtio-net rsc: support coalescing ipv4 tcp traffic
>>>>>    virtio-net rsc: support coalescing ipv6 tcp traffic
>>>>>
>>>>>   hw/net/virtio-net.c            | 602
>>>>> ++++++++++++++++++++++++++++++++++++++++-
>>>>>   include/hw/virtio/virtio-net.h |   1 +
>>>>>   include/hw/virtio/virtio.h     |  75 +++++
>>>>>   3 files changed, 677 insertions(+), 1 deletion(-)
>>>>>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-03-18  2:03       ` Jason Wang
@ 2016-03-18  4:17         ` Wei Xu
  2016-03-18  5:20           ` Jason Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Wei Xu @ 2016-03-18  4:17 UTC (permalink / raw)
  To: Jason Wang, qemu-devel



On 2016年03月18日 10:03, Jason Wang wrote:
>
> On 03/18/2016 12:45 AM, Wei Xu wrote:
>> On 2016年03月17日 16:42, Jason Wang wrote:
>>
>>> On 03/15/2016 05:17 PM, wexu@redhat.com wrote:
>>>> From: Wei Xu <wexu@redhat.com>
>>>>
>>>> All the data packets in a tcp connection will be cached to a big buffer
>>>> in every receive interval, and will be sent out via a timer, the
>>>> 'virtio_net_rsc_timeout' controls the interval, the value will
>>>> influent the
>>>> performance and response of tcp connection extremely, 50000(50us) is a
>>>> experience value to gain a performance improvement, since the whql test
>>>> sends packets every 100us, so '300000(300us)' can pass the test case,
>>>> this is also the default value, it's gonna to be tunable.
>>>>
>>>> The timer will only be triggered if the packets pool is not empty,
>>>> and it'll drain off all the cached packets
>>>>
>>>> 'NetRscChain' is used to save the segments of different protocols in a
>>>> VirtIONet device.
>>>>
>>>> The main handler of TCP includes TCP window update, duplicated ACK
>>>> check
>>>> and the real data coalescing if the new segment passed sanity check
>>>> and is identified as an 'wanted' one.
>>>>
>>>> An 'wanted' segment means:
>>>> 1. Segment is within current window and the sequence is the expected
>>>> one.
>>>> 2. ACK of the segment is in the valid window.
>>>> 3. If the ACK in the segment is a duplicated one, then it must less
>>>> than 2,
>>>>      this is to notify upper layer TCP starting retransmission due to
>>>> the spec.
>>>>
>>>> Sanity check includes:
>>>> 1. Incorrect version in IP header
>>>> 2. IP options & IP fragment
>>>> 3. Not a TCP packets
>>>> 4. Sanity size check to prevent buffer overflow attack.
>>>>
>>>> There maybe more cases should be considered such as ip
>>>> identification other
>>>> flags, while it broke the test because windows set it to the same
>>>> even it's
>>>> not a fragment.
>>>>
>>>> Normally it includes 2 typical ways to handle a TCP control flag,
>>>> 'bypass'
>>>> and 'finalize', 'bypass' means should be sent out directly, and
>>>> 'finalize'
>>>> means the packets should also be bypassed, and this should be done
>>>> after searching for the same connection packets in the pool and sending
>>>> all of them out, this is to avoid out of data.
>>>>
>>>> All the 'SYN' packets will be bypassed since this always begin a new'
>>>> connection, other flags such 'FIN/RST' will trigger a finalization,
>>>> because
>>>> this normally happens upon a connection is going to be closed, an
>>>> 'URG' packet
>>>> also finalize current coalescing unit while there maybe protocol
>>>> difference to
>>>> different OS.
>>> But URG packet should be sent as quickly as possible regardless of
>>> ordering, no?
>> Yes, you right, URG will terminate the current 'SCU', i'll amend the
>> commit log.
>>
>>>> Statistics can be used to monitor the basic coalescing status, the
>>>> 'out of order'
>>>> and 'out of window' means how many retransmitting packets, thus
>>>> describe the
>>>> performance intuitively.
>>>>
>>>> Signed-off-by: Wei Xu <wexu@redhat.com>
>>>> ---
>>>>    hw/net/virtio-net.c            | 486
>>>> ++++++++++++++++++++++++++++++++++++++++-
>>>>    include/hw/virtio/virtio-net.h |   1 +
>>>>    include/hw/virtio/virtio.h     |  72 ++++++
>>>>    3 files changed, 558 insertions(+), 1 deletion(-)
> [...]
>
>>>> +        } else {
>>>> +            /* Coalesce window update */
>>>> +            o_tcp->th_win = n_tcp->th_win;
>>>> +            chain->stat.win_update++;
>>>> +            return RSC_COALESCE;
>>>> +        }
>>>> +    } else {
>>>> +        /* pure ack, update ack */
>>>> +        o_tcp->th_ack = n_tcp->th_ack;
>>>> +        chain->stat.pure_ack++;
>>>> +        return RSC_COALESCE;
>>> Looks like there're something I missed. The spec said:
>>>
>>> "In other words, any pure ACK that is not a duplicate ACK or a window
>>> update triggers an exception and must not be coalesced. All such pure
>>> ACKs must be indicated as individual segments."
>>>
>>> Does it mean we *should not* coalesce windows update and pure ack?
>>> (Since it can wakeup transmission)?
>> It's also a little bit inexplicit and flexible due to the spec, please
>> see the flowchart I on the same page.
>>
>> Comments about the  flowchart:
>> ------------------------------------------------------------------------
>> The first of the following two flowcharts describes the rules for
>> coalescing segments and updating the TCP headers.
>> This flowchart refers to mechanisms for distinguishing valid duplicate
>> ACKs and window updates. The second flowchart describes these mechanisms.
>> ------------------------------------------------------------------------
>> As show in the flowchart, only status 'C' will break current scu and
>> get finalized, both 'A' and 'B' can be coalesced afaik.
>>
> Interesting, looks like you're right.
>
>>>> +    }
>>>> +}
>>>> +
>>>> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
>>>> NetRscSeg *seg,
>>>> +                                    const uint8_t *buf, NetRscUnit
>>>> *n_unit)
>>>> +{
>>>> +    void *data;
>>>> +    uint16_t o_ip_len;
>>>> +    uint32_t nseq, oseq;
>>>> +    NetRscUnit *o_unit;
>>>> +
>>>> +    o_unit = &seg->unit;
>>>> +    o_ip_len = htons(*o_unit->ip_plen);
>>>> +    nseq = htonl(n_unit->tcp->th_seq);
>>>> +    oseq = htonl(o_unit->tcp->th_seq);
>>>> +
>>>> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
>>>> +        /* Log this only for debugging observation */
>>>> +        chain->stat.tcp_option++;
>>>> +    }
>>>> +
>>>> +    /* Ignore packet with more/larger tcp options */
>>>> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
>>> What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ?
>> do you mean '<'? that also means some option changed, maybe i should
>> include it.
> Yes.
>
>>>> +        chain->stat.tcp_larger_option++;
>>>> +        return RSC_FINAL;
>>>> +    }
>>>> +
>>>> +    /* out of order or retransmitted. */
>>>> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
>>>> +        chain->stat.data_out_of_win++;
>>>> +        return RSC_FINAL;
>>>> +    }
>>>> +
>>>> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
>>>> +    if (nseq == oseq) {
>>>> +        if ((0 == o_unit->payload) && n_unit->payload) {
>>>> +            /* From no payload to payload, normal case, not a dup
>>>> ack or etc */
>>>> +            chain->stat.data_after_pure_ack++;
>>>> +            goto coalesce;
>>>> +        } else {
>>>> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
>>>> +                                             n_unit->tcp,
>>>> o_unit->tcp);
>>>> +        }
>>>> +    } else if ((nseq - oseq) != o_unit->payload) {
>>>> +        /* Not a consistent packet, out of order */
>>>> +        chain->stat.data_out_of_order++;
>>>> +        return RSC_FINAL;
>>>> +    } else {
>>>> +coalesce:
>>>> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
>>>> +            chain->stat.over_size++;
>>>> +            return RSC_FINAL;
>>>> +        }
>>>> +
>>>> +        /* Here comes the right data, the payload lengh in v4/v6 is
>>>> different,
>>>> +           so use the field value to update and record the new data
>>>> len */
>>>> +        o_unit->payload += n_unit->payload; /* update new data len */
>>>> +
>>>> +        /* update field in ip header */
>>>> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
>>>> +
>>>> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be
>>>> coalesced
>>>> +           for windows guest, while this may change the behavior
>>>> for linux
>>>> +           guest. */
>>> This needs more thought, 'can' probably means don't. (Linux GRO won't
>>> merge PUSH packet).
>> Yes, since it's mainly for win guest, how about take this as this
>> first and do more test and see how to handle it?
> Right, this is not an issue probably but just an optimization for latency.
>
> [...]
>
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
>>>> +        if (chain->proto == proto) {
>>>> +            return chain;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    chain = g_malloc(sizeof(*chain));
>>>> +    chain->hdr_size = n->guest_hdr_len;
>>> Why introduce a specified field for instead of just use
>>> n->guest_hdr_len?
>> this is to reduce invoking times to find VirtIONet by 'nc', there are
>> a few places will use it.
> Okay, then I think you'd better keep a pointer to VirtIONet structure
> instead. (Consider you may want to refer more data from it, we don't
> want to duplicate fields in two places).
sure, that's a good idea.
>
>>>> +    chain->proto = proto;
>>>> +    chain->max_payload = MAX_IP4_PAYLOAD;
>>>> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>>>> +                                      virtio_net_rsc_purge, chain);
>>>> +    memset(&chain->stat, 0, sizeof(chain->stat));
>>>> +
>>>> +    QTAILQ_INIT(&chain->buffers);
>>>> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
>>>> +
>>>> +    return chain;
>>>> +}
>>>> +
>>>> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>>>> +                                      const uint8_t *buf, size_t size)
>>>> +{
>>>> +    uint16_t proto;
>>>> +    NetRscChain *chain;
>>>> +    struct eth_header *eth;
>>>> +    VirtIONet *n;
>>>> +
>>>> +    n = qemu_get_nic_opaque(nc);
>>>> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>> +    }
>>>> +
>>>> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
>>>> +    proto = htons(eth->h_proto);
>>>> +
>>>> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
>>>> +    if (!chain) {
>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>> +    } else {
>>>> +        chain->stat.received++;
>>>> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
>>>> +    }
>>>> +}
>>>> +
>>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>>> +                                  const uint8_t *buf, size_t size)
>>>> +{
>>>> +    if (virtio_net_rsc_bypass) {
>>>> +        return virtio_net_do_receive(nc, buf, size);
>>> You need a feature bit for this and compat it for older machine types.
>>> And also need some work on virtio spec I think.
>> yes, not sure which way is good to support this, hmp/qmp/ethtool, this
>> is gonna to support win guest,
>> so need a well-compatible interface, any comments?
> I think this should be implemented through feature bits/negotiation
> instead of something like ethtool.
Looks this feature should be turn on/off dynamically due to the spec, so 
maybe this should be managed from the guest, is there any reference code 
for this?
>
> [...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest
  2016-03-18  2:22         ` Jason Wang
@ 2016-03-18  4:24           ` Wei Xu
  2016-03-18  5:21             ` Jason Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Wei Xu @ 2016-03-18  4:24 UTC (permalink / raw)
  To: Jason Wang, qemu-devel



On 2016年03月18日 10:22, Jason Wang wrote:
>
> On 03/18/2016 12:57 AM, Wei Xu wrote:
>> On 2016年03月17日 23:44, Michael S. Tsirkin wrote:
>>> On Thu, Mar 17, 2016 at 11:21:28PM +0800, Wei Xu wrote:
>>>> On 2016年03月17日 14:47, Jason Wang wrote:
>>>>> On 03/15/2016 05:17 PM,wexu@redhat.com  wrote:
>>>>>> From: Wei Xu<wexu@redhat.com>
>>>>>>
>>>>>> Fixed issues based on rfc patch v2:
>>>>>> 1. Removed big param list, replace it with 'NetRscUnit'
>>>>>> 2. Different virtio header size
>>>>>> 3. Modify callback function to direct call.
>>>>>> 4. Needn't check the failure of g_malloc()
>>>>>> 5. Other code format adjustment, macro naming, etc
>>>>>>
>>>>>> This patch is to support WHQL test for Windows guest, while this
>>>>>> feature also
>>>>>> benifits other guest works as a kernel 'gro' like feature with
>>>>>> userspace implementation.
>>>>>> Feature information:
>>>>>>     http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
>>>>>>
>>>>>> Both IPv4 and IPv6 are supported, though performance with
>>>>>> userspace virtio
>>>>>> is slow than vhost-net, there is about 1x to 3x performance
>>>>>> improvement to
>>>>>> userspace virtio, this is done by turning this feature on and disable
>>>>>> 'tso/gso/gro' on corresponding tap interface and guest interface,
>>>>>> while get
>>>>>> less improment with all these feature on.
>>>>>>
>>>>>> Test steps:
>>>>>> Although this feature is mainly used for window guest, i used
>>>>>> linux guest to help test
>>>>>> the feature, to make things simple, i used 3 steps to test the
>>>>>> patch as i moved on.
>>>>>> 1. With a tcp socket client/server pair running on 2 linux guest,
>>>>>> thus i can control
>>>>>> the traffic and debugging the code as i want.
>>>>>> 2. Netperf on linux guest test the throughput.
>>>>>> 3. WHQL test with 2 Windows guests.
>>>>>>
>>>>>> Current status:
>>>>>> IPv4 pass all the above tests.
>>>>>> IPv6 just passed test step 1 and 2 as described ahead, the virtio
>>>>>> nic cannot
>>>>>> receive any packet in WHQL test, looks like the test traffic is
>>>>>> not sent from
>>>>>> on the support machine, test device can access both host and
>>>>>> another linux
>>>>>> guest, tried a lot of ways to work it out but failed, maybe debug
>>>>>> from windows
>>>>>> guest driver side can help figuring it out.
>>>>> I think you need figure out where was the packet dropped first. If the
>>>>> packet was not dropped by windows guest, you may want to try
>>>>> dropmonitor.
>>>> Yes, there is something wrong with my previous description, i add
>>>> some debug
>>>> code and did new test, the packets are received by
>>>> virtio_net_receive() and
>>>> are finished putting to the vring with no error and sent to win guest
>>>> already, but wireshark on win guest doesn't get it, because the test
>>>> case
>>>> did some hacking on the filter, it installed another lightweight
>>>> filter, i'm
>>>> not sure how these packets go in the guest, maybe they are received but
>>>> dropped by driver or stack, etc.
>>> Add some debug output in the driver, rebuild it and see packets
>>> as they are received and passed up the stack.
>> Yes, but this is to win guest, i tried to build a windows debug binary
>> but failed, is there any possible missing path in virtio between
>> pushed it to vring and notified the guest successfully? i'm sure at
>> this by debugging it with gdb.
> Is the packet always dropped or does it help if you turn off some
> configuration (e.g checksum offloads) works?
Yes, only the test packets are dropped, there is no checksum for ipv6 
header,
i remembered i disabled checksum offloads and changed other features(RSS)
in the guest but it doesn't help, is there any other tunable values for 
qemu?
>
>>>> I tried 'dropmonitor', it's very interesting but it helps very
>>>> limitedly for
>>>> windows guest, i can only use it with qemu on the host.
>>>>>> Note:
>>>>>> A 'MessageDevice' nic chose as 'Realtek' will panic the system
>>>>>> sometimes during setup,
>>>>>> this can be figured out by replacing it with an 'e1000' nic.
>>>>>>
>>>>>> Todo:
>>>>>> More sanity check and tcp 'ecn' and 'window' scale test.
>>>>>>
>>>>>> Wei Xu (2):
>>>>>>     virtio-net rsc: support coalescing ipv4 tcp traffic
>>>>>>     virtio-net rsc: support coalescing ipv6 tcp traffic
>>>>>>
>>>>>>    hw/net/virtio-net.c            | 602
>>>>>> ++++++++++++++++++++++++++++++++++++++++-
>>>>>>    include/hw/virtio/virtio-net.h |   1 +
>>>>>>    include/hw/virtio/virtio.h     |  75 +++++
>>>>>>    3 files changed, 677 insertions(+), 1 deletion(-)
>>>>>>
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-03-18  4:17         ` Wei Xu
@ 2016-03-18  5:20           ` Jason Wang
  2016-03-18  6:38             ` Wei Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2016-03-18  5:20 UTC (permalink / raw)
  To: Wei Xu, qemu-devel



On 03/18/2016 12:17 PM, Wei Xu wrote:
>>>>>
>>>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>>>> +                                  const uint8_t *buf, size_t size)
>>>>> +{
>>>>> +    if (virtio_net_rsc_bypass) {
>>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>> You need a feature bit for this and compat it for older machine types.
>>>> And also need some work on virtio spec I think.
>>> yes, not sure which way is good to support this, hmp/qmp/ethtool, this
>>> is gonna to support win guest,
>>> so need a well-compatible interface, any comments?
>> I think this should be implemented through feature bits/negotiation
>> instead of something like ethtool.
> Looks this feature should be turn on/off dynamically due to the spec,
> so maybe this should be managed from the guest, is there any reference
> code for this? 

Then you may want to look at implementation of
VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest
  2016-03-18  4:24           ` Wei Xu
@ 2016-03-18  5:21             ` Jason Wang
  2016-03-18  6:30               ` Wei Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2016-03-18  5:21 UTC (permalink / raw)
  To: Wei Xu, qemu-devel



On 03/18/2016 12:24 PM, Wei Xu wrote:
>
>
> On 2016年03月18日 10:22, Jason Wang wrote:
>>
>> On 03/18/2016 12:57 AM, Wei Xu wrote:
>>> On 2016年03月17日 23:44, Michael S. Tsirkin wrote:
>>>> On Thu, Mar 17, 2016 at 11:21:28PM +0800, Wei Xu wrote:
>>>>> On 2016年03月17日 14:47, Jason Wang wrote:
>>>>>> On 03/15/2016 05:17 PM,wexu@redhat.com  wrote:
>>>>>>> From: Wei Xu<wexu@redhat.com>
>>>>>>>
>>>>>>> Fixed issues based on rfc patch v2:
>>>>>>> 1. Removed big param list, replace it with 'NetRscUnit'
>>>>>>> 2. Different virtio header size
>>>>>>> 3. Modify callback function to direct call.
>>>>>>> 4. Needn't check the failure of g_malloc()
>>>>>>> 5. Other code format adjustment, macro naming, etc
>>>>>>>
>>>>>>> This patch is to support WHQL test for Windows guest, while this
>>>>>>> feature also
>>>>>>> benifits other guest works as a kernel 'gro' like feature with
>>>>>>> userspace implementation.
>>>>>>> Feature information:
>>>>>>>    
>>>>>>> http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
>>>>>>>
>>>>>>> Both IPv4 and IPv6 are supported, though performance with
>>>>>>> userspace virtio
>>>>>>> is slow than vhost-net, there is about 1x to 3x performance
>>>>>>> improvement to
>>>>>>> userspace virtio, this is done by turning this feature on and
>>>>>>> disable
>>>>>>> 'tso/gso/gro' on corresponding tap interface and guest interface,
>>>>>>> while get
>>>>>>> less improment with all these feature on.
>>>>>>>
>>>>>>> Test steps:
>>>>>>> Although this feature is mainly used for window guest, i used
>>>>>>> linux guest to help test
>>>>>>> the feature, to make things simple, i used 3 steps to test the
>>>>>>> patch as i moved on.
>>>>>>> 1. With a tcp socket client/server pair running on 2 linux guest,
>>>>>>> thus i can control
>>>>>>> the traffic and debugging the code as i want.
>>>>>>> 2. Netperf on linux guest test the throughput.
>>>>>>> 3. WHQL test with 2 Windows guests.
>>>>>>>
>>>>>>> Current status:
>>>>>>> IPv4 pass all the above tests.
>>>>>>> IPv6 just passed test step 1 and 2 as described ahead, the virtio
>>>>>>> nic cannot
>>>>>>> receive any packet in WHQL test, looks like the test traffic is
>>>>>>> not sent from
>>>>>>> on the support machine, test device can access both host and
>>>>>>> another linux
>>>>>>> guest, tried a lot of ways to work it out but failed, maybe debug
>>>>>>> from windows
>>>>>>> guest driver side can help figuring it out.
>>>>>> I think you need figure out where was the packet dropped first.
>>>>>> If the
>>>>>> packet was not dropped by windows guest, you may want to try
>>>>>> dropmonitor.
>>>>> Yes, there is something wrong with my previous description, i add
>>>>> some debug
>>>>> code and did new test, the packets are received by
>>>>> virtio_net_receive() and
>>>>> are finished putting to the vring with no error and sent to win guest
>>>>> already, but wireshark on win guest doesn't get it, because the test
>>>>> case
>>>>> did some hacking on the filter, it installed another lightweight
>>>>> filter, i'm
>>>>> not sure how these packets go in the guest, maybe they are
>>>>> received but
>>>>> dropped by driver or stack, etc.
>>>> Add some debug output in the driver, rebuild it and see packets
>>>> as they are received and passed up the stack.
>>> Yes, but this is to win guest, i tried to build a windows debug binary
>>> but failed, is there any possible missing path in virtio between
>>> pushed it to vring and notified the guest successfully? i'm sure at
>>> this by debugging it with gdb.
>> Is the packet always dropped or does it help if you turn off some
>> configuration (e.g checksum offloads) works?
> Yes, only the test packets are dropped, there is no checksum for ipv6
> header,
> i remembered i disabled checksum offloads and changed other features(RSS)
> in the guest but it doesn't help, is there any other tunable values
> for qemu? 

-device virtio-net-pci,? can gives you all the properties.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest
  2016-03-18  5:21             ` Jason Wang
@ 2016-03-18  6:30               ` Wei Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Wei Xu @ 2016-03-18  6:30 UTC (permalink / raw)
  To: qemu-devel



On 2016年03月18日 13:21, Jason Wang wrote:
>
> On 03/18/2016 12:24 PM, Wei Xu wrote:
>>
>> On 2016年03月18日 10:22, Jason Wang wrote:
>>> On 03/18/2016 12:57 AM, Wei Xu wrote:
>>>> On 2016年03月17日 23:44, Michael S. Tsirkin wrote:
>>>>> On Thu, Mar 17, 2016 at 11:21:28PM +0800, Wei Xu wrote:
>>>>>> On 2016年03月17日 14:47, Jason Wang wrote:
>>>>>>> On 03/15/2016 05:17 PM,wexu@redhat.com  wrote:
>>>>>>>> From: Wei Xu<wexu@redhat.com>
>>>>>>>>
>>>>>>>> Fixed issues based on rfc patch v2:
>>>>>>>> 1. Removed big param list, replace it with 'NetRscUnit'
>>>>>>>> 2. Different virtio header size
>>>>>>>> 3. Modify callback function to direct call.
>>>>>>>> 4. Needn't check the failure of g_malloc()
>>>>>>>> 5. Other code format adjustment, macro naming, etc
>>>>>>>>
>>>>>>>> This patch is to support WHQL test for Windows guest, while this
>>>>>>>> feature also
>>>>>>>> benifits other guest works as a kernel 'gro' like feature with
>>>>>>>> userspace implementation.
>>>>>>>> Feature information:
>>>>>>>>     
>>>>>>>> http://msdn.microsoft.com/en-us/library/windows/hardware/jj853324
>>>>>>>>
>>>>>>>> Both IPv4 and IPv6 are supported, though performance with
>>>>>>>> userspace virtio
>>>>>>>> is slow than vhost-net, there is about 1x to 3x performance
>>>>>>>> improvement to
>>>>>>>> userspace virtio, this is done by turning this feature on and
>>>>>>>> disable
>>>>>>>> 'tso/gso/gro' on corresponding tap interface and guest interface,
>>>>>>>> while get
>>>>>>>> less improment with all these feature on.
>>>>>>>>
>>>>>>>> Test steps:
>>>>>>>> Although this feature is mainly used for window guest, i used
>>>>>>>> linux guest to help test
>>>>>>>> the feature, to make things simple, i used 3 steps to test the
>>>>>>>> patch as i moved on.
>>>>>>>> 1. With a tcp socket client/server pair running on 2 linux guest,
>>>>>>>> thus i can control
>>>>>>>> the traffic and debugging the code as i want.
>>>>>>>> 2. Netperf on linux guest test the throughput.
>>>>>>>> 3. WHQL test with 2 Windows guests.
>>>>>>>>
>>>>>>>> Current status:
>>>>>>>> IPv4 pass all the above tests.
>>>>>>>> IPv6 just passed test step 1 and 2 as described ahead, the virtio
>>>>>>>> nic cannot
>>>>>>>> receive any packet in WHQL test, looks like the test traffic is
>>>>>>>> not sent from
>>>>>>>> on the support machine, test device can access both host and
>>>>>>>> another linux
>>>>>>>> guest, tried a lot of ways to work it out but failed, maybe debug
>>>>>>>> from windows
>>>>>>>> guest driver side can help figuring it out.
>>>>>>> I think you need figure out where was the packet dropped first.
>>>>>>> If the
>>>>>>> packet was not dropped by windows guest, you may want to try
>>>>>>> dropmonitor.
>>>>>> Yes, there is something wrong with my previous description, i add
>>>>>> some debug
>>>>>> code and did new test, the packets are received by
>>>>>> virtio_net_receive() and
>>>>>> are finished putting to the vring with no error and sent to win guest
>>>>>> already, but wireshark on win guest doesn't get it, because the test
>>>>>> case
>>>>>> did some hacking on the filter, it installed another lightweight
>>>>>> filter, i'm
>>>>>> not sure how these packets go in the guest, maybe they are
>>>>>> received but
>>>>>> dropped by driver or stack, etc.
>>>>> Add some debug output in the driver, rebuild it and see packets
>>>>> as they are received and passed up the stack.
>>>> Yes, but this is to win guest, i tried to build a windows debug binary
>>>> but failed, is there any possible missing path in virtio between
>>>> pushed it to vring and notified the guest successfully? i'm sure at
>>>> this by debugging it with gdb.
>>> Is the packet always dropped or does it help if you turn off some
>>> configuration (e.g checksum offloads) works?
>> Yes, only the test packets are dropped, there is no checksum for ipv6
>> header,
>> i remembered i disabled checksum offloads and changed other features(RSS)
>> in the guest but it doesn't help, is there any other tunable values
>> for qemu?
> -device virtio-net-pci,? can gives you all the properties.
>
ok, thanks a lot.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-03-18  5:20           ` Jason Wang
@ 2016-03-18  6:38             ` Wei Xu
  2016-03-18  6:56               ` Jason Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Wei Xu @ 2016-03-18  6:38 UTC (permalink / raw)
  To: qemu-devel



On 2016年03月18日 13:20, Jason Wang wrote:
>
> On 03/18/2016 12:17 PM, Wei Xu wrote:
>>>>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>>>>> +                                  const uint8_t *buf, size_t size)
>>>>>> +{
>>>>>> +    if (virtio_net_rsc_bypass) {
>>>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>>> You need a feature bit for this and compat it for older machine types.
>>>>> And also need some work on virtio spec I think.
>>>> yes, not sure which way is good to support this, hmp/qmp/ethtool, this
>>>> is gonna to support win guest,
>>>> so need a well-compatible interface, any comments?
>>> I think this should be implemented through feature bits/negotiation
>>> instead of something like ethtool.
>> Looks this feature should be turn on/off dynamically due to the spec,
>> so maybe this should be managed from the guest, is there any reference
>> code for this?
> Then you may want to look at implementation of
> VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.
Have a short look at it, do you know how to control the feature bit?  
both when lauching vm and changing it during runtime?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-03-18  6:38             ` Wei Xu
@ 2016-03-18  6:56               ` Jason Wang
  2016-03-18 14:52                 ` Wei Xu
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2016-03-18  6:56 UTC (permalink / raw)
  To: Wei Xu, qemu-devel



On 03/18/2016 02:38 PM, Wei Xu wrote:
>
>
> On 2016年03月18日 13:20, Jason Wang wrote:
>>
>> On 03/18/2016 12:17 PM, Wei Xu wrote:
>>>>>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>>>>>> +                                  const uint8_t *buf, size_t size)
>>>>>>> +{
>>>>>>> +    if (virtio_net_rsc_bypass) {
>>>>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>>>> You need a feature bit for this and compat it for older machine
>>>>>> types.
>>>>>> And also need some work on virtio spec I think.
>>>>> yes, not sure which way is good to support this, hmp/qmp/ethtool,
>>>>> this
>>>>> is gonna to support win guest,
>>>>> so need a well-compatible interface, any comments?
>>>> I think this should be implemented through feature bits/negotiation
>>>> instead of something like ethtool.
>>> Looks this feature should be turn on/off dynamically due to the spec,
>>> so maybe this should be managed from the guest, is there any reference
>>> code for this?
>> Then you may want to look at implementation of
>> VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.
> Have a short look at it, do you know how to control the feature bit? 
> both when lauching vm and changing it during runtime? 

Virtio spec and maybe windows driver source code can give you the answer.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-03-18  6:56               ` Jason Wang
@ 2016-03-18 14:52                 ` Wei Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Wei Xu @ 2016-03-18 14:52 UTC (permalink / raw)
  To: Jason Wang, qemu-devel



On 2016年03月18日 14:56, Jason Wang wrote:
>
> On 03/18/2016 02:38 PM, Wei Xu wrote:
>>
>> On 2016年03月18日 13:20, Jason Wang wrote:
>>> On 03/18/2016 12:17 PM, Wei Xu wrote:
>>>>>>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>>>>>>> +                                  const uint8_t *buf, size_t size)
>>>>>>>> +{
>>>>>>>> +    if (virtio_net_rsc_bypass) {
>>>>>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>>>>> You need a feature bit for this and compat it for older machine
>>>>>>> types.
>>>>>>> And also need some work on virtio spec I think.
>>>>>> yes, not sure which way is good to support this, hmp/qmp/ethtool,
>>>>>> this
>>>>>> is gonna to support win guest,
>>>>>> so need a well-compatible interface, any comments?
>>>>> I think this should be implemented through feature bits/negotiation
>>>>> instead of something like ethtool.
>>>> Looks this feature should be turn on/off dynamically due to the spec,
>>>> so maybe this should be managed from the guest, is there any reference
>>>> code for this?
>>> Then you may want to look at implementation of
>>> VIRTIO_NET_F_CTRL_GUEST_OFFLOADS.
>> Have a short look at it, do you know how to control the feature bit?
>> both when lauching vm and changing it during runtime?
> Virtio spec and maybe windows driver source code can give you the answer.
OK, will check it out.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-11-30  8:55     ` Wei Xu
@ 2016-11-30 11:12       ` Jason Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Jason Wang @ 2016-11-30 11:12 UTC (permalink / raw)
  To: Wei Xu, qemu-devel; +Cc: yvugenfi, dfleytma, Ladi Prosek, mst



On 2016年11月30日 16:55, Wei Xu wrote:
> On 2016年11月24日 12:17, Jason Wang wrote:
>>
>>
>> On 2016年11月01日 01:41, wexu@redhat.com wrote:
>>> From: Wei Xu <wexu@redhat.com>
>>>
>>> All the data packets in a tcp connection are cached
>>> to a single buffer in every receive interval, and will
>>> be sent out via a timer, the 'virtio_net_rsc_timeout'
>>> controls the interval, this value may impact the
>>> performance and response time of tcp connection,
>>> 50000(50us) is an experience value to gain a performance
>>> improvement, since the whql test sends packets every 100us,
>>> so '300000(300us)' passes the test case, it is the default
>>> value as well, tune it via the command line parameter
>>> 'rsc_interval' within 'virtio-net-pci' device, for example,
>>> to launch a guest with interval set as '500000':
>>>
>>> 'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,rsc_interval=500000' 
>>>
>>>
>>>
>>> The timer will only be triggered if the packets pool is not empty,
>>> and it'll drain off all the cached packets.
>>>
>>> 'NetRscChain' is used to save the segments of IPv4/6 in a
>>> VirtIONet device.
>>>
>>> A new segment becomes a 'Candidate' as well as it passed sanity check,
>>> the main handler of TCP includes TCP window update, duplicated
>>> ACK check and the real data coalescing.
>>>
>>> An 'Candidate' segment means:
>>> 1. Segment is within current window and the sequence is the expected 
>>> one.
>>> 2. 'ACK' of the segment is in the valid window.
>>>
>>> Sanity check includes:
>>> 1. Incorrect version in IP header
>>> 2. An IP options or IP fragment
>>> 3. Not a TCP packet
>>> 4. Sanity size check to prevent buffer overflow attack.
>>> 5. An ECN packet
>>>
>>> Even though, there might more cases should be considered such as
>>> ip identification other flags, while it breaks the test because
>>> windows set it to the same even it's not a fragment.
>>>
>>> Normally it includes 2 typical ways to handle a TCP control flag,
>>> 'bypass' and 'finalize', 'bypass' means should be sent out directly,
>>> while 'finalize' means the packets should also be bypassed, but this
>>> should be done after search for the same connection packets in the
>>> pool and drain all of them out, this is to avoid out of order fragment.
>>>
>>> All the 'SYN' packets will be bypassed since this always begin a new'
>>> connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
>>> finalization, because this normally happens upon a connection is going
>>> to be closed, an 'URG' packet also finalize current coalescing unit.
>>>
>>> Statistics can be used to monitor the basic coalescing status, the
>>> 'out of order' and 'out of window' means how many retransmitting 
>>> packets,
>>> thus describe the performance intuitively.
>>>
>>> Signed-off-by: Wei Xu <wexu@redhat.com>
>>> ---
>>>   hw/net/virtio-net.c                         | 602
>>> ++++++++++++++++++++++++++--
>>>   include/hw/virtio/virtio-net.h              |   5 +-
>>>   include/hw/virtio/virtio.h                  |  76 ++++
>>>   include/net/eth.h                           |   2 +
>>>   include/standard-headers/linux/virtio_net.h |  14 +
>>>   net/tap.c                                   |   3 +-
>>>   6 files changed, 670 insertions(+), 32 deletions(-)
>>>
>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>> index 06bfe4b..d1824d9 100644
>>> --- a/hw/net/virtio-net.c
>>> +++ b/hw/net/virtio-net.c
>>> @@ -15,10 +15,12 @@
>>>   #include "qemu/iov.h"
>>>   #include "hw/virtio/virtio.h"
>>>   #include "net/net.h"
>>> +#include "net/eth.h"
>>>   #include "net/checksum.h"
>>>   #include "net/tap.h"
>>>   #include "qemu/error-report.h"
>>>   #include "qemu/timer.h"
>>> +#include "qemu/sockets.h"
>>>   #include "hw/virtio/virtio-net.h"
>>>   #include "net/vhost_net.h"
>>>   #include "hw/virtio/virtio-bus.h"
>>> @@ -43,6 +45,24 @@
>>>   #define endof(container, field) \
>>>       (offsetof(container, field) + sizeof(((container *)0)->field))
>>> +#define VIRTIO_NET_IP4_ADDR_SIZE   8        /* ipv4 saddr + daddr */
>>
>> Only used once in the code, I don't see much value of this macro.
>
> Just to keep it a bit readable.

Then you may want to replace this with sizeof(struct ...).

>
>>
>>> +
>>> +#define VIRTIO_NET_TCP_FLAG         0x3F
>>> +#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
>>> +
>>> +/* IPv4 max payload, 16 bits in the header */
>>> +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
>>> +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
>>> +
>>> +/* header length value in ip header without option */
>>> +#define VIRTIO_NET_IP4_HEADER_LENGTH 5
>>> +
>>> +/* Purge coalesced packets timer interval, This value affects the
>>> performance
>>> +   a lot, and should be tuned carefully, '300000'(300us) is the
>>> recommended
>>> +   value to pass the WHQL test, '50000' can gain 2x netperf
>>> throughput with
>>> +   tso/gso/gro 'off'. */
>>> +#define VIRTIO_NET_RSC_INTERVAL  300000
>>
>> This should be a property for virito-net and the above comment can be
>> the description of the property.
>
> This is a value for a property, actually I hadn't found a place to put
> it.

There's a description filed of PropertyInfo, but for virtio properties 
may need more work. So we can leave this as is now.

>
>>
>>> +
>>>   typedef struct VirtIOFeature {
>>>       uint32_t flags;
>>>       size_t end;
>>> @@ -589,7 +609,12 @@ static uint64_t
>>> virtio_net_guest_offloads_by_features(uint32_t features)
>>>           (1ULL << VIRTIO_NET_F_GUEST_ECN)  |
>>>           (1ULL << VIRTIO_NET_F_GUEST_UFO);
>>> -    return guest_offloads_mask & features;
>>> +    if (features & VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) {
>>> +        return (guest_offloads_mask & features) |
>>> +               (1ULL << VIRTIO_NET_F_GUEST_RSC4);
>>
>> Why need to care this, I believe RSC has nothing to do with peer's
>> offload setting?
>
> There is some misunderstanding about how does the feature work
> followed with a few subsequent comments, so let me clarify it first.
>
> Currently RSC feature is bundled with 'VIRTIO_NET_F_CTRL_GUEST_OFFLOADS'
> which means once guest driver reports supporting this feature during
> driver initializing, then qemu will initialize RSC feature and use the
> new header with RSC fields to communicate with guest.

Does it mean RSC depends on CTRL_GUEST_OFFLOADS? Any advantages?

>
> While RSC won't coalescing packets before guest driver notify host to
> enable it, the motivation of this is to support dynamically turn on/off
> the feature from guest side, and don't need a new feature bit for this
> feature.
>
> So from the guest's point of view, it can see the new header but all
> packets are still unchanged, once it enables the feature via control
> queue, coalesced packets will comes to the queue.

I believe disabling it by default should be the work of dirver not qemu. 
When RSC is enabled and negotiated, it should start to coalesce packets 
like other offload features.

>
>>
>>> +    } else {
>>> +        return guest_offloads_mask & features;
>>> +    }
>>>   }
>>>   static inline uint64_t virtio_net_supported_guest_offloads(VirtIONet
>>> *n)
>>> @@ -600,6 +625,7 @@ static inline uint64_t
>>> virtio_net_supported_guest_offloads(VirtIONet *n)
>>>   static void virtio_net_set_features(VirtIODevice *vdev, uint64_t
>>> features)
>>>   {
>>> +    NetClientState *nc;
>>>       VirtIONet *n = VIRTIO_NET(vdev);
>>>       int i;
>>> @@ -612,6 +638,22 @@ static void virtio_net_set_features(VirtIODevice
>>> *vdev, uint64_t features)
>>>                                  virtio_has_feature(features,
>>> VIRTIO_F_VERSION_1));
>>> +    if (virtio_has_feature(features,
>>> VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
>>> +        n->guest_hdr_len = sizeof(struct virtio_net_hdr_rsc);
>>
>> I'm confused, and don't see the connection here. You check
>> CTRL_GUEST_OFFLOADS but set RSC header here, I don't think
>> CTRL_GUEST_OFFLOADS implies RSC.
>>
>>> +        n->host_hdr_len = n->guest_hdr_len;
>>> +        n->has_rsc_hdr = 1;
>>
>> Why need this extra flag, can't we just check RSC feature instead?
>
> OK.
>
>>
>>> +
>>> +        for (i = 0; i < n->max_queues; i++) {
>>> +            nc = qemu_get_subqueue(n->nic, i);
>>> +
>>> +            if (peer_has_vnet_hdr(n) &&
>>> +                qemu_has_vnet_hdr_len(nc->peer, n->guest_hdr_len)) {
>>> +                qemu_set_vnet_hdr_len(nc->peer, n->guest_hdr_len);
>>> +                n->host_hdr_len = n->guest_hdr_len;
>>> +            }
>>> +        }
>>> +    }
>>
>> Need to move hdr len setting to another helper, otherwise it may be set
>> twice. Once for mrg_rxbuf and another is for RSC.
>
> Do you know where should i put it to?

Introduce a new header and put all vnet header check and set logic there 
instead of doing this twice.

>
>>
>>> +
>>>       if (n->has_vnet_hdr) {
>>>           n->curr_guest_offloads =
>>>               virtio_net_guest_offloads_by_features(features);
>>> @@ -1097,7 +1139,8 @@ static int receive_filter(VirtIONet *n, const
>>> uint8_t *buf, int size)
>>>       return 0;
>>>   }
>>> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t
>>> *buf, size_t size)
>>> +static ssize_t virtio_net_do_receive(NetClientState *nc,
>>> +                                     const uint8_t *buf, size_t size)
>>>   {
>>>       VirtIONet *n = qemu_get_nic_opaque(nc);
>>>       VirtIONetQueue *q = virtio_net_get_subqueue(nc);
>>> @@ -1161,6 +1204,12 @@ static ssize_t
>>> virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
>>>               }
>>>               receive_header(n, sg, elem->in_num, buf, size);
>>> +
>>> +            if (n->has_rsc_hdr) {
>>> +                offset = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>>> +                iov_from_buf(sg, elem->in_num, offset, \
>>> +                             buf + offset, 4);
>>
>> Don't get why this is needed.
>
> This is to put the RSS fields.

Ok, looks like I don't find the code that store RSC fields. And you may 
want to unify the logic with mrg rxbuf header copy.

>
>>
>>> +            }
>>>               offset = n->host_hdr_len;
>>>               total += n->guest_hdr_len;
>>>               guest_offset = n->guest_hdr_len;
>>> @@ -1239,7 +1288,7 @@ static int32_t
>>> virtio_net_flush_tx(VirtIONetQueue *q)
>>>           ssize_t ret;
>>>           unsigned int out_num;
>>>           struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE
>>> + 1], *out_sg;
>>> -        struct virtio_net_hdr_mrg_rxbuf mhdr;
>>> +        struct virtio_net_hdr_rsc rsc_hdr;
>>>           elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
>>>           if (!elem) {
>>> @@ -1256,26 +1305,27 @@ static int32_t
>>> virtio_net_flush_tx(VirtIONetQueue *q)
>>>           }
>>>           if (n->has_vnet_hdr) {
>>> -            if (iov_to_buf(out_sg, out_num, 0, &mhdr,
>>> n->guest_hdr_len) <
>>> +            if (iov_to_buf(out_sg, out_num, 0, &rsc_hdr,
>>> n->guest_hdr_len) <
>>>                   n->guest_hdr_len) {
>>>                   virtio_error(vdev, "virtio-net header incorrect");
>>>                   virtqueue_detach_element(q->tx_vq, elem, 0);
>>>                   g_free(elem);
>>>                   return -EINVAL;
>>>               }
>>> +
>>
>> Unnecessary newline.
>
> forgive my typo, maybe caused by the indent in my vi profile, thanks
>
>>
>>>               if (n->needs_vnet_hdr_swap) {
>>> -                virtio_net_hdr_swap(vdev, (void *) &mhdr);
>>> -                sg2[0].iov_base = &mhdr;
>>> +                virtio_net_hdr_swap(vdev, (void *) &rsc_hdr);
>>> +                sg2[0].iov_base = &rsc_hdr;
>>>                   sg2[0].iov_len = n->guest_hdr_len;
>>>                   out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
>>>                                      out_sg, out_num,
>>>                                      n->guest_hdr_len, -1);
>>>                   if (out_num == VIRTQUEUE_MAX_SIZE) {
>>>                       goto drop;
>>> -        }
>>> +                }
>>
>> Unnecessary change.
>
> OK.
>
>>
>>>                   out_num += 1;
>>>                   out_sg = sg2;
>>> -        }
>>> +            }
>>
>> Here too.
>
> OK.
>
>>

[...]

>> VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE),
>> +    DEFINE_PROP_BIT64("guest_rsc4", VirtIONet, host_features,
>> +                    VIRTIO_NET_F_GUEST_RSC4, true),
>>
>> Don't get why need DEFINE_XXX_BIT64, we still have left bits I believe.
>>
>>> +    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
>>> +                      VIRTIO_NET_RSC_INTERVAL),
>>>       DEFINE_PROP_END_OF_LIST(),
>>>   };
>>> diff --git a/include/hw/virtio/virtio-net.h
>>> b/include/hw/virtio/virtio-net.h
>>> index 0ced975..56a8ce2 100644
>>> --- a/include/hw/virtio/virtio-net.h
>>> +++ b/include/hw/virtio/virtio-net.h
>>> @@ -60,12 +60,15 @@ typedef struct VirtIONet {
>>>       VirtIONetQueue *vqs;
>>>       VirtQueue *ctrl_vq;
>>>       NICState *nic;
>>> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>>> +    uint32_t rsc_timeout;
>>>       uint32_t tx_timeout;
>>>       int32_t tx_burst;
>>>       uint32_t has_vnet_hdr;
>>> +    uint32_t has_rsc_hdr;
>>>       size_t host_hdr_len;
>>>       size_t guest_hdr_len;
>>> -    uint32_t host_features;
>>> +    uint64_t host_features;
>>
>> Do we run out of host features? If yes, need an independent patch for 
>> this.
>
> OK.
>
>>
>>>       uint8_t has_ufo;
>>>       int mergeable_rx_bufs;
>>>       uint8_t promisc;
>>> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>>> index b913aac..0006ce1 100644
>>> --- a/include/hw/virtio/virtio.h
>>> +++ b/include/hw/virtio/virtio.h
>>> @@ -30,6 +30,8 @@
>>>                                   (0x1ULL << VIRTIO_F_ANY_LAYOUT))
>>>   struct VirtQueue;
>>> +struct VirtIONet;
>>> +typedef struct VirtIONet VirtIONet;
>>>   static inline hwaddr vring_align(hwaddr addr,
>>>                                                unsigned long align)
>>> @@ -129,6 +131,80 @@ typedef struct VirtioDeviceClass {
>>>       int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>>>   } VirtioDeviceClass;
>>> +/* Coalesced packets type & status */
>>> +typedef enum {
>>> +    RSC_COALESCE,           /* Data been coalesced */
>>
>> "Data has been" ?
>
> Thanks.
>
>>
>>> +    RSC_FINAL,              /* Will terminate current connection */
>>> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
>>> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp
>>> ctrl, etc */
>>
>> "to be bypassed" ?
>
> OK.
>
>>
>>> +    RSC_CANDIDATE                /* Data want to be coalesced */
>>> +} COALESCE_STATUS;
>>> +
>>> +typedef struct NetRscStat {
>>> +    uint32_t received;
>>> +    uint32_t coalesced;
>>> +    uint32_t over_size;
>>> +    uint32_t cache;
>>> +    uint32_t empty_cache;
>>> +    uint32_t no_match_cache;
>>> +    uint32_t win_update;
>>> +    uint32_t no_match;
>>> +    uint32_t tcp_syn;
>>> +    uint32_t tcp_ctrl_drain;
>>> +    uint32_t dup_ack;
>>> +    uint32_t dup_ack1;
>>> +    uint32_t dup_ack2;
>>> +    uint32_t pure_ack;
>>> +    uint32_t ack_out_of_win;
>>> +    uint32_t data_out_of_win;
>>> +    uint32_t data_out_of_order;
>>> +    uint32_t data_after_pure_ack;
>>> +    uint32_t bypass_not_tcp;
>>> +    uint32_t tcp_option;
>>> +    uint32_t tcp_all_opt;
>>> +    uint32_t ip_frag;
>>> +    uint32_t ip_ecn;
>>> +    uint32_t ip_hacked;
>>> +    uint32_t ip_option;
>>> +    uint32_t purge_failed;
>>> +    uint32_t drain_failed;
>>> +    uint32_t final_failed;
>>> +    int64_t  timer;
>>> +} NetRscStat;
>>> +
>>> +/* Rsc unit general info used to checking if can coalescing */
>>> +typedef struct NetRscUnit {
>>> +    void *ip;   /* ip header */
>>> +    uint16_t *ip_plen;      /* data len pointer in ip header field */
>>> +    struct tcp_header *tcp; /* tcp header */
>>> +    uint16_t tcp_hdrlen;    /* tcp header len */
>>> +    uint16_t payload;       /* pure payload without 
>>> virtio/eth/ip/tcp */
>>> +} NetRscUnit;
>>> +
>>> +/* Coalesced segmant */
>>> +typedef struct NetRscSeg {
>>> +    QTAILQ_ENTRY(NetRscSeg) next;
>>> +    void *buf;
>>> +    size_t size;
>>> +    uint16_t packets;
>>> +    uint16_t dup_ack;
>>> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark
>>> here */
>>> +    NetRscUnit unit;
>>> +    NetClientState *nc;
>>> +} NetRscSeg;
>>> +
>>> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
>>> +typedef struct NetRscChain {
>>> +    QTAILQ_ENTRY(NetRscChain) next;
>>> +    VirtIONet *n;                            /* VirtIONet */
>>> +    uint16_t proto;
>>> +    uint8_t  gso_type;
>>> +    uint16_t max_payload;
>>> +    QEMUTimer *drain_timer;
>>> +    QTAILQ_HEAD(, NetRscSeg) buffers;
>>> +    NetRscStat stat;
>>> +} NetRscChain;
>>> +
>>
>> Why put the above in virtio.h? If it will not be used by other files,
>> why need put them in header file?
>
> OK, I will put them to virtio-net.h.

Looks like virtio-net.c is better, no other file needs those.

>
>>
>>>   void virtio_instance_init_common(Object *proxy_obj, void *data,
>>>                                    size_t vdev_size, const char
>>> *vdev_name);
>>> diff --git a/include/net/eth.h b/include/net/eth.h
>>> index 2013175..5952ef2 100644
>>> --- a/include/net/eth.h
>>> +++ b/include/net/eth.h
>>> @@ -177,6 +177,8 @@ struct tcp_hdr {
>>>   #define TH_PUSH 0x08
>>>   #define TH_ACK  0x10
>>>   #define TH_URG  0x20
>>> +#define TH_ECE  0x40
>>> +#define TH_CWR  0x80
>>
>> Let's put this in another patch.
>
> OK.
>
>>
>>>       u_short th_win;      /* window */
>>>       u_short th_sum;      /* checksum */
>>>       u_short th_urp;      /* urgent pointer */
>>> diff --git a/include/standard-headers/linux/virtio_net.h
>>> b/include/standard-headers/linux/virtio_net.h
>>> index 30ff249..e67b36e 100644
>>> --- a/include/standard-headers/linux/virtio_net.h
>>> +++ b/include/standard-headers/linux/virtio_net.h
>>> @@ -57,6 +57,9 @@
>>>                        * Steering */
>>>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23    /* Set MAC address */
>>> +/* Guest can handle coalesced ipv4-tcp packets */
>>> +#define VIRTIO_NET_F_GUEST_RSC4    41
>>
>> Why not use 24?
>>
>>> +
>>>   #ifndef VIRTIO_NET_NO_LEGACY
>>>   #define VIRTIO_NET_F_GSO    6    /* Host handles pkts w/ any GSO
>>> type */
>>>   #endif /* VIRTIO_NET_NO_LEGACY */
>>> @@ -94,6 +97,9 @@ struct virtio_net_hdr_v1 {
>>>   #define VIRTIO_NET_HDR_GSO_UDP        3    /* GSO frame, IPv4 UDP
>>> (UFO) */
>>>   #define VIRTIO_NET_HDR_GSO_TCPV6    4    /* GSO frame, IPv6 TCP */
>>>   #define VIRTIO_NET_HDR_GSO_ECN        0x80    /* TCP has ECN set */
>>> +#define VIRTIO_NET_HDR_RSC_NONE     5   /* No packets coalesced */
>>
>> Not sure this is really needed. Can we just use GSO_NONE?
>
> Of course we can, but it is better to keep this feature distinguished.

Is there any advantages of doing this? I believe guest does not care 
about this.

>
>>
>> And I believe we should not try to coalesce GSO packets since we're
>> lacking sufficient information for a correct rsc_pkts or rsc_dup_acks
>> from the backend.
>>
>>> +#define VIRTIO_NET_HDR_RSC_TCPV4    6 /* IPv4 TCP coalesced */
>>> +#define VIRTIO_NET_HDR_RSC_TCPV6    7   /* IPv6 TCP coalesced */
>>>       uint8_t gso_type;
>>>       __virtio16 hdr_len;    /* Ethernet + IP + tcp/udp hdrs */
>>>       __virtio16 gso_size;    /* Bytes to append to hdr_len per 
>>> frame */
>>> @@ -124,6 +130,14 @@ struct virtio_net_hdr_mrg_rxbuf {
>>>       struct virtio_net_hdr hdr;
>>>       __virtio16 num_buffers;    /* Number of merged rx buffers */
>>>   };
>>> +
>>> +/* This is the header to use when either one or both of GUEST_RSC4/6
>>> + * features have been negotiated. */
>>> +struct virtio_net_hdr_rsc {
>>> +    struct virtio_net_hdr_v1 hdr;
>>
>> If RSC depends on VERSION_1, need to forbid creating RSC device without
>> VERSION_1.
>
> How to do it?

Fail early on device_plugged.

> also a question here, which header will be used if a device is not 
> virtio 1.0 compliant?

Mergeable header is mandatory for 1.0 but selectable for 0.9x.

>
>>
>>> +    __virtio16 rsc_pkts;        /* Number of coalesced packets */
>>> +    __virtio16 rsc_dup_acks;    /* Duplicated ack packets */
>>> +};
>>>   #endif /* ...VIRTIO_NET_NO_LEGACY */
>>>   /*
>>> diff --git a/net/tap.c b/net/tap.c
>>> index b6896a7..4557aa5 100644
>>> --- a/net/tap.c
>>> +++ b/net/tap.c
>>> @@ -251,7 +251,8 @@ static void tap_set_vnet_hdr_len(NetClientState
>>> *nc, int len)
>>>       TAPState *s = DO_UPCAST(TAPState, nc, nc);
>>>       assert(nc->info->type == NET_CLIENT_DRIVER_TAP);
>>> -    assert(len == sizeof(struct virtio_net_hdr_mrg_rxbuf) ||
>>> +    assert(len == sizeof(struct virtio_net_hdr_rsc) ||
>>> +           len == sizeof(struct virtio_net_hdr_mrg_rxbuf) ||
>>>              len == sizeof(struct virtio_net_hdr));
>>>       tap_fd_set_vnet_hdr_len(s->fd, len);
>>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-11-24  4:17   ` Jason Wang
  2016-11-24  4:26     ` Michael S. Tsirkin
@ 2016-11-30  8:55     ` Wei Xu
  2016-11-30 11:12       ` Jason Wang
  1 sibling, 1 reply; 32+ messages in thread
From: Wei Xu @ 2016-11-30  8:55 UTC (permalink / raw)
  To: Jason Wang, qemu-devel; +Cc: mst, dfleytma, yvugenfi, Ladi Prosek

On 2016年11月24日 12:17, Jason Wang wrote:
>
>
> On 2016年11月01日 01:41, wexu@redhat.com wrote:
>> From: Wei Xu <wexu@redhat.com>
>>
>> All the data packets in a tcp connection are cached
>> to a single buffer in every receive interval, and will
>> be sent out via a timer, the 'virtio_net_rsc_timeout'
>> controls the interval, this value may impact the
>> performance and response time of tcp connection,
>> 50000(50us) is an experience value to gain a performance
>> improvement, since the whql test sends packets every 100us,
>> so '300000(300us)' passes the test case, it is the default
>> value as well, tune it via the command line parameter
>> 'rsc_interval' within 'virtio-net-pci' device, for example,
>> to launch a guest with interval set as '500000':
>>
>> 'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,rsc_interval=500000'
>>
>>
>> The timer will only be triggered if the packets pool is not empty,
>> and it'll drain off all the cached packets.
>>
>> 'NetRscChain' is used to save the segments of IPv4/6 in a
>> VirtIONet device.
>>
>> A new segment becomes a 'Candidate' as well as it passed sanity check,
>> the main handler of TCP includes TCP window update, duplicated
>> ACK check and the real data coalescing.
>>
>> An 'Candidate' segment means:
>> 1. Segment is within current window and the sequence is the expected one.
>> 2. 'ACK' of the segment is in the valid window.
>>
>> Sanity check includes:
>> 1. Incorrect version in IP header
>> 2. An IP options or IP fragment
>> 3. Not a TCP packet
>> 4. Sanity size check to prevent buffer overflow attack.
>> 5. An ECN packet
>>
>> Even though, there might more cases should be considered such as
>> ip identification other flags, while it breaks the test because
>> windows set it to the same even it's not a fragment.
>>
>> Normally it includes 2 typical ways to handle a TCP control flag,
>> 'bypass' and 'finalize', 'bypass' means should be sent out directly,
>> while 'finalize' means the packets should also be bypassed, but this
>> should be done after search for the same connection packets in the
>> pool and drain all of them out, this is to avoid out of order fragment.
>>
>> All the 'SYN' packets will be bypassed since this always begin a new'
>> connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
>> finalization, because this normally happens upon a connection is going
>> to be closed, an 'URG' packet also finalize current coalescing unit.
>>
>> Statistics can be used to monitor the basic coalescing status, the
>> 'out of order' and 'out of window' means how many retransmitting packets,
>> thus describe the performance intuitively.
>>
>> Signed-off-by: Wei Xu <wexu@redhat.com>
>> ---
>>   hw/net/virtio-net.c                         | 602
>> ++++++++++++++++++++++++++--
>>   include/hw/virtio/virtio-net.h              |   5 +-
>>   include/hw/virtio/virtio.h                  |  76 ++++
>>   include/net/eth.h                           |   2 +
>>   include/standard-headers/linux/virtio_net.h |  14 +
>>   net/tap.c                                   |   3 +-
>>   6 files changed, 670 insertions(+), 32 deletions(-)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 06bfe4b..d1824d9 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -15,10 +15,12 @@
>>   #include "qemu/iov.h"
>>   #include "hw/virtio/virtio.h"
>>   #include "net/net.h"
>> +#include "net/eth.h"
>>   #include "net/checksum.h"
>>   #include "net/tap.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/timer.h"
>> +#include "qemu/sockets.h"
>>   #include "hw/virtio/virtio-net.h"
>>   #include "net/vhost_net.h"
>>   #include "hw/virtio/virtio-bus.h"
>> @@ -43,6 +45,24 @@
>>   #define endof(container, field) \
>>       (offsetof(container, field) + sizeof(((container *)0)->field))
>> +#define VIRTIO_NET_IP4_ADDR_SIZE   8        /* ipv4 saddr + daddr */
>
> Only used once in the code, I don't see much value of this macro.

Just to keep it a bit readable.

>
>> +
>> +#define VIRTIO_NET_TCP_FLAG         0x3F
>> +#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
>> +
>> +/* IPv4 max payload, 16 bits in the header */
>> +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
>> +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
>> +
>> +/* header length value in ip header without option */
>> +#define VIRTIO_NET_IP4_HEADER_LENGTH 5
>> +
>> +/* Purge coalesced packets timer interval, This value affects the
>> performance
>> +   a lot, and should be tuned carefully, '300000'(300us) is the
>> recommended
>> +   value to pass the WHQL test, '50000' can gain 2x netperf
>> throughput with
>> +   tso/gso/gro 'off'. */
>> +#define VIRTIO_NET_RSC_INTERVAL  300000
>
> This should be a property for virito-net and the above comment can be
> the description of the property.

This is a value for a property, actually I hadn't found a place to put
it.

>
>> +
>>   typedef struct VirtIOFeature {
>>       uint32_t flags;
>>       size_t end;
>> @@ -589,7 +609,12 @@ static uint64_t
>> virtio_net_guest_offloads_by_features(uint32_t features)
>>           (1ULL << VIRTIO_NET_F_GUEST_ECN)  |
>>           (1ULL << VIRTIO_NET_F_GUEST_UFO);
>> -    return guest_offloads_mask & features;
>> +    if (features & VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) {
>> +        return (guest_offloads_mask & features) |
>> +               (1ULL << VIRTIO_NET_F_GUEST_RSC4);
>
> Why need to care this, I believe RSC has nothing to do with peer's
> offload setting?

There is some misunderstanding about how does the feature work
followed with a few subsequent comments, so let me clarify it first.

Currently RSC feature is bundled with 'VIRTIO_NET_F_CTRL_GUEST_OFFLOADS'
which means once guest driver reports supporting this feature during
driver initializing, then qemu will initialize RSC feature and use the
new header with RSC fields to communicate with guest.

While RSC won't coalescing packets before guest driver notify host to
enable it, the motivation of this is to support dynamically turn on/off
the feature from guest side, and don't need a new feature bit for this
feature.

So from the guest's point of view, it can see the new header but all
packets are still unchanged, once it enables the feature via control
queue, coalesced packets will comes to the queue.

>
>> +    } else {
>> +        return guest_offloads_mask & features;
>> +    }
>>   }
>>   static inline uint64_t virtio_net_supported_guest_offloads(VirtIONet
>> *n)
>> @@ -600,6 +625,7 @@ static inline uint64_t
>> virtio_net_supported_guest_offloads(VirtIONet *n)
>>   static void virtio_net_set_features(VirtIODevice *vdev, uint64_t
>> features)
>>   {
>> +    NetClientState *nc;
>>       VirtIONet *n = VIRTIO_NET(vdev);
>>       int i;
>> @@ -612,6 +638,22 @@ static void virtio_net_set_features(VirtIODevice
>> *vdev, uint64_t features)
>>                                  virtio_has_feature(features,
>>                                                     VIRTIO_F_VERSION_1));
>> +    if (virtio_has_feature(features,
>> VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
>> +        n->guest_hdr_len = sizeof(struct virtio_net_hdr_rsc);
>
> I'm confused, and don't see the connection here. You check
> CTRL_GUEST_OFFLOADS but set RSC header here, I don't think
> CTRL_GUEST_OFFLOADS implies RSC.
>
>> +        n->host_hdr_len = n->guest_hdr_len;
>> +        n->has_rsc_hdr = 1;
>
> Why need this extra flag, can't we just check RSC feature instead?

OK.

>
>> +
>> +        for (i = 0; i < n->max_queues; i++) {
>> +            nc = qemu_get_subqueue(n->nic, i);
>> +
>> +            if (peer_has_vnet_hdr(n) &&
>> +                qemu_has_vnet_hdr_len(nc->peer, n->guest_hdr_len)) {
>> +                qemu_set_vnet_hdr_len(nc->peer, n->guest_hdr_len);
>> +                n->host_hdr_len = n->guest_hdr_len;
>> +            }
>> +        }
>> +    }
>
> Need to move hdr len setting to another helper, otherwise it may be set
> twice. Once for mrg_rxbuf and another is for RSC.

Do you know where should i put it to?

>
>> +
>>       if (n->has_vnet_hdr) {
>>           n->curr_guest_offloads =
>>               virtio_net_guest_offloads_by_features(features);
>> @@ -1097,7 +1139,8 @@ static int receive_filter(VirtIONet *n, const
>> uint8_t *buf, int size)
>>       return 0;
>>   }
>> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t
>> *buf, size_t size)
>> +static ssize_t virtio_net_do_receive(NetClientState *nc,
>> +                                     const uint8_t *buf, size_t size)
>>   {
>>       VirtIONet *n = qemu_get_nic_opaque(nc);
>>       VirtIONetQueue *q = virtio_net_get_subqueue(nc);
>> @@ -1161,6 +1204,12 @@ static ssize_t
>> virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
>>               }
>>               receive_header(n, sg, elem->in_num, buf, size);
>> +
>> +            if (n->has_rsc_hdr) {
>> +                offset = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>> +                iov_from_buf(sg, elem->in_num, offset, \
>> +                             buf + offset, 4);
>
> Don't get why this is needed.

This is to put the RSS fields.

>
>> +            }
>>               offset = n->host_hdr_len;
>>               total += n->guest_hdr_len;
>>               guest_offset = n->guest_hdr_len;
>> @@ -1239,7 +1288,7 @@ static int32_t
>> virtio_net_flush_tx(VirtIONetQueue *q)
>>           ssize_t ret;
>>           unsigned int out_num;
>>           struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE
>> + 1], *out_sg;
>> -        struct virtio_net_hdr_mrg_rxbuf mhdr;
>> +        struct virtio_net_hdr_rsc rsc_hdr;
>>           elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
>>           if (!elem) {
>> @@ -1256,26 +1305,27 @@ static int32_t
>> virtio_net_flush_tx(VirtIONetQueue *q)
>>           }
>>           if (n->has_vnet_hdr) {
>> -            if (iov_to_buf(out_sg, out_num, 0, &mhdr,
>> n->guest_hdr_len) <
>> +            if (iov_to_buf(out_sg, out_num, 0, &rsc_hdr,
>> n->guest_hdr_len) <
>>                   n->guest_hdr_len) {
>>                   virtio_error(vdev, "virtio-net header incorrect");
>>                   virtqueue_detach_element(q->tx_vq, elem, 0);
>>                   g_free(elem);
>>                   return -EINVAL;
>>               }
>> +
>
> Unnecessary newline.

forgive my typo, maybe caused by the indent in my vi profile, thanks

>
>>               if (n->needs_vnet_hdr_swap) {
>> -                virtio_net_hdr_swap(vdev, (void *) &mhdr);
>> -                sg2[0].iov_base = &mhdr;
>> +                virtio_net_hdr_swap(vdev, (void *) &rsc_hdr);
>> +                sg2[0].iov_base = &rsc_hdr;
>>                   sg2[0].iov_len = n->guest_hdr_len;
>>                   out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
>>                                      out_sg, out_num,
>>                                      n->guest_hdr_len, -1);
>>                   if (out_num == VIRTQUEUE_MAX_SIZE) {
>>                       goto drop;
>> -        }
>> +                }
>
> Unnecessary change.

OK.

>
>>                   out_num += 1;
>>                   out_sg = sg2;
>> -        }
>> +            }
>
> Here too.

OK.

>
>>           }
>>           /*
>>            * If host wants to see the guest header as is, we can
>> @@ -1562,8 +1612,12 @@ static int virtio_net_load_device(VirtIODevice
>> *vdev, QEMUFile *f,
>>                                  virtio_vdev_has_feature(vdev,
>>
>> VIRTIO_F_VERSION_1));
>> -    n->status = qemu_get_be16(f);
>> +    if (virtio_vdev_has_feature(vdev, VIRTIO_NET_F_GUEST_RSC4)) {
>> +        n->guest_hdr_len = sizeof(struct virtio_net_hdr_rsc);
>> +        n->host_hdr_len = n->guest_hdr_len;
>> +    }
>
> Why need this? Btw, need keep guest visible features unchanged through
> qemu cli during migration.

Same issue with feature bit.

>
>> +    n->status = qemu_get_be16(f);
>>       n->promisc = qemu_get_byte(f);
>>       n->allmulti = qemu_get_byte(f);
>> @@ -1660,6 +1714,487 @@ static int virtio_net_load_device(VirtIODevice
>> *vdev, QEMUFile *f,
>>       return 0;
>>   }
>> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
>> +                                         const uint8_t *buf,
>> NetRscUnit* unit)
>> +{
>> +    uint16_t ip_hdrlen;
>> +    struct ip_header *ip;
>> +
>> +    ip = (struct ip_header *)(buf + chain->n->guest_hdr_len
>> +                              + sizeof(struct eth_header));
>> +    unit->ip = (void *)ip;
>> +    ip_hdrlen = (ip->ip_ver_len & 0xF) << 2;
>> +    unit->ip_plen = &ip->ip_len;
>> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) +
>> ip_hdrlen);
>> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000)
>> >> 10;
>> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen -
>> unit->tcp_hdrlen;
>> +}
>> +
>> +static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr_rsc
>> *rhdr,
>> +                                         struct ip_header *ip)
>> +{
>> +    uint32_t sum;
>> +    struct virtio_net_hdr *vhdr = (struct virtio_net_hdr *)rhdr;
>> +
>> +    ip->ip_sum = 0;
>> +    sum = net_checksum_add_cont(sizeof(struct ip_header), (uint8_t
>> *)ip, 0);
>> +    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
>> +    vhdr->flags &= ~VIRTIO_NET_HDR_F_NEEDS_CSUM;
>> +    vhdr->flags |= VIRTIO_NET_HDR_F_DATA_VALID;
>> +}
>> +
>> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg
>> *seg)
>> +{
>> +    int ret;
>> +    struct virtio_net_hdr_rsc *h;
>> +
>> +    h = (struct virtio_net_hdr_rsc *)seg->buf;
>> +    if (seg->is_coalesced) {
>> +        h->hdr.flags = VIRTIO_NET_HDR_RSC_TCPV4;
>> +        virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
>> +    }
>> +
>> +    h = (struct virtio_net_hdr_rsc *)seg->buf;
>> +    virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
>> +    h->rsc_pkts = seg->packets;
>> +    h->rsc_dup_acks = seg->dup_ack;
>> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
>> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
>> +    g_free(seg->buf);
>> +    g_free(seg);
>> +
>> +    return ret;
>> +}
>> +
>> +static void virtio_net_rsc_purge(void *opq)
>> +{
>> +    NetRscSeg *seg, *rn;
>> +    NetRscChain *chain = (NetRscChain *)opq;
>> +
>> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
>> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> +            chain->stat.purge_failed++;
>> +            continue;
>> +        }
>> +    }
>> +
>> +    chain->stat.timer++;
>> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
>> +        timer_mod(chain->drain_timer,
>> +              qemu_clock_get_ns(QEMU_CLOCK_HOST) +
>> chain->n->rsc_timeout);
>> +    }
>> +}
>> +
>> +static void virtio_net_rsc_cleanup(VirtIONet *n)
>> +{
>> +    NetRscChain *chain, *rn_chain;
>> +    NetRscSeg *seg, *rn_seg;
>> +
>> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
>> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
>> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
>> +            g_free(seg->buf);
>> +            g_free(seg);
>> +        }
>> +
>> +        timer_del(chain->drain_timer);
>> +        timer_free(chain->drain_timer);
>> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
>> +        g_free(chain);
>> +    }
>> +}
>> +
>> +static void virtio_net_rsc_cache_buf(NetRscChain *chain,
>> NetClientState *nc,
>> +                                     const uint8_t *buf, size_t size)
>> +{
>> +    uint16_t hdr_len;
>> +    NetRscSeg *seg;
>> +
>> +    hdr_len = chain->n->guest_hdr_len;
>> +    seg = g_malloc(sizeof(NetRscSeg));
>> +    seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)\
>> +                   + VIRTIO_NET_MAX_TCP_PAYLOAD);
>> +    memcpy(seg->buf, buf, size);
>> +    seg->size = size;
>> +    seg->packets = 1;
>> +    seg->dup_ack = 0;
>> +    seg->is_coalesced = 0;
>> +    seg->nc = nc;
>> +
>> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
>> +    chain->stat.cache++;
>> +
>> +    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
>> +}
>> +
>> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
>> +                                         NetRscSeg *seg, const
>> uint8_t *buf,
>> +                                         struct tcp_header *n_tcp,
>> +                                         struct tcp_header *o_tcp)
>> +{
>> +    uint32_t nack, oack;
>> +    uint16_t nwin, owin;
>> +
>> +    nack = htonl(n_tcp->th_ack);
>> +    nwin = htons(n_tcp->th_win);
>> +    oack = htonl(o_tcp->th_ack);
>> +    owin = htons(o_tcp->th_win);
>> +
>> +    if ((nack - oack) >= VIRTIO_NET_MAX_TCP_PAYLOAD) {
>> +        chain->stat.ack_out_of_win++;
>> +        return RSC_FINAL;
>> +    } else if (nack == oack) {
>> +        /* duplicated ack or window probe */
>> +        if (nwin == owin) {
>> +            /* duplicated ack, add dup ack count due to whql test up
>> to 1 */
>> +            chain->stat.dup_ack++;
>> +            return RSC_FINAL;
>> +        } else {
>> +            /* Coalesce window update */
>> +            o_tcp->th_win = n_tcp->th_win;
>> +            chain->stat.win_update++;
>> +            return RSC_COALESCE;
>> +        }
>> +    } else {
>> +        /* pure ack, go to 'C', finalize*/
>> +        chain->stat.pure_ack++;
>> +        return RSC_FINAL;
>> +    }
>> +}
>> +
>> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
>> +                                            NetRscSeg *seg, const
>> uint8_t *buf,
>> +                                            NetRscUnit *n_unit)
>> +{
>> +    void *data;
>> +    uint16_t o_ip_len;
>> +    uint32_t nseq, oseq;
>> +    NetRscUnit *o_unit;
>> +
>> +    o_unit = &seg->unit;
>> +    o_ip_len = htons(*o_unit->ip_plen);
>> +    nseq = htonl(n_unit->tcp->th_seq);
>> +    oseq = htonl(o_unit->tcp->th_seq);
>> +
>> +    /* out of order or retransmitted. */
>> +    if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) {
>> +        chain->stat.data_out_of_win++;
>> +        return RSC_FINAL;
>> +    }
>> +
>> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
>> +    if (nseq == oseq) {
>> +        if ((o_unit->payload == 0) && n_unit->payload) {
>> +            /* From no payload to payload, normal case, not a dup ack
>> or etc */
>> +            chain->stat.data_after_pure_ack++;
>> +            goto coalesce;
>> +        } else {
>> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
>> +                                             n_unit->tcp, o_unit->tcp);
>> +        }
>> +    } else if ((nseq - oseq) != o_unit->payload) {
>> +        /* Not a consistent packet, out of order */
>> +        chain->stat.data_out_of_order++;
>> +        return RSC_FINAL;
>> +    } else {
>> +coalesce:
>> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
>> +            chain->stat.over_size++;
>> +            return RSC_FINAL;
>> +        }
>> +
>> +        /* Here comes the right data, the payload length in v4/v6 is
>> different,
>> +           so use the field value to update and record the new data
>> len */
>> +        o_unit->payload += n_unit->payload; /* update new data len */
>> +
>> +        /* update field in ip header */
>> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
>> +
>> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be
>> coalesced
>> +           for windows guest, while this may change the behavior for
>> linux
>> +           guest. */
>> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
>> +
>> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
>> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
>> +
>> +        memmove(seg->buf + seg->size, data, n_unit->payload);
>> +        seg->size += n_unit->payload;
>> +        seg->packets++;
>> +        chain->stat.coalesced++;
>> +        return RSC_COALESCE;
>> +    }
>> +}
>> +
>> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg
>> *seg,
>> +                                        const uint8_t *buf, size_t size,
>> +                                        NetRscUnit *unit)
>> +{
>> +    struct ip_header *ip1, *ip2;
>> +
>> +    ip1 = (struct ip_header *)(unit->ip);
>> +    ip2 = (struct ip_header *)(seg->unit.ip);
>> +    if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst)
>> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
>> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
>> +        chain->stat.no_match++;
>> +        return RSC_NO_MATCH;
>> +    }
>> +
>> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
>> +}
>> +
>> +/* Pakcets with 'SYN' should bypass, other flag should be sent after
>> drain
>> + * to prevent out of order */
>> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
>> +                                         struct tcp_header *tcp)
>> +{
>> +    uint16_t tcp_hdr;
>> +    uint16_t tcp_flag;
>> +
>> +    tcp_flag = htons(tcp->th_offset_flags);
>> +    tcp_hdr = (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10;
>> +    tcp_flag &= VIRTIO_NET_TCP_FLAG;
>> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
>> +    if (tcp_flag & TH_SYN) {
>> +        chain->stat.tcp_syn++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST | TH_ECE | TH_CWR)) {
>> +        chain->stat.tcp_ctrl_drain++;
>> +        return RSC_FINAL;
>> +    }
>> +
>> +    if (tcp_hdr > sizeof(struct tcp_header)) {
>> +        chain->stat.tcp_all_opt++;
>> +        return RSC_FINAL;
>> +    }
>> +
>> +    return RSC_CANDIDATE;
>> +}
>> +
>> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain,
>> NetClientState *nc,
>> +                                         const uint8_t *buf, size_t
>> size,
>> +                                         NetRscUnit *unit)
>> +{
>> +    int ret;
>> +    NetRscSeg *seg, *nseg;
>> +
>> +    if (QTAILQ_EMPTY(&chain->buffers)) {
>> +        chain->stat.empty_cache++;
>> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
>> +        timer_mod(chain->drain_timer,
>> +              qemu_clock_get_ns(QEMU_CLOCK_HOST) +
>> chain->n->rsc_timeout);
>> +        return size;
>> +    }
>> +
>> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
>> +        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
>> +
>> +        if (ret == RSC_FINAL) {
>> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> +                /* Send failed */
>> +                chain->stat.final_failed++;
>> +                return 0;
>> +            }
>> +
>> +            /* Send current packet */
>> +            return virtio_net_do_receive(nc, buf, size);
>> +        } else if (ret == RSC_NO_MATCH) {
>> +            continue;
>> +        } else {
>> +            /* Coalesced, mark coalesced flag to tell calc cksum for
>> ipv4 */
>> +            seg->is_coalesced = 1;
>> +            return size;
>> +        }
>> +    }
>> +
>> +    chain->stat.no_match_cache++;
>> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
>> +    return size;
>> +}
>> +
>> +/* Drain a connection data, this is to avoid out of order segments */
>> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain,
>> NetClientState *nc,
>> +                                        const uint8_t *buf, size_t size,
>> +                                        uint16_t ip_start, uint16_t
>> ip_size,
>> +                                        uint16_t tcp_port)
>> +{
>> +    NetRscSeg *seg, *nseg;
>> +    uint32_t ppair1, ppair2;
>> +
>> +    ppair1 = *(uint32_t *)(buf + tcp_port);
>> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
>> +        ppair2 = *(uint32_t *)(seg->buf + tcp_port);
>> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
>> +            || (ppair1 != ppair2)) {
>> +            continue;
>> +        }
>> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
>> +            chain->stat.drain_failed++;
>> +        }
>> +
>> +        break;
>> +    }
>> +
>> +    return virtio_net_do_receive(nc, buf, size);
>> +}
>> +
>> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
>> +                                            struct ip_header *ip,
>> +                                            const uint8_t *buf,
>> size_t size)
>> +{
>> +    uint16_t ip_len;
>> +
>> +    /* Not an ipv4 packet */
>> +    if (((ip->ip_ver_len & 0xF0) >> 4) != IP_HEADER_VERSION_4) {
>> +        chain->stat.ip_option++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Don't handle packets with ip option */
>> +    if ((ip->ip_ver_len & 0xF) != VIRTIO_NET_IP4_HEADER_LENGTH) {
>> +        chain->stat.ip_option++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    if (ip->ip_p != IPPROTO_TCP) {
>> +        chain->stat.bypass_not_tcp++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Don't handle packets with ip fragment */
>> +    if (!(htons(ip->ip_off) & IP_DF)) {
>> +        chain->stat.ip_frag++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    /* Don't handle packets with ecn flag */
>> +    if (IPTOS_ECN(ip->ip_tos)) {
>> +        chain->stat.ip_ecn++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    ip_len = htons(ip->ip_len);
>> +    if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header))
>> +        || ip_len > (size - chain->n->guest_hdr_len -
>> +                     sizeof(struct eth_header))) {
>> +        chain->stat.ip_hacked++;
>> +        return RSC_BYPASS;
>> +    }
>> +
>> +    return RSC_CANDIDATE;
>> +}
>> +
>> +static size_t virtio_net_rsc_receive4(NetRscChain *chain,
>> NetClientState* nc,
>> +                                      const uint8_t *buf, size_t size)
>> +{
>> +    int32_t ret;
>> +    uint16_t hdr_len;
>> +    NetRscUnit unit;
>> +
>> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
>> +
>> +    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct
>> ip_header)
>> +        + sizeof(struct tcp_header))) {
>> +        chain->stat.bypass_not_tcp++;
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +
>> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
>> +    if (virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)
>> +        != RSC_CANDIDATE) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +
>> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
>> +    if (ret == RSC_BYPASS) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    } else if (ret == RSC_FINAL) {
>> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
>> +                ((hdr_len + sizeof(struct eth_header)) + 12),
>> +                VIRTIO_NET_IP4_ADDR_SIZE,
>> +                hdr_len + sizeof(struct eth_header) + sizeof(struct
>> ip_header));
>> +    }
>> +
>> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
>> +}
>> +
>> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
>> +                                                NetClientState *nc,
>> +                                                uint16_t proto)
>> +{
>> +    NetRscChain *chain;
>> +
>> +    if (proto != (uint16_t)ETH_P_IP) {
>> +        return NULL;
>> +    }
>> +
>> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
>> +        if (chain->proto == proto) {
>> +            return chain;
>> +        }
>> +    }
>> +
>> +    chain = g_malloc(sizeof(*chain));
>> +    chain->n = n;
>> +    chain->proto = proto;
>> +    chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
>> +    chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
>> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_HOST,
>> +                                      virtio_net_rsc_purge, chain);
>> +    memset(&chain->stat, 0, sizeof(chain->stat));
>> +
>> +    QTAILQ_INIT(&chain->buffers);
>> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
>> +
>> +    return chain;
>> +}
>> +
>> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>> +                                      const uint8_t *buf, size_t size)
>> +{
>> +    uint16_t proto;
>> +    NetRscChain *chain;
>> +    struct eth_header *eth;
>> +    VirtIONet *n;
>> +
>> +    n = qemu_get_nic_opaque(nc);
>> +    if (size < (n->host_hdr_len + sizeof(struct eth_header))) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +
>> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
>> +    proto = htons(eth->h_proto);
>> +
>> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
>> +    if (!chain) {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    } else {
>> +        chain->stat.received++;
>> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
>> +    }
>> +}
>> +
>> +static ssize_t virtio_net_receive(NetClientState *nc,
>> +                                  const uint8_t *buf, size_t size)
>> +{
>> +    VirtIONet *n;
>> +    struct virtio_net_hdr_rsc *h;
>> +
>> +    n = qemu_get_nic_opaque(nc);
>> +    if (n->curr_guest_offloads & (1ULL << VIRTIO_NET_F_GUEST_RSC4)) {
>> +        h = (struct virtio_net_hdr_rsc *)buf;
>> +        h->hdr.flags = VIRTIO_NET_HDR_RSC_NONE;
>> +        h->rsc_pkts = 0;
>> +        h->rsc_dup_acks = 0;
>> +        return virtio_net_rsc_receive(nc, buf, size);
>> +    } else {
>> +        return virtio_net_do_receive(nc, buf, size);
>> +    }
>> +}
>> +
>>   static NetClientInfo net_virtio_info = {
>>       .type = NET_CLIENT_DRIVER_NIC,
>>       .size = sizeof(NICState),
>> @@ -1805,6 +2340,7 @@ static void
>> virtio_net_device_realize(DeviceState *dev, Error **errp)
>>       nc = qemu_get_queue(n->nic);
>>       nc->rxfilter_notify_enabled = 1;
>> +    QTAILQ_INIT(&n->rsc_chains);
>>       n->qdev = dev;
>>   }
>> @@ -1835,6 +2371,7 @@ static void
>> virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>>       g_free(n->vqs);
>>       qemu_del_nic(n->nic);
>>       virtio_cleanup(vdev);
>> +    virtio_net_rsc_cleanup(n);
>>   }
>>   static void virtio_net_instance_init(Object *obj)
>> @@ -1872,45 +2409,46 @@ static const VMStateDescription
>> vmstate_virtio_net = {
>>   };
>>   static Property virtio_net_properties[] = {
>> -    DEFINE_PROP_BIT("csum", VirtIONet, host_features,
>> VIRTIO_NET_F_CSUM, true),
>> -    DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("csum", VirtIONet, host_features,
>> +                    VIRTIO_NET_F_CSUM, true),
>> +    DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features,
>>                       VIRTIO_NET_F_GUEST_CSUM, true),
>> -    DEFINE_PROP_BIT("gso", VirtIONet, host_features,
>> VIRTIO_NET_F_GSO, true),
>> -    DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("gso", VirtIONet, host_features,
>> VIRTIO_NET_F_GSO, true),
>> +    DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features,
>>                       VIRTIO_NET_F_GUEST_TSO4, true),
>> -    DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features,
>>                       VIRTIO_NET_F_GUEST_TSO6, true),
>> -    DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features,
>>                       VIRTIO_NET_F_GUEST_ECN, true),
>> -    DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features,
>>                       VIRTIO_NET_F_GUEST_UFO, true),
>> -    DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features,
>>                       VIRTIO_NET_F_GUEST_ANNOUNCE, true),
>> -    DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features,
>>                       VIRTIO_NET_F_HOST_TSO4, true),
>> -    DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features,
>>                       VIRTIO_NET_F_HOST_TSO6, true),
>> -    DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features,
>>                       VIRTIO_NET_F_HOST_ECN, true),
>> -    DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features,
>>                       VIRTIO_NET_F_HOST_UFO, true),
>> -    DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features,
>>                       VIRTIO_NET_F_MRG_RXBUF, true),
>> -    DEFINE_PROP_BIT("status", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("status", VirtIONet, host_features,
>>                       VIRTIO_NET_F_STATUS, true),
>> -    DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features,
>>                       VIRTIO_NET_F_CTRL_VQ, true),
>> -    DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features,
>>                       VIRTIO_NET_F_CTRL_RX, true),
>> -    DEFINE_PROP_BIT("ctrl_vlan", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("ctrl_vlan", VirtIONet, host_features,
>>                       VIRTIO_NET_F_CTRL_VLAN, true),
>> -    DEFINE_PROP_BIT("ctrl_rx_extra", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("ctrl_rx_extra", VirtIONet, host_features,
>>                       VIRTIO_NET_F_CTRL_RX_EXTRA, true),
>> -    DEFINE_PROP_BIT("ctrl_mac_addr", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("ctrl_mac_addr", VirtIONet, host_features,
>>                       VIRTIO_NET_F_CTRL_MAC_ADDR, true),
>> -    DEFINE_PROP_BIT("ctrl_guest_offloads", VirtIONet, host_features,
>> +    DEFINE_PROP_BIT64("ctrl_guest_offloads", VirtIONet, host_features,
>>                       VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, true),
>> -    DEFINE_PROP_BIT("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ,
>> false),
>> +    DEFINE_PROP_BIT64("mq", VirtIONet, host_features,
>> VIRTIO_NET_F_MQ, false),
>>       DEFINE_NIC_PROPERTIES(VirtIONet, nic_conf),
>>       DEFINE_PROP_UINT32("x-txtimer", VirtIONet, net_conf.txtimer,
>>                          TX_TIMER_INTERVAL),
>> @@ -1918,6 +2456,10 @@ static Property virtio_net_properties[] = {
>>       DEFINE_PROP_STRING("tx", VirtIONet, net_conf.tx),
>>       DEFINE_PROP_UINT16("rx_queue_size", VirtIONet,
>> net_conf.rx_queue_size,
>>                          VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE),
>> +    DEFINE_PROP_BIT64("guest_rsc4", VirtIONet, host_features,
>> +                    VIRTIO_NET_F_GUEST_RSC4, true),
>
> Don't get why need DEFINE_XXX_BIT64, we still have left bits I believe.
>
>> +    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
>> +                      VIRTIO_NET_RSC_INTERVAL),
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>> diff --git a/include/hw/virtio/virtio-net.h
>> b/include/hw/virtio/virtio-net.h
>> index 0ced975..56a8ce2 100644
>> --- a/include/hw/virtio/virtio-net.h
>> +++ b/include/hw/virtio/virtio-net.h
>> @@ -60,12 +60,15 @@ typedef struct VirtIONet {
>>       VirtIONetQueue *vqs;
>>       VirtQueue *ctrl_vq;
>>       NICState *nic;
>> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>> +    uint32_t rsc_timeout;
>>       uint32_t tx_timeout;
>>       int32_t tx_burst;
>>       uint32_t has_vnet_hdr;
>> +    uint32_t has_rsc_hdr;
>>       size_t host_hdr_len;
>>       size_t guest_hdr_len;
>> -    uint32_t host_features;
>> +    uint64_t host_features;
>
> Do we run out of host features? If yes, need an independent patch for this.

OK.

>
>>       uint8_t has_ufo;
>>       int mergeable_rx_bufs;
>>       uint8_t promisc;
>> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
>> index b913aac..0006ce1 100644
>> --- a/include/hw/virtio/virtio.h
>> +++ b/include/hw/virtio/virtio.h
>> @@ -30,6 +30,8 @@
>>                                   (0x1ULL << VIRTIO_F_ANY_LAYOUT))
>>   struct VirtQueue;
>> +struct VirtIONet;
>> +typedef struct VirtIONet VirtIONet;
>>   static inline hwaddr vring_align(hwaddr addr,
>>                                                unsigned long align)
>> @@ -129,6 +131,80 @@ typedef struct VirtioDeviceClass {
>>       int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>>   } VirtioDeviceClass;
>> +/* Coalesced packets type & status */
>> +typedef enum {
>> +    RSC_COALESCE,           /* Data been coalesced */
>
> "Data has been" ?

Thanks.

>
>> +    RSC_FINAL,              /* Will terminate current connection */
>> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
>> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp
>> ctrl, etc */
>
> "to be bypassed" ?

OK.

>
>> +    RSC_CANDIDATE                /* Data want to be coalesced */
>> +} COALESCE_STATUS;
>> +
>> +typedef struct NetRscStat {
>> +    uint32_t received;
>> +    uint32_t coalesced;
>> +    uint32_t over_size;
>> +    uint32_t cache;
>> +    uint32_t empty_cache;
>> +    uint32_t no_match_cache;
>> +    uint32_t win_update;
>> +    uint32_t no_match;
>> +    uint32_t tcp_syn;
>> +    uint32_t tcp_ctrl_drain;
>> +    uint32_t dup_ack;
>> +    uint32_t dup_ack1;
>> +    uint32_t dup_ack2;
>> +    uint32_t pure_ack;
>> +    uint32_t ack_out_of_win;
>> +    uint32_t data_out_of_win;
>> +    uint32_t data_out_of_order;
>> +    uint32_t data_after_pure_ack;
>> +    uint32_t bypass_not_tcp;
>> +    uint32_t tcp_option;
>> +    uint32_t tcp_all_opt;
>> +    uint32_t ip_frag;
>> +    uint32_t ip_ecn;
>> +    uint32_t ip_hacked;
>> +    uint32_t ip_option;
>> +    uint32_t purge_failed;
>> +    uint32_t drain_failed;
>> +    uint32_t final_failed;
>> +    int64_t  timer;
>> +} NetRscStat;
>> +
>> +/* Rsc unit general info used to checking if can coalescing */
>> +typedef struct NetRscUnit {
>> +    void *ip;   /* ip header */
>> +    uint16_t *ip_plen;      /* data len pointer in ip header field */
>> +    struct tcp_header *tcp; /* tcp header */
>> +    uint16_t tcp_hdrlen;    /* tcp header len */
>> +    uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
>> +} NetRscUnit;
>> +
>> +/* Coalesced segmant */
>> +typedef struct NetRscSeg {
>> +    QTAILQ_ENTRY(NetRscSeg) next;
>> +    void *buf;
>> +    size_t size;
>> +    uint16_t packets;
>> +    uint16_t dup_ack;
>> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark
>> here */
>> +    NetRscUnit unit;
>> +    NetClientState *nc;
>> +} NetRscSeg;
>> +
>> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
>> +typedef struct NetRscChain {
>> +    QTAILQ_ENTRY(NetRscChain) next;
>> +    VirtIONet *n;                            /* VirtIONet */
>> +    uint16_t proto;
>> +    uint8_t  gso_type;
>> +    uint16_t max_payload;
>> +    QEMUTimer *drain_timer;
>> +    QTAILQ_HEAD(, NetRscSeg) buffers;
>> +    NetRscStat stat;
>> +} NetRscChain;
>> +
>
> Why put the above in virtio.h? If it will not be used by other files,
> why need put them in header file?

OK, I will put them to virtio-net.h.

>
>>   void virtio_instance_init_common(Object *proxy_obj, void *data,
>>                                    size_t vdev_size, const char
>> *vdev_name);
>> diff --git a/include/net/eth.h b/include/net/eth.h
>> index 2013175..5952ef2 100644
>> --- a/include/net/eth.h
>> +++ b/include/net/eth.h
>> @@ -177,6 +177,8 @@ struct tcp_hdr {
>>   #define TH_PUSH 0x08
>>   #define TH_ACK  0x10
>>   #define TH_URG  0x20
>> +#define TH_ECE  0x40
>> +#define TH_CWR  0x80
>
> Let's put this in another patch.

OK.

>
>>       u_short th_win;      /* window */
>>       u_short th_sum;      /* checksum */
>>       u_short th_urp;      /* urgent pointer */
>> diff --git a/include/standard-headers/linux/virtio_net.h
>> b/include/standard-headers/linux/virtio_net.h
>> index 30ff249..e67b36e 100644
>> --- a/include/standard-headers/linux/virtio_net.h
>> +++ b/include/standard-headers/linux/virtio_net.h
>> @@ -57,6 +57,9 @@
>>                        * Steering */
>>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23    /* Set MAC address */
>> +/* Guest can handle coalesced ipv4-tcp packets */
>> +#define VIRTIO_NET_F_GUEST_RSC4    41
>
> Why not use 24?
>
>> +
>>   #ifndef VIRTIO_NET_NO_LEGACY
>>   #define VIRTIO_NET_F_GSO    6    /* Host handles pkts w/ any GSO
>> type */
>>   #endif /* VIRTIO_NET_NO_LEGACY */
>> @@ -94,6 +97,9 @@ struct virtio_net_hdr_v1 {
>>   #define VIRTIO_NET_HDR_GSO_UDP        3    /* GSO frame, IPv4 UDP
>> (UFO) */
>>   #define VIRTIO_NET_HDR_GSO_TCPV6    4    /* GSO frame, IPv6 TCP */
>>   #define VIRTIO_NET_HDR_GSO_ECN        0x80    /* TCP has ECN set */
>> +#define VIRTIO_NET_HDR_RSC_NONE     5   /* No packets coalesced */
>
> Not sure this is really needed. Can we just use GSO_NONE?

Of course we can, but it is better to keep this feature distinguished.

>
> And I believe we should not try to coalesce GSO packets since we're
> lacking sufficient information for a correct rsc_pkts or rsc_dup_acks
> from the backend.
>
>> +#define VIRTIO_NET_HDR_RSC_TCPV4    6   /* IPv4 TCP coalesced */
>> +#define VIRTIO_NET_HDR_RSC_TCPV6    7   /* IPv6 TCP coalesced */
>>       uint8_t gso_type;
>>       __virtio16 hdr_len;    /* Ethernet + IP + tcp/udp hdrs */
>>       __virtio16 gso_size;    /* Bytes to append to hdr_len per frame */
>> @@ -124,6 +130,14 @@ struct virtio_net_hdr_mrg_rxbuf {
>>       struct virtio_net_hdr hdr;
>>       __virtio16 num_buffers;    /* Number of merged rx buffers */
>>   };
>> +
>> +/* This is the header to use when either one or both of GUEST_RSC4/6
>> + * features have been negotiated. */
>> +struct virtio_net_hdr_rsc {
>> +    struct virtio_net_hdr_v1 hdr;
>
> If RSC depends on VERSION_1, need to forbid creating RSC device without
> VERSION_1.

How to do it? also a question here, which header will be used if a 
device is not virtio 1.0 compliant?

>
>> +    __virtio16 rsc_pkts;        /* Number of coalesced packets */
>> +    __virtio16 rsc_dup_acks;    /* Duplicated ack packets */
>> +};
>>   #endif /* ...VIRTIO_NET_NO_LEGACY */
>>   /*
>> diff --git a/net/tap.c b/net/tap.c
>> index b6896a7..4557aa5 100644
>> --- a/net/tap.c
>> +++ b/net/tap.c
>> @@ -251,7 +251,8 @@ static void tap_set_vnet_hdr_len(NetClientState
>> *nc, int len)
>>       TAPState *s = DO_UPCAST(TAPState, nc, nc);
>>       assert(nc->info->type == NET_CLIENT_DRIVER_TAP);
>> -    assert(len == sizeof(struct virtio_net_hdr_mrg_rxbuf) ||
>> +    assert(len == sizeof(struct virtio_net_hdr_rsc) ||
>> +           len == sizeof(struct virtio_net_hdr_mrg_rxbuf) ||
>>              len == sizeof(struct virtio_net_hdr));
>>       tap_fd_set_vnet_hdr_len(s->fd, len);
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-11-24  4:31       ` Jason Wang
@ 2016-11-24  5:09         ` Michael S. Tsirkin
  0 siblings, 0 replies; 32+ messages in thread
From: Michael S. Tsirkin @ 2016-11-24  5:09 UTC (permalink / raw)
  To: Jason Wang; +Cc: wexu, qemu-devel, dfleytma, yvugenfi

On Thu, Nov 24, 2016 at 12:31:18PM +0800, Jason Wang wrote:
> 
> 
> On 2016年11月24日 12:26, Michael S. Tsirkin wrote:
> > On Thu, Nov 24, 2016 at 12:17:21PM +0800, Jason Wang wrote:
> > > > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> > > > index 30ff249..e67b36e 100644
> > > > --- a/include/standard-headers/linux/virtio_net.h
> > > > +++ b/include/standard-headers/linux/virtio_net.h
> > > > @@ -57,6 +57,9 @@
> > > >    					 * Steering */
> > > >    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> > > > +/* Guest can handle coalesced ipv4-tcp packets */
> > > > +#define VIRTIO_NET_F_GUEST_RSC4    41
> > > Why not use 24?
> > I think we should use features >31 (virtio 1 only) for
> > nice-to-have features like RSC. Feature bits <31 are
> > easy to backport, so it makes more sense to use
> > them for fundamental things like the MTU
> > (which for some setups help fix broken networking).
> 
> Ok, I believe we need clarify this in the spec or somewhere else.

There is a design considerations chapter, it can go there.

-- 
MST

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-11-24  4:26     ` Michael S. Tsirkin
@ 2016-11-24  4:31       ` Jason Wang
  2016-11-24  5:09         ` Michael S. Tsirkin
  0 siblings, 1 reply; 32+ messages in thread
From: Jason Wang @ 2016-11-24  4:31 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: wexu, qemu-devel, dfleytma, yvugenfi



On 2016年11月24日 12:26, Michael S. Tsirkin wrote:
> On Thu, Nov 24, 2016 at 12:17:21PM +0800, Jason Wang wrote:
>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>> index 30ff249..e67b36e 100644
>>> --- a/include/standard-headers/linux/virtio_net.h
>>> +++ b/include/standard-headers/linux/virtio_net.h
>>> @@ -57,6 +57,9 @@
>>>    					 * Steering */
>>>    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>>> +/* Guest can handle coalesced ipv4-tcp packets */
>>> +#define VIRTIO_NET_F_GUEST_RSC4    41
>> Why not use 24?
> I think we should use features >31 (virtio 1 only) for
> nice-to-have features like RSC. Feature bits <31 are
> easy to backport, so it makes more sense to use
> them for fundamental things like the MTU
> (which for some setups help fix broken networking).

Ok, I believe we need clarify this in the spec or somewhere else.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-11-24  4:17   ` Jason Wang
@ 2016-11-24  4:26     ` Michael S. Tsirkin
  2016-11-24  4:31       ` Jason Wang
  2016-11-30  8:55     ` Wei Xu
  1 sibling, 1 reply; 32+ messages in thread
From: Michael S. Tsirkin @ 2016-11-24  4:26 UTC (permalink / raw)
  To: Jason Wang; +Cc: wexu, qemu-devel, dfleytma, yvugenfi

On Thu, Nov 24, 2016 at 12:17:21PM +0800, Jason Wang wrote:
> > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> > index 30ff249..e67b36e 100644
> > --- a/include/standard-headers/linux/virtio_net.h
> > +++ b/include/standard-headers/linux/virtio_net.h
> > @@ -57,6 +57,9 @@
> >   					 * Steering */
> >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> > +/* Guest can handle coalesced ipv4-tcp packets */
> > +#define VIRTIO_NET_F_GUEST_RSC4    41
> 
> Why not use 24?

I think we should use features >31 (virtio 1 only) for
nice-to-have features like RSC. Feature bits <31 are
easy to backport, so it makes more sense to use
them for fundamental things like the MTU
(which for some setups help fix broken networking).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-10-31 17:41 ` [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
@ 2016-11-24  4:17   ` Jason Wang
  2016-11-24  4:26     ` Michael S. Tsirkin
  2016-11-30  8:55     ` Wei Xu
  0 siblings, 2 replies; 32+ messages in thread
From: Jason Wang @ 2016-11-24  4:17 UTC (permalink / raw)
  To: wexu, qemu-devel; +Cc: mst, dfleytma, yvugenfi



On 2016年11月01日 01:41, wexu@redhat.com wrote:
> From: Wei Xu <wexu@redhat.com>
>
> All the data packets in a tcp connection are cached
> to a single buffer in every receive interval, and will
> be sent out via a timer, the 'virtio_net_rsc_timeout'
> controls the interval, this value may impact the
> performance and response time of tcp connection,
> 50000(50us) is an experience value to gain a performance
> improvement, since the whql test sends packets every 100us,
> so '300000(300us)' passes the test case, it is the default
> value as well, tune it via the command line parameter
> 'rsc_interval' within 'virtio-net-pci' device, for example,
> to launch a guest with interval set as '500000':
>
> 'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,rsc_interval=500000'
>
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets.
>
> 'NetRscChain' is used to save the segments of IPv4/6 in a
> VirtIONet device.
>
> A new segment becomes a 'Candidate' as well as it passed sanity check,
> the main handler of TCP includes TCP window update, duplicated
> ACK check and the real data coalescing.
>
> An 'Candidate' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. 'ACK' of the segment is in the valid window.
>
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. An IP options or IP fragment
> 3. Not a TCP packet
> 4. Sanity size check to prevent buffer overflow attack.
> 5. An ECN packet
>
> Even though, there might more cases should be considered such as
> ip identification other flags, while it breaks the test because
> windows set it to the same even it's not a fragment.
>
> Normally it includes 2 typical ways to handle a TCP control flag,
> 'bypass' and 'finalize', 'bypass' means should be sent out directly,
> while 'finalize' means the packets should also be bypassed, but this
> should be done after search for the same connection packets in the
> pool and drain all of them out, this is to avoid out of order fragment.
>
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
> finalization, because this normally happens upon a connection is going
> to be closed, an 'URG' packet also finalize current coalescing unit.
>
> Statistics can be used to monitor the basic coalescing status, the
> 'out of order' and 'out of window' means how many retransmitting packets,
> thus describe the performance intuitively.
>
> Signed-off-by: Wei Xu <wexu@redhat.com>
> ---
>   hw/net/virtio-net.c                         | 602 ++++++++++++++++++++++++++--
>   include/hw/virtio/virtio-net.h              |   5 +-
>   include/hw/virtio/virtio.h                  |  76 ++++
>   include/net/eth.h                           |   2 +
>   include/standard-headers/linux/virtio_net.h |  14 +
>   net/tap.c                                   |   3 +-
>   6 files changed, 670 insertions(+), 32 deletions(-)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 06bfe4b..d1824d9 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -15,10 +15,12 @@
>   #include "qemu/iov.h"
>   #include "hw/virtio/virtio.h"
>   #include "net/net.h"
> +#include "net/eth.h"
>   #include "net/checksum.h"
>   #include "net/tap.h"
>   #include "qemu/error-report.h"
>   #include "qemu/timer.h"
> +#include "qemu/sockets.h"
>   #include "hw/virtio/virtio-net.h"
>   #include "net/vhost_net.h"
>   #include "hw/virtio/virtio-bus.h"
> @@ -43,6 +45,24 @@
>   #define endof(container, field) \
>       (offsetof(container, field) + sizeof(((container *)0)->field))
>   
> +#define VIRTIO_NET_IP4_ADDR_SIZE   8        /* ipv4 saddr + daddr */

Only used once in the code, I don't see much value of this macro.

> +
> +#define VIRTIO_NET_TCP_FLAG         0x3F
> +#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
> +
> +/* IPv4 max payload, 16 bits in the header */
> +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
> +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
> +
> +/* header length value in ip header without option */
> +#define VIRTIO_NET_IP4_HEADER_LENGTH 5
> +
> +/* Purge coalesced packets timer interval, This value affects the performance
> +   a lot, and should be tuned carefully, '300000'(300us) is the recommended
> +   value to pass the WHQL test, '50000' can gain 2x netperf throughput with
> +   tso/gso/gro 'off'. */
> +#define VIRTIO_NET_RSC_INTERVAL  300000

This should be a property for virito-net and the above comment can be 
the description of the property.

> +
>   typedef struct VirtIOFeature {
>       uint32_t flags;
>       size_t end;
> @@ -589,7 +609,12 @@ static uint64_t virtio_net_guest_offloads_by_features(uint32_t features)
>           (1ULL << VIRTIO_NET_F_GUEST_ECN)  |
>           (1ULL << VIRTIO_NET_F_GUEST_UFO);
>   
> -    return guest_offloads_mask & features;
> +    if (features & VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) {
> +        return (guest_offloads_mask & features) |
> +               (1ULL << VIRTIO_NET_F_GUEST_RSC4);

Why need to care this, I believe RSC has nothing to do with peer's 
offload setting?

> +    } else {
> +        return guest_offloads_mask & features;
> +    }
>   }
>   
>   static inline uint64_t virtio_net_supported_guest_offloads(VirtIONet *n)
> @@ -600,6 +625,7 @@ static inline uint64_t virtio_net_supported_guest_offloads(VirtIONet *n)
>   
>   static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
>   {
> +    NetClientState *nc;
>       VirtIONet *n = VIRTIO_NET(vdev);
>       int i;
>   
> @@ -612,6 +638,22 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
>                                  virtio_has_feature(features,
>                                                     VIRTIO_F_VERSION_1));
>   
> +    if (virtio_has_feature(features, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
> +        n->guest_hdr_len = sizeof(struct virtio_net_hdr_rsc);

I'm confused, and don't see the connection here. You check 
CTRL_GUEST_OFFLOADS but set RSC header here, I don't think 
CTRL_GUEST_OFFLOADS implies RSC.

> +        n->host_hdr_len = n->guest_hdr_len;
> +        n->has_rsc_hdr = 1;

Why need this extra flag, can't we just check RSC feature instead?

> +
> +        for (i = 0; i < n->max_queues; i++) {
> +            nc = qemu_get_subqueue(n->nic, i);
> +
> +            if (peer_has_vnet_hdr(n) &&
> +                qemu_has_vnet_hdr_len(nc->peer, n->guest_hdr_len)) {
> +                qemu_set_vnet_hdr_len(nc->peer, n->guest_hdr_len);
> +                n->host_hdr_len = n->guest_hdr_len;
> +            }
> +        }
> +    }

Need to move hdr len setting to another helper, otherwise it may be set 
twice. Once for mrg_rxbuf and another is for RSC.

> +
>       if (n->has_vnet_hdr) {
>           n->curr_guest_offloads =
>               virtio_net_guest_offloads_by_features(features);
> @@ -1097,7 +1139,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
>       return 0;
>   }
>   
> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
> +static ssize_t virtio_net_do_receive(NetClientState *nc,
> +                                     const uint8_t *buf, size_t size)
>   {
>       VirtIONet *n = qemu_get_nic_opaque(nc);
>       VirtIONetQueue *q = virtio_net_get_subqueue(nc);
> @@ -1161,6 +1204,12 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
>               }
>   
>               receive_header(n, sg, elem->in_num, buf, size);
> +
> +            if (n->has_rsc_hdr) {
> +                offset = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +                iov_from_buf(sg, elem->in_num, offset, \
> +                             buf + offset, 4);

Don't get why this is needed.

> +            }
>               offset = n->host_hdr_len;
>               total += n->guest_hdr_len;
>               guest_offset = n->guest_hdr_len;
> @@ -1239,7 +1288,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
>           ssize_t ret;
>           unsigned int out_num;
>           struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg;
> -        struct virtio_net_hdr_mrg_rxbuf mhdr;
> +        struct virtio_net_hdr_rsc rsc_hdr;
>   
>           elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
>           if (!elem) {
> @@ -1256,26 +1305,27 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
>           }
>   
>           if (n->has_vnet_hdr) {
> -            if (iov_to_buf(out_sg, out_num, 0, &mhdr, n->guest_hdr_len) <
> +            if (iov_to_buf(out_sg, out_num, 0, &rsc_hdr, n->guest_hdr_len) <
>                   n->guest_hdr_len) {
>                   virtio_error(vdev, "virtio-net header incorrect");
>                   virtqueue_detach_element(q->tx_vq, elem, 0);
>                   g_free(elem);
>                   return -EINVAL;
>               }
> +

Unnecessary newline.

>               if (n->needs_vnet_hdr_swap) {
> -                virtio_net_hdr_swap(vdev, (void *) &mhdr);
> -                sg2[0].iov_base = &mhdr;
> +                virtio_net_hdr_swap(vdev, (void *) &rsc_hdr);
> +                sg2[0].iov_base = &rsc_hdr;
>                   sg2[0].iov_len = n->guest_hdr_len;
>                   out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
>                                      out_sg, out_num,
>                                      n->guest_hdr_len, -1);
>                   if (out_num == VIRTQUEUE_MAX_SIZE) {
>                       goto drop;
> -		}
> +                }

Unnecessary change.

>                   out_num += 1;
>                   out_sg = sg2;
> -	    }
> +            }

Here too.

>           }
>           /*
>            * If host wants to see the guest header as is, we can
> @@ -1562,8 +1612,12 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
>                                  virtio_vdev_has_feature(vdev,
>                                                          VIRTIO_F_VERSION_1));
>   
> -    n->status = qemu_get_be16(f);
> +    if (virtio_vdev_has_feature(vdev, VIRTIO_NET_F_GUEST_RSC4)) {
> +        n->guest_hdr_len = sizeof(struct virtio_net_hdr_rsc);
> +        n->host_hdr_len = n->guest_hdr_len;
> +    }

Why need this? Btw, need keep guest visible features unchanged through 
qemu cli during migration.

>   
> +    n->status = qemu_get_be16(f);
>       n->promisc = qemu_get_byte(f);
>       n->allmulti = qemu_get_byte(f);
>   
> @@ -1660,6 +1714,487 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
>       return 0;
>   }
>   
> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    uint16_t ip_hdrlen;
> +    struct ip_header *ip;
> +
> +    ip = (struct ip_header *)(buf + chain->n->guest_hdr_len
> +                              + sizeof(struct eth_header));
> +    unit->ip = (void *)ip;
> +    ip_hdrlen = (ip->ip_ver_len & 0xF) << 2;
> +    unit->ip_plen = &ip->ip_len;
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
> +}
> +
> +static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr_rsc *rhdr,
> +                                         struct ip_header *ip)
> +{
> +    uint32_t sum;
> +    struct virtio_net_hdr *vhdr = (struct virtio_net_hdr *)rhdr;
> +
> +    ip->ip_sum = 0;
> +    sum = net_checksum_add_cont(sizeof(struct ip_header), (uint8_t *)ip, 0);
> +    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
> +    vhdr->flags &= ~VIRTIO_NET_HDR_F_NEEDS_CSUM;
> +    vhdr->flags |= VIRTIO_NET_HDR_F_DATA_VALID;
> +}
> +
> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
> +{
> +    int ret;
> +    struct virtio_net_hdr_rsc *h;
> +
> +    h = (struct virtio_net_hdr_rsc *)seg->buf;
> +    if (seg->is_coalesced) {
> +        h->hdr.flags = VIRTIO_NET_HDR_RSC_TCPV4;
> +        virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
> +    }
> +
> +    h = (struct virtio_net_hdr_rsc *)seg->buf;
> +    virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
> +    h->rsc_pkts = seg->packets;
> +    h->rsc_dup_acks = seg->dup_ack;
> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
> +    g_free(seg->buf);
> +    g_free(seg);
> +
> +    return ret;
> +}
> +
> +static void virtio_net_rsc_purge(void *opq)
> +{
> +    NetRscSeg *seg, *rn;
> +    NetRscChain *chain = (NetRscChain *)opq;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.purge_failed++;
> +            continue;
> +        }
> +    }
> +
> +    chain->stat.timer++;
> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout);
> +    }
> +}
> +
> +static void virtio_net_rsc_cleanup(VirtIONet *n)
> +{
> +    NetRscChain *chain, *rn_chain;
> +    NetRscSeg *seg, *rn_seg;
> +
> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
> +            g_free(seg->buf);
> +            g_free(seg);
> +        }
> +
> +        timer_del(chain->drain_timer);
> +        timer_free(chain->drain_timer);
> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
> +        g_free(chain);
> +    }
> +}
> +
> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
> +                                     const uint8_t *buf, size_t size)
> +{
> +    uint16_t hdr_len;
> +    NetRscSeg *seg;
> +
> +    hdr_len = chain->n->guest_hdr_len;
> +    seg = g_malloc(sizeof(NetRscSeg));
> +    seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)\
> +                   + VIRTIO_NET_MAX_TCP_PAYLOAD);
> +    memcpy(seg->buf, buf, size);
> +    seg->size = size;
> +    seg->packets = 1;
> +    seg->dup_ack = 0;
> +    seg->is_coalesced = 0;
> +    seg->nc = nc;
> +
> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
> +    chain->stat.cache++;
> +
> +    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +}
> +
> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
> +                                         NetRscSeg *seg, const uint8_t *buf,
> +                                         struct tcp_header *n_tcp,
> +                                         struct tcp_header *o_tcp)
> +{
> +    uint32_t nack, oack;
> +    uint16_t nwin, owin;
> +
> +    nack = htonl(n_tcp->th_ack);
> +    nwin = htons(n_tcp->th_win);
> +    oack = htonl(o_tcp->th_ack);
> +    owin = htons(o_tcp->th_win);
> +
> +    if ((nack - oack) >= VIRTIO_NET_MAX_TCP_PAYLOAD) {
> +        chain->stat.ack_out_of_win++;
> +        return RSC_FINAL;
> +    } else if (nack == oack) {
> +        /* duplicated ack or window probe */
> +        if (nwin == owin) {
> +            /* duplicated ack, add dup ack count due to whql test up to 1 */
> +            chain->stat.dup_ack++;
> +            return RSC_FINAL;
> +        } else {
> +            /* Coalesce window update */
> +            o_tcp->th_win = n_tcp->th_win;
> +            chain->stat.win_update++;
> +            return RSC_COALESCE;
> +        }
> +    } else {
> +        /* pure ack, go to 'C', finalize*/
> +        chain->stat.pure_ack++;
> +        return RSC_FINAL;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
> +                                            NetRscSeg *seg, const uint8_t *buf,
> +                                            NetRscUnit *n_unit)
> +{
> +    void *data;
> +    uint16_t o_ip_len;
> +    uint32_t nseq, oseq;
> +    NetRscUnit *o_unit;
> +
> +    o_unit = &seg->unit;
> +    o_ip_len = htons(*o_unit->ip_plen);
> +    nseq = htonl(n_unit->tcp->th_seq);
> +    oseq = htonl(o_unit->tcp->th_seq);
> +
> +    /* out of order or retransmitted. */
> +    if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) {
> +        chain->stat.data_out_of_win++;
> +        return RSC_FINAL;
> +    }
> +
> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
> +    if (nseq == oseq) {
> +        if ((o_unit->payload == 0) && n_unit->payload) {
> +            /* From no payload to payload, normal case, not a dup ack or etc */
> +            chain->stat.data_after_pure_ack++;
> +            goto coalesce;
> +        } else {
> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
> +                                             n_unit->tcp, o_unit->tcp);
> +        }
> +    } else if ((nseq - oseq) != o_unit->payload) {
> +        /* Not a consistent packet, out of order */
> +        chain->stat.data_out_of_order++;
> +        return RSC_FINAL;
> +    } else {
> +coalesce:
> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
> +            chain->stat.over_size++;
> +            return RSC_FINAL;
> +        }
> +
> +        /* Here comes the right data, the payload length in v4/v6 is different,
> +           so use the field value to update and record the new data len */
> +        o_unit->payload += n_unit->payload; /* update new data len */
> +
> +        /* update field in ip header */
> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
> +
> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
> +           for windows guest, while this may change the behavior for linux
> +           guest. */
> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
> +
> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
> +
> +        memmove(seg->buf + seg->size, data, n_unit->payload);
> +        seg->size += n_unit->payload;
> +        seg->packets++;
> +        chain->stat.coalesced++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
> +                                        const uint8_t *buf, size_t size,
> +                                        NetRscUnit *unit)
> +{
> +    struct ip_header *ip1, *ip2;
> +
> +    ip1 = (struct ip_header *)(unit->ip);
> +    ip2 = (struct ip_header *)(seg->unit.ip);
> +    if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst)
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +        chain->stat.no_match++;
> +        return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
> +/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
> + * to prevent out of order */
> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> +                                         struct tcp_header *tcp)
> +{
> +    uint16_t tcp_hdr;
> +    uint16_t tcp_flag;
> +
> +    tcp_flag = htons(tcp->th_offset_flags);
> +    tcp_hdr = (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10;
> +    tcp_flag &= VIRTIO_NET_TCP_FLAG;
> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
> +    if (tcp_flag & TH_SYN) {
> +        chain->stat.tcp_syn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST | TH_ECE | TH_CWR)) {
> +        chain->stat.tcp_ctrl_drain++;
> +        return RSC_FINAL;
> +    }
> +
> +    if (tcp_hdr > sizeof(struct tcp_header)) {
> +        chain->stat.tcp_all_opt++;
> +        return RSC_FINAL;
> +    }
> +
> +    return RSC_CANDIDATE;
> +}
> +
> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
> +                                         const uint8_t *buf, size_t size,
> +                                         NetRscUnit *unit)
> +{
> +    int ret;
> +    NetRscSeg *seg, *nseg;
> +
> +    if (QTAILQ_EMPTY(&chain->buffers)) {
> +        chain->stat.empty_cache++;
> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout);
> +        return size;
> +    }
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +
> +        if (ret == RSC_FINAL) {
> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +                /* Send failed */
> +                chain->stat.final_failed++;
> +                return 0;
> +            }
> +
> +            /* Send current packet */
> +            return virtio_net_do_receive(nc, buf, size);
> +        } else if (ret == RSC_NO_MATCH) {
> +            continue;
> +        } else {
> +            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
> +            seg->is_coalesced = 1;
> +            return size;
> +        }
> +    }
> +
> +    chain->stat.no_match_cache++;
> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +    return size;
> +}
> +
> +/* Drain a connection data, this is to avoid out of order segments */
> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
> +                                        const uint8_t *buf, size_t size,
> +                                        uint16_t ip_start, uint16_t ip_size,
> +                                        uint16_t tcp_port)
> +{
> +    NetRscSeg *seg, *nseg;
> +    uint32_t ppair1, ppair2;
> +
> +    ppair1 = *(uint32_t *)(buf + tcp_port);
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        ppair2 = *(uint32_t *)(seg->buf + tcp_port);
> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
> +            || (ppair1 != ppair2)) {
> +            continue;
> +        }
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.drain_failed++;
> +        }
> +
> +        break;
> +    }
> +
> +    return virtio_net_do_receive(nc, buf, size);
> +}
> +
> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
> +                                            struct ip_header *ip,
> +                                            const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +
> +    /* Not an ipv4 packet */
> +    if (((ip->ip_ver_len & 0xF0) >> 4) != IP_HEADER_VERSION_4) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip option */
> +    if ((ip->ip_ver_len & 0xF) != VIRTIO_NET_IP4_HEADER_LENGTH) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (ip->ip_p != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip fragment */
> +    if (!(htons(ip->ip_off) & IP_DF)) {
> +        chain->stat.ip_frag++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ecn flag */
> +    if (IPTOS_ECN(ip->ip_tos)) {
> +        chain->stat.ip_ecn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    ip_len = htons(ip->ip_len);
> +    if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header))
> +        || ip_len > (size - chain->n->guest_hdr_len -
> +                     sizeof(struct eth_header))) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_CANDIDATE;
> +}
> +
> +static size_t virtio_net_rsc_receive4(NetRscChain *chain, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    uint16_t hdr_len;
> +    NetRscUnit unit;
> +
> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
> +
> +    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header)
> +        + sizeof(struct tcp_header))) {
> +        chain->stat.bypass_not_tcp++;
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
> +    if (virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)
> +        != RSC_CANDIDATE) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                ((hdr_len + sizeof(struct eth_header)) + 12),
> +                VIRTIO_NET_IP4_ADDR_SIZE,
> +                hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header));
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
> +                                                NetClientState *nc,
> +                                                uint16_t proto)
> +{
> +    NetRscChain *chain;
> +
> +    if (proto != (uint16_t)ETH_P_IP) {
> +        return NULL;
> +    }
> +
> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
> +        if (chain->proto == proto) {
> +            return chain;
> +        }
> +    }
> +
> +    chain = g_malloc(sizeof(*chain));
> +    chain->n = n;
> +    chain->proto = proto;
> +    chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
> +    chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_HOST,
> +                                      virtio_net_rsc_purge, chain);
> +    memset(&chain->stat, 0, sizeof(chain->stat));
> +
> +    QTAILQ_INIT(&chain->buffers);
> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
> +
> +    return chain;
> +}
> +
> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    uint16_t proto;
> +    NetRscChain *chain;
> +    struct eth_header *eth;
> +    VirtIONet *n;
> +
> +    n = qemu_get_nic_opaque(nc);
> +    if (size < (n->host_hdr_len + sizeof(struct eth_header))) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
> +    proto = htons(eth->h_proto);
> +
> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
> +    if (!chain) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else {
> +        chain->stat.received++;
> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
> +    }
> +}
> +
> +static ssize_t virtio_net_receive(NetClientState *nc,
> +                                  const uint8_t *buf, size_t size)
> +{
> +    VirtIONet *n;
> +    struct virtio_net_hdr_rsc *h;
> +
> +    n = qemu_get_nic_opaque(nc);
> +    if (n->curr_guest_offloads & (1ULL << VIRTIO_NET_F_GUEST_RSC4)) {
> +        h = (struct virtio_net_hdr_rsc *)buf;
> +        h->hdr.flags = VIRTIO_NET_HDR_RSC_NONE;
> +        h->rsc_pkts = 0;
> +        h->rsc_dup_acks = 0;
> +        return virtio_net_rsc_receive(nc, buf, size);
> +    } else {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +}
> +
>   static NetClientInfo net_virtio_info = {
>       .type = NET_CLIENT_DRIVER_NIC,
>       .size = sizeof(NICState),
> @@ -1805,6 +2340,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>       nc = qemu_get_queue(n->nic);
>       nc->rxfilter_notify_enabled = 1;
>   
> +    QTAILQ_INIT(&n->rsc_chains);
>       n->qdev = dev;
>   }
>   
> @@ -1835,6 +2371,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>       g_free(n->vqs);
>       qemu_del_nic(n->nic);
>       virtio_cleanup(vdev);
> +    virtio_net_rsc_cleanup(n);
>   }
>   
>   static void virtio_net_instance_init(Object *obj)
> @@ -1872,45 +2409,46 @@ static const VMStateDescription vmstate_virtio_net = {
>   };
>   
>   static Property virtio_net_properties[] = {
> -    DEFINE_PROP_BIT("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true),
> -    DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("csum", VirtIONet, host_features,
> +                    VIRTIO_NET_F_CSUM, true),
> +    DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features,
>                       VIRTIO_NET_F_GUEST_CSUM, true),
> -    DEFINE_PROP_BIT("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
> -    DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
> +    DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features,
>                       VIRTIO_NET_F_GUEST_TSO4, true),
> -    DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features,
>                       VIRTIO_NET_F_GUEST_TSO6, true),
> -    DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features,
>                       VIRTIO_NET_F_GUEST_ECN, true),
> -    DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features,
>                       VIRTIO_NET_F_GUEST_UFO, true),
> -    DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features,
>                       VIRTIO_NET_F_GUEST_ANNOUNCE, true),
> -    DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features,
>                       VIRTIO_NET_F_HOST_TSO4, true),
> -    DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features,
>                       VIRTIO_NET_F_HOST_TSO6, true),
> -    DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features,
>                       VIRTIO_NET_F_HOST_ECN, true),
> -    DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features,
>                       VIRTIO_NET_F_HOST_UFO, true),
> -    DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features,
>                       VIRTIO_NET_F_MRG_RXBUF, true),
> -    DEFINE_PROP_BIT("status", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("status", VirtIONet, host_features,
>                       VIRTIO_NET_F_STATUS, true),
> -    DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features,
>                       VIRTIO_NET_F_CTRL_VQ, true),
> -    DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features,
>                       VIRTIO_NET_F_CTRL_RX, true),
> -    DEFINE_PROP_BIT("ctrl_vlan", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("ctrl_vlan", VirtIONet, host_features,
>                       VIRTIO_NET_F_CTRL_VLAN, true),
> -    DEFINE_PROP_BIT("ctrl_rx_extra", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("ctrl_rx_extra", VirtIONet, host_features,
>                       VIRTIO_NET_F_CTRL_RX_EXTRA, true),
> -    DEFINE_PROP_BIT("ctrl_mac_addr", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("ctrl_mac_addr", VirtIONet, host_features,
>                       VIRTIO_NET_F_CTRL_MAC_ADDR, true),
> -    DEFINE_PROP_BIT("ctrl_guest_offloads", VirtIONet, host_features,
> +    DEFINE_PROP_BIT64("ctrl_guest_offloads", VirtIONet, host_features,
>                       VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, true),
> -    DEFINE_PROP_BIT("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ, false),
> +    DEFINE_PROP_BIT64("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ, false),
>       DEFINE_NIC_PROPERTIES(VirtIONet, nic_conf),
>       DEFINE_PROP_UINT32("x-txtimer", VirtIONet, net_conf.txtimer,
>                          TX_TIMER_INTERVAL),
> @@ -1918,6 +2456,10 @@ static Property virtio_net_properties[] = {
>       DEFINE_PROP_STRING("tx", VirtIONet, net_conf.tx),
>       DEFINE_PROP_UINT16("rx_queue_size", VirtIONet, net_conf.rx_queue_size,
>                          VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE),
> +    DEFINE_PROP_BIT64("guest_rsc4", VirtIONet, host_features,
> +                    VIRTIO_NET_F_GUEST_RSC4, true),

Don't get why need DEFINE_XXX_BIT64, we still have left bits I believe.

> +    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
> +                      VIRTIO_NET_RSC_INTERVAL),
>       DEFINE_PROP_END_OF_LIST(),
>   };
>   
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index 0ced975..56a8ce2 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -60,12 +60,15 @@ typedef struct VirtIONet {
>       VirtIONetQueue *vqs;
>       VirtQueue *ctrl_vq;
>       NICState *nic;
> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
> +    uint32_t rsc_timeout;
>       uint32_t tx_timeout;
>       int32_t tx_burst;
>       uint32_t has_vnet_hdr;
> +    uint32_t has_rsc_hdr;
>       size_t host_hdr_len;
>       size_t guest_hdr_len;
> -    uint32_t host_features;
> +    uint64_t host_features;

Do we run out of host features? If yes, need an independent patch for this.

>       uint8_t has_ufo;
>       int mergeable_rx_bufs;
>       uint8_t promisc;
> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> index b913aac..0006ce1 100644
> --- a/include/hw/virtio/virtio.h
> +++ b/include/hw/virtio/virtio.h
> @@ -30,6 +30,8 @@
>                                   (0x1ULL << VIRTIO_F_ANY_LAYOUT))
>   
>   struct VirtQueue;
> +struct VirtIONet;
> +typedef struct VirtIONet VirtIONet;
>   
>   static inline hwaddr vring_align(hwaddr addr,
>                                                unsigned long align)
> @@ -129,6 +131,80 @@ typedef struct VirtioDeviceClass {
>       int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
>   } VirtioDeviceClass;
>   
> +/* Coalesced packets type & status */
> +typedef enum {
> +    RSC_COALESCE,           /* Data been coalesced */

"Data has been" ?

> +    RSC_FINAL,              /* Will terminate current connection */
> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */

"to be bypassed" ?

> +    RSC_CANDIDATE                /* Data want to be coalesced */
> +} COALESCE_STATUS;
> +
> +typedef struct NetRscStat {
> +    uint32_t received;
> +    uint32_t coalesced;
> +    uint32_t over_size;
> +    uint32_t cache;
> +    uint32_t empty_cache;
> +    uint32_t no_match_cache;
> +    uint32_t win_update;
> +    uint32_t no_match;
> +    uint32_t tcp_syn;
> +    uint32_t tcp_ctrl_drain;
> +    uint32_t dup_ack;
> +    uint32_t dup_ack1;
> +    uint32_t dup_ack2;
> +    uint32_t pure_ack;
> +    uint32_t ack_out_of_win;
> +    uint32_t data_out_of_win;
> +    uint32_t data_out_of_order;
> +    uint32_t data_after_pure_ack;
> +    uint32_t bypass_not_tcp;
> +    uint32_t tcp_option;
> +    uint32_t tcp_all_opt;
> +    uint32_t ip_frag;
> +    uint32_t ip_ecn;
> +    uint32_t ip_hacked;
> +    uint32_t ip_option;
> +    uint32_t purge_failed;
> +    uint32_t drain_failed;
> +    uint32_t final_failed;
> +    int64_t  timer;
> +} NetRscStat;
> +
> +/* Rsc unit general info used to checking if can coalescing */
> +typedef struct NetRscUnit {
> +    void *ip;   /* ip header */
> +    uint16_t *ip_plen;      /* data len pointer in ip header field */
> +    struct tcp_header *tcp; /* tcp header */
> +    uint16_t tcp_hdrlen;    /* tcp header len */
> +    uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
> +} NetRscUnit;
> +
> +/* Coalesced segmant */
> +typedef struct NetRscSeg {
> +    QTAILQ_ENTRY(NetRscSeg) next;
> +    void *buf;
> +    size_t size;
> +    uint16_t packets;
> +    uint16_t dup_ack;
> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
> +    NetRscUnit unit;
> +    NetClientState *nc;
> +} NetRscSeg;
> +
> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
> +typedef struct NetRscChain {
> +    QTAILQ_ENTRY(NetRscChain) next;
> +    VirtIONet *n;                            /* VirtIONet */
> +    uint16_t proto;
> +    uint8_t  gso_type;
> +    uint16_t max_payload;
> +    QEMUTimer *drain_timer;
> +    QTAILQ_HEAD(, NetRscSeg) buffers;
> +    NetRscStat stat;
> +} NetRscChain;
> +

Why put the above in virtio.h? If it will not be used by other files, 
why need put them in header file?

>   void virtio_instance_init_common(Object *proxy_obj, void *data,
>                                    size_t vdev_size, const char *vdev_name);
>   
> diff --git a/include/net/eth.h b/include/net/eth.h
> index 2013175..5952ef2 100644
> --- a/include/net/eth.h
> +++ b/include/net/eth.h
> @@ -177,6 +177,8 @@ struct tcp_hdr {
>   #define TH_PUSH 0x08
>   #define TH_ACK  0x10
>   #define TH_URG  0x20
> +#define TH_ECE  0x40
> +#define TH_CWR  0x80

Let's put this in another patch.

>       u_short th_win;      /* window */
>       u_short th_sum;      /* checksum */
>       u_short th_urp;      /* urgent pointer */
> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> index 30ff249..e67b36e 100644
> --- a/include/standard-headers/linux/virtio_net.h
> +++ b/include/standard-headers/linux/virtio_net.h
> @@ -57,6 +57,9 @@
>   					 * Steering */
>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>   
> +/* Guest can handle coalesced ipv4-tcp packets */
> +#define VIRTIO_NET_F_GUEST_RSC4    41

Why not use 24?

> +
>   #ifndef VIRTIO_NET_NO_LEGACY
>   #define VIRTIO_NET_F_GSO	6	/* Host handles pkts w/ any GSO type */
>   #endif /* VIRTIO_NET_NO_LEGACY */
> @@ -94,6 +97,9 @@ struct virtio_net_hdr_v1 {
>   #define VIRTIO_NET_HDR_GSO_UDP		3	/* GSO frame, IPv4 UDP (UFO) */
>   #define VIRTIO_NET_HDR_GSO_TCPV6	4	/* GSO frame, IPv6 TCP */
>   #define VIRTIO_NET_HDR_GSO_ECN		0x80	/* TCP has ECN set */
> +#define VIRTIO_NET_HDR_RSC_NONE     5   /* No packets coalesced */

Not sure this is really needed. Can we just use GSO_NONE?

And I believe we should not try to coalesce GSO packets since we're 
lacking sufficient information for a correct rsc_pkts or rsc_dup_acks 
from the backend.

> +#define VIRTIO_NET_HDR_RSC_TCPV4    6   /* IPv4 TCP coalesced */
> +#define VIRTIO_NET_HDR_RSC_TCPV6    7   /* IPv6 TCP coalesced */
>   	uint8_t gso_type;
>   	__virtio16 hdr_len;	/* Ethernet + IP + tcp/udp hdrs */
>   	__virtio16 gso_size;	/* Bytes to append to hdr_len per frame */
> @@ -124,6 +130,14 @@ struct virtio_net_hdr_mrg_rxbuf {
>   	struct virtio_net_hdr hdr;
>   	__virtio16 num_buffers;	/* Number of merged rx buffers */
>   };
> +
> +/* This is the header to use when either one or both of GUEST_RSC4/6
> + * features have been negotiated. */
> +struct virtio_net_hdr_rsc {
> +    struct virtio_net_hdr_v1 hdr;

If RSC depends on VERSION_1, need to forbid creating RSC device without 
VERSION_1.

> +    __virtio16 rsc_pkts;        /* Number of coalesced packets */
> +    __virtio16 rsc_dup_acks;    /* Duplicated ack packets */
> +};
>   #endif /* ...VIRTIO_NET_NO_LEGACY */
>   
>   /*
> diff --git a/net/tap.c b/net/tap.c
> index b6896a7..4557aa5 100644
> --- a/net/tap.c
> +++ b/net/tap.c
> @@ -251,7 +251,8 @@ static void tap_set_vnet_hdr_len(NetClientState *nc, int len)
>       TAPState *s = DO_UPCAST(TAPState, nc, nc);
>   
>       assert(nc->info->type == NET_CLIENT_DRIVER_TAP);
> -    assert(len == sizeof(struct virtio_net_hdr_mrg_rxbuf) ||
> +    assert(len == sizeof(struct virtio_net_hdr_rsc) ||
> +           len == sizeof(struct virtio_net_hdr_mrg_rxbuf) ||
>              len == sizeof(struct virtio_net_hdr));
>   
>       tap_fd_set_vnet_hdr_len(s->fd, len);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
  2016-10-31 17:41 [Qemu-devel] [ RFC Patch v7 0/2] Support Receive-Segment-Offload(RSC) for WHQL wexu
@ 2016-10-31 17:41 ` wexu
  2016-11-24  4:17   ` Jason Wang
  0 siblings, 1 reply; 32+ messages in thread
From: wexu @ 2016-10-31 17:41 UTC (permalink / raw)
  To: qemu-devel; +Cc: jasowang, mst, dfleytma, yvugenfi, Wei Xu

From: Wei Xu <wexu@redhat.com>

All the data packets in a tcp connection are cached
to a single buffer in every receive interval, and will
be sent out via a timer, the 'virtio_net_rsc_timeout'
controls the interval, this value may impact the
performance and response time of tcp connection,
50000(50us) is an experience value to gain a performance
improvement, since the whql test sends packets every 100us,
so '300000(300us)' passes the test case, it is the default
value as well, tune it via the command line parameter
'rsc_interval' within 'virtio-net-pci' device, for example,
to launch a guest with interval set as '500000':

'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,rsc_interval=500000'

The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets.

'NetRscChain' is used to save the segments of IPv4/6 in a
VirtIONet device.

A new segment becomes a 'Candidate' as well as it passed sanity check,
the main handler of TCP includes TCP window update, duplicated
ACK check and the real data coalescing.

An 'Candidate' segment means:
1. Segment is within current window and the sequence is the expected one.
2. 'ACK' of the segment is in the valid window.

Sanity check includes:
1. Incorrect version in IP header
2. An IP options or IP fragment
3. Not a TCP packet
4. Sanity size check to prevent buffer overflow attack.
5. An ECN packet

Even though, there might more cases should be considered such as
ip identification other flags, while it breaks the test because
windows set it to the same even it's not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag,
'bypass' and 'finalize', 'bypass' means should be sent out directly,
while 'finalize' means the packets should also be bypassed, but this
should be done after search for the same connection packets in the
pool and drain all of them out, this is to avoid out of order fragment.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
finalization, because this normally happens upon a connection is going
to be closed, an 'URG' packet also finalize current coalescing unit.

Statistics can be used to monitor the basic coalescing status, the
'out of order' and 'out of window' means how many retransmitting packets,
thus describe the performance intuitively.

Signed-off-by: Wei Xu <wexu@redhat.com>
---
 hw/net/virtio-net.c                         | 602 ++++++++++++++++++++++++++--
 include/hw/virtio/virtio-net.h              |   5 +-
 include/hw/virtio/virtio.h                  |  76 ++++
 include/net/eth.h                           |   2 +
 include/standard-headers/linux/virtio_net.h |  14 +
 net/tap.c                                   |   3 +-
 6 files changed, 670 insertions(+), 32 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 06bfe4b..d1824d9 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -15,10 +15,12 @@
 #include "qemu/iov.h"
 #include "hw/virtio/virtio.h"
 #include "net/net.h"
+#include "net/eth.h"
 #include "net/checksum.h"
 #include "net/tap.h"
 #include "qemu/error-report.h"
 #include "qemu/timer.h"
+#include "qemu/sockets.h"
 #include "hw/virtio/virtio-net.h"
 #include "net/vhost_net.h"
 #include "hw/virtio/virtio-bus.h"
@@ -43,6 +45,24 @@
 #define endof(container, field) \
     (offsetof(container, field) + sizeof(((container *)0)->field))
 
+#define VIRTIO_NET_IP4_ADDR_SIZE   8        /* ipv4 saddr + daddr */
+
+#define VIRTIO_NET_TCP_FLAG         0x3F
+#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
+
+/* IPv4 max payload, 16 bits in the header */
+#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
+#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
+
+/* header length value in ip header without option */
+#define VIRTIO_NET_IP4_HEADER_LENGTH 5
+
+/* Purge coalesced packets timer interval, This value affects the performance
+   a lot, and should be tuned carefully, '300000'(300us) is the recommended
+   value to pass the WHQL test, '50000' can gain 2x netperf throughput with
+   tso/gso/gro 'off'. */
+#define VIRTIO_NET_RSC_INTERVAL  300000
+
 typedef struct VirtIOFeature {
     uint32_t flags;
     size_t end;
@@ -589,7 +609,12 @@ static uint64_t virtio_net_guest_offloads_by_features(uint32_t features)
         (1ULL << VIRTIO_NET_F_GUEST_ECN)  |
         (1ULL << VIRTIO_NET_F_GUEST_UFO);
 
-    return guest_offloads_mask & features;
+    if (features & VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) {
+        return (guest_offloads_mask & features) |
+               (1ULL << VIRTIO_NET_F_GUEST_RSC4);
+    } else {
+        return guest_offloads_mask & features;
+    }
 }
 
 static inline uint64_t virtio_net_supported_guest_offloads(VirtIONet *n)
@@ -600,6 +625,7 @@ static inline uint64_t virtio_net_supported_guest_offloads(VirtIONet *n)
 
 static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
 {
+    NetClientState *nc;
     VirtIONet *n = VIRTIO_NET(vdev);
     int i;
 
@@ -612,6 +638,22 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
                                virtio_has_feature(features,
                                                   VIRTIO_F_VERSION_1));
 
+    if (virtio_has_feature(features, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
+        n->guest_hdr_len = sizeof(struct virtio_net_hdr_rsc);
+        n->host_hdr_len = n->guest_hdr_len;
+        n->has_rsc_hdr = 1;
+
+        for (i = 0; i < n->max_queues; i++) {
+            nc = qemu_get_subqueue(n->nic, i);
+
+            if (peer_has_vnet_hdr(n) &&
+                qemu_has_vnet_hdr_len(nc->peer, n->guest_hdr_len)) {
+                qemu_set_vnet_hdr_len(nc->peer, n->guest_hdr_len);
+                n->host_hdr_len = n->guest_hdr_len;
+            }
+        }
+    }
+
     if (n->has_vnet_hdr) {
         n->curr_guest_offloads =
             virtio_net_guest_offloads_by_features(features);
@@ -1097,7 +1139,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
     return 0;
 }
 
-static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size)
+static ssize_t virtio_net_do_receive(NetClientState *nc,
+                                     const uint8_t *buf, size_t size)
 {
     VirtIONet *n = qemu_get_nic_opaque(nc);
     VirtIONetQueue *q = virtio_net_get_subqueue(nc);
@@ -1161,6 +1204,12 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
             }
 
             receive_header(n, sg, elem->in_num, buf, size);
+
+            if (n->has_rsc_hdr) {
+                offset = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+                iov_from_buf(sg, elem->in_num, offset, \
+                             buf + offset, 4);
+            }
             offset = n->host_hdr_len;
             total += n->guest_hdr_len;
             guest_offset = n->guest_hdr_len;
@@ -1239,7 +1288,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
         ssize_t ret;
         unsigned int out_num;
         struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg;
-        struct virtio_net_hdr_mrg_rxbuf mhdr;
+        struct virtio_net_hdr_rsc rsc_hdr;
 
         elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
         if (!elem) {
@@ -1256,26 +1305,27 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
         }
 
         if (n->has_vnet_hdr) {
-            if (iov_to_buf(out_sg, out_num, 0, &mhdr, n->guest_hdr_len) <
+            if (iov_to_buf(out_sg, out_num, 0, &rsc_hdr, n->guest_hdr_len) <
                 n->guest_hdr_len) {
                 virtio_error(vdev, "virtio-net header incorrect");
                 virtqueue_detach_element(q->tx_vq, elem, 0);
                 g_free(elem);
                 return -EINVAL;
             }
+
             if (n->needs_vnet_hdr_swap) {
-                virtio_net_hdr_swap(vdev, (void *) &mhdr);
-                sg2[0].iov_base = &mhdr;
+                virtio_net_hdr_swap(vdev, (void *) &rsc_hdr);
+                sg2[0].iov_base = &rsc_hdr;
                 sg2[0].iov_len = n->guest_hdr_len;
                 out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
                                    out_sg, out_num,
                                    n->guest_hdr_len, -1);
                 if (out_num == VIRTQUEUE_MAX_SIZE) {
                     goto drop;
-		}
+                }
                 out_num += 1;
                 out_sg = sg2;
-	    }
+            }
         }
         /*
          * If host wants to see the guest header as is, we can
@@ -1562,8 +1612,12 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
                                virtio_vdev_has_feature(vdev,
                                                        VIRTIO_F_VERSION_1));
 
-    n->status = qemu_get_be16(f);
+    if (virtio_vdev_has_feature(vdev, VIRTIO_NET_F_GUEST_RSC4)) {
+        n->guest_hdr_len = sizeof(struct virtio_net_hdr_rsc);
+        n->host_hdr_len = n->guest_hdr_len;
+    }
 
+    n->status = qemu_get_be16(f);
     n->promisc = qemu_get_byte(f);
     n->allmulti = qemu_get_byte(f);
 
@@ -1660,6 +1714,487 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f,
     return 0;
 }
 
+static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
+                                         const uint8_t *buf, NetRscUnit* unit)
+{
+    uint16_t ip_hdrlen;
+    struct ip_header *ip;
+
+    ip = (struct ip_header *)(buf + chain->n->guest_hdr_len
+                              + sizeof(struct eth_header));
+    unit->ip = (void *)ip;
+    ip_hdrlen = (ip->ip_ver_len & 0xF) << 2;
+    unit->ip_plen = &ip->ip_len;
+    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
+    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
+    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
+}
+
+static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr_rsc *rhdr,
+                                         struct ip_header *ip)
+{
+    uint32_t sum;
+    struct virtio_net_hdr *vhdr = (struct virtio_net_hdr *)rhdr;
+
+    ip->ip_sum = 0;
+    sum = net_checksum_add_cont(sizeof(struct ip_header), (uint8_t *)ip, 0);
+    ip->ip_sum = cpu_to_be16(net_checksum_finish(sum));
+    vhdr->flags &= ~VIRTIO_NET_HDR_F_NEEDS_CSUM;
+    vhdr->flags |= VIRTIO_NET_HDR_F_DATA_VALID;
+}
+
+static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
+{
+    int ret;
+    struct virtio_net_hdr_rsc *h;
+
+    h = (struct virtio_net_hdr_rsc *)seg->buf;
+    if (seg->is_coalesced) {
+        h->hdr.flags = VIRTIO_NET_HDR_RSC_TCPV4;
+        virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
+    }
+
+    h = (struct virtio_net_hdr_rsc *)seg->buf;
+    virtio_net_rsc_ipv4_checksum(h, seg->unit.ip);
+    h->rsc_pkts = seg->packets;
+    h->rsc_dup_acks = seg->dup_ack;
+    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
+    QTAILQ_REMOVE(&chain->buffers, seg, next);
+    g_free(seg->buf);
+    g_free(seg);
+
+    return ret;
+}
+
+static void virtio_net_rsc_purge(void *opq)
+{
+    NetRscSeg *seg, *rn;
+    NetRscChain *chain = (NetRscChain *)opq;
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
+        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+            chain->stat.purge_failed++;
+            continue;
+        }
+    }
+
+    chain->stat.timer++;
+    if (!QTAILQ_EMPTY(&chain->buffers)) {
+        timer_mod(chain->drain_timer,
+              qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout);
+    }
+}
+
+static void virtio_net_rsc_cleanup(VirtIONet *n)
+{
+    NetRscChain *chain, *rn_chain;
+    NetRscSeg *seg, *rn_seg;
+
+    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
+        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
+            QTAILQ_REMOVE(&chain->buffers, seg, next);
+            g_free(seg->buf);
+            g_free(seg);
+        }
+
+        timer_del(chain->drain_timer);
+        timer_free(chain->drain_timer);
+        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
+        g_free(chain);
+    }
+}
+
+static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
+                                     const uint8_t *buf, size_t size)
+{
+    uint16_t hdr_len;
+    NetRscSeg *seg;
+
+    hdr_len = chain->n->guest_hdr_len;
+    seg = g_malloc(sizeof(NetRscSeg));
+    seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)\
+                   + VIRTIO_NET_MAX_TCP_PAYLOAD);
+    memcpy(seg->buf, buf, size);
+    seg->size = size;
+    seg->packets = 1;
+    seg->dup_ack = 0;
+    seg->is_coalesced = 0;
+    seg->nc = nc;
+
+    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
+    chain->stat.cache++;
+
+    virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
+}
+
+static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
+                                         NetRscSeg *seg, const uint8_t *buf,
+                                         struct tcp_header *n_tcp,
+                                         struct tcp_header *o_tcp)
+{
+    uint32_t nack, oack;
+    uint16_t nwin, owin;
+
+    nack = htonl(n_tcp->th_ack);
+    nwin = htons(n_tcp->th_win);
+    oack = htonl(o_tcp->th_ack);
+    owin = htons(o_tcp->th_win);
+
+    if ((nack - oack) >= VIRTIO_NET_MAX_TCP_PAYLOAD) {
+        chain->stat.ack_out_of_win++;
+        return RSC_FINAL;
+    } else if (nack == oack) {
+        /* duplicated ack or window probe */
+        if (nwin == owin) {
+            /* duplicated ack, add dup ack count due to whql test up to 1 */
+            chain->stat.dup_ack++;
+            return RSC_FINAL;
+        } else {
+            /* Coalesce window update */
+            o_tcp->th_win = n_tcp->th_win;
+            chain->stat.win_update++;
+            return RSC_COALESCE;
+        }
+    } else {
+        /* pure ack, go to 'C', finalize*/
+        chain->stat.pure_ack++;
+        return RSC_FINAL;
+    }
+}
+
+static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
+                                            NetRscSeg *seg, const uint8_t *buf,
+                                            NetRscUnit *n_unit)
+{
+    void *data;
+    uint16_t o_ip_len;
+    uint32_t nseq, oseq;
+    NetRscUnit *o_unit;
+
+    o_unit = &seg->unit;
+    o_ip_len = htons(*o_unit->ip_plen);
+    nseq = htonl(n_unit->tcp->th_seq);
+    oseq = htonl(o_unit->tcp->th_seq);
+
+    /* out of order or retransmitted. */
+    if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) {
+        chain->stat.data_out_of_win++;
+        return RSC_FINAL;
+    }
+
+    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
+    if (nseq == oseq) {
+        if ((o_unit->payload == 0) && n_unit->payload) {
+            /* From no payload to payload, normal case, not a dup ack or etc */
+            chain->stat.data_after_pure_ack++;
+            goto coalesce;
+        } else {
+            return virtio_net_rsc_handle_ack(chain, seg, buf,
+                                             n_unit->tcp, o_unit->tcp);
+        }
+    } else if ((nseq - oseq) != o_unit->payload) {
+        /* Not a consistent packet, out of order */
+        chain->stat.data_out_of_order++;
+        return RSC_FINAL;
+    } else {
+coalesce:
+        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
+            chain->stat.over_size++;
+            return RSC_FINAL;
+        }
+
+        /* Here comes the right data, the payload length in v4/v6 is different,
+           so use the field value to update and record the new data len */
+        o_unit->payload += n_unit->payload; /* update new data len */
+
+        /* update field in ip header */
+        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
+
+        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
+           for windows guest, while this may change the behavior for linux
+           guest. */
+        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
+
+        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
+        o_unit->tcp->th_win = n_unit->tcp->th_win;
+
+        memmove(seg->buf + seg->size, data, n_unit->payload);
+        seg->size += n_unit->payload;
+        seg->packets++;
+        chain->stat.coalesced++;
+        return RSC_COALESCE;
+    }
+}
+
+static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
+                                        const uint8_t *buf, size_t size,
+                                        NetRscUnit *unit)
+{
+    struct ip_header *ip1, *ip2;
+
+    ip1 = (struct ip_header *)(unit->ip);
+    ip2 = (struct ip_header *)(seg->unit.ip);
+    if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst)
+        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
+        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
+        chain->stat.no_match++;
+        return RSC_NO_MATCH;
+    }
+
+    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
+}
+
+/* Pakcets with 'SYN' should bypass, other flag should be sent after drain
+ * to prevent out of order */
+static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
+                                         struct tcp_header *tcp)
+{
+    uint16_t tcp_hdr;
+    uint16_t tcp_flag;
+
+    tcp_flag = htons(tcp->th_offset_flags);
+    tcp_hdr = (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10;
+    tcp_flag &= VIRTIO_NET_TCP_FLAG;
+    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
+    if (tcp_flag & TH_SYN) {
+        chain->stat.tcp_syn++;
+        return RSC_BYPASS;
+    }
+
+    if (tcp_flag & (TH_FIN | TH_URG | TH_RST | TH_ECE | TH_CWR)) {
+        chain->stat.tcp_ctrl_drain++;
+        return RSC_FINAL;
+    }
+
+    if (tcp_hdr > sizeof(struct tcp_header)) {
+        chain->stat.tcp_all_opt++;
+        return RSC_FINAL;
+    }
+
+    return RSC_CANDIDATE;
+}
+
+static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc,
+                                         const uint8_t *buf, size_t size,
+                                         NetRscUnit *unit)
+{
+    int ret;
+    NetRscSeg *seg, *nseg;
+
+    if (QTAILQ_EMPTY(&chain->buffers)) {
+        chain->stat.empty_cache++;
+        virtio_net_rsc_cache_buf(chain, nc, buf, size);
+        timer_mod(chain->drain_timer,
+              qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout);
+        return size;
+    }
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
+        ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
+
+        if (ret == RSC_FINAL) {
+            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+                /* Send failed */
+                chain->stat.final_failed++;
+                return 0;
+            }
+
+            /* Send current packet */
+            return virtio_net_do_receive(nc, buf, size);
+        } else if (ret == RSC_NO_MATCH) {
+            continue;
+        } else {
+            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
+            seg->is_coalesced = 1;
+            return size;
+        }
+    }
+
+    chain->stat.no_match_cache++;
+    virtio_net_rsc_cache_buf(chain, nc, buf, size);
+    return size;
+}
+
+/* Drain a connection data, this is to avoid out of order segments */
+static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
+                                        const uint8_t *buf, size_t size,
+                                        uint16_t ip_start, uint16_t ip_size,
+                                        uint16_t tcp_port)
+{
+    NetRscSeg *seg, *nseg;
+    uint32_t ppair1, ppair2;
+
+    ppair1 = *(uint32_t *)(buf + tcp_port);
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
+        ppair2 = *(uint32_t *)(seg->buf + tcp_port);
+        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
+            || (ppair1 != ppair2)) {
+            continue;
+        }
+        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+            chain->stat.drain_failed++;
+        }
+
+        break;
+    }
+
+    return virtio_net_do_receive(nc, buf, size);
+}
+
+static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
+                                            struct ip_header *ip,
+                                            const uint8_t *buf, size_t size)
+{
+    uint16_t ip_len;
+
+    /* Not an ipv4 packet */
+    if (((ip->ip_ver_len & 0xF0) >> 4) != IP_HEADER_VERSION_4) {
+        chain->stat.ip_option++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ip option */
+    if ((ip->ip_ver_len & 0xF) != VIRTIO_NET_IP4_HEADER_LENGTH) {
+        chain->stat.ip_option++;
+        return RSC_BYPASS;
+    }
+
+    if (ip->ip_p != IPPROTO_TCP) {
+        chain->stat.bypass_not_tcp++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ip fragment */
+    if (!(htons(ip->ip_off) & IP_DF)) {
+        chain->stat.ip_frag++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ecn flag */
+    if (IPTOS_ECN(ip->ip_tos)) {
+        chain->stat.ip_ecn++;
+        return RSC_BYPASS;
+    }
+
+    ip_len = htons(ip->ip_len);
+    if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header))
+        || ip_len > (size - chain->n->guest_hdr_len -
+                     sizeof(struct eth_header))) {
+        chain->stat.ip_hacked++;
+        return RSC_BYPASS;
+    }
+
+    return RSC_CANDIDATE;
+}
+
+static size_t virtio_net_rsc_receive4(NetRscChain *chain, NetClientState* nc,
+                                      const uint8_t *buf, size_t size)
+{
+    int32_t ret;
+    uint16_t hdr_len;
+    NetRscUnit unit;
+
+    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
+
+    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header)
+        + sizeof(struct tcp_header))) {
+        chain->stat.bypass_not_tcp++;
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    virtio_net_rsc_extract_unit4(chain, buf, &unit);
+    if (virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)
+        != RSC_CANDIDATE) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
+    if (ret == RSC_BYPASS) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else if (ret == RSC_FINAL) {
+        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
+                ((hdr_len + sizeof(struct eth_header)) + 12),
+                VIRTIO_NET_IP4_ADDR_SIZE,
+                hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header));
+    }
+
+    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
+}
+
+static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
+                                                NetClientState *nc,
+                                                uint16_t proto)
+{
+    NetRscChain *chain;
+
+    if (proto != (uint16_t)ETH_P_IP) {
+        return NULL;
+    }
+
+    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
+        if (chain->proto == proto) {
+            return chain;
+        }
+    }
+
+    chain = g_malloc(sizeof(*chain));
+    chain->n = n;
+    chain->proto = proto;
+    chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
+    chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+    chain->drain_timer = timer_new_ns(QEMU_CLOCK_HOST,
+                                      virtio_net_rsc_purge, chain);
+    memset(&chain->stat, 0, sizeof(chain->stat));
+
+    QTAILQ_INIT(&chain->buffers);
+    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
+
+    return chain;
+}
+
+static ssize_t virtio_net_rsc_receive(NetClientState *nc,
+                                      const uint8_t *buf, size_t size)
+{
+    uint16_t proto;
+    NetRscChain *chain;
+    struct eth_header *eth;
+    VirtIONet *n;
+
+    n = qemu_get_nic_opaque(nc);
+    if (size < (n->host_hdr_len + sizeof(struct eth_header))) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    eth = (struct eth_header *)(buf + n->guest_hdr_len);
+    proto = htons(eth->h_proto);
+
+    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
+    if (!chain) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else {
+        chain->stat.received++;
+        return virtio_net_rsc_receive4(chain, nc, buf, size);
+    }
+}
+
+static ssize_t virtio_net_receive(NetClientState *nc,
+                                  const uint8_t *buf, size_t size)
+{
+    VirtIONet *n;
+    struct virtio_net_hdr_rsc *h;
+
+    n = qemu_get_nic_opaque(nc);
+    if (n->curr_guest_offloads & (1ULL << VIRTIO_NET_F_GUEST_RSC4)) {
+        h = (struct virtio_net_hdr_rsc *)buf;
+        h->hdr.flags = VIRTIO_NET_HDR_RSC_NONE;
+        h->rsc_pkts = 0;
+        h->rsc_dup_acks = 0;
+        return virtio_net_rsc_receive(nc, buf, size);
+    } else {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+}
+
 static NetClientInfo net_virtio_info = {
     .type = NET_CLIENT_DRIVER_NIC,
     .size = sizeof(NICState),
@@ -1805,6 +2340,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     nc = qemu_get_queue(n->nic);
     nc->rxfilter_notify_enabled = 1;
 
+    QTAILQ_INIT(&n->rsc_chains);
     n->qdev = dev;
 }
 
@@ -1835,6 +2371,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
     g_free(n->vqs);
     qemu_del_nic(n->nic);
     virtio_cleanup(vdev);
+    virtio_net_rsc_cleanup(n);
 }
 
 static void virtio_net_instance_init(Object *obj)
@@ -1872,45 +2409,46 @@ static const VMStateDescription vmstate_virtio_net = {
 };
 
 static Property virtio_net_properties[] = {
-    DEFINE_PROP_BIT("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true),
-    DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("csum", VirtIONet, host_features,
+                    VIRTIO_NET_F_CSUM, true),
+    DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features,
                     VIRTIO_NET_F_GUEST_CSUM, true),
-    DEFINE_PROP_BIT("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
-    DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
+    DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features,
                     VIRTIO_NET_F_GUEST_TSO4, true),
-    DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features,
                     VIRTIO_NET_F_GUEST_TSO6, true),
-    DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features,
                     VIRTIO_NET_F_GUEST_ECN, true),
-    DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features,
                     VIRTIO_NET_F_GUEST_UFO, true),
-    DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features,
                     VIRTIO_NET_F_GUEST_ANNOUNCE, true),
-    DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features,
                     VIRTIO_NET_F_HOST_TSO4, true),
-    DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features,
                     VIRTIO_NET_F_HOST_TSO6, true),
-    DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features,
                     VIRTIO_NET_F_HOST_ECN, true),
-    DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features,
                     VIRTIO_NET_F_HOST_UFO, true),
-    DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features,
                     VIRTIO_NET_F_MRG_RXBUF, true),
-    DEFINE_PROP_BIT("status", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("status", VirtIONet, host_features,
                     VIRTIO_NET_F_STATUS, true),
-    DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_VQ, true),
-    DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_RX, true),
-    DEFINE_PROP_BIT("ctrl_vlan", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("ctrl_vlan", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_VLAN, true),
-    DEFINE_PROP_BIT("ctrl_rx_extra", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("ctrl_rx_extra", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_RX_EXTRA, true),
-    DEFINE_PROP_BIT("ctrl_mac_addr", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("ctrl_mac_addr", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_MAC_ADDR, true),
-    DEFINE_PROP_BIT("ctrl_guest_offloads", VirtIONet, host_features,
+    DEFINE_PROP_BIT64("ctrl_guest_offloads", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, true),
-    DEFINE_PROP_BIT("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ, false),
+    DEFINE_PROP_BIT64("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ, false),
     DEFINE_NIC_PROPERTIES(VirtIONet, nic_conf),
     DEFINE_PROP_UINT32("x-txtimer", VirtIONet, net_conf.txtimer,
                        TX_TIMER_INTERVAL),
@@ -1918,6 +2456,10 @@ static Property virtio_net_properties[] = {
     DEFINE_PROP_STRING("tx", VirtIONet, net_conf.tx),
     DEFINE_PROP_UINT16("rx_queue_size", VirtIONet, net_conf.rx_queue_size,
                        VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE),
+    DEFINE_PROP_BIT64("guest_rsc4", VirtIONet, host_features,
+                    VIRTIO_NET_F_GUEST_RSC4, true),
+    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
+                      VIRTIO_NET_RSC_INTERVAL),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index 0ced975..56a8ce2 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -60,12 +60,15 @@ typedef struct VirtIONet {
     VirtIONetQueue *vqs;
     VirtQueue *ctrl_vq;
     NICState *nic;
+    QTAILQ_HEAD(, NetRscChain) rsc_chains;
+    uint32_t rsc_timeout;
     uint32_t tx_timeout;
     int32_t tx_burst;
     uint32_t has_vnet_hdr;
+    uint32_t has_rsc_hdr;
     size_t host_hdr_len;
     size_t guest_hdr_len;
-    uint32_t host_features;
+    uint64_t host_features;
     uint8_t has_ufo;
     int mergeable_rx_bufs;
     uint8_t promisc;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b913aac..0006ce1 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -30,6 +30,8 @@
                                 (0x1ULL << VIRTIO_F_ANY_LAYOUT))
 
 struct VirtQueue;
+struct VirtIONet;
+typedef struct VirtIONet VirtIONet;
 
 static inline hwaddr vring_align(hwaddr addr,
                                              unsigned long align)
@@ -129,6 +131,80 @@ typedef struct VirtioDeviceClass {
     int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id);
 } VirtioDeviceClass;
 
+/* Coalesced packets type & status */
+typedef enum {
+    RSC_COALESCE,           /* Data been coalesced */
+    RSC_FINAL,              /* Will terminate current connection */
+    RSC_NO_MATCH,           /* No matched in the buffer pool */
+    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
+    RSC_CANDIDATE                /* Data want to be coalesced */
+} COALESCE_STATUS;
+
+typedef struct NetRscStat {
+    uint32_t received;
+    uint32_t coalesced;
+    uint32_t over_size;
+    uint32_t cache;
+    uint32_t empty_cache;
+    uint32_t no_match_cache;
+    uint32_t win_update;
+    uint32_t no_match;
+    uint32_t tcp_syn;
+    uint32_t tcp_ctrl_drain;
+    uint32_t dup_ack;
+    uint32_t dup_ack1;
+    uint32_t dup_ack2;
+    uint32_t pure_ack;
+    uint32_t ack_out_of_win;
+    uint32_t data_out_of_win;
+    uint32_t data_out_of_order;
+    uint32_t data_after_pure_ack;
+    uint32_t bypass_not_tcp;
+    uint32_t tcp_option;
+    uint32_t tcp_all_opt;
+    uint32_t ip_frag;
+    uint32_t ip_ecn;
+    uint32_t ip_hacked;
+    uint32_t ip_option;
+    uint32_t purge_failed;
+    uint32_t drain_failed;
+    uint32_t final_failed;
+    int64_t  timer;
+} NetRscStat;
+
+/* Rsc unit general info used to checking if can coalescing */
+typedef struct NetRscUnit {
+    void *ip;   /* ip header */
+    uint16_t *ip_plen;      /* data len pointer in ip header field */
+    struct tcp_header *tcp; /* tcp header */
+    uint16_t tcp_hdrlen;    /* tcp header len */
+    uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
+} NetRscUnit;
+
+/* Coalesced segmant */
+typedef struct NetRscSeg {
+    QTAILQ_ENTRY(NetRscSeg) next;
+    void *buf;
+    size_t size;
+    uint16_t packets;
+    uint16_t dup_ack;
+    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
+    NetRscUnit unit;
+    NetClientState *nc;
+} NetRscSeg;
+
+/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
+typedef struct NetRscChain {
+    QTAILQ_ENTRY(NetRscChain) next;
+    VirtIONet *n;                            /* VirtIONet */
+    uint16_t proto;
+    uint8_t  gso_type;
+    uint16_t max_payload;
+    QEMUTimer *drain_timer;
+    QTAILQ_HEAD(, NetRscSeg) buffers;
+    NetRscStat stat;
+} NetRscChain;
+
 void virtio_instance_init_common(Object *proxy_obj, void *data,
                                  size_t vdev_size, const char *vdev_name);
 
diff --git a/include/net/eth.h b/include/net/eth.h
index 2013175..5952ef2 100644
--- a/include/net/eth.h
+++ b/include/net/eth.h
@@ -177,6 +177,8 @@ struct tcp_hdr {
 #define TH_PUSH 0x08
 #define TH_ACK  0x10
 #define TH_URG  0x20
+#define TH_ECE  0x40
+#define TH_CWR  0x80
     u_short th_win;      /* window */
     u_short th_sum;      /* checksum */
     u_short th_urp;      /* urgent pointer */
diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
index 30ff249..e67b36e 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+/* Guest can handle coalesced ipv4-tcp packets */
+#define VIRTIO_NET_F_GUEST_RSC4    41
+
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO	6	/* Host handles pkts w/ any GSO type */
 #endif /* VIRTIO_NET_NO_LEGACY */
@@ -94,6 +97,9 @@ struct virtio_net_hdr_v1 {
 #define VIRTIO_NET_HDR_GSO_UDP		3	/* GSO frame, IPv4 UDP (UFO) */
 #define VIRTIO_NET_HDR_GSO_TCPV6	4	/* GSO frame, IPv6 TCP */
 #define VIRTIO_NET_HDR_GSO_ECN		0x80	/* TCP has ECN set */
+#define VIRTIO_NET_HDR_RSC_NONE     5   /* No packets coalesced */
+#define VIRTIO_NET_HDR_RSC_TCPV4    6   /* IPv4 TCP coalesced */
+#define VIRTIO_NET_HDR_RSC_TCPV6    7   /* IPv6 TCP coalesced */
 	uint8_t gso_type;
 	__virtio16 hdr_len;	/* Ethernet + IP + tcp/udp hdrs */
 	__virtio16 gso_size;	/* Bytes to append to hdr_len per frame */
@@ -124,6 +130,14 @@ struct virtio_net_hdr_mrg_rxbuf {
 	struct virtio_net_hdr hdr;
 	__virtio16 num_buffers;	/* Number of merged rx buffers */
 };
+
+/* This is the header to use when either one or both of GUEST_RSC4/6
+ * features have been negotiated. */
+struct virtio_net_hdr_rsc {
+    struct virtio_net_hdr_v1 hdr;
+    __virtio16 rsc_pkts;        /* Number of coalesced packets */
+    __virtio16 rsc_dup_acks;    /* Duplicated ack packets */
+};
 #endif /* ...VIRTIO_NET_NO_LEGACY */
 
 /*
diff --git a/net/tap.c b/net/tap.c
index b6896a7..4557aa5 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -251,7 +251,8 @@ static void tap_set_vnet_hdr_len(NetClientState *nc, int len)
     TAPState *s = DO_UPCAST(TAPState, nc, nc);
 
     assert(nc->info->type == NET_CLIENT_DRIVER_TAP);
-    assert(len == sizeof(struct virtio_net_hdr_mrg_rxbuf) ||
+    assert(len == sizeof(struct virtio_net_hdr_rsc) ||
+           len == sizeof(struct virtio_net_hdr_mrg_rxbuf) ||
            len == sizeof(struct virtio_net_hdr));
 
     tap_fd_set_vnet_hdr_len(s->fd, len);
-- 
2.7.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-11-30 11:13 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-15  9:17 [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest wexu
2016-03-15  9:17 ` [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
2016-03-15 10:00   ` Michael S. Tsirkin
2016-03-16  3:23     ` Wei Xu
2016-03-17  8:42   ` Jason Wang
2016-03-17 16:45     ` Wei Xu
2016-03-18  2:03       ` Jason Wang
2016-03-18  4:17         ` Wei Xu
2016-03-18  5:20           ` Jason Wang
2016-03-18  6:38             ` Wei Xu
2016-03-18  6:56               ` Jason Wang
2016-03-18 14:52                 ` Wei Xu
2016-03-15  9:17 ` [Qemu-devel] [ Patch 2/2] virtio-net rsc: support coalescing ipv6 " wexu
2016-03-17  8:50   ` Jason Wang
2016-03-17 16:50     ` Wei Xu
2016-03-15 10:01 ` [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest Michael S. Tsirkin
2016-03-16  3:08   ` Wei Xu
2016-03-17  6:47 ` Jason Wang
2016-03-17 15:21   ` Wei Xu
2016-03-17 15:44     ` Michael S. Tsirkin
2016-03-17 16:57       ` Wei Xu
2016-03-18  2:22         ` Jason Wang
2016-03-18  4:24           ` Wei Xu
2016-03-18  5:21             ` Jason Wang
2016-03-18  6:30               ` Wei Xu
2016-10-31 17:41 [Qemu-devel] [ RFC Patch v7 0/2] Support Receive-Segment-Offload(RSC) for WHQL wexu
2016-10-31 17:41 ` [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
2016-11-24  4:17   ` Jason Wang
2016-11-24  4:26     ` Michael S. Tsirkin
2016-11-24  4:31       ` Jason Wang
2016-11-24  5:09         ` Michael S. Tsirkin
2016-11-30  8:55     ` Wei Xu
2016-11-30 11:12       ` Jason Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.