All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e)
@ 2015-10-25 17:00 Leonid Bloch
  2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 1/5] net: Add macros for ETH address tracing Leonid Bloch
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Leonid Bloch @ 2015-10-25 17:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Dmitry Fleytman, Jason Wang, Leonid Bloch, Shmulik Ladkani

Hello qemu-devel,

This patch series is an RFC for the new networking device emulation
we're developing for QEMU.

This new device emulates the Intel 82574 GbE Controller and works
with unmodified Intel e1000e drivers from the Linux/Windows kernels.

The status of the current series is "Functional Device Ready, work
on Extended Features in Progress".

More precisely, these patches represent a functional device, which
is recognized by the standard Intel drivers, and is able to transfer
TX/RX packets with CSO/TSO offloads, according to the spec.

Extended features not supported yet (work in progress):
  1. TX/RX Interrupt moderation mechanisms
  2. RSS
  3. Full-featured multi-queue (use of multiqueued network backend)

Also, there will be some code refactoring and performance
optimization efforts.

This series was tested on Linux (Fedora 22) and Windows (2012R2)
guests, using Iperf, with TX/RX and TCP/UDP streams, and various
packet sizes.

More thorough testing, including data streams with different MTU
sizes, and Microsoft Certification (HLK) tests, are pending missing
features' development.

See commit messages (esp. "net: Introduce e1000e device emulation")
for more information about the development approaches and the
architecture options chosen for this device.

This series is based upon v2.3.0 tag of the upstream QEMU repository,
and it will be rebased to latest before the final submission.

Please share your thoughts - any feedback is highly welcomed :)

Best Regards,
Dmitry Fleytman.

Dmitry Fleytman (5):
  net: Add macros for ETH address tracing
  net_pkt: Name vmxnet3 packet abstractions more generic
  net_pkt: Extend packet abstraction as requied by e1000e functionality
  e1000_regs: Add definitions for Intel 82574-specific bits
  net: Introduce e1000e device emulation

 MAINTAINERS             |    2 +
 default-configs/pci.mak |    1 +
 hw/net/Makefile.objs    |    5 +-
 hw/net/e1000_regs.h     |  201 ++++-
 hw/net/e1000e.c         |  531 ++++++++++++
 hw/net/e1000e_core.c    | 2081 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/net/e1000e_core.h    |  181 +++++
 hw/net/net_rx_pkt.c     |  273 +++++++
 hw/net/net_rx_pkt.h     |  241 ++++++
 hw/net/net_tx_pkt.c     |  606 ++++++++++++++
 hw/net/net_tx_pkt.h     |  191 +++++
 hw/net/vmxnet3.c        |   80 +-
 hw/net/vmxnet_rx_pkt.c  |  187 -----
 hw/net/vmxnet_rx_pkt.h  |  174 ----
 hw/net/vmxnet_tx_pkt.c  |  567 -------------
 hw/net/vmxnet_tx_pkt.h  |  148 ----
 include/net/eth.h       |   90 +-
 include/net/net.h       |    5 +
 net/eth.c               |  152 +++-
 tests/Makefile          |    4 +-
 trace-events            |   68 ++
 21 files changed, 4597 insertions(+), 1191 deletions(-)
 create mode 100644 hw/net/e1000e.c
 create mode 100644 hw/net/e1000e_core.c
 create mode 100644 hw/net/e1000e_core.h
 create mode 100644 hw/net/net_rx_pkt.c
 create mode 100644 hw/net/net_rx_pkt.h
 create mode 100644 hw/net/net_tx_pkt.c
 create mode 100644 hw/net/net_tx_pkt.h
 delete mode 100644 hw/net/vmxnet_rx_pkt.c
 delete mode 100644 hw/net/vmxnet_rx_pkt.h
 delete mode 100644 hw/net/vmxnet_tx_pkt.c
 delete mode 100644 hw/net/vmxnet_tx_pkt.h

-- 
2.4.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] [RFC PATCH 1/5] net: Add macros for ETH address tracing
  2015-10-25 17:00 [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Leonid Bloch
@ 2015-10-25 17:00 ` Leonid Bloch
  2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 2/5] net_pkt: Name vmxnet3 packet abstractions more generic Leonid Bloch
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Leonid Bloch @ 2015-10-25 17:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Dmitry Fleytman, Jason Wang, Leonid Bloch, Shmulik Ladkani

From: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com>

This patch introduces a handful macros
for tracing of Ethernet addresses.

There are 2 reasons for those macros to be added:

  1. They will be used by future commits
     introducing e1000e device emulation;
  2. They fix vmxnet3 build with debug tracing enabled:

     When vmxnet3 configuration tracing enabled by uncommenting
     VMXNET_DEBUG_CONFIG definition in vmxnet_debug.h, following
     compilation error is observed:

     hw/net/vmxnet3.c: In function ‘vmxnet3_net_init’:
     hw/net/vmxnet3.c:1974:52: error: expected ‘)’ before ‘MAC_FMT’
          VMW_CFPRN("Permanent MAC: " MAC_FMT, MAC_ARG(s->perm_mac.a));
                                                    ^
     hw/net/vmxnet3.c:1974:17: error: format ‘%s’ expects a matching ‘char *’ argument [-Werror=format=]
          VMW_CFPRN("Permanent MAC: " MAC_FMT, MAC_ARG(s->perm_mac.a));
                      ^
     hw/net/vmxnet3.c:1974:17: error: format ‘%s’ expects a matching ‘char *’ argument [-Werror=format=]
     cc1: all warnings being treated as errors

Signed-off-by: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com>
Signed-off-by: Leonid Bloch <leonid.bloch@ravellosystems.com>
---
 include/net/net.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index 50ffcb9..fa561ea 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -9,6 +9,11 @@
 #include "migration/vmstate.h"
 #include "qapi-types.h"
 
+#define MAC_FMT "%02x:%02x:%02x:%02x:%02x:%02x"
+#define MAC_ARG(x) ((uint8_t *)(x))[0], ((uint8_t *)(x))[1], \
+                   ((uint8_t *)(x))[2], ((uint8_t *)(x))[3], \
+                   ((uint8_t *)(x))[4], ((uint8_t *)(x))[5]
+
 #define MAX_QUEUE_NUM 1024
 
 /* Maximum GSO packet size (64k) plus plenty of room for
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Qemu-devel] [RFC PATCH 2/5] net_pkt: Name vmxnet3 packet abstractions more generic
  2015-10-25 17:00 [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Leonid Bloch
  2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 1/5] net: Add macros for ETH address tracing Leonid Bloch
@ 2015-10-25 17:00 ` Leonid Bloch
  2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 3/5] net_pkt: Extend packet abstraction as requied by e1000e functionality Leonid Bloch
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Leonid Bloch @ 2015-10-25 17:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Dmitry Fleytman, Jason Wang, Leonid Bloch, Shmulik Ladkani

From: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com>

This patch drops "vmx" prefix from packet abstrations names
to emphasize the fact they are generic and not tied to any
specific network device.

These abstrations will be reused by e1000e emulation implementation
introduced by following patches so their names need generalization.

This patch (except renamed files, adjusted comments and changes in MAINTAINTERS)
was produced by:

git grep -lz 'vmxnet_tx_pkt' | xargs -0 perl -i'' -pE "s/vmxnet_tx_pkt/net_tx_pkt/g"
git grep -lz 'vmxnet_rx_pkt' | xargs -0 perl -i'' -pE "s/vmxnet_rx_pkt/net_rx_pkt/g"
git grep -lz 'VmxnetTxPkt' | xargs -0 perl -i'' -pE "s/VmxnetTxPkt/NetTxPkt/g"
git grep -lz 'VMXNET_TX_PKT' | xargs -0 perl -i'' -pE "s/VMXNET_TX_PKT/NET_TX_PKT/g"
git grep -lz 'VmxnetRxPkt' | xargs -0 perl -i'' -pE "s/VmxnetRxPkt/NetRxPkt/g"
git grep -lz 'VMXNET_RX_PKT' | xargs -0 perl -i'' -pE "s/VMXNET_RX_PKT/NET_RX_PKT/g"
sed -ie 's/VMXNET_/NET_/g' hw/net/vmxnet_rx_pkt.c
sed -ie 's/VMXNET_/NET_/g' hw/net/vmxnet_tx_pkt.c

Signed-off-by: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com>
Signed-off-by: Leonid Bloch <leonid.bloch@ravellosystems.com>
---
 MAINTAINERS            |   2 +
 hw/net/Makefile.objs   |   2 +-
 hw/net/net_rx_pkt.c    | 187 ++++++++++++++++
 hw/net/net_rx_pkt.h    | 174 +++++++++++++++
 hw/net/net_tx_pkt.c    | 567 +++++++++++++++++++++++++++++++++++++++++++++++++
 hw/net/net_tx_pkt.h    | 148 +++++++++++++
 hw/net/vmxnet3.c       |  80 +++----
 hw/net/vmxnet_rx_pkt.c | 187 ----------------
 hw/net/vmxnet_rx_pkt.h | 174 ---------------
 hw/net/vmxnet_tx_pkt.c | 567 -------------------------------------------------
 hw/net/vmxnet_tx_pkt.h | 148 -------------
 tests/Makefile         |   4 +-
 12 files changed, 1121 insertions(+), 1119 deletions(-)
 create mode 100644 hw/net/net_rx_pkt.c
 create mode 100644 hw/net/net_rx_pkt.h
 create mode 100644 hw/net/net_tx_pkt.c
 create mode 100644 hw/net/net_tx_pkt.h
 delete mode 100644 hw/net/vmxnet_rx_pkt.c
 delete mode 100644 hw/net/vmxnet_rx_pkt.h
 delete mode 100644 hw/net/vmxnet_tx_pkt.c
 delete mode 100644 hw/net/vmxnet_tx_pkt.h

diff --git a/MAINTAINERS b/MAINTAINERS
index d7e9ba2..8a3f742 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -752,6 +752,8 @@ Vmware
 M: Dmitry Fleytman <dmitry@daynix.com>
 S: Maintained
 F: hw/net/vmxnet*
+F: hw/net/net_rx_pkt*
+F: hw/net/net_tx_pkt*
 F: hw/scsi/vmw_pvscsi*
 
 Subsystems
diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs
index ea93293..34039fc 100644
--- a/hw/net/Makefile.objs
+++ b/hw/net/Makefile.objs
@@ -8,7 +8,7 @@ common-obj-$(CONFIG_PCNET_PCI) += pcnet-pci.o
 common-obj-$(CONFIG_PCNET_COMMON) += pcnet.o
 common-obj-$(CONFIG_E1000_PCI) += e1000.o
 common-obj-$(CONFIG_RTL8139_PCI) += rtl8139.o
-common-obj-$(CONFIG_VMXNET3_PCI) += vmxnet_tx_pkt.o vmxnet_rx_pkt.o
+common-obj-$(CONFIG_VMXNET3_PCI) += net_tx_pkt.o net_rx_pkt.o
 common-obj-$(CONFIG_VMXNET3_PCI) += vmxnet3.o
 
 common-obj-$(CONFIG_SMC91C111) += smc91c111.o
diff --git a/hw/net/net_rx_pkt.c b/hw/net/net_rx_pkt.c
new file mode 100644
index 0000000..f4c929f
--- /dev/null
+++ b/hw/net/net_rx_pkt.c
@@ -0,0 +1,187 @@
+/*
+ * QEMU RX packets abstractions
+ *
+ * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
+ *
+ * Developed by Daynix Computing LTD (http://www.daynix.com)
+ *
+ * Authors:
+ * Dmitry Fleytman <dmitry@daynix.com>
+ * Tamir Shomer <tamirs@daynix.com>
+ * Yan Vugenfirer <yan@daynix.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "net_rx_pkt.h"
+#include "net/eth.h"
+#include "qemu-common.h"
+#include "qemu/iov.h"
+#include "net/checksum.h"
+#include "net/tap.h"
+
+/*
+ * RX packet may contain up to 2 fragments - rebuilt eth header
+ * in case of VLAN tag stripping
+ * and payload received from QEMU - in any case
+ */
+#define NET_MAX_RX_PACKET_FRAGMENTS (2)
+
+struct NetRxPkt {
+    struct virtio_net_hdr virt_hdr;
+    uint8_t ehdr_buf[ETH_MAX_L2_HDR_LEN];
+    struct iovec vec[NET_MAX_RX_PACKET_FRAGMENTS];
+    uint16_t vec_len;
+    uint32_t tot_len;
+    uint16_t tci;
+    bool vlan_stripped;
+    bool has_virt_hdr;
+    eth_pkt_types_e packet_type;
+
+    /* Analysis results */
+    bool isip4;
+    bool isip6;
+    bool isudp;
+    bool istcp;
+};
+
+void net_rx_pkt_init(struct NetRxPkt **pkt, bool has_virt_hdr)
+{
+    struct NetRxPkt *p = g_malloc0(sizeof *p);
+    p->has_virt_hdr = has_virt_hdr;
+    *pkt = p;
+}
+
+void net_rx_pkt_uninit(struct NetRxPkt *pkt)
+{
+    g_free(pkt);
+}
+
+struct virtio_net_hdr *net_rx_pkt_get_vhdr(struct NetRxPkt *pkt)
+{
+    assert(pkt);
+    return &pkt->virt_hdr;
+}
+
+void net_rx_pkt_attach_data(struct NetRxPkt *pkt, const void *data,
+                               size_t len, bool strip_vlan)
+{
+    uint16_t tci = 0;
+    uint16_t ploff;
+    assert(pkt);
+    pkt->vlan_stripped = false;
+
+    if (strip_vlan) {
+        pkt->vlan_stripped = eth_strip_vlan(data, pkt->ehdr_buf, &ploff, &tci);
+    }
+
+    if (pkt->vlan_stripped) {
+        pkt->vec[0].iov_base = pkt->ehdr_buf;
+        pkt->vec[0].iov_len = ploff - sizeof(struct vlan_header);
+        pkt->vec[1].iov_base = (uint8_t *) data + ploff;
+        pkt->vec[1].iov_len = len - ploff;
+        pkt->vec_len = 2;
+        pkt->tot_len = len - ploff + sizeof(struct eth_header);
+    } else {
+        pkt->vec[0].iov_base = (void *)data;
+        pkt->vec[0].iov_len = len;
+        pkt->vec_len = 1;
+        pkt->tot_len = len;
+    }
+
+    pkt->tci = tci;
+
+    eth_get_protocols(data, len, &pkt->isip4, &pkt->isip6,
+        &pkt->isudp, &pkt->istcp);
+}
+
+void net_rx_pkt_dump(struct NetRxPkt *pkt)
+{
+#ifdef NET_RX_PKT_DEBUG
+    NetRxPkt *pkt = (NetRxPkt *)pkt;
+    assert(pkt);
+
+    printf("RX PKT: tot_len: %d, vlan_stripped: %d, vlan_tag: %d\n",
+              pkt->tot_len, pkt->vlan_stripped, pkt->tci);
+#endif
+}
+
+void net_rx_pkt_set_packet_type(struct NetRxPkt *pkt,
+    eth_pkt_types_e packet_type)
+{
+    assert(pkt);
+
+    pkt->packet_type = packet_type;
+
+}
+
+eth_pkt_types_e net_rx_pkt_get_packet_type(struct NetRxPkt *pkt)
+{
+    assert(pkt);
+
+    return pkt->packet_type;
+}
+
+size_t net_rx_pkt_get_total_len(struct NetRxPkt *pkt)
+{
+    assert(pkt);
+
+    return pkt->tot_len;
+}
+
+void net_rx_pkt_get_protocols(struct NetRxPkt *pkt,
+                                 bool *isip4, bool *isip6,
+                                 bool *isudp, bool *istcp)
+{
+    assert(pkt);
+
+    *isip4 = pkt->isip4;
+    *isip6 = pkt->isip6;
+    *isudp = pkt->isudp;
+    *istcp = pkt->istcp;
+}
+
+struct iovec *net_rx_pkt_get_iovec(struct NetRxPkt *pkt)
+{
+    assert(pkt);
+
+    return pkt->vec;
+}
+
+void net_rx_pkt_set_vhdr(struct NetRxPkt *pkt,
+                            struct virtio_net_hdr *vhdr)
+{
+    assert(pkt);
+
+    memcpy(&pkt->virt_hdr, vhdr, sizeof pkt->virt_hdr);
+}
+
+bool net_rx_pkt_is_vlan_stripped(struct NetRxPkt *pkt)
+{
+    assert(pkt);
+
+    return pkt->vlan_stripped;
+}
+
+bool net_rx_pkt_has_virt_hdr(struct NetRxPkt *pkt)
+{
+    assert(pkt);
+
+    return pkt->has_virt_hdr;
+}
+
+uint16_t net_rx_pkt_get_num_frags(struct NetRxPkt *pkt)
+{
+    assert(pkt);
+
+    return pkt->vec_len;
+}
+
+uint16_t net_rx_pkt_get_vlan_tag(struct NetRxPkt *pkt)
+{
+    assert(pkt);
+
+    return pkt->tci;
+}
diff --git a/hw/net/net_rx_pkt.h b/hw/net/net_rx_pkt.h
new file mode 100644
index 0000000..1e4accf
--- /dev/null
+++ b/hw/net/net_rx_pkt.h
@@ -0,0 +1,174 @@
+/*
+ * QEMU RX packets abstraction
+ *
+ * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
+ *
+ * Developed by Daynix Computing LTD (http://www.daynix.com)
+ *
+ * Authors:
+ * Dmitry Fleytman <dmitry@daynix.com>
+ * Tamir Shomer <tamirs@daynix.com>
+ * Yan Vugenfirer <yan@daynix.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef NET_RX_PKT_H
+#define NET_RX_PKT_H
+
+#include "stdint.h"
+#include "stdbool.h"
+#include "net/eth.h"
+
+/* defines to enable packet dump functions */
+/*#define NET_RX_PKT_DEBUG*/
+
+struct NetRxPkt;
+
+/**
+ * Clean all rx packet resources
+ *
+ * @pkt:            packet
+ *
+ */
+void net_rx_pkt_uninit(struct NetRxPkt *pkt);
+
+/**
+ * Init function for rx packet functionality
+ *
+ * @pkt:            packet pointer
+ * @has_virt_hdr:   device uses virtio header
+ *
+ */
+void net_rx_pkt_init(struct NetRxPkt **pkt, bool has_virt_hdr);
+
+/**
+ * returns total length of data attached to rx context
+ *
+ * @pkt:            packet
+ *
+ * Return:  nothing
+ *
+ */
+size_t net_rx_pkt_get_total_len(struct NetRxPkt *pkt);
+
+/**
+ * fetches packet analysis results
+ *
+ * @pkt:            packet
+ * @isip4:          whether the packet given is IPv4
+ * @isip6:          whether the packet given is IPv6
+ * @isudp:          whether the packet given is UDP
+ * @istcp:          whether the packet given is TCP
+ *
+ */
+void net_rx_pkt_get_protocols(struct NetRxPkt *pkt,
+                                 bool *isip4, bool *isip6,
+                                 bool *isudp, bool *istcp);
+
+/**
+ * returns virtio header stored in rx context
+ *
+ * @pkt:            packet
+ * @ret:            virtio header
+ *
+ */
+struct virtio_net_hdr *net_rx_pkt_get_vhdr(struct NetRxPkt *pkt);
+
+/**
+ * returns packet type
+ *
+ * @pkt:            packet
+ * @ret:            packet type
+ *
+ */
+eth_pkt_types_e net_rx_pkt_get_packet_type(struct NetRxPkt *pkt);
+
+/**
+ * returns vlan tag
+ *
+ * @pkt:            packet
+ * @ret:            VLAN tag
+ *
+ */
+uint16_t net_rx_pkt_get_vlan_tag(struct NetRxPkt *pkt);
+
+/**
+ * tells whether vlan was stripped from the packet
+ *
+ * @pkt:            packet
+ * @ret:            VLAN stripped sign
+ *
+ */
+bool net_rx_pkt_is_vlan_stripped(struct NetRxPkt *pkt);
+
+/**
+ * notifies caller if the packet has virtio header
+ *
+ * @pkt:            packet
+ * @ret:            true if packet has virtio header, false otherwize
+ *
+ */
+bool net_rx_pkt_has_virt_hdr(struct NetRxPkt *pkt);
+
+/**
+ * returns number of frags attached to the packet
+ *
+ * @pkt:            packet
+ * @ret:            number of frags
+ *
+ */
+uint16_t net_rx_pkt_get_num_frags(struct NetRxPkt *pkt);
+
+/**
+ * attach data to rx packet
+ *
+ * @pkt:            packet
+ * @data:           pointer to the data buffer
+ * @len:            data length
+ * @strip_vlan:     should the module strip vlan from data
+ *
+ */
+void net_rx_pkt_attach_data(struct NetRxPkt *pkt, const void *data,
+    size_t len, bool strip_vlan);
+
+/**
+ * returns io vector that holds the attached data
+ *
+ * @pkt:            packet
+ * @ret:            pointer to IOVec
+ *
+ */
+struct iovec *net_rx_pkt_get_iovec(struct NetRxPkt *pkt);
+
+/**
+ * prints rx packet data if debug is enabled
+ *
+ * @pkt:            packet
+ *
+ */
+void net_rx_pkt_dump(struct NetRxPkt *pkt);
+
+/**
+ * copy passed vhdr data to packet context
+ *
+ * @pkt:            packet
+ * @vhdr:           VHDR buffer
+ *
+ */
+void net_rx_pkt_set_vhdr(struct NetRxPkt *pkt,
+    struct virtio_net_hdr *vhdr);
+
+/**
+ * save packet type in packet context
+ *
+ * @pkt:            packet
+ * @packet_type:    the packet type
+ *
+ */
+void net_rx_pkt_set_packet_type(struct NetRxPkt *pkt,
+    eth_pkt_types_e packet_type);
+
+#endif
diff --git a/hw/net/net_tx_pkt.c b/hw/net/net_tx_pkt.c
new file mode 100644
index 0000000..a2c1a76
--- /dev/null
+++ b/hw/net/net_tx_pkt.c
@@ -0,0 +1,567 @@
+/*
+ * QEMU TX packets abstractions
+ *
+ * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
+ *
+ * Developed by Daynix Computing LTD (http://www.daynix.com)
+ *
+ * Authors:
+ * Dmitry Fleytman <dmitry@daynix.com>
+ * Tamir Shomer <tamirs@daynix.com>
+ * Yan Vugenfirer <yan@daynix.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "hw/hw.h"
+#include "net_tx_pkt.h"
+#include "net/eth.h"
+#include "qemu-common.h"
+#include "qemu/iov.h"
+#include "net/checksum.h"
+#include "net/tap.h"
+#include "net/net.h"
+
+enum {
+    NET_TX_PKT_VHDR_FRAG = 0,
+    NET_TX_PKT_L2HDR_FRAG,
+    NET_TX_PKT_L3HDR_FRAG,
+    NET_TX_PKT_PL_START_FRAG
+};
+
+/* TX packet private context */
+struct NetTxPkt {
+    struct virtio_net_hdr virt_hdr;
+    bool has_virt_hdr;
+
+    struct iovec *raw;
+    uint32_t raw_frags;
+    uint32_t max_raw_frags;
+
+    struct iovec *vec;
+
+    uint8_t l2_hdr[ETH_MAX_L2_HDR_LEN];
+
+    uint32_t payload_len;
+
+    uint32_t payload_frags;
+    uint32_t max_payload_frags;
+
+    uint16_t hdr_len;
+    eth_pkt_types_e packet_type;
+    uint8_t l4proto;
+};
+
+void net_tx_pkt_init(struct NetTxPkt **pkt, uint32_t max_frags,
+    bool has_virt_hdr)
+{
+    struct NetTxPkt *p = g_malloc0(sizeof *p);
+
+    p->vec = g_malloc((sizeof *p->vec) *
+        (max_frags + NET_TX_PKT_PL_START_FRAG));
+
+    p->raw = g_malloc((sizeof *p->raw) * max_frags);
+
+    p->max_payload_frags = max_frags;
+    p->max_raw_frags = max_frags;
+    p->has_virt_hdr = has_virt_hdr;
+    p->vec[NET_TX_PKT_VHDR_FRAG].iov_base = &p->virt_hdr;
+    p->vec[NET_TX_PKT_VHDR_FRAG].iov_len =
+        p->has_virt_hdr ? sizeof p->virt_hdr : 0;
+    p->vec[NET_TX_PKT_L2HDR_FRAG].iov_base = &p->l2_hdr;
+    p->vec[NET_TX_PKT_L3HDR_FRAG].iov_base = NULL;
+    p->vec[NET_TX_PKT_L3HDR_FRAG].iov_len = 0;
+
+    *pkt = p;
+}
+
+void net_tx_pkt_uninit(struct NetTxPkt *pkt)
+{
+    if (pkt) {
+        g_free(pkt->vec);
+        g_free(pkt->raw);
+        g_free(pkt);
+    }
+}
+
+void net_tx_pkt_update_ip_checksums(struct NetTxPkt *pkt)
+{
+    uint16_t csum;
+    uint32_t ph_raw_csum;
+    assert(pkt);
+    uint8_t gso_type = pkt->virt_hdr.gso_type & ~VIRTIO_NET_HDR_GSO_ECN;
+    struct ip_header *ip_hdr;
+
+    if (VIRTIO_NET_HDR_GSO_TCPV4 != gso_type &&
+        VIRTIO_NET_HDR_GSO_UDP != gso_type) {
+        return;
+    }
+
+    ip_hdr = pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_base;
+
+    if (pkt->payload_len + pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_len >
+        ETH_MAX_IP_DGRAM_LEN) {
+        return;
+    }
+
+    ip_hdr->ip_len = cpu_to_be16(pkt->payload_len +
+        pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_len);
+
+    /* Calculate IP header checksum                    */
+    ip_hdr->ip_sum = 0;
+    csum = net_raw_checksum((uint8_t *)ip_hdr,
+        pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_len);
+    ip_hdr->ip_sum = cpu_to_be16(csum);
+
+    /* Calculate IP pseudo header checksum             */
+    ph_raw_csum = eth_calc_pseudo_hdr_csum(ip_hdr, pkt->payload_len);
+    csum = cpu_to_be16(~net_checksum_finish(ph_raw_csum));
+    iov_from_buf(&pkt->vec[NET_TX_PKT_PL_START_FRAG], pkt->payload_frags,
+                 pkt->virt_hdr.csum_offset, &csum, sizeof(csum));
+}
+
+static void net_tx_pkt_calculate_hdr_len(struct NetTxPkt *pkt)
+{
+    pkt->hdr_len = pkt->vec[NET_TX_PKT_L2HDR_FRAG].iov_len +
+        pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_len;
+}
+
+static bool net_tx_pkt_parse_headers(struct NetTxPkt *pkt)
+{
+    struct iovec *l2_hdr, *l3_hdr;
+    size_t bytes_read;
+    size_t full_ip6hdr_len;
+    uint16_t l3_proto;
+
+    assert(pkt);
+
+    l2_hdr = &pkt->vec[NET_TX_PKT_L2HDR_FRAG];
+    l3_hdr = &pkt->vec[NET_TX_PKT_L3HDR_FRAG];
+
+    bytes_read = iov_to_buf(pkt->raw, pkt->raw_frags, 0, l2_hdr->iov_base,
+                            ETH_MAX_L2_HDR_LEN);
+    if (bytes_read < ETH_MAX_L2_HDR_LEN) {
+        l2_hdr->iov_len = 0;
+        return false;
+    } else {
+        l2_hdr->iov_len = eth_get_l2_hdr_length(l2_hdr->iov_base);
+    }
+
+    l3_proto = eth_get_l3_proto(l2_hdr->iov_base, l2_hdr->iov_len);
+
+    switch (l3_proto) {
+    case ETH_P_IP:
+        l3_hdr->iov_base = g_malloc(ETH_MAX_IP4_HDR_LEN);
+
+        bytes_read = iov_to_buf(pkt->raw, pkt->raw_frags, l2_hdr->iov_len,
+                                l3_hdr->iov_base, sizeof(struct ip_header));
+
+        if (bytes_read < sizeof(struct ip_header)) {
+            l3_hdr->iov_len = 0;
+            return false;
+        }
+
+        l3_hdr->iov_len = IP_HDR_GET_LEN(l3_hdr->iov_base);
+        pkt->l4proto = ((struct ip_header *) l3_hdr->iov_base)->ip_p;
+
+        /* copy optional IPv4 header data */
+        bytes_read = iov_to_buf(pkt->raw, pkt->raw_frags,
+                                l2_hdr->iov_len + sizeof(struct ip_header),
+                                l3_hdr->iov_base + sizeof(struct ip_header),
+                                l3_hdr->iov_len - sizeof(struct ip_header));
+        if (bytes_read < l3_hdr->iov_len - sizeof(struct ip_header)) {
+            l3_hdr->iov_len = 0;
+            return false;
+        }
+        break;
+
+    case ETH_P_IPV6:
+        if (!eth_parse_ipv6_hdr(pkt->raw, pkt->raw_frags, l2_hdr->iov_len,
+                               &pkt->l4proto, &full_ip6hdr_len)) {
+            l3_hdr->iov_len = 0;
+            return false;
+        }
+
+        l3_hdr->iov_base = g_malloc(full_ip6hdr_len);
+
+        bytes_read = iov_to_buf(pkt->raw, pkt->raw_frags, l2_hdr->iov_len,
+                                l3_hdr->iov_base, full_ip6hdr_len);
+
+        if (bytes_read < full_ip6hdr_len) {
+            l3_hdr->iov_len = 0;
+            return false;
+        } else {
+            l3_hdr->iov_len = full_ip6hdr_len;
+        }
+        break;
+
+    default:
+        l3_hdr->iov_len = 0;
+        break;
+    }
+
+    net_tx_pkt_calculate_hdr_len(pkt);
+    pkt->packet_type = get_eth_packet_type(l2_hdr->iov_base);
+    return true;
+}
+
+static bool net_tx_pkt_rebuild_payload(struct NetTxPkt *pkt)
+{
+    size_t payload_len = iov_size(pkt->raw, pkt->raw_frags) - pkt->hdr_len;
+
+    pkt->payload_frags = iov_copy(&pkt->vec[NET_TX_PKT_PL_START_FRAG],
+                                pkt->max_payload_frags,
+                                pkt->raw, pkt->raw_frags,
+                                pkt->hdr_len, payload_len);
+
+    if (pkt->payload_frags != (uint32_t) -1) {
+        pkt->payload_len = payload_len;
+        return true;
+    } else {
+        return false;
+    }
+}
+
+bool net_tx_pkt_parse(struct NetTxPkt *pkt)
+{
+    return net_tx_pkt_parse_headers(pkt) &&
+           net_tx_pkt_rebuild_payload(pkt);
+}
+
+struct virtio_net_hdr *net_tx_pkt_get_vhdr(struct NetTxPkt *pkt)
+{
+    assert(pkt);
+    return &pkt->virt_hdr;
+}
+
+static uint8_t net_tx_pkt_get_gso_type(struct NetTxPkt *pkt,
+                                          bool tso_enable)
+{
+    uint8_t rc = VIRTIO_NET_HDR_GSO_NONE;
+    uint16_t l3_proto;
+
+    l3_proto = eth_get_l3_proto(pkt->vec[NET_TX_PKT_L2HDR_FRAG].iov_base,
+        pkt->vec[NET_TX_PKT_L2HDR_FRAG].iov_len);
+
+    if (!tso_enable) {
+        goto func_exit;
+    }
+
+    rc = eth_get_gso_type(l3_proto, pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_base,
+                          pkt->l4proto);
+
+func_exit:
+    return rc;
+}
+
+void net_tx_pkt_build_vheader(struct NetTxPkt *pkt, bool tso_enable,
+    bool csum_enable, uint32_t gso_size)
+{
+    struct tcp_hdr l4hdr;
+    assert(pkt);
+
+    /* csum has to be enabled if tso is. */
+    assert(csum_enable || !tso_enable);
+
+    pkt->virt_hdr.gso_type = net_tx_pkt_get_gso_type(pkt, tso_enable);
+
+    switch (pkt->virt_hdr.gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+    case VIRTIO_NET_HDR_GSO_NONE:
+        pkt->virt_hdr.hdr_len = 0;
+        pkt->virt_hdr.gso_size = 0;
+        break;
+
+    case VIRTIO_NET_HDR_GSO_UDP:
+        pkt->virt_hdr.gso_size = IP_FRAG_ALIGN_SIZE(gso_size);
+        pkt->virt_hdr.hdr_len = pkt->hdr_len + sizeof(struct udp_header);
+        break;
+
+    case VIRTIO_NET_HDR_GSO_TCPV4:
+    case VIRTIO_NET_HDR_GSO_TCPV6:
+        iov_to_buf(&pkt->vec[NET_TX_PKT_PL_START_FRAG], pkt->payload_frags,
+                   0, &l4hdr, sizeof(l4hdr));
+        pkt->virt_hdr.hdr_len = pkt->hdr_len + l4hdr.th_off * sizeof(uint32_t);
+        pkt->virt_hdr.gso_size = IP_FRAG_ALIGN_SIZE(gso_size);
+        break;
+
+    default:
+        g_assert_not_reached();
+    }
+
+    if (csum_enable) {
+        switch (pkt->l4proto) {
+        case IP_PROTO_TCP:
+            pkt->virt_hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+            pkt->virt_hdr.csum_start = pkt->hdr_len;
+            pkt->virt_hdr.csum_offset = offsetof(struct tcp_hdr, th_sum);
+            break;
+        case IP_PROTO_UDP:
+            pkt->virt_hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+            pkt->virt_hdr.csum_start = pkt->hdr_len;
+            pkt->virt_hdr.csum_offset = offsetof(struct udp_hdr, uh_sum);
+            break;
+        default:
+            break;
+        }
+    }
+}
+
+void net_tx_pkt_setup_vlan_header(struct NetTxPkt *pkt, uint16_t vlan)
+{
+    bool is_new;
+    assert(pkt);
+
+    eth_setup_vlan_headers(pkt->vec[NET_TX_PKT_L2HDR_FRAG].iov_base,
+        vlan, &is_new);
+
+    /* update l2hdrlen */
+    if (is_new) {
+        pkt->hdr_len += sizeof(struct vlan_header);
+        pkt->vec[NET_TX_PKT_L2HDR_FRAG].iov_len +=
+            sizeof(struct vlan_header);
+    }
+}
+
+bool net_tx_pkt_add_raw_fragment(struct NetTxPkt *pkt, hwaddr pa,
+    size_t len)
+{
+    hwaddr mapped_len = 0;
+    struct iovec *ventry;
+    assert(pkt);
+    assert(pkt->max_raw_frags > pkt->raw_frags);
+
+    if (!len) {
+        return true;
+     }
+
+    ventry = &pkt->raw[pkt->raw_frags];
+    mapped_len = len;
+
+    ventry->iov_base = cpu_physical_memory_map(pa, &mapped_len, false);
+    ventry->iov_len = mapped_len;
+    pkt->raw_frags += !!ventry->iov_base;
+
+    if ((ventry->iov_base == NULL) || (len != mapped_len)) {
+        return false;
+    }
+
+    return true;
+}
+
+eth_pkt_types_e net_tx_pkt_get_packet_type(struct NetTxPkt *pkt)
+{
+    assert(pkt);
+
+    return pkt->packet_type;
+}
+
+size_t net_tx_pkt_get_total_len(struct NetTxPkt *pkt)
+{
+    assert(pkt);
+
+    return pkt->hdr_len + pkt->payload_len;
+}
+
+void net_tx_pkt_dump(struct NetTxPkt *pkt)
+{
+#ifdef NET_TX_PKT_DEBUG
+    assert(pkt);
+
+    printf("TX PKT: hdr_len: %d, pkt_type: 0x%X, l2hdr_len: %lu, "
+        "l3hdr_len: %lu, payload_len: %u\n", pkt->hdr_len, pkt->packet_type,
+        pkt->vec[NET_TX_PKT_L2HDR_FRAG].iov_len,
+        pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_len, pkt->payload_len);
+#endif
+}
+
+void net_tx_pkt_reset(struct NetTxPkt *pkt)
+{
+    int i;
+
+    /* no assert, as reset can be called before tx_pkt_init */
+    if (!pkt) {
+        return;
+    }
+
+    memset(&pkt->virt_hdr, 0, sizeof(pkt->virt_hdr));
+
+    g_free(pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_base);
+    pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_base = NULL;
+
+    assert(pkt->vec);
+    for (i = NET_TX_PKT_L2HDR_FRAG;
+         i < pkt->payload_frags + NET_TX_PKT_PL_START_FRAG; i++) {
+        pkt->vec[i].iov_len = 0;
+    }
+    pkt->payload_len = 0;
+    pkt->payload_frags = 0;
+
+    assert(pkt->raw);
+    for (i = 0; i < pkt->raw_frags; i++) {
+        assert(pkt->raw[i].iov_base);
+        cpu_physical_memory_unmap(pkt->raw[i].iov_base, pkt->raw[i].iov_len,
+                                  false, pkt->raw[i].iov_len);
+        pkt->raw[i].iov_len = 0;
+    }
+    pkt->raw_frags = 0;
+
+    pkt->hdr_len = 0;
+    pkt->packet_type = 0;
+    pkt->l4proto = 0;
+}
+
+static void net_tx_pkt_do_sw_csum(struct NetTxPkt *pkt)
+{
+    struct iovec *iov = &pkt->vec[NET_TX_PKT_L2HDR_FRAG];
+    uint32_t csum_cntr;
+    uint16_t csum = 0;
+    /* num of iovec without vhdr */
+    uint32_t iov_len = pkt->payload_frags + NET_TX_PKT_PL_START_FRAG - 1;
+    uint16_t csl;
+    struct ip_header *iphdr;
+    size_t csum_offset = pkt->virt_hdr.csum_start + pkt->virt_hdr.csum_offset;
+
+    /* Put zero to checksum field */
+    iov_from_buf(iov, iov_len, csum_offset, &csum, sizeof csum);
+
+    /* Calculate L4 TCP/UDP checksum */
+    csl = pkt->payload_len;
+
+    /* data checksum */
+    csum_cntr =
+        net_checksum_add_iov(iov, iov_len, pkt->virt_hdr.csum_start, csl);
+    /* add pseudo header to csum */
+    iphdr = pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_base;
+    csum_cntr += eth_calc_pseudo_hdr_csum(iphdr, csl);
+
+    /* Put the checksum obtained into the packet */
+    csum = cpu_to_be16(net_checksum_finish(csum_cntr));
+    iov_from_buf(iov, iov_len, csum_offset, &csum, sizeof csum);
+}
+
+enum {
+    NET_TX_PKT_FRAGMENT_L2_HDR_POS = 0,
+    NET_TX_PKT_FRAGMENT_L3_HDR_POS,
+    NET_TX_PKT_FRAGMENT_HEADER_NUM
+};
+
+#define NET_MAX_FRAG_SG_LIST (64)
+
+static size_t net_tx_pkt_fetch_fragment(struct NetTxPkt *pkt,
+    int *src_idx, size_t *src_offset, struct iovec *dst, int *dst_idx)
+{
+    size_t fetched = 0;
+    struct iovec *src = pkt->vec;
+
+    *dst_idx = NET_TX_PKT_FRAGMENT_HEADER_NUM;
+
+    while (fetched < pkt->virt_hdr.gso_size) {
+
+        /* no more place in fragment iov */
+        if (*dst_idx == NET_MAX_FRAG_SG_LIST) {
+            break;
+        }
+
+        /* no more data in iovec */
+        if (*src_idx == (pkt->payload_frags + NET_TX_PKT_PL_START_FRAG)) {
+            break;
+        }
+
+
+        dst[*dst_idx].iov_base = src[*src_idx].iov_base + *src_offset;
+        dst[*dst_idx].iov_len = MIN(src[*src_idx].iov_len - *src_offset,
+            pkt->virt_hdr.gso_size - fetched);
+
+        *src_offset += dst[*dst_idx].iov_len;
+        fetched += dst[*dst_idx].iov_len;
+
+        if (*src_offset == src[*src_idx].iov_len) {
+            *src_offset = 0;
+            (*src_idx)++;
+        }
+
+        (*dst_idx)++;
+    }
+
+    return fetched;
+}
+
+static bool net_tx_pkt_do_sw_fragmentation(struct NetTxPkt *pkt,
+    NetClientState *nc)
+{
+    struct iovec fragment[NET_MAX_FRAG_SG_LIST];
+    size_t fragment_len = 0;
+    bool more_frags = false;
+
+    /* some pointers for shorter code */
+    void *l2_iov_base, *l3_iov_base;
+    size_t l2_iov_len, l3_iov_len;
+    int src_idx =  NET_TX_PKT_PL_START_FRAG, dst_idx;
+    size_t src_offset = 0;
+    size_t fragment_offset = 0;
+
+    l2_iov_base = pkt->vec[NET_TX_PKT_L2HDR_FRAG].iov_base;
+    l2_iov_len = pkt->vec[NET_TX_PKT_L2HDR_FRAG].iov_len;
+    l3_iov_base = pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_base;
+    l3_iov_len = pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_len;
+
+    /* Copy headers */
+    fragment[NET_TX_PKT_FRAGMENT_L2_HDR_POS].iov_base = l2_iov_base;
+    fragment[NET_TX_PKT_FRAGMENT_L2_HDR_POS].iov_len = l2_iov_len;
+    fragment[NET_TX_PKT_FRAGMENT_L3_HDR_POS].iov_base = l3_iov_base;
+    fragment[NET_TX_PKT_FRAGMENT_L3_HDR_POS].iov_len = l3_iov_len;
+
+
+    /* Put as much data as possible and send */
+    do {
+        fragment_len = net_tx_pkt_fetch_fragment(pkt, &src_idx, &src_offset,
+            fragment, &dst_idx);
+
+        more_frags = (fragment_offset + fragment_len < pkt->payload_len);
+
+        eth_setup_ip4_fragmentation(l2_iov_base, l2_iov_len, l3_iov_base,
+            l3_iov_len, fragment_len, fragment_offset, more_frags);
+
+        eth_fix_ip4_checksum(l3_iov_base, l3_iov_len);
+
+        qemu_sendv_packet(nc, fragment, dst_idx);
+
+        fragment_offset += fragment_len;
+
+    } while (more_frags);
+
+    return true;
+}
+
+bool net_tx_pkt_send(struct NetTxPkt *pkt, NetClientState *nc)
+{
+    assert(pkt);
+
+    if (!pkt->has_virt_hdr &&
+        pkt->virt_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+        net_tx_pkt_do_sw_csum(pkt);
+    }
+
+    /*
+     * Since underlying infrastructure does not support IP datagrams longer
+     * than 64K we should drop such packets and don't even try to send
+     */
+    if (VIRTIO_NET_HDR_GSO_NONE != pkt->virt_hdr.gso_type) {
+        if (pkt->payload_len >
+            ETH_MAX_IP_DGRAM_LEN -
+            pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_len) {
+            return false;
+        }
+    }
+
+    if (pkt->has_virt_hdr ||
+        pkt->virt_hdr.gso_type == VIRTIO_NET_HDR_GSO_NONE) {
+        qemu_sendv_packet(nc, pkt->vec,
+            pkt->payload_frags + NET_TX_PKT_PL_START_FRAG);
+        return true;
+    }
+
+    return net_tx_pkt_do_sw_fragmentation(pkt, nc);
+}
diff --git a/hw/net/net_tx_pkt.h b/hw/net/net_tx_pkt.h
new file mode 100644
index 0000000..73a67f8
--- /dev/null
+++ b/hw/net/net_tx_pkt.h
@@ -0,0 +1,148 @@
+/*
+ * QEMU TX packets abstraction
+ *
+ * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
+ *
+ * Developed by Daynix Computing LTD (http://www.daynix.com)
+ *
+ * Authors:
+ * Dmitry Fleytman <dmitry@daynix.com>
+ * Tamir Shomer <tamirs@daynix.com>
+ * Yan Vugenfirer <yan@daynix.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef NET_TX_PKT_H
+#define NET_TX_PKT_H
+
+#include "stdint.h"
+#include "stdbool.h"
+#include "net/eth.h"
+#include "exec/hwaddr.h"
+
+/* define to enable packet dump functions */
+/*#define NET_TX_PKT_DEBUG*/
+
+struct NetTxPkt;
+
+/**
+ * Init function for tx packet functionality
+ *
+ * @pkt:            packet pointer
+ * @max_frags:      max tx ip fragments
+ * @has_virt_hdr:   device uses virtio header.
+ */
+void net_tx_pkt_init(struct NetTxPkt **pkt, uint32_t max_frags,
+    bool has_virt_hdr);
+
+/**
+ * Clean all tx packet resources.
+ *
+ * @pkt:            packet.
+ */
+void net_tx_pkt_uninit(struct NetTxPkt *pkt);
+
+/**
+ * get virtio header
+ *
+ * @pkt:            packet
+ * @ret:            virtio header
+ */
+struct virtio_net_hdr *net_tx_pkt_get_vhdr(struct NetTxPkt *pkt);
+
+/**
+ * build virtio header (will be stored in module context)
+ *
+ * @pkt:            packet
+ * @tso_enable:     TSO enabled
+ * @csum_enable:    CSO enabled
+ * @gso_size:       MSS size for TSO
+ *
+ */
+void net_tx_pkt_build_vheader(struct NetTxPkt *pkt, bool tso_enable,
+    bool csum_enable, uint32_t gso_size);
+
+/**
+ * updates vlan tag, and adds vlan header in case it is missing
+ *
+ * @pkt:            packet
+ * @vlan:           VLAN tag
+ *
+ */
+void net_tx_pkt_setup_vlan_header(struct NetTxPkt *pkt, uint16_t vlan);
+
+/**
+ * populate data fragment into pkt context.
+ *
+ * @pkt:            packet
+ * @pa:             physical address of fragment
+ * @len:            length of fragment
+ *
+ */
+bool net_tx_pkt_add_raw_fragment(struct NetTxPkt *pkt, hwaddr pa,
+    size_t len);
+
+/**
+ * fix ip header fields and calculate checksums needed.
+ *
+ * @pkt:            packet
+ *
+ */
+void net_tx_pkt_update_ip_checksums(struct NetTxPkt *pkt);
+
+/**
+ * get length of all populated data.
+ *
+ * @pkt:            packet
+ * @ret:            total data length
+ *
+ */
+size_t net_tx_pkt_get_total_len(struct NetTxPkt *pkt);
+
+/**
+ * get packet type
+ *
+ * @pkt:            packet
+ * @ret:            packet type
+ *
+ */
+eth_pkt_types_e net_tx_pkt_get_packet_type(struct NetTxPkt *pkt);
+
+/**
+ * prints packet data if debug is enabled
+ *
+ * @pkt:            packet
+ *
+ */
+void net_tx_pkt_dump(struct NetTxPkt *pkt);
+
+/**
+ * reset tx packet private context (needed to be called between packets)
+ *
+ * @pkt:            packet
+ *
+ */
+void net_tx_pkt_reset(struct NetTxPkt *pkt);
+
+/**
+ * Send packet to qemu. handles sw offloads if vhdr is not supported.
+ *
+ * @pkt:            packet
+ * @nc:             NetClientState
+ * @ret:            operation result
+ *
+ */
+bool net_tx_pkt_send(struct NetTxPkt *pkt, NetClientState *nc);
+
+/**
+ * parse raw packet data and analyze offload requirements.
+ *
+ * @pkt:            packet
+ *
+ */
+bool net_tx_pkt_parse(struct NetTxPkt *pkt);
+
+#endif
diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index dfb328d..388ec10 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -29,8 +29,8 @@
 #include "vmxnet3.h"
 #include "vmxnet_debug.h"
 #include "vmware_utils.h"
-#include "vmxnet_tx_pkt.h"
-#include "vmxnet_rx_pkt.h"
+#include "net_tx_pkt.h"
+#include "net_rx_pkt.h"
 
 #define PCI_DEVICE_ID_VMWARE_VMXNET3_REVISION 0x1
 #define VMXNET3_MSIX_BAR_SIZE 0x2000
@@ -287,13 +287,13 @@ typedef struct {
         bool peer_has_vhdr;
 
         /* TX packets to QEMU interface */
-        struct VmxnetTxPkt *tx_pkt;
+        struct NetTxPkt *tx_pkt;
         uint32_t offload_mode;
         uint32_t cso_or_gso_size;
         uint16_t tci;
         bool needs_vlan;
 
-        struct VmxnetRxPkt *rx_pkt;
+        struct NetRxPkt *rx_pkt;
 
         bool tx_sop;
         bool skip_current_tx_pkt;
@@ -516,18 +516,18 @@ vmxnet3_setup_tx_offloads(VMXNET3State *s)
 {
     switch (s->offload_mode) {
     case VMXNET3_OM_NONE:
-        vmxnet_tx_pkt_build_vheader(s->tx_pkt, false, false, 0);
+        net_tx_pkt_build_vheader(s->tx_pkt, false, false, 0);
         break;
 
     case VMXNET3_OM_CSUM:
-        vmxnet_tx_pkt_build_vheader(s->tx_pkt, false, true, 0);
+        net_tx_pkt_build_vheader(s->tx_pkt, false, true, 0);
         VMW_PKPRN("L4 CSO requested\n");
         break;
 
     case VMXNET3_OM_TSO:
-        vmxnet_tx_pkt_build_vheader(s->tx_pkt, true, true,
+        net_tx_pkt_build_vheader(s->tx_pkt, true, true,
             s->cso_or_gso_size);
-        vmxnet_tx_pkt_update_ip_checksums(s->tx_pkt);
+        net_tx_pkt_update_ip_checksums(s->tx_pkt);
         VMW_PKPRN("GSO offload requested.");
         break;
 
@@ -560,12 +560,12 @@ static void
 vmxnet3_on_tx_done_update_stats(VMXNET3State *s, int qidx,
     Vmxnet3PktStatus status)
 {
-    size_t tot_len = vmxnet_tx_pkt_get_total_len(s->tx_pkt);
+    size_t tot_len = net_tx_pkt_get_total_len(s->tx_pkt);
     struct UPT1_TxStats *stats = &s->txq_descr[qidx].txq_stats;
 
     switch (status) {
     case VMXNET3_PKT_STATUS_OK:
-        switch (vmxnet_tx_pkt_get_packet_type(s->tx_pkt)) {
+        switch (net_tx_pkt_get_packet_type(s->tx_pkt)) {
         case ETH_PKT_BCAST:
             stats->bcastPktsTxOK++;
             stats->bcastBytesTxOK += tot_len;
@@ -613,7 +613,7 @@ vmxnet3_on_rx_done_update_stats(VMXNET3State *s,
                                 Vmxnet3PktStatus status)
 {
     struct UPT1_RxStats *stats = &s->rxq_descr[qidx].rxq_stats;
-    size_t tot_len = vmxnet_rx_pkt_get_total_len(s->rx_pkt);
+    size_t tot_len = net_rx_pkt_get_total_len(s->rx_pkt);
 
     switch (status) {
     case VMXNET3_PKT_STATUS_OUT_OF_BUF:
@@ -624,7 +624,7 @@ vmxnet3_on_rx_done_update_stats(VMXNET3State *s,
         stats->pktsRxError++;
         break;
     case VMXNET3_PKT_STATUS_OK:
-        switch (vmxnet_rx_pkt_get_packet_type(s->rx_pkt)) {
+        switch (net_rx_pkt_get_packet_type(s->rx_pkt)) {
         case ETH_PKT_BCAST:
             stats->bcastPktsRxOK++;
             stats->bcastBytesRxOK += tot_len;
@@ -685,10 +685,10 @@ vmxnet3_send_packet(VMXNET3State *s, uint32_t qidx)
     }
 
     /* debug prints */
-    vmxnet3_dump_virt_hdr(vmxnet_tx_pkt_get_vhdr(s->tx_pkt));
-    vmxnet_tx_pkt_dump(s->tx_pkt);
+    vmxnet3_dump_virt_hdr(net_tx_pkt_get_vhdr(s->tx_pkt));
+    net_tx_pkt_dump(s->tx_pkt);
 
-    if (!vmxnet_tx_pkt_send(s->tx_pkt, qemu_get_queue(s->nic))) {
+    if (!net_tx_pkt_send(s->tx_pkt, qemu_get_queue(s->nic))) {
         status = VMXNET3_PKT_STATUS_DISCARD;
         goto func_exit;
     }
@@ -716,7 +716,7 @@ static void vmxnet3_process_tx_queue(VMXNET3State *s, int qidx)
             data_len = (txd.len > 0) ? txd.len : VMXNET3_MAX_TX_BUF_SIZE;
             data_pa = le64_to_cpu(txd.addr);
 
-            if (!vmxnet_tx_pkt_add_raw_fragment(s->tx_pkt,
+            if (!net_tx_pkt_add_raw_fragment(s->tx_pkt,
                                                 data_pa,
                                                 data_len)) {
                 s->skip_current_tx_pkt = true;
@@ -730,10 +730,10 @@ static void vmxnet3_process_tx_queue(VMXNET3State *s, int qidx)
 
         if (txd.eop) {
             if (!s->skip_current_tx_pkt) {
-                vmxnet_tx_pkt_parse(s->tx_pkt);
+                net_tx_pkt_parse(s->tx_pkt);
 
                 if (s->needs_vlan) {
-                    vmxnet_tx_pkt_setup_vlan_header(s->tx_pkt, s->tci);
+                    net_tx_pkt_setup_vlan_header(s->tx_pkt, s->tci);
                 }
 
                 vmxnet3_send_packet(s, qidx);
@@ -745,7 +745,7 @@ static void vmxnet3_process_tx_queue(VMXNET3State *s, int qidx)
             vmxnet3_complete_packet(s, qidx, txd_idx);
             s->tx_sop = true;
             s->skip_current_tx_pkt = false;
-            vmxnet_tx_pkt_reset(s->tx_pkt);
+            net_tx_pkt_reset(s->tx_pkt);
         }
     }
 }
@@ -885,7 +885,7 @@ vmxnet3_get_next_rx_descr(VMXNET3State *s, bool is_head,
     }
 }
 
-static void vmxnet3_rx_update_descr(struct VmxnetRxPkt *pkt,
+static void vmxnet3_rx_update_descr(struct NetRxPkt *pkt,
     struct Vmxnet3_RxCompDesc *rxcd)
 {
     int csum_ok, is_gso;
@@ -893,16 +893,16 @@ static void vmxnet3_rx_update_descr(struct VmxnetRxPkt *pkt,
     struct virtio_net_hdr *vhdr;
     uint8_t offload_type;
 
-    if (vmxnet_rx_pkt_is_vlan_stripped(pkt)) {
+    if (net_rx_pkt_is_vlan_stripped(pkt)) {
         rxcd->ts = 1;
-        rxcd->tci = vmxnet_rx_pkt_get_vlan_tag(pkt);
+        rxcd->tci = net_rx_pkt_get_vlan_tag(pkt);
     }
 
-    if (!vmxnet_rx_pkt_has_virt_hdr(pkt)) {
+    if (!net_rx_pkt_has_virt_hdr(pkt)) {
         goto nocsum;
     }
 
-    vhdr = vmxnet_rx_pkt_get_vhdr(pkt);
+    vhdr = net_rx_pkt_get_vhdr(pkt);
     /*
      * Checksum is valid when lower level tell so or when lower level
      * requires checksum offload telling that packet produced/bridged
@@ -919,7 +919,7 @@ static void vmxnet3_rx_update_descr(struct VmxnetRxPkt *pkt,
         goto nocsum;
     }
 
-    vmxnet_rx_pkt_get_protocols(pkt, &isip4, &isip6, &isudp, &istcp);
+    net_rx_pkt_get_protocols(pkt, &isip4, &isip6, &isudp, &istcp);
     if ((!istcp && !isudp) || (!isip4 && !isip6)) {
         goto nocsum;
     }
@@ -978,13 +978,13 @@ vmxnet3_indicate_packet(VMXNET3State *s)
     uint32_t new_rxcd_gen = VMXNET3_INIT_GEN;
     hwaddr new_rxcd_pa = 0;
     hwaddr ready_rxcd_pa = 0;
-    struct iovec *data = vmxnet_rx_pkt_get_iovec(s->rx_pkt);
+    struct iovec *data = net_rx_pkt_get_iovec(s->rx_pkt);
     size_t bytes_copied = 0;
-    size_t bytes_left = vmxnet_rx_pkt_get_total_len(s->rx_pkt);
+    size_t bytes_left = net_rx_pkt_get_total_len(s->rx_pkt);
     uint16_t num_frags = 0;
     size_t chunk_size;
 
-    vmxnet_rx_pkt_dump(s->rx_pkt);
+    net_rx_pkt_dump(s->rx_pkt);
 
     while (bytes_left > 0) {
 
@@ -1145,7 +1145,7 @@ static void vmxnet3_reset(VMXNET3State *s)
 
     vmxnet3_deactivate_device(s);
     vmxnet3_reset_interrupt_states(s);
-    vmxnet_tx_pkt_reset(s->tx_pkt);
+    net_tx_pkt_reset(s->tx_pkt);
     s->drv_shmem = 0;
     s->tx_sop = true;
     s->skip_current_tx_pkt = false;
@@ -1455,8 +1455,8 @@ static void vmxnet3_activate_device(VMXNET3State *s)
 
     /* Preallocate TX packet wrapper */
     VMW_CFPRN("Max TX fragments is %u", s->max_tx_frags);
-    vmxnet_tx_pkt_init(&s->tx_pkt, s->max_tx_frags, s->peer_has_vhdr);
-    vmxnet_rx_pkt_init(&s->rx_pkt, s->peer_has_vhdr);
+    net_tx_pkt_init(&s->tx_pkt, s->max_tx_frags, s->peer_has_vhdr);
+    net_rx_pkt_init(&s->rx_pkt, s->peer_has_vhdr);
 
     /* Read rings memory locations for RX queues */
     for (i = 0; i < s->rxq_num; i++) {
@@ -1832,7 +1832,7 @@ vmxnet3_rx_filter_may_indicate(VMXNET3State *s, const void *data,
         return false;
     }
 
-    switch (vmxnet_rx_pkt_get_packet_type(s->rx_pkt)) {
+    switch (net_rx_pkt_get_packet_type(s->rx_pkt)) {
     case ETH_PKT_UCAST:
         if (!VMXNET_FLAG_IS_SET(s->rx_mode, VMXNET3_RXM_UCAST)) {
             return false;
@@ -1888,16 +1888,16 @@ vmxnet3_receive(NetClientState *nc, const uint8_t *buf, size_t size)
     }
 
     if (s->peer_has_vhdr) {
-        vmxnet_rx_pkt_set_vhdr(s->rx_pkt, (struct virtio_net_hdr *)buf);
+        net_rx_pkt_set_vhdr(s->rx_pkt, (struct virtio_net_hdr *)buf);
         buf += sizeof(struct virtio_net_hdr);
         size -= sizeof(struct virtio_net_hdr);
     }
 
-    vmxnet_rx_pkt_set_packet_type(s->rx_pkt,
+    net_rx_pkt_set_packet_type(s->rx_pkt,
         get_eth_packet_type(PKT_GET_ETH_HDR(buf)));
 
     if (vmxnet3_rx_filter_may_indicate(s, buf, size)) {
-        vmxnet_rx_pkt_attach_data(s->rx_pkt, buf, size, s->rx_vlan_stripping);
+        net_rx_pkt_attach_data(s->rx_pkt, buf, size, s->rx_vlan_stripping);
         bytes_indicated = vmxnet3_indicate_packet(s) ? size : -1;
         if (bytes_indicated < size) {
             VMW_PKPRN("RX: %lu of %lu bytes indicated", bytes_indicated, size);
@@ -1949,9 +1949,9 @@ static bool vmxnet3_peer_has_vnet_hdr(VMXNET3State *s)
 static void vmxnet3_net_uninit(VMXNET3State *s)
 {
     g_free(s->mcast_list);
-    vmxnet_tx_pkt_reset(s->tx_pkt);
-    vmxnet_tx_pkt_uninit(s->tx_pkt);
-    vmxnet_rx_pkt_uninit(s->rx_pkt);
+    net_tx_pkt_reset(s->tx_pkt);
+    net_tx_pkt_uninit(s->tx_pkt);
+    net_rx_pkt_uninit(s->rx_pkt);
     qemu_del_nic(s->nic);
 }
 
@@ -2380,8 +2380,8 @@ static int vmxnet3_post_load(void *opaque, int version_id)
     VMXNET3State *s = opaque;
     PCIDevice *d = PCI_DEVICE(s);
 
-    vmxnet_tx_pkt_init(&s->tx_pkt, s->max_tx_frags, s->peer_has_vhdr);
-    vmxnet_rx_pkt_init(&s->rx_pkt, s->peer_has_vhdr);
+    net_tx_pkt_init(&s->tx_pkt, s->max_tx_frags, s->peer_has_vhdr);
+    net_rx_pkt_init(&s->rx_pkt, s->peer_has_vhdr);
 
     if (s->msix_used) {
         if  (!vmxnet3_use_msix_vectors(s, VMXNET3_MAX_INTRS)) {
diff --git a/hw/net/vmxnet_rx_pkt.c b/hw/net/vmxnet_rx_pkt.c
deleted file mode 100644
index a40e346..0000000
--- a/hw/net/vmxnet_rx_pkt.c
+++ /dev/null
@@ -1,187 +0,0 @@
-/*
- * QEMU VMWARE VMXNET* paravirtual NICs - RX packets abstractions
- *
- * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
- *
- * Developed by Daynix Computing LTD (http://www.daynix.com)
- *
- * Authors:
- * Dmitry Fleytman <dmitry@daynix.com>
- * Tamir Shomer <tamirs@daynix.com>
- * Yan Vugenfirer <yan@daynix.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#include "vmxnet_rx_pkt.h"
-#include "net/eth.h"
-#include "qemu-common.h"
-#include "qemu/iov.h"
-#include "net/checksum.h"
-#include "net/tap.h"
-
-/*
- * RX packet may contain up to 2 fragments - rebuilt eth header
- * in case of VLAN tag stripping
- * and payload received from QEMU - in any case
- */
-#define VMXNET_MAX_RX_PACKET_FRAGMENTS (2)
-
-struct VmxnetRxPkt {
-    struct virtio_net_hdr virt_hdr;
-    uint8_t ehdr_buf[ETH_MAX_L2_HDR_LEN];
-    struct iovec vec[VMXNET_MAX_RX_PACKET_FRAGMENTS];
-    uint16_t vec_len;
-    uint32_t tot_len;
-    uint16_t tci;
-    bool vlan_stripped;
-    bool has_virt_hdr;
-    eth_pkt_types_e packet_type;
-
-    /* Analysis results */
-    bool isip4;
-    bool isip6;
-    bool isudp;
-    bool istcp;
-};
-
-void vmxnet_rx_pkt_init(struct VmxnetRxPkt **pkt, bool has_virt_hdr)
-{
-    struct VmxnetRxPkt *p = g_malloc0(sizeof *p);
-    p->has_virt_hdr = has_virt_hdr;
-    *pkt = p;
-}
-
-void vmxnet_rx_pkt_uninit(struct VmxnetRxPkt *pkt)
-{
-    g_free(pkt);
-}
-
-struct virtio_net_hdr *vmxnet_rx_pkt_get_vhdr(struct VmxnetRxPkt *pkt)
-{
-    assert(pkt);
-    return &pkt->virt_hdr;
-}
-
-void vmxnet_rx_pkt_attach_data(struct VmxnetRxPkt *pkt, const void *data,
-                               size_t len, bool strip_vlan)
-{
-    uint16_t tci = 0;
-    uint16_t ploff;
-    assert(pkt);
-    pkt->vlan_stripped = false;
-
-    if (strip_vlan) {
-        pkt->vlan_stripped = eth_strip_vlan(data, pkt->ehdr_buf, &ploff, &tci);
-    }
-
-    if (pkt->vlan_stripped) {
-        pkt->vec[0].iov_base = pkt->ehdr_buf;
-        pkt->vec[0].iov_len = ploff - sizeof(struct vlan_header);
-        pkt->vec[1].iov_base = (uint8_t *) data + ploff;
-        pkt->vec[1].iov_len = len - ploff;
-        pkt->vec_len = 2;
-        pkt->tot_len = len - ploff + sizeof(struct eth_header);
-    } else {
-        pkt->vec[0].iov_base = (void *)data;
-        pkt->vec[0].iov_len = len;
-        pkt->vec_len = 1;
-        pkt->tot_len = len;
-    }
-
-    pkt->tci = tci;
-
-    eth_get_protocols(data, len, &pkt->isip4, &pkt->isip6,
-        &pkt->isudp, &pkt->istcp);
-}
-
-void vmxnet_rx_pkt_dump(struct VmxnetRxPkt *pkt)
-{
-#ifdef VMXNET_RX_PKT_DEBUG
-    VmxnetRxPkt *pkt = (VmxnetRxPkt *)pkt;
-    assert(pkt);
-
-    printf("RX PKT: tot_len: %d, vlan_stripped: %d, vlan_tag: %d\n",
-              pkt->tot_len, pkt->vlan_stripped, pkt->tci);
-#endif
-}
-
-void vmxnet_rx_pkt_set_packet_type(struct VmxnetRxPkt *pkt,
-    eth_pkt_types_e packet_type)
-{
-    assert(pkt);
-
-    pkt->packet_type = packet_type;
-
-}
-
-eth_pkt_types_e vmxnet_rx_pkt_get_packet_type(struct VmxnetRxPkt *pkt)
-{
-    assert(pkt);
-
-    return pkt->packet_type;
-}
-
-size_t vmxnet_rx_pkt_get_total_len(struct VmxnetRxPkt *pkt)
-{
-    assert(pkt);
-
-    return pkt->tot_len;
-}
-
-void vmxnet_rx_pkt_get_protocols(struct VmxnetRxPkt *pkt,
-                                 bool *isip4, bool *isip6,
-                                 bool *isudp, bool *istcp)
-{
-    assert(pkt);
-
-    *isip4 = pkt->isip4;
-    *isip6 = pkt->isip6;
-    *isudp = pkt->isudp;
-    *istcp = pkt->istcp;
-}
-
-struct iovec *vmxnet_rx_pkt_get_iovec(struct VmxnetRxPkt *pkt)
-{
-    assert(pkt);
-
-    return pkt->vec;
-}
-
-void vmxnet_rx_pkt_set_vhdr(struct VmxnetRxPkt *pkt,
-                            struct virtio_net_hdr *vhdr)
-{
-    assert(pkt);
-
-    memcpy(&pkt->virt_hdr, vhdr, sizeof pkt->virt_hdr);
-}
-
-bool vmxnet_rx_pkt_is_vlan_stripped(struct VmxnetRxPkt *pkt)
-{
-    assert(pkt);
-
-    return pkt->vlan_stripped;
-}
-
-bool vmxnet_rx_pkt_has_virt_hdr(struct VmxnetRxPkt *pkt)
-{
-    assert(pkt);
-
-    return pkt->has_virt_hdr;
-}
-
-uint16_t vmxnet_rx_pkt_get_num_frags(struct VmxnetRxPkt *pkt)
-{
-    assert(pkt);
-
-    return pkt->vec_len;
-}
-
-uint16_t vmxnet_rx_pkt_get_vlan_tag(struct VmxnetRxPkt *pkt)
-{
-    assert(pkt);
-
-    return pkt->tci;
-}
diff --git a/hw/net/vmxnet_rx_pkt.h b/hw/net/vmxnet_rx_pkt.h
deleted file mode 100644
index 6b2c60e..0000000
--- a/hw/net/vmxnet_rx_pkt.h
+++ /dev/null
@@ -1,174 +0,0 @@
-/*
- * QEMU VMWARE VMXNET* paravirtual NICs - RX packets abstraction
- *
- * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
- *
- * Developed by Daynix Computing LTD (http://www.daynix.com)
- *
- * Authors:
- * Dmitry Fleytman <dmitry@daynix.com>
- * Tamir Shomer <tamirs@daynix.com>
- * Yan Vugenfirer <yan@daynix.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef VMXNET_RX_PKT_H
-#define VMXNET_RX_PKT_H
-
-#include "stdint.h"
-#include "stdbool.h"
-#include "net/eth.h"
-
-/* defines to enable packet dump functions */
-/*#define VMXNET_RX_PKT_DEBUG*/
-
-struct VmxnetRxPkt;
-
-/**
- * Clean all rx packet resources
- *
- * @pkt:            packet
- *
- */
-void vmxnet_rx_pkt_uninit(struct VmxnetRxPkt *pkt);
-
-/**
- * Init function for rx packet functionality
- *
- * @pkt:            packet pointer
- * @has_virt_hdr:   device uses virtio header
- *
- */
-void vmxnet_rx_pkt_init(struct VmxnetRxPkt **pkt, bool has_virt_hdr);
-
-/**
- * returns total length of data attached to rx context
- *
- * @pkt:            packet
- *
- * Return:  nothing
- *
- */
-size_t vmxnet_rx_pkt_get_total_len(struct VmxnetRxPkt *pkt);
-
-/**
- * fetches packet analysis results
- *
- * @pkt:            packet
- * @isip4:          whether the packet given is IPv4
- * @isip6:          whether the packet given is IPv6
- * @isudp:          whether the packet given is UDP
- * @istcp:          whether the packet given is TCP
- *
- */
-void vmxnet_rx_pkt_get_protocols(struct VmxnetRxPkt *pkt,
-                                 bool *isip4, bool *isip6,
-                                 bool *isudp, bool *istcp);
-
-/**
- * returns virtio header stored in rx context
- *
- * @pkt:            packet
- * @ret:            virtio header
- *
- */
-struct virtio_net_hdr *vmxnet_rx_pkt_get_vhdr(struct VmxnetRxPkt *pkt);
-
-/**
- * returns packet type
- *
- * @pkt:            packet
- * @ret:            packet type
- *
- */
-eth_pkt_types_e vmxnet_rx_pkt_get_packet_type(struct VmxnetRxPkt *pkt);
-
-/**
- * returns vlan tag
- *
- * @pkt:            packet
- * @ret:            VLAN tag
- *
- */
-uint16_t vmxnet_rx_pkt_get_vlan_tag(struct VmxnetRxPkt *pkt);
-
-/**
- * tells whether vlan was stripped from the packet
- *
- * @pkt:            packet
- * @ret:            VLAN stripped sign
- *
- */
-bool vmxnet_rx_pkt_is_vlan_stripped(struct VmxnetRxPkt *pkt);
-
-/**
- * notifies caller if the packet has virtio header
- *
- * @pkt:            packet
- * @ret:            true if packet has virtio header, false otherwize
- *
- */
-bool vmxnet_rx_pkt_has_virt_hdr(struct VmxnetRxPkt *pkt);
-
-/**
- * returns number of frags attached to the packet
- *
- * @pkt:            packet
- * @ret:            number of frags
- *
- */
-uint16_t vmxnet_rx_pkt_get_num_frags(struct VmxnetRxPkt *pkt);
-
-/**
- * attach data to rx packet
- *
- * @pkt:            packet
- * @data:           pointer to the data buffer
- * @len:            data length
- * @strip_vlan:     should the module strip vlan from data
- *
- */
-void vmxnet_rx_pkt_attach_data(struct VmxnetRxPkt *pkt, const void *data,
-    size_t len, bool strip_vlan);
-
-/**
- * returns io vector that holds the attached data
- *
- * @pkt:            packet
- * @ret:            pointer to IOVec
- *
- */
-struct iovec *vmxnet_rx_pkt_get_iovec(struct VmxnetRxPkt *pkt);
-
-/**
- * prints rx packet data if debug is enabled
- *
- * @pkt:            packet
- *
- */
-void vmxnet_rx_pkt_dump(struct VmxnetRxPkt *pkt);
-
-/**
- * copy passed vhdr data to packet context
- *
- * @pkt:            packet
- * @vhdr:           VHDR buffer
- *
- */
-void vmxnet_rx_pkt_set_vhdr(struct VmxnetRxPkt *pkt,
-    struct virtio_net_hdr *vhdr);
-
-/**
- * save packet type in packet context
- *
- * @pkt:            packet
- * @packet_type:    the packet type
- *
- */
-void vmxnet_rx_pkt_set_packet_type(struct VmxnetRxPkt *pkt,
-    eth_pkt_types_e packet_type);
-
-#endif
diff --git a/hw/net/vmxnet_tx_pkt.c b/hw/net/vmxnet_tx_pkt.c
deleted file mode 100644
index f7344c4..0000000
--- a/hw/net/vmxnet_tx_pkt.c
+++ /dev/null
@@ -1,567 +0,0 @@
-/*
- * QEMU VMWARE VMXNET* paravirtual NICs - TX packets abstractions
- *
- * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
- *
- * Developed by Daynix Computing LTD (http://www.daynix.com)
- *
- * Authors:
- * Dmitry Fleytman <dmitry@daynix.com>
- * Tamir Shomer <tamirs@daynix.com>
- * Yan Vugenfirer <yan@daynix.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#include "hw/hw.h"
-#include "vmxnet_tx_pkt.h"
-#include "net/eth.h"
-#include "qemu-common.h"
-#include "qemu/iov.h"
-#include "net/checksum.h"
-#include "net/tap.h"
-#include "net/net.h"
-
-enum {
-    VMXNET_TX_PKT_VHDR_FRAG = 0,
-    VMXNET_TX_PKT_L2HDR_FRAG,
-    VMXNET_TX_PKT_L3HDR_FRAG,
-    VMXNET_TX_PKT_PL_START_FRAG
-};
-
-/* TX packet private context */
-struct VmxnetTxPkt {
-    struct virtio_net_hdr virt_hdr;
-    bool has_virt_hdr;
-
-    struct iovec *raw;
-    uint32_t raw_frags;
-    uint32_t max_raw_frags;
-
-    struct iovec *vec;
-
-    uint8_t l2_hdr[ETH_MAX_L2_HDR_LEN];
-
-    uint32_t payload_len;
-
-    uint32_t payload_frags;
-    uint32_t max_payload_frags;
-
-    uint16_t hdr_len;
-    eth_pkt_types_e packet_type;
-    uint8_t l4proto;
-};
-
-void vmxnet_tx_pkt_init(struct VmxnetTxPkt **pkt, uint32_t max_frags,
-    bool has_virt_hdr)
-{
-    struct VmxnetTxPkt *p = g_malloc0(sizeof *p);
-
-    p->vec = g_malloc((sizeof *p->vec) *
-        (max_frags + VMXNET_TX_PKT_PL_START_FRAG));
-
-    p->raw = g_malloc((sizeof *p->raw) * max_frags);
-
-    p->max_payload_frags = max_frags;
-    p->max_raw_frags = max_frags;
-    p->has_virt_hdr = has_virt_hdr;
-    p->vec[VMXNET_TX_PKT_VHDR_FRAG].iov_base = &p->virt_hdr;
-    p->vec[VMXNET_TX_PKT_VHDR_FRAG].iov_len =
-        p->has_virt_hdr ? sizeof p->virt_hdr : 0;
-    p->vec[VMXNET_TX_PKT_L2HDR_FRAG].iov_base = &p->l2_hdr;
-    p->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_base = NULL;
-    p->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_len = 0;
-
-    *pkt = p;
-}
-
-void vmxnet_tx_pkt_uninit(struct VmxnetTxPkt *pkt)
-{
-    if (pkt) {
-        g_free(pkt->vec);
-        g_free(pkt->raw);
-        g_free(pkt);
-    }
-}
-
-void vmxnet_tx_pkt_update_ip_checksums(struct VmxnetTxPkt *pkt)
-{
-    uint16_t csum;
-    uint32_t ph_raw_csum;
-    assert(pkt);
-    uint8_t gso_type = pkt->virt_hdr.gso_type & ~VIRTIO_NET_HDR_GSO_ECN;
-    struct ip_header *ip_hdr;
-
-    if (VIRTIO_NET_HDR_GSO_TCPV4 != gso_type &&
-        VIRTIO_NET_HDR_GSO_UDP != gso_type) {
-        return;
-    }
-
-    ip_hdr = pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_base;
-
-    if (pkt->payload_len + pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_len >
-        ETH_MAX_IP_DGRAM_LEN) {
-        return;
-    }
-
-    ip_hdr->ip_len = cpu_to_be16(pkt->payload_len +
-        pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_len);
-
-    /* Calculate IP header checksum                    */
-    ip_hdr->ip_sum = 0;
-    csum = net_raw_checksum((uint8_t *)ip_hdr,
-        pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_len);
-    ip_hdr->ip_sum = cpu_to_be16(csum);
-
-    /* Calculate IP pseudo header checksum             */
-    ph_raw_csum = eth_calc_pseudo_hdr_csum(ip_hdr, pkt->payload_len);
-    csum = cpu_to_be16(~net_checksum_finish(ph_raw_csum));
-    iov_from_buf(&pkt->vec[VMXNET_TX_PKT_PL_START_FRAG], pkt->payload_frags,
-                 pkt->virt_hdr.csum_offset, &csum, sizeof(csum));
-}
-
-static void vmxnet_tx_pkt_calculate_hdr_len(struct VmxnetTxPkt *pkt)
-{
-    pkt->hdr_len = pkt->vec[VMXNET_TX_PKT_L2HDR_FRAG].iov_len +
-        pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_len;
-}
-
-static bool vmxnet_tx_pkt_parse_headers(struct VmxnetTxPkt *pkt)
-{
-    struct iovec *l2_hdr, *l3_hdr;
-    size_t bytes_read;
-    size_t full_ip6hdr_len;
-    uint16_t l3_proto;
-
-    assert(pkt);
-
-    l2_hdr = &pkt->vec[VMXNET_TX_PKT_L2HDR_FRAG];
-    l3_hdr = &pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG];
-
-    bytes_read = iov_to_buf(pkt->raw, pkt->raw_frags, 0, l2_hdr->iov_base,
-                            ETH_MAX_L2_HDR_LEN);
-    if (bytes_read < ETH_MAX_L2_HDR_LEN) {
-        l2_hdr->iov_len = 0;
-        return false;
-    } else {
-        l2_hdr->iov_len = eth_get_l2_hdr_length(l2_hdr->iov_base);
-    }
-
-    l3_proto = eth_get_l3_proto(l2_hdr->iov_base, l2_hdr->iov_len);
-
-    switch (l3_proto) {
-    case ETH_P_IP:
-        l3_hdr->iov_base = g_malloc(ETH_MAX_IP4_HDR_LEN);
-
-        bytes_read = iov_to_buf(pkt->raw, pkt->raw_frags, l2_hdr->iov_len,
-                                l3_hdr->iov_base, sizeof(struct ip_header));
-
-        if (bytes_read < sizeof(struct ip_header)) {
-            l3_hdr->iov_len = 0;
-            return false;
-        }
-
-        l3_hdr->iov_len = IP_HDR_GET_LEN(l3_hdr->iov_base);
-        pkt->l4proto = ((struct ip_header *) l3_hdr->iov_base)->ip_p;
-
-        /* copy optional IPv4 header data */
-        bytes_read = iov_to_buf(pkt->raw, pkt->raw_frags,
-                                l2_hdr->iov_len + sizeof(struct ip_header),
-                                l3_hdr->iov_base + sizeof(struct ip_header),
-                                l3_hdr->iov_len - sizeof(struct ip_header));
-        if (bytes_read < l3_hdr->iov_len - sizeof(struct ip_header)) {
-            l3_hdr->iov_len = 0;
-            return false;
-        }
-        break;
-
-    case ETH_P_IPV6:
-        if (!eth_parse_ipv6_hdr(pkt->raw, pkt->raw_frags, l2_hdr->iov_len,
-                               &pkt->l4proto, &full_ip6hdr_len)) {
-            l3_hdr->iov_len = 0;
-            return false;
-        }
-
-        l3_hdr->iov_base = g_malloc(full_ip6hdr_len);
-
-        bytes_read = iov_to_buf(pkt->raw, pkt->raw_frags, l2_hdr->iov_len,
-                                l3_hdr->iov_base, full_ip6hdr_len);
-
-        if (bytes_read < full_ip6hdr_len) {
-            l3_hdr->iov_len = 0;
-            return false;
-        } else {
-            l3_hdr->iov_len = full_ip6hdr_len;
-        }
-        break;
-
-    default:
-        l3_hdr->iov_len = 0;
-        break;
-    }
-
-    vmxnet_tx_pkt_calculate_hdr_len(pkt);
-    pkt->packet_type = get_eth_packet_type(l2_hdr->iov_base);
-    return true;
-}
-
-static bool vmxnet_tx_pkt_rebuild_payload(struct VmxnetTxPkt *pkt)
-{
-    size_t payload_len = iov_size(pkt->raw, pkt->raw_frags) - pkt->hdr_len;
-
-    pkt->payload_frags = iov_copy(&pkt->vec[VMXNET_TX_PKT_PL_START_FRAG],
-                                pkt->max_payload_frags,
-                                pkt->raw, pkt->raw_frags,
-                                pkt->hdr_len, payload_len);
-
-    if (pkt->payload_frags != (uint32_t) -1) {
-        pkt->payload_len = payload_len;
-        return true;
-    } else {
-        return false;
-    }
-}
-
-bool vmxnet_tx_pkt_parse(struct VmxnetTxPkt *pkt)
-{
-    return vmxnet_tx_pkt_parse_headers(pkt) &&
-           vmxnet_tx_pkt_rebuild_payload(pkt);
-}
-
-struct virtio_net_hdr *vmxnet_tx_pkt_get_vhdr(struct VmxnetTxPkt *pkt)
-{
-    assert(pkt);
-    return &pkt->virt_hdr;
-}
-
-static uint8_t vmxnet_tx_pkt_get_gso_type(struct VmxnetTxPkt *pkt,
-                                          bool tso_enable)
-{
-    uint8_t rc = VIRTIO_NET_HDR_GSO_NONE;
-    uint16_t l3_proto;
-
-    l3_proto = eth_get_l3_proto(pkt->vec[VMXNET_TX_PKT_L2HDR_FRAG].iov_base,
-        pkt->vec[VMXNET_TX_PKT_L2HDR_FRAG].iov_len);
-
-    if (!tso_enable) {
-        goto func_exit;
-    }
-
-    rc = eth_get_gso_type(l3_proto, pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_base,
-                          pkt->l4proto);
-
-func_exit:
-    return rc;
-}
-
-void vmxnet_tx_pkt_build_vheader(struct VmxnetTxPkt *pkt, bool tso_enable,
-    bool csum_enable, uint32_t gso_size)
-{
-    struct tcp_hdr l4hdr;
-    assert(pkt);
-
-    /* csum has to be enabled if tso is. */
-    assert(csum_enable || !tso_enable);
-
-    pkt->virt_hdr.gso_type = vmxnet_tx_pkt_get_gso_type(pkt, tso_enable);
-
-    switch (pkt->virt_hdr.gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
-    case VIRTIO_NET_HDR_GSO_NONE:
-        pkt->virt_hdr.hdr_len = 0;
-        pkt->virt_hdr.gso_size = 0;
-        break;
-
-    case VIRTIO_NET_HDR_GSO_UDP:
-        pkt->virt_hdr.gso_size = IP_FRAG_ALIGN_SIZE(gso_size);
-        pkt->virt_hdr.hdr_len = pkt->hdr_len + sizeof(struct udp_header);
-        break;
-
-    case VIRTIO_NET_HDR_GSO_TCPV4:
-    case VIRTIO_NET_HDR_GSO_TCPV6:
-        iov_to_buf(&pkt->vec[VMXNET_TX_PKT_PL_START_FRAG], pkt->payload_frags,
-                   0, &l4hdr, sizeof(l4hdr));
-        pkt->virt_hdr.hdr_len = pkt->hdr_len + l4hdr.th_off * sizeof(uint32_t);
-        pkt->virt_hdr.gso_size = IP_FRAG_ALIGN_SIZE(gso_size);
-        break;
-
-    default:
-        g_assert_not_reached();
-    }
-
-    if (csum_enable) {
-        switch (pkt->l4proto) {
-        case IP_PROTO_TCP:
-            pkt->virt_hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-            pkt->virt_hdr.csum_start = pkt->hdr_len;
-            pkt->virt_hdr.csum_offset = offsetof(struct tcp_hdr, th_sum);
-            break;
-        case IP_PROTO_UDP:
-            pkt->virt_hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
-            pkt->virt_hdr.csum_start = pkt->hdr_len;
-            pkt->virt_hdr.csum_offset = offsetof(struct udp_hdr, uh_sum);
-            break;
-        default:
-            break;
-        }
-    }
-}
-
-void vmxnet_tx_pkt_setup_vlan_header(struct VmxnetTxPkt *pkt, uint16_t vlan)
-{
-    bool is_new;
-    assert(pkt);
-
-    eth_setup_vlan_headers(pkt->vec[VMXNET_TX_PKT_L2HDR_FRAG].iov_base,
-        vlan, &is_new);
-
-    /* update l2hdrlen */
-    if (is_new) {
-        pkt->hdr_len += sizeof(struct vlan_header);
-        pkt->vec[VMXNET_TX_PKT_L2HDR_FRAG].iov_len +=
-            sizeof(struct vlan_header);
-    }
-}
-
-bool vmxnet_tx_pkt_add_raw_fragment(struct VmxnetTxPkt *pkt, hwaddr pa,
-    size_t len)
-{
-    hwaddr mapped_len = 0;
-    struct iovec *ventry;
-    assert(pkt);
-    assert(pkt->max_raw_frags > pkt->raw_frags);
-
-    if (!len) {
-        return true;
-     }
-
-    ventry = &pkt->raw[pkt->raw_frags];
-    mapped_len = len;
-
-    ventry->iov_base = cpu_physical_memory_map(pa, &mapped_len, false);
-    ventry->iov_len = mapped_len;
-    pkt->raw_frags += !!ventry->iov_base;
-
-    if ((ventry->iov_base == NULL) || (len != mapped_len)) {
-        return false;
-    }
-
-    return true;
-}
-
-eth_pkt_types_e vmxnet_tx_pkt_get_packet_type(struct VmxnetTxPkt *pkt)
-{
-    assert(pkt);
-
-    return pkt->packet_type;
-}
-
-size_t vmxnet_tx_pkt_get_total_len(struct VmxnetTxPkt *pkt)
-{
-    assert(pkt);
-
-    return pkt->hdr_len + pkt->payload_len;
-}
-
-void vmxnet_tx_pkt_dump(struct VmxnetTxPkt *pkt)
-{
-#ifdef VMXNET_TX_PKT_DEBUG
-    assert(pkt);
-
-    printf("TX PKT: hdr_len: %d, pkt_type: 0x%X, l2hdr_len: %lu, "
-        "l3hdr_len: %lu, payload_len: %u\n", pkt->hdr_len, pkt->packet_type,
-        pkt->vec[VMXNET_TX_PKT_L2HDR_FRAG].iov_len,
-        pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_len, pkt->payload_len);
-#endif
-}
-
-void vmxnet_tx_pkt_reset(struct VmxnetTxPkt *pkt)
-{
-    int i;
-
-    /* no assert, as reset can be called before tx_pkt_init */
-    if (!pkt) {
-        return;
-    }
-
-    memset(&pkt->virt_hdr, 0, sizeof(pkt->virt_hdr));
-
-    g_free(pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_base);
-    pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_base = NULL;
-
-    assert(pkt->vec);
-    for (i = VMXNET_TX_PKT_L2HDR_FRAG;
-         i < pkt->payload_frags + VMXNET_TX_PKT_PL_START_FRAG; i++) {
-        pkt->vec[i].iov_len = 0;
-    }
-    pkt->payload_len = 0;
-    pkt->payload_frags = 0;
-
-    assert(pkt->raw);
-    for (i = 0; i < pkt->raw_frags; i++) {
-        assert(pkt->raw[i].iov_base);
-        cpu_physical_memory_unmap(pkt->raw[i].iov_base, pkt->raw[i].iov_len,
-                                  false, pkt->raw[i].iov_len);
-        pkt->raw[i].iov_len = 0;
-    }
-    pkt->raw_frags = 0;
-
-    pkt->hdr_len = 0;
-    pkt->packet_type = 0;
-    pkt->l4proto = 0;
-}
-
-static void vmxnet_tx_pkt_do_sw_csum(struct VmxnetTxPkt *pkt)
-{
-    struct iovec *iov = &pkt->vec[VMXNET_TX_PKT_L2HDR_FRAG];
-    uint32_t csum_cntr;
-    uint16_t csum = 0;
-    /* num of iovec without vhdr */
-    uint32_t iov_len = pkt->payload_frags + VMXNET_TX_PKT_PL_START_FRAG - 1;
-    uint16_t csl;
-    struct ip_header *iphdr;
-    size_t csum_offset = pkt->virt_hdr.csum_start + pkt->virt_hdr.csum_offset;
-
-    /* Put zero to checksum field */
-    iov_from_buf(iov, iov_len, csum_offset, &csum, sizeof csum);
-
-    /* Calculate L4 TCP/UDP checksum */
-    csl = pkt->payload_len;
-
-    /* data checksum */
-    csum_cntr =
-        net_checksum_add_iov(iov, iov_len, pkt->virt_hdr.csum_start, csl);
-    /* add pseudo header to csum */
-    iphdr = pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_base;
-    csum_cntr += eth_calc_pseudo_hdr_csum(iphdr, csl);
-
-    /* Put the checksum obtained into the packet */
-    csum = cpu_to_be16(net_checksum_finish(csum_cntr));
-    iov_from_buf(iov, iov_len, csum_offset, &csum, sizeof csum);
-}
-
-enum {
-    VMXNET_TX_PKT_FRAGMENT_L2_HDR_POS = 0,
-    VMXNET_TX_PKT_FRAGMENT_L3_HDR_POS,
-    VMXNET_TX_PKT_FRAGMENT_HEADER_NUM
-};
-
-#define VMXNET_MAX_FRAG_SG_LIST (64)
-
-static size_t vmxnet_tx_pkt_fetch_fragment(struct VmxnetTxPkt *pkt,
-    int *src_idx, size_t *src_offset, struct iovec *dst, int *dst_idx)
-{
-    size_t fetched = 0;
-    struct iovec *src = pkt->vec;
-
-    *dst_idx = VMXNET_TX_PKT_FRAGMENT_HEADER_NUM;
-
-    while (fetched < pkt->virt_hdr.gso_size) {
-
-        /* no more place in fragment iov */
-        if (*dst_idx == VMXNET_MAX_FRAG_SG_LIST) {
-            break;
-        }
-
-        /* no more data in iovec */
-        if (*src_idx == (pkt->payload_frags + VMXNET_TX_PKT_PL_START_FRAG)) {
-            break;
-        }
-
-
-        dst[*dst_idx].iov_base = src[*src_idx].iov_base + *src_offset;
-        dst[*dst_idx].iov_len = MIN(src[*src_idx].iov_len - *src_offset,
-            pkt->virt_hdr.gso_size - fetched);
-
-        *src_offset += dst[*dst_idx].iov_len;
-        fetched += dst[*dst_idx].iov_len;
-
-        if (*src_offset == src[*src_idx].iov_len) {
-            *src_offset = 0;
-            (*src_idx)++;
-        }
-
-        (*dst_idx)++;
-    }
-
-    return fetched;
-}
-
-static bool vmxnet_tx_pkt_do_sw_fragmentation(struct VmxnetTxPkt *pkt,
-    NetClientState *nc)
-{
-    struct iovec fragment[VMXNET_MAX_FRAG_SG_LIST];
-    size_t fragment_len = 0;
-    bool more_frags = false;
-
-    /* some pointers for shorter code */
-    void *l2_iov_base, *l3_iov_base;
-    size_t l2_iov_len, l3_iov_len;
-    int src_idx =  VMXNET_TX_PKT_PL_START_FRAG, dst_idx;
-    size_t src_offset = 0;
-    size_t fragment_offset = 0;
-
-    l2_iov_base = pkt->vec[VMXNET_TX_PKT_L2HDR_FRAG].iov_base;
-    l2_iov_len = pkt->vec[VMXNET_TX_PKT_L2HDR_FRAG].iov_len;
-    l3_iov_base = pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_base;
-    l3_iov_len = pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_len;
-
-    /* Copy headers */
-    fragment[VMXNET_TX_PKT_FRAGMENT_L2_HDR_POS].iov_base = l2_iov_base;
-    fragment[VMXNET_TX_PKT_FRAGMENT_L2_HDR_POS].iov_len = l2_iov_len;
-    fragment[VMXNET_TX_PKT_FRAGMENT_L3_HDR_POS].iov_base = l3_iov_base;
-    fragment[VMXNET_TX_PKT_FRAGMENT_L3_HDR_POS].iov_len = l3_iov_len;
-
-
-    /* Put as much data as possible and send */
-    do {
-        fragment_len = vmxnet_tx_pkt_fetch_fragment(pkt, &src_idx, &src_offset,
-            fragment, &dst_idx);
-
-        more_frags = (fragment_offset + fragment_len < pkt->payload_len);
-
-        eth_setup_ip4_fragmentation(l2_iov_base, l2_iov_len, l3_iov_base,
-            l3_iov_len, fragment_len, fragment_offset, more_frags);
-
-        eth_fix_ip4_checksum(l3_iov_base, l3_iov_len);
-
-        qemu_sendv_packet(nc, fragment, dst_idx);
-
-        fragment_offset += fragment_len;
-
-    } while (more_frags);
-
-    return true;
-}
-
-bool vmxnet_tx_pkt_send(struct VmxnetTxPkt *pkt, NetClientState *nc)
-{
-    assert(pkt);
-
-    if (!pkt->has_virt_hdr &&
-        pkt->virt_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-        vmxnet_tx_pkt_do_sw_csum(pkt);
-    }
-
-    /*
-     * Since underlying infrastructure does not support IP datagrams longer
-     * than 64K we should drop such packets and don't even try to send
-     */
-    if (VIRTIO_NET_HDR_GSO_NONE != pkt->virt_hdr.gso_type) {
-        if (pkt->payload_len >
-            ETH_MAX_IP_DGRAM_LEN -
-            pkt->vec[VMXNET_TX_PKT_L3HDR_FRAG].iov_len) {
-            return false;
-        }
-    }
-
-    if (pkt->has_virt_hdr ||
-        pkt->virt_hdr.gso_type == VIRTIO_NET_HDR_GSO_NONE) {
-        qemu_sendv_packet(nc, pkt->vec,
-            pkt->payload_frags + VMXNET_TX_PKT_PL_START_FRAG);
-        return true;
-    }
-
-    return vmxnet_tx_pkt_do_sw_fragmentation(pkt, nc);
-}
diff --git a/hw/net/vmxnet_tx_pkt.h b/hw/net/vmxnet_tx_pkt.h
deleted file mode 100644
index 57121a6..0000000
--- a/hw/net/vmxnet_tx_pkt.h
+++ /dev/null
@@ -1,148 +0,0 @@
-/*
- * QEMU VMWARE VMXNET* paravirtual NICs - TX packets abstraction
- *
- * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
- *
- * Developed by Daynix Computing LTD (http://www.daynix.com)
- *
- * Authors:
- * Dmitry Fleytman <dmitry@daynix.com>
- * Tamir Shomer <tamirs@daynix.com>
- * Yan Vugenfirer <yan@daynix.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef VMXNET_TX_PKT_H
-#define VMXNET_TX_PKT_H
-
-#include "stdint.h"
-#include "stdbool.h"
-#include "net/eth.h"
-#include "exec/hwaddr.h"
-
-/* define to enable packet dump functions */
-/*#define VMXNET_TX_PKT_DEBUG*/
-
-struct VmxnetTxPkt;
-
-/**
- * Init function for tx packet functionality
- *
- * @pkt:            packet pointer
- * @max_frags:      max tx ip fragments
- * @has_virt_hdr:   device uses virtio header.
- */
-void vmxnet_tx_pkt_init(struct VmxnetTxPkt **pkt, uint32_t max_frags,
-    bool has_virt_hdr);
-
-/**
- * Clean all tx packet resources.
- *
- * @pkt:            packet.
- */
-void vmxnet_tx_pkt_uninit(struct VmxnetTxPkt *pkt);
-
-/**
- * get virtio header
- *
- * @pkt:            packet
- * @ret:            virtio header
- */
-struct virtio_net_hdr *vmxnet_tx_pkt_get_vhdr(struct VmxnetTxPkt *pkt);
-
-/**
- * build virtio header (will be stored in module context)
- *
- * @pkt:            packet
- * @tso_enable:     TSO enabled
- * @csum_enable:    CSO enabled
- * @gso_size:       MSS size for TSO
- *
- */
-void vmxnet_tx_pkt_build_vheader(struct VmxnetTxPkt *pkt, bool tso_enable,
-    bool csum_enable, uint32_t gso_size);
-
-/**
- * updates vlan tag, and adds vlan header in case it is missing
- *
- * @pkt:            packet
- * @vlan:           VLAN tag
- *
- */
-void vmxnet_tx_pkt_setup_vlan_header(struct VmxnetTxPkt *pkt, uint16_t vlan);
-
-/**
- * populate data fragment into pkt context.
- *
- * @pkt:            packet
- * @pa:             physical address of fragment
- * @len:            length of fragment
- *
- */
-bool vmxnet_tx_pkt_add_raw_fragment(struct VmxnetTxPkt *pkt, hwaddr pa,
-    size_t len);
-
-/**
- * fix ip header fields and calculate checksums needed.
- *
- * @pkt:            packet
- *
- */
-void vmxnet_tx_pkt_update_ip_checksums(struct VmxnetTxPkt *pkt);
-
-/**
- * get length of all populated data.
- *
- * @pkt:            packet
- * @ret:            total data length
- *
- */
-size_t vmxnet_tx_pkt_get_total_len(struct VmxnetTxPkt *pkt);
-
-/**
- * get packet type
- *
- * @pkt:            packet
- * @ret:            packet type
- *
- */
-eth_pkt_types_e vmxnet_tx_pkt_get_packet_type(struct VmxnetTxPkt *pkt);
-
-/**
- * prints packet data if debug is enabled
- *
- * @pkt:            packet
- *
- */
-void vmxnet_tx_pkt_dump(struct VmxnetTxPkt *pkt);
-
-/**
- * reset tx packet private context (needed to be called between packets)
- *
- * @pkt:            packet
- *
- */
-void vmxnet_tx_pkt_reset(struct VmxnetTxPkt *pkt);
-
-/**
- * Send packet to qemu. handles sw offloads if vhdr is not supported.
- *
- * @pkt:            packet
- * @nc:             NetClientState
- * @ret:            operation result
- *
- */
-bool vmxnet_tx_pkt_send(struct VmxnetTxPkt *pkt, NetClientState *nc);
-
-/**
- * parse raw packet data and analyze offload requirements.
- *
- * @pkt:            packet
- *
- */
-bool vmxnet_tx_pkt_parse(struct VmxnetTxPkt *pkt);
-
-#endif
diff --git a/tests/Makefile b/tests/Makefile
index 55aa745..32d7c12 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -155,8 +155,8 @@ check-qtest-i386-y += $(check-qtest-pci-y)
 gcov-files-i386-y += $(gcov-files-pci-y)
 check-qtest-i386-y += tests/vmxnet3-test$(EXESUF)
 gcov-files-i386-y += hw/net/vmxnet3.c
-gcov-files-i386-y += hw/net/vmxnet_rx_pkt.c
-gcov-files-i386-y += hw/net/vmxnet_tx_pkt.c
+gcov-files-i386-y += hw/net/net_rx_pkt.c
+gcov-files-i386-y += hw/net/net_tx_pkt.c
 check-qtest-i386-y += tests/pvpanic-test$(EXESUF)
 gcov-files-i386-y += i386-softmmu/hw/misc/pvpanic.c
 check-qtest-i386-y += tests/i82801b11-test$(EXESUF)
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Qemu-devel] [RFC PATCH 3/5] net_pkt: Extend packet abstraction as requied by e1000e functionality
  2015-10-25 17:00 [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Leonid Bloch
  2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 1/5] net: Add macros for ETH address tracing Leonid Bloch
  2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 2/5] net_pkt: Name vmxnet3 packet abstractions more generic Leonid Bloch
@ 2015-10-25 17:00 ` Leonid Bloch
  2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 4/5] e1000_regs: Add definitions for Intel 82574-specific bits Leonid Bloch
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Leonid Bloch @ 2015-10-25 17:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Dmitry Fleytman, Jason Wang, Leonid Bloch, Shmulik Ladkani

From: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com>

This patch extends TX/RX packet abstractions with features that will
be used by e1000e device implementation.

Changes are:

  1. Support iovec lists for RX buffers
  2. Deeper RX packets parsing
  3. Loopback option for TX packets
  4. Extended VLAN headers handling

Signed-off-by: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com>
Signed-off-by: Leonid Bloch <leonid.bloch@ravellosystems.com>
---
 hw/net/net_rx_pkt.c | 138 ++++++++++++++++++++++++++++++++++++++---------
 hw/net/net_rx_pkt.h |  71 +++++++++++++++++++++++-
 hw/net/net_tx_pkt.c |  77 +++++++++++++++++++-------
 hw/net/net_tx_pkt.h |  59 +++++++++++++++++---
 include/net/eth.h   |  90 ++++++++++++++++---------------
 net/eth.c           | 152 +++++++++++++++++++++++++++++++++++++++++++---------
 6 files changed, 463 insertions(+), 124 deletions(-)

diff --git a/hw/net/net_rx_pkt.c b/hw/net/net_rx_pkt.c
index f4c929f..b349c19 100644
--- a/hw/net/net_rx_pkt.c
+++ b/hw/net/net_rx_pkt.c
@@ -22,17 +22,11 @@
 #include "net/checksum.h"
 #include "net/tap.h"
 
-/*
- * RX packet may contain up to 2 fragments - rebuilt eth header
- * in case of VLAN tag stripping
- * and payload received from QEMU - in any case
- */
-#define NET_MAX_RX_PACKET_FRAGMENTS (2)
-
 struct NetRxPkt {
     struct virtio_net_hdr virt_hdr;
     uint8_t ehdr_buf[ETH_MAX_L2_HDR_LEN];
-    struct iovec vec[NET_MAX_RX_PACKET_FRAGMENTS];
+    struct iovec *vec;
+    uint16_t vec_len_total;
     uint16_t vec_len;
     uint32_t tot_len;
     uint16_t tci;
@@ -45,17 +39,26 @@ struct NetRxPkt {
     bool isip6;
     bool isudp;
     bool istcp;
+
+    size_t l3hdr_off;
+    size_t l4hdr_off;
 };
 
 void net_rx_pkt_init(struct NetRxPkt **pkt, bool has_virt_hdr)
 {
     struct NetRxPkt *p = g_malloc0(sizeof *p);
     p->has_virt_hdr = has_virt_hdr;
+    p->vec = NULL;
+    p->vec_len_total = 0;
     *pkt = p;
 }
 
 void net_rx_pkt_uninit(struct NetRxPkt *pkt)
 {
+    if (pkt->vec_len_total != 0) {
+        g_free(pkt->vec);
+    }
+
     g_free(pkt);
 }
 
@@ -65,36 +68,83 @@ struct virtio_net_hdr *net_rx_pkt_get_vhdr(struct NetRxPkt *pkt)
     return &pkt->virt_hdr;
 }
 
-void net_rx_pkt_attach_data(struct NetRxPkt *pkt, const void *data,
-                               size_t len, bool strip_vlan)
+static void
+net_rx_pkt_iovec_realloc(struct NetRxPkt *pkt,
+                            int new_iov_len)
+{
+    if (pkt->vec_len_total < new_iov_len) {
+        g_free(pkt->vec);
+        pkt->vec = g_malloc(sizeof(*pkt->vec) * new_iov_len);
+        pkt->vec_len_total = new_iov_len;
+    }
+}
+
+static void
+net_rx_pkt_pull_data(struct NetRxPkt *pkt,
+                        const struct iovec *iov, int iovcnt,
+                        size_t ploff)
+{
+    if (pkt->vlan_stripped) {
+        net_rx_pkt_iovec_realloc(pkt, iovcnt + 1);
+
+        pkt->vec[0].iov_base = pkt->ehdr_buf;
+        pkt->vec[0].iov_len = ploff - sizeof(struct vlan_header);
+
+        pkt->tot_len =
+            iov_size(iov, iovcnt) - ploff + sizeof(struct eth_header);
+
+        pkt->vec_len = iov_copy(pkt->vec + 1, pkt->vec_len_total - 1,
+                                iov, iovcnt, ploff, pkt->tot_len);
+    } else {
+        net_rx_pkt_iovec_realloc(pkt, iovcnt);
+
+        pkt->tot_len = iov_size(iov, iovcnt) - ploff;
+        pkt->vec_len = iov_copy(pkt->vec, pkt->vec_len_total,
+                                iov, iovcnt, ploff, pkt->tot_len);
+    }
+
+    eth_get_protocols(pkt->vec, pkt->vec_len, &pkt->isip4, &pkt->isip6,
+        &pkt->isudp, &pkt->istcp, &pkt->l3hdr_off, &pkt->l4hdr_off);
+}
+
+void net_rx_pkt_attach_iovec(struct NetRxPkt *pkt,
+                                const struct iovec *iov, int iovcnt,
+                                size_t iovoff, bool strip_vlan)
 {
     uint16_t tci = 0;
-    uint16_t ploff;
+    uint16_t ploff = iovoff;
     assert(pkt);
     pkt->vlan_stripped = false;
 
     if (strip_vlan) {
-        pkt->vlan_stripped = eth_strip_vlan(data, pkt->ehdr_buf, &ploff, &tci);
+        pkt->vlan_stripped = eth_strip_vlan(iov, iovcnt, iovoff, pkt->ehdr_buf,
+                                            &ploff, &tci);
     }
 
-    if (pkt->vlan_stripped) {
-        pkt->vec[0].iov_base = pkt->ehdr_buf;
-        pkt->vec[0].iov_len = ploff - sizeof(struct vlan_header);
-        pkt->vec[1].iov_base = (uint8_t *) data + ploff;
-        pkt->vec[1].iov_len = len - ploff;
-        pkt->vec_len = 2;
-        pkt->tot_len = len - ploff + sizeof(struct eth_header);
-    } else {
-        pkt->vec[0].iov_base = (void *)data;
-        pkt->vec[0].iov_len = len;
-        pkt->vec_len = 1;
-        pkt->tot_len = len;
+    pkt->tci = tci;
+
+    net_rx_pkt_pull_data(pkt, iov, iovcnt, ploff);
+}
+
+void net_rx_pkt_attach_iovec_ex(struct NetRxPkt *pkt,
+                                   const struct iovec *iov, int iovcnt,
+                                   size_t iovoff, bool strip_vlan,
+                                   uint16_t vet)
+{
+    uint16_t tci = 0;
+    uint16_t ploff = iovoff;
+    assert(pkt);
+    pkt->vlan_stripped = false;
+
+    if (strip_vlan) {
+        pkt->vlan_stripped = eth_strip_vlan_ex(iov, iovcnt, iovoff, vet,
+                                               pkt->ehdr_buf,
+                                               &ploff, &tci);
     }
 
     pkt->tci = tci;
 
-    eth_get_protocols(data, len, &pkt->isip4, &pkt->isip6,
-        &pkt->isudp, &pkt->istcp);
+    net_rx_pkt_pull_data(pkt, iov, iovcnt, ploff);
 }
 
 void net_rx_pkt_dump(struct NetRxPkt *pkt)
@@ -143,6 +193,34 @@ void net_rx_pkt_get_protocols(struct NetRxPkt *pkt,
     *istcp = pkt->istcp;
 }
 
+uint16_t net_rx_pkt_get_ip_id(struct NetRxPkt *pkt)
+{
+    assert(pkt);
+
+    if (pkt->isip4) {
+        struct ip_header iphdr = { 0 };
+        iov_to_buf(pkt->vec, pkt->vec_len, pkt->l3hdr_off,
+                   &iphdr, sizeof(iphdr));
+        return be16_to_cpu(iphdr.ip_id);
+    }
+
+    return 0;
+}
+
+bool net_rx_pkt_is_tcp_ack(struct NetRxPkt *pkt)
+{
+    assert(pkt);
+
+    if (pkt->istcp) {
+        struct tcp_header tcphdr = { 0 };
+        iov_to_buf(pkt->vec, pkt->vec_len, pkt->l4hdr_off,
+                   &tcphdr, sizeof(tcphdr));
+        return TCP_HEADER_FLAGS(&tcphdr) & TCP_FLAG_ACK;
+    }
+
+    return false;
+}
+
 struct iovec *net_rx_pkt_get_iovec(struct NetRxPkt *pkt)
 {
     assert(pkt);
@@ -158,6 +236,14 @@ void net_rx_pkt_set_vhdr(struct NetRxPkt *pkt,
     memcpy(&pkt->virt_hdr, vhdr, sizeof pkt->virt_hdr);
 }
 
+void net_rx_pkt_set_vhdr_iovec(struct NetRxPkt *pkt,
+    const struct iovec *iov, int iovcnt)
+{
+    assert(pkt);
+
+    iov_to_buf(iov, iovcnt, 0, &pkt->virt_hdr, sizeof pkt->virt_hdr);
+}
+
 bool net_rx_pkt_is_vlan_stripped(struct NetRxPkt *pkt)
 {
     assert(pkt);
diff --git a/hw/net/net_rx_pkt.h b/hw/net/net_rx_pkt.h
index 1e4accf..27c045e 100644
--- a/hw/net/net_rx_pkt.h
+++ b/hw/net/net_rx_pkt.h
@@ -69,6 +69,22 @@ void net_rx_pkt_get_protocols(struct NetRxPkt *pkt,
                                  bool *isudp, bool *istcp);
 
 /**
+* fetches IP identification for the packet
+*
+* @pkt:            packet
+*
+*/
+uint16_t net_rx_pkt_get_ip_id(struct NetRxPkt *pkt);
+
+/**
+* check if given packet is a TCP ACK packet
+*
+* @pkt:            packet
+*
+*/
+bool net_rx_pkt_is_tcp_ack(struct NetRxPkt *pkt);
+
+/**
  * returns virtio header stored in rx context
  *
  * @pkt:            packet
@@ -123,6 +139,37 @@ bool net_rx_pkt_has_virt_hdr(struct NetRxPkt *pkt);
 uint16_t net_rx_pkt_get_num_frags(struct NetRxPkt *pkt);
 
 /**
+* attach scatter-gather data to rx packet
+*
+* @pkt:            packet
+* @iov:            received data scatter-gather list
+* @iovcnt          number of elements in iov
+* @iovoff          data start offset in the iov
+* @strip_vlan:     should the module strip vlan from data
+*
+*/
+void net_rx_pkt_attach_iovec(struct NetRxPkt *pkt,
+                                const struct iovec *iov,
+                                int iovcnt, size_t iovoff,
+                                bool strip_vlan);
+
+/**
+* attach scatter-gather data to rx packet
+*
+* @pkt:            packet
+* @iov:            received data scatter-gather list
+* @iovcnt          number of elements in iov
+* @iovoff          data start offset in the iov
+* @strip_vlan:     should the module strip vlan from data
+* @vet:            VLAN tag Ethernet type
+*
+*/
+void net_rx_pkt_attach_iovec_ex(struct NetRxPkt *pkt,
+                                   const struct iovec *iov, int iovcnt,
+                                   size_t iovoff, bool strip_vlan,
+                                   uint16_t vet);
+
+/**
  * attach data to rx packet
  *
  * @pkt:            packet
@@ -131,8 +178,17 @@ uint16_t net_rx_pkt_get_num_frags(struct NetRxPkt *pkt);
  * @strip_vlan:     should the module strip vlan from data
  *
  */
-void net_rx_pkt_attach_data(struct NetRxPkt *pkt, const void *data,
-    size_t len, bool strip_vlan);
+static inline void
+net_rx_pkt_attach_data(struct NetRxPkt *pkt, const void *data,
+                          size_t len, bool strip_vlan)
+{
+    const struct iovec iov = {
+        .iov_base = (void *) data,
+        .iov_len = len
+    };
+
+    net_rx_pkt_attach_iovec(pkt, &iov, 1, 0, strip_vlan);
+}
 
 /**
  * returns io vector that holds the attached data
@@ -162,6 +218,17 @@ void net_rx_pkt_set_vhdr(struct NetRxPkt *pkt,
     struct virtio_net_hdr *vhdr);
 
 /**
+* copy passed vhdr data to packet context
+*
+* @pkt:            packet
+* @iov:            VHDR iov
+* @iovcnt:         VHDR iov array size
+*
+*/
+void net_rx_pkt_set_vhdr_iovec(struct NetRxPkt *pkt,
+    const struct iovec *iov, int iovcnt);
+
+/**
  * save packet type in packet context
  *
  * @pkt:            packet
diff --git a/hw/net/net_tx_pkt.c b/hw/net/net_tx_pkt.c
index a2c1a76..aa1e2ef 100644
--- a/hw/net/net_tx_pkt.c
+++ b/hw/net/net_tx_pkt.c
@@ -52,6 +52,8 @@ struct NetTxPkt {
     uint16_t hdr_len;
     eth_pkt_types_e packet_type;
     uint8_t l4proto;
+
+    bool is_loopback;
 };
 
 void net_tx_pkt_init(struct NetTxPkt **pkt, uint32_t max_frags,
@@ -86,6 +88,22 @@ void net_tx_pkt_uninit(struct NetTxPkt *pkt)
     }
 }
 
+void net_tx_pkt_update_ip_hdr_checksum(struct NetTxPkt *pkt)
+{
+    uint16_t csum;
+    assert(pkt);
+    struct ip_header *ip_hdr;
+    ip_hdr = pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_base;
+
+    ip_hdr->ip_len = cpu_to_be16(pkt->payload_len +
+        pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_len);
+
+    ip_hdr->ip_sum = 0;
+    csum = net_raw_checksum((uint8_t *)ip_hdr,
+        pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_len);
+    ip_hdr->ip_sum = cpu_to_be16(csum);
+}
+
 void net_tx_pkt_update_ip_checksums(struct NetTxPkt *pkt)
 {
     uint16_t csum;
@@ -93,29 +111,22 @@ void net_tx_pkt_update_ip_checksums(struct NetTxPkt *pkt)
     assert(pkt);
     uint8_t gso_type = pkt->virt_hdr.gso_type & ~VIRTIO_NET_HDR_GSO_ECN;
     struct ip_header *ip_hdr;
+    ip_hdr = pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_base;
 
     if (VIRTIO_NET_HDR_GSO_TCPV4 != gso_type &&
         VIRTIO_NET_HDR_GSO_UDP != gso_type) {
         return;
     }
 
-    ip_hdr = pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_base;
-
     if (pkt->payload_len + pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_len >
         ETH_MAX_IP_DGRAM_LEN) {
         return;
     }
 
-    ip_hdr->ip_len = cpu_to_be16(pkt->payload_len +
-        pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_len);
-
-    /* Calculate IP header checksum                    */
-    ip_hdr->ip_sum = 0;
-    csum = net_raw_checksum((uint8_t *)ip_hdr,
-        pkt->vec[NET_TX_PKT_L3HDR_FRAG].iov_len);
-    ip_hdr->ip_sum = cpu_to_be16(csum);
+    /* Calculate IP header checksum */
+    net_tx_pkt_update_ip_hdr_checksum(pkt);
 
-    /* Calculate IP pseudo header checksum             */
+    /* Calculate IP pseudo header checksum */
     ph_raw_csum = eth_calc_pseudo_hdr_csum(ip_hdr, pkt->payload_len);
     csum = cpu_to_be16(~net_checksum_finish(ph_raw_csum));
     iov_from_buf(&pkt->vec[NET_TX_PKT_PL_START_FRAG], pkt->payload_frags,
@@ -146,10 +157,11 @@ static bool net_tx_pkt_parse_headers(struct NetTxPkt *pkt)
         l2_hdr->iov_len = 0;
         return false;
     } else {
-        l2_hdr->iov_len = eth_get_l2_hdr_length(l2_hdr->iov_base);
+        l2_hdr->iov_len = ETH_MAX_L2_HDR_LEN;
+        l2_hdr->iov_len = eth_get_l2_hdr_length(l2_hdr, 1);
     }
 
-    l3_proto = eth_get_l3_proto(l2_hdr->iov_base, l2_hdr->iov_len);
+    l3_proto = eth_get_l3_proto(l2_hdr, 1, l2_hdr->iov_len);
 
     switch (l3_proto) {
     case ETH_P_IP:
@@ -242,7 +254,7 @@ static uint8_t net_tx_pkt_get_gso_type(struct NetTxPkt *pkt,
     uint8_t rc = VIRTIO_NET_HDR_GSO_NONE;
     uint16_t l3_proto;
 
-    l3_proto = eth_get_l3_proto(pkt->vec[NET_TX_PKT_L2HDR_FRAG].iov_base,
+    l3_proto = eth_get_l3_proto(&pkt->vec[NET_TX_PKT_L2HDR_FRAG], 1,
         pkt->vec[NET_TX_PKT_L2HDR_FRAG].iov_len);
 
     if (!tso_enable) {
@@ -308,13 +320,14 @@ void net_tx_pkt_build_vheader(struct NetTxPkt *pkt, bool tso_enable,
     }
 }
 
-void net_tx_pkt_setup_vlan_header(struct NetTxPkt *pkt, uint16_t vlan)
+void net_tx_pkt_setup_vlan_header_ex(struct NetTxPkt *pkt,
+    uint16_t vlan, uint16_t vlan_ethtype)
 {
     bool is_new;
     assert(pkt);
 
-    eth_setup_vlan_headers(pkt->vec[NET_TX_PKT_L2HDR_FRAG].iov_base,
-        vlan, &is_new);
+    eth_setup_vlan_headers_ex(pkt->vec[NET_TX_PKT_L2HDR_FRAG].iov_base,
+        vlan, vlan_ethtype, &is_new);
 
     /* update l2hdrlen */
     if (is_new) {
@@ -350,6 +363,11 @@ bool net_tx_pkt_add_raw_fragment(struct NetTxPkt *pkt, hwaddr pa,
     return true;
 }
 
+bool net_tx_pkt_has_fragments(struct NetTxPkt *pkt)
+{
+    return pkt->raw_frags > 0;
+}
+
 eth_pkt_types_e net_tx_pkt_get_packet_type(struct NetTxPkt *pkt)
 {
     assert(pkt);
@@ -488,6 +506,16 @@ static size_t net_tx_pkt_fetch_fragment(struct NetTxPkt *pkt,
     return fetched;
 }
 
+static inline void net_tx_pkt_sendv(struct NetTxPkt *pkt,
+    NetClientState *nc, const struct iovec *iov, int iov_cnt)
+{
+    if (pkt->is_loopback) {
+        nc->info->receive_iov(nc, iov, iov_cnt);
+    } else {
+        qemu_sendv_packet(nc, iov, iov_cnt);
+    }
+}
+
 static bool net_tx_pkt_do_sw_fragmentation(struct NetTxPkt *pkt,
     NetClientState *nc)
 {
@@ -526,7 +554,7 @@ static bool net_tx_pkt_do_sw_fragmentation(struct NetTxPkt *pkt,
 
         eth_fix_ip4_checksum(l3_iov_base, l3_iov_len);
 
-        qemu_sendv_packet(nc, fragment, dst_idx);
+        net_tx_pkt_sendv(pkt, nc, fragment, dst_idx);
 
         fragment_offset += fragment_len;
 
@@ -558,10 +586,21 @@ bool net_tx_pkt_send(struct NetTxPkt *pkt, NetClientState *nc)
 
     if (pkt->has_virt_hdr ||
         pkt->virt_hdr.gso_type == VIRTIO_NET_HDR_GSO_NONE) {
-        qemu_sendv_packet(nc, pkt->vec,
+        net_tx_pkt_sendv(pkt, nc, pkt->vec,
             pkt->payload_frags + NET_TX_PKT_PL_START_FRAG);
         return true;
     }
 
     return net_tx_pkt_do_sw_fragmentation(pkt, nc);
 }
+
+bool net_tx_pkt_send_loopback(struct NetTxPkt *pkt, NetClientState *nc)
+{
+    bool res;
+
+    pkt->is_loopback = true;
+    res = net_tx_pkt_send(pkt, nc);
+    pkt->is_loopback = false;
+
+    return res;
+}
diff --git a/hw/net/net_tx_pkt.h b/hw/net/net_tx_pkt.h
index 73a67f8..cb5983e 100644
--- a/hw/net/net_tx_pkt.h
+++ b/hw/net/net_tx_pkt.h
@@ -66,13 +66,29 @@ void net_tx_pkt_build_vheader(struct NetTxPkt *pkt, bool tso_enable,
     bool csum_enable, uint32_t gso_size);
 
 /**
- * updates vlan tag, and adds vlan header in case it is missing
- *
- * @pkt:            packet
- * @vlan:           VLAN tag
- *
- */
-void net_tx_pkt_setup_vlan_header(struct NetTxPkt *pkt, uint16_t vlan);
+* updates vlan tag, and adds vlan header with custom ethernet type
+* in case it is missing.
+*
+* @pkt:            packet
+* @vlan:           VLAN tag
+* @vlan_ethtype:   VLAN header Ethernet type
+*
+*/
+void net_tx_pkt_setup_vlan_header_ex(struct NetTxPkt *pkt,
+    uint16_t vlan, uint16_t vlan_ethtype);
+
+/**
+* updates vlan tag, and adds vlan header in case it is missing
+*
+* @pkt:            packet
+* @vlan:           VLAN tag
+*
+*/
+static inline void
+net_tx_pkt_setup_vlan_header(struct NetTxPkt *pkt, uint16_t vlan)
+{
+    net_tx_pkt_setup_vlan_header_ex(pkt, vlan, ETH_P_VLAN);
+}
 
 /**
  * populate data fragment into pkt context.
@@ -86,7 +102,7 @@ bool net_tx_pkt_add_raw_fragment(struct NetTxPkt *pkt, hwaddr pa,
     size_t len);
 
 /**
- * fix ip header fields and calculate checksums needed.
+ * Fix ip header fields and calculate IP header and pseudo header checksums.
  *
  * @pkt:            packet
  *
@@ -94,6 +110,14 @@ bool net_tx_pkt_add_raw_fragment(struct NetTxPkt *pkt, hwaddr pa,
 void net_tx_pkt_update_ip_checksums(struct NetTxPkt *pkt);
 
 /**
+ * Calculate the IP header checksum.
+ *
+ * @pkt:            packet
+ *
+ */
+void net_tx_pkt_update_ip_hdr_checksum(struct NetTxPkt *pkt);
+
+/**
  * get length of all populated data.
  *
  * @pkt:            packet
@@ -138,6 +162,17 @@ void net_tx_pkt_reset(struct NetTxPkt *pkt);
 bool net_tx_pkt_send(struct NetTxPkt *pkt, NetClientState *nc);
 
 /**
+* Redirect packet directly to receive path (emulate loopback phy).
+* Handles sw offloads if vhdr is not supported.
+*
+* @pkt:            packet
+* @nc:             NetClientState
+* @ret:            operation result
+*
+*/
+bool net_tx_pkt_send_loopback(struct NetTxPkt *pkt, NetClientState *nc);
+
+/**
  * parse raw packet data and analyze offload requirements.
  *
  * @pkt:            packet
@@ -145,4 +180,12 @@ bool net_tx_pkt_send(struct NetTxPkt *pkt, NetClientState *nc);
  */
 bool net_tx_pkt_parse(struct NetTxPkt *pkt);
 
+/**
+* indicates if there are data fragments held by this packet object.
+*
+* @pkt:            packet
+*
+*/
+bool net_tx_pkt_has_fragments(struct NetTxPkt *pkt);
+
 #endif
diff --git a/include/net/eth.h b/include/net/eth.h
index b3273b8..85d7b1e 100644
--- a/include/net/eth.h
+++ b/include/net/eth.h
@@ -68,6 +68,13 @@ typedef struct tcp_header {
     uint16_t th_urp;            /* urgent pointer */
 } tcp_header;
 
+#define TCP_FLAGS_ONLY(flags) ((flags)&0x3f)
+
+#define TCP_HEADER_FLAGS(tcp) \
+    TCP_FLAGS_ONLY(be16_to_cpu((tcp)->th_offset_flags))
+
+#define TCP_FLAG_ACK  0x10
+
 typedef struct udp_header {
     uint16_t uh_sport; /* source port */
     uint16_t uh_dport; /* destination port */
@@ -162,18 +169,19 @@ struct tcp_hdr {
 #define PKT_GET_IP_HDR(p)         \
     ((struct ip_header *)(((uint8_t *)(p)) + eth_get_l2_hdr_length(p)))
 #define IP_HDR_GET_LEN(p)         \
-    ((((struct ip_header *)p)->ip_ver_len & 0x0F) << 2)
+    ((((struct ip_header *)(p))->ip_ver_len & 0x0F) << 2)
 #define PKT_GET_IP_HDR_LEN(p)     \
     (IP_HDR_GET_LEN(PKT_GET_IP_HDR(p)))
 #define PKT_GET_IP6_HDR(p)        \
     ((struct ip6_header *) (((uint8_t *)(p)) + eth_get_l2_hdr_length(p)))
 #define IP_HEADER_VERSION(ip)     \
-    ((ip->ip_ver_len >> 4)&0xf)
+    (((ip)->ip_ver_len >> 4)&0xf)
 
 #define ETH_P_IP                  (0x0800)
 #define ETH_P_IPV6                (0x86dd)
 #define ETH_P_VLAN                (0x8100)
 #define ETH_P_DVLAN               (0x88a8)
+#define ETH_P_UNKNOWN             (0xffff)
 #define VLAN_VID_MASK             0x0fff
 #define IP_HEADER_VERSION_4       (4)
 #define IP_HEADER_VERSION_6       (6)
@@ -250,15 +258,25 @@ get_eth_packet_type(const struct eth_header *ehdr)
 }
 
 static inline uint32_t
-eth_get_l2_hdr_length(const void *p)
+eth_get_l2_hdr_length(const struct iovec *iov, int iovcnt)
 {
-    uint16_t proto = be16_to_cpu(PKT_GET_ETH_HDR(p)->h_proto);
-    struct vlan_header *hvlan = PKT_GET_VLAN_HDR(p);
+    uint8_t p[sizeof(struct eth_header) + sizeof(struct vlan_header)];
+    size_t copied = iov_to_buf(iov, iovcnt, 0, p, ARRAY_SIZE(p));
+    uint16_t proto;
+    struct vlan_header *hvlan;
+
+    if (copied < ARRAY_SIZE(p)) {
+        return copied;
+    }
+
+    proto = be16_to_cpu(PKT_GET_ETH_HDR(p)->h_proto);
+    hvlan = PKT_GET_VLAN_HDR(p);
+
     switch (proto) {
     case ETH_P_VLAN:
         return sizeof(struct eth_header) + sizeof(struct vlan_header);
     case ETH_P_DVLAN:
-        if (hvlan->h_proto == ETH_P_VLAN) {
+        if (be16_to_cpu(hvlan->h_proto) == ETH_P_VLAN) {
             return sizeof(struct eth_header) + 2 * sizeof(struct vlan_header);
         } else {
             return sizeof(struct eth_header) + sizeof(struct vlan_header);
@@ -282,51 +300,37 @@ eth_get_pkt_tci(const void *p)
     }
 }
 
-static inline bool
-eth_strip_vlan(const void *p, uint8_t *new_ehdr_buf,
-               uint16_t *payload_offset, uint16_t *tci)
-{
-    uint16_t proto = be16_to_cpu(PKT_GET_ETH_HDR(p)->h_proto);
-    struct vlan_header *hvlan = PKT_GET_VLAN_HDR(p);
-    struct eth_header *new_ehdr = (struct eth_header *) new_ehdr_buf;
+bool
+eth_strip_vlan(const struct iovec *iov, int iovcnt, size_t iovoff,
+               uint8_t *new_ehdr_buf,
+               uint16_t *payload_offset, uint16_t *tci);
 
-    switch (proto) {
-    case ETH_P_VLAN:
-    case ETH_P_DVLAN:
-        memcpy(new_ehdr->h_source, PKT_GET_ETH_HDR(p)->h_source, ETH_ALEN);
-        memcpy(new_ehdr->h_dest, PKT_GET_ETH_HDR(p)->h_dest, ETH_ALEN);
-        new_ehdr->h_proto = hvlan->h_proto;
-        *tci = be16_to_cpu(hvlan->h_tci);
-        *payload_offset =
-            sizeof(struct eth_header) + sizeof(struct vlan_header);
-        if (be16_to_cpu(new_ehdr->h_proto) == ETH_P_VLAN) {
-            memcpy(PKT_GET_VLAN_HDR(new_ehdr),
-                   PKT_GET_DVLAN_HDR(p),
-                   sizeof(struct vlan_header));
-            *payload_offset += sizeof(struct vlan_header);
-        }
-        return true;
-    default:
-        return false;
-    }
-}
+bool
+eth_strip_vlan_ex(const struct iovec *iov, int iovcnt, size_t iovoff,
+                  uint16_t vet, uint8_t *new_ehdr_buf,
+                  uint16_t *payload_offset, uint16_t *tci);
 
-static inline uint16_t
-eth_get_l3_proto(const void *l2hdr, size_t l2hdr_len)
+uint16_t
+eth_get_l3_proto(const struct iovec *l2hdr_iov, int iovcnt, size_t l2hdr_len);
+
+void eth_setup_vlan_headers_ex(struct eth_header *ehdr, uint16_t vlan_tag,
+    uint16_t vlan_ethtype, bool *is_new);
+
+static inline void
+eth_setup_vlan_headers(struct eth_header *ehdr, uint16_t vlan_tag,
+    bool *is_new)
 {
-    uint8_t *proto_ptr = (uint8_t *) l2hdr + l2hdr_len - sizeof(uint16_t);
-    return be16_to_cpup((uint16_t *)proto_ptr);
+    eth_setup_vlan_headers_ex(ehdr, vlan_tag, ETH_P_VLAN, is_new);
 }
 
-void eth_setup_vlan_headers(struct eth_header *ehdr, uint16_t vlan_tag,
-    bool *is_new);
 
 uint8_t eth_get_gso_type(uint16_t l3_proto, uint8_t *l3_hdr, uint8_t l4proto);
 
-void eth_get_protocols(const uint8_t *headers,
-                       uint32_t hdr_length,
+void eth_get_protocols(const struct iovec *iov, int iovcnt,
                        bool *isip4, bool *isip6,
-                       bool *isudp, bool *istcp);
+                       bool *isudp, bool *istcp,
+                       size_t *l3hdr_off,
+                       size_t *l4hdr_off);
 
 void eth_setup_ip4_fragmentation(const void *l2hdr, size_t l2hdr_len,
                                  void *l3hdr, size_t l3hdr_len,
@@ -340,7 +344,7 @@ uint32_t
 eth_calc_pseudo_hdr_csum(struct ip_header *iphdr, uint16_t csl);
 
 bool
-eth_parse_ipv6_hdr(struct iovec *pkt, int pkt_frags,
+eth_parse_ipv6_hdr(const struct iovec *pkt, int pkt_frags,
                    size_t ip6hdr_off, uint8_t *l4proto,
                    size_t *full_hdr_len);
 
diff --git a/net/eth.c b/net/eth.c
index 7c61132..87f30f6 100644
--- a/net/eth.c
+++ b/net/eth.c
@@ -20,8 +20,8 @@
 #include "qemu-common.h"
 #include "net/tap.h"
 
-void eth_setup_vlan_headers(struct eth_header *ehdr, uint16_t vlan_tag,
-    bool *is_new)
+void eth_setup_vlan_headers_ex(struct eth_header *ehdr, uint16_t vlan_tag,
+    uint16_t vlan_ethtype, bool *is_new)
 {
     struct vlan_header *vhdr = PKT_GET_VLAN_HDR(ehdr);
 
@@ -35,7 +35,7 @@ void eth_setup_vlan_headers(struct eth_header *ehdr, uint16_t vlan_tag,
     default:
         /* No VLAN header, put a new one */
         vhdr->h_proto = ehdr->h_proto;
-        ehdr->h_proto = cpu_to_be16(ETH_P_VLAN);
+        ehdr->h_proto = cpu_to_be16(vlan_ethtype);
         *is_new = true;
         break;
     }
@@ -78,61 +78,161 @@ eth_get_gso_type(uint16_t l3_proto, uint8_t *l3_hdr, uint8_t l4proto)
     return VIRTIO_NET_HDR_GSO_NONE | ecn_state;
 }
 
-void eth_get_protocols(const uint8_t *headers,
-                       uint32_t hdr_length,
+uint16_t
+eth_get_l3_proto(const struct iovec *l2hdr_iov, int iovcnt, size_t l2hdr_len)
+{
+    uint16_t proto;
+    size_t copied = iov_to_buf(l2hdr_iov, iovcnt, l2hdr_len - sizeof(proto),
+                               &proto, sizeof(proto));
+
+    return (copied == sizeof(proto)) ? be16_to_cpu(proto) : ETH_P_UNKNOWN;
+}
+
+void eth_get_protocols(const struct iovec *iov, int iovcnt,
                        bool *isip4, bool *isip6,
-                       bool *isudp, bool *istcp)
+                       bool *isudp, bool *istcp,
+                       size_t *l3hdr_off,
+                       size_t *l4hdr_off)
 {
     int proto;
-    size_t l2hdr_len = eth_get_l2_hdr_length(headers);
-    assert(hdr_length >= eth_get_l2_hdr_length(headers));
+    size_t l2hdr_len = eth_get_l2_hdr_length(iov, iovcnt);
+
     *isip4 = *isip6 = *isudp = *istcp = false;
 
-    proto = eth_get_l3_proto(headers, l2hdr_len);
-    if (proto == ETH_P_IP) {
-        *isip4 = true;
+    proto = eth_get_l3_proto(iov, iovcnt, l2hdr_len);
 
-        struct ip_header *iphdr;
+    *l3hdr_off = l2hdr_len;
 
-        assert(hdr_length >=
-            eth_get_l2_hdr_length(headers) + sizeof(struct ip_header));
+    if (proto == ETH_P_IP) {
+        struct ip_header iphdr;
+        size_t copied = iov_to_buf(iov, iovcnt, l2hdr_len,
+                                   &iphdr, sizeof(iphdr));
 
-        iphdr = PKT_GET_IP_HDR(headers);
+        *isip4 = true;
 
-        if (IP_HEADER_VERSION(iphdr) == IP_HEADER_VERSION_4) {
-            if (iphdr->ip_p == IP_PROTO_TCP) {
+        if (copied < sizeof(iphdr)) {
+            return;
+        }
+
+        if (IP_HEADER_VERSION(&iphdr) == IP_HEADER_VERSION_4) {
+            if (iphdr.ip_p == IP_PROTO_TCP) {
                 *istcp = true;
-            } else if (iphdr->ip_p == IP_PROTO_UDP) {
+            } else if (iphdr.ip_p == IP_PROTO_UDP) {
                 *isudp = true;
             }
         }
+
+        *l4hdr_off = l2hdr_len + IP_HDR_GET_LEN(&iphdr);
     } else if (proto == ETH_P_IPV6) {
         uint8_t l4proto;
         size_t full_ip6hdr_len;
 
-        struct iovec hdr_vec;
-        hdr_vec.iov_base = (void *) headers;
-        hdr_vec.iov_len = hdr_length;
-
         *isip6 = true;
-        if (eth_parse_ipv6_hdr(&hdr_vec, 1, l2hdr_len,
-                              &l4proto, &full_ip6hdr_len)) {
+        if (eth_parse_ipv6_hdr(iov, iovcnt, l2hdr_len,
+                               &l4proto, &full_ip6hdr_len)) {
             if (l4proto == IP_PROTO_TCP) {
                 *istcp = true;
             } else if (l4proto == IP_PROTO_UDP) {
                 *isudp = true;
             }
         }
+
+        *l4hdr_off = l2hdr_len + full_ip6hdr_len;
     }
 }
 
+bool
+eth_strip_vlan(const struct iovec *iov, int iovcnt, size_t iovoff,
+               uint8_t *new_ehdr_buf,
+               uint16_t *payload_offset, uint16_t *tci)
+{
+    struct vlan_header vlan_hdr;
+    struct eth_header *new_ehdr = (struct eth_header *) new_ehdr_buf;
+
+    size_t copied = iov_to_buf(iov, iovcnt, iovoff,
+                               new_ehdr, sizeof(*new_ehdr));
+
+    if (copied < sizeof(*new_ehdr)) {
+        return false;
+    }
+
+    switch (be16_to_cpu(new_ehdr->h_proto)) {
+    case ETH_P_VLAN:
+    case ETH_P_DVLAN:
+        copied = iov_to_buf(iov, iovcnt, iovoff + sizeof(*new_ehdr),
+                            &vlan_hdr, sizeof(vlan_hdr));
+
+        if (copied < sizeof(vlan_hdr)) {
+            return false;
+        }
+
+        new_ehdr->h_proto = vlan_hdr.h_proto;
+
+        *tci = be16_to_cpu(vlan_hdr.h_tci);
+        *payload_offset = iovoff + sizeof(*new_ehdr) + sizeof(vlan_hdr);
+
+        if (be16_to_cpu(new_ehdr->h_proto) == ETH_P_VLAN) {
+
+            copied = iov_to_buf(iov, iovcnt, *payload_offset,
+                                PKT_GET_VLAN_HDR(new_ehdr), sizeof(vlan_hdr));
+
+            if (copied < sizeof(vlan_hdr)) {
+                return false;
+            }
+
+            *payload_offset += sizeof(vlan_hdr);
+        }
+        return true;
+    default:
+        return false;
+    }
+}
+
+bool
+eth_strip_vlan_ex(const struct iovec *iov, int iovcnt, size_t iovoff,
+                  uint16_t vet, uint8_t *new_ehdr_buf,
+                  uint16_t *payload_offset, uint16_t *tci)
+{
+    struct vlan_header vlan_hdr;
+    struct eth_header *new_ehdr = (struct eth_header *) new_ehdr_buf;
+
+    size_t copied = iov_to_buf(iov, iovcnt, iovoff,
+                               new_ehdr, sizeof(*new_ehdr));
+
+    if (copied < sizeof(*new_ehdr)) {
+        return false;
+    }
+
+    if (be16_to_cpu(new_ehdr->h_proto) == vet) {
+        copied = iov_to_buf(iov, iovcnt, iovoff + sizeof(*new_ehdr),
+                            &vlan_hdr, sizeof(vlan_hdr));
+
+        if (copied < sizeof(vlan_hdr)) {
+            return false;
+        }
+
+        new_ehdr->h_proto = vlan_hdr.h_proto;
+
+        *tci = be16_to_cpu(vlan_hdr.h_tci);
+        *payload_offset = iovoff + sizeof(*new_ehdr) + sizeof(vlan_hdr);
+        return true;
+    }
+
+    return false;
+}
+
 void
 eth_setup_ip4_fragmentation(const void *l2hdr, size_t l2hdr_len,
                             void *l3hdr, size_t l3hdr_len,
                             size_t l3payload_len,
                             size_t frag_offset, bool more_frags)
 {
-    if (eth_get_l3_proto(l2hdr, l2hdr_len) == ETH_P_IP) {
+    const struct iovec l2vec = {
+        .iov_base = (void *) l2hdr,
+        .iov_len = l2hdr_len
+    };
+
+    if (eth_get_l3_proto(&l2vec, 1, l2hdr_len) == ETH_P_IP) {
         uint16_t orig_flags;
         struct ip_header *iphdr = (struct ip_header *) l3hdr;
         uint16_t frag_off_units = frag_offset / IP_FRAG_UNIT_SIZE;
@@ -185,7 +285,7 @@ eth_is_ip6_extension_header_type(uint8_t hdr_type)
     }
 }
 
-bool eth_parse_ipv6_hdr(struct iovec *pkt, int pkt_frags,
+bool eth_parse_ipv6_hdr(const struct iovec *pkt, int pkt_frags,
                         size_t ip6hdr_off, uint8_t *l4proto,
                         size_t *full_hdr_len)
 {
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Qemu-devel] [RFC PATCH 4/5] e1000_regs: Add definitions for Intel 82574-specific bits
  2015-10-25 17:00 [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Leonid Bloch
                   ` (2 preceding siblings ...)
  2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 3/5] net_pkt: Extend packet abstraction as requied by e1000e functionality Leonid Bloch
@ 2015-10-25 17:00 ` Leonid Bloch
  2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 5/5] net: Introduce e1000e device emulation Leonid Bloch
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Leonid Bloch @ 2015-10-25 17:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Dmitry Fleytman, Jason Wang, Leonid Bloch, Shmulik Ladkani

From: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com>

Signed-off-by: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com>
Signed-off-by: Leonid Bloch <leonid.bloch@ravellosystems.com>
---
 hw/net/e1000_regs.h | 201 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 199 insertions(+), 2 deletions(-)

diff --git a/hw/net/e1000_regs.h b/hw/net/e1000_regs.h
index 60b96aa..af154ad 100644
--- a/hw/net/e1000_regs.h
+++ b/hw/net/e1000_regs.h
@@ -85,6 +85,7 @@
 #define E1000_DEV_ID_82573E              0x108B
 #define E1000_DEV_ID_82573E_IAMT         0x108C
 #define E1000_DEV_ID_82573L              0x109A
+#define E1000_DEV_ID_82574L              0x10D3
 #define E1000_DEV_ID_82546GB_QUAD_COPPER_KSP3 0x10B5
 #define E1000_DEV_ID_80003ES2LAN_COPPER_DPT     0x1096
 #define E1000_DEV_ID_80003ES2LAN_SERDES_DPT     0x1098
@@ -104,6 +105,7 @@
 #define E1000_PHY_ID2_82544x 0xC30
 #define E1000_PHY_ID2_8254xx_DEFAULT 0xC20 /* 82540x, 82545x, and 82546x */
 #define E1000_PHY_ID2_82573x 0xCC0
+#define E1000_PHY_ID2_82574x 0xCB1
 
 /* Register Set. (82543, 82544)
  *
@@ -135,8 +137,10 @@
 #define E1000_ITR      0x000C4  /* Interrupt Throttling Rate - RW */
 #define E1000_ICS      0x000C8  /* Interrupt Cause Set - WO */
 #define E1000_IMS      0x000D0  /* Interrupt Mask Set - RW */
+#define E1000_EIAC     0x000DC  /* Ext. Interrupt Auto Clear - RW */
 #define E1000_IMC      0x000D8  /* Interrupt Mask Clear - WO */
 #define E1000_IAM      0x000E0  /* Interrupt Acknowledge Auto Mask */
+#define E1000_IVAR     0x000E4  /* Interrupt Vector Allocation Register - RW */
 #define E1000_RCTL     0x00100  /* RX Control - RW */
 #define E1000_RDTR1    0x02820  /* RX Delay Timer (1) - RW */
 #define E1000_RDBAL1   0x02900  /* RX Descriptor Base Address Low (1) - RW */
@@ -158,7 +162,8 @@
 #define E1000_PHY_CTRL     0x00F10  /* PHY Control Register in CSR */
 #define FEXTNVM_SW_CONFIG  0x0001
 #define E1000_PBA      0x01000  /* Packet Buffer Allocation - RW */
-#define E1000_PBS      0x01008  /* Packet Buffer Size */
+#define E1000_PBM      0x10000  /* Packet Buffer Memory - RW */
+#define E1000_PBS      0x01008  /* Packet Buffer Size - RW */
 #define E1000_EEMNGCTL 0x01010  /* MNG EEprom Control */
 #define E1000_FLASH_UPDATES 1000
 #define E1000_EEARBC   0x01024  /* EEPROM Auto Read Bus Control */
@@ -191,6 +196,12 @@
 #define E1000_RAID     0x02C08  /* Receive Ack Interrupt Delay - RW */
 #define E1000_TXDMAC   0x03000  /* TX DMA Control - RW */
 #define E1000_KABGTXD  0x03004  /* AFE Band Gap Transmit Ref Data */
+#define E1000_POEMB    0x00F10  /* PHY OEM Bits Register - RW */
+#define E1000_RDFH     0x02410  /* Receive Data FIFO Head Register - RW */
+#define E1000_RDFT     0x02418  /* Receive Data FIFO Tail Register - RW */
+#define E1000_RDFHS    0x02420  /* Receive Data FIFO Head Saved Register - RW */
+#define E1000_RDFTS    0x02428  /* Receive Data FIFO Tail Saved Register - RW */
+#define E1000_RDFPC    0x02430  /* Receive Data FIFO Packet Count - RW */
 #define E1000_TDFH     0x03410  /* TX Data FIFO Head - RW */
 #define E1000_TDFT     0x03418  /* TX Data FIFO Tail - RW */
 #define E1000_TDFHS    0x03420  /* TX Data FIFO Head Saved - RW */
@@ -294,9 +305,14 @@
 #define E1000_IP6AT    0x05880  /* IPv6 Address Table - RW Array */
 #define E1000_WUPL     0x05900  /* Wakeup Packet Length - RW */
 #define E1000_WUPM     0x05A00  /* Wakeup Packet Memory - RO A */
+#define E1000_MFUTP01  0x05828   /* Management Flex UDP/TCP Ports 0/1 - RW */
+#define E1000_MFUTP23  0x05830   /* Management Flex UDP/TCP Ports 2/3 - RW */
+#define E1000_MFVAL    0x05824   /* Manageability Filters Valid - RW */
+#define E1000_MDEF     0x05890   /* Manageability Decision Filters - RW Array */
 #define E1000_FFLT     0x05F00  /* Flexible Filter Length Table - RW Array */
 #define E1000_HOST_IF  0x08800  /* Host Interface */
 #define E1000_FFMT     0x09000  /* Flexible Filter Mask Table - RW Array */
+#define E1000_FTFT     0x09400  /* Flexible TCO Filter Table - RW Array */
 #define E1000_FFVT     0x09800  /* Flexible Filter Value Table - RW Array */
 
 #define E1000_KUMCTRLSTA 0x00034 /* MAC-PHY interface - RW */
@@ -305,12 +321,18 @@
 #define E1000_SW_FW_SYNC 0x05B5C /* Software-Firmware Synchronization - RW */
 
 #define E1000_GCR       0x05B00 /* PCI-Ex Control */
+#define E1000_FUNCTAG   0x05B08 /* Function-Tag Register */
 #define E1000_GSCL_1    0x05B10 /* PCI-Ex Statistic Control #1 */
 #define E1000_GSCL_2    0x05B14 /* PCI-Ex Statistic Control #2 */
 #define E1000_GSCL_3    0x05B18 /* PCI-Ex Statistic Control #3 */
 #define E1000_GSCL_4    0x05B1C /* PCI-Ex Statistic Control #4 */
+#define E1000_GSCN_0    0x05B20 /* 3GIO Statistic Counter Register #0 */
+#define E1000_GSCN_1    0x05B24 /* 3GIO Statistic Counter Register #1 */
+#define E1000_GSCN_2    0x05B28 /* 3GIO Statistic Counter Register #2 */
+#define E1000_GSCN_3    0x05B2C /* 3GIO Statistic Counter Register #3 */
 #define E1000_FACTPS    0x05B30 /* Function Active and Power State to MNG */
 #define E1000_SWSM      0x05B50 /* SW Semaphore */
+#define E1000_GCR2      0x05B64 /* 3GIO Control Register 2 */
 #define E1000_FWSM      0x05B54 /* FW Semaphore */
 #define E1000_FFLT_DBG  0x05F04 /* Debug Register */
 #define E1000_HICR      0x08F00 /* Host Inteface Control */
@@ -323,6 +345,59 @@
 #define E1000_RSSIM     0x05864 /* RSS Interrupt Mask */
 #define E1000_RSSIR     0x05868 /* RSS Interrupt Request */
 
+#define E1000_TIMINCA   0x0B608 /* Increment attributes register - RW */
+
+#define E1000_ICR_ASSERTED BIT(31)
+#define E1000_EIAC_MASK    0x01F00000
+
+/* IVAR register parsing helpers */
+#define E1000_IVAR_INT_ALLOC_VALID  (0x8)
+
+#define E1000_IVAR_RXQ0_SHIFT       (0)
+#define E1000_IVAR_RXQ1_SHIFT       (4)
+#define E1000_IVAR_TXQ0_SHIFT       (8)
+#define E1000_IVAR_TXQ1_SHIFT       (12)
+#define E1000_IVAR_OTHER_SHIFT      (16)
+
+#define E1000_IVAR_ENTRY_MASK       (0xF)
+#define E1000_IVAR_ENTRY_VALID_MASK E1000_IVAR_INT_ALLOC_VALID
+#define E1000_IVAR_ENTRY_VEC_MASK   (0x7)
+
+#define E1000_IVAR_RXQ0(x)          ((x) >> E1000_IVAR_RXQ0_SHIFT)
+#define E1000_IVAR_RXQ1(x)          ((x) >> E1000_IVAR_RXQ1_SHIFT)
+#define E1000_IVAR_TXQ0(x)          ((x) >> E1000_IVAR_TXQ0_SHIFT)
+#define E1000_IVAR_TXQ1(x)          ((x) >> E1000_IVAR_TXQ1_SHIFT)
+#define E1000_IVAR_OTHER(x)         ((x) >> E1000_IVAR_OTHER_SHIFT)
+
+#define E1000_IVAR_ENTRY_VALID(x)   ((x) & E1000_IVAR_ENTRY_VALID_MASK)
+#define E1000_IVAR_ENTRY_VEC(x)     ((x) & E1000_IVAR_ENTRY_VEC_MASK)
+
+#define E1000_IVAR_TX_INT_EVERY_WB  BIT(31)
+
+/* RFCTL register bits */
+#define E1000_RFCTL_NFSW_DIS            0x00000040
+#define E1000_RFCTL_NFSR_DIS            0x00000080
+#define E1000_RFCTL_ACK_DIS             0x00001000
+#define E1000_RFCTL_EXTEN               0x00008000
+#define E1000_RFCTL_IPV6_EX_DIS         0x00010000
+#define E1000_RFCTL_NEW_IPV6_EXT_DIS    0x00020000
+
+/* PSRCTL parsing */
+#define E1000_PSRCTL_BSIZE0_MASK   0x0000007F
+#define E1000_PSRCTL_BSIZE1_MASK   0x00003F00
+#define E1000_PSRCTL_BSIZE2_MASK   0x003F0000
+#define E1000_PSRCTL_BSIZE3_MASK   0x3F000000
+
+#define E1000_PSRCTL_BSIZE0_SHIFT  0
+#define E1000_PSRCTL_BSIZE1_SHIFT  8
+#define E1000_PSRCTL_BSIZE2_SHIFT  16
+#define E1000_PSRCTL_BSIZE3_SHIFT  24
+
+#define E1000_PSRCTL_BUFFS_PER_DESC 4
+
+/* TARC* parsing */
+#define E1000_TARC_ENABLE BIT(10)
+
 /* PHY 1000 MII Register/Bit Definitions */
 /* PHY Registers defined by IEEE */
 #define PHY_CTRL         0x00 /* Control Register */
@@ -338,6 +413,14 @@
 #define PHY_1000T_STATUS 0x0A /* 1000Base-T Status Reg */
 #define PHY_EXT_STATUS   0x0F /* Extended Status Reg */
 
+/* 82574-specific registers */
+#define PHY_PAGE         0x16 /* Page Address (Any page) */
+#define PHY_OEM_BITS     0x19 /* OEM Bits (Page 0) */
+#define PHY_BIAS_1       0x1d /* Bias Setting Register */
+#define PHY_BIAS_2       0x1e /* Bias Setting Register */
+
+#define PHY_PAGE_RW_MASK 0x7F /* R/W part of page address register */
+
 #define MAX_PHY_REG_ADDRESS        0x1F  /* 5 bit address bus (0-0x1F) */
 #define MAX_PHY_MULTI_PAGE_REG     0xF   /* Registers equal on all pages */
 
@@ -417,6 +500,11 @@
 #define E1000_ICR_DSW           0x00000020 /* FW changed the status of DISSW bit in the FWSM */
 #define E1000_ICR_PHYINT        0x00001000 /* LAN connected device generates an interrupt */
 #define E1000_ICR_EPRST         0x00100000 /* ME handware reset occurs */
+#define E1000_ICR_RXQ0          0x00100000 /* Rx Queue 0 Interrupt */
+#define E1000_ICR_RXQ1          0x00200000 /* Rx Queue 1 Interrupt */
+#define E1000_ICR_TXQ0          0x00400000 /* Tx Queue 0 Interrupt */
+#define E1000_ICR_TXQ1          0x00800000 /* Tx Queue 1 Interrupt */
+#define E1000_ICR_OTHER         0x01000000 /* Other Interrupts */
 
 /* Interrupt Cause Set */
 #define E1000_ICS_TXDW      E1000_ICR_TXDW      /* Transmit desc written back */
@@ -556,6 +644,15 @@
 #define E1000_EEPROM_RW_ADDR_SHIFT 8    /* Shift to the address bits */
 #define E1000_EEPROM_POLL_WRITE    1    /* Flag for polling for write complete */
 #define E1000_EEPROM_POLL_READ     0    /* Flag for polling for read complete */
+
+/* 82547l EERD register layout */
+#define E1000_NVM_RW_REG_DATA      16   /* Offset to data in NVM r/w regs */
+#define E1000_NVM_RW_REG_DONE      2    /* Offset to READ/WRITE done bit */
+#define E1000_NVM_RW_REG_START     1    /* Start operation */
+#define E1000_NVM_RW_ADDR_SHIFT    2    /* Shift to the address bits */
+#define E1000_NVM_ADDR_MASK        ((1L << 13) - 1) /* Mask for address */
+#define E1000_NVM_DATA_MASK        ((1L << 16) - 1) /* Mask for data */
+
 /* Register Bit Masks */
 /* Device Control */
 #define E1000_CTRL_FD       0x00000001  /* Full duplex.0=half; 1=full */
@@ -579,6 +676,8 @@
 #define E1000_CTRL_D_UD_POLARITY 0x00004000 /* Defined polarity of Dock/Undock indication in SDP[0] */
 #define E1000_CTRL_FORCE_PHY_RESET 0x00008000 /* Reset both PHY ports, through PHYRST_N pin */
 #define E1000_CTRL_EXT_LINK_EN 0x00010000 /* enable link status from external LINK_0 and LINK_1 pins */
+#define E1000_CTRL_EXT_EIAME   0x01000000
+#define E1000_CTRL_EXT_IAME    0x08000000 /* Int ACK Auto-mask */
 #define E1000_CTRL_SWDPIN0  0x00040000  /* SWDPIN 0 value */
 #define E1000_CTRL_SWDPIN1  0x00080000  /* SWDPIN 1 value */
 #define E1000_CTRL_SWDPIN2  0x00100000  /* SWDPIN 2 value */
@@ -658,6 +757,8 @@
 #define E1000_EECD_AUPDEN    0x00100000 /* Enable Autonomous FLASH update */
 #define E1000_EECD_SHADV     0x00200000 /* Shadow RAM Data Valid */
 #define E1000_EECD_SEC1VAL   0x00400000 /* Sector One Valid */
+
+
 #define E1000_EECD_SECVAL_SHIFT      22
 #define E1000_STM_OPCODE     0xDB00
 #define E1000_HICR_FW_RESET  0xC0
@@ -705,6 +806,18 @@
 #define E1000_EEPROM_CFG_DONE         0x00040000   /* MNG config cycle done */
 #define E1000_EEPROM_CFG_DONE_PORT_1  0x00080000   /* ...for second port */
 
+/* PCI Express Control */
+/* 3GIO Control Register - GCR (0x05B00; RW) */
+#define E1000_L0S_ADJUST              (1 << 9)
+#define E1000_L1_ENTRY_LATENCY_MSB    (1 << 23)
+#define E1000_L1_ENTRY_LATENCY_LSB    (1 << 25 | 1 << 26)
+
+#define E1000_L0S_ADJUST              (1 << 9)
+#define E1000_L1_ENTRY_LATENCY_MSB    (1 << 23)
+#define E1000_L1_ENTRY_LATENCY_LSB    (1 << 25 | 1 << 26)
+
+#define E1000_GCR_RO_BITS             (1 << 23 | 1 << 25 | 1 << 26)
+
 /* Transmit Descriptor */
 struct e1000_tx_desc {
     uint64_t buffer_addr;       /* Address of the descriptor's data buffer */
@@ -746,7 +859,9 @@ struct e1000_tx_desc {
 #define E1000_TXD_CMD_TCP    0x01000000 /* TCP packet */
 #define E1000_TXD_CMD_IP     0x02000000 /* IP packet */
 #define E1000_TXD_CMD_TSE    0x04000000 /* TCP Seg enable */
+#define E1000_TXD_CMD_SNAP   0x40000000 /* Update SNAP header */
 #define E1000_TXD_STAT_TC    0x00000004 /* Tx Underrun */
+#define E1000_TXD_EXTCMD_TSTAMP 0x00000010 /* IEEE1588 Timestamp packet */
 
 /* Transmit Control */
 #define E1000_TCTL_RST    0x00000001    /* software reset */
@@ -761,7 +876,7 @@ struct e1000_tx_desc {
 #define E1000_TCTL_NRTU   0x02000000    /* No Re-transmit on underrun */
 #define E1000_TCTL_MULR   0x10000000    /* Multiple request support */
 
-/* Receive Descriptor */
+/* Legacy Receive Descriptor */
 struct e1000_rx_desc {
     uint64_t buffer_addr; /* Address of the descriptor's data buffer */
     uint16_t length;     /* Length of data DMAed into data buffer */
@@ -771,6 +886,74 @@ struct e1000_rx_desc {
     uint16_t special;
 };
 
+/* Extended Receive Descriptor */
+union e1000_rx_desc_extended {
+    struct {
+        uint64_t buffer_addr;
+        uint64_t reserved;
+    } read;
+    struct {
+        struct {
+            uint32_t mrq;           /* Multiple Rx Queues */
+            union {
+                uint32_t rss;       /* RSS Hash */
+                struct {
+                    uint16_t ip_id; /* IP id */
+                    uint16_t csum;  /* Packet Checksum */
+                } csum_ip;
+            } hi_dword;
+        } lower;
+        struct {
+            uint32_t status_error;  /* ext status/error */
+            uint16_t length;
+            uint16_t vlan;          /* VLAN tag */
+        } upper;
+    } wb;                           /* writeback */
+};
+
+#define MAX_PS_BUFFERS 4
+
+/* Number of packet split data buffers (not including the header buffer) */
+#define PS_PAGE_BUFFERS    (MAX_PS_BUFFERS - 1)
+
+/* Receive Descriptor - Packet Split */
+union e1000_rx_desc_packet_split {
+    struct {
+        /* one buffer for protocol header(s), three data buffers */
+        uint64_t buffer_addr[MAX_PS_BUFFERS];
+    } read;
+    struct {
+        struct {
+            uint32_t mrq;          /* Multiple Rx Queues */
+            union {
+                uint32_t rss;          /* RSS Hash */
+                struct {
+                    uint16_t ip_id;    /* IP id */
+                    uint16_t csum;     /* Packet Checksum */
+                } csum_ip;
+            } hi_dword;
+        } lower;
+        struct {
+            uint32_t status_error;     /* ext status/error */
+            uint16_t length0;      /* length of buffer 0 */
+            uint16_t vlan;         /* VLAN tag */
+        } middle;
+        struct {
+            uint16_t header_status;
+            /* length of buffers 1-3 */
+            uint16_t length[PS_PAGE_BUFFERS];
+        } upper;
+        uint64_t reserved;
+    } wb; /* writeback */
+};
+
+/* Receive Checksum Control bits */
+#define E1000_RXCSUM_IPOFLD     0x100   /* IP Checksum Offload Enable */
+#define E1000_RXCSUM_TUOFLD     0x200   /* TCP/UDP Checksum Offload Enable */
+#define E1000_RXCSUM_PCSD       0x2000  /* Packet Checksum Disable */
+
+#define E1000_MAX_RX_DESC_LEN (sizeof(union e1000_rx_desc_packet_split))
+
 /* Receive Descriptor bit definitions */
 #define E1000_RXD_STAT_DD       0x01    /* Descriptor Done */
 #define E1000_RXD_STAT_EOP      0x02    /* End of Packet */
@@ -796,6 +979,15 @@ struct e1000_rx_desc {
 #define E1000_RXD_SPC_CFI_MASK  0x1000  /* CFI is bit 12 */
 #define E1000_RXD_SPC_CFI_SHIFT 12
 
+/* RX packet types */
+#define E1000_RXD_PKT_MAC       (0)
+#define E1000_RXD_PKT_IP4       (1)
+#define E1000_RXD_PKT_IP4_XDP   (2)
+#define E1000_RXD_PKT_IP6       (5)
+#define E1000_RXD_PKT_IP6_XDP   (6)
+
+#define E1000_RXD_PKT_TYPE(t) ((t) << 16)
+
 #define E1000_RXDEXT_STATERR_CE    0x01000000
 #define E1000_RXDEXT_STATERR_SE    0x02000000
 #define E1000_RXDEXT_STATERR_SEQ   0x04000000
@@ -873,6 +1065,8 @@ struct e1000_data_desc {
 #define E1000_MANC_NEIGHBOR_EN   0x00004000 /* Enable Neighbor Discovery
                                              * Filtering */
 #define E1000_MANC_ARP_RES_EN    0x00008000 /* Enable ARP response Filtering */
+#define E1000_MANC_DIS_IP_CHK_ARP  0x10000000 /* Disable IP address chacking */
+                                              /*for ARP packets - in 82574 */
 #define E1000_MANC_TCO_RESET     0x00010000 /* TCO Reset Occurred */
 #define E1000_MANC_RCV_TCO_EN    0x00020000 /* Receive TCO Packets Enabled */
 #define E1000_MANC_REPORT_STATUS 0x00040000 /* Status Reporting Enabled */
@@ -896,6 +1090,9 @@ struct e1000_data_desc {
 #define E1000_MANC_SMB_DATA_OUT_SHIFT  28 /* SMBus Data Out Shift */
 #define E1000_MANC_SMB_CLK_OUT_SHIFT   29 /* SMBus Clock Out Shift */
 
+/* FACTPS Control */
+#define E1000_FACTPS_LAN0_ON     0x00000004 /* Lan 0 enable */
+
 /* For checksumming, the sum of all words in the EEPROM should equal 0xBABA. */
 #define EEPROM_SUM 0xBABA
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Qemu-devel] [RFC PATCH 5/5] net: Introduce e1000e device emulation
  2015-10-25 17:00 [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Leonid Bloch
                   ` (3 preceding siblings ...)
  2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 4/5] e1000_regs: Add definitions for Intel 82574-specific bits Leonid Bloch
@ 2015-10-25 17:00 ` Leonid Bloch
  2015-10-28  5:44 ` [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Jason Wang
  2016-01-13  4:43 ` Prem Mallappa
  6 siblings, 0 replies; 15+ messages in thread
From: Leonid Bloch @ 2015-10-25 17:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Dmitry Fleytman, Jason Wang, Leonid Bloch, Shmulik Ladkani

From: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com>

This patch introduces emulation for the Intel 82574 adapter, AKA e1000e.

This implementation is based on the e1000 emulation code, and
utilizes the TX/RX packet abstractions initially developed for the
vmxnet3 device. Although some parts of the introduced code are
common with the e1000, the differences are substantial enough so
that the only shared resources for the two devices are the
definitions in hw/net/e1000_regs.h.

Similarly to vmxnet3, the new device uses virtio headers for task
offloads (for backends that support virtio extensions). Usage of
virtio headers may be forcibly disabled via a boolean device property
"vnet" (which is enabled by default). In such case task offloads
will be performed in software, in the same way it is done on
backends that do not support virtio headers.

The device code is split into two parts:

  1. hw/net/e1000e.c: QEMU-specific code for a network device;
  2. hw/net/e1000e_core.[hc]: Device emulation according to the spec.

The new device name is e1000e.

Intel specification for 82574 controller is available at
http://www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf

Signed-off-by: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com>
Signed-off-by: Leonid Bloch <leonid.bloch@ravellosystems.com>
---
 default-configs/pci.mak |    1 +
 hw/net/Makefile.objs    |    3 +-
 hw/net/e1000e.c         |  531 ++++++++++++
 hw/net/e1000e_core.c    | 2081 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/net/e1000e_core.h    |  181 +++++
 trace-events            |   68 ++
 6 files changed, 2864 insertions(+), 1 deletion(-)
 create mode 100644 hw/net/e1000e.c
 create mode 100644 hw/net/e1000e_core.c
 create mode 100644 hw/net/e1000e_core.h

diff --git a/default-configs/pci.mak b/default-configs/pci.mak
index 58a2c0a..5fd4fcd 100644
--- a/default-configs/pci.mak
+++ b/default-configs/pci.mak
@@ -17,6 +17,7 @@ CONFIG_VMW_PVSCSI_SCSI_PCI=y
 CONFIG_MEGASAS_SCSI_PCI=y
 CONFIG_RTL8139_PCI=y
 CONFIG_E1000_PCI=y
+CONFIG_E1000E_PCI=y
 CONFIG_VMXNET3_PCI=y
 CONFIG_IDE_CORE=y
 CONFIG_IDE_QDEV=y
diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs
index 34039fc..67d8efe 100644
--- a/hw/net/Makefile.objs
+++ b/hw/net/Makefile.objs
@@ -6,7 +6,8 @@ common-obj-$(CONFIG_NE2000_PCI) += ne2000.o
 common-obj-$(CONFIG_EEPRO100_PCI) += eepro100.o
 common-obj-$(CONFIG_PCNET_PCI) += pcnet-pci.o
 common-obj-$(CONFIG_PCNET_COMMON) += pcnet.o
-common-obj-$(CONFIG_E1000_PCI) += e1000.o
+common-obj-$(CONFIG_E1000_PCI) += e1000.o e1000e_core.o
+common-obj-$(CONFIG_E1000E_PCI) += e1000e.o e1000e_core.o
 common-obj-$(CONFIG_RTL8139_PCI) += rtl8139.o
 common-obj-$(CONFIG_VMXNET3_PCI) += net_tx_pkt.o net_rx_pkt.o
 common-obj-$(CONFIG_VMXNET3_PCI) += vmxnet3.o
diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
new file mode 100644
index 0000000..d16bd60
--- /dev/null
+++ b/hw/net/e1000e.c
@@ -0,0 +1,531 @@
+/*
+* QEMU INTEL 82574 GbE NIC emulation
+*
+* Copyright (c) 2015 Ravello Systems LTD (http://ravellosystems.com)
+*
+* Developed by Daynix Computing LTD (http://www.daynix.com)
+*
+* Authors:
+* Dmitry Fleytman <dmitry@daynix.com>
+* Leonid Bloch <leonid@daynix.com>
+* Yan Vugenfirer <yan@daynix.com>
+*
+* This work is licensed under the terms of the GNU GPL, version 2.
+* See the COPYING file in the top-level directory.
+*
+*/
+
+#include "hw/hw.h"
+#include "hw/pci/pci.h"
+#include "net/net.h"
+#include "net/tap.h"
+#include "sysemu/sysemu.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+
+#include "hw/net/e1000_regs.h"
+
+#include "e1000e_core.h"
+
+#include "trace.h"
+
+#define TYPE_E1000E "e1000e"
+#define E1000E(obj) OBJECT_CHECK(E1000EState, (obj), TYPE_E1000E)
+
+typedef struct {
+    PCIDevice parent_obj;
+    NICState *nic;
+    NICConf conf;
+
+    MemoryRegion mmio;
+    MemoryRegion io;
+    MemoryRegion msix;
+
+    uint32_t intr_state;
+    bool use_vnet;
+
+    E1000ECore core;
+
+} E1000EState;
+
+#define E1000E_MMIO_IDX     0
+#define E1000E_IO_IDX       1
+#define E1000E_MSIX_IDX     2
+
+#define E1000E_MMIO_SIZE    (128*1024)
+#define E1000E_IO_SIZE      (32)
+#define E1000E_MSIX_SIZE    (16*1024)
+
+#define E1000E_MSIX_TABLE   (0x0000)
+#define E1000E_MSIX_PBA     (0x2000)
+
+#define E1000E_USE_MSI     BIT(0)
+#define E1000E_USE_MSIX    BIT(1)
+
+static uint64_t
+e1000e_mmio_read(void *opaque, hwaddr addr, unsigned size)
+{
+    E1000EState *s = opaque;
+    return e1000e_core_read(&s->core, addr, size);
+}
+
+static void
+e1000e_mmio_write(void *opaque, hwaddr addr,
+                  uint64_t val, unsigned size)
+{
+    E1000EState *s = opaque;
+    e1000e_core_write(&s->core, addr, val, size);
+}
+
+static uint64_t
+e1000e_io_read(void *opaque, hwaddr addr, unsigned size)
+{
+    /* TODO: Implement me */
+    trace_e1000e_wrn_io_read(addr, size);
+    return 0;
+}
+
+static void
+e1000e_io_write(void *opaque, hwaddr addr,
+                uint64_t val, unsigned size)
+{
+    /* TODO: Implement me */
+    trace_e1000e_wrn_io_write(addr, size, val);
+}
+
+static const MemoryRegionOps mmio_ops = {
+    .read = e1000e_mmio_read,
+    .write = e1000e_mmio_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .impl = {
+        .min_access_size = 4,
+        .max_access_size = 4,
+    },
+};
+
+static const MemoryRegionOps io_ops = {
+    .read = e1000e_io_read,
+    .write = e1000e_io_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .impl = {
+        .min_access_size = 4,
+        .max_access_size = 4,
+    },
+};
+
+static int
+_e1000e_can_receive(NetClientState *nc)
+{
+    E1000EState *s = qemu_get_nic_opaque(nc);
+    return e1000e_can_receive(&s->core);
+}
+
+static ssize_t
+_e1000e_receive_iov(NetClientState *nc, const struct iovec *iov, int iovcnt)
+{
+    E1000EState *s = qemu_get_nic_opaque(nc);
+    return e1000e_receive_iov(&s->core, iov, iovcnt);
+}
+
+static ssize_t
+_e1000e_receive(NetClientState *nc, const uint8_t *buf, size_t size)
+{
+    E1000EState *s = qemu_get_nic_opaque(nc);
+    return e1000e_receive(&s->core, buf, size);
+}
+
+static void
+e1000e_set_link_status(NetClientState *nc)
+{
+    E1000EState *s = qemu_get_nic_opaque(nc);
+    e1000e_core_set_link_status(&s->core);
+}
+
+static NetClientInfo net_e1000e_info = {
+    .type = NET_CLIENT_OPTIONS_KIND_NIC,
+    .size = sizeof(NICState),
+    .can_receive = _e1000e_can_receive,
+    .receive = _e1000e_receive,
+    .receive_iov = _e1000e_receive_iov,
+    .link_status_changed = e1000e_set_link_status,
+};
+
+/*
+* EEPROM (NVM) contents documented in Table 36, section 6.1.
+*/
+static const uint16_t e1000e_eeprom_template[64] = {
+  /*        Address        |    Compat.    | ImVer |   Compat.     */
+    0x0000, 0x0000, 0x0000, 0x0420, 0xf746, 0x2010, 0xffff, 0xffff,
+  /*      PBA      |ICtrl1 | SSID  | SVID  | DevID |-------|ICtrl2 */
+    0x0000, 0x0000, 0x026b, 0x0000, 0x8086, 0x0000, 0x0000, 0x8058,
+  /*    NVM words 1,2,3    |-------------------------------|PCI-EID*/
+    0x0000, 0x2001, 0x7e7c, 0xffff, 0x1000, 0x00c8, 0x0000, 0x2704,
+  /* PCIe Init. Conf 1,2,3 |PCICtrl|PHY|LD1|-------| RevID | LD0,2 */
+    0x6cc9, 0x3150, 0x070e, 0x460b, 0x2d84, 0x0100, 0xf000, 0x0706,
+  /* FLPAR |FLANADD|LAN-PWR|FlVndr |ICtrl3 |APTSMBA|APTRxEP|APTSMBC*/
+    0x6000, 0x0080, 0x0f04, 0x7fff, 0x4f01, 0xc600, 0x0000, 0x20ff,
+  /* APTIF | APTMC |APTuCP |LSWFWID|MSWFWID|NC-SIMC|NC-SIC | VPDP  */
+    0x0028, 0x0003, 0x0000, 0x0000, 0x0000, 0x0003, 0x0000, 0xffff,
+  /*                            SW Section                         */
+    0x0100, 0xc000, 0x121c, 0xc007, 0xffff, 0xffff, 0xffff, 0xffff,
+  /*                      SW Section                       |CHKSUM */
+    0xffff, 0xffff, 0xffff, 0xffff, 0x0000, 0x0120, 0xffff, 0x0000,
+};
+
+static void _e1000e_core_reinitialize(E1000EState *s)
+{
+    s->core.owner = &s->parent_obj;
+    s->core.owner_nic = s->nic;
+}
+
+static void
+_e1000e_init_msi(E1000EState *s)
+{
+    int res;
+
+    res = msi_init(PCI_DEVICE(s),
+                   0xD0,   /* MSI capability offset              */
+                   1,      /* MAC MSI interrupts                 */
+                   true,   /* 64-bit message addresses supported */
+                   false); /* Per vector mask supported          */
+
+    if (res > 0) {
+        s->intr_state |= E1000E_USE_MSI;
+    } else {
+        trace_e1000e_msi_init_fail(res);
+    }
+}
+
+static void
+_e1000e_cleanup_msi(E1000EState *s)
+{
+    if (s->intr_state & E1000E_USE_MSI) {
+        msi_uninit(PCI_DEVICE(s));
+    }
+}
+
+static void
+_e1000e_unuse_msix_vectors(E1000EState *s, int num_vectors)
+{
+    int i;
+    for (i = 0; i < num_vectors; i++) {
+        msix_vector_unuse(PCI_DEVICE(s), i);
+    }
+}
+
+static bool
+_e1000e_use_msix_vectors(E1000EState *s, int num_vectors)
+{
+    int i;
+    for (i = 0; i < num_vectors; i++) {
+        int res = msix_vector_use(PCI_DEVICE(s), i);
+        if (res < 0) {
+            trace_e1000e_msix_use_vector_fail(i, res);
+            _e1000e_unuse_msix_vectors(s, i);
+            return false;
+        }
+    }
+    return true;
+}
+
+static void
+_e1000e_init_msix(E1000EState *s)
+{
+    PCIDevice *d = PCI_DEVICE(s);
+    int res = msix_init(PCI_DEVICE(s), E1000E_MSIX_VEC_NUM,
+                        &s->msix,
+                        E1000E_MSIX_IDX, E1000E_MSIX_TABLE,
+                        &s->msix,
+                        E1000E_MSIX_IDX, E1000E_MSIX_PBA,
+                        0xA0);
+
+    if (0 > res) {
+        trace_e1000e_msix_init_fail(res);
+    } else {
+        if (!_e1000e_use_msix_vectors(s, E1000E_MSIX_VEC_NUM)) {
+            msix_uninit(d, &s->msix, &s->msix);
+        } else {
+            s->intr_state |= E1000E_USE_MSIX;
+        }
+    }
+}
+
+static void
+_e1000e_cleanup_msix(E1000EState *s)
+{
+    if (s->intr_state & E1000E_USE_MSIX) {
+        _e1000e_unuse_msix_vectors(s, E1000E_MSIX_VEC_NUM);
+        msix_uninit(PCI_DEVICE(s), &s->msix, &s->msix);
+    }
+}
+
+static void
+_e1000e_init_net_peer(E1000EState *s, PCIDevice *pci_dev, uint8_t *macaddr)
+{
+    DeviceState *dev = DEVICE(pci_dev);
+    NetClientState *nc;
+
+    s->nic = qemu_new_nic(&net_e1000e_info, &s->conf,
+        object_get_typename(OBJECT(s)), dev->id, s);
+
+    qemu_format_nic_info_str(qemu_get_queue(s->nic), macaddr);
+
+    nc = qemu_get_queue(s->nic);
+
+    s->core.has_vnet = (s->use_vnet && nc->peer) ?
+        qemu_has_vnet_hdr(nc->peer) : false;
+
+    trace_e1000e_cfg_support_virtio(s->core.has_vnet);
+
+    if (s->core.has_vnet) {
+        qemu_set_vnet_hdr_len(nc->peer, sizeof(struct virtio_net_hdr));
+        qemu_using_vnet_hdr(nc->peer, true);
+        qemu_set_offload(nc->peer, 1, 0, 0, 0, 0);
+    }
+}
+
+static void e1000e_pci_realize(PCIDevice *pci_dev, Error **errp)
+{
+    E1000EState *s = E1000E(pci_dev);
+    uint8_t *macaddr;
+
+    trace_e1000e_cb_pci_realize();
+
+    pci_dev->config[PCI_CACHE_LINE_SIZE] = 0x10;
+    pci_dev->config[PCI_INTERRUPT_PIN] = 1;
+
+    /* Define IO/MMIO regions */
+    memory_region_init_io(&s->mmio, OBJECT(s), &mmio_ops, s,
+                          "e1000e-mmio", E1000E_MMIO_SIZE);
+    pci_register_bar(pci_dev, E1000E_MMIO_IDX,
+                     PCI_BASE_ADDRESS_SPACE_MEMORY, &s->mmio);
+
+    memory_region_init_io(&s->io, OBJECT(s), &io_ops, s,
+                          "e1000e-io", E1000E_IO_SIZE);
+    pci_register_bar(pci_dev, E1000E_IO_IDX,
+                     PCI_BASE_ADDRESS_SPACE_IO, &s->io);
+
+    memory_region_init(&s->msix, OBJECT(s), "e1000e-msix",
+                       E1000E_MSIX_SIZE);
+    pci_register_bar(pci_dev, E1000E_MSIX_IDX,
+                     PCI_BASE_ADDRESS_SPACE_MEMORY, &s->msix);
+
+    /* Create networking backend */
+    qemu_macaddr_default_if_unset(&s->conf.macaddr);
+    macaddr = s->conf.macaddr.a;
+
+    _e1000e_init_net_peer(s, pci_dev, macaddr);
+
+    _e1000e_init_msi(s);
+    _e1000e_init_msix(s);
+
+    /* Initialize registers interface */
+    _e1000e_core_reinitialize(s);
+
+    e1000e_core_pci_realize(&s->core,
+                           e1000e_eeprom_template,
+                           sizeof(e1000e_eeprom_template),
+                           macaddr);
+}
+
+static void e1000e_pci_uninit(PCIDevice *pci_dev)
+{
+    E1000EState *s = E1000E(pci_dev);
+
+    trace_e1000e_cb_pci_uninit();
+
+    e1000e_core_pci_uninit(&s->core);
+    qemu_del_nic(s->nic);
+
+    _e1000e_cleanup_msix(s);
+    _e1000e_cleanup_msi(s);
+}
+
+static void
+e1000e_write_config(PCIDevice *pci_dev, uint32_t addr, uint32_t val, int len)
+{
+    trace_e1000e_cb_write_config();
+
+    pci_default_write_config(pci_dev, addr, val, len);
+    msi_write_config(pci_dev, addr, val, len);
+    msix_write_config(pci_dev, addr, val, len);
+}
+
+static void e1000e_qdev_reset(DeviceState *dev)
+{
+    E1000EState *s = E1000E(dev);
+    uint8_t *macaddr = s->conf.macaddr.a;
+
+    trace_e1000e_cb_qdev_reset();
+
+    e1000e_core_reset(&s->core, macaddr, E1000_PHY_ID2_82574x);
+    qemu_format_nic_info_str(qemu_get_queue(s->nic), macaddr);
+}
+
+static void e1000e_pre_save(void *opaque)
+{
+    E1000EState *s = opaque;
+
+    trace_e1000e_cb_pre_save();
+
+    e1000e_core_pre_save(&s->core);
+}
+
+static int e1000e_post_load(void *opaque, int version_id)
+{
+    E1000EState *s = opaque;
+
+    _e1000e_core_reinitialize(s);
+
+    trace_e1000e_cb_post_load();
+
+    return e1000e_core_post_load(&s->core);
+}
+
+static bool e1000e_mit_state_needed(void *opaque)
+{
+    E1000EState *s = opaque;
+
+    return s->core.compat_flags & E1000_FLAG_MIT;
+}
+
+static const VMStateDescription vmstate_e1000e_mit_state = {
+    .name = "e1000e/mit_state",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32(core.mac[RDTR], E1000EState),
+        VMSTATE_UINT32(core.mac[RADV], E1000EState),
+        VMSTATE_UINT32(core.mac[TADV], E1000EState),
+        VMSTATE_UINT32(core.mac[ITR], E1000EState),
+        VMSTATE_BOOL(core.mit_irq_level, E1000EState),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription vmstate_e1000e = {
+    .name = "e1000e",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .pre_save = e1000e_pre_save,
+    .post_load = e1000e_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_PCI_DEVICE(parent_obj, E1000EState),
+        VMSTATE_MSIX(parent_obj, E1000EState),
+
+        VMSTATE_UINT32(intr_state, E1000EState),
+        VMSTATE_UINT32(core.rxbuf_min_shift, E1000EState),
+        VMSTATE_UINT8(core.rx_desc_len, E1000EState),
+        VMSTATE_UINT32_ARRAY(core.rxbuf_sizes, E1000EState,
+                             E1000_PSRCTL_BUFFS_PER_DESC),
+        VMSTATE_UINT32(core.rx_desc_buf_size, E1000EState),
+        VMSTATE_UINT32(core.eecd_state.val_in, E1000EState),
+        VMSTATE_UINT16(core.eecd_state.bitnum_in, E1000EState),
+        VMSTATE_UINT16(core.eecd_state.bitnum_out, E1000EState),
+        VMSTATE_UINT16(core.eecd_state.reading, E1000EState),
+        VMSTATE_UINT32(core.eecd_state.old_eecd, E1000EState),
+        VMSTATE_UINT16_ARRAY(core.eeprom, E1000EState, E1000E_EEPROM_SIZE),
+        VMSTATE_UINT16_ARRAY(core.phy, E1000EState, E1000E_PHY_SIZE),
+        VMSTATE_UINT32_ARRAY(core.mac, E1000EState, E1000E_MAC_SIZE),
+
+        VMSTATE_UINT8(core.tx[0].sum_needed, E1000EState),
+        VMSTATE_UINT8(core.tx[0].ipcss, E1000EState),
+        VMSTATE_UINT8(core.tx[0].ipcso, E1000EState),
+        VMSTATE_UINT16(core.tx[0].ipcse, E1000EState),
+        VMSTATE_UINT8(core.tx[0].tucss, E1000EState),
+        VMSTATE_UINT8(core.tx[0].tucso, E1000EState),
+        VMSTATE_UINT16(core.tx[0].tucse, E1000EState),
+        VMSTATE_UINT8(core.tx[0].hdr_len, E1000EState),
+        VMSTATE_UINT16(core.tx[0].mss, E1000EState),
+        VMSTATE_UINT32(core.tx[0].paylen, E1000EState),
+        VMSTATE_INT8(core.tx[0].ip, E1000EState),
+        VMSTATE_INT8(core.tx[0].tcp, E1000EState),
+        VMSTATE_BOOL(core.tx[0].tse, E1000EState),
+        VMSTATE_BOOL(core.tx[0].cptse, E1000EState),
+        VMSTATE_BOOL(core.tx[0].skip_cp, E1000EState),
+
+        VMSTATE_UINT8(core.tx[1].sum_needed, E1000EState),
+        VMSTATE_UINT8(core.tx[1].ipcss, E1000EState),
+        VMSTATE_UINT8(core.tx[1].ipcso, E1000EState),
+        VMSTATE_UINT16(core.tx[1].ipcse, E1000EState),
+        VMSTATE_UINT8(core.tx[1].tucss, E1000EState),
+        VMSTATE_UINT8(core.tx[1].tucso, E1000EState),
+        VMSTATE_UINT16(core.tx[1].tucse, E1000EState),
+        VMSTATE_UINT8(core.tx[1].hdr_len, E1000EState),
+        VMSTATE_UINT16(core.tx[1].mss, E1000EState),
+        VMSTATE_UINT32(core.tx[1].paylen, E1000EState),
+        VMSTATE_INT8(core.tx[1].ip, E1000EState),
+        VMSTATE_INT8(core.tx[1].tcp, E1000EState),
+        VMSTATE_BOOL(core.tx[1].tse, E1000EState),
+        VMSTATE_BOOL(core.tx[1].cptse, E1000EState),
+        VMSTATE_BOOL(core.tx[1].skip_cp, E1000EState),
+
+        VMSTATE_BOOL(core.has_vnet, E1000EState),
+
+        VMSTATE_END_OF_LIST()
+    },
+    .subsections = (VMStateSubsection[]) {
+        {
+            .vmsd = &vmstate_e1000e_mit_state,
+            .needed = e1000e_mit_state_needed,
+        }, {
+            /* empty */
+        }
+    }
+};
+
+static Property e1000e_properties[] = {
+    DEFINE_NIC_PROPERTIES(E1000EState, conf),
+    DEFINE_PROP_BIT("autonegotiation", E1000EState,
+                    core.compat_flags, E1000_FLAG_AUTONEG_BIT, true),
+    DEFINE_PROP_BIT("mitigation", E1000EState,
+                    core.compat_flags, E1000_FLAG_MIT_BIT, true),
+    DEFINE_PROP_BOOL("vnet", E1000EState, use_vnet, true),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void e1000e_class_init(ObjectClass *class, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(class);
+    PCIDeviceClass *c = PCI_DEVICE_CLASS(class);
+
+    c->realize = e1000e_pci_realize;
+    c->exit = e1000e_pci_uninit;
+    c->vendor_id = PCI_VENDOR_ID_INTEL;
+    c->device_id = E1000_DEV_ID_82574L;
+    c->revision = 0;
+    c->class_id = PCI_CLASS_NETWORK_ETHERNET;
+    c->subsystem_vendor_id = PCI_VENDOR_ID_INTEL;
+    c->subsystem_id = 0;
+    c->config_write = e1000e_write_config;
+
+    dc->desc = "Intel 82574L GbE Controller";
+    dc->reset = e1000e_qdev_reset;
+    dc->vmsd = &vmstate_e1000e;
+    dc->props = e1000e_properties;
+
+    set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
+}
+
+static void e1000e_instance_init(Object *obj)
+{
+    E1000EState *s = E1000E(obj);
+    device_add_bootindex_property(obj, &s->conf.bootindex,
+                                  "bootindex", "/ethernet-phy@0",
+                                  DEVICE(obj), NULL);
+}
+
+static const TypeInfo e1000e_info = {
+    .name = TYPE_E1000E,
+    .parent = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(E1000EState),
+    .class_init = e1000e_class_init,
+    .instance_init = e1000e_instance_init,
+};
+
+static void e1000e_register_types(void)
+{
+    type_register_static(&e1000e_info);
+}
+
+type_init(e1000e_register_types)
diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
new file mode 100644
index 0000000..f099784
--- /dev/null
+++ b/hw/net/e1000e_core.c
@@ -0,0 +1,2081 @@
+/*
+* Core code for QEMU e1000e emulation
+*
+* Software developer's manuals:
+* http://www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf
+*
+* Copyright (c) 2015 Ravello Systems LTD (http://ravellosystems.com)
+* Developed by Daynix Computing LTD (http://www.daynix.com)
+*
+* Authors:
+* Dmitry Fleytman <dmitry@daynix.com>
+* Leonid Bloch <leonid@daynix.com>
+* Yan Vugenfirer <yan@daynix.com>
+*
+* Based on work done by:
+* Nir Peleg, Tutis Systems Ltd. for Qumranet Inc.
+* Copyright (c) 2008 Qumranet
+* Based on work done by:
+* Copyright (c) 2007 Dan Aloni
+* Copyright (c) 2004 Antony T Curtis
+*
+* This library is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2 of the License, or (at your option) any later version.
+*
+* This library is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with this library; if not, see <http://www.gnu.org/licenses/>.
+*/
+
+#include "hw/hw.h"
+#include "hw/pci/pci.h"
+#include "net/net.h"
+#include "net/tap.h"
+#include "net/checksum.h"
+#include "sysemu/sysemu.h"
+#include "qemu/iov.h"
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+
+#include "net_tx_pkt.h"
+#include "net_rx_pkt.h"
+
+#include "e1000_regs.h"
+#include "e1000e_core.h"
+
+#include "trace.h"
+
+static const uint8_t E1000E_MAX_TX_FRAGS = 64;
+
+static void
+set_interrupt_cause(E1000ECore *core, uint32_t val);
+
+static inline int
+vlan_enabled(E1000ECore *core)
+{
+    return ((core->mac[CTRL] & E1000_CTRL_VME) != 0);
+}
+
+static inline int
+is_vlan_txd(uint32_t txd_lower)
+{
+    return ((txd_lower & E1000_TXD_CMD_VLE) != 0);
+}
+
+static inline void
+inc_reg_if_not_full(E1000ECore *core, int index)
+{
+    if (core->mac[index] != 0xffffffff) {
+        core->mac[index]++;
+    }
+}
+
+static void
+grow_8reg_if_not_full(E1000ECore *core, int index, int size)
+{
+    uint64_t sum = core->mac[index] | (uint64_t)core->mac[index+1] << 32;
+
+    if (sum + size < sum) {
+        sum = ~0ULL;
+    } else {
+        sum += size;
+    }
+    core->mac[index] = sum;
+    core->mac[index+1] = sum >> 32;
+}
+
+static void
+increase_size_stats(E1000ECore *core, const int *size_regs, int size)
+{
+    if (size > 1023) {
+        inc_reg_if_not_full(core, size_regs[5]);
+    } else if (size > 511) {
+        inc_reg_if_not_full(core, size_regs[4]);
+    } else if (size > 255) {
+        inc_reg_if_not_full(core, size_regs[3]);
+    } else if (size > 127) {
+        inc_reg_if_not_full(core, size_regs[2]);
+    } else if (size > 64) {
+        inc_reg_if_not_full(core, size_regs[1]);
+    } else if (size == 64) {
+        inc_reg_if_not_full(core, size_regs[0]);
+    }
+}
+
+static inline void
+process_ts_option(E1000ECore *core, struct e1000_tx_desc *dp)
+{
+    if (le32_to_cpu(dp->upper.data) & E1000_TXD_EXTCMD_TSTAMP) {
+        trace_e1000e_wrn_no_ts_support();
+    }
+}
+
+static inline void
+process_snap_option(E1000ECore *core, uint32_t cmd_and_length)
+{
+    if (cmd_and_length & E1000_TXD_CMD_SNAP) {
+        trace_e1000e_wrn_no_snap_support();
+    }
+}
+
+static void
+_e1000e_setup_tx_offloads(E1000ECore *core, struct e1000_tx *tx)
+{
+    if (tx->tse && tx->cptse) {
+        net_tx_pkt_build_vheader(tx->tx_pkt, true, true, tx->mss);
+        net_tx_pkt_update_ip_checksums(tx->tx_pkt);
+        inc_reg_if_not_full(core, TSCTC);
+        return;
+    }
+
+    if (tx->sum_needed & E1000_TXD_POPTS_TXSM) {
+        net_tx_pkt_build_vheader(tx->tx_pkt, false, true, 0);
+    }
+
+    if (tx->sum_needed & E1000_TXD_POPTS_IXSM) {
+        net_tx_pkt_update_ip_hdr_checksum(tx->tx_pkt);
+    }
+}
+
+static bool
+_e1000e_tx_pkt_send(E1000ECore *core, struct e1000_tx *tx)
+{
+    NetClientState *queue = qemu_get_queue(core->owner_nic);
+
+    _e1000e_setup_tx_offloads(core, tx);
+
+    net_tx_pkt_dump(tx->tx_pkt);
+
+    if ((core->phy[PHY_CTRL] & MII_CR_LOOPBACK) ||
+        ((core->mac[RCTL] & E1000_RCTL_LBM_MAC) == E1000_RCTL_LBM_MAC)) {
+        return net_tx_pkt_send_loopback(tx->tx_pkt, queue);
+    } else {
+        return net_tx_pkt_send(tx->tx_pkt, queue);
+    }
+}
+
+static void
+_e1000e_on_tx_done_update_stats(E1000ECore *core, struct NetTxPkt *tx_pkt)
+{
+    static const int PTCregs[6] = { PTC64, PTC127, PTC255, PTC511,
+                                    PTC1023, PTC1522 };
+
+    size_t tot_len = net_tx_pkt_get_total_len(tx_pkt);
+
+    increase_size_stats(core, PTCregs, tot_len);
+    inc_reg_if_not_full(core, TPT);
+    grow_8reg_if_not_full(core, TOTL, tot_len);
+
+    switch (net_tx_pkt_get_packet_type(tx_pkt)) {
+    case ETH_PKT_BCAST:
+        inc_reg_if_not_full(core, BPTC);
+        break;
+    case ETH_PKT_MCAST:
+        inc_reg_if_not_full(core, MPTC);
+        break;
+    case ETH_PKT_UCAST:
+        break;
+    default:
+        g_assert_not_reached();
+    }
+
+    core->mac[GPTC] = core->mac[TPT];
+    core->mac[GOTCL] = core->mac[TOTL];
+    core->mac[GOTCH] = core->mac[TOTH];
+}
+
+static void
+process_tx_desc(E1000ECore *core,
+                struct e1000_tx *tx,
+                struct e1000_tx_desc *dp)
+{
+    uint32_t txd_lower = le32_to_cpu(dp->lower.data);
+    uint32_t dtype = txd_lower & (E1000_TXD_CMD_DEXT | E1000_TXD_DTYP_D);
+    unsigned int split_size = txd_lower & 0xffff, op;
+    uint64_t addr;
+    struct e1000_context_desc *xp = (struct e1000_context_desc *)dp;
+    bool eop = txd_lower & E1000_TXD_CMD_EOP;
+
+    core->mit_ide |= (txd_lower & E1000_TXD_CMD_IDE);
+    if (dtype == E1000_TXD_CMD_DEXT) {    /* context descriptor */
+        op = le32_to_cpu(xp->cmd_and_length);
+        tx->ipcss = xp->lower_setup.ip_fields.ipcss;
+        tx->ipcso = xp->lower_setup.ip_fields.ipcso;
+        tx->ipcse = le16_to_cpu(xp->lower_setup.ip_fields.ipcse);
+        tx->tucss = xp->upper_setup.tcp_fields.tucss;
+        tx->tucso = xp->upper_setup.tcp_fields.tucso;
+        tx->tucse = le16_to_cpu(xp->upper_setup.tcp_fields.tucse);
+        tx->paylen = op & 0xfffff;
+        tx->hdr_len = xp->tcp_seg_setup.fields.hdr_len;
+        tx->mss = le16_to_cpu(xp->tcp_seg_setup.fields.mss);
+        tx->ip = (op & E1000_TXD_CMD_IP) ? 1 : 0;
+        tx->tcp = (op & E1000_TXD_CMD_TCP) ? 1 : 0;
+        tx->tse = (op & E1000_TXD_CMD_TSE) ? 1 : 0;
+        if (tx->tucso == 0) { /* this is probably wrong */
+            trace_e1000e_tx_cso_zero();
+            tx->tucso = tx->tucss + (tx->tcp ? 16 : 6);
+        }
+        process_snap_option(core, op);
+        return;
+    } else if (dtype == (E1000_TXD_CMD_DEXT | E1000_TXD_DTYP_D)) {
+        /* data descriptor */
+        tx->sum_needed = le32_to_cpu(dp->upper.data) >> 8;
+        tx->cptse = (txd_lower & E1000_TXD_CMD_TSE) ? 1 : 0;
+        process_ts_option(core, dp);
+    } else {
+        /* legacy descriptor */
+        process_ts_option(core, dp);
+        tx->cptse = 0;
+    }
+
+    addr = le64_to_cpu(dp->buffer_addr);
+
+    if (!tx->skip_cp) {
+        if (!net_tx_pkt_add_raw_fragment(tx->tx_pkt, addr, split_size)) {
+            tx->skip_cp = true;
+        }
+    }
+
+    if (eop) {
+        if (!tx->skip_cp) {
+            net_tx_pkt_parse(tx->tx_pkt);
+            if (vlan_enabled(core) && is_vlan_txd(txd_lower)) {
+                net_tx_pkt_setup_vlan_header_ex(tx->tx_pkt,
+                    le16_to_cpu(core->mac[VET]),
+                    le16_to_cpu(dp->upper.fields.special));
+            }
+            if (_e1000e_tx_pkt_send(core, tx)) {
+                _e1000e_on_tx_done_update_stats(core, tx->tx_pkt);
+            }
+        }
+
+        tx->skip_cp = false;
+        net_tx_pkt_reset(tx->tx_pkt);
+
+        tx->sum_needed = 0;
+        tx->cptse = 0;
+    }
+}
+
+static inline uint32_t
+_e1000e_tx_wb_interrupt_cause(E1000ECore *core)
+{
+    return msix_enabled(core->owner) ? E1000_ICR_TXQ0 : E1000_ICR_TXDW;
+}
+
+static inline uint32_t
+_e1000e_rx_wb_interrupt_cause(E1000ECore *core)
+{
+    return msix_enabled(core->owner) ? E1000_ICR_RXQ0 : E1000_ICS_RXT0;
+}
+
+static uint32_t
+txdesc_writeback(E1000ECore *core, dma_addr_t base, struct e1000_tx_desc *dp)
+{
+    uint32_t txd_upper, txd_lower = le32_to_cpu(dp->lower.data);
+
+    if (!(txd_lower & E1000_TXD_CMD_RS)) {
+        return 0;
+    }
+
+    txd_upper = le32_to_cpu(dp->upper.data) | E1000_TXD_STAT_DD;
+
+    dp->upper.data = cpu_to_le32(txd_upper);
+    pci_dma_write(core->owner, base + ((char *)&dp->upper - (char *)dp),
+                  &dp->upper, sizeof(dp->upper));
+    return _e1000e_tx_wb_interrupt_cause(core);
+}
+
+typedef struct tx_ring_st {
+    int tdbah;
+    int tdbal;
+    int tdlen;
+    int tdh;
+    int tdt;
+
+    struct e1000_tx *tx;
+} tx_ring;
+
+static inline int
+_e1000e_mq_reg(int reg_idx, int queue_idx)
+{
+    return reg_idx + (0x100 >> 2) * queue_idx;
+}
+
+static inline int
+_e1000e_mq_queue_idx(int base_reg_idx, int reg_idx)
+{
+    return (reg_idx - base_reg_idx) / (0x100 >> 2);
+}
+
+static inline void
+_e1000e_tx_ring_init(E1000ECore *core, tx_ring *txr, int idx)
+{
+    txr->tdbah = _e1000e_mq_reg(TDBAH, idx);
+    txr->tdbal = _e1000e_mq_reg(TDBAL, idx);
+    txr->tdlen = _e1000e_mq_reg(TDLEN, idx);
+    txr->tdh   = _e1000e_mq_reg(TDH,   idx);
+    txr->tdt   = _e1000e_mq_reg(TDT,   idx);
+
+    txr->tx    = &core->tx[idx];
+}
+
+static uint64_t tx_desc_base(E1000ECore *core, const tx_ring *txr)
+{
+    uint64_t bah = core->mac[txr->tdbah];
+    uint64_t bal = core->mac[txr->tdbal] & ~0xf;
+
+    return (bah << 32) + bal;
+}
+
+static void
+start_xmit(E1000ECore *core, const tx_ring *txr)
+{
+    dma_addr_t base;
+    struct e1000_tx_desc desc;
+    uint32_t tdh_start = core->mac[txr->tdh], cause = E1000_ICS_TXQE;
+
+    if (!(core->mac[TCTL] & E1000_TCTL_EN)) {
+        trace_e1000e_tx_disabled();
+        return;
+    }
+
+    while (core->mac[txr->tdh] != core->mac[txr->tdt]) {
+        base = tx_desc_base(core, txr) +
+               sizeof(struct e1000_tx_desc) * core->mac[txr->tdh];
+        pci_dma_read(core->owner, base, &desc, sizeof(desc));
+
+        trace_e1000e_tx_descr(core->mac[txr->tdh],
+               (void *)(intptr_t)desc.buffer_addr, desc.lower.data,
+               desc.upper.data);
+
+        process_tx_desc(core, txr->tx, &desc);
+        cause |= txdesc_writeback(core, base, &desc);
+
+        if (++core->mac[txr->tdh] * sizeof(desc) >= core->mac[txr->tdlen]) {
+            core->mac[txr->tdh] = 0;
+        }
+        /*
+         * the following could happen only if guest sw assigns
+         * bogus values to TDT/TDLEN.
+         * there's nothing too intelligent we could do about this.
+         */
+        if (core->mac[txr->tdh] == tdh_start) {
+            trace_e1000e_tdh_wraparound(tdh_start,
+                                        core->mac[txr->tdt],
+                                        core->mac[txr->tdlen]);
+            break;
+        }
+    }
+
+    set_interrupt_cause(core, cause);
+}
+
+static bool
+_e1000e_has_rxbufs(E1000ECore *core, size_t total_size)
+{
+    int bufs;
+    /* Fast-path short packets */
+    if (total_size <= core->rx_desc_buf_size) {
+        return core->mac[RDH] != core->mac[RDT];
+    }
+    if (core->mac[RDH] < core->mac[RDT]) {
+        bufs = core->mac[RDT] - core->mac[RDH];
+    } else if (core->mac[RDH] > core->mac[RDT]) {
+        bufs = core->mac[RDLEN] / core->rx_desc_len +
+            core->mac[RDT] - core->mac[RDH];
+    } else {
+        return false;
+    }
+    return total_size <= bufs * core->rx_desc_buf_size;
+}
+
+static void
+start_recv(E1000ECore *core)
+{
+    if (_e1000e_has_rxbufs(core, 1)) {
+        qemu_flush_queued_packets(qemu_get_queue(core->owner_nic));
+    }
+}
+
+int
+e1000e_can_receive(E1000ECore *core)
+{
+    return (core->mac[STATUS] & E1000_STATUS_LU) &&
+        (core->mac[RCTL] & E1000_RCTL_EN) &&
+        (core->owner->config[PCI_COMMAND] & PCI_COMMAND_MASTER) &&
+        _e1000e_has_rxbufs(core, 1);
+}
+
+ssize_t
+e1000e_receive(E1000ECore *core, const uint8_t *buf, size_t size)
+{
+    const struct iovec iov = {
+        .iov_base = (uint8_t *)buf,
+        .iov_len = size
+    };
+
+    return e1000e_receive_iov(core, &iov, 1);
+}
+
+static inline int
+vlan_rx_filter_enabled(E1000ECore *core)
+{
+    return ((core->mac[RCTL] & E1000_RCTL_VFE) != 0);
+}
+
+static inline int
+is_vlan_packet(E1000ECore *core, const uint8_t *buf)
+{
+    return (be16_to_cpup((uint16_t *)(buf + 12)) ==
+        le16_to_cpu(core->mac[VET]));
+}
+
+static bool
+receive_filter(E1000ECore *core, const uint8_t *buf, int size)
+{
+    static const int mta_shift[] = {4, 3, 2, 0};
+    uint32_t f, rctl = core->mac[RCTL], ra[2], *rp;
+
+    if (is_vlan_packet(core, buf) && vlan_rx_filter_enabled(core)) {
+        uint16_t vid = be16_to_cpup((uint16_t *)(buf + 14));
+        uint32_t vfta = le32_to_cpup((uint32_t *)(core->mac + VFTA) +
+                                     ((vid >> 5) & 0x7f));
+        if ((vfta & (1 << (vid & 0x1f))) == 0) {
+            return 0;
+        }
+    }
+
+    switch (net_rx_pkt_get_packet_type(core->tx[0].rx_pkt)) {
+    case ETH_PKT_UCAST:
+        if (rctl & E1000_RCTL_UPE) {
+            return true; /* promiscuous ucast */
+        }
+        break;
+
+    case ETH_PKT_BCAST:
+        if (rctl & E1000_RCTL_BAM) {
+            return true; /* broadcast enabled */
+        }
+        break;
+
+    case ETH_PKT_MCAST:
+        if (rctl & E1000_RCTL_MPE) {
+            return true; /* promiscuous mcast */
+        }
+        break;
+
+    default:
+        g_assert_not_reached();
+    }
+
+    for (rp = core->mac + RA; rp < core->mac + RA + 32; rp += 2) {
+        if (!(rp[1] & E1000_RAH_AV)) {
+            continue;
+        }
+        ra[0] = cpu_to_le32(rp[0]);
+        ra[1] = cpu_to_le32(rp[1]);
+        if (!memcmp(buf, (uint8_t *)ra, 6)) {
+            trace_e1000e_rx_flt_ucast_match((int)(rp - core->mac - RA) / 2,
+                                           MAC_ARG(buf));
+            return 1;
+        }
+    }
+    trace_e1000e_rx_flt_ucast_mismatch(MAC_ARG(buf));
+
+    f = mta_shift[(rctl >> E1000_RCTL_MO_SHIFT) & 3];
+    f = (((buf[5] << 8) | buf[4]) >> f) & 0xfff;
+    if (core->mac[MTA + (f >> 5)] & (1 << (f & 0x1f))) {
+        inc_reg_if_not_full(core, MPRC);
+        return 1;
+    }
+
+    trace_e1000e_rx_flt_inexact_mismatch(MAC_ARG(buf),
+                                        (rctl >> E1000_RCTL_MO_SHIFT) & 3,
+                                        f >> 5,
+                                        core->mac[MTA + (f >> 5)]);
+
+    return 0;
+}
+
+static uint64_t rx_desc_base(E1000ECore *core)
+{
+    uint64_t bah = core->mac[RDBAH];
+    uint64_t bal = core->mac[RDBAL] & ~0xf;
+
+    return (bah << 32) + bal;
+}
+
+/* FCS aka Ethernet CRC-32. We don't get it from backends and can't
+ * fill it in, just pad descriptor length by 4 bytes unless guest
+ * told us to strip it off the packet. */
+static inline int
+fcs_len(E1000ECore *core)
+{
+    return (core->mac[RCTL] & E1000_RCTL_SECRC) ? 0 : 4;
+}
+
+static void
+read_legacy_rx_descriptor(E1000ECore *core, uint8_t *desc, hwaddr *buff_addr)
+{
+    struct e1000_rx_desc *d = (struct e1000_rx_desc *) desc;
+    *buff_addr = le64_to_cpu(d->buffer_addr);
+}
+
+static void
+read_extended_rx_descriptor(E1000ECore *core, uint8_t *desc, hwaddr *buff_addr)
+{
+    union e1000_rx_desc_extended *d = (union e1000_rx_desc_extended *) desc;
+    *buff_addr = le64_to_cpu(d->read.buffer_addr);
+}
+
+static void
+read_ps_rx_descriptor(E1000ECore *core, uint8_t *desc,
+                      hwaddr (*buff_addr)[MAX_PS_BUFFERS])
+{
+    int i;
+    union e1000_rx_desc_packet_split *d =
+        (union e1000_rx_desc_packet_split *) desc;
+
+    for (i = 0; i < MAX_PS_BUFFERS; i++) {
+        (*buff_addr)[i] = le64_to_cpu(d->read.buffer_addr[i]);
+    }
+
+    trace_e1000e_rx_desc_ps_read((*buff_addr)[0], (*buff_addr)[1],
+                                 (*buff_addr)[2], (*buff_addr)[3]);
+}
+
+static void
+read_rx_descriptor(E1000ECore *core, uint8_t *desc,
+                   hwaddr (*buff_addr)[MAX_PS_BUFFERS])
+{
+    if (core->mac[RFCTL] & E1000_RFCTL_EXTEN) {
+        if (core->mac[RCTL] & E1000_RCTL_DTYP_PS) {
+            read_ps_rx_descriptor(core, desc, buff_addr);
+        } else {
+            read_extended_rx_descriptor(core, desc, &(*buff_addr)[0]);
+            (*buff_addr)[1] = (*buff_addr)[2] = (*buff_addr)[3] = 0;
+        }
+    } else {
+        read_legacy_rx_descriptor(core, desc, &(*buff_addr)[0]);
+        (*buff_addr)[1] = (*buff_addr)[2] = (*buff_addr)[3] = 0;
+    }
+}
+
+static void
+_e1000e_build_rx_metadata(struct NetRxPkt *pkt,
+                          bool is_eop,
+                          uint32_t *status_flags,
+                          uint16_t *ip_id,
+                          uint16_t *vlan_tag)
+{
+    struct virtio_net_hdr *vhdr;
+    bool isip4, isip6, istcp, isudp;
+    uint32_t pkt_type;
+
+    net_rx_pkt_get_protocols(pkt, &isip4, &isip6, &isudp, &istcp);
+
+    *status_flags = E1000_RXD_STAT_DD;
+
+    /* No additional metadata needed for non-EOP descriptors */
+    if (!is_eop) {
+        return;
+    }
+
+    *status_flags |= E1000_RXD_STAT_EOP;
+
+    /* VLAN state */
+    if (net_rx_pkt_is_vlan_stripped(pkt)) {
+        *status_flags |= E1000_RXD_STAT_VP;
+        *vlan_tag = cpu_to_le16(net_rx_pkt_get_vlan_tag(pkt));
+    }
+
+    /* Packet parsing results */
+    if (isip4) {
+        *status_flags |= E1000_RXD_STAT_IPIDV;
+        *ip_id = net_rx_pkt_get_ip_id(pkt);
+    }
+
+    if (istcp && net_rx_pkt_is_tcp_ack(pkt)) {
+        *status_flags |= E1000_RXD_STAT_ACK;
+    }
+
+    if (istcp || isudp) {
+        pkt_type = isip4 ? E1000_RXD_PKT_IP4_XDP : E1000_RXD_PKT_IP6_XDP;
+    } else if (isip4 || isip6) {
+        pkt_type = isip4 ? E1000_RXD_PKT_IP4 : E1000_RXD_PKT_IP6;
+    } else {
+        pkt_type = E1000_RXD_PKT_MAC;
+    }
+
+    *status_flags |= E1000_RXD_PKT_TYPE(pkt_type);
+
+    /* RX CSO information */
+    if (!net_rx_pkt_has_virt_hdr(pkt)) {
+        return;
+    }
+
+    vhdr = net_rx_pkt_get_vhdr(pkt);
+
+    if (!(vhdr->flags & VIRTIO_NET_HDR_F_DATA_VALID) &&
+        !(vhdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM)) {
+        return;
+    }
+
+    *status_flags |= (isip4 ? E1000_RXD_STAT_IPIDV : 0) |
+                     (istcp ? E1000_RXD_STAT_TCPCS : 0) |
+                     (isudp ? E1000_RXD_STAT_UDPCS : 0);
+}
+
+static void
+write_legacy_rx_descriptor(E1000ECore *core, uint8_t *desc,
+                           struct NetRxPkt *pkt, uint16_t length)
+{
+    uint32_t status_flags;
+    uint16_t ip_id;
+
+    struct e1000_rx_desc *d = (struct e1000_rx_desc *) desc;
+
+    d->length = cpu_to_le16(length);
+
+    _e1000e_build_rx_metadata(pkt, pkt != NULL,
+                              &status_flags, &ip_id, &d->special);
+
+    d->status = (uint8_t) status_flags;
+}
+
+static void
+write_extended_rx_descriptor(E1000ECore *core, uint8_t *desc,
+                             struct NetRxPkt *pkt, uint16_t length)
+{
+    union e1000_rx_desc_extended *d = (union e1000_rx_desc_extended *) desc;
+
+    d->wb.upper.length = cpu_to_le16(length);
+
+    _e1000e_build_rx_metadata(pkt, pkt != NULL,
+                              &d->wb.upper.status_error,
+                              &d->wb.lower.hi_dword.csum_ip.ip_id,
+                              &d->wb.upper.vlan);
+}
+
+static void
+write_ps_rx_descriptor(E1000ECore *core, uint8_t *desc,
+                       struct NetRxPkt *pkt,
+                       uint16_t (*written)[MAX_PS_BUFFERS])
+{
+    int i;
+    union e1000_rx_desc_packet_split *d =
+        (union e1000_rx_desc_packet_split *) desc;
+
+    d->wb.middle.length0 = cpu_to_le16((*written)[0]);
+
+    for (i = 0; i < PS_PAGE_BUFFERS; i++) {
+        d->wb.upper.length[i] = cpu_to_le16((*written)[i + 1]);
+    }
+
+    _e1000e_build_rx_metadata(pkt, pkt != NULL,
+                              &d->wb.middle.status_error,
+                              &d->wb.lower.hi_dword.csum_ip.ip_id,
+                              &d->wb.middle.vlan);
+
+    trace_e1000e_rx_desc_ps_write((*written)[0], (*written)[1],
+                                  (*written)[2], (*written)[3]);
+}
+
+static void
+write_rx_descriptor(E1000ECore *core, uint8_t *desc,
+                    struct NetRxPkt *pkt,
+                    uint16_t (*written)[MAX_PS_BUFFERS])
+{
+    if (core->mac[RFCTL] & E1000_RFCTL_EXTEN) {
+        if (core->mac[RCTL] & E1000_RCTL_DTYP_PS) {
+            write_ps_rx_descriptor(core, desc, pkt, written);
+        } else {
+            write_extended_rx_descriptor(core, desc,
+                                         pkt, (*written)[0]);
+        }
+    } else {
+        write_legacy_rx_descriptor(core, desc, pkt, (*written)[0]);
+    }
+}
+
+typedef struct ba_state_st {
+    uint16_t written[MAX_PS_BUFFERS];
+    uint8_t cur_idx;
+} ba_state;
+
+static void
+write_to_rx_buffers(E1000ECore *core,
+                    hwaddr (*ba)[MAX_PS_BUFFERS],
+                    ba_state *bastate,
+                    const char *data,
+                    dma_addr_t data_len)
+{
+    while (data_len > 0) {
+        uint32_t cur_buf_len = core->rxbuf_sizes[bastate->cur_idx];
+        uint32_t cur_buf_bytes_left = cur_buf_len -
+                                      bastate->written[bastate->cur_idx];
+        uint32_t bytes_to_write = MIN(data_len, cur_buf_bytes_left);
+
+        trace_e1000e_rx_desc_buff_write(bastate->cur_idx,
+                                        (*ba)[bastate->cur_idx],
+                                        bastate->written[bastate->cur_idx],
+                                        data,
+                                        bytes_to_write);
+
+        pci_dma_write(core->owner,
+            (*ba)[bastate->cur_idx] + bastate->written[bastate->cur_idx],
+            data, bytes_to_write);
+
+        bastate->written[bastate->cur_idx] += bytes_to_write;
+        data += bytes_to_write;
+        data_len -= bytes_to_write;
+
+        if (bastate->written[bastate->cur_idx] == cur_buf_len) {
+            bastate->cur_idx++;
+        }
+
+        assert(bastate->cur_idx < MAX_PS_BUFFERS);
+    }
+}
+
+static void
+_e1000e_update_rx_stats(E1000ECore *core,
+                        size_t data_size,
+                        size_t data_fcs_size)
+{
+    static const int PRCregs[6] = { PRC64, PRC127, PRC255, PRC511,
+                                    PRC1023, PRC1522 };
+
+    increase_size_stats(core, PRCregs, data_fcs_size);
+    inc_reg_if_not_full(core, TPR);
+    core->mac[GPRC] = core->mac[TPR];
+    /* TOR - Total Octets Received:
+    * This register includes bytes received in a packet from the <Destination
+    * Address> field through the <CRC> field, inclusively.
+    * Always include FCS length (4) in size.
+    */
+    grow_8reg_if_not_full(core, TORL, data_size + 4);
+    core->mac[GORCL] = core->mac[TORL];
+    core->mac[GORCH] = core->mac[TORH];
+
+    switch (net_rx_pkt_get_packet_type(core->tx[0].rx_pkt)) {
+    case ETH_PKT_BCAST:
+        inc_reg_if_not_full(core, BPRC);
+        break;
+
+    case ETH_PKT_MCAST:
+        inc_reg_if_not_full(core, MPRC);
+        break;
+
+    default:
+        break;
+    }
+}
+
+static bool
+_e1000e_write_paket_to_guest(E1000ECore *core, struct NetRxPkt *pkt)
+{
+    PCIDevice *d = core->owner;
+    dma_addr_t base;
+    uint8_t desc[E1000_MAX_RX_DESC_LEN];
+    size_t desc_size;
+    size_t desc_offset = 0;
+    size_t iov_ofs = 0;
+    uint32_t rdh_start = core->mac[RDH];
+
+    struct iovec *iov = net_rx_pkt_get_iovec(pkt);
+    size_t size = net_rx_pkt_get_total_len(pkt);
+    size_t total_size = size + fcs_len(core);
+
+    if (!_e1000e_has_rxbufs(core, total_size)) {
+        set_interrupt_cause(core, E1000_ICS_RXO);
+        return false;
+    }
+
+    do {
+        hwaddr ba[MAX_PS_BUFFERS];
+        ba_state bastate = { 0 };
+        bool is_last = false;
+
+        desc_size = total_size - desc_offset;
+
+        if (desc_size > core->rx_desc_buf_size) {
+            desc_size = core->rx_desc_buf_size;
+        }
+        base = rx_desc_base(core) + core->rx_desc_len * core->mac[RDH];
+        pci_dma_read(d, base, &desc, core->rx_desc_len);
+
+        read_rx_descriptor(core, desc, &ba);
+
+        if (ba[0]) {
+            if (desc_offset < size) {
+                static const uint32_t fcs_pad;
+                size_t iov_copy;
+                size_t copy_size = size - desc_offset;
+                if (copy_size > core->rx_desc_buf_size) {
+                    copy_size = core->rx_desc_buf_size;
+                }
+                do {
+                    iov_copy = MIN(copy_size, iov->iov_len - iov_ofs);
+
+                    write_to_rx_buffers(core, &ba, &bastate,
+                                        iov->iov_base + iov_ofs, iov_copy);
+
+                    copy_size -= iov_copy;
+                    iov_ofs += iov_copy;
+                    if (iov_ofs == iov->iov_len) {
+                        iov++;
+                        iov_ofs = 0;
+                    }
+                } while (copy_size);
+
+                /* Simulate FCS checksum presence */
+                write_to_rx_buffers(core, &ba, &bastate,
+                                    (const char *) &fcs_pad, fcs_len(core));
+            }
+            desc_offset += desc_size;
+            if (desc_offset >= total_size) {
+                is_last = true;
+            }
+        } else { /* as per intel docs; skip descriptors with null buf addr */
+            trace_e1000e_rx_null_descriptor();
+        }
+
+        write_rx_descriptor(core, desc, is_last ? core->tx[0].rx_pkt : NULL,
+                            &bastate.written);
+        pci_dma_write(d, base, &desc, core->rx_desc_len);
+
+        if (++core->mac[RDH] * core->rx_desc_len >= core->mac[RDLEN]) {
+            core->mac[RDH] = 0;
+        }
+        /* see comment in start_xmit; same here */
+        if (core->mac[RDH] == rdh_start) {
+            trace_e1000e_rx_err_wraparound(rdh_start,
+                                          core->mac[RDT],
+                                          core->mac[RDLEN]);
+            set_interrupt_cause(core, E1000_ICS_RXO);
+            return false;
+        }
+    } while (desc_offset < total_size);
+
+    _e1000e_update_rx_stats(core, size, total_size);
+
+    return true;
+}
+
+ssize_t
+e1000e_receive_iov(E1000ECore *core, const struct iovec *iov, int iovcnt)
+{
+    /* this is the size past which hardware will
+       drop packets when setting LPE=0 */
+    static const int MAXIMUM_ETHERNET_VLAN_SIZE = 1522;
+    /* this is the size past which hardware will
+       drop packets when setting LPE=1 */
+    static const int MAXIMUM_ETHERNET_LPE_SIZE = 16384;
+
+    static const int MAXIMUM_ETHERNET_HDR_LEN = (14 + 4);
+
+    /* Min. octets in an ethernet frame sans FCS */
+    static const int MIN_BUF_SIZE = 60;
+
+    unsigned int n, rdt;
+    uint8_t min_buf[MIN_BUF_SIZE];
+    struct iovec min_iov;
+    uint8_t *filter_buf;
+    size_t size, orig_size;
+    size_t iov_ofs = 0;
+
+    if (!(core->mac[STATUS] & E1000_STATUS_LU)) {
+        return -1;
+    }
+
+    if (!(core->mac[RCTL] & E1000_RCTL_EN)) {
+        return -1;
+    }
+
+    /* Pull virtio header in */
+    if (core->has_vnet) {
+        net_rx_pkt_set_vhdr_iovec(core->tx[0].rx_pkt, iov, iovcnt);
+        iov_ofs = sizeof(struct virtio_net_hdr);
+    }
+
+    filter_buf = iov->iov_base + iov_ofs;
+    orig_size = iov_size(iov, iovcnt);
+    size = orig_size - iov_ofs;
+
+    /* Pad to minimum Ethernet frame length */
+    if (size < sizeof(min_buf)) {
+        iov_to_buf(iov, iovcnt, iov_ofs, min_buf, size);
+        memset(&min_buf[size], 0, sizeof(min_buf) - size);
+        inc_reg_if_not_full(core, RUC);
+        min_iov.iov_base = filter_buf = min_buf;
+        min_iov.iov_len = size = sizeof(min_buf);
+        iovcnt = 1;
+        iov = &min_iov;
+        iov_ofs = 0;
+    } else if (iov->iov_len < MAXIMUM_ETHERNET_HDR_LEN) {
+        /* This is very unlikely, but may happen. */
+        iov_to_buf(iov, iovcnt, iov_ofs, min_buf, MAXIMUM_ETHERNET_HDR_LEN);
+        filter_buf = min_buf;
+    }
+
+    /* Discard oversized packets if !LPE and !SBP. */
+    if ((size > MAXIMUM_ETHERNET_LPE_SIZE ||
+        (size > MAXIMUM_ETHERNET_VLAN_SIZE
+        && !(core->mac[RCTL] & E1000_RCTL_LPE)))
+        && !(core->mac[RCTL] & E1000_RCTL_SBP)) {
+        inc_reg_if_not_full(core, ROC);
+        return orig_size;
+    }
+
+    net_rx_pkt_set_packet_type(core->tx[0].rx_pkt,
+        get_eth_packet_type(PKT_GET_ETH_HDR(filter_buf)));
+
+    if (!receive_filter(core, filter_buf, size)) {
+        return orig_size;
+    }
+
+    net_rx_pkt_attach_iovec_ex(core->tx[0].rx_pkt, iov, iovcnt, iov_ofs,
+                                  vlan_enabled(core),
+                                  le16_to_cpu(core->mac[VET]));
+
+    if (!_e1000e_write_paket_to_guest(core, core->tx[0].rx_pkt)) {
+        return -1;
+    }
+
+    n = _e1000e_rx_wb_interrupt_cause(core);
+    rdt = core->mac[RDT];
+    if (rdt < core->mac[RDH]) {
+        rdt += core->mac[RDLEN] / core->rx_desc_len;
+    }
+    if (((rdt - core->mac[RDH]) * core->rx_desc_len) <= core->mac[RDLEN] >>
+        core->rxbuf_min_shift) {
+        n |= E1000_ICS_RXDMT0;
+    }
+
+    set_interrupt_cause(core, n);
+
+    return orig_size;
+}
+
+static void
+e1000_link_down(E1000ECore *core)
+{
+    core->mac[STATUS] &= ~E1000_STATUS_LU;
+    core->phy[PHY_STATUS] &= ~MII_SR_LINK_STATUS;
+    core->phy[PHY_STATUS] &= ~MII_SR_AUTONEG_COMPLETE;
+    core->phy[PHY_LP_ABILITY] &= ~MII_LPAR_LPACK;
+}
+
+static void
+e1000_link_up(E1000ECore *core)
+{
+    core->mac[STATUS] |= E1000_STATUS_LU;
+    core->phy[PHY_STATUS] |= MII_SR_LINK_STATUS;
+}
+
+static bool
+have_autoneg(E1000ECore *core)
+{
+    return (core->compat_flags & E1000_FLAG_AUTONEG) &&
+        (core->phy[PHY_CTRL] & MII_CR_AUTO_NEG_EN);
+}
+
+static void
+set_phy_ctrl(E1000ECore *core, int index, uint16_t val)
+{
+    /* bits 0-5 reserved; MII_CR_[RESTART_AUTO_NEG,RESET] are self clearing */
+    core->phy[PHY_CTRL] = val & ~(0x3f |
+                                  MII_CR_RESET |
+                                  MII_CR_RESTART_AUTO_NEG);
+
+    /*
+     * QEMU 1.3 does not support link auto-negotiation emulation, so if we
+     * migrate during auto negotiation, after migration the link will be
+     * down.
+     */
+    if (have_autoneg(core) && (val & MII_CR_RESTART_AUTO_NEG)) {
+        e1000_link_down(core);
+        trace_e1000e_core_start_link_negotiation();
+        timer_mod(core->autoneg_timer,
+                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + 500);
+    }
+}
+
+static void
+set_phy_page(E1000ECore *core, int index, uint16_t val)
+{
+    core->phy[PHY_PAGE] = val & PHY_PAGE_RW_MASK;
+}
+
+void
+e1000e_core_set_link_status(E1000ECore *core)
+{
+    NetClientState *nc = qemu_get_queue(core->owner_nic);
+    uint32_t old_status = core->mac[STATUS];
+
+    if (nc->link_down) {
+        e1000_link_down(core);
+    } else {
+        if (have_autoneg(core) &&
+            !(core->phy[PHY_STATUS] & MII_SR_AUTONEG_COMPLETE)) {
+            /* emulate auto-negotiation if supported */
+            timer_mod(core->autoneg_timer,
+                      qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + 500);
+        } else {
+            e1000_link_up(core);
+        }
+    }
+
+    if (core->mac[STATUS] != old_status) {
+        set_interrupt_cause(core, E1000_ICR_LSC);
+    }
+}
+
+static void
+set_ctrl(E1000ECore *core, int index, uint32_t val)
+{
+    /* RST is self clearing */
+    core->mac[CTRL] = val & ~E1000_CTRL_RST;
+}
+
+static uint32_t
+parse_rxbufsize_e1000(uint32_t rctl)
+{
+    rctl &= E1000_RCTL_BSEX | E1000_RCTL_SZ_16384 | E1000_RCTL_SZ_8192 |
+            E1000_RCTL_SZ_4096 | E1000_RCTL_SZ_2048 | E1000_RCTL_SZ_1024 |
+            E1000_RCTL_SZ_512 | E1000_RCTL_SZ_256;
+    switch (rctl) {
+    case E1000_RCTL_BSEX | E1000_RCTL_SZ_16384:
+        return 16384;
+    case E1000_RCTL_BSEX | E1000_RCTL_SZ_8192:
+        return 8192;
+    case E1000_RCTL_BSEX | E1000_RCTL_SZ_4096:
+        return 4096;
+    case E1000_RCTL_SZ_1024:
+        return 1024;
+    case E1000_RCTL_SZ_512:
+        return 512;
+    case E1000_RCTL_SZ_256:
+        return 256;
+    }
+    return 2048;
+}
+
+static void
+calc_per_desc_buf_size(E1000ECore *core)
+{
+    int i;
+    core->rx_desc_buf_size = 0;
+
+    for (i = 0; i < ARRAY_SIZE(core->rxbuf_sizes); i++) {
+        core->rx_desc_buf_size += core->rxbuf_sizes[i];
+    }
+}
+
+static void
+parse_rxbufsize(E1000ECore *core)
+{
+    uint32_t rctl = core->mac[RCTL];
+
+    memset(core->rxbuf_sizes, 0, sizeof(core->rxbuf_sizes));
+
+    if (rctl & E1000_RCTL_DTYP_MASK) {
+        uint32_t bsize;
+
+        bsize = core->mac[PSRCTL] & E1000_PSRCTL_BSIZE0_MASK;
+        core->rxbuf_sizes[0] = (bsize >> E1000_PSRCTL_BSIZE0_SHIFT) * 128;
+
+        bsize = core->mac[PSRCTL] & E1000_PSRCTL_BSIZE1_MASK;
+        core->rxbuf_sizes[1] = (bsize >> E1000_PSRCTL_BSIZE1_SHIFT) * 1024;
+
+        bsize = core->mac[PSRCTL] & E1000_PSRCTL_BSIZE2_MASK;
+        core->rxbuf_sizes[2] = (bsize >> E1000_PSRCTL_BSIZE2_SHIFT) * 1024;
+
+        bsize = core->mac[PSRCTL] & E1000_PSRCTL_BSIZE3_MASK;
+        core->rxbuf_sizes[3] = (bsize >> E1000_PSRCTL_BSIZE3_SHIFT) * 1024;
+    } else if (rctl & E1000_RCTL_FLXBUF_MASK) {
+        int flxbuf = rctl & E1000_RCTL_FLXBUF_MASK;
+        core->rxbuf_sizes[0] = (flxbuf >> E1000_RCTL_FLXBUF_SHIFT) * 1024;
+    } else {
+        core->rxbuf_sizes[0] = parse_rxbufsize_e1000(rctl);
+    }
+
+    trace_e1000e_rx_desc_buff_sizes(core->rxbuf_sizes[0], core->rxbuf_sizes[1],
+                                    core->rxbuf_sizes[2], core->rxbuf_sizes[3]);
+
+    calc_per_desc_buf_size(core);
+}
+
+static void
+calc_rxdesclen(E1000ECore *core)
+{
+    if (core->mac[RFCTL] & E1000_RFCTL_EXTEN) {
+        if (core->mac[RCTL] & E1000_RCTL_DTYP_PS) {
+            core->rx_desc_len = sizeof(union e1000_rx_desc_packet_split);
+        } else {
+            core->rx_desc_len = sizeof(union e1000_rx_desc_extended);
+        }
+    } else {
+        core->rx_desc_len = sizeof(struct e1000_rx_desc);
+    }
+}
+
+static void
+set_rx_control(E1000ECore *core, int index, uint32_t val)
+{
+    core->mac[RCTL] = val;
+    trace_e1000e_core_set_rxctl(core->mac[RDT], core->mac[RCTL]);
+
+    if (val & E1000_RCTL_EN) {
+        parse_rxbufsize(core);
+        calc_rxdesclen(core);
+        core->rxbuf_min_shift = ((val / E1000_RCTL_RDMTS_QUAT) & 3) + 1;
+        qemu_flush_queued_packets(qemu_get_queue(core->owner_nic));
+    }
+}
+
+static void(*phyreg_writeops[])(E1000ECore *, int, uint16_t) = {
+    [PHY_CTRL] = set_phy_ctrl, [PHY_PAGE] = set_phy_page
+};
+
+enum { NPHYWRITEOPS = ARRAY_SIZE(phyreg_writeops) };
+
+/* Helper function, *curr == 0 means the value is not set */
+static inline void
+mit_update_delay(uint32_t *curr, uint32_t value)
+{
+    if (value && (*curr == 0 || value < *curr)) {
+        *curr = value;
+    }
+}
+
+static void
+clear_ims_bits(E1000ECore *core, uint32_t bits)
+{
+    core->mac[IMS] &= ~bits;
+}
+
+static void
+_e1000e_msix_notify_one(E1000ECore *core, uint32_t cause, uint32_t int_cfg)
+{
+    if (E1000_IVAR_ENTRY_VALID(int_cfg)) {
+        uint32_t vec = E1000_IVAR_ENTRY_VEC(int_cfg);
+        if (vec < E1000E_MSIX_VEC_NUM) {
+            msix_notify(core->owner, vec);
+            trace_e1000e_irq_msix_notify_vec(vec);
+        } else {
+            trace_e1000e_wrn_msix_vec_wrong(cause, int_cfg);
+        }
+    } else {
+        trace_e1000e_wrn_msix_invalid(cause, int_cfg);
+    }
+
+    if (core->mac[CTRL_EXT] & E1000_CTRL_EXT_EIAME) {
+        clear_ims_bits(core, core->mac[IAM] & cause);
+    }
+
+    core->mac[ICR] &= ~(core->mac[EIAC] & E1000_EIAC_MASK);
+}
+
+static void
+_e1000e_msix_notify(E1000ECore *core, uint32_t causes)
+{
+    if (causes & E1000_ICR_RXQ0) {
+        _e1000e_msix_notify_one(core, E1000_ICR_RXQ0,
+                                E1000_IVAR_RXQ0(core->mac[IVAR]));
+    }
+
+    if (causes & E1000_ICR_RXQ1) {
+        _e1000e_msix_notify_one(core, E1000_ICR_RXQ1,
+                                E1000_IVAR_RXQ1(core->mac[IVAR]));
+    }
+
+    if (causes & E1000_ICR_TXQ0) {
+        _e1000e_msix_notify_one(core, E1000_ICR_TXQ0,
+                                E1000_IVAR_TXQ0(core->mac[IVAR]));
+    }
+
+    if (causes & E1000_ICR_TXQ1) {
+        _e1000e_msix_notify_one(core, E1000_ICR_TXQ1,
+                                E1000_IVAR_TXQ1(core->mac[IVAR]));
+    }
+
+    if (causes & ~(E1000_ICR_RXQ0 |
+                   E1000_ICR_RXQ1 |
+                   E1000_ICR_TXQ0 |
+                   E1000_ICR_TXQ1))
+    {
+        _e1000e_msix_notify_one(core, E1000_ICR_OTHER,
+                                E1000_IVAR_OTHER(core->mac[IVAR]));
+    }
+}
+
+static void
+_e1000e_fix_icr_asserted(E1000ECore *core)
+{
+    core->mac[ICR] &= ~E1000_ICR_ASSERTED;
+    if (core->mac[ICR]) {
+        core->mac[ICR] |= E1000_ICR_ASSERTED;
+    }
+}
+
+static void
+_e1000e_update_interrupt_state(E1000ECore *core)
+{
+    uint32_t pending_ints;
+    uint32_t mit_delay;
+
+    _e1000e_fix_icr_asserted(core);
+
+    /*
+     * Make sure ICR and ICS registers have the same value.
+     * The spec says that the ICS register is write-only.  However in practice,
+     * on real hardware ICS is readable, and for reads it has the same value as
+     * ICR (except that ICS does not have the clear on read behaviour of ICR).
+     *
+     * The VxWorks PRO/1000 driver uses this behaviour.
+     */
+    core->mac[ICS] = core->mac[ICR];
+
+    pending_ints = (core->mac[IMS] & core->mac[ICR]);
+    if (!core->mit_irq_level && pending_ints) {
+        /*
+         * Here we detect a potential raising edge. We postpone raising the
+         * interrupt line if we are inside the mitigation delay window
+         * (s->mit_timer_on == 1).
+         * We provide a partial implementation of interrupt mitigation,
+         * emulating only RADV, TADV and ITR (lower 16 bits, 1024ns units for
+         * RADV and TADV, 256ns units for ITR). RDTR is only used to enable
+         * RADV; relative timers based on TIDV and RDTR are not implemented.
+         */
+        if (core->mit_timer_on) {
+            return;
+        }
+        if (core->compat_flags & E1000_FLAG_MIT) {
+            /* Compute the next mitigation delay according to pending
+             * interrupts and the current values of RADV (provided
+             * RDTR!=0), TADV and ITR.
+             * Then rearm the timer.
+             */
+            mit_delay = 0;
+            if (core->mit_ide &&
+                    (pending_ints & (E1000_ICR_TXQE | E1000_ICR_TXDW))) {
+                mit_update_delay(&mit_delay, core->mac[TADV] * 4);
+            }
+            if (core->mac[RDTR] && (pending_ints & E1000_ICS_RXT0)) {
+                mit_update_delay(&mit_delay, core->mac[RADV] * 4);
+            }
+            mit_update_delay(&mit_delay, core->mac[ITR]);
+
+            if (mit_delay) {
+                core->mit_timer_on = 1;
+                timer_mod(core->mit_timer,
+                          qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
+                          mit_delay * 256);
+            }
+            core->mit_ide = 0;
+        }
+    }
+
+    core->mit_irq_level = (pending_ints != 0);
+
+    trace_e1000e_irq_legacy_notify(core->mit_irq_level);
+    if (core->mit_irq_level) {
+        inc_reg_if_not_full(core, IAC);
+    }
+
+    pci_set_irq(core->owner, core->mit_irq_level);
+}
+
+static void
+send_msi(E1000ECore *core, uint32_t causes)
+{
+    causes &= core->mac[IMS] & ~E1000_ICR_ASSERTED;
+
+    if (causes == 0) {
+        return;
+    }
+
+    if (msix_enabled(core->owner)) {
+        trace_e1000e_irq_msix_notify(causes);
+        _e1000e_msix_notify(core, causes);
+        return;
+    }
+
+    if (msi_enabled(core->owner)) {
+        trace_e1000e_irq_msi_notify(causes);
+        msi_notify(core->owner, 0);
+        return;
+    }
+}
+
+static void
+set_interrupt_cause(E1000ECore *core, uint32_t val)
+{
+    trace_e1000e_irq_set_cause(val);
+
+    core->mac[ICR] |= val;
+    _e1000e_update_interrupt_state(core);
+    send_msi(core, val);
+}
+
+static void
+_e1000e_mit_timer(void *opaque)
+{
+    E1000ECore *core = opaque;
+
+    core->mit_timer_on = 0;
+    /* Call set_interrupt_cause to update the irq level (if necessary). */
+    set_interrupt_cause(core, core->mac[ICR]);
+}
+
+static void
+_e1000e_autoneg_timer(void *opaque)
+{
+    E1000ECore *core = opaque;
+    if (!qemu_get_queue(core->owner_nic)->link_down) {
+        e1000_link_up(core);
+        core->phy[PHY_LP_ABILITY] |= MII_LPAR_LPACK;
+        core->phy[PHY_STATUS] |= MII_SR_AUTONEG_COMPLETE;
+        trace_e1000e_core_link_negotiation_done();
+        set_interrupt_cause(core, E1000_ICR_LSC); /* signal link status change
+                                                   * to guest */
+    }
+}
+
+static const char phy_regcap[0x20] = {
+    [PHY_STATUS]        = PHY_R,  [M88E1000_EXT_PHY_SPEC_CTRL] = PHY_RW,
+    [PHY_ID1]           = PHY_R,  [M88E1000_PHY_SPEC_CTRL] =     PHY_RW,
+    [PHY_CTRL]          = PHY_RW, [PHY_1000T_CTRL] =             PHY_RW,
+    [PHY_LP_ABILITY]    = PHY_R,  [PHY_1000T_STATUS] =           PHY_R,
+    [PHY_AUTONEG_ADV]   = PHY_RW, [M88E1000_RX_ERR_CNTR] =       PHY_R,
+    [PHY_ID2]           = PHY_R,  [M88E1000_PHY_SPEC_STATUS] =   PHY_R,
+    [PHY_AUTONEG_EXP]   = PHY_R,  [PHY_PAGE] =                   PHY_RW,
+    [PHY_OEM_BITS]      = PHY_RW, [PHY_BIAS_1] =                 PHY_RW,
+    [PHY_BIAS_2]        = PHY_RW
+};
+
+static bool
+phy_reg_check_cap(E1000ECore *core, uint32_t addr, char cap)
+{
+    return phy_regcap[addr] & cap;
+}
+
+static void
+phy_reg_write(E1000ECore *core, uint32_t addr, uint16_t data)
+{
+    if (addr < NPHYWRITEOPS && phyreg_writeops[addr]) {
+        phyreg_writeops[addr](core, addr, data);
+    } else {
+        core->phy[addr] = data;
+    }
+}
+
+static void
+set_mdic(E1000ECore *core, int index, uint32_t val)
+{
+    uint32_t data = val & E1000_MDIC_DATA_MASK;
+    uint32_t addr = ((val & E1000_MDIC_REG_MASK) >> E1000_MDIC_REG_SHIFT);
+
+    if ((val & E1000_MDIC_PHY_MASK) >> E1000_MDIC_PHY_SHIFT != 1) { /* phy # */
+        val = core->mac[MDIC] | E1000_MDIC_ERROR;
+    } else if (val & E1000_MDIC_OP_READ) {
+        trace_e1000e_core_mdic_read(addr, data);
+        if (!phy_reg_check_cap(core, addr, PHY_R)) {
+            trace_e1000e_core_mdic_read_unhandled(addr);
+            val |= E1000_MDIC_ERROR;
+        } else {
+            val = (val ^ data) | core->phy[addr];
+        }
+    } else if (val & E1000_MDIC_OP_WRITE) {
+        trace_e1000e_core_mdic_write(addr, data);
+        if (!phy_reg_check_cap(core, addr, PHY_W)) {
+            trace_e1000e_core_mdic_write_unhandled(addr);
+            val |= E1000_MDIC_ERROR;
+        } else {
+            phy_reg_write(core, addr, data);
+        }
+    }
+    core->mac[MDIC] = val | E1000_MDIC_READY;
+
+    if (val & E1000_MDIC_INT_EN) {
+        set_interrupt_cause(core, E1000_ICR_MDAC);
+    }
+}
+
+static uint32_t
+get_eecd(E1000ECore *core, int index)
+{
+    uint32_t ret = E1000_EECD_PRES|E1000_EECD_GNT | core->eecd_state.old_eecd;
+
+    trace_e1000e_core_eeeprom_read(core->eecd_state.bitnum_out,
+                                  core->eecd_state.reading);
+    if (!core->eecd_state.reading ||
+        ((core->eeprom[(core->eecd_state.bitnum_out >> 4) & 0x3f] >>
+          ((core->eecd_state.bitnum_out & 0xf) ^ 0xf))) & 1) {
+        ret |= E1000_EECD_DO;
+    }
+    return ret;
+}
+
+static uint32_t
+flash_eerd_read(E1000ECore *core, int x)
+{
+    unsigned int index, r = core->mac[EERD] & ~E1000_EEPROM_RW_REG_START;
+
+    if ((core->mac[EERD] & E1000_EEPROM_RW_REG_START) == 0) {
+        return core->mac[EERD];
+    }
+
+    index = r >> E1000_EEPROM_RW_ADDR_SHIFT;
+    if (index > EEPROM_CHECKSUM_REG) {
+        return E1000_EEPROM_RW_REG_DONE | r;
+    }
+
+    return (core->eeprom[index] << E1000_EEPROM_RW_REG_DATA) |
+           E1000_EEPROM_RW_REG_DONE | r;
+}
+
+static void
+set_rdt(E1000ECore *core, int index, uint32_t val)
+{
+    core->mac[index] = val & 0xffff;
+    start_recv(core);
+}
+
+static void
+set_16bit(E1000ECore *core, int index, uint32_t val)
+{
+    core->mac[index] = val & 0xffff;
+}
+
+static void
+set_dlen(E1000ECore *core, int index, uint32_t val)
+{
+    core->mac[index] = val & 0xfff80;
+}
+
+static void
+set_tctl(E1000ECore *core, int index, uint32_t val)
+{
+    tx_ring txr;
+    core->mac[index] = val;
+
+    _e1000e_tx_ring_init(core, &txr, 0);
+    start_xmit(core, &txr);
+
+    if (core->mac[TARC1] & E1000_TARC_ENABLE) {
+            _e1000e_tx_ring_init(core, &txr, 1);
+            start_xmit(core, &txr);
+    }
+}
+
+static void
+set_tdt(E1000ECore *core, int index, uint32_t val)
+{
+    tx_ring txr;
+
+    core->mac[index] = val & 0xffff;
+
+    _e1000e_tx_ring_init(core, &txr, _e1000e_mq_queue_idx(TDT, index));
+    start_xmit(core, &txr);
+}
+
+static void
+set_ics(E1000ECore *core, int index, uint32_t val)
+{
+    trace_e1000e_core_set_ics(val);
+    set_interrupt_cause(core, val);
+}
+
+static void
+set_icr(E1000ECore *core, int index, uint32_t val)
+{
+    trace_e1000e_core_icr_write(val);
+
+    if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
+        (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
+        clear_ims_bits(core, core->mac[IAM]);
+    }
+
+    core->mac[ICR] &= ~val;
+    _e1000e_update_interrupt_state(core);
+}
+
+static void
+set_imc(E1000ECore *core, int index, uint32_t val)
+{
+    clear_ims_bits(core, val);
+    _e1000e_update_interrupt_state(core);
+}
+
+static void
+set_ims(E1000ECore *core, int index, uint32_t val)
+{
+    core->mac[IMS] |= val;
+    _e1000e_update_interrupt_state(core);
+}
+
+static uint32_t
+mac_readreg(E1000ECore *core, int index)
+{
+    return core->mac[index];
+}
+
+static uint32_t
+mac_ics_read(E1000ECore *core, int index)
+{
+    uint32_t val = core->mac[ICS];
+    trace_e1000e_core_read_ics(val);
+    return val;
+}
+
+static uint32_t
+mac_ims_read(E1000ECore *core, int index)
+{
+    return core->mac[IMS];
+}
+
+static uint32_t
+mac_low11_read(E1000ECore *core, int index)
+{
+    return core->mac[index] & 0x7ff;
+}
+
+static uint32_t
+mac_low13_read(E1000ECore *core, int index)
+{
+    return core->mac[index] & 0x1fff;
+}
+
+static uint32_t
+mac_swsm_read(E1000ECore *core, int index)
+{
+    uint32_t val = core->mac[SWSM];
+    core->mac[SWSM] = val | 1;
+    return val;
+}
+
+static uint32_t
+mac_icr_read(E1000ECore *core, int index)
+{
+    uint32_t ret = core->mac[ICR];
+
+    if (core->mac[IMS] == 0) {
+        core->mac[ICR] = 0;
+    }
+
+    if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
+        (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
+        core->mac[ICR] = 0;
+        clear_ims_bits(core, core->mac[IAM]);
+    }
+
+    _e1000e_update_interrupt_state(core);
+    trace_e1000e_core_icr_read(ret);
+    return ret;
+}
+
+static uint32_t
+mac_read_clr4(E1000ECore *core, int index)
+{
+    uint32_t ret = core->mac[index];
+
+    core->mac[index] = 0;
+    return ret;
+}
+
+static uint32_t
+mac_read_clr8(E1000ECore *core, int index)
+{
+    uint32_t ret = core->mac[index];
+
+    core->mac[index] = 0;
+    core->mac[index-1] = 0;
+    return ret;
+}
+
+static uint32_t
+get_status(E1000ECore *core, int index)
+{
+    bool gio_disable = core->mac[CTRL] & E1000_CTRL_GIO_MASTER_DISABLE;
+    uint32_t mask = gio_disable ? ~E1000_STATUS_GIO_MASTER_ENABLE : ~0L;
+
+    return core->mac[STATUS] & mask;
+}
+
+static uint32_t
+get_tarc(E1000ECore *core, int index)
+{
+    return core->mac[index] & ((BIT(11) - 1) |
+                                BIT(27)      |
+                                BIT(28)      |
+                                BIT(29)      |
+                                BIT(30));
+}
+
+static uint32_t
+get_pbs(E1000ECore *core, int index)
+{
+    return core->mac[index] & 0x3f;
+}
+
+static void
+mac_writereg(E1000ECore *core, int index, uint32_t val)
+{
+    uint32_t macaddr[2];
+
+    core->mac[index] = val;
+
+    if (index == RA + 1) {
+        macaddr[0] = cpu_to_le32(core->mac[RA]);
+        macaddr[1] = cpu_to_le32(core->mac[RA + 1]);
+        qemu_format_nic_info_str(qemu_get_queue(core->owner_nic),
+                                 (uint8_t *)macaddr);
+    }
+}
+
+static void
+set_eecd(E1000ECore *core, int index, uint32_t val)
+{
+    g_warning("e1000e EECD write not implemented");
+}
+
+static void
+set_eerd(E1000ECore *core, int index, uint32_t val)
+{
+    uint32_t addr = (val >> E1000_NVM_RW_ADDR_SHIFT) & E1000_NVM_ADDR_MASK;
+    uint32_t data;
+
+    if (!(val & E1000_NVM_RW_REG_START)) {
+        return;
+    }
+
+    if (addr > EEPROM_CHECKSUM_REG) {
+        return;
+    }
+
+    data = core->eeprom[addr];
+
+    core->mac[EERD] = E1000_NVM_RW_REG_DONE             |
+                      (addr << E1000_NVM_RW_ADDR_SHIFT) |
+                      (data << E1000_NVM_RW_REG_DATA);
+}
+
+static void
+set_psrctl(E1000ECore *core, int index, uint32_t val)
+{
+    if ((val & E1000_PSRCTL_BSIZE0_MASK) == 0) {
+        hw_error("e1000e: PSRCTL.BSIZE0 cannot be zero");
+    }
+
+    if ((val & E1000_PSRCTL_BSIZE1_MASK) == 0) {
+        hw_error("e1000e: PSRCTL.BSIZE1 cannot be zero");
+    }
+
+    core->mac[PSRCTL] = val;
+}
+
+static void
+set_rxcsum(E1000ECore *core, int index, uint32_t val)
+{
+    qemu_set_offload(qemu_get_queue(core->owner_nic),
+                     !!(val & E1000_RXCSUM_TUOFLD),
+                     0, 0, 0, 0);
+
+    core->mac[RXCSUM] = val;
+}
+
+static void
+set_gcr(E1000ECore *core, int index, uint32_t val)
+{
+    uint32_t ro_bits = core->mac[GCR] & E1000_GCR_RO_BITS;
+    core->mac[GCR] = (val & ~E1000_GCR_RO_BITS) | ro_bits;
+}
+
+
+#define getreg(x)    [x] = mac_readreg
+
+static uint32_t (*macreg_readops[])(E1000ECore *, int) = {
+    getreg(PBA),      getreg(RCTL),     getreg(TDH),      getreg(TXDCTL),
+    getreg(WUFC),     getreg(TDT),      getreg(CTRL),     getreg(LEDCTL),
+    getreg(MANC),     getreg(MDIC),     getreg(STATUS),   getreg(TORL),
+    getreg(TOTL),     getreg(FCRUC),    getreg(TCTL),     getreg(RDH),
+    getreg(RDT),      getreg(VET),      getreg(AIT),      getreg(TDBAL),
+    getreg(TDBAH),    getreg(RDBAH),    getreg(RDBAL),    getreg(TDLEN),
+    getreg(TDLEN1),   getreg(TDBAL1),   getreg(TDBAH1),   getreg(TDH1),
+    getreg(TDT1),     getreg(RDLEN),    getreg(RDTR),     getreg(RADV),
+    getreg(TADV),     getreg(ITR),      getreg(SCC),      getreg(ECOL),
+    getreg(MCC),      getreg(LATECOL),  getreg(COLC),     getreg(DC),
+    getreg(TNCRS),    getreg(SEC),      getreg(CEXTERR),  getreg(RLEC),
+    getreg(XONRXC),   getreg(XONTXC),   getreg(XOFFRXC),  getreg(XOFFTXC),
+    getreg(WUC),      getreg(WUS),      getreg(IPAV),     getreg(RFC),
+    getreg(RJC),      getreg(GORCL),    getreg(GOTCL),    getreg(RNBC),
+    getreg(TSCTFC),   getreg(MGTPRC),   getreg(MGTPDC),   getreg(MGTPTC),
+    getreg(EECD),     getreg(EERD),     getreg(GCR),      getreg(TIMINCA),
+    getreg(IAM),      getreg(EIAC),     getreg(IVAR),     getreg(CTRL_EXT),
+    getreg(RFCTL),    getreg(PSRCTL),   getreg(POEMB),    getreg(MFUTP01),
+    getreg(MFUTP23),  getreg(MANC2H),   getreg(MFVAL),    getreg(FACTPS),
+    getreg(RXCSUM),   getreg(FUNCTAG),  getreg(GSCL_1),   getreg(EXTCNF_CTRL),
+    getreg(GSCL_2),   getreg(GSCL_3),   getreg(GSCL_4),   getreg(GSCN_0),
+    getreg(GSCN_1),   getreg(GSCN_2),   getreg(GSCN_3),   getreg(GCR2),
+
+    [TOTH] = mac_read_clr8,   [TORH] = mac_read_clr8,   [GOTCH] = mac_read_clr8,
+    [GORCH] = mac_read_clr8,
+    [PRC64] = mac_read_clr4,  [PRC127] = mac_read_clr4, [PRC255] = mac_read_clr4,
+    [PRC511] = mac_read_clr4, [PRC1023] = mac_read_clr4, [PRC1522] = mac_read_clr4,
+    [PTC64] = mac_read_clr4,  [PTC127] = mac_read_clr4, [PTC255] = mac_read_clr4,
+    [PTC511] = mac_read_clr4, [PTC1023] = mac_read_clr4, [PTC1522] = mac_read_clr4,
+    [GPRC] = mac_read_clr4,   [GPTC] = mac_read_clr4,   [TPR] = mac_read_clr4,
+    [TPT] = mac_read_clr4,    [RUC] = mac_read_clr4,    [ROC] = mac_read_clr4,
+    [BPRC] = mac_read_clr4,   [MPRC] = mac_read_clr4,   [MPTC] = mac_read_clr4,
+    [BPTC] = mac_read_clr4,   [IAC] = mac_read_clr4,    [TSCTC] = mac_read_clr4,
+    [ICR] = mac_icr_read,     [EECD] = get_eecd,        [EERD] = flash_eerd_read,
+    [ICS] = mac_ics_read,
+    [IMS] = mac_ims_read,
+    [RDFH] = mac_low13_read,  [RDFT] = mac_low13_read,
+    [RDFHS] = mac_low13_read, [RDFTS] = mac_low13_read, [RDFPC] = mac_low13_read,
+    [TDFH] = mac_low11_read,  [TDFT] = mac_low11_read,
+    [TDFHS] = mac_low13_read, [TDFTS] = mac_low13_read, [TDFPC] = mac_low13_read,
+    [TDFH] = mac_low13_read,  [TDFT] = mac_low13_read,  [TDFHS] = mac_low13_read,
+    [TDFTS] = mac_low13_read,
+    [STATUS] = get_status,    [TARC0] = get_tarc,       [TARC1] = get_tarc,
+    [PBS] = get_pbs,          [SWSM] = mac_swsm_read,
+
+    [CRCERRS ... MPC] = &mac_readreg,
+    [IP6AT ... IP6AT+3] = &mac_readreg, [IP4AT ... IP4AT+6] = &mac_readreg,
+    [FFLT ... FFLT+6] = &mac_low11_read,
+    [RA ... RA+31] = &mac_readreg,
+    [WUPM ... WUPM+31] = &mac_readreg,
+    [MTA ... MTA+127] = &mac_readreg,
+    [VFTA ... VFTA+127] = &mac_readreg,
+    [FFMT ... FFMT+254] = &mac_readreg, [FFVT ... FFVT+254] = &mac_readreg,
+    [PBM ... PBM+16383] = &mac_readreg,
+    [MDEF ... MDEF + 7] = &mac_readreg,
+    [FFLT ... FFLT + 10] = &mac_readreg,
+    [FTFT ... FTFT + 254] = &mac_readreg,
+    [PBM ... PBM + 10239] = &mac_readreg,
+};
+enum { NREADOPS = ARRAY_SIZE(macreg_readops) };
+
+#define putreg(x)    [x] = mac_writereg
+static void (*macreg_writeops[])(E1000ECore *, int, uint32_t) = {
+    putreg(PBA),      putreg(EERD),     putreg(SWSM),     putreg(WUFC),
+    putreg(TDBAL),    putreg(TDBAH),    putreg(TXDCTL),   putreg(RDBAH),
+    putreg(RDBAL),    putreg(LEDCTL),   putreg(VET),      putreg(FCRUC),
+    putreg(AIT),      putreg(TDFH),     putreg(TDFT),     putreg(TDFHS),
+    putreg(TDFTS),    putreg(TDFPC),    putreg(WUC),      putreg(WUS),
+    putreg(RDFH),     putreg(RDFT),     putreg(RDFHS),    putreg(RDFTS),
+    putreg(RDFPC),    putreg(IPAV),     putreg(TDBAL1),   putreg(TDBAH1),
+    putreg(TIMINCA),  putreg(IAM),      putreg(EIAC),     putreg(IVAR),
+    putreg(CTRL_EXT), putreg(RFCTL),    putreg(TARC0),    putreg(TARC1),
+    putreg(TDFH),     putreg(TDFT),     putreg(TDFHS),    putreg(TDFTS),
+    putreg(POEMB),    putreg(PBS),      putreg(MFUTP01),  putreg(MFUTP23),
+    putreg(MANC),     putreg(MANC2H),   putreg(MFVAL),    putreg(EXTCNF_CTRL),
+    putreg(FACTPS),   putreg(FUNCTAG),  putreg(GSCL_1),   putreg(GSCL_2),
+    putreg(GSCL_3),   putreg(GSCL_4),   putreg(GSCN_0),   putreg(GSCN_1),
+    putreg(GSCN_2),   putreg(GSCN_3),   putreg(GCR2),
+
+    [TDLEN1] = set_dlen, [TDH1] = set_16bit,     [TDT1] = set_tdt,
+    [TDLEN] = set_dlen, [RDLEN] = set_dlen,      [TCTL] = set_tctl,
+    [TDT] = set_tdt,    [MDIC] = set_mdic,       [ICS] = set_ics,
+    [TDH] = set_16bit,  [RDH] = set_16bit,       [RDT] = set_rdt,
+    [IMC] = set_imc,    [IMS] = set_ims,         [ICR] = set_icr,
+    [EECD] = set_eecd,  [RCTL] = set_rx_control, [CTRL] = set_ctrl,
+    [RDTR] = set_16bit, [RADV] = set_16bit,      [TADV] = set_16bit,
+    [ITR] = set_16bit,  [EERD] = set_eerd,       [GCR] = set_gcr,
+    [PSRCTL] = set_psrctl, [RXCSUM] = set_rxcsum,
+
+    [IP6AT ... IP6AT+3] = &mac_writereg, [IP4AT ... IP4AT+6] = &mac_writereg,
+    [FFLT ... FFLT+6] = &mac_writereg,
+    [RA ... RA+31] = &mac_writereg,
+    [WUPM ... WUPM+31] = &mac_writereg,
+    [MTA ... MTA+127] = &mac_writereg,
+    [VFTA ... VFTA+127] = &mac_writereg,
+    [FFMT ... FFMT+254] = &mac_writereg, [FFVT ... FFVT+254] = &mac_writereg,
+    [PBM ... PBM+16383] = &mac_writereg,
+    [PBM ... PBM + 10239] = &mac_writereg,
+    [MDEF ... MDEF + 7] = &mac_writereg,
+    [FFLT ... FFLT + 10] = &mac_writereg,
+    [FTFT ... FTFT + 254] = &mac_writereg,
+};
+
+enum { NWRITEOPS = ARRAY_SIZE(macreg_writeops) };
+
+void
+e1000e_core_write(E1000ECore *core, hwaddr addr, uint64_t val, unsigned size)
+{
+    unsigned int index = (addr & 0x1ffff) >> 2;
+
+    if (index < NWRITEOPS && macreg_writeops[index]) {
+        trace_e1000e_core_write(index << 2, size, val);
+        macreg_writeops[index](core, index, val);
+    } else if (index < NREADOPS && macreg_readops[index]) {
+        trace_e1000e_wrn_regs_write_ro(index << 2, size, val);
+    } else {
+        trace_e1000e_wrn_regs_write_unknown(index << 2, size, val);
+    }
+}
+
+uint64_t
+e1000e_core_read(E1000ECore *core, hwaddr addr, unsigned size)
+{
+    uint64_t val;
+    unsigned int index = (addr & 0x1ffff) >> 2;
+
+    if (index < NREADOPS && macreg_readops[index]) {
+        val = macreg_readops[index](core, index);
+        trace_e1000e_core_read(index << 2, size, val);
+        return val;
+    }
+    trace_e1000e_wrn_regs_read_unknown(index << 2, size);
+    return 0;
+}
+
+static void
+e1000e_core_prepare_eeprom(E1000ECore      *core,
+                          const uint16_t *templ,
+                          uint32_t        templ_size,
+                          const uint8_t  *macaddr)
+{
+    PCIDeviceClass *pdc = PCI_DEVICE_GET_CLASS(core->owner);
+    uint16_t checksum = 0;
+    int i;
+
+    memmove(core->eeprom, templ, templ_size);
+
+    for (i = 0; i < 3; i++) {
+        core->eeprom[i] = (macaddr[2*i+1]<<8) | macaddr[2*i];
+    }
+
+    core->eeprom[11] = core->eeprom[13] = pdc->device_id;
+
+    for (i = 0; i < EEPROM_CHECKSUM_REG; i++) {
+        checksum += core->eeprom[i];
+    }
+
+    checksum = (uint16_t) EEPROM_SUM - checksum;
+
+    core->eeprom[EEPROM_CHECKSUM_REG] = checksum;
+}
+
+void
+e1000e_core_pci_realize(E1000ECore      *core,
+                       const uint16_t *eeprom_templ,
+                       uint32_t        eeprom_size,
+                       const uint8_t  *macaddr)
+{
+    int i;
+
+    core->autoneg_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+                                       _e1000e_autoneg_timer, core);
+    core->mit_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
+                                   _e1000e_mit_timer, core);
+
+    for (i = 0; i < ARRAY_SIZE(core->tx); i++) {
+        net_tx_pkt_init(&core->tx[i].tx_pkt,
+            E1000E_MAX_TX_FRAGS, core->has_vnet);
+        net_rx_pkt_init(&core->tx[i].rx_pkt,
+            core->has_vnet);
+    }
+
+    e1000e_core_prepare_eeprom(core, eeprom_templ, eeprom_size, macaddr);
+}
+
+void
+e1000e_core_pci_uninit(E1000ECore *core)
+{
+    int i;
+
+    timer_del(core->autoneg_timer);
+    timer_free(core->autoneg_timer);
+    timer_del(core->mit_timer);
+    timer_free(core->mit_timer);
+
+    for (i = 0; i < ARRAY_SIZE(core->tx); i++) {
+        net_tx_pkt_reset(core->tx[i].tx_pkt);
+        net_tx_pkt_uninit(core->tx[i].tx_pkt);
+        net_rx_pkt_uninit(core->tx[i].rx_pkt);
+    }
+}
+
+/* PHY_ID2 documented in 8254x_GBe_SDM.pdf, pp. 250 */
+static const uint16_t phy_reg_init[] = {
+    [PHY_CTRL] =   MII_CR_SPEED_SELECT_MSB |
+                   MII_CR_FULL_DUPLEX |
+                   MII_CR_AUTO_NEG_EN,
+
+    [PHY_STATUS] = MII_SR_EXTENDED_CAPS |
+                   MII_SR_LINK_STATUS |   /* link initially up */
+                   MII_SR_AUTONEG_CAPS |
+                   /* MII_SR_AUTONEG_COMPLETE: initially NOT completed */
+                   MII_SR_PREAMBLE_SUPPRESS |
+                   MII_SR_EXTENDED_STATUS |
+                   MII_SR_10T_HD_CAPS |
+                   MII_SR_10T_FD_CAPS |
+                   MII_SR_100X_HD_CAPS |
+                   MII_SR_100X_FD_CAPS,
+
+    [PHY_ID1] = 0x141,
+    /* [PHY_ID2] configured per DevId, from e1000_reset() */
+    [PHY_AUTONEG_ADV] = 0xde1,
+    [PHY_LP_ABILITY] = 0x1e0,
+    [PHY_1000T_CTRL] = 0x0e00,
+    [PHY_1000T_STATUS] = 0x3c00,
+    [M88E1000_PHY_SPEC_CTRL] = 0x360,
+    [M88E1000_PHY_SPEC_STATUS] = 0xac00,
+    [M88E1000_EXT_PHY_SPEC_CTRL] = 0x0d60,
+};
+
+static const uint32_t mac_reg_init[] = {
+    [PBA] =     0x00100030,
+    [LEDCTL] =  0x602,
+    [CTRL] =    E1000_CTRL_SWDPIN2 | E1000_CTRL_SWDPIN0 |
+                E1000_CTRL_SPD_1000 | E1000_CTRL_SLU,
+    [STATUS] =  0x80000000 | E1000_STATUS_GIO_MASTER_ENABLE |
+                E1000_STATUS_ASDV | E1000_STATUS_MTXCKOK |
+                E1000_STATUS_SPEED_1000 | E1000_STATUS_FD |
+                E1000_STATUS_LU,
+    [PSRCTL]  = (2 << E1000_PSRCTL_BSIZE0_SHIFT) |
+                (4 << E1000_PSRCTL_BSIZE1_SHIFT) |
+                (4 << E1000_PSRCTL_BSIZE2_SHIFT),
+    [TARC0]   = 0x3 | E1000_TARC_ENABLE,
+    [TARC1]   = 0x3 | E1000_TARC_ENABLE,
+    [EECD]    = E1000_EECD_AUTO_RD,
+    [EERD]    = E1000_NVM_RW_REG_DONE,
+    [GCR]     = E1000_L0S_ADJUST |
+                E1000_L1_ENTRY_LATENCY_MSB |
+                E1000_L1_ENTRY_LATENCY_LSB,
+    [TDFH]    = 0x600,
+    [TDFT]    = 0x600,
+    [TDFHS]   = 0x600,
+    [TDFTS]   = 0x600,
+    [POEMB]   = 0x30D,
+    [PBS]     = 0x028,
+    [MANC]    = E1000_MANC_DIS_IP_CHK_ARP,
+    [FACTPS]  = E1000_FACTPS_LAN0_ON | 0x20000000,
+    [SWSM]    = 1,
+    [RXCSUM]  = E1000_RXCSUM_IPOFLD | E1000_RXCSUM_TUOFLD
+};
+
+void
+e1000e_core_reset(E1000ECore *core, uint8_t *macaddr, uint16_t phy_id2)
+{
+    int i;
+
+    timer_del(core->autoneg_timer);
+    timer_del(core->mit_timer);
+    core->mit_timer_on = 0;
+    core->mit_irq_level = 0;
+    core->mit_ide = 0;
+
+    memset(core->phy, 0, sizeof core->phy);
+    memmove(core->phy, phy_reg_init, sizeof phy_reg_init);
+    core->phy[PHY_ID2] = phy_id2;
+    memset(core->mac, 0, sizeof core->mac);
+    memmove(core->mac, mac_reg_init, sizeof mac_reg_init);
+
+    core->rxbuf_min_shift = 1;
+
+    if (qemu_get_queue(core->owner_nic)->link_down) {
+        e1000_link_down(core);
+    }
+
+    /* Some guests expect pre-initialized RAH/RAL (AddrValid flag + MACaddr) */
+    core->mac[RA] = 0;
+    core->mac[RA + 1] = E1000_RAH_AV;
+    for (i = 0; i < 4; i++) {
+        core->mac[RA] |= macaddr[i] << (8 * i);
+        core->mac[RA + 1] |= (i < 2) ? macaddr[i + 4] << (8 * i) : 0;
+    }
+
+    for (i = 0; i < ARRAY_SIZE(core->tx); i++) {
+        net_tx_pkt_reset(core->tx[i].tx_pkt);
+        core->tx[i].sum_needed = 0;
+        core->tx[i].ipcss = 0;
+        core->tx[i].ipcso = 0;
+        core->tx[i].ipcse = 0;
+        core->tx[i].tucss = 0;
+        core->tx[i].tucso = 0;
+        core->tx[i].tucse = 0;
+        core->tx[i].hdr_len = 0;
+        core->tx[i].mss = 0;
+        core->tx[i].paylen = 0;
+        core->tx[i].ip = 0;
+        core->tx[i].tcp = 0;
+        core->tx[i].tse = 0;
+        core->tx[i].cptse = 0;
+        core->tx[i].skip_cp = 0;
+    }
+}
+
+void e1000e_core_pre_save(E1000ECore *core)
+{
+    int i;
+    NetClientState *nc = qemu_get_queue(core->owner_nic);
+
+    /* If the mitigation timer is active, emulate a timeout now. */
+    if (core->mit_timer_on) {
+        _e1000e_mit_timer(core);
+    }
+
+    /*
+    * If link is down and auto-negotiation is supported and ongoing,
+    * complete auto-negotiation immediately. This allows us to look
+    * at MII_SR_AUTONEG_COMPLETE to infer link status on load.
+    */
+    if (nc->link_down && have_autoneg(core)) {
+        core->phy[PHY_STATUS] |= MII_SR_AUTONEG_COMPLETE;
+    }
+
+    for (i = 0; i < ARRAY_SIZE(core->tx); i++) {
+        if (net_tx_pkt_has_fragments(core->tx[i].tx_pkt)) {
+            core->tx[i].skip_cp = true;
+        }
+    }
+}
+
+int
+e1000e_core_post_load(E1000ECore *core)
+{
+    int i;
+
+    NetClientState *nc = qemu_get_queue(core->owner_nic);
+
+    for (i = 0; i < ARRAY_SIZE(core->tx); i++) {
+        net_tx_pkt_init(&core->tx[i].tx_pkt,
+            E1000E_MAX_TX_FRAGS, core->has_vnet);
+        net_rx_pkt_init(&core->tx[i].rx_pkt,
+            core->has_vnet);
+    }
+
+    if (!(core->compat_flags & E1000_FLAG_MIT)) {
+        core->mac[ITR] = core->mac[RDTR] = core->mac[RADV] =
+            core->mac[TADV] = 0;
+        core->mit_irq_level = false;
+    }
+    core->mit_ide = 0;
+    core->mit_timer_on = false;
+
+    /* nc.link_down can't be migrated, so infer link_down according
+     * to link status bit in core.mac[STATUS].
+     * Alternatively, restart link negotiation if it was in progress. */
+    nc->link_down = (core->mac[STATUS] & E1000_STATUS_LU) == 0;
+
+    if (have_autoneg(core) &&
+        !(core->phy[PHY_STATUS] & MII_SR_AUTONEG_COMPLETE)) {
+        nc->link_down = false;
+        timer_mod(core->autoneg_timer,
+                  qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + 500);
+    }
+
+    return 0;
+}
diff --git a/hw/net/e1000e_core.h b/hw/net/e1000e_core.h
new file mode 100644
index 0000000..78e4834
--- /dev/null
+++ b/hw/net/e1000e_core.h
@@ -0,0 +1,181 @@
+/*
+* Core code for QEMU e1000e emulation
+*
+* Software developer's manuals:
+* http://www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf
+*
+* Copyright (c) 2015 Ravello Systems LTD (http://ravellosystems.com)
+* Developed by Daynix Computing LTD (http://www.daynix.com)
+*
+* Authors:
+* Dmitry Fleytman <dmitry@daynix.com>
+* Leonid Bloch <leonid@daynix.com>
+* Yan Vugenfirer <yan@daynix.com>
+*
+* Based on work done by:
+* Nir Peleg, Tutis Systems Ltd. for Qumranet Inc.
+* Copyright (c) 2008 Qumranet
+* Based on work done by:
+* Copyright (c) 2007 Dan Aloni
+* Copyright (c) 2004 Antony T Curtis
+*
+* This library is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2 of the License, or (at your option) any later version.
+*
+* This library is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with this library; if not, see <http://www.gnu.org/licenses/>.
+*/
+
+#define E1000E_PHY_SIZE     (0x20)
+#define E1000E_MAC_SIZE     (0x8000)
+#define E1000E_EEPROM_SIZE  (64)
+#define E1000E_MSIX_VEC_NUM (5)
+#define E1000E_NUM_TX_RINGS (2)
+
+typedef struct E1000Regs_st E1000ECore;
+
+enum { PHY_R = 1, PHY_W = 2, PHY_RW = PHY_R | PHY_W };
+
+typedef struct E1000Regs_st {
+    uint32_t mac[E1000E_MAC_SIZE];
+    uint16_t phy[E1000E_PHY_SIZE];
+    uint16_t eeprom[E1000E_EEPROM_SIZE];
+
+    uint32_t rxbuf_sizes[E1000_PSRCTL_BUFFS_PER_DESC];
+    uint32_t rx_desc_buf_size;
+    uint32_t rxbuf_min_shift;
+    uint8_t rx_desc_len;
+
+    struct {
+        uint32_t val_in;    /* shifted in from guest driver */
+        uint16_t bitnum_in;
+        uint16_t bitnum_out;
+        uint16_t reading;
+        uint32_t old_eecd;
+    } eecd_state;
+
+    QEMUTimer *autoneg_timer;
+    QEMUTimer *mit_timer;      /* Mitigation timer. */
+    bool mit_timer_on;         /* Mitigation timer is running. */
+    bool mit_irq_level;        /* Tracks interrupt pin level. */
+    uint32_t mit_ide;          /* Tracks E1000_TXD_CMD_IDE bit. */
+
+    /* Compatibility flags for migration to/from qemu 1.3.0 and older */
+#define E1000_FLAG_AUTONEG_BIT 0
+#define E1000_FLAG_MIT_BIT 1
+#define E1000_FLAG_AUTONEG (1 << E1000_FLAG_AUTONEG_BIT)
+#define E1000_FLAG_MIT (1 << E1000_FLAG_MIT_BIT)
+    uint32_t compat_flags;
+
+    struct e1000_tx {
+        unsigned char sum_needed;
+        uint8_t ipcss;
+        uint8_t ipcso;
+        uint16_t ipcse;
+        uint8_t tucss;
+        uint8_t tucso;
+        uint16_t tucse;
+        uint8_t hdr_len;
+        uint16_t mss;
+        uint32_t paylen;
+        int8_t ip;
+        int8_t tcp;
+        bool tse;
+        bool cptse;
+
+        struct NetTxPkt *tx_pkt;
+        bool skip_cp;
+
+        struct NetRxPkt *rx_pkt;
+    } tx[E1000E_NUM_TX_RINGS];
+
+    bool has_vnet;
+
+    NICState *owner_nic;
+    PCIDevice *owner;
+    void (*owner_start_recv)(PCIDevice *d);
+} E1000ECore;
+
+#define defreg(x)   x = (E1000_##x>>2)
+enum {
+    defreg(CTRL),    defreg(EECD),    defreg(EERD),    defreg(GPRC),
+    defreg(GPTC),    defreg(ICR),     defreg(ICS),     defreg(IMC),
+    defreg(IMS),     defreg(LEDCTL),  defreg(MANC),    defreg(MDIC),
+    defreg(MPC),     defreg(PBA),     defreg(RCTL),    defreg(RDBAH),
+    defreg(RDBAL),   defreg(RDH),     defreg(RDLEN),   defreg(RDT),
+    defreg(STATUS),  defreg(SWSM),    defreg(TCTL),    defreg(TDBAH),
+    defreg(TDBAL),   defreg(TDH),     defreg(TDLEN),   defreg(TDT),
+    defreg(TDLEN1),  defreg(TDBAL1),  defreg(TDBAH1),  defreg(TDH1),
+    defreg(TDT1),    defreg(TORH),    defreg(TORL),    defreg(TOTH),
+    defreg(TOTL),    defreg(TPR),     defreg(TPT),     defreg(TXDCTL),
+    defreg(WUFC),    defreg(RA),      defreg(MTA),     defreg(CRCERRS),
+    defreg(VFTA),    defreg(VET),     defreg(RDTR),    defreg(RADV),
+    defreg(TADV),    defreg(ITR),     defreg(SCC),     defreg(ECOL),
+    defreg(MCC),     defreg(LATECOL), defreg(COLC),    defreg(DC),
+    defreg(TNCRS),   defreg(SEC),     defreg(CEXTERR), defreg(RLEC),
+    defreg(XONRXC),  defreg(XONTXC),  defreg(XOFFRXC), defreg(XOFFTXC),
+    defreg(FCRUC),   defreg(AIT),     defreg(TDFH),    defreg(TDFT),
+    defreg(TDFHS),   defreg(TDFTS),   defreg(TDFPC),   defreg(WUC),
+    defreg(WUS),     defreg(POEMB),   defreg(PBS),     defreg(RDFH),
+    defreg(RDFT),    defreg(RDFHS),   defreg(RDFTS),   defreg(RDFPC),
+    defreg(PBM),     defreg(IPAV),    defreg(IP4AT),   defreg(IP6AT),
+    defreg(WUPM),    defreg(FFLT),    defreg(FFMT),    defreg(FFVT),
+    defreg(TARC0),   defreg(TARC1),   defreg(IAM),     defreg(EXTCNF_CTRL),
+    defreg(GCR),     defreg(TIMINCA), defreg(EIAC),    defreg(CTRL_EXT),
+    defreg(IVAR),    defreg(MFUTP01), defreg(MFUTP23), defreg(MANC2H),
+    defreg(MFVAL),   defreg(MDEF),    defreg(FACTPS),  defreg(FTFT),
+    defreg(RUC),     defreg(ROC),     defreg(RFC),     defreg(RJC),
+    defreg(PRC64),   defreg(PRC127),  defreg(PRC255),  defreg(PRC511),
+    defreg(PRC1023), defreg(PRC1522), defreg(PTC64),   defreg(PTC127),
+    defreg(PTC255),  defreg(PTC511),  defreg(PTC1023), defreg(PTC1522),
+    defreg(GORCL),   defreg(GORCH),   defreg(GOTCL),   defreg(GOTCH),
+    defreg(RNBC),    defreg(BPRC),    defreg(MPRC),    defreg(RFCTL),
+    defreg(PSRCTL),  defreg(MPTC),    defreg(BPTC),    defreg(TSCTFC),
+    defreg(IAC),     defreg(MGTPRC),  defreg(MGTPDC),  defreg(MGTPTC),
+    defreg(TSCTC),   defreg(RXCSUM),  defreg(FUNCTAG), defreg(GSCL_1),
+    defreg(GSCL_2),  defreg(GSCL_3),  defreg(GSCL_4),  defreg(GSCN_0),
+    defreg(GSCN_1),  defreg(GSCN_2),  defreg(GSCN_3),  defreg(GCR2)
+};
+
+void
+e1000e_core_write(E1000ECore *core, hwaddr addr, uint64_t val, unsigned size);
+
+uint64_t
+e1000e_core_read(E1000ECore *core, hwaddr addr, unsigned size);
+
+void
+e1000e_core_pci_realize(E1000ECore      *regs,
+                       const uint16_t *eeprom_templ,
+                       uint32_t        eeprom_size,
+                       const uint8_t  *macaddr);
+
+void
+e1000e_core_reset(E1000ECore *core, uint8_t *macaddr, uint16_t phy_id2);
+
+void
+e1000e_core_pre_save(E1000ECore *core);
+
+int
+e1000e_core_post_load(E1000ECore *core);
+
+void
+e1000e_core_set_link_status(E1000ECore *core);
+
+void
+e1000e_core_pci_uninit(E1000ECore *core);
+
+int
+e1000e_can_receive(E1000ECore *core);
+
+ssize_t
+e1000e_receive(E1000ECore *core, const uint8_t *buf, size_t size);
+
+ssize_t
+e1000e_receive_iov(E1000ECore *core, const struct iovec *iov, int iovcnt);
diff --git a/trace-events b/trace-events
index 30eba92..39ccb21 100644
--- a/trace-events
+++ b/trace-events
@@ -1590,3 +1590,71 @@ i8257_unregistered_dma(int nchan, int dma_pos, int dma_len) "unregistered DMA ch
 cpu_set_state(int cpu_index, uint8_t state) "setting cpu %d state to %" PRIu8
 cpu_halt(int cpu_index) "halting cpu %d"
 cpu_unhalt(int cpu_index) "unhalting cpu %d"
+
+# hw/net/e1000e_core.c
+e1000e_core_write(uint64_t index, uint32_t size, uint64_t val) "Write to register 0x%"PRIx64", %d byte(s), value: 0x%"PRIx64
+e1000e_core_read(uint64_t index, uint32_t size, uint64_t val) "Read from register 0x%"PRIx64", %d byte(s), value: 0x%"PRIx64
+e1000e_core_set_rxctl(uint32_t rdt, uint32_t rctl) "RCTL: %d, mac[RCTL] = 0x%x"
+e1000e_core_set_ics(uint32_t val) "set_ics 0x%x\n"
+e1000e_core_read_ics(uint32_t val) "read_ics 0x%x\n"
+e1000e_core_mdic_read(uint32_t addr, uint32_t data) "MDIC read reg 0x%x, value 0x%x"
+e1000e_core_mdic_read_unhandled(uint32_t addr) "MDIC read reg 0x%x unhandled"
+e1000e_core_mdic_write(uint32_t addr, uint32_t data) "MDIC write reg 0x%x, value 0x%x"
+e1000e_core_mdic_write_unhandled(uint32_t addr) "MDIC write reg 0x%x unhandled"
+e1000e_core_eeeprom_read(uint16_t bit, uint16_t reading) "reading eeprom bit %d (reading %d)"
+e1000e_core_eeeprom_write(uint16_t bit_in, uint16_t bit_out, uint16_t reading) "eeprom bitnum in %d out %d, reading %d"
+e1000e_core_icr_write(uint32_t val) "ICR write value 0x%x"
+e1000e_core_icr_read(uint32_t val) "ICR read value 0x%x"
+e1000e_core_start_link_negotiation(void) "Start link auto negotiation"
+e1000e_core_link_negotiation_done(void) "Auto negotiation is completed"
+
+e1000e_wrn_regs_write_ro(uint64_t index, uint32_t size, uint64_t val) "WARNING: Write to RO register 0x%"PRIx64", %d byte(s), value: 0x%"PRIx64
+e1000e_wrn_regs_write_unknown(uint64_t index, uint32_t size, uint64_t val) "WARNING: Write to unknown register 0x%"PRIx64", %d byte(s), value: 0x%"PRIx64
+e1000e_wrn_regs_read_unknown(uint64_t index, uint32_t size) "WARNING: Read from unknown register 0x%"PRIx64", %d byte(s)"
+e1000e_wrn_no_ts_support(void) "WARNING: Guest requested TX timestamping which is not supported"
+e1000e_wrn_no_snap_support(void) "WARNING: Guest requested TX SNAP header update which is not supported"
+
+e1000e_tx_disabled(void) "TX Disabled"
+e1000e_tx_descr(uint32_t head, void *addr, uint32_t lower, uint32_t upper) "index %d: %p : %x %x"
+e1000e_tdh_wraparound(uint32_t start, uint32_t tail, uint32_t len) "TDH wraparound @%x, TDT %x, TDLEN %x"
+e1000e_tx_cso_zero(void) "TCP/UDP: cso 0!"
+
+e1000e_rx_null_descriptor(void) "Null RX descriptor!!"
+
+e1000e_rx_err_wraparound(uint32_t start, uint32_t rdt, uint32_t rdlen) "RDH wraparound @%x, RDT %x, RDLEN %x"
+
+e1000e_rx_flt_ucast_match(uint32_t idx, uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3, uint8_t b4, uint8_t b5) "unicast match[%d]: %02x:%02x:%02x:%02x:%02x:%02x"
+e1000e_rx_flt_ucast_mismatch(uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3, uint8_t b4, uint8_t b5) "unicast mismatch: %02x:%02x:%02x:%02x:%02x:%02x"
+e1000e_rx_flt_inexact_mismatch(uint8_t b0, uint8_t b1, uint8_t b2, uint8_t b3, uint8_t b4, uint8_t b5, uint32_t mo, uint32_t mta, uint32_t mta_val) "inexact mismatch: %02x:%02x:%02x:%02x:%02x:%02x MO %d MTA[%d] %x"
+
+e1000e_rx_desc_ps_read(uint64_t a0, uint64_t a1, uint64_t a2, uint64_t a3) "buffers: [0x%"PRIx64", 0x%"PRIx64", 0x%"PRIx64", 0x%"PRIx64"]"
+e1000e_rx_desc_ps_write(uint16_t a0, uint16_t a1, uint16_t a2, uint16_t a3) "bytes written: [%u, %u, %u, %u]"
+e1000e_rx_desc_buff_sizes(uint32_t b0, uint32_t b1, uint32_t b2, uint32_t b3) "buffer sizes: [%u, %u, %u, %u]"
+
+e1000e_rx_desc_buff_write(uint8_t idx, uint64_t addr, uint16_t offset, const void* source, uint32_t len) "buffer #%u, addr: 0x%"PRIx64", offset: %u, from: %p, length: %u"
+
+e1000e_irq_set_cause(uint32_t cause) "IRQ cause set 0x%x"
+e1000e_irq_msi_notify(uint32_t cause) "MSI notify 0x%x"
+e1000e_irq_msix_notify(uint32_t cause) "MSI-X notify 0x%x"
+e1000e_irq_legacy_notify(bool level) "IRQ line state: %d"
+e1000e_irq_msix_notify_vec(uint32_t vector) "MSI-X notify vector 0x%x"
+
+e1000e_wrn_msix_vec_wrong(uint32_t cause, uint32_t cfg) "Invalid configuration for cause 0x%x: 0x%x"
+e1000e_wrn_msix_invalid(uint32_t cause, uint32_t cfg) "Invalid entry for cause 0x%x: 0x%x"
+
+# hw/net/e1000e.c
+e1000e_cb_pci_realize(void) "E1000E PCI realize entry"
+e1000e_cb_pci_uninit(void) "E1000E PCI unit entry"
+e1000e_cb_write_config(void) "E1000E write config entry"
+e1000e_cb_qdev_reset(void) "E1000E qdev reset entry"
+e1000e_cb_pre_save(void) "E1000E pre save entry"
+e1000e_cb_post_load(void) "E1000E post load entry"
+
+e1000e_wrn_io_read(uint64_t addr, uint32_t size) "IO unknown read from 0x%"PRIx64", %d byte(s)"
+e1000e_wrn_io_write(uint64_t addr, uint32_t size, uint64_t val) "IO unknown write to 0x%"PRIx64", %d byte(s), value: 0x%"PRIx64
+
+e1000e_msi_init_fail(int32_t res) "Failed to initialize MSI, error %d"
+e1000e_msix_init_fail(int32_t res) "Failed to initialize MSI-X, error %d"
+e1000e_msix_use_vector_fail(uint32_t vec, int32_t res) "Failed to use MSI-X vector %d, error %d"
+
+e1000e_cfg_support_virtio(bool support) "Virtio header supported: %d"
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e)
  2015-10-25 17:00 [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Leonid Bloch
                   ` (4 preceding siblings ...)
  2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 5/5] net: Introduce e1000e device emulation Leonid Bloch
@ 2015-10-28  5:44 ` Jason Wang
  2015-10-28  6:11   ` Dmitry Fleytman
  2015-10-30  5:28   ` Jason Wang
  2016-01-13  4:43 ` Prem Mallappa
  6 siblings, 2 replies; 15+ messages in thread
From: Jason Wang @ 2015-10-28  5:44 UTC (permalink / raw)
  To: Leonid Bloch, qemu-devel; +Cc: Dmitry Fleytman, Leonid Bloch, Shmulik Ladkani



On 10/26/2015 01:00 AM, Leonid Bloch wrote:
> Hello qemu-devel,
>
> This patch series is an RFC for the new networking device emulation
> we're developing for QEMU.
>
> This new device emulates the Intel 82574 GbE Controller and works
> with unmodified Intel e1000e drivers from the Linux/Windows kernels.
>
> The status of the current series is "Functional Device Ready, work
> on Extended Features in Progress".
>
> More precisely, these patches represent a functional device, which
> is recognized by the standard Intel drivers, and is able to transfer
> TX/RX packets with CSO/TSO offloads, according to the spec.
>
> Extended features not supported yet (work in progress):
>   1. TX/RX Interrupt moderation mechanisms
>   2. RSS
>   3. Full-featured multi-queue (use of multiqueued network backend)
>
> Also, there will be some code refactoring and performance
> optimization efforts.
>
> This series was tested on Linux (Fedora 22) and Windows (2012R2)
> guests, using Iperf, with TX/RX and TCP/UDP streams, and various
> packet sizes.
>
> More thorough testing, including data streams with different MTU
> sizes, and Microsoft Certification (HLK) tests, are pending missing
> features' development.
>
> See commit messages (esp. "net: Introduce e1000e device emulation")
> for more information about the development approaches and the
> architecture options chosen for this device.
>
> This series is based upon v2.3.0 tag of the upstream QEMU repository,
> and it will be rebased to latest before the final submission.
>
> Please share your thoughts - any feedback is highly welcomed :)
>
> Best Regards,
> Dmitry Fleytman.

Thanks for the series. Will go through this in next few days.

Since 2.5 is in soft freeze, this looks a 2.6 material.

>
> Dmitry Fleytman (5):
>   net: Add macros for ETH address tracing
>   net_pkt: Name vmxnet3 packet abstractions more generic
>   net_pkt: Extend packet abstraction as requied by e1000e functionality
>   e1000_regs: Add definitions for Intel 82574-specific bits
>   net: Introduce e1000e device emulation
>
>  MAINTAINERS             |    2 +
>  default-configs/pci.mak |    1 +
>  hw/net/Makefile.objs    |    5 +-
>  hw/net/e1000_regs.h     |  201 ++++-
>  hw/net/e1000e.c         |  531 ++++++++++++
>  hw/net/e1000e_core.c    | 2081 +++++++++++++++++++++++++++++++++++++++++++++++
>  hw/net/e1000e_core.h    |  181 +++++
>  hw/net/net_rx_pkt.c     |  273 +++++++
>  hw/net/net_rx_pkt.h     |  241 ++++++
>  hw/net/net_tx_pkt.c     |  606 ++++++++++++++
>  hw/net/net_tx_pkt.h     |  191 +++++
>  hw/net/vmxnet3.c        |   80 +-
>  hw/net/vmxnet_rx_pkt.c  |  187 -----
>  hw/net/vmxnet_rx_pkt.h  |  174 ----
>  hw/net/vmxnet_tx_pkt.c  |  567 -------------
>  hw/net/vmxnet_tx_pkt.h  |  148 ----
>  include/net/eth.h       |   90 +-
>  include/net/net.h       |    5 +
>  net/eth.c               |  152 +++-
>  tests/Makefile          |    4 +-
>  trace-events            |   68 ++
>  21 files changed, 4597 insertions(+), 1191 deletions(-)
>  create mode 100644 hw/net/e1000e.c
>  create mode 100644 hw/net/e1000e_core.c
>  create mode 100644 hw/net/e1000e_core.h
>  create mode 100644 hw/net/net_rx_pkt.c
>  create mode 100644 hw/net/net_rx_pkt.h
>  create mode 100644 hw/net/net_tx_pkt.c
>  create mode 100644 hw/net/net_tx_pkt.h
>  delete mode 100644 hw/net/vmxnet_rx_pkt.c
>  delete mode 100644 hw/net/vmxnet_rx_pkt.h
>  delete mode 100644 hw/net/vmxnet_tx_pkt.c
>  delete mode 100644 hw/net/vmxnet_tx_pkt.h
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e)
  2015-10-28  5:44 ` [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Jason Wang
@ 2015-10-28  6:11   ` Dmitry Fleytman
  2015-10-30  5:28   ` Jason Wang
  1 sibling, 0 replies; 15+ messages in thread
From: Dmitry Fleytman @ 2015-10-28  6:11 UTC (permalink / raw)
  To: Jason Wang; +Cc: Leonid Bloch, Leonid Bloch, qemu-devel, Shmulik Ladkani



> On 28 Oct 2015, at 07:44 AM, Jason Wang <jasowang@redhat.com> wrote:
> 
> 
> 
> On 10/26/2015 01:00 AM, Leonid Bloch wrote:
>> Hello qemu-devel,
>> 
>> This patch series is an RFC for the new networking device emulation
>> we're developing for QEMU.
>> 
>> This new device emulates the Intel 82574 GbE Controller and works
>> with unmodified Intel e1000e drivers from the Linux/Windows kernels.
>> 
>> The status of the current series is "Functional Device Ready, work
>> on Extended Features in Progress".
>> 
>> More precisely, these patches represent a functional device, which
>> is recognized by the standard Intel drivers, and is able to transfer
>> TX/RX packets with CSO/TSO offloads, according to the spec.
>> 
>> Extended features not supported yet (work in progress):
>>  1. TX/RX Interrupt moderation mechanisms
>>  2. RSS
>>  3. Full-featured multi-queue (use of multiqueued network backend)
>> 
>> Also, there will be some code refactoring and performance
>> optimization efforts.
>> 
>> This series was tested on Linux (Fedora 22) and Windows (2012R2)
>> guests, using Iperf, with TX/RX and TCP/UDP streams, and various
>> packet sizes.
>> 
>> More thorough testing, including data streams with different MTU
>> sizes, and Microsoft Certification (HLK) tests, are pending missing
>> features' development.
>> 
>> See commit messages (esp. "net: Introduce e1000e device emulation")
>> for more information about the development approaches and the
>> architecture options chosen for this device.
>> 
>> This series is based upon v2.3.0 tag of the upstream QEMU repository,
>> and it will be rebased to latest before the final submission.
>> 
>> Please share your thoughts - any feedback is highly welcomed :)
>> 
>> Best Regards,
>> Dmitry Fleytman.
> 
> Thanks for the series. Will go through this in next few days.
> 
> Since 2.5 is in soft freeze, this looks a 2.6 material.

Thanks, Jason.

> 
>> 
>> Dmitry Fleytman (5):
>>  net: Add macros for ETH address tracing
>>  net_pkt: Name vmxnet3 packet abstractions more generic
>>  net_pkt: Extend packet abstraction as requied by e1000e functionality
>>  e1000_regs: Add definitions for Intel 82574-specific bits
>>  net: Introduce e1000e device emulation
>> 
>> MAINTAINERS             |    2 +
>> default-configs/pci.mak |    1 +
>> hw/net/Makefile.objs    |    5 +-
>> hw/net/e1000_regs.h     |  201 ++++-
>> hw/net/e1000e.c         |  531 ++++++++++++
>> hw/net/e1000e_core.c    | 2081 +++++++++++++++++++++++++++++++++++++++++++++++
>> hw/net/e1000e_core.h    |  181 +++++
>> hw/net/net_rx_pkt.c     |  273 +++++++
>> hw/net/net_rx_pkt.h     |  241 ++++++
>> hw/net/net_tx_pkt.c     |  606 ++++++++++++++
>> hw/net/net_tx_pkt.h     |  191 +++++
>> hw/net/vmxnet3.c        |   80 +-
>> hw/net/vmxnet_rx_pkt.c  |  187 -----
>> hw/net/vmxnet_rx_pkt.h  |  174 ----
>> hw/net/vmxnet_tx_pkt.c  |  567 -------------
>> hw/net/vmxnet_tx_pkt.h  |  148 ----
>> include/net/eth.h       |   90 +-
>> include/net/net.h       |    5 +
>> net/eth.c               |  152 +++-
>> tests/Makefile          |    4 +-
>> trace-events            |   68 ++
>> 21 files changed, 4597 insertions(+), 1191 deletions(-)
>> create mode 100644 hw/net/e1000e.c
>> create mode 100644 hw/net/e1000e_core.c
>> create mode 100644 hw/net/e1000e_core.h
>> create mode 100644 hw/net/net_rx_pkt.c
>> create mode 100644 hw/net/net_rx_pkt.h
>> create mode 100644 hw/net/net_tx_pkt.c
>> create mode 100644 hw/net/net_tx_pkt.h
>> delete mode 100644 hw/net/vmxnet_rx_pkt.c
>> delete mode 100644 hw/net/vmxnet_rx_pkt.h
>> delete mode 100644 hw/net/vmxnet_tx_pkt.c
>> delete mode 100644 hw/net/vmxnet_tx_pkt.h
>> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e)
  2015-10-28  5:44 ` [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Jason Wang
  2015-10-28  6:11   ` Dmitry Fleytman
@ 2015-10-30  5:28   ` Jason Wang
  2015-10-31  5:52     ` Dmitry Fleytman
  1 sibling, 1 reply; 15+ messages in thread
From: Jason Wang @ 2015-10-30  5:28 UTC (permalink / raw)
  To: Leonid Bloch, qemu-devel; +Cc: Dmitry Fleytman, Leonid Bloch, Shmulik Ladkani



On 10/28/2015 01:44 PM, Jason Wang wrote:
>
> On 10/26/2015 01:00 AM, Leonid Bloch wrote:
>> Hello qemu-devel,
>>
>> This patch series is an RFC for the new networking device emulation
>> we're developing for QEMU.
>>
>> This new device emulates the Intel 82574 GbE Controller and works
>> with unmodified Intel e1000e drivers from the Linux/Windows kernels.
>>
>> The status of the current series is "Functional Device Ready, work
>> on Extended Features in Progress".
>>
>> More precisely, these patches represent a functional device, which
>> is recognized by the standard Intel drivers, and is able to transfer
>> TX/RX packets with CSO/TSO offloads, according to the spec.
>>
>> Extended features not supported yet (work in progress):
>>   1. TX/RX Interrupt moderation mechanisms
>>   2. RSS
>>   3. Full-featured multi-queue (use of multiqueued network backend)
>>
>> Also, there will be some code refactoring and performance
>> optimization efforts.
>>
>> This series was tested on Linux (Fedora 22) and Windows (2012R2)
>> guests, using Iperf, with TX/RX and TCP/UDP streams, and various
>> packet sizes.
>>
>> More thorough testing, including data streams with different MTU
>> sizes, and Microsoft Certification (HLK) tests, are pending missing
>> features' development.
>>
>> See commit messages (esp. "net: Introduce e1000e device emulation")
>> for more information about the development approaches and the
>> architecture options chosen for this device.
>>
>> This series is based upon v2.3.0 tag of the upstream QEMU repository,
>> and it will be rebased to latest before the final submission.
>>
>> Please share your thoughts - any feedback is highly welcomed :)
>>
>> Best Regards,
>> Dmitry Fleytman.
> Thanks for the series. Will go through this in next few days.

Have a quick glance at the series, got the following questions:

- Though e1000e differs from e1000 in many places, I still see lots of
code duplications. We need consider to reuse e1000.c (or at least part
of). I believe we don't want to fix a bug twice in two places in the
future and I expect hundreds of lines could be saved through this way.
- For e1000e it self, since it was a new device, so no need to care
about compatibility stuffs (e.g auto negotiation and mit). We can just
enable them forever.
- And for the generic packet abstraction layer, what's the advantages of
this? If it has lot, maybe we can use it in other nic model (e.g
virtio-net)?

Thanks

>
> Since 2.5 is in soft freeze, this looks a 2.6 material.
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e)
  2015-10-30  5:28   ` Jason Wang
@ 2015-10-31  5:52     ` Dmitry Fleytman
  2015-11-02  3:35       ` Jason Wang
  0 siblings, 1 reply; 15+ messages in thread
From: Dmitry Fleytman @ 2015-10-31  5:52 UTC (permalink / raw)
  To: Jason Wang; +Cc: Leonid Bloch, Leonid Bloch, qemu-devel, Shmulik Ladkani

[-- Attachment #1: Type: text/plain, Size: 4672 bytes --]

Hello Jason,

Thanks for reviewing. See my answers inline.


> On 30 Oct 2015, at 07:28 AM, Jason Wang <jasowang@redhat.com> wrote:
> 
> 
> 
> On 10/28/2015 01:44 PM, Jason Wang wrote:
>> 
>> On 10/26/2015 01:00 AM, Leonid Bloch wrote:
>>> Hello qemu-devel,
>>> 
>>> This patch series is an RFC for the new networking device emulation
>>> we're developing for QEMU.
>>> 
>>> This new device emulates the Intel 82574 GbE Controller and works
>>> with unmodified Intel e1000e drivers from the Linux/Windows kernels.
>>> 
>>> The status of the current series is "Functional Device Ready, work
>>> on Extended Features in Progress".
>>> 
>>> More precisely, these patches represent a functional device, which
>>> is recognized by the standard Intel drivers, and is able to transfer
>>> TX/RX packets with CSO/TSO offloads, according to the spec.
>>> 
>>> Extended features not supported yet (work in progress):
>>>  1. TX/RX Interrupt moderation mechanisms
>>>  2. RSS
>>>  3. Full-featured multi-queue (use of multiqueued network backend)
>>> 
>>> Also, there will be some code refactoring and performance
>>> optimization efforts.
>>> 
>>> This series was tested on Linux (Fedora 22) and Windows (2012R2)
>>> guests, using Iperf, with TX/RX and TCP/UDP streams, and various
>>> packet sizes.
>>> 
>>> More thorough testing, including data streams with different MTU
>>> sizes, and Microsoft Certification (HLK) tests, are pending missing
>>> features' development.
>>> 
>>> See commit messages (esp. "net: Introduce e1000e device emulation")
>>> for more information about the development approaches and the
>>> architecture options chosen for this device.
>>> 
>>> This series is based upon v2.3.0 tag of the upstream QEMU repository,
>>> and it will be rebased to latest before the final submission.
>>> 
>>> Please share your thoughts - any feedback is highly welcomed :)
>>> 
>>> Best Regards,
>>> Dmitry Fleytman.
>> Thanks for the series. Will go through this in next few days.
> 
> Have a quick glance at the series, got the following questions:
> 
> - Though e1000e differs from e1000 in many places, I still see lots of
> code duplications. We need consider to reuse e1000.c (or at least part
> of). I believe we don't want to fix a bug twice in two places in the
> future and I expect hundreds of lines could be saved through this way.

That’s a good question :)

This is how we started, we had a common “core” code base meant to implement all common logic (this split is still present in the patches - there are e1000e_core.c and e1000e.c files).
Unfortunately at some point it turned out that there are more differences that commons. We noticed that the code becomes filled with many minor differences handling.
This also made the code base more complicated and harder to follow.

So at some point of time it was decided to split the code base and revert all changes done to the e1000 device (except a few fixes/improvements Leonid submitted a few days ago).

Although there was common code between devices, total SLOC of e1000 and e1000e devices became smaller after the split.

Amount of code that may be shared between devices will be even smaller after we complete the implementation which still misses a few features (see cover letter) that will change many things.

Still after the device implementation is done, we plan to review code similarities again to see if there are possibilities for code sharing.

> - For e1000e it self, since it was a new device, so no need to care
> about compatibility stuffs (e.g auto negotiation and mit). We can just
> enable them forever.

Yes, we have this in plans.

> - And for the generic packet abstraction layer, what's the advantages of
> this? If it has lot, maybe we can use it in other nic model (e.g
> virtio-net)?

These abstractions were initially developed by me as a part of vmxnet3 device to be generic and re-usable. Their main advantage is support for virtio headers for virtio-enabled backends and emulation of network offloads in software for backends that do not support virtio.

Of course they may be re-used by virtio, however I’m not sure if it will be really useful because virtio has feature negotiation facilities and do not require SW emulation for network task offloads.

For other devices they are useful because each and every device that requires SW offloads implementation need to do exactly the same things and it doesn’t make sense to have a few implementations for this.

Best Regards,
Dmitry

> 
> Thanks
> 
>> 
>> Since 2.5 is in soft freeze, this looks a 2.6 material.


[-- Attachment #2: Type: text/html, Size: 17345 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e)
  2015-10-31  5:52     ` Dmitry Fleytman
@ 2015-11-02  3:35       ` Jason Wang
  2015-11-02  7:49         ` Dmitry Fleytman
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Wang @ 2015-11-02  3:35 UTC (permalink / raw)
  To: Dmitry Fleytman; +Cc: Leonid Bloch, Leonid Bloch, qemu-devel, Shmulik Ladkani



On 10/31/2015 01:52 PM, Dmitry Fleytman wrote:
> Hello Jason,
>
> Thanks for reviewing. See my answers inline.
>
>
>> On 30 Oct 2015, at 07:28 AM, Jason Wang <jasowang@redhat.com
>> <mailto:jasowang@redhat.com>> wrote:
>>
>>
>>
>> On 10/28/2015 01:44 PM, Jason Wang wrote:
>>>
>>> On 10/26/2015 01:00 AM, Leonid Bloch wrote:
>>>> Hello qemu-devel,
>>>>
>>>> This patch series is an RFC for the new networking device emulation
>>>> we're developing for QEMU.
>>>>
>>>> This new device emulates the Intel 82574 GbE Controller and works
>>>> with unmodified Intel e1000e drivers from the Linux/Windows kernels.
>>>>
>>>> The status of the current series is "Functional Device Ready, work
>>>> on Extended Features in Progress".
>>>>
>>>> More precisely, these patches represent a functional device, which
>>>> is recognized by the standard Intel drivers, and is able to transfer
>>>> TX/RX packets with CSO/TSO offloads, according to the spec.
>>>>
>>>> Extended features not supported yet (work in progress):
>>>>  1. TX/RX Interrupt moderation mechanisms
>>>>  2. RSS
>>>>  3. Full-featured multi-queue (use of multiqueued network backend)
>>>>
>>>> Also, there will be some code refactoring and performance
>>>> optimization efforts.
>>>>
>>>> This series was tested on Linux (Fedora 22) and Windows (2012R2)
>>>> guests, using Iperf, with TX/RX and TCP/UDP streams, and various
>>>> packet sizes.
>>>>
>>>> More thorough testing, including data streams with different MTU
>>>> sizes, and Microsoft Certification (HLK) tests, are pending missing
>>>> features' development.
>>>>
>>>> See commit messages (esp. "net: Introduce e1000e device emulation")
>>>> for more information about the development approaches and the
>>>> architecture options chosen for this device.
>>>>
>>>> This series is based upon v2.3.0 tag of the upstream QEMU repository,
>>>> and it will be rebased to latest before the final submission.
>>>>
>>>> Please share your thoughts - any feedback is highly welcomed :)
>>>>
>>>> Best Regards,
>>>> Dmitry Fleytman.
>>> Thanks for the series. Will go through this in next few days.
>>
>> Have a quick glance at the series, got the following questions:
>>
>> - Though e1000e differs from e1000 in many places, I still see lots of
>> code duplications. We need consider to reuse e1000.c (or at least part
>> of). I believe we don't want to fix a bug twice in two places in the
>> future and I expect hundreds of lines could be saved through this way.
>
> That’s a good question :)
>
> This is how we started, we had a common “core” code base meant to
> implement all common logic (this split is still present in the patches
> - there are e1000e_core.c and e1000e.c files).
> Unfortunately at some point it turned out that there are more
> differences that commons. We noticed that the code becomes filled with
> many minor differences handling.
> This also made the code base more complicated and harder to follow.
>
> So at some point of time it was decided to split the code base and
> revert all changes done to the e1000 device (except a few
> fixes/improvements Leonid submitted a few days ago).
>
> Although there was common code between devices, total SLOC of e1000
> and e1000e devices became smaller after the split.
>
> Amount of code that may be shared between devices will be even smaller
> after we complete the implementation which still misses a few features
> (see cover letter) that will change many things.
>
> Still after the device implementation is done, we plan to review code
> similarities again to see if there are possibilities for code sharing.

I see, but if we can try to re-use or unify the codes from beginning, it
would be a little bit easier. Looks like the differences were mainly:

1) MSI-X support
2) offloading support through virtio-net header
3) trace points
4) other new functions through e1000e specific registers

So we could first unify the code through implementing the support of 2
and 3 for e1000. For MSI-X and other e1000e specific new functions, it
could be done through:

1) model specific callbacks, e.g realize, transmission and reception
2) A new register flags e.g PHY_RW_E1000E which means the register is
for e1000e only. Or even model specific wirteops and readops
3) For other subtle differences, it could be done in the code by
checking the model explicitly.

What's your opinion? (A good example of code sharing is freebsd's e1000
driver which covers both 8254x and 8257x).

>
>> - For e1000e it self, since it was a new device, so no need to care
>> about compatibility stuffs (e.g auto negotiation and mit). We can just
>> enable them forever.
>
> Yes, we have this in plans.
>
>> - And for the generic packet abstraction layer, what's the advantages of
>> this? If it has lot, maybe we can use it in other nic model (e.g
>> virtio-net)?
>
> These abstractions were initially developed by me as a part of vmxnet3
> device to be generic and re-usable. Their main advantage is support
> for virtio headers for virtio-enabled backends and emulation of
> network offloads in software for backends that do not support virtio.
>
> Of course they may be re-used by virtio, however I’m not sure if it
> will be really useful because virtio has feature negotiation
> facilities and do not require SW emulation for network task offloads.
>
> For other devices they are useful because each and every device that
> requires SW offloads implementation need to do exactly the same things
> and it doesn’t make sense to have a few implementations for this.
>
> Best Regards,
> Dmitry

Ok, thanks for the explanation.

>
>>
>> Thanks
>>
>>>
>>> Since 2.5 is in soft freeze, this looks a 2.6 material.
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e)
  2015-11-02  3:35       ` Jason Wang
@ 2015-11-02  7:49         ` Dmitry Fleytman
  2015-11-03  5:44           ` Jason Wang
  0 siblings, 1 reply; 15+ messages in thread
From: Dmitry Fleytman @ 2015-11-02  7:49 UTC (permalink / raw)
  To: Jason Wang; +Cc: Leonid Bloch, Leonid Bloch, qemu-devel, Shmulik Ladkani

[-- Attachment #1: Type: text/plain, Size: 8151 bytes --]


> On 2 Nov 2015, at 05:35 AM, Jason Wang <jasowang@redhat.com> wrote:
> 
> 
> 
> On 10/31/2015 01:52 PM, Dmitry Fleytman wrote:
>> Hello Jason,
>> 
>> Thanks for reviewing. See my answers inline.
>> 
>> 
>>> On 30 Oct 2015, at 07:28 AM, Jason Wang <jasowang@redhat.com <mailto:jasowang@redhat.com>
>>> <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>> wrote:
>>> 
>>> 
>>> 
>>> On 10/28/2015 01:44 PM, Jason Wang wrote:
>>>> 
>>>> On 10/26/2015 01:00 AM, Leonid Bloch wrote:
>>>>> Hello qemu-devel,
>>>>> 
>>>>> This patch series is an RFC for the new networking device emulation
>>>>> we're developing for QEMU.
>>>>> 
>>>>> This new device emulates the Intel 82574 GbE Controller and works
>>>>> with unmodified Intel e1000e drivers from the Linux/Windows kernels.
>>>>> 
>>>>> The status of the current series is "Functional Device Ready, work
>>>>> on Extended Features in Progress".
>>>>> 
>>>>> More precisely, these patches represent a functional device, which
>>>>> is recognized by the standard Intel drivers, and is able to transfer
>>>>> TX/RX packets with CSO/TSO offloads, according to the spec.
>>>>> 
>>>>> Extended features not supported yet (work in progress):
>>>>> 1. TX/RX Interrupt moderation mechanisms
>>>>> 2. RSS
>>>>> 3. Full-featured multi-queue (use of multiqueued network backend)
>>>>> 
>>>>> Also, there will be some code refactoring and performance
>>>>> optimization efforts.
>>>>> 
>>>>> This series was tested on Linux (Fedora 22) and Windows (2012R2)
>>>>> guests, using Iperf, with TX/RX and TCP/UDP streams, and various
>>>>> packet sizes.
>>>>> 
>>>>> More thorough testing, including data streams with different MTU
>>>>> sizes, and Microsoft Certification (HLK) tests, are pending missing
>>>>> features' development.
>>>>> 
>>>>> See commit messages (esp. "net: Introduce e1000e device emulation")
>>>>> for more information about the development approaches and the
>>>>> architecture options chosen for this device.
>>>>> 
>>>>> This series is based upon v2.3.0 tag of the upstream QEMU repository,
>>>>> and it will be rebased to latest before the final submission.
>>>>> 
>>>>> Please share your thoughts - any feedback is highly welcomed :)
>>>>> 
>>>>> Best Regards,
>>>>> Dmitry Fleytman.
>>>> Thanks for the series. Will go through this in next few days.
>>> 
>>> Have a quick glance at the series, got the following questions:
>>> 
>>> - Though e1000e differs from e1000 in many places, I still see lots of
>>> code duplications. We need consider to reuse e1000.c (or at least part
>>> of). I believe we don't want to fix a bug twice in two places in the
>>> future and I expect hundreds of lines could be saved through this way.
>> 
>> That’s a good question :)
>> 
>> This is how we started, we had a common “core” code base meant to
>> implement all common logic (this split is still present in the patches
>> - there are e1000e_core.c and e1000e.c files).
>> Unfortunately at some point it turned out that there are more
>> differences that commons. We noticed that the code becomes filled with
>> many minor differences handling.
>> This also made the code base more complicated and harder to follow.
>> 
>> So at some point of time it was decided to split the code base and
>> revert all changes done to the e1000 device (except a few
>> fixes/improvements Leonid submitted a few days ago).
>> 
>> Although there was common code between devices, total SLOC of e1000
>> and e1000e devices became smaller after the split.
>> 
>> Amount of code that may be shared between devices will be even smaller
>> after we complete the implementation which still misses a few features
>> (see cover letter) that will change many things.
>> 
>> Still after the device implementation is done, we plan to review code
>> similarities again to see if there are possibilities for code sharing.
> 
> I see, but if we can try to re-use or unify the codes from beginning, it
> would be a little bit easier. Looks like the differences were mainly:
> 
> 1) MSI-X support
> 2) offloading support through virtio-net header
> 3) trace points
> 4) other new functions through e1000e specific registers
> 
> So we could first unify the code through implementing the support of 2
> and 3 for e1000. For MSI-X and other e1000e specific new functions, it
> could be done through:
> 
> 1) model specific callbacks, e.g realize, transmission and reception
> 2) A new register flags e.g PHY_RW_E1000E which means the register is
> for e1000e only. Or even model specific wirteops and readops
> 3) For other subtle differences, it could be done in the code by
> checking the model explicitly.
> 
> What's your opinion? (A good example of code sharing is freebsd's e1000
> driver which covers both 8254x and 8257x).


Hi Jason,

This is exactly how we started.

Issues that made us split the code base were following:

1. The majority of registers are different. Even same registers in many cases have different bits and require different processing logic. This introduced too much if’s into the code;
2. The data path is totally different not just because of virtio headers but also because these devices use different descriptor layouts and require different logic in order to parse and fill those. There are legacy descriptors that look almost the same but of course we must support all descriptor types described by spec;
3. Interrupt handling logic is different because of MSI support;
4. Mutli-queue and RSS make the code even less similar to the old device;
5. Changes required to isolate shared code base required changes in migration tables and fishy tricks to preserve compatibility with current implementation;
6. e1000 code suffered from massive changes which are very hard to verify because spec is big and there are no drivers that use all device features.

Overall, code for handling differences in device behaviours was bigger and more complicated then the device logic itself. The difference is not subtle when it comes to the full featured device implementation.
As for FreeBSD’s driver, I’m not deeply familiar with its code but I suspect it works with device in legacy mode which is pretty similar to an old device indeed. Since we must support all modes our situation is different.

Again, I’m totally into shared code and would like to have some common code base anyway. Currently it looks like the best way to achieve this is to finish with all device features and then see what parts of logic may be shared between the old and the new devices. It’s better to have slight code duplication and smaller shared code base than to have bloated and tricky shared code that will introduce its own problems to both devices. 

Best Regards,
Dmitry

> 
>> 
>>> - For e1000e it self, since it was a new device, so no need to care
>>> about compatibility stuffs (e.g auto negotiation and mit). We can just
>>> enable them forever.
>> 
>> Yes, we have this in plans.
>> 
>>> - And for the generic packet abstraction layer, what's the advantages of
>>> this? If it has lot, maybe we can use it in other nic model (e.g
>>> virtio-net)?
>> 
>> These abstractions were initially developed by me as a part of vmxnet3
>> device to be generic and re-usable. Their main advantage is support
>> for virtio headers for virtio-enabled backends and emulation of
>> network offloads in software for backends that do not support virtio.
>> 
>> Of course they may be re-used by virtio, however I’m not sure if it
>> will be really useful because virtio has feature negotiation
>> facilities and do not require SW emulation for network task offloads.
>> 
>> For other devices they are useful because each and every device that
>> requires SW offloads implementation need to do exactly the same things
>> and it doesn’t make sense to have a few implementations for this.
>> 
>> Best Regards,
>> Dmitry
> 
> Ok, thanks for the explanation.
> 
>> 
>>> 
>>> Thanks
>>> 
>>>> 
>>>> Since 2.5 is in soft freeze, this looks a 2.6 material.


[-- Attachment #2: Type: text/html, Size: 26232 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e)
  2015-11-02  7:49         ` Dmitry Fleytman
@ 2015-11-03  5:44           ` Jason Wang
  2015-11-03  9:17             ` Dmitry Fleytman
  0 siblings, 1 reply; 15+ messages in thread
From: Jason Wang @ 2015-11-03  5:44 UTC (permalink / raw)
  To: Dmitry Fleytman; +Cc: Leonid Bloch, Leonid Bloch, qemu-devel, Shmulik Ladkani



On 11/02/2015 03:49 PM, Dmitry Fleytman wrote:
>
>> On 2 Nov 2015, at 05:35 AM, Jason Wang <jasowang@redhat.com
>> <mailto:jasowang@redhat.com>> wrote:
>>
>>
>>
>> On 10/31/2015 01:52 PM, Dmitry Fleytman wrote:
>>> Hello Jason,
>>>
>>> Thanks for reviewing. See my answers inline.
>>>
>>>
>>>> On 30 Oct 2015, at 07:28 AM, Jason Wang <jasowang@redhat.com
>>>> <mailto:jasowang@redhat.com>
>>>> <mailto:jasowang@redhat.com>> wrote:
>>>>
>>>>
>>>>
>>>> On 10/28/2015 01:44 PM, Jason Wang wrote:
>>>>>
>>>>> On 10/26/2015 01:00 AM, Leonid Bloch wrote:
>>>>>> Hello qemu-devel,
>>>>>>
>>>>>> This patch series is an RFC for the new networking device emulation
>>>>>> we're developing for QEMU.
>>>>>>
>>>>>> This new device emulates the Intel 82574 GbE Controller and works
>>>>>> with unmodified Intel e1000e drivers from the Linux/Windows kernels.
>>>>>>
>>>>>> The status of the current series is "Functional Device Ready, work
>>>>>> on Extended Features in Progress".
>>>>>>
>>>>>> More precisely, these patches represent a functional device, which
>>>>>> is recognized by the standard Intel drivers, and is able to transfer
>>>>>> TX/RX packets with CSO/TSO offloads, according to the spec.
>>>>>>
>>>>>> Extended features not supported yet (work in progress):
>>>>>> 1. TX/RX Interrupt moderation mechanisms
>>>>>> 2. RSS
>>>>>> 3. Full-featured multi-queue (use of multiqueued network backend)
>>>>>>
>>>>>> Also, there will be some code refactoring and performance
>>>>>> optimization efforts.
>>>>>>
>>>>>> This series was tested on Linux (Fedora 22) and Windows (2012R2)
>>>>>> guests, using Iperf, with TX/RX and TCP/UDP streams, and various
>>>>>> packet sizes.
>>>>>>
>>>>>> More thorough testing, including data streams with different MTU
>>>>>> sizes, and Microsoft Certification (HLK) tests, are pending missing
>>>>>> features' development.
>>>>>>
>>>>>> See commit messages (esp. "net: Introduce e1000e device emulation")
>>>>>> for more information about the development approaches and the
>>>>>> architecture options chosen for this device.
>>>>>>
>>>>>> This series is based upon v2.3.0 tag of the upstream QEMU repository,
>>>>>> and it will be rebased to latest before the final submission.
>>>>>>
>>>>>> Please share your thoughts - any feedback is highly welcomed :)
>>>>>>
>>>>>> Best Regards,
>>>>>> Dmitry Fleytman.
>>>>> Thanks for the series. Will go through this in next few days.
>>>>
>>>> Have a quick glance at the series, got the following questions:
>>>>
>>>> - Though e1000e differs from e1000 in many places, I still see lots of
>>>> code duplications. We need consider to reuse e1000.c (or at least part
>>>> of). I believe we don't want to fix a bug twice in two places in the
>>>> future and I expect hundreds of lines could be saved through this way.
>>>
>>> That’s a good question :)
>>>
>>> This is how we started, we had a common “core” code base meant to
>>> implement all common logic (this split is still present in the patches
>>> - there are e1000e_core.c and e1000e.c files).
>>> Unfortunately at some point it turned out that there are more
>>> differences that commons. We noticed that the code becomes filled with
>>> many minor differences handling.
>>> This also made the code base more complicated and harder to follow.
>>>
>>> So at some point of time it was decided to split the code base and
>>> revert all changes done to the e1000 device (except a few
>>> fixes/improvements Leonid submitted a few days ago).
>>>
>>> Although there was common code between devices, total SLOC of e1000
>>> and e1000e devices became smaller after the split.
>>>
>>> Amount of code that may be shared between devices will be even smaller
>>> after we complete the implementation which still misses a few features
>>> (see cover letter) that will change many things.
>>>
>>> Still after the device implementation is done, we plan to review code
>>> similarities again to see if there are possibilities for code sharing.
>>
>> I see, but if we can try to re-use or unify the codes from beginning, it
>> would be a little bit easier. Looks like the differences were mainly:
>>
>> 1) MSI-X support
>> 2) offloading support through virtio-net header
>> 3) trace points
>> 4) other new functions through e1000e specific registers
>>
>> So we could first unify the code through implementing the support of 2
>> and 3 for e1000. For MSI-X and other e1000e specific new functions, it
>> could be done through:
>>
>> 1) model specific callbacks, e.g realize, transmission and reception
>> 2) A new register flags e.g PHY_RW_E1000E which means the register is
>> for e1000e only. Or even model specific wirteops and readops
>> 3) For other subtle differences, it could be done in the code by
>> checking the model explicitly.
>>
>> What's your opinion? (A good example of code sharing is freebsd's e1000
>> driver which covers both 8254x and 8257x).
>
>
> Hi Jason,
>
> This is exactly how we started.
>
> Issues that made us split the code base were following:
>
> 1. The majority of registers are different. Even same registers in
> many cases have different bits and require different processing logic.
> This introduced too much if’s into the code;

Then we can probably have different writeops and readops. This way, we
can at least save the codes of common registers.

> 2. The data path is totally different not just because of virtio
> headers but also because these devices use different descriptor
> layouts and require different logic in order to parse and fill those.
> There are legacy descriptors that look almost the same but of course
> we must support all descriptor types described by spec;

Yes, but looks like the only extend rx/tx descriptors is 8257x specific.
And 8257x fully supports both legacy and context descriptor of 8254x.
This give us another chance to reuse the code.

> 3. Interrupt handling logic is different because of MSI support;

Right, but this is not hard to address, probably a new helper.

> 4. Mutli-queue and RSS make the code even less similar to the old device;

Yes, this could be in 8275x specific file.

> 5. Changes required to isolate shared code base required changes in
> migration tables and fishy tricks to preserve compatibility with
> current implementation;

Since 8257x is a totally new device, it can has its own vmstate if it's
simpler to be implemented and we don't even need to care migration
compatibility.

> 6. e1000 code suffered from massive changes which are very hard to
> verify because spec is big and there are no drivers that use all
> device features.

Then we can try to change e1000 as mini as possible. E.g just factor out
the common logic to helpers to reuse it in e1000e.

>
> Overall, code for handling differences in device behaviours was bigger
> and more complicated then the device logic itself. The difference is
> not subtle when it comes to the full featured device implementation.
> As for FreeBSD’s driver, I’m not deeply familiar with its code but I
> suspect it works with device in legacy mode which is pretty similar to
> an old device indeed. Since we must support all modes our situation is
> different.

Yes, it does not use extended descriptor format.

>
> Again, I’m totally into shared code and would like to have some common
> code base anyway. Currently it looks like the best way to achieve this
> is to finish with all device features and then see what parts of logic
> may be shared between the old and the new devices. It’s better to have
> slight code duplication and smaller shared code base than to have
> bloated and tricky shared code that will introduce its own problems to
> both devices.

The code duplication is not slight in this rfc :). So the code has the
possibility to be unified. But I'm ok to evaluate this after all
features were developed.

Thanks

>
> Best Regards,
> Dmitry
>
>>
>>>
>>>> - For e1000e it self, since it was a new device, so no need to care
>>>> about compatibility stuffs (e.g auto negotiation and mit). We can just
>>>> enable them forever.
>>>
>>> Yes, we have this in plans.
>>>
>>>> - And for the generic packet abstraction layer, what's the
>>>> advantages of
>>>> this? If it has lot, maybe we can use it in other nic model (e.g
>>>> virtio-net)?
>>>
>>> These abstractions were initially developed by me as a part of vmxnet3
>>> device to be generic and re-usable. Their main advantage is support
>>> for virtio headers for virtio-enabled backends and emulation of
>>> network offloads in software for backends that do not support virtio.
>>>
>>> Of course they may be re-used by virtio, however I’m not sure if it
>>> will be really useful because virtio has feature negotiation
>>> facilities and do not require SW emulation for network task offloads.
>>>
>>> For other devices they are useful because each and every device that
>>> requires SW offloads implementation need to do exactly the same things
>>> and it doesn’t make sense to have a few implementations for this.
>>>
>>> Best Regards,
>>> Dmitry
>>
>> Ok, thanks for the explanation.
>>
>>>
>>>>
>>>> Thanks
>>>>
>>>>>
>>>>> Since 2.5 is in soft freeze, this looks a 2.6 material.
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e)
  2015-11-03  5:44           ` Jason Wang
@ 2015-11-03  9:17             ` Dmitry Fleytman
  0 siblings, 0 replies; 15+ messages in thread
From: Dmitry Fleytman @ 2015-11-03  9:17 UTC (permalink / raw)
  To: Jason Wang; +Cc: Leonid Bloch, Leonid Bloch, qemu-devel, Shmulik Ladkani

[-- Attachment #1: Type: text/plain, Size: 10075 bytes --]


> On 3 Nov 2015, at 07:44 AM, Jason Wang <jasowang@redhat.com> wrote:
> 
> 
> 
> On 11/02/2015 03:49 PM, Dmitry Fleytman wrote:
>> 
>>> On 2 Nov 2015, at 05:35 AM, Jason Wang <jasowang@redhat.com <mailto:jasowang@redhat.com>
>>> <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>> wrote:
>>> 
>>> 
>>> 
>>> On 10/31/2015 01:52 PM, Dmitry Fleytman wrote:
>>>> Hello Jason,
>>>> 
>>>> Thanks for reviewing. See my answers inline.
>>>> 
>>>> 
>>>>> On 30 Oct 2015, at 07:28 AM, Jason Wang <jasowang@redhat.com <mailto:jasowang@redhat.com>
>>>>> <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>
>>>>> <mailto:jasowang@redhat.com>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On 10/28/2015 01:44 PM, Jason Wang wrote:
>>>>>> 
>>>>>> On 10/26/2015 01:00 AM, Leonid Bloch wrote:
>>>>>>> Hello qemu-devel,
>>>>>>> 
>>>>>>> This patch series is an RFC for the new networking device emulation
>>>>>>> we're developing for QEMU.
>>>>>>> 
>>>>>>> This new device emulates the Intel 82574 GbE Controller and works
>>>>>>> with unmodified Intel e1000e drivers from the Linux/Windows kernels.
>>>>>>> 
>>>>>>> The status of the current series is "Functional Device Ready, work
>>>>>>> on Extended Features in Progress".
>>>>>>> 
>>>>>>> More precisely, these patches represent a functional device, which
>>>>>>> is recognized by the standard Intel drivers, and is able to transfer
>>>>>>> TX/RX packets with CSO/TSO offloads, according to the spec.
>>>>>>> 
>>>>>>> Extended features not supported yet (work in progress):
>>>>>>> 1. TX/RX Interrupt moderation mechanisms
>>>>>>> 2. RSS
>>>>>>> 3. Full-featured multi-queue (use of multiqueued network backend)
>>>>>>> 
>>>>>>> Also, there will be some code refactoring and performance
>>>>>>> optimization efforts.
>>>>>>> 
>>>>>>> This series was tested on Linux (Fedora 22) and Windows (2012R2)
>>>>>>> guests, using Iperf, with TX/RX and TCP/UDP streams, and various
>>>>>>> packet sizes.
>>>>>>> 
>>>>>>> More thorough testing, including data streams with different MTU
>>>>>>> sizes, and Microsoft Certification (HLK) tests, are pending missing
>>>>>>> features' development.
>>>>>>> 
>>>>>>> See commit messages (esp. "net: Introduce e1000e device emulation")
>>>>>>> for more information about the development approaches and the
>>>>>>> architecture options chosen for this device.
>>>>>>> 
>>>>>>> This series is based upon v2.3.0 tag of the upstream QEMU repository,
>>>>>>> and it will be rebased to latest before the final submission.
>>>>>>> 
>>>>>>> Please share your thoughts - any feedback is highly welcomed :)
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> Dmitry Fleytman.
>>>>>> Thanks for the series. Will go through this in next few days.
>>>>> 
>>>>> Have a quick glance at the series, got the following questions:
>>>>> 
>>>>> - Though e1000e differs from e1000 in many places, I still see lots of
>>>>> code duplications. We need consider to reuse e1000.c (or at least part
>>>>> of). I believe we don't want to fix a bug twice in two places in the
>>>>> future and I expect hundreds of lines could be saved through this way.
>>>> 
>>>> That’s a good question :)
>>>> 
>>>> This is how we started, we had a common “core” code base meant to
>>>> implement all common logic (this split is still present in the patches
>>>> - there are e1000e_core.c and e1000e.c files).
>>>> Unfortunately at some point it turned out that there are more
>>>> differences that commons. We noticed that the code becomes filled with
>>>> many minor differences handling.
>>>> This also made the code base more complicated and harder to follow.
>>>> 
>>>> So at some point of time it was decided to split the code base and
>>>> revert all changes done to the e1000 device (except a few
>>>> fixes/improvements Leonid submitted a few days ago).
>>>> 
>>>> Although there was common code between devices, total SLOC of e1000
>>>> and e1000e devices became smaller after the split.
>>>> 
>>>> Amount of code that may be shared between devices will be even smaller
>>>> after we complete the implementation which still misses a few features
>>>> (see cover letter) that will change many things.
>>>> 
>>>> Still after the device implementation is done, we plan to review code
>>>> similarities again to see if there are possibilities for code sharing.
>>> 
>>> I see, but if we can try to re-use or unify the codes from beginning, it
>>> would be a little bit easier. Looks like the differences were mainly:
>>> 
>>> 1) MSI-X support
>>> 2) offloading support through virtio-net header
>>> 3) trace points
>>> 4) other new functions through e1000e specific registers
>>> 
>>> So we could first unify the code through implementing the support of 2
>>> and 3 for e1000. For MSI-X and other e1000e specific new functions, it
>>> could be done through:
>>> 
>>> 1) model specific callbacks, e.g realize, transmission and reception
>>> 2) A new register flags e.g PHY_RW_E1000E which means the register is
>>> for e1000e only. Or even model specific wirteops and readops
>>> 3) For other subtle differences, it could be done in the code by
>>> checking the model explicitly.
>>> 
>>> What's your opinion? (A good example of code sharing is freebsd's e1000
>>> driver which covers both 8254x and 8257x).
>> 
>> 
>> Hi Jason,
>> 
>> This is exactly how we started.
>> 
>> Issues that made us split the code base were following:
>> 
>> 1. The majority of registers are different. Even same registers in
>> many cases have different bits and require different processing logic.
>> This introduced too much if’s into the code;
> 
> Then we can probably have different writeops and readops. This way, we
> can at least save the codes of common registers.
> 
>> 2. The data path is totally different not just because of virtio
>> headers but also because these devices use different descriptor
>> layouts and require different logic in order to parse and fill those.
>> There are legacy descriptors that look almost the same but of course
>> we must support all descriptor types described by spec;
> 
> Yes, but looks like the only extend rx/tx descriptors is 8257x specific.
> And 8257x fully supports both legacy and context descriptor of 8254x.
> This give us another chance to reuse the code.
> 
>> 3. Interrupt handling logic is different because of MSI support;
> 
> Right, but this is not hard to address, probably a new helper.
> 
>> 4. Mutli-queue and RSS make the code even less similar to the old device;
> 
> Yes, this could be in 8275x specific file.
> 
>> 5. Changes required to isolate shared code base required changes in
>> migration tables and fishy tricks to preserve compatibility with
>> current implementation;
> 
> Since 8257x is a totally new device, it can has its own vmstate if it's
> simpler to be implemented and we don't even need to care migration
> compatibility.
> 
>> 6. e1000 code suffered from massive changes which are very hard to
>> verify because spec is big and there are no drivers that use all
>> device features.
> 
> Then we can try to change e1000 as mini as possible. E.g just factor out
> the common logic to helpers to reuse it in e1000e.
> 
>> 
>> Overall, code for handling differences in device behaviours was bigger
>> and more complicated then the device logic itself. The difference is
>> not subtle when it comes to the full featured device implementation.
>> As for FreeBSD’s driver, I’m not deeply familiar with its code but I
>> suspect it works with device in legacy mode which is pretty similar to
>> an old device indeed. Since we must support all modes our situation is
>> different.
> 
> Yes, it does not use extended descriptor format.
> 
>> 
>> Again, I’m totally into shared code and would like to have some common
>> code base anyway. Currently it looks like the best way to achieve this
>> is to finish with all device features and then see what parts of logic
>> may be shared between the old and the new devices. It’s better to have
>> slight code duplication and smaller shared code base than to have
>> bloated and tricky shared code that will introduce its own problems to
>> both devices.
> 
> The code duplication is not slight in this rfc :). So the code has the
> possibility to be unified. But I'm ok to evaluate this after all
> features were developed.

Hello Jason,


I’m strongly into making code as much shared as possible :)
Let’s see how it goes with the final code…

Thanks for your comments,
Dmitry.

> 
> Thanks
> 
>> 
>> Best Regards,
>> Dmitry
>> 
>>> 
>>>> 
>>>>> - For e1000e it self, since it was a new device, so no need to care
>>>>> about compatibility stuffs (e.g auto negotiation and mit). We can just
>>>>> enable them forever.
>>>> 
>>>> Yes, we have this in plans.
>>>> 
>>>>> - And for the generic packet abstraction layer, what's the
>>>>> advantages of
>>>>> this? If it has lot, maybe we can use it in other nic model (e.g
>>>>> virtio-net)?
>>>> 
>>>> These abstractions were initially developed by me as a part of vmxnet3
>>>> device to be generic and re-usable. Their main advantage is support
>>>> for virtio headers for virtio-enabled backends and emulation of
>>>> network offloads in software for backends that do not support virtio.
>>>> 
>>>> Of course they may be re-used by virtio, however I’m not sure if it
>>>> will be really useful because virtio has feature negotiation
>>>> facilities and do not require SW emulation for network task offloads.
>>>> 
>>>> For other devices they are useful because each and every device that
>>>> requires SW offloads implementation need to do exactly the same things
>>>> and it doesn’t make sense to have a few implementations for this.
>>>> 
>>>> Best Regards,
>>>> Dmitry
>>> 
>>> Ok, thanks for the explanation.
>>> 
>>>> 
>>>>> 
>>>>> Thanks
>>>>> 
>>>>>> 
>>>>>> Since 2.5 is in soft freeze, this looks a 2.6 material.


[-- Attachment #2: Type: text/html, Size: 33370 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e)
  2015-10-25 17:00 [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Leonid Bloch
                   ` (5 preceding siblings ...)
  2015-10-28  5:44 ` [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Jason Wang
@ 2016-01-13  4:43 ` Prem Mallappa
  6 siblings, 0 replies; 15+ messages in thread
From: Prem Mallappa @ 2016-01-13  4:43 UTC (permalink / raw)
  To: Leonid Bloch, qemu-devel

On 10/25/2015 10:30 PM, Leonid Bloch wrote:
> Hello qemu-devel,
>
> This patch series is an RFC for the new networking device emulation
> we're developing for QEMU.
>
Thanks for the patch

/Prem

Tested By: Prem Mallappa <pmallapp@broadcom.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-01-13  4:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-25 17:00 [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Leonid Bloch
2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 1/5] net: Add macros for ETH address tracing Leonid Bloch
2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 2/5] net_pkt: Name vmxnet3 packet abstractions more generic Leonid Bloch
2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 3/5] net_pkt: Extend packet abstraction as requied by e1000e functionality Leonid Bloch
2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 4/5] e1000_regs: Add definitions for Intel 82574-specific bits Leonid Bloch
2015-10-25 17:00 ` [Qemu-devel] [RFC PATCH 5/5] net: Introduce e1000e device emulation Leonid Bloch
2015-10-28  5:44 ` [Qemu-devel] [RFC PATCH 0/5] Introduce Intel 82574 GbE Controller Emulation (e1000e) Jason Wang
2015-10-28  6:11   ` Dmitry Fleytman
2015-10-30  5:28   ` Jason Wang
2015-10-31  5:52     ` Dmitry Fleytman
2015-11-02  3:35       ` Jason Wang
2015-11-02  7:49         ` Dmitry Fleytman
2015-11-03  5:44           ` Jason Wang
2015-11-03  9:17             ` Dmitry Fleytman
2016-01-13  4:43 ` Prem Mallappa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.