From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Qiu, Michael" Subject: Re: [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array Date: Wed, 24 Feb 2016 08:15:36 +0000 Message-ID: <533710CFB86FA344BFBF2D6802E6028622F54AB4@SHSMSX101.ccr.corp.intel.com> References: <20160219070326.GR21426@yliu-dev.sh.intel.com> <1456151771-15382-1-git-send-email-yuanhan.liu@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Cc: Victor Kaplansky , "Michael S. Tsirkin" To: Yuanhan Liu , "dev@dpdk.org" Return-path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id E94822BA6 for ; Wed, 24 Feb 2016 09:15:46 +0100 (CET) Content-Language: en-US List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 2/22/2016 10:35 PM, Yuanhan Liu wrote:=0A= > Broadcast RARP packet by injecting it to receiving mbuf array at=0A= > rte_vhost_dequeue_burst().=0A= >=0A= > Commit 33226236a35e ("vhost: handle request to send RARP") iterates=0A= > all host interfaces and then broadcast it by all of them. It did=0A= > notify the switches about the new location of the migrated VM, however,= =0A= > the mac learning table in the target host is wrong (at least in my=0A= > test with OVS):=0A= >=0A= > $ ovs-appctl fdb/show ovsbr0=0A= > port VLAN MAC Age=0A= > 1 0 b6:3c:72:71:cd:4d 10=0A= > LOCAL 0 b6:3c:72:71:cd:4e 10=0A= > LOCAL 0 52:54:00:12:34:68 9=0A= > 1 0 56:f6:64:2c:bc:c0 1=0A= >=0A= > Where 52:54:00:12:34:68 is the mac of the VM. As you can see from the=0A= > above, the port learned is "LOCAL", which is the "ovsbr0" port. That=0A= > is reasonable, since we indeed send the pkt by the "ovsbr0" interface.=0A= >=0A= > The wrong mac table lead all the packets to the VM go to the "ovsbr0"=0A= > in the end, which ends up with all packets being lost, until the guest=0A= > send a ARP quest (or reply) to refresh the mac learning table.=0A= >=0A= > Jianfeng then came up with a solution I have thought of firstly but NAKed= =0A= =0A= Is it suitable to mention someone in the commit log?=0A= =0A= Thanks,=0A= Michael=0A= > by myself, concerning it has potential issues [0]. The solution is as tit= le=0A= > stated: broadcast the RARP packet by injecting it to the receiving mbuf= =0A= > arrays at rte_vhost_dequeue_burst(). The re-bring of that idea made me=0A= > think it twice; it looked like a false concern to me then. And I had done= =0A= > a rough verification: it worked as expected.=0A= >=0A= > [0]: http://dpdk.org/ml/archives/dev/2016-February/033527.html=0A= >=0A= > Another note is that while preparing this version, I found that DPDK has= =0A= > some ARP related structures and macros defined. So, use them instead of= =0A= > the one from standard header files here.=0A= >=0A= > Cc: Thibaut Collet =0A= > Suggested-by: Jianfeng Tan =0A= > Signed-off-by: Yuanhan Liu =0A= > ---=0A= > lib/librte_vhost/rte_virtio_net.h | 5 +-=0A= > lib/librte_vhost/vhost_rxtx.c | 80 +++++++++++++++-=0A= > lib/librte_vhost/vhost_user/vhost-net-user.c | 2 +-=0A= > lib/librte_vhost/vhost_user/virtio-net-user.c | 128 ++++----------------= ------=0A= > lib/librte_vhost/vhost_user/virtio-net-user.h | 2 +-=0A= > 5 files changed, 104 insertions(+), 113 deletions(-)=0A= >=0A= > diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_vir= tio_net.h=0A= > index 4a2303a..7d1fde2 100644=0A= > --- a/lib/librte_vhost/rte_virtio_net.h=0A= > +++ b/lib/librte_vhost/rte_virtio_net.h=0A= > @@ -49,6 +49,7 @@=0A= > =0A= > #include =0A= > #include =0A= > +#include =0A= > =0A= > struct rte_mbuf;=0A= > =0A= > @@ -133,7 +134,9 @@ struct virtio_net {=0A= > void *priv; /**< private context */=0A= > uint64_t log_size; /**< Size of log area */=0A= > uint64_t log_base; /**< Where dirty pages are logged */=0A= > - uint64_t reserved[62]; /**< Reserve some spaces for future extension. = */=0A= > + struct ether_addr mac; /**< MAC address */=0A= > + rte_atomic16_t broadcast_rarp; /**< A flag to tell if we need broadcas= t rarp packet */=0A= > + uint64_t reserved[61]; /**< Reserve some spaces for future extension. = */=0A= > struct vhost_virtqueue *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2]; /**< Cont= ains all virtqueue information. */=0A= > } __rte_cache_aligned;=0A= > =0A= > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.= c=0A= > index 12ce0cc..9d23eb1 100644=0A= > --- a/lib/librte_vhost/vhost_rxtx.c=0A= > +++ b/lib/librte_vhost/vhost_rxtx.c=0A= > @@ -43,6 +43,7 @@=0A= > #include =0A= > #include =0A= > #include =0A= > +#include =0A= > =0A= > #include "vhost-net.h"=0A= > =0A= > @@ -761,11 +762,50 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, s= truct rte_mbuf *m)=0A= > }=0A= > }=0A= > =0A= > +#define RARP_PKT_SIZE 64=0A= > +=0A= > +static int=0A= > +make_rarp_packet(struct rte_mbuf *rarp_mbuf, const struct ether_addr *ma= c)=0A= > +{=0A= > + struct ether_hdr *eth_hdr;=0A= > + struct arp_hdr *rarp;=0A= > +=0A= > + if (rarp_mbuf->buf_len < 64) {=0A= > + RTE_LOG(WARNING, VHOST_DATA,=0A= > + "failed to make RARP; mbuf size too small %u (< %d)\n",=0A= > + rarp_mbuf->buf_len, RARP_PKT_SIZE);=0A= > + return -1;=0A= > + }=0A= > +=0A= > + /* Ethernet header. */=0A= > + eth_hdr =3D rte_pktmbuf_mtod_offset(rarp_mbuf, struct ether_hdr *, 0);= =0A= > + memset(eth_hdr->d_addr.addr_bytes, 0xff, ETHER_ADDR_LEN);=0A= > + ether_addr_copy(mac, ð_hdr->s_addr);=0A= > + eth_hdr->ether_type =3D htons(ETHER_TYPE_RARP);=0A= > +=0A= > + /* RARP header. */=0A= > + rarp =3D (struct arp_hdr *)(eth_hdr + 1);=0A= > + rarp->arp_hrd =3D htons(ARP_HRD_ETHER);=0A= > + rarp->arp_pro =3D htons(ETHER_TYPE_IPv4);=0A= > + rarp->arp_hln =3D ETHER_ADDR_LEN;=0A= > + rarp->arp_pln =3D 4;=0A= > + rarp->arp_op =3D htons(ARP_OP_REVREQUEST);=0A= > +=0A= > + ether_addr_copy(mac, &rarp->arp_data.arp_sha);=0A= > + ether_addr_copy(mac, &rarp->arp_data.arp_tha);=0A= > + memset(&rarp->arp_data.arp_sip, 0x00, 4);=0A= > + memset(&rarp->arp_data.arp_tip, 0x00, 4);=0A= > +=0A= > + rarp_mbuf->pkt_len =3D rarp_mbuf->data_len =3D RARP_PKT_SIZE;=0A= > +=0A= > + return 0;=0A= > +}=0A= > +=0A= > uint16_t=0A= > rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,=0A= > struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)= =0A= > {=0A= > - struct rte_mbuf *m, *prev;=0A= > + struct rte_mbuf *m, *prev, *rarp_mbuf =3D NULL;=0A= > struct vhost_virtqueue *vq;=0A= > struct vring_desc *desc;=0A= > uint64_t vb_addr =3D 0;=0A= > @@ -788,11 +828,34 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uin= t16_t queue_id,=0A= > if (unlikely(vq->enabled =3D=3D 0))=0A= > return 0;=0A= > =0A= > + /*=0A= > + * Construct a RARP broadcast packet, and inject it to the "pkts"=0A= > + * array, to looks like that guest actually send such packet.=0A= > + *=0A= > + * Check user_send_rarp() for more information.=0A= > + */=0A= > + if (unlikely(rte_atomic16_cmpset((volatile uint16_t *)=0A= > + &dev->broadcast_rarp.cnt, 1, 0))) {=0A= > + rarp_mbuf =3D rte_pktmbuf_alloc(mbuf_pool);=0A= > + if (rarp_mbuf =3D=3D NULL) {=0A= > + RTE_LOG(ERR, VHOST_DATA,=0A= > + "Failed to allocate memory for mbuf.\n");=0A= > + return 0;=0A= > + }=0A= > +=0A= > + if (make_rarp_packet(rarp_mbuf, &dev->mac)) {=0A= > + rte_pktmbuf_free(rarp_mbuf);=0A= > + rarp_mbuf =3D NULL;=0A= > + } else {=0A= > + count -=3D 1;=0A= > + }=0A= > + }=0A= > +=0A= > avail_idx =3D *((volatile uint16_t *)&vq->avail->idx);=0A= > =0A= > /* If there are no available buffers then return. */=0A= > if (vq->last_used_idx =3D=3D avail_idx)=0A= > - return 0;=0A= > + goto out;=0A= > =0A= > LOG_DEBUG(VHOST_DATA, "%s (%"PRIu64")\n", __func__,=0A= > dev->device_fh);=0A= > @@ -983,8 +1046,21 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uin= t16_t queue_id,=0A= > vq->used->idx +=3D entry_success;=0A= > vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),=0A= > sizeof(vq->used->idx));=0A= > +=0A= > /* Kick guest if required. */=0A= > if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))=0A= > eventfd_write(vq->callfd, (eventfd_t)1);=0A= > +=0A= > +out:=0A= > + if (unlikely(rarp_mbuf !=3D NULL)) {=0A= > + /*=0A= > + * Inject it to the head of "pkts" array, so that switch's mac=0A= > + * learning table will get updated first.=0A= > + */=0A= > + memmove(&pkts[1], pkts, entry_success * sizeof(m));=0A= > + pkts[0] =3D rarp_mbuf;=0A= > + entry_success +=3D 1;=0A= > + }=0A= > +=0A= > return entry_success;=0A= > }=0A= > diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vh= ost/vhost_user/vhost-net-user.c=0A= > index de7eecb..df2bd64 100644=0A= > --- a/lib/librte_vhost/vhost_user/vhost-net-user.c=0A= > +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c=0A= > @@ -437,7 +437,7 @@ vserver_message_handler(int connfd, void *dat, int *r= emove)=0A= > user_set_vring_enable(ctx, &msg.payload.state);=0A= > break;=0A= > case VHOST_USER_SEND_RARP:=0A= > - user_send_rarp(&msg);=0A= > + user_send_rarp(ctx, &msg);=0A= > break;=0A= > =0A= > default:=0A= > diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_v= host/vhost_user/virtio-net-user.c=0A= > index 68b24f4..65b5652 100644=0A= > --- a/lib/librte_vhost/vhost_user/virtio-net-user.c=0A= > +++ b/lib/librte_vhost/vhost_user/virtio-net-user.c=0A= > @@ -39,12 +39,6 @@=0A= > #include =0A= > #include =0A= > #include =0A= > -#include =0A= > -#include =0A= > -#include =0A= > -#include =0A= > -#include =0A= > -#include =0A= > =0A= > #include =0A= > #include =0A= > @@ -415,120 +409,38 @@ user_set_log_base(struct vhost_device_ctx ctx,=0A= > return 0;=0A= > }=0A= > =0A= > -#define RARP_BUF_SIZE 64=0A= > -=0A= > -static void=0A= > -make_rarp_packet(uint8_t *buf, uint8_t *mac)=0A= > -{=0A= > - struct ether_header *eth_hdr;=0A= > - struct ether_arp *rarp;=0A= > -=0A= > - /* Ethernet header. */=0A= > - eth_hdr =3D (struct ether_header *)buf;=0A= > - memset(ð_hdr->ether_dhost, 0xff, ETH_ALEN);=0A= > - memcpy(ð_hdr->ether_shost, mac, ETH_ALEN);=0A= > - eth_hdr->ether_type =3D htons(ETH_P_RARP);=0A= > -=0A= > - /* RARP header. */=0A= > - rarp =3D (struct ether_arp *)(eth_hdr + 1);=0A= > - rarp->ea_hdr.ar_hrd =3D htons(ARPHRD_ETHER);=0A= > - rarp->ea_hdr.ar_pro =3D htons(ETHERTYPE_IP);=0A= > - rarp->ea_hdr.ar_hln =3D ETH_ALEN;=0A= > - rarp->ea_hdr.ar_pln =3D 4;=0A= > - rarp->ea_hdr.ar_op =3D htons(ARPOP_RREQUEST);=0A= > -=0A= > - memcpy(&rarp->arp_sha, mac, ETH_ALEN);=0A= > - memset(&rarp->arp_spa, 0x00, 4);=0A= > - memcpy(&rarp->arp_tha, mac, 6);=0A= > - memset(&rarp->arp_tpa, 0x00, 4);=0A= > -}=0A= > -=0A= > -=0A= > -static void=0A= > -send_rarp(const char *ifname, uint8_t *rarp)=0A= > -{=0A= > - int fd;=0A= > - struct ifreq ifr;=0A= > - struct sockaddr_ll addr;=0A= > -=0A= > - fd =3D socket(AF_PACKET, SOCK_RAW, 0);=0A= > - if (fd < 0) {=0A= > - perror("socket failed");=0A= > - return;=0A= > - }=0A= > -=0A= > - memset(&ifr, 0, sizeof(struct ifreq));=0A= > - strncpy(ifr.ifr_name, ifname, IFNAMSIZ);=0A= > - if (ioctl(fd, SIOCGIFINDEX, &ifr) < 0) {=0A= > - perror("failed to get interface index");=0A= > - close(fd);=0A= > - return;=0A= > - }=0A= > -=0A= > - addr.sll_ifindex =3D ifr.ifr_ifindex;=0A= > - addr.sll_halen =3D ETH_ALEN;=0A= > -=0A= > - if (sendto(fd, rarp, RARP_BUF_SIZE, 0,=0A= > - (const struct sockaddr*)&addr, sizeof(addr)) < 0) {=0A= > - perror("send rarp packet failed");=0A= > - }=0A= > -}=0A= > -=0A= > -=0A= > /*=0A= > - * Broadcast a RARP message to all interfaces, to update=0A= > - * switch's mac table=0A= > + * An rarp packet is constructed and broadcasted to notify switches abou= t=0A= > + * the new location of the migrated VM, so that packets from outside wil= l=0A= > + * not be lost after migration.=0A= > + *=0A= > + * However, we don't actually "send" a rarp packet here, instead, we set= =0A= > + * a flag 'broadcast_rarp' to let rte_vhost_dequeue_burst() inject it.= =0A= > */=0A= > int=0A= > -user_send_rarp(struct VhostUserMsg *msg)=0A= > +user_send_rarp(struct vhost_device_ctx ctx, struct VhostUserMsg *msg)=0A= > {=0A= > + struct virtio_net *dev;=0A= > uint8_t *mac =3D (uint8_t *)&msg->payload.u64;=0A= > - uint8_t rarp[RARP_BUF_SIZE];=0A= > - struct ifconf ifc =3D {0, };=0A= > - struct ifreq *ifr;=0A= > - int nr =3D 16;=0A= > - int fd;=0A= > - uint32_t i;=0A= > +=0A= > + dev =3D get_device(ctx);=0A= > + if (!dev)=0A= > + return -1;=0A= > =0A= > RTE_LOG(DEBUG, VHOST_CONFIG,=0A= > ":: mac: %02x:%02x:%02x:%02x:%02x:%02x\n",=0A= > mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);=0A= > -=0A= > - make_rarp_packet(rarp, mac);=0A= > + memcpy(dev->mac.addr_bytes, mac, 6);=0A= > =0A= > /*=0A= > - * Get all interfaces=0A= > + * Set the flag to inject a RARP broadcast packet at=0A= > + * rte_vhost_dequeue_burst().=0A= > + *=0A= > + * rte_smp_wmb() is for making sure the mac is copied=0A= > + * before the flag is set.=0A= > */=0A= > - fd =3D socket(AF_INET, SOCK_DGRAM, 0);=0A= > - if (fd < 0) {=0A= > - perror("failed to create AF_INET socket");=0A= > - return -1;=0A= > - }=0A= > -=0A= > -again:=0A= > - ifc.ifc_len =3D sizeof(*ifr) * nr;=0A= > - ifc.ifc_buf =3D realloc(ifc.ifc_buf, ifc.ifc_len);=0A= > -=0A= > - if (ioctl(fd, SIOCGIFCONF, &ifc) < 0) {=0A= > - perror("failed at SIOCGIFCONF");=0A= > - close(fd);=0A= > - return -1;=0A= > - }=0A= > -=0A= > - if (ifc.ifc_len =3D=3D (int)sizeof(struct ifreq) * nr) {=0A= > - /*=0A= > - * current ifc_buf is not big enough to hold=0A= > - * all interfaces; double it and try again.=0A= > - */=0A= > - nr *=3D 2;=0A= > - goto again;=0A= > - }=0A= > -=0A= > - ifr =3D (struct ifreq *)ifc.ifc_buf;=0A= > - for (i =3D 0; i < ifc.ifc_len / sizeof(struct ifreq); i++)=0A= > - send_rarp(ifr[i].ifr_name, rarp);=0A= > -=0A= > - close(fd);=0A= > + rte_smp_wmb();=0A= > + rte_atomic16_set(&dev->broadcast_rarp, 1);=0A= > =0A= > return 0;=0A= > }=0A= > diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_v= host/vhost_user/virtio-net-user.h=0A= > index 559bb46..cefec16 100644=0A= > --- a/lib/librte_vhost/vhost_user/virtio-net-user.h=0A= > +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h=0A= > @@ -54,7 +54,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struc= t VhostUserMsg *);=0A= > void user_set_protocol_features(struct vhost_device_ctx ctx,=0A= > uint64_t protocol_features);=0A= > int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *= );=0A= > -int user_send_rarp(struct VhostUserMsg *);=0A= > +int user_send_rarp(struct vhost_device_ctx ctx, struct VhostUserMsg *);= =0A= > =0A= > int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_stat= e *);=0A= > =0A= =0A=