From: Rusty Russell <rusty@rustcorp.com.au> To: "Michael S. Tsirkin" <mst@redhat.com> Cc: netdev@vger.kernel.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, mingo@elte.hu, linux-mm@kvack.org, akpm@linux-foundation.org, hpa@zytor.com, gregory.haskins@gmail.com, s.hetze@linux-ag.com, Daniel Walker <dwalker@fifo99.com>, Eric Dumazet <eric.dumazet@gmail.com>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Subject: Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server Date: Fri, 6 Nov 2009 15:29:17 +1030 [thread overview] Message-ID: <200911061529.17500.rusty@rustcorp.com.au> (raw) In-Reply-To: <20091104155724.GD32673@redhat.com> On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote: > What it is: vhost net is a character device that can be used to reduce > the number of system calls involved in virtio networking. Hi Michael, Now everyone else has finally kicked all the tires and it seems to pass, I've done a fairly complete review. Generally, it's really nice; just one bug and a few minor suggestions for polishing. > +/* Caller must have TX VQ lock */ > +static void tx_poll_stop(struct vhost_net *net) > +{ > + if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED)) > + return; likely? Really? > + for (;;) { > + head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in, > + NULL, NULL); Danger! You need an arg to vhost_get_vq_desc to tell it the max desc size you can handle. Otherwise, it's only limited by ring size, and a malicious guest can overflow you here, and below: > + /* Skip header. TODO: support TSO. */ > + s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out); ... > + > + use_mm(net->dev.mm); > + mutex_lock(&vq->mutex); > + vhost_no_notify(vq); I prefer a name like "vhost_disable_notify()". > + /* OK, now we need to know about added descriptors. */ > + if (head == vq->num && vhost_notify(vq)) > + /* They could have slipped one in as we were doing that: > + * check again. */ > + continue; > + /* Nothing new? Wait for eventfd to tell us they refilled. */ > + if (head == vq->num) > + break; > + /* We don't need to be notified again. */ > + vhost_no_notify(vq); Similarly, vhost_enable_notify. This one is particularly misleading since it doesn't actually notify anything! In particular, this code would be neater as: if (head == vq->num) { if (vhost_enable_notify(vq)) { /* Try again, they could have slipped one in. */ continue; } /* Nothing more to do. */ break; } vhost_disable_notify(vq); Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to return true only when there's more pending. Saves a loop around here most of the time. Also, the vhost_no_notify/vhost_disable_notify() can be moved out of the loop entirely. (It could be under an if (unlikely(enabled)), not sure if it's worth it). > + len = err; > + err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, hdr_size); That unsigned char * arg to memcpy_toiovec is annoying. A patch might be nice, separate from this effort. > +static int vhost_net_open(struct inode *inode, struct file *f) > +{ > + struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL); > + int r; > + if (!n) > + return -ENOMEM; > + f->private_data = n; > + n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick; > + n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick; I have a personal dislike of calloc for structures. In userspace, it's because valgrind can't spot uninitialized fields. These days a similar argument applies in the kernel, because we have KMEMCHECK now. If someone adds a field to the struct and forgets to initialize it, we can spot it. > +static void vhost_net_enable_vq(struct vhost_net *n, int index) > +{ > + struct socket *sock = n->vqs[index].private_data; OK, I can't help but this that presenting the vqs as an array doesn't buy us very much. Esp. if you change vhost_dev_init to take a NULL-terminated varargs. I think readability would improve. It means passing a vq around rather than an index. Not completely sure it'll be a win tho. > +static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd) > +{ > + struct socket *sock, *oldsock = NULL; ... > + sock = get_socket(fd); > + if (IS_ERR(sock)) { > + r = PTR_ERR(sock); > + goto done; > + } > + > + /* start polling new socket */ > + oldsock = vq->private_data; ... > +done: > + mutex_unlock(&n->dev.mutex); > + if (oldsock) { > + vhost_net_flush_vq(n, index); > + fput(oldsock->file); I dislike this style; I prefer multiple different goto points, one for when oldsock is set, and one for when it's not. That way, gcc warns us about uninitialized variables if we get it wrong. > +static long vhost_net_reset_owner(struct vhost_net *n) > +{ > + struct socket *tx_sock = NULL; > + struct socket *rx_sock = NULL; > + long r; This should be called "err", since that's what it is. > +static void vhost_net_set_features(struct vhost_net *n, u64 features) > +{ > + size_t hdr_size = features & (1 << VHOST_NET_F_VIRTIO_NET_HDR) ? > + sizeof(struct virtio_net_hdr) : 0; > + int i; > + mutex_lock(&n->dev.mutex); > + n->dev.acked_features = features; Why is this called "acked_features"? Not just "features"? I expected to see code which exposed these back to userspace, and didn't. > + case VHOST_GET_FEATURES: > + features = VHOST_FEATURES; > + return put_user(features, featurep); > + case VHOST_ACK_FEATURES: > + r = get_user(features, featurep); > + /* No features for now */ > + if (r < 0) > + return r; > + if (features & ~VHOST_FEATURES) > + return -EOPNOTSUPP; > + vhost_net_set_features(n, features); OK, from the userspace POV it's "get features" then "ack features". But I think "VHOST_SET_FEATURES" is more consistent, despite this usage. > + switch (ioctl) { > + case VHOST_SET_VRING_NUM: I haven't looked at your userspace implementation, but does a generic VHOST_SET_VRING_STATE & VHOST_GET_VRING_STATE with a struct make more sense? It'd be simpler here, but not sure if it'd be simpler to use? (Not the fd-setting ioctls of course) > + case VHOST_SET_VRING_LOG: > + r = copy_from_user(&a, argp, sizeof a); > + if (r < 0) > + break; > + if (a.padding) { > + r = -EOPNOTSUPP; > + break; > + } > + if (a.user_addr == VHOST_VRING_LOG_DISABLE) { > + vq->log_used = false; > + break; > + } > + if (a.user_addr & (sizeof *vq->used->ring - 1)) { > + r = -EINVAL; > + break; > + } > + vq->log_used = true; > + vq->log_addr = a.user_addr; > + break; For future reference, this is *exactly* the kind of thing which would have been nice as a followup patch. Easy to separate, easy to review, not critical to the core. > +/* TODO: This is really inefficient. We need something like get_user() > + * (instruction directly accesses the data, with an exception table entry > + * returning -EFAULT). See Documentation/x86/exception-tables.txt. > + */ > +static int set_bit_to_user(int nr, void __user *addr) > +{ I guess we won't be dealing with many contiguous pages, otherwise we could get a cheap speedup making this set_bits_to_user(int nr, int num_bits...). > +/* Each buffer in the virtqueues is actually a chain of descriptors. This > + * function returns the next descriptor in the chain, > + * or -1 if we're at the end. */ > +static unsigned next_desc(struct vring_desc *desc) > +{ > + unsigned int next; > + > + /* If this descriptor says it doesn't chain, we're done. */ > + if (!(desc->flags & VRING_DESC_F_NEXT)) > + return -1; Hmm, prefer s/-1/-1U/ in comment, here, and below. Clarifies a bit. > +/* After we've used one of their buffers, we tell them about it. We'll then > + * want to send them an interrupt, using vq->call. */ This comment has too much cut & paste: ... want to notify the guest, using the eventfd */ > +/* This actually sends the interrupt for this virtqueue */ > +void vhost_trigger_irq(struct vhost_dev *dev, struct vhost_virtqueue *vq) > +{ Rename vhost_notify_eventfd() or something, and fix comments? > +enum { > + VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2, +2? Believable, but is it correct? > +/* Poll a file (eventfd or socket) */ > +/* Note: there's nothing vhost specific about this structure. */ > +struct vhost_poll { This comment really helped while reading the code. Kudos! Thanks! Rusty.
WARNING: multiple messages have this Message-ID (diff)
From: Rusty Russell <rusty@rustcorp.com.au> To: "Michael S. Tsirkin" <mst@redhat.com> Cc: netdev@vger.kernel.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, mingo@elte.hu, linux-mm@kvack.org, akpm@linux-foundation.org, hpa@zytor.com, gregory.haskins@gmail.com, s.hetze@linux-ag.com, Daniel Walker <dwalker@fifo99.com>, Eric Dumazet <eric.dumazet@gmail.com>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Subject: Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server Date: Fri, 6 Nov 2009 15:29:17 +1030 [thread overview] Message-ID: <200911061529.17500.rusty@rustcorp.com.au> (raw) In-Reply-To: <20091104155724.GD32673@redhat.com> On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote: > What it is: vhost net is a character device that can be used to reduce > the number of system calls involved in virtio networking. Hi Michael, Now everyone else has finally kicked all the tires and it seems to pass, I've done a fairly complete review. Generally, it's really nice; just one bug and a few minor suggestions for polishing. > +/* Caller must have TX VQ lock */ > +static void tx_poll_stop(struct vhost_net *net) > +{ > + if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED)) > + return; likely? Really? > + for (;;) { > + head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in, > + NULL, NULL); Danger! You need an arg to vhost_get_vq_desc to tell it the max desc size you can handle. Otherwise, it's only limited by ring size, and a malicious guest can overflow you here, and below: > + /* Skip header. TODO: support TSO. */ > + s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out); ... > + > + use_mm(net->dev.mm); > + mutex_lock(&vq->mutex); > + vhost_no_notify(vq); I prefer a name like "vhost_disable_notify()". > + /* OK, now we need to know about added descriptors. */ > + if (head == vq->num && vhost_notify(vq)) > + /* They could have slipped one in as we were doing that: > + * check again. */ > + continue; > + /* Nothing new? Wait for eventfd to tell us they refilled. */ > + if (head == vq->num) > + break; > + /* We don't need to be notified again. */ > + vhost_no_notify(vq); Similarly, vhost_enable_notify. This one is particularly misleading since it doesn't actually notify anything! In particular, this code would be neater as: if (head == vq->num) { if (vhost_enable_notify(vq)) { /* Try again, they could have slipped one in. */ continue; } /* Nothing more to do. */ break; } vhost_disable_notify(vq); Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to return true only when there's more pending. Saves a loop around here most of the time. Also, the vhost_no_notify/vhost_disable_notify() can be moved out of the loop entirely. (It could be under an if (unlikely(enabled)), not sure if it's worth it). > + len = err; > + err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, hdr_size); That unsigned char * arg to memcpy_toiovec is annoying. A patch might be nice, separate from this effort. > +static int vhost_net_open(struct inode *inode, struct file *f) > +{ > + struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL); > + int r; > + if (!n) > + return -ENOMEM; > + f->private_data = n; > + n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick; > + n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick; I have a personal dislike of calloc for structures. In userspace, it's because valgrind can't spot uninitialized fields. These days a similar argument applies in the kernel, because we have KMEMCHECK now. If someone adds a field to the struct and forgets to initialize it, we can spot it. > +static void vhost_net_enable_vq(struct vhost_net *n, int index) > +{ > + struct socket *sock = n->vqs[index].private_data; OK, I can't help but this that presenting the vqs as an array doesn't buy us very much. Esp. if you change vhost_dev_init to take a NULL-terminated varargs. I think readability would improve. It means passing a vq around rather than an index. Not completely sure it'll be a win tho. > +static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd) > +{ > + struct socket *sock, *oldsock = NULL; ... > + sock = get_socket(fd); > + if (IS_ERR(sock)) { > + r = PTR_ERR(sock); > + goto done; > + } > + > + /* start polling new socket */ > + oldsock = vq->private_data; ... > +done: > + mutex_unlock(&n->dev.mutex); > + if (oldsock) { > + vhost_net_flush_vq(n, index); > + fput(oldsock->file); I dislike this style; I prefer multiple different goto points, one for when oldsock is set, and one for when it's not. That way, gcc warns us about uninitialized variables if we get it wrong. > +static long vhost_net_reset_owner(struct vhost_net *n) > +{ > + struct socket *tx_sock = NULL; > + struct socket *rx_sock = NULL; > + long r; This should be called "err", since that's what it is. > +static void vhost_net_set_features(struct vhost_net *n, u64 features) > +{ > + size_t hdr_size = features & (1 << VHOST_NET_F_VIRTIO_NET_HDR) ? > + sizeof(struct virtio_net_hdr) : 0; > + int i; > + mutex_lock(&n->dev.mutex); > + n->dev.acked_features = features; Why is this called "acked_features"? Not just "features"? I expected to see code which exposed these back to userspace, and didn't. > + case VHOST_GET_FEATURES: > + features = VHOST_FEATURES; > + return put_user(features, featurep); > + case VHOST_ACK_FEATURES: > + r = get_user(features, featurep); > + /* No features for now */ > + if (r < 0) > + return r; > + if (features & ~VHOST_FEATURES) > + return -EOPNOTSUPP; > + vhost_net_set_features(n, features); OK, from the userspace POV it's "get features" then "ack features". But I think "VHOST_SET_FEATURES" is more consistent, despite this usage. > + switch (ioctl) { > + case VHOST_SET_VRING_NUM: I haven't looked at your userspace implementation, but does a generic VHOST_SET_VRING_STATE & VHOST_GET_VRING_STATE with a struct make more sense? It'd be simpler here, but not sure if it'd be simpler to use? (Not the fd-setting ioctls of course) > + case VHOST_SET_VRING_LOG: > + r = copy_from_user(&a, argp, sizeof a); > + if (r < 0) > + break; > + if (a.padding) { > + r = -EOPNOTSUPP; > + break; > + } > + if (a.user_addr == VHOST_VRING_LOG_DISABLE) { > + vq->log_used = false; > + break; > + } > + if (a.user_addr & (sizeof *vq->used->ring - 1)) { > + r = -EINVAL; > + break; > + } > + vq->log_used = true; > + vq->log_addr = a.user_addr; > + break; For future reference, this is *exactly* the kind of thing which would have been nice as a followup patch. Easy to separate, easy to review, not critical to the core. > +/* TODO: This is really inefficient. We need something like get_user() > + * (instruction directly accesses the data, with an exception table entry > + * returning -EFAULT). See Documentation/x86/exception-tables.txt. > + */ > +static int set_bit_to_user(int nr, void __user *addr) > +{ I guess we won't be dealing with many contiguous pages, otherwise we could get a cheap speedup making this set_bits_to_user(int nr, int num_bits...). > +/* Each buffer in the virtqueues is actually a chain of descriptors. This > + * function returns the next descriptor in the chain, > + * or -1 if we're at the end. */ > +static unsigned next_desc(struct vring_desc *desc) > +{ > + unsigned int next; > + > + /* If this descriptor says it doesn't chain, we're done. */ > + if (!(desc->flags & VRING_DESC_F_NEXT)) > + return -1; Hmm, prefer s/-1/-1U/ in comment, here, and below. Clarifies a bit. > +/* After we've used one of their buffers, we tell them about it. We'll then > + * want to send them an interrupt, using vq->call. */ This comment has too much cut & paste: ... want to notify the guest, using the eventfd */ > +/* This actually sends the interrupt for this virtqueue */ > +void vhost_trigger_irq(struct vhost_dev *dev, struct vhost_virtqueue *vq) > +{ Rename vhost_notify_eventfd() or something, and fix comments? > +enum { > + VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2, +2? Believable, but is it correct? > +/* Poll a file (eventfd or socket) */ > +/* Note: there's nothing vhost specific about this structure. */ > +struct vhost_poll { This comment really helped while reading the code. Kudos! Thanks! Rusty. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-11-06 4:59 UTC|newest] Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top [not found] <cover.1257349249.git.mst@redhat.com> 2009-11-04 15:55 ` [PATCHv8 1/3] tun: export underlying socket Michael S. Tsirkin 2009-11-04 15:55 ` Michael S. Tsirkin 2009-11-04 15:55 ` Michael S. Tsirkin 2009-11-04 15:55 ` Michael S. Tsirkin 2009-11-04 15:55 ` [PATCHv8 2/3] mm: export use_mm/unuse_mm to modules Michael S. Tsirkin 2009-11-04 15:55 ` Michael S. Tsirkin 2009-11-04 15:55 ` Michael S. Tsirkin 2009-11-04 15:55 ` Michael S. Tsirkin 2009-11-04 15:57 ` [PATCHv8 3/3] vhost_net: a kernel-level virtio server Michael S. Tsirkin 2009-11-04 15:57 ` Michael S. Tsirkin 2009-11-04 15:57 ` Michael S. Tsirkin 2009-11-04 15:57 ` Michael S. Tsirkin 2009-11-06 4:59 ` Rusty Russell 2009-11-06 4:59 ` Rusty Russell [this message] 2009-11-06 4:59 ` Rusty Russell 2009-11-08 11:35 ` Michael S. Tsirkin 2009-11-08 11:35 ` Michael S. Tsirkin 2009-11-09 6:17 ` Rusty Russell 2009-11-09 6:17 ` Rusty Russell 2009-11-09 6:17 ` Rusty Russell 2009-11-09 7:10 ` Michael S. Tsirkin 2009-11-09 7:10 ` Michael S. Tsirkin 2009-11-10 1:08 ` Rusty Russell 2009-11-10 1:08 ` Rusty Russell 2009-11-10 1:08 ` Rusty Russell 2009-11-09 7:10 ` Michael S. Tsirkin 2009-11-09 7:20 ` Michael S. Tsirkin 2009-11-09 7:20 ` Michael S. Tsirkin 2009-11-09 7:20 ` Michael S. Tsirkin 2009-11-09 11:55 ` Michael S. Tsirkin 2009-11-09 11:55 ` Michael S. Tsirkin 2009-11-09 11:55 ` Michael S. Tsirkin 2010-05-04 18:22 ` virtio: put last_used and last_avail index into ring itself Michael S. Tsirkin 2010-05-04 18:22 ` Michael S. Tsirkin 2010-05-06 0:52 ` Rusty Russell 2010-05-06 0:52 ` Rusty Russell 2010-05-06 0:52 ` Rusty Russell 2010-05-06 6:27 ` Michael S. Tsirkin 2010-05-06 6:27 ` Michael S. Tsirkin 2010-05-06 6:27 ` Michael S. Tsirkin 2010-05-07 3:05 ` Rusty Russell 2010-05-07 3:05 ` Rusty Russell 2010-05-07 3:05 ` Rusty Russell 2010-05-09 8:57 ` Michael S. Tsirkin 2010-05-09 8:57 ` Michael S. Tsirkin 2010-05-09 8:57 ` Michael S. Tsirkin 2010-05-10 3:11 ` Rusty Russell 2010-05-10 3:11 ` Rusty Russell 2010-05-10 3:11 ` Rusty Russell 2010-05-04 18:22 ` Michael S. Tsirkin 2009-11-08 11:35 ` [PATCHv8 3/3] vhost_net: a kernel-level virtio server Michael S. Tsirkin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=200911061529.17500.rusty@rustcorp.com.au \ --to=rusty@rustcorp.com.au \ --cc=akpm@linux-foundation.org \ --cc=dwalker@fifo99.com \ --cc=eric.dumazet@gmail.com \ --cc=gregory.haskins@gmail.com \ --cc=hpa@zytor.com \ --cc=kvm@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mingo@elte.hu \ --cc=mst@redhat.com \ --cc=netdev@vger.kernel.org \ --cc=paulmck@linux.vnet.ibm.com \ --cc=s.hetze@linux-ag.com \ --cc=virtualization@lists.linux-foundation.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.