All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rusty Russell <rusty@rustcorp.com.au>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: netdev@vger.kernel.org,
	virtualization@lists.linux-foundation.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, linux-mm@kvack.org,
	akpm@linux-foundation.org, hpa@zytor.com,
	gregory.haskins@gmail.com, s.hetze@linux-ag.com,
	Daniel Walker <dwalker@fifo99.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
Date: Fri, 6 Nov 2009 15:29:17 +1030	[thread overview]
Message-ID: <200911061529.17500.rusty@rustcorp.com.au> (raw)
In-Reply-To: <20091104155724.GD32673@redhat.com>

On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote:
> What it is: vhost net is a character device that can be used to reduce
> the number of system calls involved in virtio networking.

Hi Michael,

   Now everyone else has finally kicked all the tires and it seems to pass,
I've done a fairly complete review.  Generally, it's really nice; just one
bug and a few minor suggestions for polishing.

> +/* Caller must have TX VQ lock */
> +static void tx_poll_stop(struct vhost_net *net)
> +{
> +	if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
> +		return;

likely?  Really?

> +	for (;;) {
> +		head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in,
> +					 NULL, NULL);

Danger!  You need an arg to vhost_get_vq_desc to tell it the max desc size
you can handle.  Otherwise, it's only limited by ring size, and a malicious
guest can overflow you here, and below:

> +		/* Skip header. TODO: support TSO. */
> +		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
...
> +
> +	use_mm(net->dev.mm);
> +	mutex_lock(&vq->mutex);
> +	vhost_no_notify(vq);

I prefer a name like "vhost_disable_notify()".

> +		/* OK, now we need to know about added descriptors. */
> +		if (head == vq->num && vhost_notify(vq))
> +			/* They could have slipped one in as we were doing that:
> +			 * check again. */
> +			continue;
> +		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> +		if (head == vq->num)
> +			break;
> +		/* We don't need to be notified again. */
> +		vhost_no_notify(vq);

Similarly, vhost_enable_notify.  This one is particularly misleading since
it doesn't actually notify anything!

In particular, this code would be neater as:

	if (head == vq->num) {
		if (vhost_enable_notify(vq)) {
			/* Try again, they could have slipped one in. */
			continue;
		}
		/* Nothing more to do. */
		break;
	}
	vhost_disable_notify(vq);

Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to
return true only when there's more pending.  Saves a loop around here most
of the time.  Also, the vhost_no_notify/vhost_disable_notify() can be moved
out of the loop entirely.  (It could be under an if (unlikely(enabled)), not
sure if it's worth it).

> +		len = err;
> +		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, hdr_size);

That unsigned char * arg to memcpy_toiovec is annoying.  A patch might be
nice, separate from this effort.

> +static int vhost_net_open(struct inode *inode, struct file *f)
> +{
> +	struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
> +	int r;
> +	if (!n)
> +		return -ENOMEM;
> +	f->private_data = n;
> +	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
> +	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;

I have a personal dislike of calloc for structures.  In userspace, it's
because valgrind can't spot uninitialized fields.  These days a similar
argument applies in the kernel, because we have KMEMCHECK now.  If someone
adds a field to the struct and forgets to initialize it, we can spot it.

> +static void vhost_net_enable_vq(struct vhost_net *n, int index)
> +{
> +	struct socket *sock = n->vqs[index].private_data;

OK, I can't help but this that presenting the vqs as an array doesn't buy
us very much.  Esp. if you change vhost_dev_init to take a NULL-terminated
varargs.  I think readability would improve.  It means passing a vq around
rather than an index.

Not completely sure it'll be a win tho.

> +static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> +{
> +	struct socket *sock, *oldsock = NULL;
...
> +	sock = get_socket(fd);
> +	if (IS_ERR(sock)) {
> +		r = PTR_ERR(sock);
> +		goto done;
> +	}
> +
> +	/* start polling new socket */
> +	oldsock = vq->private_data;
...
> +done:
> +	mutex_unlock(&n->dev.mutex);
> +	if (oldsock) {
> +		vhost_net_flush_vq(n, index);
> +		fput(oldsock->file);

I dislike this style; I prefer multiple different goto points, one for when
oldsock is set, and one for when it's not.

That way, gcc warns us about uninitialized variables if we get it wrong.

> +static long vhost_net_reset_owner(struct vhost_net *n)
> +{
> +	struct socket *tx_sock = NULL;
> +	struct socket *rx_sock = NULL;
> +	long r;

This should be called "err", since that's what it is.

> +static void vhost_net_set_features(struct vhost_net *n, u64 features)
> +{
> +	size_t hdr_size = features & (1 << VHOST_NET_F_VIRTIO_NET_HDR) ?
> +		sizeof(struct virtio_net_hdr) : 0;
> +	int i;
> +	mutex_lock(&n->dev.mutex);
> +	n->dev.acked_features = features;

Why is this called "acked_features"?  Not just "features"?  I expected
to see code which exposed these back to userspace, and didn't.

> +	case VHOST_GET_FEATURES:
> +		features = VHOST_FEATURES;
> +		return put_user(features, featurep);
> +	case VHOST_ACK_FEATURES:
> +		r = get_user(features, featurep);
> +		/* No features for now */
> +		if (r < 0)
> +			return r;
> +		if (features & ~VHOST_FEATURES)
> +			return -EOPNOTSUPP;
> +		vhost_net_set_features(n, features);

OK, from the userspace POV it's "get features" then "ack features".  But
I think "VHOST_SET_FEATURES" is more consistent, despite this usage.

> +	switch (ioctl) {
> +	case VHOST_SET_VRING_NUM:

I haven't looked at your userspace implementation, but does a generic
VHOST_SET_VRING_STATE & VHOST_GET_VRING_STATE with a struct make more
sense?  It'd be simpler here, but not sure if it'd be simpler to use?

(Not the fd-setting ioctls of course)

> +	case VHOST_SET_VRING_LOG:
> +		r = copy_from_user(&a, argp, sizeof a);
> +		if (r < 0)
> +			break;
> +		if (a.padding) {
> +			r = -EOPNOTSUPP;
> +			break;
> +		}
> +		if (a.user_addr == VHOST_VRING_LOG_DISABLE) {
> +			vq->log_used = false;
> +			break;
> +		}
> +		if (a.user_addr & (sizeof *vq->used->ring - 1)) {
> +			r = -EINVAL;
> +			break;
> +		}
> +		vq->log_used = true;
> +		vq->log_addr = a.user_addr;
> +		break;

For future reference, this is *exactly* the kind of thing which would have
been nice as a followup patch.  Easy to separate, easy to review, not critical
to the core.

> +/* TODO: This is really inefficient.  We need something like get_user()
> + * (instruction directly accesses the data, with an exception table entry
> + * returning -EFAULT). See Documentation/x86/exception-tables.txt.
> + */
> +static int set_bit_to_user(int nr, void __user *addr)
> +{

I guess we won't be dealing with many contiguous pages, otherwise we could
get a cheap speedup making this set_bits_to_user(int nr, int num_bits...).

> +/* Each buffer in the virtqueues is actually a chain of descriptors.  This
> + * function returns the next descriptor in the chain,
> + * or -1 if we're at the end. */
> +static unsigned next_desc(struct vring_desc *desc)
> +{
> +	unsigned int next;
> +
> +	/* If this descriptor says it doesn't chain, we're done. */
> +	if (!(desc->flags & VRING_DESC_F_NEXT))
> +		return -1;

Hmm, prefer s/-1/-1U/ in comment, here, and below.  Clarifies a bit.

> +/* After we've used one of their buffers, we tell them about it.  We'll then
> + * want to send them an interrupt, using vq->call. */

This comment has too much cut & paste:
	... want to notify the guest, using the eventfd */

> +/* This actually sends the interrupt for this virtqueue */
> +void vhost_trigger_irq(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +{

Rename vhost_notify_eventfd() or something, and fix comments?

> +enum {
> +	VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,

+2?  Believable, but is it correct?

> +/* Poll a file (eventfd or socket) */
> +/* Note: there's nothing vhost specific about this structure. */
> +struct vhost_poll {

This comment really helped while reading the code.  Kudos!

Thanks!
Rusty.

WARNING: multiple messages have this Message-ID (diff)
From: Rusty Russell <rusty@rustcorp.com.au>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: netdev@vger.kernel.org,
	virtualization@lists.linux-foundation.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, linux-mm@kvack.org,
	akpm@linux-foundation.org, hpa@zytor.com,
	gregory.haskins@gmail.com, s.hetze@linux-ag.com,
	Daniel Walker <dwalker@fifo99.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
Date: Fri, 6 Nov 2009 15:29:17 +1030	[thread overview]
Message-ID: <200911061529.17500.rusty@rustcorp.com.au> (raw)
In-Reply-To: <20091104155724.GD32673@redhat.com>

On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote:
> What it is: vhost net is a character device that can be used to reduce
> the number of system calls involved in virtio networking.

Hi Michael,

   Now everyone else has finally kicked all the tires and it seems to pass,
I've done a fairly complete review.  Generally, it's really nice; just one
bug and a few minor suggestions for polishing.

> +/* Caller must have TX VQ lock */
> +static void tx_poll_stop(struct vhost_net *net)
> +{
> +	if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
> +		return;

likely?  Really?

> +	for (;;) {
> +		head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in,
> +					 NULL, NULL);

Danger!  You need an arg to vhost_get_vq_desc to tell it the max desc size
you can handle.  Otherwise, it's only limited by ring size, and a malicious
guest can overflow you here, and below:

> +		/* Skip header. TODO: support TSO. */
> +		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
...
> +
> +	use_mm(net->dev.mm);
> +	mutex_lock(&vq->mutex);
> +	vhost_no_notify(vq);

I prefer a name like "vhost_disable_notify()".

> +		/* OK, now we need to know about added descriptors. */
> +		if (head == vq->num && vhost_notify(vq))
> +			/* They could have slipped one in as we were doing that:
> +			 * check again. */
> +			continue;
> +		/* Nothing new?  Wait for eventfd to tell us they refilled. */
> +		if (head == vq->num)
> +			break;
> +		/* We don't need to be notified again. */
> +		vhost_no_notify(vq);

Similarly, vhost_enable_notify.  This one is particularly misleading since
it doesn't actually notify anything!

In particular, this code would be neater as:

	if (head == vq->num) {
		if (vhost_enable_notify(vq)) {
			/* Try again, they could have slipped one in. */
			continue;
		}
		/* Nothing more to do. */
		break;
	}
	vhost_disable_notify(vq);

Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to
return true only when there's more pending.  Saves a loop around here most
of the time.  Also, the vhost_no_notify/vhost_disable_notify() can be moved
out of the loop entirely.  (It could be under an if (unlikely(enabled)), not
sure if it's worth it).

> +		len = err;
> +		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, hdr_size);

That unsigned char * arg to memcpy_toiovec is annoying.  A patch might be
nice, separate from this effort.

> +static int vhost_net_open(struct inode *inode, struct file *f)
> +{
> +	struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
> +	int r;
> +	if (!n)
> +		return -ENOMEM;
> +	f->private_data = n;
> +	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
> +	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;

I have a personal dislike of calloc for structures.  In userspace, it's
because valgrind can't spot uninitialized fields.  These days a similar
argument applies in the kernel, because we have KMEMCHECK now.  If someone
adds a field to the struct and forgets to initialize it, we can spot it.

> +static void vhost_net_enable_vq(struct vhost_net *n, int index)
> +{
> +	struct socket *sock = n->vqs[index].private_data;

OK, I can't help but this that presenting the vqs as an array doesn't buy
us very much.  Esp. if you change vhost_dev_init to take a NULL-terminated
varargs.  I think readability would improve.  It means passing a vq around
rather than an index.

Not completely sure it'll be a win tho.

> +static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> +{
> +	struct socket *sock, *oldsock = NULL;
...
> +	sock = get_socket(fd);
> +	if (IS_ERR(sock)) {
> +		r = PTR_ERR(sock);
> +		goto done;
> +	}
> +
> +	/* start polling new socket */
> +	oldsock = vq->private_data;
...
> +done:
> +	mutex_unlock(&n->dev.mutex);
> +	if (oldsock) {
> +		vhost_net_flush_vq(n, index);
> +		fput(oldsock->file);

I dislike this style; I prefer multiple different goto points, one for when
oldsock is set, and one for when it's not.

That way, gcc warns us about uninitialized variables if we get it wrong.

> +static long vhost_net_reset_owner(struct vhost_net *n)
> +{
> +	struct socket *tx_sock = NULL;
> +	struct socket *rx_sock = NULL;
> +	long r;

This should be called "err", since that's what it is.

> +static void vhost_net_set_features(struct vhost_net *n, u64 features)
> +{
> +	size_t hdr_size = features & (1 << VHOST_NET_F_VIRTIO_NET_HDR) ?
> +		sizeof(struct virtio_net_hdr) : 0;
> +	int i;
> +	mutex_lock(&n->dev.mutex);
> +	n->dev.acked_features = features;

Why is this called "acked_features"?  Not just "features"?  I expected
to see code which exposed these back to userspace, and didn't.

> +	case VHOST_GET_FEATURES:
> +		features = VHOST_FEATURES;
> +		return put_user(features, featurep);
> +	case VHOST_ACK_FEATURES:
> +		r = get_user(features, featurep);
> +		/* No features for now */
> +		if (r < 0)
> +			return r;
> +		if (features & ~VHOST_FEATURES)
> +			return -EOPNOTSUPP;
> +		vhost_net_set_features(n, features);

OK, from the userspace POV it's "get features" then "ack features".  But
I think "VHOST_SET_FEATURES" is more consistent, despite this usage.

> +	switch (ioctl) {
> +	case VHOST_SET_VRING_NUM:

I haven't looked at your userspace implementation, but does a generic
VHOST_SET_VRING_STATE & VHOST_GET_VRING_STATE with a struct make more
sense?  It'd be simpler here, but not sure if it'd be simpler to use?

(Not the fd-setting ioctls of course)

> +	case VHOST_SET_VRING_LOG:
> +		r = copy_from_user(&a, argp, sizeof a);
> +		if (r < 0)
> +			break;
> +		if (a.padding) {
> +			r = -EOPNOTSUPP;
> +			break;
> +		}
> +		if (a.user_addr == VHOST_VRING_LOG_DISABLE) {
> +			vq->log_used = false;
> +			break;
> +		}
> +		if (a.user_addr & (sizeof *vq->used->ring - 1)) {
> +			r = -EINVAL;
> +			break;
> +		}
> +		vq->log_used = true;
> +		vq->log_addr = a.user_addr;
> +		break;

For future reference, this is *exactly* the kind of thing which would have
been nice as a followup patch.  Easy to separate, easy to review, not critical
to the core.

> +/* TODO: This is really inefficient.  We need something like get_user()
> + * (instruction directly accesses the data, with an exception table entry
> + * returning -EFAULT). See Documentation/x86/exception-tables.txt.
> + */
> +static int set_bit_to_user(int nr, void __user *addr)
> +{

I guess we won't be dealing with many contiguous pages, otherwise we could
get a cheap speedup making this set_bits_to_user(int nr, int num_bits...).

> +/* Each buffer in the virtqueues is actually a chain of descriptors.  This
> + * function returns the next descriptor in the chain,
> + * or -1 if we're at the end. */
> +static unsigned next_desc(struct vring_desc *desc)
> +{
> +	unsigned int next;
> +
> +	/* If this descriptor says it doesn't chain, we're done. */
> +	if (!(desc->flags & VRING_DESC_F_NEXT))
> +		return -1;

Hmm, prefer s/-1/-1U/ in comment, here, and below.  Clarifies a bit.

> +/* After we've used one of their buffers, we tell them about it.  We'll then
> + * want to send them an interrupt, using vq->call. */

This comment has too much cut & paste:
	... want to notify the guest, using the eventfd */

> +/* This actually sends the interrupt for this virtqueue */
> +void vhost_trigger_irq(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +{

Rename vhost_notify_eventfd() or something, and fix comments?

> +enum {
> +	VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,

+2?  Believable, but is it correct?

> +/* Poll a file (eventfd or socket) */
> +/* Note: there's nothing vhost specific about this structure. */
> +struct vhost_poll {

This comment really helped while reading the code.  Kudos!

Thanks!
Rusty.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2009-11-06  4:59 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <cover.1257349249.git.mst@redhat.com>
2009-11-04 15:55 ` [PATCHv8 1/3] tun: export underlying socket Michael S. Tsirkin
2009-11-04 15:55   ` Michael S. Tsirkin
2009-11-04 15:55   ` Michael S. Tsirkin
2009-11-04 15:55 ` Michael S. Tsirkin
2009-11-04 15:55 ` [PATCHv8 2/3] mm: export use_mm/unuse_mm to modules Michael S. Tsirkin
2009-11-04 15:55 ` Michael S. Tsirkin
2009-11-04 15:55   ` Michael S. Tsirkin
2009-11-04 15:55   ` Michael S. Tsirkin
2009-11-04 15:57 ` [PATCHv8 3/3] vhost_net: a kernel-level virtio server Michael S. Tsirkin
2009-11-04 15:57 ` Michael S. Tsirkin
2009-11-04 15:57   ` Michael S. Tsirkin
2009-11-04 15:57   ` Michael S. Tsirkin
2009-11-06  4:59   ` Rusty Russell
2009-11-06  4:59   ` Rusty Russell [this message]
2009-11-06  4:59     ` Rusty Russell
2009-11-08 11:35     ` Michael S. Tsirkin
2009-11-08 11:35       ` Michael S. Tsirkin
2009-11-09  6:17       ` Rusty Russell
2009-11-09  6:17       ` Rusty Russell
2009-11-09  6:17         ` Rusty Russell
2009-11-09  7:10         ` Michael S. Tsirkin
2009-11-09  7:10           ` Michael S. Tsirkin
2009-11-10  1:08           ` Rusty Russell
2009-11-10  1:08             ` Rusty Russell
2009-11-10  1:08           ` Rusty Russell
2009-11-09  7:10         ` Michael S. Tsirkin
2009-11-09  7:20         ` Michael S. Tsirkin
2009-11-09  7:20           ` Michael S. Tsirkin
2009-11-09  7:20         ` Michael S. Tsirkin
2009-11-09 11:55         ` Michael S. Tsirkin
2009-11-09 11:55           ` Michael S. Tsirkin
2009-11-09 11:55         ` Michael S. Tsirkin
2010-05-04 18:22         ` virtio: put last_used and last_avail index into ring itself Michael S. Tsirkin
2010-05-04 18:22           ` Michael S. Tsirkin
2010-05-06  0:52           ` Rusty Russell
2010-05-06  0:52           ` Rusty Russell
2010-05-06  0:52             ` Rusty Russell
2010-05-06  6:27             ` Michael S. Tsirkin
2010-05-06  6:27             ` Michael S. Tsirkin
2010-05-06  6:27               ` Michael S. Tsirkin
2010-05-07  3:05               ` Rusty Russell
2010-05-07  3:05               ` Rusty Russell
2010-05-07  3:05                 ` Rusty Russell
2010-05-09  8:57                 ` Michael S. Tsirkin
2010-05-09  8:57                 ` Michael S. Tsirkin
2010-05-09  8:57                   ` Michael S. Tsirkin
2010-05-10  3:11                   ` Rusty Russell
2010-05-10  3:11                     ` Rusty Russell
2010-05-10  3:11                   ` Rusty Russell
2010-05-04 18:22         ` Michael S. Tsirkin
2009-11-08 11:35     ` [PATCHv8 3/3] vhost_net: a kernel-level virtio server Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200911061529.17500.rusty@rustcorp.com.au \
    --to=rusty@rustcorp.com.au \
    --cc=akpm@linux-foundation.org \
    --cc=dwalker@fifo99.com \
    --cc=eric.dumazet@gmail.com \
    --cc=gregory.haskins@gmail.com \
    --cc=hpa@zytor.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=s.hetze@linux-ag.com \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.