All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Wang <jasowang@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH net-next] tuntap: introduce tx skb ring
Date: Tue, 17 May 2016 09:38:37 +0800	[thread overview]
Message-ID: <573A761D.8080909@redhat.com> (raw)
In-Reply-To: <20160516105434-mutt-send-email-mst@redhat.com>



On 2016年05月16日 16:08, Michael S. Tsirkin wrote:
> On Mon, May 16, 2016 at 03:52:11PM +0800, Jason Wang wrote:
>>
>> On 2016年05月16日 12:23, Michael S. Tsirkin wrote:
>>> On Mon, May 16, 2016 at 09:17:01AM +0800, Jason Wang wrote:
>>>> We used to queue tx packets in sk_receive_queue, this is less
>>>> efficient since it requires spinlocks to synchronize between producer
>>>> and consumer.
>>>>
>>>> This patch tries to address this by using circular buffer which allows
>>>> lockless synchronization. This is done by switching from
>>>> sk_receive_queue to a tx skb ring with a new flag IFF_TX_RING and when
>>>> this is set:
>>> Why do we need a new flag? Is there a userspace-visible
>>> behaviour change?
>> Probably yes since tx_queue_length does not work.
> So the flag name should reflect the behaviour somehow, not
> the implementation.
>
>>>> - store pointer to skb in circular buffer in tun_net_xmit(), and read
>>>>    it from the circular buffer in tun_do_read().
>>>> - introduce a new proto_ops peek which could be implemented by
>>>>    specific socket which does not use sk_receive_queue.
>>>> - store skb length in circular buffer too, and implement a lockless
>>>>    peek for tuntap.
>>>> - change vhost_net to use proto_ops->peek() instead
>>>> - new spinlocks were introduced to synchronize among producers (and so
>>>>    did for consumers).
>>>>
>>>> Pktgen test shows about 9% improvement on guest receiving pps:
>>>>
>>>> Before: ~1480000pps
>>>> After : ~1610000pps
>>>>
>>>> (I'm not sure noblocking read is still needed, so it was not included
>>>>   in this patch)
>>> How do you mean? Of course we must support blocking and non-blocking
>>> read - userspace uses it.
>> Ok, will add this.
>>
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>> ---
>>>> ---
>>>>   drivers/net/tun.c           | 157 +++++++++++++++++++++++++++++++++++++++++---
>>>>   drivers/vhost/net.c         |  16 ++++-
>>>>   include/linux/net.h         |   1 +
>>>>   include/uapi/linux/if_tun.h |   1 +
>>>>   4 files changed, 165 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>> index 425e983..6001ece 100644
>>>> --- a/drivers/net/tun.c
>>>> +++ b/drivers/net/tun.c
>>>> @@ -71,6 +71,7 @@
>>>>   #include <net/sock.h>
>>>>   #include <linux/seq_file.h>
>>>>   #include <linux/uio.h>
>>>> +#include <linux/circ_buf.h>
>>>>   #include <asm/uaccess.h>
>>>> @@ -130,6 +131,8 @@ struct tap_filter {
>>>>   #define MAX_TAP_FLOWS  4096
>>>>   #define TUN_FLOW_EXPIRE (3 * HZ)
>>>> +#define TUN_RING_SIZE 256
>>> Can we resize this according to tx_queue_len set by user?
>> We can, but it needs lots of other changes, e.g being notified when
>> tx_queue_len was changed by user.
> Some kind of notifier?

Yes, maybe.

> Probably better than a new user interface.

Ok.

>
>> And if tx_queue_length is not power of 2,
>> we probably need modulus to calculate the capacity.
> Is that really that important for speed?

Not sure, I can test.

> If yes, round it up to next power of two.

Right, this sounds a good solution.

> You can also probably wrap it with a conditional instead.
>
>>>> +#define TUN_RING_MASK (TUN_RING_SIZE - 1)
>>>>   struct tun_pcpu_stats {
>>>>   	u64 rx_packets;
>>>> @@ -142,6 +145,11 @@ struct tun_pcpu_stats {
>>>>   	u32 rx_frame_errors;
>>>>   };
>>>> +struct tun_desc {
>>>> +	struct sk_buff *skb;
>>>> +	int len; /* Cached skb len for peeking */
>>>> +};
>>>> +
>>>>   /* A tun_file connects an open character device to a tuntap netdevice. It
>>>>    * also contains all socket related structures (except sock_fprog and tap_filter)
>>>>    * to serve as one transmit queue for tuntap device. The sock_fprog and
>>>> @@ -167,6 +175,13 @@ struct tun_file {
>>>>   	};
>>>>   	struct list_head next;
>>>>   	struct tun_struct *detached;
>>>> +	/* reader lock */
>>>> +	spinlock_t rlock;
>>>> +	unsigned long tail;
>>>> +	struct tun_desc tx_descs[TUN_RING_SIZE];
>>>> +	/* writer lock */
>>>> +	spinlock_t wlock;
>>>> +	unsigned long head;
>>>>   };
>>>>   struct tun_flow_entry {
>>>> @@ -515,7 +530,27 @@ static struct tun_struct *tun_enable_queue(struct tun_file *tfile)
>>>>   static void tun_queue_purge(struct tun_file *tfile)
>>>>   {
>>>> +	unsigned long head, tail;
>>>> +	struct tun_desc *desc;
>>>> +	struct sk_buff *skb;
>>>>   	skb_queue_purge(&tfile->sk.sk_receive_queue);
>>>> +	spin_lock(&tfile->rlock);
>>>> +
>>>> +	head = ACCESS_ONCE(tfile->head);
>>>> +	tail = tfile->tail;
>>>> +
>>>> +	/* read tail before reading descriptor at tail */
>>>> +	smp_rmb();
>>> I think you mean read *head* here
>> Right.
>>
>>>
>>>> +
>>>> +	while (CIRC_CNT(head, tail, TUN_RING_SIZE) >= 1) {
>>>> +		desc = &tfile->tx_descs[tail];
>>>> +		skb = desc->skb;
>>>> +		kfree_skb(skb);
>>>> +		tail = (tail + 1) & TUN_RING_MASK;
>>>> +		/* read descriptor before incrementing tail. */
>>>> +		smp_store_release(&tfile->tail, tail & TUN_RING_MASK);
>>>> +	}
>>>> +	spin_unlock(&tfile->rlock);
>>>>   	skb_queue_purge(&tfile->sk.sk_error_queue);
>>>>   }
>>>>
>>> Barrier pairing seems messed up. Could you tag
>>> each barrier with its pair pls?
>>> E.g. add /* Barrier A for pairing */ Before barrier and
>>> its pair.
>> Ok.
>>
>> for both tun_queue_purge() and tun_do_read():
>>
>> smp_rmb() is paired with smp_store_release() in tun_net_xmit().
> this seems at least an overkill. rmb would normally be paired with wmb,
> not a full mb within release.

wmb is not enough here. We need guarantee the descriptor is read before 
we increase head. And full mb is more heavyweight that 
smp_store_release(). For paring, I can change to use 
smp_load_acquire(tfile->head) in tun_do_read().

>
>> smp_store_release() is paired with spin_unlock()/spin_lock() in
>> tun_net_xmit().
> release can't be paired with unlock since that's also a release.
> lock is an acquire but from what I have seen you keep it around
> operations not in the middle.

Right, so I will switch to use load_acquire() in this case.

>
>>>> @@ -824,6 +859,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>   	int txq = skb->queue_mapping;
>>>>   	struct tun_file *tfile;
>>>>   	u32 numqueues = 0;
>>>> +	unsigned long flags;
>>>>   	rcu_read_lock();
>>>>   	tfile = rcu_dereference(tun->tfiles[txq]);
>>>> @@ -888,8 +924,35 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>   	nf_reset(skb);
>>>> -	/* Enqueue packet */
>>>> -	skb_queue_tail(&tfile->socket.sk->sk_receive_queue, skb);
>>>> +	if (tun->flags & IFF_TX_RING) {
>>>> +		unsigned long head, tail;
>>>> +
>>>> +		spin_lock_irqsave(&tfile->wlock, flags);
>>>> +
>>>> +		head = tfile->head;
>>>> +		tail = ACCESS_ONCE(tfile->tail);
>>> this should be acquire
>> Consider there's always one slot left empty, so we need to produced two skbs
>> here before we could corrupt consumer. So the spin_unlock()/spin_lock()
>> provides ordering here?
> It's better to just follow memory barrier rules.

Ok.

>
>
>>>> + > >>+		if (CIRC_SPACE(head, tail, TUN_RING_SIZE) >= 1) {
>>>> +			struct tun_desc *desc = &tfile->tx_descs[head];
>>>> +
>>>> +			desc->skb = skb;
>>>> +			desc->len = skb->len;
>>>> +			if (skb_vlan_tag_present(skb))
>>>> +				desc->len += VLAN_HLEN;
>>>> +
>>>> +			/* read descriptor before incrementing head. */
>>> write descriptor.
>> Right.
>>
>>>   so smp_wmb is enough.
>> I thought smp_store_release() was more lightweight than smp_wmb()?
> Why do you think this?  On which architecture?

Sorry, I was wrong here.

  reply	other threads:[~2016-05-17  1:39 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-16  1:17 [PATCH net-next] tuntap: introduce tx skb ring Jason Wang
2016-05-16  3:56 ` Eric Dumazet
2016-05-16  7:51   ` Jason Wang
2016-05-18  8:13     ` Jesper Dangaard Brouer
2016-05-18  8:23       ` Michael S. Tsirkin
2016-05-18 10:23       ` Jason Wang
2016-05-18 11:52         ` Steven Rostedt
2016-05-18 16:26       ` Michael S. Tsirkin
2016-05-18 16:41         ` Eric Dumazet
2016-05-18 16:46           ` Michael S. Tsirkin
2016-05-19 11:59         ` Jesper Dangaard Brouer
2016-05-16  4:23 ` Michael S. Tsirkin
2016-05-16  7:52   ` Jason Wang
2016-05-16  8:08     ` Michael S. Tsirkin
2016-05-17  1:38       ` Jason Wang [this message]
2016-05-18  8:16         ` Jesper Dangaard Brouer
2016-05-18  8:21           ` Michael S. Tsirkin
2016-05-18  9:21             ` Jesper Dangaard Brouer
2016-05-18  9:55               ` Michael S. Tsirkin
2016-05-18 10:42                 ` Jason Wang
2016-05-18 10:58                   ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=573A761D.8080909@redhat.com \
    --to=jasowang@redhat.com \
    --cc=davem@davemloft.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.