* [PATCH 0/2 REVIEW] Multiple transmit/receive queue kernel @ 2007-02-09 0:09 Kok, Auke 2007-02-09 0:09 ` [PATCH 1/2] NET: Multiple queue network device support Kok, Auke 2007-02-09 0:09 ` [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support Kok, Auke 0 siblings, 2 replies; 17+ messages in thread From: Kok, Auke @ 2007-02-09 0:09 UTC (permalink / raw) To: David Miller, Garzik, Jeff, netdev, linux-kernel Cc: Kok, Auke, Peter Waskiewicz Jr, Brandeburg, Jesse, Kok, Auke, Ronciak, John Hi, The following two patches are submitted for review. Please provide feedback and comments as usual. This set of patches implements a generic API for multiqueue-capable devices to have network stack support to assign multiple flows into each queue on the NIC. It provides an interface for non-multiqueue devices to coexist in a machine running with both types of devices, as well as providing a modified queuing discipline to assign multiple flows in the stack to individual queues on the NIC. This is necessary to alleviate more head-of-line-blocking issues with different priority flows, where today QoS only prevents HOLB within the stack, but not in the device. Some Intel PRO/1000 adapters support this hardware feature, so this series includes a patch to the e1000 driver to enable this hardware feature for those devices. You will need to setup the queue's using tc. Limited queue statistics are available through ethtool for devices that support multiple queues. Regards, Auke Kok Peter Waskiewicz Jr. --- These patches apply against commit b892afd1e60132a981b963929e352eabf3306ba2 of branch 'master' from torvalds/linux-2.6. You can pull these patches through git: git-pull git://lost.foo-projects.org/~ahkok/git/netdev-2.6 multiqueue --- Summary: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>: NET: Multiple queue network device support e1000: Implement the new kernel API for multiqueue TX support. --- drivers/net/Kconfig | 13 + drivers/net/e1000/e1000.h | 23 +++ drivers/net/e1000/e1000_ethtool.c | 43 ++++++ drivers/net/e1000/e1000_main.c | 269 +++++++++++++++++++++++++++++++------- include/linux/netdevice.h | 73 ++++++++++ include/net/sch_generic.h | 3 net/Kconfig | 20 ++ net/core/dev.c | 51 +++++++ net/sched/sch_generic.c | 117 ++++++++++++++++ net/sched/sch_prio.c | 106 ++++++++++++++ 10 files changed, 668 insertions(+), 50 deletions(-) ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] NET: Multiple queue network device support 2007-02-09 0:09 [PATCH 0/2 REVIEW] Multiple transmit/receive queue kernel Kok, Auke @ 2007-02-09 0:09 ` Kok, Auke 2007-02-27 1:03 ` David Miller 2007-03-09 13:40 ` Thomas Graf 2007-02-09 0:09 ` [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support Kok, Auke 1 sibling, 2 replies; 17+ messages in thread From: Kok, Auke @ 2007-02-09 0:09 UTC (permalink / raw) To: David Miller, Garzik, Jeff, netdev, linux-kernel Cc: Kok, Auke, Peter Waskiewicz Jr, Brandeburg, Jesse, Kok, Auke, Ronciak, John From: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Added an API and associated supporting routines for multiqueue network devices. This allows network devices supporting multiple TX queues to configure each queue within the netdevice and manage each queue independantly. Changes to the PRIO Qdisc also allow a user to map multiple flows to individual TX queues, taking advantage of each queue on the device. Signed-off-by: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> --- include/linux/netdevice.h | 73 ++++++++++++++++++++++++++++ include/net/sch_generic.h | 3 + net/Kconfig | 20 ++++++++ net/core/dev.c | 51 ++++++++++++++++++++ net/sched/sch_generic.c | 117 +++++++++++++++++++++++++++++++++++++++++++++ net/sched/sch_prio.c | 106 ++++++++++++++++++++++++++++++++++++++--- 6 files changed, 364 insertions(+), 6 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 2e37f50..c7f94a8 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -106,6 +106,16 @@ struct netpoll_info; #define MAX_HEADER (LL_MAX_HEADER + 48) #endif +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + +struct net_device_subqueue +{ + /* Give a lock and a flow control state for each queue */ + unsigned long state; + spinlock_t queue_lock ____cacheline_aligned_in_smp; +}; +#endif + /* * Network device statistics. Akin to the 2.0 ether stats but * with byte counters. @@ -339,6 +349,10 @@ struct net_device #define NETIF_F_GEN_CSUM (NETIF_F_NO_CSUM | NETIF_F_HW_CSUM) #define NETIF_F_ALL_CSUM (NETIF_F_IP_CSUM | NETIF_F_GEN_CSUM) +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE +#define NETIF_F_MULTI_QUEUE 8192 /* Has multiple TX/RX queues */ +#endif + struct net_device *next_sched; /* Interface index. Unique device identifier */ @@ -532,6 +546,19 @@ struct net_device struct device dev; /* space for optional statistics and wireless sysfs groups */ struct attribute_group *sysfs_groups[3]; + +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + /* To retrieve statistics per subqueue - FOR FUTURE USE */ + struct net_device_stats* (*get_subqueue_stats)(struct net_device *dev, + int queue_index); + + /* The TX queue control structures */ + struct net_device_subqueue *egress_subqueue; + int egress_subqueue_count; + int (*hard_start_subqueue_xmit)(struct sk_buff *skb, + struct net_device *dev, + int queue_index); +#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */ }; #define to_net_dev(d) container_of(d, struct net_device, dev) @@ -673,6 +700,52 @@ static inline int netif_running(const struct net_device *dev) return test_bit(__LINK_STATE_START, &dev->state); } +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + +/* + * Routines to manage the subqueues on a device. We only need start + * stop, and a check if it's stopped. All other device management is + * done at the overall netdevice level. + * Also test the netif state if we're multi-queue at all. + */ +static inline void netif_start_subqueue(struct net_device *dev, int queue_index) +{ + clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state); +} + +static inline void netif_stop_subqueue(struct net_device *dev, int queue_index) +{ +#ifdef CONFIG_NETPOLL_TRAP + if (netpoll_trap()) + return; +#endif + set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state); +} + +static inline int netif_subqueue_stopped(const struct net_device *dev, + int queue_index) +{ + return test_bit(__LINK_STATE_XOFF, + &dev->egress_subqueue[queue_index].state); +} + +static inline void netif_wake_subqueue(struct net_device *dev, int queue_index) +{ +#ifdef CONFIG_NETPOLL_TRAP + if (netpoll_trap()) + return; +#endif + if (test_and_clear_bit(__LINK_STATE_XOFF, + &dev->egress_subqueue[queue_index].state)) + __netif_schedule(dev); +} + +static inline int netif_is_multiqueue(const struct net_device *dev) +{ + return (!!(NETIF_F_MULTI_QUEUE & dev->features)); +} +#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */ + /* Use this variant when it is known for sure that it * is executing from interrupt context. diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 8208639..8751ba6 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -102,6 +102,9 @@ struct Qdisc_ops int (*dump)(struct Qdisc *, struct sk_buff *); int (*dump_stats)(struct Qdisc *, struct gnet_dump *); +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + int (*map_queue)(struct sk_buff *, struct Qdisc *); +#endif struct module *owner; }; diff --git a/net/Kconfig b/net/Kconfig index 7dfc949..796edb3 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -38,6 +38,26 @@ source "net/packet/Kconfig" source "net/unix/Kconfig" source "net/xfrm/Kconfig" +config NET_MULTI_QUEUE_DEVICE + bool "Multiple queue network device support (EXPERIMENTAL)" + depends on NET_SCHED && EXPERIMENTAL + help + Saying Y here will add support for network devices that have more than + one physical TX queue and/or RX queue. + + Multiple queue devices will require qdiscs that understand how to + queue to multiple targets. The default packet scheduler will default + to the first queue in a device. In other words, if you need the + ability to spread traffic across queues, your queueing discipline + needs to know how to do that. + + Note that saying Y here will give preferential treatment to multiple + queue devices in the network stack. A slight drop in single-queue + device performance may be seen. + + Say N here unless you know your network device supports multiple + TX and/or RX queues. + config INET bool "TCP/IP networking" ---help--- diff --git a/net/core/dev.c b/net/core/dev.c index 455d589..42b635c 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1477,6 +1477,49 @@ gso: skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS); #endif if (q->enqueue) { +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + int queue_index; + /* If we're a multi-queue device, get a queue index to lock */ + if (netif_is_multiqueue(dev)) + { + /* Get the queue index and lock it. */ + if (likely(q->ops->map_queue)) { + queue_index = q->ops->map_queue(skb, q); + spin_lock(&dev->egress_subqueue[queue_index].queue_lock); + rc = q->enqueue(skb, q); + /* + * Unlock because the underlying qdisc + * may queue and send a packet from a + * different queue. + */ + spin_unlock(&dev->egress_subqueue[queue_index].queue_lock); + qdisc_run(dev); + rc = rc == NET_XMIT_BYPASS + ? NET_XMIT_SUCCESS : rc; + goto out; + } else { + printk(KERN_CRIT "Device %s is " + "multiqueue, but map_queue is " + "undefined in the qdisc!!\n", + dev->name); + kfree_skb(skb); + } + } else { + /* We're not a multi-queue device. */ + spin_lock(&dev->queue_lock); + q = dev->qdisc; + if (q->enqueue) { + rc = q->enqueue(skb, q); + qdisc_run(dev); + spin_unlock(&dev->queue_lock); + rc = rc == NET_XMIT_BYPASS + ? NET_XMIT_SUCCESS : rc; + goto out; + } + spin_unlock(&dev->queue_lock); + } +#else /* No multi-queue support compiled in */ + /* Grab device queue */ spin_lock(&dev->queue_lock); q = dev->qdisc; @@ -1489,6 +1532,7 @@ gso: goto out; } spin_unlock(&dev->queue_lock); +#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */ } /* The device has no queue. Common case for software devices: @@ -2879,6 +2923,9 @@ int register_netdevice(struct net_device *dev) struct hlist_head *head; struct hlist_node *p; int ret; +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + int i; +#endif BUG_ON(dev_boot_phase); ASSERT_RTNL(); @@ -2894,6 +2941,10 @@ int register_netdevice(struct net_device *dev) #ifdef CONFIG_NET_CLS_ACT spin_lock_init(&dev->ingress_lock); #endif +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + for (i = 0; i < dev->egress_subqueue_count; i++) + spin_lock_init(&dev->egress_subqueue[i].queue_lock); +#endif dev->iflink = -1; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index bc116bd..62ca23e 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -42,6 +42,7 @@ to data, participating in scheduling must be additionally protected with dev->queue_lock spinlock. + For single-queue devices: The idea is the following: - enqueue, dequeue are serialized via top level device spinlock dev->queue_lock. @@ -51,6 +52,12 @@ hence this lock may be made without local bh disabling. qdisc_tree_lock must be grabbed BEFORE dev->queue_lock! + + For multi-queue devices: + - enqueue, dequeue are serialized with individual queue locks, + dev->egress_subqueue[queue_index].queue_lock. + - Everything else with tree protection with single queue devices apply + to multiqueue devices. */ DEFINE_RWLOCK(qdisc_tree_lock); @@ -92,6 +99,9 @@ static inline int qdisc_restart(struct net_device *dev) { struct Qdisc *q = dev->qdisc; struct sk_buff *skb; +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + int queue_index = 0; +#endif /* Dequeue packet */ if (((skb = dev->gso_skb)) || ((skb = q->dequeue(q)))) { @@ -130,6 +140,72 @@ static inline int qdisc_restart(struct net_device *dev) } { +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + if (netif_is_multiqueue(dev)) { + if (likely(q->ops->map_queue)) { + queue_index = q->ops->map_queue(skb, q); + } else { + printk(KERN_CRIT "Device %s is " + "multiqueue, but map_queue is " + "undefined in the qdisc!!\n", + dev->name); + goto requeue; + } + spin_unlock(&dev->egress_subqueue[queue_index].queue_lock); + /* Check top level device, and any sub-device */ + if ((!netif_queue_stopped(dev)) && + (!netif_subqueue_stopped(dev, queue_index))) { + int ret; + ret = dev->hard_start_subqueue_xmit(skb, dev, queue_index); + if (ret == NETDEV_TX_OK) { + if (!nolock) { + netif_tx_unlock(dev); + } + return -1; + } + if (ret == NETDEV_TX_LOCKED && nolock) { + spin_lock(&dev->egress_subqueue[queue_index].queue_lock); + goto collision; + } + } + /* NETDEV_TX_BUSY - we need to requeue */ + /* Release the driver */ + if (!nolock) { + netif_tx_unlock(dev); + } + spin_lock(&dev->egress_subqueue[queue_index].queue_lock); + q = dev->qdisc; + } + else + { + /* We're a single-queue device */ + /* And release queue */ + spin_unlock(&dev->queue_lock); + if (!netif_queue_stopped(dev)) { + int ret; + + ret = dev->hard_start_xmit(skb, dev); + if (ret == NETDEV_TX_OK) { + if (!nolock) { + netif_tx_unlock(dev); + } + spin_lock(&dev->queue_lock); + return -1; + } + if (ret == NETDEV_TX_LOCKED && nolock) { + spin_lock(&dev->queue_lock); + goto collision; + } + } + /* NETDEV_TX_BUSY - we need to requeue */ + /* Release the driver */ + if (!nolock) { + netif_tx_unlock(dev); + } + spin_lock(&dev->queue_lock); + q = dev->qdisc; + } +#else /* CONFIG_NET_MULTI_QUEUE_DEVICE */ /* And release queue */ spin_unlock(&dev->queue_lock); @@ -157,6 +233,7 @@ static inline int qdisc_restart(struct net_device *dev) } spin_lock(&dev->queue_lock); q = dev->qdisc; +#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */ } /* Device kicked us out :( @@ -175,9 +252,17 @@ requeue: else q->ops->requeue(skb, q); netif_schedule(dev); +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + if (netif_is_multiqueue(dev)) + spin_unlock(&dev->egress_subqueue[queue_index].queue_lock); +#endif return 1; } BUG_ON((int) q->q.qlen < 0); +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + if (netif_is_multiqueue(dev)) + spin_unlock(&dev->egress_subqueue[queue_index].queue_lock); +#endif return q->q.qlen; } @@ -287,12 +372,22 @@ static int noop_requeue(struct sk_buff *skb, struct Qdisc* qdisc) return NET_XMIT_CN; } +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE +static int noop_map_queue(struct sk_buff *skb, struct Qdisc *qdisc) +{ + return 0; +} +#endif + struct Qdisc_ops noop_qdisc_ops = { .id = "noop", .priv_size = 0, .enqueue = noop_enqueue, .dequeue = noop_dequeue, .requeue = noop_requeue, +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + .map_queue = noop_map_queue, +#endif .owner = THIS_MODULE, }; @@ -310,6 +405,9 @@ static struct Qdisc_ops noqueue_qdisc_ops = { .enqueue = noop_enqueue, .dequeue = noop_dequeue, .requeue = noop_requeue, +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + .map_queue = noop_map_queue, +#endif .owner = THIS_MODULE, }; @@ -356,10 +454,18 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc* qdisc) struct sk_buff_head *list = qdisc_priv(qdisc); for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) { +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + if (netif_is_multiqueue(qdisc->dev)) + spin_lock(&qdisc->dev->egress_subqueue[0].queue_lock); +#endif if (!skb_queue_empty(list + prio)) { qdisc->q.qlen--; return __qdisc_dequeue_head(qdisc, list + prio); } +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + if (netif_is_multiqueue(qdisc->dev)) + spin_unlock(&qdisc->dev->egress_subqueue[0].queue_lock); +#endif } return NULL; @@ -406,6 +512,14 @@ static int pfifo_fast_init(struct Qdisc *qdisc, struct rtattr *opt) return 0; } +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE +static int pfifo_fast_map_queue(struct sk_buff *skb, struct Qdisc *qdisc) +{ + /* Generic qdisc, so always return queue index 0. */ + return 0; +} +#endif + static struct Qdisc_ops pfifo_fast_ops = { .id = "pfifo_fast", .priv_size = PFIFO_FAST_BANDS * sizeof(struct sk_buff_head), @@ -415,6 +529,9 @@ static struct Qdisc_ops pfifo_fast_ops = { .init = pfifo_fast_init, .reset = pfifo_fast_reset, .dump = pfifo_fast_dump, +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + .map_queue = pfifo_fast_map_queue, +#endif .owner = THIS_MODULE, }; diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 2567b4c..0e24071 100644 --- a/net/sched/sch_prio.c +++ b/net/sched/sch_prio.c @@ -43,16 +43,19 @@ struct prio_sched_data struct tcf_proto *filter_list; u8 prio2band[TC_PRIO_MAX+1]; struct Qdisc *queues[TCQ_PRIO_BANDS]; +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + int band2queue[TC_PRIO_MAX + 1]; +#endif }; - static struct Qdisc * -prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) +prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr, int *retband) { struct prio_sched_data *q = qdisc_priv(sch); u32 band = skb->priority; struct tcf_result res; + *retband = band; *qerr = NET_XMIT_BYPASS; if (TC_H_MAJ(skb->priority) != sch->handle) { #ifdef CONFIG_NET_CLS_ACT @@ -68,13 +71,16 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) #else if (!q->filter_list || tc_classify(skb, q->filter_list, &res)) { #endif - if (TC_H_MAJ(band)) + if (TC_H_MAJ(band)) { band = 0; + *retband = 0; + } return q->queues[q->prio2band[band&TC_PRIO_MAX]]; } band = res.classid; } band = TC_H_MIN(band) - 1; + *retband = band; if (band > q->bands) return q->queues[q->prio2band[0]]; @@ -86,8 +92,15 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc *sch) { struct Qdisc *qdisc; int ret; + int band; /* Unused in this function, just here for multi-queue */ + + /* + * NOTE: if a device is multiqueue, then it is imperative the + * underlying qdisc NOT be another PRIO qdisc. Otherwise we're going + * to deadlock. Fixme please. + */ + qdisc = prio_classify(skb, sch, &ret, &band); - qdisc = prio_classify(skb, sch, &ret); #ifdef CONFIG_NET_CLS_ACT if (qdisc == NULL) { @@ -108,14 +121,34 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc *sch) return ret; } +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE +static int +prio_map_queue(struct sk_buff *skb, struct Qdisc *sch) +{ + /* + * We'll classify the skb here, so return the Qdisc back through + * the function call. The return value is the queue we need to + * lock, which will be the value returned through band. This will help + * speed up enqueue, since it won't have to do the classify again. + */ + int ret; + int band; + struct prio_sched_data *q = qdisc_priv(sch); + + sch = prio_classify(skb, sch, &ret, &band); + return q->band2queue[band]; +} +#endif static int prio_requeue(struct sk_buff *skb, struct Qdisc* sch) { struct Qdisc *qdisc; int ret; + int band; + + qdisc = prio_classify(skb, sch, &ret, &band); - qdisc = prio_classify(skb, sch, &ret); #ifdef CONFIG_NET_CLS_ACT if (qdisc == NULL) { if (ret == NET_XMIT_BYPASS) @@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch) struct sk_buff *skb; struct prio_sched_data *q = qdisc_priv(sch); int prio; +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + int queue; +#endif struct Qdisc *qdisc; + /* + * If we're multiqueue, the basic approach is try the lock on each + * queue. If it's locked, either we're already dequeuing, or enqueue + * is doing something. Go to the next band if we're locked. Once we + * have a packet, unlock the queue. NOTE: the underlying qdisc CANNOT + * be a PRIO qdisc, otherwise we will deadlock. FIXME + */ for (prio = 0; prio < q->bands; prio++) { +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + if (netif_is_multiqueue(sch->dev)) { + queue = q->band2queue[prio]; + if (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) { + qdisc = q->queues[prio]; + skb = qdisc->dequeue(qdisc); + if (skb) { + sch->q.qlen--; + skb->priority = prio; + spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock); + return skb; + } + spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock); + } + } else { + qdisc = q->queues[prio]; + skb = qdisc->dequeue(qdisc); + if (skb) { + sch->q.qlen--; + skb->priority = prio; + return skb; + } + } +#else qdisc = q->queues[prio]; skb = qdisc->dequeue(qdisc); if (skb) { sch->q.qlen--; + skb->priority = prio; return skb; } +#endif } return NULL; - } static unsigned int prio_drop(struct Qdisc* sch) @@ -205,6 +273,10 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt) struct prio_sched_data *q = qdisc_priv(sch); struct tc_prio_qopt *qopt = RTA_DATA(opt); int i; +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + int queue; + int qmapoffset; +#endif if (opt->rta_len < RTA_LENGTH(sizeof(*qopt))) return -EINVAL; @@ -247,6 +319,25 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt) } } } +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + /* setup queue to band mapping */ + if (netif_is_multiqueue(sch->dev)) { + if (q->bands == sch->dev->egress_subqueue_count) + qmapoffset = 0; + else + qmapoffset = q->bands / sch->dev->egress_subqueue_count; + + queue = 0; + for (i = 0; i < q->bands; i++) { + q->band2queue[i] = queue; + if ( (((i + 1) - queue) == qmapoffset) && + (queue < (sch->dev->egress_subqueue_count - 1))) { + queue++; + } + } + } + +#endif return 0; } @@ -430,6 +521,9 @@ static struct Qdisc_ops prio_qdisc_ops = { .destroy = prio_destroy, .change = prio_tune, .dump = prio_dump, +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE + .map_queue = prio_map_queue, +#endif .owner = THIS_MODULE, }; --- Auke Kok <auke-jan.h.kok@intel.com> ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] NET: Multiple queue network device support 2007-02-09 0:09 ` [PATCH 1/2] NET: Multiple queue network device support Kok, Auke @ 2007-02-27 1:03 ` David Miller 2007-02-27 19:38 ` Waskiewicz Jr, Peter P 2007-03-07 22:18 ` Waskiewicz Jr, Peter P 2007-03-09 13:40 ` Thomas Graf 1 sibling, 2 replies; 17+ messages in thread From: David Miller @ 2007-02-27 1:03 UTC (permalink / raw) To: auke-jan.h.kok Cc: jgarzik, netdev, linux-kernel, peter.p.waskiewicz.jr, jesse.brandeburg, auke, john.ronciak From: "Kok, Auke" <auke-jan.h.kok@intel.com> Date: Thu, 08 Feb 2007 16:09:50 -0800 > From: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> > > Added an API and associated supporting routines for multiqueue network > devices. This allows network devices supporting multiple TX queues to > configure each queue within the netdevice and manage each queue > independantly. Changes to the PRIO Qdisc also allow a user to map > multiple flows to individual TX queues, taking advantage of each queue > on the device. > > Signed-off-by: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> > Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> Thanks for posting this. I was wonding if it would be better to have the ->enqueue() perform the mapping operation, store the queue index selected inside of the skb, and just use the normal ->hard_start_xmit() to send the packet out? The only thing to take care of is the per-queue locking, but that should be easily doable. We might be able to do this even without making sk_buff any larger. For example, I suppose skb->priority might be deducable down to a u16. After a quick audit I really cannot see any legitimate use of anything larger than 16-bit values, but a more thorough audit would be necessary to validate this. ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [PATCH 1/2] NET: Multiple queue network device support 2007-02-27 1:03 ` David Miller @ 2007-02-27 19:38 ` Waskiewicz Jr, Peter P 2007-03-07 22:18 ` Waskiewicz Jr, Peter P 1 sibling, 0 replies; 17+ messages in thread From: Waskiewicz Jr, Peter P @ 2007-02-27 19:38 UTC (permalink / raw) To: David Miller Cc: jgarzik, netdev, linux-kernel, Brandeburg, Jesse, auke, Ronciak, John, Kok, Auke-jan H Dave, Thanks for the feedback. Please see below for my comments / questions. PJ Waskiewicz Intel Corporation > > From: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> > > > > Added an API and associated supporting routines for > multiqueue network > > devices. This allows network devices supporting multiple > TX queues to > > configure each queue within the netdevice and manage each queue > > independantly. Changes to the PRIO Qdisc also allow a user to map > > multiple flows to individual TX queues, taking advantage of > each queue > > on the device. > > > > Signed-off-by: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> > > Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> > > Thanks for posting this. > > I was wonding if it would be better to have the ->enqueue() > perform the mapping operation, store the queue index selected > inside of the skb, and just use the normal > ->hard_start_xmit() to send the packet out? One of the reasons I chose to add more functions is for the queue management itself. Specifically, I needed the ability to lock queues, stop queues (netif_stop_queue), and wake them up, independent of the global queue_lock in the netdevice. At first I did think about storing the queue index in the skb, but that won't help with the queue locking, since only netdevice is passed to them outside of the context of the skb. As for adding the additional multiqueue version of hard_start_subqueue_xmit(), I did that so non-MQ devices would have a clean, untouched path in dev_queue_xmit(), and only MQ devices would touch a new codepath. This also removes the need for the device driver to inspect the queue index in the skb for which Tx ring to place the packet in, which is good for performance. For having ->enqueue() take care of the mapping, I thought for awhile about this. Originally, I had ->enqueue() doing just that, until I realized I was not locking the individual subqueues correctly. The thought process was once I receive an skb for Tx, I need to lock a queue, enqueue the skb, call qdisc_run(), then unlock. That grew into lock a queue, enqueue, unlock the queue, have dequeue decide which queue to pull an skb from (since the queue an skb comes from can be different than the one just enqueued to), lock it, dequeue, unlock, and return the skb. The whole issue came down to needing to know what queue to lock before enqueuing. If I called enqueue, I would be going in unlocked. I suppose this would be ok if the lock is done inside the ->enqueue(), and unlock is done before exiting; let me look at that possibility. > > The only thing to take care of is the per-queue locking, but > that should be easily doable. > > We might be able to do this even without making sk_buff any larger. > For example, I suppose skb->priority might be deducable down > to a u16. After a quick audit I really cannot see any > legitimate use of anything larger than 16-bit values, but a > more thorough audit would be necessary to validate this. The skb->priority field right now is used to determine the PRIO / pfifo_fast band to place the skb in on enqueue. I still need to think about if storing the queue index in the skb would help, due to the queue management functions outside of the skb context. Again, thanks for the feedback. I'll respond once I've looked at moving locks into ->enqueue() and removing ->map_queue(). Please note that the PRIO band to queue mapping in this patch has a problem when the number of bands is <= the number of Tx queues on the NIC. I have a fix, and will include that in a repost. ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [PATCH 1/2] NET: Multiple queue network device support 2007-02-27 1:03 ` David Miller 2007-02-27 19:38 ` Waskiewicz Jr, Peter P @ 2007-03-07 22:18 ` Waskiewicz Jr, Peter P 2007-03-07 22:42 ` David Miller 1 sibling, 1 reply; 17+ messages in thread From: Waskiewicz Jr, Peter P @ 2007-03-07 22:18 UTC (permalink / raw) To: David Miller Cc: jgarzik, netdev, linux-kernel, Brandeburg, Jesse, auke, Ronciak, John Dave, I did some research based on your feedback. Specifically, I looked at removing ->map_queue() and allowing ->enqueue() to take care of mapping and locking of the queues (and ->dequeue()). I found that it can be done either way, but I'm not sure I like taking the locking out of dev_queue_xmit(), and relying on the qdisc ->enqueue() to make sure locking is completed correctly. Having a map routine also allows various parts of the stack to query the skb where it belongs. I also looked into storing the mapped queue value in the skb, namely in skb->priority. If I were to use this value, I'd lose filtering capabilities in sch_prio.c, where if no tc filter exists for the skb in question, it relies on the value of skb->priority to enqueue to a band. The value of skb->priority could also be used in a base driver for further filtering (some of the MAC-based QoS on wireless devices comes to mind), so I'm not sure I feel comfortable using that field to store the queue. I also envision some qdiscs in the future that could change the mapping of bands to queues without reloading the qdisc, so storing the queue information in the skb could lead to out-of-order packets into queues (stale information in the skb). What do you think? Thanks, PJ Waskiewicz Software Engineer LAD - LAN Access Division, New Technologies Intel Corporation peter.p.waskiewicz.jr@intel.com > -----Original Message----- > From: David Miller [mailto:davem@davemloft.net] > Sent: Monday, February 26, 2007 5:03 PM > To: Kok, Auke-jan H > Cc: jgarzik@pobox.com; netdev@vger.kernel.org; > linux-kernel@vger.kernel.org; Waskiewicz Jr, Peter P; > Brandeburg, Jesse; auke@foo-projects.org; Ronciak, John > Subject: Re: [PATCH 1/2] NET: Multiple queue network device support > > From: "Kok, Auke" <auke-jan.h.kok@intel.com> > Date: Thu, 08 Feb 2007 16:09:50 -0800 > > > From: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> > > > > Added an API and associated supporting routines for > multiqueue network > > devices. This allows network devices supporting multiple > TX queues to > > configure each queue within the netdevice and manage each queue > > independantly. Changes to the PRIO Qdisc also allow a user to map > > multiple flows to individual TX queues, taking advantage of > each queue > > on the device. > > > > Signed-off-by: Peter Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> > > Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> > > Thanks for posting this. > > I was wonding if it would be better to have the ->enqueue() > perform the mapping operation, store the queue index selected > inside of the skb, and just use the normal > ->hard_start_xmit() to send the packet out? > > The only thing to take care of is the per-queue locking, but > that should be easily doable. > > We might be able to do this even without making sk_buff any larger. > For example, I suppose skb->priority might be deducable down > to a u16. After a quick audit I really cannot see any > legitimate use of anything larger than 16-bit values, but a > more thorough audit would be necessary to validate this. > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] NET: Multiple queue network device support 2007-03-07 22:18 ` Waskiewicz Jr, Peter P @ 2007-03-07 22:42 ` David Miller 2007-03-09 7:26 ` Jarek Poplawski 0 siblings, 1 reply; 17+ messages in thread From: David Miller @ 2007-03-07 22:42 UTC (permalink / raw) To: peter.p.waskiewicz.jr Cc: jgarzik, netdev, linux-kernel, jesse.brandeburg, auke, john.ronciak I didn't say to use skb->priority, I said to shrink skb->priority down to a u16 and then make another u16 which will store your queue mapping value. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] NET: Multiple queue network device support 2007-03-07 22:42 ` David Miller @ 2007-03-09 7:26 ` Jarek Poplawski 0 siblings, 0 replies; 17+ messages in thread From: Jarek Poplawski @ 2007-03-09 7:26 UTC (permalink / raw) To: David Miller Cc: peter.p.waskiewicz.jr, jgarzik, netdev, linux-kernel, jesse.brandeburg, auke, john.ronciak On 07-03-2007 23:42, David Miller wrote: > I didn't say to use skb->priority, I said to shrink skb->priority down > to a u16 and then make another u16 which will store your queue mapping > value. Peter is right: this is fully used by schedulers (prio, CBQ, HTB, HFSC...) and would break users' scripts, so I wouldn't recommend, too. Regards, Jarek P. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] NET: Multiple queue network device support 2007-02-09 0:09 ` [PATCH 1/2] NET: Multiple queue network device support Kok, Auke 2007-02-27 1:03 ` David Miller @ 2007-03-09 13:40 ` Thomas Graf 2007-03-09 19:25 ` Waskiewicz Jr, Peter P 2007-03-12 8:58 ` Jarek Poplawski 1 sibling, 2 replies; 17+ messages in thread From: Thomas Graf @ 2007-03-09 13:40 UTC (permalink / raw) To: Kok, Auke Cc: David Miller, Garzik, Jeff, netdev, linux-kernel, Peter Waskiewicz Jr, Brandeburg, Jesse, Kok, Auke, Ronciak, John * Kok, Auke <auke-jan.h.kok@intel.com> 2007-02-08 16:09 > diff --git a/net/core/dev.c b/net/core/dev.c > index 455d589..42b635c 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1477,6 +1477,49 @@ gso: > skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS); > #endif > if (q->enqueue) { > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > + int queue_index; > + /* If we're a multi-queue device, get a queue index to lock */ > + if (netif_is_multiqueue(dev)) > + { > + /* Get the queue index and lock it. */ > + if (likely(q->ops->map_queue)) { > + queue_index = q->ops->map_queue(skb, q); > + spin_lock(&dev->egress_subqueue[queue_index].queue_lock); > + rc = q->enqueue(skb, q); > + /* > + * Unlock because the underlying qdisc > + * may queue and send a packet from a > + * different queue. > + */ > + spin_unlock(&dev->egress_subqueue[queue_index].queue_lock); > + qdisc_run(dev); I really dislike this asymmetric locking logic, while enqueue() is called with queue_lock held dequeue() is repsonsible to acquire the lock for qdisc_restart(). > + rc = rc == NET_XMIT_BYPASS > + ? NET_XMIT_SUCCESS : rc; > + goto out; > + } else { > + printk(KERN_CRIT "Device %s is " > + "multiqueue, but map_queue is " > + "undefined in the qdisc!!\n", > + dev->name); > + kfree_skb(skb); Move this check to tc_modify_qdisc(), it's useless to check this for every packet being delivered. > + } > + } else { > + /* We're not a multi-queue device. */ > + spin_lock(&dev->queue_lock); > + q = dev->qdisc; > + if (q->enqueue) { > + rc = q->enqueue(skb, q); > + qdisc_run(dev); > + spin_unlock(&dev->queue_lock); > + rc = rc == NET_XMIT_BYPASS > + ? NET_XMIT_SUCCESS : rc; > + goto out; > + } > + spin_unlock(&dev->queue_lock); Please don't duplicate already existing code. > @@ -130,6 +140,72 @@ static inline int qdisc_restart(struct net_device *dev) > } > > { > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > + if (netif_is_multiqueue(dev)) { > + if (likely(q->ops->map_queue)) { > + queue_index = q->ops->map_queue(skb, q); > + } else { > + printk(KERN_CRIT "Device %s is " > + "multiqueue, but map_queue is " > + "undefined in the qdisc!!\n", > + dev->name); > + goto requeue; > + } Yet another condition completely useless for every transmission. > + spin_unlock(&dev->egress_subqueue[queue_index].queue_lock); > + /* Check top level device, and any sub-device */ > + if ((!netif_queue_stopped(dev)) && > + (!netif_subqueue_stopped(dev, queue_index))) { > + int ret; > + ret = dev->hard_start_subqueue_xmit(skb, dev, queue_index); > + if (ret == NETDEV_TX_OK) { > + if (!nolock) { > + netif_tx_unlock(dev); > + } > + return -1; > + } > + if (ret == NETDEV_TX_LOCKED && nolock) { > + spin_lock(&dev->egress_subqueue[queue_index].queue_lock); > + goto collision; > + } > + } > + /* NETDEV_TX_BUSY - we need to requeue */ > + /* Release the driver */ > + if (!nolock) { > + netif_tx_unlock(dev); > + } > + spin_lock(&dev->egress_subqueue[queue_index].queue_lock); > + q = dev->qdisc; This is identical to the existing logic except for the different lock, the additional to check subqueue state and a different hard_start_xmit call. Share the logic. > + } > + else > + { > + /* We're a single-queue device */ > + /* And release queue */ > + spin_unlock(&dev->queue_lock); > + if (!netif_queue_stopped(dev)) { > + int ret; > + > + ret = dev->hard_start_xmit(skb, dev); > + if (ret == NETDEV_TX_OK) { > + if (!nolock) { > + netif_tx_unlock(dev); > + } > + spin_lock(&dev->queue_lock); > + return -1; > + } > + if (ret == NETDEV_TX_LOCKED && nolock) { > + spin_lock(&dev->queue_lock); > + goto collision; > + } > + } > + /* NETDEV_TX_BUSY - we need to requeue */ > + /* Release the driver */ > + if (!nolock) { > + netif_tx_unlock(dev); > + } > + spin_lock(&dev->queue_lock); > + q = dev->qdisc; Again, you just copied the existing code into a separate branch, fix the branching logic! > @@ -356,10 +454,18 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc* qdisc) > struct sk_buff_head *list = qdisc_priv(qdisc); > > for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) { > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > + if (netif_is_multiqueue(qdisc->dev)) > + spin_lock(&qdisc->dev->egress_subqueue[0].queue_lock); > +#endif > if (!skb_queue_empty(list + prio)) { > qdisc->q.qlen--; > return __qdisc_dequeue_head(qdisc, list + prio); > } > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > + if (netif_is_multiqueue(qdisc->dev)) > + spin_unlock(&qdisc->dev->egress_subqueue[0].queue_lock); > +#endif This lock/unlock for every band definitely screws performance. > } > > return NULL; > @@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch) > struct sk_buff *skb; > struct prio_sched_data *q = qdisc_priv(sch); > int prio; > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > + int queue; > +#endif > struct Qdisc *qdisc; > > + /* > + * If we're multiqueue, the basic approach is try the lock on each > + * queue. If it's locked, either we're already dequeuing, or enqueue > + * is doing something. Go to the next band if we're locked. Once we > + * have a packet, unlock the queue. NOTE: the underlying qdisc CANNOT > + * be a PRIO qdisc, otherwise we will deadlock. FIXME > + */ > for (prio = 0; prio < q->bands; prio++) { > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > + if (netif_is_multiqueue(sch->dev)) { > + queue = q->band2queue[prio]; > + if (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) { > + qdisc = q->queues[prio]; > + skb = qdisc->dequeue(qdisc); > + if (skb) { > + sch->q.qlen--; > + skb->priority = prio; > + spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock); > + return skb; > + } > + spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock); > + } Your modified qdisc_restart() expects the queue_lock to be locked, how can this work? > + } else { > + qdisc = q->queues[prio]; > + skb = qdisc->dequeue(qdisc); > + if (skb) { > + sch->q.qlen--; > + skb->priority = prio; > + return skb; > + } > + } > +#else > qdisc = q->queues[prio]; > skb = qdisc->dequeue(qdisc); > if (skb) { > sch->q.qlen--; > + skb->priority = prio; > return skb; > } > +#endif > } > return NULL; > - > } > > static unsigned int prio_drop(struct Qdisc* sch) ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [PATCH 1/2] NET: Multiple queue network device support 2007-03-09 13:40 ` Thomas Graf @ 2007-03-09 19:25 ` Waskiewicz Jr, Peter P 2007-03-09 23:01 ` Thomas Graf 2007-03-12 8:58 ` Jarek Poplawski 1 sibling, 1 reply; 17+ messages in thread From: Waskiewicz Jr, Peter P @ 2007-03-09 19:25 UTC (permalink / raw) To: Thomas Graf, Kok, Auke-jan H Cc: David Miller, Garzik, Jeff, netdev, linux-kernel, Brandeburg, Jesse, Kok, Auke, Ronciak, John > -----Original Message----- > From: Thomas Graf [mailto:tgraf@suug.ch] > Sent: Friday, March 09, 2007 5:41 AM > To: Kok, Auke-jan H > Cc: David Miller; Garzik, Jeff; netdev@vger.kernel.org; > linux-kernel@vger.kernel.org; Waskiewicz Jr, Peter P; > Brandeburg, Jesse; Kok, Auke; Ronciak, John > Subject: Re: [PATCH 1/2] NET: Multiple queue network device support > > * Kok, Auke <auke-jan.h.kok@intel.com> 2007-02-08 16:09 > > diff --git a/net/core/dev.c b/net/core/dev.c index 455d589..42b635c > > 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -1477,6 +1477,49 @@ gso: > > skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS); > > #endif > > if (q->enqueue) { > > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > > + int queue_index; > > + /* If we're a multi-queue device, get a queue > index to lock */ > > + if (netif_is_multiqueue(dev)) > > + { > > + /* Get the queue index and lock it. */ > > + if (likely(q->ops->map_queue)) { > > + queue_index = q->ops->map_queue(skb, q); > > + > spin_lock(&dev->egress_subqueue[queue_index].queue_lock); > > + rc = q->enqueue(skb, q); > > + /* > > + * Unlock because the underlying qdisc > > + * may queue and send a packet from a > > + * different queue. > > + */ > > + > spin_unlock(&dev->egress_subqueue[queue_index].queue_lock); > > + qdisc_run(dev); > > I really dislike this asymmetric locking logic, while > enqueue() is called with queue_lock held dequeue() is > repsonsible to acquire the lock for qdisc_restart(). I don't see how I can make this symmetrical when dealing with more than one queue. If an skb is classified to band 2 which maps to queue 1, we lock queue 1, and enqueue the packet. That queue will not be the first queue checked in dequeue for an skb, so I cannot rely on that lock protecting queue 0 in dequeue. Therefore I need to map the skb, lock, enqueue, unlock, then go into dequeue and lock the correct queue there. If I were to make this symmetrical, then the process of enqueuing and dequeuing would be serialized completely, and we'd be back to where we started without multiqueue. Also, in the multiqueue code path, dev->queue_lock isn't held anywhere, it's the subqueue lock. > > > + rc = rc == NET_XMIT_BYPASS > > + ? NET_XMIT_SUCCESS : rc; > > + goto out; > > + } else { > > + printk(KERN_CRIT "Device %s is " > > + "multiqueue, but map_queue is " > > + "undefined in the qdisc!!\n", > > + dev->name); > > + kfree_skb(skb); > > Move this check to tc_modify_qdisc(), it's useless to check > this for every packet being delivered. The patches I sent yesterday modified the non-multiqueue qdiscs to not load if the device is multiqueue, so this check can be removed altogether. > > > + } > > + } else { > > + /* We're not a multi-queue device. */ > > + spin_lock(&dev->queue_lock); > > + q = dev->qdisc; > > + if (q->enqueue) { > > + rc = q->enqueue(skb, q); > > + qdisc_run(dev); > > + spin_unlock(&dev->queue_lock); > > + rc = rc == NET_XMIT_BYPASS > > + ? NET_XMIT_SUCCESS : rc; > > + goto out; > > + } > > + spin_unlock(&dev->queue_lock); > > Please don't duplicate already existing code. I don't want to mix multiqueue and non-multiqueue code in the transmit path. This was an attempt to allow MQ and non-MQ devices to coexist in a machine having separate code paths. Are you suggesting to combine them? That would get very messy trying to determine what type of lock to grab (subqueue lock or dev->queue_lock), not to mention grabbing the dev->queue_lock would block multiqueue devices in that same codepath. > > > @@ -130,6 +140,72 @@ static inline int qdisc_restart(struct > net_device *dev) > > } > > > > { > > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > > + if (netif_is_multiqueue(dev)) { > > + if (likely(q->ops->map_queue)) { > > + queue_index = > q->ops->map_queue(skb, q); > > + } else { > > + printk(KERN_CRIT "Device %s is " > > + "multiqueue, > but map_queue is " > > + "undefined in > the qdisc!!\n", > > + dev->name); > > + goto requeue; > > + } > > Yet another condition completely useless for every transmission. Yep, this will be removed. Thanks. > > > + > spin_unlock(&dev->egress_subqueue[queue_index].queue_lock); > > + /* Check top level device, and > any sub-device */ > > + if ((!netif_queue_stopped(dev)) && > > + (!netif_subqueue_stopped(dev, > queue_index))) { > > + int ret; > > + ret = > dev->hard_start_subqueue_xmit(skb, dev, queue_index); > > + if (ret == NETDEV_TX_OK) { > > + if (!nolock) { > > + > netif_tx_unlock(dev); > > + } > > + return -1; > > + } > > + if (ret == > NETDEV_TX_LOCKED && nolock) { > > + > spin_lock(&dev->egress_subqueue[queue_index].queue_lock); > > + goto collision; > > + } > > + } > > + /* NETDEV_TX_BUSY - we need to > requeue */ > > + /* Release the driver */ > > + if (!nolock) { > > + netif_tx_unlock(dev); > > + } > > + > spin_lock(&dev->egress_subqueue[queue_index].queue_lock); > > + q = dev->qdisc; > > This is identical to the existing logic except for the > different lock, the additional to check subqueue state and a > different hard_start_xmit call. Share the logic. Not necessarily. How we got here and how the locks are working are logically different due to the different queues. I'm working on storing the queue index in the skb now from Dave's suggestion, so there will be one hard_start_xmit, but the logic is different since the queue lock we're grabbing will be different than non-multiqueue. Perhaps I'm missing something here. > > > + } > > + else > > + { > > + /* We're a single-queue device */ > > + /* And release queue */ > > + spin_unlock(&dev->queue_lock); > > + if (!netif_queue_stopped(dev)) { > > + int ret; > > + > > + ret = > dev->hard_start_xmit(skb, dev); > > + if (ret == NETDEV_TX_OK) { > > + if (!nolock) { > > + > netif_tx_unlock(dev); > > + } > > + > spin_lock(&dev->queue_lock); > > + return -1; > > + } > > + if (ret == > NETDEV_TX_LOCKED && nolock) { > > + > spin_lock(&dev->queue_lock); > > + goto collision; > > + } > > + } > > + /* NETDEV_TX_BUSY - we need to > requeue */ > > + /* Release the driver */ > > + if (!nolock) { > > + netif_tx_unlock(dev); > > + } > > + spin_lock(&dev->queue_lock); > > + q = dev->qdisc; > > Again, you just copied the existing code into a separate > branch, fix the branching logic! This is another attempt to keep the two codepaths separate. The only way I see of combining them is to check netif_is_multiqueue() everytime I need to grab a lock, which I think would be excessive. > > > @@ -356,10 +454,18 @@ static struct sk_buff > *pfifo_fast_dequeue(struct Qdisc* qdisc) > > struct sk_buff_head *list = qdisc_priv(qdisc); > > > > for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) { > > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > > + if (netif_is_multiqueue(qdisc->dev)) > > + > spin_lock(&qdisc->dev->egress_subqueue[0].queue_lock); > > +#endif > > if (!skb_queue_empty(list + prio)) { > > qdisc->q.qlen--; > > return __qdisc_dequeue_head(qdisc, list + prio); > > } > > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > > + if (netif_is_multiqueue(qdisc->dev)) > > + > spin_unlock(&qdisc->dev->egress_subqueue[0].queue_lock); > > +#endif > > This lock/unlock for every band definitely screws performance. > I will move the lock out of the for() loop since every band in pfifo_fast maps to queue 0. Thanks. > > } > > > > return NULL; > > @@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch) > > struct sk_buff *skb; > > struct prio_sched_data *q = qdisc_priv(sch); > > int prio; > > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > > + int queue; > > +#endif > > struct Qdisc *qdisc; > > > > + /* > > + * If we're multiqueue, the basic approach is try the > lock on each > > + * queue. If it's locked, either we're already > dequeuing, or enqueue > > + * is doing something. Go to the next band if we're > locked. Once we > > + * have a packet, unlock the queue. NOTE: the > underlying qdisc CANNOT > > + * be a PRIO qdisc, otherwise we will deadlock. FIXME > > + */ > > for (prio = 0; prio < q->bands; prio++) { > > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > > + if (netif_is_multiqueue(sch->dev)) { > > + queue = q->band2queue[prio]; > > + if > (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) { > > + qdisc = q->queues[prio]; > > + skb = qdisc->dequeue(qdisc); > > + if (skb) { > > + sch->q.qlen--; > > + skb->priority = prio; > > + > spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock); > > + return skb; > > + } > > + > spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock); > > + } > > Your modified qdisc_restart() expects the queue_lock to be > locked, how can this work? No, it doesn't expect the lock to be held. Because of the multiple queues, enqueueing and dequeueing are now asynchronous, since I can enqueue to queue 0 while dequeuing from queue 1. dev->queue_lock isn't held, so this can happen. Therefore the spin_trylock() is used in this dequeue because I don't want to wait for someone to finish with that queue in question (e.g. enqueue working), since that will block all other bands/queues after the band in question. So if the lock isn't available to grab, we move to the next band. If I were to wait for the lock, I'd serialize the enqueue/dequeue completely, and block other traffic flows in other queues waiting for the lock. > > > + } else { > > + qdisc = q->queues[prio]; > > + skb = qdisc->dequeue(qdisc); > > + if (skb) { > > + sch->q.qlen--; > > + skb->priority = prio; > > + return skb; > > + } > > + } > > +#else > > qdisc = q->queues[prio]; > > skb = qdisc->dequeue(qdisc); > > if (skb) { > > sch->q.qlen--; > > + skb->priority = prio; > > return skb; > > } > > +#endif > > } > > return NULL; > > - > > } > > > > static unsigned int prio_drop(struct Qdisc* sch) > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] NET: Multiple queue network device support 2007-03-09 19:25 ` Waskiewicz Jr, Peter P @ 2007-03-09 23:01 ` Thomas Graf 2007-03-09 23:27 ` Waskiewicz Jr, Peter P 0 siblings, 1 reply; 17+ messages in thread From: Thomas Graf @ 2007-03-09 23:01 UTC (permalink / raw) To: Waskiewicz Jr, Peter P Cc: Kok, Auke-jan H, David Miller, Garzik, Jeff, netdev, linux-kernel, Brandeburg, Jesse, Kok, Auke, Ronciak, John * Waskiewicz Jr, Peter P <peter.p.waskiewicz.jr@intel.com> 2007-03-09 11:25 > > > + } > > > + } else { > > > + /* We're not a multi-queue device. */ > > > + spin_lock(&dev->queue_lock); > > > + q = dev->qdisc; > > > + if (q->enqueue) { > > > + rc = q->enqueue(skb, q); > > > + qdisc_run(dev); > > > + spin_unlock(&dev->queue_lock); > > > + rc = rc == NET_XMIT_BYPASS > > > + ? NET_XMIT_SUCCESS : rc; > > > + goto out; > > > + } > > > + spin_unlock(&dev->queue_lock); > > > > Please don't duplicate already existing code. > > I don't want to mix multiqueue and non-multiqueue code in the transmit > path. This was an attempt to allow MQ and non-MQ devices to coexist in > a machine having separate code paths. Are you suggesting to combine > them? That would get very messy trying to determine what type of lock > to grab (subqueue lock or dev->queue_lock), not to mention grabbing the > dev->queue_lock would block multiqueue devices in that same codepath. The piece of code I quoted above is the branch executed if multi queue is not enabled. The code you added is 100% identical to the already existing enqueue logic. Just execute the existing branch if multi queue is not enabled. > This is another attempt to keep the two codepaths separate. The only > way I see of combining them is to check netif_is_multiqueue() everytime > I need to grab a lock, which I think would be excessive. The code added is 100% identical to the existing code, just be a little smarter on how you do the ifdefs. > > > } > > > > > > return NULL; > > > @@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch) > > > struct sk_buff *skb; > > > struct prio_sched_data *q = qdisc_priv(sch); > > > int prio; > > > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > > > + int queue; > > > +#endif > > > struct Qdisc *qdisc; > > > > > > + /* > > > + * If we're multiqueue, the basic approach is try the > > lock on each > > > + * queue. If it's locked, either we're already > > dequeuing, or enqueue > > > + * is doing something. Go to the next band if we're > > locked. Once we > > > + * have a packet, unlock the queue. NOTE: the > > underlying qdisc CANNOT > > > + * be a PRIO qdisc, otherwise we will deadlock. FIXME > > > + */ > > > for (prio = 0; prio < q->bands; prio++) { > > > +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > > > + if (netif_is_multiqueue(sch->dev)) { > > > + queue = q->band2queue[prio]; > > > + if > > (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) { > > > + qdisc = q->queues[prio]; > > > + skb = qdisc->dequeue(qdisc); > > > + if (skb) { > > > + sch->q.qlen--; > > > + skb->priority = prio; > > > + > > spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock); > > > + return skb; > > > + } > > > + > > spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock); > > > + } > > > > Your modified qdisc_restart() expects the queue_lock to be > > locked, how can this work? > > No, it doesn't expect the lock to be held. Because of the multiple > queues, enqueueing and dequeueing are now asynchronous, since I can > enqueue to queue 0 while dequeuing from queue 1. dev->queue_lock isn't > held, so this can happen. Therefore the spin_trylock() is used in this > dequeue because I don't want to wait for someone to finish with that > queue in question (e.g. enqueue working), since that will block all > other bands/queues after the band in question. So if the lock isn't > available to grab, we move to the next band. If I were to wait for the > lock, I'd serialize the enqueue/dequeue completely, and block other > traffic flows in other queues waiting for the lock. The first thing you do in qdisc_restart() after dequeue()ing is unlock the sub queue lock. You explicitely unlock it before calling qdisc_run() so I assume dequeue() is expected to keep it locked. Something doesn't add up. BTW, which lock serializes your write access to qdisc->q.qlen? It used to be dev->queue_lock but that is apparently not true for multi queue. ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [PATCH 1/2] NET: Multiple queue network device support 2007-03-09 23:01 ` Thomas Graf @ 2007-03-09 23:27 ` Waskiewicz Jr, Peter P 2007-03-10 2:34 ` Thomas Graf 0 siblings, 1 reply; 17+ messages in thread From: Waskiewicz Jr, Peter P @ 2007-03-09 23:27 UTC (permalink / raw) To: Thomas Graf Cc: Kok, Auke-jan H, David Miller, Garzik, Jeff, netdev, linux-kernel, Brandeburg, Jesse, Kok, Auke, Ronciak, John > * Waskiewicz Jr, Peter P <peter.p.waskiewicz.jr@intel.com> > 2007-03-09 11:25 > > > > + } > > > > + } else { > > > > + /* We're not a multi-queue device. */ > > > > + spin_lock(&dev->queue_lock); > > > > + q = dev->qdisc; > > > > + if (q->enqueue) { > > > > + rc = q->enqueue(skb, q); > > > > + qdisc_run(dev); > > > > + spin_unlock(&dev->queue_lock); > > > > + rc = rc == NET_XMIT_BYPASS > > > > + ? > NET_XMIT_SUCCESS : rc; > > > > + goto out; > > > > + } > > > > + spin_unlock(&dev->queue_lock); > > > > > > Please don't duplicate already existing code. > > > > I don't want to mix multiqueue and non-multiqueue code in > the transmit > > path. This was an attempt to allow MQ and non-MQ devices > to coexist > > in a machine having separate code paths. Are you suggesting to > > combine them? That would get very messy trying to > determine what type > > of lock to grab (subqueue lock or dev->queue_lock), not to mention > > grabbing the > > dev->queue_lock would block multiqueue devices in that same > codepath. > > The piece of code I quoted above is the branch executed if > multi queue is not enabled. The code you added is 100% > identical to the already existing enqueue logic. Just execute > the existing branch if multi queue is not enabled. > I totally agree this is identical code if multiqueue isn't enabled. However, when multiqueue is enabled, I don't want to make all network drivers implement the subqueue API just to be able to lock the queues. So the first thing I check is netif_is_multiqueue(dev), and if it *isn't* multiqueue, it will run the existing code. This way both non-multiqueue devices don't have to have any knowledge of the MQ API. > > This is another attempt to keep the two codepaths separate. > The only > > way I see of combining them is to check netif_is_multiqueue() > > everytime I need to grab a lock, which I think would be excessive. > > The code added is 100% identical to the existing code, just > be a little smarter on how you do the ifdefs. An example could be an on-board adapter isn't multiqueue, and an expansion card you have in your system is. If I handle multiqueue being on with just ifdef's, then I could use the expansion card, but not the on-board adapter as-is. The on-board driver would need to be updated to have 1 queue in the multiqueue context, which is what I tried to avoid. > > > Your modified qdisc_restart() expects the queue_lock to > be locked, > > > how can this work? > > > > No, it doesn't expect the lock to be held. Because of the multiple > > queues, enqueueing and dequeueing are now asynchronous, since I can > > enqueue to queue 0 while dequeuing from queue 1. dev->queue_lock > > isn't held, so this can happen. Therefore the > spin_trylock() is used > > in this dequeue because I don't want to wait for someone to finish > > with that queue in question (e.g. enqueue working), since that will > > block all other bands/queues after the band in question. So if the > > lock isn't available to grab, we move to the next band. If > I were to > > wait for the lock, I'd serialize the enqueue/dequeue > completely, and > > block other traffic flows in other queues waiting for the lock. > > The first thing you do in qdisc_restart() after dequeue()ing > is unlock the sub queue lock. You explicitely unlock it > before calling qdisc_run() so I assume dequeue() is expected > to keep it locked. Something doesn't add up. That's the entire point of this extra locking. enqueue() is going to put an skb into a band somewhere that maps to some queue, and there is no way to guarantee the skb I retrieve from dequeue() is headed for the same queue. Therefore, I need to unlock the queue after I finish enqueuing, since having that lock makes little sense to dequeue(). dequeue() will then grab *a* lock on a queue; it may be the same one we had during enqueue(), but it may not be. And the placement of the unlock of that queue is exactly where it happens in non-multiqueue, which is right before the hard_start_xmit(). I concede that the locking model is complex and seems really nasty, but to truly separate queue functionality from one another, I see no other feasible option than to run locking like this. I am very open to suggestions. > > BTW, which lock serializes your write access to > qdisc->q.qlen? It used to be dev->queue_lock but that is > apparently not true for multi queue. > This is a very good catch, because it isn't being protected on the entire qdisc right now for PRIO. However, Chris Leech pointed out the LINK_STATE_QDISC_RUNNING bit is serializing things at the qdisc_run() level, so that's probably protecting this counter right now. I'm looking at pushing that down into the qdisc, but at the same time I can think of something to serialize these stats containers for the qdisc. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] NET: Multiple queue network device support 2007-03-09 23:27 ` Waskiewicz Jr, Peter P @ 2007-03-10 2:34 ` Thomas Graf 2007-03-10 20:37 ` Waskiewicz Jr, Peter P 0 siblings, 1 reply; 17+ messages in thread From: Thomas Graf @ 2007-03-10 2:34 UTC (permalink / raw) To: Waskiewicz Jr, Peter P Cc: Kok, Auke-jan H, David Miller, Garzik, Jeff, netdev, linux-kernel, Brandeburg, Jesse, Kok, Auke, Ronciak, John * Waskiewicz Jr, Peter P <peter.p.waskiewicz.jr@intel.com> 2007-03-09 15:27 > That's the entire point of this extra locking. enqueue() is going to > put an skb into a band somewhere that maps to some queue, and there is > no way to guarantee the skb I retrieve from dequeue() is headed for the > same queue. Therefore, I need to unlock the queue after I finish > enqueuing, since having that lock makes little sense to dequeue(). > dequeue() will then grab *a* lock on a queue; it may be the same one we > had during enqueue(), but it may not be. And the placement of the > unlock of that queue is exactly where it happens in non-multiqueue, > which is right before the hard_start_xmit(). The lock is already unlocked after dequeue, from your prio_dequeue(): if (netif_is_multiqueue(sch->dev)) { queue = q->band2queue[prio]; if (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) { qdisc = q->queues[prio]; skb = qdisc->dequeue(qdisc); if (skb) { sch->q.qlen--; skb->priority = prio; spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock); return skb; } spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock); } ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [PATCH 1/2] NET: Multiple queue network device support 2007-03-10 2:34 ` Thomas Graf @ 2007-03-10 20:37 ` Waskiewicz Jr, Peter P 0 siblings, 0 replies; 17+ messages in thread From: Waskiewicz Jr, Peter P @ 2007-03-10 20:37 UTC (permalink / raw) To: Thomas Graf Cc: Kok, Auke-jan H, David Miller, Garzik, Jeff, netdev, linux-kernel, Brandeburg, Jesse, Kok, Auke, Ronciak, John > -----Original Message----- > From: Thomas Graf [mailto:tgraf@suug.ch] > Sent: Friday, March 09, 2007 6:35 PM > To: Waskiewicz Jr, Peter P > Cc: Kok, Auke-jan H; David Miller; Garzik, Jeff; > netdev@vger.kernel.org; linux-kernel@vger.kernel.org; > Brandeburg, Jesse; Kok, Auke; Ronciak, John > Subject: Re: [PATCH 1/2] NET: Multiple queue network device support > > * Waskiewicz Jr, Peter P <peter.p.waskiewicz.jr@intel.com> > 2007-03-09 15:27 > > That's the entire point of this extra locking. enqueue() > is going to > > put an skb into a band somewhere that maps to some queue, > and there is > > no way to guarantee the skb I retrieve from dequeue() is headed for > > the same queue. Therefore, I need to unlock the queue > after I finish > > enqueuing, since having that lock makes little sense to dequeue(). > > dequeue() will then grab *a* lock on a queue; it may be the > same one > > we had during enqueue(), but it may not be. And the > placement of the > > unlock of that queue is exactly where it happens in non-multiqueue, > > which is right before the hard_start_xmit(). > > The lock is already unlocked after dequeue, from your prio_dequeue(): > > if (netif_is_multiqueue(sch->dev)) { > queue = q->band2queue[prio]; > if > (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) { > qdisc = q->queues[prio]; > skb = qdisc->dequeue(qdisc); > if (skb) { > sch->q.qlen--; > skb->priority = prio; > > spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock); > return skb; > } > > spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock); > } Ok, now I see what's wrong. Taking Dave M.'s recommendation to store the queue mapping in the skb will let me unlock the queue when dequeue() returns. I'll fix this locking issue; thanks for the feedback and persistent drilling into my thick head. -PJ Waskiewicz peter.p.waskiewicz.jr@intel.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] NET: Multiple queue network device support 2007-03-09 13:40 ` Thomas Graf 2007-03-09 19:25 ` Waskiewicz Jr, Peter P @ 2007-03-12 8:58 ` Jarek Poplawski 2007-03-12 20:21 ` Waskiewicz Jr, Peter P 1 sibling, 1 reply; 17+ messages in thread From: Jarek Poplawski @ 2007-03-12 8:58 UTC (permalink / raw) To: Thomas Graf Cc: Kok, Auke, David Miller, Garzik, Jeff, netdev, linux-kernel, Peter Waskiewicz Jr, Brandeburg, Jesse, Kok, Auke, Ronciak, John On 09-03-2007 14:40, Thomas Graf wrote: > * Kok, Auke <auke-jan.h.kok@intel.com> 2007-02-08 16:09 >> diff --git a/net/core/dev.c b/net/core/dev.c >> index 455d589..42b635c 100644 >> --- a/net/core/dev.c >> +++ b/net/core/dev.c >> @@ -1477,6 +1477,49 @@ gso: >> skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS); >> #endif >> if (q->enqueue) { >> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE >> + int queue_index; >> + /* If we're a multi-queue device, get a queue index to lock */ >> + if (netif_is_multiqueue(dev)) >> + { >> + /* Get the queue index and lock it. */ >> + if (likely(q->ops->map_queue)) { >> + queue_index = q->ops->map_queue(skb, q); >> + spin_lock(&dev->egress_subqueue[queue_index].queue_lock); >> + rc = q->enqueue(skb, q); I'm not sure Dave Miller thought about this place, when he proposed to save the mapping, but I think this could be not enough. This place is racy: ->map_queue() is called 2 times and with some filters (and policies/actions) results could differ. And of course the subqueue lock doesn't prevent any filter from a config change in the meantime. After second reading of this patch I have doubts it's the proper way to solve the problem: there are many subqueues but we need a top queue (prio here) to mange them, anyway. So, why not to build this functionality directly into the queue? There is no difference for a device if skbs are going from the subqueue or a class, it is only interested in the mapping result and a possibility to stop and start a subqueue and to query its status. All this could be done by adding the callbacks directly to any classful scheduler or, if not enough, to write some specialized qdisc based on prio. The possibility to lock only a subqueue instead of a queue could be analized independently - current proposal doesn't solve this anyway. Regards, Jarek P. ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: [PATCH 1/2] NET: Multiple queue network device support 2007-03-12 8:58 ` Jarek Poplawski @ 2007-03-12 20:21 ` Waskiewicz Jr, Peter P 0 siblings, 0 replies; 17+ messages in thread From: Waskiewicz Jr, Peter P @ 2007-03-12 20:21 UTC (permalink / raw) To: Jarek Poplawski, Thomas Graf Cc: Kok, Auke-jan H, David Miller, Garzik, Jeff, netdev, linux-kernel, Brandeburg, Jesse, Kok, Auke, Ronciak, John > -----Original Message----- > From: Jarek Poplawski [mailto:jarkao2@o2.pl] > Sent: Monday, March 12, 2007 1:58 AM > To: Thomas Graf > Cc: Kok, Auke-jan H; David Miller; Garzik, Jeff; > netdev@vger.kernel.org; linux-kernel@vger.kernel.org; > Waskiewicz Jr, Peter P; Brandeburg, Jesse; Kok, Auke; Ronciak, John > Subject: Re: [PATCH 1/2] NET: Multiple queue network device support > > On 09-03-2007 14:40, Thomas Graf wrote: > > * Kok, Auke <auke-jan.h.kok@intel.com> 2007-02-08 16:09 > >> diff --git a/net/core/dev.c b/net/core/dev.c index > 455d589..42b635c > >> 100644 > >> --- a/net/core/dev.c > >> +++ b/net/core/dev.c > >> @@ -1477,6 +1477,49 @@ gso: > >> skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS); > >> #endif > >> if (q->enqueue) { > >> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE > >> + int queue_index; > >> + /* If we're a multi-queue device, get a queue > index to lock */ > >> + if (netif_is_multiqueue(dev)) > >> + { > >> + /* Get the queue index and lock it. */ > >> + if (likely(q->ops->map_queue)) { > >> + queue_index = q->ops->map_queue(skb, q); > >> + > spin_lock(&dev->egress_subqueue[queue_index].queue_lock); > >> + rc = q->enqueue(skb, q); > > I'm not sure Dave Miller thought about this place, when he > proposed to save the mapping, but I think this could be not > enough. This place is racy: ->map_queue() is called 2 times > and with some filters (and > policies/actions) results could differ. And of course the > subqueue lock doesn't prevent any filter from a config change > in the meantime. > > After second reading of this patch I have doubts it's the > proper way to solve the problem: there are many subqueues but > we need a top queue (prio here) to mange them, anyway. So, > why not to build this functionality directly into the queue? > There is no difference for a device if skbs are going from > the subqueue or a class, it is only interested in the mapping > result and a possibility to stop and start a subqueue and to > query its status. All this could be done by adding the > callbacks directly to any classful scheduler or, if not > enough, to write some specialized qdisc based on prio. The > possibility to lock only a subqueue instead of a queue could > be analized independently - current proposal doesn't solve > this anyway. > > Regards, > Jarek P. > Thanks again for the feedback. Given some discussions I had last week in the office and the general feedback here, I'm going to remove the new per-queue locking and leave the start/stop functions for each queue and combine entry points for hard_start_xmit(). I'll get this out asap for review once it's been tested here. If we see issues in the future with lock contention on the queues, we can revisit the per-queue locking. Cheers, -PJ Waskiewicz ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support. 2007-02-09 0:09 [PATCH 0/2 REVIEW] Multiple transmit/receive queue kernel Kok, Auke 2007-02-09 0:09 ` [PATCH 1/2] NET: Multiple queue network device support Kok, Auke @ 2007-02-09 0:09 ` Kok, Auke 2007-03-09 13:11 ` Thomas Graf 1 sibling, 1 reply; 17+ messages in thread From: Kok, Auke @ 2007-02-09 0:09 UTC (permalink / raw) To: David Miller, Garzik, Jeff, netdev, linux-kernel Cc: Kok, Auke, Peter Waskiewicz Jr, Brandeburg, Jesse, Kok, Auke, Ronciak, John From: Peter Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com> Several newer e1000 chipsets support multiple RX and TX queues. Most commonly, 82571's and ESB2LAN support 2 rx and 2 rx queues. Signed-off-by: Peter Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> --- drivers/net/Kconfig | 13 ++ drivers/net/e1000/e1000.h | 23 +++ drivers/net/e1000/e1000_ethtool.c | 43 ++++++ drivers/net/e1000/e1000_main.c | 269 +++++++++++++++++++++++++++++++------ 4 files changed, 304 insertions(+), 44 deletions(-) diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index ad92b6a..2d758ab 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -1988,6 +1988,19 @@ config E1000_DISABLE_PACKET_SPLIT If in doubt, say N. +config E1000_MQ + bool "Enable Tx/Rx Multiqueue Support (EXPERIMENTAL)" + depends on E1000 && NET_MULTI_QUEUE_DEVICE && EXPERIMENTAL + help + Say Y here if you want to enable multiqueue support for supported + e1000 devices. This will enable both transmit and receive queues + on devices that support them. + + In order to fully utilize the Tx queue support, use the SCH_PRIO + queueing discipline. + + If in doubt, say N. + source "drivers/net/ixp2000/Kconfig" config MYRI_SBUS diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h index 689f158..cfcdc9d 100644 --- a/drivers/net/e1000/e1000.h +++ b/drivers/net/e1000/e1000.h @@ -108,6 +108,10 @@ struct e1000_adapter; #define E1000_MIN_RXD 80 #define E1000_MAX_82544_RXD 4096 +#ifdef CONFIG_E1000_MQ +#define E1000_MAX_TX_QUEUES 4 +#endif + /* this is the size past which hardware will drop packets when setting LPE=0 */ #define MAXIMUM_ETHERNET_VLAN_SIZE 1522 @@ -168,6 +172,12 @@ struct e1000_buffer { uint16_t next_to_watch; }; +#ifdef CONFIG_E1000_MQ +struct e1000_queue_stats { + uint64_t packets; + uint64_t bytes; +}; +#endif struct e1000_ps_page { struct page *ps_page[PS_PAGE_BUFFERS]; }; struct e1000_ps_page_dma { uint64_t ps_page_dma[PS_PAGE_BUFFERS]; }; @@ -188,9 +198,16 @@ struct e1000_tx_ring { /* array of buffer information structs */ struct e1000_buffer *buffer_info; +#ifdef CONFIG_E1000_MQ + /* for tx ring cleanup - needed for multiqueue */ + spinlock_t tx_queue_lock; +#endif spinlock_t tx_lock; uint16_t tdh; uint16_t tdt; +#ifdef CONFIG_E1000_MQ + struct e1000_queue_stats tx_stats; +#endif boolean_t last_tx_tso; }; @@ -218,6 +235,9 @@ struct e1000_rx_ring { uint16_t rdh; uint16_t rdt; +#ifdef CONFIG_E1000_MQ + struct e1000_queue_stats rx_stats; +#endif }; #define E1000_DESC_UNUSED(R) \ @@ -271,6 +291,9 @@ struct e1000_adapter { /* TX */ struct e1000_tx_ring *tx_ring; /* One per active queue */ +#ifdef CONFIG_E1000_MQ + struct e1000_tx_ring **cpu_tx_ring; /* per-cpu */ +#endif unsigned int restart_queue; unsigned long tx_queue_len; uint32_t txd_cmd; diff --git a/drivers/net/e1000/e1000_ethtool.c b/drivers/net/e1000/e1000_ethtool.c index 44ebc72..c8c1500 100644 --- a/drivers/net/e1000/e1000_ethtool.c +++ b/drivers/net/e1000/e1000_ethtool.c @@ -105,7 +105,14 @@ static const struct e1000_stats e1000_gstrings_stats[] = { { "dropped_smbus", E1000_STAT(stats.mgpdc) }, }; +#ifdef CONFIG_E1000_MQ +#define E1000_QUEUE_STATS_LEN \ + (((struct e1000_adapter *)netdev->priv)->num_tx_queues + \ + ((struct e1000_adapter *)netdev->priv)->num_rx_queues) \ + * (sizeof(struct e1000_queue_stats) / sizeof(uint64_t)) +#else #define E1000_QUEUE_STATS_LEN 0 +#endif #define E1000_GLOBAL_STATS_LEN \ sizeof(e1000_gstrings_stats) / sizeof(struct e1000_stats) #define E1000_STATS_LEN (E1000_GLOBAL_STATS_LEN + E1000_QUEUE_STATS_LEN) @@ -1909,6 +1916,11 @@ e1000_get_ethtool_stats(struct net_device *netdev, struct ethtool_stats *stats, uint64_t *data) { struct e1000_adapter *adapter = netdev_priv(netdev); +#ifdef CONFIG_E1000_MQ + uint64_t *queue_stat; + int stat_count = sizeof(struct e1000_queue_stats) / sizeof(uint64_t); + int j, k; +#endif int i; e1000_update_stats(adapter); @@ -1917,12 +1929,29 @@ e1000_get_ethtool_stats(struct net_device *netdev, data[i] = (e1000_gstrings_stats[i].sizeof_stat == sizeof(uint64_t)) ? *(uint64_t *)p : *(uint32_t *)p; } +#ifdef CONFIG_E1000_MQ + for (j = 0; j < adapter->num_tx_queues; j++) { + queue_stat = (uint64_t *)&adapter->tx_ring[j].tx_stats; + for (k = 0; k < stat_count; k++) + data[i + k] = queue_stat[k]; + i += k; + } + for (j = 0; j < adapter->num_rx_queues; j++) { + queue_stat = (uint64_t *)&adapter->rx_ring[j].rx_stats; + for (k = 0; k < stat_count; k++) + data[i + k] = queue_stat[k]; + i += k; + } +#endif /* BUG_ON(i != E1000_STATS_LEN); */ } static void e1000_get_strings(struct net_device *netdev, uint32_t stringset, uint8_t *data) { +#ifdef CONFIG_E1000_MQ + struct e1000_adapter *adapter = netdev_priv(netdev); +#endif uint8_t *p = data; int i; @@ -1937,6 +1966,20 @@ e1000_get_strings(struct net_device *netdev, uint32_t stringset, uint8_t *data) ETH_GSTRING_LEN); p += ETH_GSTRING_LEN; } +#ifdef CONFIG_E1000_MQ + for (i = 0; i < adapter->num_tx_queues; i++) { + sprintf(p, "tx_queue_%u_packets", i); + p += ETH_GSTRING_LEN; + sprintf(p, "tx_queue_%u_bytes", i); + p += ETH_GSTRING_LEN; + } + for (i = 0; i < adapter->num_rx_queues; i++) { + sprintf(p, "rx_queue_%u_packets", i); + p+= ETH_GSTRING_LEN; + sprintf(p, "rx_queue_%u_bytes", i); + p += ETH_GSTRING_LEN; + } +#endif /* BUG_ON(p - data != E1000_STATS_LEN * ETH_GSTRING_LEN); */ break; } diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 619c892..a53f065 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -28,6 +28,10 @@ #include "e1000.h" #include <net/ip6_checksum.h> +#ifdef CONFIG_E1000_MQ +#include <linux/cpu.h> +#include <linux/smp.h> +#endif char e1000_driver_name[] = "e1000"; static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver"; @@ -137,6 +141,9 @@ static void e1000_exit_module(void); static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent); static void __devexit e1000_remove(struct pci_dev *pdev); static int e1000_alloc_queues(struct e1000_adapter *adapter); +#ifdef CONFIG_E1000_MQ +static void e1000_setup_queue_mapping(struct e1000_adapter *adapter); +#endif static int e1000_sw_init(struct e1000_adapter *adapter); static int e1000_open(struct net_device *netdev); static int e1000_close(struct net_device *netdev); @@ -153,7 +160,13 @@ static void e1000_set_multi(struct net_device *netdev); static void e1000_update_phy_info(unsigned long data); static void e1000_watchdog(unsigned long data); static void e1000_82547_tx_fifo_stall(unsigned long data); +static int e1000_xmit_frame_ring(struct sk_buff *skb, struct net_device *netdev, + struct e1000_tx_ring *tx_ring); static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev); +#ifdef CONFIG_E1000_MQ +static int e1000_subqueue_xmit_frame(struct sk_buff *skb, + struct net_device *netdev, int queue); +#endif static struct net_device_stats * e1000_get_stats(struct net_device *netdev); static int e1000_change_mtu(struct net_device *netdev, int new_mtu); static int e1000_set_mac(struct net_device *netdev, void *p); @@ -547,6 +560,10 @@ e1000_up(struct e1000_adapter *adapter) E1000_DESC_UNUSED(ring)); } +#ifdef CONFIG_E1000_MQ + e1000_setup_queue_mapping(adapter); +#endif + adapter->tx_queue_len = netdev->tx_queue_len; #ifdef CONFIG_E1000_NAPI @@ -900,7 +917,12 @@ e1000_probe(struct pci_dev *pdev, pci_set_master(pdev); err = -ENOMEM; +#ifdef CONFIG_E1000_MQ + netdev = alloc_etherdev(sizeof(struct e1000_adapter) + + (sizeof(struct net_device_subqueue) * E1000_MAX_TX_QUEUES)); +#else netdev = alloc_etherdev(sizeof(struct e1000_adapter)); +#endif if (!netdev) goto err_alloc_etherdev; @@ -934,6 +956,9 @@ e1000_probe(struct pci_dev *pdev, netdev->open = &e1000_open; netdev->stop = &e1000_close; netdev->hard_start_xmit = &e1000_xmit_frame; +#ifdef CONFIG_E1000_MQ + netdev->hard_start_subqueue_xmit = &e1000_subqueue_xmit_frame; +#endif netdev->get_stats = &e1000_get_stats; netdev->set_multicast_list = &e1000_set_multi; netdev->set_mac_address = &e1000_set_mac; @@ -1317,8 +1342,44 @@ e1000_sw_init(struct e1000_adapter *adapter) hw->master_slave = E1000_MASTER_SLAVE; } +#ifdef CONFIG_E1000_MQ + /* Number of supported queues. + * TODO: It's assumed num_rx_queues >= num_tx_queues, since multi-rx + * queues are much more interesting. Is it worth coding for the + * possibility (however improbable) of num_tx_queues > num_rx_queues? + */ + switch (hw->mac_type) { + case e1000_82571: + case e1000_82572: + case e1000_80003es2lan: + adapter->num_tx_queues = 2; + adapter->num_rx_queues = 2; + break; + + default: + /* All hardware before 82571 only have 1 queue each for Rx/Tx. + * However, the 82571 family does not have MSI-X, so multi- + * queue isn't enabled. + * It'd be wise not to mess with this default case. :) */ + adapter->num_tx_queues = 1; + adapter->num_rx_queues = 1; + netdev->egress_subqueue_count = 0; + break; + } + adapter->num_rx_queues = min(adapter->num_rx_queues, num_online_cpus()); + adapter->num_tx_queues = min(adapter->num_tx_queues, num_online_cpus()); + + if ((adapter->num_tx_queues > 1) || (adapter->num_rx_queues > 1)) { + netdev->egress_subqueue = (struct net_device_subqueue *)((void *)adapter + sizeof(struct e1000_adapter)); + netdev->egress_subqueue_count = adapter->num_tx_queues; + DPRINTK(DRV, INFO, "Multiqueue Enabled: RX queues = %u, " + "TX queues = %u\n", adapter->num_rx_queues, + adapter->num_tx_queues); + } +#else adapter->num_tx_queues = 1; adapter->num_rx_queues = 1; +#endif if (e1000_alloc_queues(adapter)) { DPRINTK(PROBE, ERR, "Unable to allocate memory for queues\n"); @@ -1334,13 +1395,16 @@ e1000_sw_init(struct e1000_adapter *adapter) set_bit(__LINK_STATE_START, &adapter->polling_netdev[i].state); } spin_lock_init(&adapter->tx_queue_lock); +#ifdef CONFIG_E1000_MQ + for (i = 0; i < adapter->num_tx_queues; i++) + spin_lock_init(&adapter->tx_ring[i].tx_queue_lock); +#endif #endif atomic_set(&adapter->irq_sem, 1); spin_lock_init(&adapter->stats_lock); set_bit(__E1000_DOWN, &adapter->flags); - return 0; } @@ -1382,9 +1446,28 @@ e1000_alloc_queues(struct e1000_adapter *adapter) } memset(adapter->polling_netdev, 0, size); #endif +#ifdef CONFIG_E1000_MQ + adapter->cpu_tx_ring = alloc_percpu(struct e1000_tx_ring *); +#endif return E1000_SUCCESS; } +#ifdef CONFIG_E1000_MQ +static void +e1000_setup_queue_mapping(struct e1000_adapter *adapter) +{ + int i, cpu; + + lock_cpu_hotplug(); + i = 0; + for_each_online_cpu(cpu) { + *per_cpu_ptr(adapter->cpu_tx_ring, cpu) = + &adapter->tx_ring[i % adapter->num_tx_queues]; + i++; + } + unlock_cpu_hotplug(); +} +#endif /** * e1000_open - Called when a network interface is made active @@ -1636,23 +1719,20 @@ e1000_configure_tx(struct e1000_adapter *adapter) struct e1000_hw *hw = &adapter->hw; uint32_t tdlen, tctl, tipg, tarc; uint32_t ipgr1, ipgr2; + int i; /* Setup the HW Tx Head and Tail descriptor pointers */ - - switch (adapter->num_tx_queues) { - case 1: - default: - tdba = adapter->tx_ring[0].dma; - tdlen = adapter->tx_ring[0].count * + for (i = 0; i < adapter->num_tx_queues; i++) { + tdba = adapter->tx_ring[i].dma; + tdlen = adapter->tx_ring[i].count * sizeof(struct e1000_tx_desc); - E1000_WRITE_REG(hw, TDLEN, tdlen); - E1000_WRITE_REG(hw, TDBAH, (tdba >> 32)); - E1000_WRITE_REG(hw, TDBAL, (tdba & 0x00000000ffffffffULL)); - E1000_WRITE_REG(hw, TDT, 0); - E1000_WRITE_REG(hw, TDH, 0); - adapter->tx_ring[0].tdh = ((hw->mac_type >= e1000_82543) ? E1000_TDH : E1000_82542_TDH); - adapter->tx_ring[0].tdt = ((hw->mac_type >= e1000_82543) ? E1000_TDT : E1000_82542_TDT); - break; + E1000_WRITE_REG(hw, TDLEN + (i << 8), tdlen); + E1000_WRITE_REG(hw, TDBAH + (i << 8), (tdba >> 32)); + E1000_WRITE_REG(hw, TDBAL + (i << 8), (tdba & 0x00000000ffffffffULL)); + E1000_WRITE_REG(hw, TDT + (i << 8), 0); + E1000_WRITE_REG(hw, TDH + (i << 8), 0); + adapter->tx_ring[i].tdh = ((hw->mac_type >= e1000_82543) ? E1000_TDH + (i << 8) : E1000_82542_TDH); + adapter->tx_ring[i].tdt = ((hw->mac_type >= e1000_82543) ? E1000_TDT + (i << 8) : E1000_82542_TDT); } /* Set the default values for the Tx Inter Packet Gap timer */ @@ -1999,6 +2079,7 @@ e1000_configure_rx(struct e1000_adapter *adapter) uint64_t rdba; struct e1000_hw *hw = &adapter->hw; uint32_t rdlen, rctl, rxcsum, ctrl_ext; + int i; if (adapter->rx_ps_pages) { /* this is a 32 byte descriptor */ @@ -2042,20 +2123,67 @@ e1000_configure_rx(struct e1000_adapter *adapter) /* Setup the HW Rx Head and Tail Descriptor Pointers and * the Base and Length of the Rx Descriptor Ring */ - switch (adapter->num_rx_queues) { - case 1: - default: - rdba = adapter->rx_ring[0].dma; - E1000_WRITE_REG(hw, RDLEN, rdlen); - E1000_WRITE_REG(hw, RDBAH, (rdba >> 32)); - E1000_WRITE_REG(hw, RDBAL, (rdba & 0x00000000ffffffffULL)); - E1000_WRITE_REG(hw, RDT, 0); - E1000_WRITE_REG(hw, RDH, 0); - adapter->rx_ring[0].rdh = ((hw->mac_type >= e1000_82543) ? E1000_RDH : E1000_82542_RDH); - adapter->rx_ring[0].rdt = ((hw->mac_type >= e1000_82543) ? E1000_RDT : E1000_82542_RDT); - break; + for (i = 0; i < adapter->num_rx_queues; i++) { + rdba = adapter->rx_ring[i].dma; + E1000_WRITE_REG(hw, RDLEN + (i << 8), rdlen); + E1000_WRITE_REG(hw, RDBAH + (i << 8), (rdba >> 32)); + E1000_WRITE_REG(hw, RDBAL + (i << 8), (rdba & 0x00000000ffffffffULL)); + E1000_WRITE_REG(hw, RDT + (i << 8), 0); + E1000_WRITE_REG(hw, RDH + (i << 8), 0); + adapter->rx_ring[i].rdh = ((hw->mac_type >= e1000_82543) ? E1000_RDH + (i << 8) : E1000_82542_RDH); + adapter->rx_ring[i].rdt = ((hw->mac_type >= e1000_82543) ? E1000_RDT + (i << 8) : E1000_82542_RDT); } +#ifdef CONFIG_E1000_MQ + if (adapter->num_rx_queues > 1) { + u32 random[10]; + u32 reta, mrqc; + int i; + + get_random_bytes(&random[0], 40); + + switch (adapter->num_rx_queues) { + default: + reta = 0x00800080; + mrqc = E1000_MRQC_ENABLE_RSS_2Q; + break; + } + + /* Fill out redirection table */ + for (i = 0; i < 32; i++) + E1000_WRITE_REG_ARRAY(hw, RETA, i, reta); + /* Fill out hash function seeds */ + for (i = 0; i < 10; i++) + E1000_WRITE_REG_ARRAY(hw, RSSRK, i, random[i]); + + mrqc |= (E1000_MRQC_RSS_FIELD_IPV4 | + E1000_MRQC_RSS_FIELD_IPV4_TCP); + + E1000_WRITE_REG(hw, MRQC, mrqc); + + /* Multiqueue and packet checksumming are mutually exclusive. */ + rxcsum = E1000_READ_REG(hw, RXCSUM); + rxcsum |= E1000_RXCSUM_PCSD; + E1000_WRITE_REG(hw, RXCSUM, rxcsum); + } else if (hw->mac_type >= e1000_82543) { + /* Enable 82543 Receive Checksum Offload for TCP and UDP */ + rxcsum = E1000_READ_REG(hw, RXCSUM); + if (adapter->rx_csum == TRUE) { + rxcsum |= E1000_RXCSUM_TUOFL; + + /* Enable 82571 IPv4 payload checksum for UDP fragments + * Must be used in conjunction with packet-split. */ + if ((hw->mac_type >= e1000_82571) && + (adapter->rx_ps_pages)) { + rxcsum |= E1000_RXCSUM_IPPCSE; + } + } else { + rxcsum &= ~E1000_RXCSUM_TUOFL; + /* don't need to clear IPPCSE as it defaults to 0 */ + } + E1000_WRITE_REG(hw, RXCSUM, rxcsum); + } +#else /* Enable 82543 Receive Checksum Offload for TCP and UDP */ if (hw->mac_type >= e1000_82543) { rxcsum = E1000_READ_REG(hw, RXCSUM); @@ -2074,6 +2202,7 @@ e1000_configure_rx(struct e1000_adapter *adapter) } E1000_WRITE_REG(hw, RXCSUM, rxcsum); } +#endif /* enable early receives on 82573, only takes effect if using > 2048 * byte total frame size. for example only for jumbo frames */ @@ -2555,6 +2684,9 @@ e1000_watchdog(unsigned long data) struct e1000_tx_ring *txdr = adapter->tx_ring; uint32_t link, tctl; int32_t ret_val; +#ifdef CONFIG_E1000_MQ + int i; +#endif ret_val = e1000_check_for_link(&adapter->hw); if ((ret_val == E1000_ERR_PHY) && @@ -2652,6 +2784,12 @@ e1000_watchdog(unsigned long data) netif_carrier_on(netdev); netif_wake_queue(netdev); +#ifdef CONFIG_E1000_MQ + if (netif_is_multiqueue(netdev)) + for (i = 0; i < adapter->num_tx_queues; i++) + netif_wake_subqueue(netdev, i); +#endif + mod_timer(&adapter->phy_info_timer, jiffies + 2 * HZ); adapter->smartspeed = 0; } else { @@ -3249,10 +3387,10 @@ static int e1000_maybe_stop_tx(struct net_device *netdev, #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 ) static int -e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +e1000_xmit_frame_ring(struct sk_buff *skb, struct net_device *netdev, + struct e1000_tx_ring *tx_ring) { struct e1000_adapter *adapter = netdev_priv(netdev); - struct e1000_tx_ring *tx_ring; unsigned int first, max_per_txd = E1000_MAX_DATA_PER_TXD; unsigned int max_txd_pwr = E1000_MAX_TXD_PWR; unsigned int tx_flags = 0; @@ -3265,12 +3403,6 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) unsigned int f; len -= skb->data_len; - /* This goes back to the question of how to logically map a tx queue - * to a flow. Right now, performance is impacted slightly negatively - * if using multiple tx queues. If the stack breaks away from a - * single qdisc implementation, we can look at this again. */ - tx_ring = adapter->tx_ring; - if (unlikely(skb->len <= 0)) { dev_kfree_skb_any(skb); return NETDEV_TX_OK; @@ -3425,6 +3557,31 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) return NETDEV_TX_OK; } +static int +e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev) +{ + struct e1000_adapter *adapter = netdev_priv(netdev); + struct e1000_tx_ring *tx_ring = adapter->tx_ring; + + /* This goes back to the question of how to logically map a tx queue + * to a flow. Right now, performance is impacted slightly negatively + * if using multiple tx queues. If the stack breaks away from a + * single qdisc implementation, we can look at this again. */ + return (e1000_xmit_frame_ring(skb, netdev, tx_ring)); +} + +#ifdef CONFIG_E1000_MQ +static int +e1000_subqueue_xmit_frame(struct sk_buff *skb, struct net_device *netdev, + int queue) +{ + struct e1000_adapter *adapter = netdev_priv(netdev); + struct e1000_tx_ring *tx_ring = &adapter->tx_ring[queue]; + + return (e1000_xmit_frame_ring(skb, netdev, tx_ring)); +} +#endif + /** * e1000_tx_timeout - Respond to a Tx Hang * @netdev: network interface device structure @@ -3924,6 +4081,7 @@ e1000_clean(struct net_device *poll_dev, int *budget) struct e1000_adapter *adapter; int work_to_do = min(*budget, poll_dev->quota); int tx_cleaned = 0, work_done = 0; + int i; /* Must NOT use netdev_priv macro here. */ adapter = poll_dev->priv; @@ -3933,18 +4091,27 @@ e1000_clean(struct net_device *poll_dev, int *budget) goto quit_polling; /* e1000_clean is called per-cpu. This lock protects - * tx_ring[0] from being cleaned by multiple cpus + * tx_ring[i] from being cleaned by multiple cpus * simultaneously. A failure obtaining the lock means - * tx_ring[0] is currently being cleaned anyway. */ - if (spin_trylock(&adapter->tx_queue_lock)) { - tx_cleaned = e1000_clean_tx_irq(adapter, - &adapter->tx_ring[0]); - spin_unlock(&adapter->tx_queue_lock); + * tx_ring[i] is currently being cleaned anyway. */ + for (i = 0; i < adapter->num_tx_queues; i++) { +#ifdef CONFIG_E1000_MQ + if (spin_trylock(&adapter->tx_ring[i].tx_queue_lock)) { + tx_cleaned = e1000_clean_tx_irq(adapter, + &adapter->tx_ring[i]); + spin_unlock(&adapter->tx_ring[i].tx_queue_lock); + } +#else + if (spin_trylock(&adapter->tx_queue_lock)) { + tx_cleaned = e1000_clean_tx_irq(adapter, + &adapter->tx_ring[i]); + spin_unlock(&adapter->tx_queue_lock); + } +#endif + adapter->clean_rx(adapter, &adapter->rx_ring[i], + &work_done, work_to_do); } - adapter->clean_rx(adapter, &adapter->rx_ring[0], - &work_done, work_to_do); - *budget -= work_done; poll_dev->quota -= work_done; @@ -3992,6 +4159,9 @@ e1000_clean_tx_irq(struct e1000_adapter *adapter, buffer_info = &tx_ring->buffer_info[i]; cleaned = (i == eop); +#ifdef CONFIG_E1000_MQ + tx_ring->tx_stats.bytes += buffer_info->length; +#endif if (cleaned) { struct sk_buff *skb = buffer_info->skb; unsigned int segs, bytecount; @@ -4008,6 +4178,9 @@ e1000_clean_tx_irq(struct e1000_adapter *adapter, if (unlikely(++i == tx_ring->count)) i = 0; } +#ifdef CONFIG_E1000_MQ + tx_ring->tx_stats.packets++; +#endif eop = tx_ring->buffer_info[i].next_to_watch; eop_desc = E1000_TX_DESC(*tx_ring, eop); #ifdef CONFIG_E1000_NAPI @@ -4269,6 +4442,10 @@ e1000_clean_rx_irq(struct e1000_adapter *adapter, } #endif /* CONFIG_E1000_NAPI */ netdev->last_rx = jiffies; +#ifdef CONFIG_E1000_MQ + rx_ring->rx_stats.packets++; + rx_ring->rx_stats.bytes += length; +#endif next_desc: rx_desc->status = 0; @@ -4375,6 +4552,10 @@ e1000_clean_rx_irq_ps(struct e1000_adapter *adapter, /* Good Receive */ skb_put(skb, length); +#ifdef CONFIG_E1000_MQ + rx_ring->rx_stats.packets++; + rx_ring->rx_stats.bytes += skb->len; +#endif { /* this looks ugly, but it seems compiler issues make it --- Auke Kok <auke-jan.h.kok@intel.com> ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support. 2007-02-09 0:09 ` [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support Kok, Auke @ 2007-03-09 13:11 ` Thomas Graf 0 siblings, 0 replies; 17+ messages in thread From: Thomas Graf @ 2007-03-09 13:11 UTC (permalink / raw) To: Kok, Auke Cc: David Miller, Garzik, Jeff, netdev, linux-kernel, Peter Waskiewicz Jr, Brandeburg, Jesse, Kok, Auke, Ronciak, John * Kok, Auke <auke-jan.h.kok@intel.com> 2007-02-08 16:09 > > From: Peter Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com> > > Several newer e1000 chipsets support multiple RX and TX queues. Most > commonly, 82571's and ESB2LAN support 2 rx and 2 rx queues. > > Signed-off-by: Peter Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com> > Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> > --- > > drivers/net/Kconfig | 13 ++ > drivers/net/e1000/e1000.h | 23 +++ > drivers/net/e1000/e1000_ethtool.c | 43 ++++++ > drivers/net/e1000/e1000_main.c | 269 +++++++++++++++++++++++++++++++------ > 4 files changed, 304 insertions(+), 44 deletions(-) > > diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig > index ad92b6a..2d758ab 100644 > --- a/drivers/net/Kconfig > +++ b/drivers/net/Kconfig > @@ -1988,6 +1988,19 @@ config E1000_DISABLE_PACKET_SPLIT > > If in doubt, say N. > > +config E1000_MQ > + bool "Enable Tx/Rx Multiqueue Support (EXPERIMENTAL)" > + depends on E1000 && NET_MULTI_QUEUE_DEVICE && EXPERIMENTAL Would be better to just select NET_MULTI_QUEUE_DEVICE instead of depending on it. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2007-03-12 20:21 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-02-09 0:09 [PATCH 0/2 REVIEW] Multiple transmit/receive queue kernel Kok, Auke 2007-02-09 0:09 ` [PATCH 1/2] NET: Multiple queue network device support Kok, Auke 2007-02-27 1:03 ` David Miller 2007-02-27 19:38 ` Waskiewicz Jr, Peter P 2007-03-07 22:18 ` Waskiewicz Jr, Peter P 2007-03-07 22:42 ` David Miller 2007-03-09 7:26 ` Jarek Poplawski 2007-03-09 13:40 ` Thomas Graf 2007-03-09 19:25 ` Waskiewicz Jr, Peter P 2007-03-09 23:01 ` Thomas Graf 2007-03-09 23:27 ` Waskiewicz Jr, Peter P 2007-03-10 2:34 ` Thomas Graf 2007-03-10 20:37 ` Waskiewicz Jr, Peter P 2007-03-12 8:58 ` Jarek Poplawski 2007-03-12 20:21 ` Waskiewicz Jr, Peter P 2007-02-09 0:09 ` [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support Kok, Auke 2007-03-09 13:11 ` Thomas Graf
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.