* [RFC PATCH 0/2] Coalesce MMIO writes for transmits @ 2012-07-12 0:25 Alexander Duyck 2012-07-12 0:26 ` [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching Alexander Duyck ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Alexander Duyck @ 2012-07-12 0:25 UTC (permalink / raw) To: netdev Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck This patch set is meant to address recent issues I found with ixgbe performance being bound by Tx tail writes. With these changes in place and the dispatch_limit set to 1 or more I see a significant increase in performance. In the case of one of my systems I saw the routing rate for 7 queues jump from 10.5 to 11.7Mpps. The overall increase I have seen on most systems is something on the order of about 15%. In the case of pktgen I have also seen a noticeable increase as the previous limit for transmits was ~12.5Mpps, but with this patch set in place and the dispatch_limit enabled the value increases to ~14.2Mpps. I expected there to be an increase in latency, however so far I have not ran into that. I have tried running NPtcp tests for latency and seen no difference in the coalesced and non-coalesced transaction times. I welcome any suggestions for tests I might run that might expose any latency issues as a result of this patch. --- Alexander Duyck (2): ixgbe: Add functionality for delaying the MMIO write for Tx net: Add new network device function to allow for MMIO batching drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 22 +++++++- include/linux/netdevice.h | 57 +++++++++++++++++++++ net/core/dev.c | 67 +++++++++++++++++++++++++ net/core/net-sysfs.c | 36 +++++++++++++ 4 files changed, 180 insertions(+), 2 deletions(-) -- Thanks, Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
* [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching 2012-07-12 0:25 [RFC PATCH 0/2] Coalesce MMIO writes for transmits Alexander Duyck @ 2012-07-12 0:26 ` Alexander Duyck 2012-07-12 7:14 ` Eric Dumazet 2012-07-13 7:19 ` Eric Dumazet 2012-07-12 0:26 ` [RFC PATCH 2/2] ixgbe: Add functionality for delaying the MMIO write for Tx Alexander Duyck 2012-07-12 17:23 ` [RFC PATCH 0/2] Coalesce MMIO writes for transmits Stephen Hemminger 2 siblings, 2 replies; 14+ messages in thread From: Alexander Duyck @ 2012-07-12 0:26 UTC (permalink / raw) To: netdev Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck This change adds capabilities to the driver for batching the MMIO write involved with transmits. Most of the logic is based off of the code for the qdisc scheduling. What I did is break the transmit path into two parts. We already had the ndo_start_xmit function which has been there all along. The part I added was ndo_complete_xmit which is meant to handle notifying the hardware that frames are ready for delivery. To control all of this I added a net sysfs value for the Tx queues called dispatch_limit. When 0 it indicates that all frames will notify hardware immediately. When 1 or more the netdev_complete_xmit call will queue up to that number of packets, and when the value is exceeded it will notify the hardware and reset the pending frame dispatch count. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> --- include/linux/netdevice.h | 57 ++++++++++++++++++++++++++++++++++++++ net/core/dev.c | 67 +++++++++++++++++++++++++++++++++++++++++++++ net/core/net-sysfs.c | 36 ++++++++++++++++++++++++ 3 files changed, 160 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5a1a657..8d50fc4 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -522,6 +522,8 @@ enum netdev_queue_state_t { __QUEUE_STATE_DRV_XOFF, __QUEUE_STATE_STACK_XOFF, __QUEUE_STATE_FROZEN, + __QUEUE_STATE_DELAYED, + __QUEUE_STATE_DISPATCH, #define QUEUE_STATE_ANY_XOFF ((1 << __QUEUE_STATE_DRV_XOFF) | \ (1 << __QUEUE_STATE_STACK_XOFF)) #define QUEUE_STATE_ANY_XOFF_OR_FROZEN (QUEUE_STATE_ANY_XOFF | \ @@ -550,6 +552,7 @@ struct netdev_queue { #if defined(CONFIG_XPS) && defined(CONFIG_NUMA) int numa_node; #endif + unsigned int dispatch_limit; /* * write mostly part */ @@ -561,6 +564,11 @@ struct netdev_queue { unsigned long trans_start; /* + * pointer to next Tx queue in dispatch_queue + */ + struct netdev_queue *next_dispatch; + + /* * Number of TX timeouts for this queue * (/sys/class/net/DEV/Q/trans_timeout) */ @@ -568,6 +576,8 @@ struct netdev_queue { unsigned long state; + unsigned int dispatch_pending; + #ifdef CONFIG_BQL struct dql dql; #endif @@ -924,6 +934,8 @@ struct net_device_ops { int (*ndo_stop)(struct net_device *dev); netdev_tx_t (*ndo_start_xmit) (struct sk_buff *skb, struct net_device *dev); + void (*ndo_complete_xmit) (struct net_device *dev, + unsigned int queue); u16 (*ndo_select_queue)(struct net_device *dev, struct sk_buff *skb); void (*ndo_change_rx_flags)(struct net_device *dev, @@ -1760,6 +1772,9 @@ struct softnet_data { unsigned int dropped; struct sk_buff_head input_pkt_queue; struct napi_struct backlog; + + struct netdev_queue *dispatch_queue; + struct netdev_queue **dispatch_queue_tailp; }; static inline void input_queue_head_incr(struct softnet_data *sd) @@ -1779,6 +1794,44 @@ static inline void input_queue_tail_incr_save(struct softnet_data *sd, DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data); +static inline void netif_tx_delay_queue(struct netdev_queue *txq) +{ + set_bit(__QUEUE_STATE_DELAYED, &txq->state); +} + +extern void __netif_tx_dispatch_queue(struct netdev_queue *txq); + +static inline void netif_tx_dispatch_queue(struct netdev_queue *txq) +{ + if (test_and_clear_bit(__QUEUE_STATE_DELAYED, &txq->state)) + __netif_tx_dispatch_queue(txq); +} + +static inline bool netif_tx_queue_delayed(const struct netdev_queue *txq) +{ + return test_bit(__QUEUE_STATE_DELAYED, &txq->state); +} + +static inline void netdev_complete_xmit(struct netdev_queue *txq) +{ + struct net_device *dev = txq->dev; + const struct net_device_ops *ops = dev->netdev_ops; + + if (txq->dispatch_pending < txq->dispatch_limit) { + if (netif_tx_queue_delayed(txq)) { + txq->dispatch_pending++; + return; + } + + /* start of delayed write sequence */ + netif_tx_delay_queue(txq); + } + + txq->dispatch_pending = 0; + + ops->ndo_complete_xmit(dev, txq - &dev->_tx[0]); +} + extern void __netif_schedule(struct Qdisc *q); static inline void netif_schedule_queue(struct netdev_queue *txq) @@ -1973,6 +2026,7 @@ static inline void netdev_completed_queue(struct net_device *dev, static inline void netdev_tx_reset_queue(struct netdev_queue *q) { + clear_bit(__QUEUE_STATE_DELAYED, &q->state); #ifdef CONFIG_BQL clear_bit(__QUEUE_STATE_STACK_XOFF, &q->state); dql_reset(&q->dql); @@ -2482,6 +2536,9 @@ static inline void netif_tx_unlock_bh(struct net_device *dev) } \ } +#define HARD_TX_TRYLOCK(dev, txq) \ + ((dev->features & NETIF_F_LLTX) || __netif_tx_trylock(txq)) + static inline void netif_tx_disable(struct net_device *dev) { unsigned int i; diff --git a/net/core/dev.c b/net/core/dev.c index 93af533..a72669a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2032,6 +2032,27 @@ int netif_get_num_default_rss_queues(void) } EXPORT_SYMBOL(netif_get_num_default_rss_queues); +static inline void __netif_tx_redispatch_queue(struct netdev_queue *txq) +{ + struct softnet_data *sd; + unsigned long flags; + + local_irq_save(flags); + sd = &__get_cpu_var(softnet_data); + txq->next_dispatch = NULL; + sd->dispatch_queue = txq; + sd->dispatch_queue_tailp = &txq->next_dispatch; + raise_softirq_irqoff(NET_TX_SOFTIRQ); + local_irq_restore(flags); +} + +void __netif_tx_dispatch_queue(struct netdev_queue *txq) +{ + if (!test_and_set_bit(__QUEUE_STATE_DISPATCH, &txq->state)) + __netif_tx_redispatch_queue(txq); +} +EXPORT_SYMBOL(__netif_tx_dispatch_queue); + static inline void __netif_reschedule(struct Qdisc *q) { struct softnet_data *sd; @@ -3268,6 +3289,41 @@ static void net_tx_action(struct softirq_action *h) } } } + + if (sd->dispatch_queue) { + struct netdev_queue *head; + + local_irq_disable(); + head = sd->dispatch_queue; + sd->dispatch_queue = NULL; + sd->dispatch_queue_tailp = &sd->dispatch_queue; + local_irq_enable(); + + while (head) { + struct netdev_queue *txq = head; + struct net_device *dev = txq->dev; + const struct net_device_ops *ops = dev->netdev_ops; + + head = head->next_dispatch; + + if (!HARD_TX_TRYLOCK(dev, txq)) { + __netif_tx_redispatch_queue(txq); + continue; + } + + smp_mb__before_clear_bit(); + clear_bit(__QUEUE_STATE_DISPATCH, &txq->state); + + if (txq->dispatch_pending && + !netif_tx_queue_delayed(txq)) { + int index = txq - &dev->_tx[0]; + + ops->ndo_complete_xmit(dev, index); + } + + HARD_TX_UNLOCK(dev, txq); + } + } } #if (defined(CONFIG_BRIDGE) || defined(CONFIG_BRIDGE_MODULE)) && \ @@ -6485,6 +6541,15 @@ static int dev_cpu_callback(struct notifier_block *nfb, oldsd->output_queue = NULL; oldsd->output_queue_tailp = &oldsd->output_queue; } + + /* Append delayed xmit queue from offline CPU */ + if (oldsd->dispatch_queue) { + *sd->dispatch_queue_tailp = oldsd->dispatch_queue; + sd->dispatch_queue_tailp = oldsd->dispatch_queue_tailp; + oldsd->dispatch_queue = NULL; + oldsd->dispatch_queue_tailp = &oldsd->dispatch_queue; + } + /* Append NAPI poll list from offline CPU. */ if (!list_empty(&oldsd->poll_list)) { list_splice_init(&oldsd->poll_list, &sd->poll_list); @@ -6772,6 +6837,8 @@ static int __init net_dev_init(void) INIT_LIST_HEAD(&sd->poll_list); sd->output_queue = NULL; sd->output_queue_tailp = &sd->output_queue; + sd->dispatch_queue = NULL; + sd->dispatch_queue_tailp = &sd->dispatch_queue; #ifdef CONFIG_RPS sd->csd.func = rps_trigger_softirq; sd->csd.info = sd; diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 42bb496..4f7eb58 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -997,11 +997,47 @@ static struct netdev_queue_attribute xps_cpus_attribute = __ATTR(xps_cpus, S_IRUGO | S_IWUSR, show_xps_map, store_xps_map); #endif /* CONFIG_XPS */ +static ssize_t show_dispatch_limit(struct netdev_queue *queue, + struct netdev_queue_attribute *attribute, + char *buf) +{ + unsigned int dispatch_limit; + + spin_lock_irq(&queue->_xmit_lock); + dispatch_limit = queue->dispatch_limit; + spin_unlock_irq(&queue->_xmit_lock); + + return sprintf(buf, "%u\n", dispatch_limit); +} + +static ssize_t store_dispatch_limit(struct netdev_queue *queue, + struct netdev_queue_attribute *attribute, + const char *buf, size_t len) +{ + unsigned int dispatch_limit; + int err; + + err = kstrtouint(buf, 10, &dispatch_limit); + if (err < 0) + return err; + + spin_lock_irq(&queue->_xmit_lock); + queue->dispatch_limit = dispatch_limit; + spin_unlock_irq(&queue->_xmit_lock); + + return len; +} + +static struct netdev_queue_attribute dispatch_limit_attribute = + __ATTR(dispatch_limit, S_IRUGO | S_IWUSR, + show_dispatch_limit, store_dispatch_limit); + static struct attribute *netdev_queue_default_attrs[] = { &queue_trans_timeout.attr, #ifdef CONFIG_XPS &xps_cpus_attribute.attr, #endif + &dispatch_limit_attribute.attr, NULL }; ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching 2012-07-12 0:26 ` [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching Alexander Duyck @ 2012-07-12 7:14 ` Eric Dumazet 2012-07-12 15:39 ` Alexander Duyck 2012-07-13 7:19 ` Eric Dumazet 1 sibling, 1 reply; 14+ messages in thread From: Eric Dumazet @ 2012-07-12 7:14 UTC (permalink / raw) To: Alexander Duyck Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck On Wed, 2012-07-11 at 17:26 -0700, Alexander Duyck wrote: > This change adds capabilities to the driver for batching the MMIO write > involved with transmits. Most of the logic is based off of the code for > the qdisc scheduling. > > What I did is break the transmit path into two parts. We already had the > ndo_start_xmit function which has been there all along. The part I added > was ndo_complete_xmit which is meant to handle notifying the hardware that > frames are ready for delivery. > > To control all of this I added a net sysfs value for the Tx queues called > dispatch_limit. When 0 it indicates that all frames will notify hardware > immediately. When 1 or more the netdev_complete_xmit call will queue up to > that number of packets, and when the value is exceeded it will notify the > hardware and reset the pending frame dispatch count. > > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > --- The idea is good, but do we really need so complex schem ? Most of the transmits are done from __qdisc_run() We could add logic in __qdisc_run()/qdisc_restart() qdisc_run_end() would then have to call ndo_complete_xmit() to make sure the MMIO is done. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching 2012-07-12 7:14 ` Eric Dumazet @ 2012-07-12 15:39 ` Alexander Duyck 2012-07-13 7:38 ` Eric Dumazet 0 siblings, 1 reply; 14+ messages in thread From: Alexander Duyck @ 2012-07-12 15:39 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck On 07/12/2012 12:14 AM, Eric Dumazet wrote: > On Wed, 2012-07-11 at 17:26 -0700, Alexander Duyck wrote: >> This change adds capabilities to the driver for batching the MMIO write >> involved with transmits. Most of the logic is based off of the code for >> the qdisc scheduling. >> >> What I did is break the transmit path into two parts. We already had the >> ndo_start_xmit function which has been there all along. The part I added >> was ndo_complete_xmit which is meant to handle notifying the hardware that >> frames are ready for delivery. >> >> To control all of this I added a net sysfs value for the Tx queues called >> dispatch_limit. When 0 it indicates that all frames will notify hardware >> immediately. When 1 or more the netdev_complete_xmit call will queue up to >> that number of packets, and when the value is exceeded it will notify the >> hardware and reset the pending frame dispatch count. >> >> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >> --- > The idea is good, but do we really need so complex schem ? > > Most of the transmits are done from __qdisc_run() > > We could add logic in __qdisc_run()/qdisc_restart() > > qdisc_run_end() would then have to call ndo_complete_xmit() to make > sure the MMIO is done. The problem is in both of the cases where I have seen the issue the qdisc is actually empty. In the case of pktgen it does not use the qdisc layer at all. It just directly calls ndo_start_xmit. In the standard networking case we never fill the qdisc because the MMIO write stalls the entire CPU so the application never gets a chance to get ahead of the hardware. From what I can tell the only case in which the qdisc_run solution would work is if the ndo_start_xmit was called on a different CPU from the application that is doing the transmitting. Thanks, Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching 2012-07-12 15:39 ` Alexander Duyck @ 2012-07-13 7:38 ` Eric Dumazet 2012-07-13 15:37 ` Alexander Duyck 2012-07-13 15:50 ` Stephen Hemminger 0 siblings, 2 replies; 14+ messages in thread From: Eric Dumazet @ 2012-07-13 7:38 UTC (permalink / raw) To: Alexander Duyck Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck On Thu, 2012-07-12 at 08:39 -0700, Alexander Duyck wrote: > The problem is in both of the cases where I have seen the issue the > qdisc is actually empty. > You mean a router workload, with links of same bandwidth. (BQL doesnt trigger) Frankly what percentage of linux powered machines act as high perf routers ? > In the case of pktgen it does not use the qdisc layer at all. It just > directly calls ndo_start_xmit. pktgen is in kernel, adding a complete() call in it is certainly ok, if we can avoid kernel bloat. I mean, pktgen represents less than 0.000001 % of real workloads. > > In the standard networking case we never fill the qdisc because the MMIO > write stalls the entire CPU so the application never gets a chance to > get ahead of the hardware. From what I can tell the only case in which > the qdisc_run solution would work is if the ndo_start_xmit was called on > a different CPU from the application that is doing the transmitting. Hey, I can tell that qdisc is not empty on many workloads. But BQL and TSO mean we only send one or two packets per qdisc run. I understand this MMIO batching helps routers workloads, or workloads using many small packets. But on other workloads, this adds a significant latency source (NET_TX_SOFTIRQ) It would be good to instrument the extra delay on a single UDP send. (entering do_softirq() path is not a few instructions...) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching 2012-07-13 7:38 ` Eric Dumazet @ 2012-07-13 15:37 ` Alexander Duyck 2012-07-13 15:50 ` Stephen Hemminger 1 sibling, 0 replies; 14+ messages in thread From: Alexander Duyck @ 2012-07-13 15:37 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck On 07/13/2012 12:38 AM, Eric Dumazet wrote: > On Thu, 2012-07-12 at 08:39 -0700, Alexander Duyck wrote: > >> The problem is in both of the cases where I have seen the issue the >> qdisc is actually empty. >> > You mean a router workload, with links of same bandwidth. > (BQL doesnt trigger) > > Frankly what percentage of linux powered machines act as high perf > routers ? Actually I was seeing this issue with the sending application on the same CPU as the Tx cleanup. The problem was the CPU would stall and consume cycles instead of putting work into placing more packets on the queue. >> In the case of pktgen it does not use the qdisc layer at all. It just >> directly calls ndo_start_xmit. > pktgen is in kernel, adding a complete() call in it is certainly ok, > if we can avoid kernel bloat. > > I mean, pktgen represents less than 0.000001 % of real workloads. I realize that, but it does provide a valid means of stress testing an interface and demonstrating that the MMIO writes are causing significant stalls and bus utilization. >> In the standard networking case we never fill the qdisc because the MMIO >> write stalls the entire CPU so the application never gets a chance to >> get ahead of the hardware. From what I can tell the only case in which >> the qdisc_run solution would work is if the ndo_start_xmit was called on >> a different CPU from the application that is doing the transmitting. > Hey, I can tell that qdisc is not empty on many workloads. > But BQL and TSO mean we only send one or two packets per qdisc run. > > I understand this MMIO batching helps routers workloads, or workloads > using many small packets. > > But on other workloads, this adds a significant latency source > (NET_TX_SOFTIRQ) > > It would be good to instrument the extra delay on a single UDP send. > > (entering do_softirq() path is not a few instructions...) These kind of issues are one of the reasons why this feature is disabled by default. You have to explicitly enable it by setting the dispatch_limit to something other than 0. I suppose I could just make it a part of the Tx cleanup itself since I am only doing a trylock instead of waiting and taking the full lock. I am open to any other suggestions for alternatives other than NET_TX_SOFTIRQ. Thanks, Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching 2012-07-13 7:38 ` Eric Dumazet 2012-07-13 15:37 ` Alexander Duyck @ 2012-07-13 15:50 ` Stephen Hemminger 2012-07-13 16:23 ` Eric Dumazet 1 sibling, 1 reply; 14+ messages in thread From: Stephen Hemminger @ 2012-07-13 15:50 UTC (permalink / raw) To: Eric Dumazet Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck On Fri, 13 Jul 2012 09:38:49 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > Frankly what percentage of linux powered machines act as high perf > routers ? More than you think, every Linux machine acting as hypervisor (Xen and KVM) is also doing this. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching 2012-07-13 15:50 ` Stephen Hemminger @ 2012-07-13 16:23 ` Eric Dumazet 0 siblings, 0 replies; 14+ messages in thread From: Eric Dumazet @ 2012-07-13 16:23 UTC (permalink / raw) To: Stephen Hemminger Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck On Fri, 2012-07-13 at 08:50 -0700, Stephen Hemminger wrote: > On Fri, 13 Jul 2012 09:38:49 +0200 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > Frankly what percentage of linux powered machines act as high perf > > routers ? > > More than you think, every Linux machine acting as hypervisor (Xen and KVM) > is also doing this. High perf router meant : ability to route 10 Mpps, without help of TSO. If an hypervisor is enable to use TSO, I expect very bad performances anyway. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching 2012-07-12 0:26 ` [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching Alexander Duyck 2012-07-12 7:14 ` Eric Dumazet @ 2012-07-13 7:19 ` Eric Dumazet 2012-07-13 15:49 ` Alexander Duyck 1 sibling, 1 reply; 14+ messages in thread From: Eric Dumazet @ 2012-07-13 7:19 UTC (permalink / raw) To: Alexander Duyck Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck On Wed, 2012-07-11 at 17:26 -0700, Alexander Duyck wrote: > +static inline void netdev_complete_xmit(struct netdev_queue *txq) > +{ > + struct net_device *dev = txq->dev; > + const struct net_device_ops *ops = dev->netdev_ops; > + > + if (txq->dispatch_pending < txq->dispatch_limit) { > + if (netif_tx_queue_delayed(txq)) { > + txq->dispatch_pending++; > + return; > + } > + > + /* start of delayed write sequence */ > + netif_tx_delay_queue(txq); I dont understand this part. Isnt a return missing here ? > + } > + > + txq->dispatch_pending = 0; > + > + ops->ndo_complete_xmit(dev, txq - &dev->_tx[0]); > +} > + ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching 2012-07-13 7:19 ` Eric Dumazet @ 2012-07-13 15:49 ` Alexander Duyck 2012-07-13 16:18 ` Eric Dumazet 0 siblings, 1 reply; 14+ messages in thread From: Alexander Duyck @ 2012-07-13 15:49 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck On 07/13/2012 12:19 AM, Eric Dumazet wrote: > On Wed, 2012-07-11 at 17:26 -0700, Alexander Duyck wrote: > >> +static inline void netdev_complete_xmit(struct netdev_queue *txq) >> +{ >> + struct net_device *dev = txq->dev; >> + const struct net_device_ops *ops = dev->netdev_ops; >> + >> + if (txq->dispatch_pending < txq->dispatch_limit) { >> + if (netif_tx_queue_delayed(txq)) { >> + txq->dispatch_pending++; >> + return; >> + } >> + >> + /* start of delayed write sequence */ >> + netif_tx_delay_queue(txq); > I dont understand this part. Isnt a return missing here ? > >> + } >> + >> + txq->dispatch_pending = 0; >> + >> + ops->ndo_complete_xmit(dev, txq - &dev->_tx[0]); >> +} >> + > There is intentionally no return there. The idea is that the first packet always gets through. It is what is going to later force the interrupt that will force the final flush if it is needed. That is one of the ways I am helping to reduce the latency of things such as TSO which will only be using one or two frames per interrupt anyway. Thanks, Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching 2012-07-13 15:49 ` Alexander Duyck @ 2012-07-13 16:18 ` Eric Dumazet 0 siblings, 0 replies; 14+ messages in thread From: Eric Dumazet @ 2012-07-13 16:18 UTC (permalink / raw) To: Alexander Duyck Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck On Fri, 2012-07-13 at 08:49 -0700, Alexander Duyck wrote: > On 07/13/2012 12:19 AM, Eric Dumazet wrote: > > On Wed, 2012-07-11 at 17:26 -0700, Alexander Duyck wrote: > > > >> +static inline void netdev_complete_xmit(struct netdev_queue *txq) > >> +{ > >> + struct net_device *dev = txq->dev; > >> + const struct net_device_ops *ops = dev->netdev_ops; > >> + > >> + if (txq->dispatch_pending < txq->dispatch_limit) { > >> + if (netif_tx_queue_delayed(txq)) { > >> + txq->dispatch_pending++; > >> + return; > >> + } > >> + > >> + /* start of delayed write sequence */ > >> + netif_tx_delay_queue(txq); > > I dont understand this part. Isnt a return missing here ? > > > >> + } > >> + > >> + txq->dispatch_pending = 0; > >> + > >> + ops->ndo_complete_xmit(dev, txq - &dev->_tx[0]); > >> +} > >> + > > > There is intentionally no return there. The idea is that the first > packet always gets through. It is what is going to later force the > interrupt that will force the final flush if it is needed. That is one > of the ways I am helping to reduce the latency of things such as TSO > which will only be using one or two frames per interrupt anyway. So for a single packet, we only trigger TX softirq do do nothing at all, or worse the ndo_complete_xmit() is done twice ? It looks like you need to add comments, because if I dont understand this code, who will ? ^ permalink raw reply [flat|nested] 14+ messages in thread
* [RFC PATCH 2/2] ixgbe: Add functionality for delaying the MMIO write for Tx 2012-07-12 0:25 [RFC PATCH 0/2] Coalesce MMIO writes for transmits Alexander Duyck 2012-07-12 0:26 ` [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching Alexander Duyck @ 2012-07-12 0:26 ` Alexander Duyck 2012-07-12 17:23 ` [RFC PATCH 0/2] Coalesce MMIO writes for transmits Stephen Hemminger 2 siblings, 0 replies; 14+ messages in thread From: Alexander Duyck @ 2012-07-12 0:26 UTC (permalink / raw) To: netdev Cc: davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck This change makes it so that ixgbe can use the new framework for delaying the MMIO writes in the transmit path. With this change in place we see a significant reduction in CPU utilization and increase in overall packets per second throughput for bulk traffic tests. In addition I have not seen any increase in latency as a result of this patch. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 22 ++++++++++++++++++++-- 1 files changed, 20 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 9ec65ee..e9b71b8 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -884,6 +884,8 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector, return true; } + netif_tx_dispatch_queue(txring_txq(tx_ring)); + netdev_tx_completed_queue(txring_txq(tx_ring), total_packets, total_bytes); @@ -5825,6 +5827,22 @@ static void ixgbe_service_task(struct work_struct *work) ixgbe_service_event_complete(adapter); } +static void ixgbe_complete_xmit_frame(struct net_device *dev, + unsigned int index) +{ + struct ixgbe_adapter *adapter = netdev_priv(dev); + struct ixgbe_ring *tx_ring = adapter->tx_ring[index]; + + /* notify HW of packet */ + writel(tx_ring->next_to_use, tx_ring->tail); + + /* + * we need this if more than one processor can write to our tail + * at a time, it synchronizes IO on IA64/Altix systems + */ + mmiowb(); +} + static int ixgbe_tso(struct ixgbe_ring *tx_ring, struct ixgbe_tx_buffer *first, u8 *hdr_len) @@ -6150,8 +6168,7 @@ static void ixgbe_tx_map(struct ixgbe_ring *tx_ring, tx_ring->next_to_use = i; - /* notify HW of packet */ - writel(i, tx_ring->tail); + netdev_complete_xmit(txring_txq(tx_ring)); return; dma_error: @@ -6961,6 +6978,7 @@ static const struct net_device_ops ixgbe_netdev_ops = { .ndo_open = ixgbe_open, .ndo_stop = ixgbe_close, .ndo_start_xmit = ixgbe_xmit_frame, + .ndo_complete_xmit = ixgbe_complete_xmit_frame, #ifdef IXGBE_FCOE .ndo_select_queue = ixgbe_select_queue, #endif ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] Coalesce MMIO writes for transmits 2012-07-12 0:25 [RFC PATCH 0/2] Coalesce MMIO writes for transmits Alexander Duyck 2012-07-12 0:26 ` [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching Alexander Duyck 2012-07-12 0:26 ` [RFC PATCH 2/2] ixgbe: Add functionality for delaying the MMIO write for Tx Alexander Duyck @ 2012-07-12 17:23 ` Stephen Hemminger 2012-07-12 19:01 ` Alexander Duyck 2 siblings, 1 reply; 14+ messages in thread From: Stephen Hemminger @ 2012-07-12 17:23 UTC (permalink / raw) To: Alexander Duyck Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck On Wed, 11 Jul 2012 17:25:58 -0700 Alexander Duyck <alexander.h.duyck@intel.com> wrote: > This patch set is meant to address recent issues I found with ixgbe > performance being bound by Tx tail writes. With these changes in place > and the dispatch_limit set to 1 or more I see a significant increase in > performance. > > In the case of one of my systems I saw the routing rate for 7 queues jump > from 10.5 to 11.7Mpps. The overall increase I have seen on most systems is > something on the order of about 15%. In the case of pktgen I have also > seen a noticeable increase as the previous limit for transmits was > ~12.5Mpps, but with this patch set in place and the dispatch_limit enabled > the value increases to ~14.2Mpps. > > I expected there to be an increase in latency, however so far I have not > ran into that. I have tried running NPtcp tests for latency and seen no > difference in the coalesced and non-coalesced transaction times. I welcome > any suggestions for tests I might run that might expose any latency issues > as a result of this patch. > > --- > > Alexander Duyck (2): > ixgbe: Add functionality for delaying the MMIO write for Tx > net: Add new network device function to allow for MMIO batching > > > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 22 +++++++- > include/linux/netdevice.h | 57 +++++++++++++++++++++ > net/core/dev.c | 67 +++++++++++++++++++++++++ > net/core/net-sysfs.c | 36 +++++++++++++ > 4 files changed, 180 insertions(+), 2 deletions(-) > This is a good idea. I was thinking of adding a multi-skb operation to netdevice_ops to allow this. Something like ndo_start_xmit_pkts but the problem is how to deal with the boundary case where there is only a limited number of slots in the ring. Using a "that's all folks" operation seems better. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] Coalesce MMIO writes for transmits 2012-07-12 17:23 ` [RFC PATCH 0/2] Coalesce MMIO writes for transmits Stephen Hemminger @ 2012-07-12 19:01 ` Alexander Duyck 0 siblings, 0 replies; 14+ messages in thread From: Alexander Duyck @ 2012-07-12 19:01 UTC (permalink / raw) To: Stephen Hemminger Cc: netdev, davem, jeffrey.t.kirsher, edumazet, bhutchings, therbert, alexander.duyck On 07/12/2012 10:23 AM, Stephen Hemminger wrote: > On Wed, 11 Jul 2012 17:25:58 -0700 > Alexander Duyck <alexander.h.duyck@intel.com> wrote: > >> This patch set is meant to address recent issues I found with ixgbe >> performance being bound by Tx tail writes. With these changes in place >> and the dispatch_limit set to 1 or more I see a significant increase in >> performance. >> >> In the case of one of my systems I saw the routing rate for 7 queues jump >> from 10.5 to 11.7Mpps. The overall increase I have seen on most systems is >> something on the order of about 15%. In the case of pktgen I have also >> seen a noticeable increase as the previous limit for transmits was >> ~12.5Mpps, but with this patch set in place and the dispatch_limit enabled >> the value increases to ~14.2Mpps. >> >> I expected there to be an increase in latency, however so far I have not >> ran into that. I have tried running NPtcp tests for latency and seen no >> difference in the coalesced and non-coalesced transaction times. I welcome >> any suggestions for tests I might run that might expose any latency issues >> as a result of this patch. >> >> --- >> >> Alexander Duyck (2): >> ixgbe: Add functionality for delaying the MMIO write for Tx >> net: Add new network device function to allow for MMIO batching >> >> >> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 22 +++++++- >> include/linux/netdevice.h | 57 +++++++++++++++++++++ >> net/core/dev.c | 67 +++++++++++++++++++++++++ >> net/core/net-sysfs.c | 36 +++++++++++++ >> 4 files changed, 180 insertions(+), 2 deletions(-) >> > This is a good idea. I was thinking of adding a multi-skb operation > to netdevice_ops to allow this. Something like ndo_start_xmit_pkts but > the problem is how to deal with the boundary case where there is only > a limited number of slots in the ring. Using a "that's all folks" > operation seems better. I had considered a multi-skb operation originally, but the problem was in my case I would have had to come up with a more complex buffering mechanism to generate a stream of skbs before handing them off to the device. By letting the transmit path proceed normally I shouldn't have any effect on things like the byte queue limits for the transmit queues and such. The wierd bit is how this issue was showing up. I don't know if you recall my presentation from plumbers last year, but one of the things I had brought up was the qdisc spinlock being an issue. However it was actually this MMIO write that was causing the problem because it was posting a write to non-coherent memory and then the spinlock was getting stalled behind the write and couldn't complete until the write was completed. With this change in place and the dispatch_limit set to something like 31 I see the CPU utilization for spinlocks drop from 15% (90% sch_direct_xmit / 10% dev_queue_xmit) to 5% (66% sch_direct_xmit / 33% dev_queue_xmit). Makes me wonder what other hotspots we have in the drivers that can be resolved by avoiding MMIO followed by locked operations. Thanks, Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2012-07-13 16:23 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-07-12 0:25 [RFC PATCH 0/2] Coalesce MMIO writes for transmits Alexander Duyck 2012-07-12 0:26 ` [RFC PATCH 1/2] net: Add new network device function to allow for MMIO batching Alexander Duyck 2012-07-12 7:14 ` Eric Dumazet 2012-07-12 15:39 ` Alexander Duyck 2012-07-13 7:38 ` Eric Dumazet 2012-07-13 15:37 ` Alexander Duyck 2012-07-13 15:50 ` Stephen Hemminger 2012-07-13 16:23 ` Eric Dumazet 2012-07-13 7:19 ` Eric Dumazet 2012-07-13 15:49 ` Alexander Duyck 2012-07-13 16:18 ` Eric Dumazet 2012-07-12 0:26 ` [RFC PATCH 2/2] ixgbe: Add functionality for delaying the MMIO write for Tx Alexander Duyck 2012-07-12 17:23 ` [RFC PATCH 0/2] Coalesce MMIO writes for transmits Stephen Hemminger 2012-07-12 19:01 ` Alexander Duyck
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.