All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gregory Haskins <ghaskins@novell.com>
To: David Miller <davem@davemloft.net>
Cc: vernux@us.ibm.com, andi@firstfloor.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org,
	pmullaney@novell.com
Subject: Re: High contention on the sk_buff_head.lock
Date: Wed, 18 Mar 2009 23:48:46 -0400	[thread overview]
Message-ID: <49C1C09E.8050405@novell.com> (raw)
In-Reply-To: <20090318.180355.228447835.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 4364 bytes --]

David Miller wrote:
> From: Gregory Haskins <ghaskins@novell.com>
> Date: Wed, 18 Mar 2009 17:54:04 -0400
>
>   
>> Note that -rt doesnt typically context-switch under contention anymore
>> since we introduced adaptive-locks.  Also note that the contention
>> against the lock is still contention, regardless of whether you have -rt
>> or not.  Its just that the slow-path to handle the contended case for
>> -rt is more expensive than mainline.  However, once you have the
>> contention as stated, you have already lost.
>>     
>
> First, contention is not implicitly a bad thing.
>   
However, when the contention in question is your top bottleneck, even
small improvements have the potential to yield large performance gains. :)

> Second, if the -rt kernel is doing adaptive spinning I see no
> reason why that adaptive spinning is not kicking in here
It does.  Things would be *much* worse if it wasn't.

>  to
> make this problem just go away.
>   
Well, "go away" is a relative term.  Before adaptive-locks, the box was
heavily context-switching under a network workload like we are
discussing, and that is where much of the performance was lost. 
Adaptive-locks mitigate that particular problem so that spinlock_t
clients now more closely model mainline behavior (i.e. spin under
contention, at least when they can) and this brought -rt much closer to
what you expect for performance from mainline.

However, in -rt the contention path is, by necessity, still moderately
more heavy weight than a simple mainline spin (PI, etc), so any
contention will still hurt relatively more.   And obviously
adaptive-locks do nothing to address the contention itself.  Therefore,
as long as the contention remains it will have a higher impact in -rt. 
But make no mistake: the contention can (and I believe does) affect both
trees.

To see this in action, try taking a moderately large smp system (8-way+)
and scaling the number of flows.  At 1 flow the stack is generally quite
capable of maintaining line-rate (at least up to GigE).  With two flows
this should result in each achieving roughly 50% of the line-rate, for a
sum total of line-rate.  But as you add flows to the equation, the
aggregate bandwidth typically starts to drop off to be something less
than line-rate (many times its actually much less).  And when this
happens, the contention in question is usually at the top of the charts
(thus all the interest in this thread).

> This lock is held for mere cycles, just to unlink an SKB from
> the networking qdisc, and then it is immediately released.
>   
To clarify: I haven't been looking at the stack code since last summer,
so some things may have changed since then.  However, the issue back
then from my perspective was the general backpressure in the qdisc
subsystem (e.g. dev->queue_lock), not just the skb head.lock per se. 
This dev->queue_lock can be held for quite a long time, and the
mechanism in general scales poorly as the fabric speeds and core-counts
increase.  (It is arguable whether you would even want another buffering
layer on any reasonably fast interconnect, anyway.  It ultimately just
destabilizes the flows and is probably best left to the underlying
hardware which will typically have the proper facilities for
shaping/limiting, etc. However, that is a topic for another discussion ;).)

One approach that we found yielded some significant improvements here
was to bypass the qdisc outright and do per-cpu lockless queuing.  One
thread/core could then simply aggregate the per-cpu buffers with only
minor contention between the egress (consuming) core and the producer
thread.  I realize that this solution as-is may not be feasible for
production use.  However, it did serve to prove the theory that backing
off cache-line pressures that scale with the core count *did* help the
stack make more forward progress per unit time (e.g. higher throughput)
as the number of flows increased.  The idea was more or less to refactor
the egress/TX path to be more per-cpu, similar to the methodology of the
ingress/RX side.

You could probably adapt this same concept to work harmoniously with the
qdisc infrastructure (though as previously indicated, I am not sure if
we would *want* to ;)).  Something to consider, anyway.

Regards,
-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

  parent reply	other threads:[~2009-03-19  3:46 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-18 17:24 High contention on the sk_buff_head.lock Vernon Mauery
2009-03-18 19:07 ` Eric Dumazet
2009-03-18 20:17   ` Vernon Mauery
2009-03-20 23:29     ` Jarek Poplawski
2009-03-23  8:32       ` Eric Dumazet
2009-03-23  8:32         ` Eric Dumazet
2009-03-23  8:37         ` David Miller
2009-03-23  8:50           ` Jarek Poplawski
2009-04-02 14:13           ` Herbert Xu
2009-04-02 14:15             ` Herbert Xu
2009-03-18 20:54 ` Andi Kleen
2009-03-18 21:03   ` David Miller
2009-03-18 21:10     ` Vernon Mauery
2009-03-18 21:38       ` David Miller
2009-03-18 21:49         ` Vernon Mauery
2009-03-19  1:02           ` David Miller
2009-03-18 21:54         ` Gregory Haskins
2009-03-19  1:03           ` David Miller
2009-03-19  1:13             ` Sven-Thorsten Dietrich
2009-03-19  1:17               ` David Miller
2009-03-19  1:43                 ` Sven-Thorsten Dietrich
2009-03-19  1:54                   ` David Miller
2009-03-19  5:49                     ` Eric Dumazet
2009-03-19  5:49                       ` Eric Dumazet
2009-03-19  5:58                       ` David Miller
2009-03-19 14:04                         ` [PATCH] net: reorder struct Qdisc for better SMP performance Eric Dumazet
2009-03-19 14:04                           ` Eric Dumazet
2009-03-20  8:33                           ` David Miller
2009-03-19 13:45                   ` High contention on the sk_buff_head.lock Andi Kleen
2009-03-19  3:48             ` Gregory Haskins [this message]
2009-03-19  5:38               ` David Miller
2009-03-19 12:42                 ` Gregory Haskins
2009-03-19 20:52                   ` David Miller
2009-03-19 12:50             ` Peter W. Morreale
2009-03-19  7:15           ` Evgeniy Polyakov
2009-03-18 21:07   ` Vernon Mauery
2009-03-18 21:45     ` Eilon Greenstein
2009-03-18 21:51       ` Vernon Mauery
2009-03-18 21:59         ` Andi Kleen
2009-03-18 22:19           ` Rick Jones
2009-03-19 12:59   ` Peter W. Morreale
2009-03-19 13:36     ` Peter W. Morreale
2009-03-19 13:46     ` Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49C1C09E.8050405@novell.com \
    --to=ghaskins@novell.com \
    --cc=andi@firstfloor.org \
    --cc=davem@davemloft.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pmullaney@novell.com \
    --cc=vernux@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.