Re: Traffic shaping at 10~300mbps at a 10Gbps link

From: L A Walsh <lkml@tlinx.org>
To: lartc@vger.kernel.org
Subject: Re: Traffic shaping at 10~300mbps at a 10Gbps link
Date: Wed, 07 Jul 2021 02:10:21 +0000	[thread overview]
Message-ID: <60E50D0D.8000309@tlinx.org> (raw)
In-Reply-To: <20210607133853.045a96d5@babalu>

On 2021/06/09 08:04, Ethy H. Brito wrote:
> Hi Guys!
>
> Doesn't anybody can help me with this?
> Any help will be appreciated.
>
> Cheers
>
> Ethy
>   
You may need specialized HW, or powerful multi-cpu's to
process that much traffic in real time.

I tried just getting full throughput on a 10Gb interconnect
(no switches) between a server & Win-client.  Actually I first
tried pairing 2 of those interfaces since the Intel cards
came with 2 ports each.  That was way crazy -- couldn't get
more than about 300-400MB/s, lots of dropped packets amid
saturated cpu's.

(note: MB\x1024**2 Bytes (Bytes already being 2**3 bits, while
anywhere I say Mb, that is 1000*1000 bits, so base2 prefixes
for base-2 units, and base10 prefixes for counting 1's).

W/1 port, can hit 600MB/s read + 400MB/s writes depending on
tuning of packet sizes -- but smaller write-sizes put more load
(100% cpu) on a receiver, while larger ones put more load on
sender.  Those numbers were samba speeds to/from memory (not even
to disk).  That was with NO shaping attempts.

I'm sure I'm doing things incorrectly since trying to shape  my
outside connection lost me efficiency off of a 25Mb down/10Mb up,
which is fairly slow by most standards.

Trying to traffic shape any significant traffic @10Mb...you are
going to need multiple fast cpu's, but problem is trying
to split that traffic between cpu's -- you lose cache efficiency,
which becomes significant at those speeds.  So would need
dedicated HW (not someone's workstation), and try to use
on-card flow switching to multiple on card queues, trying to
deliver separate queues to separate cpus.  It isn't likely
to be pretty, but you might get some benefit by leveraging the
card's on-chip flow separation before it his your system's
interrupt queues. 

I know the Intel cards have a fair flow-differentiation set, but
have never used it -- would also try to bind interrupt
affinity by separate queues to separate cpu's (if possible), not
sure.

Looks like my ethernet card,
07:00.0 Ethernet controller: Intel Corporation Ethernet Controller 
10-Gigabit X540-AT2 (rev 01) (1 of 2)
has eth5-TxRx[0-11] mapped to interrupts [79-90].

If you can get the interrupt servicing affinity set to the same
cpu where you can process the packets, you'll be getting a
large optimization.  I really don't know about trying to
run the ip routing rules, though on a per-queue or affinity-bound
cpu basis.  Doesn't mean it can't be done, but _ignorant_
me doesn't know about anyone who has done it.

I know this is pretty general, but its about the most depth
I have knowledge of in this area, but hopefully it is of
some help!

-linda Walsh

>
> On Mon, 7 Jun 2021 13:38:53 -0300
> "Ethy H. Brito" <ethy.brito@inexo.com.br> wrote:
>
>   
>> Hi
>>
>> I am having a hard time trying to shape 3000 users at ceil speeds from 10 to 300mbps in a 7/7Gbps link using
>> HTB+SFQ+TC(filter by IP hashkey mask) for a few days now tweaking HTB and SFQ parameters with no luck so far.
>>
>> Everything seems right, up 4Gbps overall download speed with shaping on.
>> I have no significant packets delay, no dropped packets and no high CPU average loads (not more than 20% - htop info)
>>
>> But when the speed comes to about 4.5Gbps download (upload is about 500mbps), chaos kicks in.
>> CPU load goes sky high (all 24x2.4Ghz physical cores above 90% - 48x2.4Ghz if count that virtualization is on) and as a
>> consequence packets are dropped (as reported by tc -s class sh ...), RTT goes above 200ms and a lots of ungry users. This
>> goes from about 7PM to 11 PM every day.
>>
>> If I turn shaping off, everything return to normality immediately and peaks of not more than 5Gbps (1 second average) are
>> observed and a CPU load of about 5%. So I infer the uplink is not crowded.
>>
>> I use one root HTB qdisc and one root (1:) HTB class.
>> Then about 20~30 same level (1:xx) inner classes to (sort of) separate the users by region 
>> And under these inner classes, goes the almost 3000 leaves (1:xxxx). 
>> I have one class with about 900 users and this quantity decreases by the other inner classes having some of them with just
>> one user.
>>
>> Is the way I'm using HTB+SFQ+TC suitable for this job?
>>
>> Since the script that creates the shaping environment is too long I do not post it here.
>>
>> What can I inform you guys to help me solve this?
>> Fragments of code, stats, some measurements? What?
>>
>> Thanks.
>>
>> Regards
>>
>> Ethy
>>     
>
>
>