linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kirill Tkhai <ktkhai@virtuozzo.com>
To: Eric Dumazet <edumazet@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	David Miller <davem@davemloft.net>,
	Daniel Borkmann <daniel@iogearbox.net>,
	tom@quantonium.net, netdev <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC] net;sched: Try to find idle cpu for RPS to handle packets
Date: Wed, 19 Sep 2018 18:58:44 +0300	[thread overview]
Message-ID: <84d38f11-2133-9d34-e468-d2ef16715f49@virtuozzo.com> (raw)
In-Reply-To: <CANn89iK8X5cW3=YnNRrKo=BVCFJkJ0D22YY_eJFLyGCX+5SxsQ@mail.gmail.com>

On 19.09.2018 18:49, Eric Dumazet wrote:
> On Wed, Sep 19, 2018 at 8:41 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 19.09.2018 17:55, Eric Dumazet wrote:
>>> On Wed, Sep 19, 2018 at 5:29 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>
>>>> Many workloads have polling mode of work. The application
>>>> checks for incomming packets from time to time, but it also
>>>> has a work to do, when there is no packets. This RFC
>>>> tries to develop an idea to queue RPS packets on idle
>>>> CPU in the the L3 domain of the consumer, so backlog
>>>> processing of the packets and the application can execute
>>>> in parallel.
>>>>
>>>> We require this in case of network cards does not
>>>> have enough RX queues to cover all online CPUs (this seems
>>>> to be the most cards), and  get_rps_cpu() actually chooses
>>>> remote cpu, and SMP interrupt is sent. Here we may try
>>>> our best, and to find idle CPU nearly the consumer's CPU.
>>>> Note, that in case of consumer works in poll mode and it
>>>> does not waits for incomming packets, its CPU will be not
>>>> idle, while CPU of a sleeping consumer may be idle. So,
>>>> not polling consumers will still be able to have skb
>>>> handled on its CPU.
>>>>
>>>> In case of network card has many queues, the device
>>>> interrupts will come on consumer's CPU, and this patch
>>>> won't try to find idle cpu for them.
>>>>
>>>> I've tried simple netperf test for this:
>>>> netserver -p 1234
>>>> netperf -L 127.0.0.1 -p 1234 -l 100
>>>>
>>>> Before:
>>>>  87380  16384  16384    100.00   60323.56
>>>>  87380  16384  16384    100.00   60388.46
>>>>  87380  16384  16384    100.00   60217.68
>>>>  87380  16384  16384    100.00   57995.41
>>>>  87380  16384  16384    100.00   60659.00
>>>>
>>>> After:
>>>>  87380  16384  16384    100.00   64569.09
>>>>  87380  16384  16384    100.00   64569.25
>>>>  87380  16384  16384    100.00   64691.63
>>>>  87380  16384  16384    100.00   64930.14
>>>>  87380  16384  16384    100.00   62670.15
>>>>
>>>> The difference between best runs is +7%,
>>>> the worst runs differ +8%.
>>>>
>>>> What do you think about following somehow in this way?
>>>
>>> Hi Kirill
>>>
>>> In my experience, scheduler has a poor view of softirq processing
>>> happening on various cpus.
>>> A cpu spending 90% of its cycles processing IRQ might be considered 'idle'
>>
>> Yes, in case of there is softirq on top of irq_exit(), the cpu is not
>> considered as busy. But after MAX_SOFTIRQ_TIME (=2ms), ksoftirqd are
>> waken up to execute the work in process context, and the processor is
>> considered as !idle. 2ms is 2 timer ticks in case of HZ=1000. So, we
>> don't restart softirq in case of it was executed for more then 2ms.
>>
> 
> That's the theory, but reality is very different unfortunately.
> 
> If RFS/RPS is setup properly, we really do not hit MAX_SOFTIRQ_TIME condition
> unless in some synthetic benchmarks maybe.
> 
>> The similar way, single net_rx_action() can't be executed longer
>> than 2ms.
>>
>> Having 90% load in softirq (called on top of irq_exit()) should be
>> very unlikely situation, when there are too many interrupts with small
>> amount of work, which related softirq calls are doing for each of them.
>> I think it had be a problem even in plain napi case, since it would
>> worked not like expected.
>>
>> But anyway. You worry, that during handling of next portion of skbs,
>> we find that previous portion of skbs already woken ksoftirqd, and
>> we don't see this cpu as idle? Yeah, then we'll try to change cpu,
>> and this is not what we want. We want to continue use the cpu, where
>> previous portion was handler. Hm, not so fast I'll answer, but certainly,
>> this may be handled somehow in more creative way.
>>
>>> So please run a real workload (it is _very_ uncommon anyone set up RPS
>>> on lo interface !)
>>>
>>> Like 400 or more concurrent netperf -t TCP_RR on a 10Gbit NIC.
>>
>> Yeah, it's just a simulation of a single irq nic. I'll try on something
>> more real hardware.
> 
> Also my concern is that you might have results that are tied to a particular
> version of process scheduling, platform, workload...
> 
> One month later, a small change in process scheduler,
> and very different results.

Maybe, but especially that function logic has not changed for a long time.
10 years at least. The only change is Peter adds idle core searching
functionality recently.

> This is why I believe this new feature must be controllable, via a new
> tunable (like RPS/RFS are controllable per rx queue)
> 
>>
>> How do you execute such the tests? I don't see the appropriate parameter
>> of netperf. Does this mean just to start 400 copies of netperf? How is
>> to aggregate their results in this case?
> 
> Yeah, there are various 'super_netperf' scripts available on the net
> (almost trivial to write anyway)
> 
> ( I am attaching one of them)

Thanks.

> Thanks.
>>
>>> Thanks.
>>>
>>> PS: Idea of playing with L3 domains is interesting, I have personally
>>> tried various strategies in the past but none of them
>>> demonstrated a clear win.
>>
>> Thanks,
>> Kirill

  reply	other threads:[~2018-09-19 15:58 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-19 12:28 [RFC] net;sched: Try to find idle cpu for RPS to handle packets Kirill Tkhai
2018-09-19 14:55 ` Eric Dumazet
2018-09-19 15:41   ` Kirill Tkhai
2018-09-19 15:49     ` Eric Dumazet
2018-09-19 15:58       ` Kirill Tkhai [this message]
2018-09-27 16:17         ` Willem de Bruijn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=84d38f11-2133-9d34-e468-d2ef16715f49@virtuozzo.com \
    --to=ktkhai@virtuozzo.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tom@quantonium.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).