From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [PATCH RFC net-next] netif_receive_skb performance Date: Tue, 28 Apr 2015 22:23:00 -0700 Message-ID: <1430284980.3711.38.camel@edumazet-glaptop2.roam.corp.google.com> References: <1430273488-8403-1-git-send-email-ast@plumgrid.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Cc: "David S. Miller" , Eric Dumazet , Daniel Borkmann , Thomas Graf , Jamal Hadi Salim , John Fastabend , netdev@vger.kernel.org To: Alexei Starovoitov Return-path: Received: from mail-ig0-f181.google.com ([209.85.213.181]:36111 "EHLO mail-ig0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753326AbbD2FXE (ORCPT ); Wed, 29 Apr 2015 01:23:04 -0400 Received: by igblo3 with SMTP id lo3so108137185igb.1 for ; Tue, 28 Apr 2015 22:23:03 -0700 (PDT) In-Reply-To: <1430273488-8403-1-git-send-email-ast@plumgrid.com> Sender: netdev-owner@vger.kernel.org List-ID: On Tue, 2015-04-28 at 19:11 -0700, Alexei Starovoitov wrote: > Hi, > > there were many requests for performance numbers in the past, but not > everyone has access to 10/40G nics and we need a common way to talk > about RX path performance without overhead of driver RX. That's > especially important when making changes to netif_receive_skb. Well, in real life, having to fetch RX descriptor and packet headers are the main cost, and skb->users == 1. So its nice trying to optimize netif_receive_skb(), but make sure you have something that can really exercise same code flows/stalls, otherwise you'll be tempted by wrong optimizations. I would for example use a ring buffer, so that each skb you provide to netif_receive_skb() has cold cache lines (at least skb->head if you want to mimic build_skb() or napi_get_frags()/napi_reuse_skb() behavior) Also, this model of flooding one cpu (no irqs, no context switch) mask latencies caused by code size, since icache is fully populated, with a very specialized working set. If we want to pursue this model (like user space (DPDK and alike frameworks)), we might have to design a very different model than the IRQ driven one, by dedicating one or multiple cpu threads to run networking code with no state transition.