From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexei Starovoitov Subject: Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter Date: Tue, 5 Apr 2016 15:06:49 -0700 Message-ID: <20160405220647.GA95458@ast-mbp.thefacebook.com> References: <57022A85.6040002@iogearbox.net> <20160404150700.1456ae80@redhat.com> <57026DFA.3090201@iogearbox.net> <20160404171227.1f862cb1@redhat.com> <20160404152948.GA495@gmail.com> <57029127.3040303@gmail.com> <20160404161720.GB495@gmail.com> <20160404200032.GA69842@ast-mbp.thefacebook.com> <20160405112905.66b84e13@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Brenden Blanco , John Fastabend , Tom Herbert , Daniel Borkmann , "David S. Miller" , Linux Kernel Network Developers , ogerlitz@mellanox.com To: Jesper Dangaard Brouer Return-path: Received: from mail-pf0-f172.google.com ([209.85.192.172]:33532 "EHLO mail-pf0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759952AbcDEWGy (ORCPT ); Tue, 5 Apr 2016 18:06:54 -0400 Received: by mail-pf0-f172.google.com with SMTP id 184so19142379pff.0 for ; Tue, 05 Apr 2016 15:06:54 -0700 (PDT) Content-Disposition: inline In-Reply-To: <20160405112905.66b84e13@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On Tue, Apr 05, 2016 at 11:29:05AM +0200, Jesper Dangaard Brouer wrote: > > > > Of course, there are other pieces to accelerate: > > 12.71% ksoftirqd/1 [mlx4_en] [k] mlx4_en_alloc_frags > > 6.87% ksoftirqd/1 [mlx4_en] [k] mlx4_en_free_frag > > 4.20% ksoftirqd/1 [kernel.vmlinux] [k] get_page_from_freelist > > 4.09% swapper [mlx4_en] [k] mlx4_en_process_rx_cq > > and I think Jesper's work on batch allocation is going help that a lot. > > Actually, it looks like all of this "overhead" comes from the page > alloc/free (+ dma unmap/map). We would need a page-pool recycle > mechanism to solve/remove this overhead. For the early drop case we > might be able to hack recycle the page directly in the driver (and also > avoid dma_unmap/map cycle). Exactly. A cache of allocated and mapped pages will help a lot both drop and redirect use cases. After tx completion we can recycle still mmaped page into the cache (need to make sure to map them PCI_DMA_BIDIRECTIONAL) and rx can refill the ring with it. For load balancer steady state we won't have any calls to page allocator and dma. Being able to do cheap percpu pool like this is a huge advantage that any kernel bypass cannot have. I'm pretty sure it will be possible to avoid local_cmpxchg as well.