From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.6 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 672EFC433DF for ; Mon, 12 Oct 2020 09:37:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1E0FB208B6 for ; Mon, 12 Oct 2020 09:37:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="DJVQ61cX" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387467AbgJLJhk (ORCPT ); Mon, 12 Oct 2020 05:37:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50010 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387449AbgJLJhk (ORCPT ); Mon, 12 Oct 2020 05:37:40 -0400 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E416EC0613CE; Mon, 12 Oct 2020 02:37:39 -0700 (PDT) Received: by mail-pf1-x442.google.com with SMTP id a200so12971757pfa.10; Mon, 12 Oct 2020 02:37:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=sBJswRHufzrvTn9Rx651SvBk6xaZinvE+Qr5f9sU5ZA=; b=DJVQ61cXVjQsr5/+g80bkEWeJRZIF/aj7xNSEbz34npaT7pnHu6ha5DAy+aol0d2+S tJe/EJyDdliI+civwzvJmlxECZcq+PVwhgPuALfcSi6MfMPp+cl/XkMACjNNb1z1f423 tnwGfgDSztDO3QR6/okJ8ckdFkY67M9cCQGeBAcmr0C2cu+HOL8mjaDYoJ2E/hIvU5n+ 2q+J2M0gn1T/PFMAJ9bX6wo4MJencyMda4DiqwfgKzt7p2cHRc4WGi7hMYtZ7zAK/BU/ v1mDOJZHIEgGUeIpcS36SPbb0QB7xu6FVwSzQuNmpfTQT03Cv68n9LGSilxZ3KrD1xZK vIPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=sBJswRHufzrvTn9Rx651SvBk6xaZinvE+Qr5f9sU5ZA=; b=AsWZKyUy1YpqjFaCLAc40vDT8uyocTHDDZVraDzqtBV6zEedkgsWEPLgmS8MCIAQT4 kcgqrcOdkFvmOw4JXqN085jPYKPIqJ6uF4Qc1zYLDFLp5nIkYiAoAsZL6aQdEkX7eBfi rLVtlsDswXew+wU2ziqlvGMrCruEt2d6lXplD2hO87yKbqmBjpL29J0AVMv4snVkTowl Ms9CNOMEReGXgmVE5S3jwFWGVZljYeELyE525gl+LMaQmERFQjg3hAYfH3LVv3HZxBy1 HFyXdkp7hGvpw7vHG34BV1g6tqfZxQ6uwbMsqB1Rg/yKUqN0ssK3ReQWz3L29JczFlzb 8s5A== X-Gm-Message-State: AOAM533E+IRpjDoVdfgCGB+/zw3bWE4ZFzSXfksXh7xXXjDBjSvVvEiM y6LI42t3HSibddbsVwGr4GppZu9wVpkmFXt3NKU= X-Google-Smtp-Source: ABdhPJwcpuoXXJUhWYpaiCU8TvkHcyQ0W1eEHQ6V0vNnTrxcNu2Lw1VgW3TBGm4i2mcaEoT+U64PfDKt+/XBOmqKslw= X-Received: by 2002:a17:90a:ab81:: with SMTP id n1mr20123392pjq.117.1602495459341; Mon, 12 Oct 2020 02:37:39 -0700 (PDT) MIME-Version: 1.0 References: <1602166338-21378-1-git-send-email-magnus.karlsson@gmail.com> <43b0605d-f0c9-b81c-4d16-344a7832e083@iogearbox.net> In-Reply-To: From: Magnus Karlsson Date: Mon, 12 Oct 2020 11:37:28 +0200 Message-ID: Subject: Re: [PATCH bpf-next] xsk: introduce padding between ring pointers To: Daniel Borkmann Cc: "Karlsson, Magnus" , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , Alexei Starovoitov , Network Development , Jonathan Lemon , bpf , John Fastabend Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Mon, Oct 12, 2020 at 10:37 AM Magnus Karlsson wrote: > > On Fri, Oct 9, 2020 at 5:03 PM Daniel Borkmann wrote: > > > > On 10/8/20 4:12 PM, Magnus Karlsson wrote: > > > From: Magnus Karlsson > > > > > > Introduce one cache line worth of padding between the producer and > > > consumer pointers in all the lockless rings. This so that the HW > > > adjacency prefetcher will not prefetch the consumer pointer when the > > > producer pointer is used and vice versa. This improves throughput > > > performance for the l2fwd sample app with 2% on my machine with HW > > > prefetching turned on. > > > > > > Signed-off-by: Magnus Karlsson > > > > Applied, thanks! > > > > > net/xdp/xsk_queue.h | 4 ++++ > > > 1 file changed, 4 insertions(+) > > > > > > diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h > > > index dc1dd5e..3c235d2 100644 > > > --- a/net/xdp/xsk_queue.h > > > +++ b/net/xdp/xsk_queue.h > > > @@ -15,6 +15,10 @@ > > > > > > struct xdp_ring { > > > u32 producer ____cacheline_aligned_in_smp; > > > + /* Hinder the adjacent cache prefetcher to prefetch the consumer pointer if the producer > > > + * pointer is touched and vice versa. > > > + */ > > > + u32 pad ____cacheline_aligned_in_smp; > > > u32 consumer ____cacheline_aligned_in_smp; > > > u32 flags; > > > }; > > > > > > > I was wondering whether we should even generalize this further for reuse > > elsewhere e.g. ... > > > > diff --git a/include/linux/cache.h b/include/linux/cache.h > > index 1aa8009f6d06..5521dab01649 100644 > > --- a/include/linux/cache.h > > +++ b/include/linux/cache.h > > @@ -85,4 +85,17 @@ > > #define cache_line_size() L1_CACHE_BYTES > > #endif > > > > +/* > > + * Dummy element for use in structs in order to pad a cacheline > > + * aligned element with an extra cacheline to hinder the adjacent > > + * cache prefetcher to prefetch the subsequent struct element. > > + */ > > +#ifndef ____cacheline_padding_in_smp > > +# ifdef CONFIG_SMP > > +# define ____cacheline_padding_in_smp u8 :8 ____cacheline_aligned_in_smp > > +# else > > +# define ____cacheline_padding_in_smp > > +# endif /* CONFIG_SMP */ > > +#endif > > + > > #endif /* __LINUX_CACHE_H */ > > diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h > > index cdb9cf3cd136..1da36423e779 100644 > > --- a/net/xdp/xsk_queue.h > > +++ b/net/xdp/xsk_queue.h > > @@ -15,11 +15,9 @@ > > > > struct xdp_ring { > > u32 producer ____cacheline_aligned_in_smp; > > - /* Hinder the adjacent cache prefetcher to prefetch the consumer > > - * pointer if the producer pointer is touched and vice versa. > > - */ > > - u32 pad ____cacheline_aligned_in_smp; > > + ____cacheline_padding_in_smp; > > u32 consumer ____cacheline_aligned_in_smp; > > + ____cacheline_padding_in_smp; > > u32 flags; > > }; > > This should be beneficial in theory, though I could not measure any > statistically significant improvement. Though, the flags variable is > touched much less frequently than the producer and consumer pointers, > so that might explain it. We also need to make the compiler allocate > flags to a cache line 128 bytes (2 cache lines) from the consumer > pointer like this: > > u32 consumer ____cacheline_aligned_in_smp; > ____cacheline_padding_in_smp; > u32 flags ____cacheline_aligned_in_smp; > > > ... was there any improvement to also pad after the consumer given the struct > > xdp_ring is also embedded into other structs? > > Good idea. Yes, I do believe I see another ~0.4% increase and more > stable high numbers when trying this out. The xdp_ring is followed by > the ring descriptors themselves in both the rt/tx rings and the umem > rings. And these rings are quite large, 2K in the sample app, so > accessed less frequently (1/8th of the time with a batch size of 256 > and ring size 2K) which might explain the lower increase. In the end, > I ended up with the following struct: > > u32 producer ____cacheline_aligned_in_smp; > ____cacheline_padding_in_smp; > u32 consumer ____cacheline_aligned_in_smp; > ____cacheline_padding_in_smp; > u32 flags ____cacheline_aligned_in_smp; > ____cacheline_padding_in_smp; Actually, this might make more sense and save some memory: u32 producer ____cacheline_aligned_in_smp; ____cacheline_padding_in_smp; u32 consumer ____cacheline_aligned_in_smp; u32 flags; ____cacheline_padding_in_smp; So keep the flags colocated with the consumer on the same cache line. The reason I put it there to start with was that it is usually set in conjunction with the consumer pointer being updated. This might also explain why I did not see any performance improvement by putting it on its own 128-byte cache line. In summary, we make sure producer and consumer are separated with 128 bytes as well as consumer and the start of the descriptor ring. > Do you want to submit a patch, or shall I do it? I like your > ____cacheline_padding_in_smp better than my explicit "padN" member. > > Thanks: Magnus > > > Thanks, > > Daniel