From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.4 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8EF56C432C3 for ; Sat, 16 Nov 2019 02:08:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4D27720730 for ; Sat, 16 Nov 2019 02:08:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="hvC8NBS0" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727376AbfKPCI1 (ORCPT ); Fri, 15 Nov 2019 21:08:27 -0500 Received: from mail-wr1-f65.google.com ([209.85.221.65]:41861 "EHLO mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727159AbfKPCI1 (ORCPT ); Fri, 15 Nov 2019 21:08:27 -0500 Received: by mail-wr1-f65.google.com with SMTP id b18so11465125wrj.8 for ; Fri, 15 Nov 2019 18:08:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8NYO4EYapc8xznltqHTYw3LF72laS+6k/AgXBQsNjsk=; b=hvC8NBS0JuFx8qplWQ84Pnb1K6AOyoOM7BdRkevgd/TwiRYVuiwxLeX2O7fXrpA2vU WdRwusPNaRssClKtM8tV1dMKRx4thN1wwrYrHXJ1pBSKLadaHvciUb7/OgsioZ7oXtFD pRf3+Mcbxpjyn4N8DjjDRwe/9TQq65yKhR9MTxfnUPqgmsUNdgc/dXiyZXJSPyOokrdu bj+u94WDUY0AV6xqr0zQWJj9TuzZtdn93ILFZMT1KOBSXHO2nv7aX+KriFgJK2Fjts0b dTNRER1smYv9yth4wNSwpEItghs9igTQrxLQKh2FX+jRJfemrbJlaDiCekS4jDm+7vRW qCVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8NYO4EYapc8xznltqHTYw3LF72laS+6k/AgXBQsNjsk=; b=RlTn4NuFOQSrIHJdifeuNDfu9qOooxIcTXXRUc3m7z1VjOAxa3DSW51hZ6jWOOIHpd HfG/y58MtC6dSf4Quof5V2VxH90Qcglnqaoc8C20AaeKTRmkEtCK4WQ4U2gQ685DnCCI MOvT3IEYGeuHEXujf5ZUTAQV7Yxx7i7xyZewIFUM5Mniy5k72T+mwgA/V6MFpKqG9qW9 ymDnPVqpkLXVyX6zS+7zFZ/HbwLCEkj1LdiA2qIME0JaPLdu0doLc6khNRBFRGYrd0On wFbnNgwCGuCi9NMm9X3CwuMmYsiKelyltuz4dFXVpHzDY01DDqCKb6e4B2t5GK6NXovQ tMaw== X-Gm-Message-State: APjAAAW4MMdhA+YSuA12X5qNY0KDQ4Jmnp6vTXr5GNwuiCH9ZmRtGZTJ CNmUoEMKJLUM2wIU4+WnvLGiFSqobUeln1VhaTPEbg== X-Google-Smtp-Source: APXvYqzNKfHSd/T34jH9e6Z37cMBjh3wGt32AItc++rBpPY4g0UwVRJdbUc5sIutw8u2z117TyNvRn0cLS+dYrca03s= X-Received: by 2002:a5d:44d2:: with SMTP id z18mr5451285wrr.209.1573870102492; Fri, 15 Nov 2019 18:08:22 -0800 (PST) MIME-Version: 1.0 References: <20191116015554.51077-1-edumazet@google.com> In-Reply-To: <20191116015554.51077-1-edumazet@google.com> From: Soheil Hassas Yeganeh Date: Fri, 15 Nov 2019 21:07:46 -0500 Message-ID: Subject: Re: [PATCH net-next] selftests: net: avoid ptl lock contention in tcp_mmap To: Eric Dumazet Cc: "David S . Miller" , netdev , Eric Dumazet , Arjun Roy Content-Type: text/plain; charset="UTF-8" Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Fri, Nov 15, 2019 at 8:56 PM Eric Dumazet wrote: > > tcp_mmap is used as a reference program for TCP rx zerocopy, > so it is important to point out some potential issues. > > If multiple threads are concurrently using getsockopt(... > TCP_ZEROCOPY_RECEIVE), there is a chance the low-level mm > functions compete on shared ptl lock, if vma are arbitrary placed. > > Instead of letting the mm layer place the chunks back to back, > this patch enforces an alignment so that each thread uses > a different ptl lock. > > Performance measured on a 100 Gbit NIC, with 8 tcp_mmap clients > launched at the same time : > > $ for f in {1..8}; do ./tcp_mmap -H 2002:a05:6608:290:: & done > > In the following run, we reproduce the old behavior by requesting no alignment : > > $ tcp_mmap -sz -C $((128*1024)) -a 4096 > received 32768 MB (100 % mmap'ed) in 9.69532 s, 28.3516 Gbit > cpu usage user:0.08634 sys:3.86258, 120.511 usec per MB, 171839 c-switches > received 32768 MB (100 % mmap'ed) in 25.4719 s, 10.7914 Gbit > cpu usage user:0.055268 sys:21.5633, 659.745 usec per MB, 9065 c-switches > received 32768 MB (100 % mmap'ed) in 28.5419 s, 9.63069 Gbit > cpu usage user:0.057401 sys:23.8761, 730.392 usec per MB, 14987 c-switches > received 32768 MB (100 % mmap'ed) in 28.655 s, 9.59268 Gbit > cpu usage user:0.059689 sys:23.8087, 728.406 usec per MB, 18509 c-switches > received 32768 MB (100 % mmap'ed) in 28.7808 s, 9.55074 Gbit > cpu usage user:0.066042 sys:23.4632, 718.056 usec per MB, 24702 c-switches > received 32768 MB (100 % mmap'ed) in 28.8259 s, 9.5358 Gbit > cpu usage user:0.056547 sys:23.6628, 723.858 usec per MB, 23518 c-switches > received 32768 MB (100 % mmap'ed) in 28.8808 s, 9.51767 Gbit > cpu usage user:0.059357 sys:23.8515, 729.703 usec per MB, 14691 c-switches > received 32768 MB (100 % mmap'ed) in 28.8879 s, 9.51534 Gbit > cpu usage user:0.047115 sys:23.7349, 725.769 usec per MB, 21773 c-switches > > New behavior (automatic alignment based on Hugepagesize), > we can see the system overhead being dramatically reduced. > > $ tcp_mmap -sz -C $((128*1024)) > received 32768 MB (100 % mmap'ed) in 13.5339 s, 20.3103 Gbit > cpu usage user:0.122644 sys:3.4125, 107.884 usec per MB, 168567 c-switches > received 32768 MB (100 % mmap'ed) in 16.0335 s, 17.1439 Gbit > cpu usage user:0.132428 sys:3.55752, 112.608 usec per MB, 188557 c-switches > received 32768 MB (100 % mmap'ed) in 17.5506 s, 15.6621 Gbit > cpu usage user:0.155405 sys:3.24889, 103.891 usec per MB, 226652 c-switches > received 32768 MB (100 % mmap'ed) in 19.1924 s, 14.3222 Gbit > cpu usage user:0.135352 sys:3.35583, 106.542 usec per MB, 207404 c-switches > received 32768 MB (100 % mmap'ed) in 22.3649 s, 12.2906 Gbit > cpu usage user:0.142429 sys:3.53187, 112.131 usec per MB, 250225 c-switches > received 32768 MB (100 % mmap'ed) in 22.5336 s, 12.1986 Gbit > cpu usage user:0.140654 sys:3.61971, 114.757 usec per MB, 253754 c-switches > received 32768 MB (100 % mmap'ed) in 22.5483 s, 12.1906 Gbit > cpu usage user:0.134035 sys:3.55952, 112.718 usec per MB, 252997 c-switches > received 32768 MB (100 % mmap'ed) in 22.6442 s, 12.139 Gbit > cpu usage user:0.126173 sys:3.71251, 117.147 usec per MB, 253728 c-switches > > Signed-off-by: Eric Dumazet > Cc: Soheil Hassas Yeganeh Acked-by: Soheil Hassas Yeganeh Neat idea! Thank you for the patch! > Cc: Arjun Roy > --- > tools/testing/selftests/net/tcp_mmap.c | 58 +++++++++++++++++++++++--- > 1 file changed, 53 insertions(+), 5 deletions(-) > > diff --git a/tools/testing/selftests/net/tcp_mmap.c b/tools/testing/selftests/net/tcp_mmap.c > index 0e73a30f0c2262e62a5ed1e2db6c7c8977bf44fa..5bb370a0857ec8a24916f583be5374183a9aefc8 100644 > --- a/tools/testing/selftests/net/tcp_mmap.c > +++ b/tools/testing/selftests/net/tcp_mmap.c > @@ -82,7 +82,9 @@ static int zflg; /* zero copy option. (MSG_ZEROCOPY for sender, mmap() for recei > static int xflg; /* hash received data (simple xor) (-h option) */ > static int keepflag; /* -k option: receiver shall keep all received file in memory (no munmap() calls) */ > > -static int chunk_size = 512*1024; > +static size_t chunk_size = 512*1024; > + > +static size_t map_align; > > unsigned long htotal; > > @@ -118,6 +120,9 @@ void hash_zone(void *zone, unsigned int length) > htotal = temp; > } > > +#define ALIGN_UP(x, align_to) (((x) + ((align_to)-1)) & ~((align_to)-1)) > +#define ALIGN_PTR_UP(p, ptr_align_to) ((typeof(p))ALIGN_UP((unsigned long)(p), ptr_align_to)) > + > void *child_thread(void *arg) > { > unsigned long total_mmap = 0, total = 0; > @@ -126,6 +131,7 @@ void *child_thread(void *arg) > int flags = MAP_SHARED; > struct timeval t0, t1; > char *buffer = NULL; > + void *raddr = NULL; > void *addr = NULL; > double throughput; > struct rusage ru; > @@ -142,9 +148,13 @@ void *child_thread(void *arg) > goto error; > } > if (zflg) { > - addr = mmap(NULL, chunk_size, PROT_READ, flags, fd, 0); > - if (addr == (void *)-1) > + raddr = mmap(NULL, chunk_size + map_align, PROT_READ, flags, fd, 0); > + if (raddr == (void *)-1) { > + perror("mmap"); > zflg = 0; > + } else { > + addr = ALIGN_PTR_UP(raddr, map_align); > + } > } > while (1) { > struct pollfd pfd = { .fd = fd, .events = POLLIN, }; > @@ -222,7 +232,7 @@ void *child_thread(void *arg) > free(buffer); > close(fd); > if (zflg) > - munmap(addr, chunk_size); > + munmap(raddr, chunk_size + map_align); > pthread_exit(0); > } > > @@ -303,6 +313,30 @@ static void do_accept(int fdlisten) > } > } > > +/* Each thread should reserve a big enough vma to avoid > + * spinlock collisions in ptl locks. > + * This size is 2MB on x86_64, and is exported in /proc/meminfo. > + */ > +static unsigned long default_huge_page_size(void) > +{ > + FILE *f = fopen("/proc/meminfo", "r"); > + unsigned long hps = 0; > + size_t linelen = 0; > + char *line = NULL; > + > + if (!f) > + return 0; > + while (getline(&line, &linelen, f) > 0) { > + if (sscanf(line, "Hugepagesize: %lu kB", &hps) == 1) { > + hps <<= 10; > + break; > + } > + } > + free(line); > + fclose(f); > + return hps; > +} > + > int main(int argc, char *argv[]) > { > struct sockaddr_storage listenaddr, addr; > @@ -314,7 +348,7 @@ int main(int argc, char *argv[]) > int sflg = 0; > int mss = 0; > > - while ((c = getopt(argc, argv, "46p:svr:w:H:zxkP:M:")) != -1) { > + while ((c = getopt(argc, argv, "46p:svr:w:H:zxkP:M:C:a:")) != -1) { > switch (c) { > case '4': > cfg_family = PF_INET; > @@ -354,10 +388,24 @@ int main(int argc, char *argv[]) > case 'P': > max_pacing_rate = atoi(optarg) ; > break; > + case 'C': > + chunk_size = atol(optarg); > + break; > + case 'a': > + map_align = atol(optarg); > + break; > default: > exit(1); > } > } > + if (!map_align) { > + map_align = default_huge_page_size(); > + /* if really /proc/meminfo is not helping, > + * we use the default x86_64 hugepagesize. > + */ > + if (!map_align) > + map_align = 2*1024*1024; > + } > if (sflg) { > int fdlisten = socket(cfg_family, SOCK_STREAM, 0); > > -- > 2.24.0.432.g9d3f5f5b63-goog >