From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9ACFAC2BA19 for ; Sat, 11 Apr 2020 20:48:00 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3294E2084D for ; Sat, 11 Apr 2020 20:48:00 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="tBclAGXU" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3294E2084D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id AF4EA8E00AF; Sat, 11 Apr 2020 16:47:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AA5CC8E0007; Sat, 11 Apr 2020 16:47:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9BA588E00AF; Sat, 11 Apr 2020 16:47:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0150.hostedemail.com [216.40.44.150]) by kanga.kvack.org (Postfix) with ESMTP id 81E2D8E0007 for ; Sat, 11 Apr 2020 16:47:59 -0400 (EDT) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 3F9CCAF9C for ; Sat, 11 Apr 2020 20:47:59 +0000 (UTC) X-FDA: 76696761078.25.peace25_b7a36e709711 X-HE-Tag: peace25_b7a36e709711 X-Filterd-Recvd-Size: 8093 Received: from mail-io1-f65.google.com (mail-io1-f65.google.com [209.85.166.65]) by imf13.hostedemail.com (Postfix) with ESMTP for ; Sat, 11 Apr 2020 20:47:58 +0000 (UTC) Received: by mail-io1-f65.google.com with SMTP id h6so5455513iok.11 for ; Sat, 11 Apr 2020 13:47:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=gyqWcCqdi2ftAKUfMUySoGwJLRM0YDpBHc5UUEdQL48=; b=tBclAGXUGAokAPEj3Qe76EYccQkqCXGakHpNvF2w22zN06WeF5CVZhbKvWi/yovXXD i8vVT/TFW8EKMooWWOBXq1zh+C0oZfzlXFW8MGdsuo3dn7r5x6D/nUzMImLeBWzUgNRS RI7mKKf55B/qeSt1ha3ma9hhfKNlnq0qFniPbM++SVzTYq0TsxJvf32LS9pzjx5a2QmV LCX0GeR3lThAWQ8/KTYd06Kg7YX8SFNi4go4dJCUbeMUY+MacLmH+gz0VtugZ+vgPWK9 vOHm82PqnMUgeEs8+FXW8hH83THfJZhmgJy68roasQ8fPwZmERlEQKC0ZOdvZAW1RwuZ zwhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=gyqWcCqdi2ftAKUfMUySoGwJLRM0YDpBHc5UUEdQL48=; b=SMZsNJgUSJB00FxbcyIzaQRviQPx0iU2APRRgUwWD+38EUW3QYnr5LjSdpkHEdq1b6 D0EN9WqQ6AneWPqMeqLSVDVPRnbggtLja0ojqzk0TzypNvEf5CID4KIX+jaD7BAuOD1D FHYQvcPXmkVQYcKG4z809RApYB4203HSQcLtiugdRxLgNynvJxunTdEsHBt2HwIvMDHz WvYGqH9LYWp5CWiPdy+ym4ygdDIj9C0yl3KtMUXjUhbc8bRuO0++p0GAZeA9UHy4GLCB t+HfImwUbDddYysH1uKzPkZbKbXxsZO428f3tZpHvgIGxu/cbpa0XtcDrtLT3xAZQi3J C6uw== X-Gm-Message-State: AGi0PubewXpukzzR4p2488fD+21ftkCvawyaQBCuiw2pJAstKH5JXYla dsC3QvuSkD4tv6CGDG+L8uKuPEJVRCfTswUxpsI= X-Google-Smtp-Source: APiQypLSi9C6N1FV0e3VktF8c8L4p2YynST7BaGHfJjHKq6vmsbzd0FJmUc2Qf4iwlvToQufKa+e4LyaOmWmz35B1lI= X-Received: by 2002:a02:9642:: with SMTP id c60mr9716778jai.87.1586638078078; Sat, 11 Apr 2020 13:47:58 -0700 (PDT) MIME-Version: 1.0 References: <20200403081812.GA14090@oneplus.com> <20200403085201.GX22681@dhcp22.suse.cz> <20200409152913.GA9878@oneplus.com> <20200409154538.GR18386@dhcp22.suse.cz> <87lfn390db.fsf@yhuang-dev.intel.com> In-Reply-To: From: Alexander Duyck Date: Sat, 11 Apr 2020 13:47:47 -0700 Message-ID: Subject: Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases To: Chintan Pandya Cc: "Huang, Ying" , Michal Hocko , Prathu Baronia , "akpm@linux-foundation.org" , "linux-mm@kvack.org" , "gregkh@linuxfoundation.org" , "gthelen@google.com" , "jack@suse.cz" , Ken Lin , Gasine Xu Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Apr 11, 2020 at 8:40 AM Chintan Pandya wrote: > > > > Generally, many architectures are optimized for serial loads, be it > > > initialization or access, as it is simplest form of prediction. Any > > > random access pattern would kill that pre-fetching. And for now, I > > > suspect that to be the case here. Probably, we can run more tests to confirm > > this part. > > > > Please prove your theory with test. Better to test x86 too. > > Wrote down below userspace test code. > > Code: > > #include > #include > #include > #include > > > #define SZ_1M 0x100000 > #define SZ_4K 0x1000 > #define NUM 100 > > Int main () > { > void *p; > void *q; > void *r; > > unsigned long total_pages, total_size; > int i, j; > struct timeval t0, t1, t2, t3; > int elapsed; > > printf ("Hello World\n"); > > total_size = NUM * SZ_1M; > total_pages = NUM * (SZ_1M / SZ_4K); > > p = malloc (total_size); > q = malloc (total_size); > r = malloc (total_size); > > /* So that all pages gets allocated */ > memset (r, 0xa, total_size); > memset (q, 0xa, total_size); > memset (p, 0xa, total_size); > > gettimeofday (&t0, NULL); > > /* One shot memset */ > memset (r, 0xd, total_size); > > gettimeofday (&t1, NULL); > > /* traverse in forward order */ > for (j = 0; j < total_pages; j++) > { > memset (q + (j * SZ_4K), 0xc, SZ_4K); > } > > gettimeofday (&t2, NULL); > > /* traverse in reverse order */ > for (i = 0; i < total_pages; i++) > { > memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K); > } > > gettimeofday (&t3, NULL); > > free (p); > free (q); > free (r); > > /* Results time */ > elapsed = ((t1.tv_sec - t0.tv_sec) * 1000000) + (t1.tv_usec - t0.tv_usec); > printf ("One shot: %d micro seconds\n", elapsed); > > > elapsed = ((t2.tv_sec - t1.tv_sec) * 1000000) + (t2.tv_usec - t1.tv_usec); > printf ("Forward order: %d micro seconds\n", elapsed); > > > elapsed = ((t3.tv_sec - t2.tv_sec) * 1000000) + (t3.tv_usec - t2.tv_usec); > printf ("Reverse order: %d micro seconds\n", elapsed); > return 0; > } > > ------------------------------------------------------------------------------------------------ > > Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max frequency) > All numbers are mean of 100 iterations. Variation is ignorable. > - Oneshot : 3389.26 us > - Forward : 8876.16 us > - Reverse : 18157.6 us This is an interesting data point. So running things in reverse seems much more expensive than running them forward. As such I would imagine process_huge_page is going to be significantly more expensive then on ARM64 since it will wind through the pages in reverse order from the end of the page all the way down to wherever the page was accessed. I wonder if we couldn't simply process_huge_page to process pages in two passes? The first being from the addr_hint + some offset to the end, and then loop back around to the start of the page for the second pass and just process up to where we started the first pass. The idea would be that the offset would be enough so that we have the 4K that was accessed plus some range before and after the address hopefully still in the L1 cache after we are done. > Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in max frequency) > All numbers are mean of 100 iterations. Variation is ignorable. > - Oneshot : 3203.49 us > - Forward : 5766.46 us > - Reverse : 5187.86 us > > To conclude, I observed optimized serial writes in case of ARM processor. But strangely, > memset in reverse order performs better than forward order quite consistently across > multiple x86 machines. I don't have much insight into x86 so to clarify, I would like to > restrict my previous suspicion to ARM only. What compiler options did you build the test code with? One possibility is that the compiler may have optimized away total_pages/total_size/i all into on variable and simply tracked it until i is less than 0. I know I regularly will write loops to run in reverse order for that reason as it tends to perform pretty well on x86 as all you have to do is a sub or dec and then test the signed flag to determine if you exit the loop. An additional thing I was just wondering is if this also impacts the copy operations as well? Looking through the code the two big users for process_huge_page are clear_huge_page and copy_user_huge_page. One thing that might make more sense than just splitting the code at a high level would be to look at possibly refactoring process_huge_page and the users for it.