From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=odmD=53=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9ACFAC2BA19
	for <linux-mm@archiver.kernel.org>; Sat, 11 Apr 2020 20:48:00 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 3294E2084D
	for <linux-mm@archiver.kernel.org>; Sat, 11 Apr 2020 20:48:00 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="tBclAGXU"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3294E2084D
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id AF4EA8E00AF; Sat, 11 Apr 2020 16:47:59 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AA5CC8E0007; Sat, 11 Apr 2020 16:47:59 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9BA588E00AF; Sat, 11 Apr 2020 16:47:59 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0150.hostedemail.com [216.40.44.150])
	by kanga.kvack.org (Postfix) with ESMTP id 81E2D8E0007
	for <linux-mm@kvack.org>; Sat, 11 Apr 2020 16:47:59 -0400 (EDT)
Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 3F9CCAF9C
	for <linux-mm@kvack.org>; Sat, 11 Apr 2020 20:47:59 +0000 (UTC)
X-FDA: 76696761078.25.peace25_b7a36e709711
X-HE-Tag: peace25_b7a36e709711
X-Filterd-Recvd-Size: 8093
Received: from mail-io1-f65.google.com (mail-io1-f65.google.com [209.85.166.65])
	by imf13.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sat, 11 Apr 2020 20:47:58 +0000 (UTC)
Received: by mail-io1-f65.google.com with SMTP id h6so5455513iok.11
        for <linux-mm@kvack.org>; Sat, 11 Apr 2020 13:47:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=gyqWcCqdi2ftAKUfMUySoGwJLRM0YDpBHc5UUEdQL48=;
        b=tBclAGXUGAokAPEj3Qe76EYccQkqCXGakHpNvF2w22zN06WeF5CVZhbKvWi/yovXXD
         i8vVT/TFW8EKMooWWOBXq1zh+C0oZfzlXFW8MGdsuo3dn7r5x6D/nUzMImLeBWzUgNRS
         RI7mKKf55B/qeSt1ha3ma9hhfKNlnq0qFniPbM++SVzTYq0TsxJvf32LS9pzjx5a2QmV
         LCX0GeR3lThAWQ8/KTYd06Kg7YX8SFNi4go4dJCUbeMUY+MacLmH+gz0VtugZ+vgPWK9
         vOHm82PqnMUgeEs8+FXW8hH83THfJZhmgJy68roasQ8fPwZmERlEQKC0ZOdvZAW1RwuZ
         zwhQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=gyqWcCqdi2ftAKUfMUySoGwJLRM0YDpBHc5UUEdQL48=;
        b=SMZsNJgUSJB00FxbcyIzaQRviQPx0iU2APRRgUwWD+38EUW3QYnr5LjSdpkHEdq1b6
         D0EN9WqQ6AneWPqMeqLSVDVPRnbggtLja0ojqzk0TzypNvEf5CID4KIX+jaD7BAuOD1D
         FHYQvcPXmkVQYcKG4z809RApYB4203HSQcLtiugdRxLgNynvJxunTdEsHBt2HwIvMDHz
         WvYGqH9LYWp5CWiPdy+ym4ygdDIj9C0yl3KtMUXjUhbc8bRuO0++p0GAZeA9UHy4GLCB
         t+HfImwUbDddYysH1uKzPkZbKbXxsZO428f3tZpHvgIGxu/cbpa0XtcDrtLT3xAZQi3J
         C6uw==
X-Gm-Message-State: AGi0PubewXpukzzR4p2488fD+21ftkCvawyaQBCuiw2pJAstKH5JXYla
	dsC3QvuSkD4tv6CGDG+L8uKuPEJVRCfTswUxpsI=
X-Google-Smtp-Source: APiQypLSi9C6N1FV0e3VktF8c8L4p2YynST7BaGHfJjHKq6vmsbzd0FJmUc2Qf4iwlvToQufKa+e4LyaOmWmz35B1lI=
X-Received: by 2002:a02:9642:: with SMTP id c60mr9716778jai.87.1586638078078;
 Sat, 11 Apr 2020 13:47:58 -0700 (PDT)
MIME-Version: 1.0
References: <20200403081812.GA14090@oneplus.com> <20200403085201.GX22681@dhcp22.suse.cz>
 <20200409152913.GA9878@oneplus.com> <20200409154538.GR18386@dhcp22.suse.cz>
 <SG2PR04MB2921D2AAA8726318EF53D83691DE0@SG2PR04MB2921.apcprd04.prod.outlook.com>
 <87lfn390db.fsf@yhuang-dev.intel.com> <SG2PR04MB2921E6D51681B935C0F85EEA91DF0@SG2PR04MB2921.apcprd04.prod.outlook.com>
In-Reply-To: <SG2PR04MB2921E6D51681B935C0F85EEA91DF0@SG2PR04MB2921.apcprd04.prod.outlook.com>
From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Sat, 11 Apr 2020 13:47:47 -0700
Message-ID: <CAKgT0Ufy9C=MkSjAgyyEHOO8dupQ7Sr9LWUqX15bbkW+cB2qwQ@mail.gmail.com>
Subject: Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
To: Chintan Pandya <chintan.pandya@oneplus.com>
Cc: "Huang, Ying" <ying.huang@intel.com>, Michal Hocko <mhocko@suse.com>, 
	Prathu Baronia <prathu.baronia@oneplus.com>, 
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>, 
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>, "gthelen@google.com" <gthelen@google.com>, 
	"jack@suse.cz" <jack@suse.cz>, Ken Lin <ken.lin@oneplus.com>, Gasine Xu <Gasine.Xu@oneplus.com>
Content-Type: text/plain; charset="UTF-8"
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Sat, Apr 11, 2020 at 8:40 AM Chintan Pandya
<chintan.pandya@oneplus.com> wrote:
>
> > > Generally, many architectures are optimized for serial loads, be it
> > > initialization or access, as it is simplest form of prediction. Any
> > > random access pattern would kill that pre-fetching. And for now, I
> > > suspect that to be the case here. Probably, we can run more tests to confirm
> > this part.
> >
> > Please prove your theory with test.  Better to test x86 too.
>
> Wrote down below userspace test code.
>
> Code:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
>
>
> #define SZ_1M 0x100000
> #define SZ_4K 0x1000
> #define NUM 100
>
> Int main ()
> {
>   void *p;
>   void *q;
>   void *r;
>
>   unsigned long total_pages, total_size;
>   int i, j;
>   struct timeval t0, t1, t2, t3;
>   int elapsed;
>
>   printf ("Hello World\n");
>
>   total_size = NUM * SZ_1M;
>   total_pages = NUM * (SZ_1M / SZ_4K);
>
>   p = malloc (total_size);
>   q = malloc (total_size);
>   r = malloc (total_size);
>
>   /* So that all pages gets allocated */
>   memset (r, 0xa, total_size);
>   memset (q, 0xa, total_size);
>   memset (p, 0xa, total_size);
>
>   gettimeofday (&t0, NULL);
>
>   /* One shot memset */
>   memset (r, 0xd, total_size);
>
>   gettimeofday (&t1, NULL);
>
>   /* traverse in forward order */
>   for (j = 0; j < total_pages; j++)
>     {
>       memset (q + (j * SZ_4K), 0xc, SZ_4K);
>     }
>
>   gettimeofday (&t2, NULL);
>
>   /* traverse in reverse order */
>   for (i = 0; i < total_pages; i++)
>     {
>       memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K);
>     }
>
>   gettimeofday (&t3, NULL);
>
>   free (p);
>   free (q);
>   free (r);
>
>   /* Results time */
>   elapsed = ((t1.tv_sec - t0.tv_sec) * 1000000) + (t1.tv_usec - t0.tv_usec);
>   printf ("One shot: %d micro seconds\n", elapsed);
>
>
>   elapsed = ((t2.tv_sec - t1.tv_sec) * 1000000) + (t2.tv_usec - t1.tv_usec);
>   printf ("Forward order: %d micro seconds\n", elapsed);
>
>
>   elapsed = ((t3.tv_sec - t2.tv_sec) * 1000000) + (t3.tv_usec - t2.tv_usec);
>   printf ("Reverse order: %d micro seconds\n", elapsed);
>   return 0;
> }
>
> ------------------------------------------------------------------------------------------------
>
> Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max frequency)
> All numbers are mean of 100 iterations. Variation is ignorable.
> - Oneshot : 3389.26 us
> - Forward : 8876.16 us
> - Reverse : 18157.6 us

This is an interesting data point. So running things in reverse seems
much more expensive than running them forward. As such I would imagine
process_huge_page is going to be significantly more expensive then on
ARM64 since it will wind through the pages in reverse order from the
end of the page all the way down to wherever the page was accessed.

I wonder if we couldn't simply process_huge_page to process pages in
two passes? The first being from the addr_hint + some offset to the
end, and then loop back around to the start of the page for the second
pass and just process up to where we started the first pass. The idea
would be that the offset would be enough so that we have the 4K that
was accessed plus some range before and after the address hopefully
still in the L1 cache after we are done.

> Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in max frequency)
> All numbers are mean of 100 iterations. Variation is ignorable.
> - Oneshot : 3203.49 us
> - Forward : 5766.46 us
> - Reverse : 5187.86 us
>
> To conclude, I observed optimized serial writes in case of ARM processor. But strangely,
> memset in reverse order performs better than forward order quite consistently across
> multiple x86 machines. I don't have much insight into x86 so to clarify, I would like to
> restrict my previous suspicion to ARM only.

What compiler options did you build the test code with? One
possibility is that the compiler may have optimized away
total_pages/total_size/i all into on variable and simply tracked it
until i is less than 0. I know I regularly will write loops to run in
reverse order for that reason as it tends to perform pretty well on
x86 as all you have to do is a sub or dec and then test the signed
flag to determine if you exit the loop.

An additional thing I was just wondering is if this also impacts the
copy operations as well? Looking through the code the two big users
for process_huge_page are clear_huge_page and copy_user_huge_page. One
thing that might make more sense than just splitting the code at a
high level would be to look at possibly refactoring process_huge_page
and the users for it.