Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user

From: Alexander Duyck <alexander.duyck@gmail.com>
To: Prathu Baronia <prathu.baronia@oneplus.com>
Cc: Michal Hocko <mhocko@suse.com>,
	Chintan Pandya <chintan.pandya@oneplus.com>,
	 "Huang, Ying" <ying.huang@intel.com>,
	akpm@linux-foundation.com,  linux-mm <linux-mm@kvack.org>,
	gregkh@linuxfoundation.com,  Greg Thelen <gthelen@google.com>,
	jack@suse.cz, Ken Lin <ken.lin@oneplus.com>,
	 Gasine Xu <gasine.xu@oneplus.com>
Subject: Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user
Date: Tue, 14 Apr 2020 12:32:57 -0700	[thread overview]
Message-ID: <CAKgT0Ud2zeZO7-akPCLySUAbh5ePF=Kp0V+kaBpV63woQXk_xg@mail.gmail.com> (raw)
In-Reply-To: <20200414184743.GB2097@oneplus.com>

On Tue, Apr 14, 2020 at 11:47 AM Prathu Baronia
<prathu.baronia@oneplus.com> wrote:
>
> The 04/14/2020 19:03, Michal Hocko wrote:
> > I still have hard time to see why kmap machinery should introduce any
> > slowdown here. Previous data posted while discussing v1 didn't really
> > show anything outside of the noise.
> >
> You are right, the multiple barriers are not responsible for the slowdown, but
> removal of kmap_atomic() allows us to call memset and memcpy for larger sizes.
> I will re-frame this part of the commit text when we proceed towards v3 to
> present it more cleanly.
> >
> > It would be really nice to provide std
> >
> Here is the data with std:-
> ----------------------------------------------------------------------
> Results:
> ----------------------------------------------------------------------
> Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max
> frequency)
> All numbers are mean of 100 iterations. Variation is ignorable.
> ----------------------------------------------------------------------
> - Oneshot : 3389.26 us  std: 79.1377 us
> - Forward : 8876.16 us  std: 172.699 us
> - Reverse : 18157.6 us  std: 111.713 us
> ----------------------------------------------------------------------
>
> ----------------------------------------------------------------------
> Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in
> max frequency, DDR also running at max frequency.) All numbers are mean of
> 100 iterations. Variation is ignorable.
> ----------------------------------------------------------------------
> - Oneshot : 3203.49 us  std: 115.4086 us
> - Forward : 5766.46 us  std: 328.6299 us
> - Reverse : 5187.86 us  std: 341.1918 us
> ----------------------------------------------------------------------
>
> >
> > No. There is absolutely zero reason to add a config option for this. The
> > kernel should have all the information to make an educated guess.
> >
> I will try to incorporate this in v3. But currently I don't have any idea on how
> to go about implementing the guessing logic. Would really appreciate if you can
> suggest some way to go about it.
>
> > Also before going any further. The patch which has introduced the
> > optimization was c79b57e462b5 ("mm: hugetlb: clear target sub-page last
> > when clearing huge page"). It is based on an artificial benchmark which
> > to my knowledge doesn't represent any real workload. Your measurements
> > are based on a different benchmark. Your numbers clearly show that some
> > assumptions used for the optimization are not architecture neutral.
> >
> But oneshot numbers are significantly better on both the archs. I think
> theoretically the oneshot approach should provide better results on all the
> architectures when compared with serial approach. Isn't it a fair assumption to
> go ahead with the oneshot approach?

I think the point that Michal is getting at is that there are other
tests that need to be run. You are running the test on just one core.
What happens as we start fanning this out and having multiple
instances running per socket? We would be flooding the LLC in addition
to overwriting all the other caches.

If you take a look at commit c6ddfb6c58903 ("mm, clear_huge_page: move
order algorithm into a separate function") they were running the tests
on multiple threads simultaneously as their concern was flooding the
LLC cache. I wonder if we couldn't look at bypassing the cache
entirely using something like __copy_user_nocache for some portion of
the copy and then only copy in the last pieces that we think will be
immediately accessed.