Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user

From: Michal Hocko <mhocko@suse.com>
To: Prathu Baronia <prathu.baronia@oneplus.com>
Cc: alexander.duyck@gmail.com, chintan.pandya@oneplus.com,
	ying.huang@intel.com, akpm@linux-foundation.com,
	linux-mm@kvack.org, gregkh@linuxfoundation.com,
	gthelen@google.com, jack@suse.cz, ken.lin@oneplus.com,
	gasine.xu@oneplus.com
Subject: Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user
Date: Tue, 14 Apr 2020 21:40:33 +0200	[thread overview]
Message-ID: <20200414194033.GU4629@dhcp22.suse.cz> (raw)
In-Reply-To: <20200414184743.GB2097@oneplus.com>

On Wed 15-04-20 00:17:44, Prathu Baronia wrote:
> The 04/14/2020 19:03, Michal Hocko wrote:
> > I still have hard time to see why kmap machinery should introduce any
> > slowdown here. Previous data posted while discussing v1 didn't really
> > show anything outside of the noise.
> > 
> You are right, the multiple barriers are not responsible for the slowdown, but
> removal of kmap_atomic() allows us to call memset and memcpy for larger sizes.
> I will re-frame this part of the commit text when we proceed towards v3 to
> present it more cleanly.

While this might be OK for 2MB huge pages, does the same apply to other
larger sizes? E.g. 512MG or 1G or even larger huge pages? You should
consider !PREEMPT kernels.

[...]

> > No. There is absolutely zero reason to add a config option for this. The
> > kernel should have all the information to make an educated guess.
> > 
> I will try to incorporate this in v3. But currently I don't have any idea on how
> to go about implementing the guessing logic. Would really appreciate if you can
> suggest some way to go about it.

If you cannot guess the proper sizing then how is a poor user who tries
to configure the kernel supposed to do it?

> > Also before going any further. The patch which has introduced the
> > optimization was c79b57e462b5 ("mm: hugetlb: clear target sub-page last
> > when clearing huge page"). It is based on an artificial benchmark which
> > to my knowledge doesn't represent any real workload. Your measurements
> > are based on a different benchmark. Your numbers clearly show that some
> > assumptions used for the optimization are not architecture neutral.
> > 
> But oneshot numbers are significantly better on both the archs. I think
> theoretically the oneshot approach should provide better results on all the
> architectures when compared with serial approach. Isn't it a fair assumption to
> go ahead with the oneshot approach?

What is this assumption based on? Also please consider that all these
numbers are based on artificial microbenchmarks. Can you see any
difference on real world huge page users? The same applies to the
regression you can see with the existing code.
-- 
Michal Hocko
SUSE Labs