Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases

From: Alexander Duyck <alexander.duyck@gmail.com>
To: Prathu Baronia <prathu.baronia@oneplus.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	 Greg KH <gregkh@linuxfoundation.org>,
	gthelen@google.com, jack@suse.cz,  Michal Hocko <mhocko@suse.com>,
	ken.lin@oneplus.com, gasine.xu@oneplus.com,
	 chintan.pandya@oneplus.com
Subject: Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
Date: Fri, 10 Apr 2020 11:54:53 -0700	[thread overview]
Message-ID: <CAKgT0Ud68=vkZPKU3UGSD01Fqn8M4RW7YCSJdvO76fS2QrhBzQ@mail.gmail.com> (raw)
In-Reply-To: <20200403081812.GA14090@oneplus.com>

On Fri, Apr 3, 2020 at 1:18 AM Prathu Baronia
<prathu.baronia@oneplus.com> wrote:
>
> THP allocation for anon memory requires zeroing of the huge page. To do so,
> we iterate over 2MB memory in 4KB chunks. Each iteration calls for kmap_atomic()
> and kunmap_atomic(). This routine makes sense where we need temporary mapping of
> the user page. In !HIGHMEM cases, specially in 64-bit architectures, we don't
> need temp mapping. Hence, kmap_atomic() acts as nothing more than multiple
> barrier() calls.
>
> This called for optimization. Simply getting VADDR from page does the job for
> us. So, implement another (optimized) routine for clear_huge_page() which
> doesn't need temporary mapping of user space page.
>
> While testing this patch on Qualcomm SM8150 SoC (kernel v4.14.117), we see 64%
> Improvement in clear_huge_page().
>
> Ftrace results:
>
> Default profile:
>  ------------------------------------------
>  6) ! 473.802 us  |  clear_huge_page();
>  ------------------------------------------
>
> With this patch applied:
>  ------------------------------------------
>  5) ! 170.156 us  |  clear_huge_page();
>  ------------------------------------------

I suspect that if anything this is really pointing out how much
overhead is being added through process_huge_page. I know for x86 most
of the modern processors are somewhere between 16B/cycle or 32B/cycle
to initialize memory with some fixed amount of overhead for making the
rep movsb/stosb call. One thing that might make sense to look at would
be to see if we could possibly reduce the number of calls we have to
make with process_huge_page by taking the caches into account. For
example I know on x86 the L1 cache is 32K for most processors, so we
could look at possibly bumping things up so that we are processing 8
pages at a time and then making a call to cond_resched() instead of
doing it per 4K page.

> Signed-off-by: Prathu Baronia <prathu.baronia@oneplus.com>
> Reported-by: Chintan Pandya <chintan.pandya@oneplus.com>
> ---
>  mm/memory.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3ee073d..3e120e8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5119,6 +5119,7 @@ EXPORT_SYMBOL(__might_fault);
>  #endif
>
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> +#ifdef CONFIG_HIGHMEM
>  static void clear_gigantic_page(struct page *page,
>                                 unsigned long addr,
>                                 unsigned int pages_per_huge_page)
> @@ -5183,6 +5184,16 @@ void clear_huge_page(struct page *page,
>                                     addr + right_idx * PAGE_SIZE);
>         }
>  }
> +#else
> +void clear_huge_page(struct page *page,
> +                    unsigned long addr_hint, unsigned int pages_per_huge_page)
> +{
> +       void *addr;
> +
> +       addr = page_address(page);
> +       memset(addr, 0, pages_per_huge_page*PAGE_SIZE);
> +}
> +#endif

This seems like a very simplistic solution to the problem, and I am
worried something like this would introduce latency issues when
pages_per_huge_page gets to be large. It might make more sense to just
wrap the process_huge_page call in the original clear_huge_page and
then add this code block as an #else case. That way you avoid
potentially stalling a system for extended periods of time if you
start trying to clear 1G pages with the function.

One interesting data point would be to see what the cost is for
breaking this up into a loop where you only process some fixed number
of pages and running it with cond_resched() so you can avoid
introducing latency spikes.