Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user

From: Will Deacon <will@kernel.org>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Prathu Baronia <prathu.baronia@oneplus.com>,
	catalin.marinas@arm.com, alexander.duyck@gmail.com,
	chintan.pandya@oneplus.com, mhocko@suse.com,
	akpm@linux-foundation.org, linux-mm@kvack.org,
	gregkh@linuxfoundation.com, gthelen@google.com, jack@suse.cz,
	ken.lin@oneplus.com, gasine.xu@oneplus.com, ying.huang@intel.com,
	mark.rutland@arm.com
Subject: Re: [PATCH v2] mm: Optimized hugepage zeroing & copying from user
Date: Wed, 22 Apr 2020 12:19:29 +0100	[thread overview]
Message-ID: <20200422111928.GA32051@willie-the-truck> (raw)
In-Reply-To: <20200422081852.GB29541@willie-the-truck>

On Wed, Apr 22, 2020 at 09:18:52AM +0100, Will Deacon wrote:
> On Tue, Apr 21, 2020 at 03:48:04PM +0200, Vlastimil Babka wrote:
> > On 4/21/20 3:39 PM, Will Deacon wrote:
> > > On Tue, Apr 21, 2020 at 02:48:04PM +0200, Vlastimil Babka wrote:
> > >> On 4/21/20 2:47 PM, Vlastimil Babka wrote:
> > >> > 
> > >> > It was suspected that current Intel can prefetch forward and backwards, and the
> > >> > tested ARM64 microarchitecture only backwards, can it be true? The current code
> > >> 
> > >> Oops, tested ARM64 microarchitecture I meant "only forwards".
> > > 
> > > I'd be surprised if that's the case, but it could be that there's an erratum
> > > workaround in play which hampers the prefetch behaviour. We generally try
> > > not to assume too much about the prefetcher on arm64 because they're not
> > > well documented and vary wildly between different micro-architectures.
> > 
> > Yeah it's probably not as simple as I thought, as the test code [1] shows the
> > page iteration goes backwards, but per-page memsets are not special. So maybe
> > it's not hardware specifics, but x86 memtest implementation is also done
> > backwards, so it fits the backwards outer loop, but arm64 memset is forward, so
> > the resulting pattern is non-linear?
> 
> A straightforward linear prefetcher would probably be defeated by that sort
> of thing, yes, but I'd have thought that the recent CPUs (e.g. A76 which I
> think is the "big" CPU in the SoC mentioned at the start of the thread)
> would still have a fighting chance at prefetching based on non-linear
> histories.
> 
> However, to my earlier point, we're making this more difficult than it needs
> to be for the hardware and we shouldn't assume that all prefetchers will
> handle it gracefully, so keeping the core code relatively straightforward
> does seem to be the best bet. Alarm bells just rang initially when it
> appeared that we were optimising code under arch/arm64 rather than improving
> the core code, but I now have a better picture of what's going on (thanks).
> 
> Alternatively, we could switch our memset() around, but I'm worried that
> we could end up hurting something else by doing that. I guess we could add a
> memset_backwards() version if we *had* to...
> 
> > In that case it's also a question if the measurement was done in kernel or
> > userspace, and if userspace memset have any implications for kernel memset...
> 
> Sounds like it was done in userspace. If I get a chance later on, I'll try
> to give it a spin here on some of the boards I have kicking around.

I wrote the silly harness below for the snippets given in [1] but I can't
see any difference between the forwards and backwards versions on any arm64
systems I have access to.

Will

[1] https://lore.kernel.org/linux-mm/20200414153829.GA15230@oneplus.com/

--->8

#if 0
 /* One shot memset */
  memset (r, 0xd, total_size);

  /* traverse in forward order */
  for (j = 0; j < total_pages; j++)
    {
      memset (q + (j * SZ_4K), 0xc, SZ_4K);
    }

  /* traverse in reverse order */
  for (i = 0; i < total_pages; i++)
    {
      memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K);
    }
#endif

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>

#define BUF_SZ		(1UL*1024*1024*1024)
#define PAGE_SIZE	0x1000
#define BUF_SZ_PAGES	(BUF_SZ / PAGE_SIZE)
#define NSECS_PER_SEC	1000000000ULL

static void do_the_thing_forwards(void *buf)
{
	unsigned long i;

	for (i = 0; i < BUF_SZ_PAGES; i++)
		memset(buf + (i * PAGE_SIZE), 0xc, PAGE_SIZE);
}

static void do_the_thing_backwards(void *buf)
{
	unsigned long i;

	for (i = 0; i < BUF_SZ_PAGES; i++)
		memset(buf + BUF_SZ - (i + 1) * PAGE_SIZE, 0xc, PAGE_SIZE);
}

int main(void)
{
	void *buf;
	unsigned long long delta;
	struct timespec ts_start, ts_end;

	if (posix_memalign(&buf, PAGE_SIZE, BUF_SZ)) {
		perror("posix_memalign()");
		return -1;
	}

	memset(buf, 0xd, BUF_SZ);

	if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start)) {
		perror("clock_gettime()");
		return -1;
	}

	do_the_thing_forwards(buf);
	do_the_thing_forwards(buf);
	do_the_thing_forwards(buf);
	do_the_thing_forwards(buf);
	do_the_thing_forwards(buf);

	if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end)) {
		perror("clock_gettime()");
		return -1;
	}

	delta = NSECS_PER_SEC * (ts_end.tv_sec - ts_start.tv_sec);
	delta += (ts_end.tv_nsec - ts_start.tv_nsec);
	printf("Forwards: took %f seconds\n", (double)(delta / (double)NSECS_PER_SEC));

	if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start)) {
		perror("clock_gettime()");
		return -1;
	}

	do_the_thing_backwards(buf);
	do_the_thing_backwards(buf);
	do_the_thing_backwards(buf);
	do_the_thing_backwards(buf);
	do_the_thing_backwards(buf);

	if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end)) {
		perror("clock_gettime()");
		return -1;
	}

	delta = NSECS_PER_SEC * (ts_end.tv_sec - ts_start.tv_sec);
	delta += (ts_end.tv_nsec - ts_start.tv_nsec);
	printf("Backwards: took %f seconds\n", (double)(delta / (double)NSECS_PER_SEC));

	return 0;
}