RE: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases

From: Chintan Pandya <chintan.pandya@oneplus.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>,
	Prathu Baronia <prathu.baronia@oneplus.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"gthelen@google.com" <gthelen@google.com>,
	"jack@suse.cz" <jack@suse.cz>, Ken Lin <ken.lin@oneplus.com>,
	Gasine Xu <Gasine.Xu@Oneplus.com>
Subject: RE: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
Date: Sat, 11 Apr 2020 15:40:01 +0000	[thread overview]
Message-ID: <SG2PR04MB2921E6D51681B935C0F85EEA91DF0@SG2PR04MB2921.apcprd04.prod.outlook.com> (raw)
In-Reply-To: <87lfn390db.fsf@yhuang-dev.intel.com>

> > Generally, many architectures are optimized for serial loads, be it
> > initialization or access, as it is simplest form of prediction. Any
> > random access pattern would kill that pre-fetching. And for now, I
> > suspect that to be the case here. Probably, we can run more tests to confirm
> this part.
> 
> Please prove your theory with test.  Better to test x86 too.

Wrote down below userspace test code.

Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SZ_1M 0x100000
#define SZ_4K 0x1000
#define NUM 100

Int main ()
{
  void *p;
  void *q;
  void *r;

  unsigned long total_pages, total_size;
  int i, j;
  struct timeval t0, t1, t2, t3;
  int elapsed;

  printf ("Hello World\n");

  total_size = NUM * SZ_1M;
  total_pages = NUM * (SZ_1M / SZ_4K);

  p = malloc (total_size);
  q = malloc (total_size);
  r = malloc (total_size);

  /* So that all pages gets allocated */
  memset (r, 0xa, total_size);
  memset (q, 0xa, total_size);
  memset (p, 0xa, total_size);

  gettimeofday (&t0, NULL);

  /* One shot memset */
  memset (r, 0xd, total_size);

  gettimeofday (&t1, NULL);

  /* traverse in forward order */
  for (j = 0; j < total_pages; j++)
    {
      memset (q + (j * SZ_4K), 0xc, SZ_4K);
    }

  gettimeofday (&t2, NULL);

  /* traverse in reverse order */
  for (i = 0; i < total_pages; i++)
    {
      memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K);
    }

  gettimeofday (&t3, NULL);

  free (p);
  free (q);
  free (r);

  /* Results time */
  elapsed = ((t1.tv_sec - t0.tv_sec) * 1000000) + (t1.tv_usec - t0.tv_usec);
  printf ("One shot: %d micro seconds\n", elapsed);

  elapsed = ((t2.tv_sec - t1.tv_sec) * 1000000) + (t2.tv_usec - t1.tv_usec);
  printf ("Forward order: %d micro seconds\n", elapsed);

  elapsed = ((t3.tv_sec - t2.tv_sec) * 1000000) + (t3.tv_usec - t2.tv_usec);
  printf ("Reverse order: %d micro seconds\n", elapsed);
  return 0;
}

------------------------------------------------------------------------------------------------

Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max frequency)
All numbers are mean of 100 iterations. Variation is ignorable.
- Oneshot : 3389.26 us
- Forward : 8876.16 us
- Reverse : 18157.6 us

Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in max frequency)
All numbers are mean of 100 iterations. Variation is ignorable.
- Oneshot : 3203.49 us
- Forward : 5766.46 us
- Reverse : 5187.86 us

To conclude, I observed optimized serial writes in case of ARM processor. But strangely,
memset in reverse order performs better than forward order quite consistently across
multiple x86 machines. I don't have much insight into x86 so to clarify, I would like to
restrict my previous suspicion to ARM only.

> 
> Best Regards,
> Huang, Ying