linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Toshi Kani <toshi.kani@hpe.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@linux.intel.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	mauricio.porto@hpe.com, Linux MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping
Date: Mon, 23 Nov 2015 15:15:03 -0700	[thread overview]
Message-ID: <1448316903.19320.46.camel@hpe.com> (raw)
In-Reply-To: <CAPcyv4gOrc_heKtBRZiiKeywo6Dn2JSTtfKgvse_1siyvd7kTg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3929 bytes --]

On Mon, 2015-11-23 at 12:53 -0800, Dan Williams wrote:
> On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > The following oops was observed when mmap() with MAP_POPULATE
> > pre-faulted pmd mappings of a DAX file.  follow_trans_huge_pmd()
> > expects that a target address has a struct page.
> > 
> >   BUG: unable to handle kernel paging request at ffffea0012220000
> >   follow_trans_huge_pmd+0xba/0x390
> >   follow_page_mask+0x33d/0x420
> >   __get_user_pages+0xdc/0x800
> >   populate_vma_page_range+0xb5/0xe0
> >   __mm_populate+0xc5/0x150
> >   vm_mmap_pgoff+0xd5/0xe0
> >   SyS_mmap_pgoff+0x1c1/0x290
> >   SyS_mmap+0x1b/0x30
> > 
> > Fix it by making the PMD pre-fault handling consistent with PTE.
> > After pre-faulted in faultin_page(), follow_page_mask() calls
> > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd()
> > for VM_PFNMAP or VM_MIXEDMAP.  follow_pfn_pmd() handles FOLL_TOUCH
> > and returns with -EEXIST.
> 
> As of 4.4.-rc2 DAX pmd mappings are disabled.  So we have time to do
> something more comprehensive in 4.5.

Yes, I noticed during my testing that I could not use pmd...

> > Reported-by: Mauricio Porto <mauricio.porto@hpe.com>
> > Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Matthew Wilcox <willy@linux.intel.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  mm/huge_memory.c |   34 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 34 insertions(+)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d5b8920..f56e034 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> [..]
> > @@ -1288,6 +1315,13 @@ struct page *follow_trans_huge_pmd(struct
> > vm_area_struct *vma,
> >         if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> >                 goto out;
> > 
> > +       /* pfn map does not have a struct page */
> > +       if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) {
> > +               ret = follow_pfn_pmd(vma, addr, pmd, flags);
> > +               page = ERR_PTR(ret);
> > +               goto out;
> > +       }
> > +
> >         page = pmd_page(*pmd);
> >         VM_BUG_ON_PAGE(!PageHead(page), page);
> >         if (flags & FOLL_TOUCH) {
> 
> I think it is already problematic that dax pmd mappings are getting
> confused with transparent huge pages.  

We had the same issue with dax pte mapping [1], and this change extends the pfn
map handling to pmd.  So, this problem is not specific to pmd.

[1] https://lkml.org/lkml/2015/6/23/181

> They're more closely related to
> a hugetlbfs pmd mappings in that they are mapping an explicit
> allocation.  I have some pending patches to address this dax-pmd vs
> hugetlb-pmd vs thp-pmd classification that I will post shortly.

Not sure which way is better, but I am certainly interested in your changes.

> By the way, I'm collecting DAX pmd regression tests [1], is this just
> a simple crash upon using MAP_POPULATE?
> 
> [1]: https://github.com/pmem/ndctl/blob/master/lib/test-dax-pmd.c

Yes, this issue is easy to reproduce with MAP_POPULATE.  In case it helps,
attached are the test I used for testing the patches.  Sorry, the code is messy
since it was only intended for my internal use...

 - The test was originally written for the pte change [1] and comments in
test.sh (ex. mlock fail, ok) reflect the results without the pte change.
 - For the pmd test, I modified test-mmap.c to call posix_memalign() before
mmap().  By calling free(), the 2MB-aligned address from posix_memalign() can be
used for mmap().  This keeps the mmap'd address aligned on 2MB.
 - I created test file(s) with dd (i.e. all blocks written) in my test.
 - The other infinite loop issue (fixed by my other patch) was found by the test
case with option "-LMSr".

Thanks,
-Toshi

[-- Attachment #2: test.sh --]
[-- Type: application/x-shellscript, Size: 2586 bytes --]

[-- Attachment #3: test-mmap.c --]
[-- Type: text/x-csrc, Size: 4334 bytes --]

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <string.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define MB(a)		((a) * 1024UL * 1024UL)

static struct timeval start_tv, stop_tv;

// Calculate the difference between two time values.
void tvsub(struct timeval *tdiff, struct timeval *t1, struct timeval *t0)
{
	tdiff->tv_sec = t1->tv_sec - t0->tv_sec;
	tdiff->tv_usec = t1->tv_usec - t0->tv_usec;
	if (tdiff->tv_usec < 0)
		tdiff->tv_sec--, tdiff->tv_usec += 1000000;
}

// Start timing now.
void start()
{
	(void) gettimeofday(&start_tv, (struct timezone *) 0);
}

// Stop timing and return real time in microseconds.
unsigned long long stop()
{
	struct timeval tdiff;

	(void) gettimeofday(&stop_tv, (struct timezone *) 0);
	tvsub(&tdiff, &stop_tv, &start_tv);
	return (tdiff.tv_sec * 1000000 + tdiff.tv_usec);
}

void test_write(unsigned long *p, size_t size)
{
	int i;
	unsigned long *wp, tmp;
	unsigned long long timeval;

	start();
	for (i=0, wp=p; i<(size/sizeof(wp)); i++)
		*wp++ = 1;
	timeval = stop();
	printf("Write: %10llu usec\n", timeval);
}

void test_read(unsigned long *p, size_t size)
{
	int i;
	unsigned long *wp, tmp;
	unsigned long long timeval;

	start();
	for (i=0, wp=p; i<(size/sizeof(wp)); i++)
		tmp = *wp++;
	timeval = stop();
	printf("Read : %10llu usec\n", timeval);
}

int main(int argc, char **argv)
{
	int fd, i, opt, ret;
	int oflags, mprot, mflags = 0;
	int is_read_only = 0, is_mlock = 0, is_mlockall = 0;
	int mlock_skip = 0, read_test = 0, write_test = 0;
	void *mptr = NULL;
	unsigned long *p;
	struct stat stat;
	size_t size, cpy_size;
	const char *file_name = NULL;

	while ((opt = getopt(argc, argv, "LRMSApsrw")) != -1) {
		switch (opt) {
		case 'L':
			file_name = "/mnt/pmem0/4Gfile";
			break;
		case 'R':
			printf("> mmap: read-only\n");
			is_read_only = 1;
			break;
		case 'M':
			printf("> mlock\n");
			is_mlock = 1;
			break;
		case 'S':
			printf("> mlock - skip first ite\n");
			mlock_skip = 1;
			break;
		case 'A':
			printf("> mlockall\n");
			is_mlockall = 1;
			break;
		case 'p':
			printf("> MAP_POPULATE\n");
			mflags |= MAP_POPULATE;
			break;
		case 's':
			printf("> MAP_SHARED\n");
			mflags |= MAP_SHARED;
			break;
		case 'r':
			printf("> read-test\n");
			read_test = 1;
			break;
		case 'w':
			printf("> write-test\n");
			write_test = 1;
			break;
		}
	}

	if (!file_name) {
		file_name = "/mnt/pmem1/32Kfile";
	}

	if (!(mflags & MAP_SHARED)) {
		printf("> MAP_PRIVATE\n");
		mflags |= MAP_PRIVATE;
	}

	if (is_read_only) {
		oflags = O_RDONLY;
		mprot = PROT_READ;
	} else {
		oflags = O_RDWR;
		mprot = PROT_READ|PROT_WRITE;
	}

	fd = open(file_name, oflags);
	if (fd == -1) {
		perror("open failed");
		exit(1);
	}

	ret = fstat(fd, &stat);
	if (ret < 0) {
		perror("fstat failed");
		exit(1);
	}
	size = stat.st_size;

	printf("> open %s size 0x%x flags 0x%x\n", file_name, size, oflags);

	ret = posix_memalign(&mptr, MB(2), size);
	if (ret ==0)
		free(mptr);

	printf("> mmap mprot 0x%x flags 0x%x\n", mprot, mflags);
	p = mmap(mptr, size, mprot, mflags, fd, 0x0);
	if (!p) {
		perror("mmap failed");
		exit(1);
	}
	if ((long unsigned)p & (MB(2)-1))
		printf("> mmap: NOT 2MB aligned: 0x%p\n", p);
	else
		printf("> mmap: 2MB aligned: 0x%p\n", p);

#if 0	/* SIZE LIMIT */
	if (size >= MB(2))
		cpy_size = MB(32);
	else
#endif
		cpy_size = size;

	for (i=0; i<3; i++) {

		if (is_mlock && !mlock_skip) {
			printf("> mlock 0x%p\n", p);
			ret = mlock(p, size);
			if (ret < 0) {
				perror("mlock failed");
				exit(1);
			}
		} else if (is_mlockall) {
			printf("> mlockall\n");
			ret = mlockall(MCL_CURRENT|MCL_FUTURE);
			if (ret < 0) {
				perror("mlockall failed");
				exit(1);
			}
		}

		printf("===== %d =====\n", i+1);
		if (write_test)
			test_write(p, cpy_size);
		if (read_test)
			test_read(p, cpy_size);

		if (is_mlock && !mlock_skip) {
			printf("> munlock 0x%p\n", p);
			ret = munlock(p, size);
			if (ret < 0) {
				perror("munlock failed");
				exit(1);
			}
		} else if (is_mlockall) {
			printf("> munlockall\n");
			ret = munlockall();
			if (ret < 0) {
				perror("munlockall failed");
				exit(1);
			}
		}

		/* skip, if requested, only the first iteration */
		mlock_skip = 0;
	}

	printf("> munmap 0x%p\n", p);
	munmap(p, size);
}

  reply	other threads:[~2015-11-23 22:19 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-23 20:04 [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping Toshi Kani
2015-11-23 20:53 ` Dan Williams
2015-11-23 22:15   ` Toshi Kani [this message]
2015-11-30 22:08 ` Dan Williams
2015-12-02  2:19   ` Toshi Kani
2015-12-02  3:45     ` Dan Williams
2015-12-02 17:43       ` Toshi Kani
2015-12-02 17:01         ` Dan Williams
2015-12-02 18:06           ` Dan Williams
2015-12-02 19:26             ` Toshi Kani
2015-12-02 19:00               ` Dan Williams
2015-12-02 20:02                 ` Toshi Kani
2015-12-02 20:12                   ` Toshi Kani
2015-12-02 19:57                     ` Dan Williams
2015-12-02 21:37                       ` Toshi Kani
2015-12-02 20:54                         ` Dan Williams
2015-12-02 21:55                           ` Toshi Kani
2015-12-03 23:43                             ` Dan Williams
2015-12-04 16:55                               ` Toshi Kani
2015-12-02 22:00                           ` Dave Hansen
2015-12-02 22:03                             ` Dan Williams
2015-12-02 22:09                               ` Dave Hansen
2015-12-03  0:21         ` Toshi Kani
2015-12-02 23:33           ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1448316903.19320.46.camel@hpe.com \
    --to=toshi.kani@hpe.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=mauricio.porto@hpe.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).