linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
@ 2016-04-29  8:01 Nicolas Morey Chaisemartin
  2016-05-03  4:04 ` Hugh Dickins
  0 siblings, 1 reply; 16+ messages in thread
From: Nicolas Morey Chaisemartin @ 2016-04-29  8:01 UTC (permalink / raw)
  To: linux-kernel

Hi everyone,

This is a repost from a different address as it seems the previous one ended in Gmail junk due to a domain error..
I added more info found while blindly debugging the issue.

Short version:
I'm having an issue with direct DMA transfer from a device to host memory.
It seems some of the data is not transferring to the appropriate page.

Some more details:
I'm debugging a home made PCI driver for our board (Kalray), attached to a x86_64 host running centos7 (3.10.0-327.el7.x86_64)

In the current case, a userland application transfers back and forth data through read/write operations on a file.
On the kernel side, it triggers DMA transfers through the PCI to/from our board memory.

We followed what pretty much all docs said about direct I/O to user buffers:

1) get_user_pages() (in the current case, it's at most 16 pages at once)
2) convert to a scatterlist
3) pci_map_sg
4) eventually coalesce sg (Intel IOMMU is enabled, so it's usually possible)
4) A lot of DMA engine handling code, using the dmaengine layer and virt-dma
5) wait for transfer complete, in the mean time, go back to (1) to schedule more work, if any
6) pci_unmap_sg
7) for read (card2host) transfer, set_page_dirty_lock
8) page_cache_release

In 99,9999% it works perfectly.
However, I have one userland application where a few pages are not written by a read (card2host) transfer.
The buffer is memset them to a different value so I can check that nothing has overwritten them.

I know (PCI protocol analyser) that the data left our board for the "right" address (the one set in the sg by pci_map_sg).
I tried reading the data between the pci_unmap_sg and the set_page_dirty, using
        uint32_t *addr = page_address(trans->pages[0]);
        dev_warn(&pdata->pdev->dev, "val = %x\n", *addr);
and it has the expected value.
But if I try to copy_from_user (using the address coming from userland, the one passed to get_user_pages), the data has not been written and I see the memset value.

New infos:

The issue happens with IOMMU on or off.
I compiled a kernel with DMA_API_DEBUG enabled and got no warnings or errors.

I digged a little bit deeper with my very small understanding of linux mm and I discovered that:
 * we are using transparent huge pages
 * the page 'not transferred' are the last few of a huge page
More precisely:
- We have several transfer in flight from the same user buffer
- Each transfer is 16 pages long
- At one point in time, we start transferring from another huge page (transfers are still in flight from the previous one)
- When a transfer from the previous huge page completes, I dumped at the mapcount of the pages from the previous transfers,
  they are all to 0. The pages are still mapped to dma at this point.
- A get_user_page to the address of the completed transfer returns return a different struct page * then the on I had.
But this is before I have unmapped/put_page them back. From my understanding this should not have happened.

I tried the same code with a kernel 4.5 and encountered the same issue

Disabling transparent huge pages makes the issue disapear

Thanks in advance

Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-04-29  8:01 [Question] Missing data after DMA read transfer - mm issue with transparent huge page? Nicolas Morey Chaisemartin
@ 2016-05-03  4:04 ` Hugh Dickins
  2016-05-03 10:11   ` Jerome Glisse
  0 siblings, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2016-05-03  4:04 UTC (permalink / raw)
  To: Nicolas Morey Chaisemartin
  Cc: Mel Gorman, Andrea Arcangeli, Kirill A. Shutemov,
	Kirill A. Shutemov, Jerome Glisse, Alex Williamson,
	One Thousand Gnomes, linux-kernel, linux-mm

On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:

> Hi everyone,
> 
> This is a repost from a different address as it seems the previous one ended in Gmail junk due to a domain error..

linux-kernel is a very high volume list which few are reading:
that also will account for your lack of response so far
(apart from the indefatigable Alan).

I've added linux-mm, and some people from another thread regarding
THP and get_user_pages() pins which has been discussed in recent days.

Make no mistake, the issue you're raising here is definitely not the
same as that one (which is specifically about the new THP refcounting
in v4.5+, whereas you're reporting a problem you've seen in both a
v3.10-based kernel and in v4.5).  But I think their heads are in
gear, much more so than mine, and likely to spot something.

> I added more info found while blindly debugging the issue.
> 
> Short version:
> I'm having an issue with direct DMA transfer from a device to host memory.
> It seems some of the data is not transferring to the appropriate page.
> 
> Some more details:
> I'm debugging a home made PCI driver for our board (Kalray), attached to a x86_64 host running centos7 (3.10.0-327.el7.x86_64)
> 
> In the current case, a userland application transfers back and forth data through read/write operations on a file.
> On the kernel side, it triggers DMA transfers through the PCI to/from our board memory.
> 
> We followed what pretty much all docs said about direct I/O to user buffers:
> 
> 1) get_user_pages() (in the current case, it's at most 16 pages at once)
> 2) convert to a scatterlist
> 3) pci_map_sg
> 4) eventually coalesce sg (Intel IOMMU is enabled, so it's usually possible)
> 4) A lot of DMA engine handling code, using the dmaengine layer and virt-dma
> 5) wait for transfer complete, in the mean time, go back to (1) to schedule more work, if any
> 6) pci_unmap_sg
> 7) for read (card2host) transfer, set_page_dirty_lock
> 8) page_cache_release
> 
> In 99,9999% it works perfectly.
> However, I have one userland application where a few pages are not written by a read (card2host) transfer.
> The buffer is memset them to a different value so I can check that nothing has overwritten them.
> 
> I know (PCI protocol analyser) that the data left our board for the "right" address (the one set in the sg by pci_map_sg).
> I tried reading the data between the pci_unmap_sg and the set_page_dirty, using
>         uint32_t *addr = page_address(trans->pages[0]);
>         dev_warn(&pdata->pdev->dev, "val = %x\n", *addr);
> and it has the expected value.
> But if I try to copy_from_user (using the address coming from userland, the one passed to get_user_pages), the data has not been written and I see the memset value.
> 
> New infos:
> 
> The issue happens with IOMMU on or off.
> I compiled a kernel with DMA_API_DEBUG enabled and got no warnings or errors.
> 
> I digged a little bit deeper with my very small understanding of linux mm and I discovered that:
>  * we are using transparent huge pages
>  * the page 'not transferred' are the last few of a huge page
> More precisely:
> - We have several transfer in flight from the same user buffer
> - Each transfer is 16 pages long
> - At one point in time, we start transferring from another huge page (transfers are still in flight from the previous one)
> - When a transfer from the previous huge page completes, I dumped at the mapcount of the pages from the previous transfers,
>   they are all to 0. The pages are still mapped to dma at this point.
> - A get_user_page to the address of the completed transfer returns return a different struct page * then the on I had.
> But this is before I have unmapped/put_page them back. From my understanding this should not have happened.
> 
> I tried the same code with a kernel 4.5 and encountered the same issue
> 
> Disabling transparent huge pages makes the issue disapear
> 
> Thanks in advance

It does look to me as if pages are being migrated, despite being pinned
by get_user_pages(): and that would be wrong.  Originally I intended
to suggest that THP is probably merely the cause of compaction, with
compaction causing the page migration.  But you posted very interesting
details in an earlier mail on 27th April from <nmorey@kalray.eu>:

> I ran some more tests:
> 
> * Test is OK if transparent huge tlb are disabled
> 
> * For all the page where data are not transfered, and only those pages, a call to get_user_page(user vaddr) just before dma_unmap_sg returns a different page from the original one.
> [436477.927279] mppa 0000:03:00.0: org_page= ffffea0009f60080 cur page = ffffea00074e0080
> [436477.927298] page:ffffea0009f60080 count:0 mapcount:1 mapping:          (null) index:0x2
> [436477.927314] page flags: 0x2fffff00008000(tail)
> [436477.927354] page dumped because: org_page
> [436477.927369] page:ffffea00074e0080 count:0 mapcount:1 mapping:          (null) index:0x2
> [436477.927382] page flags: 0x2fffff00008000(tail)
> [436477.927421] page dumped because: cur_page
> 
> I'm not sure what to make of this...

That (on the older kernel I think) seems clearly to show that a THP
itself has been migrated: which makes me suspect NUMA migration of
mispaced THPs - migrate_misplaced_transhuge_page().  I'd hoped to
find something obviously wrong there, but haven't quite managed
to bring my brain fully to bear on it, and hope the others Cc'ed
will do so more quickly (or spot the error of your ways instead).

I do find it suspect, how the migrate_page_copy() is done rather
early, while the old page is still mapped in the pagetable.  And
odd how it inserts the new pmd for a moment, before checking old
page_count and backing out.  But I don't see how either of those
would cause the trouble you see, where the migration goes ahead.

But I may be mistaken to suspect migration at all: perhaps this is
about Copy-On-Write: there's no concurrent fork()ing, is there?

And I think your driver is using get_user_pages() (under mmap_sem),
not short-cutting with the trickier get_user_pages_fast().

Over to more clued-in Cc's.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-03  4:04 ` Hugh Dickins
@ 2016-05-03 10:11   ` Jerome Glisse
  2016-05-03 11:03     ` Kirill A. Shutemov
       [not found]     ` <07619be9-e812-5459-26dd-ceb8c6490520@morey-chaisemartin.com>
  0 siblings, 2 replies; 16+ messages in thread
From: Jerome Glisse @ 2016-05-03 10:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nicolas Morey Chaisemartin, Mel Gorman, Andrea Arcangeli,
	Kirill A. Shutemov, Kirill A. Shutemov, Alex Williamson,
	One Thousand Gnomes, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 6514 bytes --]

On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
> On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:
> 
> > Hi everyone,
> > 
> > This is a repost from a different address as it seems the previous one ended in Gmail junk due to a domain error..
> 
> linux-kernel is a very high volume list which few are reading:
> that also will account for your lack of response so far
> (apart from the indefatigable Alan).
> 
> I've added linux-mm, and some people from another thread regarding
> THP and get_user_pages() pins which has been discussed in recent days.
> 
> Make no mistake, the issue you're raising here is definitely not the
> same as that one (which is specifically about the new THP refcounting
> in v4.5+, whereas you're reporting a problem you've seen in both a
> v3.10-based kernel and in v4.5).  But I think their heads are in
> gear, much more so than mine, and likely to spot something.
> 
> > I added more info found while blindly debugging the issue.
> > 
> > Short version:
> > I'm having an issue with direct DMA transfer from a device to host memory.
> > It seems some of the data is not transferring to the appropriate page.
> > 
> > Some more details:
> > I'm debugging a home made PCI driver for our board (Kalray), attached to a x86_64 host running centos7 (3.10.0-327.el7.x86_64)
> > 
> > In the current case, a userland application transfers back and forth data through read/write operations on a file.
> > On the kernel side, it triggers DMA transfers through the PCI to/from our board memory.
> > 
> > We followed what pretty much all docs said about direct I/O to user buffers:
> > 
> > 1) get_user_pages() (in the current case, it's at most 16 pages at once)
> > 2) convert to a scatterlist
> > 3) pci_map_sg
> > 4) eventually coalesce sg (Intel IOMMU is enabled, so it's usually possible)
> > 4) A lot of DMA engine handling code, using the dmaengine layer and virt-dma
> > 5) wait for transfer complete, in the mean time, go back to (1) to schedule more work, if any
> > 6) pci_unmap_sg
> > 7) for read (card2host) transfer, set_page_dirty_lock
> > 8) page_cache_release
> > 
> > In 99,9999% it works perfectly.
> > However, I have one userland application where a few pages are not written by a read (card2host) transfer.
> > The buffer is memset them to a different value so I can check that nothing has overwritten them.
> > 
> > I know (PCI protocol analyser) that the data left our board for the "right" address (the one set in the sg by pci_map_sg).
> > I tried reading the data between the pci_unmap_sg and the set_page_dirty, using
> >         uint32_t *addr = page_address(trans->pages[0]);
> >         dev_warn(&pdata->pdev->dev, "val = %x\n", *addr);
> > and it has the expected value.
> > But if I try to copy_from_user (using the address coming from userland, the one passed to get_user_pages), the data has not been written and I see the memset value.
> > 
> > New infos:
> > 
> > The issue happens with IOMMU on or off.
> > I compiled a kernel with DMA_API_DEBUG enabled and got no warnings or errors.
> > 
> > I digged a little bit deeper with my very small understanding of linux mm and I discovered that:
> >  * we are using transparent huge pages
> >  * the page 'not transferred' are the last few of a huge page
> > More precisely:
> > - We have several transfer in flight from the same user buffer
> > - Each transfer is 16 pages long
> > - At one point in time, we start transferring from another huge page (transfers are still in flight from the previous one)
> > - When a transfer from the previous huge page completes, I dumped at the mapcount of the pages from the previous transfers,
> >   they are all to 0. The pages are still mapped to dma at this point.
> > - A get_user_page to the address of the completed transfer returns return a different struct page * then the on I had.
> > But this is before I have unmapped/put_page them back. From my understanding this should not have happened.
> > 
> > I tried the same code with a kernel 4.5 and encountered the same issue
> > 
> > Disabling transparent huge pages makes the issue disapear
> > 
> > Thanks in advance
> 
> It does look to me as if pages are being migrated, despite being pinned
> by get_user_pages(): and that would be wrong.  Originally I intended
> to suggest that THP is probably merely the cause of compaction, with
> compaction causing the page migration.  But you posted very interesting
> details in an earlier mail on 27th April from <nmorey@kalray.eu>:
> 
> > I ran some more tests:
> > 
> > * Test is OK if transparent huge tlb are disabled
> > 
> > * For all the page where data are not transfered, and only those pages, a call to get_user_page(user vaddr) just before dma_unmap_sg returns a different page from the original one.
> > [436477.927279] mppa 0000:03:00.0: org_page= ffffea0009f60080 cur page = ffffea00074e0080
> > [436477.927298] page:ffffea0009f60080 count:0 mapcount:1 mapping:          (null) index:0x2
> > [436477.927314] page flags: 0x2fffff00008000(tail)
> > [436477.927354] page dumped because: org_page
> > [436477.927369] page:ffffea00074e0080 count:0 mapcount:1 mapping:          (null) index:0x2
> > [436477.927382] page flags: 0x2fffff00008000(tail)
> > [436477.927421] page dumped because: cur_page
> > 
> > I'm not sure what to make of this...
> 
> That (on the older kernel I think) seems clearly to show that a THP
> itself has been migrated: which makes me suspect NUMA migration of
> mispaced THPs - migrate_misplaced_transhuge_page().  I'd hoped to
> find something obviously wrong there, but haven't quite managed
> to bring my brain fully to bear on it, and hope the others Cc'ed
> will do so more quickly (or spot the error of your ways instead).
> 
> I do find it suspect, how the migrate_page_copy() is done rather
> early, while the old page is still mapped in the pagetable.  And
> odd how it inserts the new pmd for a moment, before checking old
> page_count and backing out.  But I don't see how either of those
> would cause the trouble you see, where the migration goes ahead.

So i do not think there is a bug migrate_misplaced_transhuge_page()
but i think something is wrong in it see attached patch. I still
want to convince myself i am not missing anything before posting
that one.


Now about this bug, dumb question but do you do get_user_pages with
write = 1 because if your device is writting to the page then you
must set write to 1.

get_user_pages(vaddr, nrpages, 1, 0|1, pages, NULL|vmas);


Cheers,
Jérôme

[-- Attachment #2: 0001-mm-numa-thp-fix-assumptions-of-migrate_misplaced_tra.patch --]
[-- Type: text/plain, Size: 3627 bytes --]

>From 9ded2a5da75a5e736fb36a2c4e2511d9516ecc37 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Tue, 3 May 2016 11:53:24 +0200
Subject: [PATCH] mm/numa/thp: fix assumptions of
 migrate_misplaced_transhuge_page()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Fix assumptions in migrate_misplaced_transhuge_page() which is only
call by do_huge_pmd_numa_page() itself only call by __handle_mm_fault()
for pmd with PROT_NONE. This means that if the pmd stays the same
then there can be no concurrent get_user_pages / get_user_pages_fast
(GUP/GUP_fast). More over because migrate_misplaced_transhuge_page()
only do something is page is map once then there can be no GUP from
a different process. Finaly, holding the pmd lock assure us that no
other part of the kernel will take an extre reference on the page.

In the end this means that the failure code path should never be
taken unless something is horribly wrong, so convert it to BUG_ON().

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/migrate.c | 31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 6c822a7..6315aac 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1757,6 +1757,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	pmd_t orig_entry;
 
 	/*
+	 * What we do here is only valid if pmd_protnone(entry) is true and it
+	 * is map in only one vma numamigrate_isolate_page() takes care of that
+	 * check.
+	 */
+	if (!pmd_protnone(entry))
+		goto out_unlock;
+
+	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
@@ -1797,7 +1805,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
-fail_putback:
 		spin_unlock(ptl);
 		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
@@ -1819,7 +1826,12 @@ fail_putback:
 		goto out_unlock;
 	}
 
-	orig_entry = *pmd;
+	/*
+	 * We are holding the lock so no one can set a new pmd and original pmd
+	 * is PROT_NONE thus no one can get_user_pages or get_user_pages_fast
+	 * (GUP or GUP_fast) from this point on we can not fail.
+	 */
+	orig_entry = entry;
 	entry = mk_pmd(new_page, vma->vm_page_prot);
 	entry = pmd_mkhuge(entry);
 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1837,14 +1849,13 @@ fail_putback:
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	update_mmu_cache_pmd(vma, address, &entry);
 
-	if (page_count(page) != 2) {
-		set_pmd_at(mm, mmun_start, pmd, orig_entry);
-		flush_pmd_tlb_range(vma, mmun_start, mmun_end);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
-		update_mmu_cache_pmd(vma, address, &entry);
-		page_remove_rmap(new_page, true);
-		goto fail_putback;
-	}
+	/* As said above no one can get reference on the old page nor through
+	 * get_user_pages or get_user_pages_fast (GUP/GUP_fast) or through
+	 * any other means. To get reference on huge page you need to hold
+	 * pmd_lock and we are already holding that lock here and the page
+	 * is only mapped once.
+	 */
+	BUG_ON(page_count(page) != 2);
 
 	mlock_migrate_page(new_page, page);
 	page_remove_rmap(page, true);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-03 10:11   ` Jerome Glisse
@ 2016-05-03 11:03     ` Kirill A. Shutemov
       [not found]     ` <07619be9-e812-5459-26dd-ceb8c6490520@morey-chaisemartin.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2016-05-03 11:03 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Hugh Dickins, Nicolas Morey Chaisemartin, Mel Gorman,
	Andrea Arcangeli, Kirill A. Shutemov, Alex Williamson,
	One Thousand Gnomes, linux-kernel, linux-mm

On Tue, May 03, 2016 at 12:11:54PM +0200, Jerome Glisse wrote:
> On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
> > On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:
> > 
> > > Hi everyone,
> > > 
> > > This is a repost from a different address as it seems the previous one ended in Gmail junk due to a domain error..
> > 
> > linux-kernel is a very high volume list which few are reading:
> > that also will account for your lack of response so far
> > (apart from the indefatigable Alan).
> > 
> > I've added linux-mm, and some people from another thread regarding
> > THP and get_user_pages() pins which has been discussed in recent days.
> > 
> > Make no mistake, the issue you're raising here is definitely not the
> > same as that one (which is specifically about the new THP refcounting
> > in v4.5+, whereas you're reporting a problem you've seen in both a
> > v3.10-based kernel and in v4.5).  But I think their heads are in
> > gear, much more so than mine, and likely to spot something.
> > 
> > > I added more info found while blindly debugging the issue.
> > > 
> > > Short version:
> > > I'm having an issue with direct DMA transfer from a device to host memory.
> > > It seems some of the data is not transferring to the appropriate page.
> > > 
> > > Some more details:
> > > I'm debugging a home made PCI driver for our board (Kalray), attached to a x86_64 host running centos7 (3.10.0-327.el7.x86_64)
> > > 
> > > In the current case, a userland application transfers back and forth data through read/write operations on a file.
> > > On the kernel side, it triggers DMA transfers through the PCI to/from our board memory.
> > > 
> > > We followed what pretty much all docs said about direct I/O to user buffers:
> > > 
> > > 1) get_user_pages() (in the current case, it's at most 16 pages at once)
> > > 2) convert to a scatterlist
> > > 3) pci_map_sg
> > > 4) eventually coalesce sg (Intel IOMMU is enabled, so it's usually possible)
> > > 4) A lot of DMA engine handling code, using the dmaengine layer and virt-dma
> > > 5) wait for transfer complete, in the mean time, go back to (1) to schedule more work, if any
> > > 6) pci_unmap_sg
> > > 7) for read (card2host) transfer, set_page_dirty_lock
> > > 8) page_cache_release
> > > 
> > > In 99,9999% it works perfectly.
> > > However, I have one userland application where a few pages are not written by a read (card2host) transfer.
> > > The buffer is memset them to a different value so I can check that nothing has overwritten them.
> > > 
> > > I know (PCI protocol analyser) that the data left our board for the "right" address (the one set in the sg by pci_map_sg).
> > > I tried reading the data between the pci_unmap_sg and the set_page_dirty, using
> > >         uint32_t *addr = page_address(trans->pages[0]);
> > >         dev_warn(&pdata->pdev->dev, "val = %x\n", *addr);
> > > and it has the expected value.
> > > But if I try to copy_from_user (using the address coming from userland, the one passed to get_user_pages), the data has not been written and I see the memset value.
> > > 
> > > New infos:
> > > 
> > > The issue happens with IOMMU on or off.
> > > I compiled a kernel with DMA_API_DEBUG enabled and got no warnings or errors.
> > > 
> > > I digged a little bit deeper with my very small understanding of linux mm and I discovered that:
> > >  * we are using transparent huge pages
> > >  * the page 'not transferred' are the last few of a huge page
> > > More precisely:
> > > - We have several transfer in flight from the same user buffer
> > > - Each transfer is 16 pages long
> > > - At one point in time, we start transferring from another huge page (transfers are still in flight from the previous one)
> > > - When a transfer from the previous huge page completes, I dumped at the mapcount of the pages from the previous transfers,
> > >   they are all to 0. The pages are still mapped to dma at this point.
> > > - A get_user_page to the address of the completed transfer returns return a different struct page * then the on I had.
> > > But this is before I have unmapped/put_page them back. From my understanding this should not have happened.
> > > 
> > > I tried the same code with a kernel 4.5 and encountered the same issue
> > > 
> > > Disabling transparent huge pages makes the issue disapear
> > > 
> > > Thanks in advance
> > 
> > It does look to me as if pages are being migrated, despite being pinned
> > by get_user_pages(): and that would be wrong.  Originally I intended
> > to suggest that THP is probably merely the cause of compaction, with
> > compaction causing the page migration.  But you posted very interesting
> > details in an earlier mail on 27th April from <nmorey@kalray.eu>:
> > 
> > > I ran some more tests:
> > > 
> > > * Test is OK if transparent huge tlb are disabled
> > > 
> > > * For all the page where data are not transfered, and only those pages, a call to get_user_page(user vaddr) just before dma_unmap_sg returns a different page from the original one.
> > > [436477.927279] mppa 0000:03:00.0: org_page= ffffea0009f60080 cur page = ffffea00074e0080
> > > [436477.927298] page:ffffea0009f60080 count:0 mapcount:1 mapping:          (null) index:0x2
> > > [436477.927314] page flags: 0x2fffff00008000(tail)
> > > [436477.927354] page dumped because: org_page
> > > [436477.927369] page:ffffea00074e0080 count:0 mapcount:1 mapping:          (null) index:0x2
> > > [436477.927382] page flags: 0x2fffff00008000(tail)
> > > [436477.927421] page dumped because: cur_page
> > > 
> > > I'm not sure what to make of this...
> > 
> > That (on the older kernel I think) seems clearly to show that a THP
> > itself has been migrated: which makes me suspect NUMA migration of
> > mispaced THPs - migrate_misplaced_transhuge_page().  I'd hoped to
> > find something obviously wrong there, but haven't quite managed
> > to bring my brain fully to bear on it, and hope the others Cc'ed
> > will do so more quickly (or spot the error of your ways instead).
> > 
> > I do find it suspect, how the migrate_page_copy() is done rather
> > early, while the old page is still mapped in the pagetable.  And
> > odd how it inserts the new pmd for a moment, before checking old
> > page_count and backing out.  But I don't see how either of those
> > would cause the trouble you see, where the migration goes ahead.
> 
> So i do not think there is a bug migrate_misplaced_transhuge_page()
> but i think something is wrong in it see attached patch. I still
> want to convince myself i am not missing anything before posting
> that one.
> 
> 
> Now about this bug, dumb question but do you do get_user_pages with
> write = 1 because if your device is writting to the page then you
> must set write to 1.
> 
> get_user_pages(vaddr, nrpages, 1, 0|1, pages, NULL|vmas);
> 
> 
> Cheers,
> Jérôme

> From 9ded2a5da75a5e736fb36a2c4e2511d9516ecc37 Mon Sep 17 00:00:00 2001
> From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
> Date: Tue, 3 May 2016 11:53:24 +0200
> Subject: [PATCH] mm/numa/thp: fix assumptions of
>  migrate_misplaced_transhuge_page()
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> Fix assumptions in migrate_misplaced_transhuge_page() which is only
> call by do_huge_pmd_numa_page() itself only call by __handle_mm_fault()
> for pmd with PROT_NONE. This means that if the pmd stays the same
> then there can be no concurrent get_user_pages / get_user_pages_fast
> (GUP/GUP_fast). More over because migrate_misplaced_transhuge_page()
> only do something is page is map once then there can be no GUP from
> a different process. Finaly, holding the pmd lock assure us that no
> other part of the kernel will take an extre reference on the page.
> 
> In the end this means that the failure code path should never be
> taken unless something is horribly wrong, so convert it to BUG_ON().
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>

The logic looks valid to me:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

> ---
>  mm/migrate.c | 31 +++++++++++++++++++++----------
>  1 file changed, 21 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6c822a7..6315aac 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1757,6 +1757,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  	pmd_t orig_entry;
>  
>  	/*
> +	 * What we do here is only valid if pmd_protnone(entry) is true and it
> +	 * is map in only one vma numamigrate_isolate_page() takes care of that
> +	 * check.
> +	 */
> +	if (!pmd_protnone(entry))
> +		goto out_unlock;
> +
> +	/*
>  	 * Rate-limit the amount of data that is being migrated to a node.
>  	 * Optimal placement is no good if the memory bus is saturated and
>  	 * all the time is being spent migrating!
> @@ -1797,7 +1805,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
> -fail_putback:
>  		spin_unlock(ptl);
>  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>  
> @@ -1819,7 +1826,12 @@ fail_putback:
>  		goto out_unlock;
>  	}
>  
> -	orig_entry = *pmd;
> +	/*
> +	 * We are holding the lock so no one can set a new pmd and original pmd
> +	 * is PROT_NONE thus no one can get_user_pages or get_user_pages_fast
> +	 * (GUP or GUP_fast) from this point on we can not fail.
> +	 */
> +	orig_entry = entry;
>  	entry = mk_pmd(new_page, vma->vm_page_prot);
>  	entry = pmd_mkhuge(entry);
>  	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> @@ -1837,14 +1849,13 @@ fail_putback:
>  	set_pmd_at(mm, mmun_start, pmd, entry);
>  	update_mmu_cache_pmd(vma, address, &entry);
>  
> -	if (page_count(page) != 2) {
> -		set_pmd_at(mm, mmun_start, pmd, orig_entry);
> -		flush_pmd_tlb_range(vma, mmun_start, mmun_end);
> -		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
> -		update_mmu_cache_pmd(vma, address, &entry);
> -		page_remove_rmap(new_page, true);
> -		goto fail_putback;
> -	}
> +	/* As said above no one can get reference on the old page nor through
> +	 * get_user_pages or get_user_pages_fast (GUP/GUP_fast) or through
> +	 * any other means. To get reference on huge page you need to hold
> +	 * pmd_lock and we are already holding that lock here and the page
> +	 * is only mapped once.
> +	 */
> +	BUG_ON(page_count(page) != 2);
>  
>  	mlock_migrate_page(new_page, page);
>  	page_remove_rmap(page, true);
> -- 
> 2.1.0
> 


-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
       [not found]     ` <07619be9-e812-5459-26dd-ceb8c6490520@morey-chaisemartin.com>
@ 2016-05-10 10:01       ` Jerome Glisse
  2016-05-10 11:15         ` Nicolas Morey Chaisemartin
  2016-05-11 11:15         ` Nicolas Morey Chaisemartin
  0 siblings, 2 replies; 16+ messages in thread
From: Jerome Glisse @ 2016-05-10 10:01 UTC (permalink / raw)
  To: Nicolas Morey Chaisemartin
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Kirill A. Shutemov,
	Kirill A. Shutemov, Alex Williamson, One Thousand Gnomes,
	linux-kernel, linux-mm

On Tue, May 10, 2016 at 09:04:36AM +0200, Nicolas Morey Chaisemartin wrote:
> Le 05/03/2016 à 12:11 PM, Jerome Glisse a écrit :
> > On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
> >> On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:
> >>
> >>> Hi everyone,
> >>>
> >>> This is a repost from a different address as it seems the previous one ended in Gmail junk due to a domain error..
> >> linux-kernel is a very high volume list which few are reading:
> >> that also will account for your lack of response so far
> >> (apart from the indefatigable Alan).
> >>
> >> I've added linux-mm, and some people from another thread regarding
> >> THP and get_user_pages() pins which has been discussed in recent days.
> >>
> >> Make no mistake, the issue you're raising here is definitely not the
> >> same as that one (which is specifically about the new THP refcounting
> >> in v4.5+, whereas you're reporting a problem you've seen in both a
> >> v3.10-based kernel and in v4.5).  But I think their heads are in
> >> gear, much more so than mine, and likely to spot something.
> >>
> >>> I added more info found while blindly debugging the issue.
> >>>
> >>> Short version:
> >>> I'm having an issue with direct DMA transfer from a device to host memory.
> >>> It seems some of the data is not transferring to the appropriate page.
> >>>
> >>> Some more details:
> >>> I'm debugging a home made PCI driver for our board (Kalray), attached to a x86_64 host running centos7 (3.10.0-327.el7.x86_64)
> >>>
> >>> In the current case, a userland application transfers back and forth data through read/write operations on a file.
> >>> On the kernel side, it triggers DMA transfers through the PCI to/from our board memory.
> >>>
> >>> We followed what pretty much all docs said about direct I/O to user buffers:
> >>>
> >>> 1) get_user_pages() (in the current case, it's at most 16 pages at once)
> >>> 2) convert to a scatterlist
> >>> 3) pci_map_sg
> >>> 4) eventually coalesce sg (Intel IOMMU is enabled, so it's usually possible)
> >>> 4) A lot of DMA engine handling code, using the dmaengine layer and virt-dma
> >>> 5) wait for transfer complete, in the mean time, go back to (1) to schedule more work, if any
> >>> 6) pci_unmap_sg
> >>> 7) for read (card2host) transfer, set_page_dirty_lock
> >>> 8) page_cache_release
> >>>
> >>> In 99,9999% it works perfectly.
> >>> However, I have one userland application where a few pages are not written by a read (card2host) transfer.
> >>> The buffer is memset them to a different value so I can check that nothing has overwritten them.
> >>>
> >>> I know (PCI protocol analyser) that the data left our board for the "right" address (the one set in the sg by pci_map_sg).
> >>> I tried reading the data between the pci_unmap_sg and the set_page_dirty, using
> >>>         uint32_t *addr = page_address(trans->pages[0]);
> >>>         dev_warn(&pdata->pdev->dev, "val = %x\n", *addr);
> >>> and it has the expected value.
> >>> But if I try to copy_from_user (using the address coming from userland, the one passed to get_user_pages), the data has not been written and I see the memset value.
> >>>
> >>> New infos:
> >>>
> >>> The issue happens with IOMMU on or off.
> >>> I compiled a kernel with DMA_API_DEBUG enabled and got no warnings or errors.
> >>>
> >>> I digged a little bit deeper with my very small understanding of linux mm and I discovered that:
> >>>  * we are using transparent huge pages
> >>>  * the page 'not transferred' are the last few of a huge page
> >>> More precisely:
> >>> - We have several transfer in flight from the same user buffer
> >>> - Each transfer is 16 pages long
> >>> - At one point in time, we start transferring from another huge page (transfers are still in flight from the previous one)
> >>> - When a transfer from the previous huge page completes, I dumped at the mapcount of the pages from the previous transfers,
> >>>   they are all to 0. The pages are still mapped to dma at this point.
> >>> - A get_user_page to the address of the completed transfer returns return a different struct page * then the on I had.
> >>> But this is before I have unmapped/put_page them back. From my understanding this should not have happened.
> >>>
> >>> I tried the same code with a kernel 4.5 and encountered the same issue
> >>>
> >>> Disabling transparent huge pages makes the issue disapear
> >>>
> >>> Thanks in advance
> >> It does look to me as if pages are being migrated, despite being pinned
> >> by get_user_pages(): and that would be wrong.  Originally I intended
> >> to suggest that THP is probably merely the cause of compaction, with
> >> compaction causing the page migration.  But you posted very interesting
> >> details in an earlier mail on 27th April from <nmorey@kalray.eu>:
> >>
> >>> I ran some more tests:
> >>>
> >>> * Test is OK if transparent huge tlb are disabled
> >>>
> >>> * For all the page where data are not transfered, and only those pages, a call to get_user_page(user vaddr) just before dma_unmap_sg returns a different page from the original one.
> >>> [436477.927279] mppa 0000:03:00.0: org_page= ffffea0009f60080 cur page = ffffea00074e0080
> >>> [436477.927298] page:ffffea0009f60080 count:0 mapcount:1 mapping:          (null) index:0x2
> >>> [436477.927314] page flags: 0x2fffff00008000(tail)
> >>> [436477.927354] page dumped because: org_page
> >>> [436477.927369] page:ffffea00074e0080 count:0 mapcount:1 mapping:          (null) index:0x2
> >>> [436477.927382] page flags: 0x2fffff00008000(tail)
> >>> [436477.927421] page dumped because: cur_page
> >>>
> >>> I'm not sure what to make of this...
> >> That (on the older kernel I think) seems clearly to show that a THP
> >> itself has been migrated: which makes me suspect NUMA migration of
> >> mispaced THPs - migrate_misplaced_transhuge_page().  I'd hoped to
> >> find something obviously wrong there, but haven't quite managed
> >> to bring my brain fully to bear on it, and hope the others Cc'ed
> >> will do so more quickly (or spot the error of your ways instead).
> >>
> >> I do find it suspect, how the migrate_page_copy() is done rather
> >> early, while the old page is still mapped in the pagetable.  And
> >> odd how it inserts the new pmd for a moment, before checking old
> >> page_count and backing out.  But I don't see how either of those
> >> would cause the trouble you see, where the migration goes ahead.
> > So i do not think there is a bug migrate_misplaced_transhuge_page()
> > but i think something is wrong in it see attached patch. I still
> > want to convince myself i am not missing anything before posting
> > that one.
> >
> >
> > Now about this bug, dumb question but do you do get_user_pages with
> > write = 1 because if your device is writting to the page then you
> > must set write to 1.
> >
> > get_user_pages(vaddr, nrpages, 1, 0|1, pages, NULL|vmas);
> >
> >
> > Cheers,
> > Jérôme
> >
> > 0001-mm-numa-thp-fix-assumptions-of-migrate_misplaced_tra.patch
> >
> >
> > From 9ded2a5da75a5e736fb36a2c4e2511d9516ecc37 Mon Sep 17 00:00:00 2001
> > From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
> > Date: Tue, 3 May 2016 11:53:24 +0200
> > Subject: [PATCH] mm/numa/thp: fix assumptions of
> >  migrate_misplaced_transhuge_page()
> > MIME-Version: 1.0
> > Content-Type: text/plain; charset=UTF-8
> > Content-Transfer-Encoding: 8bit
> >
> > Fix assumptions in migrate_misplaced_transhuge_page() which is only
> > call by do_huge_pmd_numa_page() itself only call by __handle_mm_fault()
> > for pmd with PROT_NONE. This means that if the pmd stays the same
> > then there can be no concurrent get_user_pages / get_user_pages_fast
> > (GUP/GUP_fast). More over because migrate_misplaced_transhuge_page()
> > only do something is page is map once then there can be no GUP from
> > a different process. Finaly, holding the pmd lock assure us that no
> > other part of the kernel will take an extre reference on the page.
> >
> > In the end this means that the failure code path should never be
> > taken unless something is horribly wrong, so convert it to BUG_ON().
> >
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  mm/migrate.c | 31 +++++++++++++++++++++----------
> >  1 file changed, 21 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 6c822a7..6315aac 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1757,6 +1757,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >  	pmd_t orig_entry;
> >  
> >  	/*
> > +	 * What we do here is only valid if pmd_protnone(entry) is true and it
> > +	 * is map in only one vma numamigrate_isolate_page() takes care of that
> > +	 * check.
> > +	 */
> > +	if (!pmd_protnone(entry))
> > +		goto out_unlock;
> > +
> > +	/*
> >  	 * Rate-limit the amount of data that is being migrated to a node.
> >  	 * Optimal placement is no good if the memory bus is saturated and
> >  	 * all the time is being spent migrating!
> > @@ -1797,7 +1805,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> >  	ptl = pmd_lock(mm, pmd);
> >  	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
> > -fail_putback:
> >  		spin_unlock(ptl);
> >  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> >  
> > @@ -1819,7 +1826,12 @@ fail_putback:
> >  		goto out_unlock;
> >  	}
> >  
> > -	orig_entry = *pmd;
> > +	/*
> > +	 * We are holding the lock so no one can set a new pmd and original pmd
> > +	 * is PROT_NONE thus no one can get_user_pages or get_user_pages_fast
> > +	 * (GUP or GUP_fast) from this point on we can not fail.
> > +	 */
> > +	orig_entry = entry;
> >  	entry = mk_pmd(new_page, vma->vm_page_prot);
> >  	entry = pmd_mkhuge(entry);
> >  	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> > @@ -1837,14 +1849,13 @@ fail_putback:
> >  	set_pmd_at(mm, mmun_start, pmd, entry);
> >  	update_mmu_cache_pmd(vma, address, &entry);
> >  
> > -	if (page_count(page) != 2) {
> > -		set_pmd_at(mm, mmun_start, pmd, orig_entry);
> > -		flush_pmd_tlb_range(vma, mmun_start, mmun_end);
> > -		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
> > -		update_mmu_cache_pmd(vma, address, &entry);
> > -		page_remove_rmap(new_page, true);
> > -		goto fail_putback;
> > -	}
> > +	/* As said above no one can get reference on the old page nor through
> > +	 * get_user_pages or get_user_pages_fast (GUP/GUP_fast) or through
> > +	 * any other means. To get reference on huge page you need to hold
> > +	 * pmd_lock and we are already holding that lock here and the page
> > +	 * is only mapped once.
> > +	 */
> > +	BUG_ON(page_count(page) != 2);
> >  
> >  	mlock_migrate_page(new_page, page);
> >  	page_remove_rmap(page, true);
> 
> Hi,
> 
> I backported the patch to 3.10 (had to copy paste pmd_protnone defitinition from 4.5) and it's working !
> I'll open a ticket in Redhat tracker to try and get this fixed in RHEL7.
> 
> I have a dumb question though: how can we end up in numa/misplaced memory code on a single socket system?
> 

This patch is not a fix, do you see bug message in kernel log ? Because if
you do that it means we have a bigger issue.

You did not answer one of my previous question, do you set get_user_pages
with write = 1 as a paremeter ?

Also it would be a lot easier if you were testing with lastest 4.6 or 4.5
not RHEL kernel as they are far appart and what might looks like same issue
on both might be totaly different bugs.

If you only really care about RHEL kernel then open a bug with Red Hat and
you can add me in bug-cc <jglisse@redhat.com>

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-10 10:01       ` Jerome Glisse
@ 2016-05-10 11:15         ` Nicolas Morey Chaisemartin
  2016-05-10 13:34           ` Jerome Glisse
  2016-05-11 11:15         ` Nicolas Morey Chaisemartin
  1 sibling, 1 reply; 16+ messages in thread
From: Nicolas Morey Chaisemartin @ 2016-05-10 11:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Kirill A. Shutemov,
	Kirill A. Shutemov, Alex Williamson, One Thousand Gnomes,
	linux-kernel, linux-mm



Le 05/10/2016 à 12:01 PM, Jerome Glisse a écrit :
> On Tue, May 10, 2016 at 09:04:36AM +0200, Nicolas Morey Chaisemartin wrote:
>> Le 05/03/2016 à 12:11 PM, Jerome Glisse a écrit :
>>> On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
>>>> On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> This is a repost from a different address as it seems the previous one ended in Gmail junk due to a domain error..
>>>> linux-kernel is a very high volume list which few are reading:
>>>> that also will account for your lack of response so far
>>>> (apart from the indefatigable Alan).
>>>>
>>>> I've added linux-mm, and some people from another thread regarding
>>>> THP and get_user_pages() pins which has been discussed in recent days.
>>>>
>>>> Make no mistake, the issue you're raising here is definitely not the
>>>> same as that one (which is specifically about the new THP refcounting
>>>> in v4.5+, whereas you're reporting a problem you've seen in both a
>>>> v3.10-based kernel and in v4.5).  But I think their heads are in
>>>> gear, much more so than mine, and likely to spot something.
>>>>
>>>>> I added more info found while blindly debugging the issue.
>>>>>
>>>>> Short version:
>>>>> I'm having an issue with direct DMA transfer from a device to host memory.
>>>>> It seems some of the data is not transferring to the appropriate page.
>>>>>
>>>>> Some more details:
>>>>> I'm debugging a home made PCI driver for our board (Kalray), attached to a x86_64 host running centos7 (3.10.0-327.el7.x86_64)
>>>>>
>>>>> In the current case, a userland application transfers back and forth data through read/write operations on a file.
>>>>> On the kernel side, it triggers DMA transfers through the PCI to/from our board memory.
>>>>>
>>>>> We followed what pretty much all docs said about direct I/O to user buffers:
>>>>>
>>>>> 1) get_user_pages() (in the current case, it's at most 16 pages at once)
>>>>> 2) convert to a scatterlist
>>>>> 3) pci_map_sg
>>>>> 4) eventually coalesce sg (Intel IOMMU is enabled, so it's usually possible)
>>>>> 4) A lot of DMA engine handling code, using the dmaengine layer and virt-dma
>>>>> 5) wait for transfer complete, in the mean time, go back to (1) to schedule more work, if any
>>>>> 6) pci_unmap_sg
>>>>> 7) for read (card2host) transfer, set_page_dirty_lock
>>>>> 8) page_cache_release
>>>>>
>>>>> In 99,9999% it works perfectly.
>>>>> However, I have one userland application where a few pages are not written by a read (card2host) transfer.
>>>>> The buffer is memset them to a different value so I can check that nothing has overwritten them.
>>>>>
>>>>> I know (PCI protocol analyser) that the data left our board for the "right" address (the one set in the sg by pci_map_sg).
>>>>> I tried reading the data between the pci_unmap_sg and the set_page_dirty, using
>>>>>         uint32_t *addr = page_address(trans->pages[0]);
>>>>>         dev_warn(&pdata->pdev->dev, "val = %x\n", *addr);
>>>>> and it has the expected value.
>>>>> But if I try to copy_from_user (using the address coming from userland, the one passed to get_user_pages), the data has not been written and I see the memset value.
>>>>>
>>>>> New infos:
>>>>>
>>>>> The issue happens with IOMMU on or off.
>>>>> I compiled a kernel with DMA_API_DEBUG enabled and got no warnings or errors.
>>>>>
>>>>> I digged a little bit deeper with my very small understanding of linux mm and I discovered that:
>>>>>  * we are using transparent huge pages
>>>>>  * the page 'not transferred' are the last few of a huge page
>>>>> More precisely:
>>>>> - We have several transfer in flight from the same user buffer
>>>>> - Each transfer is 16 pages long
>>>>> - At one point in time, we start transferring from another huge page (transfers are still in flight from the previous one)
>>>>> - When a transfer from the previous huge page completes, I dumped at the mapcount of the pages from the previous transfers,
>>>>>   they are all to 0. The pages are still mapped to dma at this point.
>>>>> - A get_user_page to the address of the completed transfer returns return a different struct page * then the on I had.
>>>>> But this is before I have unmapped/put_page them back. From my understanding this should not have happened.
>>>>>
>>>>> I tried the same code with a kernel 4.5 and encountered the same issue
>>>>>
>>>>> Disabling transparent huge pages makes the issue disapear
>>>>>
>>>>> Thanks in advance
>>>> It does look to me as if pages are being migrated, despite being pinned
>>>> by get_user_pages(): and that would be wrong.  Originally I intended
>>>> to suggest that THP is probably merely the cause of compaction, with
>>>> compaction causing the page migration.  But you posted very interesting
>>>> details in an earlier mail on 27th April from <nmorey@kalray.eu>:
>>>>
>>>>> I ran some more tests:
>>>>>
>>>>> * Test is OK if transparent huge tlb are disabled
>>>>>
>>>>> * For all the page where data are not transfered, and only those pages, a call to get_user_page(user vaddr) just before dma_unmap_sg returns a different page from the original one.
>>>>> [436477.927279] mppa 0000:03:00.0: org_page= ffffea0009f60080 cur page = ffffea00074e0080
>>>>> [436477.927298] page:ffffea0009f60080 count:0 mapcount:1 mapping:          (null) index:0x2
>>>>> [436477.927314] page flags: 0x2fffff00008000(tail)
>>>>> [436477.927354] page dumped because: org_page
>>>>> [436477.927369] page:ffffea00074e0080 count:0 mapcount:1 mapping:          (null) index:0x2
>>>>> [436477.927382] page flags: 0x2fffff00008000(tail)
>>>>> [436477.927421] page dumped because: cur_page
>>>>>
>>>>> I'm not sure what to make of this...
>>>> That (on the older kernel I think) seems clearly to show that a THP
>>>> itself has been migrated: which makes me suspect NUMA migration of
>>>> mispaced THPs - migrate_misplaced_transhuge_page().  I'd hoped to
>>>> find something obviously wrong there, but haven't quite managed
>>>> to bring my brain fully to bear on it, and hope the others Cc'ed
>>>> will do so more quickly (or spot the error of your ways instead).
>>>>
>>>> I do find it suspect, how the migrate_page_copy() is done rather
>>>> early, while the old page is still mapped in the pagetable.  And
>>>> odd how it inserts the new pmd for a moment, before checking old
>>>> page_count and backing out.  But I don't see how either of those
>>>> would cause the trouble you see, where the migration goes ahead.
>>> So i do not think there is a bug migrate_misplaced_transhuge_page()
>>> but i think something is wrong in it see attached patch. I still
>>> want to convince myself i am not missing anything before posting
>>> that one.
>>>
>>>
>>> Now about this bug, dumb question but do you do get_user_pages with
>>> write = 1 because if your device is writting to the page then you
>>> must set write to 1.
>>>
>>> get_user_pages(vaddr, nrpages, 1, 0|1, pages, NULL|vmas);
>>>
>>>
>>> Cheers,
>>> Jérôme
>>>
>>> 0001-mm-numa-thp-fix-assumptions-of-migrate_misplaced_tra.patch
>>>
>>>
>>> From 9ded2a5da75a5e736fb36a2c4e2511d9516ecc37 Mon Sep 17 00:00:00 2001
>>> From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
>>> Date: Tue, 3 May 2016 11:53:24 +0200
>>> Subject: [PATCH] mm/numa/thp: fix assumptions of
>>>  migrate_misplaced_transhuge_page()
>>> MIME-Version: 1.0
>>> Content-Type: text/plain; charset=UTF-8
>>> Content-Transfer-Encoding: 8bit
>>>
>>> Fix assumptions in migrate_misplaced_transhuge_page() which is only
>>> call by do_huge_pmd_numa_page() itself only call by __handle_mm_fault()
>>> for pmd with PROT_NONE. This means that if the pmd stays the same
>>> then there can be no concurrent get_user_pages / get_user_pages_fast
>>> (GUP/GUP_fast). More over because migrate_misplaced_transhuge_page()
>>> only do something is page is map once then there can be no GUP from
>>> a different process. Finaly, holding the pmd lock assure us that no
>>> other part of the kernel will take an extre reference on the page.
>>>
>>> In the end this means that the failure code path should never be
>>> taken unless something is horribly wrong, so convert it to BUG_ON().
>>>
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> ---
>>>  mm/migrate.c | 31 +++++++++++++++++++++----------
>>>  1 file changed, 21 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 6c822a7..6315aac 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1757,6 +1757,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  	pmd_t orig_entry;
>>>  
>>>  	/*
>>> +	 * What we do here is only valid if pmd_protnone(entry) is true and it
>>> +	 * is map in only one vma numamigrate_isolate_page() takes care of that
>>> +	 * check.
>>> +	 */
>>> +	if (!pmd_protnone(entry))
>>> +		goto out_unlock;
>>> +
>>> +	/*
>>>  	 * Rate-limit the amount of data that is being migrated to a node.
>>>  	 * Optimal placement is no good if the memory bus is saturated and
>>>  	 * all the time is being spent migrating!
>>> @@ -1797,7 +1805,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>>>  	ptl = pmd_lock(mm, pmd);
>>>  	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
>>> -fail_putback:
>>>  		spin_unlock(ptl);
>>>  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>>>  
>>> @@ -1819,7 +1826,12 @@ fail_putback:
>>>  		goto out_unlock;
>>>  	}
>>>  
>>> -	orig_entry = *pmd;
>>> +	/*
>>> +	 * We are holding the lock so no one can set a new pmd and original pmd
>>> +	 * is PROT_NONE thus no one can get_user_pages or get_user_pages_fast
>>> +	 * (GUP or GUP_fast) from this point on we can not fail.
>>> +	 */
>>> +	orig_entry = entry;
>>>  	entry = mk_pmd(new_page, vma->vm_page_prot);
>>>  	entry = pmd_mkhuge(entry);
>>>  	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>>> @@ -1837,14 +1849,13 @@ fail_putback:
>>>  	set_pmd_at(mm, mmun_start, pmd, entry);
>>>  	update_mmu_cache_pmd(vma, address, &entry);
>>>  
>>> -	if (page_count(page) != 2) {
>>> -		set_pmd_at(mm, mmun_start, pmd, orig_entry);
>>> -		flush_pmd_tlb_range(vma, mmun_start, mmun_end);
>>> -		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
>>> -		update_mmu_cache_pmd(vma, address, &entry);
>>> -		page_remove_rmap(new_page, true);
>>> -		goto fail_putback;
>>> -	}
>>> +	/* As said above no one can get reference on the old page nor through
>>> +	 * get_user_pages or get_user_pages_fast (GUP/GUP_fast) or through
>>> +	 * any other means. To get reference on huge page you need to hold
>>> +	 * pmd_lock and we are already holding that lock here and the page
>>> +	 * is only mapped once.
>>> +	 */
>>> +	BUG_ON(page_count(page) != 2);
>>>  
>>>  	mlock_migrate_page(new_page, page);
>>>  	page_remove_rmap(page, true);
>> Hi,
>>
>> I backported the patch to 3.10 (had to copy paste pmd_protnone defitinition from 4.5) and it's working !
>> I'll open a ticket in Redhat tracker to try and get this fixed in RHEL7.
>>
>> I have a dumb question though: how can we end up in numa/misplaced memory code on a single socket system?
>>
> This patch is not a fix, do you see bug message in kernel log ? Because if
> you do that it means we have a bigger issue.
I don't see any on my 3.10. I have DMA_API_DEBUG enabled but I don't think it has an impact.
> You did not answer one of my previous question, do you set get_user_pages
> with write = 1 as a paremeter ?
For the read from the device, yes:
        down_read(&current->mm->mmap_sem);
        res = get_user_pages(
                current,
                current->mm,
                (unsigned long) iov->host_addr,
                page_count,
                (write_mode == 0) ? 1 : 0,      /* write */
                0,      /* force */
                &trans->pages[sg_o],
                NULL);
        up_read(&current->mm->mmap_sem);

> Also it would be a lot easier if you were testing with lastest 4.6 or 4.5
> not RHEL kernel as they are far appart and what might looks like same issue
> on both might be totaly different bugs.
Is a RPM from elrepo ok?
http://elrepo.org/linux/kernel/el7/SRPMS/

I don't know why but a simple make install from the kernel source tree won't boot. I'm working remotely so it's hard to debug.
Rebuilding a kernel RPM seems to work though.
> If you only really care about RHEL kernel then open a bug with Red Hat and
> you can add me in bug-cc <jglisse@redhat.com>
I opened one. I'll CC you

Thanks

Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-10 11:15         ` Nicolas Morey Chaisemartin
@ 2016-05-10 13:34           ` Jerome Glisse
  2016-05-11  9:14             ` Nicolas Morey Chaisemartin
  0 siblings, 1 reply; 16+ messages in thread
From: Jerome Glisse @ 2016-05-10 13:34 UTC (permalink / raw)
  To: Nicolas Morey Chaisemartin
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Kirill A. Shutemov,
	Kirill A. Shutemov, Alex Williamson, One Thousand Gnomes,
	linux-kernel, linux-mm

On Tue, May 10, 2016 at 01:15:02PM +0200, Nicolas Morey Chaisemartin wrote:
> Le 05/10/2016 à 12:01 PM, Jerome Glisse a écrit :
> > On Tue, May 10, 2016 at 09:04:36AM +0200, Nicolas Morey Chaisemartin wrote:
> >> Le 05/03/2016 à 12:11 PM, Jerome Glisse a écrit :
> >>> On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
> >>>> On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:

[...]

> >> Hi,
> >>
> >> I backported the patch to 3.10 (had to copy paste pmd_protnone defitinition from 4.5) and it's working !
> >> I'll open a ticket in Redhat tracker to try and get this fixed in RHEL7.
> >>
> >> I have a dumb question though: how can we end up in numa/misplaced memory code on a single socket system?
> >>
> > This patch is not a fix, do you see bug message in kernel log ? Because if
> > you do that it means we have a bigger issue.
> I don't see any on my 3.10. I have DMA_API_DEBUG enabled but I don't think it has an impact.

My patch can't be backported to 3.10 as is, you most likely need to replace
pmd_protnone() by pmd_numa()

> > You did not answer one of my previous question, do you set get_user_pages
> > with write = 1 as a paremeter ?
> For the read from the device, yes:
>         down_read(&current->mm->mmap_sem);
>         res = get_user_pages(
>                 current,
>                 current->mm,
>                 (unsigned long) iov->host_addr,
>                 page_count,
>                 (write_mode == 0) ? 1 : 0,      /* write */
>                 0,      /* force */
>                 &trans->pages[sg_o],
>                 NULL);
>         up_read(&current->mm->mmap_sem);

As i don't have context to infer how write_mode is set above, do you mind
retesting your driver and always asking for write no matter what ?

> > Also it would be a lot easier if you were testing with lastest 4.6 or 4.5
> > not RHEL kernel as they are far appart and what might looks like same issue
> > on both might be totaly different bugs.
> Is a RPM from elrepo ok?
> http://elrepo.org/linux/kernel/el7/SRPMS/

Yes should be ok for testing.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-10 13:34           ` Jerome Glisse
@ 2016-05-11  9:14             ` Nicolas Morey Chaisemartin
  0 siblings, 0 replies; 16+ messages in thread
From: Nicolas Morey Chaisemartin @ 2016-05-11  9:14 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Kirill A. Shutemov,
	Kirill A. Shutemov, Alex Williamson, One Thousand Gnomes,
	linux-kernel, linux-mm



Le 05/10/2016 à 03:34 PM, Jerome Glisse a écrit :
> On Tue, May 10, 2016 at 01:15:02PM +0200, Nicolas Morey Chaisemartin wrote:
>> Le 05/10/2016 à 12:01 PM, Jerome Glisse a écrit :
>>> On Tue, May 10, 2016 at 09:04:36AM +0200, Nicolas Morey Chaisemartin wrote:
>>>> Le 05/03/2016 à 12:11 PM, Jerome Glisse a écrit :
>>>>> On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
>>>>>> On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:
> [...]
>
>>>> Hi,
>>>>
>>>> I backported the patch to 3.10 (had to copy paste pmd_protnone defitinition from 4.5) and it's working !
>>>> I'll open a ticket in Redhat tracker to try and get this fixed in RHEL7.
>>>>
>>>> I have a dumb question though: how can we end up in numa/misplaced memory code on a single socket system?
>>>>
>>> This patch is not a fix, do you see bug message in kernel log ? Because if
>>> you do that it means we have a bigger issue.
>> I don't see any on my 3.10. I have DMA_API_DEBUG enabled but I don't think it has an impact.
> My patch can't be backported to 3.10 as is, you most likely need to replace
> pmd_protnone() by pmd_numa()
>
>>> You did not answer one of my previous question, do you set get_user_pages
>>> with write = 1 as a paremeter ?
>> For the read from the device, yes:
>>         down_read(&current->mm->mmap_sem);
>>         res = get_user_pages(
>>                 current,
>>                 current->mm,
>>                 (unsigned long) iov->host_addr,
>>                 page_count,
>>                 (write_mode == 0) ? 1 : 0,      /* write */
>>                 0,      /* force */
>>                 &trans->pages[sg_o],
>>                 NULL);
>>         up_read(&current->mm->mmap_sem);
> As i don't have context to infer how write_mode is set above, do you mind
> retesting your driver and always asking for write no matter what ?
write_mode is 0 for car2host transfers so yes, write_mode is 1.
During debug I tried with write_mode=1 and force=1 in all cases and it failed too.
>>> Also it would be a lot easier if you were testing with lastest 4.6 or 4.5
>>> not RHEL kernel as they are far appart and what might looks like same issue
>>> on both might be totaly different bugs.
>> Is a RPM from elrepo ok?
>> http://elrepo.org/linux/kernel/el7/SRPMS/
> Yes should be ok for testing.
>
I tried the elrpo 4.5.2 package without your patch and my test fails, sadly the src rpm from elrepo does not contaisn the kernel sources and I haven't looked how to get the proper tarball.
I tried to rebuild a src rpm for a fedora 24 (kernel 4.5.3) and it works without your patch. I'm not sure what differs in their config. I'll keep digging.

Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-10 10:01       ` Jerome Glisse
  2016-05-10 11:15         ` Nicolas Morey Chaisemartin
@ 2016-05-11 11:15         ` Nicolas Morey Chaisemartin
  2016-05-11 14:51           ` Jerome Glisse
  1 sibling, 1 reply; 16+ messages in thread
From: Nicolas Morey Chaisemartin @ 2016-05-11 11:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Kirill A. Shutemov,
	Kirill A. Shutemov, Alex Williamson, One Thousand Gnomes,
	linux-kernel, linux-mm



Le 05/10/2016 à 12:01 PM, Jerome Glisse a écrit :
> On Tue, May 10, 2016 at 09:04:36AM +0200, Nicolas Morey Chaisemartin wrote:
>> Le 05/03/2016 à 12:11 PM, Jerome Glisse a écrit :
>>> On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
>>>> On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:
[...]
>> Hi,
>>
>> I backported the patch to 3.10 (had to copy paste pmd_protnone defitinition from 4.5) and it's working !
>> I'll open a ticket in Redhat tracker to try and get this fixed in RHEL7.
>>
>> I have a dumb question though: how can we end up in numa/misplaced memory code on a single socket system?
>>
> This patch is not a fix, do you see bug message in kernel log ? Because if
> you do that it means we have a bigger issue.
>
> You did not answer one of my previous question, do you set get_user_pages
> with write = 1 as a paremeter ?
>
> Also it would be a lot easier if you were testing with lastest 4.6 or 4.5
> not RHEL kernel as they are far appart and what might looks like same issue
> on both might be totaly different bugs.
>
> If you only really care about RHEL kernel then open a bug with Red Hat and
> you can add me in bug-cc <jglisse@redhat.com>
>
> Cheers,
> Jérôme

I finally managed to get a proper setup.
I build a vanilla 4.5 kernel from git tree using the Centos7 config, my test fails as usual.
I applied your patch, rebuild => still fails and no new messages in dmesg.

Now that I don't have to go through the RPM repackaging, I can try out things much quicker if you have any ideas.

Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-11 11:15         ` Nicolas Morey Chaisemartin
@ 2016-05-11 14:51           ` Jerome Glisse
  2016-05-12  6:07             ` Nicolas Morey-Chaisemartin
  0 siblings, 1 reply; 16+ messages in thread
From: Jerome Glisse @ 2016-05-11 14:51 UTC (permalink / raw)
  To: Nicolas Morey Chaisemartin
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Kirill A. Shutemov,
	Kirill A. Shutemov, Alex Williamson, One Thousand Gnomes,
	linux-kernel, linux-mm

On Wed, May 11, 2016 at 01:15:54PM +0200, Nicolas Morey Chaisemartin wrote:
> 
> 
> Le 05/10/2016 à 12:01 PM, Jerome Glisse a écrit :
> > On Tue, May 10, 2016 at 09:04:36AM +0200, Nicolas Morey Chaisemartin wrote:
> >> Le 05/03/2016 à 12:11 PM, Jerome Glisse a écrit :
> >>> On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
> >>>> On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:
> [...]
> >> Hi,
> >>
> >> I backported the patch to 3.10 (had to copy paste pmd_protnone defitinition from 4.5) and it's working !
> >> I'll open a ticket in Redhat tracker to try and get this fixed in RHEL7.
> >>
> >> I have a dumb question though: how can we end up in numa/misplaced memory code on a single socket system?
> >>
> > This patch is not a fix, do you see bug message in kernel log ? Because if
> > you do that it means we have a bigger issue.
> >
> > You did not answer one of my previous question, do you set get_user_pages
> > with write = 1 as a paremeter ?
> >
> > Also it would be a lot easier if you were testing with lastest 4.6 or 4.5
> > not RHEL kernel as they are far appart and what might looks like same issue
> > on both might be totaly different bugs.
> >
> > If you only really care about RHEL kernel then open a bug with Red Hat and
> > you can add me in bug-cc <jglisse@redhat.com>
> >
> > Cheers,
> > Jérôme
> 
> I finally managed to get a proper setup.
> I build a vanilla 4.5 kernel from git tree using the Centos7 config, my test fails as usual.
> I applied your patch, rebuild => still fails and no new messages in dmesg.
> 
> Now that I don't have to go through the RPM repackaging, I can try out things much quicker if you have any ideas.
> 

Still an issue if you boot with transparent_hugepage=never ?

Also to simplify investigation force write to 1 all the time no matter what.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-11 14:51           ` Jerome Glisse
@ 2016-05-12  6:07             ` Nicolas Morey-Chaisemartin
  2016-05-12  9:36               ` Jerome Glisse
  0 siblings, 1 reply; 16+ messages in thread
From: Nicolas Morey-Chaisemartin @ 2016-05-12  6:07 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Kirill A. Shutemov,
	Kirill A. Shutemov, Alex Williamson, One Thousand Gnomes,
	linux-kernel, linux-mm



Le 05/11/2016 à 04:51 PM, Jerome Glisse a écrit :
> On Wed, May 11, 2016 at 01:15:54PM +0200, Nicolas Morey Chaisemartin wrote:
>>
>> Le 05/10/2016 à 12:01 PM, Jerome Glisse a écrit :
>>> On Tue, May 10, 2016 at 09:04:36AM +0200, Nicolas Morey Chaisemartin wrote:
>>>> Le 05/03/2016 à 12:11 PM, Jerome Glisse a écrit :
>>>>> On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
>>>>>> On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:
>> [...]
>>>> Hi,
>>>>
>>>> I backported the patch to 3.10 (had to copy paste pmd_protnone defitinition from 4.5) and it's working !
>>>> I'll open a ticket in Redhat tracker to try and get this fixed in RHEL7.
>>>>
>>>> I have a dumb question though: how can we end up in numa/misplaced memory code on a single socket system?
>>>>
>>> This patch is not a fix, do you see bug message in kernel log ? Because if
>>> you do that it means we have a bigger issue.
>>>
>>> You did not answer one of my previous question, do you set get_user_pages
>>> with write = 1 as a paremeter ?
>>>
>>> Also it would be a lot easier if you were testing with lastest 4.6 or 4.5
>>> not RHEL kernel as they are far appart and what might looks like same issue
>>> on both might be totaly different bugs.
>>>
>>> If you only really care about RHEL kernel then open a bug with Red Hat and
>>> you can add me in bug-cc <jglisse@redhat.com>
>>>
>>> Cheers,
>>> Jérôme
>> I finally managed to get a proper setup.
>> I build a vanilla 4.5 kernel from git tree using the Centos7 config, my test fails as usual.
>> I applied your patch, rebuild => still fails and no new messages in dmesg.
>>
>> Now that I don't have to go through the RPM repackaging, I can try out things much quicker if you have any ideas.
>>
> Still an issue if you boot with transparent_hugepage=never ?
>
> Also to simplify investigation force write to 1 all the time no matter what.
>
> Cheers,
> Jérôme

With transparent_hugepage=never I can't see the bug anymore.

Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-12  6:07             ` Nicolas Morey-Chaisemartin
@ 2016-05-12  9:36               ` Jerome Glisse
  2016-05-12 13:30                 ` Nicolas Morey-Chaisemartin
  0 siblings, 1 reply; 16+ messages in thread
From: Jerome Glisse @ 2016-05-12  9:36 UTC (permalink / raw)
  To: Nicolas Morey-Chaisemartin
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Kirill A. Shutemov,
	Kirill A. Shutemov, Alex Williamson, One Thousand Gnomes,
	linux-kernel, linux-mm

On Thu, May 12, 2016 at 08:07:59AM +0200, Nicolas Morey-Chaisemartin wrote:
> 
> 
> Le 05/11/2016 à 04:51 PM, Jerome Glisse a écrit :
> > On Wed, May 11, 2016 at 01:15:54PM +0200, Nicolas Morey Chaisemartin wrote:
> >>
> >> Le 05/10/2016 à 12:01 PM, Jerome Glisse a écrit :
> >>> On Tue, May 10, 2016 at 09:04:36AM +0200, Nicolas Morey Chaisemartin wrote:
> >>>> Le 05/03/2016 à 12:11 PM, Jerome Glisse a écrit :
> >>>>> On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
> >>>>>> On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:
> >> [...]
> >>>> Hi,
> >>>>
> >>>> I backported the patch to 3.10 (had to copy paste pmd_protnone defitinition from 4.5) and it's working !
> >>>> I'll open a ticket in Redhat tracker to try and get this fixed in RHEL7.
> >>>>
> >>>> I have a dumb question though: how can we end up in numa/misplaced memory code on a single socket system?
> >>>>
> >>> This patch is not a fix, do you see bug message in kernel log ? Because if
> >>> you do that it means we have a bigger issue.
> >>>
> >>> You did not answer one of my previous question, do you set get_user_pages
> >>> with write = 1 as a paremeter ?
> >>>
> >>> Also it would be a lot easier if you were testing with lastest 4.6 or 4.5
> >>> not RHEL kernel as they are far appart and what might looks like same issue
> >>> on both might be totaly different bugs.
> >>>
> >>> If you only really care about RHEL kernel then open a bug with Red Hat and
> >>> you can add me in bug-cc <jglisse@redhat.com>
> >>>
> >>> Cheers,
> >>> Jérôme
> >> I finally managed to get a proper setup.
> >> I build a vanilla 4.5 kernel from git tree using the Centos7 config, my test fails as usual.
> >> I applied your patch, rebuild => still fails and no new messages in dmesg.
> >>
> >> Now that I don't have to go through the RPM repackaging, I can try out things much quicker if you have any ideas.
> >>
> > Still an issue if you boot with transparent_hugepage=never ?
> >
> > Also to simplify investigation force write to 1 all the time no matter what.
> >
> > Cheers,
> > Jérôme
> 
> With transparent_hugepage=never I can't see the bug anymore.
> 

Can you test https://patchwork.kernel.org/patch/9061351/ with 4.5
(does not apply to 3.10) and without transparent_hugepage=never

Jérôme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-12  9:36               ` Jerome Glisse
@ 2016-05-12 13:30                 ` Nicolas Morey-Chaisemartin
  2016-05-12 13:52                   ` Jerome Glisse
  0 siblings, 1 reply; 16+ messages in thread
From: Nicolas Morey-Chaisemartin @ 2016-05-12 13:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Kirill A. Shutemov,
	Kirill A. Shutemov, Alex Williamson, One Thousand Gnomes,
	linux-kernel, linux-mm



Le 05/12/2016 à 11:36 AM, Jerome Glisse a écrit :
> On Thu, May 12, 2016 at 08:07:59AM +0200, Nicolas Morey-Chaisemartin wrote:
>>
>> Le 05/11/2016 à 04:51 PM, Jerome Glisse a écrit :
>>> On Wed, May 11, 2016 at 01:15:54PM +0200, Nicolas Morey Chaisemartin wrote:
>>>> Le 05/10/2016 à 12:01 PM, Jerome Glisse a écrit :
>>>>> On Tue, May 10, 2016 at 09:04:36AM +0200, Nicolas Morey Chaisemartin wrote:
>>>>>> Le 05/03/2016 à 12:11 PM, Jerome Glisse a écrit :
>>>>>>> On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
>>>>>>>> On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:
>>>> [...]
>>>>>> Hi,
>>>>>>
>>>>>> I backported the patch to 3.10 (had to copy paste pmd_protnone defitinition from 4.5) and it's working !
>>>>>> I'll open a ticket in Redhat tracker to try and get this fixed in RHEL7.
>>>>>>
>>>>>> I have a dumb question though: how can we end up in numa/misplaced memory code on a single socket system?
>>>>>>
>>>>> This patch is not a fix, do you see bug message in kernel log ? Because if
>>>>> you do that it means we have a bigger issue.
>>>>>
>>>>> You did not answer one of my previous question, do you set get_user_pages
>>>>> with write = 1 as a paremeter ?
>>>>>
>>>>> Also it would be a lot easier if you were testing with lastest 4.6 or 4.5
>>>>> not RHEL kernel as they are far appart and what might looks like same issue
>>>>> on both might be totaly different bugs.
>>>>>
>>>>> If you only really care about RHEL kernel then open a bug with Red Hat and
>>>>> you can add me in bug-cc <jglisse@redhat.com>
>>>>>
>>>>> Cheers,
>>>>> Jérôme
>>>> I finally managed to get a proper setup.
>>>> I build a vanilla 4.5 kernel from git tree using the Centos7 config, my test fails as usual.
>>>> I applied your patch, rebuild => still fails and no new messages in dmesg.
>>>>
>>>> Now that I don't have to go through the RPM repackaging, I can try out things much quicker if you have any ideas.
>>>>
>>> Still an issue if you boot with transparent_hugepage=never ?
>>>
>>> Also to simplify investigation force write to 1 all the time no matter what.
>>>
>>> Cheers,
>>> Jérôme
>> With transparent_hugepage=never I can't see the bug anymore.
>>
> Can you test https://patchwork.kernel.org/patch/9061351/ with 4.5
> (does not apply to 3.10) and without transparent_hugepage=never
>
> Jérôme

Fails with 4.5 + this patch and with 4.5 + this patch + yours

Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-12 13:30                 ` Nicolas Morey-Chaisemartin
@ 2016-05-12 13:52                   ` Jerome Glisse
  2016-05-12 15:31                     ` Nicolas Morey-Chaisemartin
  0 siblings, 1 reply; 16+ messages in thread
From: Jerome Glisse @ 2016-05-12 13:52 UTC (permalink / raw)
  To: Nicolas Morey-Chaisemartin
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Kirill A. Shutemov,
	Kirill A. Shutemov, Alex Williamson, One Thousand Gnomes,
	linux-kernel, linux-mm

On Thu, May 12, 2016 at 03:30:24PM +0200, Nicolas Morey-Chaisemartin wrote:
> Le 05/12/2016 à 11:36 AM, Jerome Glisse a écrit :
> > On Thu, May 12, 2016 at 08:07:59AM +0200, Nicolas Morey-Chaisemartin wrote:
> >>
> >> Le 05/11/2016 à 04:51 PM, Jerome Glisse a écrit :
> >>> On Wed, May 11, 2016 at 01:15:54PM +0200, Nicolas Morey Chaisemartin wrote:
> >>>> Le 05/10/2016 à 12:01 PM, Jerome Glisse a écrit :
> >>>>> On Tue, May 10, 2016 at 09:04:36AM +0200, Nicolas Morey Chaisemartin wrote:
> >>>>>> Le 05/03/2016 à 12:11 PM, Jerome Glisse a écrit :
> >>>>>>> On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
> >>>>>>>> On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:
> >>>> [...]
> >>>>>> Hi,
> >>>>>>
> >>>>>> I backported the patch to 3.10 (had to copy paste pmd_protnone defitinition from 4.5) and it's working !
> >>>>>> I'll open a ticket in Redhat tracker to try and get this fixed in RHEL7.
> >>>>>>
> >>>>>> I have a dumb question though: how can we end up in numa/misplaced memory code on a single socket system?
> >>>>>>
> >>>>> This patch is not a fix, do you see bug message in kernel log ? Because if
> >>>>> you do that it means we have a bigger issue.
> >>>>>
> >>>>> You did not answer one of my previous question, do you set get_user_pages
> >>>>> with write = 1 as a paremeter ?
> >>>>>
> >>>>> Also it would be a lot easier if you were testing with lastest 4.6 or 4.5
> >>>>> not RHEL kernel as they are far appart and what might looks like same issue
> >>>>> on both might be totaly different bugs.
> >>>>>
> >>>>> If you only really care about RHEL kernel then open a bug with Red Hat and
> >>>>> you can add me in bug-cc <jglisse@redhat.com>
> >>>>>
> >>>>> Cheers,
> >>>>> Jérôme
> >>>> I finally managed to get a proper setup.
> >>>> I build a vanilla 4.5 kernel from git tree using the Centos7 config, my test fails as usual.
> >>>> I applied your patch, rebuild => still fails and no new messages in dmesg.
> >>>>
> >>>> Now that I don't have to go through the RPM repackaging, I can try out things much quicker if you have any ideas.
> >>>>
> >>> Still an issue if you boot with transparent_hugepage=never ?
> >>>
> >>> Also to simplify investigation force write to 1 all the time no matter what.
> >>>
> >>> Cheers,
> >>> Jérôme
> >> With transparent_hugepage=never I can't see the bug anymore.
> >>
> > Can you test https://patchwork.kernel.org/patch/9061351/ with 4.5
> > (does not apply to 3.10) and without transparent_hugepage=never
> >
> > Jérôme
> 
> Fails with 4.5 + this patch and with 4.5 + this patch + yours
> 

There must be some bug in your code, we have upstream user that works
fine with the above combination (see drivers/vfio/vfio_iommu_type1.c)
i suspect you might be releasing the page pin too early (put_page()).

If you really believe it is bug upstream we would need a dumb kernel
module that does gup like you do and that shows the issue. Right now
looking at code (assuming above patches applied) i can't see anything
that can go wrong with THP.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-12 13:52                   ` Jerome Glisse
@ 2016-05-12 15:31                     ` Nicolas Morey-Chaisemartin
  2016-05-12 15:57                       ` Andrea Arcangeli
  0 siblings, 1 reply; 16+ messages in thread
From: Nicolas Morey-Chaisemartin @ 2016-05-12 15:31 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Kirill A. Shutemov,
	Kirill A. Shutemov, Alex Williamson, One Thousand Gnomes,
	linux-kernel, linux-mm



Le 05/12/2016 à 03:52 PM, Jerome Glisse a écrit :
> On Thu, May 12, 2016 at 03:30:24PM +0200, Nicolas Morey-Chaisemartin wrote:
>> Le 05/12/2016 à 11:36 AM, Jerome Glisse a écrit :
>>> On Thu, May 12, 2016 at 08:07:59AM +0200, Nicolas Morey-Chaisemartin wrote:
[...]
>>>> With transparent_hugepage=never I can't see the bug anymore.
>>>>
>>> Can you test https://patchwork.kernel.org/patch/9061351/ with 4.5
>>> (does not apply to 3.10) and without transparent_hugepage=never
>>>
>>> Jérôme
>> Fails with 4.5 + this patch and with 4.5 + this patch + yours
>>
> There must be some bug in your code, we have upstream user that works
> fine with the above combination (see drivers/vfio/vfio_iommu_type1.c)
> i suspect you might be releasing the page pin too early (put_page()).
In my previous tests, I checked the page before calling put_page and it has already changed.
And I also checked that there is not multiple transfers in a single page at once.
So I doubt it's that.
>
> If you really believe it is bug upstream we would need a dumb kernel
> module that does gup like you do and that shows the issue. Right now
> looking at code (assuming above patches applied) i can't see anything
> that can go wrong with THP.

The issue is that I doubt I'll be able to do that. We have had code running in production for at least a year without the issue showing up and now a single test shows this.
And some tweak to the test (meaning memory footprint in the user space) can make the problem disappear.

Is there a way to track what is happening to the THP? From the looks of it, the refcount are changed behind my back? Would kgdb with watch point work on this?
Is there a less painful way?

Thanks

Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?
  2016-05-12 15:31                     ` Nicolas Morey-Chaisemartin
@ 2016-05-12 15:57                       ` Andrea Arcangeli
  0 siblings, 0 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2016-05-12 15:57 UTC (permalink / raw)
  To: Nicolas Morey-Chaisemartin
  Cc: Jerome Glisse, Hugh Dickins, Mel Gorman, Kirill A. Shutemov,
	Kirill A. Shutemov, Alex Williamson, One Thousand Gnomes,
	linux-kernel, linux-mm

Hello Nicolas,

On Thu, May 12, 2016 at 05:31:52PM +0200, Nicolas Morey-Chaisemartin wrote:
> 
> 
> Le 05/12/2016 à 03:52 PM, Jerome Glisse a écrit :
> > On Thu, May 12, 2016 at 03:30:24PM +0200, Nicolas Morey-Chaisemartin wrote:
> >> Le 05/12/2016 à 11:36 AM, Jerome Glisse a écrit :
> >>> On Thu, May 12, 2016 at 08:07:59AM +0200, Nicolas Morey-Chaisemartin wrote:
> [...]
> >>>> With transparent_hugepage=never I can't see the bug anymore.
> >>>>
> >>> Can you test https://patchwork.kernel.org/patch/9061351/ with 4.5
> >>> (does not apply to 3.10) and without transparent_hugepage=never
> >>>
> >>> Jérôme
> >> Fails with 4.5 + this patch and with 4.5 + this patch + yours
> >>
> > There must be some bug in your code, we have upstream user that works
> > fine with the above combination (see drivers/vfio/vfio_iommu_type1.c)
> > i suspect you might be releasing the page pin too early (put_page()).
> In my previous tests, I checked the page before calling put_page and it has already changed.
> And I also checked that there is not multiple transfers in a single page at once.
> So I doubt it's that.
> >
> > If you really believe it is bug upstream we would need a dumb kernel
> > module that does gup like you do and that shows the issue. Right now
> > looking at code (assuming above patches applied) i can't see anything
> > that can go wrong with THP.
> 
> The issue is that I doubt I'll be able to do that. We have had code running in production for at least a year without the issue showing up and now a single test shows this.
> And some tweak to the test (meaning memory footprint in the user space) can make the problem disappear.
> 
> Is there a way to track what is happening to the THP? From the looks of it, the refcount are changed behind my back? Would kgdb with watch point work on this?
> Is there a less painful way?

Do you use fork()?

If you have threads and your DMA I/O granularity is smaller than
PAGE_SIZE, and a thread of the application in parent or child is
writing to another part of the page, the I/O can get lost (worse, it
doesn't get really lost but it goes to the child by mistake, instead
of sticking to the "mm" where you executed get_user_pages). This is
practically a bug in fork() but it's known. It can affect any app that
uses get_user_pages/O_DIRECT, fork() and uses thread and the I/O
granularity is smaller than PAGE_SIZE.

The same bug cannot happen with KSM or other things that can wrprotect
a page out of app control, because all things out of app control
checks there are no page pins before wrprotecting the page. So it's up
to the app to control "fork()".

To fix it, you should do one of: 1) use MADV_DONTFORK on the pinned
region, 2) prevent fork to run while you've pins taken with
get_user_pages or anyway while get_user_pages may be running
concurrently, 3) use a PAGE_SIZE I/O granularity and/or prevent the
threads to write to the other part of the page while DMA is running.

I'm not aware of other issues that could screw with page pins with THP
on kernels <=4.4, if there were, everything should fall apart
including O_DIRECT and qemu cache=none. The only issue I'm aware of
that can cause DMA to get lost with page pins is the aforementioned
one.

To debug it further, I would suggest to start by searching for "fork"
calls, and adding MADV_DONTFORK to the pinned region if there's any
fork() in your testcase.

Without being allowed to see the source there's not much else we can
do considering there's no sign of unknown bugs in this area in kernels
<=4.4.

All there is, is the known bug above, but apps that could be affected
by it, actively avoid it by using MADV_DONTFORK like with qemu
cache=none.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-05-12 16:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-29  8:01 [Question] Missing data after DMA read transfer - mm issue with transparent huge page? Nicolas Morey Chaisemartin
2016-05-03  4:04 ` Hugh Dickins
2016-05-03 10:11   ` Jerome Glisse
2016-05-03 11:03     ` Kirill A. Shutemov
     [not found]     ` <07619be9-e812-5459-26dd-ceb8c6490520@morey-chaisemartin.com>
2016-05-10 10:01       ` Jerome Glisse
2016-05-10 11:15         ` Nicolas Morey Chaisemartin
2016-05-10 13:34           ` Jerome Glisse
2016-05-11  9:14             ` Nicolas Morey Chaisemartin
2016-05-11 11:15         ` Nicolas Morey Chaisemartin
2016-05-11 14:51           ` Jerome Glisse
2016-05-12  6:07             ` Nicolas Morey-Chaisemartin
2016-05-12  9:36               ` Jerome Glisse
2016-05-12 13:30                 ` Nicolas Morey-Chaisemartin
2016-05-12 13:52                   ` Jerome Glisse
2016-05-12 15:31                     ` Nicolas Morey-Chaisemartin
2016-05-12 15:57                       ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).