All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
@ 2015-08-05  8:08 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Alexey Kardashevskiy @ 2015-08-05  8:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Alexey Kardashevskiy, Alexander Duyck, Andrew Morton,
	Benjamin Herrenschmidt, David Gibson, Johannes Weiner,
	Joonsoo Kim, Mel Gorman, Michal Hocko, Paul Mackerras,
	Sasha Levin, Vlastimil Babka, linux-kernel, Alex Williamson,
	Alexander Graf, Paolo Bonzini, Aneesh Kumar K . V

This is about VFIO aka PCI passthrough used from QEMU.
KVM is irrelevant here.

QEMU is a machine emulator. It allocates guest RAM from anonymous memory
and these pages are movable which is ok. They may happen to be allocated
from the contiguous memory allocation zone (CMA). Which is also ok as
long they are movable.

However if the guest starts using VFIO (which can be hotplugged into
the guest), in most cases it involves DMA which requires guest RAM pages
to be pinned and not move once their addresses are programmed to
the hardware for DMA.

So we end up in a situation when quite many pages in CMA are not movable
anymore. And we get bunch of these:

[77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
[77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
[77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy

This is a very rough patch to start the conversation about how to move
pages properly. mm/page_alloc.c does this and
arch/powerpc/mm/mmu_context_iommu.c exploits it.

Please do not comment on the style and code placement,
this is just to give some context :)

Obviously, this does not work well - it manages to migrate only few pages
and crashes as it is missing locks/disabling interrupts and I probably
should not just remove pages from LRU list (normally, I guess, only these
can migrate) and a million of other things.

The questions are:

- what is the correct way of telling if the page is in CMA?
is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?

- how to tell MM to move page away? I am calling migrate_pages() with
an get_new_page callback which allocates a page with GFP_USER but without
GFP_MOVABLE which should allocate new page out of CMA which seems ok but
there is a little convern that we might want to add MOVABLE back when
VFIO device is unplugged from the guest.

- do I need to isolate pages by using isolate_migratepages_range,
reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
I dropped them for now and the patch uses only @migratepages from
the compact_control struct.

- are there any flags in madvise() to address this (could not
locate any relevant)?

- what else is missing? disabled interrupts? locks?

Thanks!


Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/mm/mmu_context_iommu.c | 40 +++++++++++++++++++++++++++++++------
 mm/page_alloc.c                     | 36 +++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index da6a216..bf6850e 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
+extern int mm_iommu_move_page(unsigned long pfn);
+
 long mm_iommu_get(unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
 	long i, j, ret = 0, locked_entries = 0;
 	struct page *page = NULL;
+	unsigned long moved = 0, tried = 0;
 
 	if (!current || !current->mm)
 		return -ESRCH; /* process exited */
@@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 	}
 
 	for (i = 0; i < entries; ++i) {
+		unsigned long pfn;
+
 		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
 					1/* pages */, 1/* iswrite */, &page)) {
-			for (j = 0; j < i; ++j)
-				put_page(pfn_to_page(
-						mem->hpas[j] >> PAGE_SHIFT));
-			vfree(mem->hpas);
-			kfree(mem);
 			ret = -EFAULT;
-			goto unlock_exit;
+			goto put_exit;
+		}
+
+		pfn = page_to_pfn(page);
+		if (get_pageblock_migratetype(page) == MIGRATE_CMA)
+		{
+			unsigned long pfnold = pfn;
+			put_page(page);
+			page = NULL;
+			mm_iommu_move_page(pfn);
+			if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
+						1/* pages */, 1/* iswrite */, &page)) {
+				ret = -EFAULT;
+				goto put_exit;
+			}
+			pfn = page_to_pfn(page);
+			if (pfn != pfnold)
+				++moved;
 		}
 
 		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
@@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 
 	list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
 
+	printk("***K*** %s %u: tried %lu, moved %lu of %lu\n", __func__, __LINE__,
+			tried, moved, entries);
+
+put_exit:
+	if (ret) {
+		for (j = 0; j < i; ++j)
+			put_page(pfn_to_page(mem->hpas[j] >> PAGE_SHIFT));
+		vfree(mem->hpas);
+		kfree(mem);
+	}
+
 unlock_exit:
 	if (locked_entries && ret)
 		mm_iommu_adjust_locked_vm(current->mm, locked_entries, false);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef19f22..0639cce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7008,3 +7008,39 @@ bool is_free_buddy_page(struct page *page)
 	return order < MAX_ORDER;
 }
 #endif
+
+static struct page *mm_iommu_new_page(struct page *page, unsigned long private,
+				int **reason)
+{
+	 /*
+	  * Anything but not GFP_MOVABLE!
+	  */
+	return alloc_page(GFP_USER);
+}
+
+static void mm_iommu_free_page(struct page *page, unsigned long private)
+{
+	free_page(page_to_pfn(page) << PAGE_SHIFT);
+}
+
+int mm_iommu_move_page(unsigned long pfn)
+{
+	unsigned long ret, nr_reclaimed;
+	struct compact_control cc = {
+		.nr_migratepages = 0,
+		.order = -1,
+		.zone = page_zone(pfn_to_page(pfn)),
+		.mode = MIGRATE_SYNC,
+		.ignore_skip_hint = true,
+	};
+	struct page *page = pfn_to_page(pfn);
+	INIT_LIST_HEAD(&cc.migratepages);
+
+	if (PageLRU(page)) {
+		list_del(&page->lru);
+	}
+	ret = migrate_pages(&cc.migratepages, mm_iommu_new_page,
+			mm_iommu_free_page, 0, cc.mode, MR_CMA);
+
+	return ret;
+}
-- 
2.4.0.rc3.8.gfb3e7d5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
@ 2015-08-05  8:08 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Alexey Kardashevskiy @ 2015-08-05  8:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Alexey Kardashevskiy, Alexander Duyck, Andrew Morton,
	Benjamin Herrenschmidt, David Gibson, Johannes Weiner,
	Joonsoo Kim, Mel Gorman, Michal Hocko, Paul Mackerras,
	Sasha Levin, Vlastimil Babka, linux-kernel, Alex Williamson,
	Alexander Graf, Paolo Bonzini, Aneesh Kumar K . V

This is about VFIO aka PCI passthrough used from QEMU.
KVM is irrelevant here.

QEMU is a machine emulator. It allocates guest RAM from anonymous memory
and these pages are movable which is ok. They may happen to be allocated
from the contiguous memory allocation zone (CMA). Which is also ok as
long they are movable.

However if the guest starts using VFIO (which can be hotplugged into
the guest), in most cases it involves DMA which requires guest RAM pages
to be pinned and not move once their addresses are programmed to
the hardware for DMA.

So we end up in a situation when quite many pages in CMA are not movable
anymore. And we get bunch of these:

[77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
[77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
[77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy

This is a very rough patch to start the conversation about how to move
pages properly. mm/page_alloc.c does this and
arch/powerpc/mm/mmu_context_iommu.c exploits it.

Please do not comment on the style and code placement,
this is just to give some context :)

Obviously, this does not work well - it manages to migrate only few pages
and crashes as it is missing locks/disabling interrupts and I probably
should not just remove pages from LRU list (normally, I guess, only these
can migrate) and a million of other things.

The questions are:

- what is the correct way of telling if the page is in CMA?
is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?

- how to tell MM to move page away? I am calling migrate_pages() with
an get_new_page callback which allocates a page with GFP_USER but without
GFP_MOVABLE which should allocate new page out of CMA which seems ok but
there is a little convern that we might want to add MOVABLE back when
VFIO device is unplugged from the guest.

- do I need to isolate pages by using isolate_migratepages_range,
reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
I dropped them for now and the patch uses only @migratepages from
the compact_control struct.

- are there any flags in madvise() to address this (could not
locate any relevant)?

- what else is missing? disabled interrupts? locks?

Thanks!


Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/mm/mmu_context_iommu.c | 40 +++++++++++++++++++++++++++++++------
 mm/page_alloc.c                     | 36 +++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index da6a216..bf6850e 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
+extern int mm_iommu_move_page(unsigned long pfn);
+
 long mm_iommu_get(unsigned long ua, unsigned long entries,
 		struct mm_iommu_table_group_mem_t **pmem)
 {
 	struct mm_iommu_table_group_mem_t *mem;
 	long i, j, ret = 0, locked_entries = 0;
 	struct page *page = NULL;
+	unsigned long moved = 0, tried = 0;
 
 	if (!current || !current->mm)
 		return -ESRCH; /* process exited */
@@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 	}
 
 	for (i = 0; i < entries; ++i) {
+		unsigned long pfn;
+
 		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
 					1/* pages */, 1/* iswrite */, &page)) {
-			for (j = 0; j < i; ++j)
-				put_page(pfn_to_page(
-						mem->hpas[j] >> PAGE_SHIFT));
-			vfree(mem->hpas);
-			kfree(mem);
 			ret = -EFAULT;
-			goto unlock_exit;
+			goto put_exit;
+		}
+
+		pfn = page_to_pfn(page);
+		if (get_pageblock_migratetype(page) == MIGRATE_CMA)
+		{
+			unsigned long pfnold = pfn;
+			put_page(page);
+			page = NULL;
+			mm_iommu_move_page(pfn);
+			if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
+						1/* pages */, 1/* iswrite */, &page)) {
+				ret = -EFAULT;
+				goto put_exit;
+			}
+			pfn = page_to_pfn(page);
+			if (pfn != pfnold)
+				++moved;
 		}
 
 		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
@@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 
 	list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
 
+	printk("***K*** %s %u: tried %lu, moved %lu of %lu\n", __func__, __LINE__,
+			tried, moved, entries);
+
+put_exit:
+	if (ret) {
+		for (j = 0; j < i; ++j)
+			put_page(pfn_to_page(mem->hpas[j] >> PAGE_SHIFT));
+		vfree(mem->hpas);
+		kfree(mem);
+	}
+
 unlock_exit:
 	if (locked_entries && ret)
 		mm_iommu_adjust_locked_vm(current->mm, locked_entries, false);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef19f22..0639cce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7008,3 +7008,39 @@ bool is_free_buddy_page(struct page *page)
 	return order < MAX_ORDER;
 }
 #endif
+
+static struct page *mm_iommu_new_page(struct page *page, unsigned long private,
+				int **reason)
+{
+	 /*
+	  * Anything but not GFP_MOVABLE!
+	  */
+	return alloc_page(GFP_USER);
+}
+
+static void mm_iommu_free_page(struct page *page, unsigned long private)
+{
+	free_page(page_to_pfn(page) << PAGE_SHIFT);
+}
+
+int mm_iommu_move_page(unsigned long pfn)
+{
+	unsigned long ret, nr_reclaimed;
+	struct compact_control cc = {
+		.nr_migratepages = 0,
+		.order = -1,
+		.zone = page_zone(pfn_to_page(pfn)),
+		.mode = MIGRATE_SYNC,
+		.ignore_skip_hint = true,
+	};
+	struct page *page = pfn_to_page(pfn);
+	INIT_LIST_HEAD(&cc.migratepages);
+
+	if (PageLRU(page)) {
+		list_del(&page->lru);
+	}
+	ret = migrate_pages(&cc.migratepages, mm_iommu_new_page,
+			mm_iommu_free_page, 0, cc.mode, MR_CMA);
+
+	return ret;
+}
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
  2015-08-05  8:08 ` Alexey Kardashevskiy
@ 2015-08-15 10:50   ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 16+ messages in thread
From: Alexey Kardashevskiy @ 2015-08-15 10:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Alexander Duyck, Andrew Morton, Benjamin Herrenschmidt,
	David Gibson, Johannes Weiner, Joonsoo Kim, Mel Gorman,
	Michal Hocko, Paul Mackerras, Sasha Levin, Vlastimil Babka,
	linux-kernel, Alex Williamson, Alexander Graf, Paolo Bonzini,
	Aneesh Kumar K . V, Pavel Emelyanov

On 08/05/2015 06:08 PM, Alexey Kardashevskiy wrote:
> This is about VFIO aka PCI passthrough used from QEMU.
> KVM is irrelevant here.


Anyone, any idea? Or the question is way too stupid? :)


>
> QEMU is a machine emulator. It allocates guest RAM from anonymous memory
> and these pages are movable which is ok. They may happen to be allocated
> from the contiguous memory allocation zone (CMA). Which is also ok as
> long they are movable.
>
> However if the guest starts using VFIO (which can be hotplugged into
> the guest), in most cases it involves DMA which requires guest RAM pages
> to be pinned and not move once their addresses are programmed to
> the hardware for DMA.
>
> So we end up in a situation when quite many pages in CMA are not movable
> anymore. And we get bunch of these:
>
> [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
> [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
> [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy
>
> This is a very rough patch to start the conversation about how to move
> pages properly. mm/page_alloc.c does this and
> arch/powerpc/mm/mmu_context_iommu.c exploits it.
>
> Please do not comment on the style and code placement,
> this is just to give some context :)
>
> Obviously, this does not work well - it manages to migrate only few pages
> and crashes as it is missing locks/disabling interrupts and I probably
> should not just remove pages from LRU list (normally, I guess, only these
> can migrate) and a million of other things.
>
> The questions are:
>
> - what is the correct way of telling if the page is in CMA?
> is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?
>
> - how to tell MM to move page away? I am calling migrate_pages() with
> an get_new_page callback which allocates a page with GFP_USER but without
> GFP_MOVABLE which should allocate new page out of CMA which seems ok but
> there is a little convern that we might want to add MOVABLE back when
> VFIO device is unplugged from the guest.
>
> - do I need to isolate pages by using isolate_migratepages_range,
> reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
> I dropped them for now and the patch uses only @migratepages from
> the compact_control struct.
>
> - are there any flags in madvise() to address this (could not
> locate any relevant)?
>
> - what else is missing? disabled interrupts? locks?
>
> Thanks!
>
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>   arch/powerpc/mm/mmu_context_iommu.c | 40 +++++++++++++++++++++++++++++++------
>   mm/page_alloc.c                     | 36 +++++++++++++++++++++++++++++++++
>   2 files changed, 70 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index da6a216..bf6850e 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void)
>   }
>   EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>
> +extern int mm_iommu_move_page(unsigned long pfn);
> +
>   long mm_iommu_get(unsigned long ua, unsigned long entries,
>   		struct mm_iommu_table_group_mem_t **pmem)
>   {
>   	struct mm_iommu_table_group_mem_t *mem;
>   	long i, j, ret = 0, locked_entries = 0;
>   	struct page *page = NULL;
> +	unsigned long moved = 0, tried = 0;
>
>   	if (!current || !current->mm)
>   		return -ESRCH; /* process exited */
> @@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>   	}
>
>   	for (i = 0; i < entries; ++i) {
> +		unsigned long pfn;
> +
>   		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
>   					1/* pages */, 1/* iswrite */, &page)) {
> -			for (j = 0; j < i; ++j)
> -				put_page(pfn_to_page(
> -						mem->hpas[j] >> PAGE_SHIFT));
> -			vfree(mem->hpas);
> -			kfree(mem);
>   			ret = -EFAULT;
> -			goto unlock_exit;
> +			goto put_exit;
> +		}
> +
> +		pfn = page_to_pfn(page);
> +		if (get_pageblock_migratetype(page) == MIGRATE_CMA)
> +		{
> +			unsigned long pfnold = pfn;
> +			put_page(page);
> +			page = NULL;
> +			mm_iommu_move_page(pfn);
> +			if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> +						1/* pages */, 1/* iswrite */, &page)) {
> +				ret = -EFAULT;
> +				goto put_exit;
> +			}
> +			pfn = page_to_pfn(page);
> +			if (pfn != pfnold)
> +				++moved;
>   		}
>
>   		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
> @@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>
>   	list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
>
> +	printk("***K*** %s %u: tried %lu, moved %lu of %lu\n", __func__, __LINE__,
> +			tried, moved, entries);
> +
> +put_exit:
> +	if (ret) {
> +		for (j = 0; j < i; ++j)
> +			put_page(pfn_to_page(mem->hpas[j] >> PAGE_SHIFT));
> +		vfree(mem->hpas);
> +		kfree(mem);
> +	}
> +
>   unlock_exit:
>   	if (locked_entries && ret)
>   		mm_iommu_adjust_locked_vm(current->mm, locked_entries, false);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ef19f22..0639cce 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7008,3 +7008,39 @@ bool is_free_buddy_page(struct page *page)
>   	return order < MAX_ORDER;
>   }
>   #endif
> +
> +static struct page *mm_iommu_new_page(struct page *page, unsigned long private,
> +				int **reason)
> +{
> +	 /*
> +	  * Anything but not GFP_MOVABLE!
> +	  */
> +	return alloc_page(GFP_USER);
> +}
> +
> +static void mm_iommu_free_page(struct page *page, unsigned long private)
> +{
> +	free_page(page_to_pfn(page) << PAGE_SHIFT);
> +}
> +
> +int mm_iommu_move_page(unsigned long pfn)
> +{
> +	unsigned long ret, nr_reclaimed;
> +	struct compact_control cc = {
> +		.nr_migratepages = 0,
> +		.order = -1,
> +		.zone = page_zone(pfn_to_page(pfn)),
> +		.mode = MIGRATE_SYNC,
> +		.ignore_skip_hint = true,
> +	};
> +	struct page *page = pfn_to_page(pfn);
> +	INIT_LIST_HEAD(&cc.migratepages);
> +
> +	if (PageLRU(page)) {
> +		list_del(&page->lru);
> +	}
> +	ret = migrate_pages(&cc.migratepages, mm_iommu_new_page,
> +			mm_iommu_free_page, 0, cc.mode, MR_CMA);
> +
> +	return ret;
> +}
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
@ 2015-08-15 10:50   ` Alexey Kardashevskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Alexey Kardashevskiy @ 2015-08-15 10:50 UTC (permalink / raw)
  To: linux-mm
  Cc: Alexander Duyck, Andrew Morton, Benjamin Herrenschmidt,
	David Gibson, Johannes Weiner, Joonsoo Kim, Mel Gorman,
	Michal Hocko, Paul Mackerras, Sasha Levin, Vlastimil Babka,
	linux-kernel, Alex Williamson, Alexander Graf, Paolo Bonzini,
	Aneesh Kumar K . V, Pavel Emelyanov

On 08/05/2015 06:08 PM, Alexey Kardashevskiy wrote:
> This is about VFIO aka PCI passthrough used from QEMU.
> KVM is irrelevant here.


Anyone, any idea? Or the question is way too stupid? :)


>
> QEMU is a machine emulator. It allocates guest RAM from anonymous memory
> and these pages are movable which is ok. They may happen to be allocated
> from the contiguous memory allocation zone (CMA). Which is also ok as
> long they are movable.
>
> However if the guest starts using VFIO (which can be hotplugged into
> the guest), in most cases it involves DMA which requires guest RAM pages
> to be pinned and not move once their addresses are programmed to
> the hardware for DMA.
>
> So we end up in a situation when quite many pages in CMA are not movable
> anymore. And we get bunch of these:
>
> [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
> [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
> [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy
>
> This is a very rough patch to start the conversation about how to move
> pages properly. mm/page_alloc.c does this and
> arch/powerpc/mm/mmu_context_iommu.c exploits it.
>
> Please do not comment on the style and code placement,
> this is just to give some context :)
>
> Obviously, this does not work well - it manages to migrate only few pages
> and crashes as it is missing locks/disabling interrupts and I probably
> should not just remove pages from LRU list (normally, I guess, only these
> can migrate) and a million of other things.
>
> The questions are:
>
> - what is the correct way of telling if the page is in CMA?
> is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?
>
> - how to tell MM to move page away? I am calling migrate_pages() with
> an get_new_page callback which allocates a page with GFP_USER but without
> GFP_MOVABLE which should allocate new page out of CMA which seems ok but
> there is a little convern that we might want to add MOVABLE back when
> VFIO device is unplugged from the guest.
>
> - do I need to isolate pages by using isolate_migratepages_range,
> reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
> I dropped them for now and the patch uses only @migratepages from
> the compact_control struct.
>
> - are there any flags in madvise() to address this (could not
> locate any relevant)?
>
> - what else is missing? disabled interrupts? locks?
>
> Thanks!
>
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>   arch/powerpc/mm/mmu_context_iommu.c | 40 +++++++++++++++++++++++++++++++------
>   mm/page_alloc.c                     | 36 +++++++++++++++++++++++++++++++++
>   2 files changed, 70 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index da6a216..bf6850e 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void)
>   }
>   EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>
> +extern int mm_iommu_move_page(unsigned long pfn);
> +
>   long mm_iommu_get(unsigned long ua, unsigned long entries,
>   		struct mm_iommu_table_group_mem_t **pmem)
>   {
>   	struct mm_iommu_table_group_mem_t *mem;
>   	long i, j, ret = 0, locked_entries = 0;
>   	struct page *page = NULL;
> +	unsigned long moved = 0, tried = 0;
>
>   	if (!current || !current->mm)
>   		return -ESRCH; /* process exited */
> @@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>   	}
>
>   	for (i = 0; i < entries; ++i) {
> +		unsigned long pfn;
> +
>   		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
>   					1/* pages */, 1/* iswrite */, &page)) {
> -			for (j = 0; j < i; ++j)
> -				put_page(pfn_to_page(
> -						mem->hpas[j] >> PAGE_SHIFT));
> -			vfree(mem->hpas);
> -			kfree(mem);
>   			ret = -EFAULT;
> -			goto unlock_exit;
> +			goto put_exit;
> +		}
> +
> +		pfn = page_to_pfn(page);
> +		if (get_pageblock_migratetype(page) == MIGRATE_CMA)
> +		{
> +			unsigned long pfnold = pfn;
> +			put_page(page);
> +			page = NULL;
> +			mm_iommu_move_page(pfn);
> +			if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> +						1/* pages */, 1/* iswrite */, &page)) {
> +				ret = -EFAULT;
> +				goto put_exit;
> +			}
> +			pfn = page_to_pfn(page);
> +			if (pfn != pfnold)
> +				++moved;
>   		}
>
>   		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
> @@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>
>   	list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
>
> +	printk("***K*** %s %u: tried %lu, moved %lu of %lu\n", __func__, __LINE__,
> +			tried, moved, entries);
> +
> +put_exit:
> +	if (ret) {
> +		for (j = 0; j < i; ++j)
> +			put_page(pfn_to_page(mem->hpas[j] >> PAGE_SHIFT));
> +		vfree(mem->hpas);
> +		kfree(mem);
> +	}
> +
>   unlock_exit:
>   	if (locked_entries && ret)
>   		mm_iommu_adjust_locked_vm(current->mm, locked_entries, false);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ef19f22..0639cce 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7008,3 +7008,39 @@ bool is_free_buddy_page(struct page *page)
>   	return order < MAX_ORDER;
>   }
>   #endif
> +
> +static struct page *mm_iommu_new_page(struct page *page, unsigned long private,
> +				int **reason)
> +{
> +	 /*
> +	  * Anything but not GFP_MOVABLE!
> +	  */
> +	return alloc_page(GFP_USER);
> +}
> +
> +static void mm_iommu_free_page(struct page *page, unsigned long private)
> +{
> +	free_page(page_to_pfn(page) << PAGE_SHIFT);
> +}
> +
> +int mm_iommu_move_page(unsigned long pfn)
> +{
> +	unsigned long ret, nr_reclaimed;
> +	struct compact_control cc = {
> +		.nr_migratepages = 0,
> +		.order = -1,
> +		.zone = page_zone(pfn_to_page(pfn)),
> +		.mode = MIGRATE_SYNC,
> +		.ignore_skip_hint = true,
> +	};
> +	struct page *page = pfn_to_page(pfn);
> +	INIT_LIST_HEAD(&cc.migratepages);
> +
> +	if (PageLRU(page)) {
> +		list_del(&page->lru);
> +	}
> +	ret = migrate_pages(&cc.migratepages, mm_iommu_new_page,
> +			mm_iommu_free_page, 0, cc.mode, MR_CMA);
> +
> +	return ret;
> +}
>


-- 
Alexey

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
  2015-08-05  8:08 ` Alexey Kardashevskiy
@ 2015-08-17  7:45   ` Vlastimil Babka
  -1 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2015-08-17  7:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linux-mm
  Cc: Alexander Duyck, Andrew Morton, Benjamin Herrenschmidt,
	David Gibson, Johannes Weiner, Joonsoo Kim, Mel Gorman,
	Michal Hocko, Paul Mackerras, Sasha Levin, linux-kernel,
	Alex Williamson, Alexander Graf, Paolo Bonzini,
	Aneesh Kumar K . V, Peter Zijlstra

On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote:
> This is about VFIO aka PCI passthrough used from QEMU.
> KVM is irrelevant here.
>
> QEMU is a machine emulator. It allocates guest RAM from anonymous memory
> and these pages are movable which is ok. They may happen to be allocated
> from the contiguous memory allocation zone (CMA). Which is also ok as
> long they are movable.
>
> However if the guest starts using VFIO (which can be hotplugged into
> the guest), in most cases it involves DMA which requires guest RAM pages
> to be pinned and not move once their addresses are programmed to
> the hardware for DMA.
>
> So we end up in a situation when quite many pages in CMA are not movable
> anymore. And we get bunch of these:
>
> [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
> [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
> [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy

IIRC CMA was for mobile devices and their camera/codec drivers and you 
don't use QEMU on those? What do you need CMA for in your case?

> This is a very rough patch to start the conversation about how to move
> pages properly. mm/page_alloc.c does this and
> arch/powerpc/mm/mmu_context_iommu.c exploits it.

OK such conversation should probably start by mentioning the VM_PINNED 
effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345

It's more general approach to dealing with pinned pages, and moving them 
out of CMA area (and compacting them in general) prior pinning is one of 
the things that should be done within that framework.

Then there's the effort to enable migrating pages other than LRU during 
compaction (and thus CMA allocation): https://lwn.net/Articles/650864/
I don't know if that would be applicable in your use case, i.e. are the 
pins for DMA short-lived and can the isolation/migration code wait a bit 
for the transfer to finish so it can grab them, or something?

>
> Please do not comment on the style and code placement,
> this is just to give some context :)
>
> Obviously, this does not work well - it manages to migrate only few pages
> and crashes as it is missing locks/disabling interrupts and I probably
> should not just remove pages from LRU list (normally, I guess, only these
> can migrate) and a million of other things.
>
> The questions are:
>
> - what is the correct way of telling if the page is in CMA?
> is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?

Should be.

> - how to tell MM to move page away? I am calling migrate_pages() with
> an get_new_page callback which allocates a page with GFP_USER but without
> GFP_MOVABLE which should allocate new page out of CMA which seems ok but
> there is a little convern that we might want to add MOVABLE back when
> VFIO device is unplugged from the guest.

Hmm, once the page is allocated, then the migratetype is not tracked 
anywhere (except in page_owner debug data). But the unmovable 
allocations might exhaust available unmovable pageblocks and lead to 
fragmentation. So "add MOVABLE back" would be too late. Instead we would 
need to tell the allocator somehow to give us movable page but outside 
of CMA. CMA's own __alloc_contig_migrate_range() avoids this problem by 
allocating movable pages, but the range has been already page-isolated 
and thus the allocator won't see the pages there. You obviously can't 
take this approach and isolate all CMA pageblocks like that. That smells 
like a new __GFP_FLAG, meh.

> - do I need to isolate pages by using isolate_migratepages_range,
> reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
> I dropped them for now and the patch uses only @migratepages from
> the compact_control struct.

You don't have to do reclaim_clean_pages_from_list(), but the isolation 
has to be careful, yeah.

> - are there any flags in madvise() to address this (could not
> locate any relevant)?

AFAIK there's no madvise(I_WILL_BE_PINNING_THIS_RANGE).

> - what else is missing? disabled interrupts? locks?

See what isolate_migratepages_block() does.

> Thanks!
>
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>   arch/powerpc/mm/mmu_context_iommu.c | 40 +++++++++++++++++++++++++++++++------
>   mm/page_alloc.c                     | 36 +++++++++++++++++++++++++++++++++
>   2 files changed, 70 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index da6a216..bf6850e 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void)
>   }
>   EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>
> +extern int mm_iommu_move_page(unsigned long pfn);
> +
>   long mm_iommu_get(unsigned long ua, unsigned long entries,
>   		struct mm_iommu_table_group_mem_t **pmem)
>   {
>   	struct mm_iommu_table_group_mem_t *mem;
>   	long i, j, ret = 0, locked_entries = 0;
>   	struct page *page = NULL;
> +	unsigned long moved = 0, tried = 0;
>
>   	if (!current || !current->mm)
>   		return -ESRCH; /* process exited */
> @@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>   	}
>
>   	for (i = 0; i < entries; ++i) {
> +		unsigned long pfn;
> +
>   		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
>   					1/* pages */, 1/* iswrite */, &page)) {
> -			for (j = 0; j < i; ++j)
> -				put_page(pfn_to_page(
> -						mem->hpas[j] >> PAGE_SHIFT));
> -			vfree(mem->hpas);
> -			kfree(mem);
>   			ret = -EFAULT;
> -			goto unlock_exit;
> +			goto put_exit;
> +		}
> +
> +		pfn = page_to_pfn(page);
> +		if (get_pageblock_migratetype(page) == MIGRATE_CMA)
> +		{
> +			unsigned long pfnold = pfn;
> +			put_page(page);
> +			page = NULL;
> +			mm_iommu_move_page(pfn);
> +			if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> +						1/* pages */, 1/* iswrite */, &page)) {
> +				ret = -EFAULT;
> +				goto put_exit;
> +			}
> +			pfn = page_to_pfn(page);
> +			if (pfn != pfnold)
> +				++moved;
>   		}
>
>   		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
> @@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>
>   	list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
>
> +	printk("***K*** %s %u: tried %lu, moved %lu of %lu\n", __func__, __LINE__,
> +			tried, moved, entries);
> +
> +put_exit:
> +	if (ret) {
> +		for (j = 0; j < i; ++j)
> +			put_page(pfn_to_page(mem->hpas[j] >> PAGE_SHIFT));
> +		vfree(mem->hpas);
> +		kfree(mem);
> +	}
> +
>   unlock_exit:
>   	if (locked_entries && ret)
>   		mm_iommu_adjust_locked_vm(current->mm, locked_entries, false);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ef19f22..0639cce 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7008,3 +7008,39 @@ bool is_free_buddy_page(struct page *page)
>   	return order < MAX_ORDER;
>   }
>   #endif
> +
> +static struct page *mm_iommu_new_page(struct page *page, unsigned long private,
> +				int **reason)
> +{
> +	 /*
> +	  * Anything but not GFP_MOVABLE!
> +	  */
> +	return alloc_page(GFP_USER);
> +}
> +
> +static void mm_iommu_free_page(struct page *page, unsigned long private)
> +{
> +	free_page(page_to_pfn(page) << PAGE_SHIFT);
> +}
> +
> +int mm_iommu_move_page(unsigned long pfn)
> +{
> +	unsigned long ret, nr_reclaimed;
> +	struct compact_control cc = {
> +		.nr_migratepages = 0,
> +		.order = -1,
> +		.zone = page_zone(pfn_to_page(pfn)),
> +		.mode = MIGRATE_SYNC,
> +		.ignore_skip_hint = true,
> +	};
> +	struct page *page = pfn_to_page(pfn);
> +	INIT_LIST_HEAD(&cc.migratepages);
> +
> +	if (PageLRU(page)) {
> +		list_del(&page->lru);
> +	}
> +	ret = migrate_pages(&cc.migratepages, mm_iommu_new_page,
> +			mm_iommu_free_page, 0, cc.mode, MR_CMA);
> +
> +	return ret;
> +}
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
@ 2015-08-17  7:45   ` Vlastimil Babka
  0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2015-08-17  7:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linux-mm
  Cc: Alexander Duyck, Andrew Morton, Benjamin Herrenschmidt,
	David Gibson, Johannes Weiner, Joonsoo Kim, Mel Gorman,
	Michal Hocko, Paul Mackerras, Sasha Levin, linux-kernel,
	Alex Williamson, Alexander Graf, Paolo Bonzini,
	Aneesh Kumar K . V, Peter Zijlstra

On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote:
> This is about VFIO aka PCI passthrough used from QEMU.
> KVM is irrelevant here.
>
> QEMU is a machine emulator. It allocates guest RAM from anonymous memory
> and these pages are movable which is ok. They may happen to be allocated
> from the contiguous memory allocation zone (CMA). Which is also ok as
> long they are movable.
>
> However if the guest starts using VFIO (which can be hotplugged into
> the guest), in most cases it involves DMA which requires guest RAM pages
> to be pinned and not move once their addresses are programmed to
> the hardware for DMA.
>
> So we end up in a situation when quite many pages in CMA are not movable
> anymore. And we get bunch of these:
>
> [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
> [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
> [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy

IIRC CMA was for mobile devices and their camera/codec drivers and you 
don't use QEMU on those? What do you need CMA for in your case?

> This is a very rough patch to start the conversation about how to move
> pages properly. mm/page_alloc.c does this and
> arch/powerpc/mm/mmu_context_iommu.c exploits it.

OK such conversation should probably start by mentioning the VM_PINNED 
effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345

It's more general approach to dealing with pinned pages, and moving them 
out of CMA area (and compacting them in general) prior pinning is one of 
the things that should be done within that framework.

Then there's the effort to enable migrating pages other than LRU during 
compaction (and thus CMA allocation): https://lwn.net/Articles/650864/
I don't know if that would be applicable in your use case, i.e. are the 
pins for DMA short-lived and can the isolation/migration code wait a bit 
for the transfer to finish so it can grab them, or something?

>
> Please do not comment on the style and code placement,
> this is just to give some context :)
>
> Obviously, this does not work well - it manages to migrate only few pages
> and crashes as it is missing locks/disabling interrupts and I probably
> should not just remove pages from LRU list (normally, I guess, only these
> can migrate) and a million of other things.
>
> The questions are:
>
> - what is the correct way of telling if the page is in CMA?
> is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?

Should be.

> - how to tell MM to move page away? I am calling migrate_pages() with
> an get_new_page callback which allocates a page with GFP_USER but without
> GFP_MOVABLE which should allocate new page out of CMA which seems ok but
> there is a little convern that we might want to add MOVABLE back when
> VFIO device is unplugged from the guest.

Hmm, once the page is allocated, then the migratetype is not tracked 
anywhere (except in page_owner debug data). But the unmovable 
allocations might exhaust available unmovable pageblocks and lead to 
fragmentation. So "add MOVABLE back" would be too late. Instead we would 
need to tell the allocator somehow to give us movable page but outside 
of CMA. CMA's own __alloc_contig_migrate_range() avoids this problem by 
allocating movable pages, but the range has been already page-isolated 
and thus the allocator won't see the pages there. You obviously can't 
take this approach and isolate all CMA pageblocks like that. That smells 
like a new __GFP_FLAG, meh.

> - do I need to isolate pages by using isolate_migratepages_range,
> reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
> I dropped them for now and the patch uses only @migratepages from
> the compact_control struct.

You don't have to do reclaim_clean_pages_from_list(), but the isolation 
has to be careful, yeah.

> - are there any flags in madvise() to address this (could not
> locate any relevant)?

AFAIK there's no madvise(I_WILL_BE_PINNING_THIS_RANGE).

> - what else is missing? disabled interrupts? locks?

See what isolate_migratepages_block() does.

> Thanks!
>
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>   arch/powerpc/mm/mmu_context_iommu.c | 40 +++++++++++++++++++++++++++++++------
>   mm/page_alloc.c                     | 36 +++++++++++++++++++++++++++++++++
>   2 files changed, 70 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index da6a216..bf6850e 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -72,12 +72,15 @@ bool mm_iommu_preregistered(void)
>   }
>   EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>
> +extern int mm_iommu_move_page(unsigned long pfn);
> +
>   long mm_iommu_get(unsigned long ua, unsigned long entries,
>   		struct mm_iommu_table_group_mem_t **pmem)
>   {
>   	struct mm_iommu_table_group_mem_t *mem;
>   	long i, j, ret = 0, locked_entries = 0;
>   	struct page *page = NULL;
> +	unsigned long moved = 0, tried = 0;
>
>   	if (!current || !current->mm)
>   		return -ESRCH; /* process exited */
> @@ -122,15 +125,29 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>   	}
>
>   	for (i = 0; i < entries; ++i) {
> +		unsigned long pfn;
> +
>   		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
>   					1/* pages */, 1/* iswrite */, &page)) {
> -			for (j = 0; j < i; ++j)
> -				put_page(pfn_to_page(
> -						mem->hpas[j] >> PAGE_SHIFT));
> -			vfree(mem->hpas);
> -			kfree(mem);
>   			ret = -EFAULT;
> -			goto unlock_exit;
> +			goto put_exit;
> +		}
> +
> +		pfn = page_to_pfn(page);
> +		if (get_pageblock_migratetype(page) == MIGRATE_CMA)
> +		{
> +			unsigned long pfnold = pfn;
> +			put_page(page);
> +			page = NULL;
> +			mm_iommu_move_page(pfn);
> +			if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> +						1/* pages */, 1/* iswrite */, &page)) {
> +				ret = -EFAULT;
> +				goto put_exit;
> +			}
> +			pfn = page_to_pfn(page);
> +			if (pfn != pfnold)
> +				++moved;
>   		}
>
>   		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
> @@ -144,6 +161,17 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>
>   	list_add_rcu(&mem->next, &current->mm->context.iommu_group_mem_list);
>
> +	printk("***K*** %s %u: tried %lu, moved %lu of %lu\n", __func__, __LINE__,
> +			tried, moved, entries);
> +
> +put_exit:
> +	if (ret) {
> +		for (j = 0; j < i; ++j)
> +			put_page(pfn_to_page(mem->hpas[j] >> PAGE_SHIFT));
> +		vfree(mem->hpas);
> +		kfree(mem);
> +	}
> +
>   unlock_exit:
>   	if (locked_entries && ret)
>   		mm_iommu_adjust_locked_vm(current->mm, locked_entries, false);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ef19f22..0639cce 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7008,3 +7008,39 @@ bool is_free_buddy_page(struct page *page)
>   	return order < MAX_ORDER;
>   }
>   #endif
> +
> +static struct page *mm_iommu_new_page(struct page *page, unsigned long private,
> +				int **reason)
> +{
> +	 /*
> +	  * Anything but not GFP_MOVABLE!
> +	  */
> +	return alloc_page(GFP_USER);
> +}
> +
> +static void mm_iommu_free_page(struct page *page, unsigned long private)
> +{
> +	free_page(page_to_pfn(page) << PAGE_SHIFT);
> +}
> +
> +int mm_iommu_move_page(unsigned long pfn)
> +{
> +	unsigned long ret, nr_reclaimed;
> +	struct compact_control cc = {
> +		.nr_migratepages = 0,
> +		.order = -1,
> +		.zone = page_zone(pfn_to_page(pfn)),
> +		.mode = MIGRATE_SYNC,
> +		.ignore_skip_hint = true,
> +	};
> +	struct page *page = pfn_to_page(pfn);
> +	INIT_LIST_HEAD(&cc.migratepages);
> +
> +	if (PageLRU(page)) {
> +		list_del(&page->lru);
> +	}
> +	ret = migrate_pages(&cc.migratepages, mm_iommu_new_page,
> +			mm_iommu_free_page, 0, cc.mode, MR_CMA);
> +
> +	return ret;
> +}
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
  2015-08-17  7:45   ` Vlastimil Babka
@ 2015-08-17  9:11     ` Alexey Kardashevskiy
  -1 siblings, 0 replies; 16+ messages in thread
From: Alexey Kardashevskiy @ 2015-08-17  9:11 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: Alexander Duyck, Andrew Morton, Benjamin Herrenschmidt,
	David Gibson, Johannes Weiner, Joonsoo Kim, Mel Gorman,
	Michal Hocko, Paul Mackerras, Sasha Levin, linux-kernel,
	Alex Williamson, Alexander Graf, Paolo Bonzini,
	Aneesh Kumar K . V, Peter Zijlstra

On 08/17/2015 05:45 PM, Vlastimil Babka wrote:
> On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote:
>> This is about VFIO aka PCI passthrough used from QEMU.
>> KVM is irrelevant here.
>>
>> QEMU is a machine emulator. It allocates guest RAM from anonymous memory
>> and these pages are movable which is ok. They may happen to be allocated
>> from the contiguous memory allocation zone (CMA). Which is also ok as
>> long they are movable.
>>
>> However if the guest starts using VFIO (which can be hotplugged into
>> the guest), in most cases it involves DMA which requires guest RAM pages
>> to be pinned and not move once their addresses are programmed to
>> the hardware for DMA.
>>
>> So we end up in a situation when quite many pages in CMA are not movable
>> anymore. And we get bunch of these:
>>
>> [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
>> [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
>> [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy
>
> IIRC CMA was for mobile devices and their camera/codec drivers and you
> don't use QEMU on those? What do you need CMA for in your case?


I do not want QEMU to get memory from CMA, this is my point. It just 
happens sometime that the kernel allocates movable pages from there.


>
>> This is a very rough patch to start the conversation about how to move
>> pages properly. mm/page_alloc.c does this and
>> arch/powerpc/mm/mmu_context_iommu.c exploits it.
>
> OK such conversation should probably start by mentioning the VM_PINNED
> effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345
>
> It's more general approach to dealing with pinned pages, and moving them
> out of CMA area (and compacting them in general) prior pinning is one of
> the things that should be done within that framework.


And I assume these patches did not go anywhere, right?...



> Then there's the effort to enable migrating pages other than LRU during
> compaction (and thus CMA allocation): https://lwn.net/Articles/650864/
> I don't know if that would be applicable in your use case, i.e. are the
> pins for DMA short-lived and can the isolation/migration code wait a bit
> for the transfer to finish so it can grab them, or something?


Pins for DMA are long-lived, pretty much as long as the guest is running. 
So this "compaction" is too late.


>>
>> Please do not comment on the style and code placement,
>> this is just to give some context :)
>>
>> Obviously, this does not work well - it manages to migrate only few pages
>> and crashes as it is missing locks/disabling interrupts and I probably
>> should not just remove pages from LRU list (normally, I guess, only these
>> can migrate) and a million of other things.
>>
>> The questions are:
>>
>> - what is the correct way of telling if the page is in CMA?
>> is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?
>
> Should be.
>
>> - how to tell MM to move page away? I am calling migrate_pages() with
>> an get_new_page callback which allocates a page with GFP_USER but without
>> GFP_MOVABLE which should allocate new page out of CMA which seems ok but
>> there is a little convern that we might want to add MOVABLE back when
>> VFIO device is unplugged from the guest.
>
> Hmm, once the page is allocated, then the migratetype is not tracked
> anywhere (except in page_owner debug data). But the unmovable allocations
> might exhaust available unmovable pageblocks and lead to fragmentation. So
> "add MOVABLE back" would be too late. Instead we would need to tell the
>allocator somehow to give us movable page but outside of CMA.

It is it movable, why do we care if it is in CMA or not?

> CMA's own
> __alloc_contig_migrate_range() avoids this problem by allocating movable
> pages, but the range has been already page-isolated and thus the allocator
> won't see the pages there.You obviously can't take this approach and
> isolate all CMA pageblocks like that.  That smells like a new __GFP_FLAG, meh.


I understood (more or less) all of it except the __GFP_FLAG - when/what 
would use it?



>> - do I need to isolate pages by using isolate_migratepages_range,
>> reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
>> I dropped them for now and the patch uses only @migratepages from
>> the compact_control struct.
>
> You don't have to do reclaim_clean_pages_from_list(), but the isolation has
> to be careful, yeah.


The isolation here means the whole CMA zone isolation which I "obviously 
can't take this approach"? :)


>> - are there any flags in madvise() to address this (could not
>> locate any relevant)?
>
> AFAIK there's no madvise(I_WILL_BE_PINNING_THIS_RANGE)
>
>> - what else is missing? disabled interrupts? locks?
>
> See what isolate_migratepages_block() does.


Thanks for the pointers! I'll have a closer look at Peter's patchset.


-- 
Alexey

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
@ 2015-08-17  9:11     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Alexey Kardashevskiy @ 2015-08-17  9:11 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: Alexander Duyck, Andrew Morton, Benjamin Herrenschmidt,
	David Gibson, Johannes Weiner, Joonsoo Kim, Mel Gorman,
	Michal Hocko, Paul Mackerras, Sasha Levin, linux-kernel,
	Alex Williamson, Alexander Graf, Paolo Bonzini,
	Aneesh Kumar K . V, Peter Zijlstra

On 08/17/2015 05:45 PM, Vlastimil Babka wrote:
> On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote:
>> This is about VFIO aka PCI passthrough used from QEMU.
>> KVM is irrelevant here.
>>
>> QEMU is a machine emulator. It allocates guest RAM from anonymous memory
>> and these pages are movable which is ok. They may happen to be allocated
>> from the contiguous memory allocation zone (CMA). Which is also ok as
>> long they are movable.
>>
>> However if the guest starts using VFIO (which can be hotplugged into
>> the guest), in most cases it involves DMA which requires guest RAM pages
>> to be pinned and not move once their addresses are programmed to
>> the hardware for DMA.
>>
>> So we end up in a situation when quite many pages in CMA are not movable
>> anymore. And we get bunch of these:
>>
>> [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
>> [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
>> [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy
>
> IIRC CMA was for mobile devices and their camera/codec drivers and you
> don't use QEMU on those? What do you need CMA for in your case?


I do not want QEMU to get memory from CMA, this is my point. It just 
happens sometime that the kernel allocates movable pages from there.


>
>> This is a very rough patch to start the conversation about how to move
>> pages properly. mm/page_alloc.c does this and
>> arch/powerpc/mm/mmu_context_iommu.c exploits it.
>
> OK such conversation should probably start by mentioning the VM_PINNED
> effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345
>
> It's more general approach to dealing with pinned pages, and moving them
> out of CMA area (and compacting them in general) prior pinning is one of
> the things that should be done within that framework.


And I assume these patches did not go anywhere, right?...



> Then there's the effort to enable migrating pages other than LRU during
> compaction (and thus CMA allocation): https://lwn.net/Articles/650864/
> I don't know if that would be applicable in your use case, i.e. are the
> pins for DMA short-lived and can the isolation/migration code wait a bit
> for the transfer to finish so it can grab them, or something?


Pins for DMA are long-lived, pretty much as long as the guest is running. 
So this "compaction" is too late.


>>
>> Please do not comment on the style and code placement,
>> this is just to give some context :)
>>
>> Obviously, this does not work well - it manages to migrate only few pages
>> and crashes as it is missing locks/disabling interrupts and I probably
>> should not just remove pages from LRU list (normally, I guess, only these
>> can migrate) and a million of other things.
>>
>> The questions are:
>>
>> - what is the correct way of telling if the page is in CMA?
>> is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?
>
> Should be.
>
>> - how to tell MM to move page away? I am calling migrate_pages() with
>> an get_new_page callback which allocates a page with GFP_USER but without
>> GFP_MOVABLE which should allocate new page out of CMA which seems ok but
>> there is a little convern that we might want to add MOVABLE back when
>> VFIO device is unplugged from the guest.
>
> Hmm, once the page is allocated, then the migratetype is not tracked
> anywhere (except in page_owner debug data). But the unmovable allocations
> might exhaust available unmovable pageblocks and lead to fragmentation. So
> "add MOVABLE back" would be too late. Instead we would need to tell the
>allocator somehow to give us movable page but outside of CMA.

It is it movable, why do we care if it is in CMA or not?

> CMA's own
> __alloc_contig_migrate_range() avoids this problem by allocating movable
> pages, but the range has been already page-isolated and thus the allocator
> won't see the pages there.You obviously can't take this approach and
> isolate all CMA pageblocks like that.  That smells like a new __GFP_FLAG, meh.


I understood (more or less) all of it except the __GFP_FLAG - when/what 
would use it?



>> - do I need to isolate pages by using isolate_migratepages_range,
>> reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
>> I dropped them for now and the patch uses only @migratepages from
>> the compact_control struct.
>
> You don't have to do reclaim_clean_pages_from_list(), but the isolation has
> to be careful, yeah.


The isolation here means the whole CMA zone isolation which I "obviously 
can't take this approach"? :)


>> - are there any flags in madvise() to address this (could not
>> locate any relevant)?
>
> AFAIK there's no madvise(I_WILL_BE_PINNING_THIS_RANGE)
>
>> - what else is missing? disabled interrupts? locks?
>
> See what isolate_migratepages_block() does.


Thanks for the pointers! I'll have a closer look at Peter's patchset.


-- 
Alexey

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
  2015-08-17  9:11     ` Alexey Kardashevskiy
@ 2015-08-17  9:53       ` Vlastimil Babka
  -1 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2015-08-17  9:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linux-mm
  Cc: Alexander Duyck, Andrew Morton, Benjamin Herrenschmidt,
	David Gibson, Johannes Weiner, Joonsoo Kim, Mel Gorman,
	Michal Hocko, Paul Mackerras, Sasha Levin, linux-kernel,
	Alex Williamson, Alexander Graf, Paolo Bonzini,
	Aneesh Kumar K . V, Peter Zijlstra

On 08/17/2015 11:11 AM, Alexey Kardashevskiy wrote:
> On 08/17/2015 05:45 PM, Vlastimil Babka wrote:
>> On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote:
>>> This is about VFIO aka PCI passthrough used from QEMU.
>>> KVM is irrelevant here.
>>>
>>> QEMU is a machine emulator. It allocates guest RAM from anonymous memory
>>> and these pages are movable which is ok. They may happen to be allocated
>>> from the contiguous memory allocation zone (CMA). Which is also ok as
>>> long they are movable.
>>>
>>> However if the guest starts using VFIO (which can be hotplugged into
>>> the guest), in most cases it involves DMA which requires guest RAM pages
>>> to be pinned and not move once their addresses are programmed to
>>> the hardware for DMA.
>>>
>>> So we end up in a situation when quite many pages in CMA are not movable
>>> anymore. And we get bunch of these:
>>>
>>> [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
>>> [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
>>> [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy
>>
>> IIRC CMA was for mobile devices and their camera/codec drivers and you
>> don't use QEMU on those? What do you need CMA for in your case?
>
>
> I do not want QEMU to get memory from CMA, this is my point. It just
> happens sometime that the kernel allocates movable pages from there.

I meant why the kernel used for QEMU has also CMA enabled and used (for 
something else)? CMA is mostly used on mobile devices and they don't run 
QEMU?

>
>>
>>> This is a very rough patch to start the conversation about how to move
>>> pages properly. mm/page_alloc.c does this and
>>> arch/powerpc/mm/mmu_context_iommu.c exploits it.
>>
>> OK such conversation should probably start by mentioning the VM_PINNED
>> effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345
>>
>> It's more general approach to dealing with pinned pages, and moving them
>> out of CMA area (and compacting them in general) prior pinning is one of
>> the things that should be done within that framework.
>
>
> And I assume these patches did not go anywhere, right?...

Not yet :)

>> Then there's the effort to enable migrating pages other than LRU during
>> compaction (and thus CMA allocation): https://lwn.net/Articles/650864/
>> I don't know if that would be applicable in your use case, i.e. are the
>> pins for DMA short-lived and can the isolation/migration code wait a bit
>> for the transfer to finish so it can grab them, or something?
>
>
> Pins for DMA are long-lived, pretty much as long as the guest is running.
> So this "compaction" is too late.

Oh, OK.

>>>
>>> Please do not comment on the style and code placement,
>>> this is just to give some context :)
>>>
>>> Obviously, this does not work well - it manages to migrate only few pages
>>> and crashes as it is missing locks/disabling interrupts and I probably
>>> should not just remove pages from LRU list (normally, I guess, only these
>>> can migrate) and a million of other things.
>>>
>>> The questions are:
>>>
>>> - what is the correct way of telling if the page is in CMA?
>>> is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?
>>
>> Should be.
>>
>>> - how to tell MM to move page away? I am calling migrate_pages() with
>>> an get_new_page callback which allocates a page with GFP_USER but without
>>> GFP_MOVABLE which should allocate new page out of CMA which seems ok but
>>> there is a little convern that we might want to add MOVABLE back when
>>> VFIO device is unplugged from the guest.
>>
>> Hmm, once the page is allocated, then the migratetype is not tracked
>> anywhere (except in page_owner debug data). But the unmovable allocations
>> might exhaust available unmovable pageblocks and lead to fragmentation. So
>> "add MOVABLE back" would be too late. Instead we would need to tell the
>> allocator somehow to give us movable page but outside of CMA.
>
> It is it movable, why do we care if it is in CMA or not?

I did assume your pages are mostly movable, but with some temporary pins 
they might not be movable reliably at arbitrary time. But if you say the 
pins are long lived then it's probably best allocated without MOVABLE. 
If the device is later unplugged, sync compaction will eventually move 
the pages out of unmovable pageblocks.

>> CMA's own
>> __alloc_contig_migrate_range() avoids this problem by allocating movable
>> pages, but the range has been already page-isolated and thus the allocator
>> won't see the pages there.You obviously can't take this approach and
>> isolate all CMA pageblocks like that.  That smells like a new __GFP_FLAG, meh.
>
>
> I understood (more or less) all of it except the __GFP_FLAG - when/what
> would use it?

Nevermind, wrt above.

>>> - do I need to isolate pages by using isolate_migratepages_range,
>>> reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
>>> I dropped them for now and the patch uses only @migratepages from
>>> the compact_control struct.
>>
>> You don't have to do reclaim_clean_pages_from_list(), but the isolation has
>> to be careful, yeah.
>
>
> The isolation here means the whole CMA zone isolation which I "obviously
> can't take this approach"? :)

Ah no, that's isolation from lru lists in this context. Unfortunately 
the same word is used for both.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
@ 2015-08-17  9:53       ` Vlastimil Babka
  0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2015-08-17  9:53 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linux-mm
  Cc: Alexander Duyck, Andrew Morton, Benjamin Herrenschmidt,
	David Gibson, Johannes Weiner, Joonsoo Kim, Mel Gorman,
	Michal Hocko, Paul Mackerras, Sasha Levin, linux-kernel,
	Alex Williamson, Alexander Graf, Paolo Bonzini,
	Aneesh Kumar K . V, Peter Zijlstra

On 08/17/2015 11:11 AM, Alexey Kardashevskiy wrote:
> On 08/17/2015 05:45 PM, Vlastimil Babka wrote:
>> On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote:
>>> This is about VFIO aka PCI passthrough used from QEMU.
>>> KVM is irrelevant here.
>>>
>>> QEMU is a machine emulator. It allocates guest RAM from anonymous memory
>>> and these pages are movable which is ok. They may happen to be allocated
>>> from the contiguous memory allocation zone (CMA). Which is also ok as
>>> long they are movable.
>>>
>>> However if the guest starts using VFIO (which can be hotplugged into
>>> the guest), in most cases it involves DMA which requires guest RAM pages
>>> to be pinned and not move once their addresses are programmed to
>>> the hardware for DMA.
>>>
>>> So we end up in a situation when quite many pages in CMA are not movable
>>> anymore. And we get bunch of these:
>>>
>>> [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
>>> [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
>>> [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy
>>
>> IIRC CMA was for mobile devices and their camera/codec drivers and you
>> don't use QEMU on those? What do you need CMA for in your case?
>
>
> I do not want QEMU to get memory from CMA, this is my point. It just
> happens sometime that the kernel allocates movable pages from there.

I meant why the kernel used for QEMU has also CMA enabled and used (for 
something else)? CMA is mostly used on mobile devices and they don't run 
QEMU?

>
>>
>>> This is a very rough patch to start the conversation about how to move
>>> pages properly. mm/page_alloc.c does this and
>>> arch/powerpc/mm/mmu_context_iommu.c exploits it.
>>
>> OK such conversation should probably start by mentioning the VM_PINNED
>> effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345
>>
>> It's more general approach to dealing with pinned pages, and moving them
>> out of CMA area (and compacting them in general) prior pinning is one of
>> the things that should be done within that framework.
>
>
> And I assume these patches did not go anywhere, right?...

Not yet :)

>> Then there's the effort to enable migrating pages other than LRU during
>> compaction (and thus CMA allocation): https://lwn.net/Articles/650864/
>> I don't know if that would be applicable in your use case, i.e. are the
>> pins for DMA short-lived and can the isolation/migration code wait a bit
>> for the transfer to finish so it can grab them, or something?
>
>
> Pins for DMA are long-lived, pretty much as long as the guest is running.
> So this "compaction" is too late.

Oh, OK.

>>>
>>> Please do not comment on the style and code placement,
>>> this is just to give some context :)
>>>
>>> Obviously, this does not work well - it manages to migrate only few pages
>>> and crashes as it is missing locks/disabling interrupts and I probably
>>> should not just remove pages from LRU list (normally, I guess, only these
>>> can migrate) and a million of other things.
>>>
>>> The questions are:
>>>
>>> - what is the correct way of telling if the page is in CMA?
>>> is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?
>>
>> Should be.
>>
>>> - how to tell MM to move page away? I am calling migrate_pages() with
>>> an get_new_page callback which allocates a page with GFP_USER but without
>>> GFP_MOVABLE which should allocate new page out of CMA which seems ok but
>>> there is a little convern that we might want to add MOVABLE back when
>>> VFIO device is unplugged from the guest.
>>
>> Hmm, once the page is allocated, then the migratetype is not tracked
>> anywhere (except in page_owner debug data). But the unmovable allocations
>> might exhaust available unmovable pageblocks and lead to fragmentation. So
>> "add MOVABLE back" would be too late. Instead we would need to tell the
>> allocator somehow to give us movable page but outside of CMA.
>
> It is it movable, why do we care if it is in CMA or not?

I did assume your pages are mostly movable, but with some temporary pins 
they might not be movable reliably at arbitrary time. But if you say the 
pins are long lived then it's probably best allocated without MOVABLE. 
If the device is later unplugged, sync compaction will eventually move 
the pages out of unmovable pageblocks.

>> CMA's own
>> __alloc_contig_migrate_range() avoids this problem by allocating movable
>> pages, but the range has been already page-isolated and thus the allocator
>> won't see the pages there.You obviously can't take this approach and
>> isolate all CMA pageblocks like that.  That smells like a new __GFP_FLAG, meh.
>
>
> I understood (more or less) all of it except the __GFP_FLAG - when/what
> would use it?

Nevermind, wrt above.

>>> - do I need to isolate pages by using isolate_migratepages_range,
>>> reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
>>> I dropped them for now and the patch uses only @migratepages from
>>> the compact_control struct.
>>
>> You don't have to do reclaim_clean_pages_from_list(), but the isolation has
>> to be careful, yeah.
>
>
> The isolation here means the whole CMA zone isolation which I "obviously
> can't take this approach"? :)

Ah no, that's isolation from lru lists in this context. Unfortunately 
the same word is used for both.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
  2015-08-17  9:11     ` Alexey Kardashevskiy
@ 2015-08-17  9:57       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2015-08-17  9:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Vlastimil Babka, linux-mm
  Cc: Alexander Duyck, Andrew Morton, David Gibson, Johannes Weiner,
	Joonsoo Kim, Mel Gorman, Michal Hocko, Paul Mackerras,
	Sasha Levin, linux-kernel, Alex Williamson, Alexander Graf,
	Paolo Bonzini, Aneesh Kumar K . V, Peter Zijlstra

On Mon, 2015-08-17 at 19:11 +1000, Alexey Kardashevskiy wrote:
> On 08/17/2015 05:45 PM, Vlastimil Babka wrote:
> > On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote:
> > > This is about VFIO aka PCI passthrough used from QEMU.
> > > KVM is irrelevant here.
> > > 
> > > QEMU is a machine emulator. It allocates guest RAM from anonymous 
> > > memory
> > > and these pages are movable which is ok. They may happen to be 
> > > allocated
> > > from the contiguous memory allocation zone (CMA). Which is also 
> > > ok as
> > > long they are movable.
> > > 
> > > However if the guest starts using VFIO (which can be hotplugged 
> > > into
> > > the guest), in most cases it involves DMA which requires guest 
> > > RAM pages
> > > to be pinned and not move once their addresses are programmed to
> > > the hardware for DMA.
> > > 
> > > So we end up in a situation when quite many pages in CMA are not 
> > > movable
> > > anymore. And we get bunch of these:
> > > 
> > > [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
> > > [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
> > > [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy
> > 
> > IIRC CMA was for mobile devices and their camera/codec drivers and 
> > you
> > don't use QEMU on those? What do you need CMA for in your case?
> 
> I do not want QEMU to get memory from CMA, this is my point. It just 
> happens sometime that the kernel allocates movable pages from there.

You may want to explain why we have a CMA in the first place.... our
KVM implementation needs to allocate large chunks of physically
contiguous memory for each guest in order to contain the MMU hash table
for those guests.

We use a CMA whose size can be specified at boot but is generally a
pecentile of the total system memory to allocate these from.

However we don't want normal allocations that we *know* are going to be
pinned to be in that CMA, otherwise they would defeat its purpose, so
this patch is about moving stuff that we are about to pin out of the
CMA first.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
@ 2015-08-17  9:57       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2015-08-17  9:57 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Vlastimil Babka, linux-mm
  Cc: Alexander Duyck, Andrew Morton, David Gibson, Johannes Weiner,
	Joonsoo Kim, Mel Gorman, Michal Hocko, Paul Mackerras,
	Sasha Levin, linux-kernel, Alex Williamson, Alexander Graf,
	Paolo Bonzini, Aneesh Kumar K . V, Peter Zijlstra

On Mon, 2015-08-17 at 19:11 +1000, Alexey Kardashevskiy wrote:
> On 08/17/2015 05:45 PM, Vlastimil Babka wrote:
> > On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote:
> > > This is about VFIO aka PCI passthrough used from QEMU.
> > > KVM is irrelevant here.
> > > 
> > > QEMU is a machine emulator. It allocates guest RAM from anonymous 
> > > memory
> > > and these pages are movable which is ok. They may happen to be 
> > > allocated
> > > from the contiguous memory allocation zone (CMA). Which is also 
> > > ok as
> > > long they are movable.
> > > 
> > > However if the guest starts using VFIO (which can be hotplugged 
> > > into
> > > the guest), in most cases it involves DMA which requires guest 
> > > RAM pages
> > > to be pinned and not move once their addresses are programmed to
> > > the hardware for DMA.
> > > 
> > > So we end up in a situation when quite many pages in CMA are not 
> > > movable
> > > anymore. And we get bunch of these:
> > > 
> > > [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
> > > [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
> > > [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy
> > 
> > IIRC CMA was for mobile devices and their camera/codec drivers and 
> > you
> > don't use QEMU on those? What do you need CMA for in your case?
> 
> I do not want QEMU to get memory from CMA, this is my point. It just 
> happens sometime that the kernel allocates movable pages from there.

You may want to explain why we have a CMA in the first place.... our
KVM implementation needs to allocate large chunks of physically
contiguous memory for each guest in order to contain the MMU hash table
for those guests.

We use a CMA whose size can be specified at boot but is generally a
pecentile of the total system memory to allocate these from.

However we don't want normal allocations that we *know* are going to be
pinned to be in that CMA, otherwise they would defeat its purpose, so
this patch is about moving stuff that we are about to pin out of the
CMA first.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
  2015-08-17  9:53       ` Vlastimil Babka
@ 2015-08-17  9:58         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2015-08-17  9:58 UTC (permalink / raw)
  To: Vlastimil Babka, Alexey Kardashevskiy, linux-mm
  Cc: Alexander Duyck, Andrew Morton, David Gibson, Johannes Weiner,
	Joonsoo Kim, Mel Gorman, Michal Hocko, Paul Mackerras,
	Sasha Levin, linux-kernel, Alex Williamson, Alexander Graf,
	Paolo Bonzini, Aneesh Kumar K . V, Peter Zijlstra

On Mon, 2015-08-17 at 11:53 +0200, Vlastimil Babka wrote:
> I meant why the kernel used for QEMU has also CMA enabled and used 
> (for 
> something else)? CMA is mostly used on mobile devices and they don't 
> run 
> QEMU?

I explained in a separeate reply but yes, we do use a CMA for KVM for
our MMU hash tables.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
@ 2015-08-17  9:58         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2015-08-17  9:58 UTC (permalink / raw)
  To: Vlastimil Babka, Alexey Kardashevskiy, linux-mm
  Cc: Alexander Duyck, Andrew Morton, David Gibson, Johannes Weiner,
	Joonsoo Kim, Mel Gorman, Michal Hocko, Paul Mackerras,
	Sasha Levin, linux-kernel, Alex Williamson, Alexander Graf,
	Paolo Bonzini, Aneesh Kumar K . V, Peter Zijlstra

On Mon, 2015-08-17 at 11:53 +0200, Vlastimil Babka wrote:
> I meant why the kernel used for QEMU has also CMA enabled and used 
> (for 
> something else)? CMA is mostly used on mobile devices and they don't 
> run 
> QEMU?

I explained in a separeate reply but yes, we do use a CMA for KVM for
our MMU hash tables.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
  2015-08-17  9:11     ` Alexey Kardashevskiy
@ 2015-08-31 13:48       ` Peter Zijlstra
  -1 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2015-08-31 13:48 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Vlastimil Babka, linux-mm, Alexander Duyck, Andrew Morton,
	Benjamin Herrenschmidt, David Gibson, Johannes Weiner,
	Joonsoo Kim, Mel Gorman, Michal Hocko, Paul Mackerras,
	Sasha Levin, linux-kernel, Alex Williamson, Alexander Graf,
	Paolo Bonzini, Aneesh Kumar K . V

On Mon, Aug 17, 2015 at 07:11:01PM +1000, Alexey Kardashevskiy wrote:
> >OK such conversation should probably start by mentioning the VM_PINNED
> >effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345
> >
> >It's more general approach to dealing with pinned pages, and moving them
> >out of CMA area (and compacting them in general) prior pinning is one of
> >the things that should be done within that framework.
> 
> 
> And I assume these patches did not go anywhere, right?...

I got lost in the IB code :/

Its on the TODO pile somewhere

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning
@ 2015-08-31 13:48       ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2015-08-31 13:48 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Vlastimil Babka, linux-mm, Alexander Duyck, Andrew Morton,
	Benjamin Herrenschmidt, David Gibson, Johannes Weiner,
	Joonsoo Kim, Mel Gorman, Michal Hocko, Paul Mackerras,
	Sasha Levin, linux-kernel, Alex Williamson, Alexander Graf,
	Paolo Bonzini, Aneesh Kumar K . V

On Mon, Aug 17, 2015 at 07:11:01PM +1000, Alexey Kardashevskiy wrote:
> >OK such conversation should probably start by mentioning the VM_PINNED
> >effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345
> >
> >It's more general approach to dealing with pinned pages, and moving them
> >out of CMA area (and compacting them in general) prior pinning is one of
> >the things that should be done within that framework.
> 
> 
> And I assume these patches did not go anywhere, right?...

I got lost in the IB code :/

Its on the TODO pile somewhere

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-08-31 13:49 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-05  8:08 [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning Alexey Kardashevskiy
2015-08-05  8:08 ` Alexey Kardashevskiy
2015-08-15 10:50 ` Alexey Kardashevskiy
2015-08-15 10:50   ` Alexey Kardashevskiy
2015-08-17  7:45 ` Vlastimil Babka
2015-08-17  7:45   ` Vlastimil Babka
2015-08-17  9:11   ` Alexey Kardashevskiy
2015-08-17  9:11     ` Alexey Kardashevskiy
2015-08-17  9:53     ` Vlastimil Babka
2015-08-17  9:53       ` Vlastimil Babka
2015-08-17  9:58       ` Benjamin Herrenschmidt
2015-08-17  9:58         ` Benjamin Herrenschmidt
2015-08-17  9:57     ` Benjamin Herrenschmidt
2015-08-17  9:57       ` Benjamin Herrenschmidt
2015-08-31 13:48     ` Peter Zijlstra
2015-08-31 13:48       ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.