linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] [0/13] General DMA zone rework
@ 2008-03-07  9:07 Andi Kleen
  2008-03-07  9:07 ` [PATCH] [2/13] Make get_order(0) return 0 Andi Kleen
                   ` (14 more replies)
  0 siblings, 15 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm


Background:

The 16MB Linux ZONE_DMA has some long standing problems on
x86. Traditionally it was designed only for ISA dma which is limited to
24bit (16MB). This means it has a fixed 16MB size.

On 32bit i386 with its limited virtual memory space the next zone is
lowmem with ~900MB (on default split) which works for a lot of devices,
but not all.  But on x86-64 the next zone is only 4GB (DMA32) which is too
big for quite a lot more devices (typically 30,31 or 28bit limitations).

While the DMA zone is in a true VM zone and could be in theory used for
any user allocations in practice the VM has a concept called lower zone
protection to avoid low memory deadlocks that keeps ZONE_DMA nearly
always free unless the caller specifies GFP_DMA directly. This means
in practice it does not participate in the rest of the automatic VM
balancing and its memory is essentially reserved.

Then there is another pool used on x86-64: the swiotlb pool. It is 64MB
by default and used to bounce buffer in the pci DMA API in the low level
drivers for any devices that have 32bit limitations (very common)

Swiotlb and the DMA zone already interact. For consistent mappings swiotlb
will already allocate from both the DMA zone and from the swiotlb pool as
needed. On the other hand swiotlb is a truly separate pool not directly
visible to the normal zone balancing (although it happens to be in the
DMA32 zone). In practice ZONE_DMA behaves very similar in that respect.

Driver interfaces:

When drivers need DMA able memory they typically use the pci_*/dma_*
interfaces which allow specifying device masks.   There are two interfaces
here:

dma_alloc_coherent/pci_alloc_consistent to get a block of coherent
memory honouring a device DMA mask and mapped into an IOMMU as needed.
And pci_map_*/dma_map_* to remap an arbitary block to the DMA mask of
the device and into the IOMMU.

Both ways have their own disadvantages: coherent mappings can have some
bad performance penalties and high setup costs on some platforms which
are not full IO coherent, so they are not encouraged for high volume
driver data.  And pci/dma_map_* will always bounce buffer on the common
x86 swiotlb case so it might be quite expensive. Also on a lot of IOMMU
implementations (in particularly x86 swiotlb/pci-gart) pci/dma_map_*
does not support remapping to any DMA masks smaller than 32bit so it
cannot actually be used for ISA or any other device with <32bit DMA mask.

Then there is the old style way of directly telling the allocators
what memory you need by using GFP_DMA or GFP_DMA32 and then later using
bounce less pci/dma_map_*. That also has its own set of problems: first
GFP_DMA varies between architectures. On x86 it is always 16MB, on IA64
4GB, on some other architectures it doesn't exist at all or has other
sizes. GFP_DMA32 often doesn't exist at all (although it can be often
replaced with GFP_KERNEL on 32bit platforms).  This means any caller
needs to have knowledge about its platform which is often non portable.

Then the other problem is that it these are only single bits into small
fixed zones. So for example if a user has a 30bit DMA zone limit on 64bit
they have no other choice than to use GFP_DMA and when they need more than
16MB of memory they lose.  On the other hand on a lot of other boxes which
don't have any devices with <4GB dma masks ZONE_DMA is just wasted memory.

Then GFP_DMA is also not a very clean interface. It is usually
not documented what device mask is really needed. Also some driver
authors misunderstand it as meaning "required for any DMA" which is
not correct. And often it actually requires dma masks larger than 24bit
(16MB) so the fixed 24bit on x86 is limiting.

The pci_alloc_consistent implementation on x86 also has more problems:
usually it cannot use an IOMMU to remap the dma memory so they actually
have to allocate memory with physical addresses according to the dma mask
of the passed device. All they can do for this is to map it to GFP_DMA
(16MB small), GFP_DMA32 (big, but sometimes too big) or by getting
it from the swiotlb pool. It also attempts to get fitting memory from
the main allocator. That works mostly, but has bad corner cases and is
quite inelegant.

In practice this leads to various unfortunate situations: either the 64bit
system has upto 100MB of reserved memory wasted (ZONE_DMA + swiotlb),
but does not have any devices that require bouncing to 16MB or 4GB.

Or the system has devices that need bouncing to <4GB, but the pools in
their default size are too small and can overflow. There are various
hac^wworkarounds in drivers for this problem, but it still causes
problems for users.  The ZONE_DMA can also not be enlarged because a 
lot of drivers "know" that it is only 16MB and expect 16MB memory from it.

On 32bit x86 the problem is a little less severe because of the 900MB
ZONE_NORMAL which fits most devices, but there are still some problems
with more obscure devices with sufficiently small DMA masks. And ISA
devices still fit in badly.

Requirements for a solution:

There is clearly a need for a new better low memory bounce pool on x86.
It must be larger than 16MB and actually be variable sized. Also the
driver interfaces are inadequate. All DMA memory allocation should
specify what mask they actually need. That allows to extend the pool
and use a single pool for multiple masks.

The new pool must be isolated from the rest of the VM. Otherwise it
cannot be safely used in any device driver paths who cannot necessarily
safely allocate memory (e.g. the block write out path is not allowed to
do this to avoid deadlocks while swapping) The current effective pools
(ZONE_DMA, swiotlb) are already isolated in practice so this won't make
much difference.

Proposed solution:

I chose to implement a new "maskable memory" allocator to solve these
problems. The existing page buddy allocator is not really suited for
this because the data structures don't allow cheap allocation by physical 
address boundary. 

The allocator has a separate pool of memory that it grabs at boot
using the bootmem allocator. The memory is part of the lowest zone,
but practically invisible to the normal page allocator or VM.

The allocator is very simple: it works with pages and uses a bitmap to 
find memory. It also uses a simple rotating cursor through the bitmap.
It is very similar to the allocators used by the various IOMMU 
implementations.  While this is not a true O(1) allocator, in practice
it tends to find free pages very quickly and it is quite flexible.
Also it makes it very simple to allocate below arbitary address boundaries.
It has one advantage over buddy in that it doesn't require all
blocks to be size of power of two. It only rounds to pages. So especially 
larger blocks tend to have less overhead.

The allocator knows how to fall back to the other zones if 
the mask is sufficiently big enough, so it can be used for 
arbitrary masks.

I chose to only implement a page mask allocator only, not "kmalloc_mask",
because the various drivers I looked at actually tended to allocate 
quite large objects towards a page. Also if a sub page allocator 
is really needed there are several existing ones that could be 
relatively easily adopted (mm/dmapool.c or the lib/bitmap.c allocator)
on top of an page allocator.

The maskable allocator's pool is variable sized and the user can set 
it to any size needed (upto 2GB currently). The default sizing 
heuristics are for now the same as in the old code: by default
all free memory below 16MB is put into the pool (in practice that
is only ~8MB or so usable because the kernel is loaded there too) 
and swiotlb is needed another 64MB of low memory are reserved too.
The user can override this using the command line.
Any other subsystems can also increase the memory reservation
(but this currently has to happen early while bootmem is still active)

In the future I hope to make this more flexible. In particular
the low memory could not be fully reserved, but only put into
the "moveable" zone and then later as devices are discovered
and e.g. block devices are set up this pre reservation could
be actually reserved. This would then actually allow to use
a lot of the ~100MB that currently go to waste on x86-64.  But
so far that is not implemented yet.

swiotlb doesn't maintain an own pool anymore, but just allocates
using the mask allocator.  THis is safe because the maskable
pool is isolated from the rest of the VM and not prone
to OOM deadlocks. This is admittedly more a heuristic, than 
a strict 100% guarantee, but it is not worse than the old swiotlb.
Also all users of the maskable allocators currently are benign
and won't overflow it in dynamic situations I believe. 

The internal implementations

ZONE_DMA is disabled for x86 with the maskable allocator enabled.

The maskable allocator requires some simple modifications in
the architecture start up code. These are currently only 
done for x86. All other architectures keep the same GFP_DMA
semantics as they had before. Adapting other architectures
wouldn't be very difficult.

The longer term goal is to convert all GFP_DMA allocations
to always specify the correct mask and then eventually remove
GFP_DMA.

Especially I hope kmalloc/kmem_cache_alloc GFP_DMA can be
removed soon. I have some patches to eliminate those users.
Then slab wouldn't need to maintain DMA caches anymore.

This patch kit only contains the core changes for the actual
allocator, swiotlb conversion and pci_alloc_consistent/dma_alloc_coherent()
Between swiotlb and the later changes this already means that
a lot of drivers use it.

The existing GFP_DMA users transparently fall back to
a maskable allocation with 16MB.

I have various other driver conversions in the pipeline, but 
I will post these sepately to not distract too much from
the review of the main code.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] [2/13] Make get_order(0) return 0
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
@ 2008-03-07  9:07 ` Andi Kleen
  2008-03-07  9:07 ` [PATCH] [3/13] Make kvm bad_page symbol static Andi Kleen
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm


This is needed for some followup patches. Some old drivers 
(like xd.c) pass 0 to get_order and the compat wrapper for 
the mask allocator doesn't like the resulting underflow.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/asm-generic/page.h |    3 +++
 1 file changed, 3 insertions(+)

Index: linux/include/asm-generic/page.h
===================================================================
--- linux.orig/include/asm-generic/page.h
+++ linux/include/asm-generic/page.h
@@ -11,6 +11,9 @@ static __inline__ __attribute_const__ in
 {
 	int order;
 
+	if (size == 0)
+		return 0;
+
 	size = (size - 1) >> (PAGE_SHIFT - 1);
 	order = -1;
 	do {

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] [3/13] Make kvm bad_page symbol static
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
  2008-03-07  9:07 ` [PATCH] [2/13] Make get_order(0) return 0 Andi Kleen
@ 2008-03-07  9:07 ` Andi Kleen
  2008-03-07  9:07 ` [PATCH] [4/13] Prepare page_alloc for the maskable allocator Andi Kleen
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: avi, linux-kernel, linux-mm


Avoids global namespace clash with later patch that exports page_alloc's 
bad_page. bad_page is not used outside kvm_main.c, so making it static is fine.

Cc: avi@qumranet.com

Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/linux/kvm_host.h |    2 --
 virt/kvm/kvm_main.c      |    4 ++--
 2 files changed, 2 insertions(+), 4 deletions(-)

Index: linux/include/linux/kvm_host.h
===================================================================
--- linux.orig/include/linux/kvm_host.h
+++ linux/include/linux/kvm_host.h
@@ -150,8 +150,6 @@ void kvm_exit(void);
 static inline int is_error_hpa(hpa_t hpa) { return hpa >> HPA_MSB; }
 struct page *gva_to_page(struct kvm_vcpu *vcpu, gva_t gva);
 
-extern struct page *bad_page;
-
 int is_error_page(struct page *page);
 int kvm_is_error_hva(unsigned long addr);
 int kvm_set_memory_region(struct kvm *kvm,
Index: linux/virt/kvm/kvm_main.c
===================================================================
--- linux.orig/virt/kvm/kvm_main.c
+++ linux/virt/kvm/kvm_main.c
@@ -52,6 +52,8 @@ MODULE_LICENSE("GPL");
 DEFINE_SPINLOCK(kvm_lock);
 LIST_HEAD(vm_list);
 
+static struct page *bad_page;
+
 static cpumask_t cpus_hardware_enabled;
 
 struct kmem_cache *kvm_vcpu_cache;
@@ -1271,8 +1273,6 @@ static struct sys_device kvm_sysdev = {
 	.cls = &kvm_sysdev_class,
 };
 
-struct page *bad_page;
-
 static inline
 struct kvm_vcpu *preempt_notifier_to_vcpu(struct preempt_notifier *pn)
 {

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] [4/13] Prepare page_alloc for the maskable allocator
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
  2008-03-07  9:07 ` [PATCH] [2/13] Make get_order(0) return 0 Andi Kleen
  2008-03-07  9:07 ` [PATCH] [3/13] Make kvm bad_page symbol static Andi Kleen
@ 2008-03-07  9:07 ` Andi Kleen
  2008-03-07 18:19   ` Sam Ravnborg
  2008-03-07  9:07 ` [PATCH] [5/13] Add mask allocator statistics to vmstat.[ch] Andi Kleen
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm


The maskable allocator needs some functions from page_alloc.c exported;
in particular free_pages_check and prep_new_page. Do that using mm/internal.h

Also extend free_pages_check to support custom flags and allow prep_new_page
to ignore the Reserved bit.

No behaviour change itself; just some code movement.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 mm/internal.h   |   67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c |   63 +++-------------------------------------------------
 2 files changed, 71 insertions(+), 59 deletions(-)

Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -44,7 +44,6 @@
 #include <linux/backing-dev.h>
 #include <linux/fault-inject.h>
 #include <linux/page-isolation.h>
-#include <linux/memcontrol.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -220,7 +219,7 @@ static inline int bad_range(struct zone 
 }
 #endif
 
-static void bad_page(struct page *page)
+void bad_page(struct page *page)
 {
 	void *pc = page_get_page_cgroup(page);
 
@@ -456,33 +455,6 @@ static inline void __free_one_page(struc
 	zone->free_area[order].nr_free++;
 }
 
-static inline int free_pages_check(struct page *page)
-{
-	if (unlikely(page_mapcount(page) |
-		(page->mapping != NULL)  |
-		(page_get_page_cgroup(page) != NULL) |
-		(page_count(page) != 0)  |
-		(page->flags & (
-			1 << PG_lru	|
-			1 << PG_private |
-			1 << PG_locked	|
-			1 << PG_active	|
-			1 << PG_slab	|
-			1 << PG_swapcache |
-			1 << PG_writeback |
-			1 << PG_reserved |
-			1 << PG_buddy ))))
-		bad_page(page);
-	if (PageDirty(page))
-		__ClearPageDirty(page);
-	/*
-	 * For now, we report if PG_reserved was found set, but do not
-	 * clear it, and do not free the page.  But we shall soon need
-	 * to do more, for when the ZERO_PAGE count wraps negative.
-	 */
-	return PageReserved(page);
-}
-
 /*
  * Frees a list of pages. 
  * Assumes all pages on list are in same zone, and of same order.
@@ -528,7 +500,7 @@ static void __free_pages_ok(struct page 
 	int reserved = 0;
 
 	for (i = 0 ; i < (1 << order) ; ++i)
-		reserved += free_pages_check(page + i);
+		reserved += free_pages_check(page + i, 0);
 	if (reserved)
 		return;
 
@@ -608,36 +580,9 @@ static inline void expand(struct zone *z
  */
 static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
-	if (unlikely(page_mapcount(page) |
-		(page->mapping != NULL)  |
-		(page_get_page_cgroup(page) != NULL) |
-		(page_count(page) != 0)  |
-		(page->flags & (
-			1 << PG_lru	|
-			1 << PG_private	|
-			1 << PG_locked	|
-			1 << PG_active	|
-			1 << PG_dirty	|
-			1 << PG_slab    |
-			1 << PG_swapcache |
-			1 << PG_writeback |
-			1 << PG_reserved |
-			1 << PG_buddy ))))
-		bad_page(page);
-
-	/*
-	 * For now, we report if PG_reserved was found set, but do not
-	 * clear it, and do not allocate the page: as a safety net.
-	 */
-	if (PageReserved(page))
+	if (page_prep_struct(page))
 		return 1;
 
-	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
-			1 << PG_referenced | 1 << PG_arch_1 |
-			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
-	set_page_private(page, 0);
-	set_page_refcounted(page);
-
 	arch_alloc_page(page, order);
 	kernel_map_pages(page, 1 << order, 1);
 
@@ -992,7 +937,7 @@ static void free_hot_cold_page(struct pa
 
 	if (PageAnon(page))
 		page->mapping = NULL;
-	if (free_pages_check(page))
+	if (free_pages_check(page, 0))
 		return;
 
 	if (!PageHighMem(page))
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -12,6 +12,7 @@
 #define __MM_INTERNAL_H
 
 #include <linux/mm.h>
+#include <linux/memcontrol.h>
 
 static inline void set_page_count(struct page *page, int v)
 {
@@ -48,6 +49,72 @@ static inline unsigned long page_order(s
 	return page_private(page);
 }
 
+extern void bad_page(struct page *page);
+
+static inline int free_pages_check(struct page *page, unsigned long addflags)
+{
+	if (unlikely(page_mapcount(page) |
+		(page->mapping != NULL)  |
+		(page_get_page_cgroup(page) != NULL) |
+		(page_count(page) != 0)  |
+		(page->flags & (
+			addflags |
+			1 << PG_lru	|
+			1 << PG_private |
+			1 << PG_locked	|
+			1 << PG_active	|
+			1 << PG_slab	|
+			1 << PG_swapcache |
+			1 << PG_writeback |
+			1 << PG_reserved |
+			1 << PG_buddy))))
+		bad_page(page);
+	if (PageDirty(page))
+		__ClearPageDirty(page);
+	/*
+	 * For now, we report if PG_reserved was found set, but do not
+	 * clear it, and do not free the page.  But we shall soon need
+	 * to do more, for when the ZERO_PAGE count wraps negative.
+	 */
+	return PageReserved(page);
+}
+
+/* Set up a struc page for business during allocation */
+static inline int page_prep_struct(struct page *page)
+{
+	if (unlikely(page_mapcount(page) |
+		(page->mapping != NULL)  |
+		(page_get_page_cgroup(page) != NULL) |
+		(page_count(page) != 0)  |
+		(page->flags & (
+			1 << PG_lru	|
+			1 << PG_private	|
+			1 << PG_locked	|
+			1 << PG_active	|
+			1 << PG_dirty	|
+			1 << PG_slab    |
+			1 << PG_swapcache |
+			1 << PG_writeback |
+			1 << PG_reserved |
+			1 << PG_buddy))))
+		bad_page(page);
+
+	/*
+	 * For now, we report if PG_reserved was found set, but do not
+	 * clear it, and do not allocate the page: as a safety net.
+	 */
+	if (PageReserved(page))
+		return 1;
+
+	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
+			1 << PG_referenced | 1 << PG_arch_1 |
+			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
+	set_page_private(page, 0);
+	set_page_refcounted(page);
+
+	return 0;
+}
+
 /*
  * FLATMEM and DISCONTIGMEM configurations use alloc_bootmem_node,
  * so all functions starting at paging_init should be marked __init

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] [5/13] Add mask allocator statistics to vmstat.[ch]
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
                   ` (2 preceding siblings ...)
  2008-03-07  9:07 ` [PATCH] [4/13] Prepare page_alloc for the maskable allocator Andi Kleen
@ 2008-03-07  9:07 ` Andi Kleen
  2008-03-08  2:24   ` Christoph Lameter
  2008-03-07  9:07 ` [PATCH] [6/13] Core maskable allocator Andi Kleen
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm


They are a bit on the extensive side now, but I figured out more data
is better for now.

The actual counts are added in the new files.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/linux/vmstat.h |    4 ++++
 mm/vmstat.c            |   10 ++++++++++
 2 files changed, 14 insertions(+)

Index: linux/include/linux/vmstat.h
===================================================================
--- linux.orig/include/linux/vmstat.h
+++ linux/include/linux/vmstat.h
@@ -37,6 +37,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		FOR_ALL_ZONES(PGSCAN_DIRECT),
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_MASK_ALLOC
+		MASK_ALLOC, MASK_FREE, MASK_BITMAP_SKIP, MASK_WAIT,
+		MASK_HIGHER, MASK_LOW_WASTE, MASK_HIGH_WASTE,
+#endif
 		NR_VM_EVENT_ITEMS
 };
 
Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -644,6 +644,16 @@ static const char * const vmstat_text[] 
 	"allocstall",
 
 	"pgrotated",
+
+#ifdef CONFIG_MASK_ALLOC
+	"mask_alloc",
+	"mask_free",
+	"mask_bitmap_skip",
+	"mask_wait",
+	"mask_higher",
+	"mask_low_waste",
+	"mask_high_waste",
+#endif
 #endif
 };
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] [6/13] Core maskable allocator
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
                   ` (3 preceding siblings ...)
  2008-03-07  9:07 ` [PATCH] [5/13] Add mask allocator statistics to vmstat.[ch] Andi Kleen
@ 2008-03-07  9:07 ` Andi Kleen
  2008-03-07 10:53   ` Johannes Weiner
                     ` (4 more replies)
  2008-03-07  9:07 ` [PATCH] [7/13] Implement compat hooks for GFP_DMA Andi Kleen
                   ` (9 subsequent siblings)
  14 siblings, 5 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm


This is the core code of the maskable allocator. Introduction
appended.

Background:

The 16MB Linux ZONE_DMA has some long standing problems on
x86. Traditionally it was designed only for ISA dma which is limited to
24bit (16MB). This means it has a fixed 16MB size.

On 32bit i386 with its limited virtual memory space the next zone is
lowmem with ~900MB (on default split) which works for a lot of devices,
but not all.  But on x86-64 the next zone is only 4GB (DMA32) which is too
big for quite a lot more devices (typically 30,31 or 28bit limitations).

While the DMA zone is in a true VM zone and could be in theory used for
any user allocations in practice the VM has a concept called lower zone
protection to avoid low memory deadlocks that keeps ZONE_DMA nearly
always free unless the caller specifies GFP_DMA directly. This means
in practice it does not participate in the rest of the automatic VM
balancing and its memory is essentially reserved.

Then there is another pool used on x86-64: the swiotlb pool. It is 64MB
by default and used to bounce buffer in the pci DMA API in the low level
drivers for any devices that have 32bit limitations (very common)

Swiotlb and the DMA zone already interact. For consistent mappings swiotlb
will already allocate from both the DMA zone and from the swiotlb pool as
needed. On the other hand swiotlb is a truly separate pool not directly
visible to the normal zone balancing (although it happens to be in the
DMA32 zone). In practice ZONE_DMA behaves very similar in that respect.

Driver interfaces:

When drivers need DMA able memory they typically use the pci_*/dma_*
interfaces which allow specifying device masks.   There are two interfaces
here:

dma_alloc_coherent/pci_alloc_consistent to get a block of coherent
memory honouring a device DMA mask and mapped into an IOMMU as needed.
And pci_map_*/dma_map_* to remap an arbitary block to the DMA mask of
the device and into the IOMMU.

Both ways have their own disadvantages: coherent mappings can have some
bad performance penalties and high setup costs on some platforms which
are not full IO coherent, so they are not encouraged for high volume
driver data.  And pci/dma_map_* will always bounce buffer on the common
x86 swiotlb case so it might be quite expensive. Also on a lot of IOMMU
implementations (in particularly x86 swiotlb/pci-gart) pci/dma_map_*
does not support remapping to any DMA masks smaller than 32bit so it
cannot actually be used for ISA or any other device with <32bit DMA mask.

Then there is the old style way of directly telling the allocators
what memory you need by using GFP_DMA or GFP_DMA32 and then later using
bounce less pci/dma_map_*. That also has its own set of problems: first
GFP_DMA varies between architectures. On x86 it is always 16MB, on IA64
4GB, on some other architectures it doesn't exist at all or has other
sizes. GFP_DMA32 often doesn't exist at all (although it can be often
replaced with GFP_KERNEL on 32bit platforms).  This means any caller
needs to have knowledge about its platform which is often non portable.

Then the other problem is that it these are only single bits into small
fixed zones. So for example if a user has a 30bit DMA zone limit on 64bit
they have no other choice than to use GFP_DMA and when they need more than
16MB of memory they lose.  On the other hand on a lot of other boxes which
don't have any devices with <4GB dma masks ZONE_DMA is just wasted memory.

Then GFP_DMA is also not a very clean interface. It is usually
not documented what device mask is really needed. Also some driver
authors misunderstand it as meaning "required for any DMA" which is
not correct. And often it actually requires dma masks larger than 24bit
(16MB) so the fixed 24bit on x86 is limiting.

The pci_alloc_consistent implementation on x86 also has more problems:
usually it cannot use an IOMMU to remap the dma memory so they actually
have to allocate memory with physical addresses according to the dma mask
of the passed device. All they can do for this is to map it to GFP_DMA
(16MB small), GFP_DMA32 (big, but sometimes too big) or by getting
it from the swiotlb pool. It also attempts to get fitting memory from
the main allocator. That works mostly, but has bad corner cases and is
quite inelegant.

In practice this leads to various unfortunate situations: either the 64bit
system has upto 100MB of reserved memory wasted (ZONE_DMA + swiotlb),
but does not have any devices that require bouncing to 16MB or 4GB.

Or the system has devices that need bouncing to <4GB, but the pools in
their default size are too small and can overflow. There are various
hac^wworkarounds in drivers for this problem, but it still causes
problems for users.  The ZONE_DMA can also not be enlarged because a 
lot of drivers "know" that it is only 16MB and expect 16MB memory from it.

On 32bit x86 the problem is a little less severe because of the 900MB
ZONE_NORMAL which fits most devices, but there are still some problems
with more obscure devices with sufficiently small DMA masks. And ISA
devices still fit in badly.

Requirements for a solution:

There is clearly a need for a new better low memory bounce pool on x86.
It must be larger than 16MB and actually be variable sized. Also the
driver interfaces are inadequate. All DMA memory allocation should
specify what mask they actually need. That allows to extend the pool
and use a single pool for multiple masks.

The new pool must be isolated from the rest of the VM. Otherwise it
cannot be safely used in any device driver paths who cannot necessarily
safely allocate memory (e.g. the block write out path is not allowed to
do this to avoid deadlocks while swapping) The current effective pools
(ZONE_DMA, swiotlb) are already isolated in practice so this won't make
much difference.

Proposed solution:

I chose to implement a new "maskable memory" allocator to solve these
problems. The existing page buddy allocator is not really suited for
this because the data structures don't allow cheap allocation by physical 
address boundary. 

The allocator has a separate pool of memory that it grabs at boot
using the bootmem allocator. The memory is part of the lowest zone,
but practically invisible to the normal page allocator or VM.

The allocator is very simple: it works with pages and uses a bitmap to 
find memory. It also uses a simple rotating cursor through the bitmap.
It is very similar to the allocators used by the various IOMMU 
implementations.  While this is not a true O(1) allocator, in practice
it tends to find free pages very quickly and it is quite flexible.
Also it makes it very simple to allocate below arbitary address boundaries.
It has one advantage over buddy in that it doesn't require all
blocks to be size of power of two. It only rounds to pages. So especially 
larger blocks tend to have less overhead.

The allocator knows how to fall back to the other zones if 
the mask is sufficiently big enough, so it can be used for 
arbitrary masks.

I chose to only implement a page mask allocator only, not "kmalloc_mask",
because the various drivers I looked at actually tended to allocate 
quite large objects towards a page. Also if a sub page allocator 
is really needed there are several existing ones that could be 
relatively easily adopted (mm/dmapool.c or the lib/bitmap.c allocator)
on top of an page allocator.

The maskable allocator's pool is variable sized and the user can set 
it to any size needed (upto 2GB currently). The default sizing 
heuristics are for now the same as in the old code: by default
all free memory below 16MB is put into the pool (in practice that
is only ~8MB or so usable because the kernel is loaded there too) 
and swiotlb is needed another 64MB of low memory are reserved too.
The user can override this using the command line.
Any other subsystems can also increase the memory reservation
(but this currently has to happen early while bootmem is still active)

In the future I hope to make this more flexible. In particular
the low memory could not be fully reserved, but only put into
the "moveable" zone and then later as devices are discovered
and e.g. block devices are set up this pre reservation could
be actually reserved. This would then actually allow to use
a lot of the ~100MB that currently go to waste on x86-64.  But
so far that is not implemented yet.

Quirks:

The maskable allocator is able to sleep (for freeing
of other maskable allocations). It currently does this
using a simple timeout before it fails. This needs more evaluation on how 
well it really works under low memory conditions.

I added quite a lot of statistics counters for now to better evaluate
it. Some of them might be later removed.

There is currently no higher priority pool for GFP_ATOMIC.
In general memory pressure on the mask allocator should be less
because normal user space doesn't directly allocate from it.
That is why I didn't bother implementing it right now.
If its absence is a problem it could be added later.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 Documentation/DocBook/kernel-api.tmpl |    1 
 Documentation/kernel-parameters.txt   |    3 
 include/linux/gfp.h                   |   52 +++
 include/linux/page-flags.h            |    5 
 mm/Makefile                           |    1 
 mm/mask-alloc.c                       |  504 ++++++++++++++++++++++++++++++++++
 mm/page_alloc.c                       |    4 
 7 files changed, 565 insertions(+), 5 deletions(-)

Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile
+++ linux/mm/Makefile
@@ -33,4 +33,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_MASK_ALLOC) += mask-alloc.o
 
Index: linux/mm/mask-alloc.c
===================================================================
--- /dev/null
+++ linux/mm/mask-alloc.c
@@ -0,0 +1,504 @@
+/*
+ * Generic management of low memory zone to allocate memory with a address mask.
+ *
+ * The maskable pool is reserved inside another zone, but managed by a
+ * specialized bitmap allocator. The allocator is not O(1) (searches
+ * the bitmap with a last use hint) but should be fast enough for
+ * normal purposes.  The advantage of the allocator is that it can
+ * allocate based on a mask.
+ *
+ * The allocator could be improved, but it's better to keep
+ * things simple for now and there are relatively few users
+ * which are usually not that speed critical. Also for simple
+ * repetive allocation patterns it should be approximately usually
+ * O(1) anyways due to the rotating cursor in the bitmap.
+ *
+ * This allocator should be only used by architectures with reasonably
+ * continuous physical memory at least for the low normal zone.
+ *
+ * Note book:
+ * Right now there are no high priority reservations (__GFP_HIGH). Iff
+ * they are needed it would be possible to reserve some very low memory
+ * for those.
+ *
+ * Copyright 2007, 2008 Andi Kleen, SUSE Labs.
+ * Subject to the GNU Public License v.2 only.
+ */
+
+#include <linux/mm.h>
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/bitops.h>
+#include <linux/string.h>
+#include <linux/wait.h>
+#include <linux/bootmem.h>
+#include <linux/module.h>
+#include <linux/fault-inject.h>
+#include <linux/ctype.h>
+#include <linux/kallsyms.h>
+#include "internal.h"
+
+#define BITS_PER_PAGE (PAGE_SIZE * 8)
+
+#define MASK_ZONE_LIMIT (2U<<30) /* 2GB max for now */
+
+#define Mprintk(x...)
+#define Mprint_symbol(x...)
+
+static int force_mask __read_mostly;
+static DECLARE_WAIT_QUEUE_HEAD(mask_zone_wait);
+unsigned long mask_timeout __read_mostly = 5*HZ;
+
+/*
+ * The mask_bitmap maintains all the pages in the mask pool.
+ * It is reversed (lowest pfn has the highest index)
+ * to make reverse search easier.
+ * All accesses are protected by the mask_bitmap_lock
+ */
+static DEFINE_SPINLOCK(mask_bitmap_lock);
+static unsigned long *mask_bitmap;
+static unsigned long mask_max_pfn;
+
+static inline unsigned pfn_to_maskbm_index(unsigned long pfn)
+{
+	return mask_max_pfn - pfn;
+}
+
+static inline unsigned maskbm_index_to_pfn(unsigned index)
+{
+	return mask_max_pfn - index;
+}
+
+static unsigned wait_for_mask_free(unsigned left)
+{
+	DEFINE_WAIT(wait);
+	prepare_to_wait(&mask_zone_wait, &wait, TASK_UNINTERRUPTIBLE);
+	left = schedule_timeout(left);
+	finish_wait(&mask_zone_wait, &wait);
+	return left;
+}
+
+/* First try normal zones if possible. */
+static struct page *
+alloc_higher_pages(gfp_t gfp_mask, unsigned order, unsigned long pfn)
+{
+	struct page *p = NULL;
+	if (pfn > mask_max_pfn) {
+#ifdef CONFIG_ZONE_DMA32
+		if (pfn <= (0xffffffff >> PAGE_SHIFT)) {
+			p = alloc_pages(gfp_mask|GFP_DMA32|__GFP_NOWARN,
+						order);
+			if (p && page_to_pfn(p) >= pfn) {
+				__free_pages(p, order);
+				p = NULL;
+			}
+		}
+#endif
+		p = alloc_pages(gfp_mask|__GFP_NOWARN, order);
+		if (p && page_to_pfn(p) >= pfn) {
+			__free_pages(p, order);
+			p = NULL;
+		}
+	}
+	return p;
+}
+
+static unsigned long alloc_mask(int pages, unsigned long max)
+{
+	static unsigned long next_bit;
+	unsigned long offset, flags, start, pfn;
+	int k;
+
+	if (max >= mask_max_pfn)
+		max = mask_max_pfn;
+
+	start = mask_max_pfn - max;
+
+	spin_lock_irqsave(&mask_bitmap_lock, flags);
+	offset = -1L;
+
+	if (next_bit >= start && next_bit + pages < (mask_max_pfn - (max>>1))) {
+		offset = find_next_zero_string(mask_bitmap, next_bit,
+					       mask_max_pfn, pages);
+		if (offset != -1L)
+			count_vm_events(MASK_BITMAP_SKIP, offset - next_bit);
+	}
+	if (offset == -1L) {
+		offset = find_next_zero_string(mask_bitmap, start,
+					mask_max_pfn, pages);
+		if (offset != -1L)
+			count_vm_events(MASK_BITMAP_SKIP, offset - start);
+	}
+	if (offset != -1L) {
+		for (k = 0; k < pages; k++) {
+			BUG_ON(test_bit(offset + k, mask_bitmap));
+			set_bit(offset + k, mask_bitmap);
+		}
+		next_bit = offset + pages;
+		if (next_bit >= mask_max_pfn)
+			next_bit = start;
+	}
+	spin_unlock_irqrestore(&mask_bitmap_lock, flags);
+	if (offset == -1L)
+		return -1L;
+
+	offset += pages - 1;
+	pfn = maskbm_index_to_pfn(offset);
+
+	BUG_ON(maskbm_index_to_pfn(offset) != pfn);
+	return pfn;
+}
+
+/**
+ * alloc_pages_mask - Alloc page(s) in a specific address range.
+ * @gfp:      Standard GFP mask. See get_free_pages for a list valid flags.
+ * @size:     Allocate size worth of pages. Rounded up to PAGE_SIZE.
+ * @mask:     Memory must fit into mask physical address.
+ *
+ * Returns a struct page *
+ *
+ * Manage dedicated maskable low memory zone. This zone are isolated
+ * from the normal zones. This is only a single continuous zone.
+ * The main difference to the standard allocator is that it tries
+ * to allocate memory with an physical address fitting in the passed mask.
+ *
+ * Warning: the size is in bytes, not in order like get_free_pages.
+ */
+struct page *
+alloc_pages_mask(gfp_t gfp, unsigned size, u64 mask)
+{
+	unsigned long max_pfn = mask >> PAGE_SHIFT;
+	unsigned pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	struct page *p;
+	unsigned left = (gfp & __GFP_REPEAT) ? ~0 : mask_timeout, oleft;
+	unsigned order = get_order(size);
+
+	BUG_ON(size < MASK_MIN_SIZE);	/* You likely passed order by mistake */
+	BUG_ON(gfp & (__GFP_DMA|__GFP_DMA32|__GFP_COMP));
+
+	/* Add fault injection here */
+
+again:
+	count_vm_event(MASK_ALLOC);
+	if (!force_mask) {
+		/* First try normal allocation in suitable zones
+		 * RED-PEN if size fits very badly in PS<<order don't do this?
+		 */
+		p = alloc_higher_pages(gfp, order, max_pfn);
+
+		/*
+		 * If the mask covers everything don't bother with the low zone
+		 * This way we avoid running out of low memory on a higher zones
+		 * OOM too.
+		 */
+		if (p != NULL || max_pfn >= max_low_pfn) {
+			count_vm_event(MASK_HIGHER);
+			count_vm_events(MASK_HIGH_WASTE,
+					(PAGE_SIZE << order) - size);
+			return p;
+		}
+	}
+
+	might_sleep_if(gfp & __GFP_WAIT);
+	do {
+		int i;
+		long pfn;
+
+		/* Implement waiter fairness queueing here? */
+
+		pfn = alloc_mask(pages, max_pfn);
+		if (pfn != -1L) {
+			p = pfn_to_page(pfn);
+
+			Mprintk("mask page %lx size %d mask %Lx\n",
+			       po, size, mask);
+
+			BUG_ON(pfn + pages > mask_max_pfn);
+
+			if (page_prep_struct(p))
+				goto again;
+
+			kernel_map_pages(p, pages, 1);
+
+			for (i = 0; i < pages; i++) {
+				struct page *n = p + i;
+				BUG_ON(!test_bit(pfn_to_maskbm_index(pfn+i),
+						mask_bitmap));
+				BUG_ON(!PageMaskAlloc(n));
+				arch_alloc_page(n, 0);
+				if (gfp & __GFP_ZERO)
+					clear_page(page_address(n));
+			}
+
+			count_vm_events(MASK_LOW_WASTE, pages*PAGE_SIZE-size);
+			return p;
+		}
+
+		if (!(gfp & __GFP_WAIT))
+			break;
+
+		oleft = left;
+		left = wait_for_mask_free(left);
+		count_vm_events(MASK_WAIT, left - oleft);
+	} while (left > 0);
+
+	if (!(gfp & __GFP_NOWARN)) {
+		printk(KERN_ERR
+		"%s: Cannot allocate maskable memory size %u gfp %x mask %Lx\n",
+				current->comm, size, gfp, mask);
+		dump_stack();
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(alloc_pages_mask);
+
+/**
+ * get_pages_mask - Allocate page(s) in specified address range.
+ * @gfp:      GFP mask, see get_free_pages for a list.
+ * @size:     Bytes to allocate (will be rounded up to pages)
+ * @mask:     Page must be located in mask physical address.
+ * Returns the virtual address of the page.
+ *
+ * Manage dedicated maskable low memory zone. This zone are isolated
+ * from the normal zones. This is only a single continuous zone.
+ * The main difference to the standard allocator is that it tries
+ * to allocate memory with an physical address fitting in the passed mask.
+ *
+ * Warning: the size is in bytes, not in order like get_free_pages.
+ */
+void *get_pages_mask(gfp_t gfp, unsigned size, u64 mask)
+{
+	struct page *p = alloc_pages_mask(gfp, size, mask);
+	if (!p)
+		return NULL;
+	return page_address(p);
+}
+EXPORT_SYMBOL(get_pages_mask);
+
+/**
+ * __free_pages_mask - Free pages allocated with get_pages_mask
+ * @page:  First struct page * to free
+ * @size:  Size in bytes.
+ * All pages allocated with alloc/get_pages_mask must be freed
+ * by the respective free.*mask functions.
+ */
+void __free_pages_mask(struct page *page, unsigned size)
+{
+	unsigned long pfn;
+	int i;
+	unsigned pages;
+
+	BUG_ON(size < MASK_MIN_SIZE); /* You likely passed order by mistake */
+	if (!PageMaskAlloc(page)) {
+		__free_pages(page, get_order(size));
+		return;
+	}
+
+	if (!put_page_testzero(page))
+		return;
+
+	count_vm_event(MASK_FREE);
+
+	pfn = page_to_pfn(page);
+	pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	kernel_map_pages(page, pages, 0);
+	for (i = 0; i < pages; i++) {
+		struct page *p = page + i;
+		if (free_pages_check(p, 0))
+			bad_page(p);
+		arch_free_page(p, 0);
+		BUG_ON(!PageMaskAlloc(p));
+		if (!test_and_clear_bit(pfn_to_maskbm_index(pfn + i),
+						mask_bitmap))
+			BUG();
+	}
+
+	Mprintk("mask free %lx size %u from ", pfn, size);
+	Mprint_symbol("%s\n", __builtin_return_address(0));
+	wake_up(&mask_zone_wait);
+}
+EXPORT_SYMBOL(__free_pages_mask);
+
+void free_pages_mask(void *mem, unsigned size)
+{
+	__free_pages_mask(virt_to_page(mem), size);
+}
+EXPORT_SYMBOL(free_pages_mask);
+
+static long mask_zone_size __initdata = -1L;
+
+static int __init setup_maskzone(char *s)
+{
+	do {
+		if (isdigit(*s)) {
+			mask_zone_size = memparse(s, &s);
+		} else if (!strncmp(s, "force", 5)) {
+			force_mask = 1;
+			s += 5;
+		} else
+			return -EINVAL;
+		if (*s == ',')
+			++s;
+	} while (*s);
+	return 0;
+}
+early_param("maskzone", setup_maskzone);
+
+/* Two level bitmap to keep track where we got memory from bootmem */
+#define NUM_BM_BITMAPS ((MASK_ZONE_LIMIT >> PAGE_SHIFT) / BITS_PER_PAGE)
+static unsigned long *alloc_bm[NUM_BM_BITMAPS] __initdata;
+
+static __init void mark_page(void *adr)
+{
+	unsigned long pfn = virt_to_phys(adr) >> PAGE_SHIFT;
+	unsigned i = pfn / BITS_PER_PAGE;
+	if (!alloc_bm[i])
+		alloc_bm[i] = alloc_bootmem_low(PAGE_SIZE);
+	__set_bit(pfn % BITS_PER_PAGE, alloc_bm[i]);
+}
+
+static __init int page_is_marked(unsigned long pfn)
+{
+	unsigned i = pfn / BITS_PER_PAGE;
+	return test_bit(pfn % BITS_PER_PAGE, alloc_bm[i]);
+}
+
+static __init void *boot_mask_page(void *adr)
+{
+	adr = __alloc_bootmem_nopanic(PAGE_SIZE, PAGE_SIZE,
+				      (unsigned long)adr);
+	if (!adr) {
+		printk(KERN_ERR "FAILED to allocate page for maskable zone\n");
+		return NULL;
+	}
+
+	mark_page(adr);
+	return adr;
+}
+
+static void __init init_mask_bitmap(void *adr)
+{
+	long mask_bitmap_bytes;
+
+	if (mask_bitmap) {
+		unsigned old_size = (mask_max_pfn + BITS_PER_LONG) / 8;
+		free_bootmem(virt_to_phys(mask_bitmap), old_size);
+	}
+
+	if (adr)
+		mask_max_pfn = virt_to_phys(adr) >> PAGE_SHIFT;
+
+	printk(KERN_INFO "Setting maskable low memory zone to %lu MB\n",
+			virt_to_phys(adr) >> 20);
+
+	mask_bitmap_bytes = (mask_max_pfn + BITS_PER_LONG) / 8;
+	mask_bitmap = alloc_bootmem(mask_bitmap_bytes);
+	memset(mask_bitmap, 0xff, mask_bitmap_bytes);
+}
+
+static void __init __increase_mask_zone(unsigned long size)
+{
+	void *adr = NULL;
+	long got = 0;
+
+	while (got < size) {
+		adr = boot_mask_page(adr);
+		if (!adr) {
+			printk(KERN_ERR
+		"increase_mask_zone failed at %lx\n", got);
+			break;
+		}
+		got += PAGE_SIZE;
+		if (virt_to_phys(adr) >= MASK_ZONE_LIMIT-PAGE_SIZE)
+			break;
+	}
+	init_mask_bitmap(adr);
+}
+
+void __init increase_mask_zone(unsigned long size)
+{
+	/*
+	 * When the user set an explicit zone size ignore any implicit
+	 * increases.
+	 */
+	if (mask_zone_size > 0 || size == 0)
+		return;
+	__increase_mask_zone(size);
+}
+
+/* Get memory for the low mask zone from bootmem */
+void __init init_mask_zone(unsigned long boundary)
+{
+	void *adr = NULL;
+	long got = 0;
+
+	/*
+	 * Grab upto boundary first unless the user set an explicit size.
+	 * This emulates the traditional DMA zone; actual size varying
+	 * depends on whatever is already there.
+	 */
+	if (mask_zone_size == -1L) {
+		for (adr = 0;; got += PAGE_SIZE) {
+			adr = boot_mask_page(adr);
+			if (!adr)
+				break;
+			if (virt_to_phys(adr) >= boundary - PAGE_SIZE)
+				break;
+		}
+		init_mask_bitmap(adr);
+	}
+
+	if (mask_zone_size > MASK_ZONE_LIMIT || mask_zone_size < -1L)
+		mask_zone_size = MASK_ZONE_LIMIT;
+
+	if (mask_zone_size > 0)
+		__increase_mask_zone(mask_zone_size);
+}
+
+static __init void prep_free_pg(unsigned pfn)
+{
+	struct page *p = pfn_to_page(pfn);
+	p->flags = 0;
+	__SetPageMaskAlloc(p);
+	__clear_bit(pfn_to_maskbm_index(pfn), mask_bitmap);
+	reset_page_mapcount(p);
+	set_page_count(p, 0);
+}
+
+void __init prepare_mask_zone(void)
+{
+	int i;
+	long marked = 0;
+
+	/* Free the pages previously gotten from bootmem */
+	for (i = 0; i < mask_max_pfn; i++) {
+		if (page_is_marked(i)) {
+			prep_free_pg(i);
+			marked++;
+		}
+	}
+
+	/* Now free the bitmap to the maskable zone too */
+	for (i = 0; i < ARRAY_SIZE(alloc_bm); i++) {
+		if (alloc_bm[i]) {
+			prep_free_pg(page_to_pfn(virt_to_page(alloc_bm[i])));
+			marked++;
+		}
+	}
+
+#if 1
+	for (i = 0; i <= mask_max_pfn; i++) {
+		struct page *p = pfn_to_page(i);
+		if (!PageMaskAlloc(p))
+			continue;
+		if (page_count(p))
+			panic("pfn %x count %d\n", i, page_count(p));
+		if (test_bit(pfn_to_maskbm_index(i), mask_bitmap))
+			panic("pfn %x already allocated\n", i);
+	}
+#endif
+
+	printk(KERN_INFO "Maskable zone upto %luMB with %luMB free\n",
+	       mask_max_pfn >> (20 - PAGE_SHIFT), marked >> (20 - PAGE_SHIFT));
+}
Index: linux/include/linux/gfp.h
===================================================================
--- linux.orig/include/linux/gfp.h
+++ linux/include/linux/gfp.h
@@ -214,14 +214,56 @@ extern unsigned long get_zeroed_page(gfp
 #define __get_free_page(gfp_mask) \
 		__get_free_pages((gfp_mask),0)
 
-#define __get_dma_pages(gfp_mask, order) \
-		__get_free_pages((gfp_mask) | GFP_DMA,(order))
-
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
 extern void free_hot_page(struct page *page);
 extern void free_cold_page(struct page *page);
 
+#include <asm/dma.h>  /* For TRAD_DMA_MASK/MAX_DMA_ADDRESS */
+
+#ifdef CONFIG_MASK_ALLOC
+extern struct page *alloc_pages_mask(gfp_t gfp_mask, unsigned size,
+					u64 mask);
+extern void *get_pages_mask(gfp_t gfp_mask, unsigned size,
+					u64 mask);
+extern void __free_pages_mask(struct page *page, unsigned size);
+extern void free_pages_mask(void *addr, unsigned size);
+
+/* Legacy interface. To be removed */
+static inline unsigned long __get_dma_pages(gfp_t gfp, unsigned order)
+{
+	unsigned size = PAGE_SIZE << order;
+	return (unsigned long)get_pages_mask(gfp, size, TRAD_DMA_MASK);
+}
+
+#define MASK_MIN_SIZE 16
+
+#else
+
+/*
+ * Architectures without mask alloc support continue using their dma zone
+ * (if they have one).
+ */
+#define gfp_mask(m) ((__pa(MAX_DMA_ADDRESS - 1) & (m)) ? 0 : __GFP_DMA)
+
+#define alloc_pages_mask(gfp, size, mask) \
+	__alloc_pages((gfp) | gfp_mask(mask), get_order(size))
+#define get_pages_mask(gfp, size, mask) \
+	__get_free_pages(gfp | gfp_mask(mask), get_order(size))
+#define __free_pages_mask(p, s) __free_pages(p, get_order(s))
+#define free_pages_mask(a, s) free_pages((unsigned long)addr, get_order(s))
+
+#define __get_dma_pages(gfp_mask, order) \
+		__get_free_pages((gfp_mask) | GFP_DMA, order)
+
+#endif
+
+#define get_page_mask(gfp, mask) get_pages_mask(gfp, PAGE_SIZE, mask)
+#define alloc_page_mask(gfp, mask) alloc_pages_mask(gfp, PAGE_SIZE, mask)
+
+#define __free_page_mask(page) __free_pages_mask(page, 0)
+#define free_page_mask(addr) free_pages_mask(addr, PAGE_SIZE)
+
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr),0)
 
@@ -230,4 +272,8 @@ void drain_zone_pages(struct zone *zone,
 void drain_all_pages(void);
 void drain_local_pages(void *dummy);
 
+void init_mask_zone(unsigned long trad);
+void increase_mask_zone(unsigned long size);
+void prepare_mask_zone(void);
+
 #endif /* __LINUX_GFP_H */
Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h
+++ linux/include/linux/page-flags.h
@@ -89,6 +89,7 @@
 #define PG_mappedtodisk		16	/* Has blocks allocated on-disk */
 #define PG_reclaim		17	/* To be reclaimed asap */
 #define PG_buddy		19	/* Page is free, on buddy lists */
+#define PG_mask_alloc		20	/* Page managed by the Mask allocator */
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead		PG_reclaim /* Reminder to do async read-ahead */
@@ -256,6 +257,10 @@ static inline void SetPageUptodate(struc
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
 
+#define PageMaskAlloc(page)	test_bit(PG_mask_alloc, &(page)->flags)
+#define __SetPageMaskAlloc(page)	__set_bit(PG_mask_alloc, &(page)->flags)
+#define __ClearPageMaskAlloc(page) __clear_bit(PG_mask_alloc, &(page)->flags)
+
 /*
  * PG_reclaim is used in combination with PG_compound to mark the
  * head and tail of a compound page
Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -2116,6 +2116,9 @@ and is between 256 and 4096 characters. 
 	norandmaps	Don't use address space randomization
 			Equivalent to echo 0 > /proc/sys/kernel/randomize_va_space
 
+	maskzone=size[MG] Set size of maskable DMA zone to size.
+		 force	Always allocate from the mask zone (for testing)
+
 ______________________________________________________________________
 
 TODO:
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -500,7 +500,7 @@ static void __free_pages_ok(struct page 
 	int reserved = 0;
 
 	for (i = 0 ; i < (1 << order) ; ++i)
-		reserved += free_pages_check(page + i, 0);
+		reserved += free_pages_check(page + i, 1 << PG_mask_alloc);
 	if (reserved)
 		return;
 
@@ -937,7 +937,7 @@ static void free_hot_cold_page(struct pa
 
 	if (PageAnon(page))
 		page->mapping = NULL;
-	if (free_pages_check(page, 0))
+	if (free_pages_check(page, 1 << PG_mask_alloc))
 		return;
 
 	if (!PageHighMem(page))
Index: linux/Documentation/DocBook/kernel-api.tmpl
===================================================================
--- linux.orig/Documentation/DocBook/kernel-api.tmpl
+++ linux/Documentation/DocBook/kernel-api.tmpl
@@ -164,6 +164,7 @@ X!Ilib/string.c
 !Emm/memory.c
 !Emm/vmalloc.c
 !Imm/page_alloc.c
+!Emm/mask-alloc.c
 !Emm/mempool.c
 !Emm/dmapool.c
 !Emm/page-writeback.c

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] [7/13] Implement compat hooks for GFP_DMA
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
                   ` (4 preceding siblings ...)
  2008-03-07  9:07 ` [PATCH] [6/13] Core maskable allocator Andi Kleen
@ 2008-03-07  9:07 ` Andi Kleen
  2008-03-07  9:07 ` [PATCH] [8/13] Enable the mask allocator for x86 Andi Kleen
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm


Add code to the normal allocation path to call the mask allocator
transparently for GFP_DMA allocations.

This is only a temporary measure and will go away as soon as all
the drivers are converted to use the mask allocator or one
of its callers (like PCI DMA API) directly.

But currently it is the only sane way to do a bisectable step by 
step conversion of the GFP_DMA users in tree.

Right now the special pagevec free function won't handle mask allocated
pages.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/linux/page-flags.h |    4 ++++
 mm/page_alloc.c            |   18 ++++++++++++++++++
 2 files changed, 22 insertions(+)

Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -499,6 +499,12 @@ static void __free_pages_ok(struct page 
 	int i;
 	int reserved = 0;
 
+	/* Compat code for now. Will go away */
+	if (unlikely(PageMaskAlloc(page))) {
+		__free_pages_mask(page, PAGE_SIZE << order);
+		return;
+	}
+
 	for (i = 0 ; i < (1 << order) ; ++i)
 		reserved += free_pages_check(page + i, 1 << PG_mask_alloc);
 	if (reserved)
@@ -1422,6 +1428,13 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 	if (should_fail_alloc_page(gfp_mask, order))
 		return NULL;
 
+#ifdef CONFIG_MASK_ALLOC
+	/* Compat code for now. Will go away. */
+	if (unlikely(gfp_mask & GFP_DMA))
+		return alloc_pages_mask(gfp_mask & ~__GFP_DMA,
+					PAGE_SIZE << order, TRAD_DMA_MASK);
+#endif
+
 restart:
 	z = zonelist->zones;  /* the list of zones suitable for gfp_mask */
 
@@ -1635,6 +1648,11 @@ void __pagevec_free(struct pagevec *pvec
 
 void __free_pages(struct page *page, unsigned int order)
 {
+	/* Compat code for now. Will go away */
+	if (unlikely(PageMaskAlloc(page))) {
+		__free_pages_mask(page, PAGE_SIZE << order);
+		return;
+	}
 	if (put_page_testzero(page)) {
 		if (order == 0)
 			free_hot_page(page);
Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h
+++ linux/include/linux/page-flags.h
@@ -257,9 +257,13 @@ static inline void SetPageUptodate(struc
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
 
+#ifdef CONFIG_MASK_ALLOC
 #define PageMaskAlloc(page)	test_bit(PG_mask_alloc, &(page)->flags)
 #define __SetPageMaskAlloc(page)	__set_bit(PG_mask_alloc, &(page)->flags)
 #define __ClearPageMaskAlloc(page) __clear_bit(PG_mask_alloc, &(page)->flags)
+#else
+#define PageMaskAlloc(page)	0
+#endif
 
 /*
  * PG_reclaim is used in combination with PG_compound to mark the

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] [8/13] Enable the mask allocator for x86
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
                   ` (5 preceding siblings ...)
  2008-03-07  9:07 ` [PATCH] [7/13] Implement compat hooks for GFP_DMA Andi Kleen
@ 2008-03-07  9:07 ` Andi Kleen
  2008-03-07 18:32   ` Sam Ravnborg
  2008-03-08  2:37   ` Christoph Lameter
  2008-03-07  9:07 ` [PATCH] [9/13] Remove set_dma_reserve Andi Kleen
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm


- Disable old ZONE_DMA
No fixed size ZONE_DMA now anymore. All existing users of __GFP_DMA rely 
on the compat call to the maskable allocator in alloc/free_pages
- Call maskable allocator initialization functions at boot
- Add TRAD_DMA_MASK for the compat functions 
- Remove dma_reserve call

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/Kconfig           |    4 ++--
 arch/x86/kernel/setup_32.c |    3 +--
 arch/x86/kernel/setup_64.c |    1 +
 arch/x86/mm/discontig_32.c |    2 --
 arch/x86/mm/init_32.c      |    2 ++
 arch/x86/mm/init_64.c      |    9 ++-------
 arch/x86/mm/numa_64.c      |    1 -
 include/asm-x86/dma.h      |    2 ++
 8 files changed, 10 insertions(+), 14 deletions(-)

Index: linux/arch/x86/kernel/setup_64.c
===================================================================
--- linux.orig/arch/x86/kernel/setup_64.c
+++ linux/arch/x86/kernel/setup_64.c
@@ -425,6 +425,7 @@ void __init setup_arch(char **cmdline_p)
 	}
 #endif
 	reserve_crashkernel();
+	init_mask_zone(__pa(MAX_DMA_ADDRESS));
 	paging_init();
 	map_vsyscall();
 
Index: linux/arch/x86/kernel/setup_32.c
===================================================================
--- linux.orig/arch/x86/kernel/setup_32.c
+++ linux/arch/x86/kernel/setup_32.c
@@ -438,8 +438,6 @@ void __init zone_sizes_init(void)
 {
 	unsigned long max_zone_pfns[MAX_NR_ZONES];
 	memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
-	max_zone_pfns[ZONE_DMA] =
-		virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
 	max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
 #ifdef CONFIG_HIGHMEM
 	max_zone_pfns[ZONE_HIGHMEM] = highend_pfn;
@@ -811,6 +809,7 @@ void __init setup_arch(char **cmdline_p)
 	 * NOTE: at this point the bootmem allocator is fully available.
 	 */
 
+	init_mask_zone(__pa(MAX_DMA_ADDRESS));
 #ifdef CONFIG_BLK_DEV_INITRD
 	relocate_initrd();
 #endif
Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -50,8 +50,6 @@
 const struct dma_mapping_ops *dma_ops;
 EXPORT_SYMBOL(dma_ops);
 
-static unsigned long dma_reserve __initdata;
-
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 
 /*
@@ -451,7 +449,6 @@ void __init paging_init(void)
 	unsigned long max_zone_pfns[MAX_NR_ZONES];
 
 	memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
-	max_zone_pfns[ZONE_DMA] = MAX_DMA_PFN;
 	max_zone_pfns[ZONE_DMA32] = MAX_DMA32_PFN;
 	max_zone_pfns[ZONE_NORMAL] = end_pfn;
 
@@ -514,6 +511,8 @@ void __init mem_init(void)
 
 	pci_iommu_alloc();
 
+	prepare_mask_zone();
+
 	/* clear_bss() already clear the empty_zero_page */
 
 	reservedpages = 0;
@@ -671,10 +670,6 @@ void __init reserve_bootmem_generic(unsi
 #else
 	reserve_bootmem(phys, len, BOOTMEM_DEFAULT);
 #endif
-	if (phys+len <= MAX_DMA_PFN*PAGE_SIZE) {
-		dma_reserve += len / PAGE_SIZE;
-		set_dma_reserve(dma_reserve);
-	}
 }
 
 int kern_addr_valid(unsigned long addr)
Index: linux/arch/x86/mm/numa_64.c
===================================================================
--- linux.orig/arch/x86/mm/numa_64.c
+++ linux/arch/x86/mm/numa_64.c
@@ -575,7 +575,6 @@ void __init paging_init(void)
 	unsigned long max_zone_pfns[MAX_NR_ZONES];
 
 	memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
-	max_zone_pfns[ZONE_DMA] = MAX_DMA_PFN;
 	max_zone_pfns[ZONE_DMA32] = MAX_DMA32_PFN;
 	max_zone_pfns[ZONE_NORMAL] = end_pfn;
 
Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig
+++ linux/arch/x86/Kconfig
@@ -63,8 +63,8 @@ config FAST_CMPXCHG_LOCAL
 config MMU
 	def_bool y
 
-config ZONE_DMA
-	def_bool y
+config MASK_ALLOC
+       def_bool y
 
 config QUICKLIST
 	def_bool X86_32
Index: linux/arch/x86/mm/discontig_32.c
===================================================================
--- linux.orig/arch/x86/mm/discontig_32.c
+++ linux/arch/x86/mm/discontig_32.c
@@ -400,8 +400,6 @@ void __init zone_sizes_init(void)
 	int nid;
 	unsigned long max_zone_pfns[MAX_NR_ZONES];
 	memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
-	max_zone_pfns[ZONE_DMA] =
-		virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
 	max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
 #ifdef CONFIG_HIGHMEM
 	max_zone_pfns[ZONE_HIGHMEM] = highend_pfn;
Index: linux/include/asm-x86/dma.h
===================================================================
--- linux.orig/include/asm-x86/dma.h
+++ linux/include/asm-x86/dma.h
@@ -316,4 +316,6 @@ extern int isa_dma_bridge_buggy;
 #define isa_dma_bridge_buggy	(0)
 #endif
 
+#define TRAD_DMA_MASK 0xffffff /* 16MB, 24bit */
+
 #endif /* _ASM_X86_DMA_H */
Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -571,6 +571,8 @@ void __init mem_init(void)
 	int codesize, reservedpages, datasize, initsize;
 	int tmp, bad_ppro;
 
+	prepare_mask_zone();
+
 #ifdef CONFIG_FLATMEM
 	BUG_ON(!mem_map);
 #endif

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] [9/13] Remove set_dma_reserve
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
                   ` (6 preceding siblings ...)
  2008-03-07  9:07 ` [PATCH] [8/13] Enable the mask allocator for x86 Andi Kleen
@ 2008-03-07  9:07 ` Andi Kleen
  2008-03-07  9:07 ` [PATCH] [10/13] Switch the 32bit dma_alloc_coherent functions over to use the maskable allocator Andi Kleen
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm


x86 was the only user and that one doesn't use it anymore since it was converted
to mask alloc. So remove it.
Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/linux/mm.h |    1 -
 mm/page_alloc.c    |   24 ------------------------
 2 files changed, 25 deletions(-)

Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -120,7 +120,6 @@ int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
-static unsigned long __meminitdata dma_reserve;
 
 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
   /*
@@ -3321,13 +3320,6 @@ static void __paginginit free_area_init_
 				"  %s zone: %lu pages exceeds realsize %lu\n",
 				zone_names[j], memmap_pages, realsize);
 
-		/* Account for reserved pages */
-		if (j == 0 && realsize > dma_reserve) {
-			realsize -= dma_reserve;
-			printk(KERN_DEBUG "  %s zone: %lu pages reserved\n",
-					zone_names[0], dma_reserve);
-		}
-
 		if (!is_highmem_idx(j))
 			nr_kernel_pages += realsize;
 		nr_all_pages += realsize;
@@ -3906,22 +3898,6 @@ early_param("movablecore", cmdline_parse
 
 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 
-/**
- * set_dma_reserve - set the specified number of pages reserved in the first zone
- * @new_dma_reserve: The number of pages to mark reserved
- *
- * The per-cpu batchsize and zone watermarks are determined by present_pages.
- * In the DMA zone, a significant percentage may be consumed by kernel image
- * and other unfreeable allocations which can skew the watermarks badly. This
- * function may optionally be used to account for unfreeable pages in the
- * first zone (e.g., ZONE_DMA). The effect will be lower watermarks and
- * smaller per-cpu batchsize.
- */
-void __init set_dma_reserve(unsigned long new_dma_reserve)
-{
-	dma_reserve = new_dma_reserve;
-}
-
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 static bootmem_data_t contig_bootmem_data;
 struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h
+++ linux/include/linux/mm.h
@@ -987,7 +987,6 @@ extern void sparse_memory_present_with_a
 extern int early_pfn_to_nid(unsigned long pfn);
 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
-extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
 				unsigned long, enum memmap_context);
 extern void setup_per_zone_pages_min(void);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] [10/13] Switch the 32bit dma_alloc_coherent functions over to use the maskable allocator
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
                   ` (7 preceding siblings ...)
  2008-03-07  9:07 ` [PATCH] [9/13] Remove set_dma_reserve Andi Kleen
@ 2008-03-07  9:07 ` Andi Kleen
  2008-03-07  9:07 ` [PATCH] [11/13] Switch x86-64 dma_alloc_coherent over to " Andi Kleen
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm


Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/kernel/pci-dma_32.c |   17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

Index: linux/arch/x86/kernel/pci-dma_32.c
===================================================================
--- linux.orig/arch/x86/kernel/pci-dma_32.c
+++ linux/arch/x86/kernel/pci-dma_32.c
@@ -27,11 +27,12 @@ void *dma_alloc_coherent(struct device *
 {
 	void *ret;
 	struct dma_coherent_mem *mem = dev ? dev->dma_mem : NULL;
-	int order = get_order(size);
+	u64 mask;
 	/* ignore region specifiers */
-	gfp &= ~(__GFP_DMA | __GFP_HIGHMEM);
+	gfp &= ~__GFP_HIGHMEM;
 
 	if (mem) {
+		int order = get_order(size);
 		int page = bitmap_find_free_region(mem->bitmap, mem->size,
 						     order);
 		if (page >= 0) {
@@ -44,10 +45,12 @@ void *dma_alloc_coherent(struct device *
 			return NULL;
 	}
 
-	if (dev == NULL || (dev->coherent_dma_mask < 0xffffffff))
-		gfp |= GFP_DMA;
+	if (dev == NULL)
+		mask = TRAD_DMA_MASK;
+	else
+		mask = dev->coherent_dma_mask;
 
-	ret = (void *)__get_free_pages(gfp, order);
+	ret = get_pages_mask(gfp, size, mask);
 
 	if (ret != NULL) {
 		memset(ret, 0, size);
@@ -61,15 +64,15 @@ void dma_free_coherent(struct device *de
 			 void *vaddr, dma_addr_t dma_handle)
 {
 	struct dma_coherent_mem *mem = dev ? dev->dma_mem : NULL;
-	int order = get_order(size);
 
 	WARN_ON(irqs_disabled());	/* for portability */
 	if (mem && vaddr >= mem->virt_base && vaddr < (mem->virt_base + (mem->size << PAGE_SHIFT))) {
+		int order = get_order(size);
 		int page = (vaddr - mem->virt_base) >> PAGE_SHIFT;
 
 		bitmap_release_region(mem->bitmap, page, order);
 	} else
-		free_pages((unsigned long)vaddr, order);
+		free_pages_mask(vaddr, size);
 }
 EXPORT_SYMBOL(dma_free_coherent);
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] [11/13] Switch x86-64 dma_alloc_coherent over to the maskable allocator
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
                   ` (8 preceding siblings ...)
  2008-03-07  9:07 ` [PATCH] [10/13] Switch the 32bit dma_alloc_coherent functions over to use the maskable allocator Andi Kleen
@ 2008-03-07  9:07 ` Andi Kleen
  2008-03-07  9:07 ` [PATCH] [12/13] Add vmstat statistics for new swiotlb code Andi Kleen
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm


Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/kernel/pci-dma_64.c |   49 +++++++++++++------------------------------
 1 file changed, 15 insertions(+), 34 deletions(-)

Index: linux/arch/x86/kernel/pci-dma_64.c
===================================================================
--- linux.orig/arch/x86/kernel/pci-dma_64.c
+++ linux/arch/x86/kernel/pci-dma_64.c
@@ -47,11 +47,16 @@ struct device fallback_dev = {
 
 /* Allocate DMA memory on node near device */
 noinline static void *
-dma_alloc_pages(struct device *dev, gfp_t gfp, unsigned order)
+dma_alloc_pages(struct device *dev, gfp_t gfp, unsigned size,
+		unsigned long dma_mask)
 {
 	struct page *page;
 	int node;
 
+	/* For small masks use DMA allocator without node affinity */
+	if (dma_mask < DMA_32BIT_MASK)
+		return get_pages_mask(gfp, size, dma_mask);
+
 	node = dev_to_node(dev);
 	if (node == -1)
 		node = numa_node_id();
@@ -59,7 +64,8 @@ dma_alloc_pages(struct device *dev, gfp_
 	if (node < first_node(node_online_map))
 		node = first_node(node_online_map);
 
-	page = alloc_pages_node(node, gfp, order);
+	page = alloc_pages_node(node, gfp, get_order(size));
+
 	return page ? page_address(page) : NULL;
 }
 
@@ -91,15 +97,10 @@ dma_alloc_coherent(struct device *dev, s
 	   uses the normal dma_mask for alloc_coherent. */
 	dma_mask &= *dev->dma_mask;
 
-	/* Why <=? Even when the mask is smaller than 4GB it is often
-	   larger than 16MB and in this case we have a chance of
-	   finding fitting memory in the next higher zone first. If
-	   not retry with true GFP_DMA. -AK */
 	if (dma_mask <= DMA_32BIT_MASK)
 		gfp |= GFP_DMA32;
 
- again:
-	memory = dma_alloc_pages(dev, gfp, get_order(size));
+	memory = dma_alloc_pages(dev, gfp, size, dma_mask);
 	if (memory == NULL)
 		return NULL;
 
@@ -108,25 +109,10 @@ dma_alloc_coherent(struct device *dev, s
 		bus = virt_to_bus(memory);
 	        high = (bus + size) >= dma_mask;
 		mmu = high;
-		if (force_iommu && !(gfp & GFP_DMA))
+		if (force_iommu)
 			mmu = 1;
 		else if (high) {
-			free_pages((unsigned long)memory,
-				   get_order(size));
-
-			/* Don't use the 16MB ZONE_DMA unless absolutely
-			   needed. It's better to use remapping first. */
-			if (dma_mask < DMA_32BIT_MASK && !(gfp & GFP_DMA)) {
-				gfp = (gfp & ~GFP_DMA32) | GFP_DMA;
-				goto again;
-			}
-
-			/* Let low level make its own zone decisions */
-			gfp &= ~(GFP_DMA32|GFP_DMA);
-
-			if (dma_ops->alloc_coherent)
-				return dma_ops->alloc_coherent(dev, size,
-							   dma_handle, gfp);
+			free_pages_mask(memory, size);
 			return NULL;
 		}
 
@@ -137,12 +123,6 @@ dma_alloc_coherent(struct device *dev, s
 		}
 	}
 
-	if (dma_ops->alloc_coherent) {
-		free_pages((unsigned long)memory, get_order(size));
-		gfp &= ~(GFP_DMA|GFP_DMA32);
-		return dma_ops->alloc_coherent(dev, size, dma_handle, gfp);
-	}
-
 	if (dma_ops->map_simple) {
 		*dma_handle = dma_ops->map_simple(dev, memory,
 					      size,
@@ -153,7 +133,7 @@ dma_alloc_coherent(struct device *dev, s
 
 	if (panic_on_overflow)
 		panic("dma_alloc_coherent: IOMMU overflow by %lu bytes\n",size);
-	free_pages((unsigned long)memory, get_order(size));
+	free_pages_mask(memory, size);
 	return NULL;
 }
 EXPORT_SYMBOL(dma_alloc_coherent);
@@ -166,9 +146,10 @@ void dma_free_coherent(struct device *de
 			 void *vaddr, dma_addr_t bus)
 {
 	WARN_ON(irqs_disabled());	/* for portability */
+	/* RED-PEN swiotlb does unnecessary copy here */
 	if (dma_ops->unmap_single)
 		dma_ops->unmap_single(dev, bus, size, 0);
-	free_pages((unsigned long)vaddr, get_order(size));
+	free_pages_mask(vaddr, size);
 }
 EXPORT_SYMBOL(dma_free_coherent);
 
@@ -191,7 +172,7 @@ int dma_supported(struct device *dev, u6
 
 	/* Copied from i386. Doesn't make much sense, because it will
 	   only work for pci_alloc_coherent.
-	   The caller just has to use GFP_DMA in this case. */
+	   The caller just has to use *_mask allocations in this case. */
         if (mask < DMA_24BIT_MASK)
                 return 0;
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] [12/13] Add vmstat statistics for new swiotlb code
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
                   ` (9 preceding siblings ...)
  2008-03-07  9:07 ` [PATCH] [11/13] Switch x86-64 dma_alloc_coherent over to " Andi Kleen
@ 2008-03-07  9:07 ` Andi Kleen
  2008-03-08  2:38   ` Christoph Lameter
  2008-03-07  9:07 ` [PATCH] [13/13] Convert x86-64 swiotlb to use the mask allocator directly Andi Kleen
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm


Signed-off-by: Andi Kleen <ak@suse.de>

---
 include/linux/vmstat.h |    4 ++++
 mm/vmstat.c            |    6 ++++++
 2 files changed, 10 insertions(+)

Index: linux/include/linux/vmstat.h
===================================================================
--- linux.orig/include/linux/vmstat.h
+++ linux/include/linux/vmstat.h
@@ -41,6 +41,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		MASK_ALLOC, MASK_FREE, MASK_BITMAP_SKIP, MASK_WAIT,
 		MASK_HIGHER, MASK_LOW_WASTE, MASK_HIGH_WASTE,
 #endif
+#ifdef CONFIG_SWIOTLB_MASK_ALLOC
+		SWIOTLB_USED_PAGES, SWIOTLB_BYTES_WASTED, SWIOTLB_NUM_ALLOCS,
+		SWIOTLB_NUM_FREES,
+#endif
 		NR_VM_EVENT_ITEMS
 };
 
Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -654,6 +654,12 @@ static const char * const vmstat_text[] 
 	"mask_low_waste",
 	"mask_high_waste",
 #endif
+#ifdef CONFIG_SWIOTLB_MASK_ALLOC
+	"swiotlb_used_pages",
+	"swiotlb_bytes_wasted",
+	"swiotlb_num_allocs",
+	"swiotlb_num_frees",
+#endif
 #endif
 };
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] [13/13] Convert x86-64 swiotlb to use the mask allocator directly
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
                   ` (10 preceding siblings ...)
  2008-03-07  9:07 ` [PATCH] [12/13] Add vmstat statistics for new swiotlb code Andi Kleen
@ 2008-03-07  9:07 ` Andi Kleen
  2008-03-07 15:18 ` [PATCH] [0/13] General DMA zone rework Rene Herman
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07  9:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm


The swiotlb and DMA pools already interact. Instead of using two separate
pools just always use the mask allocator pool. When swiotlb is active
the mask allocator pool is extended by the respective swiotlb size
(64MB by default as with the old code). Then all swiotlb bouncing
happens directly through the maskable zone.

This is ok because there are no "normal" memory consumers in 
the maskable zone and (nearly) all users who allocate from it would
already allocate from swiotlb before.

The code is based on the original swiotlb code, but heavily modified.

One difference now is that the swiotlb is only managed in page size
chunks now. The previous implementation used 2K "slabs". I'm still
running statistics to see if it's worth doing something about this.

lib/swiotlb.c is now not used anymore on x86.

Signed-off-by: Andi Kleen <ak@suse.de>

---
 arch/x86/Kconfig                 |    6 
 arch/x86/kernel/Makefile         |    2 
 arch/x86/kernel/pci-dma_64.c     |    4 
 arch/x86/kernel/pci-swiotlb_64.c |  431 +++++++++++++++++++++++++++++++++++++--
 include/asm-x86/swiotlb.h        |    2 
 include/linux/page-flags.h       |    1 
 6 files changed, 420 insertions(+), 26 deletions(-)

Index: linux/arch/x86/kernel/pci-dma_64.c
===================================================================
--- linux.orig/arch/x86/kernel/pci-dma_64.c
+++ linux/arch/x86/kernel/pci-dma_64.c
@@ -250,7 +250,7 @@ static __init int iommu_setup(char *p)
 		if (!strncmp(p, "nodac", 5))
 			forbid_dac = -1;
 
-#ifdef CONFIG_SWIOTLB
+#ifdef CONFIG_SWIOTLB_MASK_ALLOC
 		if (!strncmp(p, "soft",4))
 			swiotlb = 1;
 #endif
@@ -288,7 +288,7 @@ void __init pci_iommu_alloc(void)
 
 	detect_intel_iommu();
 
-#ifdef CONFIG_SWIOTLB
+#ifdef CONFIG_SWIOTLB_MASK_ALLOC
 	pci_swiotlb_init();
 #endif
 }
Index: linux/arch/x86/kernel/pci-swiotlb_64.c
===================================================================
--- linux.orig/arch/x86/kernel/pci-swiotlb_64.c
+++ linux/arch/x86/kernel/pci-swiotlb_64.c
@@ -1,30 +1,389 @@
-/* Glue code to lib/swiotlb.c */
-
+/*
+ * DMATLB implementation that bounces through isolated DMA zones.
+ * Losely based on the original swiotlb.c from Asit Mallick et.al.
+ *
+ * Still tries to be plug compatible with the lib/swiotlb.c,
+ * but that can be lifted at some point. That is the reason for the
+ * mixed naming convention.
+ *
+ * TBD: Should really support an interface for sleepy maps
+ */
 #include <linux/pci.h>
 #include <linux/cache.h>
 #include <linux/module.h>
 #include <linux/dma-mapping.h>
+#include <linux/ctype.h>
+#include <linux/bootmem.h>
+#include <linux/hardirq.h>
 
 #include <asm/gart.h>
 #include <asm/swiotlb.h>
 #include <asm/dma.h>
+#include <asm/dma-mapping.h>
+
+#define DEFAULT_SWIOTLB_SIZE (64*1024*1024)
 
 int swiotlb __read_mostly;
 
-const struct dma_mapping_ops swiotlb_dma_ops = {
-	.mapping_error = swiotlb_dma_mapping_error,
-	.alloc_coherent = swiotlb_alloc_coherent,
-	.free_coherent = swiotlb_free_coherent,
-	.map_single = swiotlb_map_single,
-	.unmap_single = swiotlb_unmap_single,
-	.sync_single_for_cpu = swiotlb_sync_single_for_cpu,
-	.sync_single_for_device = swiotlb_sync_single_for_device,
-	.sync_single_range_for_cpu = swiotlb_sync_single_range_for_cpu,
-	.sync_single_range_for_device = swiotlb_sync_single_range_for_device,
-	.sync_sg_for_cpu = swiotlb_sync_sg_for_cpu,
-	.sync_sg_for_device = swiotlb_sync_sg_for_device,
-	.map_sg = swiotlb_map_sg,
-	.unmap_sg = swiotlb_unmap_sg,
+#define PageSwiotlb(p) test_bit(PG_swiotlb, &(p)->flags)
+#define SetPageSwiotlb(p)	set_bit(PG_swiotlb, &(p)->flags);
+#define ClearPageSwiotlb(p)	clear_bit(PG_swiotlb, &(p)->flags);
+
+int swiotlb_force __read_mostly;
+
+unsigned swiotlb_size;
+
+static unsigned long io_tlb_overflow = 32*1024;
+void *io_tlb_overflow_buffer;
+
+enum dma_sync_target {
+	SYNC_FOR_CPU = 0,
+	SYNC_FOR_DEVICE = 1,
+};
+
+static void
+dmatlb_full(struct device *dev, size_t size, int dir, int do_panic)
+{
+	/*
+	 * Ran out of IOMMU space for this operation. This is very bad.
+	 * Unfortunately the drivers cannot handle this operation properly.
+	 * unless they check for dma_mapping_error (most don't)
+	 * When the mapping is small enough return a static buffer to limit
+	 * the damage, or panic when the transfer is too big.
+	 */
+	printk(KERN_ERR "DMA: Out of DMA-TLB bounce space for %zu bytes at "
+	       "device %s\n", size, dev ? dev->bus_id : "?");
+
+	if (size > io_tlb_overflow && do_panic) {
+		if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
+			panic("DMA: Memory would be corrupted\n");
+		if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+			panic("DMA: Random memory would be DMAed\n");
+	}
+}
+
+static int
+needs_mapping(struct device *hwdev, dma_addr_t addr)
+{
+	dma_addr_t mask = 0xffffffff;
+	/* If the device has a mask, use it, otherwise default to 32 bits */
+	if (hwdev && hwdev->dma_mask)
+		mask = *hwdev->dma_mask;
+	return (addr & ~mask) != 0;
+}
+
+static void account(int pages, int size)
+{
+	unsigned waste;
+	get_cpu();
+	__count_vm_events(SWIOTLB_USED_PAGES, pages);
+	waste = (PAGE_SIZE * pages) - size;
+	__count_vm_events(SWIOTLB_BYTES_WASTED, waste);
+	__count_vm_event(SWIOTLB_NUM_ALLOCS);
+	put_cpu();
+}
+
+/*
+ * Allocates bounce buffer and returns its kernel virtual address.
+ */
+static void *
+map_single(struct device *hwdev, char *buffer, size_t size, int dir)
+{
+	void *dma_addr;
+	int pages, i;
+	struct page *p;
+	gfp_t gfp = GFP_ATOMIC;
+	p = alloc_pages_mask(gfp, max(size, MASK_MIN_SIZE), *hwdev->dma_mask);
+	if (!p)
+		return NULL;
+	SetPageSwiotlb(p);
+	pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	account(pages, size);
+	p->lru.next = (void *)buffer;
+	p->lru.prev = (void *)size;
+	for (i = 1; i < pages; i++) {
+		struct page *n = p + i;
+		SetPageSwiotlb(n);
+		n->lru.next = (void *)buffer + i*PAGE_SIZE;
+		n->lru.prev = (void *)PAGE_SIZE;
+	}
+
+	dma_addr = page_address(p);
+	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+		memcpy(dma_addr, buffer, size);
+	BUG_ON(virt_to_phys(dma_addr + size - 1) & ~*hwdev->dma_mask);
+	return dma_addr;
+}
+
+static void dmatlb_unmap_single(struct device *hwdev, dma_addr_t dev_addr,
+				size_t size, int dir)
+{
+	char *dma_addr = bus_to_virt(dev_addr);
+	struct page *p = virt_to_page(dma_addr);
+	void *buffer;
+	int pages, i;
+
+	BUG_ON(dir == DMA_NONE);
+
+	if (!PageSwiotlb(p))
+		return;
+	BUG_ON((long)p->lru.prev < 0);
+	BUG_ON((unsigned long)p->lru.prev < size);
+
+	buffer = p->lru.next;
+
+	/*
+	 * First, sync the memory before unmapping the entry
+	 */
+	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
+		memcpy(buffer, dma_addr, size);
+
+	ClearPageSwiotlb(p);
+	p->lru.next = NULL;
+	p->lru.prev = NULL;
+	pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	for (i = 1; i < pages; i++) {
+		struct page *n = p + i;
+		ClearPageSwiotlb(n);
+		BUG_ON((long)n->lru.prev < 0);
+		BUG_ON((void *)n->lru.next != buffer + i*PAGE_SIZE);
+		n->lru.next = NULL;
+		n->lru.prev = NULL;
+	}
+	__free_pages_mask(p, max(size, MASK_MIN_SIZE));
+	count_vm_event(SWIOTLB_NUM_FREES);
+}
+
+static void
+sync_single(struct device *hwdev, char *dma_addr, size_t size,
+	    int dir, int target, void *buffer)
+{
+	buffer += ((unsigned long)dma_addr & ~PAGE_MASK);
+	switch (target) {
+	case SYNC_FOR_CPU:
+		if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
+			memcpy(buffer, dma_addr, size);
+		else
+			BUG_ON(dir != DMA_TO_DEVICE);
+		break;
+	case SYNC_FOR_DEVICE:
+		if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+			memcpy(dma_addr, buffer, size);
+		else
+			BUG_ON(dir != DMA_FROM_DEVICE);
+		break;
+	default:
+		BUG();
+	}
+}
+
+/*
+ * Map a single buffer of the indicated size for DMA in streaming mode.  The
+ * physical address to use is returned.
+ *
+ * Once the device is given the dma address, the device owns this
+ * memory until either swiotlb_unmap_single or swiotlb_dma_sync_single
+ * is performed.
+ */
+static dma_addr_t
+dmatlb_map_single(struct device *hwdev, void *ptr, size_t size, int dir)
+{
+	dma_addr_t dev_addr = virt_to_bus(ptr);
+	void *map;
+
+	BUG_ON(dir == DMA_NONE);
+	/*
+	 * If the pointer passed in happens to be in the device's DMA
+	 * window, we can safely return the device addr and not worry
+	 * about bounce buffering it.
+	 */
+	if (!needs_mapping(hwdev, dev_addr) && !swiotlb_force)
+		return dev_addr;
+
+	/*
+	 * Oh well, have to allocate and map a bounce buffer.
+	 */
+	map = map_single(hwdev, ptr, size, dir);
+	if (!map) {
+		dmatlb_full(hwdev, size, dir, 1);
+		map = io_tlb_overflow_buffer;
+	}
+
+	dev_addr = virt_to_bus(map);
+	return dev_addr;
+}
+
+/*
+ * Make physical memory consistent for a single streaming mode DMA
+ * translation after a transfer.
+ *
+ * If you perform a dmatlb_map_single() but wish to interrogate the buffer
+ * using the cpu, yet do not wish to teardown the dma mapping, you must
+ * call this function before doing so.	At the next point you give the dma
+ * address back to the card, you must first perform a
+ * dmatlb_dma_sync_for_device, and then the device again owns the buffer
+ */
+static void
+dmatlb_sync_single(struct device *hwdev, dma_addr_t dev_addr,
+		    size_t size, int dir, int tgt)
+{
+	char *dma_addr = bus_to_virt(dev_addr);
+	struct page *p;
+
+	BUG_ON(dir == DMA_NONE);
+	p = virt_to_page(dma_addr);
+	if (PageSwiotlb(p))
+		sync_single(hwdev, dma_addr, size, dir, tgt, p->lru.next);
+}
+
+static void
+dmatlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
+			    size_t size, int dir)
+{
+	dmatlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_CPU);
+}
+
+static void
+dmatlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr,
+			       size_t size, int dir)
+{
+	dmatlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_DEVICE);
+}
+
+/*
+ * Same as above, but for a sub-range of the mapping.
+ */
+static void
+dmatlb_sync_single_range(struct device *hwdev, dma_addr_t dev_addr,
+			  unsigned long offset, size_t size,
+			  int dir, int tgt)
+{
+	char *dma_addr = bus_to_virt(dev_addr) + offset;
+	struct page *p = virt_to_page(dma_addr);
+
+	BUG_ON(dir == DMA_NONE);
+	if (PageSwiotlb(p))
+		sync_single(hwdev, dma_addr, size, dir, tgt, p->lru.next);
+}
+
+static void
+dmatlb_sync_single_range_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
+				 unsigned long offset, size_t size, int dir)
+{
+	dmatlb_sync_single_range(hwdev, dev_addr, offset, size, dir,
+				  SYNC_FOR_CPU);
+}
+
+static void
+dmatlb_sync_single_range_for_device(struct device *hwdev,
+				    dma_addr_t dev_addr,
+				    unsigned long offset, size_t size,
+				    int dir)
+{
+	dmatlb_sync_single_range(hwdev, dev_addr, offset, size, dir,
+				  SYNC_FOR_DEVICE);
+}
+
+/*
+ * Unmap a set of streaming mode DMA translations.  Again, cpu read rules
+ * concerning calls here are the same as for swiotlb_unmap_single() above.
+ */
+static void
+dmatlb_unmap_sg(struct device *hwdev, struct scatterlist *sg, int nelems,
+		 int dir)
+{
+	int i;
+
+	BUG_ON(dir == DMA_NONE);
+
+	for (i = 0; i < nelems; i++, sg++) {
+		if (sg->dma_address != virt_to_bus(sg_virt(sg)))
+			dmatlb_unmap_single(hwdev, sg->dma_address,
+					    sg->dma_length, dir);
+	}
+}
+
+static int
+dmatlb_map_sg(struct device *hwdev, struct scatterlist *sg, int nelems,
+	       int dir)
+{
+	void *addr;
+	dma_addr_t dev_addr;
+	int i;
+
+	BUG_ON(dir == DMA_NONE);
+
+	for (i = 0; i < nelems; i++, sg++) {
+		addr = sg_virt(sg);
+		dev_addr = virt_to_bus(addr);
+		if (swiotlb_force || needs_mapping(hwdev, dev_addr)) {
+			void *map;
+			map = map_single(hwdev, addr, sg->length, dir);
+			if (!map) {
+				/* Don't panic here, we expect map_sg users
+				   to do proper error handling. */
+				dmatlb_full(hwdev, sg->length, dir, 0);
+				dmatlb_unmap_sg(hwdev, sg - i, i, dir);
+				sg[0].dma_length = 0;
+				return 0;
+			}
+			sg->dma_address = virt_to_bus(map);
+		} else
+			sg->dma_address = dev_addr;
+		sg->dma_length = sg->length;
+	}
+	return nelems;
+}
+
+static void
+dmatlb_sync_sg(struct device *hwdev, struct scatterlist *sg,
+		int nelems, int dir, int target)
+{
+	int i;
+
+	BUG_ON(dir == DMA_NONE);
+
+	for (i = 0; i < nelems; i++, sg++) {
+		void *p = bus_to_virt(sg->dma_address);
+		struct page *pg = virt_to_page(p);
+		if (PageSwiotlb(pg))
+			sync_single(hwdev, p, sg->dma_length, dir, target,
+				    pg->lru.next);
+	}
+}
+
+static void
+dmatlb_sync_sg_for_cpu(struct device *hwdev, struct scatterlist *sg,
+			int nelems, int dir)
+{
+	dmatlb_sync_sg(hwdev, sg, nelems, dir, SYNC_FOR_CPU);
+}
+
+static void
+dmatlb_sync_sg_for_device(struct device *hwdev, struct scatterlist *sg,
+			   int nelems, int dir)
+{
+	dmatlb_sync_sg(hwdev, sg, nelems, dir, SYNC_FOR_DEVICE);
+}
+
+static int
+dmatlb_dma_mapping_error(dma_addr_t dma_addr)
+{
+	return (dma_addr == virt_to_bus(io_tlb_overflow_buffer));
+}
+
+const struct dma_mapping_ops dmatlb_dma_ops = {
+	.mapping_error = dmatlb_dma_mapping_error,
+	.map_single = dmatlb_map_single,
+	.unmap_single = dmatlb_unmap_single,
+	.sync_single_for_cpu = dmatlb_sync_single_for_cpu,
+	.sync_single_for_device = dmatlb_sync_single_for_device,
+	.sync_single_range_for_cpu = dmatlb_sync_single_range_for_cpu,
+	.sync_single_range_for_device = dmatlb_sync_single_range_for_device,
+	.sync_sg_for_cpu = dmatlb_sync_sg_for_cpu,
+	.sync_sg_for_device = dmatlb_sync_sg_for_device,
+	.map_sg = dmatlb_map_sg,
+	.unmap_sg = dmatlb_unmap_sg,
 	.dma_supported = NULL,
 };
 
@@ -35,9 +394,43 @@ void __init pci_swiotlb_init(void)
 	       swiotlb = 1;
 	if (swiotlb_force)
 		swiotlb = 1;
+	if (!swiotlb_size)
+		swiotlb_size = DEFAULT_SWIOTLB_SIZE;
 	if (swiotlb) {
-		printk(KERN_INFO "PCI-DMA: Using software bounce buffering for IO (SWIOTLB)\n");
-		swiotlb_init();
-		dma_ops = &swiotlb_dma_ops;
+		printk(KERN_INFO
+       "PCI-DMA: Using software bounce buffering for IO (SWIOTLB)\n");
+
+		increase_mask_zone(swiotlb_size);
+
+		/*
+		 * Get the overflow emergency buffer
+		 */
+		io_tlb_overflow_buffer = alloc_bootmem_low(io_tlb_overflow);
+		if (!io_tlb_overflow_buffer)
+			panic("Cannot allocate SWIOTLB overflow buffer!\n");
+
+		dma_ops = &dmatlb_dma_ops;
+	}
+}
+
+#define COMPAT_IO_TLB_SHIFT 11
+
+static int __init
+setup_io_tlb_npages(char *str)
+{
+	if (isdigit(*str)) {
+		unsigned long slabs;
+		char *e;
+		slabs = simple_strtoul(str, &e, 0);
+		if (!isalpha(*e))
+			swiotlb = memparse(str, &e);
+		else
+			swiotlb_size = slabs  << COMPAT_IO_TLB_SHIFT;
 	}
+	if (*str == ',')
+		++str;
+	if (!strcmp(str, "force"))
+		swiotlb_force = 1;
+	return 1;
 }
+__setup("swiotlb=", setup_io_tlb_npages);
Index: linux/arch/x86/kernel/Makefile
===================================================================
--- linux.orig/arch/x86/kernel/Makefile
+++ linux/arch/x86/kernel/Makefile
@@ -95,5 +95,5 @@ ifeq ($(CONFIG_X86_64),y)
 
         obj-$(CONFIG_GART_IOMMU)	+= pci-gart_64.o aperture_64.o
         obj-$(CONFIG_CALGARY_IOMMU)	+= pci-calgary_64.o tce_64.o
-        obj-$(CONFIG_SWIOTLB)		+= pci-swiotlb_64.o
+	obj-$(CONFIG_SWIOTLB_MASK_ALLOC) += pci-swiotlb_64.o
 endif
Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig
+++ linux/arch/x86/Kconfig
@@ -437,7 +437,7 @@ config HPET_EMULATE_RTC
 config GART_IOMMU
 	bool "GART IOMMU support" if EMBEDDED
 	default y
-	select SWIOTLB
+	select SWIOTLB_MASK_ALLOC
 	select AGP
 	depends on X86_64 && PCI
 	help
@@ -453,7 +453,7 @@ config GART_IOMMU
 
 config CALGARY_IOMMU
 	bool "IBM Calgary IOMMU support"
-	select SWIOTLB
+	select SWIOTLB_MASK_ALLOC
 	depends on X86_64 && PCI && EXPERIMENTAL
 	help
 	  Support for hardware IOMMUs in IBM's xSeries x366 and x460
@@ -484,7 +484,7 @@ config IOMMU_HELPER
 	def_bool (CALGARY_IOMMU || GART_IOMMU)
 
 # need this always selected by IOMMU for the VIA workaround
-config SWIOTLB
+config SWIOTLB_MASK_ALLOC
 	bool
 	help
 	  Support for software bounce buffers used on x86-64 systems
Index: linux/include/asm-x86/swiotlb.h
===================================================================
--- linux.orig/include/asm-x86/swiotlb.h
+++ linux/include/asm-x86/swiotlb.h
@@ -43,7 +43,7 @@ extern void swiotlb_init(void);
 
 extern int swiotlb_force;
 
-#ifdef CONFIG_SWIOTLB
+#ifdef CONFIG_SWIOTLB_MASK_ALLOC
 extern int swiotlb;
 #else
 #define swiotlb 0
Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h
+++ linux/include/linux/page-flags.h
@@ -107,6 +107,7 @@
  *         63                            32                              0
  */
 #define PG_uncached		31	/* Page has been mapped as uncached */
+#define PG_swiotlb		30	/* Page in swiotlb */
 #endif
 
 /*

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-07  9:07 ` [PATCH] [6/13] Core maskable allocator Andi Kleen
@ 2008-03-07 10:53   ` Johannes Weiner
  2008-03-07 11:14     ` Andi Kleen
  2008-03-07 17:05   ` Randy Dunlap
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 56+ messages in thread
From: Johannes Weiner @ 2008-03-07 10:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Hi Andi,

Andi Kleen <andi@firstfloor.org> writes:

> Index: linux/mm/mask-alloc.c
> ===================================================================
> --- /dev/null
> +++ linux/mm/mask-alloc.c
> @@ -0,0 +1,504 @@
> +/*
> + * Generic management of low memory zone to allocate memory with a address mask.
> + *
> + * The maskable pool is reserved inside another zone, but managed by a
> + * specialized bitmap allocator. The allocator is not O(1) (searches
> + * the bitmap with a last use hint) but should be fast enough for
> + * normal purposes.  The advantage of the allocator is that it can
> + * allocate based on a mask.
> + *
> + * The allocator could be improved, but it's better to keep
> + * things simple for now and there are relatively few users
> + * which are usually not that speed critical. Also for simple
> + * repetive allocation patterns it should be approximately usually
> + * O(1) anyways due to the rotating cursor in the bitmap.
> + *
> + * This allocator should be only used by architectures with reasonably
> + * continuous physical memory at least for the low normal zone.
> + *
> + * Note book:
> + * Right now there are no high priority reservations (__GFP_HIGH). Iff
> + * they are needed it would be possible to reserve some very low memory
> + * for those.
> + *
> + * Copyright 2007, 2008 Andi Kleen, SUSE Labs.
> + * Subject to the GNU Public License v.2 only.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/gfp.h>
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/bitops.h>
> +#include <linux/string.h>
> +#include <linux/wait.h>
> +#include <linux/bootmem.h>
> +#include <linux/module.h>
> +#include <linux/fault-inject.h>
> +#include <linux/ctype.h>
> +#include <linux/kallsyms.h>
> +#include "internal.h"
> +
> +#define BITS_PER_PAGE (PAGE_SIZE * 8)
> +
> +#define MASK_ZONE_LIMIT (2U<<30) /* 2GB max for now */
> +
> +#define Mprintk(x...)
> +#define Mprint_symbol(x...)
> +
> +static int force_mask __read_mostly;
> +static DECLARE_WAIT_QUEUE_HEAD(mask_zone_wait);
> +unsigned long mask_timeout __read_mostly = 5*HZ;
> +
> +/*
> + * The mask_bitmap maintains all the pages in the mask pool.
> + * It is reversed (lowest pfn has the highest index)
> + * to make reverse search easier.
> + * All accesses are protected by the mask_bitmap_lock
> + */
> +static DEFINE_SPINLOCK(mask_bitmap_lock);
> +static unsigned long *mask_bitmap;
> +static unsigned long mask_max_pfn;
> +
> +static inline unsigned pfn_to_maskbm_index(unsigned long pfn)
> +{
> +	return mask_max_pfn - pfn;
> +}
> +
> +static inline unsigned maskbm_index_to_pfn(unsigned index)
> +{
> +	return mask_max_pfn - index;
> +}
> +
> +static unsigned wait_for_mask_free(unsigned left)
> +{
> +	DEFINE_WAIT(wait);
> +	prepare_to_wait(&mask_zone_wait, &wait, TASK_UNINTERRUPTIBLE);
> +	left = schedule_timeout(left);
> +	finish_wait(&mask_zone_wait, &wait);
> +	return left;
> +}
> +

If ...

> +/* First try normal zones if possible. */
> +static struct page *
> +alloc_higher_pages(gfp_t gfp_mask, unsigned order, unsigned long pfn)
> +{
> +	struct page *p = NULL;
> +	if (pfn > mask_max_pfn) {
> +#ifdef CONFIG_ZONE_DMA32
> +		if (pfn <= (0xffffffff >> PAGE_SHIFT)) {
> +			p = alloc_pages(gfp_mask|GFP_DMA32|__GFP_NOWARN,
> +						order);

... this succeeds and allocated pages, and ...

> +			if (p && page_to_pfn(p) >= pfn) {
> +				__free_pages(p, order);
> +				p = NULL;
> +			}

... p is and it's pfn is lower than pfn ...

> +		}
> +#endif
> +		p = alloc_pages(gfp_mask|__GFP_NOWARN, order);

... isn't this a leak here?

> +		if (p && page_to_pfn(p) >= pfn) {
> +			__free_pages(p, order);
> +			p = NULL;
> +		}
> +	}
> +	return p;
> +}
> +
> +static unsigned long alloc_mask(int pages, unsigned long max)
> +{
> +	static unsigned long next_bit;
> +	unsigned long offset, flags, start, pfn;
> +	int k;
> +
> +	if (max >= mask_max_pfn)
> +		max = mask_max_pfn;

Can omit the assignment when max == mask_max_pfn.

> +	start = mask_max_pfn - max;
> +
> +	spin_lock_irqsave(&mask_bitmap_lock, flags);
> +	offset = -1L;
> +
> +	if (next_bit >= start && next_bit + pages < (mask_max_pfn - (max>>1))) {
> +		offset = find_next_zero_string(mask_bitmap, next_bit,
> +					       mask_max_pfn, pages);
> +		if (offset != -1L)
> +			count_vm_events(MASK_BITMAP_SKIP, offset - next_bit);
> +	}
> +	if (offset == -1L) {
> +		offset = find_next_zero_string(mask_bitmap, start,
> +					mask_max_pfn, pages);
> +		if (offset != -1L)
> +			count_vm_events(MASK_BITMAP_SKIP, offset - start);
> +	}
> +	if (offset != -1L) {
> +		for (k = 0; k < pages; k++) {
> +			BUG_ON(test_bit(offset + k, mask_bitmap));
> +			set_bit(offset + k, mask_bitmap);
> +		}
> +		next_bit = offset + pages;
> +		if (next_bit >= mask_max_pfn)
> +			next_bit = start;
> +	}
> +	spin_unlock_irqrestore(&mask_bitmap_lock, flags);
> +	if (offset == -1L)
> +		return -1L;
> +
> +	offset += pages - 1;
> +	pfn = maskbm_index_to_pfn(offset);
> +
> +	BUG_ON(maskbm_index_to_pfn(offset) != pfn);
> +	return pfn;
> +}

	Hannes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-07 10:53   ` Johannes Weiner
@ 2008-03-07 11:14     ` Andi Kleen
  0 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07 11:14 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andi Kleen, linux-kernel, linux-mm

> > +		}
> > +#endif
> > +		p = alloc_pages(gfp_mask|__GFP_NOWARN, order);
> 
> ... isn't this a leak here?

You're right -- this needs a check. Weird wonder why the tests didn't
catch that.

Thanks,
-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
                   ` (11 preceding siblings ...)
  2008-03-07  9:07 ` [PATCH] [13/13] Convert x86-64 swiotlb to use the mask allocator directly Andi Kleen
@ 2008-03-07 15:18 ` Rene Herman
  2008-03-07 15:22   ` Rene Herman
  2008-03-07 15:34   ` Andi Kleen
  2008-03-07 20:51 ` Luiz Fernando N. Capitulino
  2008-03-08  2:42 ` Christoph Lameter
  14 siblings, 2 replies; 56+ messages in thread
From: Rene Herman @ 2008-03-07 15:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

On 07-03-08 10:07, Andi Kleen wrote:

> it to any size needed (upto 2GB currently). The default sizing 
> heuristics are for now the same as in the old code: by default
> all free memory below 16MB is put into the pool (in practice that
> is only ~8MB or so usable because the kernel is loaded there too)

Just a side-comment -- not necessarily, given CONFIG_PHYSICAL_START.

Rene.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-07 15:18 ` [PATCH] [0/13] General DMA zone rework Rene Herman
@ 2008-03-07 15:22   ` Rene Herman
  2008-03-07 15:31     ` Andi Kleen
  2008-03-07 15:34   ` Andi Kleen
  1 sibling, 1 reply; 56+ messages in thread
From: Rene Herman @ 2008-03-07 15:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

On 07-03-08 16:18, Rene Herman wrote:

> On 07-03-08 10:07, Andi Kleen wrote:
> 
>> it to any size needed (upto 2GB currently). The default sizing 
>> heuristics are for now the same as in the old code: by default
>> all free memory below 16MB is put into the pool (in practice that
>> is only ~8MB or so usable because the kernel is loaded there too)
> 
> Just a side-comment -- not necessarily, given CONFIG_PHYSICAL_START.

By the way, 1/13 didn't make it to LKML.

Rene.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-07 15:22   ` Rene Herman
@ 2008-03-07 15:31     ` Andi Kleen
  0 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07 15:31 UTC (permalink / raw)
  To: Rene Herman; +Cc: Andi Kleen, linux-kernel, linux-mm

On Fri, Mar 07, 2008 at 04:22:12PM +0100, Rene Herman wrote:
> On 07-03-08 16:18, Rene Herman wrote:
> 
> >On 07-03-08 10:07, Andi Kleen wrote:
> >
> >>it to any size needed (upto 2GB currently). The default sizing 
> >>heuristics are for now the same as in the old code: by default
> >>all free memory below 16MB is put into the pool (in practice that
> >>is only ~8MB or so usable because the kernel is loaded there too)
> >
> >Just a side-comment -- not necessarily, given CONFIG_PHYSICAL_START.
> 
> By the way, 1/13 didn't make it to LKML.

Yes, i was told. Not sure what happened. I'm pretty sure my script
has posted it. If anybody wants it it's in
ftp://firstfloor.org/pub/ak/mask/patches/bitops

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-07 15:18 ` [PATCH] [0/13] General DMA zone rework Rene Herman
  2008-03-07 15:22   ` Rene Herman
@ 2008-03-07 15:34   ` Andi Kleen
  1 sibling, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07 15:34 UTC (permalink / raw)
  To: Rene Herman; +Cc: Andi Kleen, linux-kernel, linux-mm

On Fri, Mar 07, 2008 at 04:18:55PM +0100, Rene Herman wrote:
> On 07-03-08 10:07, Andi Kleen wrote:
> 
> >it to any size needed (upto 2GB currently). The default sizing 
> >heuristics are for now the same as in the old code: by default
> >all free memory below 16MB is put into the pool (in practice that
> >is only ~8MB or so usable because the kernel is loaded there too)
> 
> Just a side-comment -- not necessarily, given CONFIG_PHYSICAL_START.

"In practice". People usually do not change that. Anyways the dma zone
will be fully compatible. If you have moved your kernel outside 
the 16MB area and there is nothing else there (normally there
is some other stuff there too) you'll get the full 16MB for the default 
compat pool.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-07  9:07 ` [PATCH] [6/13] Core maskable allocator Andi Kleen
  2008-03-07 10:53   ` Johannes Weiner
@ 2008-03-07 17:05   ` Randy Dunlap
  2008-03-07 17:31     ` Andi Kleen
  2008-03-07 21:13   ` Cyrill Gorcunov
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 56+ messages in thread
From: Randy Dunlap @ 2008-03-07 17:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

On Fri,  7 Mar 2008 10:07:16 +0100 (CET) Andi Kleen wrote:

> 
> This is the core code of the maskable allocator. Introduction
> appended.

> Index: linux/Documentation/kernel-parameters.txt
> ===================================================================
> --- linux.orig/Documentation/kernel-parameters.txt
> +++ linux/Documentation/kernel-parameters.txt
> @@ -2116,6 +2116,9 @@ and is between 256 and 4096 characters. 
>  	norandmaps	Don't use address space randomization
>  			Equivalent to echo 0 > /proc/sys/kernel/randomize_va_space
>  
> +	maskzone=size[MG] Set size of maskable DMA zone to size.
> +		 force	Always allocate from the mask zone (for testing)

                 ^^^^^^^^^^^^^ ??

> +
>  ______________________________________________________________________
>  
>  TODO:

> Index: linux/Documentation/DocBook/kernel-api.tmpl
> ===================================================================
> --- linux.orig/Documentation/DocBook/kernel-api.tmpl
> +++ linux/Documentation/DocBook/kernel-api.tmpl
> @@ -164,6 +164,7 @@ X!Ilib/string.c
>  !Emm/memory.c
>  !Emm/vmalloc.c
>  !Imm/page_alloc.c
> +!Emm/mask-alloc.c
>  !Emm/mempool.c
>  !Emm/dmapool.c
>  !Emm/page-writeback.c

Thanks for the kernel-doc annotations.

---
~Randy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-07 17:05   ` Randy Dunlap
@ 2008-03-07 17:31     ` Andi Kleen
  2008-03-07 17:33       ` Randy Dunlap
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-03-07 17:31 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: Andi Kleen, linux-kernel, linux-mm

> > +	maskzone=size[MG] Set size of maskable DMA zone to size.
> > +		 force	Always allocate from the mask zone (for testing)
> 
>                  ^^^^^^^^^^^^^ ??

What is your question?

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-07 17:31     ` Andi Kleen
@ 2008-03-07 17:33       ` Randy Dunlap
  2008-03-07 17:43         ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Randy Dunlap @ 2008-03-07 17:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Andi Kleen wrote:
>>> +	maskzone=size[MG] Set size of maskable DMA zone to size.
>>> +		 force	Always allocate from the mask zone (for testing)
>>                  ^^^^^^^^^^^^^ ??
> 
> What is your question?

That line seems odd.  Is it correct?
Why 2 spaces between force and Always?  Why is Always capitalized?
Could one of those words be dropped?  They seem a bit redundant.

-- 
~Randy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-07 17:33       ` Randy Dunlap
@ 2008-03-07 17:43         ` Andi Kleen
  2008-03-07 17:51           ` Randy Dunlap
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-03-07 17:43 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: Andi Kleen, linux-kernel, linux-mm

On Fri, Mar 07, 2008 at 09:33:02AM -0800, Randy Dunlap wrote:
> Andi Kleen wrote:
> >>>+	maskzone=size[MG] Set size of maskable DMA zone to size.
> >>>+		 force	Always allocate from the mask zone (for testing)
> >>                 ^^^^^^^^^^^^^ ??
> >
> >What is your question?
> 
> That line seems odd.  Is it correct?
> Why 2 spaces between force and Always?  Why is Always capitalized?
> Could one of those words be dropped?  They seem a bit redundant.

The option is either maskzone=size[MG] or maskzone=force
Each followed with a sentence describing them.

I tried to make this clear by lining them up, but apparently failed.
You have a preferred way to formatting such multiple choice options?

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-07 17:43         ` Andi Kleen
@ 2008-03-07 17:51           ` Randy Dunlap
  0 siblings, 0 replies; 56+ messages in thread
From: Randy Dunlap @ 2008-03-07 17:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

On Fri, 7 Mar 2008 18:43:16 +0100 Andi Kleen wrote:

> On Fri, Mar 07, 2008 at 09:33:02AM -0800, Randy Dunlap wrote:
> > Andi Kleen wrote:
> > >>>+	maskzone=size[MG] Set size of maskable DMA zone to size.
> > >>>+		 force	Always allocate from the mask zone (for testing)
> > >>                 ^^^^^^^^^^^^^ ??
> > >
> > >What is your question?
> > 
> > That line seems odd.  Is it correct?
> > Why 2 spaces between force and Always?  Why is Always capitalized?
> > Could one of those words be dropped?  They seem a bit redundant.
> 
> The option is either maskzone=size[MG] or maskzone=force
> Each followed with a sentence describing them.

Oh.  I see.

> I tried to make this clear by lining them up, but apparently failed.
> You have a preferred way to formatting such multiple choice options?

I don't know that we have a certain/fixed way for that.

How about:

	maskzone=
		Format: {size[MG] | force}
		size -- Set size of maskable DMA zone to size
		force -- Always allocate from the mask zone (for testing)


---
~Randy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [4/13] Prepare page_alloc for the maskable allocator
  2008-03-07  9:07 ` [PATCH] [4/13] Prepare page_alloc for the maskable allocator Andi Kleen
@ 2008-03-07 18:19   ` Sam Ravnborg
  2008-03-07 18:36     ` Cyrill Gorcunov
  2008-03-07 19:02     ` Andi Kleen
  0 siblings, 2 replies; 56+ messages in thread
From: Sam Ravnborg @ 2008-03-07 18:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Hi Andi.

> Index: linux/mm/internal.h
> ===================================================================
> --- linux.orig/mm/internal.h
> +++ linux/mm/internal.h
> @@ -12,6 +12,7 @@
>  #define __MM_INTERNAL_H
>  
>  #include <linux/mm.h>
> +#include <linux/memcontrol.h>
>  
>  static inline void set_page_count(struct page *page, int v)
>  {
> @@ -48,6 +49,72 @@ static inline unsigned long page_order(s
>  	return page_private(page);
>  }
>  
> +extern void bad_page(struct page *page);
> +
> +static inline int free_pages_check(struct page *page, unsigned long addflags)
> +{
> +	if (unlikely(page_mapcount(page) |
> +		(page->mapping != NULL)  |
> +		(page_get_page_cgroup(page) != NULL) |
> +		(page_count(page) != 0)  |
> +		(page->flags & (
> +			addflags |
> +			1 << PG_lru	|
> +			1 << PG_private |
> +			1 << PG_locked	|
> +			1 << PG_active	|
> +			1 << PG_slab	|
> +			1 << PG_swapcache |
> +			1 << PG_writeback |
> +			1 << PG_reserved |
> +			1 << PG_buddy))))
> +		bad_page(page);
> +	if (PageDirty(page))
> +		__ClearPageDirty(page);
> +	/*
> +	 * For now, we report if PG_reserved was found set, but do not
> +	 * clear it, and do not free the page.  But we shall soon need
> +	 * to do more, for when the ZERO_PAGE count wraps negative.
> +	 */
> +	return PageReserved(page);
> +}
Looks a bit too big for an inline in a header (~9 lines of code)?

> +
> +/* Set up a struc page for business during allocation */
> +static inline int page_prep_struct(struct page *page)
> +{
> +	if (unlikely(page_mapcount(page) |
> +		(page->mapping != NULL)  |
> +		(page_get_page_cgroup(page) != NULL) |
> +		(page_count(page) != 0)  |
> +		(page->flags & (
> +			1 << PG_lru	|
> +			1 << PG_private	|
> +			1 << PG_locked	|
> +			1 << PG_active	|
> +			1 << PG_dirty	|
> +			1 << PG_slab    |
> +			1 << PG_swapcache |
> +			1 << PG_writeback |
> +			1 << PG_reserved |
> +			1 << PG_buddy))))
> +		bad_page(page);
> +
> +	/*
> +	 * For now, we report if PG_reserved was found set, but do not
> +	 * clear it, and do not allocate the page: as a safety net.
> +	 */
> +	if (PageReserved(page))
> +		return 1;
> +
> +	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
> +			1 << PG_referenced | 1 << PG_arch_1 |
> +			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
> +	set_page_private(page, 0);
> +	set_page_refcounted(page);
> +
> +	return 0;
> +}
Again - looks too big to inline..

But for both I do not know where they are used and how often.

	Sam

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [8/13] Enable the mask allocator for x86
  2008-03-07  9:07 ` [PATCH] [8/13] Enable the mask allocator for x86 Andi Kleen
@ 2008-03-07 18:32   ` Sam Ravnborg
  2008-03-07 19:03     ` Andi Kleen
  2008-03-08  2:37   ` Christoph Lameter
  1 sibling, 1 reply; 56+ messages in thread
From: Sam Ravnborg @ 2008-03-07 18:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

On Fri, Mar 07, 2008 at 10:07:18AM +0100, Andi Kleen wrote:
> 
> - Disable old ZONE_DMA
> No fixed size ZONE_DMA now anymore. All existing users of __GFP_DMA rely 
> on the compat call to the maskable allocator in alloc/free_pages
> - Call maskable allocator initialization functions at boot
> - Add TRAD_DMA_MASK for the compat functions 

I see TRAD_DMA_MASK used by patch 6 and patch 7, but only here
in this later patch it is defined.
Looks like we have build breakage - but I have not checked it.

	Sam

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [4/13] Prepare page_alloc for the maskable allocator
  2008-03-07 18:19   ` Sam Ravnborg
@ 2008-03-07 18:36     ` Cyrill Gorcunov
  2008-03-07 19:02     ` Andi Kleen
  1 sibling, 0 replies; 56+ messages in thread
From: Cyrill Gorcunov @ 2008-03-07 18:36 UTC (permalink / raw)
  To: Sam Ravnborg; +Cc: Andi Kleen, linux-kernel, linux-mm

[Sam Ravnborg - Fri, Mar 07, 2008 at 07:19:43PM +0100]
| Hi Andi.
| 
| > Index: linux/mm/internal.h
| > ===================================================================
| > --- linux.orig/mm/internal.h
| > +++ linux/mm/internal.h
| > @@ -12,6 +12,7 @@
| >  #define __MM_INTERNAL_H
| >  
| >  #include <linux/mm.h>
| > +#include <linux/memcontrol.h>
| >  
| >  static inline void set_page_count(struct page *page, int v)
| >  {
| > @@ -48,6 +49,72 @@ static inline unsigned long page_order(s
| >  	return page_private(page);
| >  }
| >  
| > +extern void bad_page(struct page *page);
| > +
| > +static inline int free_pages_check(struct page *page, unsigned long addflags)
| > +{
| > +	if (unlikely(page_mapcount(page) |
| > +		(page->mapping != NULL)  |
| > +		(page_get_page_cgroup(page) != NULL) |
| > +		(page_count(page) != 0)  |
| > +		(page->flags & (
| > +			addflags |
| > +			1 << PG_lru	|
| > +			1 << PG_private |
| > +			1 << PG_locked	|
| > +			1 << PG_active	|
| > +			1 << PG_slab	|
| > +			1 << PG_swapcache |
| > +			1 << PG_writeback |
| > +			1 << PG_reserved |
| > +			1 << PG_buddy))))
| > +		bad_page(page);
| > +	if (PageDirty(page))
| > +		__ClearPageDirty(page);
| > +	/*
| > +	 * For now, we report if PG_reserved was found set, but do not
| > +	 * clear it, and do not free the page.  But we shall soon need
| > +	 * to do more, for when the ZERO_PAGE count wraps negative.
| > +	 */
| > +	return PageReserved(page);
| > +}
| Looks a bit too big for an inline in a header (~9 lines of code)?
| 

well, it will not be that big in compiled form 'cause in real it is

	page->flags & (addflags | (precompiled constant))

| > +
| > +/* Set up a struc page for business during allocation */
| > +static inline int page_prep_struct(struct page *page)
| > +{
| > +	if (unlikely(page_mapcount(page) |
| > +		(page->mapping != NULL)  |
| > +		(page_get_page_cgroup(page) != NULL) |
| > +		(page_count(page) != 0)  |
| > +		(page->flags & (
| > +			1 << PG_lru	|
| > +			1 << PG_private	|
| > +			1 << PG_locked	|
| > +			1 << PG_active	|
| > +			1 << PG_dirty	|
| > +			1 << PG_slab    |
| > +			1 << PG_swapcache |
| > +			1 << PG_writeback |
| > +			1 << PG_reserved |
| > +			1 << PG_buddy))))
| > +		bad_page(page);
| > +
| > +	/*
| > +	 * For now, we report if PG_reserved was found set, but do not
| > +	 * clear it, and do not allocate the page: as a safety net.
| > +	 */
| > +	if (PageReserved(page))
| > +		return 1;
| > +
| > +	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
| > +			1 << PG_referenced | 1 << PG_arch_1 |
| > +			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
| > +	set_page_private(page, 0);
| > +	set_page_refcounted(page);
| > +
| > +	return 0;
| > +}
| Again - looks too big to inline..
| 
| But for both I do not know where they are used and how often.
| 
| 	Sam
| --
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to majordomo@vger.kernel.org
| More majordomo info at  http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at  http://www.tux.org/lkml/
| 
		- Cyrill -

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [4/13] Prepare page_alloc for the maskable allocator
  2008-03-07 18:19   ` Sam Ravnborg
  2008-03-07 18:36     ` Cyrill Gorcunov
@ 2008-03-07 19:02     ` Andi Kleen
  1 sibling, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07 19:02 UTC (permalink / raw)
  To: Sam Ravnborg; +Cc: Andi Kleen, linux-kernel, linux-mm

> Again - looks too big to inline..
> 
> But for both I do not know where they are used and how often.

For the mask allocator it is not that critical, but for the main
allocator it is on the very very hot page allocation/free path. That is why
I didn't want to out-of-line it.

Note I didn't change it, just moved it.

-Andi


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [8/13] Enable the mask allocator for x86
  2008-03-07 18:32   ` Sam Ravnborg
@ 2008-03-07 19:03     ` Andi Kleen
  2008-03-07 19:09       ` Sam Ravnborg
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-03-07 19:03 UTC (permalink / raw)
  To: Sam Ravnborg; +Cc: Andi Kleen, linux-kernel, linux-mm

On Fri, Mar 07, 2008 at 07:32:55PM +0100, Sam Ravnborg wrote:
> On Fri, Mar 07, 2008 at 10:07:18AM +0100, Andi Kleen wrote:
> > 
> > - Disable old ZONE_DMA
> > No fixed size ZONE_DMA now anymore. All existing users of __GFP_DMA rely 
> > on the compat call to the maskable allocator in alloc/free_pages
> > - Call maskable allocator initialization functions at boot
> > - Add TRAD_DMA_MASK for the compat functions 
> 
> I see TRAD_DMA_MASK used by patch 6 and patch 7, but only here
> in this later patch it is defined.
> Looks like we have build breakage - but I have not checked it.

It is only used when the architecture defines CONFIG_MASK_ALLOC and
that is not done by any until the later patches.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [8/13] Enable the mask allocator for x86
  2008-03-07 19:03     ` Andi Kleen
@ 2008-03-07 19:09       ` Sam Ravnborg
  0 siblings, 0 replies; 56+ messages in thread
From: Sam Ravnborg @ 2008-03-07 19:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

On Fri, Mar 07, 2008 at 08:03:13PM +0100, Andi Kleen wrote:
> On Fri, Mar 07, 2008 at 07:32:55PM +0100, Sam Ravnborg wrote:
> > On Fri, Mar 07, 2008 at 10:07:18AM +0100, Andi Kleen wrote:
> > > 
> > > - Disable old ZONE_DMA
> > > No fixed size ZONE_DMA now anymore. All existing users of __GFP_DMA rely 
> > > on the compat call to the maskable allocator in alloc/free_pages
> > > - Call maskable allocator initialization functions at boot
> > > - Add TRAD_DMA_MASK for the compat functions 
> > 
> > I see TRAD_DMA_MASK used by patch 6 and patch 7, but only here
> > in this later patch it is defined.
> > Looks like we have build breakage - but I have not checked it.
> 
> It is only used when the architecture defines CONFIG_MASK_ALLOC and
> that is not done by any until the later patches.
Understood - I just took a quick look and wanted to let you know.

	Sam

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
                   ` (12 preceding siblings ...)
  2008-03-07 15:18 ` [PATCH] [0/13] General DMA zone rework Rene Herman
@ 2008-03-07 20:51 ` Luiz Fernando N. Capitulino
  2008-03-08  0:46   ` Andi Kleen
  2008-03-08  2:42 ` Christoph Lameter
  14 siblings, 1 reply; 56+ messages in thread
From: Luiz Fernando N. Capitulino @ 2008-03-07 20:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Em Fri,  7 Mar 2008 10:07:10 +0100 (CET)
Andi Kleen <andi@firstfloor.org> escreveu:

| I chose to implement a new "maskable memory" allocator to solve these
| problems. The existing page buddy allocator is not really suited for
| this because the data structures don't allow cheap allocation by physical 
| address boundary. 

 These patches are supposed to work, I think?

 I've tried to give them a try but got some problems. First, the
simple test case seems to fail miserably:

"""
testing mask alloc upto 24 bits
gpm1 3 mask 3fffff size 20440 total 62KB failed
gpm1 4 mask 3fffff size 24369 total 62KB failed
gpm1 6 mask 3fffff size 15255 total 64KB failed
gpm1 7 mask 3fffff size 12676 total 64KB failed
gpm1 8 mask 3fffff size 23917 total 64KB failed
gpm1 9 mask 3fffff size 11682 total 64KB failed
gpm1 10 mask 3fffff size 23091 total 64KB failed
gpm1 11 mask 3fffff size 16880 total 64KB failed
gpm1 12 mask 3fffff size 17257 total 64KB failed
gpm1 13 mask 3fffff size 8686 total 64KB failed
gpm1 14 mask 3fffff size 9871 total 64KB failed
gpm1 15 mask 3fffff size 19740 total 64KB failed
gpm1 16 mask 3fffff size 11557 total 64KB failed
gpm1 18 mask 3fffff size 23723 total 67KB failed
gpm1 19 mask 3fffff size 16136 total 67KB failed
gpm2 6 mask 3fffff size 4471 failed
gpm2 7 mask 3fffff size 16868 failed
gpm2 8 mask 3fffff size 22093 failed
gpm2 9 mask 3fffff size 17666 failed
gpm2 11 mask 3fffff size 14416 failed
gpm2 12 mask 3fffff size 10825 failed
gpm2 13 mask 3fffff size 3918 failed
gpm2 14 mask 3fffff size 6255 failed
gpm2 15 mask 3fffff size 2428 failed
gpm2 16 mask 3fffff size 517 failed
gpm2 18 mask 3fffff size 12890 failed
gpm2 19 mask 3fffff size 3211 failed
verify & free
mask fffff
mask 1fffff
mask 3fffff
mask 7fffff
mask ffffff
done
"""

 Then boot up goes on and while init is running I get this:

"""
Starting udev: ------------[ cut here ]------------
kernel BUG at mm/mask-alloc.c:178!
invalid opcode: 0000 [#1] 
Modules linked in: parport_pc(+) parport snd_via82xx(+) gameport snd_ac97_codec rtc_cmos ac97_bus rtc_core rtc_lib snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device pcspkr snd soundcore evdev i2c_viapro i2c_core thermal button processor ohci1394 ide_cd_mod ieee1394 cdrom firewire_ohci shpchp via_agp firewire_core crc_itu_t agpgart pci_hotplug 8139too mii via82cxxx ide_disk ide_core ext3 jbd uhci_hcd ohci_hcd ehci_hcd usbcore

Pid: 1088, comm: modprobe Not tainted (2.6.25-0.rc4.1mdv #1)
EIP: 0060:[<c016d493>] EFLAGS: 00010206 CPU: 0
EIP is at alloc_pages_mask+0x74/0x3de
EAX: 00000000 EBX: 00001000 ECX: df8fb854 EDX: 000004e2
ESI: 00000000 EDI: 00010000 EBP: dfbf9c44 ESP: dfbf9bec
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Process modprobe (pid: 1088, ti=dfbf8000 task=dfb9c1a0 task.ti=dfbf8000)
Stack: c0486858 004f1000 004f1000 dfbf9c04 00001000 000052d0 dfbf9c28 000fffff 
       00000001 000004e2 dfbf8000 00000000 c0496ea0 dfb9c1a0 00000000 dfbf9c40 
       c010ab33 00000000 c02c9250 00001000 00000000 00010000 dfbf9c54 c016d80b 
Call Trace:
 [<c010ab33>] ? save_stack_trace+0x1c/0x3a
 [<c016d80b>] ? get_pages_mask+0xe/0x23
 [<c0107b83>] ? dma_alloc_coherent+0xab/0xe5
 [<c010aab6>] ? save_stack_address+0x0/0x2c
 [<e095634f>] ? snd_dma_alloc_pages+0xef/0x1ef [snd_page_alloc]
 [<e0956a17>] ? snd_malloc_sgbuf_pages+0xdd/0x170 [snd_page_alloc]
 [<c0137e1f>] ? trace_hardirqs_on+0xe6/0x11b
 [<c02bf3ff>] ? __mutex_unlock_slowpath+0xf2/0x104
 [<e0956409>] ? snd_dma_alloc_pages+0x1a9/0x1ef [snd_page_alloc]
 [<c02bf419>] ? mutex_unlock+0x8/0xa
 [<e095689f>] ? snd_dma_get_reserved_buf+0xb4/0xbd [snd_page_alloc]
 [<e09a1d77>] ? snd_pcm_lib_preallocate_pages+0x5b/0x115 [snd_pcm]
 [<e09a1e62>] ? snd_pcm_lib_preallocate_pages_for_all+0x31/0x56 [snd_pcm]
 [<e09ae2ab>] ? snd_via82xx_probe+0xb22/0xe6d [snd_via82xx]
 [<c02bf043>] ? mutex_trylock+0xf3/0x10f
 [<e09ac083>] ? snd_via82xx_mixer_free_ac97+0x0/0x12 [snd_via82xx]
 [<c02c0785>] ? _spin_unlock+0x1d/0x20
 [<c01e02d4>] ? pci_match_device+0x9a/0xa5
 [<c01e0393>] ? pci_device_probe+0x39/0x59
 [<c0238115>] ? driver_probe_device+0x9f/0x119
 [<c0238274>] ? __driver_attach+0x4a/0x7f
 [<c02376ed>] ? bus_for_each_dev+0x39/0x5b
 [<c0237fad>] ? driver_attach+0x14/0x16
 [<c023822a>] ? __driver_attach+0x0/0x7f
 [<c0237dc7>] ? bus_add_driver+0x9d/0x1ae
 [<c0238433>] ? driver_register+0x47/0xa3
 [<c01e055f>] ? __pci_register_driver+0x40/0x6d
 [<e093f017>] ? alsa_card_via82xx_init+0x17/0x19 [snd_via82xx]
 [<c013ed7a>] ? sys_init_module+0x13b0/0x14a8
 [<c0122ae5>] ? __request_region+0x0/0x90
 [<c010496d>] ? sysenter_past_esp+0xb6/0xc9
 [<c0137e1f>] ? trace_hardirqs_on+0xe6/0x11b
 [<c0104924>] ? sysenter_past_esp+0x6d/0xc9
 =======================
Code: 00 74 1b 8b 45 b8 c7 45 d4 ff ff ff ff 48 c1 e8 0b ff 45 d4 d1 e8 75 f9 83 7d b8 0f 77 04 0f 0b eb fe f7 45 bc 05 40 00 00 74 04 <0f> 0b eb fe 8b 4d bc 80 cd 02 89 4d b0 d1 e9 89 4d ac ff 05 84 
EIP: [<c016d493>] alloc_pages_mask+0x74/0x3de SS:ESP 0068:dfbf9bec
---[ end trace c02107712b611dcb ]---
""""

 How can I help to debug this?

 Btw, I've created a package to test this for my convenience
(that's why you see 'mdv' in the kernel name), but it's a vanilla
kernel with your patches on top.

-- 
Luiz Fernando N. Capitulino

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-07  9:07 ` [PATCH] [6/13] Core maskable allocator Andi Kleen
  2008-03-07 10:53   ` Johannes Weiner
  2008-03-07 17:05   ` Randy Dunlap
@ 2008-03-07 21:13   ` Cyrill Gorcunov
  2008-03-07 23:28     ` Andi Kleen
  2008-03-08  5:03   ` KAMEZAWA Hiroyuki
  2008-03-11 15:34   ` Jonathan Corbet
  4 siblings, 1 reply; 56+ messages in thread
From: Cyrill Gorcunov @ 2008-03-07 21:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

[Andi Kleen - Fri, Mar 07, 2008 at 10:07:16AM +0100]
[...]
| +
| +/**
| + * alloc_pages_mask - Alloc page(s) in a specific address range.
| + * @gfp:      Standard GFP mask. See get_free_pages for a list valid flags.
| + * @size:     Allocate size worth of pages. Rounded up to PAGE_SIZE.
| + * @mask:     Memory must fit into mask physical address.
| + *
| + * Returns a struct page *
| + *
| + * Manage dedicated maskable low memory zone. This zone are isolated
| + * from the normal zones. This is only a single continuous zone.
| + * The main difference to the standard allocator is that it tries
| + * to allocate memory with an physical address fitting in the passed mask.
| + *
| + * Warning: the size is in bytes, not in order like get_free_pages.
| + */
| +struct page *
| +alloc_pages_mask(gfp_t gfp, unsigned size, u64 mask)
| +{
| +	unsigned long max_pfn = mask >> PAGE_SHIFT;
| +	unsigned pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
| +	struct page *p;
| +	unsigned left = (gfp & __GFP_REPEAT) ? ~0 : mask_timeout, oleft;
| +	unsigned order = get_order(size);
| +
| +	BUG_ON(size < MASK_MIN_SIZE);	/* You likely passed order by mistake */
| +	BUG_ON(gfp & (__GFP_DMA|__GFP_DMA32|__GFP_COMP));
| +
| +	/* Add fault injection here */
| +
| +again:
| +	count_vm_event(MASK_ALLOC);
| +	if (!force_mask) {
| +		/* First try normal allocation in suitable zones
| +		 * RED-PEN if size fits very badly in PS<<order don't do this?
| +		 */
| +		p = alloc_higher_pages(gfp, order, max_pfn);
| +
| +		/*
| +		 * If the mask covers everything don't bother with the low zone
| +		 * This way we avoid running out of low memory on a higher zones
| +		 * OOM too.
| +		 */
| +		if (p != NULL || max_pfn >= max_low_pfn) {

Andi, I'm a little confused by _this_ statistics. We could get p = NULL
there and change MASK_HIGH_WASTE even have mask not allocated. Am I
wrong or miss something? Or maybe there should be '&&' instead of '||'?

| +			count_vm_event(MASK_HIGHER);
| +			count_vm_events(MASK_HIGH_WASTE,
| +					(PAGE_SIZE << order) - size);
| +			return p;
| +		}
| +	}
| +
| +	might_sleep_if(gfp & __GFP_WAIT);
| +	do {
| +		int i;
| +		long pfn;
| +
| +		/* Implement waiter fairness queueing here? */
| +
| +		pfn = alloc_mask(pages, max_pfn);
| +		if (pfn != -1L) {
| +			p = pfn_to_page(pfn);
| +
| +			Mprintk("mask page %lx size %d mask %Lx\n",
| +			       po, size, mask);
| +
| +			BUG_ON(pfn + pages > mask_max_pfn);
| +
| +			if (page_prep_struct(p))
| +				goto again;
| +
| +			kernel_map_pages(p, pages, 1);
| +
| +			for (i = 0; i < pages; i++) {
| +				struct page *n = p + i;
| +				BUG_ON(!test_bit(pfn_to_maskbm_index(pfn+i),
| +						mask_bitmap));
| +				BUG_ON(!PageMaskAlloc(n));
| +				arch_alloc_page(n, 0);
| +				if (gfp & __GFP_ZERO)
| +					clear_page(page_address(n));
| +			}
| +
| +			count_vm_events(MASK_LOW_WASTE, pages*PAGE_SIZE-size);
| +			return p;
| +		}
| +
| +		if (!(gfp & __GFP_WAIT))
| +			break;
| +
| +		oleft = left;
| +		left = wait_for_mask_free(left);
| +		count_vm_events(MASK_WAIT, left - oleft);
| +	} while (left > 0);
| +
| +	if (!(gfp & __GFP_NOWARN)) {
| +		printk(KERN_ERR
| +		"%s: Cannot allocate maskable memory size %u gfp %x mask %Lx\n",
| +				current->comm, size, gfp, mask);
| +		dump_stack();
| +	}
| +	return NULL;
| +}
| +EXPORT_SYMBOL(alloc_pages_mask);
| +
[...]
		- Cyrill -

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-07 21:13   ` Cyrill Gorcunov
@ 2008-03-07 23:28     ` Andi Kleen
  0 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-07 23:28 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: Andi Kleen, linux-kernel, linux-mm

> Andi, I'm a little confused by _this_ statistics. We could get p = NULL
> there and change MASK_HIGH_WASTE even have mask not allocated. Am I
> wrong or miss something? Or maybe there should be '&&' instead of '||'?

You're right the statistics counter is increased incorrectly for the 
p == NULL case. I'll fix that thanks. || is correct, see the comment
above.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-07 20:51 ` Luiz Fernando N. Capitulino
@ 2008-03-08  0:46   ` Andi Kleen
  2008-03-10 18:03     ` Luiz Fernando N. Capitulino
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-03-08  0:46 UTC (permalink / raw)
  To: Luiz Fernando N. Capitulino; +Cc: Andi Kleen, linux-kernel, linux-mm

On Fri, Mar 07, 2008 at 05:51:48PM -0300, Luiz Fernando N. Capitulino wrote:
> Em Fri,  7 Mar 2008 10:07:10 +0100 (CET)
> Andi Kleen <andi@firstfloor.org> escreveu:
> 
> | I chose to implement a new "maskable memory" allocator to solve these
> | problems. The existing page buddy allocator is not really suited for
> | this because the data structures don't allow cheap allocation by physical 
> | address boundary. 
> 
>  These patches are supposed to work, I think?

Yes they work fine here and survived quite some stress testing.
But of course I only have limited configurations.

> 
>  I've tried to give them a try but got some problems. First, the
> simple test case seems to fail miserably:

Hmm I guess you got a pretty filled up 16MB area already.
Do you have a full log? It just couldn't allocate some memory
for the 24bit mask, but that is likely because it just ran out of 
memory.

I suppose it will work if you cut down the allocations in the 
test case a bit, e.g. decrease NUMALLOC to 10 and perhaps MAX_LEN to
5*PAGE_SIZE. Did that in my copy.

> 
> """
> testing mask alloc upto 24 bits
> gpm1 3 mask 3fffff size 20440 total 62KB failed
> gpm1 4 mask 3fffff size 24369 total 62KB failed
> gpm1 6 mask 3fffff size 15255 total 64KB failed
> gpm1 7 mask 3fffff size 12676 total 64KB failed
> gpm1 8 mask 3fffff size 23917 total 64KB failed
> gpm1 9 mask 3fffff size 11682 total 64KB failed
> gpm1 10 mask 3fffff size 23091 total 64KB failed
> gpm1 11 mask 3fffff size 16880 total 64KB failed
> gpm1 12 mask 3fffff size 17257 total 64KB failed
> gpm1 13 mask 3fffff size 8686 total 64KB failed
> gpm1 14 mask 3fffff size 9871 total 64KB failed
> gpm1 15 mask 3fffff size 19740 total 64KB failed
> gpm1 16 mask 3fffff size 11557 total 64KB failed
> gpm1 18 mask 3fffff size 23723 total 67KB failed
> gpm1 19 mask 3fffff size 16136 total 67KB failed
> gpm2 6 mask 3fffff size 4471 failed
> gpm2 7 mask 3fffff size 16868 failed
> gpm2 8 mask 3fffff size 22093 failed
> gpm2 9 mask 3fffff size 17666 failed
> gpm2 11 mask 3fffff size 14416 failed
> gpm2 12 mask 3fffff size 10825 failed
> gpm2 13 mask 3fffff size 3918 failed
> gpm2 14 mask 3fffff size 6255 failed
> gpm2 15 mask 3fffff size 2428 failed
> gpm2 16 mask 3fffff size 517 failed
> gpm2 18 mask 3fffff size 12890 failed
> gpm2 19 mask 3fffff size 3211 failed
> verify & free
> mask fffff
> mask 1fffff
> mask 3fffff
> mask 7fffff
> mask ffffff
> done
> """
> 
>  Then boot up goes on and while init is running I get this:

Ah, I see the problem. Your sound driver allocates a dma area < 16 bytes.
I had added a BUG_ON for that to catch some mistakes (of passing 
order instead of size), but it triggers here incorrectly. 

Didn't see that in my testing. 

I put up an updated patchkit on ftp://firstfloor.org/pub/ak/mask/patches/
It also has some other fixes.

Can you retest with that please?

Thanks for testing.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [5/13] Add mask allocator statistics to vmstat.[ch]
  2008-03-07  9:07 ` [PATCH] [5/13] Add mask allocator statistics to vmstat.[ch] Andi Kleen
@ 2008-03-08  2:24   ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2008-03-08  2:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Reviewed-by: Christoph Lameter <clameter@sgi.com>



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [8/13] Enable the mask allocator for x86
  2008-03-07  9:07 ` [PATCH] [8/13] Enable the mask allocator for x86 Andi Kleen
  2008-03-07 18:32   ` Sam Ravnborg
@ 2008-03-08  2:37   ` Christoph Lameter
  2008-03-08  6:35     ` Yinghai Lu
  2008-03-08 11:54     ` Andi Kleen
  1 sibling, 2 replies; 56+ messages in thread
From: Christoph Lameter @ 2008-03-08  2:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

On Fri, 7 Mar 2008, Andi Kleen wrote:

> - Disable old ZONE_DMA
> No fixed size ZONE_DMA now anymore. All existing users of __GFP_DMA rely 
> on the compat call to the maskable allocator in alloc/free_pages
> - Call maskable allocator initialization functions at boot
> - Add TRAD_DMA_MASK for the compat functions 
> - Remove dma_reserve call

This looks okay for the disabling part. But note that there are various 
uses of MAX_DMA_ADDRESS (sparsemem, bootmem allocator) that are currently 
suboptimal because they set a boundary at 16MB for allocation of 
potentially large operating system structures. That boundary continues to 
exist despite the removal of ZONE_DMA?

It would be better to remove ZONE_DMA32 instead and enlarge ZONE_DMA so 
that it can take over the role of ZONE_DMA. Set the boundary for 
MAX_DMA_ADDRESS to the boundary for ZONE_DMA32. Then the 
allocations for sparse and bootmem will be allocated above 4GB which 
leaves lots of the lower space available for 32 bit DMA capable devices.

Removal of ZONE_DMA32 would allow us to remove support for that zone from 
the kernel in general since x86_64 is the only user of that zone.

Trouble is likely that there are existing users that have the expectation 
of a 4GB boundary on ZONE_DMA32 and 16MB on ZONE_DMA (particularly old 
drivers). Could we not clean that up?



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [12/13] Add vmstat statistics for new swiotlb code
  2008-03-07  9:07 ` [PATCH] [12/13] Add vmstat statistics for new swiotlb code Andi Kleen
@ 2008-03-08  2:38   ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2008-03-08  2:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Reviewed-by: Christoph Lameter <clameter@sgi.com>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
                   ` (13 preceding siblings ...)
  2008-03-07 20:51 ` Luiz Fernando N. Capitulino
@ 2008-03-08  2:42 ` Christoph Lameter
  2008-03-08 11:57   ` Andi Kleen
  14 siblings, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2008-03-08  2:42 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

On Fri, 7 Mar 2008, Andi Kleen wrote:

> The longer term goal is to convert all GFP_DMA allocations
> to always specify the correct mask and then eventually remove
> GFP_DMA.
> 
> Especially I hope kmalloc/kmem_cache_alloc GFP_DMA can be
> removed soon. I have some patches to eliminate those users.
> Then slab wouldn't need to maintain DMA caches anymore.

That would be greatly appreciated. Thanks for doing this.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-07  9:07 ` [PATCH] [6/13] Core maskable allocator Andi Kleen
                     ` (2 preceding siblings ...)
  2008-03-07 21:13   ` Cyrill Gorcunov
@ 2008-03-08  5:03   ` KAMEZAWA Hiroyuki
  2008-03-08  5:41     ` KAMEZAWA Hiroyuki
  2008-03-08 11:41     ` Andi Kleen
  2008-03-11 15:34   ` Jonathan Corbet
  4 siblings, 2 replies; 56+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-08  5:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

On Fri,  7 Mar 2008 10:07:16 +0100 (CET)
Andi Kleen <andi@firstfloor.org> wrote:

> +static int __init setup_maskzone(char *s)
> +{
> +	do {
> +		if (isdigit(*s)) {
> +			mask_zone_size = memparse(s, &s);
> +		} else if (!strncmp(s, "force", 5)) {
> +			force_mask = 1;
> +			s += 5;
> +		} else
> +			return -EINVAL;
> +		if (*s == ',')
> +			++s;
> +	} while (*s);
> +	return 0;
> +}
> +early_param("maskzone", setup_maskzone);

please confirm mask_zone_size is aligned to MAX_ORDER.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-08  5:03   ` KAMEZAWA Hiroyuki
@ 2008-03-08  5:41     ` KAMEZAWA Hiroyuki
  2008-03-08 11:41     ` Andi Kleen
  1 sibling, 0 replies; 56+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-03-08  5:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andi Kleen, linux-kernel, linux-mm

On Sat, 8 Mar 2008 14:03:34 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Fri,  7 Mar 2008 10:07:16 +0100 (CET)
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> > +static int __init setup_maskzone(char *s)
> > +{
> > +	do {
> > +		if (isdigit(*s)) {
> > +			mask_zone_size = memparse(s, &s);
> > +		} else if (!strncmp(s, "force", 5)) {
> > +			force_mask = 1;
> > +			s += 5;
> > +		} else
> > +			return -EINVAL;
> > +		if (*s == ',')
> > +			++s;
> > +	} while (*s);
> > +	return 0;
> > +}
> > +early_param("maskzone", setup_maskzone);
> 
> please confirm mask_zone_size is aligned to MAX_ORDER.
> 
Sorry, please ignore above comment :(

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [8/13] Enable the mask allocator for x86
  2008-03-08  2:37   ` Christoph Lameter
@ 2008-03-08  6:35     ` Yinghai Lu
  2008-03-08  7:31       ` Christoph Lameter
  2008-03-08 11:54     ` Andi Kleen
  1 sibling, 1 reply; 56+ messages in thread
From: Yinghai Lu @ 2008-03-08  6:35 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel, linux-mm

On Fri, Mar 7, 2008 at 6:37 PM, Christoph Lameter <clameter@sgi.com> wrote:
> On Fri, 7 Mar 2008, Andi Kleen wrote:
>
>  > - Disable old ZONE_DMA
>  > No fixed size ZONE_DMA now anymore. All existing users of __GFP_DMA rely
>  > on the compat call to the maskable allocator in alloc/free_pages
>  > - Call maskable allocator initialization functions at boot
>  > - Add TRAD_DMA_MASK for the compat functions
>  > - Remove dma_reserve call
>
>  This looks okay for the disabling part. But note that there are various
>  uses of MAX_DMA_ADDRESS (sparsemem, bootmem allocator) that are currently
>  suboptimal because they set a boundary at 16MB for allocation of
>  potentially large operating system structures. That boundary continues to
>  exist despite the removal of ZONE_DMA?
>
>  It would be better to remove ZONE_DMA32 instead and enlarge ZONE_DMA so
>  that it can take over the role of ZONE_DMA. Set the boundary for
>  MAX_DMA_ADDRESS to the boundary for ZONE_DMA32. Then the
>  allocations for sparse and bootmem will be allocated above 4GB which
>  leaves lots of the lower space available for 32 bit DMA capable devices.

good. i like the idea...

How about system with only 4G or less?

YH

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [8/13] Enable the mask allocator for x86
  2008-03-08  6:35     ` Yinghai Lu
@ 2008-03-08  7:31       ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2008-03-08  7:31 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Andi Kleen, linux-kernel, linux-mm

On Fri, 7 Mar 2008, Yinghai Lu wrote:

> How about system with only 4G or less?

Then it does not matter unless the devices can only access 1G or 2G.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-08  5:03   ` KAMEZAWA Hiroyuki
  2008-03-08  5:41     ` KAMEZAWA Hiroyuki
@ 2008-03-08 11:41     ` Andi Kleen
  1 sibling, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-08 11:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Andi Kleen, linux-kernel, linux-mm

> > +early_param("maskzone", setup_maskzone);
> 
> please confirm mask_zone_size is aligned to MAX_ORDER.

It's not. Since the buddy allocator is not used MAX_ORDER doesn't really
exist for the mask allocator. It can handle arbitary size allocations.

Also strictly seen the maskzone is part of the lowmem zone.

-Andi


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [8/13] Enable the mask allocator for x86
  2008-03-08  2:37   ` Christoph Lameter
  2008-03-08  6:35     ` Yinghai Lu
@ 2008-03-08 11:54     ` Andi Kleen
  2008-03-10 17:13       ` Christoph Lameter
  1 sibling, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-03-08 11:54 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel, linux-mm

On Fri, Mar 07, 2008 at 06:37:28PM -0800, Christoph Lameter wrote:
> On Fri, 7 Mar 2008, Andi Kleen wrote:
> 
> > - Disable old ZONE_DMA
> > No fixed size ZONE_DMA now anymore. All existing users of __GFP_DMA rely 
> > on the compat call to the maskable allocator in alloc/free_pages
> > - Call maskable allocator initialization functions at boot
> > - Add TRAD_DMA_MASK for the compat functions 
> > - Remove dma_reserve call
> 
> This looks okay for the disabling part. But note that there are various 
> uses of MAX_DMA_ADDRESS (sparsemem, bootmem allocator) that are currently 

I didn't bother change bootmem because the mask allocation runs sufficiently
early that it just allocates the memory it needs and bootmem will skip
that then because it sees it as used. Moving it up would just save
a few cycles in skipping bits in the bitmap. Ok it wouldn't hurt 
I guess, but also not really help.

What sparsemem reference do you mean?

> suboptimal because they set a boundary at 16MB for allocation of 
> potentially large operating system structures. That boundary continues to 
> exist despite the removal of ZONE_DMA?

It still exists, but is variable now.
 
> 
> It would be better to remove ZONE_DMA32 instead and enlarge ZONE_DMA so 
> that it can take over the role of ZONE_DMA. Set the boundary for 

I didn't want to do this because the DMA zone is ill defined (see my
long intro in 0/0), so really nobody should be using it directly.

But the ZONE_DMA32 actually makes sense, but changing the semantics
under such a large driver base isn't a good idea. 

My goal is still to eliminate all the GFP_DMAs for x86
(it is not that much work left ...) and then just 
remove the compat hooks. The few GFP_DMA32 users can stay
though.

> MAX_DMA_ADDRESS to the boundary for ZONE_DMA32. Then the 
> allocations for sparse and bootmem will be allocated above 4GB which 

It depends on the requirements of the bootmem user.  Some do need
memory <4GB, some do not. The mask allocator itself is a client in fact
and it requires low memory of course.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-08  2:42 ` Christoph Lameter
@ 2008-03-08 11:57   ` Andi Kleen
  2008-03-10 17:14     ` Christoph Lameter
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-03-08 11:57 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel, linux-mm

On Fri, Mar 07, 2008 at 06:42:21PM -0800, Christoph Lameter wrote:
> On Fri, 7 Mar 2008, Andi Kleen wrote:
> 
> > The longer term goal is to convert all GFP_DMA allocations
> > to always specify the correct mask and then eventually remove
> > GFP_DMA.
> > 
> > Especially I hope kmalloc/kmem_cache_alloc GFP_DMA can be
> > removed soon. I have some patches to eliminate those users.
> > Then slab wouldn't need to maintain DMA caches anymore.
> 
> That would be greatly appreciated. Thanks for doing this.

I'm afraid it would not help you directly because you would still need 
to maintain that code for s390 (seems to be a heavy GFP_DMA user)
and probably some other architectures (unless you can get these
maintainers to get rid of GFP_DMA too) With my plan it can be just ifdefed
and the ifdef not enabled on x86.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [8/13] Enable the mask allocator for x86
  2008-03-08 11:54     ` Andi Kleen
@ 2008-03-10 17:13       ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2008-03-10 17:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

On Sat, 8 Mar 2008, Andi Kleen wrote:

> What sparsemem reference do you mean?

Its just the use of MAX_DMA_ADDRESS in mm/sparse-vmemmap.c

> But the ZONE_DMA32 actually makes sense, but changing the semantics
> under such a large driver base isn't a good idea. 

The driver base can only be the x86_64 only device drivers since the zone 
is not used by any other architectures. That is fairly small AFAICT.

> It depends on the requirements of the bootmem user.  Some do need
> memory <4GB, some do not. The mask allocator itself is a client in fact
> and it requires low memory of course.

The point is that it would be good to relocate as much memory allocated at 
boot as possible beyond 4GB.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-08 11:57   ` Andi Kleen
@ 2008-03-10 17:14     ` Christoph Lameter
  0 siblings, 0 replies; 56+ messages in thread
From: Christoph Lameter @ 2008-03-10 17:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

On Sat, 8 Mar 2008, Andi Kleen wrote:

> I'm afraid it would not help you directly because you would still need 
> to maintain that code for s390 (seems to be a heavy GFP_DMA user)
> and probably some other architectures (unless you can get these
> maintainers to get rid of GFP_DMA too) With my plan it can be just ifdefed
> and the ifdef not enabled on x86.

Undefining ZONE_DMA will remove support for GFP_DMA from the slab 
allocators. Your patch is already doing that.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-08  0:46   ` Andi Kleen
@ 2008-03-10 18:03     ` Luiz Fernando N. Capitulino
  2008-03-10 18:08       ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Luiz Fernando N. Capitulino @ 2008-03-10 18:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Em Sat, 8 Mar 2008 01:46:54 +0100
Andi Kleen <andi@firstfloor.org> escreveu:

| >  I've tried to give them a try but got some problems. First, the
| > simple test case seems to fail miserably:
| 
| Hmm I guess you got a pretty filled up 16MB area already.
| Do you have a full log? It just couldn't allocate some memory
| for the 24bit mask, but that is likely because it just ran out of 
| memory.

 You'll find the full log here:

http://users.mandriva.com.br/~lcapitulino/tmp/minicom.cap

| I suppose it will work if you cut down the allocations in the 
| test case a bit, e.g. decrease NUMALLOC to 10 and perhaps MAX_LEN to
| 5*PAGE_SIZE. Did that in my copy.

[...]

| I put up an updated patchkit on ftp://firstfloor.org/pub/ak/mask/patches/
| It also has some other fixes.
| 
| Can you retest with that please?

 Yes, will do. But I have something to suggest.

 I saw that the patches you've posted here are only a subset of
the patches you have in your FTP. Should I test all the patches
you have or only the subset you posted?

 If you want to get only the subset tested, would be nice to
create a new directory in the FTP for them, that would make
my life easier a bit.

-- 
Luiz Fernando N. Capitulino

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-10 18:03     ` Luiz Fernando N. Capitulino
@ 2008-03-10 18:08       ` Andi Kleen
  2008-03-11 17:26         ` Luiz Fernando N. Capitulino
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-03-10 18:08 UTC (permalink / raw)
  To: Luiz Fernando N. Capitulino; +Cc: Andi Kleen, linux-kernel, linux-mm

>  I saw that the patches you've posted here are only a subset of
> the patches you have in your FTP. Should I test all the patches

The ftp directory contains the SCSI patchkit (which I have been
posting to linux-scsi for quite some time, but it didn't get applied
yet) and the block patchkit (which hit linux-kernel earlier,
but was unfortunately not graced by a reply by the block maintainer so far) 

> you have or only the subset you posted?

Best you test all together. The subsets are independent
(as in kernel should work with any subset of them), but 
they all clean up DMA memory related issues.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-07  9:07 ` [PATCH] [6/13] Core maskable allocator Andi Kleen
                     ` (3 preceding siblings ...)
  2008-03-08  5:03   ` KAMEZAWA Hiroyuki
@ 2008-03-11 15:34   ` Jonathan Corbet
  2008-03-11 15:54     ` Andi Kleen
  4 siblings, 1 reply; 56+ messages in thread
From: Jonathan Corbet @ 2008-03-11 15:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Hi, Andi,

As I dig through this patch, I find it mostly makes sense; seems like it
could be a good idea.  I did have one little API question...

> +struct page *
> +alloc_pages_mask(gfp_t gfp, unsigned size, u64 mask)
> +{
> +	unsigned long max_pfn = mask >> PAGE_SHIFT;

The "mask" parameter isn't really a mask - it's an upper bound on the
address of the allocated memory.  Might it be better to call it
"max_addr" or "limit" or "ceiling" or some such so callers understand
for sure how it's interpreted?  The use of the term "mask" throughout
the interface could maybe create a certain amount of confusion.

Thanks,

jon

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [6/13] Core maskable allocator
  2008-03-11 15:34   ` Jonathan Corbet
@ 2008-03-11 15:54     ` Andi Kleen
  0 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2008-03-11 15:54 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: Andi Kleen, linux-kernel, linux-mm

On Tue, Mar 11, 2008 at 09:34:53AM -0600, Jonathan Corbet wrote:
> Hi, Andi,
> 
> As I dig through this patch, I find it mostly makes sense; seems like it
> could be a good idea. 

Thanks.
> 
> > +struct page *
> > +alloc_pages_mask(gfp_t gfp, unsigned size, u64 mask)
> > +{
> > +	unsigned long max_pfn = mask >> PAGE_SHIFT;
> 
> The "mask" parameter isn't really a mask - it's an upper bound on the

Actually it's both.

> address of the allocated memory.  Might it be better to call it
> "max_addr" or "limit" or "ceiling" or some such so callers understand

mask is the standard term used by the PCI-DMA API for the same
thing and since one of the main purposes of the mask allocator is to 
implement underlying support for that interface it seemed fitting to use 
the same convention.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-10 18:08       ` Andi Kleen
@ 2008-03-11 17:26         ` Luiz Fernando N. Capitulino
  2008-03-11 17:35           ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Luiz Fernando N. Capitulino @ 2008-03-11 17:26 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm

Em Mon, 10 Mar 2008 19:08:43 +0100
Andi Kleen <andi@firstfloor.org> escreveu:

| > you have or only the subset you posted?
| 
| Best you test all together. The subsets are independent
| (as in kernel should work with any subset of them), but 
| they all clean up DMA memory related issues.

 Ok, now it works with the test case fixed:

"""
testing mask alloc upto 24 bits
verify & free
mask fffff
mask 1fffff
mask 3fffff
mask 7fffff
mask ffffff
done
"""

 Do you remember the BUG_ON() I was getting because of the
sound card driver [1] ?

 It seems to me that the problem is that it's setting
__GFP_COMP. The following patch fixes it for me:

"""
Index: linux-2.6.24/sound/core/memalloc.c
===================================================================
--- linux-2.6.24.orig/sound/core/memalloc.c
+++ linux-2.6.24/sound/core/memalloc.c
@@ -218,7 +218,6 @@ static void *snd_malloc_dev_pages(struct
 	snd_assert(dma != NULL, return NULL);
 	pg = get_order(size);
 	gfp_flags = GFP_KERNEL
-		| __GFP_COMP	/* compound page lets parts be mapped */
 		| __GFP_NORETRY /* don't trigger OOM-killer */
 		| __GFP_NOWARN; /* no stack trace print - this call is non-critical */
 	res = dma_alloc_coherent(dev, PAGE_SIZE << pg, dma, gfp_flags);
"""

 Also, I think you forgot to protect the __free_pages_mask()
call with "#ifdef CONFIG_MASK_ALLOC" in patch mask-compat.

 Question: let's say that the devices I have consumes only
5MB of the whole reserved pool (16MB), of course that it's
ok to reduce the pool to 5MB, right?

 I have more machines to test this stuff...

[1] http://users.mandriva.com.br/~lcapitulino/tmp/minicom.cap

-- 
Luiz Fernando N. Capitulino

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-11 17:26         ` Luiz Fernando N. Capitulino
@ 2008-03-11 17:35           ` Andi Kleen
  2008-03-11 18:00             ` Luiz Fernando N. Capitulino
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-03-11 17:35 UTC (permalink / raw)
  To: Luiz Fernando N. Capitulino; +Cc: Andi Kleen, linux-kernel, linux-mm

On Tue, Mar 11, 2008 at 02:26:24PM -0300, Luiz Fernando N. Capitulino wrote:
> ===================================================================
> --- linux-2.6.24.orig/sound/core/memalloc.c
> +++ linux-2.6.24/sound/core/memalloc.c
> @@ -218,7 +218,6 @@ static void *snd_malloc_dev_pages(struct
>  	snd_assert(dma != NULL, return NULL);
>  	pg = get_order(size);
>  	gfp_flags = GFP_KERNEL
> -		| __GFP_COMP	/* compound page lets parts be mapped */

Oops. Thanks. I'll double check that. mask allocator indeed doesn't
handle __GFP_COMP and nobody should be passing that into dma_alloc_coherent
anyways. But the bug you got was for the small size wasn't it?

>  		| __GFP_NORETRY /* don't trigger OOM-killer */
>  		| __GFP_NOWARN; /* no stack trace print - this call is non-critical */
>  	res = dma_alloc_coherent(dev, PAGE_SIZE << pg, dma, gfp_flags);
> """
> 
>  Also, I think you forgot to protect the __free_pages_mask()
> call with "#ifdef CONFIG_MASK_ALLOC" in patch mask-compat.

PageMaskAlloc() returns constant 0 in this case so the compiler should e
liminate it.

> 
>  Question: let's say that the devices I have consumes only
> 5MB of the whole reserved pool (16MB), of course that it's

It's likely less than 16MB because there are already some users in there
like the kernel text. The default sizing without swiotlb just grabs anything 
left over up to the 16Mb boundary.

> ok to reduce the pool to 5MB, right?

Yes, but not leaving any free over is a little risky, someone might need
more later. But e.g. going down to 8-9MB would be likely possible.

The long term plan (but that is some time off and a little vague still) 
would be to let the various subsystem size the mask zone dynamically for 
their need.

Then especially on 64bit swiotlb machines who allocate 64MB by default
for swiotlb that would save significant memory.

> 
>  I have more machines to test this stuff...

Thanks. Keep me posted.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-11 17:35           ` Andi Kleen
@ 2008-03-11 18:00             ` Luiz Fernando N. Capitulino
  2008-03-11 18:49               ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Luiz Fernando N. Capitulino @ 2008-03-11 18:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, linux-kernel, linux-mm

Em Tue, 11 Mar 2008 18:35:40 +0100
Andi Kleen <andi@firstfloor.org> escreveu:

| On Tue, Mar 11, 2008 at 02:26:24PM -0300, Luiz Fernando N. Capitulino wrote:
| > ===================================================================
| > --- linux-2.6.24.orig/sound/core/memalloc.c
| > +++ linux-2.6.24/sound/core/memalloc.c
| > @@ -218,7 +218,6 @@ static void *snd_malloc_dev_pages(struct
| >  	snd_assert(dma != NULL, return NULL);
| >  	pg = get_order(size);
| >  	gfp_flags = GFP_KERNEL
| > -		| __GFP_COMP	/* compound page lets parts be mapped */
| 
| Oops. Thanks. I'll double check that. mask allocator indeed doesn't
| handle __GFP_COMP and nobody should be passing that into dma_alloc_coherent
| anyways. But the bug you got was for the small size wasn't it?

 No, it triggers the BUG_ON() which checks the gfp, not the one
which checks MASK_MIN_SIZE.

 On the other hand I'm not sure whether it does the right thing
(ie, pass size in bytes instead of order) it does:

"""
pg = get_order(size);
[...]
res = dma_alloc_coherent(dev, PAGE_SIZE << pg, dma, gfp_flags);
if (res != NULL)
	inc_snd_pages(pg);
"""

 Maybe it could be changed to:

"""
res = dma_alloc_coherent(dev, size, dma, gfp_flags);
if (res != NULL)
	inc_snd_pages(get_order(size));
"""

 But sound works (simple tests).

| > ok to reduce the pool to 5MB, right?
| 
| Yes, but not leaving any free over is a little risky, someone might need
| more later. But e.g. going down to 8-9MB would be likely possible.

 Ok.

| The long term plan (but that is some time off and a little vague still) 
| would be to let the various subsystem size the mask zone dynamically for 
| their need.

 That sounds cool.

-- 
Luiz Fernando N. Capitulino

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-11 18:00             ` Luiz Fernando N. Capitulino
@ 2008-03-11 18:49               ` Andi Kleen
  2008-03-11 19:36                 ` Luiz Fernando N. Capitulino
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2008-03-11 18:49 UTC (permalink / raw)
  To: Luiz Fernando N. Capitulino; +Cc: Andi Kleen, linux-kernel, linux-mm

> | Oops. Thanks. I'll double check that. mask allocator indeed doesn't
> | handle __GFP_COMP and nobody should be passing that into dma_alloc_coherent
> | anyways. But the bug you got was for the small size wasn't it?
> 
>  No, it triggers the BUG_ON() which checks the gfp, not the one
> which checks MASK_MIN_SIZE.

I see. I misdiagnosed your original problem then. But fixing the 
size < 16 bytes case was a good idea anyways, someone else would
have triggered that.

> 
>  On the other hand I'm not sure whether it does the right thing
> (ie, pass size in bytes instead of order) it does:

> 
> """
> pg = get_order(size);
> [...]
> res = dma_alloc_coherent(dev, PAGE_SIZE << pg, dma, gfp_flags);

With the mask allocator it can be changed to pass size directly
and save some memory. Before that it didn't make any difference.

Can you perhaps send me a complete patch fixing that for sound and the 
__GFP_COMP with description and Signed-off-by etc.? I can add it to my 
patchkit then and you would be correctly attributed. Otherwise I can do it 
myself too if you prefer. I'll also do a grep over the tree for other
such bogus __GFP_COMP users. That was an issue I hadn't considered before.

> if (res != NULL)
> 	inc_snd_pages(pg);
> """
> 
>  Maybe it could be changed to:

Agreed.

-Andi


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] [0/13] General DMA zone rework
  2008-03-11 18:49               ` Andi Kleen
@ 2008-03-11 19:36                 ` Luiz Fernando N. Capitulino
  0 siblings, 0 replies; 56+ messages in thread
From: Luiz Fernando N. Capitulino @ 2008-03-11 19:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, linux-kernel, linux-mm

Em Tue, 11 Mar 2008 19:49:26 +0100
Andi Kleen <andi@firstfloor.org> escreveu:

| > | Oops. Thanks. I'll double check that. mask allocator indeed doesn't
| > | handle __GFP_COMP and nobody should be passing that into dma_alloc_coherent
| > | anyways. But the bug you got was for the small size wasn't it?
| > 
| >  No, it triggers the BUG_ON() which checks the gfp, not the one
| > which checks MASK_MIN_SIZE.
| 
| I see. I misdiagnosed your original problem then. But fixing the 
| size < 16 bytes case was a good idea anyways, someone else would
| have triggered that.

 I see.

| Can you perhaps send me a complete patch fixing that for sound and the 
| __GFP_COMP with description and Signed-off-by etc.? I can add it to my 
| patchkit then and you would be correctly attributed. Otherwise I can do it 
| myself too if you prefer. I'll also do a grep over the tree for other
| such bogus __GFP_COMP users. That was an issue I hadn't considered before.

 Here are you (passed minimal tests).

------
ALSA: Convert snd_malloc_dev_pages() to the mask allocator

The mask allocator do not handle the __GFP_COMP flag and
will BUG_ON() if that flag is passed to it.

Also, we should pass the allocation size in bytes to
dma_alloc_coherent().

Signed-off-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br>

Index: linux-2.6.24/sound/core/memalloc.c
===================================================================
--- linux-2.6.24.orig/sound/core/memalloc.c
+++ linux-2.6.24/sound/core/memalloc.c
@@ -210,20 +210,17 @@ void snd_free_pages(void *ptr, size_t si
 /* allocate the coherent DMA pages */
 static void *snd_malloc_dev_pages(struct device *dev, size_t size, dma_addr_t *dma)
 {
-	int pg;
 	void *res;
 	gfp_t gfp_flags;
 
 	snd_assert(size > 0, return NULL);
 	snd_assert(dma != NULL, return NULL);
-	pg = get_order(size);
 	gfp_flags = GFP_KERNEL
-		| __GFP_COMP	/* compound page lets parts be mapped */
 		| __GFP_NORETRY /* don't trigger OOM-killer */
 		| __GFP_NOWARN; /* no stack trace print - this call is non-critical */
-	res = dma_alloc_coherent(dev, PAGE_SIZE << pg, dma, gfp_flags);
+	res = dma_alloc_coherent(dev, size, dma, gfp_flags);
 	if (res != NULL)
-		inc_snd_pages(pg);
+		inc_snd_pages(get_order(size));
 
 	return res;
 }



-- 
Luiz Fernando N. Capitulino

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2008-03-11 19:36 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-07  9:07 [PATCH] [0/13] General DMA zone rework Andi Kleen
2008-03-07  9:07 ` [PATCH] [2/13] Make get_order(0) return 0 Andi Kleen
2008-03-07  9:07 ` [PATCH] [3/13] Make kvm bad_page symbol static Andi Kleen
2008-03-07  9:07 ` [PATCH] [4/13] Prepare page_alloc for the maskable allocator Andi Kleen
2008-03-07 18:19   ` Sam Ravnborg
2008-03-07 18:36     ` Cyrill Gorcunov
2008-03-07 19:02     ` Andi Kleen
2008-03-07  9:07 ` [PATCH] [5/13] Add mask allocator statistics to vmstat.[ch] Andi Kleen
2008-03-08  2:24   ` Christoph Lameter
2008-03-07  9:07 ` [PATCH] [6/13] Core maskable allocator Andi Kleen
2008-03-07 10:53   ` Johannes Weiner
2008-03-07 11:14     ` Andi Kleen
2008-03-07 17:05   ` Randy Dunlap
2008-03-07 17:31     ` Andi Kleen
2008-03-07 17:33       ` Randy Dunlap
2008-03-07 17:43         ` Andi Kleen
2008-03-07 17:51           ` Randy Dunlap
2008-03-07 21:13   ` Cyrill Gorcunov
2008-03-07 23:28     ` Andi Kleen
2008-03-08  5:03   ` KAMEZAWA Hiroyuki
2008-03-08  5:41     ` KAMEZAWA Hiroyuki
2008-03-08 11:41     ` Andi Kleen
2008-03-11 15:34   ` Jonathan Corbet
2008-03-11 15:54     ` Andi Kleen
2008-03-07  9:07 ` [PATCH] [7/13] Implement compat hooks for GFP_DMA Andi Kleen
2008-03-07  9:07 ` [PATCH] [8/13] Enable the mask allocator for x86 Andi Kleen
2008-03-07 18:32   ` Sam Ravnborg
2008-03-07 19:03     ` Andi Kleen
2008-03-07 19:09       ` Sam Ravnborg
2008-03-08  2:37   ` Christoph Lameter
2008-03-08  6:35     ` Yinghai Lu
2008-03-08  7:31       ` Christoph Lameter
2008-03-08 11:54     ` Andi Kleen
2008-03-10 17:13       ` Christoph Lameter
2008-03-07  9:07 ` [PATCH] [9/13] Remove set_dma_reserve Andi Kleen
2008-03-07  9:07 ` [PATCH] [10/13] Switch the 32bit dma_alloc_coherent functions over to use the maskable allocator Andi Kleen
2008-03-07  9:07 ` [PATCH] [11/13] Switch x86-64 dma_alloc_coherent over to " Andi Kleen
2008-03-07  9:07 ` [PATCH] [12/13] Add vmstat statistics for new swiotlb code Andi Kleen
2008-03-08  2:38   ` Christoph Lameter
2008-03-07  9:07 ` [PATCH] [13/13] Convert x86-64 swiotlb to use the mask allocator directly Andi Kleen
2008-03-07 15:18 ` [PATCH] [0/13] General DMA zone rework Rene Herman
2008-03-07 15:22   ` Rene Herman
2008-03-07 15:31     ` Andi Kleen
2008-03-07 15:34   ` Andi Kleen
2008-03-07 20:51 ` Luiz Fernando N. Capitulino
2008-03-08  0:46   ` Andi Kleen
2008-03-10 18:03     ` Luiz Fernando N. Capitulino
2008-03-10 18:08       ` Andi Kleen
2008-03-11 17:26         ` Luiz Fernando N. Capitulino
2008-03-11 17:35           ` Andi Kleen
2008-03-11 18:00             ` Luiz Fernando N. Capitulino
2008-03-11 18:49               ` Andi Kleen
2008-03-11 19:36                 ` Luiz Fernando N. Capitulino
2008-03-08  2:42 ` Christoph Lameter
2008-03-08 11:57   ` Andi Kleen
2008-03-10 17:14     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).