linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv18 0/11] Contiguous Memory Allocator
@ 2011-12-29 12:39 Marek Szyprowski
  2011-12-29 12:39 ` [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating Marek Szyprowski
                   ` (11 more replies)
  0 siblings, 12 replies; 33+ messages in thread
From: Marek Szyprowski @ 2011-12-29 12:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm, linaro-mm-sig
  Cc: Michal Nazarewicz, Marek Szyprowski, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

Welcome everyone last time in 2011!

We finally managed to finish (yet) another release of the Contiguous
Memory Allocator patches. This version resolves a lot of issues reported
in the previous version and improves the reliability of the memory
allocation.

The most important changes are code cleanup after a review from Mel
Gorman, fixing of the anoying bugs (like HIGHMEM crash on ARM) and
adding allocation retry procedure in case of temporary migration fail.
This version should finally solve all the issues that are a result of
changing the migration code base from memory hotplug to memory
compaction.

ARM integration code has not been changed since last two versions, it
provides implementation of all the ideas that has been discussed during
Linaro Sprint meeting. Here are the details:

  This version provides a solution for complete integration of CMA to
  DMA mapping subsystem on ARM architecture. The issue caused by double
  dma pages mapping and possible aliasing in coherent memory mapping has
  been finally resolved, both for GFP_ATOMIC case (allocations comes from
  coherent memory pool) and non-GFP_ATOMIC case (allocations comes from
  CMA managed areas).

  For coherent, nommu, ARMv4 and ARMv5 systems the current DMA-mapping
  implementation has been kept.

  For ARMv6+ systems, CMA has been enabled and a special pool of coherent
  memory for atomic allocations has been created. The size of this pool
  defaults to DEFAULT_CONSISTEN_DMA_SIZE/8, but can be changed with
  coherent_pool kernel parameter (if really required).

  All atomic allocations are served from this pool. I've did a little
  simplification here, because there is no separate pool for writecombine
  memory - such requests are also served from coherent pool. I don't
  think that such simplification is a problem here - I found no driver
  that use dma_alloc_writecombine with GFP_ATOMIC flags.

  All non-atomic allocation are served from CMA area. Kernel mapping is
  updated to reflect required memory attributes changes. This is possible
  because during early boot, all CMA area are remapped with 4KiB pages in
  kernel low-memory.

  This version have been tested on Samsung S5PC110 based Goni machine and
  Exynos4 UniversalC210 board with various V4L2 multimedia drivers.

  Coherent atomic allocations has been tested by manually enabling the dma
  bounce for the s3c-sdhci device.

All patches are prepared for Linux Kernel v3.2-rc7.

A few words for these who see CMA for the first time:

   The Contiguous Memory Allocator (CMA) makes it possible for device
   drivers to allocate big contiguous chunks of memory after the system
   has booted. 

   The main difference from the similar frameworks is the fact that CMA
   allows to transparently reuse memory region reserved for the big
   chunk allocation as a system memory, so no memory is wasted when no
   big chunk is allocated. Once the alloc request is issued, the
   framework will migrate system pages to create a required big chunk of
   physically contiguous memory.

   For more information you can refer to nice LWN articles: 
   http://lwn.net/Articles/447405/ and http://lwn.net/Articles/450286/
   as well as links to previous versions of the CMA framework.

   The CMA framework has been initially developed by Michal Nazarewicz
   at Samsung Poland R&D Center. Since version 9, I've taken over the
   development, because Michal has left the company.

TODO (optional):
- implement support for contiguous memory areas placed in HIGHMEM zone
- resolve issue with movable pages with pending io operations

I would also like to with everyone a Happy New Year! See You again in
2012!

Best regards
Marek Szyprowski
Samsung Poland R&D Center


Links to previous versions of the patchset:
v17: <http://www.spinics.net/lists/arm-kernel/msg148499.html>
v16: <http://www.spinics.net/lists/linux-mm/msg25066.html>
v15: <http://www.spinics.net/lists/linux-mm/msg23365.html>
v14: <http://www.spinics.net/lists/linux-media/msg36536.html>
v13: (internal, intentionally not released)
v12: <http://www.spinics.net/lists/linux-media/msg35674.html>
v11: <http://www.spinics.net/lists/linux-mm/msg21868.html>
v10: <http://www.spinics.net/lists/linux-mm/msg20761.html>
 v9: <http://article.gmane.org/gmane.linux.kernel.mm/60787>
 v8: <http://article.gmane.org/gmane.linux.kernel.mm/56855>
 v7: <http://article.gmane.org/gmane.linux.kernel.mm/55626>
 v6: <http://article.gmane.org/gmane.linux.kernel.mm/55626>
 v5: (intentionally left out as CMA v5 was identical to CMA v4)
 v4: <http://article.gmane.org/gmane.linux.kernel.mm/52010>
 v3: <http://article.gmane.org/gmane.linux.kernel.mm/51573>
 v2: <http://article.gmane.org/gmane.linux.kernel.mm/50986>
 v1: <http://article.gmane.org/gmane.linux.kernel.mm/50669>


Changelog:

v18:
    1. Addressed comments and suggestions from Mel Godman related to changes
       in memory compaction code, most important points:
	- removed "mm: page_alloc: handle MIGRATE_ISOLATE in free_pcppages_bulk()"
	  and moved all the logic to set_migratetype_isolate - see
	  "mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating"
	  patch
	- code in "mm: compaction: introduce isolate_{free,migrate}pages_range()"
	  patch have been simplified and improved
	- removed "mm: mmzone: introduce zone_pfn_same_memmap()" patch

    2. Fixed crash on initialization if HIGHMEM is available on ARM platforms

    3. Fixed problems with allocation of contiguous memory if all free pages
       are occupied by page cache and reclaim is required.

    4. Added a workaround for temporary migration failures (now CMA tries
       to allocate different memory block in such case), what heavily increased
       reliability of the CMA.

    5. Minor cleanup here and there.

    6. Rebased onto v3.2-rc7 kernel tree.

v17:
    1. Replaced whole CMA core memory migration code to the new one kindly
       provided by Michal Nazarewicz. The new code is based on memory
       compaction framework not the memory hotplug, like it was before. This
       change has been suggested by Mel Godman.

    2. Addressed most of the comments from Andrew Morton and Mel Gorman in
       the rest of the CMA code.

    3. Fixed broken initialization on ARM systems with DMA zone enabled.

    4. Rebased onto v3.2-rc2 kernel.

v16:
    1. merged a fixup from Michal Nazarewicz to address comments from Dave
       Hansen about checking if pfns belong to the same memory zone

    2. merged a fix from Michal Nazarewicz for incorrect handling of pages
       which belong to page block that is in MIGRATE_ISOLATE state, in very
       rare cases the migrate type of page block might have been changed
       from MIGRATE_CMA to MIGRATE_MOVABLE because of this bug

    3. moved some common code to include/asm-generic

    4. added support for x86 DMA-mapping framework for pci-dma hardware,
       CMA can be now even more widely tested on KVM/QEMU and a lot of common
       x86 boxes

    5. rebased onto next-20111005 kernel tree, which includes changes in ARM
       DMA-mapping subsystem (CONSISTENT_DMA_SIZE removal)

    6. removed patch for CMA s5p-fimc device private regions (served only as
       example) and provided the one that matches real life case - s5p-mfc
       device

v15:
    1. fixed calculation of the total memory after activating CMA area (was
       broken from v12)

    2. more code cleanup in drivers/base/dma-contiguous.c

    3. added address limit for default CMA area

    4. rewrote ARM DMA integration:
	- removed "ARM: DMA: steal memory for DMA coherent mappings" patch
	- kept current DMA mapping implementation for coherent, nommu and
	  ARMv4/ARMv5 systems
	- enabled CMA for all ARMv6+ systems
	- added separate, small pool for coherent atomic allocations, defaults
	  to CONSISTENT_DMA_SIZE/8, but can be changed with kernel parameter
	  coherent_pool=[size]

v14:
    1. Merged with "ARM: DMA: steal memory for DMA coherent mappings" 
       patch, added support for GFP_ATOMIC allocations.

    2. Added checks for NULL device pointer

v13: (internal, intentionally not released)

v12:
    1. Fixed 2 nasty bugs in dma-contiguous allocator:
       - alignment argument was not passed correctly
       - range for dma_release_from_contiguous was not checked correctly

    2. Added support for architecture specfic dma_contiguous_early_fixup()
       function

    3. CMA and DMA-mapping integration for ARM architechture has been
       rewritten to take care of the memory aliasing issue that might
       happen for newer ARM CPUs (mapping of the same pages with different
       cache attributes is forbidden). TODO: add support for GFP_ATOMIC
       allocations basing on the "ARM: DMA: steal memory for DMA coherent
       mappings" patch and implement support for contiguous memory areas
       that are placed in HIGHMEM zone

v11:
    1. Removed genalloc usage and replaced it with direct calls to
       bitmap_* functions, dropped patches that are not needed
       anymore (genalloc extensions)

    2. Moved all contiguous area management code from mm/cma.c
       to drivers/base/dma-contiguous.c

    3. Renamed cm_alloc/free to dma_alloc/release_from_contiguous

    4. Introduced global, system wide (default) contiguous area
       configured with kernel config and kernel cmdline parameters

    5. Simplified initialization to just one function:
       dma_declare_contiguous()

    6. Added example of device private memory contiguous area

v10:
    1. Rebased onto 3.0-rc2 and resolved all conflicts

    2. Simplified CMA to be just a pure memory allocator, for use
       with platfrom/bus specific subsystems, like dma-mapping.
       Removed all device specific functions are calls.

    3. Integrated with ARM DMA-mapping subsystem.

    4. Code cleanup here and there.

    5. Removed private context support.

v9: 1. Rebased onto 2.6.39-rc1 and resolved all conflicts

    2. Fixed a bunch of nasty bugs that happened when the allocation
       failed (mainly kernel oops due to NULL ptr dereference).

    3. Introduced testing code: cma-regions compatibility layer and
       videobuf2-cma memory allocator module.

v8: 1. The alloc_contig_range() function has now been separated from
       CMA and put in page_allocator.c.  This function tries to
       migrate all LRU pages in specified range and then allocate the
       range using alloc_contig_freed_pages().

    2. Support for MIGRATE_CMA has been separated from the CMA code.
       I have not tested if CMA works with ZONE_MOVABLE but I see no
       reasons why it shouldn't.

    3. I have added a @private argument when creating CMA contexts so
       that one can reserve memory and not share it with the rest of
       the system.  This way, CMA acts only as allocation algorithm.

v7: 1. A lot of functionality that handled driver->allocator_context
       mapping has been removed from the patchset.  This is not to say
       that this code is not needed, it's just not worth posting
       everything in one patchset.

       Currently, CMA is "just" an allocator.  It uses it's own
       migratetype (MIGRATE_CMA) for defining ranges of pageblokcs
       which behave just like ZONE_MOVABLE but dispite the latter can
       be put in arbitrary places.

    2. The migration code that was introduced in the previous version
       actually started working.


v6: 1. Most importantly, v6 introduces support for memory migration.
       The implementation is not yet complete though.

       Migration support means that when CMA is not using memory
       reserved for it, page allocator can allocate pages from it.
       When CMA wants to use the memory, the pages have to be moved
       and/or evicted as to make room for CMA.

       To make it possible it must be guaranteed that only movable and
       reclaimable pages are allocated in CMA controlled regions.
       This is done by introducing a MIGRATE_CMA migrate type that
       guarantees exactly that.

       Some of the migration code is "borrowed" from Kamezawa
       Hiroyuki's alloc_contig_pages() implementation.  The main
       difference is that thanks to MIGRATE_CMA migrate type CMA
       assumes that memory controlled by CMA are is always movable or
       reclaimable so that it makes allocation decisions regardless of
       the whether some pages are actually allocated and migrates them
       if needed.

       The most interesting patches from the patchset that implement
       the functionality are:

         09/13: mm: alloc_contig_free_pages() added
         10/13: mm: MIGRATE_CMA migration type added
         11/13: mm: MIGRATE_CMA isolation functions added
         12/13: mm: cma: Migration support added [wip]

       Currently, kernel panics in some situations which I am trying
       to investigate.

    2. cma_pin() and cma_unpin() functions has been added (after
       a conversation with Johan Mossberg).  The idea is that whenever
       hardware does not use the memory (no transaction is on) the
       chunk can be moved around.  This would allow defragmentation to
       be implemented if desired.  No defragmentation algorithm is
       provided at this time.

    3. Sysfs support has been replaced with debugfs.  I always felt
       unsure about the sysfs interface and when Greg KH pointed it
       out I finally got to rewrite it to debugfs.


v5: (intentionally left out as CMA v5 was identical to CMA v4)


v4: 1. The "asterisk" flag has been removed in favour of requiring
       that platform will provide a "*=<regions>" rule in the map
       attribute.

    2. The terminology has been changed slightly renaming "kind" to
       "type" of memory.  In the previous revisions, the documentation
       indicated that device drivers define memory kinds and now,

v3: 1. The command line parameters have been removed (and moved to
       a separate patch, the fourth one).  As a consequence, the
       cma_set_defaults() function has been changed -- it no longer
       accepts a string with list of regions but an array of regions.

    2. The "asterisk" attribute has been removed.  Now, each region
       has an "asterisk" flag which lets one specify whether this
       region should by considered "asterisk" region.

    3. SysFS support has been moved to a separate patch (the third one
       in the series) and now also includes list of regions.

v2: 1. The "cma_map" command line have been removed.  In exchange,
       a SysFS entry has been created under kernel/mm/contiguous.

       The intended way of specifying the attributes is
       a cma_set_defaults() function called by platform initialisation
       code.  "regions" attribute (the string specified by "cma"
       command line parameter) can be overwritten with command line
       parameter; the other attributes can be changed during run-time
       using the SysFS entries.

    2. The behaviour of the "map" attribute has been modified
       slightly.  Currently, if no rule matches given device it is
       assigned regions specified by the "asterisk" attribute.  It is
       by default built from the region names given in "regions"
       attribute.

    3. Devices can register private regions as well as regions that
       can be shared but are not reserved using standard CMA
       mechanisms.  A private region has no name and can be accessed
       only by devices that have the pointer to it.

    4. The way allocators are registered has changed.  Currently,
       a cma_allocator_register() function is used for that purpose.
       Moreover, allocators are attached to regions the first time
       memory is registered from the region or when allocator is
       registered which means that allocators can be dynamic modules
       that are loaded after the kernel booted (of course, it won't be
       possible to allocate a chunk of memory from a region if
       allocator is not loaded).

    5. Index of new functions:

    +static inline dma_addr_t __must_check
    +cma_alloc_from(const char *regions, size_t size,
    +               dma_addr_t alignment)

    +static inline int
    +cma_info_about(struct cma_info *info, const const char *regions)

    +int __must_check cma_region_register(struct cma_region *reg);

    +dma_addr_t __must_check
    +cma_alloc_from_region(struct cma_region *reg,
    +                      size_t size, dma_addr_t alignment);

    +static inline dma_addr_t __must_check
    +cma_alloc_from(const char *regions,
    +               size_t size, dma_addr_t alignment);

    +int cma_allocator_register(struct cma_allocator *alloc);


Patches in this patchset:

Marek Szyprowski (5):
  mm: add optional memory reclaim in split_free_page()
  drivers: add Contiguous Memory Allocator
  X86: integrate CMA with DMA-mapping subsystem
  ARM: integrate CMA with DMA-mapping subsystem
  ARM: Samsung: use CMA for 2 memory banks for s5p-mfc device

Michal Nazarewicz (6):
  mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating
  mm: compaction: introduce isolate_{free,migrate}pages_range().
  mm: compaction: export some of the functions
  mm: page_alloc: introduce alloc_contig_range()
  mm: mmzone: MIGRATE_CMA migration type added
  mm: page_isolation: MIGRATE_CMA isolation functions added

 Documentation/kernel-parameters.txt   |    9 +
 arch/Kconfig                          |    3 +
 arch/arm/Kconfig                      |    2 +
 arch/arm/include/asm/dma-contiguous.h |   16 ++
 arch/arm/include/asm/mach/map.h       |    1 +
 arch/arm/kernel/setup.c               |    9 +-
 arch/arm/mm/dma-mapping.c             |  368 +++++++++++++++++++++-----
 arch/arm/mm/init.c                    |   22 ++-
 arch/arm/mm/mm.h                      |    3 +
 arch/arm/mm/mmu.c                     |   29 ++-
 arch/arm/plat-s5p/dev-mfc.c           |   51 +---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/dma-contiguous.h |   13 +
 arch/x86/include/asm/dma-mapping.h    |    4 +
 arch/x86/kernel/pci-dma.c             |   18 ++-
 arch/x86/kernel/pci-nommu.c           |    8 +-
 arch/x86/kernel/setup.c               |    2 +
 drivers/base/Kconfig                  |   89 +++++++
 drivers/base/Makefile                 |    1 +
 drivers/base/dma-contiguous.c         |  404 ++++++++++++++++++++++++++++
 include/asm-generic/dma-contiguous.h  |   27 ++
 include/linux/device.h                |    4 +
 include/linux/dma-contiguous.h        |  110 ++++++++
 include/linux/mm.h                    |    2 +-
 include/linux/mmzone.h                |   41 +++-
 include/linux/page-isolation.h        |   27 ++-
 mm/Kconfig                            |    2 +-
 mm/Makefile                           |    3 +-
 mm/compaction.c                       |  467 +++++++++++++++++++--------------
 mm/internal.h                         |   35 +++
 mm/memory-failure.c                   |    2 +-
 mm/memory_hotplug.c                   |    6 +-
 mm/page_alloc.c                       |  403 +++++++++++++++++++++++++----
 mm/page_isolation.c                   |   15 +-
 mm/vmstat.c                           |    1 +
 35 files changed, 1773 insertions(+), 425 deletions(-)
 create mode 100644 arch/arm/include/asm/dma-contiguous.h
 create mode 100644 arch/x86/include/asm/dma-contiguous.h
 create mode 100644 drivers/base/dma-contiguous.c
 create mode 100644 include/asm-generic/dma-contiguous.h
 create mode 100644 include/linux/dma-contiguous.h

-- 
1.7.1.569.g6f426

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating
  2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
@ 2011-12-29 12:39 ` Marek Szyprowski
  2012-01-01  7:49   ` Gilad Ben-Yossef
  2012-01-05 15:39   ` Michal Nazarewicz
  2011-12-29 12:39 ` [PATCH 02/11] mm: compaction: introduce isolate_{free,migrate}pages_range() Marek Szyprowski
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 33+ messages in thread
From: Marek Szyprowski @ 2011-12-29 12:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm, linaro-mm-sig
  Cc: Michal Nazarewicz, Marek Szyprowski, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

From: Michal Nazarewicz <mina86@mina86.com>

When set_migratetype_isolate() sets pageblock's migrate type, it does
not change each page_private data.  This makes sense, as the function
has no way of knowing what kind of information page_private stores.

Unfortunately, if a page is on PCP list, it's page_private indicates
its migrate type.  This means, that if a page on PCP list gets
isolated, a call to free_pcppages_bulk() will assume it has the old
migrate type rather than MIGRATE_ISOLATE.  This means, that a page
which should be isolated, will end up on a free list of it's old
migrate type.

Coincidentally, at the very end, set_migratetype_isolate() calls
drain_all_pages() which leads to calling free_pcppages_bulk(), which
does the wrong thing.

To avoid this situation, this commit moves the draining prior to
setting pageblock's migratetype and moving pages from old free list to
MIGRATETYPE_ISOLATE's free list.

Because of spin locks this is a non-trivial change however as both
set_migratetype_isolate() and free_pcppages_bulk() grab zone->lock.
To solve this problem, this commit renames free_pcppages_bulk() to
__free_pcppages_bulk() and changes it so that it no longer grabs
zone->lock instead requiring caller to hold it.  This commit later
adds a __zone_drain_all_pages() function which works just like
drain_all_pages() expects that it drains only pages from a single zone
and assumes that caller holds zone->lock.

A side effect is that instead of draining pages from all zones,
set_migratetype_isolate() now drain only pages from zone pageblock it
operates on is in.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
---
 mm/page_alloc.c |   56 ++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b8ba3a..f88b320 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -590,15 +590,16 @@ static inline int free_pages_check(struct page *page)
  *
  * And clear the zone's pages_scanned counter, to hold off the "all pages are
  * pinned" detection logic.
+ *
+ * Caller must hold zone->lock.
  */
-static void free_pcppages_bulk(struct zone *zone, int count,
+static void __free_pcppages_bulk(struct zone *zone, int count,
 					struct per_cpu_pages *pcp)
 {
 	int migratetype = 0;
 	int batch_free = 0;
 	int to_free = count;
 
-	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
@@ -628,13 +629,13 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			page = list_entry(list->prev, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
+
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, zone, 0, page_private(page));
 			trace_mm_page_pcpu_drain(page, 0, page_private(page));
 		} while (--to_free && --batch_free && !list_empty(list));
 	}
 	__mod_zone_page_state(zone, NR_FREE_PAGES, count);
-	spin_unlock(&zone->lock);
 }
 
 static void free_one_page(struct zone *zone, struct page *page, int order,
@@ -1067,14 +1068,14 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	unsigned long flags;
 	int to_drain;
 
-	local_irq_save(flags);
+	spin_lock_irqsave(&zone->lock, flags);
 	if (pcp->count >= pcp->batch)
 		to_drain = pcp->batch;
 	else
 		to_drain = pcp->count;
-	free_pcppages_bulk(zone, to_drain, pcp);
+	__free_pcppages_bulk(zone, to_drain, pcp);
 	pcp->count -= to_drain;
-	local_irq_restore(flags);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 #endif
 
@@ -1099,7 +1100,9 @@ static void drain_pages(unsigned int cpu)
 
 		pcp = &pset->pcp;
 		if (pcp->count) {
-			free_pcppages_bulk(zone, pcp->count, pcp);
+			spin_lock(&zone->lock);
+			__free_pcppages_bulk(zone, pcp->count, pcp);
+			spin_unlock(&zone->lock);
 			pcp->count = 0;
 		}
 		local_irq_restore(flags);
@@ -1122,6 +1125,32 @@ void drain_all_pages(void)
 	on_each_cpu(drain_local_pages, NULL, 1);
 }
 
+/* Caller must hold zone->lock. */
+static void __zone_drain_local_pages(void *arg)
+{
+	struct per_cpu_pages *pcp;
+	struct zone *zone = arg;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	pcp = &per_cpu_ptr(zone->pageset, smp_processor_id())->pcp;
+	if (pcp->count) {
+		/* Caller holds zone->lock, no need to grab it. */
+		__free_pcppages_bulk(zone, pcp->count, pcp);
+		pcp->count = 0;
+	}
+	local_irq_restore(flags);
+}
+
+/*
+ * Like drain_all_pages() but operates on a single zone.  Caller must
+ * hold zone->lock.
+ */
+static void __zone_drain_all_pages(struct zone *zone)
+{
+	on_each_cpu(__zone_drain_local_pages, zone, 1);
+}
+
 #ifdef CONFIG_HIBERNATION
 
 void mark_free_pages(struct zone *zone)
@@ -1202,7 +1231,9 @@ void free_hot_cold_page(struct page *page, int cold)
 		list_add(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
-		free_pcppages_bulk(zone, pcp->batch, pcp);
+		spin_lock(&zone->lock);
+		__free_pcppages_bulk(zone, pcp->batch, pcp);
+		spin_unlock(&zone->lock);
 		pcp->count -= pcp->batch;
 	}
 
@@ -3684,10 +3715,10 @@ static int __zone_pcp_update(void *data)
 		pset = per_cpu_ptr(zone->pageset, cpu);
 		pcp = &pset->pcp;
 
-		local_irq_save(flags);
-		free_pcppages_bulk(zone, pcp->count, pcp);
+		spin_lock_irqsave(&zone->lock, flags);
+		__free_pcppages_bulk(zone, pcp->count, pcp);
 		setup_pageset(pset, batch);
-		local_irq_restore(flags);
+		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 	return 0;
 }
@@ -5657,13 +5688,12 @@ int set_migratetype_isolate(struct page *page)
 
 out:
 	if (!ret) {
+		__zone_drain_all_pages(zone);
 		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 		move_freepages_block(zone, page, MIGRATE_ISOLATE);
 	}
 
 	spin_unlock_irqrestore(&zone->lock, flags);
-	if (!ret)
-		drain_all_pages();
 	return ret;
 }
 
-- 
1.7.1.569.g6f426


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 02/11] mm: compaction: introduce isolate_{free,migrate}pages_range().
  2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
  2011-12-29 12:39 ` [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating Marek Szyprowski
@ 2011-12-29 12:39 ` Marek Szyprowski
  2012-01-10 13:43   ` Mel Gorman
  2011-12-29 12:39 ` [PATCH 03/11] mm: compaction: export some of the functions Marek Szyprowski
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Marek Szyprowski @ 2011-12-29 12:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm, linaro-mm-sig
  Cc: Michal Nazarewicz, Marek Szyprowski, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

From: Michal Nazarewicz <mina86@mina86.com>

This commit introduces isolate_freepages_range() and
isolate_migratepages_range() functions.  The first one replaces
isolate_freepages_block() and the second one extracts functionality
from isolate_migratepages().

They are more generic and instead of operating on pageblocks operate
on PFN ranges.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
---
 mm/compaction.c |  184 ++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 127 insertions(+), 57 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 899d956..ae73b6f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -54,55 +54,86 @@ static unsigned long release_freepages(struct list_head *freelist)
 	return count;
 }
 
-/* Isolate free pages onto a private freelist. Must hold zone->lock */
-static unsigned long isolate_freepages_block(struct zone *zone,
-				unsigned long blockpfn,
-				struct list_head *freelist)
+/**
+ * isolate_freepages_range() - isolate free pages, must hold zone->lock.
+ * @zone:	Zone pages are in.
+ * @start_pfn:	The first PFN to start isolating.
+ * @end_pfn:	The one-past-last PFN.
+ * @freelist:	A list to save isolated pages to.
+ *
+ * If @freelist is not provided, holes in range (either non-free pages
+ * or invalid PFNs) are considered an error and function undos its
+ * actions and returns zero.
+ *
+ * If @freelist is provided, function will simply skip non-free and
+ * missing pages and put only the ones isolated on the list.
+ *
+ * Returns number of isolated pages.  This may be more then end_pfn-start_pfn
+ * if end fell in a middle of a free page.
+ */
+static unsigned long
+isolate_freepages_range(struct zone *zone,
+			unsigned long start_pfn, unsigned long end_pfn,
+			struct list_head *freelist)
 {
-	unsigned long zone_end_pfn, end_pfn;
-	int nr_scanned = 0, total_isolated = 0;
-	struct page *cursor;
-
-	/* Get the last PFN we should scan for free pages at */
-	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
-	end_pfn = min(blockpfn + pageblock_nr_pages, zone_end_pfn);
+	unsigned long nr_scanned = 0, total_isolated = 0;
+	unsigned long pfn = start_pfn;
+	struct page *page;
 
-	/* Find the first usable PFN in the block to initialse page cursor */
-	for (; blockpfn < end_pfn; blockpfn++) {
-		if (pfn_valid_within(blockpfn))
-			break;
-	}
-	cursor = pfn_to_page(blockpfn);
+	VM_BUG_ON(!pfn_valid(pfn));
+	page = pfn_to_page(pfn);
 
 	/* Isolate free pages. This assumes the block is valid */
-	for (; blockpfn < end_pfn; blockpfn++, cursor++) {
-		int isolated, i;
-		struct page *page = cursor;
+	while (pfn < end_pfn) {
+		int n = 0, i;
 
-		if (!pfn_valid_within(blockpfn))
-			continue;
-		nr_scanned++;
+		if (!pfn_valid_within(pfn))
+			goto next;
+		++nr_scanned;
 
 		if (!PageBuddy(page))
-			continue;
+			goto next;
 
 		/* Found a free page, break it into order-0 pages */
-		isolated = split_free_page(page);
-		total_isolated += isolated;
-		for (i = 0; i < isolated; i++) {
-			list_add(&page->lru, freelist);
-			page++;
+		n = split_free_page(page);
+		total_isolated += n;
+		if (freelist) {
+			struct page *p = page;
+			for (i = n; i; --i, ++p)
+				list_add(&p->lru, freelist);
 		}
 
-		/* If a page was split, advance to the end of it */
-		if (isolated) {
-			blockpfn += isolated - 1;
-			cursor += isolated - 1;
+next:
+		if (!n) {
+			/* If n == 0, we have isolated no pages. */
+			if (!freelist)
+				goto cleanup;
+			n = 1;
 		}
+
+		/*
+		 * If we pass max order page, we might end up in a different
+		 * vmemmap, so account for that.
+		 */
+		pfn += n;
+		if (pfn & (MAX_ORDER_NR_PAGES - 1))
+			page += n;
+		else
+			page = pfn_to_page(pfn);
 	}
 
 	trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
 	return total_isolated;
+
+cleanup:
+	/*
+	 * Undo what we have done so far, and return.  We know all pages from
+	 * [start_pfn, pfn) are free because we have just freed them.  If one of
+	 * the page in the range was not freed, we would end up here earlier.
+	 */
+	for (; start_pfn < pfn; ++start_pfn)
+		__free_page(pfn_to_page(start_pfn));
+	return 0;
 }
 
 /* Returns true if the page is within a block suitable for migration to */
@@ -135,7 +166,7 @@ static void isolate_freepages(struct zone *zone,
 				struct compact_control *cc)
 {
 	struct page *page;
-	unsigned long high_pfn, low_pfn, pfn;
+	unsigned long high_pfn, low_pfn, pfn, zone_end_pfn, end_pfn;
 	unsigned long flags;
 	int nr_freepages = cc->nr_freepages;
 	struct list_head *freelist = &cc->freepages;
@@ -155,6 +186,8 @@ static void isolate_freepages(struct zone *zone,
 	 */
 	high_pfn = min(low_pfn, pfn);
 
+	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+
 	/*
 	 * Isolate free pages until enough are available to migrate the
 	 * pages on cc->migratepages. We stop searching if the migrate
@@ -191,7 +224,9 @@ static void isolate_freepages(struct zone *zone,
 		isolated = 0;
 		spin_lock_irqsave(&zone->lock, flags);
 		if (suitable_migration_target(page)) {
-			isolated = isolate_freepages_block(zone, pfn, freelist);
+			end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
+			isolated = isolate_freepages_range(zone, pfn,
+					end_pfn, freelist);
 			nr_freepages += isolated;
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
@@ -250,31 +285,34 @@ typedef enum {
 	ISOLATE_SUCCESS,	/* Pages isolated, migrate */
 } isolate_migrate_t;
 
-/*
- * Isolate all pages that can be migrated from the block pointed to by
- * the migrate scanner within compact_control.
+/**
+ * isolate_migratepages_range() - isolate all migrate-able pages in range.
+ * @zone:	Zone pages are in.
+ * @cc:		Compaction control structure.
+ * @low_pfn:	The first PFN of the range.
+ * @end_pfn:	The one-past-the-last PFN of the range.
+ *
+ * Isolate all pages that can be migrated from the range specified by
+ * [low_pfn, end_pfn).  Returns zero if there is a fatal signal
+ * pending), otherwise PFN of the first page that was not scanned
+ * (which may be both less, equal to or more then end_pfn).
+ *
+ * Assumes that cc->migratepages is empty and cc->nr_migratepages is
+ * zero.
+ *
+ * Other then cc->migratepages and cc->nr_migratetypes this function
+ * does not modify any cc's fields, ie. it does not modify (or read
+ * for that matter) cc->migrate_pfn.
  */
-static isolate_migrate_t isolate_migratepages(struct zone *zone,
-					struct compact_control *cc)
+static unsigned long
+isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
+			   unsigned long low_pfn, unsigned long end_pfn)
 {
-	unsigned long low_pfn, end_pfn;
 	unsigned long last_pageblock_nr = 0, pageblock_nr;
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct list_head *migratelist = &cc->migratepages;
 	isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
 
-	/* Do not scan outside zone boundaries */
-	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
-
-	/* Only scan within a pageblock boundary */
-	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
-
-	/* Do not cross the free scanner or scan within a memory hole */
-	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
-		cc->migrate_pfn = end_pfn;
-		return ISOLATE_NONE;
-	}
-
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
 	 * list by either parallel reclaimers or compaction. If there are,
@@ -283,12 +321,12 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	while (unlikely(too_many_isolated(zone))) {
 		/* async migration should just abort */
 		if (!cc->sync)
-			return ISOLATE_ABORT;
+			return 0;
 
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		if (fatal_signal_pending(current))
-			return ISOLATE_ABORT;
+			return 0;
 	}
 
 	/* Time to isolate some pages for migration */
@@ -365,17 +403,49 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 		nr_isolated++;
 
 		/* Avoid isolating too much */
-		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
+		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
+			++low_pfn;
 			break;
+		}
 	}
 
 	acct_isolated(zone, cc);
 
 	spin_unlock_irq(&zone->lru_lock);
-	cc->migrate_pfn = low_pfn;
 
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
 
+	return low_pfn;
+}
+
+/*
+ * Isolate all pages that can be migrated from the block pointed to by
+ * the migrate scanner within compact_control.
+ */
+static isolate_migrate_t isolate_migratepages(struct zone *zone,
+					struct compact_control *cc)
+{
+	unsigned long low_pfn, end_pfn;
+
+	/* Do not scan outside zone boundaries */
+	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
+
+	/* Only scan within a pageblock boundary */
+	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
+
+	/* Do not cross the free scanner or scan within a memory hole */
+	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
+		cc->migrate_pfn = end_pfn;
+		return ISOLATE_NONE;
+	}
+
+	/* Perform the isolation */
+	low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
+	if (!low_pfn)
+		return ISOLATE_ABORT;
+
+	cc->migrate_pfn = low_pfn;
+
 	return ISOLATE_SUCCESS;
 }
 
-- 
1.7.1.569.g6f426


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 03/11] mm: compaction: export some of the functions
  2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
  2011-12-29 12:39 ` [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating Marek Szyprowski
  2011-12-29 12:39 ` [PATCH 02/11] mm: compaction: introduce isolate_{free,migrate}pages_range() Marek Szyprowski
@ 2011-12-29 12:39 ` Marek Szyprowski
  2011-12-29 12:39 ` [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range() Marek Szyprowski
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Marek Szyprowski @ 2011-12-29 12:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm, linaro-mm-sig
  Cc: Michal Nazarewicz, Marek Szyprowski, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

From: Michal Nazarewicz <mina86@mina86.com>

This commit exports some of the functions from compaction.c file
outside of it adding their declaration into internal.h header
file so that other mm related code can use them.

This forced compaction.c to always be compiled (as opposed to being
compiled only if CONFIG_COMPACTION is defined) but as to avoid
introducing code that user did not ask for, part of the compaction.c
is now wrapped in on #ifdef.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
---
 mm/Makefile     |    3 +-
 mm/compaction.c |  342 ++++++++++++++++++++++++++-----------------------------
 mm/internal.h   |   35 ++++++
 3 files changed, 200 insertions(+), 180 deletions(-)

diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..8aada89 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -13,7 +13,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   page_isolation.o mm_init.o mmu_context.o percpu.o \
-			   $(mmu-y)
+			   compaction.o $(mmu-y)
 obj-y += init-mm.o
 
 ifdef CONFIG_NO_BOOTMEM
@@ -32,7 +32,6 @@ obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
-obj-$(CONFIG_COMPACTION) += compaction.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o
diff --git a/mm/compaction.c b/mm/compaction.c
index ae73b6f..8733441 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -16,44 +16,11 @@
 #include <linux/sysfs.h>
 #include "internal.h"
 
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/compaction.h>
 
-/*
- * compact_control is used to track pages being migrated and the free pages
- * they are being migrated to during memory compaction. The free_pfn starts
- * at the end of a zone and migrate_pfn begins at the start. Movable pages
- * are moved to the end of a zone during a compaction run and the run
- * completes when free_pfn <= migrate_pfn
- */
-struct compact_control {
-	struct list_head freepages;	/* List of free pages to migrate to */
-	struct list_head migratepages;	/* List of pages being migrated */
-	unsigned long nr_freepages;	/* Number of isolated free pages */
-	unsigned long nr_migratepages;	/* Number of pages to migrate */
-	unsigned long free_pfn;		/* isolate_freepages search base */
-	unsigned long migrate_pfn;	/* isolate_migratepages search base */
-	bool sync;			/* Synchronous migration */
-
-	unsigned int order;		/* order a direct compactor needs */
-	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
-	struct zone *zone;
-};
-
-static unsigned long release_freepages(struct list_head *freelist)
-{
-	struct page *page, *next;
-	unsigned long count = 0;
-
-	list_for_each_entry_safe(page, next, freelist, lru) {
-		list_del(&page->lru);
-		__free_page(page);
-		count++;
-	}
-
-	return count;
-}
-
 /**
  * isolate_freepages_range() - isolate free pages, must hold zone->lock.
  * @zone:	Zone pages are in.
@@ -71,7 +38,7 @@ static unsigned long release_freepages(struct list_head *freelist)
  * Returns number of isolated pages.  This may be more then end_pfn-start_pfn
  * if end fell in a middle of a free page.
  */
-static unsigned long
+unsigned long
 isolate_freepages_range(struct zone *zone,
 			unsigned long start_pfn, unsigned long end_pfn,
 			struct list_head *freelist)
@@ -136,120 +103,6 @@ cleanup:
 	return 0;
 }
 
-/* Returns true if the page is within a block suitable for migration to */
-static bool suitable_migration_target(struct page *page)
-{
-
-	int migratetype = get_pageblock_migratetype(page);
-
-	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
-	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
-		return false;
-
-	/* If the page is a large free page, then allow migration */
-	if (PageBuddy(page) && page_order(page) >= pageblock_order)
-		return true;
-
-	/* If the block is MIGRATE_MOVABLE, allow migration */
-	if (migratetype == MIGRATE_MOVABLE)
-		return true;
-
-	/* Otherwise skip the block */
-	return false;
-}
-
-/*
- * Based on information in the current compact_control, find blocks
- * suitable for isolating free pages from and then isolate them.
- */
-static void isolate_freepages(struct zone *zone,
-				struct compact_control *cc)
-{
-	struct page *page;
-	unsigned long high_pfn, low_pfn, pfn, zone_end_pfn, end_pfn;
-	unsigned long flags;
-	int nr_freepages = cc->nr_freepages;
-	struct list_head *freelist = &cc->freepages;
-
-	/*
-	 * Initialise the free scanner. The starting point is where we last
-	 * scanned from (or the end of the zone if starting). The low point
-	 * is the end of the pageblock the migration scanner is using.
-	 */
-	pfn = cc->free_pfn;
-	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
-
-	/*
-	 * Take care that if the migration scanner is at the end of the zone
-	 * that the free scanner does not accidentally move to the next zone
-	 * in the next isolation cycle.
-	 */
-	high_pfn = min(low_pfn, pfn);
-
-	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
-
-	/*
-	 * Isolate free pages until enough are available to migrate the
-	 * pages on cc->migratepages. We stop searching if the migrate
-	 * and free page scanners meet or enough free pages are isolated.
-	 */
-	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
-					pfn -= pageblock_nr_pages) {
-		unsigned long isolated;
-
-		if (!pfn_valid(pfn))
-			continue;
-
-		/*
-		 * Check for overlapping nodes/zones. It's possible on some
-		 * configurations to have a setup like
-		 * node0 node1 node0
-		 * i.e. it's possible that all pages within a zones range of
-		 * pages do not belong to a single zone.
-		 */
-		page = pfn_to_page(pfn);
-		if (page_zone(page) != zone)
-			continue;
-
-		/* Check the block is suitable for migration */
-		if (!suitable_migration_target(page))
-			continue;
-
-		/*
-		 * Found a block suitable for isolating free pages from. Now
-		 * we disabled interrupts, double check things are ok and
-		 * isolate the pages. This is to minimise the time IRQs
-		 * are disabled
-		 */
-		isolated = 0;
-		spin_lock_irqsave(&zone->lock, flags);
-		if (suitable_migration_target(page)) {
-			end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
-			isolated = isolate_freepages_range(zone, pfn,
-					end_pfn, freelist);
-			nr_freepages += isolated;
-		}
-		spin_unlock_irqrestore(&zone->lock, flags);
-
-		/*
-		 * Record the highest PFN we isolated pages from. When next
-		 * looking for free pages, the search will restart here as
-		 * page migration may have returned some pages to the allocator
-		 */
-		if (isolated)
-			high_pfn = max(high_pfn, pfn);
-	}
-
-	/* split_free_page does not map the pages */
-	list_for_each_entry(page, freelist, lru) {
-		arch_alloc_page(page, 0);
-		kernel_map_pages(page, 1, 1);
-	}
-
-	cc->free_pfn = high_pfn;
-	cc->nr_freepages = nr_freepages;
-}
-
 /* Update the number of anon and file isolated pages in the zone */
 static void acct_isolated(struct zone *zone, struct compact_control *cc)
 {
@@ -278,13 +131,6 @@ static bool too_many_isolated(struct zone *zone)
 	return isolated > (inactive + active) / 2;
 }
 
-/* possible outcome of isolate_migratepages */
-typedef enum {
-	ISOLATE_ABORT,		/* Abort compaction now */
-	ISOLATE_NONE,		/* No pages isolated, continue scanning */
-	ISOLATE_SUCCESS,	/* Pages isolated, migrate */
-} isolate_migrate_t;
-
 /**
  * isolate_migratepages_range() - isolate all migrate-able pages in range.
  * @zone:	Zone pages are in.
@@ -304,7 +150,7 @@ typedef enum {
  * does not modify any cc's fields, ie. it does not modify (or read
  * for that matter) cc->migrate_pfn.
  */
-static unsigned long
+unsigned long
 isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			   unsigned long low_pfn, unsigned long end_pfn)
 {
@@ -418,35 +264,135 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	return low_pfn;
 }
 
+#endif /* CONFIG_COMPACTION || CONFIG_CMA */
+#ifdef CONFIG_COMPACTION
+
+static unsigned long release_freepages(struct list_head *freelist)
+{
+	struct page *page, *next;
+	unsigned long count = 0;
+
+	list_for_each_entry_safe(page, next, freelist, lru) {
+		list_del(&page->lru);
+		__free_page(page);
+		count++;
+	}
+
+	return count;
+}
+
+/* Returns true if the page is within a block suitable for migration to */
+static bool suitable_migration_target(struct page *page)
+{
+
+	int migratetype = get_pageblock_migratetype(page);
+
+	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
+	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
+		return false;
+
+	/* If the page is a large free page, then allow migration */
+	if (PageBuddy(page) && page_order(page) >= pageblock_order)
+		return true;
+
+	/* If the block is MIGRATE_MOVABLE, allow migration */
+	if (migratetype == MIGRATE_MOVABLE)
+		return true;
+
+	/* Otherwise skip the block */
+	return false;
+}
+
 /*
- * Isolate all pages that can be migrated from the block pointed to by
- * the migrate scanner within compact_control.
+ * Based on information in the current compact_control, find blocks
+ * suitable for isolating free pages from and then isolate them.
  */
-static isolate_migrate_t isolate_migratepages(struct zone *zone,
-					struct compact_control *cc)
+static void isolate_freepages(struct zone *zone,
+				struct compact_control *cc)
 {
-	unsigned long low_pfn, end_pfn;
+	struct page *page;
+	unsigned long high_pfn, low_pfn, pfn, zone_end_pfn, end_pfn;
+	unsigned long flags;
+	int nr_freepages = cc->nr_freepages;
+	struct list_head *freelist = &cc->freepages;
 
-	/* Do not scan outside zone boundaries */
-	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
+	/*
+	 * Initialise the free scanner. The starting point is where we last
+	 * scanned from (or the end of the zone if starting). The low point
+	 * is the end of the pageblock the migration scanner is using.
+	 */
+	pfn = cc->free_pfn;
+	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
 
-	/* Only scan within a pageblock boundary */
-	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
+	/*
+	 * Take care that if the migration scanner is at the end of the zone
+	 * that the free scanner does not accidentally move to the next zone
+	 * in the next isolation cycle.
+	 */
+	high_pfn = min(low_pfn, pfn);
 
-	/* Do not cross the free scanner or scan within a memory hole */
-	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
-		cc->migrate_pfn = end_pfn;
-		return ISOLATE_NONE;
-	}
+	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
 
-	/* Perform the isolation */
-	low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
-	if (!low_pfn)
-		return ISOLATE_ABORT;
+	/*
+	 * Isolate free pages until enough are available to migrate the
+	 * pages on cc->migratepages. We stop searching if the migrate
+	 * and free page scanners meet or enough free pages are isolated.
+	 */
+	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
+					pfn -= pageblock_nr_pages) {
+		unsigned long isolated;
 
-	cc->migrate_pfn = low_pfn;
+		if (!pfn_valid(pfn))
+			continue;
 
-	return ISOLATE_SUCCESS;
+		/*
+		 * Check for overlapping nodes/zones. It's possible on some
+		 * configurations to have a setup like
+		 * node0 node1 node0
+		 * i.e. it's possible that all pages within a zones range of
+		 * pages do not belong to a single zone.
+		 */
+		page = pfn_to_page(pfn);
+		if (page_zone(page) != zone)
+			continue;
+
+		/* Check the block is suitable for migration */
+		if (!suitable_migration_target(page))
+			continue;
+
+		/*
+		 * Found a block suitable for isolating free pages from. Now
+		 * we disabled interrupts, double check things are ok and
+		 * isolate the pages. This is to minimise the time IRQs
+		 * are disabled
+		 */
+		isolated = 0;
+		spin_lock_irqsave(&zone->lock, flags);
+		if (suitable_migration_target(page)) {
+			end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
+			isolated = isolate_freepages_range(zone, pfn,
+					end_pfn, freelist);
+			nr_freepages += isolated;
+		}
+		spin_unlock_irqrestore(&zone->lock, flags);
+
+		/*
+		 * Record the highest PFN we isolated pages from. When next
+		 * looking for free pages, the search will restart here as
+		 * page migration may have returned some pages to the allocator
+		 */
+		if (isolated)
+			high_pfn = max(high_pfn, pfn);
+	}
+
+	/* split_free_page does not map the pages */
+	list_for_each_entry(page, freelist, lru) {
+		arch_alloc_page(page, 0);
+		kernel_map_pages(page, 1, 1);
+	}
+
+	cc->free_pfn = high_pfn;
+	cc->nr_freepages = nr_freepages;
 }
 
 /*
@@ -495,6 +441,44 @@ static void update_nr_listpages(struct compact_control *cc)
 	cc->nr_freepages = nr_freepages;
 }
 
+/* possible outcome of isolate_migratepages */
+typedef enum {
+	ISOLATE_ABORT,		/* Abort compaction now */
+	ISOLATE_NONE,		/* No pages isolated, continue scanning */
+	ISOLATE_SUCCESS,	/* Pages isolated, migrate */
+} isolate_migrate_t;
+
+/*
+ * Isolate all pages that can be migrated from the block pointed to by
+ * the migrate scanner within compact_control.
+ */
+static isolate_migrate_t isolate_migratepages(struct zone *zone,
+					struct compact_control *cc)
+{
+	unsigned long low_pfn, end_pfn;
+
+	/* Do not scan outside zone boundaries */
+	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
+
+	/* Only scan within a pageblock boundary */
+	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
+
+	/* Do not cross the free scanner or scan within a memory hole */
+	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
+		cc->migrate_pfn = end_pfn;
+		return ISOLATE_NONE;
+	}
+
+	/* Perform the isolation */
+	low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
+	if (!low_pfn)
+		return ISOLATE_ABORT;
+
+	cc->migrate_pfn = low_pfn;
+
+	return ISOLATE_SUCCESS;
+}
+
 static int compact_finished(struct zone *zone,
 			    struct compact_control *cc)
 {
@@ -811,3 +795,5 @@ void compaction_unregister_node(struct node *node)
 	return sysdev_remove_file(&node->sysdev, &attr_compact);
 }
 #endif /* CONFIG_SYSFS && CONFIG_NUMA */
+
+#endif /* CONFIG_COMPACTION */
diff --git a/mm/internal.h b/mm/internal.h
index 2189af4..4a226d8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -100,6 +100,41 @@ extern void prep_compound_page(struct page *page, unsigned long order);
 extern bool is_free_buddy_page(struct page *page);
 #endif
 
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+
+/*
+ * in mm/compaction.c
+ */
+/*
+ * compact_control is used to track pages being migrated and the free pages
+ * they are being migrated to during memory compaction. The free_pfn starts
+ * at the end of a zone and migrate_pfn begins at the start. Movable pages
+ * are moved to the end of a zone during a compaction run and the run
+ * completes when free_pfn <= migrate_pfn
+ */
+struct compact_control {
+	struct list_head freepages;	/* List of free pages to migrate to */
+	struct list_head migratepages;	/* List of pages being migrated */
+	unsigned long nr_freepages;	/* Number of isolated free pages */
+	unsigned long nr_migratepages;	/* Number of pages to migrate */
+	unsigned long free_pfn;		/* isolate_freepages search base */
+	unsigned long migrate_pfn;	/* isolate_migratepages search base */
+	bool sync;			/* Synchronous migration */
+
+	unsigned int order;		/* order a direct compactor needs */
+	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
+	struct zone *zone;
+};
+
+unsigned long
+isolate_freepages_range(struct zone *zone,
+			unsigned long start, unsigned long end,
+			struct list_head *freelist);
+unsigned long
+isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
+			   unsigned long low_pfn, unsigned long end_pfn);
+
+#endif
 
 /*
  * function for dealing with page's order in buddy system.
-- 
1.7.1.569.g6f426


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range()
  2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
                   ` (2 preceding siblings ...)
  2011-12-29 12:39 ` [PATCH 03/11] mm: compaction: export some of the functions Marek Szyprowski
@ 2011-12-29 12:39 ` Marek Szyprowski
  2012-01-10 14:16   ` Mel Gorman
  2012-01-17 21:54   ` [Linaro-mm-sig] " sandeep patil
  2011-12-29 12:39 ` [PATCH 05/11] mm: mmzone: MIGRATE_CMA migration type added Marek Szyprowski
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 33+ messages in thread
From: Marek Szyprowski @ 2011-12-29 12:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm, linaro-mm-sig
  Cc: Michal Nazarewicz, Marek Szyprowski, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

From: Michal Nazarewicz <mina86@mina86.com>

This commit adds the alloc_contig_range() function which tries
to allocate given range of pages.  It tries to migrate all
already allocated pages that fall in the range thus freeing them.
Once all pages in the range are freed they are removed from the
buddy system thus allocated for the caller to use.

__alloc_contig_migrate_range() borrows some code from KAMEZAWA
Hiroyuki's __alloc_contig_pages().

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
---
 include/linux/page-isolation.h |    3 +
 mm/page_alloc.c                |  190 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 193 insertions(+), 0 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 051c1b1..d305080 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -33,5 +33,8 @@ test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn);
 extern int set_migratetype_isolate(struct page *page);
 extern void unset_migratetype_isolate(struct page *page);
 
+/* The below functions must be run on a range from a single zone. */
+int alloc_contig_range(unsigned long start, unsigned long end);
+void free_contig_range(unsigned long pfn, unsigned nr_pages);
 
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f88b320..47b0a85 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -57,6 +57,7 @@
 #include <linux/ftrace_event.h>
 #include <linux/memcontrol.h>
 #include <linux/prefetch.h>
+#include <linux/migrate.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -5711,6 +5712,195 @@ out:
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
+static unsigned long pfn_align_to_maxpage_down(unsigned long pfn)
+{
+	return pfn & ~(MAX_ORDER_NR_PAGES - 1);
+}
+
+static unsigned long pfn_align_to_maxpage_up(unsigned long pfn)
+{
+	return ALIGN(pfn, MAX_ORDER_NR_PAGES);
+}
+
+static struct page *
+__cma_migrate_alloc(struct page *page, unsigned long private, int **resultp)
+{
+	return alloc_page(GFP_HIGHUSER_MOVABLE);
+}
+
+static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
+{
+	/* This function is based on compact_zone() from compaction.c. */
+
+	unsigned long pfn = start;
+	int ret = -EBUSY;
+	unsigned tries = 0;
+
+	struct compact_control cc = {
+		.nr_migratepages = 0,
+		.order = -1,
+		.zone = page_zone(pfn_to_page(start)),
+		.sync = true,
+	};
+	INIT_LIST_HEAD(&cc.migratepages);
+
+	migrate_prep_local();
+
+	while (pfn < end || cc.nr_migratepages) {
+		/* Abort on signal */
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			goto done;
+		}
+
+		/* Get some pages to migrate. */
+		if (list_empty(&cc.migratepages)) {
+			cc.nr_migratepages = 0;
+			pfn = isolate_migratepages_range(cc.zone, &cc,
+							 pfn, end);
+			if (!pfn) {
+				ret = -EINTR;
+				goto done;
+			}
+			tries = 0;
+		}
+
+		/* Try to migrate. */
+		ret = migrate_pages(&cc.migratepages, __cma_migrate_alloc,
+				    0, false, true);
+
+		/* Migrated all of them? Great! */
+		if (list_empty(&cc.migratepages))
+			continue;
+
+		/* Try five times. */
+		if (++tries == 5) {
+			ret = ret < 0 ? ret : -EBUSY;
+			goto done;
+		}
+
+		/* Before each time drain everything and reschedule. */
+		lru_add_drain_all();
+		drain_all_pages();
+		cond_resched();
+	}
+	ret = 0;
+
+done:
+	/* Make sure all pages are isolated. */
+	if (!ret) {
+		lru_add_drain_all();
+		drain_all_pages();
+		if (WARN_ON(test_pages_isolated(start, end)))
+			ret = -EBUSY;
+	}
+
+	/* Release pages */
+	putback_lru_pages(&cc.migratepages);
+
+	return ret;
+}
+
+/**
+ * alloc_contig_range() -- tries to allocate given range of pages
+ * @start:	start PFN to allocate
+ * @end:	one-past-the-last PFN to allocate
+ *
+ * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
+ * aligned, hovewer it's callers responsibility to guarantee that we
+ * are the only thread that changes migrate type of pageblocks the
+ * pages fall in.
+ *
+ * Returns zero on success or negative error code.  On success all
+ * pages which PFN is in (start, end) are allocated for the caller and
+ * need to be freed with free_contig_range().
+ */
+int alloc_contig_range(unsigned long start, unsigned long end)
+{
+	unsigned long outer_start, outer_end;
+	int ret;
+
+	/*
+	 * What we do here is we mark all pageblocks in range as
+	 * MIGRATE_ISOLATE.  Because of the way page allocator work, we
+	 * align the range to MAX_ORDER pages so that page allocator
+	 * won't try to merge buddies from different pageblocks and
+	 * change MIGRATE_ISOLATE to some other migration type.
+	 *
+	 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
+	 * migrate the pages from an unaligned range (ie. pages that
+	 * we are interested in).  This will put all the pages in
+	 * range back to page allocator as MIGRATE_ISOLATE.
+	 *
+	 * When this is done, we take the pages in range from page
+	 * allocator removing them from the buddy system.  This way
+	 * page allocator will never consider using them.
+	 *
+	 * This lets us mark the pageblocks back as
+	 * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
+	 * MAX_ORDER aligned range but not in the unaligned, original
+	 * range are put back to page allocator so that buddy can use
+	 * them.
+	 */
+
+	ret = start_isolate_page_range(pfn_align_to_maxpage_down(start),
+				       pfn_align_to_maxpage_up(end));
+	if (ret)
+		goto done;
+
+	ret = __alloc_contig_migrate_range(start, end);
+	if (ret)
+		goto done;
+
+	/*
+	 * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
+	 * aligned blocks that are marked as MIGRATE_ISOLATE.  What's
+	 * more, all pages in [start, end) are free in page allocator.
+	 * What we are going to do is to allocate all pages from
+	 * [start, end) (that is remove them from page allocater).
+	 *
+	 * The only problem is that pages at the beginning and at the
+	 * end of interesting range may be not aligned with pages that
+	 * page allocator holds, ie. they can be part of higher order
+	 * pages.  Because of this, we reserve the bigger range and
+	 * once this is done free the pages we are not interested in.
+	 */
+
+	ret = 0;
+	while (!PageBuddy(pfn_to_page(start & (~0UL << ret))))
+		if (WARN_ON(++ret >= MAX_ORDER)) {
+			ret = -EINVAL;
+			goto done;
+		}
+
+	outer_start = start & (~0UL << ret);
+	outer_end = isolate_freepages_range(page_zone(pfn_to_page(outer_start)),
+					    outer_start, end, NULL);
+	if (!outer_end) {
+		ret = -EBUSY;
+		goto done;
+	}
+	outer_end += outer_start;
+
+	/* Free head and tail (if any) */
+	if (start != outer_start)
+		free_contig_range(outer_start, start - outer_start);
+	if (end != outer_end)
+		free_contig_range(end, outer_end - end);
+
+	ret = 0;
+done:
+	undo_isolate_page_range(pfn_align_to_maxpage_down(start),
+				pfn_align_to_maxpage_up(end));
+	return ret;
+}
+
+void free_contig_range(unsigned long pfn, unsigned nr_pages)
+{
+	for (; nr_pages--; ++pfn)
+		__free_page(pfn_to_page(pfn));
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /*
  * All pages in the range must be isolated before calling this.
-- 
1.7.1.569.g6f426


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 05/11] mm: mmzone: MIGRATE_CMA migration type added
  2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
                   ` (3 preceding siblings ...)
  2011-12-29 12:39 ` [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range() Marek Szyprowski
@ 2011-12-29 12:39 ` Marek Szyprowski
  2012-01-10 14:38   ` Mel Gorman
  2011-12-29 12:39 ` [PATCH 06/11] mm: page_isolation: MIGRATE_CMA isolation functions added Marek Szyprowski
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Marek Szyprowski @ 2011-12-29 12:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm, linaro-mm-sig
  Cc: Michal Nazarewicz, Marek Szyprowski, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

From: Michal Nazarewicz <mina86@mina86.com>

The MIGRATE_CMA migration type has two main characteristics:
(i) only movable pages can be allocated from MIGRATE_CMA
pageblocks and (ii) page allocator will never change migration
type of MIGRATE_CMA pageblocks.

This guarantees (to some degree) that page in a MIGRATE_CMA page
block can always be migrated somewhere else (unless there's no
memory left in the system).

It is designed to be used for allocating big chunks (eg. 10MiB)
of physically contiguous memory.  Once driver requests
contiguous memory, pages from MIGRATE_CMA pageblocks may be
migrated away to create a contiguous block.

To minimise number of migrations, MIGRATE_CMA migration type
is the last type tried when page allocator falls back to other
migration types then requested.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
[m.szyprowski: removed CONFIG_CMA_MIGRATE_TYPE]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 include/linux/mmzone.h         |   41 ++++++++++++++++++++----
 include/linux/page-isolation.h |    3 ++
 mm/Kconfig                     |    2 +-
 mm/compaction.c                |   11 +++++--
 mm/page_alloc.c                |   68 ++++++++++++++++++++++++++++++---------
 mm/vmstat.c                    |    1 +
 6 files changed, 99 insertions(+), 27 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 188cb2f..e38b85d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -35,13 +35,35 @@
  */
 #define PAGE_ALLOC_COSTLY_ORDER 3
 
-#define MIGRATE_UNMOVABLE     0
-#define MIGRATE_RECLAIMABLE   1
-#define MIGRATE_MOVABLE       2
-#define MIGRATE_PCPTYPES      3 /* the number of types on the pcp lists */
-#define MIGRATE_RESERVE       3
-#define MIGRATE_ISOLATE       4 /* can't allocate from here */
-#define MIGRATE_TYPES         5
+enum {
+	MIGRATE_UNMOVABLE,
+	MIGRATE_RECLAIMABLE,
+	MIGRATE_MOVABLE,
+	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
+	MIGRATE_RESERVE = MIGRATE_PCPTYPES,
+	/*
+	 * MIGRATE_CMA migration type is designed to mimic the way
+	 * ZONE_MOVABLE works.  Only movable pages can be allocated
+	 * from MIGRATE_CMA pageblocks and page allocator never
+	 * implicitly change migration type of MIGRATE_CMA pageblock.
+	 *
+	 * The way to use it is to change migratetype of a range of
+	 * pageblocks to MIGRATE_CMA which can be done by
+	 * __free_pageblock_cma() function.  What is important though
+	 * is that a range of pageblocks must be aligned to
+	 * MAX_ORDER_NR_PAGES should biggest page be bigger then
+	 * a single pageblock.
+	 */
+	MIGRATE_CMA,
+	MIGRATE_ISOLATE,	/* can't allocate from here */
+	MIGRATE_TYPES
+};
+
+#ifdef CONFIG_CMA
+#  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
+#else
+#  define is_migrate_cma(migratetype) false
+#endif
 
 #define for_each_migratetype_order(order, type) \
 	for (order = 0; order < MAX_ORDER; order++) \
@@ -54,6 +76,11 @@ static inline int get_pageblock_migratetype(struct page *page)
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
+static inline bool is_pageblock_cma(struct page *page)
+{
+	return is_migrate_cma(get_pageblock_migratetype(page));
+}
+
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index d305080..af650db 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -37,4 +37,7 @@ extern void unset_migratetype_isolate(struct page *page);
 int alloc_contig_range(unsigned long start, unsigned long end);
 void free_contig_range(unsigned long pfn, unsigned nr_pages);
 
+/* CMA stuff */
+extern void init_cma_reserved_pageblock(struct page *page);
+
 #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 011b110..e080cac 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -192,7 +192,7 @@ config COMPACTION
 config MIGRATION
 	bool "Page migration"
 	def_bool y
-	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
+	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION || CMA
 	help
 	  Allows the migration of the physical location of pages of processes
 	  while the virtual addresses are not changed. This is useful in
diff --git a/mm/compaction.c b/mm/compaction.c
index 8733441..46783b4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -21,6 +21,11 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/compaction.h>
 
+static inline bool is_migrate_cma_or_movable(int migratetype)
+{
+	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
+}
+
 /**
  * isolate_freepages_range() - isolate free pages, must hold zone->lock.
  * @zone:	Zone pages are in.
@@ -213,7 +218,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		 */
 		pageblock_nr = low_pfn >> pageblock_order;
 		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
-				get_pageblock_migratetype(page) != MIGRATE_MOVABLE) {
+		    is_migrate_cma_or_movable(get_pageblock_migratetype(page))) {
 			low_pfn += pageblock_nr_pages;
 			low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
 			last_pageblock_nr = pageblock_nr;
@@ -295,8 +300,8 @@ static bool suitable_migration_target(struct page *page)
 	if (PageBuddy(page) && page_order(page) >= pageblock_order)
 		return true;
 
-	/* If the block is MIGRATE_MOVABLE, allow migration */
-	if (migratetype == MIGRATE_MOVABLE)
+	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
+	if (is_migrate_cma_or_movable(migratetype))
 		return true;
 
 	/* Otherwise skip the block */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47b0a85..06a7861 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -722,6 +722,26 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
 	}
 }
 
+#ifdef CONFIG_CMA
+/*
+ * Free whole pageblock and set it's migration type to MIGRATE_CMA.
+ */
+void __init init_cma_reserved_pageblock(struct page *page)
+{
+	unsigned i = pageblock_nr_pages;
+	struct page *p = page;
+
+	do {
+		__ClearPageReserved(p);
+		set_page_count(p, 0);
+	} while (++p, --i);
+
+	set_page_refcounted(page);
+	set_pageblock_migratetype(page, MIGRATE_CMA);
+	__free_pages(page, pageblock_order);
+	totalram_pages += pageblock_nr_pages;
+}
+#endif
 
 /*
  * The order of subdivision here is critical for the IO subsystem.
@@ -830,11 +850,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
  * This array describes the order lists are fallen back to when
  * the free lists for the desirable migrate type are depleted
  */
-static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES-1] = {
+static int fallbacks[MIGRATE_PCPTYPES][4] = {
 	[MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   MIGRATE_RESERVE },
 	[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   MIGRATE_RESERVE },
-	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE },
-	[MIGRATE_RESERVE]     = { MIGRATE_RESERVE,     MIGRATE_RESERVE,   MIGRATE_RESERVE }, /* Never used */
+	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_CMA    , MIGRATE_RESERVE },
 };
 
 /*
@@ -929,12 +948,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 	/* Find the largest possible block of pages in the other list */
 	for (current_order = MAX_ORDER-1; current_order >= order;
 						--current_order) {
-		for (i = 0; i < MIGRATE_TYPES - 1; i++) {
+		for (i = 0; i < ARRAY_SIZE(fallbacks[0]); i++) {
 			migratetype = fallbacks[start_migratetype][i];
 
 			/* MIGRATE_RESERVE handled later if necessary */
 			if (migratetype == MIGRATE_RESERVE)
-				continue;
+				break;
 
 			area = &(zone->free_area[current_order]);
 			if (list_empty(&area->free_list[migratetype]))
@@ -949,11 +968,18 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 			 * pages to the preferred allocation list. If falling
 			 * back for a reclaimable kernel allocation, be more
 			 * aggressive about taking ownership of free pages
+			 *
+			 * On the other hand, never change migration
+			 * type of MIGRATE_CMA pageblocks nor move CMA
+			 * pages on different free lists. We don't
+			 * want unmovable pages to be allocated from
+			 * MIGRATE_CMA areas.
 			 */
-			if (unlikely(current_order >= (pageblock_order >> 1)) ||
-					start_migratetype == MIGRATE_RECLAIMABLE ||
-					page_group_by_mobility_disabled) {
-				unsigned long pages;
+			if (!is_pageblock_cma(page) &&
+			    (unlikely(current_order >= pageblock_order / 2) ||
+			     start_migratetype == MIGRATE_RECLAIMABLE ||
+			     page_group_by_mobility_disabled)) {
+				int pages;
 				pages = move_freepages_block(zone, page,
 								start_migratetype);
 
@@ -971,11 +997,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 			rmv_page_order(page);
 
 			/* Take ownership for orders >= pageblock_order */
-			if (current_order >= pageblock_order)
+			if (current_order >= pageblock_order &&
+			    !is_pageblock_cma(page))
 				change_pageblock_range(page, current_order,
 							start_migratetype);
 
-			expand(zone, page, order, current_order, area, migratetype);
+			expand(zone, page, order, current_order, area,
+			       is_migrate_cma(start_migratetype)
+			     ? start_migratetype : migratetype);
 
 			trace_mm_page_alloc_extfrag(page, order, current_order,
 				start_migratetype, migratetype);
@@ -1047,7 +1076,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			list_add(&page->lru, list);
 		else
 			list_add_tail(&page->lru, list);
-		set_page_private(page, migratetype);
+		if (is_pageblock_cma(page))
+			set_page_private(page, MIGRATE_CMA);
+		else
+			set_page_private(page, migratetype);
 		list = &page->lru;
 	}
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
@@ -1308,8 +1340,12 @@ int split_free_page(struct page *page)
 
 	if (order >= pageblock_order - 1) {
 		struct page *endpage = page + (1 << order) - 1;
-		for (; page < endpage; page += pageblock_nr_pages)
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		for (; page < endpage; page += pageblock_nr_pages) {
+			int mt = get_pageblock_migratetype(page);
+			if (mt != MIGRATE_ISOLATE && !is_migrate_cma(mt))
+				set_pageblock_migratetype(page,
+							  MIGRATE_MOVABLE);
+		}
 	}
 
 	return 1 << order;
@@ -5599,8 +5635,8 @@ __count_immobile_pages(struct zone *zone, struct page *page, int count)
 	 */
 	if (zone_idx(zone) == ZONE_MOVABLE)
 		return true;
-
-	if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
+	if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE ||
+	    is_pageblock_cma(page))
 		return true;
 
 	pfn = page_to_pfn(page);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8fd603b..9fb1789 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -613,6 +613,7 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
 	"Reclaimable",
 	"Movable",
 	"Reserve",
+	"Cma",
 	"Isolate",
 };
 
-- 
1.7.1.569.g6f426


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 06/11] mm: page_isolation: MIGRATE_CMA isolation functions added
  2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
                   ` (4 preceding siblings ...)
  2011-12-29 12:39 ` [PATCH 05/11] mm: mmzone: MIGRATE_CMA migration type added Marek Szyprowski
@ 2011-12-29 12:39 ` Marek Szyprowski
  2011-12-29 12:39 ` [PATCH 07/11] mm: add optional memory reclaim in split_free_page() Marek Szyprowski
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Marek Szyprowski @ 2011-12-29 12:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm, linaro-mm-sig
  Cc: Michal Nazarewicz, Marek Szyprowski, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

From: Michal Nazarewicz <mina86@mina86.com>

This commit changes various functions that change pages and
pageblocks migrate type between MIGRATE_ISOLATE and
MIGRATE_MOVABLE in such a way as to allow to work with
MIGRATE_CMA migrate type.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
---
 include/linux/page-isolation.h |   23 ++++++++++++-----------
 mm/memory-failure.c            |    2 +-
 mm/memory_hotplug.c            |    6 +++---
 mm/page_alloc.c                |   18 ++++++++++++------
 mm/page_isolation.c            |   15 ++++++++-------
 5 files changed, 36 insertions(+), 28 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index af650db..c8004dc 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -3,7 +3,7 @@
 
 /*
  * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
- * If specified range includes migrate types other than MOVABLE,
+ * If specified range includes migrate types other than MOVABLE or CMA,
  * this will fail with -EBUSY.
  *
  * For isolating all pages in the range finally, the caller have to
@@ -11,30 +11,31 @@
  * test it.
  */
 extern int
-start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn);
+start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+			 unsigned migratetype);
 
 /*
  * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
  * target range is [start_pfn, end_pfn)
  */
 extern int
-undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn);
+undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+			unsigned migratetype);
 
 /*
- * test all pages in [start_pfn, end_pfn)are isolated or not.
+ * Test all pages in [start_pfn, end_pfn) are isolated or not.
  */
-extern int
-test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn);
+int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn);
 
 /*
- * Internal funcs.Changes pageblock's migrate type.
- * Please use make_pagetype_isolated()/make_pagetype_movable().
+ * Internal functions. Changes pageblock's migrate type.
  */
-extern int set_migratetype_isolate(struct page *page);
-extern void unset_migratetype_isolate(struct page *page);
+int set_migratetype_isolate(struct page *page);
+void unset_migratetype_isolate(struct page *page, unsigned migratetype);
 
 /* The below functions must be run on a range from a single zone. */
-int alloc_contig_range(unsigned long start, unsigned long end);
+int alloc_contig_range(unsigned long start, unsigned long end,
+		       unsigned migratetype);
 void free_contig_range(unsigned long pfn, unsigned nr_pages);
 
 /* CMA stuff */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06d3479..3cf809d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1400,7 +1400,7 @@ static int get_any_page(struct page *p, unsigned long pfn, int flags)
 		/* Not a free page */
 		ret = 1;
 	}
-	unset_migratetype_isolate(p);
+	unset_migratetype_isolate(p, MIGRATE_MOVABLE);
 	unlock_memory_hotplug();
 	return ret;
 }
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2168489..2a44d55 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -891,7 +891,7 @@ static int __ref offline_pages(unsigned long start_pfn,
 	nr_pages = end_pfn - start_pfn;
 
 	/* set above range as isolated */
-	ret = start_isolate_page_range(start_pfn, end_pfn);
+	ret = start_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	if (ret)
 		goto out;
 
@@ -956,7 +956,7 @@ repeat:
 	   We cannot do rollback at this point. */
 	offline_isolated_pages(start_pfn, end_pfn);
 	/* reset pagetype flags and makes migrate type to be MOVABLE */
-	undo_isolate_page_range(start_pfn, end_pfn);
+	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	/* removal success */
 	zone->present_pages -= offlined_pages;
 	zone->zone_pgdat->node_present_pages -= offlined_pages;
@@ -981,7 +981,7 @@ failed_removal:
 		start_pfn, end_pfn);
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
 	/* pushback to free area */
-	undo_isolate_page_range(start_pfn, end_pfn);
+	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 
 out:
 	unlock_memory_hotplug();
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 06a7861..8b47c85 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5734,7 +5734,7 @@ out:
 	return ret;
 }
 
-void unset_migratetype_isolate(struct page *page)
+void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 {
 	struct zone *zone;
 	unsigned long flags;
@@ -5742,8 +5742,8 @@ void unset_migratetype_isolate(struct page *page)
 	spin_lock_irqsave(&zone->lock, flags);
 	if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
 		goto out;
-	set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-	move_freepages_block(zone, page, MIGRATE_MOVABLE);
+	set_pageblock_migratetype(page, migratetype);
+	move_freepages_block(zone, page, migratetype);
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
@@ -5841,6 +5841,10 @@ done:
  * alloc_contig_range() -- tries to allocate given range of pages
  * @start:	start PFN to allocate
  * @end:	one-past-the-last PFN to allocate
+ * @migratetype:	migratetype of the underlaying pageblocks (either
+ *			#MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
+ *			in range must have the same migratetype and it must
+ *			be either of the two.
  *
  * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
  * aligned, hovewer it's callers responsibility to guarantee that we
@@ -5851,7 +5855,8 @@ done:
  * pages which PFN is in (start, end) are allocated for the caller and
  * need to be freed with free_contig_range().
  */
-int alloc_contig_range(unsigned long start, unsigned long end)
+int alloc_contig_range(unsigned long start, unsigned long end,
+		       unsigned migratetype)
 {
 	unsigned long outer_start, outer_end;
 	int ret;
@@ -5880,7 +5885,8 @@ int alloc_contig_range(unsigned long start, unsigned long end)
 	 */
 
 	ret = start_isolate_page_range(pfn_align_to_maxpage_down(start),
-				       pfn_align_to_maxpage_up(end));
+				       pfn_align_to_maxpage_up(end),
+				       migratetype);
 	if (ret)
 		goto done;
 
@@ -5927,7 +5933,7 @@ int alloc_contig_range(unsigned long start, unsigned long end)
 	ret = 0;
 done:
 	undo_isolate_page_range(pfn_align_to_maxpage_down(start),
-				pfn_align_to_maxpage_up(end));
+				pfn_align_to_maxpage_up(end), migratetype);
 	return ret;
 }
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 4ae42bb..c9f0477 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -24,6 +24,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * to be MIGRATE_ISOLATE.
  * @start_pfn: The lower PFN of the range to be isolated.
  * @end_pfn: The upper PFN of the range to be isolated.
+ * @migratetype: migrate type to set in error recovery.
  *
  * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
  * the range will never be allocated. Any free pages and pages freed in the
@@ -32,8 +33,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * start_pfn/end_pfn must be aligned to pageblock_order.
  * Returns 0 on success and -EBUSY if any part of range cannot be isolated.
  */
-int
-start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
+int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+			     unsigned migratetype)
 {
 	unsigned long pfn;
 	unsigned long undo_pfn;
@@ -56,7 +57,7 @@ undo:
 	for (pfn = start_pfn;
 	     pfn < undo_pfn;
 	     pfn += pageblock_nr_pages)
-		unset_migratetype_isolate(pfn_to_page(pfn));
+		unset_migratetype_isolate(pfn_to_page(pfn), migratetype);
 
 	return -EBUSY;
 }
@@ -64,8 +65,8 @@ undo:
 /*
  * Make isolated pages available again.
  */
-int
-undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
+int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
+			    unsigned migratetype)
 {
 	unsigned long pfn;
 	struct page *page;
@@ -77,7 +78,7 @@ undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
 		page = __first_valid_page(pfn, pageblock_nr_pages);
 		if (!page || get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
 			continue;
-		unset_migratetype_isolate(page);
+		unset_migratetype_isolate(page, migratetype);
 	}
 	return 0;
 }
@@ -86,7 +87,7 @@ undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
  * all pages in [start_pfn...end_pfn) must be in the same zone.
  * zone->lock must be held before call this.
  *
- * Returns 1 if all pages in the range is isolated.
+ * Returns 1 if all pages in the range are isolated.
  */
 static int
 __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn)
-- 
1.7.1.569.g6f426


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 07/11] mm: add optional memory reclaim in split_free_page()
  2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
                   ` (5 preceding siblings ...)
  2011-12-29 12:39 ` [PATCH 06/11] mm: page_isolation: MIGRATE_CMA isolation functions added Marek Szyprowski
@ 2011-12-29 12:39 ` Marek Szyprowski
  2012-01-10 14:45   ` Mel Gorman
  2011-12-29 12:39 ` [PATCH 08/11] drivers: add Contiguous Memory Allocator Marek Szyprowski
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Marek Szyprowski @ 2011-12-29 12:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm, linaro-mm-sig
  Cc: Michal Nazarewicz, Marek Szyprowski, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

split_free_page() function is used by migration code to grab a free page
once the migration has been finished. This function must obey the same
rules as memory allocation functions to keep the correct level of
memory watermarks. Memory compaction code calls it under locks, so it
cannot perform any *_slowpath style reclaim. However when this function
is called by migration code for Contiguous Memory Allocator, the sleeping
context is permitted and the function can make a real reclaim to ensure
that the call succeeds.

The reclaim code is based on the __alloc_page_slowpath() function.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
---
 include/linux/mm.h |    2 +-
 mm/compaction.c    |    6 ++--
 mm/internal.h      |    2 +-
 mm/page_alloc.c    |   79 +++++++++++++++++++++++++++++++++++++++------------
 4 files changed, 65 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..8b15b21 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -450,7 +450,7 @@ void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
-int split_free_page(struct page *page);
+int split_free_page(struct page *page, int force_reclaim);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/mm/compaction.c b/mm/compaction.c
index 46783b4..d6c5cb7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -46,7 +46,7 @@ static inline bool is_migrate_cma_or_movable(int migratetype)
 unsigned long
 isolate_freepages_range(struct zone *zone,
 			unsigned long start_pfn, unsigned long end_pfn,
-			struct list_head *freelist)
+			struct list_head *freelist, int force_reclaim)
 {
 	unsigned long nr_scanned = 0, total_isolated = 0;
 	unsigned long pfn = start_pfn;
@@ -67,7 +67,7 @@ isolate_freepages_range(struct zone *zone,
 			goto next;
 
 		/* Found a free page, break it into order-0 pages */
-		n = split_free_page(page);
+		n = split_free_page(page, force_reclaim);
 		total_isolated += n;
 		if (freelist) {
 			struct page *p = page;
@@ -376,7 +376,7 @@ static void isolate_freepages(struct zone *zone,
 		if (suitable_migration_target(page)) {
 			end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
 			isolated = isolate_freepages_range(zone, pfn,
-					end_pfn, freelist);
+					end_pfn, freelist, false);
 			nr_freepages += isolated;
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
diff --git a/mm/internal.h b/mm/internal.h
index 4a226d8..8b80027 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -129,7 +129,7 @@ struct compact_control {
 unsigned long
 isolate_freepages_range(struct zone *zone,
 			unsigned long start, unsigned long end,
-			struct list_head *freelist);
+			struct list_head *freelist, int force_reclaim);
 unsigned long
 isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			   unsigned long low_pfn, unsigned long end_pfn);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8b47c85..a3d5892 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1302,17 +1302,65 @@ void split_page(struct page *page, unsigned int order)
 		set_page_refcounted(page + i);
 }
 
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
+						enum zone_type high_zoneidx,
+						enum zone_type classzone_idx)
+{
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		wakeup_kswapd(zone, order, classzone_idx);
+}
+
+/*
+ * Trigger memory pressure bump to reclaim at least 2^order of free pages.
+ * Does similar work as it is done in __alloc_pages_slowpath(). Used only
+ * by migration code to allocate contiguous memory range.
+ */
+static int __force_memory_reclaim(int order, struct zone *zone)
+{
+	gfp_t gfp_mask = GFP_HIGHUSER_MOVABLE;
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	struct zonelist *zonelist = node_zonelist(0, gfp_mask);
+	struct reclaim_state reclaim_state;
+	int did_some_progress = 0;
+
+	wake_all_kswapd(order, zonelist, high_zoneidx, zone_idx(zone));
+
+	/* We now go into synchronous reclaim */
+	cpuset_memory_pressure_bump();
+	current->flags |= PF_MEMALLOC;
+	lockdep_set_current_reclaim_state(gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	current->reclaim_state = &reclaim_state;
+
+	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, NULL);
+
+	current->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+	current->flags &= ~PF_MEMALLOC;
+
+	cond_resched();
+
+	if (!did_some_progress) {
+		/* Exhausted what can be done so it's blamo time */
+		out_of_memory(zonelist, gfp_mask, order, NULL);
+	}
+	return order;
+}
+
 /*
  * Similar to split_page except the page is already free. As this is only
  * being used for migration, the migratetype of the block also changes.
- * As this is called with interrupts disabled, the caller is responsible
- * for calling arch_alloc_page() and kernel_map_page() after interrupts
- * are enabled.
+ * The caller is responsible for calling arch_alloc_page() and kernel_map_page()
+ * after interrupts are enabled.
  *
  * Note: this is probably too low level an operation for use in drivers.
  * Please consult with lkml before using this in your driver.
  */
-int split_free_page(struct page *page)
+int split_free_page(struct page *page, int force_reclaim)
 {
 	unsigned int order;
 	unsigned long watermark;
@@ -1323,10 +1371,15 @@ int split_free_page(struct page *page)
 	zone = page_zone(page);
 	order = page_order(page);
 
+try_again:
 	/* Obey watermarks as if the page was being allocated */
 	watermark = low_wmark_pages(zone) + (1 << order);
-	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
-		return 0;
+	if (!zone_watermark_ok(zone, 0, watermark, 0, 0)) {
+		if (force_reclaim && __force_memory_reclaim(order, zone))
+			goto try_again;
+		else
+			return 0;
+	}
 
 	/* Remove page from free list */
 	list_del(&page->lru);
@@ -2086,18 +2139,6 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static inline
-void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
-						enum zone_type high_zoneidx,
-						enum zone_type classzone_idx)
-{
-	struct zoneref *z;
-	struct zone *zone;
-
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order, classzone_idx);
-}
-
 static inline int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
@@ -5917,7 +5958,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 
 	outer_start = start & (~0UL << ret);
 	outer_end = isolate_freepages_range(page_zone(pfn_to_page(outer_start)),
-					    outer_start, end, NULL);
+					    outer_start, end, NULL, true);
 	if (!outer_end) {
 		ret = -EBUSY;
 		goto done;
-- 
1.7.1.569.g6f426


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 08/11] drivers: add Contiguous Memory Allocator
  2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
                   ` (6 preceding siblings ...)
  2011-12-29 12:39 ` [PATCH 07/11] mm: add optional memory reclaim in split_free_page() Marek Szyprowski
@ 2011-12-29 12:39 ` Marek Szyprowski
  2011-12-29 12:39 ` [PATCH 09/11] X86: integrate CMA with DMA-mapping subsystem Marek Szyprowski
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Marek Szyprowski @ 2011-12-29 12:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm, linaro-mm-sig
  Cc: Michal Nazarewicz, Marek Szyprowski, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

The Contiguous Memory Allocator is a set of helper functions for DMA
mapping framework that improves allocations of contiguous memory chunks.

CMA grabs memory on system boot, marks it with CMA_MIGRATE_TYPE and
gives back to the system. Kernel is allowed to allocate movable pages
within CMA's managed memory so that it can be used for example for page
cache when DMA mapping do not use it. On dma_alloc_from_contiguous()
request such pages are migrated out of CMA area to free required
contiguous block and fulfill the request. This allows to allocate large
contiguous chunks of memory at any time assuming that there is enough
free memory available in the system.

This code is heavily based on earlier works by Michal Nazarewicz.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
---
 Documentation/kernel-parameters.txt  |    5 +
 arch/Kconfig                         |    3 +
 drivers/base/Kconfig                 |   89 ++++++++
 drivers/base/Makefile                |    1 +
 drivers/base/dma-contiguous.c        |  404 ++++++++++++++++++++++++++++++++++
 include/asm-generic/dma-contiguous.h |   27 +++
 include/linux/device.h               |    4 +
 include/linux/dma-contiguous.h       |  110 +++++++++
 8 files changed, 643 insertions(+), 0 deletions(-)
 create mode 100644 drivers/base/dma-contiguous.c
 create mode 100644 include/asm-generic/dma-contiguous.h
 create mode 100644 include/linux/dma-contiguous.h

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 81c287f..0a56fd7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -503,6 +503,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			Also note the kernel might malfunction if you disable
 			some critical bits.
 
+	cma=nn[MG]	[ARM,KNL]
+			Sets the size of kernel global memory area for contiguous
+			memory allocations. For more information, see
+			include/linux/dma-contiguous.h
+
 	cmo_free_hint=	[PPC] Format: { yes | no }
 			Specify whether pages are marked as being inactive
 			when they are freed.  This is used in CMO environments
diff --git a/arch/Kconfig b/arch/Kconfig
index 4b0669c..a3b39a2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -124,6 +124,9 @@ config HAVE_ARCH_TRACEHOOK
 config HAVE_DMA_ATTRS
 	bool
 
+config HAVE_DMA_CONTIGUOUS
+	bool
+
 config USE_GENERIC_SMP_HELPERS
 	bool
 
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 21cf46f..99f5fad 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -174,4 +174,93 @@ config SYS_HYPERVISOR
 
 source "drivers/base/regmap/Kconfig"
 
+config CMA
+	bool "Contiguous Memory Allocator (EXPERIMENTAL)"
+	depends on HAVE_DMA_CONTIGUOUS && HAVE_MEMBLOCK && EXPERIMENTAL
+	select MIGRATION
+	help
+	  This enables the Contiguous Memory Allocator which allows drivers
+	  to allocate big physically-contiguous blocks of memory for use with
+	  hardware components that do not support I/O map nor scatter-gather.
+
+	  For more information see <include/linux/dma-contiguous.h>.
+	  If unsure, say "n".
+
+if CMA
+
+config CMA_DEBUG
+	bool "CMA debug messages (DEVELOPMENT)"
+	depends on DEBUG_KERNEL
+	help
+	  Turns on debug messages in CMA.  This produces KERN_DEBUG
+	  messages for every CMA call as well as various messages while
+	  processing calls such as dma_alloc_from_contiguous().
+	  This option does not affect warning and error messages.
+
+comment "Default contiguous memory area size:"
+
+config CMA_SIZE_MBYTES
+	int "Size in Mega Bytes"
+	depends on !CMA_SIZE_SEL_PERCENTAGE
+	default 16
+	help
+	  Defines the size (in MiB) of the default memory area for Contiguous
+	  Memory Allocator.
+
+config CMA_SIZE_PERCENTAGE
+	int "Percentage of total memory"
+	depends on !CMA_SIZE_SEL_MBYTES
+	default 10
+	help
+	  Defines the size of the default memory area for Contiguous Memory
+	  Allocator as a percentage of the total memory in the system.
+
+choice
+	prompt "Selected region size"
+	default CMA_SIZE_SEL_ABSOLUTE
+
+config CMA_SIZE_SEL_MBYTES
+	bool "Use mega bytes value only"
+
+config CMA_SIZE_SEL_PERCENTAGE
+	bool "Use percentage value only"
+
+config CMA_SIZE_SEL_MIN
+	bool "Use lower value (minimum)"
+
+config CMA_SIZE_SEL_MAX
+	bool "Use higher value (maximum)"
+
+endchoice
+
+config CMA_ALIGNMENT
+	int "Maximum PAGE_SIZE order of alignment for contiguous buffers"
+	range 4 9
+	default 8
+	help
+	  DMA mapping framework by default aligns all buffers to the smallest
+	  PAGE_SIZE order which is greater than or equal to the requested buffer
+	  size. This works well for buffers up to a few hundreds kilobytes, but
+	  for larger buffers it just a memory waste. With this parameter you can
+	  specify the maximum PAGE_SIZE order for contiguous buffers. Larger
+	  buffers will be aligned only to this specified order. The order is
+	  expressed as a power of two multiplied by the PAGE_SIZE.
+
+	  For example, if your system defaults to 4KiB pages, the order value
+	  of 8 means that the buffers will be aligned up to 1MiB only.
+
+	  If unsure, leave the default value "8".
+
+config CMA_AREAS
+	int "Maximum count of the CMA device-private areas"
+	default 7
+	help
+	  CMA allows to create CMA areas for particular devices. This parameter
+	  sets the maximum number of such device private CMA areas in the
+	  system.
+
+	  If unsure, leave the default value "7".
+
+endif
+
 endmenu
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 99a375a..794546f 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -5,6 +5,7 @@ obj-y			:= core.o sys.o bus.o dd.o syscore.o \
 			   cpu.o firmware.o init.o map.o devres.o \
 			   attribute_container.o transport_class.o
 obj-$(CONFIG_DEVTMPFS)	+= devtmpfs.o
+obj-$(CONFIG_CMA) += dma-contiguous.o
 obj-y			+= power/
 obj-$(CONFIG_HAS_DMA)	+= dma-mapping.o
 obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
new file mode 100644
index 0000000..40c6341
--- /dev/null
+++ b/drivers/base/dma-contiguous.c
@@ -0,0 +1,404 @@
+/*
+ * Contiguous Memory Allocator for DMA mapping framework
+ * Copyright (c) 2010-2011 by Samsung Electronics.
+ * Written by:
+ *	Marek Szyprowski <m.szyprowski@samsung.com>
+ *	Michal Nazarewicz <mina86@mina86.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License or (at your optional) any later version of the license.
+ */
+
+#define pr_fmt(fmt) "cma: " fmt
+
+#ifdef CONFIG_CMA_DEBUG
+#ifndef DEBUG
+#  define DEBUG
+#endif
+#endif
+
+#include <asm/page.h>
+#include <asm/dma-contiguous.h>
+
+#include <linux/memblock.h>
+#include <linux/err.h>
+#include <linux/mm.h>
+#include <linux/mutex.h>
+#include <linux/page-isolation.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/mm_types.h>
+#include <linux/dma-contiguous.h>
+
+#ifndef SZ_1M
+#define SZ_1M (1 << 20)
+#endif
+
+struct cma {
+	unsigned long	base_pfn;
+	unsigned long	count;
+	unsigned long	*bitmap;
+};
+
+struct cma *dma_contiguous_default_area;
+
+#ifdef CONFIG_CMA_SIZE_MBYTES
+#define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
+#else
+#define CMA_SIZE_MBYTES 0
+#endif
+
+#ifdef CONFIG_CMA_SIZE_PERCENTAGE
+#define CMA_SIZE_PERCENTAGE CONFIG_CMA_SIZE_PERCENTAGE
+#else
+#define CMA_SIZE_PERCENTAGE 0
+#endif
+
+/*
+ * Default global CMA area size can be defined in kernel's .config.
+ * This is usefull mainly for distro maintainers to create a kernel
+ * that works correctly for most supported systems.
+ * The size can be set in bytes or as a percentage of the total memory
+ * in the system.
+ *
+ * Users, who want to set the size of global CMA area for their system
+ * should use cma= kernel parameter.
+ */
+static unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
+static unsigned long size_percent = CMA_SIZE_PERCENTAGE;
+static long size_cmdline = -1;
+
+static int __init early_cma(char *p)
+{
+	pr_debug("%s(%s)\n", __func__, p);
+	size_cmdline = memparse(p, &p);
+	return 0;
+}
+early_param("cma", early_cma);
+
+static unsigned long __init cma_early_get_total_pages(void)
+{
+	struct memblock_region *reg;
+	unsigned long total_pages = 0;
+
+	/*
+	 * We cannot use memblock_phys_mem_size() here, because
+	 * memblock_analyze() has not been called yet.
+	 */
+	for_each_memblock(memory, reg)
+		total_pages += memblock_region_memory_end_pfn(reg) -
+			       memblock_region_memory_base_pfn(reg);
+	return total_pages;
+}
+
+/**
+ * dma_contiguous_reserve() - reserve area for contiguous memory handling
+ * @limit: End address of the reserved memory (optional, 0 for any).
+ *
+ * This funtion reserves memory from early allocator. It should be
+ * called by arch specific code once the early allocator (memblock or bootmem)
+ * has been activated and all other subsystems have already allocated/reserved
+ * memory.
+ */
+void __init dma_contiguous_reserve(phys_addr_t limit)
+{
+	unsigned long selected_size = 0;
+	unsigned long total_pages;
+
+	pr_debug("%s(limit %08lx)\n", __func__, (unsigned long)limit);
+
+	total_pages = cma_early_get_total_pages();
+	size_percent *= (total_pages << PAGE_SHIFT) / 100;
+
+	pr_debug("%s: total available: %ld MiB, size absolute: %ld MiB, size percentage: %ld MiB\n",
+		 __func__, (total_pages << PAGE_SHIFT) / SZ_1M,
+		size_bytes / SZ_1M, size_percent / SZ_1M);
+
+#ifdef CONFIG_CMA_SIZE_SEL_MBYTES
+	selected_size = size_bytes;
+#elif defined(CONFIG_CMA_SIZE_SEL_PERCENTAGE)
+	selected_size = size_percent;
+#elif defined(CONFIG_CMA_SIZE_SEL_MIN)
+	selected_size = min(size_bytes, size_percent);
+#elif defined(CONFIG_CMA_SIZE_SEL_MAX)
+	selected_size = max(size_bytes, size_percent);
+#endif
+
+	if (size_cmdline != -1)
+		selected_size = size_cmdline;
+
+	if (!selected_size)
+		return;
+
+	pr_debug("%s: reserving %ld MiB for global area\n", __func__,
+		 selected_size / SZ_1M);
+
+	dma_declare_contiguous(NULL, selected_size, 0, limit);
+};
+
+static DEFINE_MUTEX(cma_mutex);
+
+static int cma_activate_area(unsigned long base_pfn, unsigned long count)
+{
+	unsigned long pfn = base_pfn;
+	unsigned i = count >> pageblock_order;
+	struct zone *zone;
+
+	WARN_ON_ONCE(!pfn_valid(pfn));
+	zone = page_zone(pfn_to_page(pfn));
+
+	do {
+		unsigned j;
+		base_pfn = pfn;
+		for (j = pageblock_nr_pages; j; --j, pfn++) {
+			WARN_ON_ONCE(!pfn_valid(pfn));
+			if (page_zone(pfn_to_page(pfn)) != zone)
+				return -EINVAL;
+		}
+		init_cma_reserved_pageblock(pfn_to_page(base_pfn));
+	} while (--i);
+	return 0;
+}
+
+static struct cma *cma_create_area(unsigned long base_pfn,
+				     unsigned long count)
+{
+	int bitmap_size = BITS_TO_LONGS(count) * sizeof(long);
+	struct cma *cma;
+	int ret = -ENOMEM;
+
+	pr_debug("%s(base %08lx, count %lx)\n", __func__, base_pfn, count);
+
+	cma = kmalloc(sizeof *cma, GFP_KERNEL);
+	if (!cma)
+		return ERR_PTR(-ENOMEM);
+
+	cma->base_pfn = base_pfn;
+	cma->count = count;
+	cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL);
+
+	if (!cma->bitmap)
+		goto no_mem;
+
+	ret = cma_activate_area(base_pfn, count);
+	if (ret)
+		goto error;
+
+	pr_debug("%s: returned %p\n", __func__, (void *)cma);
+	return cma;
+
+error:
+	kfree(cma->bitmap);
+no_mem:
+	kfree(cma);
+	return ERR_PTR(ret);
+}
+
+static struct cma_reserved {
+	phys_addr_t start;
+	unsigned long size;
+	struct device *dev;
+} cma_reserved[MAX_CMA_AREAS] __initdata;
+static unsigned cma_reserved_count __initdata;
+
+static int __init cma_init_reserved_areas(void)
+{
+	struct cma_reserved *r = cma_reserved;
+	unsigned i = cma_reserved_count;
+
+	pr_debug("%s()\n", __func__);
+
+	for (; i; --i, ++r) {
+		struct cma *cma;
+		cma = cma_create_area(PFN_DOWN(r->start),
+				      r->size >> PAGE_SHIFT);
+		if (!IS_ERR(cma))
+			dev_set_cma_area(r->dev, cma);
+	}
+	return 0;
+}
+core_initcall(cma_init_reserved_areas);
+
+/**
+ * dma_declare_contiguous() - reserve area for contiguous memory handling
+ *			      for particular device
+ * @dev:   Pointer to device structure.
+ * @size:  Size of the reserved memory.
+ * @start: Start address of the reserved memory (optional, 0 for any).
+ * @limit: End address of the reserved memory (optional, 0 for any).
+ *
+ * This funtion reserves memory for specified device. It should be
+ * called by board specific code when early allocator (memblock or bootmem)
+ * is still activate.
+ */
+int __init dma_declare_contiguous(struct device *dev, unsigned long size,
+				  phys_addr_t base, phys_addr_t limit)
+{
+	struct cma_reserved *r = &cma_reserved[cma_reserved_count];
+	unsigned long alignment;
+
+	pr_debug("%s(size %lx, base %08lx, limit %08lx)\n", __func__,
+		 (unsigned long)size, (unsigned long)base,
+		 (unsigned long)limit);
+
+	/* Sanity checks */
+	if (cma_reserved_count == ARRAY_SIZE(cma_reserved)) {
+		pr_err("Not enough slots for CMA reserved regions!\n");
+		return -ENOSPC;
+	}
+
+	if (!size)
+		return -EINVAL;
+
+	/* Sanitise input arguments */
+	alignment = PAGE_SIZE << max(MAX_ORDER, pageblock_order);
+	base = ALIGN(base, alignment);
+	size = ALIGN(size, alignment);
+	limit = ALIGN(limit, alignment);
+
+	/* Reserve memory */
+	if (base) {
+		if (memblock_is_region_reserved(base, size) ||
+		    memblock_reserve(base, size) < 0) {
+			base = -EBUSY;
+			goto err;
+		}
+	} else {
+		/*
+		 * Use __memblock_alloc_base() since
+		 * memblock_alloc_base() panic()s.
+		 */
+		phys_addr_t addr = __memblock_alloc_base(size, alignment, limit);
+		if (!addr) {
+			base = -ENOMEM;
+			goto err;
+		} else if (addr + size > ~(unsigned long)0) {
+			memblock_free(addr, size);
+			base = -EINVAL;
+			goto err;
+		} else {
+			base = addr;
+		}
+	}
+
+	/*
+	 * Each reserved area must be initialised later, when more kernel
+	 * subsystems (like slab allocator) are available.
+	 */
+	r->start = base;
+	r->size = size;
+	r->dev = dev;
+	cma_reserved_count++;
+	pr_info("CMA: reserved %ld MiB at %08lx\n", size / SZ_1M,
+		(unsigned long)base);
+
+	/*
+	 * Architecture specific contiguous memory fixup.
+	 */
+	dma_contiguous_early_fixup(base, size);
+	return 0;
+err:
+	pr_err("CMA: failed to reserve %ld MiB\n", size / SZ_1M);
+	return base;
+}
+
+/**
+ * dma_alloc_from_contiguous() - allocate pages from contiguous area
+ * @dev:   Pointer to device for which the allocation is performed.
+ * @count: Requested number of pages.
+ * @align: Requested alignment of pages (in PAGE_SIZE order).
+ *
+ * This funtion allocates memory buffer for specified device. It uses
+ * device specific contiguous memory area if available or the default
+ * global one. Requires architecture specific get_dev_cma_area() helper
+ * function.
+ */
+struct page *dma_alloc_from_contiguous(struct device *dev, int count,
+				       unsigned int align)
+{
+	struct cma *cma = dev_get_cma_area(dev);
+	unsigned long pfn, pageno, start = 0;
+	unsigned long mask = (1 << align) - 1;
+	int ret;
+
+	if (!cma || !cma->count)
+		return NULL;
+
+	if (align > CONFIG_CMA_ALIGNMENT)
+		align = CONFIG_CMA_ALIGNMENT;
+
+	pr_debug("%s(cma %p, count %d, align %d)\n", __func__, (void *)cma,
+		 count, align);
+
+	if (!count)
+		return NULL;
+
+	mutex_lock(&cma_mutex);
+
+	for (;;) {
+		pageno = bitmap_find_next_zero_area(cma->bitmap, cma->count,
+						    start, count, mask);
+		if (pageno >= cma->count) {
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		pfn = cma->base_pfn + pageno;
+		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA);
+		if (ret == 0) {
+			bitmap_set(cma->bitmap, pageno, count);
+			break;
+		} else if (ret != -EBUSY) {
+			goto error;
+		}
+		pr_debug("%s(): memory range at %p is busy, retrying\n",
+			 __func__, pfn_to_page(pfn));
+		/* try again with a bit different memory target */
+		start = pageno + mask + 1;
+	}
+
+	mutex_unlock(&cma_mutex);
+
+	pr_debug("%s(): returned %p\n", __func__, pfn_to_page(pfn));
+	return pfn_to_page(pfn);
+error:
+	mutex_unlock(&cma_mutex);
+	return NULL;
+}
+
+/**
+ * dma_release_from_contiguous() - release allocated pages
+ * @dev:   Pointer to device for which the pages were allocated.
+ * @pages: Allocated pages.
+ * @count: Number of allocated pages.
+ *
+ * This funtion releases memory allocated by dma_alloc_from_contiguous().
+ * It return 0 when provided pages doen't belongs to contiguous area and
+ * 1 on success.
+ */
+int dma_release_from_contiguous(struct device *dev, struct page *pages,
+				int count)
+{
+	struct cma *cma = dev_get_cma_area(dev);
+	unsigned long pfn;
+
+	if (!cma || !pages)
+		return 0;
+
+	pr_debug("%s(page %p)\n", __func__, (void *)pages);
+
+	pfn = page_to_pfn(pages);
+
+	if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count)
+		return 0;
+
+	mutex_lock(&cma_mutex);
+
+	bitmap_clear(cma->bitmap, pfn - cma->base_pfn, count);
+	free_contig_range(pfn, count);
+
+	mutex_unlock(&cma_mutex);
+	return 1;
+}
diff --git a/include/asm-generic/dma-contiguous.h b/include/asm-generic/dma-contiguous.h
new file mode 100644
index 0000000..bf2bccc
--- /dev/null
+++ b/include/asm-generic/dma-contiguous.h
@@ -0,0 +1,27 @@
+#ifndef ASM_DMA_CONTIGUOUS_H
+#define ASM_DMA_CONTIGUOUS_H
+
+#ifdef __KERNEL__
+
+#include <linux/device.h>
+#include <linux/dma-contiguous.h>
+
+#ifdef CONFIG_CMA
+
+static inline struct cma *dev_get_cma_area(struct device *dev)
+{
+	if (dev && dev->cma_area)
+		return dev->cma_area;
+	return dma_contiguous_default_area;
+}
+
+static inline void dev_set_cma_area(struct device *dev, struct cma *cma)
+{
+	if (dev)
+		dev->cma_area = cma;
+	dma_contiguous_default_area = cma;
+}
+
+#endif
+#endif
+#endif
diff --git a/include/linux/device.h b/include/linux/device.h
index 3136ede..b8794c1 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -594,6 +594,10 @@ struct device {
 
 	struct dma_coherent_mem	*dma_mem; /* internal for coherent mem
 					     override */
+#ifdef CONFIG_CMA
+	struct cma *cma_area;		/* contiguous memory area for dma
+					   allocations */
+#endif
 	/* arch specific additions */
 	struct dev_archdata	archdata;
 
diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h
new file mode 100644
index 0000000..ffb4b40
--- /dev/null
+++ b/include/linux/dma-contiguous.h
@@ -0,0 +1,110 @@
+#ifndef __LINUX_CMA_H
+#define __LINUX_CMA_H
+
+/*
+ * Contiguous Memory Allocator for DMA mapping framework
+ * Copyright (c) 2010-2011 by Samsung Electronics.
+ * Written by:
+ *	Marek Szyprowski <m.szyprowski@samsung.com>
+ *	Michal Nazarewicz <mina86@mina86.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License or (at your optional) any later version of the license.
+ */
+
+/*
+ * Contiguous Memory Allocator
+ *
+ *   The Contiguous Memory Allocator (CMA) makes it possible to
+ *   allocate big contiguous chunks of memory after the system has
+ *   booted.
+ *
+ * Why is it needed?
+ *
+ *   Various devices on embedded systems have no scatter-getter and/or
+ *   IO map support and require contiguous blocks of memory to
+ *   operate.  They include devices such as cameras, hardware video
+ *   coders, etc.
+ *
+ *   Such devices often require big memory buffers (a full HD frame
+ *   is, for instance, more then 2 mega pixels large, i.e. more than 6
+ *   MB of memory), which makes mechanisms such as kmalloc() or
+ *   alloc_page() ineffective.
+ *
+ *   At the same time, a solution where a big memory region is
+ *   reserved for a device is suboptimal since often more memory is
+ *   reserved then strictly required and, moreover, the memory is
+ *   inaccessible to page system even if device drivers don't use it.
+ *
+ *   CMA tries to solve this issue by operating on memory regions
+ *   where only movable pages can be allocated from.  This way, kernel
+ *   can use the memory for pagecache and when device driver requests
+ *   it, allocated pages can be migrated.
+ *
+ * Driver usage
+ *
+ *   CMA should not be used by the device drivers directly. It is
+ *   only a helper framework for dma-mapping subsystem.
+ *
+ *   For more information, see kernel-docs in drivers/base/dma-contiguous.c
+ */
+
+#ifdef __KERNEL__
+
+struct cma;
+struct page;
+struct device;
+
+#ifdef CONFIG_CMA
+
+/*
+ * There is always at least global CMA area and a few optional device
+ * private areas configured in kernel .config.
+ */
+#define MAX_CMA_AREAS	(1 + CONFIG_CMA_AREAS)
+
+extern struct cma *dma_contiguous_default_area;
+
+void dma_contiguous_reserve(phys_addr_t addr_limit);
+int dma_declare_contiguous(struct device *dev, unsigned long size,
+			   phys_addr_t base, phys_addr_t limit);
+
+struct page *dma_alloc_from_contiguous(struct device *dev, int count,
+				       unsigned int order);
+int dma_release_from_contiguous(struct device *dev, struct page *pages,
+				int count);
+
+#else
+
+#define MAX_CMA_AREAS	(0)
+
+static inline void dma_contiguous_reserve(phys_addr_t limit) { }
+
+static inline
+int dma_declare_contiguous(struct device *dev, unsigned long size,
+			   phys_addr_t base, phys_addr_t limit)
+{
+	return -ENOSYS;
+}
+
+static inline
+struct page *dma_alloc_from_contiguous(struct device *dev, int count,
+				       unsigned int order)
+{
+	return NULL;
+}
+
+static inline
+int dma_release_from_contiguous(struct device *dev, struct page *pages,
+				int count)
+{
+	return 0;
+}
+
+#endif
+
+#endif
+
+#endif
-- 
1.7.1.569.g6f426


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 09/11] X86: integrate CMA with DMA-mapping subsystem
  2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
                   ` (7 preceding siblings ...)
  2011-12-29 12:39 ` [PATCH 08/11] drivers: add Contiguous Memory Allocator Marek Szyprowski
@ 2011-12-29 12:39 ` Marek Szyprowski
  2011-12-29 12:39 ` [PATCH 10/11] ARM: " Marek Szyprowski
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Marek Szyprowski @ 2011-12-29 12:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm, linaro-mm-sig
  Cc: Michal Nazarewicz, Marek Szyprowski, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

This patch adds support for CMA to dma-mapping subsystem for x86
architecture that uses common pci-dma/pci-nommu implementation. This
allows to test CMA on KVM/QEMU and a lot of common x86 boxes.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
---
 arch/x86/Kconfig                      |    1 +
 arch/x86/include/asm/dma-contiguous.h |   13 +++++++++++++
 arch/x86/include/asm/dma-mapping.h    |    4 ++++
 arch/x86/kernel/pci-dma.c             |   18 ++++++++++++++++--
 arch/x86/kernel/pci-nommu.c           |    8 +-------
 arch/x86/kernel/setup.c               |    2 ++
 6 files changed, 37 insertions(+), 9 deletions(-)
 create mode 100644 arch/x86/include/asm/dma-contiguous.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index efb4294..ac101e0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -29,6 +29,7 @@ config X86
 	select ARCH_WANT_OPTIONAL_GPIOLIB
 	select ARCH_WANT_FRAME_POINTERS
 	select HAVE_DMA_ATTRS
+	select HAVE_DMA_CONTIGUOUS if !SWIOTLB
 	select HAVE_KRETPROBES
 	select HAVE_OPTPROBES
 	select HAVE_FTRACE_MCOUNT_RECORD
diff --git a/arch/x86/include/asm/dma-contiguous.h b/arch/x86/include/asm/dma-contiguous.h
new file mode 100644
index 0000000..8fb117d
--- /dev/null
+++ b/arch/x86/include/asm/dma-contiguous.h
@@ -0,0 +1,13 @@
+#ifndef ASMX86_DMA_CONTIGUOUS_H
+#define ASMX86_DMA_CONTIGUOUS_H
+
+#ifdef __KERNEL__
+
+#include <linux/device.h>
+#include <linux/dma-contiguous.h>
+#include <asm-generic/dma-contiguous.h>
+
+static inline void dma_contiguous_early_fixup(phys_addr_t base, unsigned long size) { }
+
+#endif
+#endif
diff --git a/arch/x86/include/asm/dma-mapping.h b/arch/x86/include/asm/dma-mapping.h
index ed3065f..90ac6f0 100644
--- a/arch/x86/include/asm/dma-mapping.h
+++ b/arch/x86/include/asm/dma-mapping.h
@@ -13,6 +13,7 @@
 #include <asm/io.h>
 #include <asm/swiotlb.h>
 #include <asm-generic/dma-coherent.h>
+#include <linux/dma-contiguous.h>
 
 #ifdef CONFIG_ISA
 # define ISA_DMA_BIT_MASK DMA_BIT_MASK(24)
@@ -61,6 +62,9 @@ extern int dma_set_mask(struct device *dev, u64 mask);
 extern void *dma_generic_alloc_coherent(struct device *dev, size_t size,
 					dma_addr_t *dma_addr, gfp_t flag);
 
+extern void dma_generic_free_coherent(struct device *dev, size_t size,
+				      void *vaddr, dma_addr_t dma_addr);
+
 static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size)
 {
 	if (!dev->dma_mask)
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 80dc793..f4abafc 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -90,14 +90,18 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
 				 dma_addr_t *dma_addr, gfp_t flag)
 {
 	unsigned long dma_mask;
-	struct page *page;
+	struct page *page = NULL;
+	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
 	dma_addr_t addr;
 
 	dma_mask = dma_alloc_coherent_mask(dev, flag);
 
 	flag |= __GFP_ZERO;
 again:
-	page = alloc_pages_node(dev_to_node(dev), flag, get_order(size));
+	if (!(flag & GFP_ATOMIC))
+		page = dma_alloc_from_contiguous(dev, count, get_order(size));
+	if (!page)
+		page = alloc_pages_node(dev_to_node(dev), flag, get_order(size));
 	if (!page)
 		return NULL;
 
@@ -117,6 +121,16 @@ again:
 	return page_address(page);
 }
 
+void dma_generic_free_coherent(struct device *dev, size_t size, void *vaddr,
+			       dma_addr_t dma_addr)
+{
+	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	struct page *page = virt_to_page(vaddr);
+
+	if (!dma_release_from_contiguous(dev, page, count))
+		free_pages((unsigned long)vaddr, get_order(size));
+}
+
 /*
  * See <Documentation/x86/x86_64/boot-options.txt> for the iommu kernel
  * parameter documentation.
diff --git a/arch/x86/kernel/pci-nommu.c b/arch/x86/kernel/pci-nommu.c
index 3af4af8..656566f 100644
--- a/arch/x86/kernel/pci-nommu.c
+++ b/arch/x86/kernel/pci-nommu.c
@@ -74,12 +74,6 @@ static int nommu_map_sg(struct device *hwdev, struct scatterlist *sg,
 	return nents;
 }
 
-static void nommu_free_coherent(struct device *dev, size_t size, void *vaddr,
-				dma_addr_t dma_addr)
-{
-	free_pages((unsigned long)vaddr, get_order(size));
-}
-
 static void nommu_sync_single_for_device(struct device *dev,
 			dma_addr_t addr, size_t size,
 			enum dma_data_direction dir)
@@ -97,7 +91,7 @@ static void nommu_sync_sg_for_device(struct device *dev,
 
 struct dma_map_ops nommu_dma_ops = {
 	.alloc_coherent		= dma_generic_alloc_coherent,
-	.free_coherent		= nommu_free_coherent,
+	.free_coherent		= dma_generic_free_coherent,
 	.map_sg			= nommu_map_sg,
 	.map_page		= nommu_map_page,
 	.sync_single_for_device = nommu_sync_single_for_device,
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index cf0ef98..1dfe8ba 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -50,6 +50,7 @@
 #include <asm/pci-direct.h>
 #include <linux/init_ohci1394_dma.h>
 #include <linux/kvm_para.h>
+#include <linux/dma-contiguous.h>
 
 #include <linux/errno.h>
 #include <linux/kernel.h>
@@ -944,6 +945,7 @@ void __init setup_arch(char **cmdline_p)
 	}
 #endif
 	memblock.current_limit = get_max_mapped();
+	dma_contiguous_reserve(0);
 
 	/*
 	 * NOTE: On x86-32, only from this point on, fixmaps are ready for use.
-- 
1.7.1.569.g6f426


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 10/11] ARM: integrate CMA with DMA-mapping subsystem
  2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
                   ` (8 preceding siblings ...)
  2011-12-29 12:39 ` [PATCH 09/11] X86: integrate CMA with DMA-mapping subsystem Marek Szyprowski
@ 2011-12-29 12:39 ` Marek Szyprowski
  2011-12-29 12:39 ` [PATCH 11/11] ARM: Samsung: use CMA for 2 memory banks for s5p-mfc device Marek Szyprowski
  2012-01-10  8:42 ` [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
  11 siblings, 0 replies; 33+ messages in thread
From: Marek Szyprowski @ 2011-12-29 12:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm, linaro-mm-sig
  Cc: Michal Nazarewicz, Marek Szyprowski, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

This patch adds support for CMA to dma-mapping subsystem for ARM
architecture. By default a global CMA area is used, but specific devices
are allowed to have their private memory areas if required (they can be
created with dma_declare_contiguous() function during board
initialization).

Contiguous memory areas reserved for DMA are remapped with 2-level page
tables on boot. Once a buffer is requested, a low memory kernel mapping
is updated to to match requested memory access type.

GFP_ATOMIC allocations are performed from special pool which is created
early during boot. This way remapping page attributes is not needed on
allocation time.

CMA has been enabled unconditionally for ARMv6+ systems.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
---
 Documentation/kernel-parameters.txt   |    4 +
 arch/arm/Kconfig                      |    2 +
 arch/arm/include/asm/dma-contiguous.h |   16 ++
 arch/arm/include/asm/mach/map.h       |    1 +
 arch/arm/kernel/setup.c               |    9 +-
 arch/arm/mm/dma-mapping.c             |  368 +++++++++++++++++++++++++++------
 arch/arm/mm/init.c                    |   22 ++-
 arch/arm/mm/mm.h                      |    3 +
 arch/arm/mm/mmu.c                     |   29 ++-
 9 files changed, 367 insertions(+), 87 deletions(-)
 create mode 100644 arch/arm/include/asm/dma-contiguous.h

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 0a56fd7..c8993cd 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -515,6 +515,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			a hypervisor.
 			Default: yes
 
+	coherent_pool=nn[KMG]	[ARM,KNL]
+			Sets the size of memory pool for coherent, atomic dma
+			allocations if Contiguous Memory Allocator (CMA) is used.
+
 	code_bytes	[X86] How many bytes of object code to print
 			in an oops report.
 			Range: 0 - 8192
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 776d76b..2cc4bba 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -4,6 +4,8 @@ config ARM
 	select HAVE_AOUT
 	select HAVE_DMA_API_DEBUG
 	select HAVE_IDE if PCI || ISA || PCMCIA
+	select HAVE_DMA_CONTIGUOUS if (CPU_V6 || CPU_V6K || CPU_V7)
+	select CMA if (CPU_V6 || CPU_V6K || CPU_V7)
 	select HAVE_MEMBLOCK
 	select RTC_LIB
 	select SYS_SUPPORTS_APM_EMULATION
diff --git a/arch/arm/include/asm/dma-contiguous.h b/arch/arm/include/asm/dma-contiguous.h
new file mode 100644
index 0000000..c7ba05e
--- /dev/null
+++ b/arch/arm/include/asm/dma-contiguous.h
@@ -0,0 +1,16 @@
+#ifndef ASMARM_DMA_CONTIGUOUS_H
+#define ASMARM_DMA_CONTIGUOUS_H
+
+#ifdef __KERNEL__
+
+#include <linux/device.h>
+#include <linux/dma-contiguous.h>
+#include <asm-generic/dma-contiguous.h>
+
+#ifdef CONFIG_CMA
+
+void dma_contiguous_early_fixup(phys_addr_t base, unsigned long size);
+
+#endif
+#endif
+#endif
diff --git a/arch/arm/include/asm/mach/map.h b/arch/arm/include/asm/mach/map.h
index b36f365..a6efcdd 100644
--- a/arch/arm/include/asm/mach/map.h
+++ b/arch/arm/include/asm/mach/map.h
@@ -30,6 +30,7 @@ struct map_desc {
 #define MT_MEMORY_DTCM		12
 #define MT_MEMORY_ITCM		13
 #define MT_MEMORY_SO		14
+#define MT_MEMORY_DMA_READY	15
 
 #ifdef CONFIG_MMU
 extern void iotable_init(struct map_desc *, int);
diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index 8fc2c8f..9d5bc82f 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -78,6 +78,7 @@ __setup("fpe=", fpe_setup);
 extern void paging_init(struct machine_desc *desc);
 extern void sanity_check_meminfo(void);
 extern void reboot_setup(char *str);
+extern void setup_dma_zone(struct machine_desc *desc);
 
 unsigned int processor_id;
 EXPORT_SYMBOL(processor_id);
@@ -902,12 +903,8 @@ void __init setup_arch(char **cmdline_p)
 	machine_desc = mdesc;
 	machine_name = mdesc->name;
 
-#ifdef CONFIG_ZONE_DMA
-	if (mdesc->dma_zone_size) {
-		extern unsigned long arm_dma_zone_size;
-		arm_dma_zone_size = mdesc->dma_zone_size;
-	}
-#endif
+	setup_dma_zone(mdesc);
+
 	if (mdesc->soft_reboot)
 		reboot_setup("s");
 
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 1aa664a..77e7755 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -17,7 +17,9 @@
 #include <linux/init.h>
 #include <linux/device.h>
 #include <linux/dma-mapping.h>
+#include <linux/dma-contiguous.h>
 #include <linux/highmem.h>
+#include <linux/memblock.h>
 #include <linux/slab.h>
 
 #include <asm/memory.h>
@@ -26,6 +28,8 @@
 #include <asm/tlbflush.h>
 #include <asm/sizes.h>
 #include <asm/mach/arch.h>
+#include <asm/mach/map.h>
+#include <asm/dma-contiguous.h>
 
 #include "mm.h"
 
@@ -56,6 +60,19 @@ static u64 get_coherent_dma_mask(struct device *dev)
 	return mask;
 }
 
+static void __dma_clear_buffer(struct page *page, size_t size)
+{
+	void *ptr;
+	/*
+	 * Ensure that the allocated pages are zeroed, and that any data
+	 * lurking in the kernel direct-mapped region is invalidated.
+	 */
+	ptr = page_address(page);
+	memset(ptr, 0, size);
+	dmac_flush_range(ptr, ptr + size);
+	outer_flush_range(__pa(ptr), __pa(ptr) + size);
+}
+
 /*
  * Allocate a DMA buffer for 'dev' of size 'size' using the
  * specified gfp mask.  Note that 'size' must be page aligned.
@@ -64,23 +81,6 @@ static struct page *__dma_alloc_buffer(struct device *dev, size_t size, gfp_t gf
 {
 	unsigned long order = get_order(size);
 	struct page *page, *p, *e;
-	void *ptr;
-	u64 mask = get_coherent_dma_mask(dev);
-
-#ifdef CONFIG_DMA_API_DEBUG
-	u64 limit = (mask + 1) & ~mask;
-	if (limit && size >= limit) {
-		dev_warn(dev, "coherent allocation too big (requested %#x mask %#llx)\n",
-			size, mask);
-		return NULL;
-	}
-#endif
-
-	if (!mask)
-		return NULL;
-
-	if (mask < 0xffffffffULL)
-		gfp |= GFP_DMA;
 
 	page = alloc_pages(gfp, order);
 	if (!page)
@@ -93,14 +93,7 @@ static struct page *__dma_alloc_buffer(struct device *dev, size_t size, gfp_t gf
 	for (p = page + (size >> PAGE_SHIFT), e = page + (1 << order); p < e; p++)
 		__free_page(p);
 
-	/*
-	 * Ensure that the allocated pages are zeroed, and that any data
-	 * lurking in the kernel direct-mapped region is invalidated.
-	 */
-	ptr = page_address(page);
-	memset(ptr, 0, size);
-	dmac_flush_range(ptr, ptr + size);
-	outer_flush_range(__pa(ptr), __pa(ptr) + size);
+	__dma_clear_buffer(page, size);
 
 	return page;
 }
@@ -170,6 +163,9 @@ static int __init consistent_init(void)
 	unsigned long base = consistent_base;
 	unsigned long num_ptes = (CONSISTENT_END - base) >> PMD_SHIFT;
 
+	if (cpu_architecture() >= CPU_ARCH_ARMv6)
+		return 0;
+
 	consistent_pte = kmalloc(num_ptes * sizeof(pte_t), GFP_KERNEL);
 	if (!consistent_pte) {
 		pr_err("%s: no memory\n", __func__);
@@ -210,9 +206,101 @@ static int __init consistent_init(void)
 
 	return ret;
 }
-
 core_initcall(consistent_init);
 
+static void *__alloc_from_contiguous(struct device *dev, size_t size,
+				     pgprot_t prot, struct page **ret_page);
+
+static struct arm_vmregion_head coherent_head = {
+	.vm_lock	= __SPIN_LOCK_UNLOCKED(&coherent_head.vm_lock),
+	.vm_list	= LIST_HEAD_INIT(coherent_head.vm_list),
+};
+
+size_t coherent_pool_size = DEFAULT_CONSISTENT_DMA_SIZE / 8;
+
+static int __init early_coherent_pool(char *p)
+{
+	coherent_pool_size = memparse(p, &p);
+	return 0;
+}
+early_param("coherent_pool", early_coherent_pool);
+
+/*
+ * Initialise the coherent pool for atomic allocations.
+ */
+static int __init coherent_init(void)
+{
+	pgprot_t prot = pgprot_dmacoherent(pgprot_kernel);
+	size_t size = coherent_pool_size;
+	struct page *page;
+	void *ptr;
+
+	if (cpu_architecture() < CPU_ARCH_ARMv6)
+		return 0;
+
+	ptr = __alloc_from_contiguous(NULL, size, prot, &page);
+	if (ptr) {
+		coherent_head.vm_start = (unsigned long) ptr;
+		coherent_head.vm_end = (unsigned long) ptr + size;
+		printk(KERN_INFO "DMA: preallocated %u KiB pool for atomic coherent allocations\n",
+		       (unsigned)size / 1024);
+		return 0;
+	}
+	printk(KERN_ERR "DMA: failed to allocate %u KiB pool for atomic coherent allocation\n",
+	       (unsigned)size / 1024);
+	return -ENOMEM;
+}
+/*
+ * CMA is activated by core_initcall, so we must be called after it
+ */
+postcore_initcall(coherent_init);
+
+struct dma_contig_early_reserve {
+	phys_addr_t base;
+	unsigned long size;
+};
+
+static struct dma_contig_early_reserve dma_mmu_remap[MAX_CMA_AREAS] __initdata;
+
+static int dma_mmu_remap_num __initdata;
+
+void __init dma_contiguous_early_fixup(phys_addr_t base, unsigned long size)
+{
+	dma_mmu_remap[dma_mmu_remap_num].base = base;
+	dma_mmu_remap[dma_mmu_remap_num].size = size;
+	dma_mmu_remap_num++;
+}
+
+void __init dma_contiguous_remap(void)
+{
+	int i;
+	for (i = 0; i < dma_mmu_remap_num; i++) {
+		phys_addr_t start = dma_mmu_remap[i].base;
+		phys_addr_t end = start + dma_mmu_remap[i].size;
+		struct map_desc map;
+		unsigned long addr;
+
+		if (end > arm_lowmem_limit)
+			end = arm_lowmem_limit;
+		if (start >= end)
+			return;
+
+		map.pfn = __phys_to_pfn(start);
+		map.virtual = __phys_to_virt(start);
+		map.length = end - start;
+		map.type = MT_MEMORY_DMA_READY;
+
+		/*
+		 * Clear previous low-memory mapping
+		 */
+		for (addr = __phys_to_virt(start); addr < __phys_to_virt(end);
+		     addr += PGDIR_SIZE)
+			pmd_clear(pmd_off_k(addr));
+
+		iotable_init(&map, 1);
+	}
+}
+
 static void *
 __dma_alloc_remap(struct page *page, size_t size, gfp_t gfp, pgprot_t prot)
 {
@@ -318,20 +406,172 @@ static void __dma_free_remap(void *cpu_addr, size_t size)
 	arm_vmregion_free(&consistent_head, c);
 }
 
+static int __dma_update_pte(pte_t *pte, pgtable_t token, unsigned long addr,
+			    void *data)
+{
+	struct page *page = virt_to_page(addr);
+	pgprot_t prot = *(pgprot_t *)data;
+
+	set_pte_ext(pte, mk_pte(page, prot), 0);
+	return 0;
+}
+
+static void __dma_remap(struct page *page, size_t size, pgprot_t prot)
+{
+	unsigned long start = (unsigned long) page_address(page);
+	unsigned end = start + size;
+
+	apply_to_page_range(&init_mm, start, size, __dma_update_pte, &prot);
+	dsb();
+	flush_tlb_kernel_range(start, end);
+}
+
+static void *__alloc_remap_buffer(struct device *dev, size_t size, gfp_t gfp,
+				 pgprot_t prot, struct page **ret_page)
+{
+	struct page *page;
+	void *ptr;
+	page = __dma_alloc_buffer(dev, size, gfp);
+	if (!page)
+		return NULL;
+
+	ptr = __dma_alloc_remap(page, size, gfp, prot);
+	if (!ptr) {
+		__dma_free_buffer(page, size);
+		return NULL;
+	}
+
+	*ret_page = page;
+	return ptr;
+}
+
+static void *__alloc_from_pool(struct device *dev, size_t size,
+			       struct page **ret_page)
+{
+	struct arm_vmregion *c;
+	size_t align;
+
+	if (!coherent_head.vm_start) {
+		printk(KERN_ERR "%s: coherent pool not initialised!\n",
+		       __func__);
+		dump_stack();
+		return NULL;
+	}
+
+	/*
+	 * Align the region allocation - allocations from pool are rather
+	 * small, so align them to their order in pages, minimum is a page
+	 * size. This helps reduce fragmentation of the DMA space.
+	 */
+	align = PAGE_SIZE << get_order(size);
+	c = arm_vmregion_alloc(&coherent_head, align, size, 0);
+	if (c) {
+		void *ptr = (void *)c->vm_start;
+		struct page *page = virt_to_page(ptr);
+		*ret_page = page;
+		return ptr;
+	}
+	return NULL;
+}
+
+static int __free_from_pool(void *cpu_addr, size_t size)
+{
+	unsigned long start = (unsigned long)cpu_addr;
+	unsigned long end = start + size;
+	struct arm_vmregion *c;
+
+	if (start < coherent_head.vm_start || end > coherent_head.vm_end)
+		return 0;
+
+	c = arm_vmregion_find_remove(&coherent_head, (unsigned long)start);
+
+	if ((c->vm_end - c->vm_start) != size) {
+		printk(KERN_ERR "%s: freeing wrong coherent size (%ld != %d)\n",
+		       __func__, c->vm_end - c->vm_start, size);
+		dump_stack();
+		size = c->vm_end - c->vm_start;
+	}
+
+	arm_vmregion_free(&coherent_head, c);
+	return 1;
+}
+
+static void *__alloc_from_contiguous(struct device *dev, size_t size,
+				     pgprot_t prot, struct page **ret_page)
+{
+	unsigned long order = get_order(size);
+	size_t count = size >> PAGE_SHIFT;
+	struct page *page;
+
+	page = dma_alloc_from_contiguous(dev, count, order);
+	if (!page)
+		return NULL;
+
+	__dma_clear_buffer(page, size);
+	__dma_remap(page, size, prot);
+
+	*ret_page = page;
+	return page_address(page);
+}
+
+static void __free_from_contiguous(struct device *dev, struct page *page,
+				   size_t size)
+{
+	__dma_remap(page, size, pgprot_kernel);
+	dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT);
+}
+
+#define nommu() 0
+
 #else	/* !CONFIG_MMU */
 
-#define __dma_alloc_remap(page, size, gfp, prot)	page_address(page)
-#define __dma_free_remap(addr, size)			do { } while (0)
+#define nommu() 1
+
+#define __alloc_remap_buffer(dev, size, gfp, prot, ret)	NULL
+#define __alloc_from_pool(dev, size, ret_page)		NULL
+#define __alloc_from_contiguous(dev, size, prot, ret)	NULL
+#define __free_from_pool(cpu_addr, size)		0
+#define __free_from_contiguous(dev, page, size)		do { } while (0)
+#define __dma_free_remap(cpu_addr, size)		do { } while (0)
 
 #endif	/* CONFIG_MMU */
 
-static void *
-__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp,
-	    pgprot_t prot)
+static void *__alloc_simple_buffer(struct device *dev, size_t size, gfp_t gfp,
+				   struct page **ret_page)
 {
 	struct page *page;
+	page = __dma_alloc_buffer(dev, size, gfp);
+	if (!page)
+		return NULL;
+
+	*ret_page = page;
+	return page_address(page);
+}
+
+
+
+static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
+			 gfp_t gfp, pgprot_t prot)
+{
+	u64 mask = get_coherent_dma_mask(dev);
+	struct page *page;
 	void *addr;
 
+#ifdef CONFIG_DMA_API_DEBUG
+	u64 limit = (mask + 1) & ~mask;
+	if (limit && size >= limit) {
+		dev_warn(dev, "coherent allocation too big (requested %#x mask %#llx)\n",
+			size, mask);
+		return NULL;
+	}
+#endif
+
+	if (!mask)
+		return NULL;
+
+	if (mask < 0xffffffffULL)
+		gfp |= GFP_DMA;
+
 	/*
 	 * Following is a work-around (a.k.a. hack) to prevent pages
 	 * with __GFP_COMP being passed to split_page() which cannot
@@ -344,19 +584,17 @@ __dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp,
 	*handle = ~0;
 	size = PAGE_ALIGN(size);
 
-	page = __dma_alloc_buffer(dev, size, gfp);
-	if (!page)
-		return NULL;
-
-	if (!arch_is_coherent())
-		addr = __dma_alloc_remap(page, size, gfp, prot);
+	if (arch_is_coherent() || nommu())
+		addr = __alloc_simple_buffer(dev, size, gfp, &page);
+	else if (cpu_architecture() < CPU_ARCH_ARMv6)
+		addr = __alloc_remap_buffer(dev, size, gfp, prot, &page);
+	else if (gfp & GFP_ATOMIC)
+		addr = __alloc_from_pool(dev, size, &page);
 	else
-		addr = page_address(page);
+		addr = __alloc_from_contiguous(dev, size, prot, &page);
 
 	if (addr)
 		*handle = pfn_to_dma(dev, page_to_pfn(page));
-	else
-		__dma_free_buffer(page, size);
 
 	return addr;
 }
@@ -365,8 +603,8 @@ __dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp,
  * Allocate DMA-coherent memory space and return both the kernel remapped
  * virtual and bus address for that space.
  */
-void *
-dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp)
+void *dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *handle,
+			 gfp_t gfp)
 {
 	void *memory;
 
@@ -395,25 +633,11 @@ static int dma_mmap(struct device *dev, struct vm_area_struct *vma,
 {
 	int ret = -ENXIO;
 #ifdef CONFIG_MMU
-	unsigned long user_size, kern_size;
-	struct arm_vmregion *c;
-
-	user_size = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
-
-	c = arm_vmregion_find(&consistent_head, (unsigned long)cpu_addr);
-	if (c) {
-		unsigned long off = vma->vm_pgoff;
-
-		kern_size = (c->vm_end - c->vm_start) >> PAGE_SHIFT;
-
-		if (off < kern_size &&
-		    user_size <= (kern_size - off)) {
-			ret = remap_pfn_range(vma, vma->vm_start,
-					      page_to_pfn(c->vm_pages) + off,
-					      user_size << PAGE_SHIFT,
-					      vma->vm_page_prot);
-		}
-	}
+	unsigned long pfn = dma_to_pfn(dev, dma_addr);
+	ret = remap_pfn_range(vma, vma->vm_start,
+			      pfn + vma->vm_pgoff,
+			      vma->vm_end - vma->vm_start,
+			      vma->vm_page_prot);
 #endif	/* CONFIG_MMU */
 
 	return ret;
@@ -435,23 +659,33 @@ int dma_mmap_writecombine(struct device *dev, struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(dma_mmap_writecombine);
 
+
 /*
- * free a page as defined by the above mapping.
- * Must not be called with IRQs disabled.
+ * Free a buffer as defined by the above mapping.
  */
 void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr_t handle)
 {
-	WARN_ON(irqs_disabled());
+	struct page *page = pfn_to_page(dma_to_pfn(dev, handle));
 
 	if (dma_release_from_coherent(dev, get_order(size), cpu_addr))
 		return;
 
 	size = PAGE_ALIGN(size);
 
-	if (!arch_is_coherent())
+	if (arch_is_coherent() || nommu()) {
+		__dma_free_buffer(page, size);
+	} else if (cpu_architecture() < CPU_ARCH_ARMv6) {
 		__dma_free_remap(cpu_addr, size);
-
-	__dma_free_buffer(pfn_to_page(dma_to_pfn(dev, handle)), size);
+		__dma_free_buffer(page, size);
+	} else {
+		if (__free_from_pool(cpu_addr, size))
+			return;
+		/*
+		 * Non-atomic allocations cannot be freed with IRQs disabled
+		 */
+		WARN_ON(irqs_disabled());
+		__free_from_contiguous(dev, page, size);
+	}
 }
 EXPORT_SYMBOL(dma_free_coherent);
 
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index fbdd12e..dd0a1e4 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -21,6 +21,7 @@
 #include <linux/gfp.h>
 #include <linux/memblock.h>
 #include <linux/sort.h>
+#include <linux/dma-contiguous.h>
 
 #include <asm/mach-types.h>
 #include <asm/prom.h>
@@ -238,6 +239,17 @@ static void __init arm_adjust_dma_zone(unsigned long *size, unsigned long *hole,
 }
 #endif
 
+void __init setup_dma_zone(struct machine_desc *mdesc)
+{
+#ifdef CONFIG_ZONE_DMA
+	if (mdesc->dma_zone_size) {
+		arm_dma_zone_size = mdesc->dma_zone_size;
+		arm_dma_limit = PHYS_OFFSET + arm_dma_zone_size - 1;
+	} else
+		arm_dma_limit = 0xffffffff;
+#endif
+}
+
 static void __init arm_bootmem_free(unsigned long min, unsigned long max_low,
 	unsigned long max_high)
 {
@@ -285,12 +297,9 @@ static void __init arm_bootmem_free(unsigned long min, unsigned long max_low,
 	 * Adjust the sizes according to any special requirements for
 	 * this machine type.
 	 */
-	if (arm_dma_zone_size) {
+	if (arm_dma_zone_size)
 		arm_adjust_dma_zone(zone_size, zhole_size,
 			arm_dma_zone_size >> PAGE_SHIFT);
-		arm_dma_limit = PHYS_OFFSET + arm_dma_zone_size - 1;
-	} else
-		arm_dma_limit = 0xffffffff;
 #endif
 
 	free_area_init_node(0, zone_size, min, zhole_size);
@@ -371,6 +380,11 @@ void __init arm_memblock_init(struct meminfo *mi, struct machine_desc *mdesc)
 	if (mdesc->reserve)
 		mdesc->reserve();
 
+	/* reserve memory for DMA contigouos allocations,
+	   must come from DMA area inside low memory */
+	dma_contiguous_reserve(arm_dma_limit < arm_lowmem_limit ?
+			       arm_dma_limit : arm_lowmem_limit);
+
 	memblock_analyze();
 	memblock_dump_all();
 }
diff --git a/arch/arm/mm/mm.h b/arch/arm/mm/mm.h
index ad7cce3..fa95d9b 100644
--- a/arch/arm/mm/mm.h
+++ b/arch/arm/mm/mm.h
@@ -29,5 +29,8 @@ extern u32 arm_dma_limit;
 #define arm_dma_limit ((u32)~0)
 #endif
 
+extern phys_addr_t arm_lowmem_limit;
+
 void __init bootmem_init(void);
 void arm_mm_memblock_reserve(void);
+void dma_contiguous_remap(void);
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index dc8c550..9796cf1 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -281,6 +281,11 @@ static struct mem_type mem_types[] = {
 				PMD_SECT_UNCACHED | PMD_SECT_XN,
 		.domain    = DOMAIN_KERNEL,
 	},
+	[MT_MEMORY_DMA_READY] = {
+		.prot_pte  = L_PTE_PRESENT | L_PTE_YOUNG | L_PTE_DIRTY,
+		.prot_l1   = PMD_TYPE_TABLE,
+		.domain    = DOMAIN_KERNEL,
+	},
 };
 
 const struct mem_type *get_mem_type(unsigned int type)
@@ -422,6 +427,7 @@ static void __init build_mem_type_table(void)
 	if (arch_is_coherent() && cpu_is_xsc3()) {
 		mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S;
 		mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED;
+		mem_types[MT_MEMORY_DMA_READY].prot_pte |= L_PTE_SHARED;
 		mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S;
 		mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED;
 	}
@@ -451,6 +457,7 @@ static void __init build_mem_type_table(void)
 			mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED;
 			mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S;
 			mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED;
+			mem_types[MT_MEMORY_DMA_READY].prot_pte |= L_PTE_SHARED;
 			mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S;
 			mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED;
 		}
@@ -490,6 +497,7 @@ static void __init build_mem_type_table(void)
 	mem_types[MT_HIGH_VECTORS].prot_l1 |= ecc_mask;
 	mem_types[MT_MEMORY].prot_sect |= ecc_mask | cp->pmd;
 	mem_types[MT_MEMORY].prot_pte |= kern_pgprot;
+	mem_types[MT_MEMORY_DMA_READY].prot_pte |= kern_pgprot;
 	mem_types[MT_MEMORY_NONCACHED].prot_sect |= ecc_mask;
 	mem_types[MT_ROM].prot_sect |= cp->pmd;
 
@@ -569,7 +577,7 @@ static void __init alloc_init_section(pud_t *pud, unsigned long addr,
 	 * L1 entries, whereas PGDs refer to a group of L1 entries making
 	 * up one logical pointer to an L2 table.
 	 */
-	if (((addr | end | phys) & ~SECTION_MASK) == 0) {
+	if (type->prot_sect && ((addr | end | phys) & ~SECTION_MASK) == 0) {
 		pmd_t *p = pmd;
 
 		if (addr & SECTION_SIZE)
@@ -765,7 +773,7 @@ static int __init early_vmalloc(char *arg)
 }
 early_param("vmalloc", early_vmalloc);
 
-static phys_addr_t lowmem_limit __initdata = 0;
+phys_addr_t arm_lowmem_limit __initdata = 0;
 
 void __init sanity_check_meminfo(void)
 {
@@ -834,8 +842,8 @@ void __init sanity_check_meminfo(void)
 			bank->size = newsize;
 		}
 #endif
-		if (!bank->highmem && bank->start + bank->size > lowmem_limit)
-			lowmem_limit = bank->start + bank->size;
+		if (!bank->highmem && bank->start + bank->size > arm_lowmem_limit)
+			arm_lowmem_limit = bank->start + bank->size;
 
 		j++;
 	}
@@ -860,7 +868,7 @@ void __init sanity_check_meminfo(void)
 	}
 #endif
 	meminfo.nr_banks = j;
-	memblock_set_current_limit(lowmem_limit);
+	memblock_set_current_limit(arm_lowmem_limit);
 }
 
 static inline void prepare_page_table(void)
@@ -885,8 +893,8 @@ static inline void prepare_page_table(void)
 	 * Find the end of the first block of lowmem.
 	 */
 	end = memblock.memory.regions[0].base + memblock.memory.regions[0].size;
-	if (end >= lowmem_limit)
-		end = lowmem_limit;
+	if (end >= arm_lowmem_limit)
+		end = arm_lowmem_limit;
 
 	/*
 	 * Clear out all the kernel space mappings, except for the first
@@ -1020,8 +1028,8 @@ static void __init map_lowmem(void)
 		phys_addr_t end = start + reg->size;
 		struct map_desc map;
 
-		if (end > lowmem_limit)
-			end = lowmem_limit;
+		if (end > arm_lowmem_limit)
+			end = arm_lowmem_limit;
 		if (start >= end)
 			break;
 
@@ -1042,11 +1050,12 @@ void __init paging_init(struct machine_desc *mdesc)
 {
 	void *zero_page;
 
-	memblock_set_current_limit(lowmem_limit);
+	memblock_set_current_limit(arm_lowmem_limit);
 
 	build_mem_type_table();
 	prepare_page_table();
 	map_lowmem();
+	dma_contiguous_remap();
 	devicemaps_init(mdesc);
 	kmap_init();
 
-- 
1.7.1.569.g6f426


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 11/11] ARM: Samsung: use CMA for 2 memory banks for s5p-mfc device
  2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
                   ` (9 preceding siblings ...)
  2011-12-29 12:39 ` [PATCH 10/11] ARM: " Marek Szyprowski
@ 2011-12-29 12:39 ` Marek Szyprowski
  2012-01-10  8:42 ` [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
  11 siblings, 0 replies; 33+ messages in thread
From: Marek Szyprowski @ 2011-12-29 12:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm, linaro-mm-sig
  Cc: Michal Nazarewicz, Marek Szyprowski, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

Replace custom memory bank initialization using memblock_reserve and
dma_declare_coherent with a single call to CMA's dma_declare_contiguous.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 arch/arm/plat-s5p/dev-mfc.c |   51 ++++++-------------------------------------
 1 files changed, 7 insertions(+), 44 deletions(-)

diff --git a/arch/arm/plat-s5p/dev-mfc.c b/arch/arm/plat-s5p/dev-mfc.c
index a30d36b..fcb8400 100644
--- a/arch/arm/plat-s5p/dev-mfc.c
+++ b/arch/arm/plat-s5p/dev-mfc.c
@@ -14,6 +14,7 @@
 #include <linux/interrupt.h>
 #include <linux/platform_device.h>
 #include <linux/dma-mapping.h>
+#include <linux/dma-contiguous.h>
 #include <linux/memblock.h>
 #include <linux/ioport.h>
 
@@ -22,52 +23,14 @@
 #include <plat/irqs.h>
 #include <plat/mfc.h>
 
-struct s5p_mfc_reserved_mem {
-	phys_addr_t	base;
-	unsigned long	size;
-	struct device	*dev;
-};
-
-static struct s5p_mfc_reserved_mem s5p_mfc_mem[2] __initdata;
-
 void __init s5p_mfc_reserve_mem(phys_addr_t rbase, unsigned int rsize,
 				phys_addr_t lbase, unsigned int lsize)
 {
-	int i;
-
-	s5p_mfc_mem[0].dev = &s5p_device_mfc_r.dev;
-	s5p_mfc_mem[0].base = rbase;
-	s5p_mfc_mem[0].size = rsize;
-
-	s5p_mfc_mem[1].dev = &s5p_device_mfc_l.dev;
-	s5p_mfc_mem[1].base = lbase;
-	s5p_mfc_mem[1].size = lsize;
-
-	for (i = 0; i < ARRAY_SIZE(s5p_mfc_mem); i++) {
-		struct s5p_mfc_reserved_mem *area = &s5p_mfc_mem[i];
-		if (memblock_remove(area->base, area->size)) {
-			printk(KERN_ERR "Failed to reserve memory for MFC device (%ld bytes at 0x%08lx)\n",
-			       area->size, (unsigned long) area->base);
-			area->base = 0;
-		}
-	}
-}
-
-static int __init s5p_mfc_memory_init(void)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(s5p_mfc_mem); i++) {
-		struct s5p_mfc_reserved_mem *area = &s5p_mfc_mem[i];
-		if (!area->base)
-			continue;
+	if (dma_declare_contiguous(&s5p_device_mfc_r.dev, rsize, rbase, 0))
+		printk(KERN_ERR "Failed to reserve memory for MFC device (%u bytes at 0x%08lx)\n",
+		       rsize, (unsigned long) rbase);
 
-		if (dma_declare_coherent_memory(area->dev, area->base,
-				area->base, area->size,
-				DMA_MEMORY_MAP | DMA_MEMORY_EXCLUSIVE) == 0)
-			printk(KERN_ERR "Failed to declare coherent memory for MFC device (%ld bytes at 0x%08lx)\n",
-			       area->size, (unsigned long) area->base);
-	}
-	return 0;
+	if (dma_declare_contiguous(&s5p_device_mfc_l.dev, lsize, lbase, 0))
+		printk(KERN_ERR "Failed to reserve memory for MFC device (%u bytes at 0x%08lx)\n",
+		       rsize, (unsigned long) rbase);
 }
-device_initcall(s5p_mfc_memory_init);
-- 
1.7.1.569.g6f426


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating
  2011-12-29 12:39 ` [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating Marek Szyprowski
@ 2012-01-01  7:49   ` Gilad Ben-Yossef
  2012-01-01 15:54     ` Michal Nazarewicz
  2012-01-05 15:39   ` Michal Nazarewicz
  1 sibling, 1 reply; 33+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-01  7:49 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, Michal Nazarewicz, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

On Thu, Dec 29, 2011 at 2:39 PM, Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> From: Michal Nazarewicz <mina86@mina86.com>
>
> When set_migratetype_isolate() sets pageblock's migrate type, it does
> not change each page_private data.  This makes sense, as the function
> has no way of knowing what kind of information page_private stores.
...
>
>
> A side effect is that instead of draining pages from all zones,
> set_migratetype_isolate() now drain only pages from zone pageblock it
> operates on is in.
>
...


>
> +/* Caller must hold zone->lock. */
> +static void __zone_drain_local_pages(void *arg)
> +{
> +       struct per_cpu_pages *pcp;
> +       struct zone *zone = arg;
> +       unsigned long flags;
> +
> +       local_irq_save(flags);
> +       pcp = &per_cpu_ptr(zone->pageset, smp_processor_id())->pcp;
> +       if (pcp->count) {
> +               /* Caller holds zone->lock, no need to grab it. */
> +               __free_pcppages_bulk(zone, pcp->count, pcp);
> +               pcp->count = 0;
> +       }
> +       local_irq_restore(flags);
> +}
> +
> +/*
> + * Like drain_all_pages() but operates on a single zone.  Caller must
> + * hold zone->lock.
> + */
> +static void __zone_drain_all_pages(struct zone *zone)
> +{
> +       on_each_cpu(__zone_drain_local_pages, zone, 1);
> +}
> +

Please consider whether sending an IPI to all processors in the system
and interrupting them is appropriate here.

You seem to assume that it is probable that each CPU of the possibly
4,096 (MAXSMP on x86) has a per-cpu page
for the specified zone, otherwise you're just interrupting them out of
doing something useful, or save power idle
for nothing.

While that may or may not be a reasonable assumption for the general
drain_all_pages that drains pcps from
all zones, I feel it is less likely to be the right thing once you
limit the drain to a single zone.

Some background on my attempt to reduce "IPI noise" in the system in
this context is probably useful here as
well: https://lkml.org/lkml/2011/11/22/133

Thanks :-)
Gilad



-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating
  2012-01-01  7:49   ` Gilad Ben-Yossef
@ 2012-01-01 15:54     ` Michal Nazarewicz
  2012-01-01 16:06       ` Gilad Ben-Yossef
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Nazarewicz @ 2012-01-01 15:54 UTC (permalink / raw)
  To: Marek Szyprowski, Gilad Ben-Yossef
  Cc: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, Kyungmin Park, Russell King, Andrew Morton,
	KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman, Arnd Bergmann,
	Jesse Barker, Jonathan Corbet, Shariq Hasnain, Chunsang Jeong,
	Dave Hansen, Benjamin Gaignard

> On Thu, Dec 29, 2011 at 2:39 PM, Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>> From: Michal Nazarewicz <mina86@mina86.com>
>>
>> When set_migratetype_isolate() sets pageblock's migrate type, it does
>> not change each page_private data.  This makes sense, as the function
>> has no way of knowing what kind of information page_private stores.

>> A side effect is that instead of draining pages from all zones,
>> set_migratetype_isolate() now drain only pages from zone pageblock it
>> operates on is in.

>> +/* Caller must hold zone->lock. */
>> +static void __zone_drain_local_pages(void *arg)
>> +{
>> +       struct per_cpu_pages *pcp;
>> +       struct zone *zone = arg;
>> +       unsigned long flags;
>> +
>> +       local_irq_save(flags);
>> +       pcp = &per_cpu_ptr(zone->pageset, smp_processor_id())->pcp;
>> +       if (pcp->count) {
>> +               /* Caller holds zone->lock, no need to grab it. */
>> +               __free_pcppages_bulk(zone, pcp->count, pcp);
>> +               pcp->count = 0;
>> +       }
>> +       local_irq_restore(flags);
>> +}
>> +
>> +/*
>> + * Like drain_all_pages() but operates on a single zone.  Caller must
>> + * hold zone->lock.
>> + */
>> +static void __zone_drain_all_pages(struct zone *zone)
>> +{
>> +       on_each_cpu(__zone_drain_local_pages, zone, 1);
>> +}
>> +

On Sun, 01 Jan 2012 08:49:13 +0100, Gilad Ben-Yossef <gilad@benyossef.com> wrote:
> Please consider whether sending an IPI to all processors in the system
> and interrupting them is appropriate here.
>
> You seem to assume that it is probable that each CPU of the possibly
> 4,096 (MAXSMP on x86) has a per-cpu page for the specified zone,

I'm not really assuming that (in fact I expect what you fear, ie. that
most CPUs won't have pages from specified zone an PCP list), however,
I really need to make sure to get them off all PCP lists.

> otherwise you're just interrupting them out of doing something useful,
> or save power idle for nothing.

Exactly what's happening now anyway.

> While that may or may not be a reasonable assumption for the general
> drain_all_pages that drains pcps from all zones, I feel it is less
> likely to be the right thing once you limit the drain to a single
> zone.

Currently, set_migratetype_isolate() seem to do more then it needs to,
ie. it drains all the pages even though all it cares about is a single
zone.

> Some background on my attempt to reduce "IPI noise" in the system in
> this context is probably useful here as
> well: https://lkml.org/lkml/2011/11/22/133

Looks interesting, I'm not entirely sure why it does not end up a race
condition, but in case of __zone_drain_all_pages() we already hold
zone->lock, so my fears are somehow gone..  I'll give it a try, and prepare
a patch for __zone_drain_all_pages().

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating
  2012-01-01 15:54     ` Michal Nazarewicz
@ 2012-01-01 16:06       ` Gilad Ben-Yossef
  2012-01-01 18:52         ` Michal Nazarewicz
  0 siblings, 1 reply; 33+ messages in thread
From: Gilad Ben-Yossef @ 2012-01-01 16:06 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: Marek Szyprowski, linux-kernel, linux-arm-kernel, linux-media,
	linux-mm, linaro-mm-sig, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

2012/1/1 Michal Nazarewicz <mina86@mina86.com>:
>> On Thu, Dec 29, 2011 at 2:39 PM, Marek Szyprowski
>> <m.szyprowski@samsung.com> wrote:

...
> On Sun, 01 Jan 2012 08:49:13 +0100, Gilad Ben-Yossef <gilad@benyossef.com>
> wrote:
>>
>> Please consider whether sending an IPI to all processors in the system
>> and interrupting them is appropriate here.
>>
>> You seem to assume that it is probable that each CPU of the possibly
>> 4,096 (MAXSMP on x86) has a per-cpu page for the specified zone,
>
>
> I'm not really assuming that (in fact I expect what you fear, ie. that
> most CPUs won't have pages from specified zone an PCP list), however,
> I really need to make sure to get them off all PCP lists.
>

True, the question is whether or not you have to send a global IPI to do that.

>
>> otherwise you're just interrupting them out of doing something useful,
>> or save power idle for nothing.
>
>
> Exactly what's happening now anyway.
>
>

Indeed.

>> While that may or may not be a reasonable assumption for the general
>> drain_all_pages that drains pcps from all zones, I feel it is less
>> likely to be the right thing once you limit the drain to a single
>> zone.
>
>
> Currently, set_migratetype_isolate() seem to do more then it needs to,
> ie. it drains all the pages even though all it cares about is a single
>
> zone.

I agree your patch is better then current state. I just did want to add
yet another global IPI I'll have to chase afterwards.. :-)
>
>> Some background on my attempt to reduce "IPI noise" in the system in
>> this context is probably useful here as
>> well: https://lkml.org/lkml/2011/11/22/133
>
>
> Looks interesting, I'm not entirely sure why it does not end up a race
> condition, but in case of __zone_drain_all_pages() we already hold

If a page is in the PCP list when we check, you'll send the IPI and all is well.

If it isn't when we check and gets added later you could just the same have
situation where we send the IPI, try to do try an empty PCP list and then
the page gets added. So we are not adding a race condition that is not there
already :-)

> zone->lock, so my fears are somehow gone..  I'll give it a try, and prepare
> a patch for __zone_drain_all_pages().
>

I plan to send V5 of the IPI noise patch after some testing. It has a new
version of the drain_all_pages, with no allocation in the reclaim path
and no locking. You might want to wait till that one is out to base on it.


Thank you for considering my feedback :-)

Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker
gilad@benyossef.com
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com

"Unfortunately, cache misses are an equal opportunity pain provider."
-- Mike Galbraith, LKML

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating
  2012-01-01 16:06       ` Gilad Ben-Yossef
@ 2012-01-01 18:52         ` Michal Nazarewicz
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Nazarewicz @ 2012-01-01 18:52 UTC (permalink / raw)
  To: Gilad Ben-Yossef
  Cc: Marek Szyprowski, linux-kernel, linux-arm-kernel, linux-media,
	linux-mm, linaro-mm-sig, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Mel Gorman,
	Arnd Bergmann, Jesse Barker, Jonathan Corbet, Shariq Hasnain,
	Chunsang Jeong, Dave Hansen, Benjamin Gaignard

On Sun, 01 Jan 2012 17:06:53 +0100, Gilad Ben-Yossef <gilad@benyossef.com> wrote:

> 2012/1/1 Michal Nazarewicz <mina86@mina86.com>:
>> Looks interesting, I'm not entirely sure why it does not end up a race
>> condition, but in case of __zone_drain_all_pages() we already hold

> If a page is in the PCP list when we check, you'll send the IPI and all is well.
>
> If it isn't when we check and gets added later you could just the same have
> situation where we send the IPI, try to do try an empty PCP list and then
> the page gets added. So we are not adding a race condition that is not there
> already :-)

Right, makes sense.

>> zone->lock, so my fears are somehow gone..  I'll give it a try, and prepare
>> a patch for __zone_drain_all_pages().

> I plan to send V5 of the IPI noise patch after some testing. It has a new
> version of the drain_all_pages, with no allocation in the reclaim path
> and no locking. You might want to wait till that one is out to base on it.

This shouldn't be a problem for my case as set_migratetype_isolate() is hardly
ever called in reclaim path. :)

The change so far seems rather obvious:

  mm/page_alloc.c |   14 +++++++++++++-
  1 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 424d36a..eaa686b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1181,7 +1181,19 @@ static void __zone_drain_local_pages(void *arg)
   */
  static void __zone_drain_all_pages(struct zone *zone)
  {
-	on_each_cpu(__zone_drain_local_pages, zone, 1);
+	struct per_cpu_pageset *pcp;
+	cpumask_var_t cpus;
+	int cpu;
+
+	if (likely(zalloc_cpumask_var(&cpus, GFP_ATOMIC | __GFP_NOWARN))) {
+		for_each_online_cpu(cpu)
+			if (per_cpu_ptr(zone->pageset, cpu)->pcp.count)
+				cpumask_set_cpu(cpu, cpus);
+		on_each_cpu_mask(cpus, __zone_drain_local_pages, zone, 1);
+		free_cpumask_var(cpus);
+	} else {
+		on_each_cpu(__zone_drain_local_pages, zone, 1);
+	}
  }

  #ifdef CONFIG_HIBERNATION

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating
  2011-12-29 12:39 ` [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating Marek Szyprowski
  2012-01-01  7:49   ` Gilad Ben-Yossef
@ 2012-01-05 15:39   ` Michal Nazarewicz
  2012-01-05 19:20     ` Marek Szyprowski
  1 sibling, 1 reply; 33+ messages in thread
From: Michal Nazarewicz @ 2012-01-05 15:39 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, Marek Szyprowski
  Cc: Kyungmin Park, Russell King, Andrew Morton, KAMEZAWA Hiroyuki,
	Daniel Walker, Mel Gorman, Arnd Bergmann, Jesse Barker,
	Jonathan Corbet, Shariq Hasnain, Chunsang Jeong, Dave Hansen,
	Benjamin Gaignard

On Thu, 29 Dec 2011 13:39:02 +0100, Marek Szyprowski <m.szyprowski@samsung.com> wrote:
> From: Michal Nazarewicz <mina86@mina86.com>
>
> When set_migratetype_isolate() sets pageblock's migrate type, it does
> not change each page_private data.  This makes sense, as the function
> has no way of knowing what kind of information page_private stores.
>
> Unfortunately, if a page is on PCP list, it's page_private indicates
> its migrate type.  This means, that if a page on PCP list gets
> isolated, a call to free_pcppages_bulk() will assume it has the old
> migrate type rather than MIGRATE_ISOLATE.  This means, that a page
> which should be isolated, will end up on a free list of it's old
> migrate type.
>
> Coincidentally, at the very end, set_migratetype_isolate() calls
> drain_all_pages() which leads to calling free_pcppages_bulk(), which
> does the wrong thing.
>
> To avoid this situation, this commit moves the draining prior to
> setting pageblock's migratetype and moving pages from old free list to
> MIGRATETYPE_ISOLATE's free list.
>
> Because of spin locks this is a non-trivial change however as both
> set_migratetype_isolate() and free_pcppages_bulk() grab zone->lock.
> To solve this problem, this commit renames free_pcppages_bulk() to
> __free_pcppages_bulk() and changes it so that it no longer grabs
> zone->lock instead requiring caller to hold it.  This commit later
> adds a __zone_drain_all_pages() function which works just like
> drain_all_pages() expects that it drains only pages from a single zone
> and assumes that caller holds zone->lock.

As it turns out, with some more testing on SMP systems, this whole patch
turned out to be incorrect.

We have been thinking about other approach and, if we were to use something
else then the first patch from CMAv17[1], the best thing we could came up
with was to unconditionally call drain_all_pages() at the beginning of
set_migratetype_isolate() before the call to spin_lock_irqsave().  It has
a possible race condition but a nightly stress test did have not shown any
problems.

Nonetheless, the cleanest, in my opinion, solution is to use the first patch
 from CMAv17 which can be found at [1].

So, to sum up: if you intend to test CMAv18, instead of applying this first
patch either use first patch from CMAv17[1] or put an unconditional call to
drain_all_pages() at the beginning of set_migrate_isolate() function.

Sorry for the troubles.

[1] http://www.spinics.net/lists/arm-kernel/msg148494.html

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating
  2012-01-05 15:39   ` Michal Nazarewicz
@ 2012-01-05 19:20     ` Marek Szyprowski
  0 siblings, 0 replies; 33+ messages in thread
From: Marek Szyprowski @ 2012-01-05 19:20 UTC (permalink / raw)
  To: 'Michal Nazarewicz',
	linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig
  Cc: 'Kyungmin Park', 'Russell King',
	'Andrew Morton', 'KAMEZAWA Hiroyuki',
	'Daniel Walker', 'Mel Gorman',
	'Arnd Bergmann', 'Jesse Barker',
	'Jonathan Corbet', 'Shariq Hasnain',
	'Chunsang Jeong', 'Dave Hansen',
	'Benjamin Gaignard'

Hello,

On Thursday, January 05, 2012 4:40 PM Michał Nazarewicz wrote:

> On Thu, 29 Dec 2011 13:39:02 +0100, Marek Szyprowski <m.szyprowski@samsung.com> wrote:
> > From: Michal Nazarewicz <mina86@mina86.com>
> >
> > When set_migratetype_isolate() sets pageblock's migrate type, it does
> > not change each page_private data.  This makes sense, as the function
> > has no way of knowing what kind of information page_private stores.
> >
> > Unfortunately, if a page is on PCP list, it's page_private indicates
> > its migrate type.  This means, that if a page on PCP list gets
> > isolated, a call to free_pcppages_bulk() will assume it has the old
> > migrate type rather than MIGRATE_ISOLATE.  This means, that a page
> > which should be isolated, will end up on a free list of it's old
> > migrate type.
> >
> > Coincidentally, at the very end, set_migratetype_isolate() calls
> > drain_all_pages() which leads to calling free_pcppages_bulk(), which
> > does the wrong thing.
> >
> > To avoid this situation, this commit moves the draining prior to
> > setting pageblock's migratetype and moving pages from old free list to
> > MIGRATETYPE_ISOLATE's free list.
> >
> > Because of spin locks this is a non-trivial change however as both
> > set_migratetype_isolate() and free_pcppages_bulk() grab zone->lock.
> > To solve this problem, this commit renames free_pcppages_bulk() to
> > __free_pcppages_bulk() and changes it so that it no longer grabs
> > zone->lock instead requiring caller to hold it.  This commit later
> > adds a __zone_drain_all_pages() function which works just like
> > drain_all_pages() expects that it drains only pages from a single zone
> > and assumes that caller holds zone->lock.
> 
> As it turns out, with some more testing on SMP systems, this whole patch
> turned out to be incorrect.
> 
> We have been thinking about other approach and, if we were to use something
> else then the first patch from CMAv17[1], the best thing we could came up
> with was to unconditionally call drain_all_pages() at the beginning of
> set_migratetype_isolate() before the call to spin_lock_irqsave().  It has
> a possible race condition but a nightly stress test did have not shown any
> problems.
> 
> Nonetheless, the cleanest, in my opinion, solution is to use the first patch
>  from CMAv17 which can be found at [1].
> 
> So, to sum up: if you intend to test CMAv18, instead of applying this first
> patch either use first patch from CMAv17[1] or put an unconditional call to
> drain_all_pages() at the beginning of set_migrate_isolate() function.
> 
> Sorry for the troubles.
> 
> [1] http://www.spinics.net/lists/arm-kernel/msg148494.html

I've updated our public git repository to include this workaround. You can
pull the patches from the following addresses:

git://git.infradead.org/users/kmpark/linux-samsung 3.2-rc7-cma-v18

http://git.infradead.org/users/kmpark/linux-samsung/shortlog/refs/heads/3.2-rc7-cma-v18

Best regards
-- 
Marek Szyprowski
Samsung Poland R&D Center



^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCHv18 0/11] Contiguous Memory Allocator
  2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
                   ` (10 preceding siblings ...)
  2011-12-29 12:39 ` [PATCH 11/11] ARM: Samsung: use CMA for 2 memory banks for s5p-mfc device Marek Szyprowski
@ 2012-01-10  8:42 ` Marek Szyprowski
  11 siblings, 0 replies; 33+ messages in thread
From: Marek Szyprowski @ 2012-01-10  8:42 UTC (permalink / raw)
  To: Marek Szyprowski, linux-kernel, linux-arm-kernel, linux-media,
	linux-mm, linaro-mm-sig
  Cc: 'Michal Nazarewicz', 'Kyungmin Park',
	'Russell King', 'Andrew Morton',
	'KAMEZAWA Hiroyuki', 'Daniel Walker',
	'Mel Gorman', 'Arnd Bergmann',
	'Jesse Barker', 'Jonathan Corbet',
	'Shariq Hasnain', 'Chunsang Jeong',
	'Dave Hansen', 'Benjamin Gaignard',
	'Kukjin Kim', 'KyongHo Cho'

Hello,

To help everyone in testing and adapting our patches for his hardware 
platform I've rebased our patches onto the latest v3.2 Linux kernel and
prepared a few GIT branches in our public repository. These branches
contain our memory management related patches posted in the following
threads:

"[PATCHv18 0/11] Contiguous Memory Allocator":
http://www.spinics.net/lists/linux-mm/msg28125.html
later called CMAv18,

"[PATCH 00/14] DMA-mapping framework redesign preparation":
http://www.spinics.net/lists/linux-sh/msg09777.html
and
"[PATCH 0/8 v4] ARM: DMA-mapping framework redesign":
http://www.spinics.net/lists/arm-kernel/msg151147.html
with the following update:
http://www.spinics.net/lists/arm-kernel/msg154889.html
later called DMAv5.

These branches are available in our public GIT repository:

git://git.infradead.org/users/kmpark/linux-samsung
http://git.infradead.org/users/kmpark/linux-samsung/

The following branches are available:

1) 3.2-cma-v18
Vanilla Linux v3.2 with fixed CMA v18 patches (first patch replaced
with the one from v17 to fix SMP issues, see the respective thread).

2) 3.2-dma-v5
Vanilla Linux v3.2 + iommu/next (IOMMU maintainer's patches) branch
with DMA-preparation and DMA-mapping framework redesign patches.

3) 3.2-cma-v18-dma-v5
Previous two branches merged together (DMA-mapping on top of CMA)

4) 3.2-cma-v18-dma-v5-exynos
Previous branch rebased on top of iommu/next + kgene/for-next (Samsung
SoC platform maintainer's patches) with new Exynos4 IOMMU driver by 
KyongHo Cho and relevant glue code.

5) 3.2-dma-v5-exynos
Branch from point 2 rebased on top of iommu/next + kgene/for-next 
(Samsung SoC maintainer's patches) with new Exynos4 IOMMU driver by 
KyongHo Cho and relevant glue code.

I hope everyone will find a branch that suits his needs. :)

Best regards
-- 
Marek Szyprowski
Samsung Poland R&D Center




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 02/11] mm: compaction: introduce isolate_{free,migrate}pages_range().
  2011-12-29 12:39 ` [PATCH 02/11] mm: compaction: introduce isolate_{free,migrate}pages_range() Marek Szyprowski
@ 2012-01-10 13:43   ` Mel Gorman
  2012-01-10 15:04     ` Michal Nazarewicz
  0 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2012-01-10 13:43 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, Michal Nazarewicz, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Arnd Bergmann,
	Jesse Barker, Jonathan Corbet, Shariq Hasnain, Chunsang Jeong,
	Dave Hansen, Benjamin Gaignard

On Thu, Dec 29, 2011 at 01:39:03PM +0100, Marek Szyprowski wrote:
> From: Michal Nazarewicz <mina86@mina86.com>
> 
> This commit introduces isolate_freepages_range() and
> isolate_migratepages_range() functions.  The first one replaces
> isolate_freepages_block() and the second one extracts functionality
> from isolate_migratepages().
> 
> They are more generic and instead of operating on pageblocks operate
> on PFN ranges.
> 
> Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
> ---
>  mm/compaction.c |  184 ++++++++++++++++++++++++++++++++++++++-----------------
>  1 files changed, 127 insertions(+), 57 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 899d956..ae73b6f 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -54,55 +54,86 @@ static unsigned long release_freepages(struct list_head *freelist)
>  	return count;
>  }
>  
> -/* Isolate free pages onto a private freelist. Must hold zone->lock */
> -static unsigned long isolate_freepages_block(struct zone *zone,
> -				unsigned long blockpfn,
> -				struct list_head *freelist)
> +/**
> + * isolate_freepages_range() - isolate free pages, must hold zone->lock.
> + * @zone:	Zone pages are in.
> + * @start_pfn:	The first PFN to start isolating.
> + * @end_pfn:	The one-past-last PFN.
> + * @freelist:	A list to save isolated pages to.
> + *
> + * If @freelist is not provided, holes in range (either non-free pages
> + * or invalid PFNs) are considered an error and function undos its
> + * actions and returns zero.
> + *
> + * If @freelist is provided, function will simply skip non-free and
> + * missing pages and put only the ones isolated on the list.
> + *
> + * Returns number of isolated pages.  This may be more then end_pfn-start_pfn
> + * if end fell in a middle of a free page.
> + */
> +static unsigned long
> +isolate_freepages_range(struct zone *zone,
> +			unsigned long start_pfn, unsigned long end_pfn,
> +			struct list_head *freelist)
>  {
> -	unsigned long zone_end_pfn, end_pfn;
> -	int nr_scanned = 0, total_isolated = 0;
> -	struct page *cursor;
> -
> -	/* Get the last PFN we should scan for free pages at */
> -	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> -	end_pfn = min(blockpfn + pageblock_nr_pages, zone_end_pfn);
> +	unsigned long nr_scanned = 0, total_isolated = 0;
> +	unsigned long pfn = start_pfn;
> +	struct page *page;
>  
> -	/* Find the first usable PFN in the block to initialse page cursor */
> -	for (; blockpfn < end_pfn; blockpfn++) {
> -		if (pfn_valid_within(blockpfn))
> -			break;
> -	}
> -	cursor = pfn_to_page(blockpfn);
> +	VM_BUG_ON(!pfn_valid(pfn));
> +	page = pfn_to_page(pfn);

The existing code is able to find the first usable PFN in a pageblock
with pfn_valid_within(). It's allowed to do that because it knows
the pageblock is valid so calling pfn_valid() is unnecessary.

It is curious to change this to something that can sometimes BUG_ON
!pfn_valid(pfn) instead of having a PFN walker that knows how to
handle pfn_valid().

>  
>  	/* Isolate free pages. This assumes the block is valid */

You leave this comment intact but you are no longer scanning a page
block. It is in fact assuming that the entire range is valid.

> -	for (; blockpfn < end_pfn; blockpfn++, cursor++) {
> -		int isolated, i;
> -		struct page *page = cursor;
> +	while (pfn < end_pfn) {
> +		int n = 0, i;
>  
> -		if (!pfn_valid_within(blockpfn))
> -			continue;
> -		nr_scanned++;
> +		if (!pfn_valid_within(pfn))
> +			goto next;
> +		++nr_scanned;
>  
>  		if (!PageBuddy(page))
> -			continue;
> +			goto next;
>  
>  		/* Found a free page, break it into order-0 pages */
> -		isolated = split_free_page(page);
> -		total_isolated += isolated;
> -		for (i = 0; i < isolated; i++) {
> -			list_add(&page->lru, freelist);
> -			page++;
> +		n = split_free_page(page);
> +		total_isolated += n;
> +		if (freelist) {
> +			struct page *p = page;
> +			for (i = n; i; --i, ++p)
> +				list_add(&p->lru, freelist);

Maybe it's just me, but the original code of

		for (i = 0; i < isolated; i++) {
			list_add(&page->lru, freelist);
			page++;
		}

Was a lot easier to read than this increment/decrement with zero
check for loop. Even 

for (i = 0; i < n; i++)
	list_add(&p[i].lru, freelist)

would have been a bit easier to parse. I also don't even get why you
renamed "isolated" to the less obvious name "n".

Neither are wrong, it just looks obscure for the sake of it.

>  		}
>  
> -		/* If a page was split, advance to the end of it */
> -		if (isolated) {
> -			blockpfn += isolated - 1;
> -			cursor += isolated - 1;
> +next:
> +		if (!n) {
> +			/* If n == 0, we have isolated no pages. */
> +			if (!freelist)
> +				goto cleanup;
> +			n = 1;
>  		}

That needs a better comment explaining why we undo what has happened up
until now if freelist == NULL.

> +
> +		/*
> +		 * If we pass max order page, we might end up in a different
> +		 * vmemmap, so account for that.
> +		 */
> +		pfn += n;
> +		if (pfn & (MAX_ORDER_NR_PAGES - 1))
> +			page += n;
> +		else
> +			page = pfn_to_page(pfn);
>  	}

You never check that the pfn is actually valid. There could be a
memory hole between start_pfn and end_pfn for example. This is probably
not true of the CMA caller or the existing compaction caller but it is
still unnecessarily fragile.

>  
>  	trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
>  	return total_isolated;
> +
> +cleanup:
> +	/*
> +	 * Undo what we have done so far, and return.  We know all pages from
> +	 * [start_pfn, pfn) are free because we have just freed them.  If one of
> +	 * the page in the range was not freed, we would end up here earlier.
> +	 */
> +	for (; start_pfn < pfn; ++start_pfn)
> +		__free_page(pfn_to_page(start_pfn));
> +	return 0;

This returns 0 even though you could have isolated some pages.

Overall, there would have been less contortion if you
implemented isolate_freepages_range() in terms of the existing
isolate_freepages_block. Something that looked kinda like this not
compiled untested illustration.

static unsigned long isolate_freepages_range(struct zone *zone,
                        unsigned long start_pfn, unsigned long end_pfn,
                        struct list_head *list)
{
        unsigned long pfn = start_pfn;
        unsigned long isolated;

        for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
                if (!pfn_valid(pfn))
                        continue;

                isolated += isolate_freepages_block(zone, pfn, list);
        }

        return isolated;
}

I neglected to update isolate_freepages_block() to take the end_pfn
meaning it will in fact isolate too much but that would be easy to
address. It would be up to yourself whether to shuffle the tracepoint
around because with this example it will be triggered once per
pageblock. You'd still need the cleanup code and freelist check in
isolate_freepages_block() of course.

The point is that it would minimise the disruption to the existing
code and easier to get details like pfn_valid() right without odd
contortions, more gotos than should be necessary and trying to keep
pfn and page straight.

>  }
>  
>  /* Returns true if the page is within a block suitable for migration to */
> @@ -135,7 +166,7 @@ static void isolate_freepages(struct zone *zone,
>  				struct compact_control *cc)
>  {
>  	struct page *page;
> -	unsigned long high_pfn, low_pfn, pfn;
> +	unsigned long high_pfn, low_pfn, pfn, zone_end_pfn, end_pfn;
>  	unsigned long flags;
>  	int nr_freepages = cc->nr_freepages;
>  	struct list_head *freelist = &cc->freepages;
> @@ -155,6 +186,8 @@ static void isolate_freepages(struct zone *zone,
>  	 */
>  	high_pfn = min(low_pfn, pfn);
>  
> +	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> +
>  	/*
>  	 * Isolate free pages until enough are available to migrate the
>  	 * pages on cc->migratepages. We stop searching if the migrate
> @@ -191,7 +224,9 @@ static void isolate_freepages(struct zone *zone,
>  		isolated = 0;
>  		spin_lock_irqsave(&zone->lock, flags);
>  		if (suitable_migration_target(page)) {
> -			isolated = isolate_freepages_block(zone, pfn, freelist);
> +			end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
> +			isolated = isolate_freepages_range(zone, pfn,
> +					end_pfn, freelist);
>  			nr_freepages += isolated;
>  		}
>  		spin_unlock_irqrestore(&zone->lock, flags);
> @@ -250,31 +285,34 @@ typedef enum {
>  	ISOLATE_SUCCESS,	/* Pages isolated, migrate */
>  } isolate_migrate_t;
>  
> -/*
> - * Isolate all pages that can be migrated from the block pointed to by
> - * the migrate scanner within compact_control.
> +/**
> + * isolate_migratepages_range() - isolate all migrate-able pages in range.
> + * @zone:	Zone pages are in.
> + * @cc:		Compaction control structure.
> + * @low_pfn:	The first PFN of the range.
> + * @end_pfn:	The one-past-the-last PFN of the range.
> + *
> + * Isolate all pages that can be migrated from the range specified by
> + * [low_pfn, end_pfn).  Returns zero if there is a fatal signal
> + * pending), otherwise PFN of the first page that was not scanned
> + * (which may be both less, equal to or more then end_pfn).
> + *
> + * Assumes that cc->migratepages is empty and cc->nr_migratepages is
> + * zero.
> + *
> + * Other then cc->migratepages and cc->nr_migratetypes this function
> + * does not modify any cc's fields, ie. it does not modify (or read
> + * for that matter) cc->migrate_pfn.
>   */
> -static isolate_migrate_t isolate_migratepages(struct zone *zone,
> -					struct compact_control *cc)
> +static unsigned long
> +isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> +			   unsigned long low_pfn, unsigned long end_pfn)
>  {
> -	unsigned long low_pfn, end_pfn;
>  	unsigned long last_pageblock_nr = 0, pageblock_nr;
>  	unsigned long nr_scanned = 0, nr_isolated = 0;
>  	struct list_head *migratelist = &cc->migratepages;
>  	isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
>  
> -	/* Do not scan outside zone boundaries */
> -	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
> -
> -	/* Only scan within a pageblock boundary */
> -	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> -
> -	/* Do not cross the free scanner or scan within a memory hole */
> -	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> -		cc->migrate_pfn = end_pfn;
> -		return ISOLATE_NONE;
> -	}
> -
>  	/*
>  	 * Ensure that there are not too many pages isolated from the LRU
>  	 * list by either parallel reclaimers or compaction. If there are,
> @@ -283,12 +321,12 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  	while (unlikely(too_many_isolated(zone))) {
>  		/* async migration should just abort */
>  		if (!cc->sync)
> -			return ISOLATE_ABORT;
> +			return 0;
>  
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		if (fatal_signal_pending(current))
> -			return ISOLATE_ABORT;
> +			return 0;
>  	}
>  
>  	/* Time to isolate some pages for migration */
> @@ -365,17 +403,49 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  		nr_isolated++;
>  
>  		/* Avoid isolating too much */
> -		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
> +		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
> +			++low_pfn;
>  			break;
> +		}
>  	}

This minor optimisation is unrelated to the implementation of
isolate_migratepages_range() and should be in its own patch.

Otherwise nothing jumped out.

>  
>  	acct_isolated(zone, cc);
>  
>  	spin_unlock_irq(&zone->lru_lock);
> -	cc->migrate_pfn = low_pfn;
>  
>  	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
>  
> +	return low_pfn;
> +}
> +
> +/*
> + * Isolate all pages that can be migrated from the block pointed to by
> + * the migrate scanner within compact_control.
> + */
> +static isolate_migrate_t isolate_migratepages(struct zone *zone,
> +					struct compact_control *cc)
> +{
> +	unsigned long low_pfn, end_pfn;
> +
> +	/* Do not scan outside zone boundaries */
> +	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
> +
> +	/* Only scan within a pageblock boundary */
> +	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> +
> +	/* Do not cross the free scanner or scan within a memory hole */
> +	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> +		cc->migrate_pfn = end_pfn;
> +		return ISOLATE_NONE;
> +	}
> +
> +	/* Perform the isolation */
> +	low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
> +	if (!low_pfn)
> +		return ISOLATE_ABORT;
> +
> +	cc->migrate_pfn = low_pfn;
> +
>  	return ISOLATE_SUCCESS;
>  }
>  

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range()
  2011-12-29 12:39 ` [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range() Marek Szyprowski
@ 2012-01-10 14:16   ` Mel Gorman
  2012-01-13 20:04     ` Michal Nazarewicz
  2012-01-17 21:54   ` [Linaro-mm-sig] " sandeep patil
  1 sibling, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2012-01-10 14:16 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, Michal Nazarewicz, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Arnd Bergmann,
	Jesse Barker, Jonathan Corbet, Shariq Hasnain, Chunsang Jeong,
	Dave Hansen, Benjamin Gaignard

On Thu, Dec 29, 2011 at 01:39:05PM +0100, Marek Szyprowski wrote:
> From: Michal Nazarewicz <mina86@mina86.com>
> 
> This commit adds the alloc_contig_range() function which tries
> to allocate given range of pages.  It tries to migrate all
> already allocated pages that fall in the range thus freeing them.
> Once all pages in the range are freed they are removed from the
> buddy system thus allocated for the caller to use.
> 
> __alloc_contig_migrate_range() borrows some code from KAMEZAWA
> Hiroyuki's __alloc_contig_pages().
> 
> Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
> ---
>  include/linux/page-isolation.h |    3 +
>  mm/page_alloc.c                |  190 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 193 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 051c1b1..d305080 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -33,5 +33,8 @@ test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn);
>  extern int set_migratetype_isolate(struct page *page);
>  extern void unset_migratetype_isolate(struct page *page);
>  
> +/* The below functions must be run on a range from a single zone. */
> +int alloc_contig_range(unsigned long start, unsigned long end);
> +void free_contig_range(unsigned long pfn, unsigned nr_pages);
>  
>  #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f88b320..47b0a85 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -57,6 +57,7 @@
>  #include <linux/ftrace_event.h>
>  #include <linux/memcontrol.h>
>  #include <linux/prefetch.h>
> +#include <linux/migrate.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -5711,6 +5712,195 @@ out:
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  }
>  
> +static unsigned long pfn_align_to_maxpage_down(unsigned long pfn)
> +{
> +	return pfn & ~(MAX_ORDER_NR_PAGES - 1);
> +}
> +
> +static unsigned long pfn_align_to_maxpage_up(unsigned long pfn)
> +{
> +	return ALIGN(pfn, MAX_ORDER_NR_PAGES);
> +}
> +
> +static struct page *
> +__cma_migrate_alloc(struct page *page, unsigned long private, int **resultp)
> +{
> +	return alloc_page(GFP_HIGHUSER_MOVABLE);
> +}
> +
> +static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
> +{

This is compiled in even if !CONFIG_CMA

> +	/* This function is based on compact_zone() from compaction.c. */
> +
> +	unsigned long pfn = start;
> +	int ret = -EBUSY;
> +	unsigned tries = 0;
> +
> +	struct compact_control cc = {
> +		.nr_migratepages = 0,
> +		.order = -1,
> +		.zone = page_zone(pfn_to_page(start)),
> +		.sync = true,
> +	};

Handle the case where start and end PFNs are in different zones. It
should never happen but it should be caught, warned about and an
error returned because someone will eventually get it wrong.

> +	INIT_LIST_HEAD(&cc.migratepages);
> +
> +	migrate_prep_local();
> +
> +	while (pfn < end || cc.nr_migratepages) {
> +		/* Abort on signal */
> +		if (fatal_signal_pending(current)) {
> +			ret = -EINTR;
> +			goto done;
> +		}
> +
> +		/* Get some pages to migrate. */
> +		if (list_empty(&cc.migratepages)) {
> +			cc.nr_migratepages = 0;
> +			pfn = isolate_migratepages_range(cc.zone, &cc,
> +							 pfn, end);
> +			if (!pfn) {
> +				ret = -EINTR;
> +				goto done;
> +			}
> +			tries = 0;
> +		}
> +
> +		/* Try to migrate. */
> +		ret = migrate_pages(&cc.migratepages, __cma_migrate_alloc,
> +				    0, false, true);
> +
> +		/* Migrated all of them? Great! */
> +		if (list_empty(&cc.migratepages))
> +			continue;
> +
> +		/* Try five times. */
> +		if (++tries == 5) {
> +			ret = ret < 0 ? ret : -EBUSY;
> +			goto done;
> +		}
> +
> +		/* Before each time drain everything and reschedule. */
> +		lru_add_drain_all();
> +		drain_all_pages();

Why drain everything on each migration failure? I do not see how it
would help.

> +		cond_resched();

The cond_resched() should be outside the failure path if it exists at
all.

> +	}
> +	ret = 0;
> +
> +done:
> +	/* Make sure all pages are isolated. */
> +	if (!ret) {
> +		lru_add_drain_all();
> +		drain_all_pages();
> +		if (WARN_ON(test_pages_isolated(start, end)))
> +			ret = -EBUSY;
> +	}

Another global IPI seems overkill. Drain pages only from the local CPU
(drain_pages(get_cpu()); put_cpu()) and test if the pages are isolated.
Then and only then do a global drain before trying again, warning and
retyrning -EBUSY. I expect the common case is that a global drain is
unneccessary.

> +
> +	/* Release pages */
> +	putback_lru_pages(&cc.migratepages);
> +
> +	return ret;
> +}
> +
> +/**
> + * alloc_contig_range() -- tries to allocate given range of pages
> + * @start:	start PFN to allocate
> + * @end:	one-past-the-last PFN to allocate
> + *
> + * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
> + * aligned, hovewer it's callers responsibility to guarantee that we

s/hovewer/however/

s/it's/it's the'

> + * are the only thread that changes migrate type of pageblocks the
> + * pages fall in.
> + *
> + * Returns zero on success or negative error code.  On success all
> + * pages which PFN is in (start, end) are allocated for the caller and
> + * need to be freed with free_contig_range().
> + */
> +int alloc_contig_range(unsigned long start, unsigned long end)
> +{
> +	unsigned long outer_start, outer_end;
> +	int ret;
> +
> +	/*
> +	 * What we do here is we mark all pageblocks in range as
> +	 * MIGRATE_ISOLATE.  Because of the way page allocator work, we
> +	 * align the range to MAX_ORDER pages so that page allocator
> +	 * won't try to merge buddies from different pageblocks and
> +	 * change MIGRATE_ISOLATE to some other migration type.
> +	 *
> +	 * Once the pageblocks are marked as MIGRATE_ISOLATE, we
> +	 * migrate the pages from an unaligned range (ie. pages that
> +	 * we are interested in).  This will put all the pages in
> +	 * range back to page allocator as MIGRATE_ISOLATE.
> +	 *
> +	 * When this is done, we take the pages in range from page
> +	 * allocator removing them from the buddy system.  This way
> +	 * page allocator will never consider using them.
> +	 *
> +	 * This lets us mark the pageblocks back as
> +	 * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
> +	 * MAX_ORDER aligned range but not in the unaligned, original
> +	 * range are put back to page allocator so that buddy can use
> +	 * them.
> +	 */
> +
> +	ret = start_isolate_page_range(pfn_align_to_maxpage_down(start),
> +				       pfn_align_to_maxpage_up(end));
> +	if (ret)
> +		goto done;
> +
> +	ret = __alloc_contig_migrate_range(start, end);
> +	if (ret)
> +		goto done;
> +
> +	/*
> +	 * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
> +	 * aligned blocks that are marked as MIGRATE_ISOLATE.  What's
> +	 * more, all pages in [start, end) are free in page allocator.
> +	 * What we are going to do is to allocate all pages from
> +	 * [start, end) (that is remove them from page allocater).
> +	 *
> +	 * The only problem is that pages at the beginning and at the
> +	 * end of interesting range may be not aligned with pages that
> +	 * page allocator holds, ie. they can be part of higher order
> +	 * pages.  Because of this, we reserve the bigger range and
> +	 * once this is done free the pages we are not interested in.
> +	 */
> +
> +	ret = 0;
> +	while (!PageBuddy(pfn_to_page(start & (~0UL << ret))))
> +		if (WARN_ON(++ret >= MAX_ORDER)) {
> +			ret = -EINVAL;
> +			goto done;
> +		}
> +
> +	outer_start = start & (~0UL << ret);
> +	outer_end = isolate_freepages_range(page_zone(pfn_to_page(outer_start)),
> +					    outer_start, end, NULL);
> +	if (!outer_end) {
> +		ret = -EBUSY;
> +		goto done;
> +	}
> +	outer_end += outer_start;
> +
> +	/* Free head and tail (if any) */
> +	if (start != outer_start)
> +		free_contig_range(outer_start, start - outer_start);
> +	if (end != outer_end)
> +		free_contig_range(end, outer_end - end);
> +
> +	ret = 0;
> +done:
> +	undo_isolate_page_range(pfn_align_to_maxpage_down(start),
> +				pfn_align_to_maxpage_up(end));
> +	return ret;
> +}
> +
> +void free_contig_range(unsigned long pfn, unsigned nr_pages)
> +{
> +	for (; nr_pages--; ++pfn)
> +		__free_page(pfn_to_page(pfn));
> +}
> +
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  /*
>   * All pages in the range must be isolated before calling this.
> -- 
> 1.7.1.569.g6f426
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/11] mm: mmzone: MIGRATE_CMA migration type added
  2011-12-29 12:39 ` [PATCH 05/11] mm: mmzone: MIGRATE_CMA migration type added Marek Szyprowski
@ 2012-01-10 14:38   ` Mel Gorman
  2012-01-10 15:04     ` Michal Nazarewicz
  0 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2012-01-10 14:38 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, Michal Nazarewicz, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Arnd Bergmann,
	Jesse Barker, Jonathan Corbet, Shariq Hasnain, Chunsang Jeong,
	Dave Hansen, Benjamin Gaignard

On Thu, Dec 29, 2011 at 01:39:06PM +0100, Marek Szyprowski wrote:
> From: Michal Nazarewicz <mina86@mina86.com>
> 
> The MIGRATE_CMA migration type has two main characteristics:
> (i) only movable pages can be allocated from MIGRATE_CMA
> pageblocks and (ii) page allocator will never change migration
> type of MIGRATE_CMA pageblocks.
> 
> This guarantees (to some degree) that page in a MIGRATE_CMA page
> block can always be migrated somewhere else (unless there's no
> memory left in the system).
> 
> It is designed to be used for allocating big chunks (eg. 10MiB)
> of physically contiguous memory.  Once driver requests
> contiguous memory, pages from MIGRATE_CMA pageblocks may be
> migrated away to create a contiguous block.
> 
> To minimise number of migrations, MIGRATE_CMA migration type
> is the last type tried when page allocator falls back to other
> migration types then requested.
> 
> Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
> [m.szyprowski: removed CONFIG_CMA_MIGRATE_TYPE]
> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> ---
>  include/linux/mmzone.h         |   41 ++++++++++++++++++++----
>  include/linux/page-isolation.h |    3 ++
>  mm/Kconfig                     |    2 +-
>  mm/compaction.c                |   11 +++++--
>  mm/page_alloc.c                |   68 ++++++++++++++++++++++++++++++---------
>  mm/vmstat.c                    |    1 +
>  6 files changed, 99 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 188cb2f..e38b85d 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -35,13 +35,35 @@
>   */
>  #define PAGE_ALLOC_COSTLY_ORDER 3
>  
> -#define MIGRATE_UNMOVABLE     0
> -#define MIGRATE_RECLAIMABLE   1
> -#define MIGRATE_MOVABLE       2
> -#define MIGRATE_PCPTYPES      3 /* the number of types on the pcp lists */
> -#define MIGRATE_RESERVE       3
> -#define MIGRATE_ISOLATE       4 /* can't allocate from here */
> -#define MIGRATE_TYPES         5
> +enum {
> +	MIGRATE_UNMOVABLE,
> +	MIGRATE_RECLAIMABLE,
> +	MIGRATE_MOVABLE,
> +	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
> +	MIGRATE_RESERVE = MIGRATE_PCPTYPES,
> +	/*
> +	 * MIGRATE_CMA migration type is designed to mimic the way
> +	 * ZONE_MOVABLE works.  Only movable pages can be allocated
> +	 * from MIGRATE_CMA pageblocks and page allocator never
> +	 * implicitly change migration type of MIGRATE_CMA pageblock.
> +	 *
> +	 * The way to use it is to change migratetype of a range of
> +	 * pageblocks to MIGRATE_CMA which can be done by
> +	 * __free_pageblock_cma() function.  What is important though
> +	 * is that a range of pageblocks must be aligned to
> +	 * MAX_ORDER_NR_PAGES should biggest page be bigger then
> +	 * a single pageblock.
> +	 */
> +	MIGRATE_CMA,
> +	MIGRATE_ISOLATE,	/* can't allocate from here */
> +	MIGRATE_TYPES
> +};

MIGRATE_CMA is being added whether or not CONFIG_CMA is set. This
increases the size of the pageblock bitmap and where that is just 1 bit
per pageblock in the system, it may be noticable on large machines.

> +
> +#ifdef CONFIG_CMA
> +#  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
> +#else
> +#  define is_migrate_cma(migratetype) false
> +#endif
>  

Use static inlines.

>  #define for_each_migratetype_order(order, type) \
>  	for (order = 0; order < MAX_ORDER; order++) \
> @@ -54,6 +76,11 @@ static inline int get_pageblock_migratetype(struct page *page)
>  	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
>  }
>  
> +static inline bool is_pageblock_cma(struct page *page)
> +{
> +	return is_migrate_cma(get_pageblock_migratetype(page));
> +}
> +

This allows additional calls to get_pageblock_migratetype() even if
CONFIG_CMA is not set.

>  struct free_area {
>  	struct list_head	free_list[MIGRATE_TYPES];
>  	unsigned long		nr_free;
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index d305080..af650db 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -37,4 +37,7 @@ extern void unset_migratetype_isolate(struct page *page);
>  int alloc_contig_range(unsigned long start, unsigned long end);
>  void free_contig_range(unsigned long pfn, unsigned nr_pages);
>  
> +/* CMA stuff */
> +extern void init_cma_reserved_pageblock(struct page *page);
> +
>  #endif
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 011b110..e080cac 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -192,7 +192,7 @@ config COMPACTION
>  config MIGRATION
>  	bool "Page migration"
>  	def_bool y
> -	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
> +	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION || CMA
>  	help
>  	  Allows the migration of the physical location of pages of processes
>  	  while the virtual addresses are not changed. This is useful in
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 8733441..46783b4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -21,6 +21,11 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/compaction.h>
>  
> +static inline bool is_migrate_cma_or_movable(int migratetype)
> +{
> +	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
> +}
> +

That is not a name that helps any. migrate_async_suitable() would be
marginally better.

>  /**
>   * isolate_freepages_range() - isolate free pages, must hold zone->lock.
>   * @zone:	Zone pages are in.
> @@ -213,7 +218,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  		 */
>  		pageblock_nr = low_pfn >> pageblock_order;
>  		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
> -				get_pageblock_migratetype(page) != MIGRATE_MOVABLE) {
> +		    is_migrate_cma_or_movable(get_pageblock_migratetype(page))) {
>  			low_pfn += pageblock_nr_pages;
>  			low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
>  			last_pageblock_nr = pageblock_nr;

I know I suggested migrate_async_suitable() here but the check may
not even happen if CMA uses sync migration.

> @@ -295,8 +300,8 @@ static bool suitable_migration_target(struct page *page)
>  	if (PageBuddy(page) && page_order(page) >= pageblock_order)
>  		return true;
>  
> -	/* If the block is MIGRATE_MOVABLE, allow migration */
> -	if (migratetype == MIGRATE_MOVABLE)
> +	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
> +	if (is_migrate_cma_or_movable(migratetype))
>  		return true;
>  
>  	/* Otherwise skip the block */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47b0a85..06a7861 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -722,6 +722,26 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>  	}
>  }
>  
> +#ifdef CONFIG_CMA
> +/*
> + * Free whole pageblock and set it's migration type to MIGRATE_CMA.
> + */
> +void __init init_cma_reserved_pageblock(struct page *page)
> +{
> +	unsigned i = pageblock_nr_pages;
> +	struct page *p = page;
> +
> +	do {
> +		__ClearPageReserved(p);
> +		set_page_count(p, 0);
> +	} while (++p, --i);
> +
> +	set_page_refcounted(page);
> +	set_pageblock_migratetype(page, MIGRATE_CMA);
> +	__free_pages(page, pageblock_order);
> +	totalram_pages += pageblock_nr_pages;
> +}
> +#endif
>  
>  /*
>   * The order of subdivision here is critical for the IO subsystem.
> @@ -830,11 +850,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>   * This array describes the order lists are fallen back to when
>   * the free lists for the desirable migrate type are depleted
>   */
> -static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES-1] = {
> +static int fallbacks[MIGRATE_PCPTYPES][4] = {
>  	[MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   MIGRATE_RESERVE },
>  	[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   MIGRATE_RESERVE },
> -	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE },
> -	[MIGRATE_RESERVE]     = { MIGRATE_RESERVE,     MIGRATE_RESERVE,   MIGRATE_RESERVE }, /* Never used */
> +	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_CMA    , MIGRATE_RESERVE },

Why did you delete [MIGRATE_RESERVE] here? It changes the array
from being expressed in terms of MIGRATE_TYPES to being a hard-coded
value. I do not see the advantage and it's not clear how it is related
to the patch.

>  };
>  
>  /*
> @@ -929,12 +948,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>  	/* Find the largest possible block of pages in the other list */
>  	for (current_order = MAX_ORDER-1; current_order >= order;
>  						--current_order) {
> -		for (i = 0; i < MIGRATE_TYPES - 1; i++) {
> +		for (i = 0; i < ARRAY_SIZE(fallbacks[0]); i++) {
>  			migratetype = fallbacks[start_migratetype][i];
>  
>  			/* MIGRATE_RESERVE handled later if necessary */
>  			if (migratetype == MIGRATE_RESERVE)
> -				continue;
> +				break;
>  
>  			area = &(zone->free_area[current_order]);
>  			if (list_empty(&area->free_list[migratetype]))
> @@ -949,11 +968,18 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>  			 * pages to the preferred allocation list. If falling
>  			 * back for a reclaimable kernel allocation, be more
>  			 * aggressive about taking ownership of free pages
> +			 *
> +			 * On the other hand, never change migration
> +			 * type of MIGRATE_CMA pageblocks nor move CMA
> +			 * pages on different free lists. We don't
> +			 * want unmovable pages to be allocated from
> +			 * MIGRATE_CMA areas.
>  			 */
> -			if (unlikely(current_order >= (pageblock_order >> 1)) ||
> -					start_migratetype == MIGRATE_RECLAIMABLE ||
> -					page_group_by_mobility_disabled) {
> -				unsigned long pages;
> +			if (!is_pageblock_cma(page) &&
> +			    (unlikely(current_order >= pageblock_order / 2) ||
> +			     start_migratetype == MIGRATE_RECLAIMABLE ||
> +			     page_group_by_mobility_disabled)) {
> +				int pages;
>  				pages = move_freepages_block(zone, page,
>  								start_migratetype);
>  
> @@ -971,11 +997,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>  			rmv_page_order(page);
>  
>  			/* Take ownership for orders >= pageblock_order */
> -			if (current_order >= pageblock_order)
> +			if (current_order >= pageblock_order &&
> +			    !is_pageblock_cma(page))
>  				change_pageblock_range(page, current_order,
>  							start_migratetype);
>  
> -			expand(zone, page, order, current_order, area, migratetype);
> +			expand(zone, page, order, current_order, area,
> +			       is_migrate_cma(start_migratetype)
> +			     ? start_migratetype : migratetype);
>  
>  			trace_mm_page_alloc_extfrag(page, order, current_order,
>  				start_migratetype, migratetype);
> @@ -1047,7 +1076,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  			list_add(&page->lru, list);
>  		else
>  			list_add_tail(&page->lru, list);
> -		set_page_private(page, migratetype);
> +		if (is_pageblock_cma(page))
> +			set_page_private(page, MIGRATE_CMA);
> +		else
> +			set_page_private(page, migratetype);
>  		list = &page->lru;
>  	}
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> @@ -1308,8 +1340,12 @@ int split_free_page(struct page *page)
>  
>  	if (order >= pageblock_order - 1) {
>  		struct page *endpage = page + (1 << order) - 1;
> -		for (; page < endpage; page += pageblock_nr_pages)
> -			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> +		for (; page < endpage; page += pageblock_nr_pages) {
> +			int mt = get_pageblock_migratetype(page);
> +			if (mt != MIGRATE_ISOLATE && !is_migrate_cma(mt))
> +				set_pageblock_migratetype(page,
> +							  MIGRATE_MOVABLE);
> +		}
>  	}
>  
>  	return 1 << order;
> @@ -5599,8 +5635,8 @@ __count_immobile_pages(struct zone *zone, struct page *page, int count)
>  	 */
>  	if (zone_idx(zone) == ZONE_MOVABLE)
>  		return true;
> -
> -	if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
> +	if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE ||
> +	    is_pageblock_cma(page))
>  		return true;
>  
>  	pfn = page_to_pfn(page);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 8fd603b..9fb1789 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -613,6 +613,7 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
>  	"Reclaimable",
>  	"Movable",
>  	"Reserve",
> +	"Cma",
>  	"Isolate",
>  };
>  
> -- 
> 1.7.1.569.g6f426
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 07/11] mm: add optional memory reclaim in split_free_page()
  2011-12-29 12:39 ` [PATCH 07/11] mm: add optional memory reclaim in split_free_page() Marek Szyprowski
@ 2012-01-10 14:45   ` Mel Gorman
  0 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2012-01-10 14:45 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, Michal Nazarewicz, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Arnd Bergmann,
	Jesse Barker, Jonathan Corbet, Shariq Hasnain, Chunsang Jeong,
	Dave Hansen, Benjamin Gaignard

On Thu, Dec 29, 2011 at 01:39:08PM +0100, Marek Szyprowski wrote:
> split_free_page() function is used by migration code to grab a free page
> once the migration has been finished. This function must obey the same
> rules as memory allocation functions to keep the correct level of
> memory watermarks. Memory compaction code calls it under locks, so it
> cannot perform any *_slowpath style reclaim. However when this function
> is called by migration code for Contiguous Memory Allocator, the sleeping
> context is permitted and the function can make a real reclaim to ensure
> that the call succeeds.
> 
> The reclaim code is based on the __alloc_page_slowpath() function.
> 

This feels like it's at the wrong level entirely.

> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> CC: Michal Nazarewicz <mina86@mina86.com>
> ---
>  include/linux/mm.h |    2 +-
>  mm/compaction.c    |    6 ++--
>  mm/internal.h      |    2 +-
>  mm/page_alloc.c    |   79 +++++++++++++++++++++++++++++++++++++++------------
>  4 files changed, 65 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4baadd1..8b15b21 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -450,7 +450,7 @@ void put_page(struct page *page);
>  void put_pages_list(struct list_head *pages);
>  
>  void split_page(struct page *page, unsigned int order);
> -int split_free_page(struct page *page);
> +int split_free_page(struct page *page, int force_reclaim);
>  
>  /*
>   * Compound pages have a destructor function.  Provide a
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 46783b4..d6c5cb7 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -46,7 +46,7 @@ static inline bool is_migrate_cma_or_movable(int migratetype)
>  unsigned long
>  isolate_freepages_range(struct zone *zone,
>  			unsigned long start_pfn, unsigned long end_pfn,
> -			struct list_head *freelist)
> +			struct list_head *freelist, int force_reclaim)

bool

but this is the first hint that something is wrong with the
layering. The function is about "isolating free pages" but this patch
is making it also about reclaim. If CMA cares about the watermarks,
it should be the one calling try_to_free_pages(), not wedging it into
the side of compaction.

>  {
>  	unsigned long nr_scanned = 0, total_isolated = 0;
>  	unsigned long pfn = start_pfn;
> @@ -67,7 +67,7 @@ isolate_freepages_range(struct zone *zone,
>  			goto next;
>  
>  		/* Found a free page, break it into order-0 pages */
> -		n = split_free_page(page);
> +		n = split_free_page(page, force_reclaim);
>  		total_isolated += n;
>  		if (freelist) {
>  			struct page *p = page;

Second hint - split_free_page() is not a name that indicates it should
have anything to do with reclaim.

> @@ -376,7 +376,7 @@ static void isolate_freepages(struct zone *zone,
>  		if (suitable_migration_target(page)) {
>  			end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
>  			isolated = isolate_freepages_range(zone, pfn,
> -					end_pfn, freelist);
> +					end_pfn, freelist, false);
>  			nr_freepages += isolated;
>  		}
>  		spin_unlock_irqrestore(&zone->lock, flags);
> diff --git a/mm/internal.h b/mm/internal.h
> index 4a226d8..8b80027 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -129,7 +129,7 @@ struct compact_control {
>  unsigned long
>  isolate_freepages_range(struct zone *zone,
>  			unsigned long start, unsigned long end,
> -			struct list_head *freelist);
> +			struct list_head *freelist, int force_reclaim);
>  unsigned long
>  isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  			   unsigned long low_pfn, unsigned long end_pfn);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8b47c85..a3d5892 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1302,17 +1302,65 @@ void split_page(struct page *page, unsigned int order)
>  		set_page_refcounted(page + i);
>  }
>  
> +static inline
> +void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> +						enum zone_type high_zoneidx,
> +						enum zone_type classzone_idx)
> +{
> +	struct zoneref *z;
> +	struct zone *zone;
> +
> +	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> +		wakeup_kswapd(zone, order, classzone_idx);
> +}
> +
> +/*
> + * Trigger memory pressure bump to reclaim at least 2^order of free pages.
> + * Does similar work as it is done in __alloc_pages_slowpath(). Used only
> + * by migration code to allocate contiguous memory range.
> + */
> +static int __force_memory_reclaim(int order, struct zone *zone)
> +{
> +	gfp_t gfp_mask = GFP_HIGHUSER_MOVABLE;
> +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> +	struct zonelist *zonelist = node_zonelist(0, gfp_mask);
> +	struct reclaim_state reclaim_state;
> +	int did_some_progress = 0;
> +
> +	wake_all_kswapd(order, zonelist, high_zoneidx, zone_idx(zone));
> +
> +	/* We now go into synchronous reclaim */
> +	cpuset_memory_pressure_bump();
> +	current->flags |= PF_MEMALLOC;
> +	lockdep_set_current_reclaim_state(gfp_mask);
> +	reclaim_state.reclaimed_slab = 0;
> +	current->reclaim_state = &reclaim_state;
> +
> +	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, NULL);
> +
> +	current->reclaim_state = NULL;
> +	lockdep_clear_current_reclaim_state();
> +	current->flags &= ~PF_MEMALLOC;
> +
> +	cond_resched();
> +
> +	if (!did_some_progress) {
> +		/* Exhausted what can be done so it's blamo time */
> +		out_of_memory(zonelist, gfp_mask, order, NULL);
> +	}
> +	return order;
> +}

At the very least, __alloc_pages_direct_reclaim() should be sharing this
code.

> +
>  /*
>   * Similar to split_page except the page is already free. As this is only
>   * being used for migration, the migratetype of the block also changes.
> - * As this is called with interrupts disabled, the caller is responsible
> - * for calling arch_alloc_page() and kernel_map_page() after interrupts
> - * are enabled.
> + * The caller is responsible for calling arch_alloc_page() and kernel_map_page()
> + * after interrupts are enabled.
>   *
>   * Note: this is probably too low level an operation for use in drivers.
>   * Please consult with lkml before using this in your driver.
>   */
> -int split_free_page(struct page *page)
> +int split_free_page(struct page *page, int force_reclaim)
>  {
>  	unsigned int order;
>  	unsigned long watermark;
> @@ -1323,10 +1371,15 @@ int split_free_page(struct page *page)
>  	zone = page_zone(page);
>  	order = page_order(page);
>  
> +try_again:
>  	/* Obey watermarks as if the page was being allocated */
>  	watermark = low_wmark_pages(zone) + (1 << order);
> -	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> -		return 0;
> +	if (!zone_watermark_ok(zone, 0, watermark, 0, 0)) {
> +		if (force_reclaim && __force_memory_reclaim(order, zone))
> +			goto try_again;
> +		else
> +			return 0;
> +	}
>  
>  	/* Remove page from free list */
>  	list_del(&page->lru);
> @@ -2086,18 +2139,6 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
>  	return page;
>  }
>  
> -static inline
> -void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> -						enum zone_type high_zoneidx,
> -						enum zone_type classzone_idx)
> -{
> -	struct zoneref *z;
> -	struct zone *zone;
> -
> -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> -		wakeup_kswapd(zone, order, classzone_idx);
> -}
> -
>  static inline int
>  gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
> @@ -5917,7 +5958,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  
>  	outer_start = start & (~0UL << ret);
>  	outer_end = isolate_freepages_range(page_zone(pfn_to_page(outer_start)),
> -					    outer_start, end, NULL);
> +					    outer_start, end, NULL, true);
>  	if (!outer_end) {
>  		ret = -EBUSY;
>  		goto done;

I think it makes far more sense to have the logic related to calling
try_to_free_pages here in alloc_contig_range rather than trying to pass
it into compaction helper functions.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 02/11] mm: compaction: introduce isolate_{free,migrate}pages_range().
  2012-01-10 13:43   ` Mel Gorman
@ 2012-01-10 15:04     ` Michal Nazarewicz
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Nazarewicz @ 2012-01-10 15:04 UTC (permalink / raw)
  To: Marek Szyprowski, Mel Gorman
  Cc: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, Kyungmin Park, Russell King, Andrew Morton,
	KAMEZAWA Hiroyuki, Daniel Walker, Arnd Bergmann, Jesse Barker,
	Jonathan Corbet, Shariq Hasnain, Chunsang Jeong, Dave Hansen,
	Benjamin Gaignard

On Tue, 10 Jan 2012 14:43:51 +0100, Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, Dec 29, 2011 at 01:39:03PM +0100, Marek Szyprowski wrote:
>> From: Michal Nazarewicz <mina86@mina86.com>

>> +static unsigned long
>> +isolate_freepages_range(struct zone *zone,
>> +			unsigned long start_pfn, unsigned long end_pfn,
>> +			struct list_head *freelist)
>>  {
>> -	unsigned long zone_end_pfn, end_pfn;
>> -	int nr_scanned = 0, total_isolated = 0;
>> -	struct page *cursor;
>> -
>> -	/* Get the last PFN we should scan for free pages at */
>> -	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
>> -	end_pfn = min(blockpfn + pageblock_nr_pages, zone_end_pfn);
>> +	unsigned long nr_scanned = 0, total_isolated = 0;
>> +	unsigned long pfn = start_pfn;
>> +	struct page *page;
>>
>> -	/* Find the first usable PFN in the block to initialse page cursor */
>> -	for (; blockpfn < end_pfn; blockpfn++) {
>> -		if (pfn_valid_within(blockpfn))
>> -			break;
>> -	}
>> -	cursor = pfn_to_page(blockpfn);
>> +	VM_BUG_ON(!pfn_valid(pfn));
>> +	page = pfn_to_page(pfn);
>
> The existing code is able to find the first usable PFN in a pageblock
> with pfn_valid_within(). It's allowed to do that because it knows
> the pageblock is valid so calling pfn_valid() is unnecessary.
>
> It is curious to change this to something that can sometimes BUG_ON
> !pfn_valid(pfn) instead of having a PFN walker that knows how to
> handle pfn_valid().

Actually, this walker seem redundant since one is already present in
isolate_freepages(), ie. if !pfn_valid(pfn) then the loop in
isolate_freepages() will “continue;” and the function will never get
called.

>> +cleanup:
>> +	/*
>> +	 * Undo what we have done so far, and return.  We know all pages from
>> +	 * [start_pfn, pfn) are free because we have just freed them.  If one of
>> +	 * the page in the range was not freed, we would end up here earlier.
>> +	 */
>> +	for (; start_pfn < pfn; ++start_pfn)
>> +		__free_page(pfn_to_page(start_pfn));
>> +	return 0;
>
> This returns 0 even though you could have isolated some pages.

The loop's intention is to “unisolate” (ie. free) anything that got
isolated, so at the end number of isolated pages in in fact zero.

> Overall, there would have been less contortion if you
> implemented isolate_freepages_range() in terms of the existing
> isolate_freepages_block. Something that looked kinda like this not
> compiled untested illustration.
>
> static unsigned long isolate_freepages_range(struct zone *zone,
>                         unsigned long start_pfn, unsigned long end_pfn,
>                         struct list_head *list)
> {
>         unsigned long pfn = start_pfn;
>         unsigned long isolated;
>
>         for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
>                 if (!pfn_valid(pfn))
>                         continue;
>                 isolated += isolate_freepages_block(zone, pfn, list);
>         }
>
>         return isolated;
> }
>
> I neglected to update isolate_freepages_block() to take the end_pfn
> meaning it will in fact isolate too much but that would be easy to
> address.

This is not a problem since my isolate_freepages_range() implementation
can go beyond end_pfn, anyway.

Of course, the function taking end_pfn is an optimisation since it does
not have to compute zone_end each time.

> It would be up to yourself whether to shuffle the tracepoint
> around because with this example it will be triggered once per
> pageblock. You'd still need the cleanup code and freelist check in
> isolate_freepages_block() of course.
>
> The point is that it would minimise the disruption to the existing
> code and easier to get details like pfn_valid() right without odd
> contortions, more gotos than should be necessary and trying to keep
> pfn and page straight.

Even though I'd personally go with my approach, I see merit in your point,
so will change.

>>  }
>>
>>  /* Returns true if the page is within a block suitable for migration to */

>> @@ -365,17 +403,49 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>>  		nr_isolated++;
>>
>>  		/* Avoid isolating too much */
>> -		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
>> +		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
>> +			++low_pfn;
>>  			break;
>> +		}
>>  	}
>
> This minor optimisation is unrelated to the implementation of
> isolate_migratepages_range() and should be in its own patch.

It's actually not a mere minor optimisation, but rather making the function work
according to the documentation added, ie. that it returns “PFN of the first page
that was not scanned”.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/11] mm: mmzone: MIGRATE_CMA migration type added
  2012-01-10 14:38   ` Mel Gorman
@ 2012-01-10 15:04     ` Michal Nazarewicz
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Nazarewicz @ 2012-01-10 15:04 UTC (permalink / raw)
  To: Marek Szyprowski, Mel Gorman
  Cc: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, Kyungmin Park, Russell King, Andrew Morton,
	KAMEZAWA Hiroyuki, Daniel Walker, Arnd Bergmann, Jesse Barker,
	Jonathan Corbet, Shariq Hasnain, Chunsang Jeong, Dave Hansen,
	Benjamin Gaignard

On Tue, 10 Jan 2012 15:38:36 +0100, Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, Dec 29, 2011 at 01:39:06PM +0100, Marek Szyprowski wrote:
>> From: Michal Nazarewicz <mina86@mina86.com>

>> @@ -35,13 +35,35 @@
>>   */
>>  #define PAGE_ALLOC_COSTLY_ORDER 3
>>
>> -#define MIGRATE_UNMOVABLE     0
>> -#define MIGRATE_RECLAIMABLE   1
>> -#define MIGRATE_MOVABLE       2
>> -#define MIGRATE_PCPTYPES      3 /* the number of types on the pcp lists */
>> -#define MIGRATE_RESERVE       3
>> -#define MIGRATE_ISOLATE       4 /* can't allocate from here */
>> -#define MIGRATE_TYPES         5
>> +enum {
>> +	MIGRATE_UNMOVABLE,
>> +	MIGRATE_RECLAIMABLE,
>> +	MIGRATE_MOVABLE,
>> +	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
>> +	MIGRATE_RESERVE = MIGRATE_PCPTYPES,
>> +	/*
>> +	 * MIGRATE_CMA migration type is designed to mimic the way
>> +	 * ZONE_MOVABLE works.  Only movable pages can be allocated
>> +	 * from MIGRATE_CMA pageblocks and page allocator never
>> +	 * implicitly change migration type of MIGRATE_CMA pageblock.
>> +	 *
>> +	 * The way to use it is to change migratetype of a range of
>> +	 * pageblocks to MIGRATE_CMA which can be done by
>> +	 * __free_pageblock_cma() function.  What is important though
>> +	 * is that a range of pageblocks must be aligned to
>> +	 * MAX_ORDER_NR_PAGES should biggest page be bigger then
>> +	 * a single pageblock.
>> +	 */
>> +	MIGRATE_CMA,
>> +	MIGRATE_ISOLATE,	/* can't allocate from here */
>> +	MIGRATE_TYPES
>> +};
>
> MIGRATE_CMA is being added whether or not CONFIG_CMA is set. This
> increases the size of the pageblock bitmap and where that is just 1 bit
> per pageblock in the system, it may be noticable on large machines.

Wasn't aware of that, will do.  In fact, in earlier versions in was done this way,
but resulted in more #ifdefs.

>
>> +
>> +#ifdef CONFIG_CMA
>> +#  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
>> +#else
>> +#  define is_migrate_cma(migratetype) false
>> +#endif

> Use static inlines.

I decide to use #define for the sake of situations like
is_migrate_cma(get_pageblock_migratetype(page)).  With a static inline it will have
to read pageblock's migrate type even if !CONFIG_CMA.  A macro gets rid of this
operation all together.

>>  #define for_each_migratetype_order(order, type) \
>>  	for (order = 0; order < MAX_ORDER; order++) \
>> @@ -54,6 +76,11 @@ static inline int get_pageblock_migratetype(struct page *page)
>>  	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
>>  }
>>
>> +static inline bool is_pageblock_cma(struct page *page)
>> +{
>> +	return is_migrate_cma(get_pageblock_migratetype(page));
>> +}
>> +
>
> This allows additional calls to get_pageblock_migratetype() even if
> CONFIG_CMA is not set.
>
>>  struct free_area {
>>  	struct list_head	free_list[MIGRATE_TYPES];
>>  	unsigned long		nr_free;
>> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
>> index d305080..af650db 100644
>> --- a/include/linux/page-isolation.h
>> +++ b/include/linux/page-isolation.h
>> @@ -37,4 +37,7 @@ extern void unset_migratetype_isolate(struct page *page);
>>  int alloc_contig_range(unsigned long start, unsigned long end);
>>  void free_contig_range(unsigned long pfn, unsigned nr_pages);
>>
>> +/* CMA stuff */
>> +extern void init_cma_reserved_pageblock(struct page *page);
>> +
>>  #endif
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 011b110..e080cac 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -192,7 +192,7 @@ config COMPACTION
>>  config MIGRATION
>>  	bool "Page migration"
>>  	def_bool y
>> -	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
>> +	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION || CMA
>>  	help
>>  	  Allows the migration of the physical location of pages of processes
>>  	  while the virtual addresses are not changed. This is useful in
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 8733441..46783b4 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -21,6 +21,11 @@
>>  #define CREATE_TRACE_POINTS
>>  #include <trace/events/compaction.h>
>>
>> +static inline bool is_migrate_cma_or_movable(int migratetype)
>> +{
>> +	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
>> +}
>> +
>
> That is not a name that helps any. migrate_async_suitable() would be
> marginally better.
>
>>  /**
>>   * isolate_freepages_range() - isolate free pages, must hold zone->lock.
>>   * @zone:	Zone pages are in.
>> @@ -213,7 +218,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>>  		 */
>>  		pageblock_nr = low_pfn >> pageblock_order;
>>  		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
>> -				get_pageblock_migratetype(page) != MIGRATE_MOVABLE) {
>> +		    is_migrate_cma_or_movable(get_pageblock_migratetype(page))) {
>>  			low_pfn += pageblock_nr_pages;
>>  			low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
>>  			last_pageblock_nr = pageblock_nr;
>
> I know I suggested migrate_async_suitable() here but the check may
> not even happen if CMA uses sync migration.

For CMA, we know that pageblock's migrate type is MIGRATE_CMA.

>
>> @@ -295,8 +300,8 @@ static bool suitable_migration_target(struct page *page)
>>  	if (PageBuddy(page) && page_order(page) >= pageblock_order)
>>  		return true;
>>
>> -	/* If the block is MIGRATE_MOVABLE, allow migration */
>> -	if (migratetype == MIGRATE_MOVABLE)
>> +	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
>> +	if (is_migrate_cma_or_movable(migratetype))
>>  		return true;
>>
>>  	/* Otherwise skip the block */
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 47b0a85..06a7861 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -722,6 +722,26 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
>>  	}
>>  }
>>
>> +#ifdef CONFIG_CMA
>> +/*
>> + * Free whole pageblock and set it's migration type to MIGRATE_CMA.
>> + */
>> +void __init init_cma_reserved_pageblock(struct page *page)
>> +{
>> +	unsigned i = pageblock_nr_pages;
>> +	struct page *p = page;
>> +
>> +	do {
>> +		__ClearPageReserved(p);
>> +		set_page_count(p, 0);
>> +	} while (++p, --i);
>> +
>> +	set_page_refcounted(page);
>> +	set_pageblock_migratetype(page, MIGRATE_CMA);
>> +	__free_pages(page, pageblock_order);
>> +	totalram_pages += pageblock_nr_pages;
>> +}
>> +#endif
>>
>>  /*
>>   * The order of subdivision here is critical for the IO subsystem.
>> @@ -830,11 +850,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>>   * This array describes the order lists are fallen back to when
>>   * the free lists for the desirable migrate type are depleted
>>   */
>> -static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES-1] = {
>> +static int fallbacks[MIGRATE_PCPTYPES][4] = {
>>  	[MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   MIGRATE_RESERVE },
>>  	[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   MIGRATE_RESERVE },
>> -	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE },
>> -	[MIGRATE_RESERVE]     = { MIGRATE_RESERVE,     MIGRATE_RESERVE,   MIGRATE_RESERVE }, /* Never used */
>> +	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_CMA    , MIGRATE_RESERVE },
>
> Why did you delete [MIGRATE_RESERVE] here?

It is never used anyway.

> It changes the array from being expressed in terms of MIGRATE_TYPES to
> being a hard-coded value.

I don't follow.  This is only used in code path that fills in pcp lists, so only
the first MIGRATE_PCPTYPES rows are used.

> I do not see the advantage and it's not clear how it is related to the patch.

No real advantages, I can revert that change.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range()
  2012-01-10 14:16   ` Mel Gorman
@ 2012-01-13 20:04     ` Michal Nazarewicz
  2012-01-16  9:01       ` Mel Gorman
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Nazarewicz @ 2012-01-13 20:04 UTC (permalink / raw)
  To: Marek Szyprowski, Mel Gorman
  Cc: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, Kyungmin Park, Russell King, Andrew Morton,
	KAMEZAWA Hiroyuki, Daniel Walker, Arnd Bergmann, Jesse Barker,
	Jonathan Corbet, Shariq Hasnain, Chunsang Jeong, Dave Hansen,
	Benjamin Gaignard

> On Thu, Dec 29, 2011 at 01:39:05PM +0100, Marek Szyprowski wrote:
>> From: Michal Nazarewicz <mina86@mina86.com>
>> +	/* Make sure all pages are isolated. */
>> +	if (!ret) {
>> +		lru_add_drain_all();
>> +		drain_all_pages();
>> +		if (WARN_ON(test_pages_isolated(start, end)))
>> +			ret = -EBUSY;
>> +	}

On Tue, 10 Jan 2012 15:16:13 +0100, Mel Gorman <mel@csn.ul.ie> wrote:
> Another global IPI seems overkill. Drain pages only from the local CPU
> (drain_pages(get_cpu()); put_cpu()) and test if the pages are isolated.

Is get_cpu() + put_cpu() required? Won't drain_local_pages() work?

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range()
  2012-01-13 20:04     ` Michal Nazarewicz
@ 2012-01-16  9:01       ` Mel Gorman
  2012-01-16 12:48         ` Michal Nazarewicz
  0 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2012-01-16  9:01 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: Marek Szyprowski, linux-kernel, linux-arm-kernel, linux-media,
	linux-mm, linaro-mm-sig, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Arnd Bergmann,
	Jesse Barker, Jonathan Corbet, Shariq Hasnain, Chunsang Jeong,
	Dave Hansen, Benjamin Gaignard

On Fri, Jan 13, 2012 at 09:04:31PM +0100, Michal Nazarewicz wrote:
> >On Thu, Dec 29, 2011 at 01:39:05PM +0100, Marek Szyprowski wrote:
> >>From: Michal Nazarewicz <mina86@mina86.com>
> >>+	/* Make sure all pages are isolated. */
> >>+	if (!ret) {
> >>+		lru_add_drain_all();
> >>+		drain_all_pages();
> >>+		if (WARN_ON(test_pages_isolated(start, end)))
> >>+			ret = -EBUSY;
> >>+	}
> 
> On Tue, 10 Jan 2012 15:16:13 +0100, Mel Gorman <mel@csn.ul.ie> wrote:
> >Another global IPI seems overkill. Drain pages only from the local CPU
> >(drain_pages(get_cpu()); put_cpu()) and test if the pages are isolated.
> 
> Is get_cpu() + put_cpu() required? Won't drain_local_pages() work?
> 

drain_local_pages() calls smp_processor_id() without preemption
disabled. 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range()
  2012-01-16  9:01       ` Mel Gorman
@ 2012-01-16 12:48         ` Michal Nazarewicz
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Nazarewicz @ 2012-01-16 12:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Marek Szyprowski, linux-kernel, linux-arm-kernel, linux-media,
	linux-mm, linaro-mm-sig, Kyungmin Park, Russell King,
	Andrew Morton, KAMEZAWA Hiroyuki, Daniel Walker, Arnd Bergmann,
	Jesse Barker, Jonathan Corbet, Shariq Hasnain, Chunsang Jeong,
	Dave Hansen, Benjamin Gaignard

On Mon, 16 Jan 2012 10:01:10 +0100, Mel Gorman <mel@csn.ul.ie> wrote:

> On Fri, Jan 13, 2012 at 09:04:31PM +0100, Michal Nazarewicz wrote:
>> >On Thu, Dec 29, 2011 at 01:39:05PM +0100, Marek Szyprowski wrote:
>> >>From: Michal Nazarewicz <mina86@mina86.com>
>> >>+	/* Make sure all pages are isolated. */
>> >>+	if (!ret) {
>> >>+		lru_add_drain_all();
>> >>+		drain_all_pages();
>> >>+		if (WARN_ON(test_pages_isolated(start, end)))
>> >>+			ret = -EBUSY;
>> >>+	}
>>
>> On Tue, 10 Jan 2012 15:16:13 +0100, Mel Gorman <mel@csn.ul.ie> wrote:
>> >Another global IPI seems overkill. Drain pages only from the local CPU
>> >(drain_pages(get_cpu()); put_cpu()) and test if the pages are isolated.
>>
>> Is get_cpu() + put_cpu() required? Won't drain_local_pages() work?
>>
>
> drain_local_pages() calls smp_processor_id() without preemption
> disabled.

Thanks, I wasn't sure if preemption is an issue.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range()
  2011-12-29 12:39 ` [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range() Marek Szyprowski
  2012-01-10 14:16   ` Mel Gorman
@ 2012-01-17 21:54   ` sandeep patil
  2012-01-17 22:19     ` Michal Nazarewicz
  2012-01-19  7:36     ` Marek Szyprowski
  1 sibling, 2 replies; 33+ messages in thread
From: sandeep patil @ 2012-01-17 21:54 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, Daniel Walker, Russell King, Arnd Bergmann,
	Jonathan Corbet, Mel Gorman, Michal Nazarewicz, Dave Hansen,
	Jesse Barker, Kyungmin Park, Andrew Morton, KAMEZAWA Hiroyuki

Marek,

I am running a CMA test where I keep allocating from a CMA region as long
as the allocation fails due to lack of space.

However, I am seeing failures much before I expect them to happen.
When the allocation fails, I see a warning coming from __alloc_contig_range(),
because test_pages_isolated() returned "true".

The new retry code does try a new range and eventually succeeds.


> +
> +static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
> +{
> +
> +done:
> +       /* Make sure all pages are isolated. */
> +       if (!ret) {
> +               lru_add_drain_all();
> +               drain_all_pages();
> +               if (WARN_ON(test_pages_isolated(start, end)))
> +                       ret = -EBUSY;
> +       }

I tried to find out why this happened and added in a debug print inside
__test_page_isolated_in_pageblock(). Here's the resulting log ..

---
[  133.563140] !!! Found unexpected page(pfn=9aaab), (count=0),
(isBuddy=no), (private=0x00000004), (flags=0x00000000), (_mapcount=0)
!!!
[  133.576690] ------------[ cut here ]------------
[  133.582489] WARNING: at mm/page_alloc.c:5804 alloc_contig_range+0x1a4/0x2c4()
[  133.594757] [<c003e814>] (unwind_backtrace+0x0/0xf0) from
[<c0079c7c>] (warn_slowpath_common+0x4c/0x64)
[  133.605468] [<c0079c7c>] (warn_slowpath_common+0x4c/0x64) from
[<c0079cac>] (warn_slowpath_null+0x18/0x1c)
[  133.616424] [<c0079cac>] (warn_slowpath_null+0x18/0x1c) from
[<c00e0e84>] (alloc_contig_range+0x1a4/0x2c4)
[  133.627471] EXT4-fs (mmcblk0p25): re-mounted. Opts: (null)
[  133.633728] [<c00e0e84>] (alloc_contig_range+0x1a4/0x2c4) from
[<c0266690>] (dma_alloc_from_contiguous+0x114/0x1c8)
[  133.697113] !!! Found unexpected page(pfn=9aaac), (count=0),
(isBuddy=no), (private=0x00000004), (flags=0x00000000), (_mapcount=0)
!!!
[  133.710510] EXT4-fs (mmcblk0p26): re-mounted. Opts: (null)
[  133.716766] ------------[ cut here ]------------
[  133.721954] WARNING: at mm/page_alloc.c:5804 alloc_contig_range+0x1a4/0x2c4()
[  133.734100] Emergency Remount complete
[  133.742584] [<c003e814>] (unwind_backtrace+0x0/0xf0) from
[<c0079c7c>] (warn_slowpath_common+0x4c/0x64)
[  133.753448] [<c0079c7c>] (warn_slowpath_common+0x4c/0x64) from
[<c0079cac>] (warn_slowpath_null+0x18/0x1c)
[  133.764373] [<c0079cac>] (warn_slowpath_null+0x18/0x1c) from
[<c00e0e84>] (alloc_contig_range+0x1a4/0x2c4)
[  133.775299] [<c00e0e84>] (alloc_contig_range+0x1a4/0x2c4) from
[<c0266690>] (dma_alloc_from_contiguous+0x114/0x1c8)
---

>From the log it looks like the warning showed up because page->private
is set to MIGRATE_CMA instead of MIGRATE_ISOLATED.
I've also had a test case where it failed because (page_count() != 0)

Have you or anyone else seen this during the CMA testing?

Also, could this be because we are finding a page within (start, end)
that actually belongs
to a higher order Buddy block ?


Thanks,
Sandeep

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range()
  2012-01-17 21:54   ` [Linaro-mm-sig] " sandeep patil
@ 2012-01-17 22:19     ` Michal Nazarewicz
  2012-01-18  0:46       ` sandeep patil
  2012-01-19  7:36     ` Marek Szyprowski
  1 sibling, 1 reply; 33+ messages in thread
From: Michal Nazarewicz @ 2012-01-17 22:19 UTC (permalink / raw)
  To: Marek Szyprowski, sandeep patil
  Cc: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, Daniel Walker, Russell King, Arnd Bergmann,
	Jonathan Corbet, Mel Gorman, Dave Hansen, Jesse Barker,
	Kyungmin Park, Andrew Morton, KAMEZAWA Hiroyuki

On Tue, 17 Jan 2012 22:54:28 +0100, sandeep patil <psandeep.s@gmail.com> wrote:

> Marek,
>
> I am running a CMA test where I keep allocating from a CMA region as long
> as the allocation fails due to lack of space.
>
> However, I am seeing failures much before I expect them to happen.
> When the allocation fails, I see a warning coming from __alloc_contig_range(),
> because test_pages_isolated() returned "true".

Yeah, we are wondering ourselves about that.  Could you try cherry-picking
commit ad10eb079c97e27b4d27bc755c605226ce1625de (update migrate type on pcp
when isolating) from git://github.com/mina86/linux-2.6.git?  It probably won't
apply cleanly but resolving the conflicts should not be hard (alternatively
you can try branch cma from the same repo but it is a work in progress at the
moment).

> I tried to find out why this happened and added in a debug print inside
> __test_page_isolated_in_pageblock(). Here's the resulting log ..

[...]

> From the log it looks like the warning showed up because page->private
> is set to MIGRATE_CMA instead of MIGRATE_ISOLATED.

My understanding of that situation is that the page is on pcp list in which
cases it's page_private is not updated.  Draining and the first patch in
the series (and also the commit I've pointed to above) are designed to fix
that but I'm unsure why they don't work all the time.

> I've also had a test case where it failed because (page_count() != 0)



> Have you or anyone else seen this during the CMA testing?
>
> Also, could this be because we are finding a page within (start, end)
> that actually belongs to a higher order Buddy block ?

Higher order free buddy blocks are skipped in the “if (PageBuddy(page))”
path of __test_page_isolated_in_pageblock().  Then again, now that I think
of it, something fishy may be happening on the edges.  Moving the check
outside of __alloc_contig_migrate_range() after outer_start is calculated
in alloc_contig_range() could help.  I'll take a look at it.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range()
  2012-01-17 22:19     ` Michal Nazarewicz
@ 2012-01-18  0:46       ` sandeep patil
  2012-01-18  1:44         ` Michal Nazarewicz
  0 siblings, 1 reply; 33+ messages in thread
From: sandeep patil @ 2012-01-18  0:46 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: Marek Szyprowski, linux-kernel, linux-arm-kernel, linux-media,
	linux-mm, linaro-mm-sig, Daniel Walker, Russell King,
	Arnd Bergmann, Jonathan Corbet, Mel Gorman, Dave Hansen,
	Jesse Barker, Kyungmin Park, Andrew Morton, KAMEZAWA Hiroyuki

> Yeah, we are wondering ourselves about that.  Could you try cherry-picking
> commit ad10eb079c97e27b4d27bc755c605226ce1625de (update migrate type on pcp
> when isolating) from git://github.com/mina86/linux-2.6.git?  It probably
> won't
> apply cleanly but resolving the conflicts should not be hard (alternatively
> you can try branch cma from the same repo but it is a work in progress at
> the
> moment).
>

I'll try this patch and report back ,,


>> is set to MIGRATE_CMA instead of MIGRATE_ISOLATED.
>
>
> My understanding of that situation is that the page is on pcp list in which
> cases it's page_private is not updated.  Draining and the first patch in
> the series (and also the commit I've pointed to above) are designed to fix
> that but I'm unsure why they don't work all the time.
>
>

Will verify this if the page is found on the pcp list as well .

>> I've also had a test case where it failed because (page_count() != 0)

With this, when it failed the page_count()
returned a value of 2. I am not sure why, but I will try and see If I can
reproduce this.

>
>
>> Have you or anyone else seen this during the CMA testing?
>>
>> Also, could this be because we are finding a page within (start, end)
>> that actually belongs to a higher order Buddy block ?
>
>
> Higher order free buddy blocks are skipped in the “if (PageBuddy(page))”
> path of __test_page_isolated_in_pageblock().  Then again, now that I think
> of it, something fishy may be happening on the edges.  Moving the check
> outside of __alloc_contig_migrate_range() after outer_start is calculated
> in alloc_contig_range() could help.  I'll take a look at it.

I was going to suggest that, moving the check until after outer_start
is calculated
will definitely help IMO. I am sure I've seen a case where

  page_count(page) = page->private = 0 and PageBuddy(page) was false.

I will try and reproduce this as well.

Thanks,
Sandeep

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range()
  2012-01-18  0:46       ` sandeep patil
@ 2012-01-18  1:44         ` Michal Nazarewicz
  0 siblings, 0 replies; 33+ messages in thread
From: Michal Nazarewicz @ 2012-01-18  1:44 UTC (permalink / raw)
  To: sandeep patil
  Cc: Marek Szyprowski, linux-kernel, linux-arm-kernel, linux-media,
	linux-mm, linaro-mm-sig, Daniel Walker, Russell King,
	Arnd Bergmann, Jonathan Corbet, Mel Gorman, Dave Hansen,
	Jesse Barker, Kyungmin Park, Andrew Morton, KAMEZAWA Hiroyuki

>> My understanding of that situation is that the page is on pcp list in which
>> cases it's page_private is not updated.  Draining and the first patch in
>> the series (and also the commit I've pointed to above) are designed to fix
>> that but I'm unsure why they don't work all the time.

On Wed, 18 Jan 2012 01:46:37 +0100, sandeep patil <psandeep.s@gmail.com> wrote:
> Will verify this if the page is found on the pcp list as well .

I was wondering in general if “!PageBuddy(page) && !page_count(page)” means
page is on PCP.  From what I've seen in page_isolate.c it seems to be the case.

>>> I've also had a test case where it failed because (page_count() != 0)

> With this, when it failed the page_count() returned a value of 2.  I am not
> sure why, but I will try and see If I can reproduce this.

If I'm not mistaken, page_count() != 0 means the page is allocated.  I can see
the following scenarios which can lead to page being allocated in when
test_pages_isolated() is called:

1. The page failed to migrate.  In this case however, the code would abort earlier.

2. The page was migrated but then allocated.  This is not possible since
    migrated pages are freed which puts the page on MIGRATE_ISOLATE freelist which
    guarantees that the page will not be migrated.

3. The page was removed from PCP list but with migratetype == MIGRATE_CMA.  This
    is something the first patch in the series as well as the commit I've mentioned tries
    to address so hopefully it won't be an issue any more.

4. The page was allocated from PCP list.  This may happen because draining of PCP
    list happens after IRQs are enabled in set_migratetype_isolate().  I don't have
    a solution for that just yet.  One is to alter update_pcp_isolate_block() to
    remove page from the PCP list.  I haven't looked at specifics of how to implement
    this just yet.

>> Moving the check outside of __alloc_contig_migrate_range() after outer_start
>> is calculated in alloc_contig_range() could help.
>
> I was going to suggest that, moving the check until after outer_start
> is calculated will definitely help IMO. I am sure I've seen a case where
>
>   page_count(page) = page->private = 0 and PageBuddy(page) was false.

Yep, I've pushed new content to my branch (git://github.com/mina86/linux-2.6.git cma)
and will try to get Marek to test it some time soon (I'm currently swamped with
non-Linux related work myself).

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [Linaro-mm-sig] [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range()
  2012-01-17 21:54   ` [Linaro-mm-sig] " sandeep patil
  2012-01-17 22:19     ` Michal Nazarewicz
@ 2012-01-19  7:36     ` Marek Szyprowski
  1 sibling, 0 replies; 33+ messages in thread
From: Marek Szyprowski @ 2012-01-19  7:36 UTC (permalink / raw)
  To: 'sandeep patil'
  Cc: linux-kernel, linux-arm-kernel, linux-media, linux-mm,
	linaro-mm-sig, 'Daniel Walker', 'Russell King',
	'Arnd Bergmann', 'Jonathan Corbet',
	'Mel Gorman', 'Michal Nazarewicz',
	'Dave Hansen', 'Jesse Barker',
	'Kyungmin Park', 'Andrew Morton',
	'KAMEZAWA Hiroyuki'

Hello,

On Tuesday, January 17, 2012 10:54 PM sandeep patil wrote:

> I am running a CMA test where I keep allocating from a CMA region as long
> as the allocation fails due to lack of space.
> 
> However, I am seeing failures much before I expect them to happen.
> When the allocation fails, I see a warning coming from __alloc_contig_range(),
> because test_pages_isolated() returned "true".
> 
> The new retry code does try a new range and eventually succeeds.

(snipped)

> From the log it looks like the warning showed up because page->private
> is set to MIGRATE_CMA instead of MIGRATE_ISOLATED.

> I've also had a test case where it failed because (page_count() != 0)

This means that the page is temporarily used by someone else (like for example
io subsystem or a driver).

> Have you or anyone else seen this during the CMA testing?

Yes, we observed such issues and we are also working on fixing them. However 
we gave higher priority to get the basic CMA patches merged to mainline. Once
this happen the above issues can be fixed incrementally.

> Also, could this be because we are finding a page within (start, end)
> that actually belongs
> to a higher order Buddy block ?

No, such pages should be correctly handled.

Best regards
-- 
Marek Szyprowski
Samsung Poland R&D Center



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2012-01-19  7:36 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-29 12:39 [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski
2011-12-29 12:39 ` [PATCH 01/11] mm: page_alloc: set_migratetype_isolate: drain PCP prior to isolating Marek Szyprowski
2012-01-01  7:49   ` Gilad Ben-Yossef
2012-01-01 15:54     ` Michal Nazarewicz
2012-01-01 16:06       ` Gilad Ben-Yossef
2012-01-01 18:52         ` Michal Nazarewicz
2012-01-05 15:39   ` Michal Nazarewicz
2012-01-05 19:20     ` Marek Szyprowski
2011-12-29 12:39 ` [PATCH 02/11] mm: compaction: introduce isolate_{free,migrate}pages_range() Marek Szyprowski
2012-01-10 13:43   ` Mel Gorman
2012-01-10 15:04     ` Michal Nazarewicz
2011-12-29 12:39 ` [PATCH 03/11] mm: compaction: export some of the functions Marek Szyprowski
2011-12-29 12:39 ` [PATCH 04/11] mm: page_alloc: introduce alloc_contig_range() Marek Szyprowski
2012-01-10 14:16   ` Mel Gorman
2012-01-13 20:04     ` Michal Nazarewicz
2012-01-16  9:01       ` Mel Gorman
2012-01-16 12:48         ` Michal Nazarewicz
2012-01-17 21:54   ` [Linaro-mm-sig] " sandeep patil
2012-01-17 22:19     ` Michal Nazarewicz
2012-01-18  0:46       ` sandeep patil
2012-01-18  1:44         ` Michal Nazarewicz
2012-01-19  7:36     ` Marek Szyprowski
2011-12-29 12:39 ` [PATCH 05/11] mm: mmzone: MIGRATE_CMA migration type added Marek Szyprowski
2012-01-10 14:38   ` Mel Gorman
2012-01-10 15:04     ` Michal Nazarewicz
2011-12-29 12:39 ` [PATCH 06/11] mm: page_isolation: MIGRATE_CMA isolation functions added Marek Szyprowski
2011-12-29 12:39 ` [PATCH 07/11] mm: add optional memory reclaim in split_free_page() Marek Szyprowski
2012-01-10 14:45   ` Mel Gorman
2011-12-29 12:39 ` [PATCH 08/11] drivers: add Contiguous Memory Allocator Marek Szyprowski
2011-12-29 12:39 ` [PATCH 09/11] X86: integrate CMA with DMA-mapping subsystem Marek Szyprowski
2011-12-29 12:39 ` [PATCH 10/11] ARM: " Marek Szyprowski
2011-12-29 12:39 ` [PATCH 11/11] ARM: Samsung: use CMA for 2 memory banks for s5p-mfc device Marek Szyprowski
2012-01-10  8:42 ` [PATCHv18 0/11] Contiguous Memory Allocator Marek Szyprowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).