All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND RFC PATCH v3 00/35] mm: Memory Power Management
@ 2013-08-30 13:13 ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:13 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

[ Resending, since some of the patches didn't go through successfully
the last time around. ]

Overview of Memory Power Management and its implications to the Linux MM
========================================================================

Today, we are increasingly seeing computer systems sporting larger and larger
amounts of RAM, in order to meet workload demands. However, memory consumes a
significant amount of power, potentially upto more than a third of total system
power on server systems[4]. So naturally, memory becomes the next big target
for power management - on embedded systems and smartphones, and all the way
upto large server systems.

Power-management capabilities in modern memory hardware:
-------------------------------------------------------

Modern memory hardware such as DDR3 support a number of power management
capabilities - for instance, the memory controller can automatically put
memory DIMMs/banks into content-preserving low-power states, if it detects
that the *entire* memory DIMM/bank has not been referenced for a threshold
amount of time, thus reducing the energy consumption of the memory hardware.
We term these power-manageable chunks of memory as "Memory Regions".

Exporting memory region info from the platform to the OS:
--------------------------------------------------------

The OS needs to know about the granularity at which the hardware can perform
automatic power-management of the memory banks (i.e., the address boundaries
of the memory regions). On ARM platforms, the bootloader can be modified to
pass on this info to the kernel via the device-tree. On x86 platforms, the
new ACPI 5.0 spec has added support for exporting the power-management
capabilities of the memory hardware to the OS in a standard way[5][6].

Estimate of power-savings from power-aware Linux MM:
---------------------------------------------------

Once the firmware/bootloader exports the required info to the OS, it is upto
the kernel's MM subsystem to make the best use of these capabilities and manage
memory power-efficiently. It had been demonstrated on a Samsung Exynos board
(with 2 GB RAM) that upto 6 percent of total system power can be saved by
making the Linux kernel MM subsystem power-aware[3]. (More savings can be
expected on systems with larger amounts of memory, and perhaps improved further
using better MM designs).


Role of the Linux MM in enhancing memory power savings:
------------------------------------------------------

Often, this simply translates to having the Linux MM understand the granularity
at which RAM modules can be power-managed, and keeping the memory allocations
and references consolidated to a minimum no. of these power-manageable
"memory regions". The memory hardware has the intelligence to automatically
transition memory banks that haven't been referenced for a threshold amount
of time, to low-power content-preserving states. And they can also perform
OS-cooperative power-off of unused (unallocated) memory regions. So the onus
is on the Linux VM to become power-aware and shape the allocations and
influence the references in such a way that it helps conserve memory power.
This involves consolidating the allocations/references at the right address
boundaries, keeping the memory-region granularity in mind.


So we can summarize the goals for the Linux MM as follows:

o Consolidate memory allocations and/or references such that they are not
spread across the entire memory address space, because the area of memory
that is not being referenced can reside in low power state.

o Support light-weight targeted memory compaction/reclaim, to evacuate
lightly-filled memory regions. This helps avoid memory references to
those regions, thereby allowing them to reside in low power states.


Assumptions and goals of this patchset:
--------------------------------------

In this patchset, we don't handle the part of getting the region boundary info
from the firmware/bootloader and populating it in the kernel data-structures.
The aim of this patchset is to propose and brainstorm on a power-aware design
of the Linux MM which can *use* the region boundary info to influence the MM
at various places such as page allocation, reclamation/compaction etc, thereby
contributing to memory power savings.

So, in this patchset, we assume a simple model in which each 512MB chunk of
memory can be independently power-managed, and hard-code this in the kernel.
As mentioned, the focus of this patchset is not so much on how we get this info
from the firmware or how exactly we handle a variety of configurations, but
rather on discussing the power-savings/performance impact of the MM algorithms
that *act* upon this info in order to save memory power.

That said, its not very far-fetched to try this out with actual region
boundary info to get the real power savings numbers. For example, on ARM
platforms, we can make the bootloader export this info to the OS via device-tree
and then run this patchset. (This was the method used to get the power-numbers
in [3]). But even without doing that, we can very well evaluate the
effectiveness of this patchset in contributing to power-savings, by analyzing
the free page statistics per-memory-region; and we can observe the performance
impact by running benchmarks - this is the approach currently used to evaluate
this patchset.


Brief overview of the design/approach used in this patchset:
-----------------------------------------------------------

The strategy used in this patchset is to do page allocation in increasing order
of memory regions (within a zone) and perform region-compaction in the reverse
order, as illustrated below.

---------------------------- Increasing region number---------------------->

Direction of allocation--->               <---Direction of region-compaction


We achieve this by making 3 major design changes to the Linux kernel memory
manager, as outlined below.

1. Sorted-buddy design of buddy freelists:

   To allocate pages in increasing order of memory regions, we first capture
   the memory region boundaries in suitable zone-level data-structures, and
   modify the buddy allocator such that we maintain the buddy freelists in
   region-sorted-order. Thus, automatically page allocation occurs in the
   order of increasing memory regions.

2. Split-allocator design: Page-Allocator as front-end; Region-Allocator as
   back-end:

   Mixing up movable and unmovable pages can disrupt opportunities for
   consolidating allocations. In order to separate such pages at a memory-region
   granularity, a "Region-Allocator" is introduced which allocates entire memory
   regions. The Page-Allocator is then modified to get its memory from the
   Region-Allocator and hand out pages to requesting applications in
   page-sized chunks. This design is showing significant improvements in the
   effectiveness of this patchset in consolidating allocations to minimum no.
   of memory regions.

3. Targeted region compaction/evacuation:

   Over time, due to multiple alloc()s and free()s in random order, memory gets
   fragmented, which means the memory allocations will no longer be consolidated
   to a minimum no. of memory regions. In such cases we need a light-weight
   mechanism to opportunistically compact memory to evacuate lightly-filled
   memory regions, thereby enhancing the power-savings.

   Noting that CMA (Contiguous Memory Allocator) does targeted compaction to
   achieve its goals, the v2 of this patchset generalized the targeted
   compaction code and reused it to evacuate memory regions.

   [ I have temporarily dropped this feature in this version (v3) of the
    patchset, since it can benefit from some considerable changes. I'll revive
    it in the next version and integrate it with the split-allocator design. ]


Experimental Results:
====================

I'll include the detailed results as a reply to this cover-letter, since it
can benefit from a dedicated discussion, rather than squeezing it here itself.


This patchset has been hosted in the below git tree. It applies cleanly on
v3.11-rc7.

git://github.com/srivatsabhat/linux.git mem-power-mgmt-v3


Changes in v3:
=============

* The major change is the splitting of the memory allocator into a
  Page-Allocator front-end and a Region-Allocator back-end. This helps in
  keeping movable and unmovable allocations separated across region
  boundaries, thus improving the opportunities for consolidation of memory
  allocations to a minimum no. of regions.

* A bunch of fixes all over, especially in the handling of freepage
  migratetypes and the buddy merging code.


Changes in v2:
=============

* Fixed a bug in the NUMA case.
* Added a new optimized O(log n) sorting algorithm to speed up region-sorting
  of the buddy freelists (patch 9). The efficiency of this new algorithm and
  its design allows us to support large amounts of RAM quite easily.
* Added light-weight targetted compaction/reclaim support for memory power
  management (patches 10-14).
* Revamped the cover-letter to better explain the idea behind memory power
  management and this patchset.


Some important TODOs:
====================

1. Revive the targeted region-compaction/evacuation code and make it
   work well with the new Page-Allocator - Region-Allocator split design.

2. Add optimizations to improve the performance and reduce the overhead in
   the MM hot paths.

3. Add support for making this patchset work with sparsemem, THP, memcg etc.


References:
----------

[1]. LWN article that explains the goals and the design of my Memory Power
     Management patchset:
     http://lwn.net/Articles/547439/

[2]. v2 of the "Sorted-buddy" patchset with support for targeted memory
     region compaction:
     http://lwn.net/Articles/546696/

     LWN article describing this design: http://lwn.net/Articles/547439/

     v1 of the patchset:
     http://thread.gmane.org/gmane.linux.power-management.general/28498

[3]. Estimate of potential power savings on Samsung exynos board
     http://article.gmane.org/gmane.linux.kernel.mm/65935

[4]. C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and Tom Keller.
     Energy management for commercial servers. In IEEE Computer, pages 39–48,
     Dec 2003.
     Link: researcher.ibm.com/files/us-lefurgy/computer2003.pdf

[5]. ACPI 5.0 and MPST support
     http://www.acpi.info/spec.htm
     Section 5.2.21 Memory Power State Table (MPST)

[6]. Prototype implementation of parsing of ACPI 5.0 MPST tables, by Srinivas
     Pandruvada.
     https://lkml.org/lkml/2013/4/18/349

[7]. Review comments suggesting modifying the buddy allocator to be aware of
     memory regions:
     http://article.gmane.org/gmane.linux.power-management.general/24862
     http://article.gmane.org/gmane.linux.power-management.general/25061
     http://article.gmane.org/gmane.linux.kernel.mm/64689

[8]. Patch series that implemented the node-region-zone hierarchy design:
     http://lwn.net/Articles/445045/
     http://thread.gmane.org/gmane.linux.kernel.mm/63840

     Summary of the discussion on that patchset:
     http://article.gmane.org/gmane.linux.power-management.general/25061

     Forward-port of that patchset to 3.7-rc3 (minimal x86 config)
     http://thread.gmane.org/gmane.linux.kernel.mm/89202

[9]. Disadvantages of having memory regions in the hierarchy between nodes and
     zones:
     http://article.gmane.org/gmane.linux.kernel.mm/63849


 Srivatsa S. Bhat (35):
      mm: Restructure free-page stealing code and fix a bug
      mm: Fix the value of fallback_migratetype in alloc_extfrag tracepoint
      mm: Introduce memory regions data-structure to capture region boundaries within nodes
      mm: Initialize node memory regions during boot
      mm: Introduce and initialize zone memory regions
      mm: Add helpers to retrieve node region and zone region for a given page
      mm: Add data-structures to describe memory regions within the zones' freelists
      mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
      mm: Track the freepage migratetype of pages accurately
      mm: Use the correct migratetype during buddy merging
      mm: Add an optimized version of del_from_freelist to keep page allocation fast
      bitops: Document the difference in indexing between fls() and __fls()
      mm: A new optimized O(log n) sorting algo to speed up buddy-sorting
      mm: Add support to accurately track per-memory-region allocation
      mm: Print memory region statistics to understand the buddy allocator behavior
      mm: Enable per-memory-region fragmentation stats in pagetypeinfo
      mm: Add aggressive bias to prefer lower regions during page allocation
      mm: Introduce a "Region Allocator" to manage entire memory regions
      mm: Add a mechanism to add pages to buddy freelists in bulk
      mm: Provide a mechanism to delete pages from buddy freelists in bulk
      mm: Provide a mechanism to release free memory to the region allocator
      mm: Provide a mechanism to request free memory from the region allocator
      mm: Maintain the counter for freepages in the region allocator
      mm: Propagate the sorted-buddy bias for picking free regions, to region allocator
      mm: Fix vmstat to also account for freepages in the region allocator
      mm: Drop some very expensive sorted-buddy related checks under DEBUG_PAGEALLOC
      mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA => RA flow
      mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA <= RA flow
      mm: Update the freepage migratetype of pages during region allocation
      mm: Provide a mechanism to check if a given page is in the region allocator
      mm: Add a way to request pages of a particular region from the region allocator
      mm: Modify move_freepages() to handle pages in the region allocator properly
      mm: Never change migratetypes of pageblocks during freepage stealing
      mm: Set pageblock migratetype when allocating regions from region allocator
      mm: Use a cache between page-allocator and region-allocator


 arch/x86/include/asm/bitops.h      |    4 
 include/asm-generic/bitops/__fls.h |    5 
 include/linux/mm.h                 |   42 ++
 include/linux/mmzone.h             |   75 +++
 include/trace/events/kmem.h        |   10 
 mm/compaction.c                    |    2 
 mm/page_alloc.c                    |  935 +++++++++++++++++++++++++++++++++---
 mm/vmstat.c                        |  130 +++++
 8 files changed, 1124 insertions(+), 79 deletions(-)


Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 100+ messages in thread

* [RESEND RFC PATCH v3 00/35] mm: Memory Power Management
@ 2013-08-30 13:13 ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:13 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

[ Resending, since some of the patches didn't go through successfully
the last time around. ]

Overview of Memory Power Management and its implications to the Linux MM
========================================================================

Today, we are increasingly seeing computer systems sporting larger and larger
amounts of RAM, in order to meet workload demands. However, memory consumes a
significant amount of power, potentially upto more than a third of total system
power on server systems[4]. So naturally, memory becomes the next big target
for power management - on embedded systems and smartphones, and all the way
upto large server systems.

Power-management capabilities in modern memory hardware:
-------------------------------------------------------

Modern memory hardware such as DDR3 support a number of power management
capabilities - for instance, the memory controller can automatically put
memory DIMMs/banks into content-preserving low-power states, if it detects
that the *entire* memory DIMM/bank has not been referenced for a threshold
amount of time, thus reducing the energy consumption of the memory hardware.
We term these power-manageable chunks of memory as "Memory Regions".

Exporting memory region info from the platform to the OS:
--------------------------------------------------------

The OS needs to know about the granularity at which the hardware can perform
automatic power-management of the memory banks (i.e., the address boundaries
of the memory regions). On ARM platforms, the bootloader can be modified to
pass on this info to the kernel via the device-tree. On x86 platforms, the
new ACPI 5.0 spec has added support for exporting the power-management
capabilities of the memory hardware to the OS in a standard way[5][6].

Estimate of power-savings from power-aware Linux MM:
---------------------------------------------------

Once the firmware/bootloader exports the required info to the OS, it is upto
the kernel's MM subsystem to make the best use of these capabilities and manage
memory power-efficiently. It had been demonstrated on a Samsung Exynos board
(with 2 GB RAM) that upto 6 percent of total system power can be saved by
making the Linux kernel MM subsystem power-aware[3]. (More savings can be
expected on systems with larger amounts of memory, and perhaps improved further
using better MM designs).


Role of the Linux MM in enhancing memory power savings:
------------------------------------------------------

Often, this simply translates to having the Linux MM understand the granularity
at which RAM modules can be power-managed, and keeping the memory allocations
and references consolidated to a minimum no. of these power-manageable
"memory regions". The memory hardware has the intelligence to automatically
transition memory banks that haven't been referenced for a threshold amount
of time, to low-power content-preserving states. And they can also perform
OS-cooperative power-off of unused (unallocated) memory regions. So the onus
is on the Linux VM to become power-aware and shape the allocations and
influence the references in such a way that it helps conserve memory power.
This involves consolidating the allocations/references at the right address
boundaries, keeping the memory-region granularity in mind.


So we can summarize the goals for the Linux MM as follows:

o Consolidate memory allocations and/or references such that they are not
spread across the entire memory address space, because the area of memory
that is not being referenced can reside in low power state.

o Support light-weight targeted memory compaction/reclaim, to evacuate
lightly-filled memory regions. This helps avoid memory references to
those regions, thereby allowing them to reside in low power states.


Assumptions and goals of this patchset:
--------------------------------------

In this patchset, we don't handle the part of getting the region boundary info
from the firmware/bootloader and populating it in the kernel data-structures.
The aim of this patchset is to propose and brainstorm on a power-aware design
of the Linux MM which can *use* the region boundary info to influence the MM
at various places such as page allocation, reclamation/compaction etc, thereby
contributing to memory power savings.

So, in this patchset, we assume a simple model in which each 512MB chunk of
memory can be independently power-managed, and hard-code this in the kernel.
As mentioned, the focus of this patchset is not so much on how we get this info
from the firmware or how exactly we handle a variety of configurations, but
rather on discussing the power-savings/performance impact of the MM algorithms
that *act* upon this info in order to save memory power.

That said, its not very far-fetched to try this out with actual region
boundary info to get the real power savings numbers. For example, on ARM
platforms, we can make the bootloader export this info to the OS via device-tree
and then run this patchset. (This was the method used to get the power-numbers
in [3]). But even without doing that, we can very well evaluate the
effectiveness of this patchset in contributing to power-savings, by analyzing
the free page statistics per-memory-region; and we can observe the performance
impact by running benchmarks - this is the approach currently used to evaluate
this patchset.


Brief overview of the design/approach used in this patchset:
-----------------------------------------------------------

The strategy used in this patchset is to do page allocation in increasing order
of memory regions (within a zone) and perform region-compaction in the reverse
order, as illustrated below.

---------------------------- Increasing region number---------------------->

Direction of allocation--->               <---Direction of region-compaction


We achieve this by making 3 major design changes to the Linux kernel memory
manager, as outlined below.

1. Sorted-buddy design of buddy freelists:

   To allocate pages in increasing order of memory regions, we first capture
   the memory region boundaries in suitable zone-level data-structures, and
   modify the buddy allocator such that we maintain the buddy freelists in
   region-sorted-order. Thus, automatically page allocation occurs in the
   order of increasing memory regions.

2. Split-allocator design: Page-Allocator as front-end; Region-Allocator as
   back-end:

   Mixing up movable and unmovable pages can disrupt opportunities for
   consolidating allocations. In order to separate such pages at a memory-region
   granularity, a "Region-Allocator" is introduced which allocates entire memory
   regions. The Page-Allocator is then modified to get its memory from the
   Region-Allocator and hand out pages to requesting applications in
   page-sized chunks. This design is showing significant improvements in the
   effectiveness of this patchset in consolidating allocations to minimum no.
   of memory regions.

3. Targeted region compaction/evacuation:

   Over time, due to multiple alloc()s and free()s in random order, memory gets
   fragmented, which means the memory allocations will no longer be consolidated
   to a minimum no. of memory regions. In such cases we need a light-weight
   mechanism to opportunistically compact memory to evacuate lightly-filled
   memory regions, thereby enhancing the power-savings.

   Noting that CMA (Contiguous Memory Allocator) does targeted compaction to
   achieve its goals, the v2 of this patchset generalized the targeted
   compaction code and reused it to evacuate memory regions.

   [ I have temporarily dropped this feature in this version (v3) of the
    patchset, since it can benefit from some considerable changes. I'll revive
    it in the next version and integrate it with the split-allocator design. ]


Experimental Results:
====================

I'll include the detailed results as a reply to this cover-letter, since it
can benefit from a dedicated discussion, rather than squeezing it here itself.


This patchset has been hosted in the below git tree. It applies cleanly on
v3.11-rc7.

git://github.com/srivatsabhat/linux.git mem-power-mgmt-v3


Changes in v3:
=============

* The major change is the splitting of the memory allocator into a
  Page-Allocator front-end and a Region-Allocator back-end. This helps in
  keeping movable and unmovable allocations separated across region
  boundaries, thus improving the opportunities for consolidation of memory
  allocations to a minimum no. of regions.

* A bunch of fixes all over, especially in the handling of freepage
  migratetypes and the buddy merging code.


Changes in v2:
=============

* Fixed a bug in the NUMA case.
* Added a new optimized O(log n) sorting algorithm to speed up region-sorting
  of the buddy freelists (patch 9). The efficiency of this new algorithm and
  its design allows us to support large amounts of RAM quite easily.
* Added light-weight targetted compaction/reclaim support for memory power
  management (patches 10-14).
* Revamped the cover-letter to better explain the idea behind memory power
  management and this patchset.


Some important TODOs:
====================

1. Revive the targeted region-compaction/evacuation code and make it
   work well with the new Page-Allocator - Region-Allocator split design.

2. Add optimizations to improve the performance and reduce the overhead in
   the MM hot paths.

3. Add support for making this patchset work with sparsemem, THP, memcg etc.


References:
----------

[1]. LWN article that explains the goals and the design of my Memory Power
     Management patchset:
     http://lwn.net/Articles/547439/

[2]. v2 of the "Sorted-buddy" patchset with support for targeted memory
     region compaction:
     http://lwn.net/Articles/546696/

     LWN article describing this design: http://lwn.net/Articles/547439/

     v1 of the patchset:
     http://thread.gmane.org/gmane.linux.power-management.general/28498

[3]. Estimate of potential power savings on Samsung exynos board
     http://article.gmane.org/gmane.linux.kernel.mm/65935

[4]. C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and Tom Keller.
     Energy management for commercial servers. In IEEE Computer, pages 39a??48,
     Dec 2003.
     Link: researcher.ibm.com/files/us-lefurgy/computer2003.pdf

[5]. ACPI 5.0 and MPST support
     http://www.acpi.info/spec.htm
     Section 5.2.21 Memory Power State Table (MPST)

[6]. Prototype implementation of parsing of ACPI 5.0 MPST tables, by Srinivas
     Pandruvada.
     https://lkml.org/lkml/2013/4/18/349

[7]. Review comments suggesting modifying the buddy allocator to be aware of
     memory regions:
     http://article.gmane.org/gmane.linux.power-management.general/24862
     http://article.gmane.org/gmane.linux.power-management.general/25061
     http://article.gmane.org/gmane.linux.kernel.mm/64689

[8]. Patch series that implemented the node-region-zone hierarchy design:
     http://lwn.net/Articles/445045/
     http://thread.gmane.org/gmane.linux.kernel.mm/63840

     Summary of the discussion on that patchset:
     http://article.gmane.org/gmane.linux.power-management.general/25061

     Forward-port of that patchset to 3.7-rc3 (minimal x86 config)
     http://thread.gmane.org/gmane.linux.kernel.mm/89202

[9]. Disadvantages of having memory regions in the hierarchy between nodes and
     zones:
     http://article.gmane.org/gmane.linux.kernel.mm/63849


 Srivatsa S. Bhat (35):
      mm: Restructure free-page stealing code and fix a bug
      mm: Fix the value of fallback_migratetype in alloc_extfrag tracepoint
      mm: Introduce memory regions data-structure to capture region boundaries within nodes
      mm: Initialize node memory regions during boot
      mm: Introduce and initialize zone memory regions
      mm: Add helpers to retrieve node region and zone region for a given page
      mm: Add data-structures to describe memory regions within the zones' freelists
      mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
      mm: Track the freepage migratetype of pages accurately
      mm: Use the correct migratetype during buddy merging
      mm: Add an optimized version of del_from_freelist to keep page allocation fast
      bitops: Document the difference in indexing between fls() and __fls()
      mm: A new optimized O(log n) sorting algo to speed up buddy-sorting
      mm: Add support to accurately track per-memory-region allocation
      mm: Print memory region statistics to understand the buddy allocator behavior
      mm: Enable per-memory-region fragmentation stats in pagetypeinfo
      mm: Add aggressive bias to prefer lower regions during page allocation
      mm: Introduce a "Region Allocator" to manage entire memory regions
      mm: Add a mechanism to add pages to buddy freelists in bulk
      mm: Provide a mechanism to delete pages from buddy freelists in bulk
      mm: Provide a mechanism to release free memory to the region allocator
      mm: Provide a mechanism to request free memory from the region allocator
      mm: Maintain the counter for freepages in the region allocator
      mm: Propagate the sorted-buddy bias for picking free regions, to region allocator
      mm: Fix vmstat to also account for freepages in the region allocator
      mm: Drop some very expensive sorted-buddy related checks under DEBUG_PAGEALLOC
      mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA => RA flow
      mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA <= RA flow
      mm: Update the freepage migratetype of pages during region allocation
      mm: Provide a mechanism to check if a given page is in the region allocator
      mm: Add a way to request pages of a particular region from the region allocator
      mm: Modify move_freepages() to handle pages in the region allocator properly
      mm: Never change migratetypes of pageblocks during freepage stealing
      mm: Set pageblock migratetype when allocating regions from region allocator
      mm: Use a cache between page-allocator and region-allocator


 arch/x86/include/asm/bitops.h      |    4 
 include/asm-generic/bitops/__fls.h |    5 
 include/linux/mm.h                 |   42 ++
 include/linux/mmzone.h             |   75 +++
 include/trace/events/kmem.h        |   10 
 mm/compaction.c                    |    2 
 mm/page_alloc.c                    |  935 +++++++++++++++++++++++++++++++++---
 mm/vmstat.c                        |  130 +++++
 8 files changed, 1124 insertions(+), 79 deletions(-)


Regards,
Srivatsa S. Bhat
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 01/35] mm: Restructure free-page stealing code and fix a bug
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:14   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:14 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The free-page stealing code in __rmqueue_fallback() is somewhat hard to
follow, and has an incredible amount of subtlety hidden inside!

First off, there is a minor bug in the reporting of change-of-ownership of
pageblocks. Under some conditions, we try to move upto 'pageblock_nr_pages'
no. of pages to the preferred allocation list. But we change the ownership
of that pageblock to the preferred type only if we manage to successfully
move atleast half of that pageblock (or if page_group_by_mobility_disabled
is set).

However, the current code ignores the latter part and sets the 'migratetype'
variable to the preferred type, irrespective of whether we actually changed
the pageblock migratetype of that block or not. So, the page_alloc_extfrag
tracepoint can end up printing incorrect info (i.e., 'change_ownership'
might be shown as 1 when it must have been 0).

So fixing this involves moving the update of the 'migratetype' variable to
the right place. But looking closer, we observe that the 'migratetype' variable
is used subsequently for checks such as "is_migrate_cma()". Obviously the
intent there is to check if the *fallback* type is MIGRATE_CMA, but since we
already set the 'migratetype' variable to start_migratetype, we end up checking
if the *preferred* type is MIGRATE_CMA!!

To make things more interesting, this actually doesn't cause a bug in practice,
because we never change *anything* if the fallback type is CMA.

So, restructure the code in such a way that it is trivial to understand what
is going on, and also fix the above mentioned bug. And while at it, also add a
comment explaining the subtlety behind the migratetype used in the call to
expand().

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   95 ++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 59 insertions(+), 36 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..d4b8198 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1007,6 +1007,52 @@ static void change_pageblock_range(struct page *pageblock_page,
 	}
 }
 
+/*
+ * If breaking a large block of pages, move all free pages to the preferred
+ * allocation list. If falling back for a reclaimable kernel allocation, be
+ * more aggressive about taking ownership of free pages.
+ *
+ * On the other hand, never change migration type of MIGRATE_CMA pageblocks
+ * nor move CMA pages to different free lists. We don't want unmovable pages
+ * to be allocated from MIGRATE_CMA areas.
+ *
+ * Returns the new migratetype of the pageblock (or the same old migratetype
+ * if it was unchanged).
+ */
+static int try_to_steal_freepages(struct zone *zone, struct page *page,
+				  int start_type, int fallback_type)
+{
+	int current_order = page_order(page);
+
+	if (is_migrate_cma(fallback_type))
+		return fallback_type;
+
+	/* Take ownership for orders >= pageblock_order */
+	if (current_order >= pageblock_order) {
+		change_pageblock_range(page, current_order, start_type);
+		return start_type;
+	}
+
+	if (current_order >= pageblock_order / 2 ||
+	    start_type == MIGRATE_RECLAIMABLE ||
+	    page_group_by_mobility_disabled) {
+		int pages;
+
+		pages = move_freepages_block(zone, page, start_type);
+
+		/* Claim the whole block if over half of it is free */
+		if (pages >= (1 << (pageblock_order-1)) ||
+				page_group_by_mobility_disabled) {
+
+			set_pageblock_migratetype(page, start_type);
+			return start_type;
+		}
+
+	}
+
+	return fallback_type;
+}
+
 /* Remove an element from the buddy allocator from the fallback list */
 static inline struct page *
 __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
@@ -1014,7 +1060,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 	struct free_area * area;
 	int current_order;
 	struct page *page;
-	int migratetype, i;
+	int migratetype, new_type, i;
 
 	/* Find the largest possible block of pages in the other list */
 	for (current_order = MAX_ORDER-1; current_order >= order;
@@ -1034,51 +1080,28 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 					struct page, lru);
 			area->nr_free--;
 
-			/*
-			 * If breaking a large block of pages, move all free
-			 * pages to the preferred allocation list. If falling
-			 * back for a reclaimable kernel allocation, be more
-			 * aggressive about taking ownership of free pages
-			 *
-			 * On the other hand, never change migration
-			 * type of MIGRATE_CMA pageblocks nor move CMA
-			 * pages on different free lists. We don't
-			 * want unmovable pages to be allocated from
-			 * MIGRATE_CMA areas.
-			 */
-			if (!is_migrate_cma(migratetype) &&
-			    (current_order >= pageblock_order / 2 ||
-			     start_migratetype == MIGRATE_RECLAIMABLE ||
-			     page_group_by_mobility_disabled)) {
-				int pages;
-				pages = move_freepages_block(zone, page,
-								start_migratetype);
-
-				/* Claim the whole block if over half of it is free */
-				if (pages >= (1 << (pageblock_order-1)) ||
-						page_group_by_mobility_disabled)
-					set_pageblock_migratetype(page,
-								start_migratetype);
-
-				migratetype = start_migratetype;
-			}
+			new_type = try_to_steal_freepages(zone, page,
+							  start_migratetype,
+							  migratetype);
 
 			/* Remove the page from the freelists */
 			list_del(&page->lru);
 			rmv_page_order(page);
 
-			/* Take ownership for orders >= pageblock_order */
-			if (current_order >= pageblock_order &&
-			    !is_migrate_cma(migratetype))
-				change_pageblock_range(page, current_order,
-							start_migratetype);
-
+			/*
+			 * Borrow the excess buddy pages as well, irrespective
+			 * of whether we stole freepages, or took ownership of
+			 * the pageblock or not.
+			 *
+			 * Exception: When borrowing from MIGRATE_CMA, release
+			 * the excess buddy pages to CMA itself.
+			 */
 			expand(zone, page, order, current_order, area,
 			       is_migrate_cma(migratetype)
 			     ? migratetype : start_migratetype);
 
 			trace_mm_page_alloc_extfrag(page, order, current_order,
-				start_migratetype, migratetype);
+				start_migratetype, new_type);
 
 			return page;
 		}


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 01/35] mm: Restructure free-page stealing code and fix a bug
@ 2013-08-30 13:14   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:14 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The free-page stealing code in __rmqueue_fallback() is somewhat hard to
follow, and has an incredible amount of subtlety hidden inside!

First off, there is a minor bug in the reporting of change-of-ownership of
pageblocks. Under some conditions, we try to move upto 'pageblock_nr_pages'
no. of pages to the preferred allocation list. But we change the ownership
of that pageblock to the preferred type only if we manage to successfully
move atleast half of that pageblock (or if page_group_by_mobility_disabled
is set).

However, the current code ignores the latter part and sets the 'migratetype'
variable to the preferred type, irrespective of whether we actually changed
the pageblock migratetype of that block or not. So, the page_alloc_extfrag
tracepoint can end up printing incorrect info (i.e., 'change_ownership'
might be shown as 1 when it must have been 0).

So fixing this involves moving the update of the 'migratetype' variable to
the right place. But looking closer, we observe that the 'migratetype' variable
is used subsequently for checks such as "is_migrate_cma()". Obviously the
intent there is to check if the *fallback* type is MIGRATE_CMA, but since we
already set the 'migratetype' variable to start_migratetype, we end up checking
if the *preferred* type is MIGRATE_CMA!!

To make things more interesting, this actually doesn't cause a bug in practice,
because we never change *anything* if the fallback type is CMA.

So, restructure the code in such a way that it is trivial to understand what
is going on, and also fix the above mentioned bug. And while at it, also add a
comment explaining the subtlety behind the migratetype used in the call to
expand().

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   95 ++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 59 insertions(+), 36 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..d4b8198 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1007,6 +1007,52 @@ static void change_pageblock_range(struct page *pageblock_page,
 	}
 }
 
+/*
+ * If breaking a large block of pages, move all free pages to the preferred
+ * allocation list. If falling back for a reclaimable kernel allocation, be
+ * more aggressive about taking ownership of free pages.
+ *
+ * On the other hand, never change migration type of MIGRATE_CMA pageblocks
+ * nor move CMA pages to different free lists. We don't want unmovable pages
+ * to be allocated from MIGRATE_CMA areas.
+ *
+ * Returns the new migratetype of the pageblock (or the same old migratetype
+ * if it was unchanged).
+ */
+static int try_to_steal_freepages(struct zone *zone, struct page *page,
+				  int start_type, int fallback_type)
+{
+	int current_order = page_order(page);
+
+	if (is_migrate_cma(fallback_type))
+		return fallback_type;
+
+	/* Take ownership for orders >= pageblock_order */
+	if (current_order >= pageblock_order) {
+		change_pageblock_range(page, current_order, start_type);
+		return start_type;
+	}
+
+	if (current_order >= pageblock_order / 2 ||
+	    start_type == MIGRATE_RECLAIMABLE ||
+	    page_group_by_mobility_disabled) {
+		int pages;
+
+		pages = move_freepages_block(zone, page, start_type);
+
+		/* Claim the whole block if over half of it is free */
+		if (pages >= (1 << (pageblock_order-1)) ||
+				page_group_by_mobility_disabled) {
+
+			set_pageblock_migratetype(page, start_type);
+			return start_type;
+		}
+
+	}
+
+	return fallback_type;
+}
+
 /* Remove an element from the buddy allocator from the fallback list */
 static inline struct page *
 __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
@@ -1014,7 +1060,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 	struct free_area * area;
 	int current_order;
 	struct page *page;
-	int migratetype, i;
+	int migratetype, new_type, i;
 
 	/* Find the largest possible block of pages in the other list */
 	for (current_order = MAX_ORDER-1; current_order >= order;
@@ -1034,51 +1080,28 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 					struct page, lru);
 			area->nr_free--;
 
-			/*
-			 * If breaking a large block of pages, move all free
-			 * pages to the preferred allocation list. If falling
-			 * back for a reclaimable kernel allocation, be more
-			 * aggressive about taking ownership of free pages
-			 *
-			 * On the other hand, never change migration
-			 * type of MIGRATE_CMA pageblocks nor move CMA
-			 * pages on different free lists. We don't
-			 * want unmovable pages to be allocated from
-			 * MIGRATE_CMA areas.
-			 */
-			if (!is_migrate_cma(migratetype) &&
-			    (current_order >= pageblock_order / 2 ||
-			     start_migratetype == MIGRATE_RECLAIMABLE ||
-			     page_group_by_mobility_disabled)) {
-				int pages;
-				pages = move_freepages_block(zone, page,
-								start_migratetype);
-
-				/* Claim the whole block if over half of it is free */
-				if (pages >= (1 << (pageblock_order-1)) ||
-						page_group_by_mobility_disabled)
-					set_pageblock_migratetype(page,
-								start_migratetype);
-
-				migratetype = start_migratetype;
-			}
+			new_type = try_to_steal_freepages(zone, page,
+							  start_migratetype,
+							  migratetype);
 
 			/* Remove the page from the freelists */
 			list_del(&page->lru);
 			rmv_page_order(page);
 
-			/* Take ownership for orders >= pageblock_order */
-			if (current_order >= pageblock_order &&
-			    !is_migrate_cma(migratetype))
-				change_pageblock_range(page, current_order,
-							start_migratetype);
-
+			/*
+			 * Borrow the excess buddy pages as well, irrespective
+			 * of whether we stole freepages, or took ownership of
+			 * the pageblock or not.
+			 *
+			 * Exception: When borrowing from MIGRATE_CMA, release
+			 * the excess buddy pages to CMA itself.
+			 */
 			expand(zone, page, order, current_order, area,
 			       is_migrate_cma(migratetype)
 			     ? migratetype : start_migratetype);
 
 			trace_mm_page_alloc_extfrag(page, order, current_order,
-				start_migratetype, migratetype);
+				start_migratetype, new_type);
 
 			return page;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 02/35] mm: Fix the value of fallback_migratetype in alloc_extfrag tracepoint
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:14   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:14 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

In the current code, the value of fallback_migratetype that is printed
using the mm_page_alloc_extfrag tracepoint, is the value of the migratetype
*after* it has been set to the preferred migratetype (if the ownership was
changed). Obviously that wouldn't have been the original intent. (We already
have a separate 'change_ownership' field to tell whether the ownership of the
pageblock was changed from the fallback_migratetype to the preferred type.)

The intent of the fallback_migratetype field is to show the migratetype
from which we borrowed pages in order to satisfy the allocation request.
So fix the code to print that value correctly.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/trace/events/kmem.h |   10 +++++++---
 mm/page_alloc.c             |    5 +++--
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 6bc943e..d0c6134 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -268,11 +268,13 @@ TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(struct page *page,
 			int alloc_order, int fallback_order,
-			int alloc_migratetype, int fallback_migratetype),
+			int alloc_migratetype, int fallback_migratetype,
+			int change_ownership),
 
 	TP_ARGS(page,
 		alloc_order, fallback_order,
-		alloc_migratetype, fallback_migratetype),
+		alloc_migratetype, fallback_migratetype,
+		change_ownership),
 
 	TP_STRUCT__entry(
 		__field(	struct page *,	page			)
@@ -280,6 +282,7 @@ TRACE_EVENT(mm_page_alloc_extfrag,
 		__field(	int,		fallback_order		)
 		__field(	int,		alloc_migratetype	)
 		__field(	int,		fallback_migratetype	)
+		__field(	int,		change_ownership	)
 	),
 
 	TP_fast_assign(
@@ -288,6 +291,7 @@ TRACE_EVENT(mm_page_alloc_extfrag,
 		__entry->fallback_order		= fallback_order;
 		__entry->alloc_migratetype	= alloc_migratetype;
 		__entry->fallback_migratetype	= fallback_migratetype;
+		__entry->change_ownership	= change_ownership;
 	),
 
 	TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
@@ -299,7 +303,7 @@ TRACE_EVENT(mm_page_alloc_extfrag,
 		__entry->alloc_migratetype,
 		__entry->fallback_migratetype,
 		__entry->fallback_order < pageblock_order,
-		__entry->alloc_migratetype == __entry->fallback_migratetype)
+		__entry->change_ownership)
 );
 
 #endif /* _TRACE_KMEM_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4b8198..b86d7e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1100,8 +1100,9 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 			       is_migrate_cma(migratetype)
 			     ? migratetype : start_migratetype);
 
-			trace_mm_page_alloc_extfrag(page, order, current_order,
-				start_migratetype, new_type);
+			trace_mm_page_alloc_extfrag(page, order,
+				current_order, start_migratetype, migratetype,
+				new_type == start_migratetype);
 
 			return page;
 		}


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 02/35] mm: Fix the value of fallback_migratetype in alloc_extfrag tracepoint
@ 2013-08-30 13:14   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:14 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

In the current code, the value of fallback_migratetype that is printed
using the mm_page_alloc_extfrag tracepoint, is the value of the migratetype
*after* it has been set to the preferred migratetype (if the ownership was
changed). Obviously that wouldn't have been the original intent. (We already
have a separate 'change_ownership' field to tell whether the ownership of the
pageblock was changed from the fallback_migratetype to the preferred type.)

The intent of the fallback_migratetype field is to show the migratetype
from which we borrowed pages in order to satisfy the allocation request.
So fix the code to print that value correctly.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/trace/events/kmem.h |   10 +++++++---
 mm/page_alloc.c             |    5 +++--
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 6bc943e..d0c6134 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -268,11 +268,13 @@ TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(struct page *page,
 			int alloc_order, int fallback_order,
-			int alloc_migratetype, int fallback_migratetype),
+			int alloc_migratetype, int fallback_migratetype,
+			int change_ownership),
 
 	TP_ARGS(page,
 		alloc_order, fallback_order,
-		alloc_migratetype, fallback_migratetype),
+		alloc_migratetype, fallback_migratetype,
+		change_ownership),
 
 	TP_STRUCT__entry(
 		__field(	struct page *,	page			)
@@ -280,6 +282,7 @@ TRACE_EVENT(mm_page_alloc_extfrag,
 		__field(	int,		fallback_order		)
 		__field(	int,		alloc_migratetype	)
 		__field(	int,		fallback_migratetype	)
+		__field(	int,		change_ownership	)
 	),
 
 	TP_fast_assign(
@@ -288,6 +291,7 @@ TRACE_EVENT(mm_page_alloc_extfrag,
 		__entry->fallback_order		= fallback_order;
 		__entry->alloc_migratetype	= alloc_migratetype;
 		__entry->fallback_migratetype	= fallback_migratetype;
+		__entry->change_ownership	= change_ownership;
 	),
 
 	TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
@@ -299,7 +303,7 @@ TRACE_EVENT(mm_page_alloc_extfrag,
 		__entry->alloc_migratetype,
 		__entry->fallback_migratetype,
 		__entry->fallback_order < pageblock_order,
-		__entry->alloc_migratetype == __entry->fallback_migratetype)
+		__entry->change_ownership)
 );
 
 #endif /* _TRACE_KMEM_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4b8198..b86d7e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1100,8 +1100,9 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 			       is_migrate_cma(migratetype)
 			     ? migratetype : start_migratetype);
 
-			trace_mm_page_alloc_extfrag(page, order, current_order,
-				start_migratetype, new_type);
+			trace_mm_page_alloc_extfrag(page, order,
+				current_order, start_migratetype, migratetype,
+				new_type == start_migratetype);
 
 			return page;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 03/35] mm: Introduce memory regions data-structure to capture region boundaries within nodes
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:14   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:14 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The memory within a node can be divided into regions of memory that can be
independently power-managed. That is, chunks of memory can be transitioned
(manually or automatically) to low-power states based on the frequency of
references to that region. For example, if a memory chunk is not referenced
for a given threshold amount of time, the hardware (memory controller) can
decide to put that piece of memory into a content-preserving low-power state.
And of course, on the next reference to that chunk of memory, it will be
transitioned back to full-power for read/write operations.

So, the Linux MM can take advantage of this feature by managing the available
memory with an eye towards power-savings - ie., by keeping the memory
allocations/references consolidated to a minimum no. of such power-manageable
memory regions. In order to do so, the first step is to teach the MM about
the boundaries of these regions - and to capture that info, we introduce a new
data-structure called "Memory Regions".

[Also, the concept of memory regions could potentially be extended to work
with different classes of memory like PCM (Phase Change Memory) etc and
hence, it is not limited to just power management alone].

We already sub-divide a node's memory into zones, based on some well-known
constraints. So the question is, where do we fit in memory regions in this
hierarchy. Instead of artificially trying to fit it into the hierarchy one
way or the other, we choose to simply capture the region boundaries in a
parallel data-structure, since most likely the region boundaries won't
naturally fit inside the zone boundaries or vice-versa.

But of course, memory regions are sub-divisions *within* a node, so it makes
sense to keep the data-structures in the node's struct pglist_data. (Thus
this placement makes memory regions parallel to zones in that node).

Once we capture the region boundaries in the memory regions data-structure,
we can influence MM decisions at various places, such as page allocation,
reclamation etc, in order to perform power-aware memory management.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index af4a3b7..4246620 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -35,6 +35,8 @@
  */
 #define PAGE_ALLOC_COSTLY_ORDER 3
 
+#define MAX_NR_NODE_REGIONS	256
+
 enum {
 	MIGRATE_UNMOVABLE,
 	MIGRATE_RECLAIMABLE,
@@ -708,6 +710,14 @@ struct node_active_region {
 extern struct page *mem_map;
 #endif
 
+struct node_mem_region {
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+	unsigned long present_pages;
+	unsigned long spanned_pages;
+	struct pglist_data *pgdat;
+};
+
 /*
  * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
  * (mostly NUMA machines?) to denote a higher-level memory zone than the
@@ -724,6 +734,8 @@ typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
 	struct zonelist node_zonelists[MAX_ZONELISTS];
 	int nr_zones;
+	struct node_mem_region node_regions[MAX_NR_NODE_REGIONS];
+	int nr_node_regions;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
 #ifdef CONFIG_MEMCG


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 03/35] mm: Introduce memory regions data-structure to capture region boundaries within nodes
@ 2013-08-30 13:14   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:14 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The memory within a node can be divided into regions of memory that can be
independently power-managed. That is, chunks of memory can be transitioned
(manually or automatically) to low-power states based on the frequency of
references to that region. For example, if a memory chunk is not referenced
for a given threshold amount of time, the hardware (memory controller) can
decide to put that piece of memory into a content-preserving low-power state.
And of course, on the next reference to that chunk of memory, it will be
transitioned back to full-power for read/write operations.

So, the Linux MM can take advantage of this feature by managing the available
memory with an eye towards power-savings - ie., by keeping the memory
allocations/references consolidated to a minimum no. of such power-manageable
memory regions. In order to do so, the first step is to teach the MM about
the boundaries of these regions - and to capture that info, we introduce a new
data-structure called "Memory Regions".

[Also, the concept of memory regions could potentially be extended to work
with different classes of memory like PCM (Phase Change Memory) etc and
hence, it is not limited to just power management alone].

We already sub-divide a node's memory into zones, based on some well-known
constraints. So the question is, where do we fit in memory regions in this
hierarchy. Instead of artificially trying to fit it into the hierarchy one
way or the other, we choose to simply capture the region boundaries in a
parallel data-structure, since most likely the region boundaries won't
naturally fit inside the zone boundaries or vice-versa.

But of course, memory regions are sub-divisions *within* a node, so it makes
sense to keep the data-structures in the node's struct pglist_data. (Thus
this placement makes memory regions parallel to zones in that node).

Once we capture the region boundaries in the memory regions data-structure,
we can influence MM decisions at various places, such as page allocation,
reclamation etc, in order to perform power-aware memory management.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index af4a3b7..4246620 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -35,6 +35,8 @@
  */
 #define PAGE_ALLOC_COSTLY_ORDER 3
 
+#define MAX_NR_NODE_REGIONS	256
+
 enum {
 	MIGRATE_UNMOVABLE,
 	MIGRATE_RECLAIMABLE,
@@ -708,6 +710,14 @@ struct node_active_region {
 extern struct page *mem_map;
 #endif
 
+struct node_mem_region {
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+	unsigned long present_pages;
+	unsigned long spanned_pages;
+	struct pglist_data *pgdat;
+};
+
 /*
  * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
  * (mostly NUMA machines?) to denote a higher-level memory zone than the
@@ -724,6 +734,8 @@ typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
 	struct zonelist node_zonelists[MAX_ZONELISTS];
 	int nr_zones;
+	struct node_mem_region node_regions[MAX_NR_NODE_REGIONS];
+	int nr_node_regions;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
 #ifdef CONFIG_MEMCG

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 04/35] mm: Initialize node memory regions during boot
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:15   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:15 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Initialize the node's memory-regions structures with the information about
the region-boundaries, at boot time.

Based-on-patch-by: Ankita Garg <gargankita@gmail.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mm.h |    4 ++++
 mm/page_alloc.c    |   28 ++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..18fdec4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -627,6 +627,10 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
+/* Hard-code memory region size to be 512 MB for now. */
+#define MEM_REGION_SHIFT	(29 - PAGE_SHIFT)
+#define MEM_REGION_SIZE		(1UL << MEM_REGION_SHIFT)
+
 static inline enum zone_type page_zonenum(const struct page *page)
 {
 	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b86d7e3..bb2d5d4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4809,6 +4809,33 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
 #endif /* CONFIG_FLAT_NODE_MEM_MAP */
 }
 
+static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
+{
+	int nid = pgdat->node_id;
+	unsigned long start_pfn = pgdat->node_start_pfn;
+	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
+	struct node_mem_region *region;
+	unsigned long i, absent;
+	int idx;
+
+	for (i = start_pfn, idx = 0; i < end_pfn;
+				i += region->spanned_pages, idx++) {
+
+		region = &pgdat->node_regions[idx];
+		region->pgdat = pgdat;
+		region->start_pfn = i;
+		region->spanned_pages = min(MEM_REGION_SIZE, end_pfn - i);
+		region->end_pfn = region->start_pfn + region->spanned_pages;
+
+		absent = __absent_pages_in_range(nid, region->start_pfn,
+						 region->end_pfn);
+
+		region->present_pages = region->spanned_pages - absent;
+	}
+
+	pgdat->nr_node_regions = idx;
+}
+
 void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 		unsigned long node_start_pfn, unsigned long *zholes_size)
 {
@@ -4837,6 +4864,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 
 	free_area_init_core(pgdat, start_pfn, end_pfn,
 			    zones_size, zholes_size);
+	init_node_memory_regions(pgdat);
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 04/35] mm: Initialize node memory regions during boot
@ 2013-08-30 13:15   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:15 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Initialize the node's memory-regions structures with the information about
the region-boundaries, at boot time.

Based-on-patch-by: Ankita Garg <gargankita@gmail.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mm.h |    4 ++++
 mm/page_alloc.c    |   28 ++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..18fdec4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -627,6 +627,10 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
+/* Hard-code memory region size to be 512 MB for now. */
+#define MEM_REGION_SHIFT	(29 - PAGE_SHIFT)
+#define MEM_REGION_SIZE		(1UL << MEM_REGION_SHIFT)
+
 static inline enum zone_type page_zonenum(const struct page *page)
 {
 	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b86d7e3..bb2d5d4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4809,6 +4809,33 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
 #endif /* CONFIG_FLAT_NODE_MEM_MAP */
 }
 
+static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
+{
+	int nid = pgdat->node_id;
+	unsigned long start_pfn = pgdat->node_start_pfn;
+	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
+	struct node_mem_region *region;
+	unsigned long i, absent;
+	int idx;
+
+	for (i = start_pfn, idx = 0; i < end_pfn;
+				i += region->spanned_pages, idx++) {
+
+		region = &pgdat->node_regions[idx];
+		region->pgdat = pgdat;
+		region->start_pfn = i;
+		region->spanned_pages = min(MEM_REGION_SIZE, end_pfn - i);
+		region->end_pfn = region->start_pfn + region->spanned_pages;
+
+		absent = __absent_pages_in_range(nid, region->start_pfn,
+						 region->end_pfn);
+
+		region->present_pages = region->spanned_pages - absent;
+	}
+
+	pgdat->nr_node_regions = idx;
+}
+
 void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 		unsigned long node_start_pfn, unsigned long *zholes_size)
 {
@@ -4837,6 +4864,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 
 	free_area_init_core(pgdat, start_pfn, end_pfn,
 			    zones_size, zholes_size);
+	init_node_memory_regions(pgdat);
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 05/35] mm: Introduce and initialize zone memory regions
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:15   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:15 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Memory region boundaries don't necessarily fit on zone boundaries. So we need
to maintain a zone-level mapping of the absolute memory region boundaries.

"Node Memory Regions" will be used to capture the absolute region boundaries.
Add "Zone Memory Regions" to track the subsets of the absolute memory regions
that fall within the zone boundaries.

Eg:

	|<----------------------Node---------------------->|
	 __________________________________________________
	|      Node mem reg 0 	 |      Node mem reg 1     |  (Absolute region
	|________________________|_________________________|   boundaries)

	 __________________________________________________
	|    ZONE_DMA   |	    ZONE_NORMAL		   |
	|               |                                  |
	|<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
	|_______________|________|_________________________|


In the above figure,

ZONE_DMA will have only 1 zone memory region (say, Zone mem reg 0) which is a
subset of Node mem reg 0 (ie., the portion of Node mem reg 0 that intersects
with ZONE_DMA).

ZONE_NORMAL will have 2 zone memory regions (say, Zone mem reg 0 and
Zone mem reg 1) which are subsets of Node mem reg 0 and Node mem reg 1
respectively, that intersect with ZONE_NORMAL's range.

Most of the MM algorithms (like page allocation etc) work within a zone,
hence such a zone-level mapping of the absolute region boundaries will be
very useful in influencing the MM decisions at those places.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |   11 +++++++++
 mm/page_alloc.c        |   62 +++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4246620..010ab5b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -36,6 +36,7 @@
 #define PAGE_ALLOC_COSTLY_ORDER 3
 
 #define MAX_NR_NODE_REGIONS	256
+#define MAX_NR_ZONE_REGIONS	MAX_NR_NODE_REGIONS
 
 enum {
 	MIGRATE_UNMOVABLE,
@@ -312,6 +313,13 @@ enum zone_type {
 
 #ifndef __GENERATING_BOUNDS_H
 
+struct zone_mem_region {
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+	unsigned long present_pages;
+	unsigned long spanned_pages;
+};
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 
@@ -369,6 +377,9 @@ struct zone {
 #endif
 	struct free_area	free_area[MAX_ORDER];
 
+	struct zone_mem_region	zone_regions[MAX_NR_ZONE_REGIONS];
+	int 			nr_zone_regions;
+
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb2d5d4..05cedbb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4836,6 +4836,66 @@ static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
 	pgdat->nr_node_regions = idx;
 }
 
+static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
+{
+	unsigned long start_pfn, end_pfn, absent;
+	unsigned long z_start_pfn, z_end_pfn;
+	int i, j, idx, nid = pgdat->node_id;
+	struct node_mem_region *node_region;
+	struct zone_mem_region *zone_region;
+	struct zone *z;
+
+	for (i = 0, j = 0; i < pgdat->nr_zones; i++) {
+		z = &pgdat->node_zones[i];
+		z_start_pfn = z->zone_start_pfn;
+		z_end_pfn = z->zone_start_pfn + z->spanned_pages;
+		idx = 0;
+
+		for ( ; j < pgdat->nr_node_regions; j++) {
+			node_region = &pgdat->node_regions[j];
+
+			/*
+			 * Skip node memory regions that don't intersect with
+			 * this zone.
+			 */
+			if (node_region->end_pfn <= z_start_pfn)
+				continue; /* Move to next higher node region */
+
+			if (node_region->start_pfn >= z_end_pfn)
+				break; /* Move to next higher zone */
+
+			start_pfn = max(z_start_pfn, node_region->start_pfn);
+			end_pfn = min(z_end_pfn, node_region->end_pfn);
+
+			zone_region = &z->zone_regions[idx];
+			zone_region->start_pfn = start_pfn;
+			zone_region->end_pfn = end_pfn;
+			zone_region->spanned_pages = end_pfn - start_pfn;
+
+			absent = __absent_pages_in_range(nid, start_pfn,
+						         end_pfn);
+			zone_region->present_pages =
+					zone_region->spanned_pages - absent;
+
+			idx++;
+		}
+
+		z->nr_zone_regions = idx;
+
+		/*
+		 * Revisit the last visited node memory region, in case it
+		 * spans multiple zones.
+		 */
+		j--;
+	}
+}
+
+static void __meminit init_memory_regions(struct pglist_data *pgdat)
+{
+	init_node_memory_regions(pgdat);
+	init_zone_memory_regions(pgdat);
+}
+
 void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 		unsigned long node_start_pfn, unsigned long *zholes_size)
 {
@@ -4864,7 +4924,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 
 	free_area_init_core(pgdat, start_pfn, end_pfn,
 			    zones_size, zholes_size);
-	init_node_memory_regions(pgdat);
+	init_memory_regions(pgdat);
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 05/35] mm: Introduce and initialize zone memory regions
@ 2013-08-30 13:15   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:15 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Memory region boundaries don't necessarily fit on zone boundaries. So we need
to maintain a zone-level mapping of the absolute memory region boundaries.

"Node Memory Regions" will be used to capture the absolute region boundaries.
Add "Zone Memory Regions" to track the subsets of the absolute memory regions
that fall within the zone boundaries.

Eg:

	|<----------------------Node---------------------->|
	 __________________________________________________
	|      Node mem reg 0 	 |      Node mem reg 1     |  (Absolute region
	|________________________|_________________________|   boundaries)

	 __________________________________________________
	|    ZONE_DMA   |	    ZONE_NORMAL		   |
	|               |                                  |
	|<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
	|_______________|________|_________________________|


In the above figure,

ZONE_DMA will have only 1 zone memory region (say, Zone mem reg 0) which is a
subset of Node mem reg 0 (ie., the portion of Node mem reg 0 that intersects
with ZONE_DMA).

ZONE_NORMAL will have 2 zone memory regions (say, Zone mem reg 0 and
Zone mem reg 1) which are subsets of Node mem reg 0 and Node mem reg 1
respectively, that intersect with ZONE_NORMAL's range.

Most of the MM algorithms (like page allocation etc) work within a zone,
hence such a zone-level mapping of the absolute region boundaries will be
very useful in influencing the MM decisions at those places.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |   11 +++++++++
 mm/page_alloc.c        |   62 +++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4246620..010ab5b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -36,6 +36,7 @@
 #define PAGE_ALLOC_COSTLY_ORDER 3
 
 #define MAX_NR_NODE_REGIONS	256
+#define MAX_NR_ZONE_REGIONS	MAX_NR_NODE_REGIONS
 
 enum {
 	MIGRATE_UNMOVABLE,
@@ -312,6 +313,13 @@ enum zone_type {
 
 #ifndef __GENERATING_BOUNDS_H
 
+struct zone_mem_region {
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+	unsigned long present_pages;
+	unsigned long spanned_pages;
+};
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 
@@ -369,6 +377,9 @@ struct zone {
 #endif
 	struct free_area	free_area[MAX_ORDER];
 
+	struct zone_mem_region	zone_regions[MAX_NR_ZONE_REGIONS];
+	int 			nr_zone_regions;
+
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb2d5d4..05cedbb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4836,6 +4836,66 @@ static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
 	pgdat->nr_node_regions = idx;
 }
 
+static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
+{
+	unsigned long start_pfn, end_pfn, absent;
+	unsigned long z_start_pfn, z_end_pfn;
+	int i, j, idx, nid = pgdat->node_id;
+	struct node_mem_region *node_region;
+	struct zone_mem_region *zone_region;
+	struct zone *z;
+
+	for (i = 0, j = 0; i < pgdat->nr_zones; i++) {
+		z = &pgdat->node_zones[i];
+		z_start_pfn = z->zone_start_pfn;
+		z_end_pfn = z->zone_start_pfn + z->spanned_pages;
+		idx = 0;
+
+		for ( ; j < pgdat->nr_node_regions; j++) {
+			node_region = &pgdat->node_regions[j];
+
+			/*
+			 * Skip node memory regions that don't intersect with
+			 * this zone.
+			 */
+			if (node_region->end_pfn <= z_start_pfn)
+				continue; /* Move to next higher node region */
+
+			if (node_region->start_pfn >= z_end_pfn)
+				break; /* Move to next higher zone */
+
+			start_pfn = max(z_start_pfn, node_region->start_pfn);
+			end_pfn = min(z_end_pfn, node_region->end_pfn);
+
+			zone_region = &z->zone_regions[idx];
+			zone_region->start_pfn = start_pfn;
+			zone_region->end_pfn = end_pfn;
+			zone_region->spanned_pages = end_pfn - start_pfn;
+
+			absent = __absent_pages_in_range(nid, start_pfn,
+						         end_pfn);
+			zone_region->present_pages =
+					zone_region->spanned_pages - absent;
+
+			idx++;
+		}
+
+		z->nr_zone_regions = idx;
+
+		/*
+		 * Revisit the last visited node memory region, in case it
+		 * spans multiple zones.
+		 */
+		j--;
+	}
+}
+
+static void __meminit init_memory_regions(struct pglist_data *pgdat)
+{
+	init_node_memory_regions(pgdat);
+	init_zone_memory_regions(pgdat);
+}
+
 void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 		unsigned long node_start_pfn, unsigned long *zholes_size)
 {
@@ -4864,7 +4924,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 
 	free_area_init_core(pgdat, start_pfn, end_pfn,
 			    zones_size, zholes_size);
-	init_node_memory_regions(pgdat);
+	init_memory_regions(pgdat);
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 06/35] mm: Add helpers to retrieve node region and zone region for a given page
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:15   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:15 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Given a page, we would like to have an efficient mechanism to find out
the node memory region and the zone memory region to which it belongs.

Since the node is assumed to be divided into equal-sized node memory
regions, the node memory region can be obtained by simply right-shifting
the page's pfn by 'MEM_REGION_SHIFT'.

But finding the corresponding zone memory region's index in the zone is
not that straight-forward. To have a O(1) algorithm to find it out, define a
zone_region_idx[] array to store the zone memory region indices for every
node memory region.

To illustrate, consider the following example:

	|<----------------------Node---------------------->|
	 __________________________________________________
	|      Node mem reg 0 	 |      Node mem reg 1     |  (Absolute region
	|________________________|_________________________|   boundaries)

	 __________________________________________________
	|    ZONE_DMA   |	    ZONE_NORMAL		   |
	|               |                                  |
	|<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
	|_______________|________|_________________________|


In the above figure,

Node mem region 0:
------------------
This region corresponds to the first zone mem region in ZONE_DMA and also
the first zone mem region in ZONE_NORMAL. Hence its index array would look
like this:
    node_regions[0].zone_region_idx[ZONE_DMA]     == 0
    node_regions[0].zone_region_idx[ZONE_NORMAL]  == 0


Node mem region 1:
------------------
This region corresponds to the second zone mem region in ZONE_NORMAL. Hence
its index array would look like this:
    node_regions[1].zone_region_idx[ZONE_NORMAL]  == 1


Using this index array, we can quickly obtain the zone memory region to
which a given page belongs.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mm.h     |   24 ++++++++++++++++++++++++
 include/linux/mmzone.h |    7 +++++++
 mm/page_alloc.c        |    1 +
 3 files changed, 32 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 18fdec4..52329d1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -723,6 +723,30 @@ static inline struct zone *page_zone(const struct page *page)
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
 }
 
+static inline int page_node_region_id(const struct page *page,
+				      const pg_data_t *pgdat)
+{
+	return (page_to_pfn(page) - pgdat->node_start_pfn) >> MEM_REGION_SHIFT;
+}
+
+/**
+ * Return the index of the zone memory region to which the page belongs.
+ *
+ * Given a page, find the absolute (node) memory region as well as the zone to
+ * which it belongs. Then find the region within the zone that corresponds to
+ * that node memory region, and return its index.
+ */
+static inline int page_zone_region_id(const struct page *page)
+{
+	pg_data_t *pgdat = NODE_DATA(page_to_nid(page));
+	enum zone_type z_num = page_zonenum(page);
+	unsigned long node_region_idx;
+
+	node_region_idx = page_node_region_id(page, pgdat);
+
+	return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
+}
+
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 010ab5b..76d9ed2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -726,6 +726,13 @@ struct node_mem_region {
 	unsigned long end_pfn;
 	unsigned long present_pages;
 	unsigned long spanned_pages;
+
+	/*
+	 * A physical (node) region could be split across multiple zones.
+	 * Store the indices of the corresponding regions of each such
+	 * zone for this physical (node) region.
+	 */
+	int zone_region_idx[MAX_NR_ZONES];
 	struct pglist_data *pgdat;
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 05cedbb..8ffd47b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4877,6 +4877,7 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 			zone_region->present_pages =
 					zone_region->spanned_pages - absent;
 
+			node_region->zone_region_idx[zone_idx(z)] = idx;
 			idx++;
 		}
 


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 06/35] mm: Add helpers to retrieve node region and zone region for a given page
@ 2013-08-30 13:15   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:15 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Given a page, we would like to have an efficient mechanism to find out
the node memory region and the zone memory region to which it belongs.

Since the node is assumed to be divided into equal-sized node memory
regions, the node memory region can be obtained by simply right-shifting
the page's pfn by 'MEM_REGION_SHIFT'.

But finding the corresponding zone memory region's index in the zone is
not that straight-forward. To have a O(1) algorithm to find it out, define a
zone_region_idx[] array to store the zone memory region indices for every
node memory region.

To illustrate, consider the following example:

	|<----------------------Node---------------------->|
	 __________________________________________________
	|      Node mem reg 0 	 |      Node mem reg 1     |  (Absolute region
	|________________________|_________________________|   boundaries)

	 __________________________________________________
	|    ZONE_DMA   |	    ZONE_NORMAL		   |
	|               |                                  |
	|<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
	|_______________|________|_________________________|


In the above figure,

Node mem region 0:
------------------
This region corresponds to the first zone mem region in ZONE_DMA and also
the first zone mem region in ZONE_NORMAL. Hence its index array would look
like this:
    node_regions[0].zone_region_idx[ZONE_DMA]     == 0
    node_regions[0].zone_region_idx[ZONE_NORMAL]  == 0


Node mem region 1:
------------------
This region corresponds to the second zone mem region in ZONE_NORMAL. Hence
its index array would look like this:
    node_regions[1].zone_region_idx[ZONE_NORMAL]  == 1


Using this index array, we can quickly obtain the zone memory region to
which a given page belongs.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mm.h     |   24 ++++++++++++++++++++++++
 include/linux/mmzone.h |    7 +++++++
 mm/page_alloc.c        |    1 +
 3 files changed, 32 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 18fdec4..52329d1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -723,6 +723,30 @@ static inline struct zone *page_zone(const struct page *page)
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
 }
 
+static inline int page_node_region_id(const struct page *page,
+				      const pg_data_t *pgdat)
+{
+	return (page_to_pfn(page) - pgdat->node_start_pfn) >> MEM_REGION_SHIFT;
+}
+
+/**
+ * Return the index of the zone memory region to which the page belongs.
+ *
+ * Given a page, find the absolute (node) memory region as well as the zone to
+ * which it belongs. Then find the region within the zone that corresponds to
+ * that node memory region, and return its index.
+ */
+static inline int page_zone_region_id(const struct page *page)
+{
+	pg_data_t *pgdat = NODE_DATA(page_to_nid(page));
+	enum zone_type z_num = page_zonenum(page);
+	unsigned long node_region_idx;
+
+	node_region_idx = page_node_region_id(page, pgdat);
+
+	return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
+}
+
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 010ab5b..76d9ed2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -726,6 +726,13 @@ struct node_mem_region {
 	unsigned long end_pfn;
 	unsigned long present_pages;
 	unsigned long spanned_pages;
+
+	/*
+	 * A physical (node) region could be split across multiple zones.
+	 * Store the indices of the corresponding regions of each such
+	 * zone for this physical (node) region.
+	 */
+	int zone_region_idx[MAX_NR_ZONES];
 	struct pglist_data *pgdat;
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 05cedbb..8ffd47b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4877,6 +4877,7 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 			zone_region->present_pages =
 					zone_region->spanned_pages - absent;
 
+			node_region->zone_region_idx[zone_idx(z)] = idx;
 			idx++;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 07/35] mm: Add data-structures to describe memory regions within the zones' freelists
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:16   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:16 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

In order to influence page allocation decisions (i.e., to make page-allocation
region-aware), we need to be able to distinguish pageblocks belonging to
different zone memory regions within the zones' (buddy) freelists.

So, within every freelist in a zone, provide pointers to describe the
boundaries of zone memory regions and counters to track the number of free
pageblocks within each region.

Also, fixup the references to the freelist's list_head inside struct free_area.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |   17 ++++++++++++++++-
 mm/compaction.c        |    2 +-
 mm/page_alloc.c        |   23 ++++++++++++-----------
 mm/vmstat.c            |    2 +-
 4 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 76d9ed2..201ab42 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -83,8 +83,23 @@ static inline int get_pageblock_migratetype(struct page *page)
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
+struct mem_region_list {
+	struct list_head	*page_block;
+	unsigned long		nr_free;
+};
+
+struct free_list {
+	struct list_head	list;
+
+	/*
+	 * Demarcates pageblocks belonging to different regions within
+	 * this freelist.
+	 */
+	struct mem_region_list	mr_list[MAX_NR_ZONE_REGIONS];
+};
+
 struct free_area {
-	struct list_head	free_list[MIGRATE_TYPES];
+	struct free_list	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
 };
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 05ccb4c..13912f5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -858,7 +858,7 @@ static int compact_finished(struct zone *zone,
 		struct free_area *area = &zone->free_area[order];
 
 		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&area->free_list[cc->migratetype]))
+		if (!list_empty(&area->free_list[cc->migratetype].list))
 			return COMPACT_PARTIAL;
 
 		/* Job done if allocation would set block type */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ffd47b..fd6436d0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -602,12 +602,13 @@ static inline void __free_one_page(struct page *page,
 		higher_buddy = higher_page + (buddy_idx - combined_idx);
 		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
 			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype]);
+				&zone->free_area[order].free_list[migratetype].list);
 			goto out;
 		}
 	}
 
-	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
+	list_add(&page->lru,
+		&zone->free_area[order].free_list[migratetype].list);
 out:
 	zone->free_area[order].nr_free++;
 }
@@ -829,7 +830,7 @@ static inline void expand(struct zone *zone, struct page *page,
 			continue;
 		}
 #endif
-		list_add(&page[size].lru, &area->free_list[migratetype]);
+		list_add(&page[size].lru, &area->free_list[migratetype].list);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -891,10 +892,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
 		area = &(zone->free_area[current_order]);
-		if (list_empty(&area->free_list[migratetype]))
+		if (list_empty(&area->free_list[migratetype].list))
 			continue;
 
-		page = list_entry(area->free_list[migratetype].next,
+		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
 		list_del(&page->lru);
 		rmv_page_order(page);
@@ -966,7 +967,7 @@ int move_freepages(struct zone *zone,
 
 		order = page_order(page);
 		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype]);
+			  &zone->free_area[order].free_list[migratetype].list);
 		set_freepage_migratetype(page, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
@@ -1073,10 +1074,10 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 				break;
 
 			area = &(zone->free_area[current_order]);
-			if (list_empty(&area->free_list[migratetype]))
+			if (list_empty(&area->free_list[migratetype].list))
 				continue;
 
-			page = list_entry(area->free_list[migratetype].next,
+			page = list_entry(area->free_list[migratetype].list.next,
 					struct page, lru);
 			area->nr_free--;
 
@@ -1320,7 +1321,7 @@ void mark_free_pages(struct zone *zone)
 		}
 
 	for_each_migratetype_order(order, t) {
-		list_for_each(curr, &zone->free_area[order].free_list[t]) {
+		list_for_each(curr, &zone->free_area[order].free_list[t].list) {
 			unsigned long i;
 
 			pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -3146,7 +3147,7 @@ void show_free_areas(unsigned int filter)
 
 			types[order] = 0;
 			for (type = 0; type < MIGRATE_TYPES; type++) {
-				if (!list_empty(&area->free_list[type]))
+				if (!list_empty(&area->free_list[type].list))
 					types[order] |= 1 << type;
 			}
 		}
@@ -4002,7 +4003,7 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 {
 	int order, t;
 	for_each_migratetype_order(order, t) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+		INIT_LIST_HEAD(&zone->free_area[order].free_list[t].list);
 		zone->free_area[order].nr_free = 0;
 	}
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c2ef4..0451957 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -862,7 +862,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 
 			area = &(zone->free_area[order]);
 
-			list_for_each(curr, &area->free_list[mtype])
+			list_for_each(curr, &area->free_list[mtype].list)
 				freecount++;
 			seq_printf(m, "%6lu ", freecount);
 		}


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 07/35] mm: Add data-structures to describe memory regions within the zones' freelists
@ 2013-08-30 13:16   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:16 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

In order to influence page allocation decisions (i.e., to make page-allocation
region-aware), we need to be able to distinguish pageblocks belonging to
different zone memory regions within the zones' (buddy) freelists.

So, within every freelist in a zone, provide pointers to describe the
boundaries of zone memory regions and counters to track the number of free
pageblocks within each region.

Also, fixup the references to the freelist's list_head inside struct free_area.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |   17 ++++++++++++++++-
 mm/compaction.c        |    2 +-
 mm/page_alloc.c        |   23 ++++++++++++-----------
 mm/vmstat.c            |    2 +-
 4 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 76d9ed2..201ab42 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -83,8 +83,23 @@ static inline int get_pageblock_migratetype(struct page *page)
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
+struct mem_region_list {
+	struct list_head	*page_block;
+	unsigned long		nr_free;
+};
+
+struct free_list {
+	struct list_head	list;
+
+	/*
+	 * Demarcates pageblocks belonging to different regions within
+	 * this freelist.
+	 */
+	struct mem_region_list	mr_list[MAX_NR_ZONE_REGIONS];
+};
+
 struct free_area {
-	struct list_head	free_list[MIGRATE_TYPES];
+	struct free_list	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
 };
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 05ccb4c..13912f5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -858,7 +858,7 @@ static int compact_finished(struct zone *zone,
 		struct free_area *area = &zone->free_area[order];
 
 		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&area->free_list[cc->migratetype]))
+		if (!list_empty(&area->free_list[cc->migratetype].list))
 			return COMPACT_PARTIAL;
 
 		/* Job done if allocation would set block type */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ffd47b..fd6436d0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -602,12 +602,13 @@ static inline void __free_one_page(struct page *page,
 		higher_buddy = higher_page + (buddy_idx - combined_idx);
 		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
 			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype]);
+				&zone->free_area[order].free_list[migratetype].list);
 			goto out;
 		}
 	}
 
-	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
+	list_add(&page->lru,
+		&zone->free_area[order].free_list[migratetype].list);
 out:
 	zone->free_area[order].nr_free++;
 }
@@ -829,7 +830,7 @@ static inline void expand(struct zone *zone, struct page *page,
 			continue;
 		}
 #endif
-		list_add(&page[size].lru, &area->free_list[migratetype]);
+		list_add(&page[size].lru, &area->free_list[migratetype].list);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -891,10 +892,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
 		area = &(zone->free_area[current_order]);
-		if (list_empty(&area->free_list[migratetype]))
+		if (list_empty(&area->free_list[migratetype].list))
 			continue;
 
-		page = list_entry(area->free_list[migratetype].next,
+		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
 		list_del(&page->lru);
 		rmv_page_order(page);
@@ -966,7 +967,7 @@ int move_freepages(struct zone *zone,
 
 		order = page_order(page);
 		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype]);
+			  &zone->free_area[order].free_list[migratetype].list);
 		set_freepage_migratetype(page, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
@@ -1073,10 +1074,10 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 				break;
 
 			area = &(zone->free_area[current_order]);
-			if (list_empty(&area->free_list[migratetype]))
+			if (list_empty(&area->free_list[migratetype].list))
 				continue;
 
-			page = list_entry(area->free_list[migratetype].next,
+			page = list_entry(area->free_list[migratetype].list.next,
 					struct page, lru);
 			area->nr_free--;
 
@@ -1320,7 +1321,7 @@ void mark_free_pages(struct zone *zone)
 		}
 
 	for_each_migratetype_order(order, t) {
-		list_for_each(curr, &zone->free_area[order].free_list[t]) {
+		list_for_each(curr, &zone->free_area[order].free_list[t].list) {
 			unsigned long i;
 
 			pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -3146,7 +3147,7 @@ void show_free_areas(unsigned int filter)
 
 			types[order] = 0;
 			for (type = 0; type < MIGRATE_TYPES; type++) {
-				if (!list_empty(&area->free_list[type]))
+				if (!list_empty(&area->free_list[type].list))
 					types[order] |= 1 << type;
 			}
 		}
@@ -4002,7 +4003,7 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 {
 	int order, t;
 	for_each_migratetype_order(order, t) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+		INIT_LIST_HEAD(&zone->free_area[order].free_list[t].list);
 		zone->free_area[order].nr_free = 0;
 	}
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c2ef4..0451957 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -862,7 +862,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 
 			area = &(zone->free_area[order]);
 
-			list_for_each(curr, &area->free_list[mtype])
+			list_for_each(curr, &area->free_list[mtype].list)
 				freecount++;
 			seq_printf(m, "%6lu ", freecount);
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 08/35] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:16   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:16 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The zones' freelists need to be made region-aware, in order to influence
page allocation and freeing algorithms. So in every free list in the zone, we
would like to demarcate the pageblocks belonging to different memory regions
(we can do this using a set of pointers, and thus avoid splitting up the
freelists).

Also, we would like to keep the pageblocks in the freelists sorted in
region-order. That is, pageblocks belonging to region-0 would come first,
followed by pageblocks belonging to region-1 and so on, within a given
freelist. Of course, a set of pageblocks belonging to the same region need
not be sorted; it is sufficient if we maintain the pageblocks in
region-sorted-order, rather than a full address-sorted-order.

For each freelist within the zone, we maintain a set of pointers to
pageblocks belonging to the various memory regions in that zone.

Eg:

    |<---Region0--->|   |<---Region1--->|   |<-------Region2--------->|
     ____      ____      ____      ____      ____      ____      ____
--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|-->

                 ^                  ^                              ^
                 |                  |                              |
                Reg0               Reg1                          Reg2


Page allocation will proceed as usual - pick the first item on the free list.
But we don't want to keep updating these region pointers every time we allocate
a pageblock from the freelist. So, instead of pointing to the *first* pageblock
of that region, we maintain the region pointers such that they point to the
*last* pageblock in that region, as shown in the figure above. That way, as
long as there are > 1 pageblocks in that region in that freelist, that region
pointer doesn't need to be updated.


Page allocation algorithm:
-------------------------

The heart of the page allocation algorithm remains as it is - pick the first
item on the appropriate freelist and return it.


Arrangement of pageblocks in the zone freelists:
-----------------------------------------------

This is the main change - we keep the pageblocks in region-sorted order,
where pageblocks belonging to region-0 come first, followed by those belonging
to region-1 and so on. But the pageblocks within a given region need *not* be
sorted, since we need them to be only region-sorted and not fully
address-sorted.

This sorting is performed when adding pages back to the freelists, thus
avoiding any region-related overhead in the critical page allocation
paths.

Strategy to consolidate allocations to a minimum no. of regions:
---------------------------------------------------------------

Page allocation happens in the order of increasing region number. We would
like to do light-weight page reclaim or compaction (for the purpose of memory
power management) in the reverse order, to keep the allocated pages within
a minimum number of regions (approximately). The latter part is implemented
in subsequent patches.

---------------------------- Increasing region number---------------------->

Direction of allocation--->                <---Direction of reclaim/compaction

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |  154 +++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 138 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd6436d0..398b62c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -514,6 +514,111 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+static void add_to_freelist(struct page *page, struct free_list *free_list)
+{
+	struct list_head *prev_region_list, *lru;
+	struct mem_region_list *region;
+	int region_id, i;
+
+	lru = &page->lru;
+	region_id = page_zone_region_id(page);
+
+	region = &free_list->mr_list[region_id];
+	region->nr_free++;
+
+	if (region->page_block) {
+		list_add_tail(lru, region->page_block);
+		return;
+	}
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__);
+#endif
+
+	if (!list_empty(&free_list->list)) {
+		for (i = region_id - 1; i >= 0; i--) {
+			if (free_list->mr_list[i].page_block) {
+				prev_region_list =
+					free_list->mr_list[i].page_block;
+				goto out;
+			}
+		}
+	}
+
+	/* This is the first region, so add to the head of the list */
+	prev_region_list = &free_list->list;
+
+out:
+	list_add(lru, prev_region_list);
+
+	/* Save pointer to page block of this region */
+	region->page_block = lru;
+}
+
+static void del_from_freelist(struct page *page, struct free_list *free_list)
+{
+	struct list_head *prev_page_lru, *lru, *p;
+	struct mem_region_list *region;
+	int region_id;
+
+	lru = &page->lru;
+	region_id = page_zone_region_id(page);
+	region = &free_list->mr_list[region_id];
+	region->nr_free--;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
+
+	/* Verify whether this page indeed belongs to this free list! */
+
+	list_for_each(p, &free_list->list) {
+		if (p == lru)
+			goto page_found;
+	}
+
+	WARN(1, "%s: page doesn't belong to the given freelist!\n", __func__);
+
+page_found:
+#endif
+
+	/*
+	 * If we are not deleting the last pageblock in this region (i.e.,
+	 * farthest from list head, but not necessarily the last numerically),
+	 * then we need not update the region->page_block pointer.
+	 */
+	if (lru != region->page_block) {
+		list_del(lru);
+#ifdef CONFIG_DEBUG_PAGEALLOC
+		WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__);
+#endif
+		return;
+	}
+
+	prev_page_lru = lru->prev;
+	list_del(lru);
+
+	if (region->nr_free == 0) {
+		region->page_block = NULL;
+	} else {
+		region->page_block = prev_page_lru;
+#ifdef CONFIG_DEBUG_PAGEALLOC
+		WARN(prev_page_lru == &free_list->list,
+			"%s: region->page_block points to list head\n",
+								__func__);
+#endif
+	}
+}
+
+/**
+ * Move a given page from one freelist to another.
+ */
+static void move_page_freelist(struct page *page, struct free_list *old_list,
+			       struct free_list *new_list)
+{
+	del_from_freelist(page, old_list);
+	add_to_freelist(page, new_list);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -546,6 +651,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_idx;
 	unsigned long uninitialized_var(buddy_idx);
 	struct page *buddy;
+	struct free_area *area;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 
@@ -575,8 +681,9 @@ static inline void __free_one_page(struct page *page,
 			__mod_zone_freepage_state(zone, 1 << order,
 						  migratetype);
 		} else {
-			list_del(&buddy->lru);
-			zone->free_area[order].nr_free--;
+			area = &zone->free_area[order];
+			del_from_freelist(buddy, &area->free_list[migratetype]);
+			area->nr_free--;
 			rmv_page_order(buddy);
 		}
 		combined_idx = buddy_idx & page_idx;
@@ -585,6 +692,7 @@ static inline void __free_one_page(struct page *page,
 		order++;
 	}
 	set_page_order(page, order);
+	area = &zone->free_area[order];
 
 	/*
 	 * If this is not the largest possible page, check if the buddy
@@ -601,16 +709,22 @@ static inline void __free_one_page(struct page *page,
 		buddy_idx = __find_buddy_index(combined_idx, order + 1);
 		higher_buddy = higher_page + (buddy_idx - combined_idx);
 		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype].list);
+
+			/*
+			 * Implementing an add_to_freelist_tail() won't be
+			 * very useful because both of them (almost) add to
+			 * the tail within the region. So we could potentially
+			 * switch off this entire "is next-higher buddy free?"
+			 * logic when memory regions are used.
+			 */
+			add_to_freelist(page, &area->free_list[migratetype]);
 			goto out;
 		}
 	}
 
-	list_add(&page->lru,
-		&zone->free_area[order].free_list[migratetype].list);
+	add_to_freelist(page, &area->free_list[migratetype]);
 out:
-	zone->free_area[order].nr_free++;
+	area->nr_free++;
 }
 
 static inline int free_pages_check(struct page *page)
@@ -830,7 +944,7 @@ static inline void expand(struct zone *zone, struct page *page,
 			continue;
 		}
 #endif
-		list_add(&page[size].lru, &area->free_list[migratetype].list);
+		add_to_freelist(&page[size], &area->free_list[migratetype]);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -897,7 +1011,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
-		list_del(&page->lru);
+		del_from_freelist(page, &area->free_list[migratetype]);
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
@@ -938,7 +1052,8 @@ int move_freepages(struct zone *zone,
 {
 	struct page *page;
 	unsigned long order;
-	int pages_moved = 0;
+	struct free_area *area;
+	int pages_moved = 0, old_mt;
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -966,8 +1081,10 @@ int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
-		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype].list);
+		old_mt = get_freepage_migratetype(page);
+		area = &zone->free_area[order];
+		move_page_freelist(page, &area->free_list[old_mt],
+				    &area->free_list[migratetype]);
 		set_freepage_migratetype(page, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
@@ -1061,7 +1178,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 	struct free_area * area;
 	int current_order;
 	struct page *page;
-	int migratetype, new_type, i;
+	int migratetype, new_type, i, mt;
 
 	/* Find the largest possible block of pages in the other list */
 	for (current_order = MAX_ORDER-1; current_order >= order;
@@ -1086,7 +1203,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 							  migratetype);
 
 			/* Remove the page from the freelists */
-			list_del(&page->lru);
+			mt = get_freepage_migratetype(page);
+			del_from_freelist(page, &area->free_list[mt]);
 			rmv_page_order(page);
 
 			/*
@@ -1446,7 +1564,8 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
-	list_del(&page->lru);
+	mt = get_freepage_migratetype(page);
+	del_from_freelist(page, &zone->free_area[order].free_list[mt]);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
 
@@ -6353,6 +6472,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	int order, i;
 	unsigned long pfn;
 	unsigned long flags;
+	int mt;
+
 	/* find the first valid pfn */
 	for (pfn = start_pfn; pfn < end_pfn; pfn++)
 		if (pfn_valid(pfn))
@@ -6385,7 +6506,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		printk(KERN_INFO "remove from free list %lx %d %lx\n",
 		       pfn, 1 << order, end_pfn);
 #endif
-		list_del(&page->lru);
+		mt = get_freepage_migratetype(page);
+		del_from_freelist(page, &zone->free_area[order].free_list[mt]);
 		rmv_page_order(page);
 		zone->free_area[order].nr_free--;
 #ifdef CONFIG_HIGHMEM


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 08/35] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
@ 2013-08-30 13:16   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:16 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The zones' freelists need to be made region-aware, in order to influence
page allocation and freeing algorithms. So in every free list in the zone, we
would like to demarcate the pageblocks belonging to different memory regions
(we can do this using a set of pointers, and thus avoid splitting up the
freelists).

Also, we would like to keep the pageblocks in the freelists sorted in
region-order. That is, pageblocks belonging to region-0 would come first,
followed by pageblocks belonging to region-1 and so on, within a given
freelist. Of course, a set of pageblocks belonging to the same region need
not be sorted; it is sufficient if we maintain the pageblocks in
region-sorted-order, rather than a full address-sorted-order.

For each freelist within the zone, we maintain a set of pointers to
pageblocks belonging to the various memory regions in that zone.

Eg:

    |<---Region0--->|   |<---Region1--->|   |<-------Region2--------->|
     ____      ____      ____      ____      ____      ____      ____
--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|-->

                 ^                  ^                              ^
                 |                  |                              |
                Reg0               Reg1                          Reg2


Page allocation will proceed as usual - pick the first item on the free list.
But we don't want to keep updating these region pointers every time we allocate
a pageblock from the freelist. So, instead of pointing to the *first* pageblock
of that region, we maintain the region pointers such that they point to the
*last* pageblock in that region, as shown in the figure above. That way, as
long as there are > 1 pageblocks in that region in that freelist, that region
pointer doesn't need to be updated.


Page allocation algorithm:
-------------------------

The heart of the page allocation algorithm remains as it is - pick the first
item on the appropriate freelist and return it.


Arrangement of pageblocks in the zone freelists:
-----------------------------------------------

This is the main change - we keep the pageblocks in region-sorted order,
where pageblocks belonging to region-0 come first, followed by those belonging
to region-1 and so on. But the pageblocks within a given region need *not* be
sorted, since we need them to be only region-sorted and not fully
address-sorted.

This sorting is performed when adding pages back to the freelists, thus
avoiding any region-related overhead in the critical page allocation
paths.

Strategy to consolidate allocations to a minimum no. of regions:
---------------------------------------------------------------

Page allocation happens in the order of increasing region number. We would
like to do light-weight page reclaim or compaction (for the purpose of memory
power management) in the reverse order, to keep the allocated pages within
a minimum number of regions (approximately). The latter part is implemented
in subsequent patches.

---------------------------- Increasing region number---------------------->

Direction of allocation--->                <---Direction of reclaim/compaction

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |  154 +++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 138 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd6436d0..398b62c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -514,6 +514,111 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+static void add_to_freelist(struct page *page, struct free_list *free_list)
+{
+	struct list_head *prev_region_list, *lru;
+	struct mem_region_list *region;
+	int region_id, i;
+
+	lru = &page->lru;
+	region_id = page_zone_region_id(page);
+
+	region = &free_list->mr_list[region_id];
+	region->nr_free++;
+
+	if (region->page_block) {
+		list_add_tail(lru, region->page_block);
+		return;
+	}
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__);
+#endif
+
+	if (!list_empty(&free_list->list)) {
+		for (i = region_id - 1; i >= 0; i--) {
+			if (free_list->mr_list[i].page_block) {
+				prev_region_list =
+					free_list->mr_list[i].page_block;
+				goto out;
+			}
+		}
+	}
+
+	/* This is the first region, so add to the head of the list */
+	prev_region_list = &free_list->list;
+
+out:
+	list_add(lru, prev_region_list);
+
+	/* Save pointer to page block of this region */
+	region->page_block = lru;
+}
+
+static void del_from_freelist(struct page *page, struct free_list *free_list)
+{
+	struct list_head *prev_page_lru, *lru, *p;
+	struct mem_region_list *region;
+	int region_id;
+
+	lru = &page->lru;
+	region_id = page_zone_region_id(page);
+	region = &free_list->mr_list[region_id];
+	region->nr_free--;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
+
+	/* Verify whether this page indeed belongs to this free list! */
+
+	list_for_each(p, &free_list->list) {
+		if (p == lru)
+			goto page_found;
+	}
+
+	WARN(1, "%s: page doesn't belong to the given freelist!\n", __func__);
+
+page_found:
+#endif
+
+	/*
+	 * If we are not deleting the last pageblock in this region (i.e.,
+	 * farthest from list head, but not necessarily the last numerically),
+	 * then we need not update the region->page_block pointer.
+	 */
+	if (lru != region->page_block) {
+		list_del(lru);
+#ifdef CONFIG_DEBUG_PAGEALLOC
+		WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__);
+#endif
+		return;
+	}
+
+	prev_page_lru = lru->prev;
+	list_del(lru);
+
+	if (region->nr_free == 0) {
+		region->page_block = NULL;
+	} else {
+		region->page_block = prev_page_lru;
+#ifdef CONFIG_DEBUG_PAGEALLOC
+		WARN(prev_page_lru == &free_list->list,
+			"%s: region->page_block points to list head\n",
+								__func__);
+#endif
+	}
+}
+
+/**
+ * Move a given page from one freelist to another.
+ */
+static void move_page_freelist(struct page *page, struct free_list *old_list,
+			       struct free_list *new_list)
+{
+	del_from_freelist(page, old_list);
+	add_to_freelist(page, new_list);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -546,6 +651,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long combined_idx;
 	unsigned long uninitialized_var(buddy_idx);
 	struct page *buddy;
+	struct free_area *area;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 
@@ -575,8 +681,9 @@ static inline void __free_one_page(struct page *page,
 			__mod_zone_freepage_state(zone, 1 << order,
 						  migratetype);
 		} else {
-			list_del(&buddy->lru);
-			zone->free_area[order].nr_free--;
+			area = &zone->free_area[order];
+			del_from_freelist(buddy, &area->free_list[migratetype]);
+			area->nr_free--;
 			rmv_page_order(buddy);
 		}
 		combined_idx = buddy_idx & page_idx;
@@ -585,6 +692,7 @@ static inline void __free_one_page(struct page *page,
 		order++;
 	}
 	set_page_order(page, order);
+	area = &zone->free_area[order];
 
 	/*
 	 * If this is not the largest possible page, check if the buddy
@@ -601,16 +709,22 @@ static inline void __free_one_page(struct page *page,
 		buddy_idx = __find_buddy_index(combined_idx, order + 1);
 		higher_buddy = higher_page + (buddy_idx - combined_idx);
 		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype].list);
+
+			/*
+			 * Implementing an add_to_freelist_tail() won't be
+			 * very useful because both of them (almost) add to
+			 * the tail within the region. So we could potentially
+			 * switch off this entire "is next-higher buddy free?"
+			 * logic when memory regions are used.
+			 */
+			add_to_freelist(page, &area->free_list[migratetype]);
 			goto out;
 		}
 	}
 
-	list_add(&page->lru,
-		&zone->free_area[order].free_list[migratetype].list);
+	add_to_freelist(page, &area->free_list[migratetype]);
 out:
-	zone->free_area[order].nr_free++;
+	area->nr_free++;
 }
 
 static inline int free_pages_check(struct page *page)
@@ -830,7 +944,7 @@ static inline void expand(struct zone *zone, struct page *page,
 			continue;
 		}
 #endif
-		list_add(&page[size].lru, &area->free_list[migratetype].list);
+		add_to_freelist(&page[size], &area->free_list[migratetype]);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -897,7 +1011,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
-		list_del(&page->lru);
+		del_from_freelist(page, &area->free_list[migratetype]);
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
@@ -938,7 +1052,8 @@ int move_freepages(struct zone *zone,
 {
 	struct page *page;
 	unsigned long order;
-	int pages_moved = 0;
+	struct free_area *area;
+	int pages_moved = 0, old_mt;
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -966,8 +1081,10 @@ int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
-		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype].list);
+		old_mt = get_freepage_migratetype(page);
+		area = &zone->free_area[order];
+		move_page_freelist(page, &area->free_list[old_mt],
+				    &area->free_list[migratetype]);
 		set_freepage_migratetype(page, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
@@ -1061,7 +1178,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 	struct free_area * area;
 	int current_order;
 	struct page *page;
-	int migratetype, new_type, i;
+	int migratetype, new_type, i, mt;
 
 	/* Find the largest possible block of pages in the other list */
 	for (current_order = MAX_ORDER-1; current_order >= order;
@@ -1086,7 +1203,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 							  migratetype);
 
 			/* Remove the page from the freelists */
-			list_del(&page->lru);
+			mt = get_freepage_migratetype(page);
+			del_from_freelist(page, &area->free_list[mt]);
 			rmv_page_order(page);
 
 			/*
@@ -1446,7 +1564,8 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
-	list_del(&page->lru);
+	mt = get_freepage_migratetype(page);
+	del_from_freelist(page, &zone->free_area[order].free_list[mt]);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
 
@@ -6353,6 +6472,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	int order, i;
 	unsigned long pfn;
 	unsigned long flags;
+	int mt;
+
 	/* find the first valid pfn */
 	for (pfn = start_pfn; pfn < end_pfn; pfn++)
 		if (pfn_valid(pfn))
@@ -6385,7 +6506,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		printk(KERN_INFO "remove from free list %lx %d %lx\n",
 		       pfn, 1 << order, end_pfn);
 #endif
-		list_del(&page->lru);
+		mt = get_freepage_migratetype(page);
+		del_from_freelist(page, &zone->free_area[order].free_list[mt]);
 		rmv_page_order(page);
 		zone->free_area[order].nr_free--;
 #ifdef CONFIG_HIGHMEM

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 09/35] mm: Track the freepage migratetype of pages accurately
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:16   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:16 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Due to the region-wise ordering of the pages in the buddy allocator's
free lists, whenever we want to delete a free pageblock from a free list
(for ex: when moving blocks of pages from one list to the other), we need
to be able to tell the buddy allocator exactly which migratetype it belongs
to. For that purpose, we can use the page's freepage migratetype (which is
maintained in the page's ->index field).

So, while splitting up higher order pages into smaller ones as part of buddy
operations, keep the new head pages updated with the correct freepage
migratetype information (because we depend on tracking this info accurately,
as outlined above).

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 398b62c..b4b1275 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -947,6 +947,13 @@ static inline void expand(struct zone *zone, struct page *page,
 		add_to_freelist(&page[size], &area->free_list[migratetype]);
 		area->nr_free++;
 		set_page_order(&page[size], high);
+
+		/*
+		 * Freepage migratetype is tracked using the index field of the
+		 * first page of the block. So we need to update the new first
+		 * page, when changing the page order.
+		 */
+		set_freepage_migratetype(&page[size], migratetype);
 	}
 }
 


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 09/35] mm: Track the freepage migratetype of pages accurately
@ 2013-08-30 13:16   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:16 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Due to the region-wise ordering of the pages in the buddy allocator's
free lists, whenever we want to delete a free pageblock from a free list
(for ex: when moving blocks of pages from one list to the other), we need
to be able to tell the buddy allocator exactly which migratetype it belongs
to. For that purpose, we can use the page's freepage migratetype (which is
maintained in the page's ->index field).

So, while splitting up higher order pages into smaller ones as part of buddy
operations, keep the new head pages updated with the correct freepage
migratetype information (because we depend on tracking this info accurately,
as outlined above).

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 398b62c..b4b1275 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -947,6 +947,13 @@ static inline void expand(struct zone *zone, struct page *page,
 		add_to_freelist(&page[size], &area->free_list[migratetype]);
 		area->nr_free++;
 		set_page_order(&page[size], high);
+
+		/*
+		 * Freepage migratetype is tracked using the index field of the
+		 * first page of the block. So we need to update the new first
+		 * page, when changing the page order.
+		 */
+		set_freepage_migratetype(&page[size], migratetype);
 	}
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 10/35] mm: Use the correct migratetype during buddy merging
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:16   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:16 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

While merging buddy free pages of a given order to make a higher order page,
the buddy allocator might coalesce pages belonging to *two* *different*
migratetypes of that order!

So, don't assume that both the buddies come from the same freelist;
instead, explicitly find out the migratetype info of the buddy page and use
it while merging the buddies.

Also, set the freepage migratetype of the buddy to the new one.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b4b1275..07ac019 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -681,10 +681,14 @@ static inline void __free_one_page(struct page *page,
 			__mod_zone_freepage_state(zone, 1 << order,
 						  migratetype);
 		} else {
+			int mt;
+
 			area = &zone->free_area[order];
-			del_from_freelist(buddy, &area->free_list[migratetype]);
+			mt = get_freepage_migratetype(buddy);
+			del_from_freelist(buddy, &area->free_list[mt]);
 			area->nr_free--;
 			rmv_page_order(buddy);
+			set_freepage_migratetype(buddy, migratetype);
 		}
 		combined_idx = buddy_idx & page_idx;
 		page = page + (combined_idx - page_idx);


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 10/35] mm: Use the correct migratetype during buddy merging
@ 2013-08-30 13:16   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:16 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

While merging buddy free pages of a given order to make a higher order page,
the buddy allocator might coalesce pages belonging to *two* *different*
migratetypes of that order!

So, don't assume that both the buddies come from the same freelist;
instead, explicitly find out the migratetype info of the buddy page and use
it while merging the buddies.

Also, set the freepage migratetype of the buddy to the new one.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b4b1275..07ac019 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -681,10 +681,14 @@ static inline void __free_one_page(struct page *page,
 			__mod_zone_freepage_state(zone, 1 << order,
 						  migratetype);
 		} else {
+			int mt;
+
 			area = &zone->free_area[order];
-			del_from_freelist(buddy, &area->free_list[migratetype]);
+			mt = get_freepage_migratetype(buddy);
+			del_from_freelist(buddy, &area->free_list[mt]);
 			area->nr_free--;
 			rmv_page_order(buddy);
+			set_freepage_migratetype(buddy, migratetype);
 		}
 		combined_idx = buddy_idx & page_idx;
 		page = page + (combined_idx - page_idx);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 11/35] mm: Add an optimized version of del_from_freelist to keep page allocation fast
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:17   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:17 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

One of the main advantages of this design of memory regions is that page
allocations can potentially be extremely fast - almost with no extra
overhead from memory regions.

To exploit that, introduce an optimized version of del_from_freelist(), which
utilizes the fact that we always delete items from the head of the list
during page allocation.

Basically, we want to keep a note of the region from which we are allocating
in a given freelist, to avoid having to compute the page-to-zone-region for
every page allocation. So introduce a 'next_region' pointer in every freelist
to achieve that, and use it to keep the fastpath of page allocation almost as
fast as it would have been without memory regions.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mm.h     |   14 +++++++++++
 include/linux/mmzone.h |    6 +++++
 mm/page_alloc.c        |   62 +++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52329d1..156d7db 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -747,6 +747,20 @@ static inline int page_zone_region_id(const struct page *page)
 	return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
 }
 
+static inline void set_next_region_in_freelist(struct free_list *free_list)
+{
+	struct page *page;
+	int region_id;
+
+	if (unlikely(list_empty(&free_list->list))) {
+		free_list->next_region = NULL;
+	} else {
+		page = list_entry(free_list->list.next, struct page, lru);
+		region_id = page_zone_region_id(page);
+		free_list->next_region = &free_list->mr_list[region_id];
+	}
+}
+
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 201ab42..932e71f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -92,6 +92,12 @@ struct free_list {
 	struct list_head	list;
 
 	/*
+	 * Pointer to the region from which the next allocation will be
+	 * satisfied. (Same as the freelist's first pageblock's region.)
+	 */
+	struct mem_region_list	*next_region; /* for fast page allocation */
+
+	/*
 	 * Demarcates pageblocks belonging to different regions within
 	 * this freelist.
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07ac019..52b6655 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -548,6 +548,15 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)
 	/* This is the first region, so add to the head of the list */
 	prev_region_list = &free_list->list;
 
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN((list_empty(&free_list->list) && free_list->next_region != NULL),
+					"%s: next_region not NULL\n", __func__);
+#endif
+	/*
+	 * Set 'next_region' to this region, since this is the first region now
+	 */
+	free_list->next_region = region;
+
 out:
 	list_add(lru, prev_region_list);
 
@@ -555,6 +564,47 @@ out:
 	region->page_block = lru;
 }
 
+/**
+ * __rmqueue_smallest() *always* deletes elements from the head of the
+ * list. Use this knowledge to keep page allocation fast, despite being
+ * region-aware.
+ *
+ * Do *NOT* call this function if you are deleting from somewhere deep
+ * inside the freelist.
+ */
+static void rmqueue_del_from_freelist(struct page *page,
+				      struct free_list *free_list)
+{
+	struct list_head *lru = &page->lru;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN((free_list->list.next != lru),
+				"%s: page not at head of list", __func__);
+#endif
+
+	list_del(lru);
+
+	/* Fastpath */
+	if (--(free_list->next_region->nr_free)) {
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+		WARN(free_list->next_region->nr_free < 0,
+				"%s: nr_free is negative\n", __func__);
+#endif
+		return;
+	}
+
+	/*
+	 * Slowpath, when this is the last pageblock of this region
+	 * in this freelist.
+	 */
+	free_list->next_region->page_block = NULL;
+
+	/* Set 'next_region' to the new first region in the freelist. */
+	set_next_region_in_freelist(free_list);
+}
+
+/* Generic delete function for region-aware buddy allocator. */
 static void del_from_freelist(struct page *page, struct free_list *free_list)
 {
 	struct list_head *prev_page_lru, *lru, *p;
@@ -562,6 +612,11 @@ static void del_from_freelist(struct page *page, struct free_list *free_list)
 	int region_id;
 
 	lru = &page->lru;
+
+	/* Try to fastpath, if deleting from the head of the list */
+	if (lru == free_list->list.next)
+		return rmqueue_del_from_freelist(page, free_list);
+
 	region_id = page_zone_region_id(page);
 	region = &free_list->mr_list[region_id];
 	region->nr_free--;
@@ -597,6 +652,11 @@ page_found:
 	prev_page_lru = lru->prev;
 	list_del(lru);
 
+	/*
+	 * Since we are not deleting from the head of the freelist, the
+	 * 'next_region' pointer doesn't have to change.
+	 */
+
 	if (region->nr_free == 0) {
 		region->page_block = NULL;
 	} else {
@@ -1022,7 +1082,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
-		del_from_freelist(page, &area->free_list[migratetype]);
+		rmqueue_del_from_freelist(page, &area->free_list[migratetype]);
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 11/35] mm: Add an optimized version of del_from_freelist to keep page allocation fast
@ 2013-08-30 13:17   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:17 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

One of the main advantages of this design of memory regions is that page
allocations can potentially be extremely fast - almost with no extra
overhead from memory regions.

To exploit that, introduce an optimized version of del_from_freelist(), which
utilizes the fact that we always delete items from the head of the list
during page allocation.

Basically, we want to keep a note of the region from which we are allocating
in a given freelist, to avoid having to compute the page-to-zone-region for
every page allocation. So introduce a 'next_region' pointer in every freelist
to achieve that, and use it to keep the fastpath of page allocation almost as
fast as it would have been without memory regions.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mm.h     |   14 +++++++++++
 include/linux/mmzone.h |    6 +++++
 mm/page_alloc.c        |   62 +++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52329d1..156d7db 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -747,6 +747,20 @@ static inline int page_zone_region_id(const struct page *page)
 	return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
 }
 
+static inline void set_next_region_in_freelist(struct free_list *free_list)
+{
+	struct page *page;
+	int region_id;
+
+	if (unlikely(list_empty(&free_list->list))) {
+		free_list->next_region = NULL;
+	} else {
+		page = list_entry(free_list->list.next, struct page, lru);
+		region_id = page_zone_region_id(page);
+		free_list->next_region = &free_list->mr_list[region_id];
+	}
+}
+
 #ifdef SECTION_IN_PAGE_FLAGS
 static inline void set_page_section(struct page *page, unsigned long section)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 201ab42..932e71f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -92,6 +92,12 @@ struct free_list {
 	struct list_head	list;
 
 	/*
+	 * Pointer to the region from which the next allocation will be
+	 * satisfied. (Same as the freelist's first pageblock's region.)
+	 */
+	struct mem_region_list	*next_region; /* for fast page allocation */
+
+	/*
 	 * Demarcates pageblocks belonging to different regions within
 	 * this freelist.
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07ac019..52b6655 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -548,6 +548,15 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)
 	/* This is the first region, so add to the head of the list */
 	prev_region_list = &free_list->list;
 
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN((list_empty(&free_list->list) && free_list->next_region != NULL),
+					"%s: next_region not NULL\n", __func__);
+#endif
+	/*
+	 * Set 'next_region' to this region, since this is the first region now
+	 */
+	free_list->next_region = region;
+
 out:
 	list_add(lru, prev_region_list);
 
@@ -555,6 +564,47 @@ out:
 	region->page_block = lru;
 }
 
+/**
+ * __rmqueue_smallest() *always* deletes elements from the head of the
+ * list. Use this knowledge to keep page allocation fast, despite being
+ * region-aware.
+ *
+ * Do *NOT* call this function if you are deleting from somewhere deep
+ * inside the freelist.
+ */
+static void rmqueue_del_from_freelist(struct page *page,
+				      struct free_list *free_list)
+{
+	struct list_head *lru = &page->lru;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN((free_list->list.next != lru),
+				"%s: page not at head of list", __func__);
+#endif
+
+	list_del(lru);
+
+	/* Fastpath */
+	if (--(free_list->next_region->nr_free)) {
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+		WARN(free_list->next_region->nr_free < 0,
+				"%s: nr_free is negative\n", __func__);
+#endif
+		return;
+	}
+
+	/*
+	 * Slowpath, when this is the last pageblock of this region
+	 * in this freelist.
+	 */
+	free_list->next_region->page_block = NULL;
+
+	/* Set 'next_region' to the new first region in the freelist. */
+	set_next_region_in_freelist(free_list);
+}
+
+/* Generic delete function for region-aware buddy allocator. */
 static void del_from_freelist(struct page *page, struct free_list *free_list)
 {
 	struct list_head *prev_page_lru, *lru, *p;
@@ -562,6 +612,11 @@ static void del_from_freelist(struct page *page, struct free_list *free_list)
 	int region_id;
 
 	lru = &page->lru;
+
+	/* Try to fastpath, if deleting from the head of the list */
+	if (lru == free_list->list.next)
+		return rmqueue_del_from_freelist(page, free_list);
+
 	region_id = page_zone_region_id(page);
 	region = &free_list->mr_list[region_id];
 	region->nr_free--;
@@ -597,6 +652,11 @@ page_found:
 	prev_page_lru = lru->prev;
 	list_del(lru);
 
+	/*
+	 * Since we are not deleting from the head of the freelist, the
+	 * 'next_region' pointer doesn't have to change.
+	 */
+
 	if (region->nr_free == 0) {
 		region->page_block = NULL;
 	} else {
@@ -1022,7 +1082,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
-		del_from_freelist(page, &area->free_list[migratetype]);
+		rmqueue_del_from_freelist(page, &area->free_list[migratetype]);
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 12/35] bitops: Document the difference in indexing between fls() and __fls()
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:17   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:17 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

fls() indexes the bits starting with 1, ie., from 1 to BITS_PER_LONG
whereas __fls() uses a zero-based indexing scheme (0 to BITS_PER_LONG - 1).
Add comments to document this important difference.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/x86/include/asm/bitops.h      |    4 ++++
 include/asm-generic/bitops/__fls.h |    5 +++++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6dfd019..25e6fdc 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -380,6 +380,10 @@ static inline unsigned long ffz(unsigned long word)
  * @word: The word to search
  *
  * Undefined if no set bit exists, so code should check against 0 first.
+ *
+ * Note: __fls(x) is equivalent to fls(x) - 1. That is, __fls() uses
+ * a zero-based indexing scheme (0 to BITS_PER_LONG - 1), where
+ * __fls(1) = 0, __fls(2) = 1, and so on.
  */
 static inline unsigned long __fls(unsigned long word)
 {
diff --git a/include/asm-generic/bitops/__fls.h b/include/asm-generic/bitops/__fls.h
index a60a7cc..ae908a5 100644
--- a/include/asm-generic/bitops/__fls.h
+++ b/include/asm-generic/bitops/__fls.h
@@ -8,6 +8,11 @@
  * @word: the word to search
  *
  * Undefined if no set bit exists, so code should check against 0 first.
+ *
+ * Note: __fls(x) is equivalent to fls(x) - 1. That is, __fls() uses
+ * a zero-based indexing scheme (0 to BITS_PER_LONG - 1), where
+ * __fls(1) = 0, __fls(2) = 1, and so on.
+ *
  */
 static __always_inline unsigned long __fls(unsigned long word)
 {


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 12/35] bitops: Document the difference in indexing between fls() and __fls()
@ 2013-08-30 13:17   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:17 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

fls() indexes the bits starting with 1, ie., from 1 to BITS_PER_LONG
whereas __fls() uses a zero-based indexing scheme (0 to BITS_PER_LONG - 1).
Add comments to document this important difference.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/x86/include/asm/bitops.h      |    4 ++++
 include/asm-generic/bitops/__fls.h |    5 +++++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6dfd019..25e6fdc 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -380,6 +380,10 @@ static inline unsigned long ffz(unsigned long word)
  * @word: The word to search
  *
  * Undefined if no set bit exists, so code should check against 0 first.
+ *
+ * Note: __fls(x) is equivalent to fls(x) - 1. That is, __fls() uses
+ * a zero-based indexing scheme (0 to BITS_PER_LONG - 1), where
+ * __fls(1) = 0, __fls(2) = 1, and so on.
  */
 static inline unsigned long __fls(unsigned long word)
 {
diff --git a/include/asm-generic/bitops/__fls.h b/include/asm-generic/bitops/__fls.h
index a60a7cc..ae908a5 100644
--- a/include/asm-generic/bitops/__fls.h
+++ b/include/asm-generic/bitops/__fls.h
@@ -8,6 +8,11 @@
  * @word: the word to search
  *
  * Undefined if no set bit exists, so code should check against 0 first.
+ *
+ * Note: __fls(x) is equivalent to fls(x) - 1. That is, __fls() uses
+ * a zero-based indexing scheme (0 to BITS_PER_LONG - 1), where
+ * __fls(1) = 0, __fls(2) = 1, and so on.
+ *
  */
 static __always_inline unsigned long __fls(unsigned long word)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 13/35] mm: A new optimized O(log n) sorting algo to speed up buddy-sorting
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:17   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:17 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The sorted-buddy design for memory power management depends on
keeping the buddy freelists region-sorted. And this sorting operation
has been pushed to the free() logic, keeping the alloc() path fast.

However, we would like to also keep the free() path as fast as possible,
since it holds the zone->lock, which will indirectly affect alloc() also.

So replace the existing O(n) sorting logic used in the free-path, with
a new special-case sorting algorithm of time complexity O(log n), in order
to optimize the free() path further. This algorithm uses a bitmap-based
radix tree to help speed up the sorting.

One of the other main advantages of this O(log n) design is that it can
support large amounts of RAM (upto 2 TB and beyond) quite effortlessly.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |    2 +
 mm/page_alloc.c        |  144 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 139 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 932e71f..b35020f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -102,6 +102,8 @@ struct free_list {
 	 * this freelist.
 	 */
 	struct mem_region_list	mr_list[MAX_NR_ZONE_REGIONS];
+	DECLARE_BITMAP(region_root_mask, BITS_PER_LONG);
+	DECLARE_BITMAP(region_leaf_mask, MAX_NR_ZONE_REGIONS);
 };
 
 struct free_area {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 52b6655..4da02fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -514,11 +514,131 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+/**
+ *
+ * An example should help illustrate the bitmap representation of memory
+ * regions easily. So consider the following scenario:
+ *
+ * MAX_NR_ZONE_REGIONS = 256
+ * DECLARE_BITMAP(region_leaf_mask, MAX_NR_ZONE_REGIONS);
+ * DECLARE_BITMAP(region_root_mask, BITS_PER_LONG);
+ *
+ * Here region_leaf_mask is an array of unsigned longs. And region_root_mask
+ * is a single unsigned long. The tree notion is constructed like this:
+ * Each bit in the region_root_mask will correspond to an array element of
+ * region_leaf_mask, as shown below. (The elements of the region_leaf_mask
+ * array are shown as being discontiguous, only to help illustrate the
+ * concept easily).
+ *
+ *                    Region Root Mask
+ *                   ___________________
+ *                  |____|____|____|____|
+ *                    /    |     \     \
+ *                   /     |      \     \
+ *             ________    |   ________  \
+ *            |________|   |  |________|  \
+ *                         |               \
+ *                      ________        ________
+ *                     |________|      |________|   <--- Region Leaf Mask
+ *                                                         array elements
+ *
+ * If an array element in the leaf mask is non-zero, the corresponding bit
+ * for that array element will be set in the root mask. Every bit in the
+ * region_leaf_mask will correspond to a memory region; it is set if that
+ * region is present in that free list, cleared otherwise.
+ *
+ * This arrangement helps us find the previous set bit in region_leaf_mask
+ * using at most 2 bitmask-searches (each bitmask of size BITS_PER_LONG),
+ * one at the root-level, and one at the leaf level. Thus, this design of
+ * an optimized access structure reduces the search-complexity when dealing
+ * with large amounts of memory. The worst-case time-complexity of buddy
+ * sorting comes to O(log n) using this algorithm, where 'n' is the no. of
+ * memory regions in the zone.
+ *
+ * For example, with MEM_REGION_SIZE = 512 MB, on 64-bit machines, we can
+ * deal with upto 2TB of RAM (MAX_NR_ZONE_REGIONS = 4096) efficiently (just
+ * 12 ops in the worst case, as opposed to 4096 in an O(n) algo) with such
+ * an arrangement, without even needing to extend this 2-level hierarchy
+ * any further.
+ */
+
+static void set_region_bit(int region_id, struct free_list *free_list)
+{
+	set_bit(region_id, free_list->region_leaf_mask);
+	set_bit(BIT_WORD(region_id), free_list->region_root_mask);
+}
+
+static void clear_region_bit(int region_id, struct free_list *free_list)
+{
+	clear_bit(region_id, free_list->region_leaf_mask);
+
+	if (!(free_list->region_leaf_mask[BIT_WORD(region_id)]))
+		clear_bit(BIT_WORD(region_id), free_list->region_root_mask);
+
+}
+
+/* Note that Region 0 corresponds to bit position 1 (0x1) and so on */
+static int find_prev_region(int region_id, struct free_list *free_list)
+{
+	int leaf_word, prev_region_id;
+	unsigned long *region_root_mask, *region_leaf_mask;
+	unsigned long tmp_root_mask, tmp_leaf_mask;
+
+	if (!region_id)
+		return -1; /* No previous region */
+
+	leaf_word = BIT_WORD(region_id);
+
+	region_root_mask = free_list->region_root_mask;
+	region_leaf_mask = free_list->region_leaf_mask;
+
+
+	/*
+	 * Try to get the prev region id without going to the root mask.
+	 * Note that region_id itself might not be set yet.
+	 */
+	if (region_leaf_mask[leaf_word]) {
+		tmp_leaf_mask = region_leaf_mask[leaf_word] &
+							(BIT_MASK(region_id) - 1);
+
+		if (tmp_leaf_mask) {
+			/* Prev region is in this leaf mask itself. Find it. */
+			prev_region_id = leaf_word * BITS_PER_LONG +
+							__fls(tmp_leaf_mask);
+			goto out;
+		}
+	}
+
+	/* Search the root mask for the leaf mask having prev region */
+	tmp_root_mask = *region_root_mask & (BIT(leaf_word) - 1);
+	if (tmp_root_mask) {
+		leaf_word = __fls(tmp_root_mask);
+
+		/* Get the prev region id from the leaf mask */
+		prev_region_id = leaf_word * BITS_PER_LONG +
+					__fls(region_leaf_mask[leaf_word]);
+	} else {
+		/*
+		 * This itself is the first populated region in this
+		 * freelist, so previous region doesn't exist.
+		 */
+		prev_region_id = -1;
+	}
+
+out:
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN(prev_region_id >= region_id, "%s: bitmap logic messed up\n",
+								__func__);
+#endif
+	return prev_region_id;
+}
+
 static void add_to_freelist(struct page *page, struct free_list *free_list)
 {
 	struct list_head *prev_region_list, *lru;
 	struct mem_region_list *region;
-	int region_id, i;
+	int region_id, prev_region_id;
 
 	lru = &page->lru;
 	region_id = page_zone_region_id(page);
@@ -536,12 +656,17 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)
 #endif
 
 	if (!list_empty(&free_list->list)) {
-		for (i = region_id - 1; i >= 0; i--) {
-			if (free_list->mr_list[i].page_block) {
-				prev_region_list =
-					free_list->mr_list[i].page_block;
-				goto out;
-			}
+		prev_region_id = find_prev_region(region_id, free_list);
+		if (prev_region_id >= 0) {
+			prev_region_list =
+				free_list->mr_list[prev_region_id].page_block;
+#ifdef CONFIG_DEBUG_PAGEALLOC
+			WARN(prev_region_list == NULL,
+				"%s: prev_region_list is NULL\n"
+				"region_id=%d, prev_region_id=%d\n", __func__,
+				 region_id, prev_region_id);
+#endif
+			goto out;
 		}
 	}
 
@@ -562,6 +687,7 @@ out:
 
 	/* Save pointer to page block of this region */
 	region->page_block = lru;
+	set_region_bit(region_id, free_list);
 }
 
 /**
@@ -576,6 +702,7 @@ static void rmqueue_del_from_freelist(struct page *page,
 				      struct free_list *free_list)
 {
 	struct list_head *lru = &page->lru;
+	int region_id;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	WARN((free_list->list.next != lru),
@@ -599,6 +726,8 @@ static void rmqueue_del_from_freelist(struct page *page,
 	 * in this freelist.
 	 */
 	free_list->next_region->page_block = NULL;
+	region_id = free_list->next_region - free_list->mr_list;
+	clear_region_bit(region_id, free_list);
 
 	/* Set 'next_region' to the new first region in the freelist. */
 	set_next_region_in_freelist(free_list);
@@ -659,6 +788,7 @@ page_found:
 
 	if (region->nr_free == 0) {
 		region->page_block = NULL;
+		clear_region_bit(region_id, free_list);
 	} else {
 		region->page_block = prev_page_lru;
 #ifdef CONFIG_DEBUG_PAGEALLOC


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 13/35] mm: A new optimized O(log n) sorting algo to speed up buddy-sorting
@ 2013-08-30 13:17   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:17 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The sorted-buddy design for memory power management depends on
keeping the buddy freelists region-sorted. And this sorting operation
has been pushed to the free() logic, keeping the alloc() path fast.

However, we would like to also keep the free() path as fast as possible,
since it holds the zone->lock, which will indirectly affect alloc() also.

So replace the existing O(n) sorting logic used in the free-path, with
a new special-case sorting algorithm of time complexity O(log n), in order
to optimize the free() path further. This algorithm uses a bitmap-based
radix tree to help speed up the sorting.

One of the other main advantages of this O(log n) design is that it can
support large amounts of RAM (upto 2 TB and beyond) quite effortlessly.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |    2 +
 mm/page_alloc.c        |  144 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 139 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 932e71f..b35020f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -102,6 +102,8 @@ struct free_list {
 	 * this freelist.
 	 */
 	struct mem_region_list	mr_list[MAX_NR_ZONE_REGIONS];
+	DECLARE_BITMAP(region_root_mask, BITS_PER_LONG);
+	DECLARE_BITMAP(region_leaf_mask, MAX_NR_ZONE_REGIONS);
 };
 
 struct free_area {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 52b6655..4da02fc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -514,11 +514,131 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+/**
+ *
+ * An example should help illustrate the bitmap representation of memory
+ * regions easily. So consider the following scenario:
+ *
+ * MAX_NR_ZONE_REGIONS = 256
+ * DECLARE_BITMAP(region_leaf_mask, MAX_NR_ZONE_REGIONS);
+ * DECLARE_BITMAP(region_root_mask, BITS_PER_LONG);
+ *
+ * Here region_leaf_mask is an array of unsigned longs. And region_root_mask
+ * is a single unsigned long. The tree notion is constructed like this:
+ * Each bit in the region_root_mask will correspond to an array element of
+ * region_leaf_mask, as shown below. (The elements of the region_leaf_mask
+ * array are shown as being discontiguous, only to help illustrate the
+ * concept easily).
+ *
+ *                    Region Root Mask
+ *                   ___________________
+ *                  |____|____|____|____|
+ *                    /    |     \     \
+ *                   /     |      \     \
+ *             ________    |   ________  \
+ *            |________|   |  |________|  \
+ *                         |               \
+ *                      ________        ________
+ *                     |________|      |________|   <--- Region Leaf Mask
+ *                                                         array elements
+ *
+ * If an array element in the leaf mask is non-zero, the corresponding bit
+ * for that array element will be set in the root mask. Every bit in the
+ * region_leaf_mask will correspond to a memory region; it is set if that
+ * region is present in that free list, cleared otherwise.
+ *
+ * This arrangement helps us find the previous set bit in region_leaf_mask
+ * using at most 2 bitmask-searches (each bitmask of size BITS_PER_LONG),
+ * one at the root-level, and one at the leaf level. Thus, this design of
+ * an optimized access structure reduces the search-complexity when dealing
+ * with large amounts of memory. The worst-case time-complexity of buddy
+ * sorting comes to O(log n) using this algorithm, where 'n' is the no. of
+ * memory regions in the zone.
+ *
+ * For example, with MEM_REGION_SIZE = 512 MB, on 64-bit machines, we can
+ * deal with upto 2TB of RAM (MAX_NR_ZONE_REGIONS = 4096) efficiently (just
+ * 12 ops in the worst case, as opposed to 4096 in an O(n) algo) with such
+ * an arrangement, without even needing to extend this 2-level hierarchy
+ * any further.
+ */
+
+static void set_region_bit(int region_id, struct free_list *free_list)
+{
+	set_bit(region_id, free_list->region_leaf_mask);
+	set_bit(BIT_WORD(region_id), free_list->region_root_mask);
+}
+
+static void clear_region_bit(int region_id, struct free_list *free_list)
+{
+	clear_bit(region_id, free_list->region_leaf_mask);
+
+	if (!(free_list->region_leaf_mask[BIT_WORD(region_id)]))
+		clear_bit(BIT_WORD(region_id), free_list->region_root_mask);
+
+}
+
+/* Note that Region 0 corresponds to bit position 1 (0x1) and so on */
+static int find_prev_region(int region_id, struct free_list *free_list)
+{
+	int leaf_word, prev_region_id;
+	unsigned long *region_root_mask, *region_leaf_mask;
+	unsigned long tmp_root_mask, tmp_leaf_mask;
+
+	if (!region_id)
+		return -1; /* No previous region */
+
+	leaf_word = BIT_WORD(region_id);
+
+	region_root_mask = free_list->region_root_mask;
+	region_leaf_mask = free_list->region_leaf_mask;
+
+
+	/*
+	 * Try to get the prev region id without going to the root mask.
+	 * Note that region_id itself might not be set yet.
+	 */
+	if (region_leaf_mask[leaf_word]) {
+		tmp_leaf_mask = region_leaf_mask[leaf_word] &
+							(BIT_MASK(region_id) - 1);
+
+		if (tmp_leaf_mask) {
+			/* Prev region is in this leaf mask itself. Find it. */
+			prev_region_id = leaf_word * BITS_PER_LONG +
+							__fls(tmp_leaf_mask);
+			goto out;
+		}
+	}
+
+	/* Search the root mask for the leaf mask having prev region */
+	tmp_root_mask = *region_root_mask & (BIT(leaf_word) - 1);
+	if (tmp_root_mask) {
+		leaf_word = __fls(tmp_root_mask);
+
+		/* Get the prev region id from the leaf mask */
+		prev_region_id = leaf_word * BITS_PER_LONG +
+					__fls(region_leaf_mask[leaf_word]);
+	} else {
+		/*
+		 * This itself is the first populated region in this
+		 * freelist, so previous region doesn't exist.
+		 */
+		prev_region_id = -1;
+	}
+
+out:
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN(prev_region_id >= region_id, "%s: bitmap logic messed up\n",
+								__func__);
+#endif
+	return prev_region_id;
+}
+
 static void add_to_freelist(struct page *page, struct free_list *free_list)
 {
 	struct list_head *prev_region_list, *lru;
 	struct mem_region_list *region;
-	int region_id, i;
+	int region_id, prev_region_id;
 
 	lru = &page->lru;
 	region_id = page_zone_region_id(page);
@@ -536,12 +656,17 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)
 #endif
 
 	if (!list_empty(&free_list->list)) {
-		for (i = region_id - 1; i >= 0; i--) {
-			if (free_list->mr_list[i].page_block) {
-				prev_region_list =
-					free_list->mr_list[i].page_block;
-				goto out;
-			}
+		prev_region_id = find_prev_region(region_id, free_list);
+		if (prev_region_id >= 0) {
+			prev_region_list =
+				free_list->mr_list[prev_region_id].page_block;
+#ifdef CONFIG_DEBUG_PAGEALLOC
+			WARN(prev_region_list == NULL,
+				"%s: prev_region_list is NULL\n"
+				"region_id=%d, prev_region_id=%d\n", __func__,
+				 region_id, prev_region_id);
+#endif
+			goto out;
 		}
 	}
 
@@ -562,6 +687,7 @@ out:
 
 	/* Save pointer to page block of this region */
 	region->page_block = lru;
+	set_region_bit(region_id, free_list);
 }
 
 /**
@@ -576,6 +702,7 @@ static void rmqueue_del_from_freelist(struct page *page,
 				      struct free_list *free_list)
 {
 	struct list_head *lru = &page->lru;
+	int region_id;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	WARN((free_list->list.next != lru),
@@ -599,6 +726,8 @@ static void rmqueue_del_from_freelist(struct page *page,
 	 * in this freelist.
 	 */
 	free_list->next_region->page_block = NULL;
+	region_id = free_list->next_region - free_list->mr_list;
+	clear_region_bit(region_id, free_list);
 
 	/* Set 'next_region' to the new first region in the freelist. */
 	set_next_region_in_freelist(free_list);
@@ -659,6 +788,7 @@ page_found:
 
 	if (region->nr_free == 0) {
 		region->page_block = NULL;
+		clear_region_bit(region_id, free_list);
 	} else {
 		region->page_block = prev_page_lru;
 #ifdef CONFIG_DEBUG_PAGEALLOC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 14/35] mm: Add support to accurately track per-memory-region allocation
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:18   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:18 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The page allocator can make smarter decisions to influence memory power
management, if we track the per-region memory allocations closely.
So add the necessary support to accurately track allocations on a per-region
basis.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |    2 +
 mm/page_alloc.c        |   65 +++++++++++++++++++++++++++++++++++-------------
 2 files changed, 50 insertions(+), 17 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b35020f..ef602a8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -86,6 +86,7 @@ static inline int get_pageblock_migratetype(struct page *page)
 struct mem_region_list {
 	struct list_head	*page_block;
 	unsigned long		nr_free;
+	struct zone_mem_region	*zone_region;
 };
 
 struct free_list {
@@ -341,6 +342,7 @@ struct zone_mem_region {
 	unsigned long end_pfn;
 	unsigned long present_pages;
 	unsigned long spanned_pages;
+	unsigned long nr_free;
 };
 
 struct zone {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4da02fc..6e711b9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -634,7 +634,8 @@ out:
 	return prev_region_id;
 }
 
-static void add_to_freelist(struct page *page, struct free_list *free_list)
+static void add_to_freelist(struct page *page, struct free_list *free_list,
+			    int order)
 {
 	struct list_head *prev_region_list, *lru;
 	struct mem_region_list *region;
@@ -645,6 +646,7 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)
 
 	region = &free_list->mr_list[region_id];
 	region->nr_free++;
+	region->zone_region->nr_free += 1 << order;
 
 	if (region->page_block) {
 		list_add_tail(lru, region->page_block);
@@ -699,9 +701,10 @@ out:
  * inside the freelist.
  */
 static void rmqueue_del_from_freelist(struct page *page,
-				      struct free_list *free_list)
+				      struct free_list *free_list, int order)
 {
 	struct list_head *lru = &page->lru;
+	struct mem_region_list *mr_list;
 	int region_id;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
@@ -712,7 +715,10 @@ static void rmqueue_del_from_freelist(struct page *page,
 	list_del(lru);
 
 	/* Fastpath */
-	if (--(free_list->next_region->nr_free)) {
+	mr_list = free_list->next_region;
+	mr_list->zone_region->nr_free -= 1 << order;
+
+	if (--(mr_list->nr_free)) {
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 		WARN(free_list->next_region->nr_free < 0,
@@ -734,7 +740,8 @@ static void rmqueue_del_from_freelist(struct page *page,
 }
 
 /* Generic delete function for region-aware buddy allocator. */
-static void del_from_freelist(struct page *page, struct free_list *free_list)
+static void del_from_freelist(struct page *page, struct free_list *free_list,
+			      int order)
 {
 	struct list_head *prev_page_lru, *lru, *p;
 	struct mem_region_list *region;
@@ -744,11 +751,12 @@ static void del_from_freelist(struct page *page, struct free_list *free_list)
 
 	/* Try to fastpath, if deleting from the head of the list */
 	if (lru == free_list->list.next)
-		return rmqueue_del_from_freelist(page, free_list);
+		return rmqueue_del_from_freelist(page, free_list, order);
 
 	region_id = page_zone_region_id(page);
 	region = &free_list->mr_list[region_id];
 	region->nr_free--;
+	region->zone_region->nr_free -= 1 << order;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
@@ -803,10 +811,10 @@ page_found:
  * Move a given page from one freelist to another.
  */
 static void move_page_freelist(struct page *page, struct free_list *old_list,
-			       struct free_list *new_list)
+			       struct free_list *new_list, int order)
 {
-	del_from_freelist(page, old_list);
-	add_to_freelist(page, new_list);
+	del_from_freelist(page, old_list, order);
+	add_to_freelist(page, new_list, order);
 }
 
 /*
@@ -875,7 +883,7 @@ static inline void __free_one_page(struct page *page,
 
 			area = &zone->free_area[order];
 			mt = get_freepage_migratetype(buddy);
-			del_from_freelist(buddy, &area->free_list[mt]);
+			del_from_freelist(buddy, &area->free_list[mt], order);
 			area->nr_free--;
 			rmv_page_order(buddy);
 			set_freepage_migratetype(buddy, migratetype);
@@ -911,12 +919,13 @@ static inline void __free_one_page(struct page *page,
 			 * switch off this entire "is next-higher buddy free?"
 			 * logic when memory regions are used.
 			 */
-			add_to_freelist(page, &area->free_list[migratetype]);
+			add_to_freelist(page, &area->free_list[migratetype],
+					order);
 			goto out;
 		}
 	}
 
-	add_to_freelist(page, &area->free_list[migratetype]);
+	add_to_freelist(page, &area->free_list[migratetype], order);
 out:
 	area->nr_free++;
 }
@@ -1138,7 +1147,8 @@ static inline void expand(struct zone *zone, struct page *page,
 			continue;
 		}
 #endif
-		add_to_freelist(&page[size], &area->free_list[migratetype]);
+		add_to_freelist(&page[size], &area->free_list[migratetype],
+				high);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 
@@ -1212,7 +1222,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
-		rmqueue_del_from_freelist(page, &area->free_list[migratetype]);
+		rmqueue_del_from_freelist(page, &area->free_list[migratetype],
+					  current_order);
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
@@ -1285,7 +1296,7 @@ int move_freepages(struct zone *zone,
 		old_mt = get_freepage_migratetype(page);
 		area = &zone->free_area[order];
 		move_page_freelist(page, &area->free_list[old_mt],
-				    &area->free_list[migratetype]);
+				    &area->free_list[migratetype], order);
 		set_freepage_migratetype(page, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
@@ -1405,7 +1416,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 
 			/* Remove the page from the freelists */
 			mt = get_freepage_migratetype(page);
-			del_from_freelist(page, &area->free_list[mt]);
+			del_from_freelist(page, &area->free_list[mt],
+					  current_order);
 			rmv_page_order(page);
 
 			/*
@@ -1766,7 +1778,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 
 	/* Remove page from free list */
 	mt = get_freepage_migratetype(page);
-	del_from_freelist(page, &zone->free_area[order].free_list[mt]);
+	del_from_freelist(page, &zone->free_area[order].free_list[mt], order);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
 
@@ -5157,6 +5169,22 @@ static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
 	pgdat->nr_node_regions = idx;
 }
 
+static void __meminit zone_init_free_lists_late(struct zone *zone)
+{
+	struct mem_region_list *mr_list;
+	int order, t, i;
+
+	for_each_migratetype_order(order, t) {
+		for (i = 0; i < zone->nr_zone_regions; i++) {
+			mr_list =
+				&zone->free_area[order].free_list[t].mr_list[i];
+
+			mr_list->nr_free = 0;
+			mr_list->zone_region = &zone->zone_regions[i];
+		}
+	}
+}
+
 static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 {
 	unsigned long start_pfn, end_pfn, absent;
@@ -5204,6 +5232,8 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 
 		z->nr_zone_regions = idx;
 
+		zone_init_free_lists_late(z);
+
 		/*
 		 * Revisit the last visited node memory region, in case it
 		 * spans multiple zones.
@@ -6708,7 +6738,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		       pfn, 1 << order, end_pfn);
 #endif
 		mt = get_freepage_migratetype(page);
-		del_from_freelist(page, &zone->free_area[order].free_list[mt]);
+		del_from_freelist(page, &zone->free_area[order].free_list[mt],
+				  order);
 		rmv_page_order(page);
 		zone->free_area[order].nr_free--;
 #ifdef CONFIG_HIGHMEM


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 14/35] mm: Add support to accurately track per-memory-region allocation
@ 2013-08-30 13:18   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:18 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The page allocator can make smarter decisions to influence memory power
management, if we track the per-region memory allocations closely.
So add the necessary support to accurately track allocations on a per-region
basis.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |    2 +
 mm/page_alloc.c        |   65 +++++++++++++++++++++++++++++++++++-------------
 2 files changed, 50 insertions(+), 17 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b35020f..ef602a8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -86,6 +86,7 @@ static inline int get_pageblock_migratetype(struct page *page)
 struct mem_region_list {
 	struct list_head	*page_block;
 	unsigned long		nr_free;
+	struct zone_mem_region	*zone_region;
 };
 
 struct free_list {
@@ -341,6 +342,7 @@ struct zone_mem_region {
 	unsigned long end_pfn;
 	unsigned long present_pages;
 	unsigned long spanned_pages;
+	unsigned long nr_free;
 };
 
 struct zone {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4da02fc..6e711b9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -634,7 +634,8 @@ out:
 	return prev_region_id;
 }
 
-static void add_to_freelist(struct page *page, struct free_list *free_list)
+static void add_to_freelist(struct page *page, struct free_list *free_list,
+			    int order)
 {
 	struct list_head *prev_region_list, *lru;
 	struct mem_region_list *region;
@@ -645,6 +646,7 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)
 
 	region = &free_list->mr_list[region_id];
 	region->nr_free++;
+	region->zone_region->nr_free += 1 << order;
 
 	if (region->page_block) {
 		list_add_tail(lru, region->page_block);
@@ -699,9 +701,10 @@ out:
  * inside the freelist.
  */
 static void rmqueue_del_from_freelist(struct page *page,
-				      struct free_list *free_list)
+				      struct free_list *free_list, int order)
 {
 	struct list_head *lru = &page->lru;
+	struct mem_region_list *mr_list;
 	int region_id;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
@@ -712,7 +715,10 @@ static void rmqueue_del_from_freelist(struct page *page,
 	list_del(lru);
 
 	/* Fastpath */
-	if (--(free_list->next_region->nr_free)) {
+	mr_list = free_list->next_region;
+	mr_list->zone_region->nr_free -= 1 << order;
+
+	if (--(mr_list->nr_free)) {
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 		WARN(free_list->next_region->nr_free < 0,
@@ -734,7 +740,8 @@ static void rmqueue_del_from_freelist(struct page *page,
 }
 
 /* Generic delete function for region-aware buddy allocator. */
-static void del_from_freelist(struct page *page, struct free_list *free_list)
+static void del_from_freelist(struct page *page, struct free_list *free_list,
+			      int order)
 {
 	struct list_head *prev_page_lru, *lru, *p;
 	struct mem_region_list *region;
@@ -744,11 +751,12 @@ static void del_from_freelist(struct page *page, struct free_list *free_list)
 
 	/* Try to fastpath, if deleting from the head of the list */
 	if (lru == free_list->list.next)
-		return rmqueue_del_from_freelist(page, free_list);
+		return rmqueue_del_from_freelist(page, free_list, order);
 
 	region_id = page_zone_region_id(page);
 	region = &free_list->mr_list[region_id];
 	region->nr_free--;
+	region->zone_region->nr_free -= 1 << order;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
@@ -803,10 +811,10 @@ page_found:
  * Move a given page from one freelist to another.
  */
 static void move_page_freelist(struct page *page, struct free_list *old_list,
-			       struct free_list *new_list)
+			       struct free_list *new_list, int order)
 {
-	del_from_freelist(page, old_list);
-	add_to_freelist(page, new_list);
+	del_from_freelist(page, old_list, order);
+	add_to_freelist(page, new_list, order);
 }
 
 /*
@@ -875,7 +883,7 @@ static inline void __free_one_page(struct page *page,
 
 			area = &zone->free_area[order];
 			mt = get_freepage_migratetype(buddy);
-			del_from_freelist(buddy, &area->free_list[mt]);
+			del_from_freelist(buddy, &area->free_list[mt], order);
 			area->nr_free--;
 			rmv_page_order(buddy);
 			set_freepage_migratetype(buddy, migratetype);
@@ -911,12 +919,13 @@ static inline void __free_one_page(struct page *page,
 			 * switch off this entire "is next-higher buddy free?"
 			 * logic when memory regions are used.
 			 */
-			add_to_freelist(page, &area->free_list[migratetype]);
+			add_to_freelist(page, &area->free_list[migratetype],
+					order);
 			goto out;
 		}
 	}
 
-	add_to_freelist(page, &area->free_list[migratetype]);
+	add_to_freelist(page, &area->free_list[migratetype], order);
 out:
 	area->nr_free++;
 }
@@ -1138,7 +1147,8 @@ static inline void expand(struct zone *zone, struct page *page,
 			continue;
 		}
 #endif
-		add_to_freelist(&page[size], &area->free_list[migratetype]);
+		add_to_freelist(&page[size], &area->free_list[migratetype],
+				high);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 
@@ -1212,7 +1222,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
-		rmqueue_del_from_freelist(page, &area->free_list[migratetype]);
+		rmqueue_del_from_freelist(page, &area->free_list[migratetype],
+					  current_order);
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
@@ -1285,7 +1296,7 @@ int move_freepages(struct zone *zone,
 		old_mt = get_freepage_migratetype(page);
 		area = &zone->free_area[order];
 		move_page_freelist(page, &area->free_list[old_mt],
-				    &area->free_list[migratetype]);
+				    &area->free_list[migratetype], order);
 		set_freepage_migratetype(page, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
@@ -1405,7 +1416,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 
 			/* Remove the page from the freelists */
 			mt = get_freepage_migratetype(page);
-			del_from_freelist(page, &area->free_list[mt]);
+			del_from_freelist(page, &area->free_list[mt],
+					  current_order);
 			rmv_page_order(page);
 
 			/*
@@ -1766,7 +1778,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 
 	/* Remove page from free list */
 	mt = get_freepage_migratetype(page);
-	del_from_freelist(page, &zone->free_area[order].free_list[mt]);
+	del_from_freelist(page, &zone->free_area[order].free_list[mt], order);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
 
@@ -5157,6 +5169,22 @@ static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
 	pgdat->nr_node_regions = idx;
 }
 
+static void __meminit zone_init_free_lists_late(struct zone *zone)
+{
+	struct mem_region_list *mr_list;
+	int order, t, i;
+
+	for_each_migratetype_order(order, t) {
+		for (i = 0; i < zone->nr_zone_regions; i++) {
+			mr_list =
+				&zone->free_area[order].free_list[t].mr_list[i];
+
+			mr_list->nr_free = 0;
+			mr_list->zone_region = &zone->zone_regions[i];
+		}
+	}
+}
+
 static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 {
 	unsigned long start_pfn, end_pfn, absent;
@@ -5204,6 +5232,8 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 
 		z->nr_zone_regions = idx;
 
+		zone_init_free_lists_late(z);
+
 		/*
 		 * Revisit the last visited node memory region, in case it
 		 * spans multiple zones.
@@ -6708,7 +6738,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		       pfn, 1 << order, end_pfn);
 #endif
 		mt = get_freepage_migratetype(page);
-		del_from_freelist(page, &zone->free_area[order].free_list[mt]);
+		del_from_freelist(page, &zone->free_area[order].free_list[mt],
+				  order);
 		rmv_page_order(page);
 		zone->free_area[order].nr_free--;
 #ifdef CONFIG_HIGHMEM

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 15/35] mm: Print memory region statistics to understand the buddy allocator behavior
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:18   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:18 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

In order to observe the behavior of the region-aware buddy allocator, modify
vmstat.c to also print memory region related statistics. In particular, enable
memory region-related info in /proc/zoneinfo and /proc/buddyinfo, since they
would help us to atleast (roughly) observe how the new buddy allocator is
behaving.

For now, the region statistics correspond to the zone memory regions and not
the (absolute) node memory regions, and some of the statistics (especially the
no. of present pages) might not be very accurate. But since we account for
and print the free page statistics for every zone memory region accurately, we
should be able to observe the new page allocator behavior to a reasonable
degree of accuracy.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/vmstat.c |   34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0451957..4cba0da 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -827,11 +827,28 @@ const char * const vmstat_text[] = {
 static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 						struct zone *zone)
 {
-	int order;
+	int i, order, t;
+	struct free_area *area;
 
-	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
-	for (order = 0; order < MAX_ORDER; ++order)
-		seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+	seq_printf(m, "Node %d, zone %8s \n", pgdat->node_id, zone->name);
+
+	for (i = 0; i < zone->nr_zone_regions; i++) {
+
+		seq_printf(m, "\t\t Region %6d ", i);
+
+		for (order = 0; order < MAX_ORDER; ++order) {
+			unsigned long nr_free = 0;
+
+			area = &zone->free_area[order];
+
+			for (t = 0; t < MIGRATE_TYPES; t++) {
+				nr_free +=
+					area->free_list[t].mr_list[i].nr_free;
+			}
+			seq_printf(m, "%6lu ", nr_free);
+		}
+		seq_putc(m, '\n');
+	}
 	seq_putc(m, '\n');
 }
 
@@ -1018,6 +1035,15 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   zone->present_pages,
 		   zone->managed_pages);
 
+	seq_printf(m, "\n\nPer-region page stats\t present\t free\n\n");
+	for (i = 0; i < zone->nr_zone_regions; i++) {
+		struct zone_mem_region *region;
+
+		region = &zone->zone_regions[i];
+		seq_printf(m, "\tRegion %6d \t %6lu \t %6lu\n", i,
+				region->present_pages, region->nr_free);
+	}
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
 				zone_page_state(zone, i));


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 15/35] mm: Print memory region statistics to understand the buddy allocator behavior
@ 2013-08-30 13:18   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:18 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

In order to observe the behavior of the region-aware buddy allocator, modify
vmstat.c to also print memory region related statistics. In particular, enable
memory region-related info in /proc/zoneinfo and /proc/buddyinfo, since they
would help us to atleast (roughly) observe how the new buddy allocator is
behaving.

For now, the region statistics correspond to the zone memory regions and not
the (absolute) node memory regions, and some of the statistics (especially the
no. of present pages) might not be very accurate. But since we account for
and print the free page statistics for every zone memory region accurately, we
should be able to observe the new page allocator behavior to a reasonable
degree of accuracy.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/vmstat.c |   34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0451957..4cba0da 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -827,11 +827,28 @@ const char * const vmstat_text[] = {
 static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 						struct zone *zone)
 {
-	int order;
+	int i, order, t;
+	struct free_area *area;
 
-	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
-	for (order = 0; order < MAX_ORDER; ++order)
-		seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+	seq_printf(m, "Node %d, zone %8s \n", pgdat->node_id, zone->name);
+
+	for (i = 0; i < zone->nr_zone_regions; i++) {
+
+		seq_printf(m, "\t\t Region %6d ", i);
+
+		for (order = 0; order < MAX_ORDER; ++order) {
+			unsigned long nr_free = 0;
+
+			area = &zone->free_area[order];
+
+			for (t = 0; t < MIGRATE_TYPES; t++) {
+				nr_free +=
+					area->free_list[t].mr_list[i].nr_free;
+			}
+			seq_printf(m, "%6lu ", nr_free);
+		}
+		seq_putc(m, '\n');
+	}
 	seq_putc(m, '\n');
 }
 
@@ -1018,6 +1035,15 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   zone->present_pages,
 		   zone->managed_pages);
 
+	seq_printf(m, "\n\nPer-region page stats\t present\t free\n\n");
+	for (i = 0; i < zone->nr_zone_regions; i++) {
+		struct zone_mem_region *region;
+
+		region = &zone->zone_regions[i];
+		seq_printf(m, "\tRegion %6d \t %6lu \t %6lu\n", i,
+				region->present_pages, region->nr_free);
+	}
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		seq_printf(m, "\n    %-12s %lu", vmstat_text[i],
 				zone_page_state(zone, i));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 16/35] mm: Enable per-memory-region fragmentation stats in pagetypeinfo
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:18   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:18 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Pagetypeinfo is invaluable in observing the fragmentation of memory
into different migratetypes. Modify this code to also print out the
fragmentation statistics at a per-zone-memory-region granularity
(along with the existing per-zone reporting).

This helps us observe the effects of influencing memory allocation
decisions at the page-allocator level and understand the extent to
which they help in consolidation.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/vmstat.c |   86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 84 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4cba0da..924babc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -887,6 +887,35 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 	}
 }
 
+static void pagetypeinfo_showfree_region_print(struct seq_file *m,
+					       pg_data_t *pgdat,
+					       struct zone *zone)
+{
+	int order, mtype, i;
+
+	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
+
+		for (i = 0; i < zone->nr_zone_regions; i++) {
+			seq_printf(m, "Node %4d, zone %8s, R%3d %12s ",
+						pgdat->node_id,
+						zone->name,
+						i,
+						migratetype_names[mtype]);
+
+			for (order = 0; order < MAX_ORDER; ++order) {
+				struct free_area *area;
+
+				area = &(zone->free_area[order]);
+
+				seq_printf(m, "%6lu ",
+				   area->free_list[mtype].mr_list[i].nr_free);
+			}
+			seq_putc(m, '\n');
+		}
+
+	}
+}
+
 /* Print out the free pages at each order for each migatetype */
 static int pagetypeinfo_showfree(struct seq_file *m, void *arg)
 {
@@ -901,6 +930,11 @@ static int pagetypeinfo_showfree(struct seq_file *m, void *arg)
 
 	walk_zones_in_node(m, pgdat, pagetypeinfo_showfree_print);
 
+	seq_putc(m, '\n');
+
+	/* Print the free pages at each migratetype, per memory region */
+	walk_zones_in_node(m, pgdat, pagetypeinfo_showfree_region_print);
+
 	return 0;
 }
 
@@ -932,24 +966,72 @@ static void pagetypeinfo_showblockcount_print(struct seq_file *m,
 	}
 
 	/* Print counts */
-	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+	seq_printf(m, "Node %d, zone %8s      ", pgdat->node_id, zone->name);
 	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
 		seq_printf(m, "%12lu ", count[mtype]);
 	seq_putc(m, '\n');
 }
 
+static void pagetypeinfo_showblockcount_region_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	int mtype, i;
+	unsigned long pfn;
+	unsigned long start_pfn, end_pfn;
+	unsigned long count[MIGRATE_TYPES] = { 0, };
+
+	for (i = 0; i < zone->nr_zone_regions; i++) {
+		start_pfn = zone->zone_regions[i].start_pfn;
+		end_pfn = zone->zone_regions[i].end_pfn;
+
+		for (pfn = start_pfn; pfn < end_pfn;
+						pfn += pageblock_nr_pages) {
+			struct page *page;
+
+			if (!pfn_valid(pfn))
+				continue;
+
+			page = pfn_to_page(pfn);
+
+			/* Watch for unexpected holes punched in the memmap */
+			if (!memmap_valid_within(pfn, page, zone))
+				continue;
+
+			mtype = get_pageblock_migratetype(page);
+
+			if (mtype < MIGRATE_TYPES)
+				count[mtype]++;
+		}
+
+		/* Print counts */
+		seq_printf(m, "Node %d, zone %8s R%3d ", pgdat->node_id,
+			   zone->name, i);
+		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+			seq_printf(m, "%12lu ", count[mtype]);
+		seq_putc(m, '\n');
+
+		/* Reset the counters */
+		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+			count[mtype] = 0;
+	}
+}
+
 /* Print out the free pages at each order for each migratetype */
 static int pagetypeinfo_showblockcount(struct seq_file *m, void *arg)
 {
 	int mtype;
 	pg_data_t *pgdat = (pg_data_t *)arg;
 
-	seq_printf(m, "\n%-23s", "Number of blocks type ");
+	seq_printf(m, "\n%-23s", "Number of blocks type      ");
 	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
 		seq_printf(m, "%12s ", migratetype_names[mtype]);
 	seq_putc(m, '\n');
 	walk_zones_in_node(m, pgdat, pagetypeinfo_showblockcount_print);
 
+	/* Print out the pageblock info for per memory region */
+	seq_putc(m, '\n');
+	walk_zones_in_node(m, pgdat, pagetypeinfo_showblockcount_region_print);
+
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 16/35] mm: Enable per-memory-region fragmentation stats in pagetypeinfo
@ 2013-08-30 13:18   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:18 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Pagetypeinfo is invaluable in observing the fragmentation of memory
into different migratetypes. Modify this code to also print out the
fragmentation statistics at a per-zone-memory-region granularity
(along with the existing per-zone reporting).

This helps us observe the effects of influencing memory allocation
decisions at the page-allocator level and understand the extent to
which they help in consolidation.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/vmstat.c |   86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 84 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4cba0da..924babc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -887,6 +887,35 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
 	}
 }
 
+static void pagetypeinfo_showfree_region_print(struct seq_file *m,
+					       pg_data_t *pgdat,
+					       struct zone *zone)
+{
+	int order, mtype, i;
+
+	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
+
+		for (i = 0; i < zone->nr_zone_regions; i++) {
+			seq_printf(m, "Node %4d, zone %8s, R%3d %12s ",
+						pgdat->node_id,
+						zone->name,
+						i,
+						migratetype_names[mtype]);
+
+			for (order = 0; order < MAX_ORDER; ++order) {
+				struct free_area *area;
+
+				area = &(zone->free_area[order]);
+
+				seq_printf(m, "%6lu ",
+				   area->free_list[mtype].mr_list[i].nr_free);
+			}
+			seq_putc(m, '\n');
+		}
+
+	}
+}
+
 /* Print out the free pages at each order for each migatetype */
 static int pagetypeinfo_showfree(struct seq_file *m, void *arg)
 {
@@ -901,6 +930,11 @@ static int pagetypeinfo_showfree(struct seq_file *m, void *arg)
 
 	walk_zones_in_node(m, pgdat, pagetypeinfo_showfree_print);
 
+	seq_putc(m, '\n');
+
+	/* Print the free pages at each migratetype, per memory region */
+	walk_zones_in_node(m, pgdat, pagetypeinfo_showfree_region_print);
+
 	return 0;
 }
 
@@ -932,24 +966,72 @@ static void pagetypeinfo_showblockcount_print(struct seq_file *m,
 	}
 
 	/* Print counts */
-	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+	seq_printf(m, "Node %d, zone %8s      ", pgdat->node_id, zone->name);
 	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
 		seq_printf(m, "%12lu ", count[mtype]);
 	seq_putc(m, '\n');
 }
 
+static void pagetypeinfo_showblockcount_region_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	int mtype, i;
+	unsigned long pfn;
+	unsigned long start_pfn, end_pfn;
+	unsigned long count[MIGRATE_TYPES] = { 0, };
+
+	for (i = 0; i < zone->nr_zone_regions; i++) {
+		start_pfn = zone->zone_regions[i].start_pfn;
+		end_pfn = zone->zone_regions[i].end_pfn;
+
+		for (pfn = start_pfn; pfn < end_pfn;
+						pfn += pageblock_nr_pages) {
+			struct page *page;
+
+			if (!pfn_valid(pfn))
+				continue;
+
+			page = pfn_to_page(pfn);
+
+			/* Watch for unexpected holes punched in the memmap */
+			if (!memmap_valid_within(pfn, page, zone))
+				continue;
+
+			mtype = get_pageblock_migratetype(page);
+
+			if (mtype < MIGRATE_TYPES)
+				count[mtype]++;
+		}
+
+		/* Print counts */
+		seq_printf(m, "Node %d, zone %8s R%3d ", pgdat->node_id,
+			   zone->name, i);
+		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+			seq_printf(m, "%12lu ", count[mtype]);
+		seq_putc(m, '\n');
+
+		/* Reset the counters */
+		for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
+			count[mtype] = 0;
+	}
+}
+
 /* Print out the free pages at each order for each migratetype */
 static int pagetypeinfo_showblockcount(struct seq_file *m, void *arg)
 {
 	int mtype;
 	pg_data_t *pgdat = (pg_data_t *)arg;
 
-	seq_printf(m, "\n%-23s", "Number of blocks type ");
+	seq_printf(m, "\n%-23s", "Number of blocks type      ");
 	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
 		seq_printf(m, "%12s ", migratetype_names[mtype]);
 	seq_putc(m, '\n');
 	walk_zones_in_node(m, pgdat, pagetypeinfo_showblockcount_print);
 
+	/* Print out the pageblock info for per memory region */
+	seq_putc(m, '\n');
+	walk_zones_in_node(m, pgdat, pagetypeinfo_showblockcount_region_print);
+
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 17/35] mm: Add aggressive bias to prefer lower regions during page allocation
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:19   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:19 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

While allocating pages from buddy freelists, there could be situations
in which we have a ready freepage of the required order in a *higher*
numbered memory region, and there also exists a freepage of a higher
page order in a *lower* numbered memory region.

To make the consolidation logic more aggressive, try to split up the
higher order buddy page of a lower numbered region and allocate it,
rather than allocating pages from a higher numbered region.

This ensures that we spill over to a new region only when we truly
don't have enough contiguous memory in any lower numbered region to
satisfy that allocation request.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   44 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e711b9..0cc2a3e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1210,8 +1210,9 @@ static inline
 struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						int migratetype)
 {
-	unsigned int current_order;
-	struct free_area * area;
+	unsigned int current_order, alloc_order;
+	struct free_area *area, *other_area;
+	int alloc_region, other_region;
 	struct page *page;
 
 	/* Find a page of the appropriate size in the preferred list */
@@ -1220,17 +1221,40 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		if (list_empty(&area->free_list[migratetype].list))
 			continue;
 
-		page = list_entry(area->free_list[migratetype].list.next,
-							struct page, lru);
-		rmqueue_del_from_freelist(page, &area->free_list[migratetype],
-					  current_order);
-		rmv_page_order(page);
-		area->nr_free--;
-		expand(zone, page, order, current_order, area, migratetype);
-		return page;
+		alloc_order = current_order;
+		alloc_region = area->free_list[migratetype].next_region -
+				area->free_list[migratetype].mr_list;
+		current_order++;
+		goto try_others;
 	}
 
 	return NULL;
+
+try_others:
+	/* Try to aggressively prefer lower numbered regions for allocations */
+	for ( ; current_order < MAX_ORDER; ++current_order) {
+		other_area = &(zone->free_area[current_order]);
+		if (list_empty(&other_area->free_list[migratetype].list))
+			continue;
+
+		other_region = other_area->free_list[migratetype].next_region -
+				other_area->free_list[migratetype].mr_list;
+
+		if (other_region < alloc_region) {
+			alloc_region = other_region;
+			alloc_order = current_order;
+		}
+	}
+
+	area = &(zone->free_area[alloc_order]);
+	page = list_entry(area->free_list[migratetype].list.next, struct page,
+			  lru);
+	rmqueue_del_from_freelist(page, &area->free_list[migratetype],
+				  alloc_order);
+	rmv_page_order(page);
+	area->nr_free--;
+	expand(zone, page, order, alloc_order, area, migratetype);
+	return page;
 }
 
 


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 17/35] mm: Add aggressive bias to prefer lower regions during page allocation
@ 2013-08-30 13:19   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:19 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

While allocating pages from buddy freelists, there could be situations
in which we have a ready freepage of the required order in a *higher*
numbered memory region, and there also exists a freepage of a higher
page order in a *lower* numbered memory region.

To make the consolidation logic more aggressive, try to split up the
higher order buddy page of a lower numbered region and allocate it,
rather than allocating pages from a higher numbered region.

This ensures that we spill over to a new region only when we truly
don't have enough contiguous memory in any lower numbered region to
satisfy that allocation request.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   44 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e711b9..0cc2a3e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1210,8 +1210,9 @@ static inline
 struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						int migratetype)
 {
-	unsigned int current_order;
-	struct free_area * area;
+	unsigned int current_order, alloc_order;
+	struct free_area *area, *other_area;
+	int alloc_region, other_region;
 	struct page *page;
 
 	/* Find a page of the appropriate size in the preferred list */
@@ -1220,17 +1221,40 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		if (list_empty(&area->free_list[migratetype].list))
 			continue;
 
-		page = list_entry(area->free_list[migratetype].list.next,
-							struct page, lru);
-		rmqueue_del_from_freelist(page, &area->free_list[migratetype],
-					  current_order);
-		rmv_page_order(page);
-		area->nr_free--;
-		expand(zone, page, order, current_order, area, migratetype);
-		return page;
+		alloc_order = current_order;
+		alloc_region = area->free_list[migratetype].next_region -
+				area->free_list[migratetype].mr_list;
+		current_order++;
+		goto try_others;
 	}
 
 	return NULL;
+
+try_others:
+	/* Try to aggressively prefer lower numbered regions for allocations */
+	for ( ; current_order < MAX_ORDER; ++current_order) {
+		other_area = &(zone->free_area[current_order]);
+		if (list_empty(&other_area->free_list[migratetype].list))
+			continue;
+
+		other_region = other_area->free_list[migratetype].next_region -
+				other_area->free_list[migratetype].mr_list;
+
+		if (other_region < alloc_region) {
+			alloc_region = other_region;
+			alloc_order = current_order;
+		}
+	}
+
+	area = &(zone->free_area[alloc_order]);
+	page = list_entry(area->free_list[migratetype].list.next, struct page,
+			  lru);
+	rmqueue_del_from_freelist(page, &area->free_list[migratetype],
+				  alloc_order);
+	rmv_page_order(page);
+	area->nr_free--;
+	expand(zone, page, order, alloc_order, area, migratetype);
+	return page;
 }
 
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 18/35] mm: Introduce a "Region Allocator" to manage entire memory regions
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:19   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:19 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Today, the MM subsystem uses the buddy 'Page Allocator' to manage memory
at a 'page' granularity. But this allocator has no notion of the physical
topology of the underlying memory hardware, and hence it is hard to
influence memory allocation decisions keeping the platform constraints
in mind.

So we need to augment the page-allocator with a new entity to manage
memory (at a much larger granularity) keeping the underlying platform
characteristics and the memory hardware topology in mind.

To that end, introduce a "Memory Region Allocator" as a backend to the
existing "Page Allocator".


Splitting the memory allocator into a Page-Allocator front-end and a
Region-Allocator backend:


                 Page Allocator          |      Memory Region Allocator
                                         -
           __    __    __                |    ________    ________
          |__|--|__|--|__|-- ...         -   |        |  |        |
           ____    ____    ____          |   |        |  |        |
          |____|--|____|--|____|-- ...   -   |        |--|        |-- ...
                                         |   |        |  |        |
                                         -   |________|  |________|
                                         |
                                         -
             Manages pages using         |     Manages memory regions
              buddy freelists            -  (allocates and frees entire
                                         |   memory regions, i.e., at a
                                         -   memory-region granularity)


The flow of memory allocations/frees between entities requesting memory
(applications/kernel) and the MM subsystem:

                  pages               regions
  Applications <========>   Page    <========>  Memory Region
   and Kernel             Allocator               Allocator



Since the region allocator is supposed to function as a backend to the
page allocator, we implement it on a per-zone basis (since the page-allocator
is also per-zone).

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |   17 +++++++++++++++++
 mm/page_alloc.c        |   19 +++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef602a8..c2956dd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -112,6 +112,21 @@ struct free_area {
 	unsigned long		nr_free;
 };
 
+/* A simplified free_area for managing entire memory regions */
+struct free_area_region {
+	struct list_head	list;
+	unsigned long		nr_free;
+};
+
+struct mem_region {
+	struct free_area_region	region_area[MAX_ORDER];
+};
+
+struct region_allocator {
+	struct mem_region	region[MAX_NR_ZONE_REGIONS];
+	int			next_region;
+};
+
 struct pglist_data;
 
 /*
@@ -405,6 +420,8 @@ struct zone {
 	struct zone_mem_region	zone_regions[MAX_NR_ZONE_REGIONS];
 	int 			nr_zone_regions;
 
+	struct region_allocator	region_allocator;
+
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0cc2a3e..905360c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5209,6 +5209,23 @@ static void __meminit zone_init_free_lists_late(struct zone *zone)
 	}
 }
 
+static void __meminit init_zone_region_allocator(struct zone *zone)
+{
+	struct free_area_region *area;
+	int i, j;
+
+	for (i = 0; i < zone->nr_zone_regions; i++) {
+		area = zone->region_allocator.region[i].region_area;
+
+		for (j = 0; j < MAX_ORDER; j++) {
+			INIT_LIST_HEAD(&area[j].list);
+			area[j].nr_free = 0;
+		}
+	}
+
+	zone->region_allocator.next_region = -1;
+}
+
 static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 {
 	unsigned long start_pfn, end_pfn, absent;
@@ -5258,6 +5275,8 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 
 		zone_init_free_lists_late(z);
 
+		init_zone_region_allocator(z);
+
 		/*
 		 * Revisit the last visited node memory region, in case it
 		 * spans multiple zones.


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 18/35] mm: Introduce a "Region Allocator" to manage entire memory regions
@ 2013-08-30 13:19   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:19 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Today, the MM subsystem uses the buddy 'Page Allocator' to manage memory
at a 'page' granularity. But this allocator has no notion of the physical
topology of the underlying memory hardware, and hence it is hard to
influence memory allocation decisions keeping the platform constraints
in mind.

So we need to augment the page-allocator with a new entity to manage
memory (at a much larger granularity) keeping the underlying platform
characteristics and the memory hardware topology in mind.

To that end, introduce a "Memory Region Allocator" as a backend to the
existing "Page Allocator".


Splitting the memory allocator into a Page-Allocator front-end and a
Region-Allocator backend:


                 Page Allocator          |      Memory Region Allocator
                                         -
           __    __    __                |    ________    ________
          |__|--|__|--|__|-- ...         -   |        |  |        |
           ____    ____    ____          |   |        |  |        |
          |____|--|____|--|____|-- ...   -   |        |--|        |-- ...
                                         |   |        |  |        |
                                         -   |________|  |________|
                                         |
                                         -
             Manages pages using         |     Manages memory regions
              buddy freelists            -  (allocates and frees entire
                                         |   memory regions, i.e., at a
                                         -   memory-region granularity)


The flow of memory allocations/frees between entities requesting memory
(applications/kernel) and the MM subsystem:

                  pages               regions
  Applications <========>   Page    <========>  Memory Region
   and Kernel             Allocator               Allocator



Since the region allocator is supposed to function as a backend to the
page allocator, we implement it on a per-zone basis (since the page-allocator
is also per-zone).

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |   17 +++++++++++++++++
 mm/page_alloc.c        |   19 +++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef602a8..c2956dd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -112,6 +112,21 @@ struct free_area {
 	unsigned long		nr_free;
 };
 
+/* A simplified free_area for managing entire memory regions */
+struct free_area_region {
+	struct list_head	list;
+	unsigned long		nr_free;
+};
+
+struct mem_region {
+	struct free_area_region	region_area[MAX_ORDER];
+};
+
+struct region_allocator {
+	struct mem_region	region[MAX_NR_ZONE_REGIONS];
+	int			next_region;
+};
+
 struct pglist_data;
 
 /*
@@ -405,6 +420,8 @@ struct zone {
 	struct zone_mem_region	zone_regions[MAX_NR_ZONE_REGIONS];
 	int 			nr_zone_regions;
 
+	struct region_allocator	region_allocator;
+
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0cc2a3e..905360c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5209,6 +5209,23 @@ static void __meminit zone_init_free_lists_late(struct zone *zone)
 	}
 }
 
+static void __meminit init_zone_region_allocator(struct zone *zone)
+{
+	struct free_area_region *area;
+	int i, j;
+
+	for (i = 0; i < zone->nr_zone_regions; i++) {
+		area = zone->region_allocator.region[i].region_area;
+
+		for (j = 0; j < MAX_ORDER; j++) {
+			INIT_LIST_HEAD(&area[j].list);
+			area[j].nr_free = 0;
+		}
+	}
+
+	zone->region_allocator.next_region = -1;
+}
+
 static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 {
 	unsigned long start_pfn, end_pfn, absent;
@@ -5258,6 +5275,8 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
 
 		zone_init_free_lists_late(z);
 
+		init_zone_region_allocator(z);
+
 		/*
 		 * Revisit the last visited node memory region, in case it
 		 * spans multiple zones.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 19/35] mm: Add a mechanism to add pages to buddy freelists in bulk
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:19   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:19 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

When the buddy page allocator requests the region allocator for memory,
it gets all the freepages belonging to an entire region at once. So, in
order to make it efficient, we need a way to add all those pages to the
buddy freelists in one shot. Add this support, and also take care to
update the nr-free statistics properly.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 905360c..b66ddff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -692,6 +692,52 @@ out:
 	set_region_bit(region_id, free_list);
 }
 
+/*
+ * Add all the freepages contained in 'list' to the buddy freelist
+ * 'free_list'. Using suitable list-manipulation tricks, we move the
+ * pages between the lists in one shot.
+ */
+static void add_to_freelist_bulk(struct list_head *list,
+				 struct free_list *free_list, int order,
+				 int region_id)
+{
+	struct list_head *cur, *position;
+	struct mem_region_list *region;
+	unsigned long nr_pages = 0;
+	struct free_area *area;
+	struct page *page;
+
+	if (list_empty(list))
+		return;
+
+	page = list_first_entry(list, struct page, lru);
+	list_del(&page->lru);
+
+	/*
+	 * Add one page using add_to_freelist() so that it sets up the
+	 * region related data-structures of the freelist properly.
+	 */
+	add_to_freelist(page, free_list, order);
+
+	/* Now add the rest of the pages in bulk */
+	list_for_each(cur, list)
+		nr_pages++;
+
+	position = free_list->mr_list[region_id].page_block;
+	list_splice_tail(list, position);
+
+
+	/* Update the statistics */
+	region = &free_list->mr_list[region_id];
+	region->nr_free += nr_pages;
+
+	area = &(page_zone(page)->free_area[order]);
+	area->nr_free += nr_pages + 1;
+
+	/* Fix up the zone region stats, since add_to_freelist() altered it */
+	region->zone_region->nr_free -= 1 << order;
+}
+
 /**
  * __rmqueue_smallest() *always* deletes elements from the head of the
  * list. Use this knowledge to keep page allocation fast, despite being


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 19/35] mm: Add a mechanism to add pages to buddy freelists in bulk
@ 2013-08-30 13:19   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:19 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

When the buddy page allocator requests the region allocator for memory,
it gets all the freepages belonging to an entire region at once. So, in
order to make it efficient, we need a way to add all those pages to the
buddy freelists in one shot. Add this support, and also take care to
update the nr-free statistics properly.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 905360c..b66ddff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -692,6 +692,52 @@ out:
 	set_region_bit(region_id, free_list);
 }
 
+/*
+ * Add all the freepages contained in 'list' to the buddy freelist
+ * 'free_list'. Using suitable list-manipulation tricks, we move the
+ * pages between the lists in one shot.
+ */
+static void add_to_freelist_bulk(struct list_head *list,
+				 struct free_list *free_list, int order,
+				 int region_id)
+{
+	struct list_head *cur, *position;
+	struct mem_region_list *region;
+	unsigned long nr_pages = 0;
+	struct free_area *area;
+	struct page *page;
+
+	if (list_empty(list))
+		return;
+
+	page = list_first_entry(list, struct page, lru);
+	list_del(&page->lru);
+
+	/*
+	 * Add one page using add_to_freelist() so that it sets up the
+	 * region related data-structures of the freelist properly.
+	 */
+	add_to_freelist(page, free_list, order);
+
+	/* Now add the rest of the pages in bulk */
+	list_for_each(cur, list)
+		nr_pages++;
+
+	position = free_list->mr_list[region_id].page_block;
+	list_splice_tail(list, position);
+
+
+	/* Update the statistics */
+	region = &free_list->mr_list[region_id];
+	region->nr_free += nr_pages;
+
+	area = &(page_zone(page)->free_area[order]);
+	area->nr_free += nr_pages + 1;
+
+	/* Fix up the zone region stats, since add_to_freelist() altered it */
+	region->zone_region->nr_free -= 1 << order;
+}
+
 /**
  * __rmqueue_smallest() *always* deletes elements from the head of the
  * list. Use this knowledge to keep page allocation fast, despite being

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 20/35] mm: Provide a mechanism to delete pages from buddy freelists in bulk
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:20   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:20 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

When the buddy allocator releases excess free memory to the region
allocator, it does it at a region granularity - that is, it releases
all the freepages of that region to the region allocator, at once.
So, in order to make this efficient, we need a way to delete all those
pages from the buddy freelists in one shot. Add this support, and also
take care to update the nr-free statistics properly.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b66ddff..5227ac3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -853,6 +853,61 @@ page_found:
 	}
 }
 
+/*
+ * Delete all freepages belonging to the region 'region_id' from 'free_list'
+ * and move them to 'list'. Using suitable list-manipulation tricks, we move
+ * the pages between the lists in one shot.
+ */
+static void del_from_freelist_bulk(struct list_head *list,
+				   struct free_list *free_list, int order,
+				   int region_id)
+{
+	struct mem_region_list *region, *prev_region;
+	unsigned long nr_pages = 0;
+	struct free_area *area;
+	struct list_head *cur;
+	struct page *page;
+	int prev_region_id;
+
+	region = &free_list->mr_list[region_id];
+
+	/*
+	 * Perform bulk movement of all pages of the region to the new list,
+	 * except the page pointed to by region->pageblock.
+	 */
+	prev_region_id = find_prev_region(region_id, free_list);
+	if (prev_region_id < 0) {
+		/* This is the first region on the list */
+		list_cut_position(list, &free_list->list,
+				  region->page_block->prev);
+	} else {
+		prev_region = &free_list->mr_list[prev_region_id];
+		list_cut_position(list, prev_region->page_block,
+				  region->page_block->prev);
+	}
+
+	list_for_each(cur, list)
+		nr_pages++;
+
+	region->nr_free -= nr_pages;
+
+	/*
+	 * Now delete the page pointed to by region->page_block using
+	 * del_from_freelist(), so that it sets up the region related
+	 * data-structures of the freelist properly.
+	 */
+	page = list_entry(region->page_block, struct page, lru);
+	del_from_freelist(page, free_list, order);
+
+	list_add_tail(&page->lru, list);
+
+	area = &(page_zone(page)->free_area[order]);
+	area->nr_free -= nr_pages + 1;
+
+	/* Fix up the zone region stats, since del_from_freelist() altered it */
+	region->zone_region->nr_free += 1 << order;
+}
+
 /**
  * Move a given page from one freelist to another.
  */


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 20/35] mm: Provide a mechanism to delete pages from buddy freelists in bulk
@ 2013-08-30 13:20   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:20 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

When the buddy allocator releases excess free memory to the region
allocator, it does it at a region granularity - that is, it releases
all the freepages of that region to the region allocator, at once.
So, in order to make this efficient, we need a way to delete all those
pages from the buddy freelists in one shot. Add this support, and also
take care to update the nr-free statistics properly.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b66ddff..5227ac3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -853,6 +853,61 @@ page_found:
 	}
 }
 
+/*
+ * Delete all freepages belonging to the region 'region_id' from 'free_list'
+ * and move them to 'list'. Using suitable list-manipulation tricks, we move
+ * the pages between the lists in one shot.
+ */
+static void del_from_freelist_bulk(struct list_head *list,
+				   struct free_list *free_list, int order,
+				   int region_id)
+{
+	struct mem_region_list *region, *prev_region;
+	unsigned long nr_pages = 0;
+	struct free_area *area;
+	struct list_head *cur;
+	struct page *page;
+	int prev_region_id;
+
+	region = &free_list->mr_list[region_id];
+
+	/*
+	 * Perform bulk movement of all pages of the region to the new list,
+	 * except the page pointed to by region->pageblock.
+	 */
+	prev_region_id = find_prev_region(region_id, free_list);
+	if (prev_region_id < 0) {
+		/* This is the first region on the list */
+		list_cut_position(list, &free_list->list,
+				  region->page_block->prev);
+	} else {
+		prev_region = &free_list->mr_list[prev_region_id];
+		list_cut_position(list, prev_region->page_block,
+				  region->page_block->prev);
+	}
+
+	list_for_each(cur, list)
+		nr_pages++;
+
+	region->nr_free -= nr_pages;
+
+	/*
+	 * Now delete the page pointed to by region->page_block using
+	 * del_from_freelist(), so that it sets up the region related
+	 * data-structures of the freelist properly.
+	 */
+	page = list_entry(region->page_block, struct page, lru);
+	del_from_freelist(page, free_list, order);
+
+	list_add_tail(&page->lru, list);
+
+	area = &(page_zone(page)->free_area[order]);
+	area->nr_free -= nr_pages + 1;
+
+	/* Fix up the zone region stats, since del_from_freelist() altered it */
+	region->zone_region->nr_free += 1 << order;
+}
+
 /**
  * Move a given page from one freelist to another.
  */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 21/35] mm: Provide a mechanism to release free memory to the region allocator
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:20   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:20 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Implement helper functions to release freepages from the buddy freelists to
the region allocator.

For simplicity, all operations related to the region allocator are performed
at the granularity of entire memory regions. That is, when we release freepages
to the region allocator, we free all the pages belonging to that region.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5227ac3..d407caf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -918,6 +918,26 @@ static void move_page_freelist(struct page *page, struct free_list *old_list,
 	add_to_freelist(page, new_list, order);
 }
 
+/* Add pages from the given buddy freelist to the region allocator */
+static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
+				    int region_id)
+{
+	struct region_allocator *reg_alloc;
+	struct list_head *ralloc_list;
+	int order;
+
+	if (WARN_ON(list_empty(&free_list->list)))
+		return;
+
+	order = page_order(list_first_entry(&free_list->list,
+					    struct page, lru));
+
+	reg_alloc = &z->region_allocator;
+	ralloc_list = &reg_alloc->region[region_id].region_area[order].list;
+
+	del_from_freelist_bulk(ralloc_list, free_list, order, region_id);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 21/35] mm: Provide a mechanism to release free memory to the region allocator
@ 2013-08-30 13:20   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:20 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Implement helper functions to release freepages from the buddy freelists to
the region allocator.

For simplicity, all operations related to the region allocator are performed
at the granularity of entire memory regions. That is, when we release freepages
to the region allocator, we free all the pages belonging to that region.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5227ac3..d407caf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -918,6 +918,26 @@ static void move_page_freelist(struct page *page, struct free_list *old_list,
 	add_to_freelist(page, new_list, order);
 }
 
+/* Add pages from the given buddy freelist to the region allocator */
+static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
+				    int region_id)
+{
+	struct region_allocator *reg_alloc;
+	struct list_head *ralloc_list;
+	int order;
+
+	if (WARN_ON(list_empty(&free_list->list)))
+		return;
+
+	order = page_order(list_first_entry(&free_list->list,
+					    struct page, lru));
+
+	reg_alloc = &z->region_allocator;
+	ralloc_list = &reg_alloc->region[region_id].region_area[order].list;
+
+	del_from_freelist_bulk(ralloc_list, free_list, order, region_id);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 22/35] mm: Provide a mechanism to request free memory from the region allocator
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:20   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:20 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Implement helper functions to request freepages from the region allocator
in order to add them to the buddy freelists.

For simplicity, all operations related to the region allocator are performed
at the granularity of entire memory regions. That is, when the buddy
allocator requests freepages from the region allocator, the latter picks a
free region and always allocates all the freepages belonging to that entire
region.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d407caf..5b58e7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -938,6 +938,29 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 	del_from_freelist_bulk(ralloc_list, free_list, order, region_id);
 }
 
+/* Delete freepages from the region allocator and add them to buddy freelists */
+static int del_from_region_allocator(struct zone *zone, unsigned int order,
+				     int migratetype)
+{
+	struct region_allocator *reg_alloc;
+	struct list_head *ralloc_list;
+	struct free_list *free_list;
+	int next_region;
+
+	reg_alloc = &zone->region_allocator;
+
+	next_region = reg_alloc->next_region;
+	if (next_region < 0)
+		return -ENOMEM;
+
+	ralloc_list = &reg_alloc->region[next_region].region_area[order].list;
+	free_list = &zone->free_area[order].free_list[migratetype];
+
+	add_to_freelist_bulk(ralloc_list, free_list, order, next_region);
+
+	return 0;
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 22/35] mm: Provide a mechanism to request free memory from the region allocator
@ 2013-08-30 13:20   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:20 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Implement helper functions to request freepages from the region allocator
in order to add them to the buddy freelists.

For simplicity, all operations related to the region allocator are performed
at the granularity of entire memory regions. That is, when the buddy
allocator requests freepages from the region allocator, the latter picks a
free region and always allocates all the freepages belonging to that entire
region.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d407caf..5b58e7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -938,6 +938,29 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 	del_from_freelist_bulk(ralloc_list, free_list, order, region_id);
 }
 
+/* Delete freepages from the region allocator and add them to buddy freelists */
+static int del_from_region_allocator(struct zone *zone, unsigned int order,
+				     int migratetype)
+{
+	struct region_allocator *reg_alloc;
+	struct list_head *ralloc_list;
+	struct free_list *free_list;
+	int next_region;
+
+	reg_alloc = &zone->region_allocator;
+
+	next_region = reg_alloc->next_region;
+	if (next_region < 0)
+		return -ENOMEM;
+
+	ralloc_list = &reg_alloc->region[next_region].region_area[order].list;
+	free_list = &zone->free_area[order].free_list[migratetype];
+
+	add_to_freelist_bulk(ralloc_list, free_list, order, next_region);
+
+	return 0;
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 23/35] mm: Maintain the counter for freepages in the region allocator
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:21   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:21 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

We have a field named 'nr_free' for every memory-region in the region
allocator. Keep it updated with the count of freepages in that region.

We already run a loop while moving freepages in bulk between the buddy
allocator and the region allocator. Reuse that to update the freepages
count as well.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   45 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b58e7d..78ae8f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -696,10 +696,12 @@ out:
  * Add all the freepages contained in 'list' to the buddy freelist
  * 'free_list'. Using suitable list-manipulation tricks, we move the
  * pages between the lists in one shot.
+ *
+ * Returns the number of pages moved.
  */
-static void add_to_freelist_bulk(struct list_head *list,
-				 struct free_list *free_list, int order,
-				 int region_id)
+static unsigned long
+add_to_freelist_bulk(struct list_head *list, struct free_list *free_list,
+		     int order, int region_id)
 {
 	struct list_head *cur, *position;
 	struct mem_region_list *region;
@@ -708,7 +710,7 @@ static void add_to_freelist_bulk(struct list_head *list,
 	struct page *page;
 
 	if (list_empty(list))
-		return;
+		return 0;
 
 	page = list_first_entry(list, struct page, lru);
 	list_del(&page->lru);
@@ -736,6 +738,8 @@ static void add_to_freelist_bulk(struct list_head *list,
 
 	/* Fix up the zone region stats, since add_to_freelist() altered it */
 	region->zone_region->nr_free -= 1 << order;
+
+	return nr_pages + 1;
 }
 
 /**
@@ -857,10 +861,12 @@ page_found:
  * Delete all freepages belonging to the region 'region_id' from 'free_list'
  * and move them to 'list'. Using suitable list-manipulation tricks, we move
  * the pages between the lists in one shot.
+ *
+ * Returns the number of pages moved.
  */
-static void del_from_freelist_bulk(struct list_head *list,
-				   struct free_list *free_list, int order,
-				   int region_id)
+static unsigned long
+del_from_freelist_bulk(struct list_head *list, struct free_list *free_list,
+		       int order, int region_id)
 {
 	struct mem_region_list *region, *prev_region;
 	unsigned long nr_pages = 0;
@@ -906,6 +912,8 @@ static void del_from_freelist_bulk(struct list_head *list,
 
 	/* Fix up the zone region stats, since del_from_freelist() altered it */
 	region->zone_region->nr_free += 1 << order;
+
+	return nr_pages + 1;
 }
 
 /**
@@ -923,7 +931,9 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 				    int region_id)
 {
 	struct region_allocator *reg_alloc;
+	struct free_area_region *reg_area;
 	struct list_head *ralloc_list;
+	unsigned long nr_pages;
 	int order;
 
 	if (WARN_ON(list_empty(&free_list->list)))
@@ -933,9 +943,14 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 					    struct page, lru));
 
 	reg_alloc = &z->region_allocator;
-	ralloc_list = &reg_alloc->region[region_id].region_area[order].list;
+	reg_area = &reg_alloc->region[region_id].region_area[order];
+	ralloc_list = &reg_area->list;
+
+	nr_pages = del_from_freelist_bulk(ralloc_list, free_list, order,
+					  region_id);
 
-	del_from_freelist_bulk(ralloc_list, free_list, order, region_id);
+	WARN_ON(reg_area->nr_free != 0);
+	reg_area->nr_free += nr_pages;
 }
 
 /* Delete freepages from the region allocator and add them to buddy freelists */
@@ -943,8 +958,10 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 				     int migratetype)
 {
 	struct region_allocator *reg_alloc;
+	struct free_area_region *reg_area;
 	struct list_head *ralloc_list;
 	struct free_list *free_list;
+	unsigned long nr_pages;
 	int next_region;
 
 	reg_alloc = &zone->region_allocator;
@@ -953,10 +970,16 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 	if (next_region < 0)
 		return -ENOMEM;
 
-	ralloc_list = &reg_alloc->region[next_region].region_area[order].list;
+	reg_area = &reg_alloc->region[next_region].region_area[order];
+	ralloc_list = &reg_area->list;
+
 	free_list = &zone->free_area[order].free_list[migratetype];
 
-	add_to_freelist_bulk(ralloc_list, free_list, order, next_region);
+	nr_pages = add_to_freelist_bulk(ralloc_list, free_list, order,
+					next_region);
+
+	reg_area->nr_free -= nr_pages;
+	WARN_ON(reg_area->nr_free != 0);
 
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 23/35] mm: Maintain the counter for freepages in the region allocator
@ 2013-08-30 13:21   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:21 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

We have a field named 'nr_free' for every memory-region in the region
allocator. Keep it updated with the count of freepages in that region.

We already run a loop while moving freepages in bulk between the buddy
allocator and the region allocator. Reuse that to update the freepages
count as well.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   45 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b58e7d..78ae8f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -696,10 +696,12 @@ out:
  * Add all the freepages contained in 'list' to the buddy freelist
  * 'free_list'. Using suitable list-manipulation tricks, we move the
  * pages between the lists in one shot.
+ *
+ * Returns the number of pages moved.
  */
-static void add_to_freelist_bulk(struct list_head *list,
-				 struct free_list *free_list, int order,
-				 int region_id)
+static unsigned long
+add_to_freelist_bulk(struct list_head *list, struct free_list *free_list,
+		     int order, int region_id)
 {
 	struct list_head *cur, *position;
 	struct mem_region_list *region;
@@ -708,7 +710,7 @@ static void add_to_freelist_bulk(struct list_head *list,
 	struct page *page;
 
 	if (list_empty(list))
-		return;
+		return 0;
 
 	page = list_first_entry(list, struct page, lru);
 	list_del(&page->lru);
@@ -736,6 +738,8 @@ static void add_to_freelist_bulk(struct list_head *list,
 
 	/* Fix up the zone region stats, since add_to_freelist() altered it */
 	region->zone_region->nr_free -= 1 << order;
+
+	return nr_pages + 1;
 }
 
 /**
@@ -857,10 +861,12 @@ page_found:
  * Delete all freepages belonging to the region 'region_id' from 'free_list'
  * and move them to 'list'. Using suitable list-manipulation tricks, we move
  * the pages between the lists in one shot.
+ *
+ * Returns the number of pages moved.
  */
-static void del_from_freelist_bulk(struct list_head *list,
-				   struct free_list *free_list, int order,
-				   int region_id)
+static unsigned long
+del_from_freelist_bulk(struct list_head *list, struct free_list *free_list,
+		       int order, int region_id)
 {
 	struct mem_region_list *region, *prev_region;
 	unsigned long nr_pages = 0;
@@ -906,6 +912,8 @@ static void del_from_freelist_bulk(struct list_head *list,
 
 	/* Fix up the zone region stats, since del_from_freelist() altered it */
 	region->zone_region->nr_free += 1 << order;
+
+	return nr_pages + 1;
 }
 
 /**
@@ -923,7 +931,9 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 				    int region_id)
 {
 	struct region_allocator *reg_alloc;
+	struct free_area_region *reg_area;
 	struct list_head *ralloc_list;
+	unsigned long nr_pages;
 	int order;
 
 	if (WARN_ON(list_empty(&free_list->list)))
@@ -933,9 +943,14 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 					    struct page, lru));
 
 	reg_alloc = &z->region_allocator;
-	ralloc_list = &reg_alloc->region[region_id].region_area[order].list;
+	reg_area = &reg_alloc->region[region_id].region_area[order];
+	ralloc_list = &reg_area->list;
+
+	nr_pages = del_from_freelist_bulk(ralloc_list, free_list, order,
+					  region_id);
 
-	del_from_freelist_bulk(ralloc_list, free_list, order, region_id);
+	WARN_ON(reg_area->nr_free != 0);
+	reg_area->nr_free += nr_pages;
 }
 
 /* Delete freepages from the region allocator and add them to buddy freelists */
@@ -943,8 +958,10 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 				     int migratetype)
 {
 	struct region_allocator *reg_alloc;
+	struct free_area_region *reg_area;
 	struct list_head *ralloc_list;
 	struct free_list *free_list;
+	unsigned long nr_pages;
 	int next_region;
 
 	reg_alloc = &zone->region_allocator;
@@ -953,10 +970,16 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 	if (next_region < 0)
 		return -ENOMEM;
 
-	ralloc_list = &reg_alloc->region[next_region].region_area[order].list;
+	reg_area = &reg_alloc->region[next_region].region_area[order];
+	ralloc_list = &reg_area->list;
+
 	free_list = &zone->free_area[order].free_list[migratetype];
 
-	add_to_freelist_bulk(ralloc_list, free_list, order, next_region);
+	nr_pages = add_to_freelist_bulk(ralloc_list, free_list, order,
+					next_region);
+
+	reg_area->nr_free -= nr_pages;
+	WARN_ON(reg_area->nr_free != 0);
 
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 24/35] mm: Propagate the sorted-buddy bias for picking free regions, to region allocator
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:21   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:21 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The sorted-buddy page allocator keeps the buddy freelists sorted region-wise,
and tries to pick lower numbered regions while allocating pages. The idea is
to allocate regions in the increasing order of region number.

Propagate the same bias to the region allocator as well. That is, make it
favor lower numbered regions while allocating regions to the page allocator.
To do this efficiently, add a bitmap to represent the regions in the region
allocator, and use bitmap operations to manage these regions and to pick the
lowest numbered free region efficiently.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |    1 +
 mm/page_alloc.c        |   19 ++++++++++++++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c2956dd..8c6e9f1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -125,6 +125,7 @@ struct mem_region {
 struct region_allocator {
 	struct mem_region	region[MAX_NR_ZONE_REGIONS];
 	int			next_region;
+	DECLARE_BITMAP(ralloc_mask, MAX_NR_ZONE_REGIONS);
 };
 
 struct pglist_data;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78ae8f6..7e82872a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -934,7 +934,7 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 	struct free_area_region *reg_area;
 	struct list_head *ralloc_list;
 	unsigned long nr_pages;
-	int order;
+	int order, *next_region;
 
 	if (WARN_ON(list_empty(&free_list->list)))
 		return;
@@ -951,6 +951,13 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 
 	WARN_ON(reg_area->nr_free != 0);
 	reg_area->nr_free += nr_pages;
+
+	set_bit(region_id, reg_alloc->ralloc_mask);
+	next_region = &reg_alloc->next_region;
+
+	if ((*next_region < 0) ||
+			(*next_region > 0 && region_id < *next_region))
+		*next_region = region_id;
 }
 
 /* Delete freepages from the region allocator and add them to buddy freelists */
@@ -981,6 +988,16 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 	reg_area->nr_free -= nr_pages;
 	WARN_ON(reg_area->nr_free != 0);
 
+	/* Pick a new next_region */
+	clear_bit(next_region, reg_alloc->ralloc_mask);
+	next_region = find_first_bit(reg_alloc->ralloc_mask,
+				     MAX_NR_ZONE_REGIONS);
+
+	if (next_region >= MAX_NR_ZONE_REGIONS)
+		next_region = -1; /* No free regions available */
+
+	reg_alloc->next_region = next_region;
+
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 24/35] mm: Propagate the sorted-buddy bias for picking free regions, to region allocator
@ 2013-08-30 13:21   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:21 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The sorted-buddy page allocator keeps the buddy freelists sorted region-wise,
and tries to pick lower numbered regions while allocating pages. The idea is
to allocate regions in the increasing order of region number.

Propagate the same bias to the region allocator as well. That is, make it
favor lower numbered regions while allocating regions to the page allocator.
To do this efficiently, add a bitmap to represent the regions in the region
allocator, and use bitmap operations to manage these regions and to pick the
lowest numbered free region efficiently.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |    1 +
 mm/page_alloc.c        |   19 ++++++++++++++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c2956dd..8c6e9f1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -125,6 +125,7 @@ struct mem_region {
 struct region_allocator {
 	struct mem_region	region[MAX_NR_ZONE_REGIONS];
 	int			next_region;
+	DECLARE_BITMAP(ralloc_mask, MAX_NR_ZONE_REGIONS);
 };
 
 struct pglist_data;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78ae8f6..7e82872a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -934,7 +934,7 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 	struct free_area_region *reg_area;
 	struct list_head *ralloc_list;
 	unsigned long nr_pages;
-	int order;
+	int order, *next_region;
 
 	if (WARN_ON(list_empty(&free_list->list)))
 		return;
@@ -951,6 +951,13 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 
 	WARN_ON(reg_area->nr_free != 0);
 	reg_area->nr_free += nr_pages;
+
+	set_bit(region_id, reg_alloc->ralloc_mask);
+	next_region = &reg_alloc->next_region;
+
+	if ((*next_region < 0) ||
+			(*next_region > 0 && region_id < *next_region))
+		*next_region = region_id;
 }
 
 /* Delete freepages from the region allocator and add them to buddy freelists */
@@ -981,6 +988,16 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 	reg_area->nr_free -= nr_pages;
 	WARN_ON(reg_area->nr_free != 0);
 
+	/* Pick a new next_region */
+	clear_bit(next_region, reg_alloc->ralloc_mask);
+	next_region = find_first_bit(reg_alloc->ralloc_mask,
+				     MAX_NR_ZONE_REGIONS);
+
+	if (next_region >= MAX_NR_ZONE_REGIONS)
+		next_region = -1; /* No free regions available */
+
+	reg_alloc->next_region = next_region;
+
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 25/35] mm: Fix vmstat to also account for freepages in the region allocator
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:21   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:21 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Currently vmstat considers only the freepages present in the buddy freelists
of the page allocator. But with the newly introduced region allocator in
place, freepages could be present in the region allocator as well. So teach
vmstat to take them into consideration when reporting free memory.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/vmstat.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 924babc..8cb7a10 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -829,6 +829,8 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 {
 	int i, order, t;
 	struct free_area *area;
+	struct free_area_region *reg_area;
+	struct region_allocator *reg_alloc;
 
 	seq_printf(m, "Node %d, zone %8s \n", pgdat->node_id, zone->name);
 
@@ -845,6 +847,12 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 				nr_free +=
 					area->free_list[t].mr_list[i].nr_free;
 			}
+
+			/* Add up freepages in the region allocator as well */
+			reg_alloc = &zone->region_allocator;
+			reg_area = &reg_alloc->region[i].region_area[order];
+			nr_free += reg_area->nr_free;
+
 			seq_printf(m, "%6lu ", nr_free);
 		}
 		seq_putc(m, '\n');


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 25/35] mm: Fix vmstat to also account for freepages in the region allocator
@ 2013-08-30 13:21   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:21 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Currently vmstat considers only the freepages present in the buddy freelists
of the page allocator. But with the newly introduced region allocator in
place, freepages could be present in the region allocator as well. So teach
vmstat to take them into consideration when reporting free memory.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/vmstat.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 924babc..8cb7a10 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -829,6 +829,8 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 {
 	int i, order, t;
 	struct free_area *area;
+	struct free_area_region *reg_area;
+	struct region_allocator *reg_alloc;
 
 	seq_printf(m, "Node %d, zone %8s \n", pgdat->node_id, zone->name);
 
@@ -845,6 +847,12 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 				nr_free +=
 					area->free_list[t].mr_list[i].nr_free;
 			}
+
+			/* Add up freepages in the region allocator as well */
+			reg_alloc = &zone->region_allocator;
+			reg_area = &reg_alloc->region[i].region_area[order];
+			nr_free += reg_area->nr_free;
+
 			seq_printf(m, "%6lu ", nr_free);
 		}
 		seq_putc(m, '\n');

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 26/35] mm: Drop some very expensive sorted-buddy related checks under DEBUG_PAGEALLOC
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:22   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:22 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Under CONFIG_DEBUG_PAGEALLOC, we have numerous checks and balances to verify
the correctness of various sorted-buddy operations. But some of them are very
expensive and hence can't be enabled while benchmarking the code.
(They should be used only to verify that the code is working correctly, as a
precursor to benchmarking the performance).

The check to see if a page given as input to del_from_freelist() indeed
belongs to that freelist, is one such very expensive check. Drop it.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7e82872a..9be946e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -811,6 +811,7 @@ static void del_from_freelist(struct page *page, struct free_list *free_list,
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
 
+#if 0
 	/* Verify whether this page indeed belongs to this free list! */
 
 	list_for_each(p, &free_list->list) {
@@ -819,6 +820,7 @@ static void del_from_freelist(struct page *page, struct free_list *free_list,
 	}
 
 	WARN(1, "%s: page doesn't belong to the given freelist!\n", __func__);
+#endif
 
 page_found:
 #endif


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 26/35] mm: Drop some very expensive sorted-buddy related checks under DEBUG_PAGEALLOC
@ 2013-08-30 13:22   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:22 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Under CONFIG_DEBUG_PAGEALLOC, we have numerous checks and balances to verify
the correctness of various sorted-buddy operations. But some of them are very
expensive and hence can't be enabled while benchmarking the code.
(They should be used only to verify that the code is working correctly, as a
precursor to benchmarking the performance).

The check to see if a page given as input to del_from_freelist() indeed
belongs to that freelist, is one such very expensive check. Drop it.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7e82872a..9be946e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -811,6 +811,7 @@ static void del_from_freelist(struct page *page, struct free_list *free_list,
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
 
+#if 0
 	/* Verify whether this page indeed belongs to this free list! */
 
 	list_for_each(p, &free_list->list) {
@@ -819,6 +820,7 @@ static void del_from_freelist(struct page *page, struct free_list *free_list,
 	}
 
 	WARN(1, "%s: page doesn't belong to the given freelist!\n", __func__);
+#endif
 
 page_found:
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 27/35] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA => RA flow
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:22   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:22 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Now that we have built up an infrastructure that forms a "Memory Region
Allocator", connect it with the page allocator. To entities requesting
memory, the page allocator will function as a front-end, whereas the
region allocator will act as a back-end to the page allocator.
(Analogy: page allocator is like free cash, whereas region allocator
is like a bank).

Implement the flow of freepages from the page allocator to the region
allocator. When the buddy freelists notice that they have all the freepages
forming a memory region, they give it back to the region allocator.

Simplification: We assume that the freepages of a memory region can be
completely represented by a set of MAX_ORDER-1 pages. That is, we only
need to consider the buddy freelists corresponding to MAX_ORDER-1, while
interacting with the region allocator. Furthermore, we assume that
pageblock_order == MAX_ORDER-1.

(These assumptions are used to ease the implementation, so that one can
quickly evaluate the benefits of the overall design without getting
bogged down by too many corner cases and constraints. Of course future
implementations will handle more scenarios and will have reduced dependence
on such simplifying assumptions.)

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9be946e..b8af5a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -634,6 +634,37 @@ out:
 	return prev_region_id;
 }
 
+
+static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
+				    int region_id);
+
+
+static inline int can_return_region(struct mem_region_list *region, int order)
+{
+	struct zone_mem_region *zone_region;
+
+	zone_region = region->zone_region;
+
+	if (likely(zone_region->nr_free != zone_region->present_pages))
+		return 0;
+
+	/*
+	 * Don't release freepages to the region allocator if some other
+	 * buddy pages can potentially merge with our freepages to form
+	 * higher order pages.
+	 *
+	 * Hack: Don't return the region unless all the freepages are of
+	 * order MAX_ORDER-1.
+	 */
+	if (likely(order != MAX_ORDER-1))
+		return 0;
+
+	if (region->nr_free * (1 << order) == zone_region->nr_free)
+		return 1;
+
+	return 0;
+}
+
 static void add_to_freelist(struct page *page, struct free_list *free_list,
 			    int order)
 {
@@ -650,7 +681,7 @@ static void add_to_freelist(struct page *page, struct free_list *free_list,
 
 	if (region->page_block) {
 		list_add_tail(lru, region->page_block);
-		return;
+		goto try_return_region;
 	}
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
@@ -690,6 +721,15 @@ out:
 	/* Save pointer to page block of this region */
 	region->page_block = lru;
 	set_region_bit(region_id, free_list);
+
+try_return_region:
+
+	/*
+	 * Try to return the freepages of a memory region to the region
+	 * allocator, if possible.
+	 */
+	if (can_return_region(region, order))
+		add_to_region_allocator(page_zone(page), free_list, region_id);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 27/35] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA => RA flow
@ 2013-08-30 13:22   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:22 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Now that we have built up an infrastructure that forms a "Memory Region
Allocator", connect it with the page allocator. To entities requesting
memory, the page allocator will function as a front-end, whereas the
region allocator will act as a back-end to the page allocator.
(Analogy: page allocator is like free cash, whereas region allocator
is like a bank).

Implement the flow of freepages from the page allocator to the region
allocator. When the buddy freelists notice that they have all the freepages
forming a memory region, they give it back to the region allocator.

Simplification: We assume that the freepages of a memory region can be
completely represented by a set of MAX_ORDER-1 pages. That is, we only
need to consider the buddy freelists corresponding to MAX_ORDER-1, while
interacting with the region allocator. Furthermore, we assume that
pageblock_order == MAX_ORDER-1.

(These assumptions are used to ease the implementation, so that one can
quickly evaluate the benefits of the overall design without getting
bogged down by too many corner cases and constraints. Of course future
implementations will handle more scenarios and will have reduced dependence
on such simplifying assumptions.)

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9be946e..b8af5a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -634,6 +634,37 @@ out:
 	return prev_region_id;
 }
 
+
+static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
+				    int region_id);
+
+
+static inline int can_return_region(struct mem_region_list *region, int order)
+{
+	struct zone_mem_region *zone_region;
+
+	zone_region = region->zone_region;
+
+	if (likely(zone_region->nr_free != zone_region->present_pages))
+		return 0;
+
+	/*
+	 * Don't release freepages to the region allocator if some other
+	 * buddy pages can potentially merge with our freepages to form
+	 * higher order pages.
+	 *
+	 * Hack: Don't return the region unless all the freepages are of
+	 * order MAX_ORDER-1.
+	 */
+	if (likely(order != MAX_ORDER-1))
+		return 0;
+
+	if (region->nr_free * (1 << order) == zone_region->nr_free)
+		return 1;
+
+	return 0;
+}
+
 static void add_to_freelist(struct page *page, struct free_list *free_list,
 			    int order)
 {
@@ -650,7 +681,7 @@ static void add_to_freelist(struct page *page, struct free_list *free_list,
 
 	if (region->page_block) {
 		list_add_tail(lru, region->page_block);
-		return;
+		goto try_return_region;
 	}
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
@@ -690,6 +721,15 @@ out:
 	/* Save pointer to page block of this region */
 	region->page_block = lru;
 	set_region_bit(region_id, free_list);
+
+try_return_region:
+
+	/*
+	 * Try to return the freepages of a memory region to the region
+	 * allocator, if possible.
+	 */
+	if (can_return_region(region, order))
+		add_to_region_allocator(page_zone(page), free_list, region_id);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 28/35] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA <= RA flow
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:22   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:22 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Now that we have built up an infrastructure that forms a "Memory Region
Allocator", connect it with the page allocator. To entities requesting
memory, the page allocator will function as a front-end, whereas the
region allocator will act as a back-end to the page allocator.
(Analogy: page allocator is like free cash, whereas region allocator
is like a bank).

Implement the flow of freepages from the region allocator to the page
allocator. When __rmqueue_smallest() comes out empty handed, try to get
freepages from the region allocator. If that fails, only then fallback
to an allocation from a different migratetype. This helps significantly
in avoiding mixing of allocations of different migratetypes in a single
region. Thus it helps in keeping entire memory regions homogeneous with
respect to the type of allocations.

Simplification: We assume that the freepages of a memory region can be
completely represented by a set of MAX_ORDER-1 pages. That is, we only
need to consider the buddy freelists corresponding to MAX_ORDER-1, while
interacting with the region allocator. Furthermore, we assume that
pageblock_order == MAX_ORDER-1.

(These assumptions are used to ease the implementation, so that one can
quickly evaluate the benefits of the overall design without getting
bogged down by too many corner cases and constraints. Of course future
implementations will handle more scenarios and will have reduced dependence
on such simplifying assumptions.)

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b8af5a2..3749e2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1702,10 +1702,18 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
 {
 	struct page *page;
 
-retry_reserve:
+retry:
 	page = __rmqueue_smallest(zone, order, migratetype);
 
 	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
+
+		/*
+		 * Try to get a region from the region allocator before falling
+		 * back to an allocation from a different migratetype.
+		 */
+		if (!del_from_region_allocator(zone, MAX_ORDER-1, migratetype))
+			goto retry;
+
 		page = __rmqueue_fallback(zone, order, migratetype);
 
 		/*
@@ -1715,7 +1723,7 @@ retry_reserve:
 		 */
 		if (!page) {
 			migratetype = MIGRATE_RESERVE;
-			goto retry_reserve;
+			goto retry;
 		}
 	}
 


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 28/35] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA <= RA flow
@ 2013-08-30 13:22   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:22 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Now that we have built up an infrastructure that forms a "Memory Region
Allocator", connect it with the page allocator. To entities requesting
memory, the page allocator will function as a front-end, whereas the
region allocator will act as a back-end to the page allocator.
(Analogy: page allocator is like free cash, whereas region allocator
is like a bank).

Implement the flow of freepages from the region allocator to the page
allocator. When __rmqueue_smallest() comes out empty handed, try to get
freepages from the region allocator. If that fails, only then fallback
to an allocation from a different migratetype. This helps significantly
in avoiding mixing of allocations of different migratetypes in a single
region. Thus it helps in keeping entire memory regions homogeneous with
respect to the type of allocations.

Simplification: We assume that the freepages of a memory region can be
completely represented by a set of MAX_ORDER-1 pages. That is, we only
need to consider the buddy freelists corresponding to MAX_ORDER-1, while
interacting with the region allocator. Furthermore, we assume that
pageblock_order == MAX_ORDER-1.

(These assumptions are used to ease the implementation, so that one can
quickly evaluate the benefits of the overall design without getting
bogged down by too many corner cases and constraints. Of course future
implementations will handle more scenarios and will have reduced dependence
on such simplifying assumptions.)

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b8af5a2..3749e2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1702,10 +1702,18 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
 {
 	struct page *page;
 
-retry_reserve:
+retry:
 	page = __rmqueue_smallest(zone, order, migratetype);
 
 	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
+
+		/*
+		 * Try to get a region from the region allocator before falling
+		 * back to an allocation from a different migratetype.
+		 */
+		if (!del_from_region_allocator(zone, MAX_ORDER-1, migratetype))
+			goto retry;
+
 		page = __rmqueue_fallback(zone, order, migratetype);
 
 		/*
@@ -1715,7 +1723,7 @@ retry_reserve:
 		 */
 		if (!page) {
 			migratetype = MIGRATE_RESERVE;
-			goto retry_reserve;
+			goto retry;
 		}
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 29/35] mm: Update the freepage migratetype of pages during region allocation
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:23   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:23 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The freepage migratetype is used to determine which freelist a given
page should be added to, upon getting freed. To ensure that the page
goes to the right freelist, set the freepage migratetype of all
the pages of a region, when allocating freepages from the region allocator.

This helps ensure that upon freeing the pages or during buddy expansion,
the pages are added back to the freelists of the migratetype for which
the pages were originally requested from the region allocator.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3749e2a..a62730b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1022,6 +1022,9 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 	reg_area = &reg_alloc->region[next_region].region_area[order];
 	ralloc_list = &reg_area->list;
 
+	list_for_each_entry(page, ralloc_list, lru)
+		set_freepage_migratetype(page, migratetype);
+
 	free_list = &zone->free_area[order].free_list[migratetype];
 
 	nr_pages = add_to_freelist_bulk(ralloc_list, free_list, order,


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 29/35] mm: Update the freepage migratetype of pages during region allocation
@ 2013-08-30 13:23   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:23 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

The freepage migratetype is used to determine which freelist a given
page should be added to, upon getting freed. To ensure that the page
goes to the right freelist, set the freepage migratetype of all
the pages of a region, when allocating freepages from the region allocator.

This helps ensure that upon freeing the pages or during buddy expansion,
the pages are added back to the freelists of the migratetype for which
the pages were originally requested from the region allocator.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3749e2a..a62730b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1022,6 +1022,9 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 	reg_area = &reg_alloc->region[next_region].region_area[order];
 	ralloc_list = &reg_area->list;
 
+	list_for_each_entry(page, ralloc_list, lru)
+		set_freepage_migratetype(page, migratetype);
+
 	free_list = &zone->free_area[order].free_list[migratetype];
 
 	nr_pages = add_to_freelist_bulk(ralloc_list, free_list, order,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 30/35] mm: Provide a mechanism to check if a given page is in the region allocator
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:23   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:23 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

With the introduction of the region allocator, a freepage can be either
in one of the buddy freelists or in the region allocator. In cases where we
want to move freepages to a given migratetype's freelists, we will need to
know where they were originally located. So provide a helper to distinguish
whether the freepage resides in the region allocator or the buddy freelists.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a62730b..3f49ca8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1047,6 +1047,37 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 }
 
 /*
+ * Return 1 if the page is in the region allocator, else return 0
+ * (which usually means that the page is in the buddy freelists).
+ */
+static int page_in_region_allocator(struct page *page)
+{
+	struct region_allocator *reg_alloc;
+	struct free_area_region *reg_area;
+	int order, region_id;
+
+	/* We keep only MAX_ORDER-1 pages in the region allocator */
+	order = page_order(page);
+	if (order != MAX_ORDER-1)
+		return 0;
+
+	/*
+	 * It is sufficient to check if (any of) the pages belonging to
+	 * that region are in the region allocator, because a page resides
+	 * in the region allocator if and only if all the pages of that
+	 * region are also in the region allocator.
+	 */
+	region_id = page_zone_region_id(page);
+	reg_alloc = &page_zone(page)->region_allocator;
+	reg_area = &reg_alloc->region[region_id].region_area[order];
+
+	if (reg_area->nr_free)
+		return 1;
+
+	return 0;
+}
+
+/*
  * Freeing function for a buddy system allocator.
  *
  * The concept of a buddy system is to maintain direct-mapped table


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 30/35] mm: Provide a mechanism to check if a given page is in the region allocator
@ 2013-08-30 13:23   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:23 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

With the introduction of the region allocator, a freepage can be either
in one of the buddy freelists or in the region allocator. In cases where we
want to move freepages to a given migratetype's freelists, we will need to
know where they were originally located. So provide a helper to distinguish
whether the freepage resides in the region allocator or the buddy freelists.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a62730b..3f49ca8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1047,6 +1047,37 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 }
 
 /*
+ * Return 1 if the page is in the region allocator, else return 0
+ * (which usually means that the page is in the buddy freelists).
+ */
+static int page_in_region_allocator(struct page *page)
+{
+	struct region_allocator *reg_alloc;
+	struct free_area_region *reg_area;
+	int order, region_id;
+
+	/* We keep only MAX_ORDER-1 pages in the region allocator */
+	order = page_order(page);
+	if (order != MAX_ORDER-1)
+		return 0;
+
+	/*
+	 * It is sufficient to check if (any of) the pages belonging to
+	 * that region are in the region allocator, because a page resides
+	 * in the region allocator if and only if all the pages of that
+	 * region are also in the region allocator.
+	 */
+	region_id = page_zone_region_id(page);
+	reg_alloc = &page_zone(page)->region_allocator;
+	reg_area = &reg_alloc->region[region_id].region_area[order];
+
+	if (reg_area->nr_free)
+		return 1;
+
+	return 0;
+}
+
+/*
  * Freeing function for a buddy system allocator.
  *
  * The concept of a buddy system is to maintain direct-mapped table

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 31/35] mm: Add a way to request pages of a particular region from the region allocator
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:23   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:23 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

When moving freepages from one migratetype to another (using move_freepages()
or equivalent), we might encounter situations in which we would like to move
pages that are in the region allocator. In such cases, we need a way to
request pages of a particular region from the region allocator.

We already have the code to perform the heavy-lifting of actually moving the
pages of a region from the region allocator to a requested freelist or
migratetype. So just reorganize that code in such a way that we can also
pin-point a region and specify that we want the region allocator to allocate
pages from that particular region.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   40 ++++++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f49ca8..fc530ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1002,24 +1002,18 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 		*next_region = region_id;
 }
 
-/* Delete freepages from the region allocator and add them to buddy freelists */
-static int del_from_region_allocator(struct zone *zone, unsigned int order,
-				     int migratetype)
+static void __del_from_region_allocator(struct zone *zone, unsigned int order,
+					int migratetype, int region_id)
 {
 	struct region_allocator *reg_alloc;
 	struct free_area_region *reg_area;
 	struct list_head *ralloc_list;
 	struct free_list *free_list;
 	unsigned long nr_pages;
-	int next_region;
+	struct page *page;
 
 	reg_alloc = &zone->region_allocator;
-
-	next_region = reg_alloc->next_region;
-	if (next_region < 0)
-		return -ENOMEM;
-
-	reg_area = &reg_alloc->region[next_region].region_area[order];
+	reg_area = &reg_alloc->region[region_id].region_area[order];
 	ralloc_list = &reg_area->list;
 
 	list_for_each_entry(page, ralloc_list, lru)
@@ -1028,20 +1022,34 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 	free_list = &zone->free_area[order].free_list[migratetype];
 
 	nr_pages = add_to_freelist_bulk(ralloc_list, free_list, order,
-					next_region);
+					region_id);
 
 	reg_area->nr_free -= nr_pages;
 	WARN_ON(reg_area->nr_free != 0);
 
 	/* Pick a new next_region */
-	clear_bit(next_region, reg_alloc->ralloc_mask);
-	next_region = find_first_bit(reg_alloc->ralloc_mask,
+	clear_bit(region_id, reg_alloc->ralloc_mask);
+	region_id = find_first_bit(reg_alloc->ralloc_mask,
 				     MAX_NR_ZONE_REGIONS);
 
-	if (next_region >= MAX_NR_ZONE_REGIONS)
-		next_region = -1; /* No free regions available */
+	if (region_id >= MAX_NR_ZONE_REGIONS)
+		region_id = -1; /* No free regions available */
+
+	reg_alloc->next_region = region_id;
+}
+
+/* Delete freepages from the region allocator and add them to buddy freelists */
+static int del_from_region_allocator(struct zone *zone, unsigned int order,
+				     int migratetype)
+{
+	int next_region;
+
+	next_region = zone->region_allocator.next_region;
+
+	if (next_region < 0)
+		return -ENOMEM;
 
-	reg_alloc->next_region = next_region;
+	__del_from_region_allocator(zone, order, migratetype, next_region);
 
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 31/35] mm: Add a way to request pages of a particular region from the region allocator
@ 2013-08-30 13:23   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:23 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

When moving freepages from one migratetype to another (using move_freepages()
or equivalent), we might encounter situations in which we would like to move
pages that are in the region allocator. In such cases, we need a way to
request pages of a particular region from the region allocator.

We already have the code to perform the heavy-lifting of actually moving the
pages of a region from the region allocator to a requested freelist or
migratetype. So just reorganize that code in such a way that we can also
pin-point a region and specify that we want the region allocator to allocate
pages from that particular region.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   40 ++++++++++++++++++++++++----------------
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f49ca8..fc530ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1002,24 +1002,18 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 		*next_region = region_id;
 }
 
-/* Delete freepages from the region allocator and add them to buddy freelists */
-static int del_from_region_allocator(struct zone *zone, unsigned int order,
-				     int migratetype)
+static void __del_from_region_allocator(struct zone *zone, unsigned int order,
+					int migratetype, int region_id)
 {
 	struct region_allocator *reg_alloc;
 	struct free_area_region *reg_area;
 	struct list_head *ralloc_list;
 	struct free_list *free_list;
 	unsigned long nr_pages;
-	int next_region;
+	struct page *page;
 
 	reg_alloc = &zone->region_allocator;
-
-	next_region = reg_alloc->next_region;
-	if (next_region < 0)
-		return -ENOMEM;
-
-	reg_area = &reg_alloc->region[next_region].region_area[order];
+	reg_area = &reg_alloc->region[region_id].region_area[order];
 	ralloc_list = &reg_area->list;
 
 	list_for_each_entry(page, ralloc_list, lru)
@@ -1028,20 +1022,34 @@ static int del_from_region_allocator(struct zone *zone, unsigned int order,
 	free_list = &zone->free_area[order].free_list[migratetype];
 
 	nr_pages = add_to_freelist_bulk(ralloc_list, free_list, order,
-					next_region);
+					region_id);
 
 	reg_area->nr_free -= nr_pages;
 	WARN_ON(reg_area->nr_free != 0);
 
 	/* Pick a new next_region */
-	clear_bit(next_region, reg_alloc->ralloc_mask);
-	next_region = find_first_bit(reg_alloc->ralloc_mask,
+	clear_bit(region_id, reg_alloc->ralloc_mask);
+	region_id = find_first_bit(reg_alloc->ralloc_mask,
 				     MAX_NR_ZONE_REGIONS);
 
-	if (next_region >= MAX_NR_ZONE_REGIONS)
-		next_region = -1; /* No free regions available */
+	if (region_id >= MAX_NR_ZONE_REGIONS)
+		region_id = -1; /* No free regions available */
+
+	reg_alloc->next_region = region_id;
+}
+
+/* Delete freepages from the region allocator and add them to buddy freelists */
+static int del_from_region_allocator(struct zone *zone, unsigned int order,
+				     int migratetype)
+{
+	int next_region;
+
+	next_region = zone->region_allocator.next_region;
+
+	if (next_region < 0)
+		return -ENOMEM;
 
-	reg_alloc->next_region = next_region;
+	__del_from_region_allocator(zone, order, migratetype, next_region);
 
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 32/35] mm: Modify move_freepages() to handle pages in the region allocator properly
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:24   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:24 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

There are situations in which the memory management subsystem needs to move
pages from one migratetype to another, such as when setting up the per-zone
migrate reserves (where freepages are moved from MIGRATE_MOVABLE to
MIGRATE_RESERVE freelists).

But the existing code that does freepage movement is unaware of the region
allocator. In other words, it always assumes that the freepages that it is
moving are always in the buddy page allocator's freelists. But with the
introduction of the region allocator, the freepages could instead reside
in the region allocator as well. So teach move_freepages() to check whether
the pages are in the buddy page allocator's freelists or the region
allocator and handle the two cases appropriately.

The region allocator is designed in such a way that it always allocates
or receives entire memory regions as a single unit. To retain these
semantics during freepage movement, we first move all the pages of that
region from the region allocator to the MIGRATE_MOVABLE buddy freelist
and then move the requested page(s) from MIGRATE_MOVABLE to the required
migratetype.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fc530ff..3ce0c61 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1557,7 +1557,7 @@ int move_freepages(struct zone *zone,
 	struct page *page;
 	unsigned long order;
 	struct free_area *area;
-	int pages_moved = 0, old_mt;
+	int pages_moved = 0, old_mt, region_id;
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -1584,7 +1584,23 @@ int move_freepages(struct zone *zone,
 			continue;
 		}
 
+		/*
+		 * If the page is in the region allocator, we first move the
+		 * region to the MIGRATE_MOVABLE buddy freelists and then move
+		 * that page to the freelist of the requested migratetype.
+		 * This is because the region allocator operates on whole region-
+		 * sized chunks, whereas here we want to move pages in much
+		 * smaller chunks.
+		 */
 		order = page_order(page);
+		if (page_in_region_allocator(page)) {
+			region_id = page_zone_region_id(page);
+			__del_from_region_allocator(zone, order, MIGRATE_MOVABLE,
+						    region_id);
+
+			continue; /* Try this page again from the buddy-list */
+		}
+
 		old_mt = get_freepage_migratetype(page);
 		area = &zone->free_area[order];
 		move_page_freelist(page, &area->free_list[old_mt],


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 32/35] mm: Modify move_freepages() to handle pages in the region allocator properly
@ 2013-08-30 13:24   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:24 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

There are situations in which the memory management subsystem needs to move
pages from one migratetype to another, such as when setting up the per-zone
migrate reserves (where freepages are moved from MIGRATE_MOVABLE to
MIGRATE_RESERVE freelists).

But the existing code that does freepage movement is unaware of the region
allocator. In other words, it always assumes that the freepages that it is
moving are always in the buddy page allocator's freelists. But with the
introduction of the region allocator, the freepages could instead reside
in the region allocator as well. So teach move_freepages() to check whether
the pages are in the buddy page allocator's freelists or the region
allocator and handle the two cases appropriately.

The region allocator is designed in such a way that it always allocates
or receives entire memory regions as a single unit. To retain these
semantics during freepage movement, we first move all the pages of that
region from the region allocator to the MIGRATE_MOVABLE buddy freelist
and then move the requested page(s) from MIGRATE_MOVABLE to the required
migratetype.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fc530ff..3ce0c61 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1557,7 +1557,7 @@ int move_freepages(struct zone *zone,
 	struct page *page;
 	unsigned long order;
 	struct free_area *area;
-	int pages_moved = 0, old_mt;
+	int pages_moved = 0, old_mt, region_id;
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -1584,7 +1584,23 @@ int move_freepages(struct zone *zone,
 			continue;
 		}
 
+		/*
+		 * If the page is in the region allocator, we first move the
+		 * region to the MIGRATE_MOVABLE buddy freelists and then move
+		 * that page to the freelist of the requested migratetype.
+		 * This is because the region allocator operates on whole region-
+		 * sized chunks, whereas here we want to move pages in much
+		 * smaller chunks.
+		 */
 		order = page_order(page);
+		if (page_in_region_allocator(page)) {
+			region_id = page_zone_region_id(page);
+			__del_from_region_allocator(zone, order, MIGRATE_MOVABLE,
+						    region_id);
+
+			continue; /* Try this page again from the buddy-list */
+		}
+
 		old_mt = get_freepage_migratetype(page);
 		area = &zone->free_area[order];
 		move_page_freelist(page, &area->free_list[old_mt],

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 33/35] mm: Never change migratetypes of pageblocks during freepage stealing
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:24   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:24 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

We would like to keep large chunks of memory (of the size of memory regions)
populated by allocations of a single migratetype. This helps in influencing
allocation/reclaim decisions at a per-migratetype basis, which would also
automatically respect memory region boundaries.

For example, if a region is known to contain only MIGRATE_UNMOVABLE pages,
we can skip trying targeted compaction on that region. Similarly, if a region
has only MIGRATE_MOVABLE pages, then the likelihood of successful targeted
evacuation of that region is higher, as opposed to having a few unmovable
pages embedded in a region otherwise containing mostly movable allocations.
Thus, it is beneficial to try and keep memory allocations homogeneous (in
terms of the migratetype) in region-sized chunks of memory.

Changing the migratetypes of pageblocks during freepage stealing comes in the
way of this effort, since it fragments the ownership of memory segments.
So never change the ownership of pageblocks during freepage stealing.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   36 ++++++++++--------------------------
 1 file changed, 10 insertions(+), 26 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3ce0c61..e303351 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1648,14 +1648,16 @@ static void change_pageblock_range(struct page *pageblock_page,
 /*
  * If breaking a large block of pages, move all free pages to the preferred
  * allocation list. If falling back for a reclaimable kernel allocation, be
- * more aggressive about taking ownership of free pages.
+ * more aggressive about borrowing the free pages.
  *
- * On the other hand, never change migration type of MIGRATE_CMA pageblocks
- * nor move CMA pages to different free lists. We don't want unmovable pages
- * to be allocated from MIGRATE_CMA areas.
+ * On the other hand, never move CMA pages to different free lists. We don't
+ * want unmovable pages to be allocated from MIGRATE_CMA areas.
  *
- * Returns the new migratetype of the pageblock (or the same old migratetype
- * if it was unchanged).
+ * Also, we *NEVER* change the pageblock migratetype of any block of memory.
+ * (IOW, we only try to _loan_ the freepages from a fallback list, but never
+ * try to _own_ them.)
+ *
+ * Returns the migratetype of the fallback list.
  */
 static int try_to_steal_freepages(struct zone *zone, struct page *page,
 				  int start_type, int fallback_type)
@@ -1665,28 +1667,10 @@ static int try_to_steal_freepages(struct zone *zone, struct page *page,
 	if (is_migrate_cma(fallback_type))
 		return fallback_type;
 
-	/* Take ownership for orders >= pageblock_order */
-	if (current_order >= pageblock_order) {
-		change_pageblock_range(page, current_order, start_type);
-		return start_type;
-	}
-
 	if (current_order >= pageblock_order / 2 ||
 	    start_type == MIGRATE_RECLAIMABLE ||
-	    page_group_by_mobility_disabled) {
-		int pages;
-
-		pages = move_freepages_block(zone, page, start_type);
-
-		/* Claim the whole block if over half of it is free */
-		if (pages >= (1 << (pageblock_order-1)) ||
-				page_group_by_mobility_disabled) {
-
-			set_pageblock_migratetype(page, start_type);
-			return start_type;
-		}
-
-	}
+	    page_group_by_mobility_disabled)
+		move_freepages_block(zone, page, start_type);
 
 	return fallback_type;
 }


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 33/35] mm: Never change migratetypes of pageblocks during freepage stealing
@ 2013-08-30 13:24   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:24 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

We would like to keep large chunks of memory (of the size of memory regions)
populated by allocations of a single migratetype. This helps in influencing
allocation/reclaim decisions at a per-migratetype basis, which would also
automatically respect memory region boundaries.

For example, if a region is known to contain only MIGRATE_UNMOVABLE pages,
we can skip trying targeted compaction on that region. Similarly, if a region
has only MIGRATE_MOVABLE pages, then the likelihood of successful targeted
evacuation of that region is higher, as opposed to having a few unmovable
pages embedded in a region otherwise containing mostly movable allocations.
Thus, it is beneficial to try and keep memory allocations homogeneous (in
terms of the migratetype) in region-sized chunks of memory.

Changing the migratetypes of pageblocks during freepage stealing comes in the
way of this effort, since it fragments the ownership of memory segments.
So never change the ownership of pageblocks during freepage stealing.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   36 ++++++++++--------------------------
 1 file changed, 10 insertions(+), 26 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3ce0c61..e303351 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1648,14 +1648,16 @@ static void change_pageblock_range(struct page *pageblock_page,
 /*
  * If breaking a large block of pages, move all free pages to the preferred
  * allocation list. If falling back for a reclaimable kernel allocation, be
- * more aggressive about taking ownership of free pages.
+ * more aggressive about borrowing the free pages.
  *
- * On the other hand, never change migration type of MIGRATE_CMA pageblocks
- * nor move CMA pages to different free lists. We don't want unmovable pages
- * to be allocated from MIGRATE_CMA areas.
+ * On the other hand, never move CMA pages to different free lists. We don't
+ * want unmovable pages to be allocated from MIGRATE_CMA areas.
  *
- * Returns the new migratetype of the pageblock (or the same old migratetype
- * if it was unchanged).
+ * Also, we *NEVER* change the pageblock migratetype of any block of memory.
+ * (IOW, we only try to _loan_ the freepages from a fallback list, but never
+ * try to _own_ them.)
+ *
+ * Returns the migratetype of the fallback list.
  */
 static int try_to_steal_freepages(struct zone *zone, struct page *page,
 				  int start_type, int fallback_type)
@@ -1665,28 +1667,10 @@ static int try_to_steal_freepages(struct zone *zone, struct page *page,
 	if (is_migrate_cma(fallback_type))
 		return fallback_type;
 
-	/* Take ownership for orders >= pageblock_order */
-	if (current_order >= pageblock_order) {
-		change_pageblock_range(page, current_order, start_type);
-		return start_type;
-	}
-
 	if (current_order >= pageblock_order / 2 ||
 	    start_type == MIGRATE_RECLAIMABLE ||
-	    page_group_by_mobility_disabled) {
-		int pages;
-
-		pages = move_freepages_block(zone, page, start_type);
-
-		/* Claim the whole block if over half of it is free */
-		if (pages >= (1 << (pageblock_order-1)) ||
-				page_group_by_mobility_disabled) {
-
-			set_pageblock_migratetype(page, start_type);
-			return start_type;
-		}
-
-	}
+	    page_group_by_mobility_disabled)
+		move_freepages_block(zone, page, start_type);
 
 	return fallback_type;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 34/35] mm: Set pageblock migratetype when allocating regions from region allocator
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:24   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:24 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

We would like to maintain memory regions such that all memory pertaining to
given a memory region serves allocations of a single migratetype. IOW, we
don't want to permanently mix allocations of different migratetypes within
the same region.

So, when allocating a region from the region allocator to the page allocator,
set the pageblock migratetype of all that memory to the migratetype for which
the page allocator requested memory.

Note that this still allows temporary sharing of pages between different
migratetypes; it just ensures that there is no *permanent* mixing of
migratetypes within a given memory region.

An important advantage to be noted here is that the region allocator doesn't
have to manage memory in a granularity lesser than a memory region, in *any*
situation. This is because the freepage migratetype and the fallback mechanism
allows temporary sharing of free memory between different migratetypes when
the system is short on memory, but eventually all the memory gets freed to
the original migratetype (because we set the pageblock migratetype of all the
freepages appropriately when allocating regions).

This greatly simplifies the design of the region allocator, since it doesn't
have to keep track of memory in smaller chunks than a memory region.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e303351..1312546 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1016,8 +1016,10 @@ static void __del_from_region_allocator(struct zone *zone, unsigned int order,
 	reg_area = &reg_alloc->region[region_id].region_area[order];
 	ralloc_list = &reg_area->list;
 
-	list_for_each_entry(page, ralloc_list, lru)
+	list_for_each_entry(page, ralloc_list, lru) {
 		set_freepage_migratetype(page, migratetype);
+		set_pageblock_migratetype(page, migratetype);
+	}
 
 	free_list = &zone->free_area[order].free_list[migratetype];
 


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 34/35] mm: Set pageblock migratetype when allocating regions from region allocator
@ 2013-08-30 13:24   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:24 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

We would like to maintain memory regions such that all memory pertaining to
given a memory region serves allocations of a single migratetype. IOW, we
don't want to permanently mix allocations of different migratetypes within
the same region.

So, when allocating a region from the region allocator to the page allocator,
set the pageblock migratetype of all that memory to the migratetype for which
the page allocator requested memory.

Note that this still allows temporary sharing of pages between different
migratetypes; it just ensures that there is no *permanent* mixing of
migratetypes within a given memory region.

An important advantage to be noted here is that the region allocator doesn't
have to manage memory in a granularity lesser than a memory region, in *any*
situation. This is because the freepage migratetype and the fallback mechanism
allows temporary sharing of free memory between different migratetypes when
the system is short on memory, but eventually all the memory gets freed to
the original migratetype (because we set the pageblock migratetype of all the
freepages appropriately when allocating regions).

This greatly simplifies the design of the region allocator, since it doesn't
have to keep track of memory in smaller chunks than a memory region.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e303351..1312546 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1016,8 +1016,10 @@ static void __del_from_region_allocator(struct zone *zone, unsigned int order,
 	reg_area = &reg_alloc->region[region_id].region_area[order];
 	ralloc_list = &reg_area->list;
 
-	list_for_each_entry(page, ralloc_list, lru)
+	list_for_each_entry(page, ralloc_list, lru) {
 		set_freepage_migratetype(page, migratetype);
+		set_pageblock_migratetype(page, migratetype);
+	}
 
 	free_list = &zone->free_area[order].free_list[migratetype];
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 35/35] mm: Use a cache between page-allocator and region-allocator
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:24   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:24 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Currently, whenever the page allocator notices that it has all the freepages
of a given memory region, it attempts to return it back to the region
allocator. This strategy is needlessly aggressive and can cause a lot of back
and forth between the page-allocator and the region-allocator.

More importantly, it can potentially completely wreck the benefits of having
a region allocator in the first place - if the buddy allocator immediately
returns freepages of memory regions to the region allocator, it goes back to
the generic pool of pages. So, in future, depending on when the next allocation
request arrives for this particular migratetype, the region allocator might not
have any free regions to hand out, and hence we might end up falling back
to freepages of other migratetypes. Instead, if the page allocator retains
a few regions as a cache for every migratetype, we will have higher chances
of avoiding fallbacks to other migratetypes.

So, don't return all free memory regions (in the page allocator) to the
region allocator. Keep atleast one region as a cache, for future use.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1312546..55e8e65 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -639,9 +639,11 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 				    int region_id);
 
 
-static inline int can_return_region(struct mem_region_list *region, int order)
+static inline int can_return_region(struct mem_region_list *region, int order,
+				    struct free_list *free_list)
 {
 	struct zone_mem_region *zone_region;
+	struct page *prev_page, *next_page;
 
 	zone_region = region->zone_region;
 
@@ -659,6 +661,16 @@ static inline int can_return_region(struct mem_region_list *region, int order)
 	if (likely(order != MAX_ORDER-1))
 		return 0;
 
+	/*
+	 * Don't return all the regions; retain atleast one region as a
+	 * cache for future use.
+	 */
+	prev_page = container_of(free_list->list.prev , struct page, lru);
+	next_page = container_of(free_list->list.next , struct page, lru);
+
+	if (page_zone_region_id(prev_page) == page_zone_region_id(next_page))
+		return 0; /* There is only one region in this freelist */
+
 	if (region->nr_free * (1 << order) == zone_region->nr_free)
 		return 1;
 
@@ -728,7 +740,7 @@ try_return_region:
 	 * Try to return the freepages of a memory region to the region
 	 * allocator, if possible.
 	 */
-	if (can_return_region(region, order))
+	if (can_return_region(region, order, free_list))
 		add_to_region_allocator(page_zone(page), free_list, region_id);
 }
 


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [RFC PATCH v3 35/35] mm: Use a cache between page-allocator and region-allocator
@ 2013-08-30 13:24   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:24 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, srivatsa.bhat, linux-pm,
	linux-mm, linux-kernel

Currently, whenever the page allocator notices that it has all the freepages
of a given memory region, it attempts to return it back to the region
allocator. This strategy is needlessly aggressive and can cause a lot of back
and forth between the page-allocator and the region-allocator.

More importantly, it can potentially completely wreck the benefits of having
a region allocator in the first place - if the buddy allocator immediately
returns freepages of memory regions to the region allocator, it goes back to
the generic pool of pages. So, in future, depending on when the next allocation
request arrives for this particular migratetype, the region allocator might not
have any free regions to hand out, and hence we might end up falling back
to freepages of other migratetypes. Instead, if the page allocator retains
a few regions as a cache for every migratetype, we will have higher chances
of avoiding fallbacks to other migratetypes.

So, don't return all free memory regions (in the page allocator) to the
region allocator. Keep atleast one region as a cache, for future use.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |   16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1312546..55e8e65 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -639,9 +639,11 @@ static void add_to_region_allocator(struct zone *z, struct free_list *free_list,
 				    int region_id);
 
 
-static inline int can_return_region(struct mem_region_list *region, int order)
+static inline int can_return_region(struct mem_region_list *region, int order,
+				    struct free_list *free_list)
 {
 	struct zone_mem_region *zone_region;
+	struct page *prev_page, *next_page;
 
 	zone_region = region->zone_region;
 
@@ -659,6 +661,16 @@ static inline int can_return_region(struct mem_region_list *region, int order)
 	if (likely(order != MAX_ORDER-1))
 		return 0;
 
+	/*
+	 * Don't return all the regions; retain atleast one region as a
+	 * cache for future use.
+	 */
+	prev_page = container_of(free_list->list.prev , struct page, lru);
+	next_page = container_of(free_list->list.next , struct page, lru);
+
+	if (page_zone_region_id(prev_page) == page_zone_region_id(next_page))
+		return 0; /* There is only one region in this freelist */
+
 	if (region->nr_free * (1 << order) == zone_region->nr_free)
 		return 1;
 
@@ -728,7 +740,7 @@ try_return_region:
 	 * Try to return the freepages of a memory region to the region
 	 * allocator, if possible.
 	 */
-	if (can_return_region(region, order))
+	if (can_return_region(region, order, free_list))
 		add_to_region_allocator(page_zone(page), free_list, region_id);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [RESEND RFC PATCH v3 00/35] mm: Memory Power Management
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 13:26   ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:26 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: Srivatsa S. Bhat, gargankita, paulmck, svaidy, andi,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel


Experimental Results:
====================

Test setup:
----------

x86 Sandybridge dual-socket quad core HT-enabled machine, with 128GB RAM.
Memory Region size = 512MB.

Testcase:
--------

Strategy:

Try to allocate and free large chunks of memory (comparable to that of memory
region size) in multiple threads, and examine the number of completely free
memory regions at the end of the run (when all the memory is freed). (Note
that we don't create any pagecache usage here).

Implementation:

Run 20 instances of multi-threaded ebizzy in parallel, with chunksize=256MB,
and no. of threads=32. This means, potentially 20 * 32 threads can allocate/free
memory in parallel, and each alloc/free size will be 256MB, which is half of
the memory region size.

Cmd-line of each ebizzy instance: ./ebizzy -s 268435456 -n 2 -t 32 -S 60


Effectiveness in consolidating allocations:
------------------------------------------

With the above test case, the higher the number of completely free memory
regions at the end of the run, the better is the memory management algorithm
in consolidating allocations.

Here are the results, with vanilla 3.11-rc7 and with this patchset applied:

                  Free regions at test-start   Free regions after test-run
Without patchset               242                         18
With patchset                  238                        121

This shows that this patchset performs tremendously better than vanilla
kernel in terms of keeping the memory allocations consolidated to a minimum
no. of memory regions. Note that the amount of memory consumed at the end of
the run is 0, so it shows the drastic extent to which the mainline kernel can
fragment memory by spreading a handful of pages across many memory regions.
And since this patchset teaches the kernel to understand the memory region
granularity/boundaries and influences the MM decisions effectively, it shows
a significant improvement over mainline. Also, this improvement is with the
allocator changes alone; targeted compaction (which was dropped in this
version) is expected to show even more benefits.

Below is the log of the variation of the no. of completely free regions
from the beginning to the end of the test, at 1 second intervals (total
test-run takes 1 minute).

         Vanilla 3.11-rc7         With this patchset
                242                     238
                242                     238
                242                     238
                242                     238
                242                     238
                239                     236
                221                     215
                196                     181
                171                     139
                144                     112
                117                     78
                69                      48
                49                      24
                27                      21
                15                      21
                15                      21
                15                      21
                15                      21
                15                      21
                15                      21
                15                      22
                15                      22
                15                      23
                15                      23
                15                      27
                15                      29
                15                      29
                15                      30
                15                      30
                15                      30
                15                      30
                15                      30
                15                      30
                15                      30
                15                      32
                15                      33
                15                      33
                15                      33
                15                      33
                15                      36
                15                      42
                15                      42
                15                      44
                15                      48
                16                      111
                17                      114
                17                      114
                17                      114
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                18                      121


It is interesting to also examine the fragmentation of memory by
looking at the per-region statistics added by this patchset.

Statistics for vanilla 3.11-rc7 kernel:
======================================

We can see from the statistics that there is a lot of fragmentation
among the MOVABLE migratetype.

Node 0, zone   Normal
  pages free     15751188
        min      5575
        low      6968
        high     8362
        scanned  0
        spanned  16252928
        present  16252928
        managed  15989951

Per-region page stats	 present	 free

	Region      0 	      1 	   1024
	Region      1 	 131072 	 131072
	Region      2 	 131072 	 131072
	Region      3 	 131072 	 131072
	Region      4 	 131072 	 131072
	Region      5 	 131072 	 130045
	Region      6 	 131072 	 131032
	Region      7 	 131072 	 131023
	Region      8 	 131072 	 131022
	Region      9 	 131072 	 131062
	Region     10 	 131072 	 131055
	Region     11 	 131072 	 131064
	Region     12 	 131072 	 131047
	Region     13 	 131072 	 131051
	Region     14 	 131072 	 131056
	Region     15 	 131072 	 131046
	Region     16 	 131072 	 131051
	Region     17 	 131072 	 131061
	Region     18 	 131072 	 131030
	Region     19 	 131072 	 130168
	Region     20 	 131072 	 131937
	Region     21 	 131072 	 131067
	Region     22 	 131072 	 131028
	Region     23 	 131072 	 131051
	Region     24 	 131072 	 131041
	Region     25 	 131072 	 131047
	Region     26 	 131072 	 131051
	Region     27 	 131072 	 131054
	Region     28 	 131072 	 131049
	Region     29 	 131072 	 130994
	Region     30 	 131072 	 131059
	Region     31 	 131072 	 131060
	Region     32 	 131072 	 131051
	Region     33 	 131072 	 131047
	Region     34 	 131072 	 131050
	Region     35 	 131072 	 131050
	Region     36 	 131072 	 131039
	Region     37 	 131072 	 131053
	Region     38 	 131072 	 131045
	Region     39 	 131072 	 130275
	Region     40 	 131072 	 131807
	Region     41 	 131072 	 131050
	Region     42 	 131072 	 131051
	Region     43 	 131072 	 131037
	Region     44 	 131072 	 131052
	Region     45 	 131072 	 131011
	Region     46 	 131072 	 131026
	Region     47 	 131072 	 130285
	Region     48 	 131072 	 131810
	Region     49 	 131072 	 131046
	Region     50 	 131072 	 131049
	Region     51 	 131072 	 131054
	Region     52 	 131072 	 131064
	Region     53 	 131072 	 131053
	Region     54 	 131072 	 131019
	Region     55 	 131072 	 130997
	Region     56 	 131072 	 131039
	Region     57 	 131072 	 131058
	Region     58 	 131072 	 130182
	Region     59 	 131072 	 131057
	Region     60 	 131072 	 131063
	Region     61 	 131072 	 131046
	Region     62 	 131072 	 131055
	Region     63 	 131072 	 131060
	Region     64 	 131072 	 131049
	Region     65 	 131072 	 131042
	Region     66 	 131072 	 131048
	Region     67 	 131072 	 131052
	Region     68 	 131072 	 130997
	Region     69 	 131072 	 131046
	Region     70 	 131072 	 131045
	Region     71 	 131072 	 131028
	Region     72 	 131072 	 131054
	Region     73 	 131072 	 131048
	Region     74 	 131072 	 131052
	Region     75 	 131072 	 131043
	Region     76 	 131072 	 131052
	Region     77 	 131072 	 130542
	Region     78 	 131072 	 131556
	Region     79 	 131072 	 131048
	Region     80 	 131072 	 131043
	Region     81 	 131072 	 130548
	Region     82 	 131072 	 131551
	Region     83 	 131072 	 131019
	Region     84 	 131072 	 131033
	Region     85 	 131072 	 131047
	Region     86 	 131072 	 131059
	Region     87 	 131072 	 131054
	Region     88 	 131072 	 131043
	Region     89 	 131072 	 131035
	Region     90 	 131072 	 131044
	Region     91 	 131072 	 130538
	Region     92 	 131072 	 131560
	Region     93 	 131072 	 131063
	Region     94 	 131072 	 131033
	Region     95 	 131072 	 131046
	Region     96 	 131072 	 131048
	Region     97 	 131072 	 131049
	Region     98 	 131072 	 131058
	Region     99 	 131072 	 131048
	Region    100 	 131072 	 130484
	Region    101 	 131072 	 131557
	Region    102 	 131072 	 131038
	Region    103 	 131072 	 131044
	Region    104 	 131072 	 131040
	Region    105 	 131072 	 130988
	Region    106 	 131072 	 131039
	Region    107 	 131072 	 131009
	Region    108 	 131072 	 131059
	Region    109 	 131072 	 131049
	Region    110 	 131072 	 131050
	Region    111 	 131072 	 131042
	Region    112 	 131072 	 131052
	Region    113 	 131072 	 131053
	Region    114 	 131072 	 131067
	Region    115 	 131072 	 131062
	Region    116 	 131072 	 131072
	Region    117 	 131072 	 131072
	Region    118 	 131072 	 129860
	Region    119 	 131072 	 125402
	Region    120 	 131072 	  63109
	Region    121 	 131072 	  84301
	Region    122 	 131072 	  17009
	Region    123 	 131072 	      0
	Region    124 	 131071 	      0



Page block order: 10
Pages per block:  1024

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10 
Node    0, zone      DMA, type    Unmovable      1      2      2      1      3      2      0      0      1      1      0 
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      2 
Node    0, zone      DMA, type      Reserve      0      0      0      0      0      0      0      0      0      0      1 
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type    Unmovable      0      1      0      0      0      0      1      1      1      1      0 
Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type      Movable      8     10     12      8     10      8      8      6      5      7    436 
Node    0, zone    DMA32, type      Reserve      0      0      0      0      0      0      0      0      0      0      1 
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, type    Unmovable   8982   9711   5941   2108    611    189      9      0      1      1      0 
Node    0, zone   Normal, type  Reclaimable      0      0      0      0      1      0      0      1      0      0      0 
Node    0, zone   Normal, type      Movable   2349   4937   5264   3716   2323   1859   1689   1602   1412   1310  13826 
Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      2 
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 

Node    0, zone   Normal, R  0      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  1      Movable      0      0      0      0      0      0      0      0      0      0    127 
Node    0, zone   Normal, R  2      Movable      0      0      0      0      0      0      0      0      0      0    128 
Node    0, zone   Normal, R  3      Movable      0      0      0      0      0      0      0      0      0      0    128 
Node    0, zone   Normal, R  4      Movable      0      0      0      0      0      0      0      0      0      0    128 
Node    0, zone   Normal, R  5      Movable      3      3      3      3      3      3      3      3      3      3    124 
Node    0, zone   Normal, R  6      Movable     18     25     25     24     23     22     23     19     19     14    111 
Node    0, zone   Normal, R  7      Movable      7     16     18     16     12     13     14      8      7      9    119 
Node    0, zone   Normal, R  8      Movable     12     17     18     17     11     11     13     11     10     11    117 
Node    0, zone   Normal, R  9      Movable      6      6      7      7      5      6      6      6      6      6    122 
Node    0, zone   Normal, R 10      Movable      7     10     11     11     11     11     11     11     11      9    118 
Node    0, zone   Normal, R 11      Movable      8      8      8      8      8      8      8      8      6      7    121 
Node    0, zone   Normal, R 12      Movable      5      7     11     11     11     11     11     11     11     11    117 
Node    0, zone   Normal, R 13      Movable     15     18     18     18     18     18     18     18     18     12    113 
Node    0, zone   Normal, R 14      Movable      6      9     10      8      9      9      9      9      9      9    119 
Node    0, zone   Normal, R 15      Movable     10     12     15     15     13     14     12     13     13     13    115 
Node    0, zone   Normal, R 16      Movable      3      4      6      7      7      7      7      7      5      6    122 
Node    0, zone   Normal, R 17      Movable      1      4      5      5      5      5      5      5      5      5    123 
Node    0, zone   Normal, R 18      Movable     14     22     25     23     22     21     22     22     22     20    107 
Node    0, zone   Normal, R 19      Movable      6      7      7      7      7      7      7      8      7      7    120 
Node    0, zone   Normal, R 20      Movable      9     10     11     13     13     13     11     11     12     10    118 
Node    0, zone   Normal, R 21      Movable      3      4      4      4      4      4      2      3      3      3    125 
Node    0, zone   Normal, R 22      Movable      6     11     16     11     14     12     13     13     11     12    116 
Node    0, zone   Normal, R 23      Movable     11     14     15     15     15     15     15     15     13     14    114 
Node    0, zone   Normal, R 24      Movable      7     11     13     14     14     14     12     13     13     13    115 
Node    0, zone   Normal, R 25      Movable      7     12     12     13     11     12     12     12     12     12    116 
Node    0, zone   Normal, R 26      Movable      5      9     11     11     11     11     11     11      9     10    118 
Node    0, zone   Normal, R 27      Movable      8     13     13     13     11     10     11      9      8      9    119 
Node    0, zone   Normal, R 28      Movable     11     13     13     12     11     12     12     10      9      8    119 
Node    0, zone   Normal, R 29      Movable     20     27     28     27     26     24     22     22     19     17    109 
Node    0, zone   Normal, R 30      Movable      5      9      9      9      9      9      9      9      9      7    120 
Node    0, zone   Normal, R 31      Movable      6      7      6      7      7      7      7      7      7      5    122 
Node    0, zone   Normal, R 32      Movable      1      5      8      8      8      8      8      8      8      8    120 
Node    0, zone   Normal, R 33      Movable      5      9     10     11     11     11     11     11     11      9    118 
Node    0, zone   Normal, R 34      Movable      6      8      9     10      8      7      8      8      8      8    120 
Node    0, zone   Normal, R 35      Movable     14     18     16     17     17     15     16     16     14     13    114 
Node    0, zone   Normal, R 36      Movable     11     16     19     19     17     18     16     17     15     16    112 
Node    0, zone   Normal, R 37      Movable     15     17     17     17     17     17     17     17     15     14    113 
Node    0, zone   Normal, R 38      Movable      7     13     15     15     15     15     15     15     13     12    115 
Node    0, zone   Normal, R 39      Movable     11     18     19     19     17     16     17     15     15     11    114 
Node    0, zone   Normal, R 40      Movable     13     21     18     18     19     15     15     16     13     13    115 
Node    0, zone   Normal, R 41      Movable      4      7     10     10     10     10     10     10     10     10    118 
Node    0, zone   Normal, R 42      Movable     13     15     16     14     11     13     13     13     13     11    116 
Node    0, zone   Normal, R 43      Movable     13     16     16     18     18     18     18     18     14     16    112 
Node    0, zone   Normal, R 44      Movable     10     11     11     12     12     12     12     12     12     12    116 
Node    0, zone   Normal, R 45      Movable     13     19     20     22     21     22     20     21     17     15    111 
Node    0, zone   Normal, R 46      Movable     10     16     16     19     19     19     19     19     15     15    112 
Node    0, zone   Normal, R 47      Movable     11     15     15     13     14     14     14     14     13     11    115 
Node    0, zone   Normal, R 48      Movable      8     15     17     17     15     16     14     15     12     14    115 
Node    0, zone   Normal, R 49      Movable      8      9     11     12     12     12     12     10     11      9    118 
Node    0, zone   Normal, R 50      Movable      9     12     14     14     14     12     13     13     13     13    115 
Node    0, zone   Normal, R 51      Movable      8     11     12     12     12     12     12     12     10      7    119 
Node    0, zone   Normal, R 52      Movable      8      8      8      8      8      8      8      8      8      8    120 
Node    0, zone   Normal, R 53      Movable      9     12     13     13     13     13     13     13     13     11    116 
Node    0, zone   Normal, R 54      Movable     11     14     19     19     18     19     15     17     15     10    115 
Node    0, zone   Normal, R 55      Movable     13     14     15     16     17     16     17     17     15     16    112 
Node    0, zone   Normal, R 56      Movable      3      8     11     12     12     12     12     10      9     10    118 
Node    0, zone   Normal, R 57      Movable      4      9      9      7      8      8      8      8      8      8    120 
Node    0, zone   Normal, R 58      Movable      6      8      8      8      8      8      8      8      6      7    120 
Node    0, zone   Normal, R 59      Movable      7     11     11     11     11     11     11     11      9     10    118 
Node    0, zone   Normal, R 60      Movable      3      4      5      5      5      5      5      5      5      5    123 
Node    0, zone   Normal, R 61      Movable      8     15     16     14     13     14     14     12     13     11    116 
Node    0, zone   Normal, R 62      Movable      7     10     11     11     11     11     11     11     11     11    117 
Node    0, zone   Normal, R 63      Movable      4      4      6      6      6      6      6      6      6      4    123 
Node    0, zone   Normal, R 64      Movable      9     12     14     14     14     14     14     14     12     13    115 
Node    0, zone   Normal, R 65      Movable      6      8     11     12     12     12     12     12     10      9    118 
Node    0, zone   Normal, R 66      Movable     20     22     22     16     19     19     15     17     13     11    115 
Node    0, zone   Normal, R 67      Movable      4      8     10      8      7      8      8      8      8      6    121 
Node    0, zone   Normal, R 68      Movable     13     20     22     23     23     24     18     19     20     18    109 
Node    0, zone   Normal, R 69      Movable      4      9     10     11     11     11     11      9      8      9    119 
Node    0, zone   Normal, R 70      Movable      9     14     16     16     16     16     14     15     13     12    115 
Node    0, zone   Normal, R 71      Movable      8     18     22     22     22     22     20     21     17     17    110 
Node    0, zone   Normal, R 72      Movable      4      5      8      8      8      8      8      8      8      8    120 
Node    0, zone   Normal, R 73      Movable     18     21     21     21     17     17     18     18     18     16    111 
Node    0, zone   Normal, R 74      Movable      4      8     10     10     10     10     10     10     10      8    119 
Node    0, zone   Normal, R 75      Movable      9      9     12     13     13     13     13     13      9     11    117 
Node    0, zone   Normal, R 76      Movable     12     16     16     14     13     14     14     14      8      9    118 
Node    0, zone   Normal, R 77      Movable     14     14     15     15     15     15     15     15     13     15    113 
Node    0, zone   Normal, R 78      Movable      8     10     14     14     14     14     14     14     12     12    116 
Node    0, zone   Normal, R 79      Movable      8     10     13     13      9     11      9     10     10      6    120 
Node    0, zone   Normal, R 80      Movable     11     14     15     16     16     16     16     16     12     14    114 
Node    0, zone   Normal, R 81      Movable      4      8      8      8      8      8      8      8      8      9    119 
Node    0, zone   Normal, R 82      Movable     17     23     24     24     22     23     19     21     21     12    112 
Node    0, zone   Normal, R 83      Movable      5      7      8      9      9     10      6      8      8      6    121 
Node    0, zone   Normal, R 84      Movable     17     22     25     25     23     24     24     20     16     19    109 
Node    0, zone   Normal, R 85      Movable      9     15     16     16     16     16     16     16     10     11    116 
Node    0, zone   Normal, R 86      Movable      3      8      8      8      8      8      6      7      5      6    122 
Node    0, zone   Normal, R 87      Movable     10     12     13     13     11     10     11     11      7      9    119 
Node    0, zone   Normal, R 88      Movable     15     20     17     15     15     16     16     14      9     12    116 
Node    0, zone   Normal, R 89      Movable     15     18     20     21     21     19     20     20     16     18    110 
Node    0, zone   Normal, R 90      Movable      8     16     15     14     15     15     13     14     12     13    115 
Node    0, zone   Normal, R 91      Movable      6     10     10     11     11     11     11     11      9      7    119 
Node    0, zone   Normal, R 92      Movable     14     15     17     17     17     17     17     17     15     13    114 
Node    0, zone   Normal, R 93      Movable      5      5      6      6      6      6      6      6      6      6    122 
Node    0, zone   Normal, R 94      Movable     15     27     25     26     26     26     26     26     20     19    107 
Node    0, zone   Normal, R 95      Movable     12     15     17     17     15     16     14     13     14     10    116 
Node    0, zone   Normal, R 96      Movable     10     15     16     16     16     16     16     14     15     15    113 
Node    0, zone   Normal, R 97      Movable      5      8      9     10      8      9      9      9      9      7    120 
Node    0, zone   Normal, R 98      Movable      8     11     11     11     11     11     11      9     10     10    118 
Node    0, zone   Normal, R 99      Movable     10     13     15     15     15     15     15     15     15     11    115 
Node    0, zone   Normal, R100      Movable     32     42     48     44     44     45     39     40     23     25     99 
Node    0, zone   Normal, R101      Movable      9     14     16     16     14     15     15     15     15     14    114 
Node    0, zone   Normal, R102      Movable     18     24     25     23     22     23     21     22     18     18    109 
Node    0, zone   Normal, R103      Movable     12     16     18     16     15     14     15     15     11     13    115 
Node    0, zone   Normal, R104      Movable     14     17     20     20     20     20     20     20     20     18    109 
Node    0, zone   Normal, R105      Movable     16     24     35     32     32     33     31     30     25     26    101 
Node    0, zone   Normal, R106      Movable     11     18     20     20     20     20     18     19     19     15    111 
Node    0, zone   Normal, R107      Movable     11     29     33     33     33     33     33     33     25     25    101 
Node    0, zone   Normal, R108      Movable     13     13     13     13     13     13     13     13     13     13    115 
Node    0, zone   Normal, R109      Movable      3      5      9      9      9      9      9      9      9      9    119 
Node    0, zone   Normal, R110      Movable     12     15     16     16     16     16     16     16     10     13    115 
Node    0, zone   Normal, R111      Movable      8     13     16     16     16     16     16     14     15     11    115 
Node    0, zone   Normal, R112      Movable      6      9      5      4      4      5      5      5      3      2    125 
Node    0, zone   Normal, R113      Movable      1      2      2      2      3      1      2      2      2      2    126 
Node    0, zone   Normal, R114      Movable      1      3      3      3      3      3      3      3      1      2    126 
Node    0, zone   Normal, R115      Movable      4      5      4      5      5      5      5      5      5      5    123 
Node    0, zone   Normal, R116      Movable      0      0      0      0      0      0      0      0      0      0    128 
Node    0, zone   Normal, R117      Movable      0      0      0      0      0      0      0      0      0      0    128 
Node    0, zone   Normal, R118      Movable     10     33     34     22     18     17     14     14     14     11    114 
Node    0, zone   Normal, R119      Movable     36    117    163    146    143    138    126    102     85     66     39 
Node    0, zone   Normal, R120      Movable    366    963    961    572    191     57     35     19     19      5      5 
Node    0, zone   Normal, R121      Movable    802   2065   2211   1260    425    128     45     14      7      0      0 
Node    0, zone   Normal, R122      Movable    123    328    322    160     37      6      1      0      0      0      0 
Node    0, zone   Normal, R123      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R124      Movable      0      0      0      0      0      0      0      0      0      0      0 

Number of blocks type         Unmovable  Reclaimable      Movable      Reserve      Isolate 
Node 0, zone      DMA                 1            0            2            1            0 
Node 0, zone    DMA32                 1            0          506            1            0 
Node 0, zone   Normal               227           38        15605            2            0 

Node 0, zone      DMA R  0            1            0            2            1            0 
Node 0, zone    DMA32 R  0            0            0          124            1            0 
Node 0, zone    DMA32 R  1            0            0          128            0            0 
Node 0, zone    DMA32 R  2            0            0          128            0            0 
Node 0, zone    DMA32 R  3            1            0          127            0            0 
Node 0, zone    DMA32 R  4            0            0            0            0            0 
Node 0, zone    DMA32 R  5            0            0            0            0            0 
Node 0, zone    DMA32 R  6            0            0            0            0            0 
Node 0, zone    DMA32 R  7            0            0            0            0            0 
Node 0, zone   Normal R  0            0            0            0            1            0 
Node 0, zone   Normal R  1            0            0          126            2            0 
Node 0, zone   Normal R  2            0            0          128            0            0 
Node 0, zone   Normal R  3            0            0          128            0            0 
Node 0, zone   Normal R  4            0            0          128            0            0 
Node 0, zone   Normal R  5            0            1          127            0            0 
Node 0, zone   Normal R  6            0            0          128            0            0 
Node 0, zone   Normal R  7            0            0          128            0            0 
Node 0, zone   Normal R  8            0            0          128            0            0 
Node 0, zone   Normal R  9            0            0          128            0            0 
Node 0, zone   Normal R 10            0            0          128            0            0 
Node 0, zone   Normal R 11            0            0          128            0            0 
Node 0, zone   Normal R 12            0            0          128            0            0 
Node 0, zone   Normal R 13            0            0          128            0            0 
Node 0, zone   Normal R 14            0            0          128            0            0 
Node 0, zone   Normal R 15            0            0          128            0            0 
Node 0, zone   Normal R 16            0            0          128            0            0 
Node 0, zone   Normal R 17            0            0          128            0            0 
Node 0, zone   Normal R 18            0            0          128            0            0 
Node 0, zone   Normal R 19            0            0          128            0            0 
Node 0, zone   Normal R 20            0            0          128            0            0 
Node 0, zone   Normal R 21            0            0          128            0            0 
Node 0, zone   Normal R 22            0            0          128            0            0 
Node 0, zone   Normal R 23            0            0          128            0            0 
Node 0, zone   Normal R 24            0            0          128            0            0 
Node 0, zone   Normal R 25            0            0          128            0            0 
Node 0, zone   Normal R 26            0            0          128            0            0 
Node 0, zone   Normal R 27            0            0          128            0            0 
Node 0, zone   Normal R 28            0            0          128            0            0 
Node 0, zone   Normal R 29            0            0          128            0            0 
Node 0, zone   Normal R 30            0            0          128            0            0 
Node 0, zone   Normal R 31            0            0          128            0            0 
Node 0, zone   Normal R 32            0            0          128            0            0 
Node 0, zone   Normal R 33            0            0          128            0            0 
Node 0, zone   Normal R 34            0            0          128            0            0 
Node 0, zone   Normal R 35            0            0          128            0            0 
Node 0, zone   Normal R 36            0            0          128            0            0 
Node 0, zone   Normal R 37            0            0          128            0            0 
Node 0, zone   Normal R 38            0            0          128            0            0 
Node 0, zone   Normal R 39            0            0          128            0            0 
Node 0, zone   Normal R 40            0            0          128            0            0 
Node 0, zone   Normal R 41            0            0          128            0            0 
Node 0, zone   Normal R 42            0            0          128            0            0 
Node 0, zone   Normal R 43            0            0          128            0            0 
Node 0, zone   Normal R 44            0            0          128            0            0 
Node 0, zone   Normal R 45            0            0          128            0            0 
Node 0, zone   Normal R 46            0            0          128            0            0 
Node 0, zone   Normal R 47            0            0          128            0            0 
Node 0, zone   Normal R 48            0            0          128            0            0 
Node 0, zone   Normal R 49            0            0          128            0            0 
Node 0, zone   Normal R 50            0            0          128            0            0 
Node 0, zone   Normal R 51            0            0          128            0            0 
Node 0, zone   Normal R 52            0            0          128            0            0 
Node 0, zone   Normal R 53            0            0          128            0            0 
Node 0, zone   Normal R 54            0            0          128            0            0 
Node 0, zone   Normal R 55            0            0          128            0            0 
Node 0, zone   Normal R 56            0            0          128            0            0 
Node 0, zone   Normal R 57            0            0          128            0            0 
Node 0, zone   Normal R 58            0            1          127            0            0 
Node 0, zone   Normal R 59            0            0          128            0            0 
Node 0, zone   Normal R 60            0            0          128            0            0 
Node 0, zone   Normal R 61            0            0          128            0            0 
Node 0, zone   Normal R 62            0            0          128            0            0 
Node 0, zone   Normal R 63            0            0          128            0            0 
Node 0, zone   Normal R 64            0            0          128            0            0 
Node 0, zone   Normal R 65            0            0          128            0            0 
Node 0, zone   Normal R 66            0            0          128            0            0 
Node 0, zone   Normal R 67            0            0          128            0            0 
Node 0, zone   Normal R 68            0            0          128            0            0 
Node 0, zone   Normal R 69            0            0          128            0            0 
Node 0, zone   Normal R 70            0            0          128            0            0 
Node 0, zone   Normal R 71            0            0          128            0            0 
Node 0, zone   Normal R 72            0            0          128            0            0 
Node 0, zone   Normal R 73            0            0          128            0            0 
Node 0, zone   Normal R 74            0            0          128            0            0 
Node 0, zone   Normal R 75            0            0          128            0            0 
Node 0, zone   Normal R 76            0            0          128            0            0 
Node 0, zone   Normal R 77            0            0          128            0            0 
Node 0, zone   Normal R 78            0            0          128            0            0 
Node 0, zone   Normal R 79            0            0          128            0            0 
Node 0, zone   Normal R 80            0            0          128            0            0 
Node 0, zone   Normal R 81            0            0          128            0            0 
Node 0, zone   Normal R 82            0            0          128            0            0 
Node 0, zone   Normal R 83            0            0          128            0            0 
Node 0, zone   Normal R 84            0            0          128            0            0 
Node 0, zone   Normal R 85            0            0          128            0            0 
Node 0, zone   Normal R 86            0            0          128            0            0 
Node 0, zone   Normal R 87            0            0          128            0            0 
Node 0, zone   Normal R 88            0            0          128            0            0 
Node 0, zone   Normal R 89            0            0          128            0            0 
Node 0, zone   Normal R 90            0            0          128            0            0 
Node 0, zone   Normal R 91            0            0          128            0            0 
Node 0, zone   Normal R 92            0            0          128            0            0 
Node 0, zone   Normal R 93            0            0          128            0            0 
Node 0, zone   Normal R 94            0            0          128            0            0 
Node 0, zone   Normal R 95            0            0          128            0            0 
Node 0, zone   Normal R 96            0            0          128            0            0 
Node 0, zone   Normal R 97            0            0          128            0            0 
Node 0, zone   Normal R 98            0            0          128            0            0 
Node 0, zone   Normal R 99            0            0          128            0            0 
Node 0, zone   Normal R100            0            0          128            0            0 
Node 0, zone   Normal R101            0            0          128            0            0 
Node 0, zone   Normal R102            0            0          128            0            0 
Node 0, zone   Normal R103            0            0          128            0            0 
Node 0, zone   Normal R104            0            0          128            0            0 
Node 0, zone   Normal R105            0            0          128            0            0 
Node 0, zone   Normal R106            0            0          128            0            0 
Node 0, zone   Normal R107            0            0          128            0            0 
Node 0, zone   Normal R108            0            0          128            0            0 
Node 0, zone   Normal R109            0            0          128            0            0 
Node 0, zone   Normal R110            0            0          128            0            0 
Node 0, zone   Normal R111            0            0          128            0            0 
Node 0, zone   Normal R112            0            0          128            0            0 
Node 0, zone   Normal R113            0            0          128            0            0 
Node 0, zone   Normal R114            0            0          128            0            0 
Node 0, zone   Normal R115            0            0          128            0            0 
Node 0, zone   Normal R116            0            0          128            0            0 
Node 0, zone   Normal R117            0            0          128            0            0 
Node 0, zone   Normal R118            0            1          127            0            0 
Node 0, zone   Normal R119            0            4          124            0            0 
Node 0, zone   Normal R120           62           18           48            0            0 
Node 0, zone   Normal R121           63            1           64            0            0 
Node 0, zone   Normal R122          102           12           14            0            0 
Node 0, zone   Normal R123            0            0          128            0            0 
Node 0, zone   Normal R124            0            0          128            0            0 


Statistics with this patchset applied:
=====================================

Comparing these statistics with that of vanilla kernel, we see that the
fragmentation is significantly lesser, as seen in the MOVABLE migratetype.

Node 0, zone   Normal
  pages free     15731928
        min      5575
        low      6968
        high     8362
        scanned  0
        spanned  16252928
        present  16252928
        managed  15989885

Per-region page stats	 present	 free

	Region      0 	      1 	   1024
	Region      1 	 131072 	  11137
	Region      2 	 131072 	  83876
	Region      3 	 131072 	  72134
	Region      4 	 131072 	 116194
	Region      5 	 131072 	 116393
	Region      6 	 131072 	 130746
	Region      7 	 131072 	 131040
	Region      8 	 131072 	 131072
	Region      9 	 131072 	 131072
	Region     10 	 131072 	 131072
	Region     11 	 131072 	 131035
	Region     12 	 131072 	 131072
	Region     13 	 131072 	 130112
	Region     14 	 131072 	 131976
	Region     15 	 131072 	 131061
	Region     16 	 131072 	 131038
	Region     17 	 131072 	 131045
	Region     18 	 131072 	 131039
	Region     19 	 131072 	 131029
	Region     20 	 131072 	 131072
	Region     21 	 131072 	 131051
	Region     22 	 131072 	 131066
	Region     23 	 131072 	 131070
	Region     24 	 131072 	 131069
	Region     25 	 131072 	 131032
	Region     26 	 131072 	 131040
	Region     27 	 131072 	 131072
	Region     28 	 131072 	 131069
	Region     29 	 131072 	 131056
	Region     30 	 131072 	 131045
	Region     31 	 131072 	 131070
	Region     32 	 131072 	 131055
	Region     33 	 131072 	 131053
	Region     34 	 131072 	 131042
	Region     35 	 131072 	 131065
	Region     36 	 131072 	 130987
	Region     37 	 131072 	 131072
	Region     38 	 131072 	 131068
	Region     39 	 131072 	 131014
	Region     40 	 131072 	 131044
	Region     41 	 131072 	 131067
	Region     42 	 131072 	 131071
	Region     43 	 131072 	 131045
	Region     44 	 131072 	 131072
	Region     45 	 131072 	 131068
	Region     46 	 131072 	 131038
	Region     47 	 131072 	 131069
	Region     48 	 131072 	 131072
	Region     49 	 131072 	 131070
	Region     50 	 131072 	 131054
	Region     51 	 131072 	 131064
	Region     52 	 131072 	 131072
	Region     53 	 131072 	 131042
	Region     54 	 131072 	 131041
	Region     55 	 131072 	 131072
	Region     56 	 131072 	 131066
	Region     57 	 131072 	 131072
	Region     58 	 131072 	 131072
	Region     59 	 131072 	 131068
	Region     60 	 131072 	 131057
	Region     61 	 131072 	 131072
	Region     62 	 131072 	 131041
	Region     63 	 131072 	 131046
	Region     64 	 131072 	 131053
	Region     65 	 131072 	 131072
	Region     66 	 131072 	 131072
	Region     67 	 131072 	 131072
	Region     68 	 131072 	 131067
	Region     69 	 131072 	 131041
	Region     70 	 131072 	 131071
	Region     71 	 131072 	 131052
	Region     72 	 131072 	 131071
	Region     73 	 131072 	 131072
	Region     74 	 131072 	 131066
	Region     75 	 131072 	 131072
	Region     76 	 131072 	 131072
	Region     77 	 131072 	 131065
	Region     78 	 131072 	 131067
	Region     79 	 131072 	 131072
	Region     80 	 131072 	 131071
	Region     81 	 131072 	 131056
	Region     82 	 131072 	 131072
	Region     83 	 131072 	 131072
	Region     84 	 131072 	 131072
	Region     85 	 131072 	 131072
	Region     86 	 131072 	 131062
	Region     87 	 131072 	 131072
	Region     88 	 131072 	 131067
	Region     89 	 131072 	 131057
	Region     90 	 131072 	 131072
	Region     91 	 131072 	 131026
	Region     92 	 131072 	 131072
	Region     93 	 131072 	 131067
	Region     94 	 131072 	 131057
	Region     95 	 131072 	 131072
	Region     96 	 131072 	 131072
	Region     97 	 131072 	 131072
	Region     98 	 131072 	 131072
	Region     99 	 131072 	 131037
	Region    100 	 131072 	 131072
	Region    101 	 131072 	 131072
	Region    102 	 131072 	 131071
	Region    103 	 131072 	 131072
	Region    104 	 131072 	 131072
	Region    105 	 131072 	 131072
	Region    106 	 131072 	 131072
	Region    107 	 131072 	 131072
	Region    108 	 131072 	 131072
	Region    109 	 131072 	 131072
	Region    110 	 131072 	 131072
	Region    111 	 131072 	 131056
	Region    112 	 131072 	 131072
	Region    113 	 131072 	 131072
	Region    114 	 131072 	 131072
	Region    115 	 131072 	 131072
	Region    116 	 131072 	 131072
	Region    117 	 131072 	 131072
	Region    118 	 131072 	 131072
	Region    119 	 131072 	 131072
	Region    120 	 131072 	 131072
	Region    121 	 131072 	 131072
	Region    122 	 131072 	 128263
	Region    123 	 131072 	      0
	Region    124 	 131071 	     53

Page block order: 10
Pages per block:  1024

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10 
Node    0, zone      DMA, type    Unmovable      1      2      2      1      3      2      0      0      1      1      0 
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      2 
Node    0, zone      DMA, type      Reserve      0      0      0      0      0      0      0      0      0      0      1 
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type    Unmovable      0      1      0      0      0      0      1      1      1      1    127 
Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type      Movable      9     10     12      8     10      8      8      6      5      7    309 
Node    0, zone    DMA32, type      Reserve      0      0      0      0      0      0      0      0      0      0      1 
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, type    Unmovable  10879   9467   6585   2559   1859    630     81      7      1      2    108 
Node    0, zone   Normal, type  Reclaimable      1      1      1      1      0      1      0      1      0      0     81 
Node    0, zone   Normal, type      Movable    690   3282   4967   2628   1209    810    677    554    468    375   8006 
Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      2 
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 

Node    0, zone   Normal, R  0      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  1      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  2      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  3      Movable     50   2600   4277   1986    588    193     90     42     18      1      1 
Node    0, zone   Normal, R  4      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  5      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  6      Movable     52     65     71     57     60     57     51     45     39     29     91 
Node    0, zone   Normal, R  7      Movable      2      1      3      2      2      1      2      2      2      2    126 
Node    0, zone   Normal, R  8      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  9      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 10      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 11      Movable      5      7      8      5      6      7      7      3      3      4    124 
Node    0, zone   Normal, R 12      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 13      Movable      0      0      0      0      0      0      1      0      0      0    127 
Node    0, zone   Normal, R 14      Movable     24     26     29     29     26     28     27     28     24     18    107 
Node    0, zone   Normal, R 15      Movable      9     10     10      8      7      6      7      7      7      7    121 
Node    0, zone   Normal, R 16      Movable      8     13     15     16     14     13     14     14     14     10    116 
Node    0, zone   Normal, R 17      Movable     11     17     14     14     15     15     15     13     12     13    115 
Node    0, zone   Normal, R 18      Movable      9      7      8      5      4      6      6      4      3      4    124 
Node    0, zone   Normal, R 19      Movable      9      8      9      9     11     11      9     10      8      7    120 
Node    0, zone   Normal, R 20      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 21      Movable     13      9     11     10     11      9     10     10     10      8    119 
Node    0, zone   Normal, R 22      Movable      6      6      6      6      6      6      6      6      6      6    122 
Node    0, zone   Normal, R 23      Movable      2      2      2      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 24      Movable      3      3      3      3      3      3      3      3      3      3    125 
Node    0, zone   Normal, R 25      Movable     10     11     14     12     12     13     11     12      8      6    120 
Node    0, zone   Normal, R 26      Movable     12     10     16     16     16     16     16     16     16     14    113 
Node    0, zone   Normal, R 27      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 28      Movable      1      0      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 29      Movable      0      0      0      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 30      Movable      1      2      2      3      2      1      2      2      2      0    127 
Node    0, zone   Normal, R 31      Movable      2      2      2      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 32      Movable      1      3      2      2      1      2      2      2      2      2    126 
Node    0, zone   Normal, R 33      Movable      3      3      3      3      4      4      4      4      4      4    124 
Node    0, zone   Normal, R 34      Movable      2      2      1      3      2      3      1      2      2      2    126 
Node    0, zone   Normal, R 35      Movable      3      3      4      4      4      4      4      4      4      4    124 
Node    0, zone   Normal, R 36      Movable      3     32     32     35     35     35     35     33     32     23    100 
Node    0, zone   Normal, R 37      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 38      Movable      2      1      2      0      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 39      Movable      4      3      3      2      4      3      2      3      3      1    126 
Node    0, zone   Normal, R 40      Movable      0      4      7      8      8      8      8      8      4      6    122 
Node    0, zone   Normal, R 41      Movable      1      1      2      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 42      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 43      Movable      3     13     12     11     12     12     12     12     10      7    119 
Node    0, zone   Normal, R 44      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 45      Movable      4      4      4      4      4      4      4      4      2      3    125 
Node    0, zone   Normal, R 46      Movable      8     13     15     16     16     16     16     16     14     15    113 
Node    0, zone   Normal, R 47      Movable      1      2      2      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 48      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 49      Movable      2      2      2      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 50      Movable      2      2      6      6      6      6      6      6      6      6    122 
Node    0, zone   Normal, R 51      Movable      2      1      1      0      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 52      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 53      Movable      4      3      4      7      5      6      6      6      4      5    123 
Node    0, zone   Normal, R 54      Movable      3      3      4      1      2      3      3      3      3      3    125 
Node    0, zone   Normal, R 55      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 56      Movable      6      6      4      5      5      5      5      3      4      4    124 
Node    0, zone   Normal, R 57      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 58      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 59      Movable      0      0      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 60      Movable      1      8      8      8      8      8      8      8      8      8    120 
Node    0, zone   Normal, R 61      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 62      Movable      1      0      0      0      2      2      2      2      0      1    127 
Node    0, zone   Normal, R 63      Movable      2     14     14     14     14     14     14     14     14     14    114 
Node    0, zone   Normal, R 64      Movable      5     12     12     10     11     11     11      9     10      8    119 
Node    0, zone   Normal, R 65      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 66      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 67      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 68      Movable      1      1      0      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 69      Movable      3      3      4      5      2      4      4      4      4      2    125 
Node    0, zone   Normal, R 70      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 71      Movable      0      2      6      6      6      6      6      6      6      6    122 
Node    0, zone   Normal, R 72      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 73      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 74      Movable      2      2      1      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 75      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 76      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 77      Movable      3      3      4      4      4      4      4      4      4      4    124 
Node    0, zone   Normal, R 78      Movable      1      1      0      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 79      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 80      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 81      Movable     16     16     16     16     16     16     16     16     16     16    112 
Node    0, zone   Normal, R 82      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 83      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 84      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 85      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 86      Movable      2      2      0      2      2      2      2      2      2      0    127 
Node    0, zone   Normal, R 87      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 88      Movable      1      1      2      0      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 89      Movable      1      0      0      2      0      1      1      1      1      1    127 
Node    0, zone   Normal, R 90      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 91      Movable      2      6      5      4      5      6      6      4      5      5    123 
Node    0, zone   Normal, R 92      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 93      Movable      1      1      0      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 94      Movable      1      2      1      1      2      2      2      0      1      1    127 
Node    0, zone   Normal, R 95      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 96      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 97      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 98      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 99      Movable      5      4      4      4      4      5      5      5      5      5    123 
Node    0, zone   Normal, R100      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R101      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R102      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R103      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R104      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R105      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R106      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R107      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R108      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R109      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R110      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R111      Movable      2      1      1      1      2      2      2      2      2      0    127 
Node    0, zone   Normal, R112      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R113      Movable      0      0      0      0      0      0      0      0      0      0    128 
Node    0, zone   Normal, R114      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R115      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R116      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R117      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R118      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R119      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R120      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R121      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R122      Movable    351    298    271    239    212    203    182    127     94     60     31 
Node    0, zone   Normal, R123      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R124      Movable      1      0      1      0      1      1      0      0      0      0      0 

Number of blocks type         Unmovable  Reclaimable      Movable      Reserve      Isolate 
Node 0, zone      DMA                 1            0            2            1            0 
Node 0, zone    DMA32               128            0          379            1            0 
Node 0, zone   Normal               384          128        15359            1            0 

Node 0, zone   Normal R  0            0            0            0            1            0 
Node 0, zone   Normal R  1          127            0            0            1            0 
Node 0, zone   Normal R  2            1          127            0            0            0 
Node 0, zone   Normal R  3            0            1          127            0            0 
Node 0, zone   Normal R  4          127            0            1            0            0 
Node 0, zone   Normal R  5          128            0            0            0            0 
Node 0, zone   Normal R  6            1            0          127            0            0 
Node 0, zone   Normal R  7            0            0          128            0            0 
Node 0, zone   Normal R  8            0            0          128            0            0 
Node 0, zone   Normal R  9            0            0          128            0            0 
Node 0, zone   Normal R 10            0            0          128            0            0 
Node 0, zone   Normal R 11            0            0          128            0            0 
Node 0, zone   Normal R 12            0            0          128            0            0 
Node 0, zone   Normal R 13            0            0          128            0            0 
Node 0, zone   Normal R 14            0            0          128            0            0 
Node 0, zone   Normal R 15            0            0          128            0            0 
Node 0, zone   Normal R 16            0            0          128            0            0 
Node 0, zone   Normal R 17            0            0          128            0            0 
Node 0, zone   Normal R 18            0            0          128            0            0 
Node 0, zone   Normal R 19            0            0          128            0            0 
Node 0, zone   Normal R 20            0            0          128            0            0 
Node 0, zone   Normal R 21            0            0          128            0            0 
Node 0, zone   Normal R 22            0            0          128            0            0 
Node 0, zone   Normal R 23            0            0          128            0            0 
Node 0, zone   Normal R 24            0            0          128            0            0 
Node 0, zone   Normal R 25            0            0          128            0            0 
Node 0, zone   Normal R 26            0            0          128            0            0 
Node 0, zone   Normal R 27            0            0          128            0            0 
Node 0, zone   Normal R 28            0            0          128            0            0 
Node 0, zone   Normal R 29            0            0          128            0            0 
Node 0, zone   Normal R 30            0            0          128            0            0 
Node 0, zone   Normal R 31            0            0          128            0            0 
Node 0, zone   Normal R 32            0            0          128            0            0 
Node 0, zone   Normal R 33            0            0          128            0            0 
Node 0, zone   Normal R 34            0            0          128            0            0 
Node 0, zone   Normal R 35            0            0          128            0            0 
Node 0, zone   Normal R 36            0            0          128            0            0 
Node 0, zone   Normal R 37            0            0          128            0            0 
Node 0, zone   Normal R 38            0            0          128            0            0 
Node 0, zone   Normal R 39            0            0          128            0            0 
Node 0, zone   Normal R 40            0            0          128            0            0 
Node 0, zone   Normal R 41            0            0          128            0            0 
Node 0, zone   Normal R 42            0            0          128            0            0 
Node 0, zone   Normal R 43            0            0          128            0            0 
Node 0, zone   Normal R 44            0            0          128            0            0 
Node 0, zone   Normal R 45            0            0          128            0            0 
Node 0, zone   Normal R 46            0            0          128            0            0 
Node 0, zone   Normal R 47            0            0          128            0            0 
Node 0, zone   Normal R 48            0            0          128            0            0 
Node 0, zone   Normal R 49            0            0          128            0            0 
Node 0, zone   Normal R 50            0            0          128            0            0 
Node 0, zone   Normal R 51            0            0          128            0            0 
Node 0, zone   Normal R 52            0            0          128            0            0 
Node 0, zone   Normal R 53            0            0          128            0            0 
Node 0, zone   Normal R 54            0            0          128            0            0 
Node 0, zone   Normal R 55            0            0          128            0            0 
Node 0, zone   Normal R 56            0            0          128            0            0 
Node 0, zone   Normal R 57            0            0          128            0            0 
Node 0, zone   Normal R 58            0            0          128            0            0 
Node 0, zone   Normal R 59            0            0          128            0            0 
Node 0, zone   Normal R 60            0            0          128            0            0 
Node 0, zone   Normal R 61            0            0          128            0            0 
Node 0, zone   Normal R 62            0            0          128            0            0 
Node 0, zone   Normal R 63            0            0          128            0            0 
Node 0, zone   Normal R 64            0            0          128            0            0 
Node 0, zone   Normal R 65            0            0          128            0            0 
Node 0, zone   Normal R 66            0            0          128            0            0 
Node 0, zone   Normal R 67            0            0          128            0            0 
Node 0, zone   Normal R 68            0            0          128            0            0 
Node 0, zone   Normal R 69            0            0          128            0            0 
Node 0, zone   Normal R 70            0            0          128            0            0 
Node 0, zone   Normal R 71            0            0          128            0            0 
Node 0, zone   Normal R 72            0            0          128            0            0 
Node 0, zone   Normal R 73            0            0          128            0            0 
Node 0, zone   Normal R 74            0            0          128            0            0 
Node 0, zone   Normal R 75            0            0          128            0            0 
Node 0, zone   Normal R 76            0            0          128            0            0 
Node 0, zone   Normal R 77            0            0          128            0            0 
Node 0, zone   Normal R 78            0            0          128            0            0 
Node 0, zone   Normal R 79            0            0          128            0            0 
Node 0, zone   Normal R 80            0            0          128            0            0 
Node 0, zone   Normal R 81            0            0          128            0            0 
Node 0, zone   Normal R 82            0            0          128            0            0 
Node 0, zone   Normal R 83            0            0          128            0            0 
Node 0, zone   Normal R 84            0            0          128            0            0 
Node 0, zone   Normal R 85            0            0          128            0            0 
Node 0, zone   Normal R 86            0            0          128            0            0 
Node 0, zone   Normal R 87            0            0          128            0            0 
Node 0, zone   Normal R 88            0            0          128            0            0 
Node 0, zone   Normal R 89            0            0          128            0            0 
Node 0, zone   Normal R 90            0            0          128            0            0 
Node 0, zone   Normal R 91            0            0          128            0            0 
Node 0, zone   Normal R 92            0            0          128            0            0 
Node 0, zone   Normal R 93            0            0          128            0            0 
Node 0, zone   Normal R 94            0            0          128            0            0 
Node 0, zone   Normal R 95            0            0          128            0            0 
Node 0, zone   Normal R 96            0            0          128            0            0 
Node 0, zone   Normal R 97            0            0          128            0            0 
Node 0, zone   Normal R 98            0            0          128            0            0 
Node 0, zone   Normal R 99            0            0          128            0            0 
Node 0, zone   Normal R100            0            0          128            0            0 
Node 0, zone   Normal R101            0            0          128            0            0 
Node 0, zone   Normal R102            0            0          128            0            0 
Node 0, zone   Normal R103            0            0          128            0            0 
Node 0, zone   Normal R104            0            0          128            0            0 
Node 0, zone   Normal R105            0            0          128            0            0 
Node 0, zone   Normal R106            0            0          128            0            0 
Node 0, zone   Normal R107            0            0          128            0            0 
Node 0, zone   Normal R108            0            0          128            0            0 
Node 0, zone   Normal R109            0            0          128            0            0 
Node 0, zone   Normal R110            0            0          128            0            0 
Node 0, zone   Normal R111            0            0          128            0            0 
Node 0, zone   Normal R112            0            0          128            0            0 
Node 0, zone   Normal R113            0            0          128            0            0 
Node 0, zone   Normal R114            0            0          128            0            0 
Node 0, zone   Normal R115            0            0          128            0            0 
Node 0, zone   Normal R116            0            0          128            0            0 
Node 0, zone   Normal R117            0            0          128            0            0 
Node 0, zone   Normal R118            0            0          128            0            0 
Node 0, zone   Normal R119            0            0          128            0            0 
Node 0, zone   Normal R120            0            0          128            0            0 
Node 0, zone   Normal R121            0            0          128            0            0 
Node 0, zone   Normal R122            0            0          128            0            0 
Node 0, zone   Normal R123            0            0          128            0            0 
Node 0, zone   Normal R124            0            0          128            0            0 


Performance impact:
------------------

Kernbench was run with and without the patchset. It shows an overhead of
around 7.8% with the patchset applied.

Vanilla kernel:

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 706.760000
User Time 4536.670000
System Time 1526.610000
Percent CPU 857.000000
Context Switches 2229643.000000
Sleeps 2211767.000000

With patchset:

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 761.010000
User Time 4605.450000
System Time 1535.870000
Percent CPU 806.000000
Context Switches 2247690.000000
Sleeps 2213503.000000

This version (v3) of the patchset focussed more on improving the consolidation
ratio and less on the performance impact. There is plenty of room for
performance optimization, and I'll work on that in future versions.

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RESEND RFC PATCH v3 00/35] mm: Memory Power Management
@ 2013-08-30 13:26   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 13:26 UTC (permalink / raw)
  To: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw
  Cc: Srivatsa S. Bhat, gargankita, paulmck, svaidy, andi,
	isimatu.yasuaki, santosh.shilimkar, kosaki.motohiro, linux-pm,
	linux-mm, linux-kernel


Experimental Results:
====================

Test setup:
----------

x86 Sandybridge dual-socket quad core HT-enabled machine, with 128GB RAM.
Memory Region size = 512MB.

Testcase:
--------

Strategy:

Try to allocate and free large chunks of memory (comparable to that of memory
region size) in multiple threads, and examine the number of completely free
memory regions at the end of the run (when all the memory is freed). (Note
that we don't create any pagecache usage here).

Implementation:

Run 20 instances of multi-threaded ebizzy in parallel, with chunksize=256MB,
and no. of threads=32. This means, potentially 20 * 32 threads can allocate/free
memory in parallel, and each alloc/free size will be 256MB, which is half of
the memory region size.

Cmd-line of each ebizzy instance: ./ebizzy -s 268435456 -n 2 -t 32 -S 60


Effectiveness in consolidating allocations:
------------------------------------------

With the above test case, the higher the number of completely free memory
regions at the end of the run, the better is the memory management algorithm
in consolidating allocations.

Here are the results, with vanilla 3.11-rc7 and with this patchset applied:

                  Free regions at test-start   Free regions after test-run
Without patchset               242                         18
With patchset                  238                        121

This shows that this patchset performs tremendously better than vanilla
kernel in terms of keeping the memory allocations consolidated to a minimum
no. of memory regions. Note that the amount of memory consumed at the end of
the run is 0, so it shows the drastic extent to which the mainline kernel can
fragment memory by spreading a handful of pages across many memory regions.
And since this patchset teaches the kernel to understand the memory region
granularity/boundaries and influences the MM decisions effectively, it shows
a significant improvement over mainline. Also, this improvement is with the
allocator changes alone; targeted compaction (which was dropped in this
version) is expected to show even more benefits.

Below is the log of the variation of the no. of completely free regions
from the beginning to the end of the test, at 1 second intervals (total
test-run takes 1 minute).

         Vanilla 3.11-rc7         With this patchset
                242                     238
                242                     238
                242                     238
                242                     238
                242                     238
                239                     236
                221                     215
                196                     181
                171                     139
                144                     112
                117                     78
                69                      48
                49                      24
                27                      21
                15                      21
                15                      21
                15                      21
                15                      21
                15                      21
                15                      21
                15                      22
                15                      22
                15                      23
                15                      23
                15                      27
                15                      29
                15                      29
                15                      30
                15                      30
                15                      30
                15                      30
                15                      30
                15                      30
                15                      30
                15                      32
                15                      33
                15                      33
                15                      33
                15                      33
                15                      36
                15                      42
                15                      42
                15                      44
                15                      48
                16                      111
                17                      114
                17                      114
                17                      114
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      115
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                17                      116
                18                      121


It is interesting to also examine the fragmentation of memory by
looking at the per-region statistics added by this patchset.

Statistics for vanilla 3.11-rc7 kernel:
======================================

We can see from the statistics that there is a lot of fragmentation
among the MOVABLE migratetype.

Node 0, zone   Normal
  pages free     15751188
        min      5575
        low      6968
        high     8362
        scanned  0
        spanned  16252928
        present  16252928
        managed  15989951

Per-region page stats	 present	 free

	Region      0 	      1 	   1024
	Region      1 	 131072 	 131072
	Region      2 	 131072 	 131072
	Region      3 	 131072 	 131072
	Region      4 	 131072 	 131072
	Region      5 	 131072 	 130045
	Region      6 	 131072 	 131032
	Region      7 	 131072 	 131023
	Region      8 	 131072 	 131022
	Region      9 	 131072 	 131062
	Region     10 	 131072 	 131055
	Region     11 	 131072 	 131064
	Region     12 	 131072 	 131047
	Region     13 	 131072 	 131051
	Region     14 	 131072 	 131056
	Region     15 	 131072 	 131046
	Region     16 	 131072 	 131051
	Region     17 	 131072 	 131061
	Region     18 	 131072 	 131030
	Region     19 	 131072 	 130168
	Region     20 	 131072 	 131937
	Region     21 	 131072 	 131067
	Region     22 	 131072 	 131028
	Region     23 	 131072 	 131051
	Region     24 	 131072 	 131041
	Region     25 	 131072 	 131047
	Region     26 	 131072 	 131051
	Region     27 	 131072 	 131054
	Region     28 	 131072 	 131049
	Region     29 	 131072 	 130994
	Region     30 	 131072 	 131059
	Region     31 	 131072 	 131060
	Region     32 	 131072 	 131051
	Region     33 	 131072 	 131047
	Region     34 	 131072 	 131050
	Region     35 	 131072 	 131050
	Region     36 	 131072 	 131039
	Region     37 	 131072 	 131053
	Region     38 	 131072 	 131045
	Region     39 	 131072 	 130275
	Region     40 	 131072 	 131807
	Region     41 	 131072 	 131050
	Region     42 	 131072 	 131051
	Region     43 	 131072 	 131037
	Region     44 	 131072 	 131052
	Region     45 	 131072 	 131011
	Region     46 	 131072 	 131026
	Region     47 	 131072 	 130285
	Region     48 	 131072 	 131810
	Region     49 	 131072 	 131046
	Region     50 	 131072 	 131049
	Region     51 	 131072 	 131054
	Region     52 	 131072 	 131064
	Region     53 	 131072 	 131053
	Region     54 	 131072 	 131019
	Region     55 	 131072 	 130997
	Region     56 	 131072 	 131039
	Region     57 	 131072 	 131058
	Region     58 	 131072 	 130182
	Region     59 	 131072 	 131057
	Region     60 	 131072 	 131063
	Region     61 	 131072 	 131046
	Region     62 	 131072 	 131055
	Region     63 	 131072 	 131060
	Region     64 	 131072 	 131049
	Region     65 	 131072 	 131042
	Region     66 	 131072 	 131048
	Region     67 	 131072 	 131052
	Region     68 	 131072 	 130997
	Region     69 	 131072 	 131046
	Region     70 	 131072 	 131045
	Region     71 	 131072 	 131028
	Region     72 	 131072 	 131054
	Region     73 	 131072 	 131048
	Region     74 	 131072 	 131052
	Region     75 	 131072 	 131043
	Region     76 	 131072 	 131052
	Region     77 	 131072 	 130542
	Region     78 	 131072 	 131556
	Region     79 	 131072 	 131048
	Region     80 	 131072 	 131043
	Region     81 	 131072 	 130548
	Region     82 	 131072 	 131551
	Region     83 	 131072 	 131019
	Region     84 	 131072 	 131033
	Region     85 	 131072 	 131047
	Region     86 	 131072 	 131059
	Region     87 	 131072 	 131054
	Region     88 	 131072 	 131043
	Region     89 	 131072 	 131035
	Region     90 	 131072 	 131044
	Region     91 	 131072 	 130538
	Region     92 	 131072 	 131560
	Region     93 	 131072 	 131063
	Region     94 	 131072 	 131033
	Region     95 	 131072 	 131046
	Region     96 	 131072 	 131048
	Region     97 	 131072 	 131049
	Region     98 	 131072 	 131058
	Region     99 	 131072 	 131048
	Region    100 	 131072 	 130484
	Region    101 	 131072 	 131557
	Region    102 	 131072 	 131038
	Region    103 	 131072 	 131044
	Region    104 	 131072 	 131040
	Region    105 	 131072 	 130988
	Region    106 	 131072 	 131039
	Region    107 	 131072 	 131009
	Region    108 	 131072 	 131059
	Region    109 	 131072 	 131049
	Region    110 	 131072 	 131050
	Region    111 	 131072 	 131042
	Region    112 	 131072 	 131052
	Region    113 	 131072 	 131053
	Region    114 	 131072 	 131067
	Region    115 	 131072 	 131062
	Region    116 	 131072 	 131072
	Region    117 	 131072 	 131072
	Region    118 	 131072 	 129860
	Region    119 	 131072 	 125402
	Region    120 	 131072 	  63109
	Region    121 	 131072 	  84301
	Region    122 	 131072 	  17009
	Region    123 	 131072 	      0
	Region    124 	 131071 	      0



Page block order: 10
Pages per block:  1024

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10 
Node    0, zone      DMA, type    Unmovable      1      2      2      1      3      2      0      0      1      1      0 
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      2 
Node    0, zone      DMA, type      Reserve      0      0      0      0      0      0      0      0      0      0      1 
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type    Unmovable      0      1      0      0      0      0      1      1      1      1      0 
Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type      Movable      8     10     12      8     10      8      8      6      5      7    436 
Node    0, zone    DMA32, type      Reserve      0      0      0      0      0      0      0      0      0      0      1 
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, type    Unmovable   8982   9711   5941   2108    611    189      9      0      1      1      0 
Node    0, zone   Normal, type  Reclaimable      0      0      0      0      1      0      0      1      0      0      0 
Node    0, zone   Normal, type      Movable   2349   4937   5264   3716   2323   1859   1689   1602   1412   1310  13826 
Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      2 
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 

Node    0, zone   Normal, R  0      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  1      Movable      0      0      0      0      0      0      0      0      0      0    127 
Node    0, zone   Normal, R  2      Movable      0      0      0      0      0      0      0      0      0      0    128 
Node    0, zone   Normal, R  3      Movable      0      0      0      0      0      0      0      0      0      0    128 
Node    0, zone   Normal, R  4      Movable      0      0      0      0      0      0      0      0      0      0    128 
Node    0, zone   Normal, R  5      Movable      3      3      3      3      3      3      3      3      3      3    124 
Node    0, zone   Normal, R  6      Movable     18     25     25     24     23     22     23     19     19     14    111 
Node    0, zone   Normal, R  7      Movable      7     16     18     16     12     13     14      8      7      9    119 
Node    0, zone   Normal, R  8      Movable     12     17     18     17     11     11     13     11     10     11    117 
Node    0, zone   Normal, R  9      Movable      6      6      7      7      5      6      6      6      6      6    122 
Node    0, zone   Normal, R 10      Movable      7     10     11     11     11     11     11     11     11      9    118 
Node    0, zone   Normal, R 11      Movable      8      8      8      8      8      8      8      8      6      7    121 
Node    0, zone   Normal, R 12      Movable      5      7     11     11     11     11     11     11     11     11    117 
Node    0, zone   Normal, R 13      Movable     15     18     18     18     18     18     18     18     18     12    113 
Node    0, zone   Normal, R 14      Movable      6      9     10      8      9      9      9      9      9      9    119 
Node    0, zone   Normal, R 15      Movable     10     12     15     15     13     14     12     13     13     13    115 
Node    0, zone   Normal, R 16      Movable      3      4      6      7      7      7      7      7      5      6    122 
Node    0, zone   Normal, R 17      Movable      1      4      5      5      5      5      5      5      5      5    123 
Node    0, zone   Normal, R 18      Movable     14     22     25     23     22     21     22     22     22     20    107 
Node    0, zone   Normal, R 19      Movable      6      7      7      7      7      7      7      8      7      7    120 
Node    0, zone   Normal, R 20      Movable      9     10     11     13     13     13     11     11     12     10    118 
Node    0, zone   Normal, R 21      Movable      3      4      4      4      4      4      2      3      3      3    125 
Node    0, zone   Normal, R 22      Movable      6     11     16     11     14     12     13     13     11     12    116 
Node    0, zone   Normal, R 23      Movable     11     14     15     15     15     15     15     15     13     14    114 
Node    0, zone   Normal, R 24      Movable      7     11     13     14     14     14     12     13     13     13    115 
Node    0, zone   Normal, R 25      Movable      7     12     12     13     11     12     12     12     12     12    116 
Node    0, zone   Normal, R 26      Movable      5      9     11     11     11     11     11     11      9     10    118 
Node    0, zone   Normal, R 27      Movable      8     13     13     13     11     10     11      9      8      9    119 
Node    0, zone   Normal, R 28      Movable     11     13     13     12     11     12     12     10      9      8    119 
Node    0, zone   Normal, R 29      Movable     20     27     28     27     26     24     22     22     19     17    109 
Node    0, zone   Normal, R 30      Movable      5      9      9      9      9      9      9      9      9      7    120 
Node    0, zone   Normal, R 31      Movable      6      7      6      7      7      7      7      7      7      5    122 
Node    0, zone   Normal, R 32      Movable      1      5      8      8      8      8      8      8      8      8    120 
Node    0, zone   Normal, R 33      Movable      5      9     10     11     11     11     11     11     11      9    118 
Node    0, zone   Normal, R 34      Movable      6      8      9     10      8      7      8      8      8      8    120 
Node    0, zone   Normal, R 35      Movable     14     18     16     17     17     15     16     16     14     13    114 
Node    0, zone   Normal, R 36      Movable     11     16     19     19     17     18     16     17     15     16    112 
Node    0, zone   Normal, R 37      Movable     15     17     17     17     17     17     17     17     15     14    113 
Node    0, zone   Normal, R 38      Movable      7     13     15     15     15     15     15     15     13     12    115 
Node    0, zone   Normal, R 39      Movable     11     18     19     19     17     16     17     15     15     11    114 
Node    0, zone   Normal, R 40      Movable     13     21     18     18     19     15     15     16     13     13    115 
Node    0, zone   Normal, R 41      Movable      4      7     10     10     10     10     10     10     10     10    118 
Node    0, zone   Normal, R 42      Movable     13     15     16     14     11     13     13     13     13     11    116 
Node    0, zone   Normal, R 43      Movable     13     16     16     18     18     18     18     18     14     16    112 
Node    0, zone   Normal, R 44      Movable     10     11     11     12     12     12     12     12     12     12    116 
Node    0, zone   Normal, R 45      Movable     13     19     20     22     21     22     20     21     17     15    111 
Node    0, zone   Normal, R 46      Movable     10     16     16     19     19     19     19     19     15     15    112 
Node    0, zone   Normal, R 47      Movable     11     15     15     13     14     14     14     14     13     11    115 
Node    0, zone   Normal, R 48      Movable      8     15     17     17     15     16     14     15     12     14    115 
Node    0, zone   Normal, R 49      Movable      8      9     11     12     12     12     12     10     11      9    118 
Node    0, zone   Normal, R 50      Movable      9     12     14     14     14     12     13     13     13     13    115 
Node    0, zone   Normal, R 51      Movable      8     11     12     12     12     12     12     12     10      7    119 
Node    0, zone   Normal, R 52      Movable      8      8      8      8      8      8      8      8      8      8    120 
Node    0, zone   Normal, R 53      Movable      9     12     13     13     13     13     13     13     13     11    116 
Node    0, zone   Normal, R 54      Movable     11     14     19     19     18     19     15     17     15     10    115 
Node    0, zone   Normal, R 55      Movable     13     14     15     16     17     16     17     17     15     16    112 
Node    0, zone   Normal, R 56      Movable      3      8     11     12     12     12     12     10      9     10    118 
Node    0, zone   Normal, R 57      Movable      4      9      9      7      8      8      8      8      8      8    120 
Node    0, zone   Normal, R 58      Movable      6      8      8      8      8      8      8      8      6      7    120 
Node    0, zone   Normal, R 59      Movable      7     11     11     11     11     11     11     11      9     10    118 
Node    0, zone   Normal, R 60      Movable      3      4      5      5      5      5      5      5      5      5    123 
Node    0, zone   Normal, R 61      Movable      8     15     16     14     13     14     14     12     13     11    116 
Node    0, zone   Normal, R 62      Movable      7     10     11     11     11     11     11     11     11     11    117 
Node    0, zone   Normal, R 63      Movable      4      4      6      6      6      6      6      6      6      4    123 
Node    0, zone   Normal, R 64      Movable      9     12     14     14     14     14     14     14     12     13    115 
Node    0, zone   Normal, R 65      Movable      6      8     11     12     12     12     12     12     10      9    118 
Node    0, zone   Normal, R 66      Movable     20     22     22     16     19     19     15     17     13     11    115 
Node    0, zone   Normal, R 67      Movable      4      8     10      8      7      8      8      8      8      6    121 
Node    0, zone   Normal, R 68      Movable     13     20     22     23     23     24     18     19     20     18    109 
Node    0, zone   Normal, R 69      Movable      4      9     10     11     11     11     11      9      8      9    119 
Node    0, zone   Normal, R 70      Movable      9     14     16     16     16     16     14     15     13     12    115 
Node    0, zone   Normal, R 71      Movable      8     18     22     22     22     22     20     21     17     17    110 
Node    0, zone   Normal, R 72      Movable      4      5      8      8      8      8      8      8      8      8    120 
Node    0, zone   Normal, R 73      Movable     18     21     21     21     17     17     18     18     18     16    111 
Node    0, zone   Normal, R 74      Movable      4      8     10     10     10     10     10     10     10      8    119 
Node    0, zone   Normal, R 75      Movable      9      9     12     13     13     13     13     13      9     11    117 
Node    0, zone   Normal, R 76      Movable     12     16     16     14     13     14     14     14      8      9    118 
Node    0, zone   Normal, R 77      Movable     14     14     15     15     15     15     15     15     13     15    113 
Node    0, zone   Normal, R 78      Movable      8     10     14     14     14     14     14     14     12     12    116 
Node    0, zone   Normal, R 79      Movable      8     10     13     13      9     11      9     10     10      6    120 
Node    0, zone   Normal, R 80      Movable     11     14     15     16     16     16     16     16     12     14    114 
Node    0, zone   Normal, R 81      Movable      4      8      8      8      8      8      8      8      8      9    119 
Node    0, zone   Normal, R 82      Movable     17     23     24     24     22     23     19     21     21     12    112 
Node    0, zone   Normal, R 83      Movable      5      7      8      9      9     10      6      8      8      6    121 
Node    0, zone   Normal, R 84      Movable     17     22     25     25     23     24     24     20     16     19    109 
Node    0, zone   Normal, R 85      Movable      9     15     16     16     16     16     16     16     10     11    116 
Node    0, zone   Normal, R 86      Movable      3      8      8      8      8      8      6      7      5      6    122 
Node    0, zone   Normal, R 87      Movable     10     12     13     13     11     10     11     11      7      9    119 
Node    0, zone   Normal, R 88      Movable     15     20     17     15     15     16     16     14      9     12    116 
Node    0, zone   Normal, R 89      Movable     15     18     20     21     21     19     20     20     16     18    110 
Node    0, zone   Normal, R 90      Movable      8     16     15     14     15     15     13     14     12     13    115 
Node    0, zone   Normal, R 91      Movable      6     10     10     11     11     11     11     11      9      7    119 
Node    0, zone   Normal, R 92      Movable     14     15     17     17     17     17     17     17     15     13    114 
Node    0, zone   Normal, R 93      Movable      5      5      6      6      6      6      6      6      6      6    122 
Node    0, zone   Normal, R 94      Movable     15     27     25     26     26     26     26     26     20     19    107 
Node    0, zone   Normal, R 95      Movable     12     15     17     17     15     16     14     13     14     10    116 
Node    0, zone   Normal, R 96      Movable     10     15     16     16     16     16     16     14     15     15    113 
Node    0, zone   Normal, R 97      Movable      5      8      9     10      8      9      9      9      9      7    120 
Node    0, zone   Normal, R 98      Movable      8     11     11     11     11     11     11      9     10     10    118 
Node    0, zone   Normal, R 99      Movable     10     13     15     15     15     15     15     15     15     11    115 
Node    0, zone   Normal, R100      Movable     32     42     48     44     44     45     39     40     23     25     99 
Node    0, zone   Normal, R101      Movable      9     14     16     16     14     15     15     15     15     14    114 
Node    0, zone   Normal, R102      Movable     18     24     25     23     22     23     21     22     18     18    109 
Node    0, zone   Normal, R103      Movable     12     16     18     16     15     14     15     15     11     13    115 
Node    0, zone   Normal, R104      Movable     14     17     20     20     20     20     20     20     20     18    109 
Node    0, zone   Normal, R105      Movable     16     24     35     32     32     33     31     30     25     26    101 
Node    0, zone   Normal, R106      Movable     11     18     20     20     20     20     18     19     19     15    111 
Node    0, zone   Normal, R107      Movable     11     29     33     33     33     33     33     33     25     25    101 
Node    0, zone   Normal, R108      Movable     13     13     13     13     13     13     13     13     13     13    115 
Node    0, zone   Normal, R109      Movable      3      5      9      9      9      9      9      9      9      9    119 
Node    0, zone   Normal, R110      Movable     12     15     16     16     16     16     16     16     10     13    115 
Node    0, zone   Normal, R111      Movable      8     13     16     16     16     16     16     14     15     11    115 
Node    0, zone   Normal, R112      Movable      6      9      5      4      4      5      5      5      3      2    125 
Node    0, zone   Normal, R113      Movable      1      2      2      2      3      1      2      2      2      2    126 
Node    0, zone   Normal, R114      Movable      1      3      3      3      3      3      3      3      1      2    126 
Node    0, zone   Normal, R115      Movable      4      5      4      5      5      5      5      5      5      5    123 
Node    0, zone   Normal, R116      Movable      0      0      0      0      0      0      0      0      0      0    128 
Node    0, zone   Normal, R117      Movable      0      0      0      0      0      0      0      0      0      0    128 
Node    0, zone   Normal, R118      Movable     10     33     34     22     18     17     14     14     14     11    114 
Node    0, zone   Normal, R119      Movable     36    117    163    146    143    138    126    102     85     66     39 
Node    0, zone   Normal, R120      Movable    366    963    961    572    191     57     35     19     19      5      5 
Node    0, zone   Normal, R121      Movable    802   2065   2211   1260    425    128     45     14      7      0      0 
Node    0, zone   Normal, R122      Movable    123    328    322    160     37      6      1      0      0      0      0 
Node    0, zone   Normal, R123      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R124      Movable      0      0      0      0      0      0      0      0      0      0      0 

Number of blocks type         Unmovable  Reclaimable      Movable      Reserve      Isolate 
Node 0, zone      DMA                 1            0            2            1            0 
Node 0, zone    DMA32                 1            0          506            1            0 
Node 0, zone   Normal               227           38        15605            2            0 

Node 0, zone      DMA R  0            1            0            2            1            0 
Node 0, zone    DMA32 R  0            0            0          124            1            0 
Node 0, zone    DMA32 R  1            0            0          128            0            0 
Node 0, zone    DMA32 R  2            0            0          128            0            0 
Node 0, zone    DMA32 R  3            1            0          127            0            0 
Node 0, zone    DMA32 R  4            0            0            0            0            0 
Node 0, zone    DMA32 R  5            0            0            0            0            0 
Node 0, zone    DMA32 R  6            0            0            0            0            0 
Node 0, zone    DMA32 R  7            0            0            0            0            0 
Node 0, zone   Normal R  0            0            0            0            1            0 
Node 0, zone   Normal R  1            0            0          126            2            0 
Node 0, zone   Normal R  2            0            0          128            0            0 
Node 0, zone   Normal R  3            0            0          128            0            0 
Node 0, zone   Normal R  4            0            0          128            0            0 
Node 0, zone   Normal R  5            0            1          127            0            0 
Node 0, zone   Normal R  6            0            0          128            0            0 
Node 0, zone   Normal R  7            0            0          128            0            0 
Node 0, zone   Normal R  8            0            0          128            0            0 
Node 0, zone   Normal R  9            0            0          128            0            0 
Node 0, zone   Normal R 10            0            0          128            0            0 
Node 0, zone   Normal R 11            0            0          128            0            0 
Node 0, zone   Normal R 12            0            0          128            0            0 
Node 0, zone   Normal R 13            0            0          128            0            0 
Node 0, zone   Normal R 14            0            0          128            0            0 
Node 0, zone   Normal R 15            0            0          128            0            0 
Node 0, zone   Normal R 16            0            0          128            0            0 
Node 0, zone   Normal R 17            0            0          128            0            0 
Node 0, zone   Normal R 18            0            0          128            0            0 
Node 0, zone   Normal R 19            0            0          128            0            0 
Node 0, zone   Normal R 20            0            0          128            0            0 
Node 0, zone   Normal R 21            0            0          128            0            0 
Node 0, zone   Normal R 22            0            0          128            0            0 
Node 0, zone   Normal R 23            0            0          128            0            0 
Node 0, zone   Normal R 24            0            0          128            0            0 
Node 0, zone   Normal R 25            0            0          128            0            0 
Node 0, zone   Normal R 26            0            0          128            0            0 
Node 0, zone   Normal R 27            0            0          128            0            0 
Node 0, zone   Normal R 28            0            0          128            0            0 
Node 0, zone   Normal R 29            0            0          128            0            0 
Node 0, zone   Normal R 30            0            0          128            0            0 
Node 0, zone   Normal R 31            0            0          128            0            0 
Node 0, zone   Normal R 32            0            0          128            0            0 
Node 0, zone   Normal R 33            0            0          128            0            0 
Node 0, zone   Normal R 34            0            0          128            0            0 
Node 0, zone   Normal R 35            0            0          128            0            0 
Node 0, zone   Normal R 36            0            0          128            0            0 
Node 0, zone   Normal R 37            0            0          128            0            0 
Node 0, zone   Normal R 38            0            0          128            0            0 
Node 0, zone   Normal R 39            0            0          128            0            0 
Node 0, zone   Normal R 40            0            0          128            0            0 
Node 0, zone   Normal R 41            0            0          128            0            0 
Node 0, zone   Normal R 42            0            0          128            0            0 
Node 0, zone   Normal R 43            0            0          128            0            0 
Node 0, zone   Normal R 44            0            0          128            0            0 
Node 0, zone   Normal R 45            0            0          128            0            0 
Node 0, zone   Normal R 46            0            0          128            0            0 
Node 0, zone   Normal R 47            0            0          128            0            0 
Node 0, zone   Normal R 48            0            0          128            0            0 
Node 0, zone   Normal R 49            0            0          128            0            0 
Node 0, zone   Normal R 50            0            0          128            0            0 
Node 0, zone   Normal R 51            0            0          128            0            0 
Node 0, zone   Normal R 52            0            0          128            0            0 
Node 0, zone   Normal R 53            0            0          128            0            0 
Node 0, zone   Normal R 54            0            0          128            0            0 
Node 0, zone   Normal R 55            0            0          128            0            0 
Node 0, zone   Normal R 56            0            0          128            0            0 
Node 0, zone   Normal R 57            0            0          128            0            0 
Node 0, zone   Normal R 58            0            1          127            0            0 
Node 0, zone   Normal R 59            0            0          128            0            0 
Node 0, zone   Normal R 60            0            0          128            0            0 
Node 0, zone   Normal R 61            0            0          128            0            0 
Node 0, zone   Normal R 62            0            0          128            0            0 
Node 0, zone   Normal R 63            0            0          128            0            0 
Node 0, zone   Normal R 64            0            0          128            0            0 
Node 0, zone   Normal R 65            0            0          128            0            0 
Node 0, zone   Normal R 66            0            0          128            0            0 
Node 0, zone   Normal R 67            0            0          128            0            0 
Node 0, zone   Normal R 68            0            0          128            0            0 
Node 0, zone   Normal R 69            0            0          128            0            0 
Node 0, zone   Normal R 70            0            0          128            0            0 
Node 0, zone   Normal R 71            0            0          128            0            0 
Node 0, zone   Normal R 72            0            0          128            0            0 
Node 0, zone   Normal R 73            0            0          128            0            0 
Node 0, zone   Normal R 74            0            0          128            0            0 
Node 0, zone   Normal R 75            0            0          128            0            0 
Node 0, zone   Normal R 76            0            0          128            0            0 
Node 0, zone   Normal R 77            0            0          128            0            0 
Node 0, zone   Normal R 78            0            0          128            0            0 
Node 0, zone   Normal R 79            0            0          128            0            0 
Node 0, zone   Normal R 80            0            0          128            0            0 
Node 0, zone   Normal R 81            0            0          128            0            0 
Node 0, zone   Normal R 82            0            0          128            0            0 
Node 0, zone   Normal R 83            0            0          128            0            0 
Node 0, zone   Normal R 84            0            0          128            0            0 
Node 0, zone   Normal R 85            0            0          128            0            0 
Node 0, zone   Normal R 86            0            0          128            0            0 
Node 0, zone   Normal R 87            0            0          128            0            0 
Node 0, zone   Normal R 88            0            0          128            0            0 
Node 0, zone   Normal R 89            0            0          128            0            0 
Node 0, zone   Normal R 90            0            0          128            0            0 
Node 0, zone   Normal R 91            0            0          128            0            0 
Node 0, zone   Normal R 92            0            0          128            0            0 
Node 0, zone   Normal R 93            0            0          128            0            0 
Node 0, zone   Normal R 94            0            0          128            0            0 
Node 0, zone   Normal R 95            0            0          128            0            0 
Node 0, zone   Normal R 96            0            0          128            0            0 
Node 0, zone   Normal R 97            0            0          128            0            0 
Node 0, zone   Normal R 98            0            0          128            0            0 
Node 0, zone   Normal R 99            0            0          128            0            0 
Node 0, zone   Normal R100            0            0          128            0            0 
Node 0, zone   Normal R101            0            0          128            0            0 
Node 0, zone   Normal R102            0            0          128            0            0 
Node 0, zone   Normal R103            0            0          128            0            0 
Node 0, zone   Normal R104            0            0          128            0            0 
Node 0, zone   Normal R105            0            0          128            0            0 
Node 0, zone   Normal R106            0            0          128            0            0 
Node 0, zone   Normal R107            0            0          128            0            0 
Node 0, zone   Normal R108            0            0          128            0            0 
Node 0, zone   Normal R109            0            0          128            0            0 
Node 0, zone   Normal R110            0            0          128            0            0 
Node 0, zone   Normal R111            0            0          128            0            0 
Node 0, zone   Normal R112            0            0          128            0            0 
Node 0, zone   Normal R113            0            0          128            0            0 
Node 0, zone   Normal R114            0            0          128            0            0 
Node 0, zone   Normal R115            0            0          128            0            0 
Node 0, zone   Normal R116            0            0          128            0            0 
Node 0, zone   Normal R117            0            0          128            0            0 
Node 0, zone   Normal R118            0            1          127            0            0 
Node 0, zone   Normal R119            0            4          124            0            0 
Node 0, zone   Normal R120           62           18           48            0            0 
Node 0, zone   Normal R121           63            1           64            0            0 
Node 0, zone   Normal R122          102           12           14            0            0 
Node 0, zone   Normal R123            0            0          128            0            0 
Node 0, zone   Normal R124            0            0          128            0            0 


Statistics with this patchset applied:
=====================================

Comparing these statistics with that of vanilla kernel, we see that the
fragmentation is significantly lesser, as seen in the MOVABLE migratetype.

Node 0, zone   Normal
  pages free     15731928
        min      5575
        low      6968
        high     8362
        scanned  0
        spanned  16252928
        present  16252928
        managed  15989885

Per-region page stats	 present	 free

	Region      0 	      1 	   1024
	Region      1 	 131072 	  11137
	Region      2 	 131072 	  83876
	Region      3 	 131072 	  72134
	Region      4 	 131072 	 116194
	Region      5 	 131072 	 116393
	Region      6 	 131072 	 130746
	Region      7 	 131072 	 131040
	Region      8 	 131072 	 131072
	Region      9 	 131072 	 131072
	Region     10 	 131072 	 131072
	Region     11 	 131072 	 131035
	Region     12 	 131072 	 131072
	Region     13 	 131072 	 130112
	Region     14 	 131072 	 131976
	Region     15 	 131072 	 131061
	Region     16 	 131072 	 131038
	Region     17 	 131072 	 131045
	Region     18 	 131072 	 131039
	Region     19 	 131072 	 131029
	Region     20 	 131072 	 131072
	Region     21 	 131072 	 131051
	Region     22 	 131072 	 131066
	Region     23 	 131072 	 131070
	Region     24 	 131072 	 131069
	Region     25 	 131072 	 131032
	Region     26 	 131072 	 131040
	Region     27 	 131072 	 131072
	Region     28 	 131072 	 131069
	Region     29 	 131072 	 131056
	Region     30 	 131072 	 131045
	Region     31 	 131072 	 131070
	Region     32 	 131072 	 131055
	Region     33 	 131072 	 131053
	Region     34 	 131072 	 131042
	Region     35 	 131072 	 131065
	Region     36 	 131072 	 130987
	Region     37 	 131072 	 131072
	Region     38 	 131072 	 131068
	Region     39 	 131072 	 131014
	Region     40 	 131072 	 131044
	Region     41 	 131072 	 131067
	Region     42 	 131072 	 131071
	Region     43 	 131072 	 131045
	Region     44 	 131072 	 131072
	Region     45 	 131072 	 131068
	Region     46 	 131072 	 131038
	Region     47 	 131072 	 131069
	Region     48 	 131072 	 131072
	Region     49 	 131072 	 131070
	Region     50 	 131072 	 131054
	Region     51 	 131072 	 131064
	Region     52 	 131072 	 131072
	Region     53 	 131072 	 131042
	Region     54 	 131072 	 131041
	Region     55 	 131072 	 131072
	Region     56 	 131072 	 131066
	Region     57 	 131072 	 131072
	Region     58 	 131072 	 131072
	Region     59 	 131072 	 131068
	Region     60 	 131072 	 131057
	Region     61 	 131072 	 131072
	Region     62 	 131072 	 131041
	Region     63 	 131072 	 131046
	Region     64 	 131072 	 131053
	Region     65 	 131072 	 131072
	Region     66 	 131072 	 131072
	Region     67 	 131072 	 131072
	Region     68 	 131072 	 131067
	Region     69 	 131072 	 131041
	Region     70 	 131072 	 131071
	Region     71 	 131072 	 131052
	Region     72 	 131072 	 131071
	Region     73 	 131072 	 131072
	Region     74 	 131072 	 131066
	Region     75 	 131072 	 131072
	Region     76 	 131072 	 131072
	Region     77 	 131072 	 131065
	Region     78 	 131072 	 131067
	Region     79 	 131072 	 131072
	Region     80 	 131072 	 131071
	Region     81 	 131072 	 131056
	Region     82 	 131072 	 131072
	Region     83 	 131072 	 131072
	Region     84 	 131072 	 131072
	Region     85 	 131072 	 131072
	Region     86 	 131072 	 131062
	Region     87 	 131072 	 131072
	Region     88 	 131072 	 131067
	Region     89 	 131072 	 131057
	Region     90 	 131072 	 131072
	Region     91 	 131072 	 131026
	Region     92 	 131072 	 131072
	Region     93 	 131072 	 131067
	Region     94 	 131072 	 131057
	Region     95 	 131072 	 131072
	Region     96 	 131072 	 131072
	Region     97 	 131072 	 131072
	Region     98 	 131072 	 131072
	Region     99 	 131072 	 131037
	Region    100 	 131072 	 131072
	Region    101 	 131072 	 131072
	Region    102 	 131072 	 131071
	Region    103 	 131072 	 131072
	Region    104 	 131072 	 131072
	Region    105 	 131072 	 131072
	Region    106 	 131072 	 131072
	Region    107 	 131072 	 131072
	Region    108 	 131072 	 131072
	Region    109 	 131072 	 131072
	Region    110 	 131072 	 131072
	Region    111 	 131072 	 131056
	Region    112 	 131072 	 131072
	Region    113 	 131072 	 131072
	Region    114 	 131072 	 131072
	Region    115 	 131072 	 131072
	Region    116 	 131072 	 131072
	Region    117 	 131072 	 131072
	Region    118 	 131072 	 131072
	Region    119 	 131072 	 131072
	Region    120 	 131072 	 131072
	Region    121 	 131072 	 131072
	Region    122 	 131072 	 128263
	Region    123 	 131072 	      0
	Region    124 	 131071 	     53

Page block order: 10
Pages per block:  1024

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10 
Node    0, zone      DMA, type    Unmovable      1      2      2      1      3      2      0      0      1      1      0 
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      2 
Node    0, zone      DMA, type      Reserve      0      0      0      0      0      0      0      0      0      0      1 
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type    Unmovable      0      1      0      0      0      0      1      1      1      1    127 
Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type      Movable      9     10     12      8     10      8      8      6      5      7    309 
Node    0, zone    DMA32, type      Reserve      0      0      0      0      0      0      0      0      0      0      1 
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, type    Unmovable  10879   9467   6585   2559   1859    630     81      7      1      2    108 
Node    0, zone   Normal, type  Reclaimable      1      1      1      1      0      1      0      1      0      0     81 
Node    0, zone   Normal, type      Movable    690   3282   4967   2628   1209    810    677    554    468    375   8006 
Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      2 
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 

Node    0, zone   Normal, R  0      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  1      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  2      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  3      Movable     50   2600   4277   1986    588    193     90     42     18      1      1 
Node    0, zone   Normal, R  4      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  5      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  6      Movable     52     65     71     57     60     57     51     45     39     29     91 
Node    0, zone   Normal, R  7      Movable      2      1      3      2      2      1      2      2      2      2    126 
Node    0, zone   Normal, R  8      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R  9      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 10      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 11      Movable      5      7      8      5      6      7      7      3      3      4    124 
Node    0, zone   Normal, R 12      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 13      Movable      0      0      0      0      0      0      1      0      0      0    127 
Node    0, zone   Normal, R 14      Movable     24     26     29     29     26     28     27     28     24     18    107 
Node    0, zone   Normal, R 15      Movable      9     10     10      8      7      6      7      7      7      7    121 
Node    0, zone   Normal, R 16      Movable      8     13     15     16     14     13     14     14     14     10    116 
Node    0, zone   Normal, R 17      Movable     11     17     14     14     15     15     15     13     12     13    115 
Node    0, zone   Normal, R 18      Movable      9      7      8      5      4      6      6      4      3      4    124 
Node    0, zone   Normal, R 19      Movable      9      8      9      9     11     11      9     10      8      7    120 
Node    0, zone   Normal, R 20      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 21      Movable     13      9     11     10     11      9     10     10     10      8    119 
Node    0, zone   Normal, R 22      Movable      6      6      6      6      6      6      6      6      6      6    122 
Node    0, zone   Normal, R 23      Movable      2      2      2      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 24      Movable      3      3      3      3      3      3      3      3      3      3    125 
Node    0, zone   Normal, R 25      Movable     10     11     14     12     12     13     11     12      8      6    120 
Node    0, zone   Normal, R 26      Movable     12     10     16     16     16     16     16     16     16     14    113 
Node    0, zone   Normal, R 27      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 28      Movable      1      0      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 29      Movable      0      0      0      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 30      Movable      1      2      2      3      2      1      2      2      2      0    127 
Node    0, zone   Normal, R 31      Movable      2      2      2      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 32      Movable      1      3      2      2      1      2      2      2      2      2    126 
Node    0, zone   Normal, R 33      Movable      3      3      3      3      4      4      4      4      4      4    124 
Node    0, zone   Normal, R 34      Movable      2      2      1      3      2      3      1      2      2      2    126 
Node    0, zone   Normal, R 35      Movable      3      3      4      4      4      4      4      4      4      4    124 
Node    0, zone   Normal, R 36      Movable      3     32     32     35     35     35     35     33     32     23    100 
Node    0, zone   Normal, R 37      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 38      Movable      2      1      2      0      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 39      Movable      4      3      3      2      4      3      2      3      3      1    126 
Node    0, zone   Normal, R 40      Movable      0      4      7      8      8      8      8      8      4      6    122 
Node    0, zone   Normal, R 41      Movable      1      1      2      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 42      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 43      Movable      3     13     12     11     12     12     12     12     10      7    119 
Node    0, zone   Normal, R 44      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 45      Movable      4      4      4      4      4      4      4      4      2      3    125 
Node    0, zone   Normal, R 46      Movable      8     13     15     16     16     16     16     16     14     15    113 
Node    0, zone   Normal, R 47      Movable      1      2      2      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 48      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 49      Movable      2      2      2      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 50      Movable      2      2      6      6      6      6      6      6      6      6    122 
Node    0, zone   Normal, R 51      Movable      2      1      1      0      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 52      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 53      Movable      4      3      4      7      5      6      6      6      4      5    123 
Node    0, zone   Normal, R 54      Movable      3      3      4      1      2      3      3      3      3      3    125 
Node    0, zone   Normal, R 55      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 56      Movable      6      6      4      5      5      5      5      3      4      4    124 
Node    0, zone   Normal, R 57      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 58      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 59      Movable      0      0      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 60      Movable      1      8      8      8      8      8      8      8      8      8    120 
Node    0, zone   Normal, R 61      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 62      Movable      1      0      0      0      2      2      2      2      0      1    127 
Node    0, zone   Normal, R 63      Movable      2     14     14     14     14     14     14     14     14     14    114 
Node    0, zone   Normal, R 64      Movable      5     12     12     10     11     11     11      9     10      8    119 
Node    0, zone   Normal, R 65      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 66      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 67      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 68      Movable      1      1      0      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 69      Movable      3      3      4      5      2      4      4      4      4      2    125 
Node    0, zone   Normal, R 70      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 71      Movable      0      2      6      6      6      6      6      6      6      6    122 
Node    0, zone   Normal, R 72      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 73      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 74      Movable      2      2      1      2      2      2      2      2      2      2    126 
Node    0, zone   Normal, R 75      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 76      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 77      Movable      3      3      4      4      4      4      4      4      4      4    124 
Node    0, zone   Normal, R 78      Movable      1      1      0      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 79      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 80      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 81      Movable     16     16     16     16     16     16     16     16     16     16    112 
Node    0, zone   Normal, R 82      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 83      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 84      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 85      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 86      Movable      2      2      0      2      2      2      2      2      2      0    127 
Node    0, zone   Normal, R 87      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 88      Movable      1      1      2      0      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 89      Movable      1      0      0      2      0      1      1      1      1      1    127 
Node    0, zone   Normal, R 90      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 91      Movable      2      6      5      4      5      6      6      4      5      5    123 
Node    0, zone   Normal, R 92      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 93      Movable      1      1      0      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R 94      Movable      1      2      1      1      2      2      2      0      1      1    127 
Node    0, zone   Normal, R 95      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 96      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 97      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 98      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R 99      Movable      5      4      4      4      4      5      5      5      5      5    123 
Node    0, zone   Normal, R100      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R101      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R102      Movable      1      1      1      1      1      1      1      1      1      1    127 
Node    0, zone   Normal, R103      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R104      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R105      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R106      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R107      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R108      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R109      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R110      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R111      Movable      2      1      1      1      2      2      2      2      2      0    127 
Node    0, zone   Normal, R112      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R113      Movable      0      0      0      0      0      0      0      0      0      0    128 
Node    0, zone   Normal, R114      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R115      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R116      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R117      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R118      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R119      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R120      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R121      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R122      Movable    351    298    271    239    212    203    182    127     94     60     31 
Node    0, zone   Normal, R123      Movable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, R124      Movable      1      0      1      0      1      1      0      0      0      0      0 

Number of blocks type         Unmovable  Reclaimable      Movable      Reserve      Isolate 
Node 0, zone      DMA                 1            0            2            1            0 
Node 0, zone    DMA32               128            0          379            1            0 
Node 0, zone   Normal               384          128        15359            1            0 

Node 0, zone   Normal R  0            0            0            0            1            0 
Node 0, zone   Normal R  1          127            0            0            1            0 
Node 0, zone   Normal R  2            1          127            0            0            0 
Node 0, zone   Normal R  3            0            1          127            0            0 
Node 0, zone   Normal R  4          127            0            1            0            0 
Node 0, zone   Normal R  5          128            0            0            0            0 
Node 0, zone   Normal R  6            1            0          127            0            0 
Node 0, zone   Normal R  7            0            0          128            0            0 
Node 0, zone   Normal R  8            0            0          128            0            0 
Node 0, zone   Normal R  9            0            0          128            0            0 
Node 0, zone   Normal R 10            0            0          128            0            0 
Node 0, zone   Normal R 11            0            0          128            0            0 
Node 0, zone   Normal R 12            0            0          128            0            0 
Node 0, zone   Normal R 13            0            0          128            0            0 
Node 0, zone   Normal R 14            0            0          128            0            0 
Node 0, zone   Normal R 15            0            0          128            0            0 
Node 0, zone   Normal R 16            0            0          128            0            0 
Node 0, zone   Normal R 17            0            0          128            0            0 
Node 0, zone   Normal R 18            0            0          128            0            0 
Node 0, zone   Normal R 19            0            0          128            0            0 
Node 0, zone   Normal R 20            0            0          128            0            0 
Node 0, zone   Normal R 21            0            0          128            0            0 
Node 0, zone   Normal R 22            0            0          128            0            0 
Node 0, zone   Normal R 23            0            0          128            0            0 
Node 0, zone   Normal R 24            0            0          128            0            0 
Node 0, zone   Normal R 25            0            0          128            0            0 
Node 0, zone   Normal R 26            0            0          128            0            0 
Node 0, zone   Normal R 27            0            0          128            0            0 
Node 0, zone   Normal R 28            0            0          128            0            0 
Node 0, zone   Normal R 29            0            0          128            0            0 
Node 0, zone   Normal R 30            0            0          128            0            0 
Node 0, zone   Normal R 31            0            0          128            0            0 
Node 0, zone   Normal R 32            0            0          128            0            0 
Node 0, zone   Normal R 33            0            0          128            0            0 
Node 0, zone   Normal R 34            0            0          128            0            0 
Node 0, zone   Normal R 35            0            0          128            0            0 
Node 0, zone   Normal R 36            0            0          128            0            0 
Node 0, zone   Normal R 37            0            0          128            0            0 
Node 0, zone   Normal R 38            0            0          128            0            0 
Node 0, zone   Normal R 39            0            0          128            0            0 
Node 0, zone   Normal R 40            0            0          128            0            0 
Node 0, zone   Normal R 41            0            0          128            0            0 
Node 0, zone   Normal R 42            0            0          128            0            0 
Node 0, zone   Normal R 43            0            0          128            0            0 
Node 0, zone   Normal R 44            0            0          128            0            0 
Node 0, zone   Normal R 45            0            0          128            0            0 
Node 0, zone   Normal R 46            0            0          128            0            0 
Node 0, zone   Normal R 47            0            0          128            0            0 
Node 0, zone   Normal R 48            0            0          128            0            0 
Node 0, zone   Normal R 49            0            0          128            0            0 
Node 0, zone   Normal R 50            0            0          128            0            0 
Node 0, zone   Normal R 51            0            0          128            0            0 
Node 0, zone   Normal R 52            0            0          128            0            0 
Node 0, zone   Normal R 53            0            0          128            0            0 
Node 0, zone   Normal R 54            0            0          128            0            0 
Node 0, zone   Normal R 55            0            0          128            0            0 
Node 0, zone   Normal R 56            0            0          128            0            0 
Node 0, zone   Normal R 57            0            0          128            0            0 
Node 0, zone   Normal R 58            0            0          128            0            0 
Node 0, zone   Normal R 59            0            0          128            0            0 
Node 0, zone   Normal R 60            0            0          128            0            0 
Node 0, zone   Normal R 61            0            0          128            0            0 
Node 0, zone   Normal R 62            0            0          128            0            0 
Node 0, zone   Normal R 63            0            0          128            0            0 
Node 0, zone   Normal R 64            0            0          128            0            0 
Node 0, zone   Normal R 65            0            0          128            0            0 
Node 0, zone   Normal R 66            0            0          128            0            0 
Node 0, zone   Normal R 67            0            0          128            0            0 
Node 0, zone   Normal R 68            0            0          128            0            0 
Node 0, zone   Normal R 69            0            0          128            0            0 
Node 0, zone   Normal R 70            0            0          128            0            0 
Node 0, zone   Normal R 71            0            0          128            0            0 
Node 0, zone   Normal R 72            0            0          128            0            0 
Node 0, zone   Normal R 73            0            0          128            0            0 
Node 0, zone   Normal R 74            0            0          128            0            0 
Node 0, zone   Normal R 75            0            0          128            0            0 
Node 0, zone   Normal R 76            0            0          128            0            0 
Node 0, zone   Normal R 77            0            0          128            0            0 
Node 0, zone   Normal R 78            0            0          128            0            0 
Node 0, zone   Normal R 79            0            0          128            0            0 
Node 0, zone   Normal R 80            0            0          128            0            0 
Node 0, zone   Normal R 81            0            0          128            0            0 
Node 0, zone   Normal R 82            0            0          128            0            0 
Node 0, zone   Normal R 83            0            0          128            0            0 
Node 0, zone   Normal R 84            0            0          128            0            0 
Node 0, zone   Normal R 85            0            0          128            0            0 
Node 0, zone   Normal R 86            0            0          128            0            0 
Node 0, zone   Normal R 87            0            0          128            0            0 
Node 0, zone   Normal R 88            0            0          128            0            0 
Node 0, zone   Normal R 89            0            0          128            0            0 
Node 0, zone   Normal R 90            0            0          128            0            0 
Node 0, zone   Normal R 91            0            0          128            0            0 
Node 0, zone   Normal R 92            0            0          128            0            0 
Node 0, zone   Normal R 93            0            0          128            0            0 
Node 0, zone   Normal R 94            0            0          128            0            0 
Node 0, zone   Normal R 95            0            0          128            0            0 
Node 0, zone   Normal R 96            0            0          128            0            0 
Node 0, zone   Normal R 97            0            0          128            0            0 
Node 0, zone   Normal R 98            0            0          128            0            0 
Node 0, zone   Normal R 99            0            0          128            0            0 
Node 0, zone   Normal R100            0            0          128            0            0 
Node 0, zone   Normal R101            0            0          128            0            0 
Node 0, zone   Normal R102            0            0          128            0            0 
Node 0, zone   Normal R103            0            0          128            0            0 
Node 0, zone   Normal R104            0            0          128            0            0 
Node 0, zone   Normal R105            0            0          128            0            0 
Node 0, zone   Normal R106            0            0          128            0            0 
Node 0, zone   Normal R107            0            0          128            0            0 
Node 0, zone   Normal R108            0            0          128            0            0 
Node 0, zone   Normal R109            0            0          128            0            0 
Node 0, zone   Normal R110            0            0          128            0            0 
Node 0, zone   Normal R111            0            0          128            0            0 
Node 0, zone   Normal R112            0            0          128            0            0 
Node 0, zone   Normal R113            0            0          128            0            0 
Node 0, zone   Normal R114            0            0          128            0            0 
Node 0, zone   Normal R115            0            0          128            0            0 
Node 0, zone   Normal R116            0            0          128            0            0 
Node 0, zone   Normal R117            0            0          128            0            0 
Node 0, zone   Normal R118            0            0          128            0            0 
Node 0, zone   Normal R119            0            0          128            0            0 
Node 0, zone   Normal R120            0            0          128            0            0 
Node 0, zone   Normal R121            0            0          128            0            0 
Node 0, zone   Normal R122            0            0          128            0            0 
Node 0, zone   Normal R123            0            0          128            0            0 
Node 0, zone   Normal R124            0            0          128            0            0 


Performance impact:
------------------

Kernbench was run with and without the patchset. It shows an overhead of
around 7.8% with the patchset applied.

Vanilla kernel:

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 706.760000
User Time 4536.670000
System Time 1526.610000
Percent CPU 857.000000
Context Switches 2229643.000000
Sleeps 2211767.000000

With patchset:

Average Optimal load -j 32 Run (std deviation):
Elapsed Time 761.010000
User Time 4605.450000
System Time 1535.870000
Percent CPU 806.000000
Context Switches 2247690.000000
Sleeps 2213503.000000

This version (v3) of the patchset focussed more on improving the consolidation
ratio and less on the performance impact. There is plenty of room for
performance optimization, and I'll work on that in future versions.

Regards,
Srivatsa S. Bhat

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RESEND RFC PATCH v3 00/35] mm: Memory Power Management
  2013-08-30 13:13 ` Srivatsa S. Bhat
@ 2013-08-30 15:27   ` Dave Hansen
  -1 siblings, 0 replies; 100+ messages in thread
From: Dave Hansen @ 2013-08-30 15:27 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel

On 08/30/2013 06:13 AM, Srivatsa S. Bhat wrote:
> Overview of Memory Power Management and its implications to the Linux MM
> ========================================================================
> 
> Today, we are increasingly seeing computer systems sporting larger and larger
> amounts of RAM, in order to meet workload demands. However, memory consumes a
> significant amount of power, potentially upto more than a third of total system
> power on server systems[4]. So naturally, memory becomes the next big target
> for power management - on embedded systems and smartphones, and all the way
> upto large server systems.

Srivatsa, you're sending a huge patch set to a very long cc list of
people, but you're leading the description with text that most of us
have already read a bunch of times.  Why?

What changed in this patch from the last round?  Where would you like
reviewers to concentrate their time amongst the thousand lines of code?
 What barriers do _you_ see as remaining before this gets merged?


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RESEND RFC PATCH v3 00/35] mm: Memory Power Management
@ 2013-08-30 15:27   ` Dave Hansen
  0 siblings, 0 replies; 100+ messages in thread
From: Dave Hansen @ 2013-08-30 15:27 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel

On 08/30/2013 06:13 AM, Srivatsa S. Bhat wrote:
> Overview of Memory Power Management and its implications to the Linux MM
> ========================================================================
> 
> Today, we are increasingly seeing computer systems sporting larger and larger
> amounts of RAM, in order to meet workload demands. However, memory consumes a
> significant amount of power, potentially upto more than a third of total system
> power on server systems[4]. So naturally, memory becomes the next big target
> for power management - on embedded systems and smartphones, and all the way
> upto large server systems.

Srivatsa, you're sending a huge patch set to a very long cc list of
people, but you're leading the description with text that most of us
have already read a bunch of times.  Why?

What changed in this patch from the last round?  Where would you like
reviewers to concentrate their time amongst the thousand lines of code?
 What barriers do _you_ see as remaining before this gets merged?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RESEND RFC PATCH v3 00/35] mm: Memory Power Management
  2013-08-30 15:27   ` Dave Hansen
@ 2013-08-30 17:50     ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 17:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel

On 08/30/2013 08:57 PM, Dave Hansen wrote:
> On 08/30/2013 06:13 AM, Srivatsa S. Bhat wrote:
>> Overview of Memory Power Management and its implications to the Linux MM
>> ========================================================================
>>
>> Today, we are increasingly seeing computer systems sporting larger and larger
>> amounts of RAM, in order to meet workload demands. However, memory consumes a
>> significant amount of power, potentially upto more than a third of total system
>> power on server systems[4]. So naturally, memory becomes the next big target
>> for power management - on embedded systems and smartphones, and all the way
>> upto large server systems.
> 
> Srivatsa, you're sending a huge patch set to a very long cc list of
> people, but you're leading the description with text that most of us
> have already read a bunch of times.  Why?
> 

Well, I had got the impression that with each posting, a fresh set of
reviewers were taking a look at the patchset for the first time. So I retained
the leading description. But since you have been familiar with this patchset
right from the very first posting, I think you found it repetitive and useless.
Thanks for the tip, I'll curtail the leading text in future versions and
instead give links to earlier patchsets as reference, for new reviewers.

> What changed in this patch from the last round?

The fundamental change in this version is the splitting up of the memory
allocator into a front-end (page-allocator) and a back-end (region-allocator).
The corresponding code is in patches 18 to 32. Patches 33-35 are some policy
changes on top of that infrastructure that help further improve the consolidation.
Overall, this design change has caused considerable improvements in the
consolidation ratio achieved by the patchset.

Minor changes include augmenting /proc/pagetypeinfo to print the statistics
on a per-region basis, which turns out to be very useful in visualizing the
fragmentation.

And in this version, the experimental results section (which I posted as a
reply to the cover-letter) has some pretty noticeable numbers. The previous
postings didn't really have enough numbers/data to prove that the patchset
actually was much better than mainline. This version addresses that issue,
from a functional point-of-view.

>  Where would you like
> reviewers to concentrate their time amongst the thousand lines of code?

I would be grateful if reviewers could comment on the new split-allocator
design and let me know if they notice any blatant design issues. Some of
the changes are very bold IMHO, so I'd really appreciate if reviewers could
let me know if I'm going totally off-track or whether the numbers/data
justify the huge design changes sufficiently (atleast to know whether to
continue in that direction or not).

>  What barriers do _you_ see as remaining before this gets merged?
> 

I believe that I have showcased all the major design changes that I had
in mind, in this version and the previous versions. (This version includes
all of them, except the targeted compaction support (dropped temporarily),
which was introduced in the last version). What remains is the routine work:
making this code work with various MM config options etc, and reduce the
overhead in the hotpaths.

So, if the design changes are agreed upon, I can go ahead and address the
remaining rough edges and make it merge-ready. I assume it would be good
to add a config option and keep it under Kernel Hacking or such, so that
people who know their platform characteristics can try it out by giving
the region-boundaries via kernel command line etc. I think that would be
a good way to upstream this feature, since it allows the flexibility for
people to try it out with various usecases on different platforms. (Also,
that way, we need not wait for firmware support such as ACPI 5.0 to be
available in order to merge this code).

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RESEND RFC PATCH v3 00/35] mm: Memory Power Management
@ 2013-08-30 17:50     ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-08-30 17:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, riel, arjan,
	srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, isimatu.yasuaki,
	santosh.shilimkar, kosaki.motohiro, linux-pm, linux-mm,
	linux-kernel

On 08/30/2013 08:57 PM, Dave Hansen wrote:
> On 08/30/2013 06:13 AM, Srivatsa S. Bhat wrote:
>> Overview of Memory Power Management and its implications to the Linux MM
>> ========================================================================
>>
>> Today, we are increasingly seeing computer systems sporting larger and larger
>> amounts of RAM, in order to meet workload demands. However, memory consumes a
>> significant amount of power, potentially upto more than a third of total system
>> power on server systems[4]. So naturally, memory becomes the next big target
>> for power management - on embedded systems and smartphones, and all the way
>> upto large server systems.
> 
> Srivatsa, you're sending a huge patch set to a very long cc list of
> people, but you're leading the description with text that most of us
> have already read a bunch of times.  Why?
> 

Well, I had got the impression that with each posting, a fresh set of
reviewers were taking a look at the patchset for the first time. So I retained
the leading description. But since you have been familiar with this patchset
right from the very first posting, I think you found it repetitive and useless.
Thanks for the tip, I'll curtail the leading text in future versions and
instead give links to earlier patchsets as reference, for new reviewers.

> What changed in this patch from the last round?

The fundamental change in this version is the splitting up of the memory
allocator into a front-end (page-allocator) and a back-end (region-allocator).
The corresponding code is in patches 18 to 32. Patches 33-35 are some policy
changes on top of that infrastructure that help further improve the consolidation.
Overall, this design change has caused considerable improvements in the
consolidation ratio achieved by the patchset.

Minor changes include augmenting /proc/pagetypeinfo to print the statistics
on a per-region basis, which turns out to be very useful in visualizing the
fragmentation.

And in this version, the experimental results section (which I posted as a
reply to the cover-letter) has some pretty noticeable numbers. The previous
postings didn't really have enough numbers/data to prove that the patchset
actually was much better than mainline. This version addresses that issue,
from a functional point-of-view.

>  Where would you like
> reviewers to concentrate their time amongst the thousand lines of code?

I would be grateful if reviewers could comment on the new split-allocator
design and let me know if they notice any blatant design issues. Some of
the changes are very bold IMHO, so I'd really appreciate if reviewers could
let me know if I'm going totally off-track or whether the numbers/data
justify the huge design changes sufficiently (atleast to know whether to
continue in that direction or not).

>  What barriers do _you_ see as remaining before this gets merged?
> 

I believe that I have showcased all the major design changes that I had
in mind, in this version and the previous versions. (This version includes
all of them, except the targeted compaction support (dropped temporarily),
which was introduced in the last version). What remains is the routine work:
making this code work with various MM config options etc, and reduce the
overhead in the hotpaths.

So, if the design changes are agreed upon, I can go ahead and address the
remaining rough edges and make it merge-ready. I assume it would be good
to add a config option and keep it under Kernel Hacking or such, so that
people who know their platform characteristics can try it out by giving
the region-boundaries via kernel command line etc. I think that would be
a good way to upstream this feature, since it allows the flexibility for
people to try it out with various usecases on different platforms. (Also,
that way, we need not wait for firmware support such as ACPI 5.0 to be
available in order to merge this code).

Regards,
Srivatsa S. Bhat

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 04/35] mm: Initialize node memory regions during boot
  2013-08-30 13:15   ` Srivatsa S. Bhat
@ 2013-09-02  6:20     ` Yasuaki Ishimatsu
  -1 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-02  6:20 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/08/30 22:15), Srivatsa S. Bhat wrote:
> Initialize the node's memory-regions structures with the information about
> the region-boundaries, at boot time.
>
> Based-on-patch-by: Ankita Garg <gargankita@gmail.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>   include/linux/mm.h |    4 ++++
>   mm/page_alloc.c    |   28 ++++++++++++++++++++++++++++
>   2 files changed, 32 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f022460..18fdec4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -627,6 +627,10 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   #define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
>   #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
>
> +/* Hard-code memory region size to be 512 MB for now. */
> +#define MEM_REGION_SHIFT	(29 - PAGE_SHIFT)
> +#define MEM_REGION_SIZE		(1UL << MEM_REGION_SHIFT)
> +
>   static inline enum zone_type page_zonenum(const struct page *page)
>   {
>   	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b86d7e3..bb2d5d4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4809,6 +4809,33 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
>   #endif /* CONFIG_FLAT_NODE_MEM_MAP */
>   }
>
> +static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
> +{
> +	int nid = pgdat->node_id;
> +	unsigned long start_pfn = pgdat->node_start_pfn;
> +	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
> +	struct node_mem_region *region;
> +	unsigned long i, absent;
> +	int idx;
> +
> +	for (i = start_pfn, idx = 0; i < end_pfn;
> +				i += region->spanned_pages, idx++) {
> +

> +		region = &pgdat->node_regions[idx];

It seems that overflow easily occurs.
node_regions[] has 256 entries and MEM_REGION_SIZE is 512MiB. So if
the pgdat has more than 128 GiB, overflow will occur. Am I wrong?

Thanks,
Yasuaki Ishimatsu

> +		region->pgdat = pgdat;
> +		region->start_pfn = i;
> +		region->spanned_pages = min(MEM_REGION_SIZE, end_pfn - i);
> +		region->end_pfn = region->start_pfn + region->spanned_pages;
> +
> +		absent = __absent_pages_in_range(nid, region->start_pfn,
> +						 region->end_pfn);
> +
> +		region->present_pages = region->spanned_pages - absent;
> +	}
> +
> +	pgdat->nr_node_regions = idx;
> +}
> +
>   void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>   		unsigned long node_start_pfn, unsigned long *zholes_size)
>   {
> @@ -4837,6 +4864,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>
>   	free_area_init_core(pgdat, start_pfn, end_pfn,
>   			    zones_size, zholes_size);
> +	init_node_memory_regions(pgdat);
>   }
>
>   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 04/35] mm: Initialize node memory regions during boot
@ 2013-09-02  6:20     ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-02  6:20 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/08/30 22:15), Srivatsa S. Bhat wrote:
> Initialize the node's memory-regions structures with the information about
> the region-boundaries, at boot time.
>
> Based-on-patch-by: Ankita Garg <gargankita@gmail.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>   include/linux/mm.h |    4 ++++
>   mm/page_alloc.c    |   28 ++++++++++++++++++++++++++++
>   2 files changed, 32 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f022460..18fdec4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -627,6 +627,10 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   #define LAST_NID_MASK		((1UL << LAST_NID_WIDTH) - 1)
>   #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
>
> +/* Hard-code memory region size to be 512 MB for now. */
> +#define MEM_REGION_SHIFT	(29 - PAGE_SHIFT)
> +#define MEM_REGION_SIZE		(1UL << MEM_REGION_SHIFT)
> +
>   static inline enum zone_type page_zonenum(const struct page *page)
>   {
>   	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b86d7e3..bb2d5d4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4809,6 +4809,33 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
>   #endif /* CONFIG_FLAT_NODE_MEM_MAP */
>   }
>
> +static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
> +{
> +	int nid = pgdat->node_id;
> +	unsigned long start_pfn = pgdat->node_start_pfn;
> +	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
> +	struct node_mem_region *region;
> +	unsigned long i, absent;
> +	int idx;
> +
> +	for (i = start_pfn, idx = 0; i < end_pfn;
> +				i += region->spanned_pages, idx++) {
> +

> +		region = &pgdat->node_regions[idx];

It seems that overflow easily occurs.
node_regions[] has 256 entries and MEM_REGION_SIZE is 512MiB. So if
the pgdat has more than 128 GiB, overflow will occur. Am I wrong?

Thanks,
Yasuaki Ishimatsu

> +		region->pgdat = pgdat;
> +		region->start_pfn = i;
> +		region->spanned_pages = min(MEM_REGION_SIZE, end_pfn - i);
> +		region->end_pfn = region->start_pfn + region->spanned_pages;
> +
> +		absent = __absent_pages_in_range(nid, region->start_pfn,
> +						 region->end_pfn);
> +
> +		region->present_pages = region->spanned_pages - absent;
> +	}
> +
> +	pgdat->nr_node_regions = idx;
> +}
> +
>   void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>   		unsigned long node_start_pfn, unsigned long *zholes_size)
>   {
> @@ -4837,6 +4864,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>
>   	free_area_init_core(pgdat, start_pfn, end_pfn,
>   			    zones_size, zholes_size);
> +	init_node_memory_regions(pgdat);
>   }
>
>   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 04/35] mm: Initialize node memory regions during boot
  2013-09-02  6:20     ` Yasuaki Ishimatsu
@ 2013-09-02 17:43       ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-02 17:43 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

On 09/02/2013 11:50 AM, Yasuaki Ishimatsu wrote:
> (2013/08/30 22:15), Srivatsa S. Bhat wrote:
>> Initialize the node's memory-regions structures with the information
>> about
>> the region-boundaries, at boot time.
>>
>> Based-on-patch-by: Ankita Garg <gargankita@gmail.com>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>> ---
>>
>>   include/linux/mm.h |    4 ++++
>>   mm/page_alloc.c    |   28 ++++++++++++++++++++++++++++
>>   2 files changed, 32 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index f022460..18fdec4 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -627,6 +627,10 @@ static inline pte_t maybe_mkwrite(pte_t pte,
>> struct vm_area_struct *vma)
>>   #define LAST_NID_MASK        ((1UL << LAST_NID_WIDTH) - 1)
>>   #define ZONEID_MASK        ((1UL << ZONEID_SHIFT) - 1)
>>
>> +/* Hard-code memory region size to be 512 MB for now. */
>> +#define MEM_REGION_SHIFT    (29 - PAGE_SHIFT)
>> +#define MEM_REGION_SIZE        (1UL << MEM_REGION_SHIFT)
>> +
>>   static inline enum zone_type page_zonenum(const struct page *page)
>>   {
>>       return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b86d7e3..bb2d5d4 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4809,6 +4809,33 @@ static void __init_refok
>> alloc_node_mem_map(struct pglist_data *pgdat)
>>   #endif /* CONFIG_FLAT_NODE_MEM_MAP */
>>   }
>>
>> +static void __meminit init_node_memory_regions(struct pglist_data
>> *pgdat)
>> +{
>> +    int nid = pgdat->node_id;
>> +    unsigned long start_pfn = pgdat->node_start_pfn;
>> +    unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>> +    struct node_mem_region *region;
>> +    unsigned long i, absent;
>> +    int idx;
>> +
>> +    for (i = start_pfn, idx = 0; i < end_pfn;
>> +                i += region->spanned_pages, idx++) {
>> +
> 
>> +        region = &pgdat->node_regions[idx];
> 
> It seems that overflow easily occurs.
> node_regions[] has 256 entries and MEM_REGION_SIZE is 512MiB. So if
> the pgdat has more than 128 GiB, overflow will occur. Am I wrong?
>

No, you are right. It should be made dynamic to accommodate larger
memory. I just used that value as a placeholder, since my focus was to
demonstrate what algorithms and designs could be developed on top of
this infrastructure, to help shape memory allocations. But certainly
this needs to be modified to be flexible enough to work with any memory
size. Thank you for your review!

Regards,
Srivatsa S. Bhat
 
> 
>> +        region->pgdat = pgdat;
>> +        region->start_pfn = i;
>> +        region->spanned_pages = min(MEM_REGION_SIZE, end_pfn - i);
>> +        region->end_pfn = region->start_pfn + region->spanned_pages;
>> +
>> +        absent = __absent_pages_in_range(nid, region->start_pfn,
>> +                         region->end_pfn);
>> +
>> +        region->present_pages = region->spanned_pages - absent;
>> +    }
>> +
>> +    pgdat->nr_node_regions = idx;
>> +}
>> +
>>   void __paginginit free_area_init_node(int nid, unsigned long
>> *zones_size,
>>           unsigned long node_start_pfn, unsigned long *zholes_size)
>>   {
>> @@ -4837,6 +4864,7 @@ void __paginginit free_area_init_node(int nid,
>> unsigned long *zones_size,
>>
>>       free_area_init_core(pgdat, start_pfn, end_pfn,
>>                   zones_size, zholes_size);
>> +    init_node_memory_regions(pgdat);
>>   }
>>
>>   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>>
> 
> 


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 04/35] mm: Initialize node memory regions during boot
@ 2013-09-02 17:43       ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-02 17:43 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

On 09/02/2013 11:50 AM, Yasuaki Ishimatsu wrote:
> (2013/08/30 22:15), Srivatsa S. Bhat wrote:
>> Initialize the node's memory-regions structures with the information
>> about
>> the region-boundaries, at boot time.
>>
>> Based-on-patch-by: Ankita Garg <gargankita@gmail.com>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>> ---
>>
>>   include/linux/mm.h |    4 ++++
>>   mm/page_alloc.c    |   28 ++++++++++++++++++++++++++++
>>   2 files changed, 32 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index f022460..18fdec4 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -627,6 +627,10 @@ static inline pte_t maybe_mkwrite(pte_t pte,
>> struct vm_area_struct *vma)
>>   #define LAST_NID_MASK        ((1UL << LAST_NID_WIDTH) - 1)
>>   #define ZONEID_MASK        ((1UL << ZONEID_SHIFT) - 1)
>>
>> +/* Hard-code memory region size to be 512 MB for now. */
>> +#define MEM_REGION_SHIFT    (29 - PAGE_SHIFT)
>> +#define MEM_REGION_SIZE        (1UL << MEM_REGION_SHIFT)
>> +
>>   static inline enum zone_type page_zonenum(const struct page *page)
>>   {
>>       return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b86d7e3..bb2d5d4 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4809,6 +4809,33 @@ static void __init_refok
>> alloc_node_mem_map(struct pglist_data *pgdat)
>>   #endif /* CONFIG_FLAT_NODE_MEM_MAP */
>>   }
>>
>> +static void __meminit init_node_memory_regions(struct pglist_data
>> *pgdat)
>> +{
>> +    int nid = pgdat->node_id;
>> +    unsigned long start_pfn = pgdat->node_start_pfn;
>> +    unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>> +    struct node_mem_region *region;
>> +    unsigned long i, absent;
>> +    int idx;
>> +
>> +    for (i = start_pfn, idx = 0; i < end_pfn;
>> +                i += region->spanned_pages, idx++) {
>> +
> 
>> +        region = &pgdat->node_regions[idx];
> 
> It seems that overflow easily occurs.
> node_regions[] has 256 entries and MEM_REGION_SIZE is 512MiB. So if
> the pgdat has more than 128 GiB, overflow will occur. Am I wrong?
>

No, you are right. It should be made dynamic to accommodate larger
memory. I just used that value as a placeholder, since my focus was to
demonstrate what algorithms and designs could be developed on top of
this infrastructure, to help shape memory allocations. But certainly
this needs to be modified to be flexible enough to work with any memory
size. Thank you for your review!

Regards,
Srivatsa S. Bhat
 
> 
>> +        region->pgdat = pgdat;
>> +        region->start_pfn = i;
>> +        region->spanned_pages = min(MEM_REGION_SIZE, end_pfn - i);
>> +        region->end_pfn = region->start_pfn + region->spanned_pages;
>> +
>> +        absent = __absent_pages_in_range(nid, region->start_pfn,
>> +                         region->end_pfn);
>> +
>> +        region->present_pages = region->spanned_pages - absent;
>> +    }
>> +
>> +    pgdat->nr_node_regions = idx;
>> +}
>> +
>>   void __paginginit free_area_init_node(int nid, unsigned long
>> *zones_size,
>>           unsigned long node_start_pfn, unsigned long *zholes_size)
>>   {
>> @@ -4837,6 +4864,7 @@ void __paginginit free_area_init_node(int nid,
>> unsigned long *zones_size,
>>
>>       free_area_init_core(pgdat, start_pfn, end_pfn,
>>                   zones_size, zholes_size);
>> +    init_node_memory_regions(pgdat);
>>   }
>>
>>   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>>
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 04/35] mm: Initialize node memory regions during boot
  2013-09-02 17:43       ` Srivatsa S. Bhat
@ 2013-09-03  4:53         ` Yasuaki Ishimatsu
  -1 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-03  4:53 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/09/03 2:43), Srivatsa S. Bhat wrote:
> On 09/02/2013 11:50 AM, Yasuaki Ishimatsu wrote:
>> (2013/08/30 22:15), Srivatsa S. Bhat wrote:
>>> Initialize the node's memory-regions structures with the information
>>> about
>>> the region-boundaries, at boot time.
>>>
>>> Based-on-patch-by: Ankita Garg <gargankita@gmail.com>
>>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>>> ---
>>>
>>>    include/linux/mm.h |    4 ++++
>>>    mm/page_alloc.c    |   28 ++++++++++++++++++++++++++++
>>>    2 files changed, 32 insertions(+)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index f022460..18fdec4 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -627,6 +627,10 @@ static inline pte_t maybe_mkwrite(pte_t pte,
>>> struct vm_area_struct *vma)
>>>    #define LAST_NID_MASK        ((1UL << LAST_NID_WIDTH) - 1)
>>>    #define ZONEID_MASK        ((1UL << ZONEID_SHIFT) - 1)
>>>
>>> +/* Hard-code memory region size to be 512 MB for now. */
>>> +#define MEM_REGION_SHIFT    (29 - PAGE_SHIFT)
>>> +#define MEM_REGION_SIZE        (1UL << MEM_REGION_SHIFT)
>>> +
>>>    static inline enum zone_type page_zonenum(const struct page *page)
>>>    {
>>>        return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index b86d7e3..bb2d5d4 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4809,6 +4809,33 @@ static void __init_refok
>>> alloc_node_mem_map(struct pglist_data *pgdat)
>>>    #endif /* CONFIG_FLAT_NODE_MEM_MAP */
>>>    }
>>>
>>> +static void __meminit init_node_memory_regions(struct pglist_data
>>> *pgdat)
>>> +{
>>> +    int nid = pgdat->node_id;
>>> +    unsigned long start_pfn = pgdat->node_start_pfn;
>>> +    unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>>> +    struct node_mem_region *region;
>>> +    unsigned long i, absent;
>>> +    int idx;
>>> +
>>> +    for (i = start_pfn, idx = 0; i < end_pfn;
>>> +                i += region->spanned_pages, idx++) {
>>> +
>>
>>> +        region = &pgdat->node_regions[idx];
>>
>> It seems that overflow easily occurs.
>> node_regions[] has 256 entries and MEM_REGION_SIZE is 512MiB. So if
>> the pgdat has more than 128 GiB, overflow will occur. Am I wrong?
>>
>
> No, you are right. It should be made dynamic to accommodate larger
> memory. I just used that value as a placeholder, since my focus was to
> demonstrate what algorithms and designs could be developed on top of
> this infrastructure, to help shape memory allocations. But certainly
> this needs to be modified to be flexible enough to work with any memory
> size. Thank you for your review!

Thank you for your explanation. I understood it.

Thanks,
Yasuaki Ishimatsu

>
> Regards,
> Srivatsa S. Bhat
>
>>
>>> +        region->pgdat = pgdat;
>>> +        region->start_pfn = i;
>>> +        region->spanned_pages = min(MEM_REGION_SIZE, end_pfn - i);
>>> +        region->end_pfn = region->start_pfn + region->spanned_pages;
>>> +
>>> +        absent = __absent_pages_in_range(nid, region->start_pfn,
>>> +                         region->end_pfn);
>>> +
>>> +        region->present_pages = region->spanned_pages - absent;
>>> +    }
>>> +
>>> +    pgdat->nr_node_regions = idx;
>>> +}
>>> +
>>>    void __paginginit free_area_init_node(int nid, unsigned long
>>> *zones_size,
>>>            unsigned long node_start_pfn, unsigned long *zholes_size)
>>>    {
>>> @@ -4837,6 +4864,7 @@ void __paginginit free_area_init_node(int nid,
>>> unsigned long *zones_size,
>>>
>>>        free_area_init_core(pgdat, start_pfn, end_pfn,
>>>                    zones_size, zholes_size);
>>> +    init_node_memory_regions(pgdat);
>>>    }
>>>
>>>    #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>>>
>>
>>
>



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 04/35] mm: Initialize node memory regions during boot
@ 2013-09-03  4:53         ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-03  4:53 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/09/03 2:43), Srivatsa S. Bhat wrote:
> On 09/02/2013 11:50 AM, Yasuaki Ishimatsu wrote:
>> (2013/08/30 22:15), Srivatsa S. Bhat wrote:
>>> Initialize the node's memory-regions structures with the information
>>> about
>>> the region-boundaries, at boot time.
>>>
>>> Based-on-patch-by: Ankita Garg <gargankita@gmail.com>
>>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>>> ---
>>>
>>>    include/linux/mm.h |    4 ++++
>>>    mm/page_alloc.c    |   28 ++++++++++++++++++++++++++++
>>>    2 files changed, 32 insertions(+)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index f022460..18fdec4 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -627,6 +627,10 @@ static inline pte_t maybe_mkwrite(pte_t pte,
>>> struct vm_area_struct *vma)
>>>    #define LAST_NID_MASK        ((1UL << LAST_NID_WIDTH) - 1)
>>>    #define ZONEID_MASK        ((1UL << ZONEID_SHIFT) - 1)
>>>
>>> +/* Hard-code memory region size to be 512 MB for now. */
>>> +#define MEM_REGION_SHIFT    (29 - PAGE_SHIFT)
>>> +#define MEM_REGION_SIZE        (1UL << MEM_REGION_SHIFT)
>>> +
>>>    static inline enum zone_type page_zonenum(const struct page *page)
>>>    {
>>>        return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index b86d7e3..bb2d5d4 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4809,6 +4809,33 @@ static void __init_refok
>>> alloc_node_mem_map(struct pglist_data *pgdat)
>>>    #endif /* CONFIG_FLAT_NODE_MEM_MAP */
>>>    }
>>>
>>> +static void __meminit init_node_memory_regions(struct pglist_data
>>> *pgdat)
>>> +{
>>> +    int nid = pgdat->node_id;
>>> +    unsigned long start_pfn = pgdat->node_start_pfn;
>>> +    unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>>> +    struct node_mem_region *region;
>>> +    unsigned long i, absent;
>>> +    int idx;
>>> +
>>> +    for (i = start_pfn, idx = 0; i < end_pfn;
>>> +                i += region->spanned_pages, idx++) {
>>> +
>>
>>> +        region = &pgdat->node_regions[idx];
>>
>> It seems that overflow easily occurs.
>> node_regions[] has 256 entries and MEM_REGION_SIZE is 512MiB. So if
>> the pgdat has more than 128 GiB, overflow will occur. Am I wrong?
>>
>
> No, you are right. It should be made dynamic to accommodate larger
> memory. I just used that value as a placeholder, since my focus was to
> demonstrate what algorithms and designs could be developed on top of
> this infrastructure, to help shape memory allocations. But certainly
> this needs to be modified to be flexible enough to work with any memory
> size. Thank you for your review!

Thank you for your explanation. I understood it.

Thanks,
Yasuaki Ishimatsu

>
> Regards,
> Srivatsa S. Bhat
>
>>
>>> +        region->pgdat = pgdat;
>>> +        region->start_pfn = i;
>>> +        region->spanned_pages = min(MEM_REGION_SIZE, end_pfn - i);
>>> +        region->end_pfn = region->start_pfn + region->spanned_pages;
>>> +
>>> +        absent = __absent_pages_in_range(nid, region->start_pfn,
>>> +                         region->end_pfn);
>>> +
>>> +        region->present_pages = region->spanned_pages - absent;
>>> +    }
>>> +
>>> +    pgdat->nr_node_regions = idx;
>>> +}
>>> +
>>>    void __paginginit free_area_init_node(int nid, unsigned long
>>> *zones_size,
>>>            unsigned long node_start_pfn, unsigned long *zholes_size)
>>>    {
>>> @@ -4837,6 +4864,7 @@ void __paginginit free_area_init_node(int nid,
>>> unsigned long *zones_size,
>>>
>>>        free_area_init_core(pgdat, start_pfn, end_pfn,
>>>                    zones_size, zholes_size);
>>> +    init_node_memory_regions(pgdat);
>>>    }
>>>
>>>    #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>>>
>>
>>
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 06/35] mm: Add helpers to retrieve node region and zone region for a given page
  2013-08-30 13:15   ` Srivatsa S. Bhat
  (?)
@ 2013-09-03  5:56     ` Yasuaki Ishimatsu
  -1 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-03  5:56 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/08/30 22:15), Srivatsa S. Bhat wrote:
> Given a page, we would like to have an efficient mechanism to find out
> the node memory region and the zone memory region to which it belongs.
>
> Since the node is assumed to be divided into equal-sized node memory
> regions, the node memory region can be obtained by simply right-shifting
> the page's pfn by 'MEM_REGION_SHIFT'.
>
> But finding the corresponding zone memory region's index in the zone is
> not that straight-forward. To have a O(1) algorithm to find it out, define a
> zone_region_idx[] array to store the zone memory region indices for every
> node memory region.
>
> To illustrate, consider the following example:
>
> 	|<----------------------Node---------------------->|
> 	 __________________________________________________
> 	|      Node mem reg 0 	 |      Node mem reg 1     |  (Absolute region
> 	|________________________|_________________________|   boundaries)
>
> 	 __________________________________________________
> 	|    ZONE_DMA   |	    ZONE_NORMAL		   |
> 	|               |                                  |
> 	|<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
> 	|_______________|________|_________________________|
>
>
> In the above figure,
>
> Node mem region 0:
> ------------------
> This region corresponds to the first zone mem region in ZONE_DMA and also
> the first zone mem region in ZONE_NORMAL. Hence its index array would look
> like this:
>      node_regions[0].zone_region_idx[ZONE_DMA]     == 0
>      node_regions[0].zone_region_idx[ZONE_NORMAL]  == 0
>
>
> Node mem region 1:
> ------------------
> This region corresponds to the second zone mem region in ZONE_NORMAL. Hence
> its index array would look like this:
>      node_regions[1].zone_region_idx[ZONE_NORMAL]  == 1
>
>
> Using this index array, we can quickly obtain the zone memory region to
> which a given page belongs.
>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>   include/linux/mm.h     |   24 ++++++++++++++++++++++++
>   include/linux/mmzone.h |    7 +++++++
>   mm/page_alloc.c        |    1 +
>   3 files changed, 32 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 18fdec4..52329d1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -723,6 +723,30 @@ static inline struct zone *page_zone(const struct page *page)
>   	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
>   }
>
> +static inline int page_node_region_id(const struct page *page,
> +				      const pg_data_t *pgdat)
> +{
> +	return (page_to_pfn(page) - pgdat->node_start_pfn) >> MEM_REGION_SHIFT;
> +}
> +
> +/**
> + * Return the index of the zone memory region to which the page belongs.
> + *
> + * Given a page, find the absolute (node) memory region as well as the zone to
> + * which it belongs. Then find the region within the zone that corresponds to
> + * that node memory region, and return its index.
> + */
> +static inline int page_zone_region_id(const struct page *page)
> +{
> +	pg_data_t *pgdat = NODE_DATA(page_to_nid(page));
> +	enum zone_type z_num = page_zonenum(page);
> +	unsigned long node_region_idx;
> +
> +	node_region_idx = page_node_region_id(page, pgdat);
> +
> +	return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
> +}
> +
>   #ifdef SECTION_IN_PAGE_FLAGS
>   static inline void set_page_section(struct page *page, unsigned long section)
>   {
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 010ab5b..76d9ed2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -726,6 +726,13 @@ struct node_mem_region {
>   	unsigned long end_pfn;
>   	unsigned long present_pages;
>   	unsigned long spanned_pages;

> +
> +	/*
> +	 * A physical (node) region could be split across multiple zones.
> +	 * Store the indices of the corresponding regions of each such
> +	 * zone for this physical (node) region.
> +	 */
> +	int zone_region_idx[MAX_NR_ZONES];

You should initialize the zone_region_id[] as negative value.
If the zone_region_id is initialized as 0, region 0 belongs to all zones.

Thanks,
Yasuaki Ishimatsu


>   	struct pglist_data *pgdat;
>   };
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 05cedbb..8ffd47b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4877,6 +4877,7 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
>   			zone_region->present_pages =
>   					zone_region->spanned_pages - absent;
>
> +			node_region->zone_region_idx[zone_idx(z)] = idx;
>   			idx++;
>   		}
>
>



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 06/35] mm: Add helpers to retrieve node region and zone region for a given page
@ 2013-09-03  5:56     ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-03  5:56 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/08/30 22:15), Srivatsa S. Bhat wrote:
> Given a page, we would like to have an efficient mechanism to find out
> the node memory region and the zone memory region to which it belongs.
>
> Since the node is assumed to be divided into equal-sized node memory
> regions, the node memory region can be obtained by simply right-shifting
> the page's pfn by 'MEM_REGION_SHIFT'.
>
> But finding the corresponding zone memory region's index in the zone is
> not that straight-forward. To have a O(1) algorithm to find it out, define a
> zone_region_idx[] array to store the zone memory region indices for every
> node memory region.
>
> To illustrate, consider the following example:
>
> 	|<----------------------Node---------------------->|
> 	 __________________________________________________
> 	|      Node mem reg 0 	 |      Node mem reg 1     |  (Absolute region
> 	|________________________|_________________________|   boundaries)
>
> 	 __________________________________________________
> 	|    ZONE_DMA   |	    ZONE_NORMAL		   |
> 	|               |                                  |
> 	|<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
> 	|_______________|________|_________________________|
>
>
> In the above figure,
>
> Node mem region 0:
> ------------------
> This region corresponds to the first zone mem region in ZONE_DMA and also
> the first zone mem region in ZONE_NORMAL. Hence its index array would look
> like this:
>      node_regions[0].zone_region_idx[ZONE_DMA]     == 0
>      node_regions[0].zone_region_idx[ZONE_NORMAL]  == 0
>
>
> Node mem region 1:
> ------------------
> This region corresponds to the second zone mem region in ZONE_NORMAL. Hence
> its index array would look like this:
>      node_regions[1].zone_region_idx[ZONE_NORMAL]  == 1
>
>
> Using this index array, we can quickly obtain the zone memory region to
> which a given page belongs.
>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>   include/linux/mm.h     |   24 ++++++++++++++++++++++++
>   include/linux/mmzone.h |    7 +++++++
>   mm/page_alloc.c        |    1 +
>   3 files changed, 32 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 18fdec4..52329d1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -723,6 +723,30 @@ static inline struct zone *page_zone(const struct page *page)
>   	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
>   }
>
> +static inline int page_node_region_id(const struct page *page,
> +				      const pg_data_t *pgdat)
> +{
> +	return (page_to_pfn(page) - pgdat->node_start_pfn) >> MEM_REGION_SHIFT;
> +}
> +
> +/**
> + * Return the index of the zone memory region to which the page belongs.
> + *
> + * Given a page, find the absolute (node) memory region as well as the zone to
> + * which it belongs. Then find the region within the zone that corresponds to
> + * that node memory region, and return its index.
> + */
> +static inline int page_zone_region_id(const struct page *page)
> +{
> +	pg_data_t *pgdat = NODE_DATA(page_to_nid(page));
> +	enum zone_type z_num = page_zonenum(page);
> +	unsigned long node_region_idx;
> +
> +	node_region_idx = page_node_region_id(page, pgdat);
> +
> +	return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
> +}
> +
>   #ifdef SECTION_IN_PAGE_FLAGS
>   static inline void set_page_section(struct page *page, unsigned long section)
>   {
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 010ab5b..76d9ed2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -726,6 +726,13 @@ struct node_mem_region {
>   	unsigned long end_pfn;
>   	unsigned long present_pages;
>   	unsigned long spanned_pages;

> +
> +	/*
> +	 * A physical (node) region could be split across multiple zones.
> +	 * Store the indices of the corresponding regions of each such
> +	 * zone for this physical (node) region.
> +	 */
> +	int zone_region_idx[MAX_NR_ZONES];

You should initialize the zone_region_id[] as negative value.
If the zone_region_id is initialized as 0, region 0 belongs to all zones.

Thanks,
Yasuaki Ishimatsu


>   	struct pglist_data *pgdat;
>   };
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 05cedbb..8ffd47b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4877,6 +4877,7 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
>   			zone_region->present_pages =
>   					zone_region->spanned_pages - absent;
>
> +			node_region->zone_region_idx[zone_idx(z)] = idx;
>   			idx++;
>   		}
>
>



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 06/35] mm: Add helpers to retrieve node region and zone region for a given page
@ 2013-09-03  5:56     ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-03  5:56 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/08/30 22:15), Srivatsa S. Bhat wrote:
> Given a page, we would like to have an efficient mechanism to find out
> the node memory region and the zone memory region to which it belongs.
>
> Since the node is assumed to be divided into equal-sized node memory
> regions, the node memory region can be obtained by simply right-shifting
> the page's pfn by 'MEM_REGION_SHIFT'.
>
> But finding the corresponding zone memory region's index in the zone is
> not that straight-forward. To have a O(1) algorithm to find it out, define a
> zone_region_idx[] array to store the zone memory region indices for every
> node memory region.
>
> To illustrate, consider the following example:
>
> 	|<----------------------Node---------------------->|
> 	 __________________________________________________
> 	|      Node mem reg 0 	 |      Node mem reg 1     |  (Absolute region
> 	|________________________|_________________________|   boundaries)
>
> 	 __________________________________________________
> 	|    ZONE_DMA   |	    ZONE_NORMAL		   |
> 	|               |                                  |
> 	|<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
> 	|_______________|________|_________________________|
>
>
> In the above figure,
>
> Node mem region 0:
> ------------------
> This region corresponds to the first zone mem region in ZONE_DMA and also
> the first zone mem region in ZONE_NORMAL. Hence its index array would look
> like this:
>      node_regions[0].zone_region_idx[ZONE_DMA]     == 0
>      node_regions[0].zone_region_idx[ZONE_NORMAL]  == 0
>
>
> Node mem region 1:
> ------------------
> This region corresponds to the second zone mem region in ZONE_NORMAL. Hence
> its index array would look like this:
>      node_regions[1].zone_region_idx[ZONE_NORMAL]  == 1
>
>
> Using this index array, we can quickly obtain the zone memory region to
> which a given page belongs.
>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>   include/linux/mm.h     |   24 ++++++++++++++++++++++++
>   include/linux/mmzone.h |    7 +++++++
>   mm/page_alloc.c        |    1 +
>   3 files changed, 32 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 18fdec4..52329d1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -723,6 +723,30 @@ static inline struct zone *page_zone(const struct page *page)
>   	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
>   }
>
> +static inline int page_node_region_id(const struct page *page,
> +				      const pg_data_t *pgdat)
> +{
> +	return (page_to_pfn(page) - pgdat->node_start_pfn) >> MEM_REGION_SHIFT;
> +}
> +
> +/**
> + * Return the index of the zone memory region to which the page belongs.
> + *
> + * Given a page, find the absolute (node) memory region as well as the zone to
> + * which it belongs. Then find the region within the zone that corresponds to
> + * that node memory region, and return its index.
> + */
> +static inline int page_zone_region_id(const struct page *page)
> +{
> +	pg_data_t *pgdat = NODE_DATA(page_to_nid(page));
> +	enum zone_type z_num = page_zonenum(page);
> +	unsigned long node_region_idx;
> +
> +	node_region_idx = page_node_region_id(page, pgdat);
> +
> +	return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
> +}
> +
>   #ifdef SECTION_IN_PAGE_FLAGS
>   static inline void set_page_section(struct page *page, unsigned long section)
>   {
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 010ab5b..76d9ed2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -726,6 +726,13 @@ struct node_mem_region {
>   	unsigned long end_pfn;
>   	unsigned long present_pages;
>   	unsigned long spanned_pages;

> +
> +	/*
> +	 * A physical (node) region could be split across multiple zones.
> +	 * Store the indices of the corresponding regions of each such
> +	 * zone for this physical (node) region.
> +	 */
> +	int zone_region_idx[MAX_NR_ZONES];

You should initialize the zone_region_id[] as negative value.
If the zone_region_id is initialized as 0, region 0 belongs to all zones.

Thanks,
Yasuaki Ishimatsu


>   	struct pglist_data *pgdat;
>   };
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 05cedbb..8ffd47b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4877,6 +4877,7 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
>   			zone_region->present_pages =
>   					zone_region->spanned_pages - absent;
>
> +			node_region->zone_region_idx[zone_idx(z)] = idx;
>   			idx++;
>   		}
>
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 09/35] mm: Track the freepage migratetype of pages accurately
  2013-08-30 13:16   ` Srivatsa S. Bhat
@ 2013-09-03  6:38     ` Yasuaki Ishimatsu
  -1 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-03  6:38 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/08/30 22:16), Srivatsa S. Bhat wrote:
> Due to the region-wise ordering of the pages in the buddy allocator's
> free lists, whenever we want to delete a free pageblock from a free list
> (for ex: when moving blocks of pages from one list to the other), we need
> to be able to tell the buddy allocator exactly which migratetype it belongs
> to. For that purpose, we can use the page's freepage migratetype (which is
> maintained in the page's ->index field).
>
> So, while splitting up higher order pages into smaller ones as part of buddy
> operations, keep the new head pages updated with the correct freepage
> migratetype information (because we depend on tracking this info accurately,
> as outlined above).
>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>   mm/page_alloc.c |    7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 398b62c..b4b1275 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -947,6 +947,13 @@ static inline void expand(struct zone *zone, struct page *page,
>   		add_to_freelist(&page[size], &area->free_list[migratetype]);
>   		area->nr_free++;
>   		set_page_order(&page[size], high);
> +
> +		/*
> +		 * Freepage migratetype is tracked using the index field of the
> +		 * first page of the block. So we need to update the new first
> +		 * page, when changing the page order.
> +		 */
> +		set_freepage_migratetype(&page[size], migratetype);
>   	}
>   }
>
>

It this patch a bug fix patch?
If so, I want you to split the patch from the patch-set.

Thanks,
Yasuaki Ishimatsu



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 09/35] mm: Track the freepage migratetype of pages accurately
@ 2013-09-03  6:38     ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-03  6:38 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/08/30 22:16), Srivatsa S. Bhat wrote:
> Due to the region-wise ordering of the pages in the buddy allocator's
> free lists, whenever we want to delete a free pageblock from a free list
> (for ex: when moving blocks of pages from one list to the other), we need
> to be able to tell the buddy allocator exactly which migratetype it belongs
> to. For that purpose, we can use the page's freepage migratetype (which is
> maintained in the page's ->index field).
>
> So, while splitting up higher order pages into smaller ones as part of buddy
> operations, keep the new head pages updated with the correct freepage
> migratetype information (because we depend on tracking this info accurately,
> as outlined above).
>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>   mm/page_alloc.c |    7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 398b62c..b4b1275 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -947,6 +947,13 @@ static inline void expand(struct zone *zone, struct page *page,
>   		add_to_freelist(&page[size], &area->free_list[migratetype]);
>   		area->nr_free++;
>   		set_page_order(&page[size], high);
> +
> +		/*
> +		 * Freepage migratetype is tracked using the index field of the
> +		 * first page of the block. So we need to update the new first
> +		 * page, when changing the page order.
> +		 */
> +		set_freepage_migratetype(&page[size], migratetype);
>   	}
>   }
>
>

It this patch a bug fix patch?
If so, I want you to split the patch from the patch-set.

Thanks,
Yasuaki Ishimatsu


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 06/35] mm: Add helpers to retrieve node region and zone region for a given page
  2013-09-03  5:56     ` Yasuaki Ishimatsu
@ 2013-09-03  8:34       ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-03  8:34 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

On 09/03/2013 11:26 AM, Yasuaki Ishimatsu wrote:
> (2013/08/30 22:15), Srivatsa S. Bhat wrote:
>> Given a page, we would like to have an efficient mechanism to find out
>> the node memory region and the zone memory region to which it belongs.
>>
>> Since the node is assumed to be divided into equal-sized node memory
>> regions, the node memory region can be obtained by simply right-shifting
>> the page's pfn by 'MEM_REGION_SHIFT'.
>>
>> But finding the corresponding zone memory region's index in the zone is
>> not that straight-forward. To have a O(1) algorithm to find it out,
>> define a
>> zone_region_idx[] array to store the zone memory region indices for every
>> node memory region.
>>
>> To illustrate, consider the following example:
>>
>>     |<----------------------Node---------------------->|
>>      __________________________________________________
>>     |      Node mem reg 0      |      Node mem reg 1     |  (Absolute
>> region
>>     |________________________|_________________________|   boundaries)
>>
>>      __________________________________________________
>>     |    ZONE_DMA   |        ZONE_NORMAL           |
>>     |               |                                  |
>>     |<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
>>     |_______________|________|_________________________|
>>
>>
>> In the above figure,
>>
>> Node mem region 0:
>> ------------------
>> This region corresponds to the first zone mem region in ZONE_DMA and also
>> the first zone mem region in ZONE_NORMAL. Hence its index array would
>> look
>> like this:
>>      node_regions[0].zone_region_idx[ZONE_DMA]     == 0
>>      node_regions[0].zone_region_idx[ZONE_NORMAL]  == 0
>>
>>
>> Node mem region 1:
>> ------------------
>> This region corresponds to the second zone mem region in ZONE_NORMAL.
>> Hence
>> its index array would look like this:
>>      node_regions[1].zone_region_idx[ZONE_NORMAL]  == 1
>>
>>
>> Using this index array, we can quickly obtain the zone memory region to
>> which a given page belongs.
>>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>> ---
>>
>>   include/linux/mm.h     |   24 ++++++++++++++++++++++++
>>   include/linux/mmzone.h |    7 +++++++
>>   mm/page_alloc.c        |    1 +
>>   3 files changed, 32 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 18fdec4..52329d1 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -723,6 +723,30 @@ static inline struct zone *page_zone(const struct
>> page *page)
>>       return
>> &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
>>   }
>>
>> +static inline int page_node_region_id(const struct page *page,
>> +                      const pg_data_t *pgdat)
>> +{
>> +    return (page_to_pfn(page) - pgdat->node_start_pfn) >>
>> MEM_REGION_SHIFT;
>> +}
>> +
>> +/**
>> + * Return the index of the zone memory region to which the page belongs.
>> + *
>> + * Given a page, find the absolute (node) memory region as well as
>> the zone to
>> + * which it belongs. Then find the region within the zone that
>> corresponds to
>> + * that node memory region, and return its index.
>> + */
>> +static inline int page_zone_region_id(const struct page *page)
>> +{
>> +    pg_data_t *pgdat = NODE_DATA(page_to_nid(page));
>> +    enum zone_type z_num = page_zonenum(page);
>> +    unsigned long node_region_idx;
>> +
>> +    node_region_idx = page_node_region_id(page, pgdat);
>> +
>> +    return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
>> +}
>> +
>>   #ifdef SECTION_IN_PAGE_FLAGS
>>   static inline void set_page_section(struct page *page, unsigned long
>> section)
>>   {
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 010ab5b..76d9ed2 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -726,6 +726,13 @@ struct node_mem_region {
>>       unsigned long end_pfn;
>>       unsigned long present_pages;
>>       unsigned long spanned_pages;
> 
>> +
>> +    /*
>> +     * A physical (node) region could be split across multiple zones.
>> +     * Store the indices of the corresponding regions of each such
>> +     * zone for this physical (node) region.
>> +     */
>> +    int zone_region_idx[MAX_NR_ZONES];
> 
> You should initialize the zone_region_id[] as negative value.

Oh, I missed that.

> If the zone_region_id is initialized as 0, region 0 belongs to all zones.
> 

In fact, if it is initialized as zero, every node region will appear to
map to every zone's first zone-mem-region. But luckily, since we never index
the zone_region_idx[] array with incorrect zone-number, I didn't encounter
any wrong values in practice. But thanks for pointing it out, I'll fix it.

Regards,
Srivatsa S. Bhat



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 06/35] mm: Add helpers to retrieve node region and zone region for a given page
@ 2013-09-03  8:34       ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-03  8:34 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

On 09/03/2013 11:26 AM, Yasuaki Ishimatsu wrote:
> (2013/08/30 22:15), Srivatsa S. Bhat wrote:
>> Given a page, we would like to have an efficient mechanism to find out
>> the node memory region and the zone memory region to which it belongs.
>>
>> Since the node is assumed to be divided into equal-sized node memory
>> regions, the node memory region can be obtained by simply right-shifting
>> the page's pfn by 'MEM_REGION_SHIFT'.
>>
>> But finding the corresponding zone memory region's index in the zone is
>> not that straight-forward. To have a O(1) algorithm to find it out,
>> define a
>> zone_region_idx[] array to store the zone memory region indices for every
>> node memory region.
>>
>> To illustrate, consider the following example:
>>
>>     |<----------------------Node---------------------->|
>>      __________________________________________________
>>     |      Node mem reg 0      |      Node mem reg 1     |  (Absolute
>> region
>>     |________________________|_________________________|   boundaries)
>>
>>      __________________________________________________
>>     |    ZONE_DMA   |        ZONE_NORMAL           |
>>     |               |                                  |
>>     |<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
>>     |_______________|________|_________________________|
>>
>>
>> In the above figure,
>>
>> Node mem region 0:
>> ------------------
>> This region corresponds to the first zone mem region in ZONE_DMA and also
>> the first zone mem region in ZONE_NORMAL. Hence its index array would
>> look
>> like this:
>>      node_regions[0].zone_region_idx[ZONE_DMA]     == 0
>>      node_regions[0].zone_region_idx[ZONE_NORMAL]  == 0
>>
>>
>> Node mem region 1:
>> ------------------
>> This region corresponds to the second zone mem region in ZONE_NORMAL.
>> Hence
>> its index array would look like this:
>>      node_regions[1].zone_region_idx[ZONE_NORMAL]  == 1
>>
>>
>> Using this index array, we can quickly obtain the zone memory region to
>> which a given page belongs.
>>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>> ---
>>
>>   include/linux/mm.h     |   24 ++++++++++++++++++++++++
>>   include/linux/mmzone.h |    7 +++++++
>>   mm/page_alloc.c        |    1 +
>>   3 files changed, 32 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 18fdec4..52329d1 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -723,6 +723,30 @@ static inline struct zone *page_zone(const struct
>> page *page)
>>       return
>> &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
>>   }
>>
>> +static inline int page_node_region_id(const struct page *page,
>> +                      const pg_data_t *pgdat)
>> +{
>> +    return (page_to_pfn(page) - pgdat->node_start_pfn) >>
>> MEM_REGION_SHIFT;
>> +}
>> +
>> +/**
>> + * Return the index of the zone memory region to which the page belongs.
>> + *
>> + * Given a page, find the absolute (node) memory region as well as
>> the zone to
>> + * which it belongs. Then find the region within the zone that
>> corresponds to
>> + * that node memory region, and return its index.
>> + */
>> +static inline int page_zone_region_id(const struct page *page)
>> +{
>> +    pg_data_t *pgdat = NODE_DATA(page_to_nid(page));
>> +    enum zone_type z_num = page_zonenum(page);
>> +    unsigned long node_region_idx;
>> +
>> +    node_region_idx = page_node_region_id(page, pgdat);
>> +
>> +    return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
>> +}
>> +
>>   #ifdef SECTION_IN_PAGE_FLAGS
>>   static inline void set_page_section(struct page *page, unsigned long
>> section)
>>   {
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 010ab5b..76d9ed2 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -726,6 +726,13 @@ struct node_mem_region {
>>       unsigned long end_pfn;
>>       unsigned long present_pages;
>>       unsigned long spanned_pages;
> 
>> +
>> +    /*
>> +     * A physical (node) region could be split across multiple zones.
>> +     * Store the indices of the corresponding regions of each such
>> +     * zone for this physical (node) region.
>> +     */
>> +    int zone_region_idx[MAX_NR_ZONES];
> 
> You should initialize the zone_region_id[] as negative value.

Oh, I missed that.

> If the zone_region_id is initialized as 0, region 0 belongs to all zones.
> 

In fact, if it is initialized as zero, every node region will appear to
map to every zone's first zone-mem-region. But luckily, since we never index
the zone_region_idx[] array with incorrect zone-number, I didn't encounter
any wrong values in practice. But thanks for pointing it out, I'll fix it.

Regards,
Srivatsa S. Bhat


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 09/35] mm: Track the freepage migratetype of pages accurately
  2013-09-03  6:38     ` Yasuaki Ishimatsu
@ 2013-09-03  8:45       ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-03  8:45 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

On 09/03/2013 12:08 PM, Yasuaki Ishimatsu wrote:
> (2013/08/30 22:16), Srivatsa S. Bhat wrote:
>> Due to the region-wise ordering of the pages in the buddy allocator's
>> free lists, whenever we want to delete a free pageblock from a free list
>> (for ex: when moving blocks of pages from one list to the other), we need
>> to be able to tell the buddy allocator exactly which migratetype it
>> belongs
>> to. For that purpose, we can use the page's freepage migratetype
>> (which is
>> maintained in the page's ->index field).
>>
>> So, while splitting up higher order pages into smaller ones as part of
>> buddy
>> operations, keep the new head pages updated with the correct freepage
>> migratetype information (because we depend on tracking this info
>> accurately,
>> as outlined above).
>>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>> ---
>>
>>   mm/page_alloc.c |    7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 398b62c..b4b1275 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -947,6 +947,13 @@ static inline void expand(struct zone *zone,
>> struct page *page,
>>           add_to_freelist(&page[size], &area->free_list[migratetype]);
>>           area->nr_free++;
>>           set_page_order(&page[size], high);
>> +
>> +        /*
>> +         * Freepage migratetype is tracked using the index field of the
>> +         * first page of the block. So we need to update the new first
>> +         * page, when changing the page order.
>> +         */
>> +        set_freepage_migratetype(&page[size], migratetype);
>>       }
>>   }
>>
>>
> 
> It this patch a bug fix patch?
> If so, I want you to split the patch from the patch-set.
> 

No, its not a bug-fix. We need to take care of this only when using the
sorted-buddy design to maintain the freelists, which is introduced only in
this patchset. So mainline doesn't need this patch.

In mainline, we can delete a page from a buddy freelist by simply calling
list_del() by passing a pointer to page->lru. It doesn't matter which freelist
the page was belonging to. However, in the sorted-buddy design introduced
in this patchset, we also need to know which particular freelist we are
deleting that page from, because apart from breaking the ->lru link from
the linked-list, we also need to update certain other things such as the
region->page_block pointer etc, which are part of that particular freelist.
Thus, it becomes essential to know which freelist we are deleting the page
from. And for that, we need this patch to maintain that information accurately
even during buddy operations such as splitting buddy pages in expand().

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 09/35] mm: Track the freepage migratetype of pages accurately
@ 2013-09-03  8:45       ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-03  8:45 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

On 09/03/2013 12:08 PM, Yasuaki Ishimatsu wrote:
> (2013/08/30 22:16), Srivatsa S. Bhat wrote:
>> Due to the region-wise ordering of the pages in the buddy allocator's
>> free lists, whenever we want to delete a free pageblock from a free list
>> (for ex: when moving blocks of pages from one list to the other), we need
>> to be able to tell the buddy allocator exactly which migratetype it
>> belongs
>> to. For that purpose, we can use the page's freepage migratetype
>> (which is
>> maintained in the page's ->index field).
>>
>> So, while splitting up higher order pages into smaller ones as part of
>> buddy
>> operations, keep the new head pages updated with the correct freepage
>> migratetype information (because we depend on tracking this info
>> accurately,
>> as outlined above).
>>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>> ---
>>
>>   mm/page_alloc.c |    7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 398b62c..b4b1275 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -947,6 +947,13 @@ static inline void expand(struct zone *zone,
>> struct page *page,
>>           add_to_freelist(&page[size], &area->free_list[migratetype]);
>>           area->nr_free++;
>>           set_page_order(&page[size], high);
>> +
>> +        /*
>> +         * Freepage migratetype is tracked using the index field of the
>> +         * first page of the block. So we need to update the new first
>> +         * page, when changing the page order.
>> +         */
>> +        set_freepage_migratetype(&page[size], migratetype);
>>       }
>>   }
>>
>>
> 
> It this patch a bug fix patch?
> If so, I want you to split the patch from the patch-set.
> 

No, its not a bug-fix. We need to take care of this only when using the
sorted-buddy design to maintain the freelists, which is introduced only in
this patchset. So mainline doesn't need this patch.

In mainline, we can delete a page from a buddy freelist by simply calling
list_del() by passing a pointer to page->lru. It doesn't matter which freelist
the page was belonging to. However, in the sorted-buddy design introduced
in this patchset, we also need to know which particular freelist we are
deleting that page from, because apart from breaking the ->lru link from
the linked-list, we also need to update certain other things such as the
region->page_block pointer etc, which are part of that particular freelist.
Thus, it becomes essential to know which freelist we are deleting the page
from. And for that, we need this patch to maintain that information accurately
even during buddy operations such as splitting buddy pages in expand().

Regards,
Srivatsa S. Bhat

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 08/35] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
  2013-08-30 13:16   ` Srivatsa S. Bhat
  (?)
@ 2013-09-04  7:49     ` Yasuaki Ishimatsu
  -1 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-04  7:49 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/08/30 22:16), Srivatsa S. Bhat wrote:
> The zones' freelists need to be made region-aware, in order to influence
> page allocation and freeing algorithms. So in every free list in the zone, we
> would like to demarcate the pageblocks belonging to different memory regions
> (we can do this using a set of pointers, and thus avoid splitting up the
> freelists).
>
> Also, we would like to keep the pageblocks in the freelists sorted in
> region-order. That is, pageblocks belonging to region-0 would come first,
> followed by pageblocks belonging to region-1 and so on, within a given
> freelist. Of course, a set of pageblocks belonging to the same region need
> not be sorted; it is sufficient if we maintain the pageblocks in
> region-sorted-order, rather than a full address-sorted-order.
>
> For each freelist within the zone, we maintain a set of pointers to
> pageblocks belonging to the various memory regions in that zone.
>
> Eg:
>
>      |<---Region0--->|   |<---Region1--->|   |<-------Region2--------->|
>       ____      ____      ____      ____      ____      ____      ____
> --> |____|--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|-->
>
>                   ^                  ^                              ^
>                   |                  |                              |
>                  Reg0               Reg1                          Reg2
>
>
> Page allocation will proceed as usual - pick the first item on the free list.
> But we don't want to keep updating these region pointers every time we allocate
> a pageblock from the freelist. So, instead of pointing to the *first* pageblock
> of that region, we maintain the region pointers such that they point to the
> *last* pageblock in that region, as shown in the figure above. That way, as
> long as there are > 1 pageblocks in that region in that freelist, that region
> pointer doesn't need to be updated.
>
>
> Page allocation algorithm:
> -------------------------
>
> The heart of the page allocation algorithm remains as it is - pick the first
> item on the appropriate freelist and return it.
>
>
> Arrangement of pageblocks in the zone freelists:
> -----------------------------------------------
>
> This is the main change - we keep the pageblocks in region-sorted order,
> where pageblocks belonging to region-0 come first, followed by those belonging
> to region-1 and so on. But the pageblocks within a given region need *not* be
> sorted, since we need them to be only region-sorted and not fully
> address-sorted.
>
> This sorting is performed when adding pages back to the freelists, thus
> avoiding any region-related overhead in the critical page allocation
> paths.
>
> Strategy to consolidate allocations to a minimum no. of regions:
> ---------------------------------------------------------------
>
> Page allocation happens in the order of increasing region number. We would
> like to do light-weight page reclaim or compaction (for the purpose of memory
> power management) in the reverse order, to keep the allocated pages within
> a minimum number of regions (approximately). The latter part is implemented
> in subsequent patches.
>
> ---------------------------- Increasing region number---------------------->
>
> Direction of allocation--->                <---Direction of reclaim/compaction
>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>   mm/page_alloc.c |  154 +++++++++++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 138 insertions(+), 16 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fd6436d0..398b62c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -514,6 +514,111 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>   	return 0;
>   }
>
> +static void add_to_freelist(struct page *page, struct free_list *free_list)
> +{
> +	struct list_head *prev_region_list, *lru;
> +	struct mem_region_list *region;
> +	int region_id, i;
> +
> +	lru = &page->lru;
> +	region_id = page_zone_region_id(page);
> +
> +	region = &free_list->mr_list[region_id];
> +	region->nr_free++;
> +
> +	if (region->page_block) {
> +		list_add_tail(lru, region->page_block);
> +		return;
> +	}
> +
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +	WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__);
> +#endif
> +
> +	if (!list_empty(&free_list->list)) {
> +		for (i = region_id - 1; i >= 0; i--) {
> +			if (free_list->mr_list[i].page_block) {
> +				prev_region_list =
> +					free_list->mr_list[i].page_block;
> +				goto out;
> +			}
> +		}
> +	}
> +
> +	/* This is the first region, so add to the head of the list */
> +	prev_region_list = &free_list->list;
> +
> +out:
> +	list_add(lru, prev_region_list);
> +
> +	/* Save pointer to page block of this region */
> +	region->page_block = lru;
> +}
> +
> +static void del_from_freelist(struct page *page, struct free_list *free_list)
> +{

> +	struct list_head *prev_page_lru, *lru, *p;
nitpick

*p is used only when enabling CONFIG_DEBUG_PAGEALLOC option.
When disabling the config option and compiling kernel, the
messages are shown.

   CC      mm/page_alloc.o
mm/page_alloc.c: In function ‘del_from_freelist’:
mm/page_alloc.c:560: 警告: unused variable ‘p’

Thanks,
Yasuaki Ishimatsu

> +	struct mem_region_list *region;
> +	int region_id;
> +
> +	lru = &page->lru;
> +	region_id = page_zone_region_id(page);
> +	region = &free_list->mr_list[region_id];
> +	region->nr_free--;
> +
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
> +
> +	/* Verify whether this page indeed belongs to this free list! */
> +
> +	list_for_each(p, &free_list->list) {
> +		if (p == lru)
> +			goto page_found;
> +	}
> +
> +	WARN(1, "%s: page doesn't belong to the given freelist!\n", __func__);
> +
> +page_found:
> +#endif
> +
> +	/*
> +	 * If we are not deleting the last pageblock in this region (i.e.,
> +	 * farthest from list head, but not necessarily the last numerically),
> +	 * then we need not update the region->page_block pointer.
> +	 */
> +	if (lru != region->page_block) {
> +		list_del(lru);
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +		WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__);
> +#endif
> +		return;
> +	}
> +
> +	prev_page_lru = lru->prev;
> +	list_del(lru);
> +
> +	if (region->nr_free == 0) {
> +		region->page_block = NULL;
> +	} else {
> +		region->page_block = prev_page_lru;
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +		WARN(prev_page_lru == &free_list->list,
> +			"%s: region->page_block points to list head\n",
> +								__func__);
> +#endif
> +	}
> +}
> +
> +/**
> + * Move a given page from one freelist to another.
> + */
> +static void move_page_freelist(struct page *page, struct free_list *old_list,
> +			       struct free_list *new_list)
> +{
> +	del_from_freelist(page, old_list);
> +	add_to_freelist(page, new_list);
> +}
> +
>   /*
>    * Freeing function for a buddy system allocator.
>    *
> @@ -546,6 +651,7 @@ static inline void __free_one_page(struct page *page,
>   	unsigned long combined_idx;
>   	unsigned long uninitialized_var(buddy_idx);
>   	struct page *buddy;
> +	struct free_area *area;
>
>   	VM_BUG_ON(!zone_is_initialized(zone));
>
> @@ -575,8 +681,9 @@ static inline void __free_one_page(struct page *page,
>   			__mod_zone_freepage_state(zone, 1 << order,
>   						  migratetype);
>   		} else {
> -			list_del(&buddy->lru);
> -			zone->free_area[order].nr_free--;
> +			area = &zone->free_area[order];
> +			del_from_freelist(buddy, &area->free_list[migratetype]);
> +			area->nr_free--;
>   			rmv_page_order(buddy);
>   		}
>   		combined_idx = buddy_idx & page_idx;
> @@ -585,6 +692,7 @@ static inline void __free_one_page(struct page *page,
>   		order++;
>   	}
>   	set_page_order(page, order);
> +	area = &zone->free_area[order];
>
>   	/*
>   	 * If this is not the largest possible page, check if the buddy
> @@ -601,16 +709,22 @@ static inline void __free_one_page(struct page *page,
>   		buddy_idx = __find_buddy_index(combined_idx, order + 1);
>   		higher_buddy = higher_page + (buddy_idx - combined_idx);
>   		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
> -			list_add_tail(&page->lru,
> -				&zone->free_area[order].free_list[migratetype].list);
> +
> +			/*
> +			 * Implementing an add_to_freelist_tail() won't be
> +			 * very useful because both of them (almost) add to
> +			 * the tail within the region. So we could potentially
> +			 * switch off this entire "is next-higher buddy free?"
> +			 * logic when memory regions are used.
> +			 */
> +			add_to_freelist(page, &area->free_list[migratetype]);
>   			goto out;
>   		}
>   	}
>
> -	list_add(&page->lru,
> -		&zone->free_area[order].free_list[migratetype].list);
> +	add_to_freelist(page, &area->free_list[migratetype]);
>   out:
> -	zone->free_area[order].nr_free++;
> +	area->nr_free++;
>   }
>
>   static inline int free_pages_check(struct page *page)
> @@ -830,7 +944,7 @@ static inline void expand(struct zone *zone, struct page *page,
>   			continue;
>   		}
>   #endif
> -		list_add(&page[size].lru, &area->free_list[migratetype].list);
> +		add_to_freelist(&page[size], &area->free_list[migratetype]);
>   		area->nr_free++;
>   		set_page_order(&page[size], high);
>   	}
> @@ -897,7 +1011,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>
>   		page = list_entry(area->free_list[migratetype].list.next,
>   							struct page, lru);
> -		list_del(&page->lru);
> +		del_from_freelist(page, &area->free_list[migratetype]);
>   		rmv_page_order(page);
>   		area->nr_free--;
>   		expand(zone, page, order, current_order, area, migratetype);
> @@ -938,7 +1052,8 @@ int move_freepages(struct zone *zone,
>   {
>   	struct page *page;
>   	unsigned long order;
> -	int pages_moved = 0;
> +	struct free_area *area;
> +	int pages_moved = 0, old_mt;
>
>   #ifndef CONFIG_HOLES_IN_ZONE
>   	/*
> @@ -966,8 +1081,10 @@ int move_freepages(struct zone *zone,
>   		}
>
>   		order = page_order(page);
> -		list_move(&page->lru,
> -			  &zone->free_area[order].free_list[migratetype].list);
> +		old_mt = get_freepage_migratetype(page);
> +		area = &zone->free_area[order];
> +		move_page_freelist(page, &area->free_list[old_mt],
> +				    &area->free_list[migratetype]);
>   		set_freepage_migratetype(page, migratetype);
>   		page += 1 << order;
>   		pages_moved += 1 << order;
> @@ -1061,7 +1178,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>   	struct free_area * area;
>   	int current_order;
>   	struct page *page;
> -	int migratetype, new_type, i;
> +	int migratetype, new_type, i, mt;
>
>   	/* Find the largest possible block of pages in the other list */
>   	for (current_order = MAX_ORDER-1; current_order >= order;
> @@ -1086,7 +1203,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>   							  migratetype);
>
>   			/* Remove the page from the freelists */
> -			list_del(&page->lru);
> +			mt = get_freepage_migratetype(page);
> +			del_from_freelist(page, &area->free_list[mt]);
>   			rmv_page_order(page);
>
>   			/*
> @@ -1446,7 +1564,8 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>   	}
>
>   	/* Remove page from free list */
> -	list_del(&page->lru);
> +	mt = get_freepage_migratetype(page);
> +	del_from_freelist(page, &zone->free_area[order].free_list[mt]);
>   	zone->free_area[order].nr_free--;
>   	rmv_page_order(page);
>
> @@ -6353,6 +6472,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
>   	int order, i;
>   	unsigned long pfn;
>   	unsigned long flags;
> +	int mt;
> +
>   	/* find the first valid pfn */
>   	for (pfn = start_pfn; pfn < end_pfn; pfn++)
>   		if (pfn_valid(pfn))
> @@ -6385,7 +6506,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
>   		printk(KERN_INFO "remove from free list %lx %d %lx\n",
>   		       pfn, 1 << order, end_pfn);
>   #endif
> -		list_del(&page->lru);
> +		mt = get_freepage_migratetype(page);
> +		del_from_freelist(page, &zone->free_area[order].free_list[mt]);
>   		rmv_page_order(page);
>   		zone->free_area[order].nr_free--;
>   #ifdef CONFIG_HIGHMEM
>



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 08/35] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
@ 2013-09-04  7:49     ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-04  7:49 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/08/30 22:16), Srivatsa S. Bhat wrote:
> The zones' freelists need to be made region-aware, in order to influence
> page allocation and freeing algorithms. So in every free list in the zone, we
> would like to demarcate the pageblocks belonging to different memory regions
> (we can do this using a set of pointers, and thus avoid splitting up the
> freelists).
>
> Also, we would like to keep the pageblocks in the freelists sorted in
> region-order. That is, pageblocks belonging to region-0 would come first,
> followed by pageblocks belonging to region-1 and so on, within a given
> freelist. Of course, a set of pageblocks belonging to the same region need
> not be sorted; it is sufficient if we maintain the pageblocks in
> region-sorted-order, rather than a full address-sorted-order.
>
> For each freelist within the zone, we maintain a set of pointers to
> pageblocks belonging to the various memory regions in that zone.
>
> Eg:
>
>      |<---Region0--->|   |<---Region1--->|   |<-------Region2--------->|
>       ____      ____      ____      ____      ____      ____      ____
> --> |____|--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|-->
>
>                   ^                  ^                              ^
>                   |                  |                              |
>                  Reg0               Reg1                          Reg2
>
>
> Page allocation will proceed as usual - pick the first item on the free list.
> But we don't want to keep updating these region pointers every time we allocate
> a pageblock from the freelist. So, instead of pointing to the *first* pageblock
> of that region, we maintain the region pointers such that they point to the
> *last* pageblock in that region, as shown in the figure above. That way, as
> long as there are > 1 pageblocks in that region in that freelist, that region
> pointer doesn't need to be updated.
>
>
> Page allocation algorithm:
> -------------------------
>
> The heart of the page allocation algorithm remains as it is - pick the first
> item on the appropriate freelist and return it.
>
>
> Arrangement of pageblocks in the zone freelists:
> -----------------------------------------------
>
> This is the main change - we keep the pageblocks in region-sorted order,
> where pageblocks belonging to region-0 come first, followed by those belonging
> to region-1 and so on. But the pageblocks within a given region need *not* be
> sorted, since we need them to be only region-sorted and not fully
> address-sorted.
>
> This sorting is performed when adding pages back to the freelists, thus
> avoiding any region-related overhead in the critical page allocation
> paths.
>
> Strategy to consolidate allocations to a minimum no. of regions:
> ---------------------------------------------------------------
>
> Page allocation happens in the order of increasing region number. We would
> like to do light-weight page reclaim or compaction (for the purpose of memory
> power management) in the reverse order, to keep the allocated pages within
> a minimum number of regions (approximately). The latter part is implemented
> in subsequent patches.
>
> ---------------------------- Increasing region number---------------------->
>
> Direction of allocation--->                <---Direction of reclaim/compaction
>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>   mm/page_alloc.c |  154 +++++++++++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 138 insertions(+), 16 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fd6436d0..398b62c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -514,6 +514,111 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>   	return 0;
>   }
>
> +static void add_to_freelist(struct page *page, struct free_list *free_list)
> +{
> +	struct list_head *prev_region_list, *lru;
> +	struct mem_region_list *region;
> +	int region_id, i;
> +
> +	lru = &page->lru;
> +	region_id = page_zone_region_id(page);
> +
> +	region = &free_list->mr_list[region_id];
> +	region->nr_free++;
> +
> +	if (region->page_block) {
> +		list_add_tail(lru, region->page_block);
> +		return;
> +	}
> +
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +	WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__);
> +#endif
> +
> +	if (!list_empty(&free_list->list)) {
> +		for (i = region_id - 1; i >= 0; i--) {
> +			if (free_list->mr_list[i].page_block) {
> +				prev_region_list =
> +					free_list->mr_list[i].page_block;
> +				goto out;
> +			}
> +		}
> +	}
> +
> +	/* This is the first region, so add to the head of the list */
> +	prev_region_list = &free_list->list;
> +
> +out:
> +	list_add(lru, prev_region_list);
> +
> +	/* Save pointer to page block of this region */
> +	region->page_block = lru;
> +}
> +
> +static void del_from_freelist(struct page *page, struct free_list *free_list)
> +{

> +	struct list_head *prev_page_lru, *lru, *p;
nitpick

*p is used only when enabling CONFIG_DEBUG_PAGEALLOC option.
When disabling the config option and compiling kernel, the
messages are shown.

   CC      mm/page_alloc.o
mm/page_alloc.c: In function ‘del_from_freelist’:
mm/page_alloc.c:560: 警告: unused variable ‘p’

Thanks,
Yasuaki Ishimatsu

> +	struct mem_region_list *region;
> +	int region_id;
> +
> +	lru = &page->lru;
> +	region_id = page_zone_region_id(page);
> +	region = &free_list->mr_list[region_id];
> +	region->nr_free--;
> +
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
> +
> +	/* Verify whether this page indeed belongs to this free list! */
> +
> +	list_for_each(p, &free_list->list) {
> +		if (p == lru)
> +			goto page_found;
> +	}
> +
> +	WARN(1, "%s: page doesn't belong to the given freelist!\n", __func__);
> +
> +page_found:
> +#endif
> +
> +	/*
> +	 * If we are not deleting the last pageblock in this region (i.e.,
> +	 * farthest from list head, but not necessarily the last numerically),
> +	 * then we need not update the region->page_block pointer.
> +	 */
> +	if (lru != region->page_block) {
> +		list_del(lru);
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +		WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__);
> +#endif
> +		return;
> +	}
> +
> +	prev_page_lru = lru->prev;
> +	list_del(lru);
> +
> +	if (region->nr_free == 0) {
> +		region->page_block = NULL;
> +	} else {
> +		region->page_block = prev_page_lru;
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +		WARN(prev_page_lru == &free_list->list,
> +			"%s: region->page_block points to list head\n",
> +								__func__);
> +#endif
> +	}
> +}
> +
> +/**
> + * Move a given page from one freelist to another.
> + */
> +static void move_page_freelist(struct page *page, struct free_list *old_list,
> +			       struct free_list *new_list)
> +{
> +	del_from_freelist(page, old_list);
> +	add_to_freelist(page, new_list);
> +}
> +
>   /*
>    * Freeing function for a buddy system allocator.
>    *
> @@ -546,6 +651,7 @@ static inline void __free_one_page(struct page *page,
>   	unsigned long combined_idx;
>   	unsigned long uninitialized_var(buddy_idx);
>   	struct page *buddy;
> +	struct free_area *area;
>
>   	VM_BUG_ON(!zone_is_initialized(zone));
>
> @@ -575,8 +681,9 @@ static inline void __free_one_page(struct page *page,
>   			__mod_zone_freepage_state(zone, 1 << order,
>   						  migratetype);
>   		} else {
> -			list_del(&buddy->lru);
> -			zone->free_area[order].nr_free--;
> +			area = &zone->free_area[order];
> +			del_from_freelist(buddy, &area->free_list[migratetype]);
> +			area->nr_free--;
>   			rmv_page_order(buddy);
>   		}
>   		combined_idx = buddy_idx & page_idx;
> @@ -585,6 +692,7 @@ static inline void __free_one_page(struct page *page,
>   		order++;
>   	}
>   	set_page_order(page, order);
> +	area = &zone->free_area[order];
>
>   	/*
>   	 * If this is not the largest possible page, check if the buddy
> @@ -601,16 +709,22 @@ static inline void __free_one_page(struct page *page,
>   		buddy_idx = __find_buddy_index(combined_idx, order + 1);
>   		higher_buddy = higher_page + (buddy_idx - combined_idx);
>   		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
> -			list_add_tail(&page->lru,
> -				&zone->free_area[order].free_list[migratetype].list);
> +
> +			/*
> +			 * Implementing an add_to_freelist_tail() won't be
> +			 * very useful because both of them (almost) add to
> +			 * the tail within the region. So we could potentially
> +			 * switch off this entire "is next-higher buddy free?"
> +			 * logic when memory regions are used.
> +			 */
> +			add_to_freelist(page, &area->free_list[migratetype]);
>   			goto out;
>   		}
>   	}
>
> -	list_add(&page->lru,
> -		&zone->free_area[order].free_list[migratetype].list);
> +	add_to_freelist(page, &area->free_list[migratetype]);
>   out:
> -	zone->free_area[order].nr_free++;
> +	area->nr_free++;
>   }
>
>   static inline int free_pages_check(struct page *page)
> @@ -830,7 +944,7 @@ static inline void expand(struct zone *zone, struct page *page,
>   			continue;
>   		}
>   #endif
> -		list_add(&page[size].lru, &area->free_list[migratetype].list);
> +		add_to_freelist(&page[size], &area->free_list[migratetype]);
>   		area->nr_free++;
>   		set_page_order(&page[size], high);
>   	}
> @@ -897,7 +1011,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>
>   		page = list_entry(area->free_list[migratetype].list.next,
>   							struct page, lru);
> -		list_del(&page->lru);
> +		del_from_freelist(page, &area->free_list[migratetype]);
>   		rmv_page_order(page);
>   		area->nr_free--;
>   		expand(zone, page, order, current_order, area, migratetype);
> @@ -938,7 +1052,8 @@ int move_freepages(struct zone *zone,
>   {
>   	struct page *page;
>   	unsigned long order;
> -	int pages_moved = 0;
> +	struct free_area *area;
> +	int pages_moved = 0, old_mt;
>
>   #ifndef CONFIG_HOLES_IN_ZONE
>   	/*
> @@ -966,8 +1081,10 @@ int move_freepages(struct zone *zone,
>   		}
>
>   		order = page_order(page);
> -		list_move(&page->lru,
> -			  &zone->free_area[order].free_list[migratetype].list);
> +		old_mt = get_freepage_migratetype(page);
> +		area = &zone->free_area[order];
> +		move_page_freelist(page, &area->free_list[old_mt],
> +				    &area->free_list[migratetype]);
>   		set_freepage_migratetype(page, migratetype);
>   		page += 1 << order;
>   		pages_moved += 1 << order;
> @@ -1061,7 +1178,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>   	struct free_area * area;
>   	int current_order;
>   	struct page *page;
> -	int migratetype, new_type, i;
> +	int migratetype, new_type, i, mt;
>
>   	/* Find the largest possible block of pages in the other list */
>   	for (current_order = MAX_ORDER-1; current_order >= order;
> @@ -1086,7 +1203,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>   							  migratetype);
>
>   			/* Remove the page from the freelists */
> -			list_del(&page->lru);
> +			mt = get_freepage_migratetype(page);
> +			del_from_freelist(page, &area->free_list[mt]);
>   			rmv_page_order(page);
>
>   			/*
> @@ -1446,7 +1564,8 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>   	}
>
>   	/* Remove page from free list */
> -	list_del(&page->lru);
> +	mt = get_freepage_migratetype(page);
> +	del_from_freelist(page, &zone->free_area[order].free_list[mt]);
>   	zone->free_area[order].nr_free--;
>   	rmv_page_order(page);
>
> @@ -6353,6 +6472,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
>   	int order, i;
>   	unsigned long pfn;
>   	unsigned long flags;
> +	int mt;
> +
>   	/* find the first valid pfn */
>   	for (pfn = start_pfn; pfn < end_pfn; pfn++)
>   		if (pfn_valid(pfn))
> @@ -6385,7 +6506,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
>   		printk(KERN_INFO "remove from free list %lx %d %lx\n",
>   		       pfn, 1 << order, end_pfn);
>   #endif
> -		list_del(&page->lru);
> +		mt = get_freepage_migratetype(page);
> +		del_from_freelist(page, &zone->free_area[order].free_list[mt]);
>   		rmv_page_order(page);
>   		zone->free_area[order].nr_free--;
>   #ifdef CONFIG_HIGHMEM
>



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 08/35] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
@ 2013-09-04  7:49     ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-04  7:49 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/08/30 22:16), Srivatsa S. Bhat wrote:
> The zones' freelists need to be made region-aware, in order to influence
> page allocation and freeing algorithms. So in every free list in the zone, we
> would like to demarcate the pageblocks belonging to different memory regions
> (we can do this using a set of pointers, and thus avoid splitting up the
> freelists).
>
> Also, we would like to keep the pageblocks in the freelists sorted in
> region-order. That is, pageblocks belonging to region-0 would come first,
> followed by pageblocks belonging to region-1 and so on, within a given
> freelist. Of course, a set of pageblocks belonging to the same region need
> not be sorted; it is sufficient if we maintain the pageblocks in
> region-sorted-order, rather than a full address-sorted-order.
>
> For each freelist within the zone, we maintain a set of pointers to
> pageblocks belonging to the various memory regions in that zone.
>
> Eg:
>
>      |<---Region0--->|   |<---Region1--->|   |<-------Region2--------->|
>       ____      ____      ____      ____      ____      ____      ____
> --> |____|--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|-->
>
>                   ^                  ^                              ^
>                   |                  |                              |
>                  Reg0               Reg1                          Reg2
>
>
> Page allocation will proceed as usual - pick the first item on the free list.
> But we don't want to keep updating these region pointers every time we allocate
> a pageblock from the freelist. So, instead of pointing to the *first* pageblock
> of that region, we maintain the region pointers such that they point to the
> *last* pageblock in that region, as shown in the figure above. That way, as
> long as there are > 1 pageblocks in that region in that freelist, that region
> pointer doesn't need to be updated.
>
>
> Page allocation algorithm:
> -------------------------
>
> The heart of the page allocation algorithm remains as it is - pick the first
> item on the appropriate freelist and return it.
>
>
> Arrangement of pageblocks in the zone freelists:
> -----------------------------------------------
>
> This is the main change - we keep the pageblocks in region-sorted order,
> where pageblocks belonging to region-0 come first, followed by those belonging
> to region-1 and so on. But the pageblocks within a given region need *not* be
> sorted, since we need them to be only region-sorted and not fully
> address-sorted.
>
> This sorting is performed when adding pages back to the freelists, thus
> avoiding any region-related overhead in the critical page allocation
> paths.
>
> Strategy to consolidate allocations to a minimum no. of regions:
> ---------------------------------------------------------------
>
> Page allocation happens in the order of increasing region number. We would
> like to do light-weight page reclaim or compaction (for the purpose of memory
> power management) in the reverse order, to keep the allocated pages within
> a minimum number of regions (approximately). The latter part is implemented
> in subsequent patches.
>
> ---------------------------- Increasing region number---------------------->
>
> Direction of allocation--->                <---Direction of reclaim/compaction
>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>   mm/page_alloc.c |  154 +++++++++++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 138 insertions(+), 16 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fd6436d0..398b62c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -514,6 +514,111 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>   	return 0;
>   }
>
> +static void add_to_freelist(struct page *page, struct free_list *free_list)
> +{
> +	struct list_head *prev_region_list, *lru;
> +	struct mem_region_list *region;
> +	int region_id, i;
> +
> +	lru = &page->lru;
> +	region_id = page_zone_region_id(page);
> +
> +	region = &free_list->mr_list[region_id];
> +	region->nr_free++;
> +
> +	if (region->page_block) {
> +		list_add_tail(lru, region->page_block);
> +		return;
> +	}
> +
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +	WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__);
> +#endif
> +
> +	if (!list_empty(&free_list->list)) {
> +		for (i = region_id - 1; i >= 0; i--) {
> +			if (free_list->mr_list[i].page_block) {
> +				prev_region_list =
> +					free_list->mr_list[i].page_block;
> +				goto out;
> +			}
> +		}
> +	}
> +
> +	/* This is the first region, so add to the head of the list */
> +	prev_region_list = &free_list->list;
> +
> +out:
> +	list_add(lru, prev_region_list);
> +
> +	/* Save pointer to page block of this region */
> +	region->page_block = lru;
> +}
> +
> +static void del_from_freelist(struct page *page, struct free_list *free_list)
> +{

> +	struct list_head *prev_page_lru, *lru, *p;
nitpick

*p is used only when enabling CONFIG_DEBUG_PAGEALLOC option.
When disabling the config option and compiling kernel, the
messages are shown.

   CC      mm/page_alloc.o
mm/page_alloc.c: In function a??del_from_freelista??:
mm/page_alloc.c:560: e-|a??: unused variable a??pa??

Thanks,
Yasuaki Ishimatsu

> +	struct mem_region_list *region;
> +	int region_id;
> +
> +	lru = &page->lru;
> +	region_id = page_zone_region_id(page);
> +	region = &free_list->mr_list[region_id];
> +	region->nr_free--;
> +
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
> +
> +	/* Verify whether this page indeed belongs to this free list! */
> +
> +	list_for_each(p, &free_list->list) {
> +		if (p == lru)
> +			goto page_found;
> +	}
> +
> +	WARN(1, "%s: page doesn't belong to the given freelist!\n", __func__);
> +
> +page_found:
> +#endif
> +
> +	/*
> +	 * If we are not deleting the last pageblock in this region (i.e.,
> +	 * farthest from list head, but not necessarily the last numerically),
> +	 * then we need not update the region->page_block pointer.
> +	 */
> +	if (lru != region->page_block) {
> +		list_del(lru);
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +		WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__);
> +#endif
> +		return;
> +	}
> +
> +	prev_page_lru = lru->prev;
> +	list_del(lru);
> +
> +	if (region->nr_free == 0) {
> +		region->page_block = NULL;
> +	} else {
> +		region->page_block = prev_page_lru;
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +		WARN(prev_page_lru == &free_list->list,
> +			"%s: region->page_block points to list head\n",
> +								__func__);
> +#endif
> +	}
> +}
> +
> +/**
> + * Move a given page from one freelist to another.
> + */
> +static void move_page_freelist(struct page *page, struct free_list *old_list,
> +			       struct free_list *new_list)
> +{
> +	del_from_freelist(page, old_list);
> +	add_to_freelist(page, new_list);
> +}
> +
>   /*
>    * Freeing function for a buddy system allocator.
>    *
> @@ -546,6 +651,7 @@ static inline void __free_one_page(struct page *page,
>   	unsigned long combined_idx;
>   	unsigned long uninitialized_var(buddy_idx);
>   	struct page *buddy;
> +	struct free_area *area;
>
>   	VM_BUG_ON(!zone_is_initialized(zone));
>
> @@ -575,8 +681,9 @@ static inline void __free_one_page(struct page *page,
>   			__mod_zone_freepage_state(zone, 1 << order,
>   						  migratetype);
>   		} else {
> -			list_del(&buddy->lru);
> -			zone->free_area[order].nr_free--;
> +			area = &zone->free_area[order];
> +			del_from_freelist(buddy, &area->free_list[migratetype]);
> +			area->nr_free--;
>   			rmv_page_order(buddy);
>   		}
>   		combined_idx = buddy_idx & page_idx;
> @@ -585,6 +692,7 @@ static inline void __free_one_page(struct page *page,
>   		order++;
>   	}
>   	set_page_order(page, order);
> +	area = &zone->free_area[order];
>
>   	/*
>   	 * If this is not the largest possible page, check if the buddy
> @@ -601,16 +709,22 @@ static inline void __free_one_page(struct page *page,
>   		buddy_idx = __find_buddy_index(combined_idx, order + 1);
>   		higher_buddy = higher_page + (buddy_idx - combined_idx);
>   		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
> -			list_add_tail(&page->lru,
> -				&zone->free_area[order].free_list[migratetype].list);
> +
> +			/*
> +			 * Implementing an add_to_freelist_tail() won't be
> +			 * very useful because both of them (almost) add to
> +			 * the tail within the region. So we could potentially
> +			 * switch off this entire "is next-higher buddy free?"
> +			 * logic when memory regions are used.
> +			 */
> +			add_to_freelist(page, &area->free_list[migratetype]);
>   			goto out;
>   		}
>   	}
>
> -	list_add(&page->lru,
> -		&zone->free_area[order].free_list[migratetype].list);
> +	add_to_freelist(page, &area->free_list[migratetype]);
>   out:
> -	zone->free_area[order].nr_free++;
> +	area->nr_free++;
>   }
>
>   static inline int free_pages_check(struct page *page)
> @@ -830,7 +944,7 @@ static inline void expand(struct zone *zone, struct page *page,
>   			continue;
>   		}
>   #endif
> -		list_add(&page[size].lru, &area->free_list[migratetype].list);
> +		add_to_freelist(&page[size], &area->free_list[migratetype]);
>   		area->nr_free++;
>   		set_page_order(&page[size], high);
>   	}
> @@ -897,7 +1011,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>
>   		page = list_entry(area->free_list[migratetype].list.next,
>   							struct page, lru);
> -		list_del(&page->lru);
> +		del_from_freelist(page, &area->free_list[migratetype]);
>   		rmv_page_order(page);
>   		area->nr_free--;
>   		expand(zone, page, order, current_order, area, migratetype);
> @@ -938,7 +1052,8 @@ int move_freepages(struct zone *zone,
>   {
>   	struct page *page;
>   	unsigned long order;
> -	int pages_moved = 0;
> +	struct free_area *area;
> +	int pages_moved = 0, old_mt;
>
>   #ifndef CONFIG_HOLES_IN_ZONE
>   	/*
> @@ -966,8 +1081,10 @@ int move_freepages(struct zone *zone,
>   		}
>
>   		order = page_order(page);
> -		list_move(&page->lru,
> -			  &zone->free_area[order].free_list[migratetype].list);
> +		old_mt = get_freepage_migratetype(page);
> +		area = &zone->free_area[order];
> +		move_page_freelist(page, &area->free_list[old_mt],
> +				    &area->free_list[migratetype]);
>   		set_freepage_migratetype(page, migratetype);
>   		page += 1 << order;
>   		pages_moved += 1 << order;
> @@ -1061,7 +1178,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>   	struct free_area * area;
>   	int current_order;
>   	struct page *page;
> -	int migratetype, new_type, i;
> +	int migratetype, new_type, i, mt;
>
>   	/* Find the largest possible block of pages in the other list */
>   	for (current_order = MAX_ORDER-1; current_order >= order;
> @@ -1086,7 +1203,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>   							  migratetype);
>
>   			/* Remove the page from the freelists */
> -			list_del(&page->lru);
> +			mt = get_freepage_migratetype(page);
> +			del_from_freelist(page, &area->free_list[mt]);
>   			rmv_page_order(page);
>
>   			/*
> @@ -1446,7 +1564,8 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>   	}
>
>   	/* Remove page from free list */
> -	list_del(&page->lru);
> +	mt = get_freepage_migratetype(page);
> +	del_from_freelist(page, &zone->free_area[order].free_list[mt]);
>   	zone->free_area[order].nr_free--;
>   	rmv_page_order(page);
>
> @@ -6353,6 +6472,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
>   	int order, i;
>   	unsigned long pfn;
>   	unsigned long flags;
> +	int mt;
> +
>   	/* find the first valid pfn */
>   	for (pfn = start_pfn; pfn < end_pfn; pfn++)
>   		if (pfn_valid(pfn))
> @@ -6385,7 +6506,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
>   		printk(KERN_INFO "remove from free list %lx %d %lx\n",
>   		       pfn, 1 << order, end_pfn);
>   #endif
> -		list_del(&page->lru);
> +		mt = get_freepage_migratetype(page);
> +		del_from_freelist(page, &zone->free_area[order].free_list[mt]);
>   		rmv_page_order(page);
>   		zone->free_area[order].nr_free--;
>   #ifdef CONFIG_HIGHMEM
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 09/35] mm: Track the freepage migratetype of pages accurately
  2013-09-03  8:45       ` Srivatsa S. Bhat
@ 2013-09-04  8:23         ` Yasuaki Ishimatsu
  -1 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-04  8:23 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/09/03 17:45), Srivatsa S. Bhat wrote:
> On 09/03/2013 12:08 PM, Yasuaki Ishimatsu wrote:
>> (2013/08/30 22:16), Srivatsa S. Bhat wrote:
>>> Due to the region-wise ordering of the pages in the buddy allocator's
>>> free lists, whenever we want to delete a free pageblock from a free list
>>> (for ex: when moving blocks of pages from one list to the other), we need
>>> to be able to tell the buddy allocator exactly which migratetype it
>>> belongs
>>> to. For that purpose, we can use the page's freepage migratetype
>>> (which is
>>> maintained in the page's ->index field).
>>>
>>> So, while splitting up higher order pages into smaller ones as part of
>>> buddy
>>> operations, keep the new head pages updated with the correct freepage
>>> migratetype information (because we depend on tracking this info
>>> accurately,
>>> as outlined above).
>>>
>>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>>> ---
>>>
>>>    mm/page_alloc.c |    7 +++++++
>>>    1 file changed, 7 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 398b62c..b4b1275 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -947,6 +947,13 @@ static inline void expand(struct zone *zone,
>>> struct page *page,
>>>            add_to_freelist(&page[size], &area->free_list[migratetype]);
>>>            area->nr_free++;
>>>            set_page_order(&page[size], high);
>>> +
>>> +        /*
>>> +         * Freepage migratetype is tracked using the index field of the
>>> +         * first page of the block. So we need to update the new first
>>> +         * page, when changing the page order.
>>> +         */
>>> +        set_freepage_migratetype(&page[size], migratetype);
>>>        }
>>>    }
>>>
>>>
>>
>> It this patch a bug fix patch?
>> If so, I want you to split the patch from the patch-set.
>>
>
> No, its not a bug-fix. We need to take care of this only when using the
> sorted-buddy design to maintain the freelists, which is introduced only in
> this patchset. So mainline doesn't need this patch.
>
> In mainline, we can delete a page from a buddy freelist by simply calling
> list_del() by passing a pointer to page->lru. It doesn't matter which freelist
> the page was belonging to. However, in the sorted-buddy design introduced
> in this patchset, we also need to know which particular freelist we are
> deleting that page from, because apart from breaking the ->lru link from
> the linked-list, we also need to update certain other things such as the
> region->page_block pointer etc, which are part of that particular freelist.
> Thus, it becomes essential to know which freelist we are deleting the page
> from. And for that, we need this patch to maintain that information accurately
> even during buddy operations such as splitting buddy pages in expand().

I may be wrong because I do not know this part clearly.

Original code is here:

---
static inline void expand(struct zone *zone, struct page *page,
	int low, int high, struct free_area *area,
	int migratetype)
{
...
		list_add(&page[size].lru, &area->free_list[migratetype]);
		area->nr_free++;
		set_page_order(&page[size], high);
---

It seems that migratietype of page[size] page is changed. So even if not
applying your patch, I think migratetype of the page should be changed.

thanks,
Yasuaki Ishimatsu

>
> Regards,
> Srivatsa S. Bhat
>



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 09/35] mm: Track the freepage migratetype of pages accurately
@ 2013-09-04  8:23         ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 100+ messages in thread
From: Yasuaki Ishimatsu @ 2013-09-04  8:23 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

(2013/09/03 17:45), Srivatsa S. Bhat wrote:
> On 09/03/2013 12:08 PM, Yasuaki Ishimatsu wrote:
>> (2013/08/30 22:16), Srivatsa S. Bhat wrote:
>>> Due to the region-wise ordering of the pages in the buddy allocator's
>>> free lists, whenever we want to delete a free pageblock from a free list
>>> (for ex: when moving blocks of pages from one list to the other), we need
>>> to be able to tell the buddy allocator exactly which migratetype it
>>> belongs
>>> to. For that purpose, we can use the page's freepage migratetype
>>> (which is
>>> maintained in the page's ->index field).
>>>
>>> So, while splitting up higher order pages into smaller ones as part of
>>> buddy
>>> operations, keep the new head pages updated with the correct freepage
>>> migratetype information (because we depend on tracking this info
>>> accurately,
>>> as outlined above).
>>>
>>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>>> ---
>>>
>>>    mm/page_alloc.c |    7 +++++++
>>>    1 file changed, 7 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 398b62c..b4b1275 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -947,6 +947,13 @@ static inline void expand(struct zone *zone,
>>> struct page *page,
>>>            add_to_freelist(&page[size], &area->free_list[migratetype]);
>>>            area->nr_free++;
>>>            set_page_order(&page[size], high);
>>> +
>>> +        /*
>>> +         * Freepage migratetype is tracked using the index field of the
>>> +         * first page of the block. So we need to update the new first
>>> +         * page, when changing the page order.
>>> +         */
>>> +        set_freepage_migratetype(&page[size], migratetype);
>>>        }
>>>    }
>>>
>>>
>>
>> It this patch a bug fix patch?
>> If so, I want you to split the patch from the patch-set.
>>
>
> No, its not a bug-fix. We need to take care of this only when using the
> sorted-buddy design to maintain the freelists, which is introduced only in
> this patchset. So mainline doesn't need this patch.
>
> In mainline, we can delete a page from a buddy freelist by simply calling
> list_del() by passing a pointer to page->lru. It doesn't matter which freelist
> the page was belonging to. However, in the sorted-buddy design introduced
> in this patchset, we also need to know which particular freelist we are
> deleting that page from, because apart from breaking the ->lru link from
> the linked-list, we also need to update certain other things such as the
> region->page_block pointer etc, which are part of that particular freelist.
> Thus, it becomes essential to know which freelist we are deleting the page
> from. And for that, we need this patch to maintain that information accurately
> even during buddy operations such as splitting buddy pages in expand().

I may be wrong because I do not know this part clearly.

Original code is here:

---
static inline void expand(struct zone *zone, struct page *page,
	int low, int high, struct free_area *area,
	int migratetype)
{
...
		list_add(&page[size].lru, &area->free_list[migratetype]);
		area->nr_free++;
		set_page_order(&page[size], high);
---

It seems that migratietype of page[size] page is changed. So even if not
applying your patch, I think migratetype of the page should be changed.

thanks,
Yasuaki Ishimatsu

>
> Regards,
> Srivatsa S. Bhat
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 09/35] mm: Track the freepage migratetype of pages accurately
  2013-09-04  8:23         ` Yasuaki Ishimatsu
@ 2013-09-06  5:24           ` Srivatsa S. Bhat
  -1 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-06  5:24 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

On 09/04/2013 01:53 PM, Yasuaki Ishimatsu wrote:
> (2013/09/03 17:45), Srivatsa S. Bhat wrote:
>> On 09/03/2013 12:08 PM, Yasuaki Ishimatsu wrote:
>>> (2013/08/30 22:16), Srivatsa S. Bhat wrote:
>>>> Due to the region-wise ordering of the pages in the buddy allocator's
>>>> free lists, whenever we want to delete a free pageblock from a free
>>>> list
>>>> (for ex: when moving blocks of pages from one list to the other), we
>>>> need
>>>> to be able to tell the buddy allocator exactly which migratetype it
>>>> belongs
>>>> to. For that purpose, we can use the page's freepage migratetype
>>>> (which is
>>>> maintained in the page's ->index field).
>>>>
>>>> So, while splitting up higher order pages into smaller ones as part of
>>>> buddy
>>>> operations, keep the new head pages updated with the correct freepage
>>>> migratetype information (because we depend on tracking this info
>>>> accurately,
>>>> as outlined above).
>>>>
>>>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>>>> ---
>>>>
>>>>    mm/page_alloc.c |    7 +++++++
>>>>    1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 398b62c..b4b1275 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -947,6 +947,13 @@ static inline void expand(struct zone *zone,
>>>> struct page *page,
>>>>            add_to_freelist(&page[size], &area->free_list[migratetype]);
>>>>            area->nr_free++;
>>>>            set_page_order(&page[size], high);
>>>> +
>>>> +        /*
>>>> +         * Freepage migratetype is tracked using the index field of
>>>> the
>>>> +         * first page of the block. So we need to update the new first
>>>> +         * page, when changing the page order.
>>>> +         */
>>>> +        set_freepage_migratetype(&page[size], migratetype);
>>>>        }
>>>>    }
>>>>
>>>>
>>>
>>> It this patch a bug fix patch?
>>> If so, I want you to split the patch from the patch-set.
>>>
>>
>> No, its not a bug-fix. We need to take care of this only when using the
>> sorted-buddy design to maintain the freelists, which is introduced
>> only in
>> this patchset. So mainline doesn't need this patch.
>>
>> In mainline, we can delete a page from a buddy freelist by simply calling
>> list_del() by passing a pointer to page->lru. It doesn't matter which
>> freelist
>> the page was belonging to. However, in the sorted-buddy design introduced
>> in this patchset, we also need to know which particular freelist we are
>> deleting that page from, because apart from breaking the ->lru link from
>> the linked-list, we also need to update certain other things such as the
>> region->page_block pointer etc, which are part of that particular
>> freelist.
>> Thus, it becomes essential to know which freelist we are deleting the
>> page
>> from. And for that, we need this patch to maintain that information
>> accurately
>> even during buddy operations such as splitting buddy pages in expand().
> 
> I may be wrong because I do not know this part clearly.
> 
> Original code is here:
> 
> ---
> static inline void expand(struct zone *zone, struct page *page,
>     int low, int high, struct free_area *area,
>     int migratetype)
> {
> ...
>         list_add(&page[size].lru, &area->free_list[migratetype]);
>         area->nr_free++;
>         set_page_order(&page[size], high);
> ---
> 
> It seems that migratietype of page[size] page is changed. So even if not
> applying your patch, I think migratetype of the page should be changed.
> 

Hmm, thinking about this a bit more, I agree with you. Although its not a
bug-fix for mainline, it is certainly good to have, since it makes things
more consistent by tracking the freepage migratetype properly for pages
split during buddy expansion. I'll separate this patch from the series and
post it as a stand-alone patch. Thank you!

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [RFC PATCH v3 09/35] mm: Track the freepage migratetype of pages accurately
@ 2013-09-06  5:24           ` Srivatsa S. Bhat
  0 siblings, 0 replies; 100+ messages in thread
From: Srivatsa S. Bhat @ 2013-09-06  5:24 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: akpm, mgorman, hannes, tony.luck, matthew.garrett, dave, riel,
	arjan, srinivas.pandruvada, willy, kamezawa.hiroyu, lenb, rjw,
	gargankita, paulmck, svaidy, andi, santosh.shilimkar,
	kosaki.motohiro, linux-pm, linux-mm, linux-kernel

On 09/04/2013 01:53 PM, Yasuaki Ishimatsu wrote:
> (2013/09/03 17:45), Srivatsa S. Bhat wrote:
>> On 09/03/2013 12:08 PM, Yasuaki Ishimatsu wrote:
>>> (2013/08/30 22:16), Srivatsa S. Bhat wrote:
>>>> Due to the region-wise ordering of the pages in the buddy allocator's
>>>> free lists, whenever we want to delete a free pageblock from a free
>>>> list
>>>> (for ex: when moving blocks of pages from one list to the other), we
>>>> need
>>>> to be able to tell the buddy allocator exactly which migratetype it
>>>> belongs
>>>> to. For that purpose, we can use the page's freepage migratetype
>>>> (which is
>>>> maintained in the page's ->index field).
>>>>
>>>> So, while splitting up higher order pages into smaller ones as part of
>>>> buddy
>>>> operations, keep the new head pages updated with the correct freepage
>>>> migratetype information (because we depend on tracking this info
>>>> accurately,
>>>> as outlined above).
>>>>
>>>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>>>> ---
>>>>
>>>>    mm/page_alloc.c |    7 +++++++
>>>>    1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 398b62c..b4b1275 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -947,6 +947,13 @@ static inline void expand(struct zone *zone,
>>>> struct page *page,
>>>>            add_to_freelist(&page[size], &area->free_list[migratetype]);
>>>>            area->nr_free++;
>>>>            set_page_order(&page[size], high);
>>>> +
>>>> +        /*
>>>> +         * Freepage migratetype is tracked using the index field of
>>>> the
>>>> +         * first page of the block. So we need to update the new first
>>>> +         * page, when changing the page order.
>>>> +         */
>>>> +        set_freepage_migratetype(&page[size], migratetype);
>>>>        }
>>>>    }
>>>>
>>>>
>>>
>>> It this patch a bug fix patch?
>>> If so, I want you to split the patch from the patch-set.
>>>
>>
>> No, its not a bug-fix. We need to take care of this only when using the
>> sorted-buddy design to maintain the freelists, which is introduced
>> only in
>> this patchset. So mainline doesn't need this patch.
>>
>> In mainline, we can delete a page from a buddy freelist by simply calling
>> list_del() by passing a pointer to page->lru. It doesn't matter which
>> freelist
>> the page was belonging to. However, in the sorted-buddy design introduced
>> in this patchset, we also need to know which particular freelist we are
>> deleting that page from, because apart from breaking the ->lru link from
>> the linked-list, we also need to update certain other things such as the
>> region->page_block pointer etc, which are part of that particular
>> freelist.
>> Thus, it becomes essential to know which freelist we are deleting the
>> page
>> from. And for that, we need this patch to maintain that information
>> accurately
>> even during buddy operations such as splitting buddy pages in expand().
> 
> I may be wrong because I do not know this part clearly.
> 
> Original code is here:
> 
> ---
> static inline void expand(struct zone *zone, struct page *page,
>     int low, int high, struct free_area *area,
>     int migratetype)
> {
> ...
>         list_add(&page[size].lru, &area->free_list[migratetype]);
>         area->nr_free++;
>         set_page_order(&page[size], high);
> ---
> 
> It seems that migratietype of page[size] page is changed. So even if not
> applying your patch, I think migratetype of the page should be changed.
> 

Hmm, thinking about this a bit more, I agree with you. Although its not a
bug-fix for mainline, it is certainly good to have, since it makes things
more consistent by tracking the freepage migratetype properly for pages
split during buddy expansion. I'll separate this patch from the series and
post it as a stand-alone patch. Thank you!

Regards,
Srivatsa S. Bhat

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2013-09-06  5:28 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-30 13:13 [RESEND RFC PATCH v3 00/35] mm: Memory Power Management Srivatsa S. Bhat
2013-08-30 13:13 ` Srivatsa S. Bhat
2013-08-30 13:14 ` [RFC PATCH v3 01/35] mm: Restructure free-page stealing code and fix a bug Srivatsa S. Bhat
2013-08-30 13:14   ` Srivatsa S. Bhat
2013-08-30 13:14 ` [RFC PATCH v3 02/35] mm: Fix the value of fallback_migratetype in alloc_extfrag tracepoint Srivatsa S. Bhat
2013-08-30 13:14   ` Srivatsa S. Bhat
2013-08-30 13:14 ` [RFC PATCH v3 03/35] mm: Introduce memory regions data-structure to capture region boundaries within nodes Srivatsa S. Bhat
2013-08-30 13:14   ` Srivatsa S. Bhat
2013-08-30 13:15 ` [RFC PATCH v3 04/35] mm: Initialize node memory regions during boot Srivatsa S. Bhat
2013-08-30 13:15   ` Srivatsa S. Bhat
2013-09-02  6:20   ` Yasuaki Ishimatsu
2013-09-02  6:20     ` Yasuaki Ishimatsu
2013-09-02 17:43     ` Srivatsa S. Bhat
2013-09-02 17:43       ` Srivatsa S. Bhat
2013-09-03  4:53       ` Yasuaki Ishimatsu
2013-09-03  4:53         ` Yasuaki Ishimatsu
2013-08-30 13:15 ` [RFC PATCH v3 05/35] mm: Introduce and initialize zone memory regions Srivatsa S. Bhat
2013-08-30 13:15   ` Srivatsa S. Bhat
2013-08-30 13:15 ` [RFC PATCH v3 06/35] mm: Add helpers to retrieve node region and zone region for a given page Srivatsa S. Bhat
2013-08-30 13:15   ` Srivatsa S. Bhat
2013-09-03  5:56   ` Yasuaki Ishimatsu
2013-09-03  5:56     ` Yasuaki Ishimatsu
2013-09-03  5:56     ` Yasuaki Ishimatsu
2013-09-03  8:34     ` Srivatsa S. Bhat
2013-09-03  8:34       ` Srivatsa S. Bhat
2013-08-30 13:16 ` [RFC PATCH v3 07/35] mm: Add data-structures to describe memory regions within the zones' freelists Srivatsa S. Bhat
2013-08-30 13:16   ` Srivatsa S. Bhat
2013-08-30 13:16 ` [RFC PATCH v3 08/35] mm: Demarcate and maintain pageblocks in region-order in " Srivatsa S. Bhat
2013-08-30 13:16   ` Srivatsa S. Bhat
2013-09-04  7:49   ` Yasuaki Ishimatsu
2013-09-04  7:49     ` Yasuaki Ishimatsu
2013-09-04  7:49     ` Yasuaki Ishimatsu
2013-08-30 13:16 ` [RFC PATCH v3 09/35] mm: Track the freepage migratetype of pages accurately Srivatsa S. Bhat
2013-08-30 13:16   ` Srivatsa S. Bhat
2013-09-03  6:38   ` Yasuaki Ishimatsu
2013-09-03  6:38     ` Yasuaki Ishimatsu
2013-09-03  8:45     ` Srivatsa S. Bhat
2013-09-03  8:45       ` Srivatsa S. Bhat
2013-09-04  8:23       ` Yasuaki Ishimatsu
2013-09-04  8:23         ` Yasuaki Ishimatsu
2013-09-06  5:24         ` Srivatsa S. Bhat
2013-09-06  5:24           ` Srivatsa S. Bhat
2013-08-30 13:16 ` [RFC PATCH v3 10/35] mm: Use the correct migratetype during buddy merging Srivatsa S. Bhat
2013-08-30 13:16   ` Srivatsa S. Bhat
2013-08-30 13:17 ` [RFC PATCH v3 11/35] mm: Add an optimized version of del_from_freelist to keep page allocation fast Srivatsa S. Bhat
2013-08-30 13:17   ` Srivatsa S. Bhat
2013-08-30 13:17 ` [RFC PATCH v3 12/35] bitops: Document the difference in indexing between fls() and __fls() Srivatsa S. Bhat
2013-08-30 13:17   ` Srivatsa S. Bhat
2013-08-30 13:17 ` [RFC PATCH v3 13/35] mm: A new optimized O(log n) sorting algo to speed up buddy-sorting Srivatsa S. Bhat
2013-08-30 13:17   ` Srivatsa S. Bhat
2013-08-30 13:18 ` [RFC PATCH v3 14/35] mm: Add support to accurately track per-memory-region allocation Srivatsa S. Bhat
2013-08-30 13:18   ` Srivatsa S. Bhat
2013-08-30 13:18 ` [RFC PATCH v3 15/35] mm: Print memory region statistics to understand the buddy allocator behavior Srivatsa S. Bhat
2013-08-30 13:18   ` Srivatsa S. Bhat
2013-08-30 13:18 ` [RFC PATCH v3 16/35] mm: Enable per-memory-region fragmentation stats in pagetypeinfo Srivatsa S. Bhat
2013-08-30 13:18   ` Srivatsa S. Bhat
2013-08-30 13:19 ` [RFC PATCH v3 17/35] mm: Add aggressive bias to prefer lower regions during page allocation Srivatsa S. Bhat
2013-08-30 13:19   ` Srivatsa S. Bhat
2013-08-30 13:19 ` [RFC PATCH v3 18/35] mm: Introduce a "Region Allocator" to manage entire memory regions Srivatsa S. Bhat
2013-08-30 13:19   ` Srivatsa S. Bhat
2013-08-30 13:19 ` [RFC PATCH v3 19/35] mm: Add a mechanism to add pages to buddy freelists in bulk Srivatsa S. Bhat
2013-08-30 13:19   ` Srivatsa S. Bhat
2013-08-30 13:20 ` [RFC PATCH v3 20/35] mm: Provide a mechanism to delete pages from " Srivatsa S. Bhat
2013-08-30 13:20   ` Srivatsa S. Bhat
2013-08-30 13:20 ` [RFC PATCH v3 21/35] mm: Provide a mechanism to release free memory to the region allocator Srivatsa S. Bhat
2013-08-30 13:20   ` Srivatsa S. Bhat
2013-08-30 13:20 ` [RFC PATCH v3 22/35] mm: Provide a mechanism to request free memory from " Srivatsa S. Bhat
2013-08-30 13:20   ` Srivatsa S. Bhat
2013-08-30 13:21 ` [RFC PATCH v3 23/35] mm: Maintain the counter for freepages in " Srivatsa S. Bhat
2013-08-30 13:21   ` Srivatsa S. Bhat
2013-08-30 13:21 ` [RFC PATCH v3 24/35] mm: Propagate the sorted-buddy bias for picking free regions, to " Srivatsa S. Bhat
2013-08-30 13:21   ` Srivatsa S. Bhat
2013-08-30 13:21 ` [RFC PATCH v3 25/35] mm: Fix vmstat to also account for freepages in the " Srivatsa S. Bhat
2013-08-30 13:21   ` Srivatsa S. Bhat
2013-08-30 13:22 ` [RFC PATCH v3 26/35] mm: Drop some very expensive sorted-buddy related checks under DEBUG_PAGEALLOC Srivatsa S. Bhat
2013-08-30 13:22   ` Srivatsa S. Bhat
2013-08-30 13:22 ` [RFC PATCH v3 27/35] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA => RA flow Srivatsa S. Bhat
2013-08-30 13:22   ` Srivatsa S. Bhat
2013-08-30 13:22 ` [RFC PATCH v3 28/35] mm: Connect Page Allocator(PA) to Region Allocator(RA); add PA <= " Srivatsa S. Bhat
2013-08-30 13:22   ` Srivatsa S. Bhat
2013-08-30 13:23 ` [RFC PATCH v3 29/35] mm: Update the freepage migratetype of pages during region allocation Srivatsa S. Bhat
2013-08-30 13:23   ` Srivatsa S. Bhat
2013-08-30 13:23 ` [RFC PATCH v3 30/35] mm: Provide a mechanism to check if a given page is in the region allocator Srivatsa S. Bhat
2013-08-30 13:23   ` Srivatsa S. Bhat
2013-08-30 13:23 ` [RFC PATCH v3 31/35] mm: Add a way to request pages of a particular region from " Srivatsa S. Bhat
2013-08-30 13:23   ` Srivatsa S. Bhat
2013-08-30 13:24 ` [RFC PATCH v3 32/35] mm: Modify move_freepages() to handle pages in the region allocator properly Srivatsa S. Bhat
2013-08-30 13:24   ` Srivatsa S. Bhat
2013-08-30 13:24 ` [RFC PATCH v3 33/35] mm: Never change migratetypes of pageblocks during freepage stealing Srivatsa S. Bhat
2013-08-30 13:24   ` Srivatsa S. Bhat
2013-08-30 13:24 ` [RFC PATCH v3 34/35] mm: Set pageblock migratetype when allocating regions from region allocator Srivatsa S. Bhat
2013-08-30 13:24   ` Srivatsa S. Bhat
2013-08-30 13:24 ` [RFC PATCH v3 35/35] mm: Use a cache between page-allocator and region-allocator Srivatsa S. Bhat
2013-08-30 13:24   ` Srivatsa S. Bhat
2013-08-30 13:26 ` [RESEND RFC PATCH v3 00/35] mm: Memory Power Management Srivatsa S. Bhat
2013-08-30 13:26   ` Srivatsa S. Bhat
2013-08-30 15:27 ` Dave Hansen
2013-08-30 15:27   ` Dave Hansen
2013-08-30 17:50   ` Srivatsa S. Bhat
2013-08-30 17:50     ` Srivatsa S. Bhat

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.