linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/7] Introduce ZONE_CMA
@ 2017-04-11  3:17 js1304
  2017-04-11  3:17 ` [PATCH v7 1/7] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request js1304
                   ` (9 more replies)
  0 siblings, 10 replies; 39+ messages in thread
From: js1304 @ 2017-04-11  3:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, mgorman, Laura Abbott,
	Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Changed from v6
o Rebase on next-20170405
o Add a fix for lowmem mapping on ARM (last patch)
o Re-organize the cover letter

Changes from v5
o Rebase on next-20161013
o Cosmetic change on patch 1
o Optimize span of ZONE_CMA on multiple node system

Changes from v4
o Rebase on next-20160825
o Add general fix patch for lowmem reserve
o Fix lowmem reserve ratio
o Fix zone span optimizaion per Vlastimil
o Fix pageset initialization
o Change invocation timing on cma_init_reserved_areas()

Changes from v3
o Rebase on next-20160805
o Split first patch per Vlastimil
o Remove useless function parameter per Vlastimil
o Add code comment per Vlastimil
o Add following description on cover-letter

Changes from v2
o Rebase on next-20160525
o No other changes except following description

Changes from v1
o Separate some patches which deserve to submit independently
o Modify description to reflect current kernel state
(e.g. high-order watermark problem disappeared by Mel's work)
o Don't increase SECTION_SIZE_BITS to make a room in page flags
(detailed reason is on the patch that adds ZONE_CMA)
o Adjust ZONE_CMA population code


Hello,

This is the 7th version of ZONE_CMA patchset. One patch is added
to fix potential problem on ARM. Other changes are just due to rebase.

This patchset has long history and got some reviews before. This
cover-letter has the summary and my opinion on those reviews. Content
order is so confusing so I make a simple index. If anyone want to
understand the history properly, please read them by reverse order.

PART 1. Strong points of the zone approach
PART 2. Summary in LSF/MM 2016 discussion
PART 3. Original motivation of this patchset

***** PART 1 *****

CMA has many problems and I mentioned them on the bottom of the
cover letter. These problems comes from limitation of CMA memory that
should be always migratable for device usage. I think that introducing
a new zone is the best approach to solve them. Here are the reasons.

Zone is introduced to solve some issues due to H/W addressing limitation.
MM subsystem is implemented to work efficiently with these zones.
Allocation/reclaim logic in MM consider this limitation very much.
What I did in this patchset is introducing a new zone and extending zone's
concept slightly. New concept is that zone can have not only H/W addressing
limitation but also S/W limitation to guarantee page migration.
This concept is originated from ZONE_MOVABLE and it works well
for a long time. So, ZONE_CMA should not be special at this moment.

There is a major concern from Mel that ZONE_MOVABLE which has
S/W limitation causes highmem/lowmem problem. Highmem/lowmem problem is
that some of memory cannot be usable for kernel memory due to limitation
of the zone. It causes to break LRU ordering and makes hard to find kernel
usable memory when memory pressure.

However, important point is that this problem doesn't come from
implementation detail (ZONE_MOVABLE/MIGRATETYPE). Even if we implement it
by MIGRATETYPE instead of by ZONE_MOVABLE, we cannot use that type of
memory for kernel allocation because it isn't migratable. So, it will cause
to break LRU ordering, too. We cannot avoid the problem in any case.
Therefore, we should focus on which solution is better for maintenance
and not intrusive for MM subsystem.

In this viewpoint, I think that zone approach is better. As mentioned
earlier, MM subsystem already have many infrastructures to deal with
zone's H/W addressing limitation. Adding S/W limitation on zone concept
and adding a new zone doesn't change anything. It will work by itself.
My patchset can remove many hooks related to CMA area management in MM
while solving the problems. More hooks are required to solve the problems
if we choose MIGRATETYPE approach.

Although Mel withdrew the review, Vlastimil expressed an agreement on this
new zone approach [6].

 "I realize I differ here from much more experienced mm guys, and will
 probably deservingly regret it later on, but I think that the ZONE_CMA
 approach could work indeed better than current MIGRATE_CMA pageblocks."

If anyone has a different opinion, please let me know.

Thanks.

***** PART 2 *****

There was a discussion with Mel [5] after LSF/MM 2016. I could summarise
it to help merge decision but it's better to read by yourself since
if I summarise it, it would be biased for me. But, if anyone hope
the summary, I will do it. :)

Anyway, Mel's position on this patchset seems to be neutral. He saids:
"I'm not going to outright NAK your series but I won't ACK it either"

We can fix the problems with any approach but I hope to go a new zone
approach because it is less error-prone. It reduces some corner case
handling for now and remove need for potential corner case handling to fix
problems.

Note that our company is already using ZONE_CMA and there is no problem.

If anyone has a different opinion, please let me know and let's discuss
together.

Andrew, if there is something to do for merge, please let me know.

***** PART 3 *****

This series try to solve problems of current CMA implementation.

CMA is introduced to provide physically contiguous pages at runtime
without exclusive reserved memory area. But, current implementation
works like as previous reserved memory approach, because freepages
on CMA region are used only if there is no movable freepage. In other
words, freepages on CMA region are only used as fallback. In that
situation where freepages on CMA region are used as fallback, kswapd
would be woken up easily since there is no unmovable and reclaimable
freepage, too. If kswapd starts to reclaim memory, fallback allocation
to MIGRATE_CMA doesn't occur any more since movable freepages are
already refilled by kswapd and then most of freepage on CMA are left
to be in free. This situation looks like exclusive reserved memory case.

In my experiment, I found that if system memory has 1024 MB memory and
512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
free memory is left. Detailed reason is that for keeping enough free
memory for unmovable and reclaimable allocation, kswapd uses below
equation when calculating free memory and it easily go under the watermark.

Free memory for unmovable and reclaimable = Free total - Free CMA pages

This is derivated from the property of CMA freepage that CMA freepage
can't be used for unmovable and reclaimable allocation.

Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
is lower than low watermark and tries to make free memory until
(FreeTotal - FreeCMA) is higher than high watermark. That results
in that FreeTotal is moving around 512MB boundary consistently. It
then means that we can't utilize full memory capacity.

To fix this problem, I submitted some patches [1] about 10 months ago,
but, found some more problems to be fixed before solving this problem.
It requires many hooks in allocator hotpath so some developers doesn't
like it. Instead, some of them suggest different approach [2] to fix
all the problems related to CMA, that is, introducing a new zone to deal
with free CMA pages. I agree that it is the best way to go so implement
here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar, I
decide to add a new zone rather than piggyback on ZONE_MOVABLE since
they have some differences. First, reserved CMA pages should not be
offlined. If freepage for CMA is managed by ZONE_MOVABLE, we need to keep
MIGRATE_CMA migratetype and insert many hooks on memory hotplug code
to distiguish hotpluggable memory and reserved memory for CMA in the same
zone. It would make memory hotplug code which is already complicated
more complicated. Second, cma_alloc() can be called more frequently
than memory hotplug operation and possibly we need to control
allocation rate of ZONE_CMA to optimize latency in the future.
In this case, separate zone approach is easy to modify. Third, I'd
like to see statistics for CMA, separately. Sometimes, we need to debug
why cma_alloc() is failed and separate statistics would be more helpful
in this situtaion.

Anyway, this patchset solves four problems related to CMA implementation.

1) Utilization problem
As mentioned above, we can't utilize full memory capacity due to the
limitation of CMA freepage and fallback policy. This patchset implements
a new zone for CMA and uses it for GFP_HIGHUSER_MOVABLE request. This
typed allocation is used for page cache and anonymous pages which
occupies most of memory usage in normal case so we can utilize full
memory capacity. Below is the experiment result about this problem.

8 CPUs, 1024 MB, VIRTUAL MACHINE
make -j16

<Before this series>
CMA reserve:            0 MB            512 MB
Elapsed-time:           92.4		186.5
pswpin:                 82		18647
pswpout:                160		69839

<After this series>
CMA reserve:            0 MB            512 MB
Elapsed-time:           93.1		93.4
pswpin:                 84		46
pswpout:                183		92

FYI, there is another attempt [3] trying to solve this problem in lkml.
And, as far as I know, Qualcomm also has out-of-tree solution for this
problem.

2) Reclaim problem
Currently, there is no logic to distinguish CMA pages in reclaim path.
If reclaim is initiated for unmovable and reclaimable allocation,
reclaiming CMA pages doesn't help to satisfy the request and reclaiming
CMA page is just waste. By managing CMA pages in the new zone, we can
skip to reclaim ZONE_CMA completely if it is unnecessary.

3) Atomic allocation failure problem
Kswapd isn't started to reclaim pages when allocation request is movable
type and there is enough free page in the CMA region. After bunch of
consecutive movable allocation requests, free pages in ordinary region
(not CMA region) would be exhausted without waking up kswapd. At that time,
if atomic unmovable allocation comes, it can't be successful since there
is not enough page in ordinary region. This problem is reported
by Aneesh [4] and can be solved by this patchset.

4) Inefficiently work of compaction
Usual high-order allocation request is unmovable type and it cannot
be serviced from CMA area. In compaction, migration scanner doesn't
distinguish migratable pages on the CMA area and do migration.
In this case, even if we make high-order page on that region, it
cannot be used due to type mismatch. This patch will solve this problem
by separating CMA pages from ordinary zones.

I passed boot test on x86_64, x86_32, arm and arm64. I did some stress
tests on x86_64 and x86_32 and there is no problem. Feel free to enjoy
and please give me a feedback. :)


Thanks.

[1] https://lkml.org/lkml/2014/5/28/64
[2] https://lkml.org/lkml/2014/11/4/55
[3] https://lkml.org/lkml/2014/10/15/623
[4] http://www.spinics.net/lists/linux-mm/msg100562.html
[5] https://lkml.kernel.org/r/20160425053653.GA25662@js1304-P5Q-DELUXE
[6] https://lkml.kernel.org/r/1919a85d-6e1e-374f-b8c3-1236c36b0393@suse.cz

Joonsoo Kim (7):
  mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request
  mm/cma: introduce new zone, ZONE_CMA
  mm/cma: populate ZONE_CMA
  mm/cma: remove ALLOC_CMA
  mm/cma: remove MIGRATE_CMA
  mm/cma: remove per zone CMA stat
  ARM: CMA: avoid re-mapping CMA region if CONFIG_HIGHMEM

 arch/arm/mm/dma-mapping.c           |   7 +-
 arch/powerpc/mm/mmu_context_iommu.c |   2 +-
 arch/x86/mm/highmem_32.c            |   8 ++
 fs/proc/meminfo.c                   |   2 +-
 include/linux/cma.h                 |   7 ++
 include/linux/gfp.h                 |  32 +++---
 include/linux/memory_hotplug.h      |   3 -
 include/linux/mempolicy.h           |   2 +-
 include/linux/mm.h                  |   1 +
 include/linux/mmzone.h              |  60 +++++-----
 include/linux/page-isolation.h      |   5 +-
 include/linux/vm_event_item.h       |  10 +-
 include/linux/vmstat.h              |   8 --
 include/trace/events/mmflags.h      |  10 +-
 kernel/power/snapshot.c             |   8 ++
 mm/cma.c                            |  78 +++++++++++--
 mm/compaction.c                     |  12 +-
 mm/hugetlb.c                        |   3 +-
 mm/internal.h                       |   4 +-
 mm/memory_hotplug.c                 |   7 +-
 mm/page_alloc.c                     | 220 ++++++++++++++++++------------------
 mm/page_isolation.c                 |  15 +--
 mm/page_owner.c                     |   6 +-
 mm/usercopy.c                       |   4 +-
 mm/vmstat.c                         |  10 +-
 25 files changed, 310 insertions(+), 214 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v7 1/7] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request
  2017-04-11  3:17 [PATCH v7 0/7] Introduce ZONE_CMA js1304
@ 2017-04-11  3:17 ` js1304
  2017-04-17  7:38   ` Minchan Kim
  2017-04-11  3:17 ` [PATCH v7 2/7] mm/cma: introduce new zone, ZONE_CMA js1304
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 39+ messages in thread
From: js1304 @ 2017-04-11  3:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, mgorman, Laura Abbott,
	Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
important to reserve. When ZONE_MOVABLE is used, this problem would
theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
allocation request which is mainly used for page cache and anon page
allocation. So, fix it.

And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
makes code complex. For example, if there is highmem system, following
reserve ratio is activated for *NORMAL ZONE* which would be easyily
misleading people.

 #ifdef CONFIG_HIGHMEM
 32
 #endif

This patch also fix this situation by defining sysctl_lowmem_reserve_ratio
array by MAX_NR_ZONES and place "#ifdef" to right place.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/mmzone.h |  2 +-
 mm/page_alloc.c        | 11 ++++++-----
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ebaccd4..96194bf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -869,7 +869,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
-extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
+extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 32b31d6..60ffa4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -203,17 +203,18 @@ static void __free_pages_ok(struct page *page, unsigned int order);
  * TBD: should special case ZONE_DMA32 machines here - in those we normally
  * don't need any ZONE_NORMAL reservation
  */
-int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
+int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
 #ifdef CONFIG_ZONE_DMA
-	 256,
+	[ZONE_DMA] = 256,
 #endif
 #ifdef CONFIG_ZONE_DMA32
-	 256,
+	[ZONE_DMA32] = 256,
 #endif
+	[ZONE_NORMAL] = 32,
 #ifdef CONFIG_HIGHMEM
-	 32,
+	[ZONE_HIGHMEM] = INT_MAX,
 #endif
-	 32,
+	[ZONE_MOVABLE] = INT_MAX,
 };
 
 EXPORT_SYMBOL(totalram_pages);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 2/7] mm/cma: introduce new zone, ZONE_CMA
  2017-04-11  3:17 [PATCH v7 0/7] Introduce ZONE_CMA js1304
  2017-04-11  3:17 ` [PATCH v7 1/7] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request js1304
@ 2017-04-11  3:17 ` js1304
  2017-04-11  3:17 ` [PATCH v7 3/7] mm/cma: populate ZONE_CMA js1304
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: js1304 @ 2017-04-11  3:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, mgorman, Laura Abbott,
	Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Attached cover-letter:

This series try to solve problems of current CMA implementation.

CMA is introduced to provide physically contiguous pages at runtime
without exclusive reserved memory area. But, current implementation
works like as previous reserved memory approach, because freepages
on CMA region are used only if there is no movable freepage. In other
words, freepages on CMA region are only used as fallback. In that
situation where freepages on CMA region are used as fallback, kswapd
would be woken up easily since there is no unmovable and reclaimable
freepage, too. If kswapd starts to reclaim memory, fallback allocation
to MIGRATE_CMA doesn't occur any more since movable freepages are
already refilled by kswapd and then most of freepage on CMA are left
to be in free. This situation looks like exclusive reserved memory case.

In my experiment, I found that if system memory has 1024 MB memory and
512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
free memory is left. Detailed reason is that for keeping enough free
memory for unmovable and reclaimable allocation, kswapd uses below
equation when calculating free memory and it easily go under the watermark.

Free memory for unmovable and reclaimable = Free total - Free CMA pages

This is derivated from the property of CMA freepage that CMA freepage
can't be used for unmovable and reclaimable allocation.

Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
is lower than low watermark and tries to make free memory until
(FreeTotal - FreeCMA) is higher than high watermark. That results
in that FreeTotal is moving around 512MB boundary consistently. It
then means that we can't utilize full memory capacity.

To fix this problem, I submitted some patches [1] about 10 months ago,
but, found some more problems to be fixed before solving this problem.
It requires many hooks in allocator hotpath so some developers doesn't
like it. Instead, some of them suggest different approach [2] to fix
all the problems related to CMA, that is, introducing a new zone to deal
with free CMA pages. I agree that it is the best way to go so implement
here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar, I
decide to add a new zone rather than piggyback on ZONE_MOVABLE since
they have some differences. First, reserved CMA pages should not be
offlined. If freepage for CMA is managed by ZONE_MOVABLE, we need to keep
MIGRATE_CMA migratetype and insert many hooks on memory hotplug code
to distiguish hotpluggable memory and reserved memory for CMA in the same
zone. It would make memory hotplug code which is already complicated
more complicated. Second, cma_alloc() can be called more frequently
than memory hotplug operation and possibly we need to control
allocation rate of ZONE_CMA to optimize latency in the future.
In this case, separate zone approach is easy to modify. Third, I'd
like to see statistics for CMA, separately. Sometimes, we need to debug
why cma_alloc() is failed and separate statistics would be more helpful
in this situtaion.

Anyway, this patchset solves four problems related to CMA implementation.

1) Utilization problem
As mentioned above, we can't utilize full memory capacity due to the
limitation of CMA freepage and fallback policy. This patchset implements
a new zone for CMA and uses it for GFP_HIGHUSER_MOVABLE request. This
typed allocation is used for page cache and anonymous pages which
occupies most of memory usage in normal case so we can utilize full
memory capacity. Below is the experiment result about this problem.

8 CPUs, 1024 MB, VIRTUAL MACHINE
make -j16

<Before this series>
CMA reserve:            0 MB            512 MB
Elapsed-time:           92.4		186.5
pswpin:                 82		18647
pswpout:                160		69839

<After this series>
CMA reserve:            0 MB            512 MB
Elapsed-time:           93.1		93.4
pswpin:                 84		46
pswpout:                183		92

FYI, there is another attempt [3] trying to solve this problem in lkml.
And, as far as I know, Qualcomm also has out-of-tree solution for this
problem.

2) Reclaim problem
Currently, there is no logic to distinguish CMA pages in reclaim path.
If reclaim is initiated for unmovable and reclaimable allocation,
reclaiming CMA pages doesn't help to satisfy the request and reclaiming
CMA page is just waste. By managing CMA pages in the new zone, we can
skip to reclaim ZONE_CMA completely if it is unnecessary.

3) Atomic allocation failure problem
Kswapd isn't started to reclaim pages when allocation request is movable
type and there is enough free page in the CMA region. After bunch of
consecutive movable allocation requests, free pages in ordinary region
(not CMA region) would be exhausted without waking up kswapd. At that time,
if atomic unmovable allocation comes, it can't be successful since there
is not enough page in ordinary region. This problem is reported
by Aneesh [4] and can be solved by this patchset.

4) Inefficiently work of compaction
Usual high-order allocation request is unmovable type and it cannot
be serviced from CMA area. In compaction, migration scanner doesn't
distinguish migratable pages on the CMA area and do migration.
In this case, even if we make high-order page on that region, it
cannot be used due to type mismatch. This patch will solve this problem
by separating CMA pages from ordinary zones.

[1] https://lkml.org/lkml/2014/5/28/64
[2] https://lkml.org/lkml/2014/11/4/55
[3] https://lkml.org/lkml/2014/10/15/623
[4] http://www.spinics.net/lists/linux-mm/msg100562.html
[5] https://lkml.org/lkml/2014/5/30/320

For this patch:

Currently, reserved pages for CMA are managed together with normal pages.
To distinguish them, we used migratetype, MIGRATE_CMA, and
do special handlings for this migratetype. But, it turns out that
there are too many problems with this approach and to fix all of them
needs many more hooks to page allocation and reclaim path so
some developers express their discomfort and problems on CMA aren't fixed
for a long time.

To terminate this situation and fix CMA problems, this patch implements
ZONE_CMA. Reserved pages for CMA will be managed in this new zone. This
approach will remove all exisiting hooks for MIGRATE_CMA and many
problems related to CMA implementation will be solved.

This patch only add basic infrastructure of ZONE_CMA. In the following
patch, ZONE_CMA is actually populated and used.

Adding a new zone could cause two possible problems. One is the overflow
of page flags and the other is GFP_ZONES_TABLE issue.

Following is page-flags layout described in page-flags-layout.h.

1. No sparsemem or sparsemem vmemmap: |       NODE     | ZONE |             ... | FLAGS |
2.      " plus space for last_cpupid: |       NODE     | ZONE | LAST_CPUPID ... | FLAGS |
3. classic sparse with space for node:| SECTION | NODE | ZONE |             ... | FLAGS |
4.      " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
5. classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |

There is no problem in #1, #2 configurations for 64-bit system. There are
enough room even for extremiely large x86_64 system. 32-bit system would
not have many nodes so it would have no problem, too.
System with #3, #4, #5 configurations could be affected by this zone
addition, but, thanks to recent THP rework which reduce one page flag,
problem surface would be small. In some configurations, problem is
still possible, but, it highly depends on individual configuration
so impact cannot be easily estimated. I guess that usual system
with CONFIG_CMA would not be affected. If there is a problem,
we can adjust section width or node width for that architecture.

Currently, GFP_ZONES_TABLE is 32-bit value for 32-bit bit operation
in the 32-bit system. If we add one more zone, it will be 48-bit and
32-bit bit operation cannot be possible. Although it will cause slight
overhead, there is no other way so this patch relax GFP_ZONES_TABLE's
32-bit limitation. 32-bit System with CONFIG_CMA will be affected by
this change but it would be marginal.

Note that there are many checkpatch warnings but I think that current
code is better for readability than fixing them up.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 arch/x86/mm/highmem_32.c       |  8 ++++++
 include/linux/gfp.h            | 30 ++++++++++++---------
 include/linux/mempolicy.h      |  2 +-
 include/linux/mmzone.h         | 31 ++++++++++++++++++++-
 include/linux/vm_event_item.h  | 10 ++++++-
 include/trace/events/mmflags.h | 10 ++++++-
 kernel/power/snapshot.c        |  8 ++++++
 mm/page_alloc.c                | 61 ++++++++++++++++++++++++++++++++++--------
 mm/vmstat.c                    |  9 ++++++-
 9 files changed, 141 insertions(+), 28 deletions(-)

diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index 6d18b70..52a14da 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -120,6 +120,14 @@ void __init set_highmem_pages_init(void)
 		if (!is_highmem(zone))
 			continue;
 
+		/*
+		 * ZONE_CMA is a special zone that should not be
+		 * participated in initialization because it's pages
+		 * would be initialized by initialization of other zones.
+		 */
+		if (is_zone_cma(zone))
+			continue;
+
 		zone_start_pfn = zone->zone_start_pfn;
 		zone_end_pfn = zone_start_pfn + zone->spanned_pages;
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 2b1a44f5..c2ed2eb 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -311,6 +311,12 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 #define OPT_ZONE_DMA32 ZONE_NORMAL
 #endif
 
+#ifdef CONFIG_CMA
+#define OPT_ZONE_CMA ZONE_CMA
+#else
+#define OPT_ZONE_CMA ZONE_MOVABLE
+#endif
+
 /*
  * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
  * zone to use given the lowest 4 bits of gfp_t. Entries are GFP_ZONES_SHIFT
@@ -340,8 +346,6 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
  *       0xd    => BAD (MOVABLE+DMA32+DMA)
  *       0xe    => BAD (MOVABLE+DMA32+HIGHMEM)
  *       0xf    => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
- *
- * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
  */
 
 #if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
@@ -351,19 +355,21 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 #define GFP_ZONES_SHIFT ZONES_SHIFT
 #endif
 
-#if 16 * GFP_ZONES_SHIFT > BITS_PER_LONG
-#error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
+#if !defined(CONFIG_64BITS) && GFP_ZONES_SHIFT > 2
+typedef unsigned long long GFP_ZONE_TABLE_TYPE;
+#else
+typedef unsigned long GFP_ZONE_TABLE_TYPE;
 #endif
 
 #define GFP_ZONE_TABLE ( \
-	(ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)				       \
-	| (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)		       \
-	| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT)	       \
-	| (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT)		       \
-	| (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT)		       \
-	| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)    \
-	| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)\
-	| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)\
+	((GFP_ZONE_TABLE_TYPE) ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)					\
+	| ((GFP_ZONE_TABLE_TYPE) OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)				\
+	| ((GFP_ZONE_TABLE_TYPE) OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT)			\
+	| ((GFP_ZONE_TABLE_TYPE) OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT)			\
+	| ((GFP_ZONE_TABLE_TYPE) ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT)			\
+	| ((GFP_ZONE_TABLE_TYPE) OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)	\
+	| ((GFP_ZONE_TABLE_TYPE) OPT_ZONE_CMA << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)	\
+	| ((GFP_ZONE_TABLE_TYPE) OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)	\
 )
 
 /*
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5f4d828..a43adb5 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -158,7 +158,7 @@ extern enum zone_type policy_zone;
 
 static inline void check_highest_zone(enum zone_type k)
 {
-	if (k > policy_zone && k != ZONE_MOVABLE)
+	if (k > policy_zone && k != ZONE_MOVABLE && !is_zone_cma_idx(k))
 		policy_zone = k;
 }
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 96194bf..74eda07 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -338,6 +338,9 @@ enum zone_type {
 	ZONE_HIGHMEM,
 #endif
 	ZONE_MOVABLE,
+#ifdef CONFIG_CMA
+	ZONE_CMA,
+#endif
 #ifdef CONFIG_ZONE_DEVICE
 	ZONE_DEVICE,
 #endif
@@ -838,11 +841,37 @@ static inline int zone_movable_is_highmem(void)
 }
 #endif
 
+static inline int is_zone_cma_idx(enum zone_type idx)
+{
+#ifdef CONFIG_CMA
+	return idx == ZONE_CMA;
+#else
+	return 0;
+#endif
+}
+
+static inline int is_zone_cma(struct zone *zone)
+{
+	int zone_idx = zone_idx(zone);
+
+	return is_zone_cma_idx(zone_idx);
+}
+
+static inline int zone_cma_is_highmem(void)
+{
+#ifdef CONFIG_HIGHMEM
+	return 1;
+#else
+	return 0;
+#endif
+}
+
 static inline int is_highmem_idx(enum zone_type idx)
 {
 #ifdef CONFIG_HIGHMEM
 	return (idx == ZONE_HIGHMEM ||
-		(idx == ZONE_MOVABLE && zone_movable_is_highmem()));
+		(idx == ZONE_MOVABLE && zone_movable_is_highmem()) ||
+		(is_zone_cma_idx(idx) && zone_cma_is_highmem()));
 #else
 	return 0;
 #endif
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d84ae90..228f3df 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -19,7 +19,15 @@
 #define HIGHMEM_ZONE(xx)
 #endif
 
-#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, HIGHMEM_ZONE(xx) xx##_MOVABLE
+#ifdef CONFIG_CMA
+#define MOVABLE_ZONE(xx) xx##_MOVABLE,
+#define CMA_ZONE(xx) xx##_CMA
+#else
+#define MOVABLE_ZONE(xx) xx##_MOVABLE
+#define CMA_ZONE(xx)
+#endif
+
+#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, HIGHMEM_ZONE(xx) MOVABLE_ZONE(xx) CMA_ZONE(xx)
 
 enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 304ff94..7c32ba6 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -232,12 +232,20 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 #define IFDEF_ZONE_HIGHMEM(X)
 #endif
 
+#ifdef CONFIG_CMA
+#define IFDEF_ZONE_CMA(X, Y, Z) X Z
+#else
+#define IFDEF_ZONE_CMA(X, Y, Z) Y
+#endif
+
 #define ZONE_TYPE						\
 	IFDEF_ZONE_DMA(		EM (ZONE_DMA,	 "DMA"))	\
 	IFDEF_ZONE_DMA32(	EM (ZONE_DMA32,	 "DMA32"))	\
 				EM (ZONE_NORMAL, "Normal")	\
 	IFDEF_ZONE_HIGHMEM(	EM (ZONE_HIGHMEM,"HighMem"))	\
-				EMe(ZONE_MOVABLE,"Movable")
+	IFDEF_ZONE_CMA(		EM (ZONE_MOVABLE,"Movable"),	\
+				EMe(ZONE_MOVABLE,"Movable"),	\
+				EMe(ZONE_CMA,	 "CMA"))
 
 #define LRU_NAMES		\
 		EM (LRU_INACTIVE_ANON, "inactive_anon") \
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 3b1e0f3..6971c23 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1170,6 +1170,14 @@ unsigned int snapshot_additional_pages(struct zone *zone)
 {
 	unsigned int rtree, nodes;
 
+	/*
+	 * Estimation of needed pages for ZONE_CMA is already considered
+	 * when calculating other zones since span of ZONE_CMA is subset
+	 * of other zones.
+	 */
+	if (is_zone_cma(zone))
+		return 0;
+
 	rtree = nodes = DIV_ROUND_UP(zone->spanned_pages, BM_BITS_PER_BLOCK);
 	rtree += DIV_ROUND_UP(rtree * sizeof(struct rtree_node),
 			      LINKED_PAGE_DATA_SIZE);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 60ffa4e..26d86c3b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -215,6 +215,9 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
 	[ZONE_HIGHMEM] = INT_MAX,
 #endif
 	[ZONE_MOVABLE] = INT_MAX,
+#ifdef CONFIG_CMA
+	[ZONE_CMA] = INT_MAX,
+#endif
 };
 
 EXPORT_SYMBOL(totalram_pages);
@@ -231,6 +234,9 @@ static char * const zone_names[MAX_NR_ZONES] = {
 	 "HighMem",
 #endif
 	 "Movable",
+#ifdef CONFIG_CMA
+	 "CMA",
+#endif
 #ifdef CONFIG_ZONE_DEVICE
 	 "Device",
 #endif
@@ -5253,6 +5259,15 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 	struct memblock_region *r = NULL, *tmp;
 #endif
 
+	/*
+	 * Physical pages for ZONE_CMA are belong to other zones now. They
+	 * are initialized when corresponding zone is initialized and they
+	 * will be moved to ZONE_CMA later. Zone information will also be
+	 * adjusted later.
+	 */
+	if (is_zone_cma_idx(zone))
+		return;
+
 	if (highest_memmap_pfn < end_pfn - 1)
 		highest_memmap_pfn = end_pfn - 1;
 
@@ -5649,7 +5664,7 @@ static void __init find_usable_zone_for_movable(void)
 {
 	int zone_index;
 	for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) {
-		if (zone_index == ZONE_MOVABLE)
+		if (zone_index == ZONE_MOVABLE || is_zone_cma_idx(zone_index))
 			continue;
 
 		if (arch_zone_highest_possible_pfn[zone_index] >
@@ -5864,6 +5879,8 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
 						unsigned long *zholes_size)
 {
 	unsigned long realtotalpages = 0, totalpages = 0;
+	unsigned long zone_cma_start_pfn = UINT_MAX;
+	unsigned long zone_cma_end_pfn = 0;
 	enum zone_type i;
 
 	for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -5871,6 +5888,13 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
 		unsigned long zone_start_pfn, zone_end_pfn;
 		unsigned long size, real_size;
 
+		if (is_zone_cma_idx(i)) {
+			zone->zone_start_pfn = zone_cma_start_pfn;
+			size = zone_cma_end_pfn - zone_cma_start_pfn;
+			real_size = 0;
+			goto init_zone;
+		}
+
 		size = zone_spanned_pages_in_node(pgdat->node_id, i,
 						  node_start_pfn,
 						  node_end_pfn,
@@ -5880,13 +5904,23 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
 		real_size = size - zone_absent_pages_in_node(pgdat->node_id, i,
 						  node_start_pfn, node_end_pfn,
 						  zholes_size);
-		if (size)
+		if (size) {
 			zone->zone_start_pfn = zone_start_pfn;
-		else
+			if (zone_cma_start_pfn > zone_start_pfn)
+				zone_cma_start_pfn = zone_start_pfn;
+			if (zone_cma_end_pfn < zone_start_pfn + size)
+				zone_cma_end_pfn = zone_start_pfn + size;
+		} else
 			zone->zone_start_pfn = 0;
+
+init_zone:
 		zone->spanned_pages = size;
 		zone->present_pages = real_size;
 
+		/* Prevent to over-count node span */
+		if (is_zone_cma_idx(i))
+			size = 0;
+
 		totalpages += size;
 		realtotalpages += real_size;
 	}
@@ -6030,6 +6064,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize, freesize, memmap_pages;
 		unsigned long zone_start_pfn = zone->zone_start_pfn;
+		bool zone_kernel = !is_highmem_idx(j) && !is_zone_cma_idx(j);
 
 		size = zone->spanned_pages;
 		realsize = freesize = zone->present_pages;
@@ -6040,7 +6075,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		 * and per-cpu initialisations
 		 */
 		memmap_pages = calc_memmap_size(size, realsize);
-		if (!is_highmem_idx(j)) {
+		if (zone_kernel) {
 			if (freesize >= memmap_pages) {
 				freesize -= memmap_pages;
 				if (memmap_pages)
@@ -6059,7 +6094,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 					zone_names[0], dma_reserve);
 		}
 
-		if (!is_highmem_idx(j))
+		if (zone_kernel)
 			nr_kernel_pages += freesize;
 		/* Charge for highmem memmap if there are enough kernel pages */
 		else if (nr_kernel_pages > memmap_pages * 2)
@@ -6071,7 +6106,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		 * when the bootmem allocator frees pages into the buddy system.
 		 * And all highmem pages will be managed by the buddy system.
 		 */
-		zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
+		zone->managed_pages = zone_kernel ? freesize : realsize;
 #ifdef CONFIG_NUMA
 		zone->node = nid;
 #endif
@@ -6081,7 +6116,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 		zone_seqlock_init(zone);
 		zone_pcp_init(zone);
 
-		if (!size)
+		/*
+		 * ZONE_CMA should be initialized even if it has no present
+		 * page now since pages will be moved to the zone later.
+		 */
+		if (!size && !is_zone_cma_idx(j))
 			continue;
 
 		set_pageblock_order();
@@ -6537,7 +6576,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 	start_pfn = find_min_pfn_with_active_regions();
 
 	for (i = 0; i < MAX_NR_ZONES; i++) {
-		if (i == ZONE_MOVABLE)
+		if (i == ZONE_MOVABLE || is_zone_cma_idx(i))
 			continue;
 
 		end_pfn = max(max_zone_pfn[i], start_pfn);
@@ -6554,7 +6593,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 	/* Print out the zone ranges */
 	pr_info("Zone ranges:\n");
 	for (i = 0; i < MAX_NR_ZONES; i++) {
-		if (i == ZONE_MOVABLE)
+		if (i == ZONE_MOVABLE || is_zone_cma_idx(i))
 			continue;
 		pr_info("  %-8s ", zone_names[i]);
 		if (arch_zone_lowest_possible_pfn[i] ==
@@ -7318,9 +7357,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 
 	/*
 	 * For avoiding noise data, lru_add_drain_all() should be called
-	 * If ZONE_MOVABLE, the zone never contains unmovable pages
+	 * If ZONE_(MOVABLE|CMA), the zone never contains unmovable pages
 	 */
-	if (zone_idx(zone) == ZONE_MOVABLE)
+	if (zone_idx(zone) == ZONE_MOVABLE || is_zone_cma(zone))
 		return false;
 	mt = get_pageblock_migratetype(page);
 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 757be83..3c3aac2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -915,8 +915,15 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 #define TEXT_FOR_HIGHMEM(xx)
 #endif
 
+#ifdef CONFIG_CMA
+#define TEXT_FOR_CMA(xx) xx "_cma",
+#else
+#define TEXT_FOR_CMA(xx)
+#endif
+
 #define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \
-					TEXT_FOR_HIGHMEM(xx) xx "_movable",
+					TEXT_FOR_HIGHMEM(xx) xx "_movable", \
+					TEXT_FOR_CMA(xx)
 
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 3/7] mm/cma: populate ZONE_CMA
  2017-04-11  3:17 [PATCH v7 0/7] Introduce ZONE_CMA js1304
  2017-04-11  3:17 ` [PATCH v7 1/7] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request js1304
  2017-04-11  3:17 ` [PATCH v7 2/7] mm/cma: introduce new zone, ZONE_CMA js1304
@ 2017-04-11  3:17 ` js1304
  2017-04-11  3:17 ` [PATCH v7 4/7] mm/cma: remove ALLOC_CMA js1304
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: js1304 @ 2017-04-11  3:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, mgorman, Laura Abbott,
	Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Until now, reserved pages for CMA are managed in the ordinary zones
where page's pfn are belong to. This approach has numorous problems
and fixing them isn't easy. (It is mentioned on previous patch.)
To fix this situation, ZONE_CMA is introduced in previous patch, but,
not yet populated. This patch implement population of ZONE_CMA
by stealing reserved pages from the ordinary zones.

Unlike previous implementation that kernel allocation request with
__GFP_MOVABLE could be serviced from CMA region, allocation request only
with GFP_HIGHUSER_MOVABLE can be serviced from CMA region in the new
approach. This is an inevitable design decision to use the zone
implementation because ZONE_CMA could contain highmem. Due to this
decision, ZONE_CMA will work like as ZONE_HIGHMEM or ZONE_MOVABLE.

I don't think it would be a problem because most of file cache pages
and anonymous pages are requested with GFP_HIGHUSER_MOVABLE. It could
be proved by the fact that there are many systems with ZONE_HIGHMEM and
they work fine. Notable disadvantage is that we cannot use these pages
for blockdev file cache page, because it usually has __GFP_MOVABLE but
not __GFP_HIGHMEM and __GFP_USER. But, in this case, there is pros and
cons. In my experience, blockdev file cache pages are one of the top
reason that causes cma_alloc() to fail temporarily. So, we can get more
guarantee of cma_alloc() success by discarding that case.

Implementation itself is very easy to understand. Steal when cma area is
initialized and recalculate various per zone stat/threshold.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/memory_hotplug.h |  3 ---
 include/linux/mm.h             |  1 +
 mm/cma.c                       | 60 ++++++++++++++++++++++++++++++++++++++----
 mm/internal.h                  |  3 +++
 mm/page_alloc.c                | 29 +++++++++++++++++---
 5 files changed, 84 insertions(+), 12 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index e60f203..d730ce9 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -229,9 +229,6 @@ void put_online_mems(void);
 void mem_hotplug_begin(void);
 void mem_hotplug_done(void);
 
-extern void set_zone_contiguous(struct zone *zone);
-extern void clear_zone_contiguous(struct zone *zone);
-
 #else /* ! CONFIG_MEMORY_HOTPLUG */
 /*
  * Stub functions for when hotplug is off
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 022423c..1390abe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2031,6 +2031,7 @@ extern void setup_per_cpu_pageset(void);
 
 extern void zone_pcp_update(struct zone *zone);
 extern void zone_pcp_reset(struct zone *zone);
+extern void setup_zone_pageset(struct zone *zone);
 
 /* page_alloc.c */
 extern int min_free_kbytes;
diff --git a/mm/cma.c b/mm/cma.c
index a6033e3..6d8bd300 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -38,6 +38,7 @@
 #include <trace/events/cma.h>
 
 #include "cma.h"
+#include "internal.h"
 
 struct cma cma_areas[MAX_CMA_AREAS];
 unsigned cma_area_count;
@@ -116,10 +117,9 @@ static int __init cma_activate_area(struct cma *cma)
 		for (j = pageblock_nr_pages; j; --j, pfn++) {
 			WARN_ON_ONCE(!pfn_valid(pfn));
 			/*
-			 * alloc_contig_range requires the pfn range
-			 * specified to be in the same zone. Make this
-			 * simple by forcing the entire CMA resv range
-			 * to be in the same zone.
+			 * In init_cma_reserved_pageblock(), present_pages is
+			 * adjusted with assumption that all pages come from
+			 * a single zone. It could be fixed but not yet done.
 			 */
 			if (page_zone(pfn_to_page(pfn)) != zone)
 				goto err;
@@ -145,6 +145,34 @@ static int __init cma_activate_area(struct cma *cma)
 static int __init cma_init_reserved_areas(void)
 {
 	int i;
+	struct zone *zone;
+	pg_data_t *pgdat;
+
+	if (!cma_area_count)
+		return 0;
+
+	for_each_online_pgdat(pgdat) {
+		unsigned long start_pfn = UINT_MAX, end_pfn = 0;
+
+		for (i = 0; i < cma_area_count; i++) {
+			if (pfn_to_nid(cma_areas[i].base_pfn) !=
+				pgdat->node_id)
+				continue;
+
+			start_pfn = min(start_pfn, cma_areas[i].base_pfn);
+			end_pfn = max(end_pfn, cma_areas[i].base_pfn +
+						cma_areas[i].count);
+		}
+
+		if (!end_pfn)
+			continue;
+
+		zone = &pgdat->node_zones[ZONE_CMA];
+
+		/* ZONE_CMA doesn't need to exceed CMA region */
+		zone->zone_start_pfn = start_pfn;
+		zone->spanned_pages = end_pfn - start_pfn;
+	}
 
 	for (i = 0; i < cma_area_count; i++) {
 		int ret = cma_activate_area(&cma_areas[i]);
@@ -153,9 +181,31 @@ static int __init cma_init_reserved_areas(void)
 			return ret;
 	}
 
+	/*
+	 * Reserved pages for ZONE_CMA are now activated and this would change
+	 * ZONE_CMA's managed page counter and other zone's present counter.
+	 * We need to re-calculate various zone information that depends on
+	 * this initialization.
+	 */
+	build_all_zonelists(NULL, NULL);
+	for_each_populated_zone(zone) {
+		if (is_zone_cma(zone))
+			setup_zone_pageset(zone);
+		else
+			zone_pcp_update(zone);
+
+		set_zone_contiguous(zone);
+	}
+
+	/*
+	 * We need to re-init per zone wmark by calling
+	 * init_per_zone_wmark_min() but doesn't call here because it is
+	 * registered on core_initcall and it will be called later than us.
+	 */
+
 	return 0;
 }
-core_initcall(cma_init_reserved_areas);
+pure_initcall(cma_init_reserved_areas);
 
 /**
  * cma_init_reserved_mem() - create custom contiguous area from reserved memory
diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558..ecc69a4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -168,6 +168,9 @@ extern void post_alloc_hook(struct page *page, unsigned int order,
 					gfp_t gfp_flags);
 extern int user_min_free_kbytes;
 
+extern void set_zone_contiguous(struct zone *zone);
+extern void clear_zone_contiguous(struct zone *zone);
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 26d86c3b..760f518 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1583,16 +1583,38 @@ void __init page_alloc_init_late(void)
 }
 
 #ifdef CONFIG_CMA
+static void __init adjust_present_page_count(struct page *page, long count)
+{
+	struct zone *zone = page_zone(page);
+
+	/* We don't need to hold a lock since it is boot-up process */
+	zone->present_pages += count;
+}
+
 /* Free whole pageblock and set its migration type to MIGRATE_CMA. */
 void __init init_cma_reserved_pageblock(struct page *page)
 {
 	unsigned i = pageblock_nr_pages;
+	unsigned long pfn = page_to_pfn(page);
 	struct page *p = page;
+	int nid = page_to_nid(page);
+
+	/*
+	 * ZONE_CMA will steal present pages from other zones by changing
+	 * page links so page_zone() is changed. Before that,
+	 * we need to adjust previous zone's page count first.
+	 */
+	adjust_present_page_count(page, -pageblock_nr_pages);
 
 	do {
 		__ClearPageReserved(p);
 		set_page_count(p, 0);
-	} while (++p, --i);
+
+		/* Steal pages from other zones */
+		set_page_links(p, ZONE_CMA, nid, pfn);
+	} while (++p, ++pfn, --i);
+
+	adjust_present_page_count(page, pageblock_nr_pages);
 
 	set_pageblock_migratetype(page, MIGRATE_CMA);
 
@@ -5124,7 +5146,6 @@ static void build_zonelists(pg_data_t *pgdat)
  */
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
-static void setup_zone_pageset(struct zone *zone);
 
 /*
  * Global mutex to protect against size modification of zonelists
@@ -5497,7 +5518,7 @@ static void __meminit zone_pageset_init(struct zone *zone, int cpu)
 	pageset_set_high_and_batch(zone, pcp);
 }
 
-static void __meminit setup_zone_pageset(struct zone *zone)
+void __meminit setup_zone_pageset(struct zone *zone)
 {
 	int cpu;
 	zone->pageset = alloc_percpu(struct per_cpu_pageset);
@@ -7669,7 +7690,7 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
 }
 #endif
 
-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined CONFIG_MEMORY_HOTPLUG || defined CONFIG_CMA
 /*
  * The zone indicated has a new number of managed_pages; batch sizes and percpu
  * page high values need to be recalulated.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 4/7] mm/cma: remove ALLOC_CMA
  2017-04-11  3:17 [PATCH v7 0/7] Introduce ZONE_CMA js1304
                   ` (2 preceding siblings ...)
  2017-04-11  3:17 ` [PATCH v7 3/7] mm/cma: populate ZONE_CMA js1304
@ 2017-04-11  3:17 ` js1304
  2017-04-11  3:17 ` [PATCH v7 5/7] mm/cma: remove MIGRATE_CMA js1304
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: js1304 @ 2017-04-11  3:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, mgorman, Laura Abbott,
	Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Now, all reserved pages for CMA region are belong to the ZONE_CMA
and it only serves for GFP_HIGHUSER_MOVABLE. Therefore, we don't need to
consider ALLOC_CMA at all.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/compaction.c |  4 +---
 mm/internal.h   |  1 -
 mm/page_alloc.c | 28 +++-------------------------
 3 files changed, 4 insertions(+), 29 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 613c59e..80b1424 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1427,14 +1427,12 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 	 * if compaction succeeds.
 	 * For costly orders, we require low watermark instead of min for
 	 * compaction to proceed to increase its chances.
-	 * ALLOC_CMA is used, as pages in CMA pageblocks are considered
-	 * suitable migration targets
 	 */
 	watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
 				low_wmark_pages(zone) : min_wmark_pages(zone);
 	watermark += compact_gap(order);
 	if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
-						ALLOC_CMA, wmark_target))
+						0, wmark_target))
 		return COMPACT_SKIPPED;
 
 	return COMPACT_CONTINUE;
diff --git a/mm/internal.h b/mm/internal.h
index ecc69a4..08b19b7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -486,7 +486,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-#define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 760f518..18f16bf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2664,7 +2664,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
 		 * exists.
 		 */
 		watermark = min_wmark_pages(zone) + (1UL << order);
-		if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
 			return 0;
 
 		__mod_zone_freepage_state(zone, -(1UL << order), mt);
@@ -2931,12 +2931,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	else
 		min -= min / 4;
 
-#ifdef CONFIG_CMA
-	/* If allocation can't use CMA areas don't use free CMA pages */
-	if (!(alloc_flags & ALLOC_CMA))
-		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
-
 	/*
 	 * Check watermarks for an order-0 allocation request. If these
 	 * are not met, then a high-order request also cannot go ahead
@@ -2966,10 +2960,8 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		}
 
 #ifdef CONFIG_CMA
-		if ((alloc_flags & ALLOC_CMA) &&
-		    !list_empty(&area->free_list[MIGRATE_CMA])) {
+		if (!list_empty(&area->free_list[MIGRATE_CMA]))
 			return true;
-		}
 #endif
 	}
 	return false;
@@ -2986,13 +2978,6 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 		unsigned long mark, int classzone_idx, unsigned int alloc_flags)
 {
 	long free_pages = zone_page_state(z, NR_FREE_PAGES);
-	long cma_pages = 0;
-
-#ifdef CONFIG_CMA
-	/* If allocation can't use CMA areas don't use free CMA pages */
-	if (!(alloc_flags & ALLOC_CMA))
-		cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
 
 	/*
 	 * Fast check for order-0 only. If this fails then the reserves
@@ -3001,7 +2986,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 	 * the caller is !atomic then it'll uselessly search the free
 	 * list. That corner case is then slower but it is harmless.
 	 */
-	if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzone_idx])
+	if (!order && free_pages > mark + z->lowmem_reserve[classzone_idx])
 		return true;
 
 	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
@@ -3572,10 +3557,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-#ifdef CONFIG_CMA
-	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
-		alloc_flags |= ALLOC_CMA;
-#endif
 	return alloc_flags;
 }
 
@@ -3997,9 +3978,6 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return false;
 
-	if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
-		*alloc_flags |= ALLOC_CMA;
-
 	return true;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 5/7] mm/cma: remove MIGRATE_CMA
  2017-04-11  3:17 [PATCH v7 0/7] Introduce ZONE_CMA js1304
                   ` (3 preceding siblings ...)
  2017-04-11  3:17 ` [PATCH v7 4/7] mm/cma: remove ALLOC_CMA js1304
@ 2017-04-11  3:17 ` js1304
  2017-04-11  3:17 ` [PATCH v7 6/7] mm/cma: remove per zone CMA stat js1304
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: js1304 @ 2017-04-11  3:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, mgorman, Laura Abbott,
	Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Now, all reserved pages for CMA region are belong to the ZONE_CMA
and there is no other type of pages. Therefore, we don't need to
use MIGRATE_CMA to distinguish and handle differently for CMA pages
and ordinary pages. Remove MIGRATE_CMA.

Unfortunately, this patch make free CMA counter incorrect because
we count it when pages are on the MIGRATE_CMA. It will be fixed
by next patch. I can squash next patch here but it makes changes
complicated and hard to review so I separate that.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 arch/powerpc/mm/mmu_context_iommu.c |  2 +-
 include/linux/gfp.h                 |  2 +-
 include/linux/mmzone.h              | 26 +----------
 include/linux/page-isolation.h      |  5 +--
 include/linux/vmstat.h              |  8 ----
 mm/cma.c                            |  3 +-
 mm/compaction.c                     |  8 +---
 mm/hugetlb.c                        |  3 +-
 mm/memory_hotplug.c                 |  7 ++-
 mm/page_alloc.c                     | 86 ++++++++++---------------------------
 mm/page_isolation.c                 | 15 +++----
 mm/page_owner.c                     |  6 +--
 mm/usercopy.c                       |  4 +-
 13 files changed, 43 insertions(+), 132 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index fc67bd7..330c495 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -184,7 +184,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		 * of the CMA zone if possible. NOTE: faulting in + migration
 		 * can be expensive. Batching can be considered later
 		 */
-		if (is_migrate_cma_page(page)) {
+		if (is_zone_cma(page_zone(page))) {
 			if (mm_iommu_move_page_from_cma(page))
 				goto populate;
 			if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c2ed2eb..15987cc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -563,7 +563,7 @@ static inline bool pm_suspended_storage(void)
 #if (defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || defined(CONFIG_CMA)
 /* The below functions must be run on a range from a single zone. */
 extern int alloc_contig_range(unsigned long start, unsigned long end,
-			      unsigned migratetype, gfp_t gfp_mask);
+				gfp_t gfp_mask);
 extern void free_contig_range(unsigned long pfn, unsigned nr_pages);
 #endif
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 74eda07..efb69b1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -41,22 +41,6 @@ enum migratetype {
 	MIGRATE_RECLAIMABLE,
 	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
 	MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
-#ifdef CONFIG_CMA
-	/*
-	 * MIGRATE_CMA migration type is designed to mimic the way
-	 * ZONE_MOVABLE works.  Only movable pages can be allocated
-	 * from MIGRATE_CMA pageblocks and page allocator never
-	 * implicitly change migration type of MIGRATE_CMA pageblock.
-	 *
-	 * The way to use it is to change migratetype of a range of
-	 * pageblocks to MIGRATE_CMA which can be done by
-	 * __free_pageblock_cma() function.  What is important though
-	 * is that a range of pageblocks must be aligned to
-	 * MAX_ORDER_NR_PAGES should biggest page be bigger then
-	 * a single pageblock.
-	 */
-	MIGRATE_CMA,
-#endif
 #ifdef CONFIG_MEMORY_ISOLATION
 	MIGRATE_ISOLATE,	/* can't allocate from here */
 #endif
@@ -66,17 +50,9 @@ enum migratetype {
 /* In mm/page_alloc.c; keep in sync also with show_migration_types() there */
 extern char * const migratetype_names[MIGRATE_TYPES];
 
-#ifdef CONFIG_CMA
-#  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
-#  define is_migrate_cma_page(_page) (get_pageblock_migratetype(_page) == MIGRATE_CMA)
-#else
-#  define is_migrate_cma(migratetype) false
-#  define is_migrate_cma_page(_page) false
-#endif
-
 static inline bool is_migrate_movable(int mt)
 {
-	return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE;
+	return mt == MIGRATE_MOVABLE;
 }
 
 #define for_each_migratetype_order(order, type) \
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index d4cd201..67735f2 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -46,15 +46,14 @@ int move_freepages_block(struct zone *zone, struct page *page,
  */
 int
 start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			 unsigned migratetype, bool skip_hwpoisoned_pages);
+				bool skip_hwpoisoned_pages);
 
 /*
  * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE.
  * target range is [start_pfn, end_pfn)
  */
 int
-undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			unsigned migratetype);
+undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn);
 
 /*
  * Test all pages in [start_pfn, end_pfn) are isolated or not.
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 6137719..ac6db88 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -341,14 +341,6 @@ static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
 #endif		/* CONFIG_SMP */
 
-static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
-					     int migratetype)
-{
-	__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
-	if (is_migrate_cma(migratetype))
-		__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
-}
-
 extern const char * const vmstat_text[];
 
 #endif /* _LINUX_VMSTAT_H */
diff --git a/mm/cma.c b/mm/cma.c
index 6d8bd300..91dd85a 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -479,8 +479,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 
 		pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
 		mutex_lock(&cma_mutex);
-		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
-					 gfp_mask);
+		ret = alloc_contig_range(pfn, pfn + count, gfp_mask);
 		mutex_unlock(&cma_mutex);
 		if (ret == 0) {
 			page = pfn_to_page(pfn);
diff --git a/mm/compaction.c b/mm/compaction.c
index 80b1424..f6ae10f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1017,7 +1017,7 @@ static bool suitable_migration_target(struct compact_control *cc,
 	if (cc->ignore_block_suitable)
 		return true;
 
-	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
+	/* If the block is MIGRATE_MOVABLE, allow migration */
 	if (is_migrate_movable(get_pageblock_migratetype(page)))
 		return true;
 
@@ -1338,12 +1338,6 @@ static enum compact_result __compact_finished(struct zone *zone,
 		if (!list_empty(&area->free_list[migratetype]))
 			return COMPACT_SUCCESS;
 
-#ifdef CONFIG_CMA
-		/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
-		if (migratetype == MIGRATE_MOVABLE &&
-			!list_empty(&area->free_list[MIGRATE_CMA]))
-			return COMPACT_SUCCESS;
-#endif
 		/*
 		 * Job done if allocation would steal freepages from
 		 * other migratetype buddy lists.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e582887..d26c837 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1053,8 +1053,7 @@ static int __alloc_gigantic_page(unsigned long start_pfn,
 				unsigned long nr_pages)
 {
 	unsigned long end_pfn = start_pfn + nr_pages;
-	return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
-				  GFP_KERNEL);
+	return alloc_contig_range(start_pfn, end_pfn, GFP_KERNEL);
 }
 
 static bool pfn_range_valid_gigantic(struct zone *z,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 76d4745..c48c36f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1897,8 +1897,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		return -EINVAL;
 
 	/* set above range as isolated */
-	ret = start_isolate_page_range(start_pfn, end_pfn,
-				       MIGRATE_MOVABLE, true);
+	ret = start_isolate_page_range(start_pfn, end_pfn, true);
 	if (ret)
 		return ret;
 
@@ -1968,7 +1967,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	   We cannot do rollback at this point. */
 	offline_isolated_pages(start_pfn, end_pfn);
 	/* reset pagetype flags and makes migrate type to be MOVABLE */
-	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+	undo_isolate_page_range(start_pfn, end_pfn);
 	/* removal success */
 	adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
 	zone->present_pages -= offlined_pages;
@@ -2005,7 +2004,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		 ((unsigned long long) end_pfn << PAGE_SHIFT) - 1);
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
 	/* pushback to free area */
-	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
+	undo_isolate_page_range(start_pfn, end_pfn);
 	return ret;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 18f16bf..33a1b69 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -136,8 +136,8 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
  * put on a pcplist. Used to avoid the pageblock migratetype lookup when
  * freeing from pcplists in most cases, at the cost of possibly becoming stale.
  * Also the migratetype set in the page does not necessarily match the pcplist
- * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any
- * other index - this ensures that it will be put on the correct CMA freelist.
+ * index, e.g. page might have MIGRATE_MOVABLE set but be on a pcplist with any
+ * other index - this ensures that it will be put on the correct freelist.
  */
 static inline int get_pcppage_migratetype(struct page *page)
 {
@@ -247,9 +247,6 @@ char * const migratetype_names[MIGRATE_TYPES] = {
 	"Movable",
 	"Reclaimable",
 	"HighAtomic",
-#ifdef CONFIG_CMA
-	"CMA",
-#endif
 #ifdef CONFIG_MEMORY_ISOLATION
 	"Isolate",
 #endif
@@ -681,7 +678,7 @@ static inline bool set_page_guard(struct zone *zone, struct page *page,
 	INIT_LIST_HEAD(&page->lru);
 	set_page_private(page, order);
 	/* Guard pages are not available for any usage */
-	__mod_zone_freepage_state(zone, -(1 << order), migratetype);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
 
 	return true;
 }
@@ -702,7 +699,7 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 
 	set_page_private(page, 0);
 	if (!is_migrate_isolate(migratetype))
-		__mod_zone_freepage_state(zone, (1 << order), migratetype);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, (1 << order));
 }
 #else
 struct page_ext_operations debug_guardpage_ops;
@@ -809,7 +806,7 @@ static inline void __free_one_page(struct page *page,
 
 	VM_BUG_ON(migratetype == -1);
 	if (likely(!is_migrate_isolate(migratetype)))
-		__mod_zone_freepage_state(zone, 1 << order, migratetype);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
@@ -1591,7 +1588,7 @@ static void __init adjust_present_page_count(struct page *page, long count)
 	zone->present_pages += count;
 }
 
-/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
+/* Free whole pageblock and set its migration type to MIGRATE_MOVABLE. */
 void __init init_cma_reserved_pageblock(struct page *page)
 {
 	unsigned i = pageblock_nr_pages;
@@ -1616,7 +1613,7 @@ void __init init_cma_reserved_pageblock(struct page *page)
 
 	adjust_present_page_count(page, pageblock_nr_pages);
 
-	set_pageblock_migratetype(page, MIGRATE_CMA);
+	set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 
 	if (pageblock_order >= MAX_ORDER) {
 		i = pageblock_nr_pages;
@@ -1836,25 +1833,11 @@ static int fallbacks[MIGRATE_TYPES][4] = {
 	[MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   MIGRATE_TYPES },
 	[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   MIGRATE_TYPES },
 	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
-#ifdef CONFIG_CMA
-	[MIGRATE_CMA]         = { MIGRATE_TYPES }, /* Never used */
-#endif
 #ifdef CONFIG_MEMORY_ISOLATION
 	[MIGRATE_ISOLATE]     = { MIGRATE_TYPES }, /* Never used */
 #endif
 };
 
-#ifdef CONFIG_CMA
-static struct page *__rmqueue_cma_fallback(struct zone *zone,
-					unsigned int order)
-{
-	return __rmqueue_smallest(zone, order, MIGRATE_CMA);
-}
-#else
-static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
-					unsigned int order) { return NULL; }
-#endif
-
 /*
  * Move the free pages in a range to the free lists of the requested type.
  * Note that start_page and end_pages are not aligned on a pageblock
@@ -2122,8 +2105,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
 
 	/* Yoink! */
 	mt = get_pageblock_migratetype(page);
-	if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt)
-	    && !is_migrate_cma(mt)) {
+	if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt)) {
 		zone->nr_reserved_highatomic += pageblock_nr_pages;
 		set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
 		move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
@@ -2267,13 +2249,8 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
 
 retry:
 	page = __rmqueue_smallest(zone, order, migratetype);
-	if (unlikely(!page)) {
-		if (migratetype == MIGRATE_MOVABLE)
-			page = __rmqueue_cma_fallback(zone, order);
-
-		if (!page && __rmqueue_fallback(zone, order, migratetype))
-			goto retry;
-	}
+	if (unlikely(!page) && __rmqueue_fallback(zone, order, migratetype))
+		goto retry;
 
 	trace_mm_page_alloc_zone_locked(page, order, migratetype);
 	return page;
@@ -2315,9 +2292,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			list_add_tail(&page->lru, list);
 		list = &page->lru;
 		alloced++;
-		if (is_migrate_cma(get_pcppage_migratetype(page)))
-			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-					      -(1 << order));
 	}
 
 	/*
@@ -2667,7 +2641,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
 		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
 			return 0;
 
-		__mod_zone_freepage_state(zone, -(1UL << order), mt);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
 	}
 
 	/* Remove page from free list */
@@ -2683,8 +2657,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 		struct page *endpage = page + (1 << order) - 1;
 		for (; page < endpage; page += pageblock_nr_pages) {
 			int mt = get_pageblock_migratetype(page);
-			if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)
-			    && !is_migrate_highatomic(mt))
+			if (!is_migrate_isolate(mt) &&
+				!is_migrate_highatomic(mt))
 				set_pageblock_migratetype(page,
 							  MIGRATE_MOVABLE);
 		}
@@ -2807,8 +2781,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 	spin_unlock(&zone->lock);
 	if (!page)
 		goto failed;
-	__mod_zone_freepage_state(zone, -(1 << order),
-				  get_pcppage_migratetype(page));
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 	zone_statistics(preferred_zone, zone);
@@ -2958,11 +2931,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			if (!list_empty(&area->free_list[mt]))
 				return true;
 		}
-
-#ifdef CONFIG_CMA
-		if (!list_empty(&area->free_list[MIGRATE_CMA]))
-			return true;
-#endif
 	}
 	return false;
 }
@@ -4469,9 +4437,6 @@ static void show_migration_types(unsigned char type)
 		[MIGRATE_MOVABLE]	= 'M',
 		[MIGRATE_RECLAIMABLE]	= 'E',
 		[MIGRATE_HIGHATOMIC]	= 'H',
-#ifdef CONFIG_CMA
-		[MIGRATE_CMA]		= 'C',
-#endif
 #ifdef CONFIG_MEMORY_ISOLATION
 		[MIGRATE_ISOLATE]	= 'I',
 #endif
@@ -7361,7 +7326,7 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 	if (zone_idx(zone) == ZONE_MOVABLE || is_zone_cma(zone))
 		return false;
 	mt = get_pageblock_migratetype(page);
-	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
+	if (mt == MIGRATE_MOVABLE)
 		return false;
 
 	pfn = page_to_pfn(page);
@@ -7512,16 +7477,12 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
  * alloc_contig_range() -- tries to allocate given range of pages
  * @start:	start PFN to allocate
  * @end:	one-past-the-last PFN to allocate
- * @migratetype:	migratetype of the underlaying pageblocks (either
- *			#MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
- *			in range must have the same migratetype and it must
- *			be either of the two.
  * @gfp_mask:	GFP mask to use during compaction
  *
  * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
  * aligned, however it's the caller's responsibility to guarantee that
  * we are the only thread that changes migrate type of pageblocks the
- * pages fall in.
+ * pages fall in and it should be MIGRATE_MOVABLE.
  *
  * The PFN range must belong to a single zone.
  *
@@ -7530,7 +7491,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
  * need to be freed with free_contig_range().
  */
 int alloc_contig_range(unsigned long start, unsigned long end,
-		       unsigned migratetype, gfp_t gfp_mask)
+			gfp_t gfp_mask)
 {
 	unsigned long outer_start, outer_end;
 	unsigned int order;
@@ -7564,15 +7525,14 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * allocator removing them from the buddy system.  This way
 	 * page allocator will never consider using them.
 	 *
-	 * This lets us mark the pageblocks back as
-	 * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
-	 * aligned range but not in the unaligned, original range are
-	 * put back to page allocator so that buddy can use them.
+	 * This lets us mark the pageblocks back as MIGRATE_MOVABLE
+	 * so that free pages in the aligned range but not in the
+	 * unaligned, original range are put back to page allocator
+	 * so that buddy can use them.
 	 */
 
 	ret = start_isolate_page_range(pfn_max_align_down(start),
-				       pfn_max_align_up(end), migratetype,
-				       false);
+				       pfn_max_align_up(end), false);
 	if (ret)
 		return ret;
 
@@ -7650,7 +7610,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 
 done:
 	undo_isolate_page_range(pfn_max_align_down(start),
-				pfn_max_align_up(end), migratetype);
+				pfn_max_align_up(end));
 	return ret;
 }
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 5092e4e..312f2f6 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -62,14 +62,13 @@ static int set_migratetype_isolate(struct page *page,
 out:
 	if (!ret) {
 		unsigned long nr_pages;
-		int migratetype = get_pageblock_migratetype(page);
 
 		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 		zone->nr_isolate_pageblock++;
 		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE,
 									NULL);
 
-		__mod_zone_freepage_state(zone, -nr_pages, migratetype);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, -nr_pages);
 	}
 
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -122,7 +121,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 	 */
 	if (!isolated_page) {
 		nr_pages = move_freepages_block(zone, page, migratetype, NULL);
-		__mod_zone_freepage_state(zone, nr_pages, migratetype);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
 	}
 	set_pageblock_migratetype(page, migratetype);
 	zone->nr_isolate_pageblock--;
@@ -151,7 +150,6 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * to be MIGRATE_ISOLATE.
  * @start_pfn: The lower PFN of the range to be isolated.
  * @end_pfn: The upper PFN of the range to be isolated.
- * @migratetype: migrate type to set in error recovery.
  *
  * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in
  * the range will never be allocated. Any free pages and pages freed in the
@@ -161,7 +159,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * Returns 0 on success and -EBUSY if any part of range cannot be isolated.
  */
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			     unsigned migratetype, bool skip_hwpoisoned_pages)
+				bool skip_hwpoisoned_pages)
 {
 	unsigned long pfn;
 	unsigned long undo_pfn;
@@ -185,7 +183,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	for (pfn = start_pfn;
 	     pfn < undo_pfn;
 	     pfn += pageblock_nr_pages)
-		unset_migratetype_isolate(pfn_to_page(pfn), migratetype);
+		unset_migratetype_isolate(pfn_to_page(pfn), MIGRATE_MOVABLE);
 
 	return -EBUSY;
 }
@@ -193,8 +191,7 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 /*
  * Make isolated pages available again.
  */
-int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
-			    unsigned migratetype)
+int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn)
 {
 	unsigned long pfn;
 	struct page *page;
@@ -208,7 +205,7 @@ int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 		page = __first_valid_page(pfn, pageblock_nr_pages);
 		if (!page || !is_migrate_isolate_page(page))
 			continue;
-		unset_migratetype_isolate(page, migratetype);
+		unset_migratetype_isolate(page, MIGRATE_MOVABLE);
 	}
 	return 0;
 }
diff --git a/mm/page_owner.c b/mm/page_owner.c
index c3cee24..4016815 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -299,11 +299,7 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
 			page_mt = gfpflags_to_migratetype(
 					page_owner->gfp_mask);
 			if (pageblock_mt != page_mt) {
-				if (is_migrate_cma(pageblock_mt))
-					count[MIGRATE_MOVABLE]++;
-				else
-					count[pageblock_mt]++;
-
+				count[pageblock_mt]++;
 				pfn = block_end_pfn;
 				break;
 			}
diff --git a/mm/usercopy.c b/mm/usercopy.c
index a9852b2..f0a2c8f 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -179,7 +179,7 @@ static inline const char *check_page_span(const void *ptr, unsigned long n,
 	 * several independently allocated pages.
 	 */
 	is_reserved = PageReserved(page);
-	is_cma = is_migrate_cma_page(page);
+	is_cma = is_zone_cma(page_zone(page));
 	if (!is_reserved && !is_cma)
 		return "<spans multiple pages>";
 
@@ -187,7 +187,7 @@ static inline const char *check_page_span(const void *ptr, unsigned long n,
 		page = virt_to_head_page(ptr);
 		if (is_reserved && !PageReserved(page))
 			return "<spans Reserved and non-Reserved pages>";
-		if (is_cma && !is_migrate_cma_page(page))
+		if (is_cma && !is_zone_cma(page_zone(page)))
 			return "<spans CMA and non-CMA pages>";
 	}
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 6/7] mm/cma: remove per zone CMA stat
  2017-04-11  3:17 [PATCH v7 0/7] Introduce ZONE_CMA js1304
                   ` (4 preceding siblings ...)
  2017-04-11  3:17 ` [PATCH v7 5/7] mm/cma: remove MIGRATE_CMA js1304
@ 2017-04-11  3:17 ` js1304
  2017-04-11  3:17 ` [PATCH v7 7/7] ARM: CMA: avoid re-mapping CMA region if CONFIG_HIGHMEM js1304
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: js1304 @ 2017-04-11  3:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, mgorman, Laura Abbott,
	Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Now, all reserved pages for CMA region are belong to the ZONE_CMA
so we don't need to maintain CMA stat in other zones. Remove it.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 fs/proc/meminfo.c      |  2 +-
 include/linux/cma.h    |  7 +++++++
 include/linux/mmzone.h |  1 -
 mm/cma.c               | 15 +++++++++++++++
 mm/page_alloc.c        |  7 +++----
 mm/vmstat.c            |  1 -
 6 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8a42849..0ca6f38 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -151,7 +151,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_CMA
 	show_val_kb(m, "CmaTotal:       ", totalcma_pages);
 	show_val_kb(m, "CmaFree:        ",
-		    global_page_state(NR_FREE_CMA_PAGES));
+		    cma_get_free());
 #endif
 
 	hugetlb_report_meminfo(m);
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 03f32d0..2433d5e 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -32,4 +32,11 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
 extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 			      gfp_t gfp_mask);
 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
+
+#ifdef CONFIG_CMA
+extern unsigned long cma_get_free(void);
+#else
+static inline unsigned long cma_get_free(void) { return 0; }
+#endif
+
 #endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index efb69b1..5e1d8ba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -118,7 +118,6 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
-	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
 enum node_stat_item {
diff --git a/mm/cma.c b/mm/cma.c
index 91dd85a..adfda1c 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -54,6 +54,21 @@ unsigned long cma_get_size(const struct cma *cma)
 	return cma->count << PAGE_SHIFT;
 }
 
+unsigned long cma_get_free(void)
+{
+	struct zone *zone;
+	unsigned long freecma = 0;
+
+	for_each_populated_zone(zone) {
+		if (!is_zone_cma(zone))
+			continue;
+
+		freecma += zone_page_state(zone, NR_FREE_PAGES);
+	}
+
+	return freecma;
+}
+
 static unsigned long cma_bitmap_aligned_mask(const struct cma *cma,
 					     int align_order)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 33a1b69..7d61469 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -66,6 +66,7 @@
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
 #include <linux/ftrace.h>
+#include <linux/cma.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -4502,7 +4503,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		global_page_state(NR_BOUNCE),
 		global_page_state(NR_FREE_PAGES),
 		free_pcp,
-		global_page_state(NR_FREE_CMA_PAGES));
+		cma_get_free());
 
 	for_each_online_pgdat(pgdat) {
 		if (show_mem_node_skip(filter, pgdat->node_id, nodemask))
@@ -4586,7 +4587,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			" bounce:%lukB"
 			" free_pcp:%lukB"
 			" local_pcp:%ukB"
-			" free_cma:%lukB"
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
@@ -4608,8 +4608,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(zone_page_state(zone, NR_PAGETABLE)),
 			K(zone_page_state(zone, NR_BOUNCE)),
 			K(free_pcp),
-			K(this_cpu_read(zone->pageset->pcp.count)),
-			K(zone_page_state(zone, NR_FREE_CMA_PAGES)));
+			K(this_cpu_read(zone->pageset->pcp.count)));
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
 			printk(KERN_CONT " %ld", zone->lowmem_reserve[i]);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3c3aac2..9467b56 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -951,7 +951,6 @@ const char * const vmstat_text[] = {
 	"numa_local",
 	"numa_other",
 #endif
-	"nr_free_cma",
 
 	/* Node-based counters */
 	"nr_inactive_anon",
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 7/7] ARM: CMA: avoid re-mapping CMA region if CONFIG_HIGHMEM
  2017-04-11  3:17 [PATCH v7 0/7] Introduce ZONE_CMA js1304
                   ` (5 preceding siblings ...)
  2017-04-11  3:17 ` [PATCH v7 6/7] mm/cma: remove per zone CMA stat js1304
@ 2017-04-11  3:17 ` js1304
  2017-04-11 18:15 ` [PATCH v7 0/7] Introduce ZONE_CMA Michal Hocko
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: js1304 @ 2017-04-11  3:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, mgorman, Laura Abbott,
	Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

CMA region is now managed by the separate zone, ZONE_CMA, to
fix many MM related problems. In this implementation, it is
possible that ZONE_CMA contains two CMA regions that are
on the both, lowmem and highmem, respectively. To handle this case
properly, ZONE_CMA is considered as highmem.

In dma_contiguous_remap(), mapping for CMA region on lowmem is cleared
and remapped for DMA, but, in the new CMA implementation, remap isn't
needed since the region is considered as highmem. And, remap should not be
allowed since it would cause cache problems. So, this patch disables it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 arch/arm/mm/dma-mapping.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 475811f..377053a 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -506,7 +506,12 @@ void __init dma_contiguous_remap(void)
 		flush_tlb_kernel_range(__phys_to_virt(start),
 				       __phys_to_virt(end));
 
-		iotable_init(&map, 1);
+		/*
+		 * For highmem system, all the memory in CMA region will be
+		 * considered as highmem, therefore, re-mapping isn't required.
+		 */
+		if (!IS_ENABLED(CONFIG_HIGHMEM))
+			iotable_init(&map, 1);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-11  3:17 [PATCH v7 0/7] Introduce ZONE_CMA js1304
                   ` (6 preceding siblings ...)
  2017-04-11  3:17 ` [PATCH v7 7/7] ARM: CMA: avoid re-mapping CMA region if CONFIG_HIGHMEM js1304
@ 2017-04-11 18:15 ` Michal Hocko
  2017-04-12  1:35   ` Joonsoo Kim
  2017-04-12  1:39 ` Joonsoo Kim
  2017-04-24  4:08 ` Bob Liu
  9 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2017-04-11 18:15 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Joonsoo Kim

Hi,
I didn't get to read though patches yet but the cover letter didn't
really help me to understand the basic concepts to have a good starting
point before diving into implementation details. It contains a lot of
history remarks which is not bad but IMHO too excessive here. I would
appreciate the following information (some of that is already provided
in the cover but could benefit from some rewording/text reorganization).

- what is ZONE_CMA and how it is configured (from admin POV)
- how does ZONE_CMA compare to other zones
- who is allowed to allocate from this zone and what are the
  guarantees/requirements for successful allocation
- how does the zone compare to a preallocate allocation pool
- how is ZONE_CMA balanced/reclaimed due to internal memory pressure
  (from CMA users)
- is this zone reclaimable for the global memory reclaim
- why this was/is controversial
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-11 18:15 ` [PATCH v7 0/7] Introduce ZONE_CMA Michal Hocko
@ 2017-04-12  1:35   ` Joonsoo Kim
  2017-04-13 11:56     ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Joonsoo Kim @ 2017-04-12  1:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Tue, Apr 11, 2017 at 08:15:20PM +0200, Michal Hocko wrote:
> Hi,
> I didn't get to read though patches yet but the cover letter didn't
> really help me to understand the basic concepts to have a good starting
> point before diving into implementation details. It contains a lot of
> history remarks which is not bad but IMHO too excessive here. I would
> appreciate the following information (some of that is already provided
> in the cover but could benefit from some rewording/text reorganization).
> 
> - what is ZONE_CMA and how it is configured (from admin POV)
> - how does ZONE_CMA compare to other zones
> - who is allowed to allocate from this zone and what are the
>   guarantees/requirements for successful allocation
> - how does the zone compare to a preallocate allocation pool
> - how is ZONE_CMA balanced/reclaimed due to internal memory pressure
>   (from CMA users)
> - is this zone reclaimable for the global memory reclaim
> - why this was/is controversial

Hello,

I hope that following summary helps you to understand this patchset.
I skip some basic things about CMA. I will attach this description to
the cover-letter if re-spin is needed.

1. What is ZONE_CMA

ZONE_CMA is a newly introduced zone that manages freepages in CMA areas.
Previously, freepages in CMA areas are in the ordinary zone and
managed/distinguished by the special migratetype, MIGRATE_CMA.
However, it causes too many subtle problems and fixing all the problems
due to it seems to be impossible and too intrusive to MM subsystem.
Therefore, different solution is requested and this is the outcome of
this request. Problem details are described in PART 3.

There is no change in admin POV. It is just implementation detail.
If the kernel is congifured to use CMA, it is managed by MM like as before
except pages are now belong to the separate zone, ZONE_CMA.

2. How does ZONE_CMA compare to other zones

ZONE_CMA is conceptually the same with ZONE_MOVABLE. There is a software
constraint to guarantee the success of future allocation request from
the device. If the device requests the specific range of the memory in CMA
area at the runtime, page that allocated by MM will be migrated to
the other page and it will be returned to the device. To guarantee it,
ZONE_CMA only takes the allocation request with GFP_MOVABLE.

The other important point about ZONE_CMA is that span of ZONE_CMA would be
overlapped with the other zone. This is not new to MM subsystem and
MM subsystem has enough logic to handle such situation
so there would be no problem.

Other things are completely the same with other zones. For MM POV, there is
no difference in allocation process except that it only takes
GFP_MOVABLE request. In reclaim, pages that are allocated by MM will
be reclaimed by the same policy of the MM. So, no difference.

This 'no difference' is a strong point of this approach. ZONE_CMA is
naturally handled by MM subsystem unlike as before (special handling is
required for MIGRATE_CMA).

3. Controversial Point

Major concern from Mel is that zone concept is abused. ZONE is originally
introduced to solve some issues due to H/W addressing limitation.
However, from the age of ZONE_MOVABLE, ZONE is used to solve the issues
due to S/W limitation. This S/W limitation causes highmem/lowmem problem
that is some of memory cannot be usable for kernel memory and LRU ordering
would be broken easily. My major objection to this point is that
this problem isn't related to implementation detail like as ZONE.
Problems just comes from S/W limitation that we cannot use this memory
for kernel memory to guarantee offlining the memory (ZONE_MOVABLE) or
allocation from the device (ZONE_CMA) in the future. See PART 1 for
more information.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-11  3:17 [PATCH v7 0/7] Introduce ZONE_CMA js1304
                   ` (7 preceding siblings ...)
  2017-04-11 18:15 ` [PATCH v7 0/7] Introduce ZONE_CMA Michal Hocko
@ 2017-04-12  1:39 ` Joonsoo Kim
  2017-04-24  4:08 ` Bob Liu
  9 siblings, 0 replies; 39+ messages in thread
From: Joonsoo Kim @ 2017-04-12  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, mgorman, Laura Abbott,
	Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Tue, Apr 11, 2017 at 12:17:13PM +0900, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Changed from v6
> o Rebase on next-20170405
> o Add a fix for lowmem mapping on ARM (last patch)

Hello, Russell and Will.

In this 7th patchset, I newly added a patch for ARM.
Could you review it?

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-12  1:35   ` Joonsoo Kim
@ 2017-04-13 11:56     ` Michal Hocko
  2017-04-17  2:02       ` Joonsoo Kim
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2017-04-13 11:56 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> On Tue, Apr 11, 2017 at 08:15:20PM +0200, Michal Hocko wrote:
> > Hi,
> > I didn't get to read though patches yet but the cover letter didn't
> > really help me to understand the basic concepts to have a good starting
> > point before diving into implementation details. It contains a lot of
> > history remarks which is not bad but IMHO too excessive here. I would
> > appreciate the following information (some of that is already provided
> > in the cover but could benefit from some rewording/text reorganization).
> > 
> > - what is ZONE_CMA and how it is configured (from admin POV)
> > - how does ZONE_CMA compare to other zones
> > - who is allowed to allocate from this zone and what are the
> >   guarantees/requirements for successful allocation
> > - how does the zone compare to a preallocate allocation pool
> > - how is ZONE_CMA balanced/reclaimed due to internal memory pressure
> >   (from CMA users)
> > - is this zone reclaimable for the global memory reclaim
> > - why this was/is controversial
> 
> Hello,
> 
> I hope that following summary helps you to understand this patchset.
> I skip some basic things about CMA. I will attach this description to
> the cover-letter if re-spin is needed.

I believe that sorting out these questions is more important than what
you have in the current cover letter. Andrew tends to fold the cover
into the first patch so I think you should update.

> 2. How does ZONE_CMA compare to other zones
> 
> ZONE_CMA is conceptually the same with ZONE_MOVABLE. There is a software
> constraint to guarantee the success of future allocation request from
> the device. If the device requests the specific range of the memory in CMA
> area at the runtime, page that allocated by MM will be migrated to
> the other page and it will be returned to the device. To guarantee it,
> ZONE_CMA only takes the allocation request with GFP_MOVABLE.

The immediate follow up question is. Why cannot we reuse ZONE_MOVABLE
for that purpose?

> The other important point about ZONE_CMA is that span of ZONE_CMA would be
> overlapped with the other zone. This is not new to MM subsystem and
> MM subsystem has enough logic to handle such situation
> so there would be no problem.

I am not really sure this is actually true. Zones are disjoint from the
early beginning. I remember that we had something like numa nodes
interleaving but that is such a rare configuration that I wouldn't be
surprised if it wasn't very well tested and actually broken in some
subtle ways.

There are many page_zone(page) != zone checks sprinkled in the code but
I do not see anything consistent there. Similarly pageblock_pfn_to_page
is only used by compaction but there are other pfn walkers which do
ad-hoc checking. I was staring into that code these days due to my
hotplug patches.

That being said, I think that interleaving zones are an interesting
concept but I would be rather nervous to consider this as working
currently without a deeper review.

> Other things are completely the same with other zones. For MM POV, there is
> no difference in allocation process except that it only takes
> GFP_MOVABLE request. In reclaim, pages that are allocated by MM will
> be reclaimed by the same policy of the MM. So, no difference.

OK, so essentially this is yet another "highmem" zone. We already know
that only GFP_MOVABLE are allowed to fallback to ZONE_CMA but do CMA
allocations fallback to other zones and punch new holes? In which zone
order?

> This 'no difference' is a strong point of this approach. ZONE_CMA is
> naturally handled by MM subsystem unlike as before (special handling is
> required for MIGRATE_CMA).
> 
> 3. Controversial Point
> 
> Major concern from Mel is that zone concept is abused. ZONE is originally
> introduced to solve some issues due to H/W addressing limitation.

Yes, very much agreed on that. You basically want to punch holes into
other zones to guarantee an allocation progress. Marking those wholes
with special migrate type sounds quite natural but I will have to study
the current code some more to see whether issues you mention are
inherently unfixable. This might very well turn out to be the case.

> However, from the age of ZONE_MOVABLE, ZONE is used to solve the issues
> due to S/W limitation.

copying ZONE_MOVABLE pattern doesn't sound all that great to me to be
honest.

> This S/W limitation causes highmem/lowmem problem
> that is some of memory cannot be usable for kernel memory and LRU ordering
> would be broken easily. My major objection to this point is that
> this problem isn't related to implementation detail like as ZONE.

yes, agreement on that.

> Problems just comes from S/W limitation that we cannot use this memory
> for kernel memory to guarantee offlining the memory (ZONE_MOVABLE) or
> allocation from the device (ZONE_CMA) in the future. See PART 1 for
> more information.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-13 11:56     ` Michal Hocko
@ 2017-04-17  2:02       ` Joonsoo Kim
  2017-04-21  1:35         ` Joonsoo Kim
  2017-04-24 13:09         ` Michal Hocko
  0 siblings, 2 replies; 39+ messages in thread
From: Joonsoo Kim @ 2017-04-17  2:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> > On Tue, Apr 11, 2017 at 08:15:20PM +0200, Michal Hocko wrote:
> > > Hi,
> > > I didn't get to read though patches yet but the cover letter didn't
> > > really help me to understand the basic concepts to have a good starting
> > > point before diving into implementation details. It contains a lot of
> > > history remarks which is not bad but IMHO too excessive here. I would
> > > appreciate the following information (some of that is already provided
> > > in the cover but could benefit from some rewording/text reorganization).
> > > 
> > > - what is ZONE_CMA and how it is configured (from admin POV)
> > > - how does ZONE_CMA compare to other zones
> > > - who is allowed to allocate from this zone and what are the
> > >   guarantees/requirements for successful allocation
> > > - how does the zone compare to a preallocate allocation pool
> > > - how is ZONE_CMA balanced/reclaimed due to internal memory pressure
> > >   (from CMA users)
> > > - is this zone reclaimable for the global memory reclaim
> > > - why this was/is controversial
> > 
> > Hello,
> > 
> > I hope that following summary helps you to understand this patchset.
> > I skip some basic things about CMA. I will attach this description to
> > the cover-letter if re-spin is needed.
> 
> I believe that sorting out these questions is more important than what
> you have in the current cover letter. Andrew tends to fold the cover
> into the first patch so I think you should update.

Okay.
 
> > 2. How does ZONE_CMA compare to other zones
> > 
> > ZONE_CMA is conceptually the same with ZONE_MOVABLE. There is a software
> > constraint to guarantee the success of future allocation request from
> > the device. If the device requests the specific range of the memory in CMA
> > area at the runtime, page that allocated by MM will be migrated to
> > the other page and it will be returned to the device. To guarantee it,
> > ZONE_CMA only takes the allocation request with GFP_MOVABLE.
> 
> The immediate follow up question is. Why cannot we reuse ZONE_MOVABLE
> for that purpose?

I can make CMA reuses the ZONE_MOVABLE but I don't want it. Reasons
are that

1. If ZONE_MOVABLE has two different types of memory, hotpluggable and
CMA, it may need special handling for each type. This would lead to a new
migratetype again (to distinguish them) and easy to be error-prone. I
don't want that case.

2. CMA users want to see usage stat separately since CMA often causes
the problems and separate stat would helps to debug it.

> > The other important point about ZONE_CMA is that span of ZONE_CMA would be
> > overlapped with the other zone. This is not new to MM subsystem and
> > MM subsystem has enough logic to handle such situation
> > so there would be no problem.
> 
> I am not really sure this is actually true. Zones are disjoint from the
> early beginning. I remember that we had something like numa nodes
> interleaving but that is such a rare configuration that I wouldn't be
> surprised if it wasn't very well tested and actually broken in some
> subtle ways.

I agree with your concern however if something is broken for them, it
just shows that we need to fix it. MM should handle this situation
since we already know that such architecture exists.

> 
> There are many page_zone(page) != zone checks sprinkled in the code but
> I do not see anything consistent there. Similarly pageblock_pfn_to_page
> is only used by compaction but there are other pfn walkers which do
> ad-hoc checking. I was staring into that code these days due to my
> hotplug patches.
>
> That being said, I think that interleaving zones are an interesting
> concept but I would be rather nervous to consider this as working
> currently without a deeper review.

I have tried to audit all the pfn walkers before and have added above
mentioned check. Perhaps, I missed something however I believe not
that much. Our production already have used ZONE_CMA and I haven't get
the report about such problem.

> 
> > Other things are completely the same with other zones. For MM POV, there is
> > no difference in allocation process except that it only takes
> > GFP_MOVABLE request. In reclaim, pages that are allocated by MM will
> > be reclaimed by the same policy of the MM. So, no difference.
> 
> OK, so essentially this is yet another "highmem" zone. We already know
> that only GFP_MOVABLE are allowed to fallback to ZONE_CMA but do CMA
> allocations fallback to other zones and punch new holes? In which zone
> order?

Hmm... I don't understand your question. Could you elaborate it more?

> > This 'no difference' is a strong point of this approach. ZONE_CMA is
> > naturally handled by MM subsystem unlike as before (special handling is
> > required for MIGRATE_CMA).
> > 
> > 3. Controversial Point
> > 
> > Major concern from Mel is that zone concept is abused. ZONE is originally
> > introduced to solve some issues due to H/W addressing limitation.
> 
> Yes, very much agreed on that. You basically want to punch holes into
> other zones to guarantee an allocation progress. Marking those wholes
> with special migrate type sounds quite natural but I will have to study
> the current code some more to see whether issues you mention are
> inherently unfixable. This might very well turn out to be the case.

At a glance, special migratetype sound natural. I also did. However,
it's not natural in implementation POV. Zone consists of the same type
of memory (by definition ?) and MM subsystem is implemented with that
assumption. If difference type of memory shares the same zone, it easily
causes the problem and CMA problems are the such case.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 1/7] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request
  2017-04-11  3:17 ` [PATCH v7 1/7] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request js1304
@ 2017-04-17  7:38   ` Minchan Kim
  2017-04-21  1:32     ` Joonsoo Kim
  0 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2017-04-17  7:38 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Joonsoo Kim

Hi Joonsoo,

On Tue, Apr 11, 2017 at 12:17:14PM +0900, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
> important to reserve. When ZONE_MOVABLE is used, this problem would
> theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
> allocation request which is mainly used for page cache and anon page
> allocation. So, fix it.
> 
> And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
> makes code complex. For example, if there is highmem system, following
> reserve ratio is activated for *NORMAL ZONE* which would be easyily
> misleading people.
> 
>  #ifdef CONFIG_HIGHMEM
>  32
>  #endif
> 
> This patch also fix this situation by defining sysctl_lowmem_reserve_ratio
> array by MAX_NR_ZONES and place "#ifdef" to right place.
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  include/linux/mmzone.h |  2 +-
>  mm/page_alloc.c        | 11 ++++++-----
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ebaccd4..96194bf 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -869,7 +869,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
>  					void __user *, size_t *, loff_t *);
>  int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
>  					void __user *, size_t *, loff_t *);
> -extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
> +extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
>  int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
>  					void __user *, size_t *, loff_t *);
>  int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 32b31d6..60ffa4e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -203,17 +203,18 @@ static void __free_pages_ok(struct page *page, unsigned int order);
>   * TBD: should special case ZONE_DMA32 machines here - in those we normally
>   * don't need any ZONE_NORMAL reservation
>   */
> -int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
> +int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
>  #ifdef CONFIG_ZONE_DMA
> -	 256,
> +	[ZONE_DMA] = 256,
>  #endif
>  #ifdef CONFIG_ZONE_DMA32
> -	 256,
> +	[ZONE_DMA32] = 256,
>  #endif
> +	[ZONE_NORMAL] = 32,
>  #ifdef CONFIG_HIGHMEM
> -	 32,
> +	[ZONE_HIGHMEM] = INT_MAX,
>  #endif
> -	 32,
> +	[ZONE_MOVABLE] = INT_MAX,
>  };

We need to update lowmem_reserve_ratio in Documentation/sysctl/vm.txt.
And to me, INT_MAX is rather awkward.

# cat /proc/sys/vm/lowmem_reserve_ratio
        256     256     32      2147483647      2147483647

What do you think about to use 0 or -1 as special meaning
instead 2147483647?

Anyway, it could be separate patch regardless of zone_cma
so I hope Andrew to merge this patch regardless of other patches
in this patchset.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 1/7] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request
  2017-04-17  7:38   ` Minchan Kim
@ 2017-04-21  1:32     ` Joonsoo Kim
  0 siblings, 0 replies; 39+ messages in thread
From: Joonsoo Kim @ 2017-04-21  1:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Mon, Apr 17, 2017 at 04:38:08PM +0900, Minchan Kim wrote:
> Hi Joonsoo,
> 
> On Tue, Apr 11, 2017 at 12:17:14PM +0900, js1304@gmail.com wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > 
> > Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
> > important to reserve. When ZONE_MOVABLE is used, this problem would
> > theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
> > allocation request which is mainly used for page cache and anon page
> > allocation. So, fix it.
> > 
> > And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
> > makes code complex. For example, if there is highmem system, following
> > reserve ratio is activated for *NORMAL ZONE* which would be easyily
> > misleading people.
> > 
> >  #ifdef CONFIG_HIGHMEM
> >  32
> >  #endif
> > 
> > This patch also fix this situation by defining sysctl_lowmem_reserve_ratio
> > array by MAX_NR_ZONES and place "#ifdef" to right place.
> > 
> > Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > ---
> >  include/linux/mmzone.h |  2 +-
> >  mm/page_alloc.c        | 11 ++++++-----
> >  2 files changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index ebaccd4..96194bf 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -869,7 +869,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
> >  					void __user *, size_t *, loff_t *);
> >  int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
> >  					void __user *, size_t *, loff_t *);
> > -extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
> > +extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
> >  int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
> >  					void __user *, size_t *, loff_t *);
> >  int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 32b31d6..60ffa4e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -203,17 +203,18 @@ static void __free_pages_ok(struct page *page, unsigned int order);
> >   * TBD: should special case ZONE_DMA32 machines here - in those we normally
> >   * don't need any ZONE_NORMAL reservation
> >   */
> > -int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
> > +int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
> >  #ifdef CONFIG_ZONE_DMA
> > -	 256,
> > +	[ZONE_DMA] = 256,
> >  #endif
> >  #ifdef CONFIG_ZONE_DMA32
> > -	 256,
> > +	[ZONE_DMA32] = 256,
> >  #endif
> > +	[ZONE_NORMAL] = 32,
> >  #ifdef CONFIG_HIGHMEM
> > -	 32,
> > +	[ZONE_HIGHMEM] = INT_MAX,
> >  #endif
> > -	 32,
> > +	[ZONE_MOVABLE] = INT_MAX,
> >  };
> 
> We need to update lowmem_reserve_ratio in Documentation/sysctl/vm.txt.

Okay!

> And to me, INT_MAX is rather awkward.

I also think so.

> # cat /proc/sys/vm/lowmem_reserve_ratio
>         256     256     32      2147483647      2147483647
> 
> What do you think about to use 0 or -1 as special meaning
> instead 2147483647?

I have thought it but drop it. In setup_per_zone_lowmem_reserve(),
there is a code to adjust the value to 1 if the value is less than 1.
There might be someone who (ab)use this adjustment so it's safe to use
INT_MAX.

> Anyway, it could be separate patch regardless of zone_cma
> so I hope Andrew to merge this patch regardless of other patches
> in this patchset.

Okay. I will send updated version soon.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-17  2:02       ` Joonsoo Kim
@ 2017-04-21  1:35         ` Joonsoo Kim
  2017-04-21  6:54           ` Michal Hocko
  2017-04-24 13:09         ` Michal Hocko
  1 sibling, 1 reply; 39+ messages in thread
From: Joonsoo Kim @ 2017-04-21  1:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Mon, Apr 17, 2017 at 11:02:12AM +0900, Joonsoo Kim wrote:
> On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> > On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> > > On Tue, Apr 11, 2017 at 08:15:20PM +0200, Michal Hocko wrote:
> > > > Hi,
> > > > I didn't get to read though patches yet but the cover letter didn't
> > > > really help me to understand the basic concepts to have a good starting
> > > > point before diving into implementation details. It contains a lot of
> > > > history remarks which is not bad but IMHO too excessive here. I would
> > > > appreciate the following information (some of that is already provided
> > > > in the cover but could benefit from some rewording/text reorganization).
> > > > 
> > > > - what is ZONE_CMA and how it is configured (from admin POV)
> > > > - how does ZONE_CMA compare to other zones
> > > > - who is allowed to allocate from this zone and what are the
> > > >   guarantees/requirements for successful allocation
> > > > - how does the zone compare to a preallocate allocation pool
> > > > - how is ZONE_CMA balanced/reclaimed due to internal memory pressure
> > > >   (from CMA users)
> > > > - is this zone reclaimable for the global memory reclaim
> > > > - why this was/is controversial
> > > 
> > > Hello,
> > > 
> > > I hope that following summary helps you to understand this patchset.
> > > I skip some basic things about CMA. I will attach this description to
> > > the cover-letter if re-spin is needed.
> > 
> > I believe that sorting out these questions is more important than what
> > you have in the current cover letter. Andrew tends to fold the cover
> > into the first patch so I think you should update.
> 
> Okay.
>  
> > > 2. How does ZONE_CMA compare to other zones
> > > 
> > > ZONE_CMA is conceptually the same with ZONE_MOVABLE. There is a software
> > > constraint to guarantee the success of future allocation request from
> > > the device. If the device requests the specific range of the memory in CMA
> > > area at the runtime, page that allocated by MM will be migrated to
> > > the other page and it will be returned to the device. To guarantee it,
> > > ZONE_CMA only takes the allocation request with GFP_MOVABLE.
> > 
> > The immediate follow up question is. Why cannot we reuse ZONE_MOVABLE
> > for that purpose?
> 
> I can make CMA reuses the ZONE_MOVABLE but I don't want it. Reasons
> are that
> 
> 1. If ZONE_MOVABLE has two different types of memory, hotpluggable and
> CMA, it may need special handling for each type. This would lead to a new
> migratetype again (to distinguish them) and easy to be error-prone. I
> don't want that case.
> 
> 2. CMA users want to see usage stat separately since CMA often causes
> the problems and separate stat would helps to debug it.
> 
> > > The other important point about ZONE_CMA is that span of ZONE_CMA would be
> > > overlapped with the other zone. This is not new to MM subsystem and
> > > MM subsystem has enough logic to handle such situation
> > > so there would be no problem.
> > 
> > I am not really sure this is actually true. Zones are disjoint from the
> > early beginning. I remember that we had something like numa nodes
> > interleaving but that is such a rare configuration that I wouldn't be
> > surprised if it wasn't very well tested and actually broken in some
> > subtle ways.
> 
> I agree with your concern however if something is broken for them, it
> just shows that we need to fix it. MM should handle this situation
> since we already know that such architecture exists.
> 
> > 
> > There are many page_zone(page) != zone checks sprinkled in the code but
> > I do not see anything consistent there. Similarly pageblock_pfn_to_page
> > is only used by compaction but there are other pfn walkers which do
> > ad-hoc checking. I was staring into that code these days due to my
> > hotplug patches.
> >
> > That being said, I think that interleaving zones are an interesting
> > concept but I would be rather nervous to consider this as working
> > currently without a deeper review.
> 
> I have tried to audit all the pfn walkers before and have added above
> mentioned check. Perhaps, I missed something however I believe not
> that much. Our production already have used ZONE_CMA and I haven't get
> the report about such problem.
> 
> > 
> > > Other things are completely the same with other zones. For MM POV, there is
> > > no difference in allocation process except that it only takes
> > > GFP_MOVABLE request. In reclaim, pages that are allocated by MM will
> > > be reclaimed by the same policy of the MM. So, no difference.
> > 
> > OK, so essentially this is yet another "highmem" zone. We already know
> > that only GFP_MOVABLE are allowed to fallback to ZONE_CMA but do CMA
> > allocations fallback to other zones and punch new holes? In which zone
> > order?
> 
> Hmm... I don't understand your question. Could you elaborate it more?
> 
> > > This 'no difference' is a strong point of this approach. ZONE_CMA is
> > > naturally handled by MM subsystem unlike as before (special handling is
> > > required for MIGRATE_CMA).
> > > 
> > > 3. Controversial Point
> > > 
> > > Major concern from Mel is that zone concept is abused. ZONE is originally
> > > introduced to solve some issues due to H/W addressing limitation.
> > 
> > Yes, very much agreed on that. You basically want to punch holes into
> > other zones to guarantee an allocation progress. Marking those wholes
> > with special migrate type sounds quite natural but I will have to study
> > the current code some more to see whether issues you mention are
> > inherently unfixable. This might very well turn out to be the case.
> 
> At a glance, special migratetype sound natural. I also did. However,
> it's not natural in implementation POV. Zone consists of the same type
> of memory (by definition ?) and MM subsystem is implemented with that
> assumption. If difference type of memory shares the same zone, it easily
> causes the problem and CMA problems are the such case.

Hello, Michal.

If you don't have any more question, I will send next version with
updated cover-letter.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-21  1:35         ` Joonsoo Kim
@ 2017-04-21  6:54           ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2017-04-21  6:54 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Fri 21-04-17 10:35:03, Joonsoo Kim wrote:
[...]
> Hello, Michal.
> 
> If you don't have any more question, I will send next version with
> updated cover-letter.

I am sorry but I am bussy as hell this week and didn't get to your email
yet. I will try as soon as possible.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-11  3:17 [PATCH v7 0/7] Introduce ZONE_CMA js1304
                   ` (8 preceding siblings ...)
  2017-04-12  1:39 ` Joonsoo Kim
@ 2017-04-24  4:08 ` Bob Liu
  9 siblings, 0 replies; 39+ messages in thread
From: Bob Liu @ 2017-04-24  4:08 UTC (permalink / raw)
  To: js1304, Andrew Morton
  Cc: Rik van Riel, Johannes Weiner, mgorman, Laura Abbott,
	Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Joonsoo Kim

On 2017/4/11 11:17, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Changed from v6
> o Rebase on next-20170405
> o Add a fix for lowmem mapping on ARM (last patch)
> o Re-organize the cover letter
> 
> Changes from v5
> o Rebase on next-20161013
> o Cosmetic change on patch 1
> o Optimize span of ZONE_CMA on multiple node system
> 
> Changes from v4
> o Rebase on next-20160825
> o Add general fix patch for lowmem reserve
> o Fix lowmem reserve ratio
> o Fix zone span optimizaion per Vlastimil
> o Fix pageset initialization
> o Change invocation timing on cma_init_reserved_areas()
> 
> Changes from v3
> o Rebase on next-20160805
> o Split first patch per Vlastimil
> o Remove useless function parameter per Vlastimil
> o Add code comment per Vlastimil
> o Add following description on cover-letter
> 
> Changes from v2
> o Rebase on next-20160525
> o No other changes except following description
> 
> Changes from v1
> o Separate some patches which deserve to submit independently
> o Modify description to reflect current kernel state
> (e.g. high-order watermark problem disappeared by Mel's work)
> o Don't increase SECTION_SIZE_BITS to make a room in page flags
> (detailed reason is on the patch that adds ZONE_CMA)
> o Adjust ZONE_CMA population code
> 
> 
> Hello,
> 
> This is the 7th version of ZONE_CMA patchset. One patch is added
> to fix potential problem on ARM. Other changes are just due to rebase.
> 
> This patchset has long history and got some reviews before. This
> cover-letter has the summary and my opinion on those reviews. Content
> order is so confusing so I make a simple index. If anyone want to
> understand the history properly, please read them by reverse order.
> 
> PART 1. Strong points of the zone approach
> PART 2. Summary in LSF/MM 2016 discussion
> PART 3. Original motivation of this patchset
> 
> ***** PART 1 *****
> 
> CMA has many problems and I mentioned them on the bottom of the
> cover letter. These problems comes from limitation of CMA memory that
> should be always migratable for device usage. I think that introducing
> a new zone is the best approach to solve them. Here are the reasons.
> 
> Zone is introduced to solve some issues due to H/W addressing limitation.
> MM subsystem is implemented to work efficiently with these zones.
> Allocation/reclaim logic in MM consider this limitation very much.
> What I did in this patchset is introducing a new zone and extending zone's
> concept slightly. New concept is that zone can have not only H/W addressing
> limitation but also S/W limitation to guarantee page migration.
> This concept is originated from ZONE_MOVABLE and it works well
> for a long time. So, ZONE_CMA should not be special at this moment.
> 
> There is a major concern from Mel that ZONE_MOVABLE which has
> S/W limitation causes highmem/lowmem problem. Highmem/lowmem problem is
> that some of memory cannot be usable for kernel memory due to limitation
> of the zone. It causes to break LRU ordering and makes hard to find kernel
> usable memory when memory pressure.
> 
> However, important point is that this problem doesn't come from
> implementation detail (ZONE_MOVABLE/MIGRATETYPE). Even if we implement it
> by MIGRATETYPE instead of by ZONE_MOVABLE, we cannot use that type of
> memory for kernel allocation because it isn't migratable. So, it will cause
> to break LRU ordering, too. We cannot avoid the problem in any case.
> Therefore, we should focus on which solution is better for maintenance
> and not intrusive for MM subsystem.
> 
> In this viewpoint, I think that zone approach is better. As mentioned
> earlier, MM subsystem already have many infrastructures to deal with
> zone's H/W addressing limitation. Adding S/W limitation on zone concept
> and adding a new zone doesn't change anything. It will work by itself.
> My patchset can remove many hooks related to CMA area management in MM
> while solving the problems. More hooks are required to solve the problems
> if we choose MIGRATETYPE approach.
> 

Agree, there are already too many hooks and pain to maintain/bugfix.
It looks better if choose this ZONE_CMA approach.

--
Regards,
Bob Liu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-17  2:02       ` Joonsoo Kim
  2017-04-21  1:35         ` Joonsoo Kim
@ 2017-04-24 13:09         ` Michal Hocko
  2017-04-25  3:42           ` Joonsoo Kim
  1 sibling, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2017-04-24 13:09 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Mon 17-04-17 11:02:12, Joonsoo Kim wrote:
> On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> > On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
[...]
> > > ZONE_CMA is conceptually the same with ZONE_MOVABLE. There is a software
> > > constraint to guarantee the success of future allocation request from
> > > the device. If the device requests the specific range of the memory in CMA
> > > area at the runtime, page that allocated by MM will be migrated to
> > > the other page and it will be returned to the device. To guarantee it,
> > > ZONE_CMA only takes the allocation request with GFP_MOVABLE.
> > 
> > The immediate follow up question is. Why cannot we reuse ZONE_MOVABLE
> > for that purpose?
> 
> I can make CMA reuses the ZONE_MOVABLE but I don't want it. Reasons
> are that
> 
> 1. If ZONE_MOVABLE has two different types of memory, hotpluggable and
> CMA, it may need special handling for each type. This would lead to a new
> migratetype again (to distinguish them) and easy to be error-prone. I
> don't want that case.

Hmm, I see your motivation. I believe that we could find a way
around this. Anyway, movable zones are quite special and configuring
overlapping CMA and hotplug movable regions could be refused. So I am
not even sure this is a real problem in practice.

> 2. CMA users want to see usage stat separately since CMA often causes
> the problems and separate stat would helps to debug it.

That could be solved by a per-zone/node counter.

Anyway, these reasons should be mentioned as well. Adding a new zone is
not for free. For most common configurations where we have ZONE_DMA,
ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE all the 3 bits are already
consumed so a new zone will need a new one AFAICS.

[...]
> > > Other things are completely the same with other zones. For MM POV, there is
> > > no difference in allocation process except that it only takes
> > > GFP_MOVABLE request. In reclaim, pages that are allocated by MM will
> > > be reclaimed by the same policy of the MM. So, no difference.
> > 
> > OK, so essentially this is yet another "highmem" zone. We already know
> > that only GFP_MOVABLE are allowed to fallback to ZONE_CMA but do CMA
> > allocations fallback to other zones and punch new holes? In which zone
> > order?
> 
> Hmm... I don't understand your question. Could you elaborate it more?

Well, my question was about the zone fallback chain. MOVABLE allocation
can fallback to lower zones and also to the ZONE_CMA with your patch. If
there is a CMA allocation it doesn't fall back to any other zone - in
other words no new holes are punched to other zones. Is this correct?

> > > This 'no difference' is a strong point of this approach. ZONE_CMA is
> > > naturally handled by MM subsystem unlike as before (special handling is
> > > required for MIGRATE_CMA).
> > > 
> > > 3. Controversial Point
> > > 
> > > Major concern from Mel is that zone concept is abused. ZONE is originally
> > > introduced to solve some issues due to H/W addressing limitation.
> > 
> > Yes, very much agreed on that. You basically want to punch holes into
> > other zones to guarantee an allocation progress. Marking those wholes
> > with special migrate type sounds quite natural but I will have to study
> > the current code some more to see whether issues you mention are
> > inherently unfixable. This might very well turn out to be the case.
> 
> At a glance, special migratetype sound natural. I also did. However,
> it's not natural in implementation POV. Zone consists of the same type
> of memory (by definition ?) and MM subsystem is implemented with that
> assumption. If difference type of memory shares the same zone, it easily
> causes the problem and CMA problems are the such case.

But this is not any different from the highmem vs. lowmem problems we
already have, no? I have looked at your example in the cover where you
mention utilization and the reclaim problems. With the node reclaim we
will have pages from all zones on the same LRU(s). isolate_lru_pages
will skip those from ZONE_CMA because their zone_idx is higher than
gfp_idx(GFP_KERNEL). The same could be achieved by an explicit check for
the pageblock migrate type. So the zone doesn't really help much. Or is
there some aspect that I am missing?

Another worry I would have with the zone approach is that there is a
risk to reintroduce issues we used to have with small zones in the
past. Just consider that the CMA will get depleted by CMA users almost
completely. Now that zone will not get balanced with only few pages.
wakeup_kswapd/pgdat_balanced already has measures to prevent from wake
ups but I cannot say I would be sure everything will work smoothly.

I have glanced through the cumulative diff and to be honest I am not
really sure the result is a great simplification in the end. There is
still quite a lot of special casing. It is true that the page allocator
path is cleaned up and some CMA specific checks are moved away. This is
definitely good to see but I am not convinced that the new zone is
really justified. Only very little from the zone infrastructure is used
in the end AFAICS. Is there any specific usecase which cannot be solved
with the pageblock while it could be with the zone approach? That would
be a strong argument to chose one over another.

Please do _not_ take this as a NAK from me. At least not at this time. I
am still trying to understand all the consequences but my intuition
tells me that building on top of highmem like approach will turn out to
be problematic in future (as we have already seen with the highmem and
movable zones) so this needs a very prudent consideration.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-24 13:09         ` Michal Hocko
@ 2017-04-25  3:42           ` Joonsoo Kim
  2017-04-27 15:06             ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Joonsoo Kim @ 2017-04-25  3:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Mon, Apr 24, 2017 at 03:09:36PM +0200, Michal Hocko wrote:
> On Mon 17-04-17 11:02:12, Joonsoo Kim wrote:
> > On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> > > On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> [...]
> > > > ZONE_CMA is conceptually the same with ZONE_MOVABLE. There is a software
> > > > constraint to guarantee the success of future allocation request from
> > > > the device. If the device requests the specific range of the memory in CMA
> > > > area at the runtime, page that allocated by MM will be migrated to
> > > > the other page and it will be returned to the device. To guarantee it,
> > > > ZONE_CMA only takes the allocation request with GFP_MOVABLE.
> > > 
> > > The immediate follow up question is. Why cannot we reuse ZONE_MOVABLE
> > > for that purpose?
> > 
> > I can make CMA reuses the ZONE_MOVABLE but I don't want it. Reasons
> > are that
> > 
> > 1. If ZONE_MOVABLE has two different types of memory, hotpluggable and
> > CMA, it may need special handling for each type. This would lead to a new
> > migratetype again (to distinguish them) and easy to be error-prone. I
> > don't want that case.
> 
> Hmm, I see your motivation. I believe that we could find a way
> around this. Anyway, movable zones are quite special and configuring
> overlapping CMA and hotplug movable regions could be refused. So I am
> not even sure this is a real problem in practice.
> 
> > 2. CMA users want to see usage stat separately since CMA often causes
> > the problems and separate stat would helps to debug it.
> 
> That could be solved by a per-zone/node counter.
> 
> Anyway, these reasons should be mentioned as well. Adding a new zone is

Okay.

> not for free. For most common configurations where we have ZONE_DMA,
> ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE all the 3 bits are already
> consumed so a new zone will need a new one AFAICS.

Yes, it requires one more bit for a new zone and it's handled by the patch.

> 
> [...]
> > > > Other things are completely the same with other zones. For MM POV, there is
> > > > no difference in allocation process except that it only takes
> > > > GFP_MOVABLE request. In reclaim, pages that are allocated by MM will
> > > > be reclaimed by the same policy of the MM. So, no difference.
> > > 
> > > OK, so essentially this is yet another "highmem" zone. We already know
> > > that only GFP_MOVABLE are allowed to fallback to ZONE_CMA but do CMA
> > > allocations fallback to other zones and punch new holes? In which zone
> > > order?
> > 
> > Hmm... I don't understand your question. Could you elaborate it more?
> 
> Well, my question was about the zone fallback chain. MOVABLE allocation
> can fallback to lower zones and also to the ZONE_CMA with your patch. If
> there is a CMA allocation it doesn't fall back to any other zone - in
> other words no new holes are punched to other zones. Is this correct?

Hmm... I still don't get the meaning of "no new holes are punched to
other zones". I try to answer with my current understanding about your
question.

MOVABLE allocation will fallback as following sequence.

ZONE_CMA -> ZONE_MOVABLE -> ZONE_HIGHMEM -> ZONE_NORMAL -> ...

I don't understand what you mean CMA allocation. In MM's context,
there is no CMA allocation. That is just MOVABLE allocation.

For device's context, there is CMA allocation. It is range specific
allocation so it should be succeed for requested range. No fallback is
allowed in this case.

> > > > This 'no difference' is a strong point of this approach. ZONE_CMA is
> > > > naturally handled by MM subsystem unlike as before (special handling is
> > > > required for MIGRATE_CMA).
> > > > 
> > > > 3. Controversial Point
> > > > 
> > > > Major concern from Mel is that zone concept is abused. ZONE is originally
> > > > introduced to solve some issues due to H/W addressing limitation.
> > > 
> > > Yes, very much agreed on that. You basically want to punch holes into
> > > other zones to guarantee an allocation progress. Marking those wholes
> > > with special migrate type sounds quite natural but I will have to study
> > > the current code some more to see whether issues you mention are
> > > inherently unfixable. This might very well turn out to be the case.
> > 
> > At a glance, special migratetype sound natural. I also did. However,
> > it's not natural in implementation POV. Zone consists of the same type
> > of memory (by definition ?) and MM subsystem is implemented with that
> > assumption. If difference type of memory shares the same zone, it easily
> > causes the problem and CMA problems are the such case.
> 
> But this is not any different from the highmem vs. lowmem problems we
> already have, no? I have looked at your example in the cover where you
> mention utilization and the reclaim problems. With the node reclaim we
> will have pages from all zones on the same LRU(s). isolate_lru_pages
> will skip those from ZONE_CMA because their zone_idx is higher than
> gfp_idx(GFP_KERNEL). The same could be achieved by an explicit check for
> the pageblock migrate type. So the zone doesn't really help much. Or is
> there some aspect that I am missing?

Your understanding is correct. It can archieved by an explict check
for migratetype. And, this is the main reason that we should avoid
such approach.

With ZONE approach, all these things are done naturally. We don't need
any explicit check to anywhere. We already have a code to skip to
reclaim such pages by checking zone_idx.

However, with MIGRATETYPE approach, all these things *cannot* be done
naturally. We need extra checks to all the places (allocator fast
path, reclaim path, compaction, etc...). It is really error-prone and
it already causes many problems due to this aspect. For the
performance wise, this approach is also bad since it requires to check
migratetype for each pages.

Moreover, even if we adds extra checks, things cannot be easily
perfect. See 3) Atomic allocation failure problem. It's inherent
problem if we have different types of memory in a single zone.
We possibly can make things perfect even with MIGRATETYPE approach,
however, it requires additional checks in hotpath than current. It's
expensive and undesirable. It will make future maintenance of MM code
much difficult.

This is why I prefer the ZONE approach.


> 
> Another worry I would have with the zone approach is that there is a
> risk to reintroduce issues we used to have with small zones in the
> past. Just consider that the CMA will get depleted by CMA users almost
> completely. Now that zone will not get balanced with only few pages.
> wakeup_kswapd/pgdat_balanced already has measures to prevent from wake
> ups but I cannot say I would be sure everything will work smoothly.

If there is a small zone problem, it should be fixed in any case.
There are many workloads that allocates memory almost completely
and doesn't return them back to the page allocator.

> I have glanced through the cumulative diff and to be honest I am not
> really sure the result is a great simplification in the end. There is
> still quite a lot of special casing. It is true that the page allocator

Special casing is mostly for initialization. We cannot avoid such things
since CMA isn't normal memory. We have just two choices.

1) ZONE: special casing in intialization phase
2) MIGRATETYPE: special casing in runtime phase

And, I choose 1).

> path is cleaned up and some CMA specific checks are moved away. This is
> definitely good to see but I am not convinced that the new zone is
> really justified. Only very little from the zone infrastructure is used
> in the end AFAICS. Is there any specific usecase which cannot be solved
> with the pageblock while it could be with the zone approach? That would
> be a strong argument to chose one over another.

As I mentioned above, atomic allocation failure problem is somewhat
inherent problem. Another one is described in Vlastimil's reply in
other thread.

lkml.kernel.org/r/ae9c5714-ff45-05ea-6a10-976c311b5742@suse.cz
> 
> Please do _not_ take this as a NAK from me. At least not at this time. I
> am still trying to understand all the consequences but my intuition
> tells me that building on top of highmem like approach will turn out to
> be problematic in future (as we have already seen with the highmem and
> movable zones) so this needs a very prudent consideration.

I can understand that you are prudent to this issue. However, it takes more
than two years and many people already expressed that ZONE approach is the
way to go.

As I said before, some problems are due to S/W limitation of the CMA
memory, not due to the implementation. Even if it chooses MIGRATETYPE
approach, some problem that you have seen in highmem and
movable zones will still exist. So, it should not be a criteria for
implementation decision.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-25  3:42           ` Joonsoo Kim
@ 2017-04-27 15:06             ` Michal Hocko
  2017-04-28  8:04               ` Generic approach to customizable zones - was: " Igor Stoppa
                                 ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Michal Hocko @ 2017-04-27 15:06 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Tue 25-04-17 12:42:57, Joonsoo Kim wrote:
> On Mon, Apr 24, 2017 at 03:09:36PM +0200, Michal Hocko wrote:
> > On Mon 17-04-17 11:02:12, Joonsoo Kim wrote:
> > > On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> > > > On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
[...]
> > not for free. For most common configurations where we have ZONE_DMA,
> > ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE all the 3 bits are already
> > consumed so a new zone will need a new one AFAICS.
> 
> Yes, it requires one more bit for a new zone and it's handled by the patch.

I am pretty sure that you are aware that consuming new page flag bits
is usually a no-go and something we try to avoid as much as possible
because we are in a great shortage there. So there really have to be a
_strong_ reason if we go that way. My current understanding that the
whole zone concept is more about a more convenient implementation rather
than a fundamental change which will solve unsolvable problems with the
current approach. More on that below.

[...]
> MOVABLE allocation will fallback as following sequence.
> 
> ZONE_CMA -> ZONE_MOVABLE -> ZONE_HIGHMEM -> ZONE_NORMAL -> ...
> 
> I don't understand what you mean CMA allocation. In MM's context,
> there is no CMA allocation. That is just MOVABLE allocation.
> 
> For device's context, there is CMA allocation. It is range specific
> allocation so it should be succeed for requested range. No fallback is
> allowed in this case.

OK. that answers my question. I guess... My main confusion comes from
__alloc_gigantic_page which shares alloc_contig_range with the cma
allocation. But from what you wrote above and my quick glance over the
code __alloc_gigantic_page simply changes the migrate type of the pfn
range and it doesn't move it to the zone CMA. Right?

[...]
> > > At a glance, special migratetype sound natural. I also did. However,
> > > it's not natural in implementation POV. Zone consists of the same type
> > > of memory (by definition ?) and MM subsystem is implemented with that
> > > assumption. If difference type of memory shares the same zone, it easily
> > > causes the problem and CMA problems are the such case.
> > 
> > But this is not any different from the highmem vs. lowmem problems we
> > already have, no? I have looked at your example in the cover where you
> > mention utilization and the reclaim problems. With the node reclaim we
> > will have pages from all zones on the same LRU(s). isolate_lru_pages
> > will skip those from ZONE_CMA because their zone_idx is higher than
> > gfp_idx(GFP_KERNEL). The same could be achieved by an explicit check for
> > the pageblock migrate type. So the zone doesn't really help much. Or is
> > there some aspect that I am missing?
> 
> Your understanding is correct. It can archieved by an explict check
> for migratetype. And, this is the main reason that we should avoid
> such approach.
> 
> With ZONE approach, all these things are done naturally. We don't need
> any explicit check to anywhere. We already have a code to skip to
> reclaim such pages by checking zone_idx.

Yes, and as we have to filter pages anyway doing so for cma blocks
doesn't sound overly burdensome from the maintenance point of view.
 
> However, with MIGRATETYPE approach, all these things *cannot* be done
> naturally. We need extra checks to all the places (allocator fast
> path, reclaim path, compaction, etc...). It is really error-prone and
> it already causes many problems due to this aspect. For the
> performance wise, this approach is also bad since it requires to check
> migratetype for each pages.
> 
> Moreover, even if we adds extra checks, things cannot be easily
> perfect.

I see this point and I agree that using a specific zone might be a
_nicer_ solution in the end but you have to consider another aspects as
well. The main one I am worried about is a long term maintainability.
We are really out of page flags and consuming one for a rather specific
usecase is not good. Look at ZONE_DMA. I am pretty sure that almost
no sane HW needs 16MB zone anymore, yet we have hard time to get rid
of it and so we have that memory laying around unused all the time
and blocking one page flag bit. CMA falls into a similar category
AFAIU. I wouldn't be all that surprised if a future HW will not need CMA
allocations in few years, yet we will have to fight to get rid of it
like we do with ZONE_DMA. And not only that. We will also have to fight
finding page flags for other more general usecases in the meantime.

> See 3) Atomic allocation failure problem. It's inherent
> problem if we have different types of memory in a single zone.
> We possibly can make things perfect even with MIGRATETYPE approach,
> however, it requires additional checks in hotpath than current. It's
> expensive and undesirable. It will make future maintenance of MM code
> much difficult.

I believe that the overhead in the hot path is not such a big deal. We
have means to make it 0 when CMA is not used by jumplabels. I assume
that the vast majority of systems will not use CMA. Those systems which
use CMA should be able to cope with some slight overhead IMHO.

I agree that the code maintenance cost is not free. And that is a valid
concern. CMA maintenance will not be for free in either case, though (if
for nothing else the page flags space mentioned above). Let's see what
what this means for mm/page_alloc.c

 mm/page_alloc.c | 220 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 109 insertions(+), 111 deletions(-)

Not very convincing at first glance but this can be quite misleading as
you have already mentioned because you have moved a lot of code to to
init path. So let's just focus on the allocator hot paths

@@ -800,7 +805,7 @@ static inline void __free_one_page(struct page *page,
 
 	VM_BUG_ON(migratetype == -1);
 	if (likely(!is_migrate_isolate(migratetype)))
-		__mod_zone_freepage_state(zone, 1 << order, migratetype);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
 
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
@@ -1804,25 +1831,11 @@ static int fallbacks[MIGRATE_TYPES][4] = {
 	[MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   MIGRATE_TYPES },
 	[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   MIGRATE_TYPES },
 	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
-#ifdef CONFIG_CMA
-	[MIGRATE_CMA]         = { MIGRATE_TYPES }, /* Never used */
-#endif
 #ifdef CONFIG_MEMORY_ISOLATION
 	[MIGRATE_ISOLATE]     = { MIGRATE_TYPES }, /* Never used */
 #endif
 };
 
-#ifdef CONFIG_CMA
-static struct page *__rmqueue_cma_fallback(struct zone *zone,
-					unsigned int order)
-{
-	return __rmqueue_smallest(zone, order, MIGRATE_CMA);
-}
-#else
-static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
-					unsigned int order) { return NULL; }
-#endif
-
 /*
  * Move the free pages in a range to the free lists of the requested type.
  * Note that start_page and end_pages are not aligned on a pageblock
@@ -2090,8 +2103,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
 
 	/* Yoink! */
 	mt = get_pageblock_migratetype(page);
-	if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt)
-	    && !is_migrate_cma(mt)) {
+	if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt)) {
 		zone->nr_reserved_highatomic += pageblock_nr_pages;
 		set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
 		move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
@@ -2235,13 +2247,8 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
 
 retry:
 	page = __rmqueue_smallest(zone, order, migratetype);
-	if (unlikely(!page)) {
-		if (migratetype == MIGRATE_MOVABLE)
-			page = __rmqueue_cma_fallback(zone, order);
-
-		if (!page && __rmqueue_fallback(zone, order, migratetype))
-			goto retry;
-	}
+	if (unlikely(!page) && __rmqueue_fallback(zone, order, migratetype))
+		goto retry;
 
 	trace_mm_page_alloc_zone_locked(page, order, migratetype);
 	return page;
@@ -2283,9 +2290,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			list_add_tail(&page->lru, list);
 		list = &page->lru;
 		alloced++;
-		if (is_migrate_cma(get_pcppage_migratetype(page)))
-			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-					      -(1 << order));
 	}
 
 	/*
@@ -2636,10 +2640,10 @@ int __isolate_free_page(struct page *page, unsigned int order)
 		 * exists.
 		 */
 		watermark = min_wmark_pages(zone) + (1UL << order);
-		if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
 			return 0;
 
-		__mod_zone_freepage_state(zone, -(1UL << order), mt);
+		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
 	}
 
 	/* Remove page from free list */
@@ -2655,8 +2659,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 		struct page *endpage = page + (1 << order) - 1;
 		for (; page < endpage; page += pageblock_nr_pages) {
 			int mt = get_pageblock_migratetype(page);
-			if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)
-			    && !is_migrate_highatomic(mt))
+			if (!is_migrate_isolate(mt) &&
+				!is_migrate_highatomic(mt))
 				set_pageblock_migratetype(page,
 							  MIGRATE_MOVABLE);
 		}
@@ -2783,8 +2787,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 	spin_unlock(&zone->lock);
 	if (!page)
 		goto failed;
-	__mod_zone_freepage_state(zone, -(1 << order),
-				  get_pcppage_migratetype(page));
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 	zone_statistics(preferred_zone, zone);
@@ -2907,12 +2910,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	else
 		min -= min / 4;
 
-#ifdef CONFIG_CMA
-	/* If allocation can't use CMA areas don't use free CMA pages */
-	if (!(alloc_flags & ALLOC_CMA))
-		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
-
 	/*
 	 * Check watermarks for an order-0 allocation request. If these
 	 * are not met, then a high-order request also cannot go ahead
@@ -2940,13 +2937,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			if (!list_empty(&area->free_list[mt]))
 				return true;
 		}
-
-#ifdef CONFIG_CMA
-		if ((alloc_flags & ALLOC_CMA) &&
-		    !list_empty(&area->free_list[MIGRATE_CMA])) {
-			return true;
-		}
-#endif
 	}
 	return false;
 }
@@ -2962,13 +2952,6 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 		unsigned long mark, int classzone_idx, unsigned int alloc_flags)
 {
 	long free_pages = zone_page_state(z, NR_FREE_PAGES);
-	long cma_pages = 0;
-
-#ifdef CONFIG_CMA
-	/* If allocation can't use CMA areas don't use free CMA pages */
-	if (!(alloc_flags & ALLOC_CMA))
-		cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
 
 	/*
 	 * Fast check for order-0 only. If this fails then the reserves
@@ -2977,7 +2960,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 	 * the caller is !atomic then it'll uselessly search the free
 	 * list. That corner case is then slower but it is harmless.
 	 */
-	if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzone_idx])
+	if (!order && free_pages > mark + z->lowmem_reserve[classzone_idx])
 		return true;
 
 	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
@@ -3547,10 +3530,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-#ifdef CONFIG_CMA
-	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
-		alloc_flags |= ALLOC_CMA;
-#endif
 	return alloc_flags;
 }
 
@@ -3972,9 +3951,6 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return false;
 
-	if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
-		*alloc_flags |= ALLOC_CMA;
-
 	return true;
 }
 
This looks like a nice clean up. Those ifdefs are ugly as hell. One
could argue that some of that could be cleaned up by simply adding some
helpers (with a jump label to reduce the overhead), though. But is this
really strong enough reason to bring the whole zone in? I am not really
convinced to be honest.
 
[...]

> > Please do _not_ take this as a NAK from me. At least not at this time. I
> > am still trying to understand all the consequences but my intuition
> > tells me that building on top of highmem like approach will turn out to
> > be problematic in future (as we have already seen with the highmem and
> > movable zones) so this needs a very prudent consideration.
> 
> I can understand that you are prudent to this issue. However, it takes more
> than two years and many people already expressed that ZONE approach is the
> way to go.

I can see a single Acked-by and one Reviewed-by. It would be much more
convincing to see much larger support. Do not take me wrong I am not
trying to undermine the feedback so far but we should be clear about one
thing. CMA is mostly motivated by the industry which tries to overcome
HW limitations which can change in future very easily. I would rather
see good enough solution for something like that than a nicer solution
which is pushing additional burden on more general usecases.

That being said, I would like to see a much larger consensus in the MM
community before a new zone is merged. I am staying very skeptical this
is the right direction though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Generic approach to customizable zones - was: Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-27 15:06             ` Michal Hocko
@ 2017-04-28  8:04               ` Igor Stoppa
  2017-04-28  8:36                 ` Michal Hocko
  2017-05-02  4:01               ` Joonsoo Kim
  2017-05-02  8:06               ` Vlastimil Babka
  2 siblings, 1 reply; 39+ messages in thread
From: Igor Stoppa @ 2017-04-28  8:04 UTC (permalink / raw)
  To: Michal Hocko, Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On 27/04/17 18:06, Michal Hocko wrote:
> On Tue 25-04-17 12:42:57, Joonsoo Kim wrote:

[...]

>> Yes, it requires one more bit for a new zone and it's handled by the patch.
> 
> I am pretty sure that you are aware that consuming new page flag bits
> is usually a no-go and something we try to avoid as much as possible
> because we are in a great shortage there. So there really have to be a
> _strong_ reason if we go that way. My current understanding that the
> whole zone concept is more about a more convenient implementation rather
> than a fundamental change which will solve unsolvable problems with the
> current approach. More on that below.

Since I am in a similar situation, I think it's better if I join this
conversation instead of going through the same in a separate thread.

In this regard, I have a few observations (are they correct?):

* not everyone seems to be interested in having all the current
  zones active simultaneously

* some zones are even not so meaningful on certain architectures or
  platforms

* some architectures/platforms that are 64 bits would have no penalty
  in dealing with a larger data type.

So I wonder, would anybody be against this:

* within the 32bits constraint, define some optional zones

* decouple the specific position of a bit from the zone it represents;
  iow: if the zone is enabled, ensure that it gets a bit in the mask,
  but do not make promises about which one it is, provided that the
  corresponding macros work properly

* ensure that if one selects more optional zones than there are bits
  available (in the case of a 32bits mask), an error is produced at
  compile time

* if one is happy to have a 64bits type, allow for as many zones as
  it's possible to fit, or anyway more than what is possible with
  the 32 bit mask.

I think I can re-factor the code so that there is no runtime performance
degradation, if there is no immediate objection to what I described. Or
maybe I failed to notice some obvious pitfall?

>From what I see, there seems to be a lot of interest in using functions
like Kmalloc / vmalloc, with the ability of specifying pseudo-custom
areas from where they should tap into.

Why not, as long as those who do not need it are not negatively impacted?

I understand that if the association between bits and zones is fixed,
then suddenly bits become very precious stuff, but if they could be used
in a more efficient way, then maybe they could be used more liberally.

The alternative is to keep getting requests about new zones and turning
them away because they do not pass the bar of being extremely critical,
even if indeed they would simplify people's life.


The change shouldn't be too ugly, if I do something along these lines of
the pseudo code below.
Note: the #ifdefs would be mainly concentrated in the declaration part.

enum gfp_zone_shift {
#if IS_ENABLED(CONFIG_ZONE_DMA)
/*I haven't checked if this is the correct name, but it gives the idea*/
        ZONE_DMA_SHIFT = 0,
#endif
#if IS_ENABLED(CONFIG_ZONE_HIGHMEM)
        ZONE_HIGHMEM_SHIFT,
#endif
#if IS_ENABLED(CONFIG_ZONE_DMA32)
        ZONE_DMA32_SHIFT,
#endif
#if IS_ENABLED(CONFIG_ZONE_xxx)
        ZONE_xxx,
#endif
       NON_OPTIONAL_ZONE_SHIFT,
       ...
       USED_ZONES_NUMBER,
       ZONE_MOVABLE_SHIFT = USED_ZONES_NUMBER,
       ...
};

#if USED_ZONES_NUMBER < MAX_ZONES_32BITS
typedef gfp_zones_t uint32_t
#elif IS_ENABLED(CONFIG_ZONES_64BITS
typedef gfp_zones_t uint64_t
#else
#error
#endif

The type should be adjusted in other places where it is used, but I
didn't find too many occurrences.

#define __ZONE_DMA \
          (((gfp_zones_t)IS_ENABLED(CONFIG_ZONE_DMA)) << \
           (ZONE_DMA_SHIFT - 0))

[rinse and repeat]

Code referring to these optional zones can be sandboxed in

#if IS_ENABLED(CONFIG_ZONE_DMA)

inline function do_something_dma() {
   ....
}

#else
#define do_something_dma()
#endif


Or equivalent, effectively removing many #ifdefs from the main code of
functions like those called by kmalloc.


So, would this approach stand a chance?


thanks, igor

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Generic approach to customizable zones - was: Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-28  8:04               ` Generic approach to customizable zones - was: " Igor Stoppa
@ 2017-04-28  8:36                 ` Michal Hocko
  2017-04-28  9:04                   ` Igor Stoppa
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2017-04-28  8:36 UTC (permalink / raw)
  To: Igor Stoppa
  Cc: Joonsoo Kim, Andrew Morton, Rik van Riel, Johannes Weiner,
	mgorman, Laura Abbott, Minchan Kim, Marek Szyprowski,
	Michal Nazarewicz, Aneesh Kumar K . V, Vlastimil Babka,
	Russell King, Will Deacon, linux-mm, linux-kernel, kernel-team

I didn't read this thoughly yet because I will be travelling shortly but
this point alone just made ask, because it seems there is some
misunderstanding

On Fri 28-04-17 11:04:27, Igor Stoppa wrote:
[...]
> * if one is happy to have a 64bits type, allow for as many zones as
>   it's possible to fit, or anyway more than what is possible with
>   the 32 bit mask.

zones are currently placed in struct page::flags. And that already is
64b size on 64b arches. And we do not really have any room spare there.
We encode page flags, zone id, numa_nid/sparse section_nr there. How can
you add more without enlarging the struct page itself or using external
means to store the same information (page_ext comes to mind)? Even if
the later would be possible then note thatpage_zone() is used in many
performance sensitive paths and making it perform well with special
casing would be far from trivial.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Generic approach to customizable zones - was: Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-28  8:36                 ` Michal Hocko
@ 2017-04-28  9:04                   ` Igor Stoppa
  0 siblings, 0 replies; 39+ messages in thread
From: Igor Stoppa @ 2017-04-28  9:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Rik van Riel, Johannes Weiner,
	mgorman, Laura Abbott, Minchan Kim, Marek Szyprowski,
	Michal Nazarewicz, Aneesh Kumar K . V, Vlastimil Babka,
	Russell King, Will Deacon, linux-mm, linux-kernel, kernel-team



On 28/04/17 11:36, Michal Hocko wrote:
> I didn't read this thoughly yet because I will be travelling shortly

ok, thanks for bearing with me =)

> but
> this point alone just made ask, because it seems there is some
> misunderstanding

It is possible, so far I did some changes, but I have not completed the
whole conversion.

> On Fri 28-04-17 11:04:27, Igor Stoppa wrote:
> [...]
>> * if one is happy to have a 64bits type, allow for as many zones as
>>   it's possible to fit, or anyway more than what is possible with
>>   the 32 bit mask.
> 
> zones are currently placed in struct page::flags. And that already is
> 64b size on 64b arches. 

Ok, the issues I had so fare were related to the enum for zones being
treated as 32b.

> And we do not really have any room spare there.
> We encode page flags, zone id, numa_nid/sparse section_nr there. How can
> you add more without enlarging the struct page itself or using external
> means to store the same information (page_ext comes to mind)?

Then I'll be conservative and assume I can't, unless I can prove otherwise.

There is still the possibility I mentioned of loosely coupling DMA,
DMA32 and HIGHMEM with the bits currently reserved for them, right?

If my system doesn't use those zones as such, because it doesn't
have/need them, those bits are wasted for me. Otoh someone else is
probably not interested in what I'm after but needs one or more of those
zones.

Making the meaning of the bits configurable should still be a viable
option. It's not altering their amount, just their purpose on a specific
build.

> Even if
> the later would be possible then note thatpage_zone() is used in many
> performance sensitive paths and making it perform well with special
> casing would be far from trivial.


If the solution I propose is acceptable, I'm willing to bite the bullet
and go for implementing the conversion.

In my case I really would like to be able to use kmalloc, because it
would provide an easy path to convert also other portions of the kernel,
besides SE Linux.

I suspect I would encounter overall far less resistance if the type of
change I propose is limited to:

s/GFP_KERNEL/GFP_LOCKABLE/

And if I can guarrantee that GFP_LOCKABLE falls back to GFP_KERNEL when
the "lockable" feature is not enabled.


--
thanks, igor

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-27 15:06             ` Michal Hocko
  2017-04-28  8:04               ` Generic approach to customizable zones - was: " Igor Stoppa
@ 2017-05-02  4:01               ` Joonsoo Kim
  2017-05-02 13:32                 ` Michal Hocko
  2017-05-02  8:06               ` Vlastimil Babka
  2 siblings, 1 reply; 39+ messages in thread
From: Joonsoo Kim @ 2017-05-02  4:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Thu, Apr 27, 2017 at 05:06:36PM +0200, Michal Hocko wrote:
> On Tue 25-04-17 12:42:57, Joonsoo Kim wrote:
> > On Mon, Apr 24, 2017 at 03:09:36PM +0200, Michal Hocko wrote:
> > > On Mon 17-04-17 11:02:12, Joonsoo Kim wrote:
> > > > On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> > > > > On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> [...]
> > > not for free. For most common configurations where we have ZONE_DMA,
> > > ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE all the 3 bits are already
> > > consumed so a new zone will need a new one AFAICS.
> > 
> > Yes, it requires one more bit for a new zone and it's handled by the patch.
> 
> I am pretty sure that you are aware that consuming new page flag bits
> is usually a no-go and something we try to avoid as much as possible
> because we are in a great shortage there. So there really have to be a
> _strong_ reason if we go that way. My current understanding that the
> whole zone concept is more about a more convenient implementation rather
> than a fundamental change which will solve unsolvable problems with the
> current approach. More on that below.

If there is a consensus that adding a new zone and one more bit in
page flags bits seems to be unreasonable, I try to find a way to use
ZONE_MOVABLE. As mentioned before, that's not fundamental issue to me.
However, it will have many potential problems as I mentioned so I
*really* don't prefer that way.

> 
> [...]
> > MOVABLE allocation will fallback as following sequence.
> > 
> > ZONE_CMA -> ZONE_MOVABLE -> ZONE_HIGHMEM -> ZONE_NORMAL -> ...
> > 
> > I don't understand what you mean CMA allocation. In MM's context,
> > there is no CMA allocation. That is just MOVABLE allocation.
> > 
> > For device's context, there is CMA allocation. It is range specific
> > allocation so it should be succeed for requested range. No fallback is
> > allowed in this case.
> 
> OK. that answers my question. I guess... My main confusion comes from
> __alloc_gigantic_page which shares alloc_contig_range with the cma
> allocation. But from what you wrote above and my quick glance over the
> code __alloc_gigantic_page simply changes the migrate type of the pfn
> range and it doesn't move it to the zone CMA. Right?

Yes, it doesn't move it to the zone CMA.

> 
> [...]
> > > > At a glance, special migratetype sound natural. I also did. However,
> > > > it's not natural in implementation POV. Zone consists of the same type
> > > > of memory (by definition ?) and MM subsystem is implemented with that
> > > > assumption. If difference type of memory shares the same zone, it easily
> > > > causes the problem and CMA problems are the such case.
> > > 
> > > But this is not any different from the highmem vs. lowmem problems we
> > > already have, no? I have looked at your example in the cover where you
> > > mention utilization and the reclaim problems. With the node reclaim we
> > > will have pages from all zones on the same LRU(s). isolate_lru_pages
> > > will skip those from ZONE_CMA because their zone_idx is higher than
> > > gfp_idx(GFP_KERNEL). The same could be achieved by an explicit check for
> > > the pageblock migrate type. So the zone doesn't really help much. Or is
> > > there some aspect that I am missing?
> > 
> > Your understanding is correct. It can archieved by an explict check
> > for migratetype. And, this is the main reason that we should avoid
> > such approach.
> > 
> > With ZONE approach, all these things are done naturally. We don't need
> > any explicit check to anywhere. We already have a code to skip to
> > reclaim such pages by checking zone_idx.
> 
> Yes, and as we have to filter pages anyway doing so for cma blocks
> doesn't sound overly burdensome from the maintenance point of view.
>  
> > However, with MIGRATETYPE approach, all these things *cannot* be done
> > naturally. We need extra checks to all the places (allocator fast
> > path, reclaim path, compaction, etc...). It is really error-prone and
> > it already causes many problems due to this aspect. For the
> > performance wise, this approach is also bad since it requires to check
> > migratetype for each pages.
> > 
> > Moreover, even if we adds extra checks, things cannot be easily
> > perfect.
> 
> I see this point and I agree that using a specific zone might be a
> _nicer_ solution in the end but you have to consider another aspects as
> well. The main one I am worried about is a long term maintainability.
> We are really out of page flags and consuming one for a rather specific
> usecase is not good. Look at ZONE_DMA. I am pretty sure that almost
> no sane HW needs 16MB zone anymore, yet we have hard time to get rid
> of it and so we have that memory laying around unused all the time
> and blocking one page flag bit. CMA falls into a similar category
> AFAIU. I wouldn't be all that surprised if a future HW will not need CMA
> allocations in few years, yet we will have to fight to get rid of it
> like we do with ZONE_DMA. And not only that. We will also have to fight
> finding page flags for other more general usecases in the meantime.

This maintenance problem is inherent. This problem exists even if we
uses MIGRATETYPE approach. We cannot remove many hooks for CMA if a
future HW will not need CMA allocation in few years. The only
difference is that one takes single zone bit only for CMA user and the
other approach takes many hooks that we need to take care about it all
the time.

> 
> > See 3) Atomic allocation failure problem. It's inherent
> > problem if we have different types of memory in a single zone.
> > We possibly can make things perfect even with MIGRATETYPE approach,
> > however, it requires additional checks in hotpath than current. It's
> > expensive and undesirable. It will make future maintenance of MM code
> > much difficult.
> 
> I believe that the overhead in the hot path is not such a big deal. We
> have means to make it 0 when CMA is not used by jumplabels. I assume
> that the vast majority of systems will not use CMA. Those systems which
> use CMA should be able to cope with some slight overhead IMHO.

Please don't underestimate number of CMA user. Most of android device
uses CMA. So, there would be more devices using CMA than the server
not using CMA. They also have a right to experience the best performance.

> 
> I agree that the code maintenance cost is not free. And that is a valid
> concern. CMA maintenance will not be for free in either case, though (if
> for nothing else the page flags space mentioned above). Let's see what
> what this means for mm/page_alloc.c
> 
>  mm/page_alloc.c | 220 ++++++++++++++++++++++++++++----------------------------
>  1 file changed, 109 insertions(+), 111 deletions(-)
> 
> Not very convincing at first glance but this can be quite misleading as
> you have already mentioned because you have moved a lot of code to to
> init path. So let's just focus on the allocator hot paths
> 
> @@ -800,7 +805,7 @@ static inline void __free_one_page(struct page *page,
>  
>  	VM_BUG_ON(migratetype == -1);
>  	if (likely(!is_migrate_isolate(migratetype)))
> -		__mod_zone_freepage_state(zone, 1 << order, migratetype);
> +		__mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
>  
>  	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
> @@ -1804,25 +1831,11 @@ static int fallbacks[MIGRATE_TYPES][4] = {
>  	[MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   MIGRATE_TYPES },
>  	[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   MIGRATE_TYPES },
>  	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
> -#ifdef CONFIG_CMA
> -	[MIGRATE_CMA]         = { MIGRATE_TYPES }, /* Never used */
> -#endif
>  #ifdef CONFIG_MEMORY_ISOLATION
>  	[MIGRATE_ISOLATE]     = { MIGRATE_TYPES }, /* Never used */
>  #endif
>  };
>  
> -#ifdef CONFIG_CMA
> -static struct page *__rmqueue_cma_fallback(struct zone *zone,
> -					unsigned int order)
> -{
> -	return __rmqueue_smallest(zone, order, MIGRATE_CMA);
> -}
> -#else
> -static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
> -					unsigned int order) { return NULL; }
> -#endif
> -
>  /*
>   * Move the free pages in a range to the free lists of the requested type.
>   * Note that start_page and end_pages are not aligned on a pageblock
> @@ -2090,8 +2103,7 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
>  
>  	/* Yoink! */
>  	mt = get_pageblock_migratetype(page);
> -	if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt)
> -	    && !is_migrate_cma(mt)) {
> +	if (!is_migrate_highatomic(mt) && !is_migrate_isolate(mt)) {
>  		zone->nr_reserved_highatomic += pageblock_nr_pages;
>  		set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
>  		move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
> @@ -2235,13 +2247,8 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
>  
>  retry:
>  	page = __rmqueue_smallest(zone, order, migratetype);
> -	if (unlikely(!page)) {
> -		if (migratetype == MIGRATE_MOVABLE)
> -			page = __rmqueue_cma_fallback(zone, order);
> -
> -		if (!page && __rmqueue_fallback(zone, order, migratetype))
> -			goto retry;
> -	}
> +	if (unlikely(!page) && __rmqueue_fallback(zone, order, migratetype))
> +		goto retry;
>  
>  	trace_mm_page_alloc_zone_locked(page, order, migratetype);
>  	return page;
> @@ -2283,9 +2290,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  			list_add_tail(&page->lru, list);
>  		list = &page->lru;
>  		alloced++;
> -		if (is_migrate_cma(get_pcppage_migratetype(page)))
> -			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> -					      -(1 << order));
>  	}
>  
>  	/*
> @@ -2636,10 +2640,10 @@ int __isolate_free_page(struct page *page, unsigned int order)
>  		 * exists.
>  		 */
>  		watermark = min_wmark_pages(zone) + (1UL << order);
> -		if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
> +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
>  			return 0;
>  
> -		__mod_zone_freepage_state(zone, -(1UL << order), mt);
> +		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
>  	}
>  
>  	/* Remove page from free list */
> @@ -2655,8 +2659,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
>  		struct page *endpage = page + (1 << order) - 1;
>  		for (; page < endpage; page += pageblock_nr_pages) {
>  			int mt = get_pageblock_migratetype(page);
> -			if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)
> -			    && !is_migrate_highatomic(mt))
> +			if (!is_migrate_isolate(mt) &&
> +				!is_migrate_highatomic(mt))
>  				set_pageblock_migratetype(page,
>  							  MIGRATE_MOVABLE);
>  		}
> @@ -2783,8 +2787,7 @@ struct page *rmqueue(struct zone *preferred_zone,
>  	spin_unlock(&zone->lock);
>  	if (!page)
>  		goto failed;
> -	__mod_zone_freepage_state(zone, -(1 << order),
> -				  get_pcppage_migratetype(page));
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
>  
>  	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
>  	zone_statistics(preferred_zone, zone);
> @@ -2907,12 +2910,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  	else
>  		min -= min / 4;
>  
> -#ifdef CONFIG_CMA
> -	/* If allocation can't use CMA areas don't use free CMA pages */
> -	if (!(alloc_flags & ALLOC_CMA))
> -		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
> -#endif
> -
>  	/*
>  	 * Check watermarks for an order-0 allocation request. If these
>  	 * are not met, then a high-order request also cannot go ahead
> @@ -2940,13 +2937,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  			if (!list_empty(&area->free_list[mt]))
>  				return true;
>  		}
> -
> -#ifdef CONFIG_CMA
> -		if ((alloc_flags & ALLOC_CMA) &&
> -		    !list_empty(&area->free_list[MIGRATE_CMA])) {
> -			return true;
> -		}
> -#endif
>  	}
>  	return false;
>  }
> @@ -2962,13 +2952,6 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>  		unsigned long mark, int classzone_idx, unsigned int alloc_flags)
>  {
>  	long free_pages = zone_page_state(z, NR_FREE_PAGES);
> -	long cma_pages = 0;
> -
> -#ifdef CONFIG_CMA
> -	/* If allocation can't use CMA areas don't use free CMA pages */
> -	if (!(alloc_flags & ALLOC_CMA))
> -		cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
> -#endif
>  
>  	/*
>  	 * Fast check for order-0 only. If this fails then the reserves
> @@ -2977,7 +2960,7 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>  	 * the caller is !atomic then it'll uselessly search the free
>  	 * list. That corner case is then slower but it is harmless.
>  	 */
> -	if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzone_idx])
> +	if (!order && free_pages > mark + z->lowmem_reserve[classzone_idx])
>  		return true;
>  
>  	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
> @@ -3547,10 +3530,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	} else if (unlikely(rt_task(current)) && !in_interrupt())
>  		alloc_flags |= ALLOC_HARDER;
>  
> -#ifdef CONFIG_CMA
> -	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> -		alloc_flags |= ALLOC_CMA;
> -#endif
>  	return alloc_flags;
>  }
>  
> @@ -3972,9 +3951,6 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>  	if (should_fail_alloc_page(gfp_mask, order))
>  		return false;
>  
> -	if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
> -		*alloc_flags |= ALLOC_CMA;
> -
>  	return true;
>  }
>  
> This looks like a nice clean up. Those ifdefs are ugly as hell. One
> could argue that some of that could be cleaned up by simply adding some
> helpers (with a jump label to reduce the overhead), though. But is this
> really strong enough reason to bring the whole zone in? I am not really
> convinced to be honest.

Please don't underestimate the benefit of this patchset.
I have said that we need *more* hooks to fix all the problems.

And, please think that this code removal is not only code removal but
also concept removal. With this removing, we don't need to consider
ALLOC_CMA for alloc_flags when calling zone_watermark_ok(). There are
many bugs on it and it still remains. We don't need to consider
pageblock migratetype when handling pageblock migratetype. We don't
need to take a great care about calculating the number of CMA
freepages.

>  
> [...]
> 
> > > Please do _not_ take this as a NAK from me. At least not at this time. I
> > > am still trying to understand all the consequences but my intuition
> > > tells me that building on top of highmem like approach will turn out to
> > > be problematic in future (as we have already seen with the highmem and
> > > movable zones) so this needs a very prudent consideration.
> > 
> > I can understand that you are prudent to this issue. However, it takes more
> > than two years and many people already expressed that ZONE approach is the
> > way to go.
> 
> I can see a single Acked-by and one Reviewed-by. It would be much more
> convincing to see much larger support. Do not take me wrong I am not
> trying to undermine the feedback so far but we should be clear about one
> thing. CMA is mostly motivated by the industry which tries to overcome
> HW limitations which can change in future very easily. I would rather
> see good enough solution for something like that than a nicer solution
> which is pushing additional burden on more general usecases.

First of all, current MIGRATETYPE approach isn't good enough to me.
They caused too many problems and there are many remanining problems.
It will causes maintenance issue for a long time.

And, although good enough solution can be better than nicer solution
in some cases, it looks like current situation isn't that case.
There is a single reason, saving page flag bit, to support good enough
solution.

I'd like to ask reversly. Is this a enough reason to make CMA user to
suffer from bugs? Is it a enough reason to maintain such dirty and
hard maintainable code, and to hurt other MM devepoler's development
performance? We need a great care on handling pageblock migratetype,
accounting freepage, reclaiming the allocated page. It would affect
many developers.

> 
> That being said, I would like to see a much larger consensus in the MM
> community before a new zone is merged. I am staying very skeptical this
> is the right direction though.

As you know, CMA is actively used in industry and they don't review
the patch in public mailing list. Moreover, there is no interest on
CMA by core MM developers. So, I think that 1 acked-by and 1
reviewed-by is a great agreement. I and they are most of developers
who have interest on CMA in MM community.

Anyway, AFAIK, following people are in favor of this patchset.
If I am wrong, please let me know.

Minchan Kim <minchan@kernel.org>
Laura Abbot <lauraa@codeaurora.org>
Bob Liu <liubo95@huawei.com>

As you know (?), Minchan and Laura are also the core developers to CMA
topic.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-04-27 15:06             ` Michal Hocko
  2017-04-28  8:04               ` Generic approach to customizable zones - was: " Igor Stoppa
  2017-05-02  4:01               ` Joonsoo Kim
@ 2017-05-02  8:06               ` Vlastimil Babka
  2017-05-02 13:03                 ` Michal Hocko
  2 siblings, 1 reply; 39+ messages in thread
From: Vlastimil Babka @ 2017-05-02  8:06 UTC (permalink / raw)
  To: Michal Hocko, Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Russell King, Will Deacon, linux-mm,
	linux-kernel, kernel-team

On 04/27/2017 05:06 PM, Michal Hocko wrote:
> On Tue 25-04-17 12:42:57, Joonsoo Kim wrote:
>> On Mon, Apr 24, 2017 at 03:09:36PM +0200, Michal Hocko wrote:
>>> On Mon 17-04-17 11:02:12, Joonsoo Kim wrote:
>>>> On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
>>>>> On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> [...]
>>> not for free. For most common configurations where we have ZONE_DMA,
>>> ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE all the 3 bits are already
>>> consumed so a new zone will need a new one AFAICS.
>>
>> Yes, it requires one more bit for a new zone and it's handled by the patch.
> 
> I am pretty sure that you are aware that consuming new page flag bits
> is usually a no-go and something we try to avoid as much as possible
> because we are in a great shortage there. So there really have to be a
> _strong_ reason if we go that way. My current understanding that the
> whole zone concept is more about a more convenient implementation rather
> than a fundamental change which will solve unsolvable problems with the
> current approach. More on that below.

I don't see it as such a big issue. It's behind a CONFIG option (so we
also don't need the jump labels you suggest later) and enabling it
reduces the number of possible NUMA nodes (not page flags). So either
you are building a kernel for android phone that needs CMA but will have
a single NUMA node, or for a large server with many nodes that won't
have CMA. As long as there won't be large servers that need CMA, we
should be fine (yes, I know some HW vendors can be very creative, but
then it's their problem?).

> [...]
>> MOVABLE allocation will fallback as following sequence.
>>
>> ZONE_CMA -> ZONE_MOVABLE -> ZONE_HIGHMEM -> ZONE_NORMAL -> ...

Hmm, so this in effect resembles some of the aggressive CMA utilization
efforts that were never merged due to issues. Joonsoo, could you
summarize/expand the cover letter part on what were the issues with
aggressive CMA utilization, and why they no longer apply with ZONE_CMA,
especially given the current node-lru reclaim? Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-02  8:06               ` Vlastimil Babka
@ 2017-05-02 13:03                 ` Michal Hocko
  2017-05-02 13:41                   ` Igor Stoppa
  2017-05-04 12:33                   ` Vlastimil Babka
  0 siblings, 2 replies; 39+ messages in thread
From: Michal Hocko @ 2017-05-02 13:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Andrew Morton, Rik van Riel, Johannes Weiner,
	mgorman, Laura Abbott, Minchan Kim, Marek Szyprowski,
	Michal Nazarewicz, Aneesh Kumar K . V, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Tue 02-05-17 10:06:01, Vlastimil Babka wrote:
> On 04/27/2017 05:06 PM, Michal Hocko wrote:
> > On Tue 25-04-17 12:42:57, Joonsoo Kim wrote:
> >> On Mon, Apr 24, 2017 at 03:09:36PM +0200, Michal Hocko wrote:
> >>> On Mon 17-04-17 11:02:12, Joonsoo Kim wrote:
> >>>> On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> >>>>> On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> > [...]
> >>> not for free. For most common configurations where we have ZONE_DMA,
> >>> ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE all the 3 bits are already
> >>> consumed so a new zone will need a new one AFAICS.
> >>
> >> Yes, it requires one more bit for a new zone and it's handled by the patch.
> > 
> > I am pretty sure that you are aware that consuming new page flag bits
> > is usually a no-go and something we try to avoid as much as possible
> > because we are in a great shortage there. So there really have to be a
> > _strong_ reason if we go that way. My current understanding that the
> > whole zone concept is more about a more convenient implementation rather
> > than a fundamental change which will solve unsolvable problems with the
> > current approach. More on that below.
> 
> I don't see it as such a big issue. It's behind a CONFIG option (so we
> also don't need the jump labels you suggest later) and enabling it
> reduces the number of possible NUMA nodes (not page flags). So either
> you are building a kernel for android phone that needs CMA but will have
> a single NUMA node, or for a large server with many nodes that won't
> have CMA. As long as there won't be large servers that need CMA, we
> should be fine (yes, I know some HW vendors can be very creative, but
> then it's their problem?).

Is this really about Android/UMA systems only? My quick grep seems to disagree
$ git grep CONFIG_CMA=y
arch/arm/configs/exynos_defconfig:CONFIG_CMA=y
arch/arm/configs/imx_v6_v7_defconfig:CONFIG_CMA=y
arch/arm/configs/keystone_defconfig:CONFIG_CMA=y
arch/arm/configs/multi_v7_defconfig:CONFIG_CMA=y
arch/arm/configs/omap2plus_defconfig:CONFIG_CMA=y
arch/arm/configs/tegra_defconfig:CONFIG_CMA=y
arch/arm/configs/vexpress_defconfig:CONFIG_CMA=y
arch/arm64/configs/defconfig:CONFIG_CMA=y
arch/mips/configs/ci20_defconfig:CONFIG_CMA=y
arch/mips/configs/db1xxx_defconfig:CONFIG_CMA=y
arch/s390/configs/default_defconfig:CONFIG_CMA=y
arch/s390/configs/gcov_defconfig:CONFIG_CMA=y
arch/s390/configs/performance_defconfig:CONFIG_CMA=y
arch/s390/defconfig:CONFIG_CMA=y

I am pretty sure s390 and ppc support NUMA and aim at supporting really
large systems. 

I can imagine that we could make ZONE_CMA configurable in a way that
only very well defined use cases would be supported so that we can save
page flags space. But this alone sounds like a maintainability nightmare
to me. Especially when I consider ZONE_DMA situation. There is simply
not an easy way to find out whether my HW really needs DMA zone or
not. Most probably not but it still is configured and hidden behind
config ZONE_DMA
        bool "DMA memory allocation support" if EXPERT
        default y
        help
          DMA memory allocation support allows devices with less than 32-bit
          addressing to allocate within the first 16MB of address space.
          Disable if no such devices will be used.

          If unsure, say Y.

Are we really ready to add another thing like that? How are distribution
kernels going to handle that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-02  4:01               ` Joonsoo Kim
@ 2017-05-02 13:32                 ` Michal Hocko
  2017-05-11  2:12                   ` Joonsoo Kim
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2017-05-02 13:32 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Tue 02-05-17 13:01:32, Joonsoo Kim wrote:
> On Thu, Apr 27, 2017 at 05:06:36PM +0200, Michal Hocko wrote:
[...]
> > I see this point and I agree that using a specific zone might be a
> > _nicer_ solution in the end but you have to consider another aspects as
> > well. The main one I am worried about is a long term maintainability.
> > We are really out of page flags and consuming one for a rather specific
> > usecase is not good. Look at ZONE_DMA. I am pretty sure that almost
> > no sane HW needs 16MB zone anymore, yet we have hard time to get rid
> > of it and so we have that memory laying around unused all the time
> > and blocking one page flag bit. CMA falls into a similar category
> > AFAIU. I wouldn't be all that surprised if a future HW will not need CMA
> > allocations in few years, yet we will have to fight to get rid of it
> > like we do with ZONE_DMA. And not only that. We will also have to fight
> > finding page flags for other more general usecases in the meantime.
> 
> This maintenance problem is inherent. This problem exists even if we
> uses MIGRATETYPE approach. We cannot remove many hooks for CMA if a
> future HW will not need CMA allocation in few years. The only
> difference is that one takes single zone bit only for CMA user and the
> other approach takes many hooks that we need to take care about it all
> the time.

And I consider this a big difference. Because while hooks are not nice
they will affect CMA users (in a sense of bugs/performance etc.). While
an additional bit consumed will affect potential future and more generic
features.

[...]
> > I believe that the overhead in the hot path is not such a big deal. We
> > have means to make it 0 when CMA is not used by jumplabels. I assume
> > that the vast majority of systems will not use CMA. Those systems which
> > use CMA should be able to cope with some slight overhead IMHO.
> 
> Please don't underestimate number of CMA user. Most of android device
> uses CMA. So, there would be more devices using CMA than the server
> not using CMA. They also have a right to experience the best performance.

This is not a fair comparison, though. Android development model is much
more faster and tend to not care about future maintainability at all. I
do not know about any android device that would run on a clean vanilla
kernel because vendors simply do not care enough (or have time) to put
the code into a proper shape to upstream it. I understand that this
model might work quite well for rapidly changing and moving mobile or
IoT segment but it is not the greatest fit to motivate the core kernel
subsystem development. We are not in the drivers space!

[...]
> > This looks like a nice clean up. Those ifdefs are ugly as hell. One
> > could argue that some of that could be cleaned up by simply adding some
> > helpers (with a jump label to reduce the overhead), though. But is this
> > really strong enough reason to bring the whole zone in? I am not really
> > convinced to be honest.
> 
> Please don't underestimate the benefit of this patchset.
> I have said that we need *more* hooks to fix all the problems.
> 
> And, please think that this code removal is not only code removal but
> also concept removal. With this removing, we don't need to consider
> ALLOC_CMA for alloc_flags when calling zone_watermark_ok(). There are
> many bugs on it and it still remains. We don't need to consider
> pageblock migratetype when handling pageblock migratetype. We don't
> need to take a great care about calculating the number of CMA
> freepages.

And all this can be isolated to CMA specific hooks with mostly minimum
impact to most users. I hear you saying that zone approach is more natural
and I would agree if we wouldn't have to care about the number of zones.

> > [...]
> > 
> > > > Please do _not_ take this as a NAK from me. At least not at this time. I
> > > > am still trying to understand all the consequences but my intuition
> > > > tells me that building on top of highmem like approach will turn out to
> > > > be problematic in future (as we have already seen with the highmem and
> > > > movable zones) so this needs a very prudent consideration.
> > > 
> > > I can understand that you are prudent to this issue. However, it takes more
> > > than two years and many people already expressed that ZONE approach is the
> > > way to go.
> > 
> > I can see a single Acked-by and one Reviewed-by. It would be much more
> > convincing to see much larger support. Do not take me wrong I am not
> > trying to undermine the feedback so far but we should be clear about one
> > thing. CMA is mostly motivated by the industry which tries to overcome
> > HW limitations which can change in future very easily. I would rather
> > see good enough solution for something like that than a nicer solution
> > which is pushing additional burden on more general usecases.
> 
> First of all, current MIGRATETYPE approach isn't good enough to me.
> They caused too many problems and there are many remanining problems.
> It will causes maintenance issue for a long time.
> 
> And, although good enough solution can be better than nicer solution
> in some cases, it looks like current situation isn't that case.
> There is a single reason, saving page flag bit, to support good enough
> solution.
> 
> I'd like to ask reversly. Is this a enough reason to make CMA user to
> suffer from bugs?

No, but I haven't heard any single argument that those bugs are
impossible to fix with the current approach. They might be harder to fix
but if I can chose between harder for CMA and harder for other more
generic HW independent features I will go for the first one. And do not
take me wrong, I have nothing against CMA as such. It solves a real life
problem. I just believe it doesn't deserve to consume a new bit in page
flags because that is just too scarce resource.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-02 13:03                 ` Michal Hocko
@ 2017-05-02 13:41                   ` Igor Stoppa
  2017-05-04 12:33                   ` Vlastimil Babka
  1 sibling, 0 replies; 39+ messages in thread
From: Igor Stoppa @ 2017-05-02 13:41 UTC (permalink / raw)
  To: Michal Hocko, Vlastimil Babka
  Cc: Joonsoo Kim, Andrew Morton, Rik van Riel, Johannes Weiner,
	mgorman, Laura Abbott, Minchan Kim, Marek Szyprowski,
	Michal Nazarewicz, Aneesh Kumar K . V, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On 02/05/17 16:03, Michal Hocko wrote:

> I can imagine that we could make ZONE_CMA configurable in a way that
> only very well defined use cases would be supported so that we can save
> page flags space. But this alone sounds like a maintainability nightmare
> to me. Especially when I consider ZONE_DMA situation. There is simply
> not an easy way to find out whether my HW really needs DMA zone or
> not. Most probably not but it still is configured and hidden behind
> config ZONE_DMA
>         bool "DMA memory allocation support" if EXPERT
>         default y
>         help
>           DMA memory allocation support allows devices with less than 32-bit
>           addressing to allocate within the first 16MB of address space.
>           Disable if no such devices will be used.
> 
>           If unsure, say Y.
> 
> Are we really ready to add another thing like that? How are distribution
> kernels going to handle that?

In practice there are 2 quite opposite scenarios:

- distros that try to cater to (almost) everyone and are constrained in
what they can leave out

- ad-hoc builds (like Android, but also IoT) where the HW is *very* well
known upfront, because it's probably even impossible to make any change
that doesn't involved a rework station.

So maybe the answer is to not have only EXPERT, but rather DISTRO/CUSTOM
with the implications these can bring.

A generic build would assume to be a DISTRO type, but something else, of
more embedded persuasion, could do otherwise.

ZONE_DMA / ZONE_DMA32 actually seem to be perfect candidates for being
replaced by something else, when unused, as I proposed on Friday:

http://marc.info/?l=linux-mm&m=149337033630993&w=2


It might still be that only some cases would be upstreamable, even after
these changes.

But at least some of those might be useful also for non-Android/ non-IoT
scenarios.


---
igor

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-02 13:03                 ` Michal Hocko
  2017-05-02 13:41                   ` Igor Stoppa
@ 2017-05-04 12:33                   ` Vlastimil Babka
  2017-05-04 12:46                     ` Michal Hocko
  1 sibling, 1 reply; 39+ messages in thread
From: Vlastimil Babka @ 2017-05-04 12:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Rik van Riel, Johannes Weiner,
	mgorman, Laura Abbott, Minchan Kim, Marek Szyprowski,
	Michal Nazarewicz, Aneesh Kumar K . V, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Heiko Carstens,
	Martin Schwidefsky

On 05/02/2017 03:03 PM, Michal Hocko wrote:
> On Tue 02-05-17 10:06:01, Vlastimil Babka wrote:
>> On 04/27/2017 05:06 PM, Michal Hocko wrote:
>>> On Tue 25-04-17 12:42:57, Joonsoo Kim wrote:
>>>> On Mon, Apr 24, 2017 at 03:09:36PM +0200, Michal Hocko wrote:
>>>>> On Mon 17-04-17 11:02:12, Joonsoo Kim wrote:
>>>>>> On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
>>>>>>> On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
>>> [...]
>>>>> not for free. For most common configurations where we have ZONE_DMA,
>>>>> ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE all the 3 bits are already
>>>>> consumed so a new zone will need a new one AFAICS.
>>>>
>>>> Yes, it requires one more bit for a new zone and it's handled by the patch.
>>>
>>> I am pretty sure that you are aware that consuming new page flag bits
>>> is usually a no-go and something we try to avoid as much as possible
>>> because we are in a great shortage there. So there really have to be a
>>> _strong_ reason if we go that way. My current understanding that the
>>> whole zone concept is more about a more convenient implementation rather
>>> than a fundamental change which will solve unsolvable problems with the
>>> current approach. More on that below.
>>
>> I don't see it as such a big issue. It's behind a CONFIG option (so we
>> also don't need the jump labels you suggest later) and enabling it
>> reduces the number of possible NUMA nodes (not page flags). So either
>> you are building a kernel for android phone that needs CMA but will have
>> a single NUMA node, or for a large server with many nodes that won't
>> have CMA. As long as there won't be large servers that need CMA, we
>> should be fine (yes, I know some HW vendors can be very creative, but
>> then it's their problem?).
> 
> Is this really about Android/UMA systems only? My quick grep seems to disagree
> $ git grep CONFIG_CMA=y
> arch/arm/configs/exynos_defconfig:CONFIG_CMA=y
> arch/arm/configs/imx_v6_v7_defconfig:CONFIG_CMA=y
> arch/arm/configs/keystone_defconfig:CONFIG_CMA=y
> arch/arm/configs/multi_v7_defconfig:CONFIG_CMA=y
> arch/arm/configs/omap2plus_defconfig:CONFIG_CMA=y
> arch/arm/configs/tegra_defconfig:CONFIG_CMA=y
> arch/arm/configs/vexpress_defconfig:CONFIG_CMA=y
> arch/arm64/configs/defconfig:CONFIG_CMA=y
> arch/mips/configs/ci20_defconfig:CONFIG_CMA=y
> arch/mips/configs/db1xxx_defconfig:CONFIG_CMA=y
> arch/s390/configs/default_defconfig:CONFIG_CMA=y
> arch/s390/configs/gcov_defconfig:CONFIG_CMA=y
> arch/s390/configs/performance_defconfig:CONFIG_CMA=y
> arch/s390/defconfig:CONFIG_CMA=y
> 
> I am pretty sure s390 and ppc support NUMA and aim at supporting really
> large systems. 

I don't see ppc there, and s390 commit adding CMA as default provides no
info. Heiko/Martin, could you share what does s390 use CMA for? Thanks.

> I can imagine that we could make ZONE_CMA configurable in a way that
> only very well defined use cases would be supported so that we can save
> page flags space. But this alone sounds like a maintainability nightmare
> to me. Especially when I consider ZONE_DMA situation. There is simply
> not an easy way to find out whether my HW really needs DMA zone or
> not. Most probably not but it still is configured and hidden behind
> config ZONE_DMA
>         bool "DMA memory allocation support" if EXPERT
>         default y
>         help
>           DMA memory allocation support allows devices with less than 32-bit
>           addressing to allocate within the first 16MB of address space.
>           Disable if no such devices will be used.
> 
>           If unsure, say Y.
> 
> Are we really ready to add another thing like that? How are distribution
> kernels going to handle that?

I still hope that generic enterprise/desktop distributions can disable
it, and it's only used for small devices with custom kernels.

The config burden is already there in any case, it just translates to
extra migratetype and fastpath hooks, not extra zone and potentially
less nodes.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-04 12:33                   ` Vlastimil Babka
@ 2017-05-04 12:46                     ` Michal Hocko
  2017-05-11  8:51                       ` Vlastimil Babka
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2017-05-04 12:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Andrew Morton, Rik van Riel, Johannes Weiner,
	mgorman, Laura Abbott, Minchan Kim, Marek Szyprowski,
	Michal Nazarewicz, Aneesh Kumar K . V, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Heiko Carstens,
	Martin Schwidefsky

On Thu 04-05-17 14:33:24, Vlastimil Babka wrote:
> On 05/02/2017 03:03 PM, Michal Hocko wrote:
> > On Tue 02-05-17 10:06:01, Vlastimil Babka wrote:
> >> On 04/27/2017 05:06 PM, Michal Hocko wrote:
> >>> On Tue 25-04-17 12:42:57, Joonsoo Kim wrote:
> >>>> On Mon, Apr 24, 2017 at 03:09:36PM +0200, Michal Hocko wrote:
> >>>>> On Mon 17-04-17 11:02:12, Joonsoo Kim wrote:
> >>>>>> On Thu, Apr 13, 2017 at 01:56:15PM +0200, Michal Hocko wrote:
> >>>>>>> On Wed 12-04-17 10:35:06, Joonsoo Kim wrote:
> >>> [...]
> >>>>> not for free. For most common configurations where we have ZONE_DMA,
> >>>>> ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE all the 3 bits are already
> >>>>> consumed so a new zone will need a new one AFAICS.
> >>>>
> >>>> Yes, it requires one more bit for a new zone and it's handled by the patch.
> >>>
> >>> I am pretty sure that you are aware that consuming new page flag bits
> >>> is usually a no-go and something we try to avoid as much as possible
> >>> because we are in a great shortage there. So there really have to be a
> >>> _strong_ reason if we go that way. My current understanding that the
> >>> whole zone concept is more about a more convenient implementation rather
> >>> than a fundamental change which will solve unsolvable problems with the
> >>> current approach. More on that below.
> >>
> >> I don't see it as such a big issue. It's behind a CONFIG option (so we
> >> also don't need the jump labels you suggest later) and enabling it
> >> reduces the number of possible NUMA nodes (not page flags). So either
> >> you are building a kernel for android phone that needs CMA but will have
> >> a single NUMA node, or for a large server with many nodes that won't
> >> have CMA. As long as there won't be large servers that need CMA, we
> >> should be fine (yes, I know some HW vendors can be very creative, but
> >> then it's their problem?).
> > 
> > Is this really about Android/UMA systems only? My quick grep seems to disagree
> > $ git grep CONFIG_CMA=y
> > arch/arm/configs/exynos_defconfig:CONFIG_CMA=y
> > arch/arm/configs/imx_v6_v7_defconfig:CONFIG_CMA=y
> > arch/arm/configs/keystone_defconfig:CONFIG_CMA=y
> > arch/arm/configs/multi_v7_defconfig:CONFIG_CMA=y
> > arch/arm/configs/omap2plus_defconfig:CONFIG_CMA=y
> > arch/arm/configs/tegra_defconfig:CONFIG_CMA=y
> > arch/arm/configs/vexpress_defconfig:CONFIG_CMA=y
> > arch/arm64/configs/defconfig:CONFIG_CMA=y
> > arch/mips/configs/ci20_defconfig:CONFIG_CMA=y
> > arch/mips/configs/db1xxx_defconfig:CONFIG_CMA=y
> > arch/s390/configs/default_defconfig:CONFIG_CMA=y
> > arch/s390/configs/gcov_defconfig:CONFIG_CMA=y
> > arch/s390/configs/performance_defconfig:CONFIG_CMA=y
> > arch/s390/defconfig:CONFIG_CMA=y
> > 
> > I am pretty sure s390 and ppc support NUMA and aim at supporting really
> > large systems. 
> 
> I don't see ppc there,

config KVM_BOOK3S_64_HV
        tristate "KVM for POWER7 and later using hypervisor mode in host"
        depends on KVM_BOOK3S_64 && PPC_POWERNV
        select KVM_BOOK3S_HV_POSSIBLE
        select MMU_NOTIFIER
        select CMA

fa61a4e376d21 tries to explain some more

[...]
> > Are we really ready to add another thing like that? How are distribution
> > kernels going to handle that?
> 
> I still hope that generic enterprise/desktop distributions can disable
> it, and it's only used for small devices with custom kernels.
> 
> The config burden is already there in any case, it just translates to
> extra migratetype and fastpath hooks, not extra zone and potentially
> less nodes.

AFAIU the extra migrate type costs nothing when there are no cma
reservations. And those hooks can be made noop behind static branch
as well. So distribution kernels do not really have to be afraid of
enabling CMA currently.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-02 13:32                 ` Michal Hocko
@ 2017-05-11  2:12                   ` Joonsoo Kim
  2017-05-11  9:13                     ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Joonsoo Kim @ 2017-05-11  2:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

Sorry for the late response. I was on a vacation.

On Tue, May 02, 2017 at 03:32:29PM +0200, Michal Hocko wrote:
> On Tue 02-05-17 13:01:32, Joonsoo Kim wrote:
> > On Thu, Apr 27, 2017 at 05:06:36PM +0200, Michal Hocko wrote:
> [...]
> > > I see this point and I agree that using a specific zone might be a
> > > _nicer_ solution in the end but you have to consider another aspects as
> > > well. The main one I am worried about is a long term maintainability.
> > > We are really out of page flags and consuming one for a rather specific
> > > usecase is not good. Look at ZONE_DMA. I am pretty sure that almost
> > > no sane HW needs 16MB zone anymore, yet we have hard time to get rid
> > > of it and so we have that memory laying around unused all the time
> > > and blocking one page flag bit. CMA falls into a similar category
> > > AFAIU. I wouldn't be all that surprised if a future HW will not need CMA
> > > allocations in few years, yet we will have to fight to get rid of it
> > > like we do with ZONE_DMA. And not only that. We will also have to fight
> > > finding page flags for other more general usecases in the meantime.
> > 
> > This maintenance problem is inherent. This problem exists even if we
> > uses MIGRATETYPE approach. We cannot remove many hooks for CMA if a
> > future HW will not need CMA allocation in few years. The only
> > difference is that one takes single zone bit only for CMA user and the
> > other approach takes many hooks that we need to take care about it all
> > the time.
> 
> And I consider this a big difference. Because while hooks are not nice
> they will affect CMA users (in a sense of bugs/performance etc.). While
> an additional bit consumed will affect potential future and more generic
> features.

Because these hooks are so tricky and are spread on many places,
bugs about these hooks can be made by *non-CMA* user and they hurt
*CMA* user. These hooks could also delay non-CMA user's development speed
since there are many hooks about CMA and understanding how CMA is managed
is rather difficult. I think that this is a big maintenance overhead
not only for CMA user but also for non-CMA user. So, I think that it
can justify additional bit consumed.

> 
> [...]
> > > I believe that the overhead in the hot path is not such a big deal. We
> > > have means to make it 0 when CMA is not used by jumplabels. I assume
> > > that the vast majority of systems will not use CMA. Those systems which
> > > use CMA should be able to cope with some slight overhead IMHO.
> > 
> > Please don't underestimate number of CMA user. Most of android device
> > uses CMA. So, there would be more devices using CMA than the server
> > not using CMA. They also have a right to experience the best performance.
> 
> This is not a fair comparison, though. Android development model is much
> more faster and tend to not care about future maintainability at all. I
> do not know about any android device that would run on a clean vanilla
> kernel because vendors simply do not care enough (or have time) to put
> the code into a proper shape to upstream it. I understand that this
> model might work quite well for rapidly changing and moving mobile or
> IoT segment but it is not the greatest fit to motivate the core kernel
> subsystem development. We are not in the drivers space!
> 
> [...]
> > > This looks like a nice clean up. Those ifdefs are ugly as hell. One
> > > could argue that some of that could be cleaned up by simply adding some
> > > helpers (with a jump label to reduce the overhead), though. But is this
> > > really strong enough reason to bring the whole zone in? I am not really
> > > convinced to be honest.
> > 
> > Please don't underestimate the benefit of this patchset.
> > I have said that we need *more* hooks to fix all the problems.
> > 
> > And, please think that this code removal is not only code removal but
> > also concept removal. With this removing, we don't need to consider
> > ALLOC_CMA for alloc_flags when calling zone_watermark_ok(). There are
> > many bugs on it and it still remains. We don't need to consider
> > pageblock migratetype when handling pageblock migratetype. We don't
> > need to take a great care about calculating the number of CMA
> > freepages.
> 
> And all this can be isolated to CMA specific hooks with mostly minimum
> impact to most users. I hear you saying that zone approach is more natural
> and I would agree if we wouldn't have to care about the number of zones.

I attach a solution about one more bit in page flags although I don't
agree with your opinion that additional bit is no-go approach. Just
note that we have already used three bits for zone encoding in some
configuration due to ZONE_DEVICE.

> 
> > > [...]
> > > 
> > > > > Please do _not_ take this as a NAK from me. At least not at this time. I
> > > > > am still trying to understand all the consequences but my intuition
> > > > > tells me that building on top of highmem like approach will turn out to
> > > > > be problematic in future (as we have already seen with the highmem and
> > > > > movable zones) so this needs a very prudent consideration.
> > > > 
> > > > I can understand that you are prudent to this issue. However, it takes more
> > > > than two years and many people already expressed that ZONE approach is the
> > > > way to go.
> > > 
> > > I can see a single Acked-by and one Reviewed-by. It would be much more
> > > convincing to see much larger support. Do not take me wrong I am not
> > > trying to undermine the feedback so far but we should be clear about one
> > > thing. CMA is mostly motivated by the industry which tries to overcome
> > > HW limitations which can change in future very easily. I would rather
> > > see good enough solution for something like that than a nicer solution
> > > which is pushing additional burden on more general usecases.
> > 
> > First of all, current MIGRATETYPE approach isn't good enough to me.
> > They caused too many problems and there are many remanining problems.
> > It will causes maintenance issue for a long time.
> > 
> > And, although good enough solution can be better than nicer solution
> > in some cases, it looks like current situation isn't that case.
> > There is a single reason, saving page flag bit, to support good enough
> > solution.
> > 
> > I'd like to ask reversly. Is this a enough reason to make CMA user to
> > suffer from bugs?
> 
> No, but I haven't heard any single argument that those bugs are
> impossible to fix with the current approach. They might be harder to fix
> but if I can chose between harder for CMA and harder for other more
> generic HW independent features I will go for the first one. And do not
> take me wrong, I have nothing against CMA as such. It solves a real life
> problem. I just believe it doesn't deserve to consume a new bit in page
> flags because that is just too scarce resource.

As I mentioned above, I think that maintenance overhead due to CMA
deserves to consume a new bit in page flags. It also provide us
extendability to introduce more zones in the future.

Anyway, this value-judgement is subjective so I guess that we
cannot agree with each other. To solve your concern,
I make following solution. Please let me know your opinion about this.
This patch can be applied on top of my ZONE_CMA series.

Thanks.


------------------->8----------------------
>From bd8baee10fd83121b4172375524b97c3296a61e1 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date: Wed, 10 May 2017 15:12:26 +0900
Subject: [RFC PATCH] mm/cma: don't use one more bit on page flags for ZONE_CMA

This is just for showing my idea to solve the page flags problem
due to ZONE_CMA. There is some consensus that ZONE_CMA is a nicer
solution than MIGRATETYPE approach but there is also a worry about
using one more bit on page flags. This patch tries to solve this
worry. This patch is a temporary implementation and I will
optimize it more if everyone agree with this approach.

Adding a new zone (ZONE_CMA) needs one more bit on page flags
in some configuration. This resource is very precious so that
it's better not to consume it as much as possible. Therefore,
this patch implements following tricky magic not to use one more bit.

1. If the number of the zone are five due to the ZONE_CMA, start
encoding magic.
2. ZONE_MOVABLE is written on the pages for ZONE_CMA. Since we don't
use ZONE_CMA (enum value 4) that needs one more bit for encoding,
we can save one bit on page flags.
3. Check ZONE_MOVABLE when retrieving page's zone index.
If the zone index is ZONE_MOVABLE, we use a special handler function
to get the actual zone index.

I found no regression on kernel build test with applying this magic.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/cma.h               | 24 ++++++++++++++++++++++++
 include/linux/gfp.h               | 24 ++++++++++++++++++++++--
 include/linux/mm.h                |  8 +++++++-
 include/linux/page-flags-layout.h | 15 ++++++++++++++-
 mm/cma.c                          | 19 +++++++++++++++++++
 mm/page_alloc.c                   | 10 +++++++++-
 6 files changed, 95 insertions(+), 5 deletions(-)

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 2433d5e..d36695c 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -3,6 +3,7 @@
 
 #include <linux/init.h>
 #include <linux/types.h>
+#include <linux/mmzone.h>
 
 /*
  * There is always at least global CMA area and a few optional
@@ -34,9 +35,32 @@ extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
 
 #ifdef CONFIG_CMA
+extern unsigned cma_area_count;
 extern unsigned long cma_get_free(void);
+extern bool is_zone_cma_page(const struct page *page);
+
+static inline enum zone_type page_zonenum_special(const struct page *page,
+			enum zone_type zone_type)
+{
+	if (!cma_area_count)
+		return zone_type;
+
+	if (zone_type != ZONE_MOVABLE)
+		return zone_type;
+
+	if (is_zone_cma_page(page))
+		return ZONE_CMA;
+
+	return ZONE_MOVABLE;
+}
+
 #else
 static inline unsigned long cma_get_free(void) { return 0; }
+static inline enum zone_type page_zonenum_special(const struct page *page,
+			enum zone_type zone_type)
+{
+	return zone_type;
+}
 #endif
 
 #endif
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 15987cc..bc5e443 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -348,13 +348,33 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
  *       0xf    => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
  */
 
-#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
+#if MAX_NR_ZONES <= 4
+#define GFP_ZONES_SHIFT ZONES_SHIFT
+#elif (MAX_NR_ZONES-1) == 4
+
+/*
+ * ZONE_CMA is encoded as ZONE_MOVABLE and ZONES_SHIFT would be one less
+ * than what GFP_ZONES_SHIFT usually needs.
+ */
+#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_CMA)
+#define GFP_ZONES_SHIFT ZONES_SHIFT
+
 /* ZONE_DEVICE is not a valid GFP zone specifier */
-#define GFP_ZONES_SHIFT 2
+#elif defined(CONFIG_ZONE_DEVICE)
+#define GFP_ZONES_SHIFT Z
+
+#elif defined(CONFIG_CMA)
+#define GFP_ZONES_SHIFT (ZONES_SHIFT + 1)
+
+#else
+#define GFP_ZONES_SHIFT ZONES_SHIFT
+#endif
+
 #else
 #define GFP_ZONES_SHIFT ZONES_SHIFT
 #endif
 
+
 #if !defined(CONFIG_64BITS) && GFP_ZONES_SHIFT > 2
 typedef unsigned long long GFP_ZONE_TABLE_TYPE;
 #else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 693dc6e..50228ef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -23,6 +23,7 @@
 #include <linux/page_ext.h>
 #include <linux/err.h>
 #include <linux/page_ref.h>
+#include <linux/cma.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -780,7 +781,12 @@ int finish_mkwrite_fault(struct vm_fault *vmf);
 
 static inline enum zone_type page_zonenum(const struct page *page)
 {
-	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
+	enum zone_type zone_type = (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
+
+	if (ZONE_CMA_IN_PAGE_FLAGS)
+		return zone_type;
+
+	return page_zonenum_special(page, zone_type);
 }
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 77b078c..8a4ae38 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -17,12 +17,25 @@
 #define ZONES_SHIFT 1
 #elif MAX_NR_ZONES <= 4
 #define ZONES_SHIFT 2
+
 #elif MAX_NR_ZONES <= 8
-#define ZONES_SHIFT 3
+
+#if defined(CONFIG_CMA) && (MAX_NR_ZONES-1) <= 4
+#define ZONES_SHIFT 2
+#define ZONE_CMA_IN_PAGE_FLAGS 0
 #else
+#define ZONES_SHIFT 3
+#endif
+
+#else /* MAX_NR_ZONES <= 8 */
+
 #error ZONES_SHIFT -- too many zones configured adjust calculation
 #endif
 
+#ifndef ZONE_CMA_IN_PAGE_FLAGS
+#define ZONE_CMA_IN_PAGE_FLAGS 1
+#endif
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 
diff --git a/mm/cma.c b/mm/cma.c
index adfda1c..1833973 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -110,6 +110,25 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn,
 	mutex_unlock(&cma->lock);
 }
 
+bool is_zone_cma_page(const struct page *page)
+{
+	struct cma *cma;
+	int i;
+	unsigned long pfn = page_to_pfn(page);
+
+	for (i = 0; i < cma_area_count; i++) {
+		cma = &cma_areas[i];
+
+		if (pfn < cma->base_pfn)
+			continue;
+
+		if (pfn < cma->base_pfn + cma->count)
+			return true;
+	}
+
+	return false;
+}
+
 static int __init cma_activate_area(struct cma *cma)
 {
 	int bitmap_size = BITS_TO_LONGS(cma_bitmap_maxno(cma)) * sizeof(long);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cf29227..9399cc4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1596,6 +1596,14 @@ void __init init_cma_reserved_pageblock(struct page *page)
 	unsigned long pfn = page_to_pfn(page);
 	struct page *p = page;
 	int nid = page_to_nid(page);
+	enum zone_type zone_type;
+
+	/*
+	 * In some cases, we don't have enough place to encode ZONE_CMA
+	 * in page flags. Use ZONE_MOVABLE in this case and do
+	 * post-processing when getting the page's zone.
+	 */
+	zone_type = ZONE_CMA_IN_PAGE_FLAGS ? ZONE_CMA : ZONE_MOVABLE;
 
 	/*
 	 * ZONE_CMA will steal present pages from other zones by changing
@@ -1609,7 +1617,7 @@ void __init init_cma_reserved_pageblock(struct page *page)
 		set_page_count(p, 0);
 
 		/* Steal pages from other zones */
-		set_page_links(p, ZONE_CMA, nid, pfn);
+		set_page_links(p, zone_type, nid, pfn);
 	} while (++p, ++pfn, --i);
 
 	adjust_present_page_count(page, pageblock_nr_pages);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-04 12:46                     ` Michal Hocko
@ 2017-05-11  8:51                       ` Vlastimil Babka
  0 siblings, 0 replies; 39+ messages in thread
From: Vlastimil Babka @ 2017-05-11  8:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Rik van Riel, Johannes Weiner,
	mgorman, Laura Abbott, Minchan Kim, Marek Szyprowski,
	Michal Nazarewicz, Aneesh Kumar K . V, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team, Heiko Carstens,
	Martin Schwidefsky

On 05/04/2017 02:46 PM, Michal Hocko wrote:
> On Thu 04-05-17 14:33:24, Vlastimil Babka wrote:
>>>
>>> I am pretty sure s390 and ppc support NUMA and aim at supporting really
>>> large systems. 
>>
>> I don't see ppc there,
> 
> config KVM_BOOK3S_64_HV
>         tristate "KVM for POWER7 and later using hypervisor mode in host"
>         depends on KVM_BOOK3S_64 && PPC_POWERNV
>         select KVM_BOOK3S_HV_POSSIBLE
>         select MMU_NOTIFIER
>         select CMA
> 
> fa61a4e376d21 tries to explain some more

Uh, that's unfortunate then.

> [...]
>>> Are we really ready to add another thing like that? How are distribution
>>> kernels going to handle that?
>>
>> I still hope that generic enterprise/desktop distributions can disable
>> it, and it's only used for small devices with custom kernels.
>>
>> The config burden is already there in any case, it just translates to
>> extra migratetype and fastpath hooks, not extra zone and potentially
>> less nodes.
> 
> AFAIU the extra migrate type costs nothing when there are no cma
> reservations. And those hooks can be made noop behind static branch
> as well. So distribution kernels do not really have to be afraid of
> enabling CMA currently.

The tradeoff is unfortunate :/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-11  2:12                   ` Joonsoo Kim
@ 2017-05-11  9:13                     ` Michal Hocko
  2017-05-12  2:00                       ` Joonsoo Kim
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2017-05-11  9:13 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Thu 11-05-17 11:12:43, Joonsoo Kim wrote:
> Sorry for the late response. I was on a vacation.
> 
> On Tue, May 02, 2017 at 03:32:29PM +0200, Michal Hocko wrote:
> > On Tue 02-05-17 13:01:32, Joonsoo Kim wrote:
> > > On Thu, Apr 27, 2017 at 05:06:36PM +0200, Michal Hocko wrote:
> > [...]
> > > > I see this point and I agree that using a specific zone might be a
> > > > _nicer_ solution in the end but you have to consider another aspects as
> > > > well. The main one I am worried about is a long term maintainability.
> > > > We are really out of page flags and consuming one for a rather specific
> > > > usecase is not good. Look at ZONE_DMA. I am pretty sure that almost
> > > > no sane HW needs 16MB zone anymore, yet we have hard time to get rid
> > > > of it and so we have that memory laying around unused all the time
> > > > and blocking one page flag bit. CMA falls into a similar category
> > > > AFAIU. I wouldn't be all that surprised if a future HW will not need CMA
> > > > allocations in few years, yet we will have to fight to get rid of it
> > > > like we do with ZONE_DMA. And not only that. We will also have to fight
> > > > finding page flags for other more general usecases in the meantime.
> > > 
> > > This maintenance problem is inherent. This problem exists even if we
> > > uses MIGRATETYPE approach. We cannot remove many hooks for CMA if a
> > > future HW will not need CMA allocation in few years. The only
> > > difference is that one takes single zone bit only for CMA user and the
> > > other approach takes many hooks that we need to take care about it all
> > > the time.
> > 
> > And I consider this a big difference. Because while hooks are not nice
> > they will affect CMA users (in a sense of bugs/performance etc.). While
> > an additional bit consumed will affect potential future and more generic
> > features.
> 
> Because these hooks are so tricky and are spread on many places,
> bugs about these hooks can be made by *non-CMA* user and they hurt
> *CMA* user. These hooks could also delay non-CMA user's development speed
> since there are many hooks about CMA and understanding how CMA is managed
> is rather difficult.

Than make those hooks easier to maintain. Seriously this is a
non-argument.

[...]

> > And all this can be isolated to CMA specific hooks with mostly minimum
> > impact to most users. I hear you saying that zone approach is more natural
> > and I would agree if we wouldn't have to care about the number of zones.
> 
> I attach a solution about one more bit in page flags although I don't
> agree with your opinion that additional bit is no-go approach. Just
> note that we have already used three bits for zone encoding in some
> configuration due to ZONE_DEVICE.

I am absolutely not happy about ZONE_DEVICE but there is _no_ other
viable solution right now. I know that people behind this are still
considering struct page over direct pfn usage but they are not in the
same situation as CMA which _can_ work without additional zone.

If you _really_ insist on using zone for CMA then reuse ZONE_MOVABLE.
I absolutely miss why do you push a specialized zone so hard.

[...]
> > No, but I haven't heard any single argument that those bugs are
> > impossible to fix with the current approach. They might be harder to fix
> > but if I can chose between harder for CMA and harder for other more
> > generic HW independent features I will go for the first one. And do not
> > take me wrong, I have nothing against CMA as such. It solves a real life
> > problem. I just believe it doesn't deserve to consume a new bit in page
> > flags because that is just too scarce resource.
> 
> As I mentioned above, I think that maintenance overhead due to CMA
> deserves to consume a new bit in page flags. It also provide us
> extendability to introduce more zones in the future.
> 
> Anyway, this value-judgement is subjective so I guess that we
> cannot agree with each other. To solve your concern,
> I make following solution. Please let me know your opinion about this.
> This patch can be applied on top of my ZONE_CMA series.

I don not think this makes situation any easier or more acceptable for
merging.

But I feel we are looping without much progress. So let me NAK this
until it is _proven_ that the current code is unfixable nor ZONE_MOVABLE
can be reused
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-11  9:13                     ` Michal Hocko
@ 2017-05-12  2:00                       ` Joonsoo Kim
  2017-05-12  6:38                         ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Joonsoo Kim @ 2017-05-12  2:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Thu, May 11, 2017 at 11:13:04AM +0200, Michal Hocko wrote:
> On Thu 11-05-17 11:12:43, Joonsoo Kim wrote:
> > Sorry for the late response. I was on a vacation.
> > 
> > On Tue, May 02, 2017 at 03:32:29PM +0200, Michal Hocko wrote:
> > > On Tue 02-05-17 13:01:32, Joonsoo Kim wrote:
> > > > On Thu, Apr 27, 2017 at 05:06:36PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > I see this point and I agree that using a specific zone might be a
> > > > > _nicer_ solution in the end but you have to consider another aspects as
> > > > > well. The main one I am worried about is a long term maintainability.
> > > > > We are really out of page flags and consuming one for a rather specific
> > > > > usecase is not good. Look at ZONE_DMA. I am pretty sure that almost
> > > > > no sane HW needs 16MB zone anymore, yet we have hard time to get rid
> > > > > of it and so we have that memory laying around unused all the time
> > > > > and blocking one page flag bit. CMA falls into a similar category
> > > > > AFAIU. I wouldn't be all that surprised if a future HW will not need CMA
> > > > > allocations in few years, yet we will have to fight to get rid of it
> > > > > like we do with ZONE_DMA. And not only that. We will also have to fight
> > > > > finding page flags for other more general usecases in the meantime.
> > > > 
> > > > This maintenance problem is inherent. This problem exists even if we
> > > > uses MIGRATETYPE approach. We cannot remove many hooks for CMA if a
> > > > future HW will not need CMA allocation in few years. The only
> > > > difference is that one takes single zone bit only for CMA user and the
> > > > other approach takes many hooks that we need to take care about it all
> > > > the time.
> > > 
> > > And I consider this a big difference. Because while hooks are not nice
> > > they will affect CMA users (in a sense of bugs/performance etc.). While
> > > an additional bit consumed will affect potential future and more generic
> > > features.
> > 
> > Because these hooks are so tricky and are spread on many places,
> > bugs about these hooks can be made by *non-CMA* user and they hurt
> > *CMA* user. These hooks could also delay non-CMA user's development speed
> > since there are many hooks about CMA and understanding how CMA is managed
> > is rather difficult.
> 
> Than make those hooks easier to maintain. Seriously this is a
> non-argument.

I can't understand what you said here. With zone approach, someone who
isn't related to CMA don't need to understand how CMA is managed.

> 
> [...]
> 
> > > And all this can be isolated to CMA specific hooks with mostly minimum
> > > impact to most users. I hear you saying that zone approach is more natural
> > > and I would agree if we wouldn't have to care about the number of zones.
> > 
> > I attach a solution about one more bit in page flags although I don't
> > agree with your opinion that additional bit is no-go approach. Just
> > note that we have already used three bits for zone encoding in some
> > configuration due to ZONE_DEVICE.
> 
> I am absolutely not happy about ZONE_DEVICE but there is _no_ other
> viable solution right now. I know that people behind this are still
> considering struct page over direct pfn usage but they are not in the
> same situation as CMA which _can_ work without additional zone.

IIUC, ZONE_DEVICE can reuse the other zone and migratetype. What
they need is just struct page and separate zone is not necessarily needed.
The other thing that they want is to distinguish if the page is for
the ZONE_DEVICE memory or not so it can use similar approach with CMA.

IMHO, there is almost nothing that _cannot_ work in S/W world. What we
need to consider is just trade-off. So, please don't say impossibility
again.

> 
> If you _really_ insist on using zone for CMA then reuse ZONE_MOVABLE.
> I absolutely miss why do you push a specialized zone so hard.

As I said before, there is no fundamental issue to reuse ZONE_MOVABLE.
I just don't want to reuse it because they have a different
characteristic. In MM subsystem context, their characteristic is the same.
However, CMA memory should be used for the device in runtime so more
allocation guarantee is needed. See the offline_pages() in
mm/memory_hotplug.c. They can bear in 120 sec to offline the
memory but CMA memory can't.

And, this is a design issue. I don't want to talk about why should we
pursuit the good design. Originally, ZONE exists to manage different
type of memory. Migratetype is not for that purpose. Using separate
zone fits the original purpose. Mixing them would be a bad design and
we would esaily encounter the unexpected problem in the future.

> 
> [...]
> > > No, but I haven't heard any single argument that those bugs are
> > > impossible to fix with the current approach. They might be harder to fix
> > > but if I can chose between harder for CMA and harder for other more
> > > generic HW independent features I will go for the first one. And do not
> > > take me wrong, I have nothing against CMA as such. It solves a real life
> > > problem. I just believe it doesn't deserve to consume a new bit in page
> > > flags because that is just too scarce resource.
> > 
> > As I mentioned above, I think that maintenance overhead due to CMA
> > deserves to consume a new bit in page flags. It also provide us
> > extendability to introduce more zones in the future.
> > 
> > Anyway, this value-judgement is subjective so I guess that we
> > cannot agree with each other. To solve your concern,
> > I make following solution. Please let me know your opinion about this.
> > This patch can be applied on top of my ZONE_CMA series.
> 
> I don not think this makes situation any easier or more acceptable for
> merging.

Please say the reason. This implementation don't use additional bit in
page flags that you concerned about. And, there is no performance
regression at least in my simple test.

> But I feel we are looping without much progress. So let me NAK this
> until it is _proven_ that the current code is unfixable nor ZONE_MOVABLE
> can be reused

I want to open all the possibilty so could you check that ZONE_MOVABLE
can be overlapped with other zones? IIRC, your rework doesn't allow
it.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-12  2:00                       ` Joonsoo Kim
@ 2017-05-12  6:38                         ` Michal Hocko
  2017-05-15  3:57                           ` Joonsoo Kim
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2017-05-12  6:38 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Fri 12-05-17 11:00:48, Joonsoo Kim wrote:
> On Thu, May 11, 2017 at 11:13:04AM +0200, Michal Hocko wrote:
> > On Thu 11-05-17 11:12:43, Joonsoo Kim wrote:
> > > Sorry for the late response. I was on a vacation.
> > > 
> > > On Tue, May 02, 2017 at 03:32:29PM +0200, Michal Hocko wrote:
> > > > On Tue 02-05-17 13:01:32, Joonsoo Kim wrote:
> > > > > On Thu, Apr 27, 2017 at 05:06:36PM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > I see this point and I agree that using a specific zone might be a
> > > > > > _nicer_ solution in the end but you have to consider another aspects as
> > > > > > well. The main one I am worried about is a long term maintainability.
> > > > > > We are really out of page flags and consuming one for a rather specific
> > > > > > usecase is not good. Look at ZONE_DMA. I am pretty sure that almost
> > > > > > no sane HW needs 16MB zone anymore, yet we have hard time to get rid
> > > > > > of it and so we have that memory laying around unused all the time
> > > > > > and blocking one page flag bit. CMA falls into a similar category
> > > > > > AFAIU. I wouldn't be all that surprised if a future HW will not need CMA
> > > > > > allocations in few years, yet we will have to fight to get rid of it
> > > > > > like we do with ZONE_DMA. And not only that. We will also have to fight
> > > > > > finding page flags for other more general usecases in the meantime.
> > > > > 
> > > > > This maintenance problem is inherent. This problem exists even if we
> > > > > uses MIGRATETYPE approach. We cannot remove many hooks for CMA if a
> > > > > future HW will not need CMA allocation in few years. The only
> > > > > difference is that one takes single zone bit only for CMA user and the
> > > > > other approach takes many hooks that we need to take care about it all
> > > > > the time.
> > > > 
> > > > And I consider this a big difference. Because while hooks are not nice
> > > > they will affect CMA users (in a sense of bugs/performance etc.). While
> > > > an additional bit consumed will affect potential future and more generic
> > > > features.
> > > 
> > > Because these hooks are so tricky and are spread on many places,
> > > bugs about these hooks can be made by *non-CMA* user and they hurt
> > > *CMA* user. These hooks could also delay non-CMA user's development speed
> > > since there are many hooks about CMA and understanding how CMA is managed
> > > is rather difficult.
> > 
> > Than make those hooks easier to maintain. Seriously this is a
> > non-argument.
> 
> I can't understand what you said here. 

I wanted to say that you can make those hooks so non-intrusive that
nobody outside of the CMA has to even care that CMA exists.

> With zone approach, someone who
> isn't related to CMA don't need to understand how CMA is managed.
> 
> > 
> > [...]
> > 
> > > > And all this can be isolated to CMA specific hooks with mostly minimum
> > > > impact to most users. I hear you saying that zone approach is more natural
> > > > and I would agree if we wouldn't have to care about the number of zones.
> > > 
> > > I attach a solution about one more bit in page flags although I don't
> > > agree with your opinion that additional bit is no-go approach. Just
> > > note that we have already used three bits for zone encoding in some
> > > configuration due to ZONE_DEVICE.
> > 
> > I am absolutely not happy about ZONE_DEVICE but there is _no_ other
> > viable solution right now. I know that people behind this are still
> > considering struct page over direct pfn usage but they are not in the
> > same situation as CMA which _can_ work without additional zone.
> 
> IIUC, ZONE_DEVICE can reuse the other zone and migratetype.

They are not going to migrate anything or define any allocation fallback
policy because those pages are outside of the page allocator completely.
And that is why a zone approach is a reasonable approach. There are
probably other ways and I will certainly push going that way.

[...]

> > If you _really_ insist on using zone for CMA then reuse ZONE_MOVABLE.
> > I absolutely miss why do you push a specialized zone so hard.
> 
> As I said before, there is no fundamental issue to reuse ZONE_MOVABLE.
> I just don't want to reuse it because they have a different
> characteristic. In MM subsystem context, their characteristic is the same.
> However, CMA memory should be used for the device in runtime so more
> allocation guarantee is needed. See the offline_pages() in
> mm/memory_hotplug.c. They can bear in 120 sec to offline the
> memory but CMA memory can't.

This is just an implementation detail. Pinned pages in the CMA ranges
should be easilly checked. Moreover memory hotplug cares only about
hotplugable memory and placing CMA ranges there could be seen as a
configuration bug.

> And, this is a design issue. I don't want to talk about why should we
> pursuit the good design. Originally, ZONE exists to manage different
> type of memory. Migratetype is not for that purpose. Using separate
> zone fits the original purpose. Mixing them would be a bad design and
> we would esaily encounter the unexpected problem in the future.

As I've said earlier. Abusing ZONE_MOVABLE is not ideal either. I would
rather keep the status quo and fix the cluttered code and make it easier
to follow. But if you absolutely insist that a specialized zone is
necessary then ZONE_MOVABLE a) already exists and we do not need to
consume another bit b) most of the CMA zone characteristics overlap
with MOVABLE. So it is the least painful zone to use with the current
restrictions we have.

> > [...]
> > > > No, but I haven't heard any single argument that those bugs are
> > > > impossible to fix with the current approach. They might be harder to fix
> > > > but if I can chose between harder for CMA and harder for other more
> > > > generic HW independent features I will go for the first one. And do not
> > > > take me wrong, I have nothing against CMA as such. It solves a real life
> > > > problem. I just believe it doesn't deserve to consume a new bit in page
> > > > flags because that is just too scarce resource.
> > > 
> > > As I mentioned above, I think that maintenance overhead due to CMA
> > > deserves to consume a new bit in page flags. It also provide us
> > > extendability to introduce more zones in the future.
> > > 
> > > Anyway, this value-judgement is subjective so I guess that we
> > > cannot agree with each other. To solve your concern,
> > > I make following solution. Please let me know your opinion about this.
> > > This patch can be applied on top of my ZONE_CMA series.
> > 
> > I don not think this makes situation any easier or more acceptable for
> > merging.
> 
> Please say the reason. This implementation don't use additional bit in
> page flags that you concerned about. And, there is no performance
> regression at least in my simple test.

I really do not want to question your "simple test" but page_zonenum is
used in many performance sensitive paths and proving it doesn't regress
would require testing many different workload. Are you going to do that?

> > But I feel we are looping without much progress. So let me NAK this
> > until it is _proven_ that the current code is unfixable nor ZONE_MOVABLE
> > can be reused
> 
> I want to open all the possibilty so could you check that ZONE_MOVABLE
> can be overlapped with other zones? IIRC, your rework doesn't allow
> it.

My rework keeps the status quo, which is based on the assumption that
zones cannot overlap. A longer term plan is that this restriction is
removed. As I've said earlier overlapping zones is an interesting
concept which is definitely worth pursuing.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-12  6:38                         ` Michal Hocko
@ 2017-05-15  3:57                           ` Joonsoo Kim
  2017-05-16  8:47                             ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Joonsoo Kim @ 2017-05-15  3:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Fri, May 12, 2017 at 08:38:15AM +0200, Michal Hocko wrote:
> On Fri 12-05-17 11:00:48, Joonsoo Kim wrote:
> > On Thu, May 11, 2017 at 11:13:04AM +0200, Michal Hocko wrote:
> > > On Thu 11-05-17 11:12:43, Joonsoo Kim wrote:
> > > > Sorry for the late response. I was on a vacation.
> > > > 
> > > > On Tue, May 02, 2017 at 03:32:29PM +0200, Michal Hocko wrote:
> > > > > On Tue 02-05-17 13:01:32, Joonsoo Kim wrote:
> > > > > > On Thu, Apr 27, 2017 at 05:06:36PM +0200, Michal Hocko wrote:
> > > > > [...]
> > > > > > > I see this point and I agree that using a specific zone might be a
> > > > > > > _nicer_ solution in the end but you have to consider another aspects as
> > > > > > > well. The main one I am worried about is a long term maintainability.
> > > > > > > We are really out of page flags and consuming one for a rather specific
> > > > > > > usecase is not good. Look at ZONE_DMA. I am pretty sure that almost
> > > > > > > no sane HW needs 16MB zone anymore, yet we have hard time to get rid
> > > > > > > of it and so we have that memory laying around unused all the time
> > > > > > > and blocking one page flag bit. CMA falls into a similar category
> > > > > > > AFAIU. I wouldn't be all that surprised if a future HW will not need CMA
> > > > > > > allocations in few years, yet we will have to fight to get rid of it
> > > > > > > like we do with ZONE_DMA. And not only that. We will also have to fight
> > > > > > > finding page flags for other more general usecases in the meantime.
> > > > > > 
> > > > > > This maintenance problem is inherent. This problem exists even if we
> > > > > > uses MIGRATETYPE approach. We cannot remove many hooks for CMA if a
> > > > > > future HW will not need CMA allocation in few years. The only
> > > > > > difference is that one takes single zone bit only for CMA user and the
> > > > > > other approach takes many hooks that we need to take care about it all
> > > > > > the time.
> > > > > 
> > > > > And I consider this a big difference. Because while hooks are not nice
> > > > > they will affect CMA users (in a sense of bugs/performance etc.). While
> > > > > an additional bit consumed will affect potential future and more generic
> > > > > features.
> > > > 
> > > > Because these hooks are so tricky and are spread on many places,
> > > > bugs about these hooks can be made by *non-CMA* user and they hurt
> > > > *CMA* user. These hooks could also delay non-CMA user's development speed
> > > > since there are many hooks about CMA and understanding how CMA is managed
> > > > is rather difficult.
> > > 
> > > Than make those hooks easier to maintain. Seriously this is a
> > > non-argument.
> > 
> > I can't understand what you said here. 
> 
> I wanted to say that you can make those hooks so non-intrusive that
> nobody outside of the CMA has to even care that CMA exists.

I guess that current code is the result of such effort and it would be
intrusive.

> 
> > With zone approach, someone who
> > isn't related to CMA don't need to understand how CMA is managed.
> > 
> > > 
> > > [...]
> > > 
> > > > > And all this can be isolated to CMA specific hooks with mostly minimum
> > > > > impact to most users. I hear you saying that zone approach is more natural
> > > > > and I would agree if we wouldn't have to care about the number of zones.
> > > > 
> > > > I attach a solution about one more bit in page flags although I don't
> > > > agree with your opinion that additional bit is no-go approach. Just
> > > > note that we have already used three bits for zone encoding in some
> > > > configuration due to ZONE_DEVICE.
> > > 
> > > I am absolutely not happy about ZONE_DEVICE but there is _no_ other
> > > viable solution right now. I know that people behind this are still
> > > considering struct page over direct pfn usage but they are not in the
> > > same situation as CMA which _can_ work without additional zone.
> > 
> > IIUC, ZONE_DEVICE can reuse the other zone and migratetype.
> 
> They are not going to migrate anything or define any allocation fallback
> policy because those pages are outside of the page allocator completely.
> And that is why a zone approach is a reasonable approach. There are
> probably other ways and I will certainly push going that way.

I have a different opinion but it's not a main issue here so I don't
argue anymore.

> [...]
> 
> > > If you _really_ insist on using zone for CMA then reuse ZONE_MOVABLE.
> > > I absolutely miss why do you push a specialized zone so hard.
> > 
> > As I said before, there is no fundamental issue to reuse ZONE_MOVABLE.
> > I just don't want to reuse it because they have a different
> > characteristic. In MM subsystem context, their characteristic is the same.
> > However, CMA memory should be used for the device in runtime so more
> > allocation guarantee is needed. See the offline_pages() in
> > mm/memory_hotplug.c. They can bear in 120 sec to offline the
> > memory but CMA memory can't.
> 
> This is just an implementation detail. Pinned pages in the CMA ranges
> should be easilly checked. Moreover memory hotplug cares only about
> hotplugable memory and placing CMA ranges there could be seen as a
> configuration bug.

I just wanted to say that there is a difference and I'm worry about
that it would cause the unexpected problem in the future. We can
easily distinguish them by adding another explicit check but I don't
think that this is the best.

> > And, this is a design issue. I don't want to talk about why should we
> > pursuit the good design. Originally, ZONE exists to manage different
> > type of memory. Migratetype is not for that purpose. Using separate
> > zone fits the original purpose. Mixing them would be a bad design and
> > we would esaily encounter the unexpected problem in the future.
> 
> As I've said earlier. Abusing ZONE_MOVABLE is not ideal either. I would
> rather keep the status quo and fix the cluttered code and make it easier
> to follow. But if you absolutely insist that a specialized zone is
> necessary then ZONE_MOVABLE a) already exists and we do not need to
> consume another bit b) most of the CMA zone characteristics overlap
> with MOVABLE. So it is the least painful zone to use with the current
> restrictions we have.

I also think that using ZONE_MOVALBE for CMA memory is a good
candidate to consider. I will consider it more deeply.

> > > [...]
> > > > > No, but I haven't heard any single argument that those bugs are
> > > > > impossible to fix with the current approach. They might be harder to fix
> > > > > but if I can chose between harder for CMA and harder for other more
> > > > > generic HW independent features I will go for the first one. And do not
> > > > > take me wrong, I have nothing against CMA as such. It solves a real life
> > > > > problem. I just believe it doesn't deserve to consume a new bit in page
> > > > > flags because that is just too scarce resource.
> > > > 
> > > > As I mentioned above, I think that maintenance overhead due to CMA
> > > > deserves to consume a new bit in page flags. It also provide us
> > > > extendability to introduce more zones in the future.
> > > > 
> > > > Anyway, this value-judgement is subjective so I guess that we
> > > > cannot agree with each other. To solve your concern,
> > > > I make following solution. Please let me know your opinion about this.
> > > > This patch can be applied on top of my ZONE_CMA series.
> > > 
> > > I don not think this makes situation any easier or more acceptable for
> > > merging.
> > 
> > Please say the reason. This implementation don't use additional bit in
> > page flags that you concerned about. And, there is no performance
> > regression at least in my simple test.
> 
> I really do not want to question your "simple test" but page_zonenum is
> used in many performance sensitive paths and proving it doesn't regress
> would require testing many different workload. Are you going to do that?

In fact, I don't think that we need to take care about this
performance problem seriously. The reasons are that:

1. Currently, there is a usable bit in the page flags.
2. Even if others consume one usable bit, there still exists spare bit
in 64b kernel. And, for 32b kernel, the number of the zone can be five
if both, ZONE_CMA and ZONE_HIGHMEM, are used. And, using ZONE_HIGHMEM
in 32b system is out of the trend.
3. Even if we fall into the latter category, I can optimize it not to
regress if both the zone, ZONE_MOVABLE and ZONE_CMA, aren't used
simultaneously with two zone bits in page flags. However, using both
zones is not usual case.
4. This performance problem only affects CMA users and there is also a
benefit due to removal of many hooks in MM subsystem so net result would
not be worse.

So, I think that performance would be better in most of cases. It
would be magianlly worse in rare cases and they could bear with it. Do
you still think that using ZONE_MOVABLE for CMA memory is
necessary rather than separate zone, ZONE_CMA?

> > > But I feel we are looping without much progress. So let me NAK this
> > > until it is _proven_ that the current code is unfixable nor ZONE_MOVABLE
> > > can be reused
> > 
> > I want to open all the possibilty so could you check that ZONE_MOVABLE
> > can be overlapped with other zones? IIRC, your rework doesn't allow
> > it.
> 
> My rework keeps the status quo, which is based on the assumption that
> zones cannot overlap. A longer term plan is that this restriction is
> removed. As I've said earlier overlapping zones is an interesting
> concept which is definitely worth pursuing.

Okay. We did a lot of discussion so it's better to summarise it.

1. ZONE_CMA might be a nicer solution than MIGRATETYPE.
2. Additional bit in page flags would cause another kind of
maintenance problem so it's better to avoid it as much as possible.
3. Abusing ZONE_MOVABLE looks better than introducing ZONE_CMA since
it doesn't need additional bit in page flag.
4. (Not-yet-finished) If ZONE_CMA doesn't need extra bit in page
flags with hacky magic and it has no performance regression,
??? (it's okay to use separate zone for CMA?)

If you clarify your opinion on forth argument, I will try to think the
best approach with considering these arguements.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-15  3:57                           ` Joonsoo Kim
@ 2017-05-16  8:47                             ` Michal Hocko
  2017-05-17  7:44                               ` Joonsoo Kim
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2017-05-16  8:47 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

On Mon 15-05-17 12:57:15, Joonsoo Kim wrote:
> On Fri, May 12, 2017 at 08:38:15AM +0200, Michal Hocko wrote:
[...]
> > I really do not want to question your "simple test" but page_zonenum is
> > used in many performance sensitive paths and proving it doesn't regress
> > would require testing many different workload. Are you going to do that?
> 
> In fact, I don't think that we need to take care about this
> performance problem seriously. The reasons are that:
> 
> 1. Currently, there is a usable bit in the page flags.
> 2. Even if others consume one usable bit, there still exists spare bit
> in 64b kernel. And, for 32b kernel, the number of the zone can be five
> if both, ZONE_CMA and ZONE_HIGHMEM, are used. And, using ZONE_HIGHMEM
> in 32b system is out of the trend.
> 3. Even if we fall into the latter category, I can optimize it not to
> regress if both the zone, ZONE_MOVABLE and ZONE_CMA, aren't used
> simultaneously with two zone bits in page flags. However, using both
> zones is not usual case.
> 4. This performance problem only affects CMA users and there is also a
> benefit due to removal of many hooks in MM subsystem so net result would
> not be worse.

A lot of fiddling for something that we can address in a different way,
really.

> So, I think that performance would be better in most of cases. It
> would be magianlly worse in rare cases and they could bear with it. Do
> you still think that using ZONE_MOVABLE for CMA memory is
> necessary rather than separate zone, ZONE_CMA?

yes, because the main point is that a new zone is not really needed
AFAICS. Just try to reuse what we already have (ZONE_MOVABLE). And more
over a new zone just pulls a lot of infrastructure which will be never
used.

> > > > But I feel we are looping without much progress. So let me NAK this
> > > > until it is _proven_ that the current code is unfixable nor ZONE_MOVABLE
> > > > can be reused
> > > 
> > > I want to open all the possibilty so could you check that ZONE_MOVABLE
> > > can be overlapped with other zones? IIRC, your rework doesn't allow
> > > it.
> > 
> > My rework keeps the status quo, which is based on the assumption that
> > zones cannot overlap. A longer term plan is that this restriction is
> > removed. As I've said earlier overlapping zones is an interesting
> > concept which is definitely worth pursuing.
> 
> Okay. We did a lot of discussion so it's better to summarise it.
> 
> 1. ZONE_CMA might be a nicer solution than MIGRATETYPE.
> 2. Additional bit in page flags would cause another kind of
> maintenance problem so it's better to avoid it as much as possible.
> 3. Abusing ZONE_MOVABLE looks better than introducing ZONE_CMA since
> it doesn't need additional bit in page flag.
> 4. (Not-yet-finished) If ZONE_CMA doesn't need extra bit in page
> flags with hacky magic and it has no performance regression,
> ??? (it's okay to use separate zone for CMA?)

As mentioned above. I do not see why we should go over additional hops
just to have a zone which is not strictly needed. So if there are no
inherent problems reusing MOVABLE/HIGMEM zone then a separate zone
sounds like a wrong direction.

But let me repeat. I am _not_ convinced that the migratetype situation
is all that bad and unfixable. You have mentioned some issues with the
current approach but none of them seem inherently unfixable. So I would
still prefer keeping the current way. But I am not going to insist if
you _really_ believe that the long term maintenance cost will be higher
than a zone approach and you can reuse MOVABLE/HIGHMEM zones without
disruptive changes. I can help you with the hotplug part of the MOVABLE
zone because that is desirable on its own.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 0/7] Introduce ZONE_CMA
  2017-05-16  8:47                             ` Michal Hocko
@ 2017-05-17  7:44                               ` Joonsoo Kim
  0 siblings, 0 replies; 39+ messages in thread
From: Joonsoo Kim @ 2017-05-17  7:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Rik van Riel, Johannes Weiner, mgorman,
	Laura Abbott, Minchan Kim, Marek Szyprowski, Michal Nazarewicz,
	Aneesh Kumar K . V, Vlastimil Babka, Russell King, Will Deacon,
	linux-mm, linux-kernel, kernel-team

> > 
> > Okay. We did a lot of discussion so it's better to summarise it.
> > 
> > 1. ZONE_CMA might be a nicer solution than MIGRATETYPE.
> > 2. Additional bit in page flags would cause another kind of
> > maintenance problem so it's better to avoid it as much as possible.
> > 3. Abusing ZONE_MOVABLE looks better than introducing ZONE_CMA since
> > it doesn't need additional bit in page flag.
> > 4. (Not-yet-finished) If ZONE_CMA doesn't need extra bit in page
> > flags with hacky magic and it has no performance regression,
> > ??? (it's okay to use separate zone for CMA?)
> 
> As mentioned above. I do not see why we should go over additional hops
> just to have a zone which is not strictly needed. So if there are no
> inherent problems reusing MOVABLE/HIGMEM zone then a separate zone
> sounds like a wrong direction.
> 
> But let me repeat. I am _not_ convinced that the migratetype situation
> is all that bad and unfixable. You have mentioned some issues with the
> current approach but none of them seem inherently unfixable. So I would
> still prefer keeping the current way. But I am not going to insist if
> you _really_ believe that the long term maintenance cost will be higher
> than a zone approach and you can reuse MOVABLE/HIGHMEM zones without
> disruptive changes. I can help you with the hotplug part of the MOVABLE
> zone because that is desirable on its own.

Okay. Thanks for sharing your opinion. I will decide the final
direction after some investigation.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2017-05-17  7:44 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-11  3:17 [PATCH v7 0/7] Introduce ZONE_CMA js1304
2017-04-11  3:17 ` [PATCH v7 1/7] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request js1304
2017-04-17  7:38   ` Minchan Kim
2017-04-21  1:32     ` Joonsoo Kim
2017-04-11  3:17 ` [PATCH v7 2/7] mm/cma: introduce new zone, ZONE_CMA js1304
2017-04-11  3:17 ` [PATCH v7 3/7] mm/cma: populate ZONE_CMA js1304
2017-04-11  3:17 ` [PATCH v7 4/7] mm/cma: remove ALLOC_CMA js1304
2017-04-11  3:17 ` [PATCH v7 5/7] mm/cma: remove MIGRATE_CMA js1304
2017-04-11  3:17 ` [PATCH v7 6/7] mm/cma: remove per zone CMA stat js1304
2017-04-11  3:17 ` [PATCH v7 7/7] ARM: CMA: avoid re-mapping CMA region if CONFIG_HIGHMEM js1304
2017-04-11 18:15 ` [PATCH v7 0/7] Introduce ZONE_CMA Michal Hocko
2017-04-12  1:35   ` Joonsoo Kim
2017-04-13 11:56     ` Michal Hocko
2017-04-17  2:02       ` Joonsoo Kim
2017-04-21  1:35         ` Joonsoo Kim
2017-04-21  6:54           ` Michal Hocko
2017-04-24 13:09         ` Michal Hocko
2017-04-25  3:42           ` Joonsoo Kim
2017-04-27 15:06             ` Michal Hocko
2017-04-28  8:04               ` Generic approach to customizable zones - was: " Igor Stoppa
2017-04-28  8:36                 ` Michal Hocko
2017-04-28  9:04                   ` Igor Stoppa
2017-05-02  4:01               ` Joonsoo Kim
2017-05-02 13:32                 ` Michal Hocko
2017-05-11  2:12                   ` Joonsoo Kim
2017-05-11  9:13                     ` Michal Hocko
2017-05-12  2:00                       ` Joonsoo Kim
2017-05-12  6:38                         ` Michal Hocko
2017-05-15  3:57                           ` Joonsoo Kim
2017-05-16  8:47                             ` Michal Hocko
2017-05-17  7:44                               ` Joonsoo Kim
2017-05-02  8:06               ` Vlastimil Babka
2017-05-02 13:03                 ` Michal Hocko
2017-05-02 13:41                   ` Igor Stoppa
2017-05-04 12:33                   ` Vlastimil Babka
2017-05-04 12:46                     ` Michal Hocko
2017-05-11  8:51                       ` Vlastimil Babka
2017-04-12  1:39 ` Joonsoo Kim
2017-04-24  4:08 ` Bob Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).