Hi Mel,

Thanks for pointing out the problems in this patchset.

It was my intern project done in NVIDIA last summer. I only used
micro-benchmarks to demonstrate the big memory bandwidth utilization gap
between base page migration and THP migration along with serialized page
migration vs parallel page migration.

Here are cross-socket serialized page migration results from calling
move_pages() syscall:

In x86_64, a Intel two-socket E5-2640v3 box,
single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW,
single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW.

In ppc64, a two-socket Power8 box,
single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW,
single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW.

THP migration is not slow at all when compared to a group of equivalent
base page migrations.

For 1-thread vs 8-thread THP migration:
In x86_64,
1-thread 2MB THP migration takes 658.54 us, using 2.97 GB/s BW,
8-thread 2MB THP migration takes 227.76 us, using 8.58 GB/s BW.

In ppc64,
1-thread 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW,
8-thread 16MB THP migration takes 1223.87 us, using 12.77 GB/s BW.

This big increase on BW utilization is the motivation of pushing this
patchset.

> 
> So the key potential issue here in my mind is that THP migration is too slow
> in some cases. What I object to is improving that using a high priority
> workqueue that potentially starves other CPUs and pollutes their cache
> which is generally very expensive.

I might not completely agree with this. Using a high priority workqueue
can guarantee page migration work is done ASAP. Otherwise, we completely
lose the speedup brought by parallel page migration, if data copy
threads have to wait.

I understand your concern on CPU utilization impact. I think checking
CPU utilization and only using idle CPUs could potentially avoid this
problem.

> 
> Lets look at the core of what copy_huge_page does in mm/migrate.c which
> is the function that gets parallelised by the series in question. For
> a !HIGHMEM system, it's woefully inefficient. Historically, it was an
> implementation that would work generically which was fine but maybe not
> for future systems. It was also fine back when hugetlbfs was the only huge
> page implementation and COW operations were incredibly rare on the grounds
> due to the risk that they could terminate the process with prejudice.
> 
> The function takes a huge page, splits it into PAGE_SIZE chunks, kmap_atomics
> the source and destination for each PAGE_SIZE chunk and copies it. The
> parallelised version does one kmap and copies it in chunks assuming the
> THP is fully mapped and accessible. Fundamentally, this is broken in the
> generic sense as the kmap is not guaranteed to make the whole page necessary
> but it happens to work on !highmem systems.  What is more important to
> note is that it's multiple preempt and pagefault enables and disables
> on a per-page basis that happens 512 times (for THP on x86-64 at least),
> all of which are expensive operations depending on the kernel config and
> I suspect that the parallisation is actually masking that stupid overhead.

You are right on kmap, I think making this patchset depend on !HIGHMEM
can avoid the problem. It might not make sense to kmap potentially 512
base pages to migrate a THP in a system with highmem.

> 
> At the very least, I would have expected an initial attempt of one patch that
> optimised for !highmem systems to ignore kmap, simply disable preempt (if
> that is even necessary, I didn't check) and copy a pinned physical->physical
> page as a single copy without looping on a PAGE_SIZE basis and see how
> much that gained. Do it initially for THP only and worry about gigantic
> pages when or if that is a problem.

I can try this out to show how much improvement we can obtain from
existing THP migration, which is shown in the data above.

> 
> That would be patch 1 of a series.  Maybe that'll be enough, maybe not but
> I feel it's important to optimise the serialised case as much as possible
> before considering parallelisation to highlight and justify why it's
> necessary[1]. If nothing else, what if two CPUs both parallelise a migration
> at the same time and end up preempting each other? Between that and the
> workqueue setup, it's potentially much slower than an optimised serial copy.
> 
> It would be tempting to experiment but the test case was not even included
> with the series (maybe it's somewhere else)[2]. While it's obvious how
> such a test case could be constructed, it feels unnecessary to construct
> it when it should be in the changelog.

Do you mean performing multiple parallel page migrations at the same
time and show all the page migration time?


-- 
Best Regards,
Yan Zi