Hi Mel, Thanks for pointing out the problems in this patchset. It was my intern project done in NVIDIA last summer. I only used micro-benchmarks to demonstrate the big memory bandwidth utilization gap between base page migration and THP migration along with serialized page migration vs parallel page migration. Here are cross-socket serialized page migration results from calling move_pages() syscall: In x86_64, a Intel two-socket E5-2640v3 box, single 4KB base page migration takes 62.47 us, using 0.06 GB/s BW, single 2MB THP migration takes 658.54 us, using 2.97 GB/s BW, 512 4KB base page migration takes 1987.38 us, using 0.98 GB/s BW. In ppc64, a two-socket Power8 box, single 64KB base page migration takes 49.3 us, using 1.24 GB/s BW, single 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW, 256 64KB base page migration takes 2543.65 us, using 6.14 GB/s BW. THP migration is not slow at all when compared to a group of equivalent base page migrations. For 1-thread vs 8-thread THP migration: In x86_64, 1-thread 2MB THP migration takes 658.54 us, using 2.97 GB/s BW, 8-thread 2MB THP migration takes 227.76 us, using 8.58 GB/s BW. In ppc64, 1-thread 16MB THP migration takes 2202.17 us, using 7.10 GB/s BW, 8-thread 16MB THP migration takes 1223.87 us, using 12.77 GB/s BW. This big increase on BW utilization is the motivation of pushing this patchset. > > So the key potential issue here in my mind is that THP migration is too slow > in some cases. What I object to is improving that using a high priority > workqueue that potentially starves other CPUs and pollutes their cache > which is generally very expensive. I might not completely agree with this. Using a high priority workqueue can guarantee page migration work is done ASAP. Otherwise, we completely lose the speedup brought by parallel page migration, if data copy threads have to wait. I understand your concern on CPU utilization impact. I think checking CPU utilization and only using idle CPUs could potentially avoid this problem. > > Lets look at the core of what copy_huge_page does in mm/migrate.c which > is the function that gets parallelised by the series in question. For > a !HIGHMEM system, it's woefully inefficient. Historically, it was an > implementation that would work generically which was fine but maybe not > for future systems. It was also fine back when hugetlbfs was the only huge > page implementation and COW operations were incredibly rare on the grounds > due to the risk that they could terminate the process with prejudice. > > The function takes a huge page, splits it into PAGE_SIZE chunks, kmap_atomics > the source and destination for each PAGE_SIZE chunk and copies it. The > parallelised version does one kmap and copies it in chunks assuming the > THP is fully mapped and accessible. Fundamentally, this is broken in the > generic sense as the kmap is not guaranteed to make the whole page necessary > but it happens to work on !highmem systems. What is more important to > note is that it's multiple preempt and pagefault enables and disables > on a per-page basis that happens 512 times (for THP on x86-64 at least), > all of which are expensive operations depending on the kernel config and > I suspect that the parallisation is actually masking that stupid overhead. You are right on kmap, I think making this patchset depend on !HIGHMEM can avoid the problem. It might not make sense to kmap potentially 512 base pages to migrate a THP in a system with highmem. > > At the very least, I would have expected an initial attempt of one patch that > optimised for !highmem systems to ignore kmap, simply disable preempt (if > that is even necessary, I didn't check) and copy a pinned physical->physical > page as a single copy without looping on a PAGE_SIZE basis and see how > much that gained. Do it initially for THP only and worry about gigantic > pages when or if that is a problem. I can try this out to show how much improvement we can obtain from existing THP migration, which is shown in the data above. > > That would be patch 1 of a series. Maybe that'll be enough, maybe not but > I feel it's important to optimise the serialised case as much as possible > before considering parallelisation to highlight and justify why it's > necessary[1]. If nothing else, what if two CPUs both parallelise a migration > at the same time and end up preempting each other? Between that and the > workqueue setup, it's potentially much slower than an optimised serial copy. > > It would be tempting to experiment but the test case was not even included > with the series (maybe it's somewhere else)[2]. While it's obvious how > such a test case could be constructed, it feels unnecessary to construct > it when it should be in the changelog. Do you mean performing multiple parallel page migrations at the same time and show all the page migration time? -- Best Regards, Yan Zi