在 2023/2/8 下午2:21, haoxin 写道: > > On my arm64 server with 128 cores, 2 numa nodes. > > I used memhog as benchmark : > >     numactl -m -C 5 memhog -r100000 1G >     Do a fix, numactl -m 0 -C 5 memhog -r100000 1G > The test result as below: > >  With this patch: > >     #time migratepages 8490 0 1 > >     real 0m1.161s > >     user 0m0.000s > >     sys 0m1.161s > > without this patch: > >     #time migratepages 8460 0 1 > >     real 0m2.068s > >     user 0m0.001s > >     sys 0m2.068s > > So you can see the migration performance improvement about *+78%* > > This is the perf record info. > > w/o > +   51.07%     0.09%  migratepages  [kernel.kallsyms]  [k] > migrate_folio_extra > +   42.43%     0.04%  migratepages  [kernel.kallsyms]  [k] folio_copy > +   42.34%    42.34%  migratepages  [kernel.kallsyms]  [k] __pi_copy_page > +   33.99%     0.09%  migratepages  [kernel.kallsyms]  [k] rmap_walk_anon > +   32.35%     0.04%  migratepages  [kernel.kallsyms]  [k] try_to_migrate > *+   27.78%    27.78%  migratepages  [kernel.kallsyms]  [k] > ptep_clear_flush * > +    8.19%     6.64%  migratepages  [kernel.kallsyms]  [k] > folio_migrate_flagsmigrati_tlb_flush > > w/ this patch > +   18.57%     0.13%  migratepages     [kernel.kallsyms]   [k] > migrate_pages > +   18.23%     0.07%  migratepages     [kernel.kallsyms]   [k] > migrate_pages_batch > +   16.29%     0.13%  migratepages     [kernel.kallsyms]   [k] > migrate_folio_move > +   12.73%     0.10%  migratepages     [kernel.kallsyms]   [k] > move_to_new_folio > +   12.52%     0.06%  migratepages     [kernel.kallsyms]   [k] > migrate_folio_extra > > Therefore, this patch helps improve performance in page migration > > > So,  you can add Tested-by: Xin Hao > > > 在 2023/2/6 下午2:33, Huang Ying 写道: >> From: "Huang, Ying" >> >> Now, migrate_pages() migrate folios one by one, like the fake code as >> follows, >> >> for each folio >> unmap >> flush TLB >> copy >> restore map >> >> If multiple folios are passed to migrate_pages(), there are >> opportunities to batch the TLB flushing and copying. That is, we can >> change the code to something as follows, >> >> for each folio >> unmap >> for each folio >> flush TLB >> for each folio >> copy >> for each folio >> restore map >> >> The total number of TLB flushing IPI can be reduced considerably. And >> we may use some hardware accelerator such as DSA to accelerate the >> folio copying. >> >> So in this patch, we refactor the migrate_pages() implementation and >> implement the TLB flushing batching. Base on this, hardware >> accelerated folio copying can be implemented. >> >> If too many folios are passed to migrate_pages(), in the naive batched >> implementation, we may unmap too many folios at the same time. The >> possibility for a task to wait for the migrated folios to be mapped >> again increases. So the latency may be hurt. To deal with this >> issue, the max number of folios be unmapped in batch is restricted to >> no more than HPAGE_PMD_NR in the unit of page. That is, the influence >> is at the same level of THP migration. >> >> We use the following test to measure the performance impact of the >> patchset, >> >> On a 2-socket Intel server, >> >> - Run pmbench memory accessing benchmark >> >> - Run `migratepages` to migrate pages of pmbench between node 0 and >> node 1 back and forth. >> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and >> the number of pages migrated successfully per second increases 291.7%. >> >> This patchset is based on v6.2-rc4. >> >> Changes: >> >> v4: >> >> - Fixed another bug about non-LRU folio migration. Thanks Hyeonggon! >> >> v3: >> >> - Rebased on v6.2-rc4 >> >> - Fixed a bug about non-LRU folio migration. Thanks Mike! >> >> - Fixed some comments. Thanks Baolin! >> >> - Collected reviewed-by. >> >> v2: >> >> - Rebased on v6.2-rc3 >> >> - Fixed type force cast warning. Thanks Kees! >> >> - Added more comments and cleaned up the code. Thanks Andrew, Zi, Alistair, Dan! >> >> - Collected reviewed-by. >> >> from rfc to v1: >> >> - Rebased on v6.2-rc1 >> >> - Fix the deadlock issue caused by locking multiple pages synchronously >> per Alistair's comments. Thanks! >> >> - Fix the autonumabench panic per Rao's comments and fix. Thanks! >> >> - Other minor fixes per comments. Thanks! >> >> Best Regards, >> Huang, Ying