* [PATCH] mm: migration: fix migration of huge PMD shared pages
@ 2018-08-13 3:41 Mike Kravetz
2018-08-13 4:10 ` kbuild test robot
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Mike Kravetz @ 2018-08-13 3:41 UTC (permalink / raw)
To: linux-mm, linux-kernel
Cc: Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
Naoya Horiguchi, Davidlohr Bueso, Michal Hocko, Andrew Morton,
Mike Kravetz
The page migration code employs try_to_unmap() to try and unmap the
source page. This is accomplished by using rmap_walk to find all
vmas where the page is mapped. This search stops when page mapcount
is zero. For shared PMD huge pages, the page map count is always 1
not matter the number of mappings. Shared mappings are tracked via
the reference count of the PMD page. Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.
This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page. Hence, data is lost.
This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined. DB developers noticed
they could reproduce the issue by (hotplug) offlining memory used
to back huge pages. A simple testcase can reproduce the problem by
creating a shared PMD mapping (note that this must be at least
PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
migrate_pages() to migrate process pages between nodes.
To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops reference on PMD page. After this, flush caches and
TLB.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
I am not %100 sure on the required flushing, so suggestions would be
appreciated. This also should go to stable. It has been around for
a long time so still looking for an appropriate 'fixes:'.
mm/rmap.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/mm/rmap.c b/mm/rmap.c
index 09a799c9aebd..45583758bf16 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1409,6 +1409,27 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
address = pvmw.address;
+ /*
+ * PMDs for hugetlbfs pages could be shared. In this case,
+ * pages with shared PMDs will have a mapcount of 1 no matter
+ * how many times it is actually mapped. Map counting for
+ * PMD sharing is mostly done via the reference count on the
+ * PMD page itself. If the page we are trying to unmap is a
+ * hugetlbfs page, attempt to 'unshare' at the PMD level.
+ * huge_pmd_unshare takes care of clearing the PUD and
+ * reference counting on the PMD page which effectively unmaps
+ * the page. Take care of flushing cache and TLB for page in
+ * this specific mapping here.
+ */
+ if (PageHuge(page) &&
+ huge_pmd_unshare(mm, &address, pvmw.pte)) {
+ unsigned long end_add = address + vma_mmu_pagesize(vma);
+
+ flush_cache_range(vma, address, end_add);
+ flush_tlb_range(vma, address, end_add);
+ mmu_notifier_invalidate_range(mm, address, end_add);
+ continue;
+ }
if (IS_ENABLED(CONFIG_MIGRATION) &&
(flags & TTU_MIGRATION) &&
--
2.17.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: migration: fix migration of huge PMD shared pages
2018-08-13 3:41 [PATCH] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
@ 2018-08-13 4:10 ` kbuild test robot
2018-08-14 0:30 ` [PATCH v2] " Mike Kravetz
2018-08-13 4:17 ` [PATCH] " kbuild test robot
2018-08-13 10:58 ` Kirill A. Shutemov
2 siblings, 1 reply; 10+ messages in thread
From: kbuild test robot @ 2018-08-13 4:10 UTC (permalink / raw)
To: Mike Kravetz
Cc: kbuild-all, linux-mm, linux-kernel, Kirill A . Shutemov,
Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
Davidlohr Bueso, Michal Hocko, Andrew Morton, Mike Kravetz
[-- Attachment #1: Type: text/plain, Size: 11198 bytes --]
Hi Mike,
I love your patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180810]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Mike-Kravetz/mm-migration-fix-migration-of-huge-PMD-shared-pages/20180813-114549
config: i386-randconfig-x003-201832 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:1425:7: error: implicit declaration of function 'huge_pmd_unshare'; did you mean '__NR_unshare'? [-Werror=implicit-function-declaration]
huge_pmd_unshare(mm, &address, pvmw.pte)) {
^~~~~~~~~~~~~~~~
__NR_unshare
cc1: some warnings being treated as errors
vim +1425 mm/rmap.c
1382
1383 /*
1384 * If the page is mlock()d, we cannot swap it out.
1385 * If it's recently referenced (perhaps page_referenced
1386 * skipped over this mm) then we should reactivate it.
1387 */
1388 if (!(flags & TTU_IGNORE_MLOCK)) {
1389 if (vma->vm_flags & VM_LOCKED) {
1390 /* PTE-mapped THP are never mlocked */
1391 if (!PageTransCompound(page)) {
1392 /*
1393 * Holding pte lock, we do *not* need
1394 * mmap_sem here
1395 */
1396 mlock_vma_page(page);
1397 }
1398 ret = false;
1399 page_vma_mapped_walk_done(&pvmw);
1400 break;
1401 }
1402 if (flags & TTU_MUNLOCK)
1403 continue;
1404 }
1405
1406 /* Unexpected PMD-mapped THP? */
1407 VM_BUG_ON_PAGE(!pvmw.pte, page);
1408
1409 subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
1410 address = pvmw.address;
1411
1412 /*
1413 * PMDs for hugetlbfs pages could be shared. In this case,
1414 * pages with shared PMDs will have a mapcount of 1 no matter
1415 * how many times it is actually mapped. Map counting for
1416 * PMD sharing is mostly done via the reference count on the
1417 * PMD page itself. If the page we are trying to unmap is a
1418 * hugetlbfs page, attempt to 'unshare' at the PMD level.
1419 * huge_pmd_unshare takes care of clearing the PUD and
1420 * reference counting on the PMD page which effectively unmaps
1421 * the page. Take care of flushing cache and TLB for page in
1422 * this specific mapping here.
1423 */
1424 if (PageHuge(page) &&
> 1425 huge_pmd_unshare(mm, &address, pvmw.pte)) {
1426 unsigned long end_add = address + vma_mmu_pagesize(vma);
1427
1428 flush_cache_range(vma, address, end_add);
1429 flush_tlb_range(vma, address, end_add);
1430 mmu_notifier_invalidate_range(mm, address, end_add);
1431 continue;
1432 }
1433
1434 if (IS_ENABLED(CONFIG_MIGRATION) &&
1435 (flags & TTU_MIGRATION) &&
1436 is_zone_device_page(page)) {
1437 swp_entry_t entry;
1438 pte_t swp_pte;
1439
1440 pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte);
1441
1442 /*
1443 * Store the pfn of the page in a special migration
1444 * pte. do_swap_page() will wait until the migration
1445 * pte is removed and then restart fault handling.
1446 */
1447 entry = make_migration_entry(page, 0);
1448 swp_pte = swp_entry_to_pte(entry);
1449 if (pte_soft_dirty(pteval))
1450 swp_pte = pte_swp_mksoft_dirty(swp_pte);
1451 set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
1452 /*
1453 * No need to invalidate here it will synchronize on
1454 * against the special swap migration pte.
1455 */
1456 goto discard;
1457 }
1458
1459 if (!(flags & TTU_IGNORE_ACCESS)) {
1460 if (ptep_clear_flush_young_notify(vma, address,
1461 pvmw.pte)) {
1462 ret = false;
1463 page_vma_mapped_walk_done(&pvmw);
1464 break;
1465 }
1466 }
1467
1468 /* Nuke the page table entry. */
1469 flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
1470 if (should_defer_flush(mm, flags)) {
1471 /*
1472 * We clear the PTE but do not flush so potentially
1473 * a remote CPU could still be writing to the page.
1474 * If the entry was previously clean then the
1475 * architecture must guarantee that a clear->dirty
1476 * transition on a cached TLB entry is written through
1477 * and traps if the PTE is unmapped.
1478 */
1479 pteval = ptep_get_and_clear(mm, address, pvmw.pte);
1480
1481 set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
1482 } else {
1483 pteval = ptep_clear_flush(vma, address, pvmw.pte);
1484 }
1485
1486 /* Move the dirty bit to the page. Now the pte is gone. */
1487 if (pte_dirty(pteval))
1488 set_page_dirty(page);
1489
1490 /* Update high watermark before we lower rss */
1491 update_hiwater_rss(mm);
1492
1493 if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
1494 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
1495 if (PageHuge(page)) {
1496 int nr = 1 << compound_order(page);
1497 hugetlb_count_sub(nr, mm);
1498 set_huge_swap_pte_at(mm, address,
1499 pvmw.pte, pteval,
1500 vma_mmu_pagesize(vma));
1501 } else {
1502 dec_mm_counter(mm, mm_counter(page));
1503 set_pte_at(mm, address, pvmw.pte, pteval);
1504 }
1505
1506 } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
1507 /*
1508 * The guest indicated that the page content is of no
1509 * interest anymore. Simply discard the pte, vmscan
1510 * will take care of the rest.
1511 * A future reference will then fault in a new zero
1512 * page. When userfaultfd is active, we must not drop
1513 * this page though, as its main user (postcopy
1514 * migration) will not expect userfaults on already
1515 * copied pages.
1516 */
1517 dec_mm_counter(mm, mm_counter(page));
1518 /* We have to invalidate as we cleared the pte */
1519 mmu_notifier_invalidate_range(mm, address,
1520 address + PAGE_SIZE);
1521 } else if (IS_ENABLED(CONFIG_MIGRATION) &&
1522 (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
1523 swp_entry_t entry;
1524 pte_t swp_pte;
1525
1526 if (arch_unmap_one(mm, vma, address, pteval) < 0) {
1527 set_pte_at(mm, address, pvmw.pte, pteval);
1528 ret = false;
1529 page_vma_mapped_walk_done(&pvmw);
1530 break;
1531 }
1532
1533 /*
1534 * Store the pfn of the page in a special migration
1535 * pte. do_swap_page() will wait until the migration
1536 * pte is removed and then restart fault handling.
1537 */
1538 entry = make_migration_entry(subpage,
1539 pte_write(pteval));
1540 swp_pte = swp_entry_to_pte(entry);
1541 if (pte_soft_dirty(pteval))
1542 swp_pte = pte_swp_mksoft_dirty(swp_pte);
1543 set_pte_at(mm, address, pvmw.pte, swp_pte);
1544 /*
1545 * No need to invalidate here it will synchronize on
1546 * against the special swap migration pte.
1547 */
1548 } else if (PageAnon(page)) {
1549 swp_entry_t entry = { .val = page_private(subpage) };
1550 pte_t swp_pte;
1551 /*
1552 * Store the swap location in the pte.
1553 * See handle_pte_fault() ...
1554 */
1555 if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
1556 WARN_ON_ONCE(1);
1557 ret = false;
1558 /* We have to invalidate as we cleared the pte */
1559 mmu_notifier_invalidate_range(mm, address,
1560 address + PAGE_SIZE);
1561 page_vma_mapped_walk_done(&pvmw);
1562 break;
1563 }
1564
1565 /* MADV_FREE page check */
1566 if (!PageSwapBacked(page)) {
1567 if (!PageDirty(page)) {
1568 /* Invalidate as we cleared the pte */
1569 mmu_notifier_invalidate_range(mm,
1570 address, address + PAGE_SIZE);
1571 dec_mm_counter(mm, MM_ANONPAGES);
1572 goto discard;
1573 }
1574
1575 /*
1576 * If the page was redirtied, it cannot be
1577 * discarded. Remap the page to page table.
1578 */
1579 set_pte_at(mm, address, pvmw.pte, pteval);
1580 SetPageSwapBacked(page);
1581 ret = false;
1582 page_vma_mapped_walk_done(&pvmw);
1583 break;
1584 }
1585
1586 if (swap_duplicate(entry) < 0) {
1587 set_pte_at(mm, address, pvmw.pte, pteval);
1588 ret = false;
1589 page_vma_mapped_walk_done(&pvmw);
1590 break;
1591 }
1592 if (arch_unmap_one(mm, vma, address, pteval) < 0) {
1593 set_pte_at(mm, address, pvmw.pte, pteval);
1594 ret = false;
1595 page_vma_mapped_walk_done(&pvmw);
1596 break;
1597 }
1598 if (list_empty(&mm->mmlist)) {
1599 spin_lock(&mmlist_lock);
1600 if (list_empty(&mm->mmlist))
1601 list_add(&mm->mmlist, &init_mm.mmlist);
1602 spin_unlock(&mmlist_lock);
1603 }
1604 dec_mm_counter(mm, MM_ANONPAGES);
1605 inc_mm_counter(mm, MM_SWAPENTS);
1606 swp_pte = swp_entry_to_pte(entry);
1607 if (pte_soft_dirty(pteval))
1608 swp_pte = pte_swp_mksoft_dirty(swp_pte);
1609 set_pte_at(mm, address, pvmw.pte, swp_pte);
1610 /* Invalidate as we cleared the pte */
1611 mmu_notifier_invalidate_range(mm, address,
1612 address + PAGE_SIZE);
1613 } else {
1614 /*
1615 * We should not need to notify here as we reach this
1616 * case only from freeze_page() itself only call from
1617 * split_huge_page_to_list() so everything below must
1618 * be true:
1619 * - page is not anonymous
1620 * - page is locked
1621 *
1622 * So as it is a locked file back page thus it can not
1623 * be remove from the page cache and replace by a new
1624 * page before mmu_notifier_invalidate_range_end so no
1625 * concurrent thread might update its page table to
1626 * point at new page while a device still is using this
1627 * page.
1628 *
1629 * See Documentation/vm/mmu_notifier.rst
1630 */
1631 dec_mm_counter(mm, mm_counter_file(page));
1632 }
1633 discard:
1634 /*
1635 * No need to call mmu_notifier_invalidate_range() it has be
1636 * done above for all cases requiring it to happen under page
1637 * table lock before mmu_notifier_invalidate_range_end()
1638 *
1639 * See Documentation/vm/mmu_notifier.rst
1640 */
1641 page_remove_rmap(subpage, PageHuge(page));
1642 put_page(page);
1643 }
1644
1645 mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
1646
1647 return ret;
1648 }
1649
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28495 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: migration: fix migration of huge PMD shared pages
2018-08-13 3:41 [PATCH] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
2018-08-13 4:10 ` kbuild test robot
@ 2018-08-13 4:17 ` kbuild test robot
2018-08-13 10:58 ` Kirill A. Shutemov
2 siblings, 0 replies; 10+ messages in thread
From: kbuild test robot @ 2018-08-13 4:17 UTC (permalink / raw)
To: Mike Kravetz
Cc: kbuild-all, linux-mm, linux-kernel, Kirill A . Shutemov,
Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
Davidlohr Bueso, Michal Hocko, Andrew Morton, Mike Kravetz
[-- Attachment #1: Type: text/plain, Size: 11200 bytes --]
Hi Mike,
I love your patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180810]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Mike-Kravetz/mm-migration-fix-migration-of-huge-PMD-shared-pages/20180813-114549
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:1425:7: error: implicit declaration of function 'huge_pmd_unshare'; did you mean 'do_huge_pmd_wp_page'? [-Werror=implicit-function-declaration]
huge_pmd_unshare(mm, &address, pvmw.pte)) {
^~~~~~~~~~~~~~~~
do_huge_pmd_wp_page
cc1: some warnings being treated as errors
vim +1425 mm/rmap.c
1382
1383 /*
1384 * If the page is mlock()d, we cannot swap it out.
1385 * If it's recently referenced (perhaps page_referenced
1386 * skipped over this mm) then we should reactivate it.
1387 */
1388 if (!(flags & TTU_IGNORE_MLOCK)) {
1389 if (vma->vm_flags & VM_LOCKED) {
1390 /* PTE-mapped THP are never mlocked */
1391 if (!PageTransCompound(page)) {
1392 /*
1393 * Holding pte lock, we do *not* need
1394 * mmap_sem here
1395 */
1396 mlock_vma_page(page);
1397 }
1398 ret = false;
1399 page_vma_mapped_walk_done(&pvmw);
1400 break;
1401 }
1402 if (flags & TTU_MUNLOCK)
1403 continue;
1404 }
1405
1406 /* Unexpected PMD-mapped THP? */
1407 VM_BUG_ON_PAGE(!pvmw.pte, page);
1408
1409 subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
1410 address = pvmw.address;
1411
1412 /*
1413 * PMDs for hugetlbfs pages could be shared. In this case,
1414 * pages with shared PMDs will have a mapcount of 1 no matter
1415 * how many times it is actually mapped. Map counting for
1416 * PMD sharing is mostly done via the reference count on the
1417 * PMD page itself. If the page we are trying to unmap is a
1418 * hugetlbfs page, attempt to 'unshare' at the PMD level.
1419 * huge_pmd_unshare takes care of clearing the PUD and
1420 * reference counting on the PMD page which effectively unmaps
1421 * the page. Take care of flushing cache and TLB for page in
1422 * this specific mapping here.
1423 */
1424 if (PageHuge(page) &&
> 1425 huge_pmd_unshare(mm, &address, pvmw.pte)) {
1426 unsigned long end_add = address + vma_mmu_pagesize(vma);
1427
1428 flush_cache_range(vma, address, end_add);
1429 flush_tlb_range(vma, address, end_add);
1430 mmu_notifier_invalidate_range(mm, address, end_add);
1431 continue;
1432 }
1433
1434 if (IS_ENABLED(CONFIG_MIGRATION) &&
1435 (flags & TTU_MIGRATION) &&
1436 is_zone_device_page(page)) {
1437 swp_entry_t entry;
1438 pte_t swp_pte;
1439
1440 pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte);
1441
1442 /*
1443 * Store the pfn of the page in a special migration
1444 * pte. do_swap_page() will wait until the migration
1445 * pte is removed and then restart fault handling.
1446 */
1447 entry = make_migration_entry(page, 0);
1448 swp_pte = swp_entry_to_pte(entry);
1449 if (pte_soft_dirty(pteval))
1450 swp_pte = pte_swp_mksoft_dirty(swp_pte);
1451 set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
1452 /*
1453 * No need to invalidate here it will synchronize on
1454 * against the special swap migration pte.
1455 */
1456 goto discard;
1457 }
1458
1459 if (!(flags & TTU_IGNORE_ACCESS)) {
1460 if (ptep_clear_flush_young_notify(vma, address,
1461 pvmw.pte)) {
1462 ret = false;
1463 page_vma_mapped_walk_done(&pvmw);
1464 break;
1465 }
1466 }
1467
1468 /* Nuke the page table entry. */
1469 flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
1470 if (should_defer_flush(mm, flags)) {
1471 /*
1472 * We clear the PTE but do not flush so potentially
1473 * a remote CPU could still be writing to the page.
1474 * If the entry was previously clean then the
1475 * architecture must guarantee that a clear->dirty
1476 * transition on a cached TLB entry is written through
1477 * and traps if the PTE is unmapped.
1478 */
1479 pteval = ptep_get_and_clear(mm, address, pvmw.pte);
1480
1481 set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
1482 } else {
1483 pteval = ptep_clear_flush(vma, address, pvmw.pte);
1484 }
1485
1486 /* Move the dirty bit to the page. Now the pte is gone. */
1487 if (pte_dirty(pteval))
1488 set_page_dirty(page);
1489
1490 /* Update high watermark before we lower rss */
1491 update_hiwater_rss(mm);
1492
1493 if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
1494 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
1495 if (PageHuge(page)) {
1496 int nr = 1 << compound_order(page);
1497 hugetlb_count_sub(nr, mm);
1498 set_huge_swap_pte_at(mm, address,
1499 pvmw.pte, pteval,
1500 vma_mmu_pagesize(vma));
1501 } else {
1502 dec_mm_counter(mm, mm_counter(page));
1503 set_pte_at(mm, address, pvmw.pte, pteval);
1504 }
1505
1506 } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
1507 /*
1508 * The guest indicated that the page content is of no
1509 * interest anymore. Simply discard the pte, vmscan
1510 * will take care of the rest.
1511 * A future reference will then fault in a new zero
1512 * page. When userfaultfd is active, we must not drop
1513 * this page though, as its main user (postcopy
1514 * migration) will not expect userfaults on already
1515 * copied pages.
1516 */
1517 dec_mm_counter(mm, mm_counter(page));
1518 /* We have to invalidate as we cleared the pte */
1519 mmu_notifier_invalidate_range(mm, address,
1520 address + PAGE_SIZE);
1521 } else if (IS_ENABLED(CONFIG_MIGRATION) &&
1522 (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
1523 swp_entry_t entry;
1524 pte_t swp_pte;
1525
1526 if (arch_unmap_one(mm, vma, address, pteval) < 0) {
1527 set_pte_at(mm, address, pvmw.pte, pteval);
1528 ret = false;
1529 page_vma_mapped_walk_done(&pvmw);
1530 break;
1531 }
1532
1533 /*
1534 * Store the pfn of the page in a special migration
1535 * pte. do_swap_page() will wait until the migration
1536 * pte is removed and then restart fault handling.
1537 */
1538 entry = make_migration_entry(subpage,
1539 pte_write(pteval));
1540 swp_pte = swp_entry_to_pte(entry);
1541 if (pte_soft_dirty(pteval))
1542 swp_pte = pte_swp_mksoft_dirty(swp_pte);
1543 set_pte_at(mm, address, pvmw.pte, swp_pte);
1544 /*
1545 * No need to invalidate here it will synchronize on
1546 * against the special swap migration pte.
1547 */
1548 } else if (PageAnon(page)) {
1549 swp_entry_t entry = { .val = page_private(subpage) };
1550 pte_t swp_pte;
1551 /*
1552 * Store the swap location in the pte.
1553 * See handle_pte_fault() ...
1554 */
1555 if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
1556 WARN_ON_ONCE(1);
1557 ret = false;
1558 /* We have to invalidate as we cleared the pte */
1559 mmu_notifier_invalidate_range(mm, address,
1560 address + PAGE_SIZE);
1561 page_vma_mapped_walk_done(&pvmw);
1562 break;
1563 }
1564
1565 /* MADV_FREE page check */
1566 if (!PageSwapBacked(page)) {
1567 if (!PageDirty(page)) {
1568 /* Invalidate as we cleared the pte */
1569 mmu_notifier_invalidate_range(mm,
1570 address, address + PAGE_SIZE);
1571 dec_mm_counter(mm, MM_ANONPAGES);
1572 goto discard;
1573 }
1574
1575 /*
1576 * If the page was redirtied, it cannot be
1577 * discarded. Remap the page to page table.
1578 */
1579 set_pte_at(mm, address, pvmw.pte, pteval);
1580 SetPageSwapBacked(page);
1581 ret = false;
1582 page_vma_mapped_walk_done(&pvmw);
1583 break;
1584 }
1585
1586 if (swap_duplicate(entry) < 0) {
1587 set_pte_at(mm, address, pvmw.pte, pteval);
1588 ret = false;
1589 page_vma_mapped_walk_done(&pvmw);
1590 break;
1591 }
1592 if (arch_unmap_one(mm, vma, address, pteval) < 0) {
1593 set_pte_at(mm, address, pvmw.pte, pteval);
1594 ret = false;
1595 page_vma_mapped_walk_done(&pvmw);
1596 break;
1597 }
1598 if (list_empty(&mm->mmlist)) {
1599 spin_lock(&mmlist_lock);
1600 if (list_empty(&mm->mmlist))
1601 list_add(&mm->mmlist, &init_mm.mmlist);
1602 spin_unlock(&mmlist_lock);
1603 }
1604 dec_mm_counter(mm, MM_ANONPAGES);
1605 inc_mm_counter(mm, MM_SWAPENTS);
1606 swp_pte = swp_entry_to_pte(entry);
1607 if (pte_soft_dirty(pteval))
1608 swp_pte = pte_swp_mksoft_dirty(swp_pte);
1609 set_pte_at(mm, address, pvmw.pte, swp_pte);
1610 /* Invalidate as we cleared the pte */
1611 mmu_notifier_invalidate_range(mm, address,
1612 address + PAGE_SIZE);
1613 } else {
1614 /*
1615 * We should not need to notify here as we reach this
1616 * case only from freeze_page() itself only call from
1617 * split_huge_page_to_list() so everything below must
1618 * be true:
1619 * - page is not anonymous
1620 * - page is locked
1621 *
1622 * So as it is a locked file back page thus it can not
1623 * be remove from the page cache and replace by a new
1624 * page before mmu_notifier_invalidate_range_end so no
1625 * concurrent thread might update its page table to
1626 * point at new page while a device still is using this
1627 * page.
1628 *
1629 * See Documentation/vm/mmu_notifier.rst
1630 */
1631 dec_mm_counter(mm, mm_counter_file(page));
1632 }
1633 discard:
1634 /*
1635 * No need to call mmu_notifier_invalidate_range() it has be
1636 * done above for all cases requiring it to happen under page
1637 * table lock before mmu_notifier_invalidate_range_end()
1638 *
1639 * See Documentation/vm/mmu_notifier.rst
1640 */
1641 page_remove_rmap(subpage, PageHuge(page));
1642 put_page(page);
1643 }
1644
1645 mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
1646
1647 return ret;
1648 }
1649
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6367 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: migration: fix migration of huge PMD shared pages
2018-08-13 3:41 [PATCH] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
2018-08-13 4:10 ` kbuild test robot
2018-08-13 4:17 ` [PATCH] " kbuild test robot
@ 2018-08-13 10:58 ` Kirill A. Shutemov
2018-08-13 23:21 ` Mike Kravetz
2 siblings, 1 reply; 10+ messages in thread
From: Kirill A. Shutemov @ 2018-08-13 10:58 UTC (permalink / raw)
To: Mike Kravetz
Cc: linux-mm, linux-kernel, Kirill A . Shutemov,
Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
Davidlohr Bueso, Michal Hocko, Andrew Morton
On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
> The page migration code employs try_to_unmap() to try and unmap the
> source page. This is accomplished by using rmap_walk to find all
> vmas where the page is mapped. This search stops when page mapcount
> is zero. For shared PMD huge pages, the page map count is always 1
> not matter the number of mappings. Shared mappings are tracked via
> the reference count of the PMD page. Therefore, try_to_unmap stops
> prematurely and does not completely unmap all mappings of the source
> page.
>
> This problem can result is data corruption as writes to the original
> source page can happen after contents of the page are copied to the
> target page. Hence, data is lost.
>
> This problem was originally seen as DB corruption of shared global
> areas after a huge page was soft offlined. DB developers noticed
> they could reproduce the issue by (hotplug) offlining memory used
> to back huge pages. A simple testcase can reproduce the problem by
> creating a shared PMD mapping (note that this must be at least
> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
> migrate_pages() to migrate process pages between nodes.
>
> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a
> shared mapping it will be 'unshared' which removes the page table
> entry and drops reference on PMD page. After this, flush caches and
> TLB.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
> I am not %100 sure on the required flushing, so suggestions would be
> appreciated. This also should go to stable. It has been around for
> a long time so still looking for an appropriate 'fixes:'.
I believe we need flushing. And huge_pmd_unshare() usage in
__unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
that case.
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: migration: fix migration of huge PMD shared pages
2018-08-13 10:58 ` Kirill A. Shutemov
@ 2018-08-13 23:21 ` Mike Kravetz
2018-08-14 8:48 ` Kirill A. Shutemov
0 siblings, 1 reply; 10+ messages in thread
From: Mike Kravetz @ 2018-08-13 23:21 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: linux-mm, linux-kernel, Kirill A . Shutemov,
Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
Davidlohr Bueso, Michal Hocko, Andrew Morton
On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
>> The page migration code employs try_to_unmap() to try and unmap the
>> source page. This is accomplished by using rmap_walk to find all
>> vmas where the page is mapped. This search stops when page mapcount
>> is zero. For shared PMD huge pages, the page map count is always 1
>> not matter the number of mappings. Shared mappings are tracked via
>> the reference count of the PMD page. Therefore, try_to_unmap stops
>> prematurely and does not completely unmap all mappings of the source
>> page.
>>
>> This problem can result is data corruption as writes to the original
>> source page can happen after contents of the page are copied to the
>> target page. Hence, data is lost.
>>
>> This problem was originally seen as DB corruption of shared global
>> areas after a huge page was soft offlined. DB developers noticed
>> they could reproduce the issue by (hotplug) offlining memory used
>> to back huge pages. A simple testcase can reproduce the problem by
>> creating a shared PMD mapping (note that this must be at least
>> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
>> migrate_pages() to migrate process pages between nodes.
>>
>> To fix, have the try_to_unmap_one routine check for huge PMD sharing
>> by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a
>> shared mapping it will be 'unshared' which removes the page table
>> entry and drops reference on PMD page. After this, flush caches and
>> TLB.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>> I am not %100 sure on the required flushing, so suggestions would be
>> appreciated. This also should go to stable. It has been around for
>> a long time so still looking for an appropriate 'fixes:'.
>
> I believe we need flushing. And huge_pmd_unshare() usage in
> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
> that case.
Thanks Kirill,
__unmap_hugepage_range() has two callers:
1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
tlb_finish_mmu on the range. IIUC, this should cause an appropriate
TLB flush.
2) __unmap_hugepage_range_final via unmap_single_vma. unmap_single_vma
has three callers:
- unmap_vmas which assumes the caller will flush the whole range after
return.
- zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
- zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
So, it appears we are covered. But, I could be missing something.
My primary reason for asking the question was with respect to the code
added to try_to_unmap_one. In my testing, the changes I added appeared
to be required. Just wanted to make sure.
I need to fix a build issue and will send another version.
--
Mike Kravetz
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2] mm: migration: fix migration of huge PMD shared pages
2018-08-13 4:10 ` kbuild test robot
@ 2018-08-14 0:30 ` Mike Kravetz
2018-08-14 6:28 ` Greg KH
0 siblings, 1 reply; 10+ messages in thread
From: Mike Kravetz @ 2018-08-14 0:30 UTC (permalink / raw)
To: linux-mm, linux-kernel
Cc: Kirill A . Shutemov, Jérôme Glisse, Vlastimil Babka,
Naoya Horiguchi, Davidlohr Bueso, Michal Hocko, Andrew Morton,
stable, Mike Kravetz
The page migration code employs try_to_unmap() to try and unmap the
source page. This is accomplished by using rmap_walk to find all
vmas where the page is mapped. This search stops when page mapcount
is zero. For shared PMD huge pages, the page map count is always 1
no matter the number of mappings. Shared mappings are tracked via
the reference count of the PMD page. Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.
This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page. Hence, data is lost.
This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined due to ECC memory errors.
DB developers noticed they could reproduce the issue by (hotplug)
offlining memory used to back huge pages. A simple testcase can
reproduce the problem by creating a shared PMD mapping (note that
this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
x86)), and using migrate_pages() to migrate process pages between
nodes while continually writing to the huge pages being migrated.
To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops the reference on the PMD page. After this, flush
caches and TLB.
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
v2: Fixed build issue for !CONFIG_HUGETLB_PAGE and typos in comment
include/linux/hugetlb.h | 6 ++++++
mm/rmap.c | 21 +++++++++++++++++++++
2 files changed, 27 insertions(+)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2a82e3..7524663028ec 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -170,6 +170,12 @@ static inline unsigned long hugetlb_total_pages(void)
return 0;
}
+static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
+ pte_t *ptep)
+{
+ return 0;
+}
+
#define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n) ({ BUG(); 0; })
#define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL)
#define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
diff --git a/mm/rmap.c b/mm/rmap.c
index 09a799c9aebd..cf2340adad10 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1409,6 +1409,27 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
address = pvmw.address;
+ /*
+ * PMDs for hugetlbfs pages could be shared. In this case,
+ * pages with shared PMDs will have a mapcount of 1 no matter
+ * how many times they are actually mapped. Map counting for
+ * PMD sharing is mostly done via the reference count on the
+ * PMD page itself. If the page we are trying to unmap is a
+ * hugetlbfs page, attempt to 'unshare' at the PMD level.
+ * huge_pmd_unshare clears the PUD and adjusts reference
+ * counting on the PMD page which effectively unmaps the page.
+ * Take care of flushing cache and TLB for page in this
+ * specific mapping here.
+ */
+ if (PageHuge(page) &&
+ huge_pmd_unshare(mm, &address, pvmw.pte)) {
+ unsigned long end_add = address + vma_mmu_pagesize(vma);
+
+ flush_cache_range(vma, address, end_add);
+ flush_tlb_range(vma, address, end_add);
+ mmu_notifier_invalidate_range(mm, address, end_add);
+ continue;
+ }
if (IS_ENABLED(CONFIG_MIGRATION) &&
(flags & TTU_MIGRATION) &&
--
2.17.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v2] mm: migration: fix migration of huge PMD shared pages
2018-08-14 0:30 ` [PATCH v2] " Mike Kravetz
@ 2018-08-14 6:28 ` Greg KH
0 siblings, 0 replies; 10+ messages in thread
From: Greg KH @ 2018-08-14 6:28 UTC (permalink / raw)
To: Mike Kravetz
Cc: linux-mm, linux-kernel, Kirill A . Shutemov,
Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
Davidlohr Bueso, Michal Hocko, Andrew Morton, stable
On Mon, Aug 13, 2018 at 05:30:58PM -0700, Mike Kravetz wrote:
> The page migration code employs try_to_unmap() to try and unmap the
> source page. This is accomplished by using rmap_walk to find all
> vmas where the page is mapped. This search stops when page mapcount
> is zero. For shared PMD huge pages, the page map count is always 1
> no matter the number of mappings. Shared mappings are tracked via
> the reference count of the PMD page. Therefore, try_to_unmap stops
> prematurely and does not completely unmap all mappings of the source
> page.
>
> This problem can result is data corruption as writes to the original
> source page can happen after contents of the page are copied to the
> target page. Hence, data is lost.
>
> This problem was originally seen as DB corruption of shared global
> areas after a huge page was soft offlined due to ECC memory errors.
> DB developers noticed they could reproduce the issue by (hotplug)
> offlining memory used to back huge pages. A simple testcase can
> reproduce the problem by creating a shared PMD mapping (note that
> this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
> x86)), and using migrate_pages() to migrate process pages between
> nodes while continually writing to the huge pages being migrated.
>
> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a
> shared mapping it will be 'unshared' which removes the page table
> entry and drops the reference on the PMD page. After this, flush
> caches and TLB.
>
> Fixes: 39dde65c9940 ("shared page table for hugetlb page")
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
> v2: Fixed build issue for !CONFIG_HUGETLB_PAGE and typos in comment
>
<formletter>
This is not the correct way to submit patches for inclusion in the
stable kernel tree. Please read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.
</formletter>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: migration: fix migration of huge PMD shared pages
2018-08-13 23:21 ` Mike Kravetz
@ 2018-08-14 8:48 ` Kirill A. Shutemov
2018-08-15 0:15 ` Mike Kravetz
0 siblings, 1 reply; 10+ messages in thread
From: Kirill A. Shutemov @ 2018-08-14 8:48 UTC (permalink / raw)
To: Mike Kravetz
Cc: Kirill A. Shutemov, linux-mm, linux-kernel,
Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
Davidlohr Bueso, Michal Hocko, Andrew Morton
On Mon, Aug 13, 2018 at 11:21:41PM +0000, Mike Kravetz wrote:
> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
> > On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
> >> The page migration code employs try_to_unmap() to try and unmap the
> >> source page. This is accomplished by using rmap_walk to find all
> >> vmas where the page is mapped. This search stops when page mapcount
> >> is zero. For shared PMD huge pages, the page map count is always 1
> >> not matter the number of mappings. Shared mappings are tracked via
> >> the reference count of the PMD page. Therefore, try_to_unmap stops
> >> prematurely and does not completely unmap all mappings of the source
> >> page.
> >>
> >> This problem can result is data corruption as writes to the original
> >> source page can happen after contents of the page are copied to the
> >> target page. Hence, data is lost.
> >>
> >> This problem was originally seen as DB corruption of shared global
> >> areas after a huge page was soft offlined. DB developers noticed
> >> they could reproduce the issue by (hotplug) offlining memory used
> >> to back huge pages. A simple testcase can reproduce the problem by
> >> creating a shared PMD mapping (note that this must be at least
> >> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
> >> migrate_pages() to migrate process pages between nodes.
> >>
> >> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> >> by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a
> >> shared mapping it will be 'unshared' which removes the page table
> >> entry and drops reference on PMD page. After this, flush caches and
> >> TLB.
> >>
> >> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >> ---
> >> I am not %100 sure on the required flushing, so suggestions would be
> >> appreciated. This also should go to stable. It has been around for
> >> a long time so still looking for an appropriate 'fixes:'.
> >
> > I believe we need flushing. And huge_pmd_unshare() usage in
> > __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
> > that case.
>
> Thanks Kirill,
>
> __unmap_hugepage_range() has two callers:
> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
> tlb_finish_mmu on the range. IIUC, this should cause an appropriate
> TLB flush.
> 2) __unmap_hugepage_range_final via unmap_single_vma. unmap_single_vma
> has three callers:
> - unmap_vmas which assumes the caller will flush the whole range after
> return.
> - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
> - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
>
> So, it appears we are covered. But, I could be missing something.
My problem here is that the mapping that moved by huge_pmd_unshare() in
not accounted into mmu_gather and can be missed on tlb_finish_mmu().
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: migration: fix migration of huge PMD shared pages
2018-08-14 8:48 ` Kirill A. Shutemov
@ 2018-08-15 0:15 ` Mike Kravetz
2018-08-15 8:47 ` Kirill A. Shutemov
0 siblings, 1 reply; 10+ messages in thread
From: Mike Kravetz @ 2018-08-15 0:15 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Kirill A. Shutemov, linux-mm, linux-kernel,
Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
Davidlohr Bueso, Michal Hocko, Andrew Morton
On 08/14/2018 01:48 AM, Kirill A. Shutemov wrote:
> On Mon, Aug 13, 2018 at 11:21:41PM +0000, Mike Kravetz wrote:
>> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
>>> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
>>>> I am not %100 sure on the required flushing, so suggestions would be
>>>> appreciated. This also should go to stable. It has been around for
>>>> a long time so still looking for an appropriate 'fixes:'.
>>>
>>> I believe we need flushing. And huge_pmd_unshare() usage in
>>> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
>>> that case.
>>
>> Thanks Kirill,
>>
>> __unmap_hugepage_range() has two callers:
>> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
>> tlb_finish_mmu on the range. IIUC, this should cause an appropriate
>> TLB flush.
>> 2) __unmap_hugepage_range_final via unmap_single_vma. unmap_single_vma
>> has three callers:
>> - unmap_vmas which assumes the caller will flush the whole range after
>> return.
>> - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
>> - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
>>
>> So, it appears we are covered. But, I could be missing something.
>
> My problem here is that the mapping that moved by huge_pmd_unshare() in
> not accounted into mmu_gather and can be missed on tlb_finish_mmu().
Ah, I think I now see the issue you are concerned with.
When huge_pmd_unshare succeeds we effectively unmap a PUD_SIZE area.
The routine __unmap_hugepage_range may only have been passed a range
that is a subset of PUD_SIZE. In the case I was trying to address,
try_to_unmap_one() the 'range' will certainly be less than PUD_SIZE.
Upon further thought, I think that even in the case of try_to_unmap_one
we should flush PUD_SIZE range.
My first thought would be to embed this flushing within huge_pmd_unshare
itself. Perhaps, whenever huge_pmd_unshare succeeds we should do an
explicit:
flush_cache_range(PUD_SIZE)
flush_tlb_range(PUD_SIZE)
mmu_notifier_invalidate_range(PUD_SIZE)
That would take some of the burden off the callers of huge_pmd_unshare.
However, I am not sure if the flushing calls above play nice in all the
calling environments. I'll look into it some more, but would appreciate
additional comments.
--
Mike Kravetz
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: migration: fix migration of huge PMD shared pages
2018-08-15 0:15 ` Mike Kravetz
@ 2018-08-15 8:47 ` Kirill A. Shutemov
0 siblings, 0 replies; 10+ messages in thread
From: Kirill A. Shutemov @ 2018-08-15 8:47 UTC (permalink / raw)
To: Mike Kravetz
Cc: Kirill A. Shutemov, linux-mm, linux-kernel,
Jérôme Glisse, Vlastimil Babka, Naoya Horiguchi,
Davidlohr Bueso, Michal Hocko, Andrew Morton
On Tue, Aug 14, 2018 at 05:15:57PM -0700, Mike Kravetz wrote:
> On 08/14/2018 01:48 AM, Kirill A. Shutemov wrote:
> > On Mon, Aug 13, 2018 at 11:21:41PM +0000, Mike Kravetz wrote:
> >> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
> >>> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
> >>>> I am not %100 sure on the required flushing, so suggestions would be
> >>>> appreciated. This also should go to stable. It has been around for
> >>>> a long time so still looking for an appropriate 'fixes:'.
> >>>
> >>> I believe we need flushing. And huge_pmd_unshare() usage in
> >>> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
> >>> that case.
> >>
> >> Thanks Kirill,
> >>
> >> __unmap_hugepage_range() has two callers:
> >> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
> >> tlb_finish_mmu on the range. IIUC, this should cause an appropriate
> >> TLB flush.
> >> 2) __unmap_hugepage_range_final via unmap_single_vma. unmap_single_vma
> >> has three callers:
> >> - unmap_vmas which assumes the caller will flush the whole range after
> >> return.
> >> - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
> >> - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
> >>
> >> So, it appears we are covered. But, I could be missing something.
> >
> > My problem here is that the mapping that moved by huge_pmd_unshare() in
> > not accounted into mmu_gather and can be missed on tlb_finish_mmu().
>
> Ah, I think I now see the issue you are concerned with.
>
> When huge_pmd_unshare succeeds we effectively unmap a PUD_SIZE area.
> The routine __unmap_hugepage_range may only have been passed a range
> that is a subset of PUD_SIZE. In the case I was trying to address,
> try_to_unmap_one() the 'range' will certainly be less than PUD_SIZE.
> Upon further thought, I think that even in the case of try_to_unmap_one
> we should flush PUD_SIZE range.
>
> My first thought would be to embed this flushing within huge_pmd_unshare
> itself. Perhaps, whenever huge_pmd_unshare succeeds we should do an
> explicit:
> flush_cache_range(PUD_SIZE)
> flush_tlb_range(PUD_SIZE)
> mmu_notifier_invalidate_range(PUD_SIZE)
> That would take some of the burden off the callers of huge_pmd_unshare.
> However, I am not sure if the flushing calls above play nice in all the
> calling environments. I'll look into it some more, but would appreciate
> additional comments.
I don't think it would work: flush_tlb_range() does IPI and calling it
under spinlock will not go well. I think we need to find a way to account
it properly in the mmu_gather. It's not obvious to me how.
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-08-15 8:48 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-13 3:41 [PATCH] mm: migration: fix migration of huge PMD shared pages Mike Kravetz
2018-08-13 4:10 ` kbuild test robot
2018-08-14 0:30 ` [PATCH v2] " Mike Kravetz
2018-08-14 6:28 ` Greg KH
2018-08-13 4:17 ` [PATCH] " kbuild test robot
2018-08-13 10:58 ` Kirill A. Shutemov
2018-08-13 23:21 ` Mike Kravetz
2018-08-14 8:48 ` Kirill A. Shutemov
2018-08-15 0:15 ` Mike Kravetz
2018-08-15 8:47 ` Kirill A. Shutemov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).