* [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding @ 2020-08-28 14:03 Gerald Schaefer 2020-08-28 14:03 ` [RFC PATCH 1/2] " Gerald Schaefer ` (3 more replies) 0 siblings, 4 replies; 14+ messages in thread From: Gerald Schaefer @ 2020-08-28 14:03 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton Cc: linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda, Christian Borntraeger Hi, Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") introduced a subtle but severe bug on s390 with gup_fast, due to dynamic page table folding. The question "What would it require for the generic code to work for s390" has already been discussed here https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 and ended with a promising approach here https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 which in the end unfortunately didn't quite work completely. We tried to mimic static level folding by changing pgd_offset to always dynamically calculate top level page table offset, and do nothing in folded pXd_offset. What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do not reflect this dynamic behaviour, and still act like static 5-level page tables. Here is an example of what happens with gup_fast on s390, for a task with 3-levels paging, crossing a 2 GB pud boundary: // addr = 0x1007ffff000, end = 0x10080001000 static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; // pud_offset returns &p4d itself (a pointer to a value on stack) pudp = pud_offset(&p4d, addr); do { // on second iteratation reading "random" stack value pud_t pud = READ_ONCE(*pudp); // next = 0x10080000000 next = pud_addr_end(addr, end); ... } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack return 1; } pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end should also have returned that limit, instead of the 5-level static pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end" parameter for gup_pud_range would also have been 0x10080000000, and we would not iterate further in gup_pud_range, but rather go back and (correctly) do it in gup_pgd_range. So, for the second iteration in gup_pud_range, we will increase pudp, which pointed to a stack value and not the real pud table. This new pudp will then point to whatever lies behind the p4d stack value. In general, this happens to be the previously read pgd, but it probably could also be something different, depending on compiler decisions. Most unfortunately, if it happens to be the pgd value, which is the same as the p4d / pud due to folding, it is a valid and present entry. So after the increment, we would still point to the same pud entry. The addr however has been increased in the second iteration, so that we now have different pmd/pte_index values, which will result in very wrong behaviour for the remaining gup_pmd/pte_range calls. We will effectively operate on an address minus 2 GB, due to missing pudp increase. In the "good case", if nothing is mapped there, we will fall back to the slow gup path. But if something is mapped there, and valid for gup_fast, we will end up (silently) getting references on the wrong pages and also add the wrong pages to the **pages result array. This can cause data corruption. We came up with two possible fix-ups, both will introduce some gup-specific helper functions, which will have no effect on other archs than s390. Patch 1 is the solution that has already been discussed in https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 It will additionally pass along pXd pointers in gup_pXd_range, and also introduce pXd_offset_orig for s390, which takes an additional pXd entry value parameter. This allows correctly returning / incrementing pointers while still preseving the READ_ONCE logic for gup_fast. Patch 2 would instead introduce new gup_pXd_addr_end helpers, which take an additional pXd entry value parameter, that can be used on s390 to determine the correct page table level and return corresponding end / boundary. With that, the pointer iteration will always happen in gup_pgd_range. Comments / preferences are welcome. As a last resort, we might also revert back to the s390-specific gup_fast code, if none of the options are agreeable. Regards, Gerald ^ permalink raw reply [flat|nested] 14+ messages in thread
* [RFC PATCH 1/2] mm/gup: fix gup_fast with dynamic page table folding 2020-08-28 14:03 [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding Gerald Schaefer @ 2020-08-28 14:03 ` Gerald Schaefer 2020-08-28 14:03 ` [RFC PATCH 2/2] " Gerald Schaefer ` (2 subsequent siblings) 3 siblings, 0 replies; 14+ messages in thread From: Gerald Schaefer @ 2020-08-28 14:03 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton Cc: linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda, Christian Borntraeger From: Vasily Gorbik <gor@linux.ibm.com> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") introduced a subtle but severe bug on s390 with gup_fast, due to dynamic page table folding. The question "What would it require for the generic code to work for s390" has already been discussed here https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 and ended with a promising approach here https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 which in the end unfortunately didn't quite work completely. We tried to mimic static level folding by changing pgd_offset to always calculate top level page table offset, and do nothing in folded pXd_offset. What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do not reflect this dynamic behaviour, and still act like static 5-level page tables. Here is an example of what happens with gup_fast on s390, for a task with 3-levels paging, crossing a 2 GB pud boundary: // addr = 0x1007ffff000, end = 0x10080001000 static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; // pud_offset returns &p4d itself (a pointer to a value on stack) pudp = pud_offset(&p4d, addr); do { // on second iteratation reading "random" stack value pud_t pud = READ_ONCE(*pudp); // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 next = pud_addr_end(addr, end); ... } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack return 1; } pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end should also have returned that limit, instead of the 5-level static pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end" parameter for gup_pud_range would also have been 0x10080000000, and we would not iterate further in gup_pud_range, but rather go back and (correctly) do it in gup_pgd_range. So, for the second iteration in gup_pud_range, we will increase pudp, which pointed to a stack value and not the real pud table. This new pudp will then point to whatever lies behind the p4d stack value. In general, this happens to be the previously read pgd, but it probably could also be something different, depending on compiler decisions. Most unfortunately, if it happens to be the pgd value, which is the same as the p4d / pud due to folding, it is a valid and present entry. So after the increment, we would still point to the same pud entry. The addr however has been increased in the second iteration, so that we now have different pmd/pte_index values, which will result in very wrong behaviour for the remaining gup_pmd/pte_range calls. We will effectively operate on an address minus 2 GB, due to missing pudp increase. In the "good case", if nothing is mapped there, we will fall back to the slow gup path. But if something is mapped there, and valid for gup_fast, we will end up (silently) getting references on the wrong pages and also add the wrong pages to the **pages result array. This can cause data corruption. Fix this with an approach that has already been discussed in https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 It will additionally pass along pXd pointers in gup_pXd_range, and also introduce pXd_offset_orig for s390, which takes an additional pXd entry value parameter. This allows returning correct pointers while still preseving the READ_ONCE logic for gup_fast. No change for other architectures introduced. Cc: <stable@vger.kernel.org> # 5.2+ Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> --- arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++---------- include/linux/pgtable.h | 10 ++++++++ mm/gup.c | 18 +++++++------- 3 files changed, 49 insertions(+), 21 deletions(-) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 7eb01a5459cd..69a92e39d7b8 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address) #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address) -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) +static inline p4d_t *p4d_offset_orig(pgd_t *pgdp, pgd_t pgd, unsigned long address) { - if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) - return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); - return (p4d_t *) pgd; + if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) + return (p4d_t *) pgd_deref(pgd) + p4d_index(address); + return (p4d_t *) pgdp; } +#define p4d_offset_orig p4d_offset_orig -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address) { - if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) - return (pud_t *) p4d_deref(*p4d) + pud_index(address); - return (pud_t *) p4d; + return p4d_offset_orig(pgdp, *pgdp, address); +} + +static inline pud_t *pud_offset_orig(p4d_t *p4dp, p4d_t p4d, unsigned long address) +{ + if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) + return (pud_t *) p4d_deref(p4d) + pud_index(address); + return (pud_t *) p4dp; +} +#define pud_offset_orig pud_offset_orig + +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address) +{ + return pud_offset_orig(p4dp, *p4dp, address); } #define pud_offset pud_offset -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) +static inline pmd_t *pmd_offset_orig(pud_t *pudp, pud_t pud, unsigned long address) +{ + if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) + return (pmd_t *) pud_deref(pud) + pmd_index(address); + return (pmd_t *) pudp; +} +#define pmd_offset_orig pmd_offset_orig + +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address) { - if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) - return (pmd_t *) pud_deref(*pud) + pmd_index(address); - return (pmd_t *) pud; + return pmd_offset_orig(pudp, *pudp, address); } #define pmd_offset pmd_offset diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index a124c21e3204..02f93358126e 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1425,6 +1425,16 @@ typedef unsigned int pgtbl_mod_mask; #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) #endif +#ifndef p4d_offset_orig +#define p4d_offset_orig(pgdp, pgd, address) p4d_offset(&pgd, address) +#endif +#ifndef pud_offset_orig +#define pud_offset_orig(p4dp, p4d, address) pud_offset(&p4d, address) +#endif +#ifndef pmd_offset_orig +#define pmd_offset_orig(pudp, pud, address) pmd_offset(&pud, address) +#endif + /* * p?d_leaf() - true if this entry is a final mapping to a physical address. * This differs from p?d_huge() by the fact that they are always available (if diff --git a/mm/gup.c b/mm/gup.c index ae096ea7583f..fbdd9f0bf219 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2500,13 +2500,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, return 1; } -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pmd_t *pmdp; - pmdp = pmd_offset(&pud, addr); + pmdp = pmd_offset_orig(pudp, pud, addr); do { pmd_t pmd = READ_ONCE(*pmdp); @@ -2543,13 +2543,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, return 1; } -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; - pudp = pud_offset(&p4d, addr); + pudp = pud_offset_orig(p4dp, p4d, addr); do { pud_t pud = READ_ONCE(*pudp); @@ -2564,20 +2564,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, if (!gup_huge_pd(__hugepd(pud_val(pud)), addr, PUD_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pmd_range(pud, addr, next, flags, pages, nr)) + } else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr)) return 0; } while (pudp++, addr = next, addr != end); return 1; } -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; p4d_t *p4dp; - p4dp = p4d_offset(&pgd, addr); + p4dp = p4d_offset_orig(pgdp, pgd, addr); do { p4d_t p4d = READ_ONCE(*p4dp); @@ -2589,7 +2589,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr, P4D_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pud_range(p4d, addr, next, flags, pages, nr)) + } else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) return 0; } while (p4dp++, addr = next, addr != end); @@ -2617,7 +2617,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, PGDIR_SHIFT, next, flags, pages, nr)) return; - } else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr)) + } else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) return; } while (pgdp++, addr = next, addr != end); } -- 2.17.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 2/2] mm/gup: fix gup_fast with dynamic page table folding 2020-08-28 14:03 [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding Gerald Schaefer 2020-08-28 14:03 ` [RFC PATCH 1/2] " Gerald Schaefer @ 2020-08-28 14:03 ` Gerald Schaefer 2020-08-28 14:21 ` [RFC PATCH 0/2] " Jason Gunthorpe 2020-08-31 11:53 ` Christian Borntraeger 3 siblings, 0 replies; 14+ messages in thread From: Gerald Schaefer @ 2020-08-28 14:03 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton Cc: linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda, Christian Borntraeger From: Alexander Gordeev <agordeev@linux.ibm.com> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") introduced a subtle but severe bug on s390 with gup_fast, due to dynamic page table folding. The question "What would it require for the generic code to work for s390" has already been discussed here https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 and ended with a promising approach here https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 which in the end unfortunately didn't quite work completely. We tried to mimic static level folding by changing pgd_offset to always calculate top level page table offset, and do nothing in folded pXd_offset. What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do not reflect this dynamic behaviour, and still act like static 5-level page tables. Here is an example of what happens with gup_fast on s390, for a task with 3-levels paging, crossing a 2 GB pud boundary: // addr = 0x1007ffff000, end = 0x10080001000 static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; // pud_offset returns &p4d itself (a pointer to a value on stack) pudp = pud_offset(&p4d, addr); do { // on second iteratation reading "random" stack value pud_t pud = READ_ONCE(*pudp); // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 next = pud_addr_end(addr, end); ... } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack return 1; } pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end should also have returned that limit, instead of the 5-level static pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end" parameter for gup_pud_range would also have been 0x10080000000, and we would not iterate further in gup_pud_range, but rather go back and (correctly) do it in gup_pgd_range. So, for the second iteration in gup_pud_range, we will increase pudp, which pointed to a stack value and not the real pud table. This new pudp will then point to whatever lies behind the p4d stack value. In general, this happens to be the previously read pgd, but it probably could also be something different, depending on compiler decisions. Most unfortunately, if it happens to be the pgd value, which is the same as the p4d / pud due to folding, it is a valid and present entry. So after the increment, we would still point to the same pud entry. The addr however has been increased in the second iteration, so that we now have different pmd/pte_index values, which will result in very wrong behaviour for the remaining gup_pmd/pte_range calls. We will effectively operate on an address minus 2 GB, due to missing pudp increase. In the "good case", if nothing is mapped there, we will fall back to the slow gup path. But if something is mapped there, and valid for gup_fast, we will end up (silently) getting references on the wrong pages and also add the wrong pages to the **pages result array. This can cause data corruption. Fix this by introducing new gup_pXd_addr_end helpers, which take an additional pXd entry value parameter, that can be used on s390 to determine the correct page table level and return corresponding end / boundary. With that, the pointer iteration will always happen in gup_pgd_range for s390. No change for other architectures introduced. Cc: <stable@vger.kernel.org> # 5.2+ Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> --- arch/s390/include/asm/pgtable.h | 49 +++++++++++++++++++++++++++++++++ include/linux/pgtable.h | 16 +++++++++++ mm/gup.c | 8 +++--- 3 files changed, 69 insertions(+), 4 deletions(-) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 7eb01a5459cd..1b8f461f5783 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -512,6 +512,55 @@ static inline bool mm_pmd_folded(struct mm_struct *mm) } #define mm_pmd_folded(mm) mm_pmd_folded(mm) +static inline unsigned long gup_folded_addr_end(unsigned long rste, + unsigned long addr, unsigned long end) +{ + unsigned int type = rste & _REGION_ENTRY_TYPE_MASK; + unsigned long size, mask, boundary; + + switch (type) { + case _REGION_ENTRY_TYPE_R1: + size = PGDIR_SIZE; + mask = PGDIR_MASK; + break; + case _REGION_ENTRY_TYPE_R2: + size = P4D_SIZE; + mask = P4D_MASK; + break; + case _REGION_ENTRY_TYPE_R3: + size = PUD_SIZE; + mask = PUD_MASK; + break; + default: + BUG(); + }; + + boundary = (addr + size) & mask; + + return (boundary - 1) < (end - 1) ? boundary : end; +} + +#define gup_pgd_addr_end gup_pgd_addr_end +static inline unsigned long gup_pgd_addr_end(pgd_t pgd, + unsigned long addr, unsigned long end) +{ + return gup_folded_addr_end(pgd_val(pgd), addr, end); +} + +#define gup_p4d_addr_end gup_p4d_addr_end +static inline unsigned long gup_p4d_addr_end(p4d_t p4d, + unsigned long addr, unsigned long end) +{ + return gup_folded_addr_end(p4d_val(p4d), addr, end); +} + +#define gup_pud_addr_end gup_pud_addr_end +static inline unsigned long gup_pud_addr_end(pud_t pud, + unsigned long addr, unsigned long end) +{ + return gup_folded_addr_end(pud_val(pud), addr, end); +} + static inline int mm_has_pgste(struct mm_struct *mm) { #ifdef CONFIG_PGSTE diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e8cbc2e795d5..620a83c774c7 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -681,6 +681,22 @@ static inline int arch_unmap_one(struct mm_struct *mm, }) #endif +#ifndef gup_pgd_addr_end +#define gup_pgd_addr_end(pgd, addr, end) pgd_addr_end(addr, end) +#endif + +#ifndef gup_p4d_addr_end +#define gup_p4d_addr_end(p4d, addr, end) p4d_addr_end(addr, end) +#endif + +#ifndef gup_pud_addr_end +#define gup_pud_addr_end(pud, addr, end) pud_addr_end(addr, end) +#endif + +#ifndef gup_pmd_addr_end +#define gup_pmd_addr_end(pmd, addr, end) pmd_addr_end(addr, end) +#endif + /* * When walking page tables, we usually want to skip any p?d_none entries; * and any p?d_bad entries - reporting the error before resetting to none. diff --git a/mm/gup.c b/mm/gup.c index ae096ea7583f..149ef3d71457 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2510,7 +2510,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, do { pmd_t pmd = READ_ONCE(*pmdp); - next = pmd_addr_end(addr, end); + next = gup_pmd_addr_end(pmd, addr, end); if (!pmd_present(pmd)) return 0; @@ -2553,7 +2553,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, do { pud_t pud = READ_ONCE(*pudp); - next = pud_addr_end(addr, end); + next = gup_pud_addr_end(pud, addr, end); if (unlikely(!pud_present(pud))) return 0; if (unlikely(pud_huge(pud))) { @@ -2581,7 +2581,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, do { p4d_t p4d = READ_ONCE(*p4dp); - next = p4d_addr_end(addr, end); + next = gup_p4d_addr_end(p4d, addr, end); if (p4d_none(p4d)) return 0; BUILD_BUG_ON(p4d_huge(p4d)); @@ -2606,7 +2606,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, do { pgd_t pgd = READ_ONCE(*pgdp); - next = pgd_addr_end(addr, end); + next = gup_pgd_addr_end(pgd, addr, end); if (pgd_none(pgd)) return; if (unlikely(pgd_huge(pgd))) { -- 2.17.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding 2020-08-28 14:03 [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding Gerald Schaefer 2020-08-28 14:03 ` [RFC PATCH 1/2] " Gerald Schaefer 2020-08-28 14:03 ` [RFC PATCH 2/2] " Gerald Schaefer @ 2020-08-28 14:21 ` Jason Gunthorpe 2020-08-28 15:01 ` Gerald Schaefer 2020-08-31 11:53 ` Christian Borntraeger 3 siblings, 1 reply; 14+ messages in thread From: Jason Gunthorpe @ 2020-08-28 14:21 UTC (permalink / raw) To: Gerald Schaefer Cc: Linus Torvalds, Andrew Morton, linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda, Christian Borntraeger On Fri, Aug 28, 2020 at 04:03:12PM +0200, Gerald Schaefer wrote: > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > code") introduced a subtle but severe bug on s390 with gup_fast, due to > dynamic page table folding. I think the page walk code in mm/pagewalk.c has similar issues to GUP. I've been noodling on some patches to add the missing stack copies to pagewalk.c as they are clearly missing.. It would be good if this could be less GUP specific? Generically this is about walking the page table without holding the page table spinlocks using READ_ONCE. Jason ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding 2020-08-28 14:21 ` [RFC PATCH 0/2] " Jason Gunthorpe @ 2020-08-28 15:01 ` Gerald Schaefer 2020-08-28 15:20 ` Jason Gunthorpe 0 siblings, 1 reply; 14+ messages in thread From: Gerald Schaefer @ 2020-08-28 15:01 UTC (permalink / raw) To: Jason Gunthorpe Cc: Linus Torvalds, Andrew Morton, linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda, Christian Borntraeger On Fri, 28 Aug 2020 11:21:37 -0300 Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Fri, Aug 28, 2020 at 04:03:12PM +0200, Gerald Schaefer wrote: > > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast > > code") introduced a subtle but severe bug on s390 with gup_fast, due to > > dynamic page table folding. > > I think the page walk code in mm/pagewalk.c has similar issues to > GUP. I've been noodling on some patches to add the missing stack > copies to pagewalk.c as they are clearly missing.. > > It would be good if this could be less GUP specific? > > Generically this is about walking the page table without holding the > page table spinlocks using READ_ONCE. Indeed, if there were other code paths doing that, they would most likely also be broken (at least) for s390. Alexander was already looking into generalizing the new gup-specific helpers, but so far we assumed that would only be "nice to have" for the future, and not fix any real issues at the moment. So we wanted to focus on first fixing the very real gup_fast issue. Both approaches here probably could be generalized, by either changing pXd_address_end() or pXd_offset(), but I guess it makes sense to already take into account that we might need such generalization sooner than expected. Just to make sure, you are referring to some future / planned changes to mm/pagewalk.c, and not some currently existing pagetable walkers already using the READ_ONCE logic w/o spinlocks? If those would exist already, I guess we would already have issues on s390, independent from our conversion to common code gup_fast. Regards, Gerald ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding 2020-08-28 15:01 ` Gerald Schaefer @ 2020-08-28 15:20 ` Jason Gunthorpe 0 siblings, 0 replies; 14+ messages in thread From: Jason Gunthorpe @ 2020-08-28 15:20 UTC (permalink / raw) To: Gerald Schaefer Cc: Linus Torvalds, Andrew Morton, linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda, Christian Borntraeger On Fri, Aug 28, 2020 at 05:01:03PM +0200, Gerald Schaefer wrote: > Just to make sure, you are referring to some future / planned > changes to mm/pagewalk.c, and not some currently existing > pagetable walkers already using the READ_ONCE logic w/o > spinlocks? Yes no current code, just something I've been looking at slowly. Jason ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding 2020-08-28 14:03 [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding Gerald Schaefer ` (2 preceding siblings ...) 2020-08-28 14:21 ` [RFC PATCH 0/2] " Jason Gunthorpe @ 2020-08-31 11:53 ` Christian Borntraeger 2020-08-31 19:15 ` Andrew Morton 3 siblings, 1 reply; 14+ messages in thread From: Christian Borntraeger @ 2020-08-31 11:53 UTC (permalink / raw) To: Gerald Schaefer, Linus Torvalds, Andrew Morton Cc: linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda On 28.08.20 16:03, Gerald Schaefer wrote: [...] > We came up with two possible fix-ups, both will introduce some gup-specific > helper functions, which will have no effect on other archs than s390. > > Patch 1 is the solution that has already been discussed in > https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 > It will additionally pass along pXd pointers in gup_pXd_range, and > also introduce pXd_offset_orig for s390, which takes an additional > pXd entry value parameter. This allows correctly returning / incrementing > pointers while still preseving the READ_ONCE logic for gup_fast. > > Patch 2 would instead introduce new gup_pXd_addr_end helpers, which take > an additional pXd entry value parameter, that can be used on s390 > to determine the correct page table level and return corresponding > end / boundary. With that, the pointer iteration will always > happen in gup_pgd_range. > > Comments / preferences are welcome. As a last resort, we might also > revert back to the s390-specific gup_fast code, if none of the options > are agreeable. given that this is a data integrity issue, I think it would be good to have some feedback soon if option 1 or option 2 would be acceptable from a common code perspective. Andrew, who of the mm people would be the right one to decide? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding 2020-08-31 11:53 ` Christian Borntraeger @ 2020-08-31 19:15 ` Andrew Morton 2020-09-01 17:40 ` Gerald Schaefer 0 siblings, 1 reply; 14+ messages in thread From: Andrew Morton @ 2020-08-31 19:15 UTC (permalink / raw) To: Christian Borntraeger Cc: Gerald Schaefer, Linus Torvalds, linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda On Mon, 31 Aug 2020 13:53:36 +0200 Christian Borntraeger <borntraeger@de.ibm.com> wrote: > > > On 28.08.20 16:03, Gerald Schaefer wrote: > have some feedback soon if option 1 or option 2 would be acceptable > from a common code perspective. Andrew, who of the mm people would > be the right one to decide? Jason and John Hubbard are doing most of the work in there at present, Both patches look OK to me from a non-s390 perspective. Unless we plan to implement Jason's more-general approach this time, I'd be inclined to defer to the s390 people as to the preferred implementation. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding 2020-08-31 19:15 ` Andrew Morton @ 2020-09-01 17:40 ` Gerald Schaefer 2020-09-01 18:14 ` Jason Gunthorpe 2020-09-01 23:22 ` John Hubbard 0 siblings, 2 replies; 14+ messages in thread From: Gerald Schaefer @ 2020-09-01 17:40 UTC (permalink / raw) To: Andrew Morton Cc: Christian Borntraeger, Linus Torvalds, linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda, John Hubbard, Jason Gunthorpe On Mon, 31 Aug 2020 12:15:53 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Mon, 31 Aug 2020 13:53:36 +0200 Christian Borntraeger <borntraeger@de.ibm.com> wrote: > > > > > > > On 28.08.20 16:03, Gerald Schaefer wrote: > > have some feedback soon if option 1 or option 2 would be acceptable > > from a common code perspective. Andrew, who of the mm people would > > be the right one to decide? > > Jason and John Hubbard are doing most of the work in there at present, > > Both patches look OK to me from a non-s390 perspective. Unless we plan > to implement Jason's more-general approach this time, I'd be inclined > to defer to the s390 people as to the preferred implementation. My gut feeling would say that the gup_pXd_addr_end approach makes us behave more like other archs, at least for gup_fast, which cannot be so bad. Regarding generalization, both approaches would introduce some specific helpers for this "page table walking w/o lock using READ_ONCE", currently only used by gup_fast. What we initially had in mind wrt generalization was replacing e.g. the pXd_addr_end() completely, with the new version that takes the additional pXd value parameter. That would of course be quite intrusive, also affecting other arch code, so it is probably not what we want to do in the first step. To make it a bit more generic and also usable for future paths in mm/pagewalk.c for example, we could however at least name the new helpers differently, e.g. pXd_addr_end_folded instead of gup_pXd_addr_end, and also add some comment on why / where they should be used, like this (will send again as proper patch with description if agreed): --- arch/s390/include/asm/pgtable.h | 49 +++++++++++++++++++++++++++++++++ include/linux/pgtable.h | 32 +++++++++++++++++++++ mm/gup.c | 8 +++--- 3 files changed, 85 insertions(+), 4 deletions(-) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 7eb01a5459cd..d74257f2cb5b 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -512,6 +512,55 @@ static inline bool mm_pmd_folded(struct mm_struct *mm) } #define mm_pmd_folded(mm) mm_pmd_folded(mm) +static inline unsigned long rste_addr_end_folded(unsigned long rste, + unsigned long addr, unsigned long end) +{ + unsigned int type = rste & _REGION_ENTRY_TYPE_MASK; + unsigned long size, mask, boundary; + + switch (type) { + case _REGION_ENTRY_TYPE_R1: + size = PGDIR_SIZE; + mask = PGDIR_MASK; + break; + case _REGION_ENTRY_TYPE_R2: + size = P4D_SIZE; + mask = P4D_MASK; + break; + case _REGION_ENTRY_TYPE_R3: + size = PUD_SIZE; + mask = PUD_MASK; + break; + default: + BUG(); + }; + + boundary = (addr + size) & mask; + + return (boundary - 1) < (end - 1) ? boundary : end; +} + +#define pgd_addr_end_folded pgd_addr_end_folded +static inline unsigned long pgd_addr_end_folded(pgd_t pgd, unsigned long addr, + unsigned long end) +{ + return rste_addr_end_folded(pgd_val(pgd), addr, end); +} + +#define p4d_addr_end_folded p4d_addr_end_folded +static inline unsigned long p4d_addr_end_folded(p4d_t p4d, unsigned long addr, + unsigned long end) +{ + return rste_addr_end_folded(p4d_val(p4d), addr, end); +} + +#define pud_addr_end_folded pud_addr_end_folded +static inline unsigned long pud_addr_end_folded(pud_t pud, unsigned long addr, + unsigned long end) +{ + return rste_addr_end_folded(pud_val(pud), addr, end); +} + static inline int mm_has_pgste(struct mm_struct *mm) { #ifdef CONFIG_PGSTE diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e8cbc2e795d5..43dacbce823f 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -681,6 +681,38 @@ static inline int arch_unmap_one(struct mm_struct *mm, }) #endif +/* + * With dynamic page table levels on s390, the static pXd_addr_end() functions + * will not return corresponding dynamic boundaries. This is no problem as long + * as only pXd pointers are passed down during page table walk, because + * pXd_offset() will simply return the given pointer for folded levels, and the + * pointer iteration over a range simply happens at the correct page table + * level. + * It is however a problem with gup_fast, or other places walking the page + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to + * a stack variable, which cannot be used for pointer iteration at the correct + * level. Instead, the iteration then has to happen by going up to pgd level + * again. To allow this, provide pXd_addr_end_folded() functions with an + * additional pXd value parameter, which can be used on s390 to determine the + * folding level and return the corresponding boundary. + */ +#ifndef pgd_addr_end_folded +#define pgd_addr_end_folded(pgd, addr, end) pgd_addr_end(addr, end) +#endif + +#ifndef p4d_addr_end_folded +#define p4d_addr_end_folded(p4d, addr, end) p4d_addr_end(addr, end) +#endif + +#ifndef pud_addr_end_folded +#define pud_addr_end_folded(pud, addr, end) pud_addr_end(addr, end) +#endif + +#ifndef pmd_addr_end_folded +#define pmd_addr_end_folded(pmd, addr, end) pmd_addr_end(addr, end) +#endif + /* * When walking page tables, we usually want to skip any p?d_none entries; * and any p?d_bad entries - reporting the error before resetting to none. diff --git a/mm/gup.c b/mm/gup.c index ae096ea7583f..3157bc5ede24 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2510,7 +2510,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, do { pmd_t pmd = READ_ONCE(*pmdp); - next = pmd_addr_end(addr, end); + next = pmd_addr_end_folded(pmd, addr, end); if (!pmd_present(pmd)) return 0; @@ -2553,7 +2553,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, do { pud_t pud = READ_ONCE(*pudp); - next = pud_addr_end(addr, end); + next = pud_addr_end_folded(pud, addr, end); if (unlikely(!pud_present(pud))) return 0; if (unlikely(pud_huge(pud))) { @@ -2581,7 +2581,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, do { p4d_t p4d = READ_ONCE(*p4dp); - next = p4d_addr_end(addr, end); + next = p4d_addr_end_folded(p4d, addr, end); if (p4d_none(p4d)) return 0; BUILD_BUG_ON(p4d_huge(p4d)); @@ -2606,7 +2606,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end, do { pgd_t pgd = READ_ONCE(*pgdp); - next = pgd_addr_end(addr, end); + next = pgd_addr_end_folded(pgd, addr, end); if (pgd_none(pgd)) return; if (unlikely(pgd_huge(pgd))) { ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding 2020-09-01 17:40 ` Gerald Schaefer @ 2020-09-01 18:14 ` Jason Gunthorpe 2020-09-01 23:22 ` John Hubbard 1 sibling, 0 replies; 14+ messages in thread From: Jason Gunthorpe @ 2020-09-01 18:14 UTC (permalink / raw) To: Gerald Schaefer Cc: Andrew Morton, Christian Borntraeger, Linus Torvalds, linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda, John Hubbard On Tue, Sep 01, 2020 at 07:40:20PM +0200, Gerald Schaefer wrote: > +/* > + * With dynamic page table levels on s390, the static pXd_addr_end() functions > + * will not return corresponding dynamic boundaries. This is no problem as long > + * as only pXd pointers are passed down during page table walk, because > + * pXd_offset() will simply return the given pointer for folded levels, and the > + * pointer iteration over a range simply happens at the correct page table > + * level. > + * It is however a problem with gup_fast, or other places walking the page > + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead > + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to > + * a stack variable, which cannot be used for pointer iteration at the correct > + * level. Instead, the iteration then has to happen by going up to pgd level > + * again. To allow this, provide pXd_addr_end_folded() functions with an > + * additional pXd value parameter, which can be used on s390 to determine the > + * folding level and return the corresponding boundary. > + */ > +#ifndef pgd_addr_end_folded > +#define pgd_addr_end_folded(pgd, addr, end) pgd_addr_end(addr, end) > +#endif > + > +#ifndef p4d_addr_end_folded > +#define p4d_addr_end_folded(p4d, addr, end) p4d_addr_end(addr, end) > +#endif > + > +#ifndef pud_addr_end_folded > +#define pud_addr_end_folded(pud, addr, end) pud_addr_end(addr, end) > +#endif > + > +#ifndef pmd_addr_end_folded > +#define pmd_addr_end_folded(pmd, addr, end) pmd_addr_end(addr, end) > +#endif Feels like it would be cleaner to globally change pmd_addr_end() /etc to require the extra pmd input rather that introduce this special rule when *_folded needs to be used? NOP on all arches execpt s390.. There are not so many call sites that it seems too scary, and I wouldn't be surprised if there are going to be more cases beyond GUP that *should* be using the READ_ONCE trick. Jason ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding 2020-09-01 17:40 ` Gerald Schaefer 2020-09-01 18:14 ` Jason Gunthorpe @ 2020-09-01 23:22 ` John Hubbard 2020-09-02 12:24 ` Gerald Schaefer 1 sibling, 1 reply; 14+ messages in thread From: John Hubbard @ 2020-09-01 23:22 UTC (permalink / raw) To: Gerald Schaefer, Andrew Morton Cc: Christian Borntraeger, Linus Torvalds, linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda, Jason Gunthorpe On 9/1/20 10:40 AM, Gerald Schaefer wrote: > On Mon, 31 Aug 2020 12:15:53 -0700 > Andrew Morton <akpm@linux-foundation.org> wrote: ... > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index e8cbc2e795d5..43dacbce823f 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -681,6 +681,38 @@ static inline int arch_unmap_one(struct mm_struct *mm, > }) > #endif > > +/* > + * With dynamic page table levels on s390, the static pXd_addr_end() functions > + * will not return corresponding dynamic boundaries. This is no problem as long > + * as only pXd pointers are passed down during page table walk, because > + * pXd_offset() will simply return the given pointer for folded levels, and the > + * pointer iteration over a range simply happens at the correct page table > + * level. > + * It is however a problem with gup_fast, or other places walking the page > + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead > + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to > + * a stack variable, which cannot be used for pointer iteration at the correct > + * level. Instead, the iteration then has to happen by going up to pgd level > + * again. To allow this, provide pXd_addr_end_folded() functions with an > + * additional pXd value parameter, which can be used on s390 to determine the > + * folding level and return the corresponding boundary. Ah OK, I finally see what you have in mind. And as Jason noted, if we just pass an additional parameter to pXd_addr_end() that's going to be cleaner. And doing so puts this in line with other page table abstractions that also carry more information than some architectures need. For example, on x86, set_pte_at() ignores the first two parameters: #define set_pte_at(mm, addr, ptep, pte) native_set_pte_at(mm, addr, ptep, pte) static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep , pte_t pte) { native_set_pte(ptep, pte); } This type of abstraction has worked out very well, IMHO. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding 2020-09-01 23:22 ` John Hubbard @ 2020-09-02 12:24 ` Gerald Schaefer 2020-09-02 15:09 ` Gerald Schaefer 0 siblings, 1 reply; 14+ messages in thread From: Gerald Schaefer @ 2020-09-02 12:24 UTC (permalink / raw) To: John Hubbard Cc: Andrew Morton, Christian Borntraeger, Linus Torvalds, linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda, Jason Gunthorpe On Tue, 1 Sep 2020 16:22:22 -0700 John Hubbard <jhubbard@nvidia.com> wrote: > On 9/1/20 10:40 AM, Gerald Schaefer wrote: > > On Mon, 31 Aug 2020 12:15:53 -0700 > > Andrew Morton <akpm@linux-foundation.org> wrote: > ... > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > > index e8cbc2e795d5..43dacbce823f 100644 > > --- a/include/linux/pgtable.h > > +++ b/include/linux/pgtable.h > > @@ -681,6 +681,38 @@ static inline int arch_unmap_one(struct mm_struct *mm, > > }) > > #endif > > > > +/* > > + * With dynamic page table levels on s390, the static pXd_addr_end() functions > > + * will not return corresponding dynamic boundaries. This is no problem as long > > + * as only pXd pointers are passed down during page table walk, because > > + * pXd_offset() will simply return the given pointer for folded levels, and the > > + * pointer iteration over a range simply happens at the correct page table > > + * level. > > + * It is however a problem with gup_fast, or other places walking the page > > + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead > > + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to > > + * a stack variable, which cannot be used for pointer iteration at the correct > > + * level. Instead, the iteration then has to happen by going up to pgd level > > + * again. To allow this, provide pXd_addr_end_folded() functions with an > > + * additional pXd value parameter, which can be used on s390 to determine the > > + * folding level and return the corresponding boundary. > > Ah OK, I finally see what you have in mind. And as Jason noted, if we just > pass an additional parameter to pXd_addr_end() that's going to be > cleaner. And doing so puts this in line with other page table > abstractions that also carry more information than some architectures > need. For example, on x86, set_pte_at() ignores the first two > parameters: > > #define set_pte_at(mm, addr, ptep, pte) native_set_pte_at(mm, addr, ptep, pte) > > static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr, > pte_t *ptep , pte_t pte) > { > native_set_pte(ptep, pte); > } > > This type of abstraction has worked out very well, IMHO. Yes, it certainly feels like the right way to do it, and it would not affect other archs in a functional way. It would however introduce a subtle change for s390 behavior on _all_ page table walkers, not just the READ_ONCE gup_fast path, i.e. it changes the level at which the pointer iteration is done. Of course, that *should* not have any functional issues, or else it would also be broken in gup_fast, but in this area we often were wrong with should / could assumptions... At least it could have some (minor) performance impact on s390, due to going up to pgd level again instead of staying at the folded level, but only for crossing pud/p4d boundaries, so that *should* be neglectable. So, while totally agreeing that adding the pXd parameter to pXd_addr_end() in general looks like the cleanest way, we will at least need to give this some proper testing on s390. One other question that came up here while doing that generalized approach was "how to submit it", i.e. should it be split up in multiple patches which might be cleaner, or have it all-in-one, which would simplify Fixes/stable tags. After all, we must not forget that this fixes a severe data integrity issue on s390, so we don't want to unnecessarily complicate or defer backports for that. That being said, maybe a compromise could be to split it up in a way that first introduces the pXd_addr_end_folded() helpers for gup only, with proper Fixes/stable tags? Then do the generalization with more thorough testing for s390 after that, possibly also with Fixes/stable tags so that we do not have different code in stable and upstream? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding 2020-09-02 12:24 ` Gerald Schaefer @ 2020-09-02 15:09 ` Gerald Schaefer 2020-09-02 20:13 ` Jason Gunthorpe 0 siblings, 1 reply; 14+ messages in thread From: Gerald Schaefer @ 2020-09-02 15:09 UTC (permalink / raw) To: John Hubbard Cc: Andrew Morton, Christian Borntraeger, Linus Torvalds, linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda, Jason Gunthorpe On Wed, 2 Sep 2020 14:24:37 +0200 Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > On Tue, 1 Sep 2020 16:22:22 -0700 > John Hubbard <jhubbard@nvidia.com> wrote: > > > On 9/1/20 10:40 AM, Gerald Schaefer wrote: > > > On Mon, 31 Aug 2020 12:15:53 -0700 > > > Andrew Morton <akpm@linux-foundation.org> wrote: > > ... > > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > > > index e8cbc2e795d5..43dacbce823f 100644 > > > --- a/include/linux/pgtable.h > > > +++ b/include/linux/pgtable.h > > > @@ -681,6 +681,38 @@ static inline int arch_unmap_one(struct mm_struct *mm, > > > }) > > > #endif > > > > > > +/* > > > + * With dynamic page table levels on s390, the static pXd_addr_end() functions > > > + * will not return corresponding dynamic boundaries. This is no problem as long > > > + * as only pXd pointers are passed down during page table walk, because > > > + * pXd_offset() will simply return the given pointer for folded levels, and the > > > + * pointer iteration over a range simply happens at the correct page table > > > + * level. > > > + * It is however a problem with gup_fast, or other places walking the page > > > + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead > > > + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to > > > + * a stack variable, which cannot be used for pointer iteration at the correct > > > + * level. Instead, the iteration then has to happen by going up to pgd level > > > + * again. To allow this, provide pXd_addr_end_folded() functions with an > > > + * additional pXd value parameter, which can be used on s390 to determine the > > > + * folding level and return the corresponding boundary. > > > > Ah OK, I finally see what you have in mind. And as Jason noted, if we just > > pass an additional parameter to pXd_addr_end() that's going to be > > cleaner. And doing so puts this in line with other page table > > abstractions that also carry more information than some architectures > > need. For example, on x86, set_pte_at() ignores the first two > > parameters: > > > > #define set_pte_at(mm, addr, ptep, pte) native_set_pte_at(mm, addr, ptep, pte) > > > > static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr, > > pte_t *ptep , pte_t pte) > > { > > native_set_pte(ptep, pte); > > } > > > > This type of abstraction has worked out very well, IMHO. > > Yes, it certainly feels like the right way to do it, and it would > not affect other archs in a functional way. It would however introduce > a subtle change for s390 behavior on _all_ page table walkers, not > just the READ_ONCE gup_fast path, i.e. it changes the level at which > the pointer iteration is done. Of course, that *should* not have any > functional issues, or else it would also be broken in gup_fast, but > in this area we often were wrong with should / could assumptions... Hmm, not so sure about that "not affect other archs", that might also be one of those *should*s. Consider this change to mm/mlock.c from our current internal generalization work, for example: diff --git a/mm/mlock.c b/mm/mlock.c index 93ca2bf30b4f..dbde97f317d4 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -374,8 +374,12 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, struct vm_area_struct *vma, struct zone *zone, unsigned long start, unsigned long end) { - pte_t *pte; spinlock_t *ptl; + pte_t *pte; + pmd_t *pmd; + pud_t *pud; + p4d_t *p4d; + pgd_t *pgd; /* * Initialize pte walk starting at the already pinned page where we @@ -384,10 +388,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, */ pte = get_locked_pte(vma->vm_mm, start, &ptl); /* Make sure we do not cross the page table boundary */ - end = pgd_addr_end(start, end); - end = p4d_addr_end(start, end); - end = pud_addr_end(start, end); - end = pmd_addr_end(start, end); + pgd = pgd_offset(vma->vm_mm, start); + end = pgd_addr_end(*pgd, start, end); + p4d = p4d_offset(pgd, start); + end = p4d_addr_end(*p4d, start, end); + pud = pud_offset(p4d, start); + end = pud_addr_end(*pud, start, end); + pmd = pmd_offset(pud, start); + end = pmd_addr_end(*pmd, start, end); /* The page next to the pinned page is the first we will try to get */ start += PAGE_SIZE; I guess we *could* assume that all the extra pXd_offset() calls and also the de-referencing would be optimized out by the compiler for other archs, but it is one example where my gut tells me that this might not be so trivial and w/o unwanted effects after all. Anyway, stay tuned, we will send a v2 of this RFC with going the "modify pXd_addr_end" approach, including the minimal gup-specific patch plus on top the generalization work. Then we might get a better picture of this. ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding 2020-09-02 15:09 ` Gerald Schaefer @ 2020-09-02 20:13 ` Jason Gunthorpe 0 siblings, 0 replies; 14+ messages in thread From: Jason Gunthorpe @ 2020-09-02 20:13 UTC (permalink / raw) To: Gerald Schaefer Cc: John Hubbard, Andrew Morton, Christian Borntraeger, Linus Torvalds, linux-mm, LKML, Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, Claudio Imbrenda On Wed, Sep 02, 2020 at 05:09:58PM +0200, Gerald Schaefer wrote: > I guess we *could* assume that all the extra pXd_offset() calls and > also the de-referencing would be optimized out by the compiler for other > archs, but it is one example where my gut tells me that this might not > be so trivial and w/o unwanted effects after all. Assigning to a variable that is never read should be eliminated.. If things are very complex then the pXX_offset function might need to be marked with attribute pure, but I think this should be reliable? Jason ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2020-09-02 20:13 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-08-28 14:03 [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding Gerald Schaefer 2020-08-28 14:03 ` [RFC PATCH 1/2] " Gerald Schaefer 2020-08-28 14:03 ` [RFC PATCH 2/2] " Gerald Schaefer 2020-08-28 14:21 ` [RFC PATCH 0/2] " Jason Gunthorpe 2020-08-28 15:01 ` Gerald Schaefer 2020-08-28 15:20 ` Jason Gunthorpe 2020-08-31 11:53 ` Christian Borntraeger 2020-08-31 19:15 ` Andrew Morton 2020-09-01 17:40 ` Gerald Schaefer 2020-09-01 18:14 ` Jason Gunthorpe 2020-09-01 23:22 ` John Hubbard 2020-09-02 12:24 ` Gerald Schaefer 2020-09-02 15:09 ` Gerald Schaefer 2020-09-02 20:13 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).