* [PATCH 0/2] mm: madvise: return exact bytes advised with process_madvise under error @ 2022-03-23 15:24 Charan Teja Kalla 2022-03-23 15:24 ` [PATCH 1/2] Revert "mm: madvise: skip unmapped vma holes passed to process_madvise" Charan Teja Kalla 2022-03-23 15:24 ` [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error Charan Teja Kalla 0 siblings, 2 replies; 10+ messages in thread From: Charan Teja Kalla @ 2022-03-23 15:24 UTC (permalink / raw) To: akpm, mhocko, minchan, surenb, vbabka, rientjes, nadav.amit, edgararriaga Cc: linux-mm, linux-kernel, Charan Teja Kalla Under error conditions, process_madvise() is not returning the exact bytes processed in a iovec element thus user may repeat the advise on vma ranges contained in the iovec element despite those ranges are already processed. This problem is partially solved with commit 08095d6310a7 ("mm: madvise: skip unmapped vma holes passed to process_madvise") for ENOMEM return types. These patches try to solve the problem for other error return types. Starting this as new discussion, as the back ground for these changes are coming from below patches, which are already merged into linus tree: 1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5bd009c7c9a9e888077c07535dc0c70aeab242c3 2) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=08095d6310a7ce43256b4251577bc66a25c6e1a6 and lore archives for the above changes: 1) V2: https://lore.kernel.org/linux-mm/cover.1647008754.git.quic_charante@quicinc.com/ 2) V1: https://lore.kernel.org/linux-mm/1646803679-11433-1-git-send-email-quic_charante@quicinc.com/ Charan Teja Kalla (1): Revert "mm: madvise: skip unmapped vma holes passed to process_madvise" Charan Teja Reddy (1): mm: madvise: return exact bytes advised with process_madvise under error mm/madvise.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 88 insertions(+), 11 deletions(-) -- 2.7.4 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] Revert "mm: madvise: skip unmapped vma holes passed to process_madvise" 2022-03-23 15:24 [PATCH 0/2] mm: madvise: return exact bytes advised with process_madvise under error Charan Teja Kalla @ 2022-03-23 15:24 ` Charan Teja Kalla 2022-03-24 12:48 ` Michal Hocko 2022-03-23 15:24 ` [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error Charan Teja Kalla 1 sibling, 1 reply; 10+ messages in thread From: Charan Teja Kalla @ 2022-03-23 15:24 UTC (permalink / raw) To: akpm, mhocko, minchan, surenb, vbabka, rientjes, nadav.amit, edgararriaga Cc: linux-mm, linux-kernel, Charan Teja Kalla This reverts commit 08095d6310a7 ("mm: madvise: skip unmapped vma holes passed to process_madvise") as process_madvise() fails to return exact processed bytes at other cases too. As an example: if the process_madvise() hits mlocked pages after processing some initial bytes passed in [start, end), it just returns EINVAL though some bytes are processed. Thus making an exception only for ENOMEM is partially fixing the problem of returning the proper advised bytes. Thus revert this patch and return proper bytes advised, if there any, for all the error types in the following patch. Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> --- mm/madvise.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 39b712f..0d8fd17 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1433,16 +1433,9 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, while (iov_iter_count(&iter)) { iovec = iov_iter_iovec(&iter); - /* - * do_madvise returns ENOMEM if unmapped holes are present - * in the passed VMA. process_madvise() is expected to skip - * unmapped holes passed to it in the 'struct iovec' list - * and not fail because of them. Thus treat -ENOMEM return - * from do_madvise as valid and continue processing. - */ ret = do_madvise(mm, (unsigned long)iovec.iov_base, iovec.iov_len, behavior); - if (ret < 0 && ret != -ENOMEM) + if (ret < 0) break; iov_iter_advance(&iter, iovec.iov_len); } -- 2.7.4 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] Revert "mm: madvise: skip unmapped vma holes passed to process_madvise" 2022-03-23 15:24 ` [PATCH 1/2] Revert "mm: madvise: skip unmapped vma holes passed to process_madvise" Charan Teja Kalla @ 2022-03-24 12:48 ` Michal Hocko 2022-03-24 14:03 ` Charan Teja Kalla 0 siblings, 1 reply; 10+ messages in thread From: Michal Hocko @ 2022-03-24 12:48 UTC (permalink / raw) To: Charan Teja Kalla Cc: akpm, minchan, surenb, vbabka, rientjes, nadav.amit, edgararriaga, linux-mm, linux-kernel On Wed 23-03-22 20:54:09, Charan Teja Kalla wrote: > This reverts commit 08095d6310a7 ("mm: madvise: skip unmapped vma holes > passed to process_madvise") as process_madvise() fails to return exact > processed bytes at other cases too. As an example: if the > process_madvise() hits mlocked pages after processing some initial bytes > passed in [start, end), it just returns EINVAL though some bytes are > processed. Thus making an exception only for ENOMEM is partially fixing > the problem of returning the proper advised bytes. > > Thus revert this patch and return proper bytes advised, if there any, > for all the error types in the following patch. I do agree with the revert. I am not sure the above really is a proper justification though. 08095d6310a7 was changing one (arguably) dubious semantic by another one without a proper justification and wider consensus which I would expect from a patch which changes an existing semantic. Not to mention it being marked for stable tree. But let's not nit pick on that now. Let's send this revert ASAP and use some more time to discuss the semantic and whether any change is really required. > Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Acked-by: Michal Hocko <mhocko@suse.com> > --- > mm/madvise.c | 9 +-------- > 1 file changed, 1 insertion(+), 8 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 39b712f..0d8fd17 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1433,16 +1433,9 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > > while (iov_iter_count(&iter)) { > iovec = iov_iter_iovec(&iter); > - /* > - * do_madvise returns ENOMEM if unmapped holes are present > - * in the passed VMA. process_madvise() is expected to skip > - * unmapped holes passed to it in the 'struct iovec' list > - * and not fail because of them. Thus treat -ENOMEM return > - * from do_madvise as valid and continue processing. > - */ > ret = do_madvise(mm, (unsigned long)iovec.iov_base, > iovec.iov_len, behavior); > - if (ret < 0 && ret != -ENOMEM) > + if (ret < 0) > break; > iov_iter_advance(&iter, iovec.iov_len); > } > -- > 2.7.4 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] Revert "mm: madvise: skip unmapped vma holes passed to process_madvise" 2022-03-24 12:48 ` Michal Hocko @ 2022-03-24 14:03 ` Charan Teja Kalla 0 siblings, 0 replies; 10+ messages in thread From: Charan Teja Kalla @ 2022-03-24 14:03 UTC (permalink / raw) To: Michal Hocko Cc: akpm, minchan, surenb, vbabka, rientjes, nadav.amit, edgararriaga, linux-mm, linux-kernel Thanks Michal. On 3/24/2022 6:18 PM, Michal Hocko wrote: > On Wed 23-03-22 20:54:09, Charan Teja Kalla wrote: >> This reverts commit 08095d6310a7 ("mm: madvise: skip unmapped vma holes >> passed to process_madvise") as process_madvise() fails to return exact >> processed bytes at other cases too. As an example: if the >> process_madvise() hits mlocked pages after processing some initial bytes >> passed in [start, end), it just returns EINVAL though some bytes are >> processed. Thus making an exception only for ENOMEM is partially fixing >> the problem of returning the proper advised bytes. >> >> Thus revert this patch and return proper bytes advised, if there any, >> for all the error types in the following patch. > > I do agree with the revert. I am not sure the above really is a proper > justification though. 08095d6310a7 was changing one (arguably) dubious > semantic by another one without a proper justification and wider > consensus which I would expect from a patch which changes an existing > semantic. Not to mention it being marked for stable tree. Thanks for pointing this out. Since 08095d6310a7 is marked for stable tree, doing the same for this change. Cc: <stable@vger.kernel.org> # 5.10+ > > But let's not nit pick on that now. Let's send this revert ASAP and use > some more time to discuss the semantic and whether any change is really > required. > >> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> > > Acked-by: Michal Hocko <mhocko@suse.com> > Thanks for the quick ack. >> --- >> mm/madvise.c | 9 +-------- >> 1 file changed, 1 insertion(+), 8 deletions(-) >> >> diff --git a/mm/madvise.c b/mm/madvise.c >> index 39b712f..0d8fd17 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -1433,16 +1433,9 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, >> >> while (iov_iter_count(&iter)) { >> iovec = iov_iter_iovec(&iter); >> - /* >> - * do_madvise returns ENOMEM if unmapped holes are present >> - * in the passed VMA. process_madvise() is expected to skip >> - * unmapped holes passed to it in the 'struct iovec' list >> - * and not fail because of them. Thus treat -ENOMEM return >> - * from do_madvise as valid and continue processing. >> - */ >> ret = do_madvise(mm, (unsigned long)iovec.iov_base, >> iovec.iov_len, behavior); >> - if (ret < 0 && ret != -ENOMEM) >> + if (ret < 0) >> break; >> iov_iter_advance(&iter, iovec.iov_len); >> } >> -- >> 2.7.4 > ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error 2022-03-23 15:24 [PATCH 0/2] mm: madvise: return exact bytes advised with process_madvise under error Charan Teja Kalla 2022-03-23 15:24 ` [PATCH 1/2] Revert "mm: madvise: skip unmapped vma holes passed to process_madvise" Charan Teja Kalla @ 2022-03-23 15:24 ` Charan Teja Kalla 2022-03-24 13:14 ` Michal Hocko 1 sibling, 1 reply; 10+ messages in thread From: Charan Teja Kalla @ 2022-03-23 15:24 UTC (permalink / raw) To: akpm, mhocko, minchan, surenb, vbabka, rientjes, nadav.amit, edgararriaga Cc: linux-mm, linux-kernel, Charan Teja Reddy From: Charan Teja Reddy <quic_charante@quicinc.com> The commit 5bd009c7c9a9 ("mm: madvise: return correct bytes advised with process_madvise") fixes the issue to return number of bytes that are successfully advised before hitting error with iovec elements processing. But, when the user passed unmapped ranges in iovec, the syscall ignores these holes and continues processing and returns ENOMEM in the end, which is same as madvise semantic. This is a problem for vector processing where user may want to know how many bytes were exactly processed in a iovec element to make better decissions in the user space. As in ENOMEM case, we processed all bytes in a iovec element but still returned error which will confuse the user whether it is failed or succeeded to advise. As an example, consider below ranges were passed by the user in struct iovec: iovec1(ranges: vma1), iovec2(ranges: vma2 -- vma3 -- hole) and iovec3(ranges: vma4). In the current implementation, it fully advise iovec1 and iovec2 but just returns number of processed bytes as iovec1 range. Then user may repeat the processing of iovec2, which is already processed, which then returns with ENOMEM. Then user may want to skip iovec2 and starts processing from iovec3. Here because of wrong return processed bytes, iovec2 is processed twice. This problem is solved with commit 08095d6310a7 ("mm: madvise: skip unmapped vma holes passed to process_madvise"), where the user now returns iovec1 and iovec2 as processed and he may restart from iovec3. Some problems with this patch are that: 1) User may wanted to be notified as unmapped address ranges were passed by returning ENOMEM[1]. 2) It didn't consider the case where there exists partially advised bytes with other error types too, eg EINVAL. Thus fixing only for ENOMEM is partially solving the problem[2]. 3) Even if no vma is found in the passed iovec range, it is still considered as processed instead of returning ENOMEM. These can be fixed by having process_madvise() with its own semantics[3], different from madvise(), where it will have its own iterator and returns exact bytes it addressed. Now process_madvise() stops iterating if it encounters a hole or an invalid vma and returns the bytes till processed in that iovec element. In the above example, it first returns the processed bytes as the ranges of iovec1(vma1) and iovec2(vma2, vma3) so that user can exactly know that hole/invalid vma exists after vma3 in the passed iovec elements. And thus user can skip hole/invalid vma in the next retry and starts processing from iovec3. [1]https://lore.kernel.org/linux-mm/YjmLmBUmROr+hshO@dhcp22.suse.cz/ [2]https://lore.kernel.org/linux-mm/YjFAzuLKWw5eadtf@google.com/ [3]https://lore.kernel.org/linux-mm/YjNgoeg1yOocsjWC@google.com/ Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com> --- mm/madvise.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 87 insertions(+), 3 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 0d8fd17..9169b16 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1381,6 +1381,89 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) return do_madvise(current->mm, start, len_in, behavior); } +/* + * TODO: Add documentation for process_madvise() + */ +static int do_process_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, + int behavior, size_t *partial_bytes_advised) +{ + unsigned long end, tmp; + struct vm_area_struct *vma, *prev; + int error = -EINVAL; + size_t len; + size_t tmp_bytes_advised = 0; + struct blk_plug plug; + + *partial_bytes_advised = 0; + /* + * TODO: Move these checks to a common function to be used by both + * madvise() and process_madvise(). + */ + start = untagged_addr(start); + if (!PAGE_ALIGNED(start)) + return error; + len = PAGE_ALIGN(len_in); + + /* Check to see whether len was rounded up from small -ve to zero */ + if (len_in && !len) + return error; + + end = start + len; + if (end < start) + return error; + + error = 0; + if (end == start) + return error; + + mmap_read_lock(mm); + + vma = find_vma_prev(mm, start, &prev); + if (vma && start > vma->vm_start) + prev = vma; + + blk_start_plug(&plug); + for (;;) { + /* + * It it hits a unmapped address range in the [start, end), + * stop processing and return ENOMEM. + */ + if (!vma || start < vma->vm_start) { + error = -ENOMEM; + goto out; + } + + tmp = vma->vm_end; + if (end < tmp) + tmp = end; + + error = madvise_vma_behavior(vma, &prev, start, tmp, behavior); + if (error) + goto out; + tmp_bytes_advised += tmp - start; + start = tmp; + if (prev && start < prev->vm_end) + start = prev->vm_end; + if (start >= end) + goto out; + if (prev) + vma = prev->vm_next; + else + vma = find_vma(mm, start); + } +out: + /* + * partial_bytes_advised may contain non-zero bytes indicating + * the number of bytes advised before failure. Holds zero incase + * of success. + */ + *partial_bytes_advised = error ? tmp_bytes_advised : 0; + blk_finish_plug(&plug); + mmap_read_unlock(mm); + + return error; +} + SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, size_t, vlen, int, behavior, unsigned int, flags) { @@ -1391,6 +1474,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, struct task_struct *task; struct mm_struct *mm; size_t total_len; + size_t partial_bytes_advised; unsigned int f_flags; if (flags != 0) { @@ -1433,14 +1517,14 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, while (iov_iter_count(&iter)) { iovec = iov_iter_iovec(&iter); - ret = do_madvise(mm, (unsigned long)iovec.iov_base, - iovec.iov_len, behavior); + ret = do_process_madvise(mm, (unsigned long)iovec.iov_base, + iovec.iov_len, behavior, &partial_bytes_advised); if (ret < 0) break; iov_iter_advance(&iter, iovec.iov_len); } - ret = (total_len - iov_iter_count(&iter)) ? : ret; + ret = (total_len - iov_iter_count(&iter) + partial_bytes_advised) ? : ret; release_mm: mmput(mm); -- 2.7.4 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error 2022-03-23 15:24 ` [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error Charan Teja Kalla @ 2022-03-24 13:14 ` Michal Hocko 2022-03-24 15:45 ` Charan Teja Kalla 0 siblings, 1 reply; 10+ messages in thread From: Michal Hocko @ 2022-03-24 13:14 UTC (permalink / raw) To: Charan Teja Kalla Cc: akpm, minchan, surenb, vbabka, rientjes, nadav.amit, edgararriaga, linux-mm, linux-kernel On Wed 23-03-22 20:54:10, Charan Teja Kalla wrote: > From: Charan Teja Reddy <quic_charante@quicinc.com> > > The commit 5bd009c7c9a9 ("mm: madvise: return correct bytes advised with > process_madvise") fixes the issue to return number of bytes that are > successfully advised before hitting error with iovec elements > processing. But, when the user passed unmapped ranges in iovec, the > syscall ignores these holes and continues processing and returns ENOMEM > in the end, which is same as madvise semantic. This is a problem for > vector processing where user may want to know how many bytes were > exactly processed in a iovec element to make better decissions in the > user space. As in ENOMEM case, we processed all bytes in a iovec element > but still returned error which will confuse the user whether it is > failed or succeeded to advise. Do you have any specific example where the initial semantic is really problematic or is this mostly a theoretical problem you have found when reading the code? > As an example, consider below ranges were passed by the user in struct > iovec: iovec1(ranges: vma1), iovec2(ranges: vma2 -- vma3 -- hole) and > iovec3(ranges: vma4). In the current implementation, it fully advise > iovec1 and iovec2 but just returns number of processed bytes as iovec1 > range. Then user may repeat the processing of iovec2, which is already > processed, which then returns with ENOMEM. Then user may want to skip > iovec2 and starts processing from iovec3. Here because of wrong return > processed bytes, iovec2 is processed twice. I think you should be much more specific why this is actually a problem. This would surely be less optimal but is this a correctness issue? [...] > + vma = find_vma_prev(mm, start, &prev); > + if (vma && start > vma->vm_start) > + prev = vma; > + > + blk_start_plug(&plug); > + for (;;) { > + /* > + * It it hits a unmapped address range in the [start, end), > + * stop processing and return ENOMEM. > + */ > + if (!vma || start < vma->vm_start) { > + error = -ENOMEM; > + goto out; > + } > + > + tmp = vma->vm_end; > + if (end < tmp) > + tmp = end; > + > + error = madvise_vma_behavior(vma, &prev, start, tmp, behavior); > + if (error) > + goto out; > + tmp_bytes_advised += tmp - start; > + start = tmp; > + if (prev && start < prev->vm_end) > + start = prev->vm_end; > + if (start >= end) > + goto out; > + if (prev) > + vma = prev->vm_next; > + else > + vma = find_vma(mm, start); > + } > +out: > + /* > + * partial_bytes_advised may contain non-zero bytes indicating > + * the number of bytes advised before failure. Holds zero incase > + * of success. > + */ > + *partial_bytes_advised = error ? tmp_bytes_advised : 0; Although this looks like a fix I am not sure it is future proof. madvise_vma_behavior doesn't report which part of the range has been really processed. I do not think that currently supported madvise modes for process_madvise support an early break out with return to the userspace (madvise_cold_or_pageout_pte_range bails on fatal signals for example) but this can change in the future and then you are back to "imprecise" return value problem. Yes, this is a theoretical problem but so it sounds the problem you are trying to fix IMHO. I think it would be better to live with imprecise return values reporting rather than aiming for perfection which would be fragile and add a future maintenance burden. On the other hand if there are _real_ workloads which suffer from the existing semantic then sure the above seems to be an appropriate fix AFAICS. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error 2022-03-24 13:14 ` Michal Hocko @ 2022-03-24 15:45 ` Charan Teja Kalla 2022-03-25 0:46 ` Minchan Kim ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Charan Teja Kalla @ 2022-03-24 15:45 UTC (permalink / raw) To: Michal Hocko Cc: akpm, minchan, surenb, vbabka, rientjes, nadav.amit, edgararriaga, linux-mm, linux-kernel, Johannes Weiner Thanks Michal for the inputs. On 3/24/2022 6:44 PM, Michal Hocko wrote: > On Wed 23-03-22 20:54:10, Charan Teja Kalla wrote: >> From: Charan Teja Reddy <quic_charante@quicinc.com> >> >> The commit 5bd009c7c9a9 ("mm: madvise: return correct bytes advised with >> process_madvise") fixes the issue to return number of bytes that are >> successfully advised before hitting error with iovec elements >> processing. But, when the user passed unmapped ranges in iovec, the >> syscall ignores these holes and continues processing and returns ENOMEM >> in the end, which is same as madvise semantic. This is a problem for >> vector processing where user may want to know how many bytes were >> exactly processed in a iovec element to make better decissions in the >> user space. As in ENOMEM case, we processed all bytes in a iovec element >> but still returned error which will confuse the user whether it is >> failed or succeeded to advise. > Do you have any specific example where the initial semantic is really > problematic or is this mostly a theoretical problem you have found when > reading the code? > > >> As an example, consider below ranges were passed by the user in struct >> iovec: iovec1(ranges: vma1), iovec2(ranges: vma2 -- vma3 -- hole) and >> iovec3(ranges: vma4). In the current implementation, it fully advise >> iovec1 and iovec2 but just returns number of processed bytes as iovec1 >> range. Then user may repeat the processing of iovec2, which is already >> processed, which then returns with ENOMEM. Then user may want to skip >> iovec2 and starts processing from iovec3. Here because of wrong return >> processed bytes, iovec2 is processed twice. > I think you should be much more specific why this is actually a problem. > This would surely be less optimal but is this a correctness issue? > Yes, this is a problem found when reading the code, but IMO we can easily expect an invalid vma/hole in the passed range because we are operating on other process VMA. More than solving the problem of being less optimal, this can be looked in the direction of helping the user to take better policy decisions with this syscall. And, not better policy decisions from user is just being sub optimal(i.e. issuing the syscall again on the processed range) with this syscall. Having said that, at present I don't have any reports/unit test showing the existing semantic is really a problematic. > [...] >> + vma = find_vma_prev(mm, start, &prev); >> + if (vma && start > vma->vm_start) >> + prev = vma; >> + >> + blk_start_plug(&plug); >> + for (;;) { >> + /* >> + * It it hits a unmapped address range in the [start, end), >> + * stop processing and return ENOMEM. >> + */ >> + if (!vma || start < vma->vm_start) { >> + error = -ENOMEM; >> + goto out; >> + } >> + >> + tmp = vma->vm_end; >> + if (end < tmp) >> + tmp = end; >> + >> + error = madvise_vma_behavior(vma, &prev, start, tmp, behavior); >> + if (error) >> + goto out; >> + tmp_bytes_advised += tmp - start; >> + start = tmp; >> + if (prev && start < prev->vm_end) >> + start = prev->vm_end; >> + if (start >= end) >> + goto out; >> + if (prev) >> + vma = prev->vm_next; >> + else >> + vma = find_vma(mm, start); >> + } >> +out: >> + /* >> + * partial_bytes_advised may contain non-zero bytes indicating >> + * the number of bytes advised before failure. Holds zero incase >> + * of success. >> + */ >> + *partial_bytes_advised = error ? tmp_bytes_advised : 0; > Although this looks like a fix I am not sure it is future proof. > madvise_vma_behavior doesn't report which part of the range has been > really processed. I do not think that currently supported madvise modes > for process_madvise support an early break out with return to the > userspace (madvise_cold_or_pageout_pte_range bails on fatal signals for > example) but this can change in the future and then you are back to > "imprecise" return value problem. Yes, this is a theoretical problem Agree here with the "imprecise" return value problem with processing a VMA range. Yes when it is decided to return proper processed value from madvise_vma_behavior(), this code too may need the maintenance. > but so it sounds the problem you are trying to fix IMHO. I think it > would be better to live with imprecise return values reporting rather > than aiming for perfection which would be fragile and add a future > maintenance burden. > Hmm. Should atleast this imprecise return values be documented in man page or in madvise.c file? > On the other hand if there are _real_ workloads which suffer from the > existing semantic then sure the above seems to be an appropriate fix > AFAICS. > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error 2022-03-24 15:45 ` Charan Teja Kalla @ 2022-03-25 0:46 ` Minchan Kim 2022-03-25 0:48 ` Minchan Kim 2022-03-25 8:02 ` Michal Hocko 2 siblings, 0 replies; 10+ messages in thread From: Minchan Kim @ 2022-03-25 0:46 UTC (permalink / raw) To: Charan Teja Kalla Cc: Michal Hocko, akpm, surenb, vbabka, rientjes, nadav.amit, edgararriaga, linux-mm, linux-kernel, Johannes Weiner On Thu, Mar 24, 2022 at 09:15:57PM +0530, Charan Teja Kalla wrote: > Thanks Michal for the inputs. > > On 3/24/2022 6:44 PM, Michal Hocko wrote: > > On Wed 23-03-22 20:54:10, Charan Teja Kalla wrote: > >> From: Charan Teja Reddy <quic_charante@quicinc.com> > >> > >> The commit 5bd009c7c9a9 ("mm: madvise: return correct bytes advised with > >> process_madvise") fixes the issue to return number of bytes that are > >> successfully advised before hitting error with iovec elements > >> processing. But, when the user passed unmapped ranges in iovec, the > >> syscall ignores these holes and continues processing and returns ENOMEM > >> in the end, which is same as madvise semantic. This is a problem for > >> vector processing where user may want to know how many bytes were > >> exactly processed in a iovec element to make better decissions in the > >> user space. As in ENOMEM case, we processed all bytes in a iovec element > >> but still returned error which will confuse the user whether it is > >> failed or succeeded to advise. > > Do you have any specific example where the initial semantic is really > > problematic or is this mostly a theoretical problem you have found when > > reading the code? > > > > > >> As an example, consider below ranges were passed by the user in struct > >> iovec: iovec1(ranges: vma1), iovec2(ranges: vma2 -- vma3 -- hole) and > >> iovec3(ranges: vma4). In the current implementation, it fully advise > >> iovec1 and iovec2 but just returns number of processed bytes as iovec1 > >> range. Then user may repeat the processing of iovec2, which is already > >> processed, which then returns with ENOMEM. Then user may want to skip > >> iovec2 and starts processing from iovec3. Here because of wrong return > >> processed bytes, iovec2 is processed twice. > > I think you should be much more specific why this is actually a problem. > > This would surely be less optimal but is this a correctness issue? > > > > Yes, this is a problem found when reading the code, but IMO we can > easily expect an invalid vma/hole in the passed range because we are > operating on other process VMA. More than solving the problem of being > less optimal, this can be looked in the direction of helping the user to > take better policy decisions with this syscall. And, not better policy > decisions from user is just being sub optimal(i.e. issuing the syscall > again on the processed range) with this syscall. > > Having said that, at present I don't have any reports/unit test showing > the existing semantic is really a problematic. > > > [...] > >> + vma = find_vma_prev(mm, start, &prev); > >> + if (vma && start > vma->vm_start) > >> + prev = vma; > >> + > >> + blk_start_plug(&plug); > >> + for (;;) { > >> + /* > >> + * It it hits a unmapped address range in the [start, end), > >> + * stop processing and return ENOMEM. > >> + */ > >> + if (!vma || start < vma->vm_start) { > >> + error = -ENOMEM; > >> + goto out; > >> + } > >> + > >> + tmp = vma->vm_end; > >> + if (end < tmp) > >> + tmp = end; > >> + > >> + error = madvise_vma_behavior(vma, &prev, start, tmp, behavior); > >> + if (error) > >> + goto out; > >> + tmp_bytes_advised += tmp - start; > >> + start = tmp; > >> + if (prev && start < prev->vm_end) > >> + start = prev->vm_end; > >> + if (start >= end) > >> + goto out; > >> + if (prev) > >> + vma = prev->vm_next; > >> + else > >> + vma = find_vma(mm, start); > >> + } > >> +out: > >> + /* > >> + * partial_bytes_advised may contain non-zero bytes indicating > >> + * the number of bytes advised before failure. Holds zero incase > >> + * of success. > >> + */ > >> + *partial_bytes_advised = error ? tmp_bytes_advised : 0; > > Although this looks like a fix I am not sure it is future proof. > > madvise_vma_behavior doesn't report which part of the range has been > > really processed. I do not think that currently supported madvise modes > > for process_madvise support an early break out with return to the > > userspace (madvise_cold_or_pageout_pte_range bails on fatal signals for EINVAL due to can_madv_lru_vma since it countered VM_PFNMAP which is not rare in Android. User process could fiter them out via looking /proc/pid/smaps properly but it's too expensive. A idea to fiter them out from /proc/<pid>/maps is checking shared flags such as rw-s or ---s(even though it's not accurate, it would work effectively). > > example) but this can change in the future and then you are back to > > "imprecise" return value problem. Yes, this is a theoretical problem > > Agree here with the "imprecise" return value problem with processing a > VMA range. Yes when it is decided to return proper processed value from > madvise_vma_behavior(), this code too may need the maintenance. > > > but so it sounds the problem you are trying to fix IMHO. I think it > > would be better to live with imprecise return values reporting rather > > than aiming for perfection which would be fragile and add a future > > maintenance burden. Actually, I don't think the maintainace cost would be that big. Having said, I agree the patch should justify with number how it would be painful since it's more of optimization. Thanks. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error 2022-03-24 15:45 ` Charan Teja Kalla 2022-03-25 0:46 ` Minchan Kim @ 2022-03-25 0:48 ` Minchan Kim 2022-03-25 8:02 ` Michal Hocko 2 siblings, 0 replies; 10+ messages in thread From: Minchan Kim @ 2022-03-25 0:48 UTC (permalink / raw) To: Charan Teja Kalla Cc: Michal Hocko, akpm, surenb, vbabka, rientjes, nadav.amit, edgararriaga, linux-mm, linux-kernel, Johannes Weiner On Thu, Mar 24, 2022 at 09:15:57PM +0530, Charan Teja Kalla wrote: > Thanks Michal for the inputs. > > On 3/24/2022 6:44 PM, Michal Hocko wrote: > > On Wed 23-03-22 20:54:10, Charan Teja Kalla wrote: > >> From: Charan Teja Reddy <quic_charante@quicinc.com> > >> > >> The commit 5bd009c7c9a9 ("mm: madvise: return correct bytes advised with > >> process_madvise") fixes the issue to return number of bytes that are > >> successfully advised before hitting error with iovec elements > >> processing. But, when the user passed unmapped ranges in iovec, the > >> syscall ignores these holes and continues processing and returns ENOMEM > >> in the end, which is same as madvise semantic. This is a problem for > >> vector processing where user may want to know how many bytes were > >> exactly processed in a iovec element to make better decissions in the > >> user space. As in ENOMEM case, we processed all bytes in a iovec element > >> but still returned error which will confuse the user whether it is > >> failed or succeeded to advise. > > Do you have any specific example where the initial semantic is really > > problematic or is this mostly a theoretical problem you have found when > > reading the code? > > > > > >> As an example, consider below ranges were passed by the user in struct > >> iovec: iovec1(ranges: vma1), iovec2(ranges: vma2 -- vma3 -- hole) and > >> iovec3(ranges: vma4). In the current implementation, it fully advise > >> iovec1 and iovec2 but just returns number of processed bytes as iovec1 > >> range. Then user may repeat the processing of iovec2, which is already > >> processed, which then returns with ENOMEM. Then user may want to skip > >> iovec2 and starts processing from iovec3. Here because of wrong return > >> processed bytes, iovec2 is processed twice. > > I think you should be much more specific why this is actually a problem. > > This would surely be less optimal but is this a correctness issue? > > > > Yes, this is a problem found when reading the code, but IMO we can > easily expect an invalid vma/hole in the passed range because we are > operating on other process VMA. More than solving the problem of being > less optimal, this can be looked in the direction of helping the user to > take better policy decisions with this syscall. And, not better policy > decisions from user is just being sub optimal(i.e. issuing the syscall > again on the processed range) with this syscall. > > Having said that, at present I don't have any reports/unit test showing > the existing semantic is really a problematic. > > > [...] > >> + vma = find_vma_prev(mm, start, &prev); > >> + if (vma && start > vma->vm_start) > >> + prev = vma; > >> + > >> + blk_start_plug(&plug); > >> + for (;;) { > >> + /* > >> + * It it hits a unmapped address range in the [start, end), > >> + * stop processing and return ENOMEM. > >> + */ > >> + if (!vma || start < vma->vm_start) { > >> + error = -ENOMEM; > >> + goto out; > >> + } > >> + > >> + tmp = vma->vm_end; > >> + if (end < tmp) > >> + tmp = end; > >> + > >> + error = madvise_vma_behavior(vma, &prev, start, tmp, behavior); > >> + if (error) > >> + goto out; > >> + tmp_bytes_advised += tmp - start; > >> + start = tmp; > >> + if (prev && start < prev->vm_end) > >> + start = prev->vm_end; > >> + if (start >= end) > >> + goto out; > >> + if (prev) > >> + vma = prev->vm_next; > >> + else > >> + vma = find_vma(mm, start); > >> + } > >> +out: > >> + /* > >> + * partial_bytes_advised may contain non-zero bytes indicating > >> + * the number of bytes advised before failure. Holds zero incase > >> + * of success. > >> + */ > >> + *partial_bytes_advised = error ? tmp_bytes_advised : 0; > > Although this looks like a fix I am not sure it is future proof. > > madvise_vma_behavior doesn't report which part of the range has been > > really processed. I do not think that currently supported madvise modes > > for process_madvise support an early break out with return to the > > userspace (madvise_cold_or_pageout_pte_range bails on fatal signals for > > example) but this can change in the future and then you are back to > > "imprecise" return value problem. Yes, this is a theoretical problem > > Agree here with the "imprecise" return value problem with processing a > VMA range. Yes when it is decided to return proper processed value from > madvise_vma_behavior(), this code too may need the maintenance. > > > but so it sounds the problem you are trying to fix IMHO. I think it > > would be better to live with imprecise return values reporting rather > > than aiming for perfection which would be fragile and add a future > > maintenance burden. > > > Hmm. Should atleast this imprecise return values be documented in man > page or in madvise.c file? I don't think we need to document it in man page. madvice.c would be enough, IMHO. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error 2022-03-24 15:45 ` Charan Teja Kalla 2022-03-25 0:46 ` Minchan Kim 2022-03-25 0:48 ` Minchan Kim @ 2022-03-25 8:02 ` Michal Hocko 2 siblings, 0 replies; 10+ messages in thread From: Michal Hocko @ 2022-03-25 8:02 UTC (permalink / raw) To: Charan Teja Kalla Cc: akpm, minchan, surenb, vbabka, rientjes, nadav.amit, edgararriaga, linux-mm, linux-kernel, Johannes Weiner On Thu 24-03-22 21:15:57, Charan Teja Kalla wrote: > Thanks Michal for the inputs. > > On 3/24/2022 6:44 PM, Michal Hocko wrote: > > On Wed 23-03-22 20:54:10, Charan Teja Kalla wrote: > >> From: Charan Teja Reddy <quic_charante@quicinc.com> > >> > >> The commit 5bd009c7c9a9 ("mm: madvise: return correct bytes advised with > >> process_madvise") fixes the issue to return number of bytes that are > >> successfully advised before hitting error with iovec elements > >> processing. But, when the user passed unmapped ranges in iovec, the > >> syscall ignores these holes and continues processing and returns ENOMEM > >> in the end, which is same as madvise semantic. This is a problem for > >> vector processing where user may want to know how many bytes were > >> exactly processed in a iovec element to make better decissions in the > >> user space. As in ENOMEM case, we processed all bytes in a iovec element > >> but still returned error which will confuse the user whether it is > >> failed or succeeded to advise. > > Do you have any specific example where the initial semantic is really > > problematic or is this mostly a theoretical problem you have found when > > reading the code? > > > > > >> As an example, consider below ranges were passed by the user in struct > >> iovec: iovec1(ranges: vma1), iovec2(ranges: vma2 -- vma3 -- hole) and > >> iovec3(ranges: vma4). In the current implementation, it fully advise > >> iovec1 and iovec2 but just returns number of processed bytes as iovec1 > >> range. Then user may repeat the processing of iovec2, which is already > >> processed, which then returns with ENOMEM. Then user may want to skip > >> iovec2 and starts processing from iovec3. Here because of wrong return > >> processed bytes, iovec2 is processed twice. > > I think you should be much more specific why this is actually a problem. > > This would surely be less optimal but is this a correctness issue? > > > > Yes, this is a problem found when reading the code, but IMO we can > easily expect an invalid vma/hole in the passed range because we are > operating on other process VMA. More than solving the problem of being > less optimal, this can be looked in the direction of helping the user to > take better policy decisions with this syscall. And, not better policy > decisions from user is just being sub optimal(i.e. issuing the syscall > again on the processed range) with this syscall. > > Having said that, at present I don't have any reports/unit test showing > the existing semantic is really a problematic. OK, thanks for the clarification. I would tend to not change the existing semantic. For one doing so is always a regression risk so the reasoning should be really strong. [...] > > but so it sounds the problem you are trying to fix IMHO. I think it > > would be better to live with imprecise return values reporting rather > > than aiming for perfection which would be fragile and add a future > > maintenance burden. > > > Hmm. Should atleast this imprecise return values be documented in man > page or in madvise.c file? The man page says: " On success, process_madvise() returns the number of bytes advised. This return value may be less than the total number of requested bytes, if an error occurred after some iovec elements were already processed. The caller should check the return value to determine whether a partial advice occurred. " which is pretty broad and AFAIU it matches the current behavior. It doesn't explain what exactly the return value is. It just mentions that the caller should check for partial advice without any further guidance - e.g. where should a new call start. I think that such a guidance would be a bad in general. On a partial success the caller would need to re-evaluate ranges anyway. So I guess we are good on the man page side for now. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-03-25 8:02 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-03-23 15:24 [PATCH 0/2] mm: madvise: return exact bytes advised with process_madvise under error Charan Teja Kalla 2022-03-23 15:24 ` [PATCH 1/2] Revert "mm: madvise: skip unmapped vma holes passed to process_madvise" Charan Teja Kalla 2022-03-24 12:48 ` Michal Hocko 2022-03-24 14:03 ` Charan Teja Kalla 2022-03-23 15:24 ` [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error Charan Teja Kalla 2022-03-24 13:14 ` Michal Hocko 2022-03-24 15:45 ` Charan Teja Kalla 2022-03-25 0:46 ` Minchan Kim 2022-03-25 0:48 ` Minchan Kim 2022-03-25 8:02 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).