linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2,0/2]mm: madvise: return correct bytes processed with process_madvise
@ 2022-03-11 15:29 Charan Teja Kalla
  2022-03-11 15:29 ` [PATCH V2,1/2] mm: madvise: return correct bytes advised " Charan Teja Kalla
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Charan Teja Kalla @ 2022-03-11 15:29 UTC (permalink / raw)
  To: akpm, surenb, vbabka, rientjes, sfr, edgararriaga, minchan,
	nadav.amit, mhocko
  Cc: linux-mm, linux-kernel, Charan Teja Kalla

With the process_madvise(), always choose to return non zero processed
bytes over an error. This can help the user to know on which VMA, passed
in the 'struct iovec' vector list, is failed to advise thus can take the
decission of retrying/skipping on that VMA.

Changes in V2:
  -- Separated the fixes returning processed bytes in case of an error
     and ENOMEM handling of process_madvise() due to unmapped hole in
     the VMA, as per the Minchan comments.
  -- Improved the comment for ENOMEM handling case as per Amit comments.

Changes in V1:
  -- Fixed the return value of process_madvise().
  -- Fixed ENOMEM handling of process_madvise() from do_madvise()
  -- https://patchwork.kernel.org/project/linux-mm/patch/1646803679-11433-1-git-send-email-quic_charante@quicinc.com/

Charan Teja Kalla (2):
  mm: madvise: return correct bytes advised with process_madvise
  mm: madvise: skip unmapped vma holes passed to process_madvise

 mm/madvise.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH V2,1/2] mm: madvise: return correct bytes advised with process_madvise
  2022-03-11 15:29 [PATCH V2,0/2]mm: madvise: return correct bytes processed with process_madvise Charan Teja Kalla
@ 2022-03-11 15:29 ` Charan Teja Kalla
  2022-03-15 22:20   ` Minchan Kim
  2022-03-21 15:18   ` Michal Hocko
  2022-03-11 15:29 ` [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise Charan Teja Kalla
  2022-03-11 21:42 ` [PATCH V2,0/2]mm: madvise: return correct bytes processed with process_madvise Andrew Morton
  2 siblings, 2 replies; 23+ messages in thread
From: Charan Teja Kalla @ 2022-03-11 15:29 UTC (permalink / raw)
  To: akpm, surenb, vbabka, rientjes, sfr, edgararriaga, minchan,
	nadav.amit, mhocko
  Cc: linux-mm, linux-kernel, Charan Teja Kalla, # 5 . 10+

The process_madvise() system call returns error even after processing
some VMA's passed in the 'struct iovec' vector list which leaves the
user confused to know where to restart the advise next. It is also
against this syscall man page[1] documentation where it mentions that
"return value may be less than the total number of requested bytes, if
an error occurred after some iovec elements were already processed.".

Consider a user passed 10 VMA's in the 'struct iovec' vector list of
which 9 are processed but one. Then it just returns the error caused on
that failed VMA despite the first 9 VMA's processed, leaving the user
confused about on which VMA it is failed. Returning the number of bytes
processed here can help the user to know which VMA it is failed on and
thus can retry/skip the advise on that VMA.

[1]https://man7.org/linux/man-pages/man2/process_madvise.2.html.

Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
Cc: <stable@vger.kernel.org> # 5.10+
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
Changes in V2:
 -- Separated the ENOMEM handling and return bytes processed, as per Minchan comments.
 -- This contains correcting return bytes processed with process_madvise().

Changes in V1:
 -- Fixed the ENOMEM handling and return bytes processed by process_madvise.
 -- https://patchwork.kernel.org/project/linux-mm/patch/1646803679-11433-1-git-send-email-quic_charante@quicinc.com/

 mm/madvise.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 38d0f51..e97e6a9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1433,8 +1433,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		iov_iter_advance(&iter, iovec.iov_len);
 	}
 
-	if (ret == 0)
-		ret = total_len - iov_iter_count(&iter);
+	ret = (total_len - iov_iter_count(&iter)) ? : ret;
 
 release_mm:
 	mmput(mm);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-11 15:29 [PATCH V2,0/2]mm: madvise: return correct bytes processed with process_madvise Charan Teja Kalla
  2022-03-11 15:29 ` [PATCH V2,1/2] mm: madvise: return correct bytes advised " Charan Teja Kalla
@ 2022-03-11 15:29 ` Charan Teja Kalla
  2022-03-15 22:58   ` Minchan Kim
  2022-03-21 15:34   ` Michal Hocko
  2022-03-11 21:42 ` [PATCH V2,0/2]mm: madvise: return correct bytes processed with process_madvise Andrew Morton
  2 siblings, 2 replies; 23+ messages in thread
From: Charan Teja Kalla @ 2022-03-11 15:29 UTC (permalink / raw)
  To: akpm, surenb, vbabka, rientjes, sfr, edgararriaga, minchan,
	nadav.amit, mhocko
  Cc: linux-mm, linux-kernel, Charan Teja Kalla, # 5 . 10+

The process_madvise() system call is expected to skip holes in vma
passed through 'struct iovec' vector list. But do_madvise, which
process_madvise() calls for each vma, returns ENOMEM in case of unmapped
holes, despite the VMA is processed.
Thus process_madvise() should treat ENOMEM as expected and consider the
VMA passed to as processed and continue processing other vma's in the
vector list. Returning -ENOMEM to user, despite the VMA is processed,
will be unable to figure out where to start the next madvise.

Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
Cc: <stable@vger.kernel.org> # 5.10+
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
Changes in V2:
  -- Fixed handling of ENOMEM by process_madvise().
  -- Patch doesn't exist in V1.

 mm/madvise.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index e97e6a9..14fb76d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1426,9 +1426,16 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 
 	while (iov_iter_count(&iter)) {
 		iovec = iov_iter_iovec(&iter);
+		/*
+		 * do_madvise returns ENOMEM if unmapped holes are present
+		 * in the passed VMA. process_madvise() is expected to skip
+		 * unmapped holes passed to it in the 'struct iovec' list
+		 * and not fail because of them. Thus treat -ENOMEM return
+		 * from do_madvise as valid and continue processing.
+		 */
 		ret = do_madvise(mm, (unsigned long)iovec.iov_base,
 					iovec.iov_len, behavior);
-		if (ret < 0)
+		if (ret < 0 && ret != -ENOMEM)
 			break;
 		iov_iter_advance(&iter, iovec.iov_len);
 	}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,0/2]mm: madvise: return correct bytes processed with process_madvise
  2022-03-11 15:29 [PATCH V2,0/2]mm: madvise: return correct bytes processed with process_madvise Charan Teja Kalla
  2022-03-11 15:29 ` [PATCH V2,1/2] mm: madvise: return correct bytes advised " Charan Teja Kalla
  2022-03-11 15:29 ` [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise Charan Teja Kalla
@ 2022-03-11 21:42 ` Andrew Morton
  2022-03-15 14:26   ` Charan Teja Kalla
  2 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2022-03-11 21:42 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: surenb, vbabka, rientjes, sfr, edgararriaga, minchan, nadav.amit,
	mhocko, linux-mm, linux-kernel

On Fri, 11 Mar 2022 20:59:04 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:

> With the process_madvise(), always choose to return non zero processed
> bytes over an error. This can help the user to know on which VMA, passed
> in the 'struct iovec' vector list, is failed to advise thus can take the
> decission of retrying/skipping on that VMA.

Thanks, this is not good.

We should have added userspace tests for process_madvise() along with
the syscall itself.  But evidently that was omitted.  If someone
decides to contribute such tests, hopefully they will include checks
for these return values.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,0/2]mm: madvise: return correct bytes processed with process_madvise
  2022-03-11 21:42 ` [PATCH V2,0/2]mm: madvise: return correct bytes processed with process_madvise Andrew Morton
@ 2022-03-15 14:26   ` Charan Teja Kalla
  0 siblings, 0 replies; 23+ messages in thread
From: Charan Teja Kalla @ 2022-03-15 14:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: surenb, vbabka, rientjes, sfr, edgararriaga, minchan, nadav.amit,
	mhocko, linux-mm, linux-kernel

Thanks Andrew!!

On 3/12/2022 3:12 AM, Andrew Morton wrote:
>> With the process_madvise(), always choose to return non zero processed
>> bytes over an error. This can help the user to know on which VMA, passed
>> in the 'struct iovec' vector list, is failed to advise thus can take the
>> decission of retrying/skipping on that VMA.
> Thanks, this is not good.
> 
> We should have added userspace tests for process_madvise() along with
> the syscall itself.  But evidently that was omitted.  If someone
> decides to contribute such tests, hopefully they will include checks
> for these return values.

We are happy to contribute here.

> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,1/2] mm: madvise: return correct bytes advised with process_madvise
  2022-03-11 15:29 ` [PATCH V2,1/2] mm: madvise: return correct bytes advised " Charan Teja Kalla
@ 2022-03-15 22:20   ` Minchan Kim
  2022-03-21 15:18   ` Michal Hocko
  1 sibling, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2022-03-15 22:20 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, surenb, vbabka, rientjes, sfr, edgararriaga, nadav.amit,
	mhocko, linux-mm, linux-kernel, # 5 . 10+

On Fri, Mar 11, 2022 at 08:59:05PM +0530, Charan Teja Kalla wrote:
> The process_madvise() system call returns error even after processing
> some VMA's passed in the 'struct iovec' vector list which leaves the
> user confused to know where to restart the advise next. It is also
> against this syscall man page[1] documentation where it mentions that
> "return value may be less than the total number of requested bytes, if
> an error occurred after some iovec elements were already processed.".
> 
> Consider a user passed 10 VMA's in the 'struct iovec' vector list of
> which 9 are processed but one. Then it just returns the error caused on
> that failed VMA despite the first 9 VMA's processed, leaving the user
> confused about on which VMA it is failed. Returning the number of bytes
> processed here can help the user to know which VMA it is failed on and
> thus can retry/skip the advise on that VMA.
> 
> [1]https://man7.org/linux/man-pages/man2/process_madvise.2.html.
> 
> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
> Cc: <stable@vger.kernel.org> # 5.10+
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-11 15:29 ` [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise Charan Teja Kalla
@ 2022-03-15 22:58   ` Minchan Kim
  2022-03-15 23:48     ` Andrew Morton
  2022-03-21 15:34   ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: Minchan Kim @ 2022-03-15 22:58 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, surenb, vbabka, rientjes, sfr, edgararriaga, nadav.amit,
	mhocko, linux-mm, linux-kernel, # 5 . 10+

On Fri, Mar 11, 2022 at 08:59:06PM +0530, Charan Teja Kalla wrote:
> The process_madvise() system call is expected to skip holes in vma
> passed through 'struct iovec' vector list. But do_madvise, which
> process_madvise() calls for each vma, returns ENOMEM in case of unmapped
> holes, despite the VMA is processed.
> Thus process_madvise() should treat ENOMEM as expected and consider the
> VMA passed to as processed and continue processing other vma's in the
> vector list. Returning -ENOMEM to user, despite the VMA is processed,
> will be unable to figure out where to start the next madvise.
> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
> Cc: <stable@vger.kernel.org> # 5.10+

Hmm, not sure whether it's stable material since it changes semantic of
API. It would be better to change the semantic from 5.19 with man page
update to specify the change.


> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> ---
> Changes in V2:
>   -- Fixed handling of ENOMEM by process_madvise().
>   -- Patch doesn't exist in V1.
> 
>  mm/madvise.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index e97e6a9..14fb76d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1426,9 +1426,16 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>  
>  	while (iov_iter_count(&iter)) {
>  		iovec = iov_iter_iovec(&iter);
> +		/*
> +		 * do_madvise returns ENOMEM if unmapped holes are present
> +		 * in the passed VMA. process_madvise() is expected to skip
> +		 * unmapped holes passed to it in the 'struct iovec' list
> +		 * and not fail because of them. Thus treat -ENOMEM return
> +		 * from do_madvise as valid and continue processing.
> +		 */
>  		ret = do_madvise(mm, (unsigned long)iovec.iov_base,
>  					iovec.iov_len, behavior);
> -		if (ret < 0)
> +		if (ret < 0 && ret != -ENOMEM)
>  			break;
>  		iov_iter_advance(&iter, iovec.iov_len);
>  	}
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-15 22:58   ` Minchan Kim
@ 2022-03-15 23:48     ` Andrew Morton
  2022-03-16  1:43       ` Minchan Kim
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2022-03-15 23:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Charan Teja Kalla, surenb, vbabka, rientjes, sfr, edgararriaga,
	nadav.amit, mhocko, linux-mm, linux-kernel, # 5 . 10+

On Tue, 15 Mar 2022 15:58:28 -0700 Minchan Kim <minchan@kernel.org> wrote:

> On Fri, Mar 11, 2022 at 08:59:06PM +0530, Charan Teja Kalla wrote:
> > The process_madvise() system call is expected to skip holes in vma
> > passed through 'struct iovec' vector list. But do_madvise, which
> > process_madvise() calls for each vma, returns ENOMEM in case of unmapped
> > holes, despite the VMA is processed.
> > Thus process_madvise() should treat ENOMEM as expected and consider the
> > VMA passed to as processed and continue processing other vma's in the
> > vector list. Returning -ENOMEM to user, despite the VMA is processed,
> > will be unable to figure out where to start the next madvise.
> > Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
> > Cc: <stable@vger.kernel.org> # 5.10+
> 
> Hmm, not sure whether it's stable material since it changes semantic of
> API. It would be better to change the semantic from 5.19 with man page
> update to specify the change.

It's a very desirable change and it makes the code match the manpage
and it's cc:stable.  I think we should just absorb any transitory
damage which this causes people.  I doubt if there will be much - if
anyone was affected by this they would have already told us that it's
broken?



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-15 23:48     ` Andrew Morton
@ 2022-03-16  1:43       ` Minchan Kim
  2022-03-16 14:19         ` Charan Teja Kalla
  0 siblings, 1 reply; 23+ messages in thread
From: Minchan Kim @ 2022-03-16  1:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Charan Teja Kalla, surenb, vbabka, rientjes, sfr, edgararriaga,
	nadav.amit, mhocko, linux-mm, linux-kernel, # 5 . 10+

On Tue, Mar 15, 2022 at 04:48:07PM -0700, Andrew Morton wrote:
> On Tue, 15 Mar 2022 15:58:28 -0700 Minchan Kim <minchan@kernel.org> wrote:
> 
> > On Fri, Mar 11, 2022 at 08:59:06PM +0530, Charan Teja Kalla wrote:
> > > The process_madvise() system call is expected to skip holes in vma
> > > passed through 'struct iovec' vector list. But do_madvise, which
> > > process_madvise() calls for each vma, returns ENOMEM in case of unmapped
> > > holes, despite the VMA is processed.
> > > Thus process_madvise() should treat ENOMEM as expected and consider the
> > > VMA passed to as processed and continue processing other vma's in the
> > > vector list. Returning -ENOMEM to user, despite the VMA is processed,
> > > will be unable to figure out where to start the next madvise.
> > > Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
> > > Cc: <stable@vger.kernel.org> # 5.10+
> > 
> > Hmm, not sure whether it's stable material since it changes semantic of
> > API. It would be better to change the semantic from 5.19 with man page
> > update to specify the change.
> 
> It's a very desirable change and it makes the code match the manpage
> and it's cc:stable.  I think we should just absorb any transitory
> damage which this causes people.  I doubt if there will be much - if
> anyone was affected by this they would have already told us that it's
> broken?


process_madvise fails to return exact processed bytes at several cases
if it encounters the error, such as, -EINVAL, -EINTR, -ENOMEM in the
middle of processing vmas. And now we are trying to make exception for
change for only hole? IMO, it's worth to note in man page.

In addition, this change returns positive processes bytes even though
it didn't process anything if it couldn't find any vma for the first
iteration in madvise_walk_vmas.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-16  1:43       ` Minchan Kim
@ 2022-03-16 14:19         ` Charan Teja Kalla
  2022-03-16 21:29           ` Andrew Morton
                             ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Charan Teja Kalla @ 2022-03-16 14:19 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: surenb, vbabka, rientjes, sfr, edgararriaga, nadav.amit, mhocko,
	linux-mm, linux-kernel, # 5 . 10+

Thanks Andrew and Minchan.

On 3/16/2022 7:13 AM, Minchan Kim wrote:
> On Tue, Mar 15, 2022 at 04:48:07PM -0700, Andrew Morton wrote:
>> On Tue, 15 Mar 2022 15:58:28 -0700 Minchan Kim <minchan@kernel.org> wrote:
>>
>>> On Fri, Mar 11, 2022 at 08:59:06PM +0530, Charan Teja Kalla wrote:
>>>> The process_madvise() system call is expected to skip holes in vma
>>>> passed through 'struct iovec' vector list. But do_madvise, which
>>>> process_madvise() calls for each vma, returns ENOMEM in case of unmapped
>>>> holes, despite the VMA is processed.
>>>> Thus process_madvise() should treat ENOMEM as expected and consider the
>>>> VMA passed to as processed and continue processing other vma's in the
>>>> vector list. Returning -ENOMEM to user, despite the VMA is processed,
>>>> will be unable to figure out where to start the next madvise.
>>>> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
>>>> Cc: <stable@vger.kernel.org> # 5.10+
>>>
>>> Hmm, not sure whether it's stable material since it changes semantic of
>>> API. It would be better to change the semantic from 5.19 with man page
>>> update to specify the change.
>>
>> It's a very desirable change and it makes the code match the manpage
>> and it's cc:stable.  I think we should just absorb any transitory
>> damage which this causes people.  I doubt if there will be much - if
>> anyone was affected by this they would have already told us that it's
>> broken?
> 
> 
> process_madvise fails to return exact processed bytes at several cases
> if it encounters the error, such as, -EINVAL, -EINTR, -ENOMEM in the
> middle of processing vmas. And now we are trying to make exception for
> change for only hole?
I think EINTR will never return in the middle of processing VMA's for
the behaviours supported by process_madvise().

It can return EINTR when:
-------------------------
1) PTRACE_MODE_READ is being checked in mm_access() where it is waiting
on task->signal->exec_update_lock. EINTR returned from here guarantees
that process_madvise() didn't event start processing.
https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1264 -->
https://elixir.bootlin.com/linux/v5.16.14/source/kernel/fork.c#L1318

2) The process_madvise() started processing VMA's but the required
behavior on a VMA needs mmap_write_lock_killable(), from where EINTR is
returned. The current behaviours supported by process_madvise(),
MADV_COLD, PAGEOUT, WILLNEED, just need read lock here.
https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1164
 **Thus I think no way for EINTR can be returned by process_madvise() in
the middle of processing.** . No?

for EINVAL:
-----------
The only case, I can think of,  where EINVAL can be returned in the
middle of processing is in examples like, given range contains VMA's
with a hole in between and one of the VMA contains the pages that fails
can_madv_lru_vma() condition.
So, it's a limitation that this returns -EINVAL though some bytes are
processed.
	OR
Since there exists still some invalid bytes processed it is valid to
return -EINVAL here and user has to check the address range sent?

for ENOMEM:
----------
Though complete range is processed still returns ENOMEM. IMO, This
shouldn't be treated as error which the patch is targeted for. Then
there is limitation case that you mentioned below where it returns
positive processes bytes even though it didn't process anything if it
couldn't find any vma for the first iteration in madvise_walk_vmas

I think the above limitations with EINVAL and ENOMEM are arising because
we are relying on do_madvise() functionality which madvise() call uses
to process a single VMA. When 'struct iovec' vector processing interface
is given in a system call, it is the expectation by the caller that this
system call should return the correct bytes processed to help the user
to take the correct decisions. Please correct me If i am wrong here.

So, should we add the new function say do_process_madvise(), which take
cares of above limitations? or any alternative suggestions here please?

> IMO, it's worth to note in man page.
> 

Or the current patch for just ENOMEM is sufficient here and we just have
to update the man page?

> In addition, this change returns positive processes bytes even though
> it didn't process anything if it couldn't find any vma for the first
> iteration in madvise_walk_vmas.

Thanks,
Charan


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-16 14:19         ` Charan Teja Kalla
@ 2022-03-16 21:29           ` Andrew Morton
  2022-03-17 16:28             ` Minchan Kim
  2022-03-17 16:24           ` Minchan Kim
  2022-03-21 15:02           ` Michal Hocko
  2 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2022-03-16 21:29 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: Minchan Kim, surenb, vbabka, rientjes, sfr, edgararriaga,
	nadav.amit, mhocko, linux-mm, linux-kernel, # 5 . 10+

On Wed, 16 Mar 2022 19:49:38 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:

> > IMO, it's worth to note in man page.
> > 
> 
> Or the current patch for just ENOMEM is sufficient here and we just have
> to update the man page?

I think the "On success, process_madvise() returns the number of bytes
advised" behaviour sounds useful.  But madvise() doesn't do that.

RETURN VALUE
       On  success, madvise() returns zero.  On error, it returns -1 and errno
       is set to indicate the error.

So why is it desirable in the case of process_madvise()?



And why was process_madvise() designed this way?   Or was it
always simply an error in the manpage?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-16 14:19         ` Charan Teja Kalla
  2022-03-16 21:29           ` Andrew Morton
@ 2022-03-17 16:24           ` Minchan Kim
  2022-03-21 15:02           ` Michal Hocko
  2 siblings, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2022-03-17 16:24 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: Andrew Morton, surenb, vbabka, rientjes, sfr, edgararriaga,
	nadav.amit, mhocko, linux-mm, linux-kernel, # 5 . 10+

On Wed, Mar 16, 2022 at 07:49:38PM +0530, Charan Teja Kalla wrote:
> Thanks Andrew and Minchan.
> 
> On 3/16/2022 7:13 AM, Minchan Kim wrote:
> > On Tue, Mar 15, 2022 at 04:48:07PM -0700, Andrew Morton wrote:
> >> On Tue, 15 Mar 2022 15:58:28 -0700 Minchan Kim <minchan@kernel.org> wrote:
> >>
> >>> On Fri, Mar 11, 2022 at 08:59:06PM +0530, Charan Teja Kalla wrote:
> >>>> The process_madvise() system call is expected to skip holes in vma
> >>>> passed through 'struct iovec' vector list. But do_madvise, which
> >>>> process_madvise() calls for each vma, returns ENOMEM in case of unmapped
> >>>> holes, despite the VMA is processed.
> >>>> Thus process_madvise() should treat ENOMEM as expected and consider the
> >>>> VMA passed to as processed and continue processing other vma's in the
> >>>> vector list. Returning -ENOMEM to user, despite the VMA is processed,
> >>>> will be unable to figure out where to start the next madvise.
> >>>> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
> >>>> Cc: <stable@vger.kernel.org> # 5.10+
> >>>
> >>> Hmm, not sure whether it's stable material since it changes semantic of
> >>> API. It would be better to change the semantic from 5.19 with man page
> >>> update to specify the change.
> >>
> >> It's a very desirable change and it makes the code match the manpage
> >> and it's cc:stable.  I think we should just absorb any transitory
> >> damage which this causes people.  I doubt if there will be much - if
> >> anyone was affected by this they would have already told us that it's
> >> broken?
> > 
> > 
> > process_madvise fails to return exact processed bytes at several cases
> > if it encounters the error, such as, -EINVAL, -EINTR, -ENOMEM in the
> > middle of processing vmas. And now we are trying to make exception for
> > change for only hole?
> I think EINTR will never return in the middle of processing VMA's for
> the behaviours supported by process_madvise().
> 
> It can return EINTR when:
> -------------------------
> 1) PTRACE_MODE_READ is being checked in mm_access() where it is waiting
> on task->signal->exec_update_lock. EINTR returned from here guarantees
> that process_madvise() didn't event start processing.
> https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1264 -->
> https://elixir.bootlin.com/linux/v5.16.14/source/kernel/fork.c#L1318
> 
> 2) The process_madvise() started processing VMA's but the required
> behavior on a VMA needs mmap_write_lock_killable(), from where EINTR is
> returned. The current behaviours supported by process_madvise(),
> MADV_COLD, PAGEOUT, WILLNEED, just need read lock here.
> https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1164
>  **Thus I think no way for EINTR can be returned by process_madvise() in
> the middle of processing.** . No?
> 
> for EINVAL:
> -----------
> The only case, I can think of,  where EINVAL can be returned in the
> middle of processing is in examples like, given range contains VMA's
> with a hole in between and one of the VMA contains the pages that fails
> can_madv_lru_vma() condition.
> So, it's a limitation that this returns -EINVAL though some bytes are
> processed.
> 	OR
> Since there exists still some invalid bytes processed it is valid to
> return -EINVAL here and user has to check the address range sent?
> 
> for ENOMEM:
> ----------
> Though complete range is processed still returns ENOMEM. IMO, This
> shouldn't be treated as error which the patch is targeted for. Then
> there is limitation case that you mentioned below where it returns
> positive processes bytes even though it didn't process anything if it
> couldn't find any vma for the first iteration in madvise_walk_vmas
> 
> I think the above limitations with EINVAL and ENOMEM are arising because
> we are relying on do_madvise() functionality which madvise() call uses
> to process a single VMA. When 'struct iovec' vector processing interface
> is given in a system call, it is the expectation by the caller that this
> system call should return the correct bytes processed to help the user
> to take the correct decisions. Please correct me If i am wrong here.
> 
> So, should we add the new function say do_process_madvise(), which take
> cares of above limitations? or any alternative suggestions here please?

What I am thinking now is that the process_madvise needs own iterator(i.e.,
do_process_madvise) and it should represent exact bytes it addressed with
exacts ranges like process_vm_readv/writev. Poviding valid ranges is
responsiblity from the user.

> 
> > IMO, it's worth to note in man page.
> > 
> 
> Or the current patch for just ENOMEM is sufficient here and we just have
> to update the man page?
> 
> > In addition, this change returns positive processes bytes even though
> > it didn't process anything if it couldn't find any vma for the first
> > iteration in madvise_walk_vmas.
> 
> Thanks,
> Charan
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-16 21:29           ` Andrew Morton
@ 2022-03-17 16:28             ` Minchan Kim
  2022-03-17 16:53               ` Suren Baghdasaryan
  0 siblings, 1 reply; 23+ messages in thread
From: Minchan Kim @ 2022-03-17 16:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Charan Teja Kalla, surenb, vbabka, rientjes, sfr, edgararriaga,
	nadav.amit, mhocko, linux-mm, linux-kernel, # 5 . 10+

On Wed, Mar 16, 2022 at 02:29:06PM -0700, Andrew Morton wrote:
> On Wed, 16 Mar 2022 19:49:38 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:
> 
> > > IMO, it's worth to note in man page.
> > > 
> > 
> > Or the current patch for just ENOMEM is sufficient here and we just have
> > to update the man page?
> 
> I think the "On success, process_madvise() returns the number of bytes
> advised" behaviour sounds useful.  But madvise() doesn't do that.
> 
> RETURN VALUE
>        On  success, madvise() returns zero.  On error, it returns -1 and errno
>        is set to indicate the error.
> 
> So why is it desirable in the case of process_madvise()?

Since process_madvise deal with multiple ranges and could fail at one of
them in the middle or pocessing, people could decide where the call
failed and then make a strategy whether they will abort at the point or
continue to hint next addresses. Here, problem of the strategy is API
doesn't return any error vaule if it has processed any bytes so they
would have limitation to decide a policy. That's the limitation for
every vector IO syscalls, unfortunately.

> 
> 
> 
> And why was process_madvise() designed this way?   Or was it
> always simply an error in the manpage?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-17 16:28             ` Minchan Kim
@ 2022-03-17 16:53               ` Suren Baghdasaryan
  2022-03-17 20:38                 ` Nadav Amit
  0 siblings, 1 reply; 23+ messages in thread
From: Suren Baghdasaryan @ 2022-03-17 16:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Charan Teja Kalla, Vlastimil Babka,
	David Rientjes, Stephen Rothwell, Edgar Arriaga García,
	nadav.amit, Michal Hocko, linux-mm, LKML, # 5 . 10+

On Thu, Mar 17, 2022 at 9:28 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Wed, Mar 16, 2022 at 02:29:06PM -0700, Andrew Morton wrote:
> > On Wed, 16 Mar 2022 19:49:38 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:
> >
> > > > IMO, it's worth to note in man page.
> > > >
> > >
> > > Or the current patch for just ENOMEM is sufficient here and we just have
> > > to update the man page?
> >
> > I think the "On success, process_madvise() returns the number of bytes
> > advised" behaviour sounds useful.  But madvise() doesn't do that.
> >
> > RETURN VALUE
> >        On  success, madvise() returns zero.  On error, it returns -1 and errno
> >        is set to indicate the error.
> >
> > So why is it desirable in the case of process_madvise()?
>
> Since process_madvise deal with multiple ranges and could fail at one of
> them in the middle or pocessing, people could decide where the call
> failed and then make a strategy whether they will abort at the point or
> continue to hint next addresses. Here, problem of the strategy is API
> doesn't return any error vaule if it has processed any bytes so they
> would have limitation to decide a policy. That's the limitation for
> every vector IO syscalls, unfortunately.
>
> >
> >
> >
> > And why was process_madvise() designed this way?   Or was it
> > always simply an error in the manpage?

Taking a closer look, indeed manpage seems to be wrong.
https://elixir.bootlin.com/linux/v5.17-rc8/source/mm/madvise.c#L1154
indicates that in the presence of unmapped holes madvise will skip
them but will return ENOMEM and that's what process_madvise is
ultimately returning in this case. So, the manpage claim of "This
return value may be less than the total number of requested bytes, if
an error occurred after some iovec elements were already processed."
does not reflect the reality in our case because the return value will
be -ENOMEM. After the desired behavior is finalized I'll modify the
manpage accordingly.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-17 16:53               ` Suren Baghdasaryan
@ 2022-03-17 20:38                 ` Nadav Amit
  2022-03-18 14:05                   ` Charan Teja Kalla
  0 siblings, 1 reply; 23+ messages in thread
From: Nadav Amit @ 2022-03-17 20:38 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Minchan Kim, Andrew Morton, Charan Teja Kalla, Vlastimil Babka,
	David Rientjes, Stephen Rothwell, Edgar Arriaga García,
	Michal Hocko, linux-mm, LKML, # 5 . 10+



> On Mar 17, 2022, at 9:53 AM, Suren Baghdasaryan <surenb@google.com> wrote:
> 
> On Thu, Mar 17, 2022 at 9:28 AM Minchan Kim <minchan@kernel.org> wrote:
>> 
>> On Wed, Mar 16, 2022 at 02:29:06PM -0700, Andrew Morton wrote:
>>> On Wed, 16 Mar 2022 19:49:38 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:
>>> 
>>>>> IMO, it's worth to note in man page.
>>>>> 
>>>> 
>>>> Or the current patch for just ENOMEM is sufficient here and we just have
>>>> to update the man page?
>>> 
>>> I think the "On success, process_madvise() returns the number of bytes
>>> advised" behaviour sounds useful.  But madvise() doesn't do that.
>>> 
>>> RETURN VALUE
>>>       On  success, madvise() returns zero.  On error, it returns -1 and errno
>>>       is set to indicate the error.
>>> 
>>> So why is it desirable in the case of process_madvise()?
>> 
>> Since process_madvise deal with multiple ranges and could fail at one of
>> them in the middle or pocessing, people could decide where the call
>> failed and then make a strategy whether they will abort at the point or
>> continue to hint next addresses. Here, problem of the strategy is API
>> doesn't return any error vaule if it has processed any bytes so they
>> would have limitation to decide a policy. That's the limitation for
>> every vector IO syscalls, unfortunately.
>> 
>>> 
>>> 
>>> 
>>> And why was process_madvise() designed this way?   Or was it
>>> always simply an error in the manpage?
> 
> Taking a closer look, indeed manpage seems to be wrong.
> https://elixir.bootlin.com/linux/v5.17-rc8/source/mm/madvise.c#L1154
> indicates that in the presence of unmapped holes madvise will skip
> them but will return ENOMEM and that's what process_madvise is
> ultimately returning in this case. So, the manpage claim of "This
> return value may be less than the total number of requested bytes, if
> an error occurred after some iovec elements were already processed."
> does not reflect the reality in our case because the return value will
> be -ENOMEM. After the desired behavior is finalized I'll modify the
> manpage accordingly.

Since process_madvise() might be used in sort of non-cooperative mode,
I think that the caller cannot guarantee that it knows exactly the
memory layout of the process whose memory it madvise’s. I know that
MADV_DONTNEED for instance is not supported (at least today) by
process_madvise(), but if it were, the caller may want which exact
memory was madvise'd even if the target process ran some other
memory layout changing syscalls (e.g., munmap()).

IOW, skipping holes and just returning the total number of madvise’d
bytes might not be enough.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-17 20:38                 ` Nadav Amit
@ 2022-03-18 14:05                   ` Charan Teja Kalla
  2022-03-18 15:37                     ` Minchan Kim
  0 siblings, 1 reply; 23+ messages in thread
From: Charan Teja Kalla @ 2022-03-18 14:05 UTC (permalink / raw)
  To: Nadav Amit, Suren Baghdasaryan
  Cc: Minchan Kim, Andrew Morton, Vlastimil Babka, David Rientjes,
	Stephen Rothwell, Edgar Arriaga García, Michal Hocko,
	linux-mm, LKML, # 5 . 10+

Thank you for valuable inputs.

On 3/18/2022 2:08 AM, Nadav Amit wrote:
>>>>>> IMO, it's worth to note in man page.
>>>>>>
>>>>> Or the current patch for just ENOMEM is sufficient here and we just have
>>>>> to update the man page?
>>>> I think the "On success, process_madvise() returns the number of bytes
>>>> advised" behaviour sounds useful.  But madvise() doesn't do that.
>>>>
>>>> RETURN VALUE
>>>>       On  success, madvise() returns zero.  On error, it returns -1 and errno
>>>>       is set to indicate the error.
>>>>
>>>> So why is it desirable in the case of process_madvise()?
>>> Since process_madvise deal with multiple ranges and could fail at one of
>>> them in the middle or pocessing, people could decide where the call
>>> failed and then make a strategy whether they will abort at the point or
>>> continue to hint next addresses. Here, problem of the strategy is API
>>> doesn't return any error vaule if it has processed any bytes so they
>>> would have limitation to decide a policy. That's the limitation for
>>> every vector IO syscalls, unfortunately.
>>>
>>>>
>>>>
>>>> And why was process_madvise() designed this way?   Or was it
>>>> always simply an error in the manpage?
>> Taking a closer look, indeed manpage seems to be wrong.
>> https://elixir.bootlin.com/linux/v5.17-rc8/source/mm/madvise.c#L1154
>> indicates that in the presence of unmapped holes madvise will skip
>> them but will return ENOMEM and that's what process_madvise is
>> ultimately returning in this case. So, the manpage claim of "This
>> return value may be less than the total number of requested bytes, if
>> an error occurred after some iovec elements were already processed."
>> does not reflect the reality in our case because the return value will
>> be -ENOMEM. After the desired behavior is finalized I'll modify the
>> manpage accordingly.
> Since process_madvise() might be used in sort of non-cooperative mode,
> I think that the caller cannot guarantee that it knows exactly the
> memory layout of the process whose memory it madvise’s. I know that
> MADV_DONTNEED for instance is not supported (at least today) by
> process_madvise(), but if it were, the caller may want which exact
> memory was madvise'd even if the target process ran some other
> memory layout changing syscalls (e.g., munmap()).
> 
> IOW, skipping holes and just returning the total number of madvise’d
> bytes might not be enough.

Then does the advised bytes range by default including holes is a
correct design?
Say the [start, len) range passed in the iovec by the user contains the
layout like, vma1 -- hole-- vma2 -- hole -- vma3.

Under ideal case, where all vma's are eligible for advise, the total
bytes processed returning should be vma3->end - vma1->start. This is
success case.

 Now, say that vma1 is succeeded but vma2(say VM_LOCKED) is failed at
advise. In such case processed bytes will be
vma2->start-vma1->start(still consider hole as bytes processed), so that
user may restart/skip at vma2, then continue. This return type will be
partially processed bytes.

If the system doesn't found any VMA in the passed range by user, it
returns ENOMEM as not a single advisable vma is found in the range.

> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-18 14:05                   ` Charan Teja Kalla
@ 2022-03-18 15:37                     ` Minchan Kim
  0 siblings, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2022-03-18 15:37 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: Nadav Amit, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	David Rientjes, Stephen Rothwell, Edgar Arriaga García,
	Michal Hocko, linux-mm, LKML, # 5 . 10+

On Fri, Mar 18, 2022 at 07:35:41PM +0530, Charan Teja Kalla wrote:
> Thank you for valuable inputs.
> 
> On 3/18/2022 2:08 AM, Nadav Amit wrote:
> >>>>>> IMO, it's worth to note in man page.
> >>>>>>
> >>>>> Or the current patch for just ENOMEM is sufficient here and we just have
> >>>>> to update the man page?
> >>>> I think the "On success, process_madvise() returns the number of bytes
> >>>> advised" behaviour sounds useful.  But madvise() doesn't do that.
> >>>>
> >>>> RETURN VALUE
> >>>>       On  success, madvise() returns zero.  On error, it returns -1 and errno
> >>>>       is set to indicate the error.
> >>>>
> >>>> So why is it desirable in the case of process_madvise()?
> >>> Since process_madvise deal with multiple ranges and could fail at one of
> >>> them in the middle or pocessing, people could decide where the call
> >>> failed and then make a strategy whether they will abort at the point or
> >>> continue to hint next addresses. Here, problem of the strategy is API
> >>> doesn't return any error vaule if it has processed any bytes so they
> >>> would have limitation to decide a policy. That's the limitation for
> >>> every vector IO syscalls, unfortunately.
> >>>
> >>>>
> >>>>
> >>>> And why was process_madvise() designed this way?   Or was it
> >>>> always simply an error in the manpage?
> >> Taking a closer look, indeed manpage seems to be wrong.
> >> https://elixir.bootlin.com/linux/v5.17-rc8/source/mm/madvise.c#L1154
> >> indicates that in the presence of unmapped holes madvise will skip
> >> them but will return ENOMEM and that's what process_madvise is
> >> ultimately returning in this case. So, the manpage claim of "This
> >> return value may be less than the total number of requested bytes, if
> >> an error occurred after some iovec elements were already processed."
> >> does not reflect the reality in our case because the return value will
> >> be -ENOMEM. After the desired behavior is finalized I'll modify the
> >> manpage accordingly.
> > Since process_madvise() might be used in sort of non-cooperative mode,
> > I think that the caller cannot guarantee that it knows exactly the
> > memory layout of the process whose memory it madvise’s. I know that
> > MADV_DONTNEED for instance is not supported (at least today) by
> > process_madvise(), but if it were, the caller may want which exact
> > memory was madvise'd even if the target process ran some other
> > memory layout changing syscalls (e.g., munmap()).
> > 
> > IOW, skipping holes and just returning the total number of madvise’d
> > bytes might not be enough.
> 
> Then does the advised bytes range by default including holes is a
> correct design?
> Say the [start, len) range passed in the iovec by the user contains the
> layout like, vma1 -- hole-- vma2 -- hole -- vma3.
> 
> Under ideal case, where all vma's are eligible for advise, the total
> bytes processed returning should be vma3->end - vma1->start. This is
> success case.
> 
>  Now, say that vma1 is succeeded but vma2(say VM_LOCKED) is failed at
> advise. In such case processed bytes will be
> vma2->start-vma1->start(still consider hole as bytes processed), so that
> user may restart/skip at vma2, then continue. This return type will be
> partially processed bytes.
> 
> If the system doesn't found any VMA in the passed range by user, it
> returns ENOMEM as not a single advisable vma is found in the range.

As I mentioned in other reply, let's do not make any exception(i.e.,
skipping hole) for vectored memory syscall but exact processed bytes
on the exact ranges.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-16 14:19         ` Charan Teja Kalla
  2022-03-16 21:29           ` Andrew Morton
  2022-03-17 16:24           ` Minchan Kim
@ 2022-03-21 15:02           ` Michal Hocko
  2022-03-22  5:19             ` Charan Teja Kalla
  2 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2022-03-21 15:02 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: Minchan Kim, Andrew Morton, surenb, vbabka, rientjes, sfr,
	edgararriaga, nadav.amit, linux-mm, linux-kernel, # 5 . 10+

On Wed 16-03-22 19:49:38, Charan Teja Kalla wrote:
[...]
> It can return EINTR when:
> -------------------------
> 1) PTRACE_MODE_READ is being checked in mm_access() where it is waiting
> on task->signal->exec_update_lock. EINTR returned from here guarantees
> that process_madvise() didn't event start processing.
> https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1264 -->
> https://elixir.bootlin.com/linux/v5.16.14/source/kernel/fork.c#L1318
> 
> 2) The process_madvise() started processing VMA's but the required
> behavior on a VMA needs mmap_write_lock_killable(), from where EINTR is
> returned.

Please note this will happen if the task has been killed. The return
value doesn't really matter because the process won't run in userspace.

> The current behaviours supported by process_madvise(),
> MADV_COLD, PAGEOUT, WILLNEED, just need read lock here.
> https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1164
>  **Thus I think no way for EINTR can be returned by process_madvise() in
> the middle of processing.** . No?

Maybe not with the current implementation but I can easily imagine that
there is a requirement to break out early when there is a signal pending
(e.g. to support terminating madvise on a large memory rage). You would
get EINTR then somehow need to communicate that to the userspace.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,1/2] mm: madvise: return correct bytes advised with process_madvise
  2022-03-11 15:29 ` [PATCH V2,1/2] mm: madvise: return correct bytes advised " Charan Teja Kalla
  2022-03-15 22:20   ` Minchan Kim
@ 2022-03-21 15:18   ` Michal Hocko
  1 sibling, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2022-03-21 15:18 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, surenb, vbabka, rientjes, sfr, edgararriaga, minchan,
	nadav.amit, linux-mm, linux-kernel, # 5 . 10+

On Fri 11-03-22 20:59:05, Charan Teja Kalla wrote:
> The process_madvise() system call returns error even after processing
> some VMA's passed in the 'struct iovec' vector list which leaves the
> user confused to know where to restart the advise next. It is also
> against this syscall man page[1] documentation where it mentions that
> "return value may be less than the total number of requested bytes, if
> an error occurred after some iovec elements were already processed.".
> 
> Consider a user passed 10 VMA's in the 'struct iovec' vector list of
> which 9 are processed but one. Then it just returns the error caused on
> that failed VMA despite the first 9 VMA's processed, leaving the user
> confused about on which VMA it is failed. Returning the number of bytes
> processed here can help the user to know which VMA it is failed on and
> thus can retry/skip the advise on that VMA.
> 
> [1]https://man7.org/linux/man-pages/man2/process_madvise.2.html.
> 
> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
> Cc: <stable@vger.kernel.org> # 5.10+
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> Changes in V2:
>  -- Separated the ENOMEM handling and return bytes processed, as per Minchan comments.
>  -- This contains correcting return bytes processed with process_madvise().
> 
> Changes in V1:
>  -- Fixed the ENOMEM handling and return bytes processed by process_madvise.
>  -- https://patchwork.kernel.org/project/linux-mm/patch/1646803679-11433-1-git-send-email-quic_charante@quicinc.com/
> 
>  mm/madvise.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 38d0f51..e97e6a9 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1433,8 +1433,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>  		iov_iter_advance(&iter, iovec.iov_len);
>  	}
>  
> -	if (ret == 0)
> -		ret = total_len - iov_iter_count(&iter);
> +	ret = (total_len - iov_iter_count(&iter)) ? : ret;
>  
>  release_mm:
>  	mmput(mm);
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-11 15:29 ` [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise Charan Teja Kalla
  2022-03-15 22:58   ` Minchan Kim
@ 2022-03-21 15:34   ` Michal Hocko
  2022-03-22  7:10     ` Charan Teja Kalla
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2022-03-21 15:34 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, surenb, vbabka, rientjes, sfr, edgararriaga, minchan,
	nadav.amit, linux-mm, linux-kernel, # 5 . 10+

On Fri 11-03-22 20:59:06, Charan Teja Kalla wrote:
> The process_madvise() system call is expected to skip holes in vma
> passed through 'struct iovec' vector list.

Where is this assumption coming from? From the man page I can see:
: The advice might be applied to only a part of iovec if one of its
: elements points to an invalid memory region in the remote
: process.  No further elements will be processed beyond that
: point.  

> But do_madvise, which
> process_madvise() calls for each vma, returns ENOMEM in case of unmapped
> holes, despite the VMA is processed.
> Thus process_madvise() should treat ENOMEM as expected and consider the
> VMA passed to as processed and continue processing other vma's in the
> vector list. Returning -ENOMEM to user, despite the VMA is processed,
> will be unable to figure out where to start the next madvise.

I am not sure I follow. With your previous patch and -ENOMEM from
do_madvise you get the the answer you are looking for, no?
With this applied you are loosing the information that some of the iters
are not mapped or has a hole. Which might be a useful information
especially when processing on remote tasks which are free to manipulate
their address spaces.

> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
> Cc: <stable@vger.kernel.org> # 5.10+
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> ---
> Changes in V2:
>   -- Fixed handling of ENOMEM by process_madvise().
>   -- Patch doesn't exist in V1.
> 
>  mm/madvise.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index e97e6a9..14fb76d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1426,9 +1426,16 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>  
>  	while (iov_iter_count(&iter)) {
>  		iovec = iov_iter_iovec(&iter);
> +		/*
> +		 * do_madvise returns ENOMEM if unmapped holes are present
> +		 * in the passed VMA. process_madvise() is expected to skip
> +		 * unmapped holes passed to it in the 'struct iovec' list
> +		 * and not fail because of them. Thus treat -ENOMEM return
> +		 * from do_madvise as valid and continue processing.
> +		 */
>  		ret = do_madvise(mm, (unsigned long)iovec.iov_base,
>  					iovec.iov_len, behavior);
> -		if (ret < 0)
> +		if (ret < 0 && ret != -ENOMEM)
>  			break;
>  		iov_iter_advance(&iter, iovec.iov_len);
>  	}
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-21 15:02           ` Michal Hocko
@ 2022-03-22  5:19             ` Charan Teja Kalla
  0 siblings, 0 replies; 23+ messages in thread
From: Charan Teja Kalla @ 2022-03-22  5:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, surenb, vbabka, rientjes, sfr,
	edgararriaga, nadav.amit, linux-mm, linux-kernel, # 5 . 10+


On 3/21/2022 8:32 PM, Michal Hocko wrote:
>> It can return EINTR when:
>> -------------------------
>> 1) PTRACE_MODE_READ is being checked in mm_access() where it is waiting
>> on task->signal->exec_update_lock. EINTR returned from here guarantees
>> that process_madvise() didn't event start processing.
>> https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1264 -->
>> https://elixir.bootlin.com/linux/v5.16.14/source/kernel/fork.c#L1318
>>
>> 2) The process_madvise() started processing VMA's but the required
>> behavior on a VMA needs mmap_write_lock_killable(), from where EINTR is
>> returned.
> Please note this will happen if the task has been killed. The return
> value doesn't really matter because the process won't run in userspace.

Okay, thanks here.

> 
>> The current behaviours supported by process_madvise(),
>> MADV_COLD, PAGEOUT, WILLNEED, just need read lock here.
>> https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1164
>>  **Thus I think no way for EINTR can be returned by process_madvise() in
>> the middle of processing.** . No?
> Maybe not with the current implementation but I can easily imagine that
> there is a requirement to break out early when there is a signal pending
> (e.g. to support terminating madvise on a large memory rage). You would
> get EINTR then somehow need to communicate that to the userspace.

Agree. Will implement this.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-21 15:34   ` Michal Hocko
@ 2022-03-22  7:10     ` Charan Teja Kalla
  2022-03-22  8:40       ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Charan Teja Kalla @ 2022-03-22  7:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, surenb, vbabka, rientjes, sfr, edgararriaga, minchan,
	nadav.amit, linux-mm, linux-kernel, # 5 . 10+

Thanks Michal for the inputs.

On 3/21/2022 9:04 PM, Michal Hocko wrote:
> On Fri 11-03-22 20:59:06, Charan Teja Kalla wrote:
>> The process_madvise() system call is expected to skip holes in vma
>> passed through 'struct iovec' vector list.
> Where is this assumption coming from? From the man page I can see:
> : The advice might be applied to only a part of iovec if one of its
> : elements points to an invalid memory region in the remote
> : process.  No further elements will be processed beyond that
> : point.  

I assumed this while processing a single element of a iovec. In a
scenario where a range passed contains multiple VMA's + holes, on
encountering the VMA with VM_LOCKED|VM_HUGETLB|VM_PFNMAP, we are
immediately stopping further processing of that iovec element with
EINVAL return. Where as on encountering a hole, we are simply
remembering it as ENOMEM but continues processing that iovec element and
in the end returns ENOMEM. This means that complete range is processed
but still returning ENOMEM, hence the assumption of skipping holes in a
vma.

The other problem is, in an individual iovec element, though some bytes
are processed we may still endup in returning EINVAL which is hard for
the user to take decisions i.e. he doesn't know at which address it is
exactly failed to advise.

Anyway, both these will be addressed in the next version of this patch
with the suggestions from minchan [1] where it mentioned that: "it
should represent exact bytes it addressed with exacts ranges like
process_vm_readv/writev. Poviding valid ranges is responsiblity from the
user."

[1]  https://lore.kernel.org/linux-mm/YjNgoeg1yOocsjWC@google.com/
> 
>> But do_madvise, which
>> process_madvise() calls for each vma, returns ENOMEM in case of unmapped
>> holes, despite the VMA is processed.
>> Thus process_madvise() should treat ENOMEM as expected and consider the
>> VMA passed to as processed and continue processing other vma's in the
>> vector list. Returning -ENOMEM to user, despite the VMA is processed,
>> will be unable to figure out where to start the next madvise.
> I am not sure I follow. With your previous patch and -ENOMEM from
> do_madvise you get the the answer you are looking for, no?
> With this applied you are loosing the information that some of the iters
> are not mapped or has a hole. Which might be a useful information
> especially when processing on remote tasks which are free to manipulate
> their address spaces.

Yes, it should return ENOMEM. The same will be fixed in the next revision.

> 
>> Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
>> Cc: <stable@vger.kernel.org> # 5.10+
>> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
>> ---
>> Changes in V2:
>>   -- Fixed handling of ENOMEM by process_madvise().
>>   -- Patch doesn't exist in V1.
>>
>>  mm/madvise.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index e97e6a9..14fb76d 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -1426,9 +1426,16 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>>  
>>  	while (iov_iter_count(&iter)) {
>>  		iovec = iov_iter_iovec(&iter);
>> +		/*
>> +		 * do_madvise returns ENOMEM if unmapped holes are present
>> +		 * in the passed VMA. process_madvise() is expected to skip
>> +		 * unmapped holes passed to it in the 'struct iovec' list
>> +		 * and not fail because of them. Thus treat -ENOMEM return
>> +		 * from do_madvise as valid and continue processing.
>> +		 */
>>  		ret = do_madvise(mm, (unsigned long)iovec.iov_base,
>>  					iovec.iov_len, behavior);
>> -		if (ret < 0)
>> +		if (ret < 0 && ret != -ENOMEM)
>>  			break;
>>  		iov_iter_advance(&iter, iovec.iov_len);
>>  	}
>> -- 
>> 2.7.4

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise
  2022-03-22  7:10     ` Charan Teja Kalla
@ 2022-03-22  8:40       ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2022-03-22  8:40 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: akpm, surenb, vbabka, rientjes, sfr, edgararriaga, minchan,
	nadav.amit, linux-mm, linux-kernel, # 5 . 10+

On Tue 22-03-22 12:40:24, Charan Teja Kalla wrote:
> Thanks Michal for the inputs.
> 
> On 3/21/2022 9:04 PM, Michal Hocko wrote:
> > On Fri 11-03-22 20:59:06, Charan Teja Kalla wrote:
> >> The process_madvise() system call is expected to skip holes in vma
> >> passed through 'struct iovec' vector list.
> > Where is this assumption coming from? From the man page I can see:
> > : The advice might be applied to only a part of iovec if one of its
> > : elements points to an invalid memory region in the remote
> > : process.  No further elements will be processed beyond that
> > : point.  
> 
> I assumed this while processing a single element of a iovec. In a
> scenario where a range passed contains multiple VMA's + holes, on
> encountering the VMA with VM_LOCKED|VM_HUGETLB|VM_PFNMAP, we are
> immediately stopping further processing of that iovec element with
> EINVAL return. Where as on encountering a hole, we are simply
> remembering it as ENOMEM but continues processing that iovec element and
> in the end returns ENOMEM. This means that complete range is processed
> but still returning ENOMEM, hence the assumption of skipping holes in a
> vma.
> 
> The other problem is, in an individual iovec element, though some bytes
> are processed we may still endup in returning EINVAL which is hard for
> the user to take decisions i.e. he doesn't know at which address it is
> exactly failed to advise.
> 
> Anyway, both these will be addressed in the next version of this patch
> with the suggestions from minchan [1] where it mentioned that: "it
> should represent exact bytes it addressed with exacts ranges like
> process_vm_readv/writev. Poviding valid ranges is responsiblity from the
> user."

I would tend to agree that the userspace should be providing sensible
ranges (either subsets or full existing mappings). Whenever multiple
vmas are defined by a single iovec, things get more complicated. IMO
process_madvise should mimic the madvise semantic applied to each iovec.
That means to bail out on an error. That applies to ENOMEM even when the
last iovec has been processed completely.

This would allow to learn about address space change that the caller is
not aware of. That being said, your first patch should be good enough.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-03-22  8:41 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-11 15:29 [PATCH V2,0/2]mm: madvise: return correct bytes processed with process_madvise Charan Teja Kalla
2022-03-11 15:29 ` [PATCH V2,1/2] mm: madvise: return correct bytes advised " Charan Teja Kalla
2022-03-15 22:20   ` Minchan Kim
2022-03-21 15:18   ` Michal Hocko
2022-03-11 15:29 ` [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to process_madvise Charan Teja Kalla
2022-03-15 22:58   ` Minchan Kim
2022-03-15 23:48     ` Andrew Morton
2022-03-16  1:43       ` Minchan Kim
2022-03-16 14:19         ` Charan Teja Kalla
2022-03-16 21:29           ` Andrew Morton
2022-03-17 16:28             ` Minchan Kim
2022-03-17 16:53               ` Suren Baghdasaryan
2022-03-17 20:38                 ` Nadav Amit
2022-03-18 14:05                   ` Charan Teja Kalla
2022-03-18 15:37                     ` Minchan Kim
2022-03-17 16:24           ` Minchan Kim
2022-03-21 15:02           ` Michal Hocko
2022-03-22  5:19             ` Charan Teja Kalla
2022-03-21 15:34   ` Michal Hocko
2022-03-22  7:10     ` Charan Teja Kalla
2022-03-22  8:40       ` Michal Hocko
2022-03-11 21:42 ` [PATCH V2,0/2]mm: madvise: return correct bytes processed with process_madvise Andrew Morton
2022-03-15 14:26   ` Charan Teja Kalla

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).