All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-22 16:43 ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-22 16:43 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, linux-mm, Michal Hocko, netdev,
	Eric Dumazet, Vlastimil Babka

The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
with the number of fds passed. We had a customer report page allocation
failures of order-4 for this allocation. This is a costly order, so it might
easily fail, as the VM expects such allocation to have a lower-order fallback.

Such trivial fallback is vmalloc(), as the memory doesn't have to be
physically contiguous. Also the allocation is temporary for the duration of the
syscall, so it's unlikely to stress vmalloc too much.

Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
it doesn't need this kind of fallback.

[eric.dumazet@gmail.com: fix failure path logic]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/select.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8ed9da50896a..b99e98524fde 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -29,6 +29,7 @@
 #include <linux/sched/rt.h>
 #include <linux/freezer.h>
 #include <net/busy_poll.h>
+#include <linux/vmalloc.h>
 
 #include <asm/uaccess.h>
 
@@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	struct fdtable *fdt;
 	/* Allocate small arguments on the stack to save memory and be faster */
 	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
+	unsigned long alloc_size;
 
 	ret = -EINVAL;
 	if (n < 0)
@@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	bits = stack_fds;
 	if (size > sizeof(stack_fds) / 6) {
 		/* Not enough space in on-stack array; must use kmalloc */
+		alloc_size = 6 * size;
 		ret = -ENOMEM;
-		bits = kmalloc(6 * size, GFP_KERNEL);
+		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
+		if (!bits && alloc_size > PAGE_SIZE)
+			bits = vmalloc(alloc_size);
+
 		if (!bits)
 			goto out_nofds;
 	}
@@ -618,7 +624,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 
 out:
 	if (bits != stack_fds)
-		kfree(bits);
+		kvfree(bits);
 out_nofds:
 	return ret;
 }
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-22 16:43 ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-22 16:43 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, linux-mm, Michal Hocko, netdev,
	Eric Dumazet, Vlastimil Babka

The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
with the number of fds passed. We had a customer report page allocation
failures of order-4 for this allocation. This is a costly order, so it might
easily fail, as the VM expects such allocation to have a lower-order fallback.

Such trivial fallback is vmalloc(), as the memory doesn't have to be
physically contiguous. Also the allocation is temporary for the duration of the
syscall, so it's unlikely to stress vmalloc too much.

Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
it doesn't need this kind of fallback.

[eric.dumazet@gmail.com: fix failure path logic]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/select.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8ed9da50896a..b99e98524fde 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -29,6 +29,7 @@
 #include <linux/sched/rt.h>
 #include <linux/freezer.h>
 #include <net/busy_poll.h>
+#include <linux/vmalloc.h>
 
 #include <asm/uaccess.h>
 
@@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	struct fdtable *fdt;
 	/* Allocate small arguments on the stack to save memory and be faster */
 	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
+	unsigned long alloc_size;
 
 	ret = -EINVAL;
 	if (n < 0)
@@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	bits = stack_fds;
 	if (size > sizeof(stack_fds) / 6) {
 		/* Not enough space in on-stack array; must use kmalloc */
+		alloc_size = 6 * size;
 		ret = -ENOMEM;
-		bits = kmalloc(6 * size, GFP_KERNEL);
+		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
+		if (!bits && alloc_size > PAGE_SIZE)
+			bits = vmalloc(alloc_size);
+
 		if (!bits)
 			goto out_nofds;
 	}
@@ -618,7 +624,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 
 out:
 	if (bits != stack_fds)
-		kfree(bits);
+		kvfree(bits);
 out_nofds:
 	return ret;
 }
-- 
2.10.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-22 16:43 ` Vlastimil Babka
  (?)
@ 2016-09-22 16:49   ` Eric Dumazet
  -1 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2016-09-22 16:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> with the number of fds passed. We had a customer report page allocation
> failures of order-4 for this allocation. This is a costly order, so it might
> easily fail, as the VM expects such allocation to have a lower-order fallback.
> 
> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> physically contiguous. Also the allocation is temporary for the duration of the
> syscall, so it's unlikely to stress vmalloc too much.

vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.

So I guess allowing vmalloc() being called from an innocent application
doing a select() might be dangerous, especially if this select() happens
thousands of time per second.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-22 16:49   ` Eric Dumazet
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2016-09-22 16:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> with the number of fds passed. We had a customer report page allocation
> failures of order-4 for this allocation. This is a costly order, so it might
> easily fail, as the VM expects such allocation to have a lower-order fallback.
> 
> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> physically contiguous. Also the allocation is temporary for the duration of the
> syscall, so it's unlikely to stress vmalloc too much.

vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.

So I guess allowing vmalloc() being called from an innocent application
doing a select() might be dangerous, especially if this select() happens
thousands of time per second.





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-22 16:49   ` Eric Dumazet
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2016-09-22 16:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> with the number of fds passed. We had a customer report page allocation
> failures of order-4 for this allocation. This is a costly order, so it might
> easily fail, as the VM expects such allocation to have a lower-order fallback.
> 
> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> physically contiguous. Also the allocation is temporary for the duration of the
> syscall, so it's unlikely to stress vmalloc too much.

vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.

So I guess allowing vmalloc() being called from an innocent application
doing a select() might be dangerous, especially if this select() happens
thousands of time per second.





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-22 16:49   ` Eric Dumazet
@ 2016-09-22 16:56     ` Vlastimil Babka
  -1 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-22 16:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On 09/22/2016 06:49 PM, Eric Dumazet wrote:
> On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
>> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
>> with the number of fds passed. We had a customer report page allocation
>> failures of order-4 for this allocation. This is a costly order, so it might
>> easily fail, as the VM expects such allocation to have a lower-order fallback.
>>
>> Such trivial fallback is vmalloc(), as the memory doesn't have to be
>> physically contiguous. Also the allocation is temporary for the duration of the
>> syscall, so it's unlikely to stress vmalloc too much.
>
> vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.
>
> So I guess allowing vmalloc() being called from an innocent application
> doing a select() might be dangerous, especially if this select() happens
> thousands of time per second.

Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-22 16:56     ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-22 16:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On 09/22/2016 06:49 PM, Eric Dumazet wrote:
> On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
>> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
>> with the number of fds passed. We had a customer report page allocation
>> failures of order-4 for this allocation. This is a costly order, so it might
>> easily fail, as the VM expects such allocation to have a lower-order fallback.
>>
>> Such trivial fallback is vmalloc(), as the memory doesn't have to be
>> physically contiguous. Also the allocation is temporary for the duration of the
>> syscall, so it's unlikely to stress vmalloc too much.
>
> vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.
>
> So I guess allowing vmalloc() being called from an innocent application
> doing a select() might be dangerous, especially if this select() happens
> thousands of time per second.

Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-22 16:56     ` Vlastimil Babka
  (?)
@ 2016-09-22 17:07       ` Eric Dumazet
  -1 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2016-09-22 17:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On Thu, 2016-09-22 at 18:56 +0200, Vlastimil Babka wrote:
> On 09/22/2016 06:49 PM, Eric Dumazet wrote:
> > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
> >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> >> with the number of fds passed. We had a customer report page allocation
> >> failures of order-4 for this allocation. This is a costly order, so it might
> >> easily fail, as the VM expects such allocation to have a lower-order fallback.
> >>
> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> >> physically contiguous. Also the allocation is temporary for the duration of the
> >> syscall, so it's unlikely to stress vmalloc too much.
> >
> > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.
> >
> > So I guess allowing vmalloc() being called from an innocent application
> > doing a select() might be dangerous, especially if this select() happens
> > thousands of time per second.
> 
> Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()?

Possibly.

We don't have a library function (attempting kmalloc(), fallback to
vmalloc() presumably to avoid abuses, but I guess some patches were
accepted without thinking about this.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-22 17:07       ` Eric Dumazet
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2016-09-22 17:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On Thu, 2016-09-22 at 18:56 +0200, Vlastimil Babka wrote:
> On 09/22/2016 06:49 PM, Eric Dumazet wrote:
> > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
> >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> >> with the number of fds passed. We had a customer report page allocation
> >> failures of order-4 for this allocation. This is a costly order, so it might
> >> easily fail, as the VM expects such allocation to have a lower-order fallback.
> >>
> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> >> physically contiguous. Also the allocation is temporary for the duration of the
> >> syscall, so it's unlikely to stress vmalloc too much.
> >
> > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.
> >
> > So I guess allowing vmalloc() being called from an innocent application
> > doing a select() might be dangerous, especially if this select() happens
> > thousands of time per second.
> 
> Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()?

Possibly.

We don't have a library function (attempting kmalloc(), fallback to
vmalloc() presumably to avoid abuses, but I guess some patches were
accepted without thinking about this.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-22 17:07       ` Eric Dumazet
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2016-09-22 17:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On Thu, 2016-09-22 at 18:56 +0200, Vlastimil Babka wrote:
> On 09/22/2016 06:49 PM, Eric Dumazet wrote:
> > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
> >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> >> with the number of fds passed. We had a customer report page allocation
> >> failures of order-4 for this allocation. This is a costly order, so it might
> >> easily fail, as the VM expects such allocation to have a lower-order fallback.
> >>
> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> >> physically contiguous. Also the allocation is temporary for the duration of the
> >> syscall, so it's unlikely to stress vmalloc too much.
> >
> > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.
> >
> > So I guess allowing vmalloc() being called from an innocent application
> > doing a select() might be dangerous, especially if this select() happens
> > thousands of time per second.
> 
> Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()?

Possibly.

We don't have a library function (attempting kmalloc(), fallback to
vmalloc() presumably to avoid abuses, but I guess some patches were
accepted without thinking about this.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-22 17:07       ` Eric Dumazet
@ 2016-09-22 17:55         ` Vlastimil Babka
  -1 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-22 17:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev, Linux API, linux-man

On 09/22/2016 07:07 PM, Eric Dumazet wrote:
> On Thu, 2016-09-22 at 18:56 +0200, Vlastimil Babka wrote:
>> On 09/22/2016 06:49 PM, Eric Dumazet wrote:
>> > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
>> >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
>> >> with the number of fds passed. We had a customer report page allocation
>> >> failures of order-4 for this allocation. This is a costly order, so it might
>> >> easily fail, as the VM expects such allocation to have a lower-order fallback.
>> >>
>> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be
>> >> physically contiguous. Also the allocation is temporary for the duration of the
>> >> syscall, so it's unlikely to stress vmalloc too much.
>> >
>> > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.
>> >
>> > So I guess allowing vmalloc() being called from an innocent application
>> > doing a select() might be dangerous, especially if this select() happens
>> > thousands of time per second.
>>
>> Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()?
>
> Possibly.
>
> We don't have a library function (attempting kmalloc(), fallback to
> vmalloc() presumably to avoid abuses, but I guess some patches were
> accepted without thinking about this.

So in the case of select() it seems like the memory we need 6 bits per file 
descriptor, multiplied by the highest possible file descriptor (nfds) as passed 
to the syscall. According to the man page of select:

        EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see 
getrlimit(2)).

The code actually seems to silently cap the value instead of returning EINVAL 
though? (IIUC):

        /* max_fds can increase, so grab it once to avoid race */
         rcu_read_lock();
         fdt = files_fdtable(current->files);
         max_fds = fdt->max_fds;
         rcu_read_unlock();
         if (n > max_fds)
                 n = max_fds;

The default for this cap seems to be 1024 where I checked (again, IIUC, it's 
what ulimit -n returns?). I wasn't able to change it to more than 2048, which 
makes the bitmaps still below PAGE_SIZE.

So if I get that right, the system admin would have to allow really large 
RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large 
concern?

Vlastimil

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-22 17:55         ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-22 17:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev, Linux API, linux-man

On 09/22/2016 07:07 PM, Eric Dumazet wrote:
> On Thu, 2016-09-22 at 18:56 +0200, Vlastimil Babka wrote:
>> On 09/22/2016 06:49 PM, Eric Dumazet wrote:
>> > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
>> >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
>> >> with the number of fds passed. We had a customer report page allocation
>> >> failures of order-4 for this allocation. This is a costly order, so it might
>> >> easily fail, as the VM expects such allocation to have a lower-order fallback.
>> >>
>> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be
>> >> physically contiguous. Also the allocation is temporary for the duration of the
>> >> syscall, so it's unlikely to stress vmalloc too much.
>> >
>> > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.
>> >
>> > So I guess allowing vmalloc() being called from an innocent application
>> > doing a select() might be dangerous, especially if this select() happens
>> > thousands of time per second.
>>
>> Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()?
>
> Possibly.
>
> We don't have a library function (attempting kmalloc(), fallback to
> vmalloc() presumably to avoid abuses, but I guess some patches were
> accepted without thinking about this.

So in the case of select() it seems like the memory we need 6 bits per file 
descriptor, multiplied by the highest possible file descriptor (nfds) as passed 
to the syscall. According to the man page of select:

        EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see 
getrlimit(2)).

The code actually seems to silently cap the value instead of returning EINVAL 
though? (IIUC):

        /* max_fds can increase, so grab it once to avoid race */
         rcu_read_lock();
         fdt = files_fdtable(current->files);
         max_fds = fdt->max_fds;
         rcu_read_unlock();
         if (n > max_fds)
                 n = max_fds;

The default for this cap seems to be 1024 where I checked (again, IIUC, it's 
what ulimit -n returns?). I wasn't able to change it to more than 2048, which 
makes the bitmaps still below PAGE_SIZE.

So if I get that right, the system admin would have to allow really large 
RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large 
concern?

Vlastimil

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-22 17:55         ` Vlastimil Babka
  (?)
@ 2016-09-23  9:42         ` David Laight
  2016-09-23  9:58             ` Vlastimil Babka
  -1 siblings, 1 reply; 38+ messages in thread
From: David Laight @ 2016-09-23  9:42 UTC (permalink / raw)
  To: 'Vlastimil Babka', Eric Dumazet
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev, Linux API, linux-man

From: Vlastimil Babka
> Sent: 22 September 2016 18:55
...
> So in the case of select() it seems like the memory we need 6 bits per file
> descriptor, multiplied by the highest possible file descriptor (nfds) as passed
> to the syscall. According to the man page of select:
> 
>         EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see
> getrlimit(2)).

That second clause is relatively recent.

> The code actually seems to silently cap the value instead of returning EINVAL
> though? (IIUC):
> 
>         /* max_fds can increase, so grab it once to avoid race */
>          rcu_read_lock();
>          fdt = files_fdtable(current->files);
>          max_fds = fdt->max_fds;
>          rcu_read_unlock();
>          if (n > max_fds)
>                  n = max_fds;
> 
> The default for this cap seems to be 1024 where I checked (again, IIUC, it's
> what ulimit -n returns?). I wasn't able to change it to more than 2048, which
> makes the bitmaps still below PAGE_SIZE.
> 
> So if I get that right, the system admin would have to allow really large
> RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large
> concern?

4k open files isn't that many.
Especially for programs that are using pipes to emulate windows events.

I suspect that fdt->max_fds is an upper bound for the highest fd the
process has open - not the RLIMIT_NOFILE value.
select() shouldn't be silently ignoring large values of 'n' unless
the fd_set bits are zero.

Of course, select does scale well for high numbered fds
and neither poll nor select scale well for large numbers of fds.

	David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-23  9:58             ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-23  9:58 UTC (permalink / raw)
  To: David Laight, Eric Dumazet
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev, Linux API, linux-man

On 09/23/2016 11:42 AM, David Laight wrote:
> From: Vlastimil Babka
>> Sent: 22 September 2016 18:55
> ...
>> So in the case of select() it seems like the memory we need 6 bits per file
>> descriptor, multiplied by the highest possible file descriptor (nfds) as passed
>> to the syscall. According to the man page of select:
>>
>>         EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see
>> getrlimit(2)).
> 
> That second clause is relatively recent.

Interesting... so it was added without actually being true in the kernel
code?

>> The code actually seems to silently cap the value instead of returning EINVAL
>> though? (IIUC):
>>
>>         /* max_fds can increase, so grab it once to avoid race */
>>          rcu_read_lock();
>>          fdt = files_fdtable(current->files);
>>          max_fds = fdt->max_fds;
>>          rcu_read_unlock();
>>          if (n > max_fds)
>>                  n = max_fds;
>>
>> The default for this cap seems to be 1024 where I checked (again, IIUC, it's
>> what ulimit -n returns?). I wasn't able to change it to more than 2048, which
>> makes the bitmaps still below PAGE_SIZE.
>>
>> So if I get that right, the system admin would have to allow really large
>> RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large
>> concern?
> 
> 4k open files isn't that many.
> Especially for programs that are using pipes to emulate windows events.

Sure but IIUC we need 6 bits per file. That means up to almost 42k
files, we should fit into order-3 allocation, which effectively cannot
fail right now.

> I suspect that fdt->max_fds is an upper bound for the highest fd the
> process has open - not the RLIMIT_NOFILE value.

I gathered that the highest fd effectively limits the number of files,
so it's the same. I might be wrong.

> select() shouldn't be silently ignoring large values of 'n' unless
> the fd_set bits are zero.

Yeah that doesn't seem to conform to the manpage.

> Of course, select does scale well for high numbered fds
> and neither poll nor select scale well for large numbers of fds.

True.

> 	David
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-23  9:58             ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-23  9:58 UTC (permalink / raw)
  To: David Laight, Eric Dumazet
  Cc: Alexander Viro, Andrew Morton,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michal Hocko,
	netdev-u79uwXL29TY76Z2rM5mHXA, Linux API,
	linux-man-u79uwXL29TY76Z2rM5mHXA

On 09/23/2016 11:42 AM, David Laight wrote:
> From: Vlastimil Babka
>> Sent: 22 September 2016 18:55
> ...
>> So in the case of select() it seems like the memory we need 6 bits per file
>> descriptor, multiplied by the highest possible file descriptor (nfds) as passed
>> to the syscall. According to the man page of select:
>>
>>         EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see
>> getrlimit(2)).
> 
> That second clause is relatively recent.

Interesting... so it was added without actually being true in the kernel
code?

>> The code actually seems to silently cap the value instead of returning EINVAL
>> though? (IIUC):
>>
>>         /* max_fds can increase, so grab it once to avoid race */
>>          rcu_read_lock();
>>          fdt = files_fdtable(current->files);
>>          max_fds = fdt->max_fds;
>>          rcu_read_unlock();
>>          if (n > max_fds)
>>                  n = max_fds;
>>
>> The default for this cap seems to be 1024 where I checked (again, IIUC, it's
>> what ulimit -n returns?). I wasn't able to change it to more than 2048, which
>> makes the bitmaps still below PAGE_SIZE.
>>
>> So if I get that right, the system admin would have to allow really large
>> RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large
>> concern?
> 
> 4k open files isn't that many.
> Especially for programs that are using pipes to emulate windows events.

Sure but IIUC we need 6 bits per file. That means up to almost 42k
files, we should fit into order-3 allocation, which effectively cannot
fail right now.

> I suspect that fdt->max_fds is an upper bound for the highest fd the
> process has open - not the RLIMIT_NOFILE value.

I gathered that the highest fd effectively limits the number of files,
so it's the same. I might be wrong.

> select() shouldn't be silently ignoring large values of 'n' unless
> the fd_set bits are zero.

Yeah that doesn't seem to conform to the manpage.

> Of course, select does scale well for high numbered fds
> and neither poll nor select scale well for large numbers of fds.

True.

> 	David
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-23  9:58             ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-23  9:58 UTC (permalink / raw)
  To: David Laight, Eric Dumazet
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev, Linux API, linux-man

On 09/23/2016 11:42 AM, David Laight wrote:
> From: Vlastimil Babka
>> Sent: 22 September 2016 18:55
> ...
>> So in the case of select() it seems like the memory we need 6 bits per file
>> descriptor, multiplied by the highest possible file descriptor (nfds) as passed
>> to the syscall. According to the man page of select:
>>
>>         EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see
>> getrlimit(2)).
> 
> That second clause is relatively recent.

Interesting... so it was added without actually being true in the kernel
code?

>> The code actually seems to silently cap the value instead of returning EINVAL
>> though? (IIUC):
>>
>>         /* max_fds can increase, so grab it once to avoid race */
>>          rcu_read_lock();
>>          fdt = files_fdtable(current->files);
>>          max_fds = fdt->max_fds;
>>          rcu_read_unlock();
>>          if (n > max_fds)
>>                  n = max_fds;
>>
>> The default for this cap seems to be 1024 where I checked (again, IIUC, it's
>> what ulimit -n returns?). I wasn't able to change it to more than 2048, which
>> makes the bitmaps still below PAGE_SIZE.
>>
>> So if I get that right, the system admin would have to allow really large
>> RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large
>> concern?
> 
> 4k open files isn't that many.
> Especially for programs that are using pipes to emulate windows events.

Sure but IIUC we need 6 bits per file. That means up to almost 42k
files, we should fit into order-3 allocation, which effectively cannot
fail right now.

> I suspect that fdt->max_fds is an upper bound for the highest fd the
> process has open - not the RLIMIT_NOFILE value.

I gathered that the highest fd effectively limits the number of files,
so it's the same. I might be wrong.

> select() shouldn't be silently ignoring large values of 'n' unless
> the fd_set bits are zero.

Yeah that doesn't seem to conform to the manpage.

> Of course, select does scale well for high numbered fds
> and neither poll nor select scale well for large numbers of fds.

True.

> 	David
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-23  9:58             ` Vlastimil Babka
  (?)
  (?)
@ 2016-09-23 13:35             ` David Laight
  2016-09-26 10:01                 ` Vlastimil Babka
  -1 siblings, 1 reply; 38+ messages in thread
From: David Laight @ 2016-09-23 13:35 UTC (permalink / raw)
  To: 'Vlastimil Babka', Eric Dumazet
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev, Linux API, linux-man

From: Vlastimil Babka
> Sent: 23 September 2016 10:59
...
> > I suspect that fdt->max_fds is an upper bound for the highest fd the
> > process has open - not the RLIMIT_NOFILE value.
> 
> I gathered that the highest fd effectively limits the number of files,
> so it's the same. I might be wrong.

An application can reduce RLIMIT_NOFILE below that of an open file.

	David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-22 16:49   ` Eric Dumazet
@ 2016-09-25 18:50     ` Andi Kleen
  -1 siblings, 0 replies; 38+ messages in thread
From: Andi Kleen @ 2016-09-25 18:50 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-fsdevel, linux-kernel, netdev

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
>> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
>> with the number of fds passed. We had a customer report page allocation
>> failures of order-4 for this allocation. This is a costly order, so it might
>> easily fail, as the VM expects such allocation to have a lower-order fallback.
>> 
>> Such trivial fallback is vmalloc(), as the memory doesn't have to be
>> physically contiguous. Also the allocation is temporary for the duration of the
>> syscall, so it's unlikely to stress vmalloc too much.
>
> vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.
>
> So I guess allowing vmalloc() being called from an innocent application
> doing a select() might be dangerous, especially if this select() happens
> thousands of time per second.

Yes it seems like a bad idea because of all the scaling problems here.

The right solution would be to fix select to use multiple
non virtually contiguous pages.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-25 18:50     ` Andi Kleen
  0 siblings, 0 replies; 38+ messages in thread
From: Andi Kleen @ 2016-09-25 18:50 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-fsdevel, linux-kernel, netdev

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
>> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
>> with the number of fds passed. We had a customer report page allocation
>> failures of order-4 for this allocation. This is a costly order, so it might
>> easily fail, as the VM expects such allocation to have a lower-order fallback.
>> 
>> Such trivial fallback is vmalloc(), as the memory doesn't have to be
>> physically contiguous. Also the allocation is temporary for the duration of the
>> syscall, so it's unlikely to stress vmalloc too much.
>
> vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.
>
> So I guess allowing vmalloc() being called from an innocent application
> doing a select() might be dangerous, especially if this select() happens
> thousands of time per second.

Yes it seems like a bad idea because of all the scaling problems here.

The right solution would be to fix select to use multiple
non virtually contiguous pages.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-23 13:35             ` David Laight
@ 2016-09-26 10:01                 ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-26 10:01 UTC (permalink / raw)
  To: David Laight, Eric Dumazet
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev, Linux API, linux-man

On 09/23/2016 03:35 PM, David Laight wrote:
> From: Vlastimil Babka
>> Sent: 23 September 2016 10:59
> ...
>> > I suspect that fdt->max_fds is an upper bound for the highest fd the
>> > process has open - not the RLIMIT_NOFILE value.
>>
>> I gathered that the highest fd effectively limits the number of files,
>> so it's the same. I might be wrong.
>
> An application can reduce RLIMIT_NOFILE below that of an open file.

OK, I did some more digging in the code, and my understanding is that:

- fdt->max_fds is the current size of the fdtable, which isn't allocated upfront 
to match the limit, but grows as needed. This means it's OK for 
core_sys_select() to silently cap nfds, as it knows there are no fd's with 
higher number in the fdtable, so it's a performance optimization. However, to 
match what the manpage says, there should be another check against RLIMIT_NOFILE 
to return -EINVAL, which there isn't, AFAICS.

- fdtable is expanded (and fdt->max_fds bumped) by 
expand_files()->expand_fdtable() which checks against fs.nr_open sysctl, which 
seems to be 1048576 where I checked.

- callers of expand_files(), such as dup(), check the rlimit(RLIMIT_NOFILE) to 
limit the expansion.

So yeah, application can reduce RLIMIT_NOFILE, but it has no effect on fdtable 
and fdt->max_fds that is already above the limit. Select syscall would have to 
check the rlimit to conform to the manpage. Or (rather?) we should fix the manpage.

As for the original vmalloc() flood concern, I still think we're safe, as 
ordinary users are limited by RLIMIT_NOFILE way below sizes that would need 
vmalloc(), and root has many other options to DOS the system (or worse).

Vlastimil

> 	David
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-26 10:01                 ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-26 10:01 UTC (permalink / raw)
  To: David Laight, Eric Dumazet
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev, Linux API, linux-man

On 09/23/2016 03:35 PM, David Laight wrote:
> From: Vlastimil Babka
>> Sent: 23 September 2016 10:59
> ...
>> > I suspect that fdt->max_fds is an upper bound for the highest fd the
>> > process has open - not the RLIMIT_NOFILE value.
>>
>> I gathered that the highest fd effectively limits the number of files,
>> so it's the same. I might be wrong.
>
> An application can reduce RLIMIT_NOFILE below that of an open file.

OK, I did some more digging in the code, and my understanding is that:

- fdt->max_fds is the current size of the fdtable, which isn't allocated upfront 
to match the limit, but grows as needed. This means it's OK for 
core_sys_select() to silently cap nfds, as it knows there are no fd's with 
higher number in the fdtable, so it's a performance optimization. However, to 
match what the manpage says, there should be another check against RLIMIT_NOFILE 
to return -EINVAL, which there isn't, AFAICS.

- fdtable is expanded (and fdt->max_fds bumped) by 
expand_files()->expand_fdtable() which checks against fs.nr_open sysctl, which 
seems to be 1048576 where I checked.

- callers of expand_files(), such as dup(), check the rlimit(RLIMIT_NOFILE) to 
limit the expansion.

So yeah, application can reduce RLIMIT_NOFILE, but it has no effect on fdtable 
and fdt->max_fds that is already above the limit. Select syscall would have to 
check the rlimit to conform to the manpage. Or (rather?) we should fix the manpage.

As for the original vmalloc() flood concern, I still think we're safe, as 
ordinary users are limited by RLIMIT_NOFILE way below sizes that would need 
vmalloc(), and root has many other options to DOS the system (or worse).

Vlastimil

> 	David
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-26 10:01                 ` Vlastimil Babka
  (?)
@ 2016-09-26 15:02                 ` David Laight
  -1 siblings, 0 replies; 38+ messages in thread
From: David Laight @ 2016-09-26 15:02 UTC (permalink / raw)
  To: 'Vlastimil Babka', Eric Dumazet
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev, Linux API, linux-man

From: Vlastimil Babka
> Sent: 26 September 2016 11:02
> On 09/23/2016 03:35 PM, David Laight wrote:
> > From: Vlastimil Babka
> >> Sent: 23 September 2016 10:59
> > ...
> >> > I suspect that fdt->max_fds is an upper bound for the highest fd the
> >> > process has open - not the RLIMIT_NOFILE value.
> >>
> >> I gathered that the highest fd effectively limits the number of files,
> >> so it's the same. I might be wrong.
> >
> > An application can reduce RLIMIT_NOFILE below that of an open file.
> 
> OK, I did some more digging in the code, and my understanding is that:
> 
> - fdt->max_fds is the current size of the fdtable, which isn't allocated upfront
> to match the limit, but grows as needed. This means it's OK for
> core_sys_select() to silently cap nfds, as it knows there are no fd's with
> higher number in the fdtable, so it's a performance optimization.

Not entirely, if any bits are set for fd above fdt->max_fds then select()
call should fail - fd not open.

> However, to
> match what the manpage says, there should be another check against RLIMIT_NOFILE
> to return -EINVAL, which there isn't, AFAICS.
> 
> - fdtable is expanded (and fdt->max_fds bumped) by
> expand_files()->expand_fdtable() which checks against fs.nr_open sysctl, which
> seems to be 1048576 where I checked.
> 
> - callers of expand_files(), such as dup(), check the rlimit(RLIMIT_NOFILE) to
> limit the expansion.
> 
> So yeah, application can reduce RLIMIT_NOFILE, but it has no effect on fdtable
> and fdt->max_fds that is already above the limit. Select syscall would have to
> check the rlimit to conform to the manpage. Or (rather?) we should fix the manpage.

I think the manpage should be fixed (delete that clause).
Then add code to the system call to scan the high bit sets (above fdt->max_fds)
for any non-zero bytes. This can be done into a small buffer.

> As for the original vmalloc() flood concern, I still think we're safe, as
> ordinary users are limited by RLIMIT_NOFILE way below sizes that would need
> vmalloc(), and root has many other options to DOS the system (or worse).

Some processes need very high numbers of fd.
Likely they don't use select() on them, but trashing performance if they
do is a bit silly.
Trying to slit the 3 masks first seems sensible.

	David

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-22 16:43 ` Vlastimil Babka
@ 2016-09-27  0:01   ` Andrew Morton
  -1 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2016-09-27  0:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, linux-mm,
	Michal Hocko, netdev, Eric Dumazet

On Thu, 22 Sep 2016 18:43:59 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> with the number of fds passed. We had a customer report page allocation
> failures of order-4 for this allocation. This is a costly order, so it might
> easily fail, as the VM expects such allocation to have a lower-order fallback.
> 
> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> physically contiguous. Also the allocation is temporary for the duration of the
> syscall, so it's unlikely to stress vmalloc too much.
> 
> Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
> it doesn't need this kind of fallback.
> 
> ...
>
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -29,6 +29,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/freezer.h>
>  #include <net/busy_poll.h>
> +#include <linux/vmalloc.h>
>  
>  #include <asm/uaccess.h>
>  
> @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  	struct fdtable *fdt;
>  	/* Allocate small arguments on the stack to save memory and be faster */
>  	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
> +	unsigned long alloc_size;
>  
>  	ret = -EINVAL;
>  	if (n < 0)
> @@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  	bits = stack_fds;
>  	if (size > sizeof(stack_fds) / 6) {
>  		/* Not enough space in on-stack array; must use kmalloc */
> +		alloc_size = 6 * size;

Well.  `size' is `unsigned'.  The multiplication will be done as 32-bit
so there was no point in making `alloc_size' unsigned long.

So can we tighten up the types in this function?  size_t might make
sense, but vmalloc() takes a ulong.

>  		ret = -ENOMEM;
> -		bits = kmalloc(6 * size, GFP_KERNEL);
> +		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
> +		if (!bits && alloc_size > PAGE_SIZE)
> +			bits = vmalloc(alloc_size);

I don't share Eric's concerns about performance here.  If the vmalloc()
is called, we're about to write to that quite large amount of memory
which we just allocated, and the vmalloc() overhead will be relatively
low.

>  		if (!bits)
>  			goto out_nofds;
>  	}
> @@ -618,7 +624,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  
>  out:
>  	if (bits != stack_fds)
> -		kfree(bits);
> +		kvfree(bits);
>  out_nofds:
>  	return ret;

It otherwise looks OK to me.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-27  0:01   ` Andrew Morton
  0 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2016-09-27  0:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, linux-mm,
	Michal Hocko, netdev, Eric Dumazet

On Thu, 22 Sep 2016 18:43:59 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> with the number of fds passed. We had a customer report page allocation
> failures of order-4 for this allocation. This is a costly order, so it might
> easily fail, as the VM expects such allocation to have a lower-order fallback.
> 
> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> physically contiguous. Also the allocation is temporary for the duration of the
> syscall, so it's unlikely to stress vmalloc too much.
> 
> Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
> it doesn't need this kind of fallback.
> 
> ...
>
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -29,6 +29,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/freezer.h>
>  #include <net/busy_poll.h>
> +#include <linux/vmalloc.h>
>  
>  #include <asm/uaccess.h>
>  
> @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  	struct fdtable *fdt;
>  	/* Allocate small arguments on the stack to save memory and be faster */
>  	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
> +	unsigned long alloc_size;
>  
>  	ret = -EINVAL;
>  	if (n < 0)
> @@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  	bits = stack_fds;
>  	if (size > sizeof(stack_fds) / 6) {
>  		/* Not enough space in on-stack array; must use kmalloc */
> +		alloc_size = 6 * size;

Well.  `size' is `unsigned'.  The multiplication will be done as 32-bit
so there was no point in making `alloc_size' unsigned long.

So can we tighten up the types in this function?  size_t might make
sense, but vmalloc() takes a ulong.

>  		ret = -ENOMEM;
> -		bits = kmalloc(6 * size, GFP_KERNEL);
> +		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
> +		if (!bits && alloc_size > PAGE_SIZE)
> +			bits = vmalloc(alloc_size);

I don't share Eric's concerns about performance here.  If the vmalloc()
is called, we're about to write to that quite large amount of memory
which we just allocated, and the vmalloc() overhead will be relatively
low.

>  		if (!bits)
>  			goto out_nofds;
>  	}
> @@ -618,7 +624,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  
>  out:
>  	if (bits != stack_fds)
> -		kfree(bits);
> +		kvfree(bits);
>  out_nofds:
>  	return ret;

It otherwise looks OK to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-27  0:01   ` Andrew Morton
  (?)
@ 2016-09-27  1:38     ` Eric Dumazet
  -1 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2016-09-27  1:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Alexander Viro, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On Mon, 2016-09-26 at 17:01 -0700, Andrew Morton wrote:

> I don't share Eric's concerns about performance here.  If the vmalloc()
> is called, we're about to write to that quite large amount of memory
> which we just allocated, and the vmalloc() overhead will be relatively
> low.

I did not care of the performance of this particular select() system
call really, but other cpus because of more TLB invalidations.

At least CONFIG_DEBUG_PAGEALLOC=y builds should be impacted, but maybe
we do not care.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-27  1:38     ` Eric Dumazet
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2016-09-27  1:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Alexander Viro, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On Mon, 2016-09-26 at 17:01 -0700, Andrew Morton wrote:

> I don't share Eric's concerns about performance here.  If the vmalloc()
> is called, we're about to write to that quite large amount of memory
> which we just allocated, and the vmalloc() overhead will be relatively
> low.

I did not care of the performance of this particular select() system
call really, but other cpus because of more TLB invalidations.

At least CONFIG_DEBUG_PAGEALLOC=y builds should be impacted, but maybe
we do not care.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-27  1:38     ` Eric Dumazet
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2016-09-27  1:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Alexander Viro, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On Mon, 2016-09-26 at 17:01 -0700, Andrew Morton wrote:

> I don't share Eric's concerns about performance here.  If the vmalloc()
> is called, we're about to write to that quite large amount of memory
> which we just allocated, and the vmalloc() overhead will be relatively
> low.

I did not care of the performance of this particular select() system
call really, but other cpus because of more TLB invalidations.

At least CONFIG_DEBUG_PAGEALLOC=y builds should be impacted, but maybe
we do not care.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-27  0:01   ` Andrew Morton
@ 2016-09-27  8:06     ` Vlastimil Babka
  -1 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-27  8:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, linux-mm,
	Michal Hocko, netdev, Eric Dumazet

On 09/27/2016 02:01 AM, Andrew Morton wrote:
> On Thu, 22 Sep 2016 18:43:59 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
>
>> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
>> with the number of fds passed. We had a customer report page allocation
>> failures of order-4 for this allocation. This is a costly order, so it might
>> easily fail, as the VM expects such allocation to have a lower-order fallback.
>>
>> Such trivial fallback is vmalloc(), as the memory doesn't have to be
>> physically contiguous. Also the allocation is temporary for the duration of the
>> syscall, so it's unlikely to stress vmalloc too much.
>>
>> Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
>> it doesn't need this kind of fallback.
>>
>> ...
>>
>> --- a/fs/select.c
>> +++ b/fs/select.c
>> @@ -29,6 +29,7 @@
>>  #include <linux/sched/rt.h>
>>  #include <linux/freezer.h>
>>  #include <net/busy_poll.h>
>> +#include <linux/vmalloc.h>
>>
>>  #include <asm/uaccess.h>
>>
>> @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>>  	struct fdtable *fdt;
>>  	/* Allocate small arguments on the stack to save memory and be faster */
>>  	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
>> +	unsigned long alloc_size;
>>
>>  	ret = -EINVAL;
>>  	if (n < 0)
>> @@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>>  	bits = stack_fds;
>>  	if (size > sizeof(stack_fds) / 6) {
>>  		/* Not enough space in on-stack array; must use kmalloc */
>> +		alloc_size = 6 * size;
>
> Well.  `size' is `unsigned'.  The multiplication will be done as 32-bit
> so there was no point in making `alloc_size' unsigned long.

Uh, right. Thanks.

> So can we tighten up the types in this function?  size_t might make
> sense, but vmalloc() takes a ulong.

Let's do size_t then, as the conversion to ulong is safe.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-27  8:06     ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-27  8:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, linux-mm,
	Michal Hocko, netdev, Eric Dumazet

On 09/27/2016 02:01 AM, Andrew Morton wrote:
> On Thu, 22 Sep 2016 18:43:59 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
>
>> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
>> with the number of fds passed. We had a customer report page allocation
>> failures of order-4 for this allocation. This is a costly order, so it might
>> easily fail, as the VM expects such allocation to have a lower-order fallback.
>>
>> Such trivial fallback is vmalloc(), as the memory doesn't have to be
>> physically contiguous. Also the allocation is temporary for the duration of the
>> syscall, so it's unlikely to stress vmalloc too much.
>>
>> Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
>> it doesn't need this kind of fallback.
>>
>> ...
>>
>> --- a/fs/select.c
>> +++ b/fs/select.c
>> @@ -29,6 +29,7 @@
>>  #include <linux/sched/rt.h>
>>  #include <linux/freezer.h>
>>  #include <net/busy_poll.h>
>> +#include <linux/vmalloc.h>
>>
>>  #include <asm/uaccess.h>
>>
>> @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>>  	struct fdtable *fdt;
>>  	/* Allocate small arguments on the stack to save memory and be faster */
>>  	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
>> +	unsigned long alloc_size;
>>
>>  	ret = -EINVAL;
>>  	if (n < 0)
>> @@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>>  	bits = stack_fds;
>>  	if (size > sizeof(stack_fds) / 6) {
>>  		/* Not enough space in on-stack array; must use kmalloc */
>> +		alloc_size = 6 * size;
>
> Well.  `size' is `unsigned'.  The multiplication will be done as 32-bit
> so there was no point in making `alloc_size' unsigned long.

Uh, right. Thanks.

> So can we tighten up the types in this function?  size_t might make
> sense, but vmalloc() takes a ulong.

Let's do size_t then, as the conversion to ulong is safe.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-27  1:38     ` Eric Dumazet
@ 2016-09-27  8:13       ` Vlastimil Babka
  -1 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-27  8:13 UTC (permalink / raw)
  To: Eric Dumazet, Andrew Morton
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, linux-mm,
	Michal Hocko, netdev

On 09/27/2016 03:38 AM, Eric Dumazet wrote:
> On Mon, 2016-09-26 at 17:01 -0700, Andrew Morton wrote:
>
>> I don't share Eric's concerns about performance here.  If the vmalloc()
>> is called, we're about to write to that quite large amount of memory
>> which we just allocated, and the vmalloc() overhead will be relatively
>> low.
>
> I did not care of the performance of this particular select() system
> call really, but other cpus because of more TLB invalidations.

There are many other ways to cause those, AFAIK. The reclaim/compaction
for order-3 allocation has its own impact on system, including TLB flushes.
Or a flood of mmap(MAP_POPULATE) and madvise(MADV_DONTNEED) calls...
This vmalloc() would however require raising RLIMIT_NOFILE above the defaults.

> At least CONFIG_DEBUG_PAGEALLOC=y builds should be impacted, but maybe
> we do not care.

I doubt anyone runs that in production, especially if performance is of concern.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-27  8:13       ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-27  8:13 UTC (permalink / raw)
  To: Eric Dumazet, Andrew Morton
  Cc: Alexander Viro, linux-fsdevel, linux-kernel, linux-mm,
	Michal Hocko, netdev

On 09/27/2016 03:38 AM, Eric Dumazet wrote:
> On Mon, 2016-09-26 at 17:01 -0700, Andrew Morton wrote:
>
>> I don't share Eric's concerns about performance here.  If the vmalloc()
>> is called, we're about to write to that quite large amount of memory
>> which we just allocated, and the vmalloc() overhead will be relatively
>> low.
>
> I did not care of the performance of this particular select() system
> call really, but other cpus because of more TLB invalidations.

There are many other ways to cause those, AFAIK. The reclaim/compaction
for order-3 allocation has its own impact on system, including TLB flushes.
Or a flood of mmap(MAP_POPULATE) and madvise(MADV_DONTNEED) calls...
This vmalloc() would however require raising RLIMIT_NOFILE above the defaults.

> At least CONFIG_DEBUG_PAGEALLOC=y builds should be impacted, but maybe
> we do not care.

I doubt anyone runs that in production, especially if performance is of concern.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3] fs/select: add vmalloc fallback for select(2)
  2016-09-22 16:43 ` Vlastimil Babka
@ 2016-09-27  8:45   ` Vlastimil Babka
  -1 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-27  8:45 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, linux-mm, Michal Hocko, netdev,
	Eric Dumazet, David Laight, Hillf Danton, Nicholas Piggin,
	Jason Baron, Vlastimil Babka

The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
with the number of fds passed. We had a customer report page allocation
failures of order-4 for this allocation. This is a costly order, so it might
easily fail, as the VM expects such allocation to have a lower-order fallback.

Such trivial fallback is vmalloc(), as the memory doesn't have to be physically
contiguous and the allocation is temporary for the duration of the syscall
only. There were some concerns, whether this would have negative impact on the
system by exposing vmalloc() to userspace. Although an excessive use of vmalloc
can cause some system wide performance issues - TLB flushes etc. - a large
order allocation is not for free either and an excessive reclaim/compaction can
have a similar effect. Also note that the size is effectively limited by
RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the
bitmaps will fit well within single page and thus the vmalloc() fallback could
be only excercised for processes where root allows a higher limit.

Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
it doesn't need this kind of fallback.

[eric.dumazet@gmail.com: fix failure path logic]
[akpm@linux-foundation.org: use proper type for size]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/select.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8ed9da50896a..3d4f85defeab 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -29,6 +29,7 @@
 #include <linux/sched/rt.h>
 #include <linux/freezer.h>
 #include <net/busy_poll.h>
+#include <linux/vmalloc.h>
 
 #include <asm/uaccess.h>
 
@@ -554,7 +555,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	fd_set_bits fds;
 	void *bits;
 	int ret, max_fds;
-	unsigned int size;
+	size_t size, alloc_size;
 	struct fdtable *fdt;
 	/* Allocate small arguments on the stack to save memory and be faster */
 	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
@@ -581,7 +582,14 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	if (size > sizeof(stack_fds) / 6) {
 		/* Not enough space in on-stack array; must use kmalloc */
 		ret = -ENOMEM;
-		bits = kmalloc(6 * size, GFP_KERNEL);
+		if (size > (SIZE_MAX / 6))
+			goto out_nofds;
+
+		alloc_size = 6 * size;
+		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
+		if (!bits && alloc_size > PAGE_SIZE)
+			bits = vmalloc(alloc_size);
+
 		if (!bits)
 			goto out_nofds;
 	}
@@ -618,7 +626,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 
 out:
 	if (bits != stack_fds)
-		kfree(bits);
+		kvfree(bits);
 out_nofds:
 	return ret;
 }
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3] fs/select: add vmalloc fallback for select(2)
@ 2016-09-27  8:45   ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2016-09-27  8:45 UTC (permalink / raw)
  To: Alexander Viro, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, linux-mm, Michal Hocko, netdev,
	Eric Dumazet, David Laight, Hillf Danton, Nicholas Piggin,
	Jason Baron, Vlastimil Babka

The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
with the number of fds passed. We had a customer report page allocation
failures of order-4 for this allocation. This is a costly order, so it might
easily fail, as the VM expects such allocation to have a lower-order fallback.

Such trivial fallback is vmalloc(), as the memory doesn't have to be physically
contiguous and the allocation is temporary for the duration of the syscall
only. There were some concerns, whether this would have negative impact on the
system by exposing vmalloc() to userspace. Although an excessive use of vmalloc
can cause some system wide performance issues - TLB flushes etc. - a large
order allocation is not for free either and an excessive reclaim/compaction can
have a similar effect. Also note that the size is effectively limited by
RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the
bitmaps will fit well within single page and thus the vmalloc() fallback could
be only excercised for processes where root allows a higher limit.

Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
it doesn't need this kind of fallback.

[eric.dumazet@gmail.com: fix failure path logic]
[akpm@linux-foundation.org: use proper type for size]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/select.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8ed9da50896a..3d4f85defeab 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -29,6 +29,7 @@
 #include <linux/sched/rt.h>
 #include <linux/freezer.h>
 #include <net/busy_poll.h>
+#include <linux/vmalloc.h>
 
 #include <asm/uaccess.h>
 
@@ -554,7 +555,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	fd_set_bits fds;
 	void *bits;
 	int ret, max_fds;
-	unsigned int size;
+	size_t size, alloc_size;
 	struct fdtable *fdt;
 	/* Allocate small arguments on the stack to save memory and be faster */
 	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
@@ -581,7 +582,14 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	if (size > sizeof(stack_fds) / 6) {
 		/* Not enough space in on-stack array; must use kmalloc */
 		ret = -ENOMEM;
-		bits = kmalloc(6 * size, GFP_KERNEL);
+		if (size > (SIZE_MAX / 6))
+			goto out_nofds;
+
+		alloc_size = 6 * size;
+		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
+		if (!bits && alloc_size > PAGE_SIZE)
+			bits = vmalloc(alloc_size);
+
 		if (!bits)
 			goto out_nofds;
 	}
@@ -618,7 +626,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 
 out:
 	if (bits != stack_fds)
-		kfree(bits);
+		kvfree(bits);
 out_nofds:
 	return ret;
 }
-- 
2.10.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3] fs/select: add vmalloc fallback for select(2)
  2016-09-27  8:45   ` Vlastimil Babka
@ 2016-09-27 10:22     ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2016-09-27 10:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, netdev, Eric Dumazet, David Laight, Hillf Danton,
	Nicholas Piggin, Jason Baron

On Tue 27-09-16 10:45:36, Vlastimil Babka wrote:
> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> with the number of fds passed. We had a customer report page allocation
> failures of order-4 for this allocation. This is a costly order, so it might
> easily fail, as the VM expects such allocation to have a lower-order fallback.
> 
> Such trivial fallback is vmalloc(), as the memory doesn't have to be physically
> contiguous and the allocation is temporary for the duration of the syscall
> only. There were some concerns, whether this would have negative impact on the
> system by exposing vmalloc() to userspace. Although an excessive use of vmalloc
> can cause some system wide performance issues - TLB flushes etc. - a large
> order allocation is not for free either and an excessive reclaim/compaction can
> have a similar effect. Also note that the size is effectively limited by
> RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the
> bitmaps will fit well within single page and thus the vmalloc() fallback could
> be only excercised for processes where root allows a higher limit.
> 
> Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
> it doesn't need this kind of fallback.
> 
> [eric.dumazet@gmail.com: fix failure path logic]
> [akpm@linux-foundation.org: use proper type for size]
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Yes this makes sense to me. It could be argued that this could be
simplified to not rely on high order allocations at all but this is
simple enough (and backportable to stable trees) and should work
reasonably well.

So FWIW
Acked-by: Michal Hocko <mhocko@suse.com>

I would even argue to use __GFP_NORETRY for size > PAGE_SIZE because
giving a userspace an access to high order pages which can invoke OOM
killer is not a great idea. Something for a separate patch though.

> ---
>  fs/select.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/select.c b/fs/select.c
> index 8ed9da50896a..3d4f85defeab 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -29,6 +29,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/freezer.h>
>  #include <net/busy_poll.h>
> +#include <linux/vmalloc.h>
>  
>  #include <asm/uaccess.h>
>  
> @@ -554,7 +555,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  	fd_set_bits fds;
>  	void *bits;
>  	int ret, max_fds;
> -	unsigned int size;
> +	size_t size, alloc_size;
>  	struct fdtable *fdt;
>  	/* Allocate small arguments on the stack to save memory and be faster */
>  	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
> @@ -581,7 +582,14 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  	if (size > sizeof(stack_fds) / 6) {
>  		/* Not enough space in on-stack array; must use kmalloc */
>  		ret = -ENOMEM;
> -		bits = kmalloc(6 * size, GFP_KERNEL);
> +		if (size > (SIZE_MAX / 6))
> +			goto out_nofds;
> +
> +		alloc_size = 6 * size;
> +		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
> +		if (!bits && alloc_size > PAGE_SIZE)
> +			bits = vmalloc(alloc_size);
> +
>  		if (!bits)
>  			goto out_nofds;
>  	}
> @@ -618,7 +626,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  
>  out:
>  	if (bits != stack_fds)
> -		kfree(bits);
> +		kvfree(bits);
>  out_nofds:
>  	return ret;
>  }
> -- 
> 2.10.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3] fs/select: add vmalloc fallback for select(2)
@ 2016-09-27 10:22     ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2016-09-27 10:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
	linux-mm, netdev, Eric Dumazet, David Laight, Hillf Danton,
	Nicholas Piggin, Jason Baron

On Tue 27-09-16 10:45:36, Vlastimil Babka wrote:
> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> with the number of fds passed. We had a customer report page allocation
> failures of order-4 for this allocation. This is a costly order, so it might
> easily fail, as the VM expects such allocation to have a lower-order fallback.
> 
> Such trivial fallback is vmalloc(), as the memory doesn't have to be physically
> contiguous and the allocation is temporary for the duration of the syscall
> only. There were some concerns, whether this would have negative impact on the
> system by exposing vmalloc() to userspace. Although an excessive use of vmalloc
> can cause some system wide performance issues - TLB flushes etc. - a large
> order allocation is not for free either and an excessive reclaim/compaction can
> have a similar effect. Also note that the size is effectively limited by
> RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the
> bitmaps will fit well within single page and thus the vmalloc() fallback could
> be only excercised for processes where root allows a higher limit.
> 
> Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
> it doesn't need this kind of fallback.
> 
> [eric.dumazet@gmail.com: fix failure path logic]
> [akpm@linux-foundation.org: use proper type for size]
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Yes this makes sense to me. It could be argued that this could be
simplified to not rely on high order allocations at all but this is
simple enough (and backportable to stable trees) and should work
reasonably well.

So FWIW
Acked-by: Michal Hocko <mhocko@suse.com>

I would even argue to use __GFP_NORETRY for size > PAGE_SIZE because
giving a userspace an access to high order pages which can invoke OOM
killer is not a great idea. Something for a separate patch though.

> ---
>  fs/select.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/select.c b/fs/select.c
> index 8ed9da50896a..3d4f85defeab 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -29,6 +29,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/freezer.h>
>  #include <net/busy_poll.h>
> +#include <linux/vmalloc.h>
>  
>  #include <asm/uaccess.h>
>  
> @@ -554,7 +555,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  	fd_set_bits fds;
>  	void *bits;
>  	int ret, max_fds;
> -	unsigned int size;
> +	size_t size, alloc_size;
>  	struct fdtable *fdt;
>  	/* Allocate small arguments on the stack to save memory and be faster */
>  	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
> @@ -581,7 +582,14 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  	if (size > sizeof(stack_fds) / 6) {
>  		/* Not enough space in on-stack array; must use kmalloc */
>  		ret = -ENOMEM;
> -		bits = kmalloc(6 * size, GFP_KERNEL);
> +		if (size > (SIZE_MAX / 6))
> +			goto out_nofds;
> +
> +		alloc_size = 6 * size;
> +		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
> +		if (!bits && alloc_size > PAGE_SIZE)
> +			bits = vmalloc(alloc_size);
> +
>  		if (!bits)
>  			goto out_nofds;
>  	}
> @@ -618,7 +626,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>  
>  out:
>  	if (bits != stack_fds)
> -		kfree(bits);
> +		kvfree(bits);
>  out_nofds:
>  	return ret;
>  }
> -- 
> 2.10.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
  2016-09-27  8:13       ` Vlastimil Babka
  (?)
@ 2016-09-27 13:34         ` Eric Dumazet
  -1 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2016-09-27 13:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Alexander Viro, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On Tue, 2016-09-27 at 10:13 +0200, Vlastimil Babka wrote:

> I doubt anyone runs that in production, especially if performance is of concern.
> 

I doubt anyone serious runs select() on a large fd set in production.

Last time I used it was in last century.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-27 13:34         ` Eric Dumazet
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2016-09-27 13:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Alexander Viro, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On Tue, 2016-09-27 at 10:13 +0200, Vlastimil Babka wrote:

> I doubt anyone runs that in production, especially if performance is of concern.
> 

I doubt anyone serious runs select() on a large fd set in production.

Last time I used it was in last century.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
@ 2016-09-27 13:34         ` Eric Dumazet
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2016-09-27 13:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Alexander Viro, linux-fsdevel, linux-kernel,
	linux-mm, Michal Hocko, netdev

On Tue, 2016-09-27 at 10:13 +0200, Vlastimil Babka wrote:

> I doubt anyone runs that in production, especially if performance is of concern.
> 

I doubt anyone serious runs select() on a large fd set in production.

Last time I used it was in last century.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-09-27 13:34 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-22 16:43 [PATCH v2] fs/select: add vmalloc fallback for select(2) Vlastimil Babka
2016-09-22 16:43 ` Vlastimil Babka
2016-09-22 16:49 ` Eric Dumazet
2016-09-22 16:49   ` Eric Dumazet
2016-09-22 16:49   ` Eric Dumazet
2016-09-22 16:56   ` Vlastimil Babka
2016-09-22 16:56     ` Vlastimil Babka
2016-09-22 17:07     ` Eric Dumazet
2016-09-22 17:07       ` Eric Dumazet
2016-09-22 17:07       ` Eric Dumazet
2016-09-22 17:55       ` Vlastimil Babka
2016-09-22 17:55         ` Vlastimil Babka
2016-09-23  9:42         ` David Laight
2016-09-23  9:58           ` Vlastimil Babka
2016-09-23  9:58             ` Vlastimil Babka
2016-09-23  9:58             ` Vlastimil Babka
2016-09-23 13:35             ` David Laight
2016-09-26 10:01               ` Vlastimil Babka
2016-09-26 10:01                 ` Vlastimil Babka
2016-09-26 15:02                 ` David Laight
2016-09-25 18:50   ` Andi Kleen
2016-09-25 18:50     ` Andi Kleen
2016-09-27  0:01 ` Andrew Morton
2016-09-27  0:01   ` Andrew Morton
2016-09-27  1:38   ` Eric Dumazet
2016-09-27  1:38     ` Eric Dumazet
2016-09-27  1:38     ` Eric Dumazet
2016-09-27  8:13     ` Vlastimil Babka
2016-09-27  8:13       ` Vlastimil Babka
2016-09-27 13:34       ` Eric Dumazet
2016-09-27 13:34         ` Eric Dumazet
2016-09-27 13:34         ` Eric Dumazet
2016-09-27  8:06   ` Vlastimil Babka
2016-09-27  8:06     ` Vlastimil Babka
2016-09-27  8:45 ` [PATCH v3] " Vlastimil Babka
2016-09-27  8:45   ` Vlastimil Babka
2016-09-27 10:22   ` Michal Hocko
2016-09-27 10:22     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.