From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1474983244.28155.48.camel@edumazet-glaptop3.roam.corp.google.com> Subject: Re: [PATCH v2] fs/select: add vmalloc fallback for select(2) From: Eric Dumazet To: Vlastimil Babka Cc: Andrew Morton , Alexander Viro , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , netdev@vger.kernel.org Date: Tue, 27 Sep 2016 06:34:04 -0700 In-Reply-To: <6e62a278-4ac3-a866-51c6-e32511406aba@suse.cz> References: <20160922164359.9035-1-vbabka@suse.cz> <20160926170105.517f74cd67ecdd5ef73e1865@linux-foundation.org> <1474940324.28155.44.camel@edumazet-glaptop3.roam.corp.google.com> <6e62a278-4ac3-a866-51c6-e32511406aba@suse.cz> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On Tue, 2016-09-27 at 10:13 +0200, Vlastimil Babka wrote: > I doubt anyone runs that in production, especially if performance is of concern. > I doubt anyone serious runs select() on a large fd set in production. Last time I used it was in last century. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Tue, 27 Sep 2016 12:22:00 +0200 From: Michal Hocko To: Vlastimil Babka Cc: Alexander Viro , Andrew Morton , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, netdev@vger.kernel.org, Eric Dumazet , David Laight , Hillf Danton , Nicholas Piggin , Jason Baron Subject: Re: [PATCH v3] fs/select: add vmalloc fallback for select(2) Message-ID: <20160927102200.GA2278@dhcp22.suse.cz> References: <20160922164359.9035-1-vbabka@suse.cz> <20160927084536.5923-1-vbabka@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160927084536.5923-1-vbabka@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: On Tue 27-09-16 10:45:36, Vlastimil Babka wrote: > The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows > with the number of fds passed. We had a customer report page allocation > failures of order-4 for this allocation. This is a costly order, so it might > easily fail, as the VM expects such allocation to have a lower-order fallback. > > Such trivial fallback is vmalloc(), as the memory doesn't have to be physically > contiguous and the allocation is temporary for the duration of the syscall > only. There were some concerns, whether this would have negative impact on the > system by exposing vmalloc() to userspace. Although an excessive use of vmalloc > can cause some system wide performance issues - TLB flushes etc. - a large > order allocation is not for free either and an excessive reclaim/compaction can > have a similar effect. Also note that the size is effectively limited by > RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the > bitmaps will fit well within single page and thus the vmalloc() fallback could > be only excercised for processes where root allows a higher limit. > > Note that the poll(2) syscall seems to use a linked list of order-0 pages, so > it doesn't need this kind of fallback. > > [eric.dumazet@gmail.com: fix failure path logic] > [akpm@linux-foundation.org: use proper type for size] > Signed-off-by: Vlastimil Babka Yes this makes sense to me. It could be argued that this could be simplified to not rely on high order allocations at all but this is simple enough (and backportable to stable trees) and should work reasonably well. So FWIW Acked-by: Michal Hocko I would even argue to use __GFP_NORETRY for size > PAGE_SIZE because giving a userspace an access to high order pages which can invoke OOM killer is not a great idea. Something for a separate patch though. > --- > fs/select.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/fs/select.c b/fs/select.c > index 8ed9da50896a..3d4f85defeab 100644 > --- a/fs/select.c > +++ b/fs/select.c > @@ -29,6 +29,7 @@ > #include > #include > #include > +#include > > #include > > @@ -554,7 +555,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, > fd_set_bits fds; > void *bits; > int ret, max_fds; > - unsigned int size; > + size_t size, alloc_size; > struct fdtable *fdt; > /* Allocate small arguments on the stack to save memory and be faster */ > long stack_fds[SELECT_STACK_ALLOC/sizeof(long)]; > @@ -581,7 +582,14 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, > if (size > sizeof(stack_fds) / 6) { > /* Not enough space in on-stack array; must use kmalloc */ > ret = -ENOMEM; > - bits = kmalloc(6 * size, GFP_KERNEL); > + if (size > (SIZE_MAX / 6)) > + goto out_nofds; > + > + alloc_size = 6 * size; > + bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN); > + if (!bits && alloc_size > PAGE_SIZE) > + bits = vmalloc(alloc_size); > + > if (!bits) > goto out_nofds; > } > @@ -618,7 +626,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, > > out: > if (bits != stack_fds) > - kfree(bits); > + kvfree(bits); > out_nofds: > return ret; > } > -- > 2.10.0 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Vlastimil Babka To: Alexander Viro , Andrew Morton Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , netdev@vger.kernel.org, Eric Dumazet , David Laight , Hillf Danton , Nicholas Piggin , Jason Baron , Vlastimil Babka Subject: [PATCH v3] fs/select: add vmalloc fallback for select(2) Date: Tue, 27 Sep 2016 10:45:36 +0200 Message-Id: <20160927084536.5923-1-vbabka@suse.cz> In-Reply-To: <20160922164359.9035-1-vbabka@suse.cz> References: <20160922164359.9035-1-vbabka@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows with the number of fds passed. We had a customer report page allocation failures of order-4 for this allocation. This is a costly order, so it might easily fail, as the VM expects such allocation to have a lower-order fallback. Such trivial fallback is vmalloc(), as the memory doesn't have to be physically contiguous and the allocation is temporary for the duration of the syscall only. There were some concerns, whether this would have negative impact on the system by exposing vmalloc() to userspace. Although an excessive use of vmalloc can cause some system wide performance issues - TLB flushes etc. - a large order allocation is not for free either and an excessive reclaim/compaction can have a similar effect. Also note that the size is effectively limited by RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the bitmaps will fit well within single page and thus the vmalloc() fallback could be only excercised for processes where root allows a higher limit. Note that the poll(2) syscall seems to use a linked list of order-0 pages, so it doesn't need this kind of fallback. [eric.dumazet@gmail.com: fix failure path logic] [akpm@linux-foundation.org: use proper type for size] Signed-off-by: Vlastimil Babka --- fs/select.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/select.c b/fs/select.c index 8ed9da50896a..3d4f85defeab 100644 --- a/fs/select.c +++ b/fs/select.c @@ -29,6 +29,7 @@ #include #include #include +#include #include @@ -554,7 +555,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, fd_set_bits fds; void *bits; int ret, max_fds; - unsigned int size; + size_t size, alloc_size; struct fdtable *fdt; /* Allocate small arguments on the stack to save memory and be faster */ long stack_fds[SELECT_STACK_ALLOC/sizeof(long)]; @@ -581,7 +582,14 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, if (size > sizeof(stack_fds) / 6) { /* Not enough space in on-stack array; must use kmalloc */ ret = -ENOMEM; - bits = kmalloc(6 * size, GFP_KERNEL); + if (size > (SIZE_MAX / 6)) + goto out_nofds; + + alloc_size = 6 * size; + bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN); + if (!bits && alloc_size > PAGE_SIZE) + bits = vmalloc(alloc_size); + if (!bits) goto out_nofds; } @@ -618,7 +626,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, out: if (bits != stack_fds) - kfree(bits); + kvfree(bits); out_nofds: return ret; } -- 2.10.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH v2] fs/select: add vmalloc fallback for select(2) To: Eric Dumazet , Andrew Morton References: <20160922164359.9035-1-vbabka@suse.cz> <20160926170105.517f74cd67ecdd5ef73e1865@linux-foundation.org> <1474940324.28155.44.camel@edumazet-glaptop3.roam.corp.google.com> Cc: Alexander Viro , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , netdev@vger.kernel.org From: Vlastimil Babka Message-ID: <6e62a278-4ac3-a866-51c6-e32511406aba@suse.cz> Date: Tue, 27 Sep 2016 10:13:16 +0200 MIME-Version: 1.0 In-Reply-To: <1474940324.28155.44.camel@edumazet-glaptop3.roam.corp.google.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On 09/27/2016 03:38 AM, Eric Dumazet wrote: > On Mon, 2016-09-26 at 17:01 -0700, Andrew Morton wrote: > >> I don't share Eric's concerns about performance here. If the vmalloc() >> is called, we're about to write to that quite large amount of memory >> which we just allocated, and the vmalloc() overhead will be relatively >> low. > > I did not care of the performance of this particular select() system > call really, but other cpus because of more TLB invalidations. There are many other ways to cause those, AFAIK. The reclaim/compaction for order-3 allocation has its own impact on system, including TLB flushes. Or a flood of mmap(MAP_POPULATE) and madvise(MADV_DONTNEED) calls... This vmalloc() would however require raising RLIMIT_NOFILE above the defaults. > At least CONFIG_DEBUG_PAGEALLOC=y builds should be impacted, but maybe > we do not care. I doubt anyone runs that in production, especially if performance is of concern. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH v2] fs/select: add vmalloc fallback for select(2) To: Andrew Morton References: <20160922164359.9035-1-vbabka@suse.cz> <20160926170105.517f74cd67ecdd5ef73e1865@linux-foundation.org> Cc: Alexander Viro , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , netdev@vger.kernel.org, Eric Dumazet From: Vlastimil Babka Message-ID: <91be8fd4-6600-d58d-d77a-d06ebed79f7e@suse.cz> Date: Tue, 27 Sep 2016 10:06:24 +0200 MIME-Version: 1.0 In-Reply-To: <20160926170105.517f74cd67ecdd5ef73e1865@linux-foundation.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On 09/27/2016 02:01 AM, Andrew Morton wrote: > On Thu, 22 Sep 2016 18:43:59 +0200 Vlastimil Babka wrote: > >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows >> with the number of fds passed. We had a customer report page allocation >> failures of order-4 for this allocation. This is a costly order, so it might >> easily fail, as the VM expects such allocation to have a lower-order fallback. >> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be >> physically contiguous. Also the allocation is temporary for the duration of the >> syscall, so it's unlikely to stress vmalloc too much. >> >> Note that the poll(2) syscall seems to use a linked list of order-0 pages, so >> it doesn't need this kind of fallback. >> >> ... >> >> --- a/fs/select.c >> +++ b/fs/select.c >> @@ -29,6 +29,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> >> @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, >> struct fdtable *fdt; >> /* Allocate small arguments on the stack to save memory and be faster */ >> long stack_fds[SELECT_STACK_ALLOC/sizeof(long)]; >> + unsigned long alloc_size; >> >> ret = -EINVAL; >> if (n < 0) >> @@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, >> bits = stack_fds; >> if (size > sizeof(stack_fds) / 6) { >> /* Not enough space in on-stack array; must use kmalloc */ >> + alloc_size = 6 * size; > > Well. `size' is `unsigned'. The multiplication will be done as 32-bit > so there was no point in making `alloc_size' unsigned long. Uh, right. Thanks. > So can we tighten up the types in this function? size_t might make > sense, but vmalloc() takes a ulong. Let's do size_t then, as the conversion to ulong is safe. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1474940324.28155.44.camel@edumazet-glaptop3.roam.corp.google.com> Subject: Re: [PATCH v2] fs/select: add vmalloc fallback for select(2) From: Eric Dumazet To: Andrew Morton Cc: Vlastimil Babka , Alexander Viro , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , netdev@vger.kernel.org Date: Mon, 26 Sep 2016 18:38:44 -0700 In-Reply-To: <20160926170105.517f74cd67ecdd5ef73e1865@linux-foundation.org> References: <20160922164359.9035-1-vbabka@suse.cz> <20160926170105.517f74cd67ecdd5ef73e1865@linux-foundation.org> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On Mon, 2016-09-26 at 17:01 -0700, Andrew Morton wrote: > I don't share Eric's concerns about performance here. If the vmalloc() > is called, we're about to write to that quite large amount of memory > which we just allocated, and the vmalloc() overhead will be relatively > low. I did not care of the performance of this particular select() system call really, but other cpus because of more TLB invalidations. At least CONFIG_DEBUG_PAGEALLOC=y builds should be impacted, but maybe we do not care. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 26 Sep 2016 17:01:05 -0700 From: Andrew Morton To: Vlastimil Babka Cc: Alexander Viro , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , netdev@vger.kernel.org, Eric Dumazet Subject: Re: [PATCH v2] fs/select: add vmalloc fallback for select(2) Message-Id: <20160926170105.517f74cd67ecdd5ef73e1865@linux-foundation.org> In-Reply-To: <20160922164359.9035-1-vbabka@suse.cz> References: <20160922164359.9035-1-vbabka@suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On Thu, 22 Sep 2016 18:43:59 +0200 Vlastimil Babka wrote: > The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows > with the number of fds passed. We had a customer report page allocation > failures of order-4 for this allocation. This is a costly order, so it might > easily fail, as the VM expects such allocation to have a lower-order fallback. > > Such trivial fallback is vmalloc(), as the memory doesn't have to be > physically contiguous. Also the allocation is temporary for the duration of the > syscall, so it's unlikely to stress vmalloc too much. > > Note that the poll(2) syscall seems to use a linked list of order-0 pages, so > it doesn't need this kind of fallback. > > ... > > --- a/fs/select.c > +++ b/fs/select.c > @@ -29,6 +29,7 @@ > #include > #include > #include > +#include > > #include > > @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, > struct fdtable *fdt; > /* Allocate small arguments on the stack to save memory and be faster */ > long stack_fds[SELECT_STACK_ALLOC/sizeof(long)]; > + unsigned long alloc_size; > > ret = -EINVAL; > if (n < 0) > @@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, > bits = stack_fds; > if (size > sizeof(stack_fds) / 6) { > /* Not enough space in on-stack array; must use kmalloc */ > + alloc_size = 6 * size; Well. `size' is `unsigned'. The multiplication will be done as 32-bit so there was no point in making `alloc_size' unsigned long. So can we tighten up the types in this function? size_t might make sense, but vmalloc() takes a ulong. > ret = -ENOMEM; > - bits = kmalloc(6 * size, GFP_KERNEL); > + bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN); > + if (!bits && alloc_size > PAGE_SIZE) > + bits = vmalloc(alloc_size); I don't share Eric's concerns about performance here. If the vmalloc() is called, we're about to write to that quite large amount of memory which we just allocated, and the vmalloc() overhead will be relatively low. > if (!bits) > goto out_nofds; > } > @@ -618,7 +624,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, > > out: > if (bits != stack_fds) > - kfree(bits); > + kvfree(bits); > out_nofds: > return ret; It otherwise looks OK to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: David Laight To: 'Vlastimil Babka' , Eric Dumazet CC: Alexander Viro , Andrew Morton , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Michal Hocko , "netdev@vger.kernel.org" , Linux API , "linux-man@vger.kernel.org" Subject: RE: [PATCH v2] fs/select: add vmalloc fallback for select(2) Date: Mon, 26 Sep 2016 15:02:50 +0000 Message-ID: <063D6719AE5E284EB5DD2968C1650D6DB0109D2D@AcuExch.aculab.com> References: <20160922164359.9035-1-vbabka@suse.cz> <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com> <12efc491-a0e7-1012-5a8b-6d3533c720db@suse.cz> <1474564068.23058.144.camel@edumazet-glaptop3.roam.corp.google.com> <063D6719AE5E284EB5DD2968C1650D6DB0107DC8@AcuExch.aculab.com> <3bbcc269-ec8b-12dd-e0ae-190c18bc3f47@suse.cz> <063D6719AE5E284EB5DD2968C1650D6DB0107FEB@AcuExch.aculab.com> <5bb958c9-542e-e86b-779c-e3d93dc01632@suse.cz> In-Reply-To: <5bb958c9-542e-e86b-779c-e3d93dc01632@suse.cz> Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: RnJvbTogVmxhc3RpbWlsIEJhYmthDQo+IFNlbnQ6IDI2IFNlcHRlbWJlciAyMDE2IDExOjAyDQo+ IE9uIDA5LzIzLzIwMTYgMDM6MzUgUE0sIERhdmlkIExhaWdodCB3cm90ZToNCj4gPiBGcm9tOiBW bGFzdGltaWwgQmFia2ENCj4gPj4gU2VudDogMjMgU2VwdGVtYmVyIDIwMTYgMTA6NTkNCj4gPiAu Li4NCj4gPj4gPiBJIHN1c3BlY3QgdGhhdCBmZHQtPm1heF9mZHMgaXMgYW4gdXBwZXIgYm91bmQg Zm9yIHRoZSBoaWdoZXN0IGZkIHRoZQ0KPiA+PiA+IHByb2Nlc3MgaGFzIG9wZW4gLSBub3QgdGhl IFJMSU1JVF9OT0ZJTEUgdmFsdWUuDQo+ID4+DQo+ID4+IEkgZ2F0aGVyZWQgdGhhdCB0aGUgaGln aGVzdCBmZCBlZmZlY3RpdmVseSBsaW1pdHMgdGhlIG51bWJlciBvZiBmaWxlcywNCj4gPj4gc28g aXQncyB0aGUgc2FtZS4gSSBtaWdodCBiZSB3cm9uZy4NCj4gPg0KPiA+IEFuIGFwcGxpY2F0aW9u IGNhbiByZWR1Y2UgUkxJTUlUX05PRklMRSBiZWxvdyB0aGF0IG9mIGFuIG9wZW4gZmlsZS4NCj4g DQo+IE9LLCBJIGRpZCBzb21lIG1vcmUgZGlnZ2luZyBpbiB0aGUgY29kZSwgYW5kIG15IHVuZGVy c3RhbmRpbmcgaXMgdGhhdDoNCj4gDQo+IC0gZmR0LT5tYXhfZmRzIGlzIHRoZSBjdXJyZW50IHNp emUgb2YgdGhlIGZkdGFibGUsIHdoaWNoIGlzbid0IGFsbG9jYXRlZCB1cGZyb250DQo+IHRvIG1h dGNoIHRoZSBsaW1pdCwgYnV0IGdyb3dzIGFzIG5lZWRlZC4gVGhpcyBtZWFucyBpdCdzIE9LIGZv cg0KPiBjb3JlX3N5c19zZWxlY3QoKSB0byBzaWxlbnRseSBjYXAgbmZkcywgYXMgaXQga25vd3Mg dGhlcmUgYXJlIG5vIGZkJ3Mgd2l0aA0KPiBoaWdoZXIgbnVtYmVyIGluIHRoZSBmZHRhYmxlLCBz byBpdCdzIGEgcGVyZm9ybWFuY2Ugb3B0aW1pemF0aW9uLg0KDQpOb3QgZW50aXJlbHksIGlmIGFu eSBiaXRzIGFyZSBzZXQgZm9yIGZkIGFib3ZlIGZkdC0+bWF4X2ZkcyB0aGVuIHNlbGVjdCgpDQpj YWxsIHNob3VsZCBmYWlsIC0gZmQgbm90IG9wZW4uDQoNCj4gSG93ZXZlciwgdG8NCj4gbWF0Y2gg d2hhdCB0aGUgbWFucGFnZSBzYXlzLCB0aGVyZSBzaG91bGQgYmUgYW5vdGhlciBjaGVjayBhZ2Fp bnN0IFJMSU1JVF9OT0ZJTEUNCj4gdG8gcmV0dXJuIC1FSU5WQUwsIHdoaWNoIHRoZXJlIGlzbid0 LCBBRkFJQ1MuDQo+IA0KPiAtIGZkdGFibGUgaXMgZXhwYW5kZWQgKGFuZCBmZHQtPm1heF9mZHMg YnVtcGVkKSBieQ0KPiBleHBhbmRfZmlsZXMoKS0+ZXhwYW5kX2ZkdGFibGUoKSB3aGljaCBjaGVj a3MgYWdhaW5zdCBmcy5ucl9vcGVuIHN5c2N0bCwgd2hpY2gNCj4gc2VlbXMgdG8gYmUgMTA0ODU3 NiB3aGVyZSBJIGNoZWNrZWQuDQo+IA0KPiAtIGNhbGxlcnMgb2YgZXhwYW5kX2ZpbGVzKCksIHN1 Y2ggYXMgZHVwKCksIGNoZWNrIHRoZSBybGltaXQoUkxJTUlUX05PRklMRSkgdG8NCj4gbGltaXQg dGhlIGV4cGFuc2lvbi4NCj4gDQo+IFNvIHllYWgsIGFwcGxpY2F0aW9uIGNhbiByZWR1Y2UgUkxJ TUlUX05PRklMRSwgYnV0IGl0IGhhcyBubyBlZmZlY3Qgb24gZmR0YWJsZQ0KPiBhbmQgZmR0LT5t YXhfZmRzIHRoYXQgaXMgYWxyZWFkeSBhYm92ZSB0aGUgbGltaXQuIFNlbGVjdCBzeXNjYWxsIHdv dWxkIGhhdmUgdG8NCj4gY2hlY2sgdGhlIHJsaW1pdCB0byBjb25mb3JtIHRvIHRoZSBtYW5wYWdl LiBPciAocmF0aGVyPykgd2Ugc2hvdWxkIGZpeCB0aGUgbWFucGFnZS4NCg0KSSB0aGluayB0aGUg bWFucGFnZSBzaG91bGQgYmUgZml4ZWQgKGRlbGV0ZSB0aGF0IGNsYXVzZSkuDQpUaGVuIGFkZCBj b2RlIHRvIHRoZSBzeXN0ZW0gY2FsbCB0byBzY2FuIHRoZSBoaWdoIGJpdCBzZXRzIChhYm92ZSBm ZHQtPm1heF9mZHMpDQpmb3IgYW55IG5vbi16ZXJvIGJ5dGVzLiBUaGlzIGNhbiBiZSBkb25lIGlu dG8gYSBzbWFsbCBidWZmZXIuDQoNCj4gQXMgZm9yIHRoZSBvcmlnaW5hbCB2bWFsbG9jKCkgZmxv b2QgY29uY2VybiwgSSBzdGlsbCB0aGluayB3ZSdyZSBzYWZlLCBhcw0KPiBvcmRpbmFyeSB1c2Vy cyBhcmUgbGltaXRlZCBieSBSTElNSVRfTk9GSUxFIHdheSBiZWxvdyBzaXplcyB0aGF0IHdvdWxk IG5lZWQNCj4gdm1hbGxvYygpLCBhbmQgcm9vdCBoYXMgbWFueSBvdGhlciBvcHRpb25zIHRvIERP UyB0aGUgc3lzdGVtIChvciB3b3JzZSkuDQoNClNvbWUgcHJvY2Vzc2VzIG5lZWQgdmVyeSBoaWdo IG51bWJlcnMgb2YgZmQuDQpMaWtlbHkgdGhleSBkb24ndCB1c2Ugc2VsZWN0KCkgb24gdGhlbSwg YnV0IHRyYXNoaW5nIHBlcmZvcm1hbmNlIGlmIHRoZXkNCmRvIGlzIGEgYml0IHNpbGx5Lg0KVHJ5 aW5nIHRvIHNsaXQgdGhlIDMgbWFza3MgZmlyc3Qgc2VlbXMgc2Vuc2libGUuDQoNCglEYXZpZA0K DQo= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH v2] fs/select: add vmalloc fallback for select(2) To: David Laight , Eric Dumazet References: <20160922164359.9035-1-vbabka@suse.cz> <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com> <12efc491-a0e7-1012-5a8b-6d3533c720db@suse.cz> <1474564068.23058.144.camel@edumazet-glaptop3.roam.corp.google.com> <063D6719AE5E284EB5DD2968C1650D6DB0107DC8@AcuExch.aculab.com> <3bbcc269-ec8b-12dd-e0ae-190c18bc3f47@suse.cz> <063D6719AE5E284EB5DD2968C1650D6DB0107FEB@AcuExch.aculab.com> Cc: Alexander Viro , Andrew Morton , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Michal Hocko , "netdev@vger.kernel.org" , Linux API , "linux-man@vger.kernel.org" From: Vlastimil Babka Message-ID: <5bb958c9-542e-e86b-779c-e3d93dc01632@suse.cz> Date: Mon, 26 Sep 2016 12:01:32 +0200 MIME-Version: 1.0 In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0107FEB@AcuExch.aculab.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On 09/23/2016 03:35 PM, David Laight wrote: > From: Vlastimil Babka >> Sent: 23 September 2016 10:59 > ... >> > I suspect that fdt->max_fds is an upper bound for the highest fd the >> > process has open - not the RLIMIT_NOFILE value. >> >> I gathered that the highest fd effectively limits the number of files, >> so it's the same. I might be wrong. > > An application can reduce RLIMIT_NOFILE below that of an open file. OK, I did some more digging in the code, and my understanding is that: - fdt->max_fds is the current size of the fdtable, which isn't allocated upfront to match the limit, but grows as needed. This means it's OK for core_sys_select() to silently cap nfds, as it knows there are no fd's with higher number in the fdtable, so it's a performance optimization. However, to match what the manpage says, there should be another check against RLIMIT_NOFILE to return -EINVAL, which there isn't, AFAICS. - fdtable is expanded (and fdt->max_fds bumped) by expand_files()->expand_fdtable() which checks against fs.nr_open sysctl, which seems to be 1048576 where I checked. - callers of expand_files(), such as dup(), check the rlimit(RLIMIT_NOFILE) to limit the expansion. So yeah, application can reduce RLIMIT_NOFILE, but it has no effect on fdtable and fdt->max_fds that is already above the limit. Select syscall would have to check the rlimit to conform to the manpage. Or (rather?) we should fix the manpage. As for the original vmalloc() flood concern, I still think we're safe, as ordinary users are limited by RLIMIT_NOFILE way below sizes that would need vmalloc(), and root has many other options to DOS the system (or worse). Vlastimil > David > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: To: linux-mm@kvack.org From: Andi Kleen Subject: Re: [PATCH v2] fs/select: add vmalloc fallback for select(2) Date: Sun, 25 Sep 2016 11:50:05 -0700 Message-ID: <87r387oluq.fsf@tassilo.jf.intel.com> References: <20160922164359.9035-1-vbabka@suse.cz> <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com> Mime-Version: 1.0 Content-Type: text/plain Cc: linux-fsdevel@vger.kernel.org,linux-kernel@vger.kernel.org,netdev@vger.kernel.org Sender: owner-linux-mm@kvack.org List-ID: Eric Dumazet writes: > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote: >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows >> with the number of fds passed. We had a customer report page allocation >> failures of order-4 for this allocation. This is a costly order, so it might >> easily fail, as the VM expects such allocation to have a lower-order fallback. >> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be >> physically contiguous. Also the allocation is temporary for the duration of the >> syscall, so it's unlikely to stress vmalloc too much. > > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes. > > So I guess allowing vmalloc() being called from an innocent application > doing a select() might be dangerous, especially if this select() happens > thousands of time per second. Yes it seems like a bad idea because of all the scaling problems here. The right solution would be to fix select to use multiple non virtually contiguous pages. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: David Laight To: 'Vlastimil Babka' , Eric Dumazet CC: Alexander Viro , Andrew Morton , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Michal Hocko , "netdev@vger.kernel.org" , Linux API , "linux-man@vger.kernel.org" Subject: RE: [PATCH v2] fs/select: add vmalloc fallback for select(2) Date: Fri, 23 Sep 2016 13:35:52 +0000 Message-ID: <063D6719AE5E284EB5DD2968C1650D6DB0107FEB@AcuExch.aculab.com> References: <20160922164359.9035-1-vbabka@suse.cz> <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com> <12efc491-a0e7-1012-5a8b-6d3533c720db@suse.cz> <1474564068.23058.144.camel@edumazet-glaptop3.roam.corp.google.com> <063D6719AE5E284EB5DD2968C1650D6DB0107DC8@AcuExch.aculab.com> <3bbcc269-ec8b-12dd-e0ae-190c18bc3f47@suse.cz> In-Reply-To: <3bbcc269-ec8b-12dd-e0ae-190c18bc3f47@suse.cz> Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: RnJvbTogVmxhc3RpbWlsIEJhYmthDQo+IFNlbnQ6IDIzIFNlcHRlbWJlciAyMDE2IDEwOjU5DQou Li4NCj4gPiBJIHN1c3BlY3QgdGhhdCBmZHQtPm1heF9mZHMgaXMgYW4gdXBwZXIgYm91bmQgZm9y IHRoZSBoaWdoZXN0IGZkIHRoZQ0KPiA+IHByb2Nlc3MgaGFzIG9wZW4gLSBub3QgdGhlIFJMSU1J VF9OT0ZJTEUgdmFsdWUuDQo+IA0KPiBJIGdhdGhlcmVkIHRoYXQgdGhlIGhpZ2hlc3QgZmQgZWZm ZWN0aXZlbHkgbGltaXRzIHRoZSBudW1iZXIgb2YgZmlsZXMsDQo+IHNvIGl0J3MgdGhlIHNhbWUu IEkgbWlnaHQgYmUgd3JvbmcuDQoNCkFuIGFwcGxpY2F0aW9uIGNhbiByZWR1Y2UgUkxJTUlUX05P RklMRSBiZWxvdyB0aGF0IG9mIGFuIG9wZW4gZmlsZS4NCg0KCURhdmlkDQoNCg== -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH v2] fs/select: add vmalloc fallback for select(2) To: David Laight , Eric Dumazet References: <20160922164359.9035-1-vbabka@suse.cz> <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com> <12efc491-a0e7-1012-5a8b-6d3533c720db@suse.cz> <1474564068.23058.144.camel@edumazet-glaptop3.roam.corp.google.com> <063D6719AE5E284EB5DD2968C1650D6DB0107DC8@AcuExch.aculab.com> Cc: Alexander Viro , Andrew Morton , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Michal Hocko , "netdev@vger.kernel.org" , Linux API , "linux-man@vger.kernel.org" From: Vlastimil Babka Message-ID: <3bbcc269-ec8b-12dd-e0ae-190c18bc3f47@suse.cz> Date: Fri, 23 Sep 2016 11:58:34 +0200 MIME-Version: 1.0 In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0107DC8@AcuExch.aculab.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On 09/23/2016 11:42 AM, David Laight wrote: > From: Vlastimil Babka >> Sent: 22 September 2016 18:55 > ... >> So in the case of select() it seems like the memory we need 6 bits per file >> descriptor, multiplied by the highest possible file descriptor (nfds) as passed >> to the syscall. According to the man page of select: >> >> EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see >> getrlimit(2)). > > That second clause is relatively recent. Interesting... so it was added without actually being true in the kernel code? >> The code actually seems to silently cap the value instead of returning EINVAL >> though? (IIUC): >> >> /* max_fds can increase, so grab it once to avoid race */ >> rcu_read_lock(); >> fdt = files_fdtable(current->files); >> max_fds = fdt->max_fds; >> rcu_read_unlock(); >> if (n > max_fds) >> n = max_fds; >> >> The default for this cap seems to be 1024 where I checked (again, IIUC, it's >> what ulimit -n returns?). I wasn't able to change it to more than 2048, which >> makes the bitmaps still below PAGE_SIZE. >> >> So if I get that right, the system admin would have to allow really large >> RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large >> concern? > > 4k open files isn't that many. > Especially for programs that are using pipes to emulate windows events. Sure but IIUC we need 6 bits per file. That means up to almost 42k files, we should fit into order-3 allocation, which effectively cannot fail right now. > I suspect that fdt->max_fds is an upper bound for the highest fd the > process has open - not the RLIMIT_NOFILE value. I gathered that the highest fd effectively limits the number of files, so it's the same. I might be wrong. > select() shouldn't be silently ignoring large values of 'n' unless > the fd_set bits are zero. Yeah that doesn't seem to conform to the manpage. > Of course, select does scale well for high numbered fds > and neither poll nor select scale well for large numbers of fds. True. > David > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: David Laight To: 'Vlastimil Babka' , Eric Dumazet CC: Alexander Viro , Andrew Morton , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Michal Hocko , "netdev@vger.kernel.org" , Linux API , "linux-man@vger.kernel.org" Subject: RE: [PATCH v2] fs/select: add vmalloc fallback for select(2) Date: Fri, 23 Sep 2016 09:42:07 +0000 Message-ID: <063D6719AE5E284EB5DD2968C1650D6DB0107DC8@AcuExch.aculab.com> References: <20160922164359.9035-1-vbabka@suse.cz> <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com> <12efc491-a0e7-1012-5a8b-6d3533c720db@suse.cz> <1474564068.23058.144.camel@edumazet-glaptop3.roam.corp.google.com> In-Reply-To: Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: RnJvbTogVmxhc3RpbWlsIEJhYmthDQo+IFNlbnQ6IDIyIFNlcHRlbWJlciAyMDE2IDE4OjU1DQou Li4NCj4gU28gaW4gdGhlIGNhc2Ugb2Ygc2VsZWN0KCkgaXQgc2VlbXMgbGlrZSB0aGUgbWVtb3J5 IHdlIG5lZWQgNiBiaXRzIHBlciBmaWxlDQo+IGRlc2NyaXB0b3IsIG11bHRpcGxpZWQgYnkgdGhl IGhpZ2hlc3QgcG9zc2libGUgZmlsZSBkZXNjcmlwdG9yIChuZmRzKSBhcyBwYXNzZWQNCj4gdG8g dGhlIHN5c2NhbGwuIEFjY29yZGluZyB0byB0aGUgbWFuIHBhZ2Ugb2Ygc2VsZWN0Og0KPiANCj4g ICAgICAgICBFSU5WQUwgbmZkcyBpcyBuZWdhdGl2ZSBvciBleGNlZWRzIHRoZSBSTElNSVRfTk9G SUxFIHJlc291cmNlIGxpbWl0IChzZWUNCj4gZ2V0cmxpbWl0KDIpKS4NCg0KVGhhdCBzZWNvbmQg Y2xhdXNlIGlzIHJlbGF0aXZlbHkgcmVjZW50Lg0KDQo+IFRoZSBjb2RlIGFjdHVhbGx5IHNlZW1z IHRvIHNpbGVudGx5IGNhcCB0aGUgdmFsdWUgaW5zdGVhZCBvZiByZXR1cm5pbmcgRUlOVkFMDQo+ IHRob3VnaD8gKElJVUMpOg0KPiANCj4gICAgICAgICAvKiBtYXhfZmRzIGNhbiBpbmNyZWFzZSwg c28gZ3JhYiBpdCBvbmNlIHRvIGF2b2lkIHJhY2UgKi8NCj4gICAgICAgICAgcmN1X3JlYWRfbG9j aygpOw0KPiAgICAgICAgICBmZHQgPSBmaWxlc19mZHRhYmxlKGN1cnJlbnQtPmZpbGVzKTsNCj4g ICAgICAgICAgbWF4X2ZkcyA9IGZkdC0+bWF4X2ZkczsNCj4gICAgICAgICAgcmN1X3JlYWRfdW5s b2NrKCk7DQo+ICAgICAgICAgIGlmIChuID4gbWF4X2ZkcykNCj4gICAgICAgICAgICAgICAgICBu ID0gbWF4X2ZkczsNCj4gDQo+IFRoZSBkZWZhdWx0IGZvciB0aGlzIGNhcCBzZWVtcyB0byBiZSAx MDI0IHdoZXJlIEkgY2hlY2tlZCAoYWdhaW4sIElJVUMsIGl0J3MNCj4gd2hhdCB1bGltaXQgLW4g cmV0dXJucz8pLiBJIHdhc24ndCBhYmxlIHRvIGNoYW5nZSBpdCB0byBtb3JlIHRoYW4gMjA0OCwg d2hpY2gNCj4gbWFrZXMgdGhlIGJpdG1hcHMgc3RpbGwgYmVsb3cgUEFHRV9TSVpFLg0KPiANCj4g U28gaWYgSSBnZXQgdGhhdCByaWdodCwgdGhlIHN5c3RlbSBhZG1pbiB3b3VsZCBoYXZlIHRvIGFs bG93IHJlYWxseSBsYXJnZQ0KPiBSTElNSVRfTk9GSUxFIHRvIGV2ZW4gbWFrZSB2bWFsbG9jKCkg cG9zc2libGUgaGVyZS4gU28gSSBkb24ndCBzZWUgaXQgYXMgYSBsYXJnZQ0KPiBjb25jZXJuPw0K DQo0ayBvcGVuIGZpbGVzIGlzbid0IHRoYXQgbWFueS4NCkVzcGVjaWFsbHkgZm9yIHByb2dyYW1z IHRoYXQgYXJlIHVzaW5nIHBpcGVzIHRvIGVtdWxhdGUgd2luZG93cyBldmVudHMuDQoNCkkgc3Vz cGVjdCB0aGF0IGZkdC0+bWF4X2ZkcyBpcyBhbiB1cHBlciBib3VuZCBmb3IgdGhlIGhpZ2hlc3Qg ZmQgdGhlDQpwcm9jZXNzIGhhcyBvcGVuIC0gbm90IHRoZSBSTElNSVRfTk9GSUxFIHZhbHVlLg0K c2VsZWN0KCkgc2hvdWxkbid0IGJlIHNpbGVudGx5IGlnbm9yaW5nIGxhcmdlIHZhbHVlcyBvZiAn bicgdW5sZXNzDQp0aGUgZmRfc2V0IGJpdHMgYXJlIHplcm8uDQoNCk9mIGNvdXJzZSwgc2VsZWN0 IGRvZXMgc2NhbGUgd2VsbCBmb3IgaGlnaCBudW1iZXJlZCBmZHMNCmFuZCBuZWl0aGVyIHBvbGwg bm9yIHNlbGVjdCBzY2FsZSB3ZWxsIGZvciBsYXJnZSBudW1iZXJzIG9mIGZkcy4NCg0KCURhdmlk DQoNCg== -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH v2] fs/select: add vmalloc fallback for select(2) To: Eric Dumazet References: <20160922164359.9035-1-vbabka@suse.cz> <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com> <12efc491-a0e7-1012-5a8b-6d3533c720db@suse.cz> <1474564068.23058.144.camel@edumazet-glaptop3.roam.corp.google.com> Cc: Alexander Viro , Andrew Morton , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , netdev@vger.kernel.org, Linux API , linux-man@vger.kernel.org From: Vlastimil Babka Message-ID: Date: Thu, 22 Sep 2016 19:55:09 +0200 MIME-Version: 1.0 In-Reply-To: <1474564068.23058.144.camel@edumazet-glaptop3.roam.corp.google.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On 09/22/2016 07:07 PM, Eric Dumazet wrote: > On Thu, 2016-09-22 at 18:56 +0200, Vlastimil Babka wrote: >> On 09/22/2016 06:49 PM, Eric Dumazet wrote: >> > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote: >> >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows >> >> with the number of fds passed. We had a customer report page allocation >> >> failures of order-4 for this allocation. This is a costly order, so it might >> >> easily fail, as the VM expects such allocation to have a lower-order fallback. >> >> >> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be >> >> physically contiguous. Also the allocation is temporary for the duration of the >> >> syscall, so it's unlikely to stress vmalloc too much. >> > >> > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes. >> > >> > So I guess allowing vmalloc() being called from an innocent application >> > doing a select() might be dangerous, especially if this select() happens >> > thousands of time per second. >> >> Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()? > > Possibly. > > We don't have a library function (attempting kmalloc(), fallback to > vmalloc() presumably to avoid abuses, but I guess some patches were > accepted without thinking about this. So in the case of select() it seems like the memory we need 6 bits per file descriptor, multiplied by the highest possible file descriptor (nfds) as passed to the syscall. According to the man page of select: EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see getrlimit(2)). The code actually seems to silently cap the value instead of returning EINVAL though? (IIUC): /* max_fds can increase, so grab it once to avoid race */ rcu_read_lock(); fdt = files_fdtable(current->files); max_fds = fdt->max_fds; rcu_read_unlock(); if (n > max_fds) n = max_fds; The default for this cap seems to be 1024 where I checked (again, IIUC, it's what ulimit -n returns?). I wasn't able to change it to more than 2048, which makes the bitmaps still below PAGE_SIZE. So if I get that right, the system admin would have to allow really large RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large concern? Vlastimil -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1474564068.23058.144.camel@edumazet-glaptop3.roam.corp.google.com> Subject: Re: [PATCH v2] fs/select: add vmalloc fallback for select(2) From: Eric Dumazet To: Vlastimil Babka Cc: Alexander Viro , Andrew Morton , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , netdev@vger.kernel.org Date: Thu, 22 Sep 2016 10:07:48 -0700 In-Reply-To: <12efc491-a0e7-1012-5a8b-6d3533c720db@suse.cz> References: <20160922164359.9035-1-vbabka@suse.cz> <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com> <12efc491-a0e7-1012-5a8b-6d3533c720db@suse.cz> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On Thu, 2016-09-22 at 18:56 +0200, Vlastimil Babka wrote: > On 09/22/2016 06:49 PM, Eric Dumazet wrote: > > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote: > >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows > >> with the number of fds passed. We had a customer report page allocation > >> failures of order-4 for this allocation. This is a costly order, so it might > >> easily fail, as the VM expects such allocation to have a lower-order fallback. > >> > >> Such trivial fallback is vmalloc(), as the memory doesn't have to be > >> physically contiguous. Also the allocation is temporary for the duration of the > >> syscall, so it's unlikely to stress vmalloc too much. > > > > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes. > > > > So I guess allowing vmalloc() being called from an innocent application > > doing a select() might be dangerous, especially if this select() happens > > thousands of time per second. > > Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()? Possibly. We don't have a library function (attempting kmalloc(), fallback to vmalloc() presumably to avoid abuses, but I guess some patches were accepted without thinking about this. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH v2] fs/select: add vmalloc fallback for select(2) To: Eric Dumazet References: <20160922164359.9035-1-vbabka@suse.cz> <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com> Cc: Alexander Viro , Andrew Morton , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , netdev@vger.kernel.org From: Vlastimil Babka Message-ID: <12efc491-a0e7-1012-5a8b-6d3533c720db@suse.cz> Date: Thu, 22 Sep 2016 18:56:48 +0200 MIME-Version: 1.0 In-Reply-To: <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On 09/22/2016 06:49 PM, Eric Dumazet wrote: > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote: >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows >> with the number of fds passed. We had a customer report page allocation >> failures of order-4 for this allocation. This is a costly order, so it might >> easily fail, as the VM expects such allocation to have a lower-order fallback. >> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be >> physically contiguous. Also the allocation is temporary for the duration of the >> syscall, so it's unlikely to stress vmalloc too much. > > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes. > > So I guess allowing vmalloc() being called from an innocent application > doing a select() might be dangerous, especially if this select() happens > thousands of time per second. Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com> Subject: Re: [PATCH v2] fs/select: add vmalloc fallback for select(2) From: Eric Dumazet To: Vlastimil Babka Cc: Alexander Viro , Andrew Morton , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , netdev@vger.kernel.org Date: Thu, 22 Sep 2016 09:49:42 -0700 In-Reply-To: <20160922164359.9035-1-vbabka@suse.cz> References: <20160922164359.9035-1-vbabka@suse.cz> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote: > The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows > with the number of fds passed. We had a customer report page allocation > failures of order-4 for this allocation. This is a costly order, so it might > easily fail, as the VM expects such allocation to have a lower-order fallback. > > Such trivial fallback is vmalloc(), as the memory doesn't have to be > physically contiguous. Also the allocation is temporary for the duration of the > syscall, so it's unlikely to stress vmalloc too much. vmalloc() uses a vmap_area_lock spinlock, and TLB flushes. So I guess allowing vmalloc() being called from an innocent application doing a select() might be dangerous, especially if this select() happens thousands of time per second. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Vlastimil Babka To: Alexander Viro , Andrew Morton Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , netdev@vger.kernel.org, Eric Dumazet , Vlastimil Babka Subject: [PATCH v2] fs/select: add vmalloc fallback for select(2) Date: Thu, 22 Sep 2016 18:43:59 +0200 Message-Id: <20160922164359.9035-1-vbabka@suse.cz> Sender: owner-linux-mm@kvack.org List-ID: The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows with the number of fds passed. We had a customer report page allocation failures of order-4 for this allocation. This is a costly order, so it might easily fail, as the VM expects such allocation to have a lower-order fallback. Such trivial fallback is vmalloc(), as the memory doesn't have to be physically contiguous. Also the allocation is temporary for the duration of the syscall, so it's unlikely to stress vmalloc too much. Note that the poll(2) syscall seems to use a linked list of order-0 pages, so it doesn't need this kind of fallback. [eric.dumazet@gmail.com: fix failure path logic] Signed-off-by: Vlastimil Babka --- fs/select.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/select.c b/fs/select.c index 8ed9da50896a..b99e98524fde 100644 --- a/fs/select.c +++ b/fs/select.c @@ -29,6 +29,7 @@ #include #include #include +#include #include @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, struct fdtable *fdt; /* Allocate small arguments on the stack to save memory and be faster */ long stack_fds[SELECT_STACK_ALLOC/sizeof(long)]; + unsigned long alloc_size; ret = -EINVAL; if (n < 0) @@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, bits = stack_fds; if (size > sizeof(stack_fds) / 6) { /* Not enough space in on-stack array; must use kmalloc */ + alloc_size = 6 * size; ret = -ENOMEM; - bits = kmalloc(6 * size, GFP_KERNEL); + bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN); + if (!bits && alloc_size > PAGE_SIZE) + bits = vmalloc(alloc_size); + if (!bits) goto out_nofds; } @@ -618,7 +624,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, out: if (bits != stack_fds) - kfree(bits); + kvfree(bits); out_nofds: return ret; } -- 2.10.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org