* /proc/stat vs. failed order-4 allocation
@ 2014-05-21 12:25 Heiko Carstens
2014-05-21 14:32 ` Christoph Hellwig
2014-05-22 11:29 ` /proc/stat vs. failed order-4 allocation Ian Kent
0 siblings, 2 replies; 21+ messages in thread
From: Heiko Carstens @ 2014-05-21 12:25 UTC (permalink / raw)
To: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet
Cc: linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl
Hi all,
I'm just wondering why /proc/stat is a single_open() seq_file and not a
regular seq_file with an iterator (say 48 online cpus for each iteration
or something similar).
Of course, in theory, the "intr" line may be very long as well...
With the current implementation everything must fit into a single buffer.
So if memory is highly fragmented we run into failing higher order
allocations (like below), which effectively means reading /proc/stat
doesn't work anymore.
>From stat_open:
size_t size = 1024 + 128 * num_possible_cpus();
[...]
/* minimum size to display an interrupt count : 2 bytes */
size += 2 * nr_irqs;
[...]
buf = kmalloc(size, GFP_KERNEL);
if (!buf)
return -ENOMEM;
With num_possible_cpus() = 256 we end up with an order 4 allocation.
So, would there be any objections, adding a cpu iterator to /proc/stat?
62129.701569] sadc: page allocation failure: order:4, mode:0x1040d0
[62129.701573] CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
[62129.701574] 00000000edf27840 00000000edf27850 0000000000000002 0000000000000000
00000000edf278e0 00000000edf27858 00000000edf27858 00000000001120c0
0000000000000000 000000000072c7c0 0000000000711836 000000000000000b
00000000edf278a0 00000000edf27840 0000000000000000 0000000000000000
00000000001040d0 00000000001120c0 00000000edf27840 00000000edf278a0
[62129.701586] Call Trace:
[62129.701588] ([<0000000000111fbe>] show_trace+0xe6/0x130)
[62129.701591] [<0000000000112074>] show_stack+0x6c/0xe8
[62129.701593] [<000000000020d356>] warn_alloc_failed+0xd6/0x138
[62129.701596] [<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68
[62129.701598] [<000000000021168e>] __get_free_pages+0x2e/0x58
[62129.701599] [<000000000025a05c>] kmalloc_order_trace+0x44/0xc0
[62129.701602] [<00000000002f3ffa>] stat_open+0x5a/0xd8
[62129.701604] [<00000000002e9aaa>] proc_reg_open+0x8a/0x140
[62129.701606] [<0000000000273b64>] do_dentry_open+0x1bc/0x2c8
[62129.701608] [<000000000027411e>] finish_open+0x46/0x60
[62129.701610] [<000000000028675a>] do_last+0x382/0x10d0
[62129.701612] [<0000000000287570>] path_openat+0xc8/0x4f8
[62129.701614] [<0000000000288bde>] do_filp_open+0x46/0xa8
[62129.701616] [<000000000027541c>] do_sys_open+0x114/0x1f0
[62129.701618] [<00000000005b1c1c>] sysc_tracego+0x14/0x1a
[62129.701620] [<000003fffd0040a0>] 0x3fffd0040a0
[62129.701624] Mem-Info:
[62129.701625] DMA per-cpu:
[62129.701627] CPU 0: hi: 186, btch: 31 usd: 0
[62129.701628] CPU 1: hi: 186, btch: 31 usd: 0
[62129.701630] CPU 2: hi: 186, btch: 31 usd: 51
[62129.701631] Normal per-cpu:
[62129.701632] CPU 0: hi: 186, btch: 31 usd: 30
[62129.701634] CPU 1: hi: 186, btch: 31 usd: 0
[62129.701635] CPU 2: hi: 186, btch: 31 usd: 0
[62129.701639] active_anon:5416 inactive_anon:5571 isolated_anon:0
active_file:440513 inactive_file:406221 isolated_file:27
unevictable:1741 dirty:35305 writeback:0 unstable:0
free:40319 slab_reclaimable:41921 slab_unreclaimable:34553
mapped:3921 shmem:1351 pagetables:296 bounce:0
free_cma:0
[62129.701648] DMA free:25192kB min:11800kB low:14748kB high:17700kB active_anon:11032kB inactive_anon:11320kB active_file:1002092kB inactive_file:904260kB unevictable:3772kB isolated(anon):0kB isolated(file):4kB present:2097152kB managed:2070452kB mlocked:3772kB dirty:55072kB writeback:0kB mapped:6316kB shmem:1152kB slab_reclaimable:61192kB slab_unreclaimable:50108kB kernel_stack:2368kB pagetables:532kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:133 all_unreclaimable? no
[62129.701652] lowmem_reserve[]: 0 1837 1837
[62129.701658] Normal free:136084kB min:10724kB low:13404kB high:16084kB active_anon:10632kB inactive_anon:10964kB active_file:759960kB inactive_file:720624kB unevictable:3192kB isolated(anon):0kB isolated(file):4kB present:1966080kB managed:1881776kB mlocked:3192kB dirty:86148kB writeback:0kB mapped:9368kB shmem:4252kB slab_reclaimable:106492kB slab_unreclaimable:88104kB kernel_stack:5808kB pagetables:652kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[62129.701661] lowmem_reserve[]: 0 0 0
[62129.701664] DMA: 1540*4kB (UEM) 2217*8kB (UEM) 9*16kB (UEM) 1*32kB (R) 5*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 0*1024kB = 25288kB
[62129.701673] Normal: 21631*4kB (UEM) 5755*8kB (UEM) 145*16kB (UEM) 8*32kB (ER) 4*64kB (R) 2*128kB (R) 0*256kB 1*512kB (R) 0*1024kB = 136164kB
[62129.701682] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1024kB
[62129.701684] 849331 total pagecache pages
[62129.701685] 131 pages in swap cache
[62129.701687] Swap cache stats: add 9956, delete 9825, find 1049/1416
[62129.701688] Free swap = 7186784kB
[62129.701689] Total swap = 7212140kB
[62129.710679] 1015808 pages RAM
[62129.710681] 23437 pages reserved
[62129.710682] 1360146 pages shared
[62129.710683] 384507 pages non-shared
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: /proc/stat vs. failed order-4 allocation 2014-05-21 12:25 /proc/stat vs. failed order-4 allocation Heiko Carstens @ 2014-05-21 14:32 ` Christoph Hellwig 2014-05-22 3:05 ` Elliott, Robert (Server Storage) 2014-05-28 8:58 ` Heiko Carstens 2014-05-22 11:29 ` /proc/stat vs. failed order-4 allocation Ian Kent 1 sibling, 2 replies; 21+ messages in thread From: Christoph Hellwig @ 2014-05-21 14:32 UTC (permalink / raw) To: Heiko Carstens Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl On Wed, May 21, 2014 at 02:25:21PM +0200, Heiko Carstens wrote: > Hi all, > > I'm just wondering why /proc/stat is a single_open() seq_file and not a > regular seq_file with an iterator (say 48 online cpus for each iteration > or something similar). Probably because no one sent a patch for it. I'm pretty sure it used the even more horrible old proc ops before and was converted in batch with various other files. ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: /proc/stat vs. failed order-4 allocation 2014-05-21 14:32 ` Christoph Hellwig @ 2014-05-22 3:05 ` Elliott, Robert (Server Storage) 2014-05-28 8:58 ` Heiko Carstens 1 sibling, 0 replies; 21+ messages in thread From: Elliott, Robert (Server Storage) @ 2014-05-22 3:05 UTC (permalink / raw) To: Christoph Hellwig, Heiko Carstens Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel- > owner@vger.kernel.org] On Behalf Of Christoph Hellwig > Sent: Wednesday, May 21, 2014 9:32 AM > To: Heiko Carstens > Cc: Andrew Morton; KAMEZAWA Hiroyuki; Andrea Righi; Eric Dumazet; linux- > kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; Hendrik Brueckner; > Thorsten Diehl > Subject: Re: /proc/stat vs. failed order-4 allocation > > On Wed, May 21, 2014 at 02:25:21PM +0200, Heiko Carstens wrote: > > Hi all, > > > > I'm just wondering why /proc/stat is a single_open() seq_file and not a > > regular seq_file with an iterator (say 48 online cpus for each iteration > > or something similar). > > Probably because no one sent a patch for it. I'm pretty sure it used the > even more horrible old proc ops before and was converted in batch with > various other files. https://lkml.org/lkml/2012/1/20/153 worried about performance and led to the current code; the reply in https://lkml.org/lkml/2012/1/23/41 discussed using seq_file, but that idea was rejected at the time. Be careful about losing consistency of the information for the CPUs. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: /proc/stat vs. failed order-4 allocation 2014-05-21 14:32 ` Christoph Hellwig 2014-05-22 3:05 ` Elliott, Robert (Server Storage) @ 2014-05-28 8:58 ` Heiko Carstens 2014-05-28 8:59 ` [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size Heiko Carstens 2014-05-28 9:01 ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Heiko Carstens 1 sibling, 2 replies; 21+ messages in thread From: Heiko Carstens @ 2014-05-28 8:58 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Ian Kent, Elliott, Robert (Server Storage) On Wed, May 21, 2014 at 07:32:29AM -0700, Christoph Hellwig wrote: > On Wed, May 21, 2014 at 02:25:21PM +0200, Heiko Carstens wrote: > > Hi all, > > > > I'm just wondering why /proc/stat is a single_open() seq_file and not a > > regular seq_file with an iterator (say 48 online cpus for each iteration > > or something similar). > > Probably because no one sent a patch for it. I'm pretty sure it used the > even more horrible old proc ops before and was converted in batch with > various other files. Ok, so how about the two patches sent as reply to this mail. (btw. if nobody objects to the modified patch from KAMEZAWA Hiroyuki the first patch could be dropped and/or folded into the second patch) Performance wise there doesn't seem to be too much of a difference, however all measurements have been done a 64 cpu 2nd level guest. It _looks_ like the new code is < 3% slower. ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size 2014-05-28 8:58 ` Heiko Carstens @ 2014-05-28 8:59 ` Heiko Carstens 2014-05-28 11:06 ` Ian Kent 2014-05-28 9:01 ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Heiko Carstens 1 sibling, 1 reply; 21+ messages in thread From: Heiko Carstens @ 2014-05-28 8:59 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Ian Kent, Elliott, Robert (Server Storage) The number of bytes contained 'within' /proc/stat depends on the number of online cpus and not of the numbe of possible cpus. This reduces the number of bytes requested for the initial buffer allocation within stat_open(). Which is usually way too high and for nr_possible_cpus() == 256 cpus would result in an order 4 allocation. Order 4 allocations however may fail if memory is fragmented and we would end up with an unreadable /proc/stat file: 62129.701569] sadc: page allocation failure: order:4, mode:0x1040d0 [62129.701573] CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1 [...] [62129.701586] Call Trace: [62129.701588] ([<0000000000111fbe>] show_trace+0xe6/0x130) [62129.701591] [<0000000000112074>] show_stack+0x6c/0xe8 [62129.701593] [<000000000020d356>] warn_alloc_failed+0xd6/0x138 [62129.701596] [<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68 [62129.701598] [<000000000021168e>] __get_free_pages+0x2e/0x58 [62129.701599] [<000000000025a05c>] kmalloc_order_trace+0x44/0xc0 [62129.701602] [<00000000002f3ffa>] stat_open+0x5a/0xd8 [62129.701604] [<00000000002e9aaa>] proc_reg_open+0x8a/0x140 [62129.701606] [<0000000000273b64>] do_dentry_open+0x1bc/0x2c8 [62129.701608] [<000000000027411e>] finish_open+0x46/0x60 [62129.701610] [<000000000028675a>] do_last+0x382/0x10d0 [62129.701612] [<0000000000287570>] path_openat+0xc8/0x4f8 [62129.701614] [<0000000000288bde>] do_filp_open+0x46/0xa8 [62129.701616] [<000000000027541c>] do_sys_open+0x114/0x1f0 [62129.701618] [<00000000005b1c1c>] sysc_tracego+0x14/0x1a Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> --- fs/proc/stat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 9d231e9e5f0e..3898ca5f1e92 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v) static int stat_open(struct inode *inode, struct file *file) { - size_t size = 1024 + 128 * num_possible_cpus(); + size_t size = 1024 + 128 * num_online_cpus(); char *buf; struct seq_file *m; int res; -- 1.8.5.5 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size 2014-05-28 8:59 ` [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size Heiko Carstens @ 2014-05-28 11:06 ` Ian Kent 2014-05-28 11:14 ` Ian Kent 0 siblings, 1 reply; 21+ messages in thread From: Ian Kent @ 2014-05-28 11:06 UTC (permalink / raw) To: Heiko Carstens Cc: Christoph Hellwig, Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott, Robert (Server Storage) On Wed, 2014-05-28 at 10:59 +0200, Heiko Carstens wrote: > The number of bytes contained 'within' /proc/stat depends on the number > of online cpus and not of the numbe of possible cpus. > > This reduces the number of bytes requested for the initial buffer allocation > within stat_open(). Which is usually way too high and for nr_possible_cpus() > == 256 cpus would result in an order 4 allocation. > > Order 4 allocations however may fail if memory is fragmented and we would > end up with an unreadable /proc/stat file: > > 62129.701569] sadc: page allocation failure: order:4, mode:0x1040d0 > [62129.701573] CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1 > [...] > [62129.701586] Call Trace: > [62129.701588] ([<0000000000111fbe>] show_trace+0xe6/0x130) > [62129.701591] [<0000000000112074>] show_stack+0x6c/0xe8 > [62129.701593] [<000000000020d356>] warn_alloc_failed+0xd6/0x138 > [62129.701596] [<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68 > [62129.701598] [<000000000021168e>] __get_free_pages+0x2e/0x58 > [62129.701599] [<000000000025a05c>] kmalloc_order_trace+0x44/0xc0 > [62129.701602] [<00000000002f3ffa>] stat_open+0x5a/0xd8 > [62129.701604] [<00000000002e9aaa>] proc_reg_open+0x8a/0x140 > [62129.701606] [<0000000000273b64>] do_dentry_open+0x1bc/0x2c8 > [62129.701608] [<000000000027411e>] finish_open+0x46/0x60 > [62129.701610] [<000000000028675a>] do_last+0x382/0x10d0 > [62129.701612] [<0000000000287570>] path_openat+0xc8/0x4f8 > [62129.701614] [<0000000000288bde>] do_filp_open+0x46/0xa8 > [62129.701616] [<000000000027541c>] do_sys_open+0x114/0x1f0 > [62129.701618] [<00000000005b1c1c>] sysc_tracego+0x14/0x1a > > Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> > --- > fs/proc/stat.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/proc/stat.c b/fs/proc/stat.c > index 9d231e9e5f0e..3898ca5f1e92 100644 > --- a/fs/proc/stat.c > +++ b/fs/proc/stat.c > @@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v) > > static int stat_open(struct inode *inode, struct file *file) > { > - size_t size = 1024 + 128 * num_possible_cpus(); > + size_t size = 1024 + 128 * num_online_cpus(); Yes, I thought of this too when I was looking at the problem but was concerned about the number of online cpus changing during the read. If a system can hotplug cpus then I guess we don't care much about the number of cpus increasing during the read, we'll just see incorrect data once, but what would happen if some cpus were removed? Do we even care about that case? > char *buf; > struct seq_file *m; > int res; ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size 2014-05-28 11:06 ` Ian Kent @ 2014-05-28 11:14 ` Ian Kent 0 siblings, 0 replies; 21+ messages in thread From: Ian Kent @ 2014-05-28 11:14 UTC (permalink / raw) To: Heiko Carstens Cc: Christoph Hellwig, Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott, Robert (Server Storage) On Wed, 2014-05-28 at 19:06 +0800, Ian Kent wrote: > On Wed, 2014-05-28 at 10:59 +0200, Heiko Carstens wrote: > > The number of bytes contained 'within' /proc/stat depends on the number > > of online cpus and not of the numbe of possible cpus. > > > > This reduces the number of bytes requested for the initial buffer allocation > > within stat_open(). Which is usually way too high and for nr_possible_cpus() > > == 256 cpus would result in an order 4 allocation. > > > > Order 4 allocations however may fail if memory is fragmented and we would > > end up with an unreadable /proc/stat file: > > > > 62129.701569] sadc: page allocation failure: order:4, mode:0x1040d0 > > [62129.701573] CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1 > > [...] > > [62129.701586] Call Trace: > > [62129.701588] ([<0000000000111fbe>] show_trace+0xe6/0x130) > > [62129.701591] [<0000000000112074>] show_stack+0x6c/0xe8 > > [62129.701593] [<000000000020d356>] warn_alloc_failed+0xd6/0x138 > > [62129.701596] [<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68 > > [62129.701598] [<000000000021168e>] __get_free_pages+0x2e/0x58 > > [62129.701599] [<000000000025a05c>] kmalloc_order_trace+0x44/0xc0 > > [62129.701602] [<00000000002f3ffa>] stat_open+0x5a/0xd8 > > [62129.701604] [<00000000002e9aaa>] proc_reg_open+0x8a/0x140 > > [62129.701606] [<0000000000273b64>] do_dentry_open+0x1bc/0x2c8 > > [62129.701608] [<000000000027411e>] finish_open+0x46/0x60 > > [62129.701610] [<000000000028675a>] do_last+0x382/0x10d0 > > [62129.701612] [<0000000000287570>] path_openat+0xc8/0x4f8 > > [62129.701614] [<0000000000288bde>] do_filp_open+0x46/0xa8 > > [62129.701616] [<000000000027541c>] do_sys_open+0x114/0x1f0 > > [62129.701618] [<00000000005b1c1c>] sysc_tracego+0x14/0x1a > > > > Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> > > --- > > fs/proc/stat.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/proc/stat.c b/fs/proc/stat.c > > index 9d231e9e5f0e..3898ca5f1e92 100644 > > --- a/fs/proc/stat.c > > +++ b/fs/proc/stat.c > > @@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v) > > > > static int stat_open(struct inode *inode, struct file *file) > > { > > - size_t size = 1024 + 128 * num_possible_cpus(); > > + size_t size = 1024 + 128 * num_online_cpus(); > > Yes, I thought of this too when I was looking at the problem but was > concerned about the number of online cpus changing during the read. > > If a system can hotplug cpus then I guess we don't care much about the > number of cpus increasing during the read, we'll just see incorrect data > once, but what would happen if some cpus were removed? Do we even care > about that case? Oh hang on, that's not right it's the opposite, if the number of cpus increases between the call to stat_open() and show_stat() there might not be enough space. > > > char *buf; > > struct seq_file *m; > > int res; > ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open 2014-05-28 8:58 ` Heiko Carstens 2014-05-28 8:59 ` [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size Heiko Carstens @ 2014-05-28 9:01 ` Heiko Carstens 2014-05-28 22:37 ` Andrew Morton 1 sibling, 1 reply; 21+ messages in thread From: Heiko Carstens @ 2014-05-28 9:01 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Ian Kent, Elliott, Robert (Server Storage) From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Now, /proc/stat uses single_open() for showing information. This means the all data will be gathered and buffered once to a (big) buf. Now, /proc/stat shows stats per cpu and stats per IRQs. To get information in once-shot, it allocates a big buffer (until KMALLOC_MAX_SIZE). Eric Dumazet reported that the bufsize calculation doesn't take the numner of IRQ into account and the information cannot be got in one-shot. (By this, seq_read() will allocate buffer again and read the whole data gain...) This patch changes /proc/stat to use seq_open() rather than single_open() and provides ->start(), ->next(), ->stop(), ->show(). By this, /proc/stat will not need to take care of size of buffer. [heiko.carstens@de.ibm.com]: This is the forward port of a patch from KAMEZAWA Hiroyuki (https://lkml.org/lkml/2012/1/23/41). I added a couple of simple changes like e.g. that the cpu iterator handles 32 cpus in a batch to avoid lots of iterations. With this patch it should not happen anymore that reading /proc/stat fails because of a failing high order memory allocation. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> --- fs/proc/stat.c | 278 +++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 203 insertions(+), 75 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 3898ca5f1e92..652e255fee90 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -77,22 +77,109 @@ static u64 get_iowait_time(int cpu) #endif -static int show_stat(struct seq_file *p, void *v) +enum proc_stat_stage /* The numbers are used as *pos and iter->stage */ +{ + SHOW_TOTAL_CPU_STAT, + SHOW_PERCPU_STAT, + SHOW_TOTAL_IRQS, + SHOW_IRQ_DETAILS, + SHOW_TIMES, + SHOW_TOTAL_SOFTIRQ, + SHOW_SOFTIRQ_DETAILS, + SHOW_EOL, + END_STATS, +}; + +/* + * To reduce the number of ->next(), ->show() calls IRQ numbers are + * handled in batch. + */ +struct seq_stat_iter { + int stage; + unsigned long jiffies; + int cpu_iter; + int irq_iter; + int softirq_iter; + /* cached information */ + u64 irq_sum; + u64 softirq_sum; + u32 per_softirq_sums[NR_SOFTIRQS]; +}; + +static void *proc_stat_start(struct seq_file *p, loff_t *pos) +{ + struct seq_stat_iter *iter = p->private; + + /* At lseek(), *pos==0 is passed.(see travers() in seq_file.c */ + if (!*pos) { + struct timespec boottime; + + memset(iter, 0, sizeof(*iter)); + iter->stage = SHOW_TOTAL_CPU_STAT; + getboottime(&boottime); + iter->jiffies = boottime.tv_sec; + } + if (iter->stage == END_STATS) + return NULL; + return iter; +} + +static void proc_stat_stop(struct seq_file *p, void *v) +{ +} + +static void *proc_stat_next(struct seq_file *p, void *v, loff_t *pos) +{ + struct seq_stat_iter *iter = p->private; + int index; + + switch (iter->stage) { + case SHOW_TOTAL_CPU_STAT: + iter->stage = SHOW_PERCPU_STAT; + iter->cpu_iter = cpumask_first(cpu_online_mask); + break; + case SHOW_PERCPU_STAT: + index = cpumask_next(iter->cpu_iter, cpu_online_mask); + if (index >= nr_cpu_ids) + iter->stage = SHOW_TOTAL_IRQS; + else + iter->cpu_iter = index; + break; + case SHOW_TOTAL_IRQS: + iter->stage = SHOW_IRQ_DETAILS; + iter->irq_iter = 0; + break; + case SHOW_IRQ_DETAILS: + if (iter->irq_iter >= nr_irqs) + iter->stage = SHOW_TIMES; + break; + case SHOW_TIMES: + iter->stage = SHOW_TOTAL_SOFTIRQ; + break; + case SHOW_TOTAL_SOFTIRQ: + iter->stage = SHOW_SOFTIRQ_DETAILS; + break; + case SHOW_SOFTIRQ_DETAILS: + iter->stage = SHOW_EOL; + break; + case SHOW_EOL: + iter->stage = END_STATS; + return NULL; + default: + break; + } + return iter; +} + +static int show_total_cpu_stat(struct seq_file *p, struct seq_stat_iter *iter) { - int i, j; - unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; u64 guest, guest_nice; - u64 sum = 0; - u64 sum_softirq = 0; - unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; - struct timespec boottime; + int i, j; - user = nice = system = idle = iowait = - irq = softirq = steal = 0; + user = nice = system = idle = iowait = 0; + irq = softirq = steal = 0; guest = guest_nice = 0; - getboottime(&boottime); - jif = boottime.tv_sec; for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; @@ -105,17 +192,17 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; - sum += kstat_cpu_irqs_sum(i); - sum += arch_irq_stat_cpu(i); + iter->irq_sum += kstat_cpu_irqs_sum(i); + iter->irq_sum += arch_irq_stat_cpu(i); for (j = 0; j < NR_SOFTIRQS; j++) { unsigned int softirq_stat = kstat_softirqs_cpu(j, i); - per_softirq_sums[j] += softirq_stat; - sum_softirq += softirq_stat; + iter->per_softirq_sums[j] += softirq_stat; + iter->softirq_sum += softirq_stat; } } - sum += arch_irq_stat(); + iter->irq_sum += arch_irq_stat(); seq_puts(p, "cpu "); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); @@ -129,20 +216,31 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); seq_putc(p, '\n'); + return 0; +} + +static int show_online_cpu_stat(struct seq_file *p, struct seq_stat_iter *iter) +{ + u64 user, nice, system, idle, iowait, irq, softirq, steal; + u64 guest, guest_nice; + int i, cpu, index; - for_each_online_cpu(i) { + /* Handle 32 cpus at a time, to avoid lots of seqfile iterations. */ + cpu = index = iter->cpu_iter; + for (i = 0; i < 32 && index < nr_cpu_ids; i++) { + cpu = index; /* Copy values here to work around gcc-2.95.3, gcc-2.96 */ - user = kcpustat_cpu(i).cpustat[CPUTIME_USER]; - nice = kcpustat_cpu(i).cpustat[CPUTIME_NICE]; - system = kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM]; - idle = get_idle_time(i); - iowait = get_iowait_time(i); - irq = kcpustat_cpu(i).cpustat[CPUTIME_IRQ]; - softirq = kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]; - steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; - guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; - guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; - seq_printf(p, "cpu%d", i); + user = kcpustat_cpu(cpu).cpustat[CPUTIME_USER]; + nice = kcpustat_cpu(cpu).cpustat[CPUTIME_NICE]; + system = kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM]; + idle = get_idle_time(cpu); + iowait = get_iowait_time(cpu); + irq = kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ]; + softirq = kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ]; + steal = kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL]; + guest = kcpustat_cpu(cpu).cpustat[CPUTIME_GUEST]; + guest_nice = kcpustat_cpu(cpu).cpustat[CPUTIME_GUEST_NICE]; + seq_printf(p, "cpu%d", cpu); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(system)); @@ -154,66 +252,96 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); seq_putc(p, '\n'); + index = cpumask_next(cpu, cpu_online_mask); } - seq_printf(p, "intr %llu", (unsigned long long)sum); - - /* sum again ? it could be updated? */ - for_each_irq_nr(j) - seq_put_decimal_ull(p, ' ', kstat_irqs(j)); - - seq_printf(p, - "\nctxt %llu\n" - "btime %lu\n" - "processes %lu\n" - "procs_running %lu\n" - "procs_blocked %lu\n", - nr_context_switches(), - (unsigned long)jif, - total_forks, - nr_running(), - nr_iowait()); - - seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq); - - for (i = 0; i < NR_SOFTIRQS; i++) - seq_put_decimal_ull(p, ' ', per_softirq_sums[i]); - seq_putc(p, '\n'); + iter->cpu_iter = cpu; + return 0; +} + +static int show_irq_details(struct seq_file *p, struct seq_stat_iter *iter) +{ + int ret; + + /* + * we update iterater in ->show()...seems ugly but for avoiding + * tons of function calls, print out here as much as possible + */ + do { + ret = seq_put_decimal_ull(p, ' ', kstat_irqs(iter->irq_iter)); + if (!ret) + iter->irq_iter += 1; + } while (!ret && iter->irq_iter < nr_irqs); return 0; } +static int show_softirq_details(struct seq_file *p, struct seq_stat_iter *iter) +{ + int ret; + + do { + ret = seq_put_decimal_ull(p, ' ', + iter->per_softirq_sums[iter->softirq_iter]); + if (!ret) + iter->softirq_iter += 1; + } while (!ret && iter->softirq_iter < NR_SOFTIRQS); + return 0; +} + +static int proc_stat_show(struct seq_file *p, void *v) +{ + struct seq_stat_iter *iter = v; + + switch (iter->stage) { + case SHOW_TOTAL_CPU_STAT: + return show_total_cpu_stat(p, iter); + case SHOW_PERCPU_STAT: + return show_online_cpu_stat(p, iter); + case SHOW_TOTAL_IRQS: + return seq_printf(p, "intr %llu", + (unsigned long long)iter->irq_sum); + case SHOW_IRQ_DETAILS: + return show_irq_details(p, iter); + case SHOW_TIMES: + return seq_printf(p, + "\nctxt %llu\n" + "btime %lu\n" + "processes %lu\n" + "procs_running %lu\n" + "procs_blocked %lu\n", + nr_context_switches(), + (unsigned long)iter->jiffies, + total_forks, + nr_running(), + nr_iowait()); + case SHOW_TOTAL_SOFTIRQ: + return seq_printf(p, "softirq %llu", + (unsigned long long)iter->softirq_sum); + case SHOW_SOFTIRQ_DETAILS: + return show_softirq_details(p, iter); + case SHOW_EOL: + return seq_putc(p, '\n'); + } + return 0; +} + +static const struct seq_operations show_stat_op = { + .start = proc_stat_start, + .next = proc_stat_next, + .stop = proc_stat_stop, + .show = proc_stat_show, +}; + static int stat_open(struct inode *inode, struct file *file) { - size_t size = 1024 + 128 * num_online_cpus(); - char *buf; - struct seq_file *m; - int res; - - /* minimum size to display an interrupt count : 2 bytes */ - size += 2 * nr_irqs; - - /* don't ask for more than the kmalloc() max size */ - if (size > KMALLOC_MAX_SIZE) - size = KMALLOC_MAX_SIZE; - buf = kmalloc(size, GFP_KERNEL); - if (!buf) - return -ENOMEM; - - res = single_open(file, show_stat, NULL); - if (!res) { - m = file->private_data; - m->buf = buf; - m->size = ksize(buf); - } else - kfree(buf); - return res; + return seq_open_private(file, &show_stat_op, sizeof(struct seq_stat_iter)); } static const struct file_operations proc_stat_operations = { .open = stat_open, .read = seq_read, .llseek = seq_lseek, - .release = single_release, + .release = seq_release_private, }; static int __init proc_stat_init(void) -- 1.8.5.5 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open 2014-05-28 9:01 ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Heiko Carstens @ 2014-05-28 22:37 ` Andrew Morton 2014-05-30 8:38 ` Heiko Carstens 2014-06-09 8:11 ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Ian Kent 0 siblings, 2 replies; 21+ messages in thread From: Andrew Morton @ 2014-05-28 22:37 UTC (permalink / raw) To: Heiko Carstens Cc: Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Ian Kent, Elliott, Robert (Server Storage) On Wed, 28 May 2014 11:01:53 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > Now, /proc/stat uses single_open() for showing information. This means > the all data will be gathered and buffered once to a (big) buf. > > Now, /proc/stat shows stats per cpu and stats per IRQs. To get information > in once-shot, it allocates a big buffer (until KMALLOC_MAX_SIZE). > > Eric Dumazet reported that the bufsize calculation doesn't take > the numner of IRQ into account and the information cannot be > got in one-shot. (By this, seq_read() will allocate buffer again > and read the whole data gain...) > > This patch changes /proc/stat to use seq_open() rather than single_open() > and provides ->start(), ->next(), ->stop(), ->show(). > > By this, /proc/stat will not need to take care of size of buffer. > > [heiko.carstens@de.ibm.com]: This is the forward port of a patch > from KAMEZAWA Hiroyuki (https://lkml.org/lkml/2012/1/23/41). > I added a couple of simple changes like e.g. that the cpu iterator > handles 32 cpus in a batch to avoid lots of iterations. > > With this patch it should not happen anymore that reading /proc/stat > fails because of a failing high order memory allocation. So this deletes the problematic allocation which [1/2] kind-of fixed, yes? I agree with Ian - there's a hotplugging race. And [1/2] doesn't do anything to address the worst-case allocation size. So I think we may as well do this all in a single patch. Without having looked closely at the code I worry a bit about the effects. /proc/pid/stat is a complex thing and its contents will vary in strange ways as the things it is reporting on undergo concurrent changes. This patch will presumably replace one set of bizarre behaviour with a new set of bizarre behaviour and there may be unforseen consequences to established userspace. So we're going to need a lot of testing and a lot of testing time to identify issues and weed them out. So.. can we take this up for 3.16-rc1? See if we can get some careful review done then and test it for a couple of months? Meanwhile, the changelog looks a bit hastily thrown together - some smoothing would be nice, and perhaps some work spent identifying possible behavioural changes. Timing changes, locking canges, effects of concurrent fork/exit activity etc? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open 2014-05-28 22:37 ` Andrew Morton @ 2014-05-30 8:38 ` Heiko Carstens 2014-05-30 11:36 ` [PATCH] fs: proc/stat: use seq_file iterator interface Heiko Carstens 2014-06-09 8:11 ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Ian Kent 1 sibling, 1 reply; 21+ messages in thread From: Heiko Carstens @ 2014-05-30 8:38 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Ian Kent, Elliott, Robert (Server Storage) On Wed, May 28, 2014 at 03:37:04PM -0700, Andrew Morton wrote: > On Wed, 28 May 2014 11:01:53 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > > With this patch it should not happen anymore that reading /proc/stat > > fails because of a failing high order memory allocation. > > So this deletes the problematic allocation which [1/2] kind-of fixed, > yes? Yes. > I agree with Ian - there's a hotplugging race. And [1/2] doesn't do > anything to address the worst-case allocation size. So I think we may > as well do this all in a single patch. Fine with me. However the hotplugging race in 1/2 doesn't matter: if the result doesn't fit into the preallocated buffer the seq_file infrastructure would simply allocate a buffer twice as large as before and retry. The point of patch 1/2 was to have a patch that probably solves the problem almost always ;) , without having the problems you describe below. > Without having looked closely at the code I worry a bit about the > effects. /proc/pid/stat is a complex thing and its contents will vary It's /proc/stat not /proc/pid/stat. > So.. can we take this up for 3.16-rc1? See if we can get some careful > review done then and test it for a couple of months? Sure. > Meanwhile, the changelog looks a bit hastily thrown together - some > smoothing would be nice, and perhaps some work spent identifying > possible behavioural changes. Timing changes, locking canges, effects > of concurrent fork/exit activity etc? Well... I'll try to come up with something better. Even though I only forward ported an existing patch to address a memory allocation failure. Oh oh... ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH] fs: proc/stat: use seq_file iterator interface 2014-05-30 8:38 ` Heiko Carstens @ 2014-05-30 11:36 ` Heiko Carstens 0 siblings, 0 replies; 21+ messages in thread From: Heiko Carstens @ 2014-05-30 11:36 UTC (permalink / raw) To: Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Ian Kent, Elliott, Robert (Server Storage) > Well... I'll try to come up with something better. Even though I only > forward ported an existing patch to address a memory allocation failure. How about the patch below then? >From 52a050f85256d7586933365da1b98c6227651449 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Date: Tue, 27 May 2014 12:41:00 +0200 Subject: [PATCH] fs: proc/stat: use seq_file iterator interface This is the forward port of a patch from KAMEZAWA Hiroyuki with a couple of additional small changess. The original patch was posted here: https://lkml.org/lkml/2012/1/23/41. /proc/stat uses single_open() for showing information. This means all data will be gathered and buffered once into a big buffer. The size of the buffer depends on the nubmer of possible cpus and therefore may get very large. This may lead to large memory allocations, which can fail if memory is fragmented (e.g. for num_possible_cpus() == 256): sadc: page allocation failure: order:4, mode:0x1040d0 CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1 [...] Call Trace: ([<0000000000111fbe>] show_trace+0xe6/0x130) [<0000000000112074>] show_stack+0x6c/0xe8 [<000000000020d356>] warn_alloc_failed+0xd6/0x138 [<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68 [<000000000021168e>] __get_free_pages+0x2e/0x58 [<000000000025a05c>] kmalloc_order_trace+0x44/0xc0 [<00000000002f3ffa>] stat_open+0x5a/0xd8 [<00000000002e9aaa>] proc_reg_open+0x8a/0x140 [<0000000000273b64>] do_dentry_open+0x1bc/0x2c8 [<000000000027411e>] finish_open+0x46/0x60 [<000000000028675a>] do_last+0x382/0x10d0 [<0000000000287570>] path_openat+0xc8/0x4f8 [<0000000000288bde>] do_filp_open+0x46/0xa8 [<000000000027541c>] do_sys_open+0x114/0x1f0 [<00000000005b1c1c>] sysc_tracego+0x14/0x1a This in turn means that reading /proc/stat doesn't work at all in such situations. To address this issue convert /proc/stat to use the seq_file iterator interface. The existing output is split into several pieces so that each piece fits into a single page before it is copied to user space. That way the higher order allocation goes away. Unfortunately this has user space visible side effects. Unlike before, reading /proc/stat may now require several read system calls, even if the buffer provided by user space would be large enough to hold the entire data. Howewer it seems to be unlikely that such user space programs exist, since the content of /proc/stat varies all the time anyway. Furthermore the code that generated the output gathered per cpu statistics without any locking. So while a process read /proc/stat the data is changed concurrently leading potentially to inconsistent data. This is still the case after the change, however because of the iterator interface, and the additional latencies we get with that, the per cpu statistics may get more inconsistent than before. If this is really an issue remains to be seen. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> --- fs/proc/stat.c | 278 +++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 203 insertions(+), 75 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 9d231e9e5f0e..141c4b773ec1 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -77,22 +77,109 @@ static u64 get_iowait_time(int cpu) #endif -static int show_stat(struct seq_file *p, void *v) +enum proc_stat_stage /* The numbers are used as *pos and iter->stage */ +{ + SHOW_TOTAL_CPU_STAT, + SHOW_PERCPU_STAT, + SHOW_TOTAL_IRQS, + SHOW_IRQ_DETAILS, + SHOW_TIMES, + SHOW_TOTAL_SOFTIRQ, + SHOW_SOFTIRQ_DETAILS, + SHOW_EOL, + END_STATS, +}; + +/* + * To reduce the number of ->next(), ->show() calls IRQ numbers are + * handled in batch. + */ +struct seq_stat_iter { + int stage; + unsigned long jiffies; + int cpu_iter; + int irq_iter; + int softirq_iter; + /* cached information */ + u64 irq_sum; + u64 softirq_sum; + u32 per_softirq_sums[NR_SOFTIRQS]; +}; + +static void *proc_stat_start(struct seq_file *p, loff_t *pos) +{ + struct seq_stat_iter *iter = p->private; + + /* At lseek(), *pos==0 is passed (see traverse() in seq_file.c. */ + if (!*pos) { + struct timespec boottime; + + memset(iter, 0, sizeof(*iter)); + iter->stage = SHOW_TOTAL_CPU_STAT; + getboottime(&boottime); + iter->jiffies = boottime.tv_sec; + } + if (iter->stage == END_STATS) + return NULL; + return iter; +} + +static void proc_stat_stop(struct seq_file *p, void *v) +{ +} + +static void *proc_stat_next(struct seq_file *p, void *v, loff_t *pos) +{ + struct seq_stat_iter *iter = p->private; + int index; + + switch (iter->stage) { + case SHOW_TOTAL_CPU_STAT: + iter->stage = SHOW_PERCPU_STAT; + iter->cpu_iter = cpumask_first(cpu_online_mask); + break; + case SHOW_PERCPU_STAT: + index = cpumask_next(iter->cpu_iter, cpu_online_mask); + if (index >= nr_cpu_ids) + iter->stage = SHOW_TOTAL_IRQS; + else + iter->cpu_iter = index; + break; + case SHOW_TOTAL_IRQS: + iter->stage = SHOW_IRQ_DETAILS; + iter->irq_iter = 0; + break; + case SHOW_IRQ_DETAILS: + if (iter->irq_iter >= nr_irqs) + iter->stage = SHOW_TIMES; + break; + case SHOW_TIMES: + iter->stage = SHOW_TOTAL_SOFTIRQ; + break; + case SHOW_TOTAL_SOFTIRQ: + iter->stage = SHOW_SOFTIRQ_DETAILS; + break; + case SHOW_SOFTIRQ_DETAILS: + iter->stage = SHOW_EOL; + break; + case SHOW_EOL: + iter->stage = END_STATS; + return NULL; + default: + break; + } + return iter; +} + +static int show_total_cpu_stat(struct seq_file *p, struct seq_stat_iter *iter) { - int i, j; - unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; u64 guest, guest_nice; - u64 sum = 0; - u64 sum_softirq = 0; - unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; - struct timespec boottime; + int i, j; - user = nice = system = idle = iowait = - irq = softirq = steal = 0; + user = nice = system = idle = iowait = 0; + irq = softirq = steal = 0; guest = guest_nice = 0; - getboottime(&boottime); - jif = boottime.tv_sec; for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; @@ -105,17 +192,17 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; - sum += kstat_cpu_irqs_sum(i); - sum += arch_irq_stat_cpu(i); + iter->irq_sum += kstat_cpu_irqs_sum(i); + iter->irq_sum += arch_irq_stat_cpu(i); for (j = 0; j < NR_SOFTIRQS; j++) { unsigned int softirq_stat = kstat_softirqs_cpu(j, i); - per_softirq_sums[j] += softirq_stat; - sum_softirq += softirq_stat; + iter->per_softirq_sums[j] += softirq_stat; + iter->softirq_sum += softirq_stat; } } - sum += arch_irq_stat(); + iter->irq_sum += arch_irq_stat(); seq_puts(p, "cpu "); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); @@ -129,20 +216,31 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); seq_putc(p, '\n'); + return 0; +} + +static int show_online_cpu_stat(struct seq_file *p, struct seq_stat_iter *iter) +{ + u64 user, nice, system, idle, iowait, irq, softirq, steal; + u64 guest, guest_nice; + int i, cpu, index; - for_each_online_cpu(i) { + /* Handle 32 cpus at a time, to avoid lots of seqfile iterations. */ + cpu = index = iter->cpu_iter; + for (i = 0; i < 32 && index < nr_cpu_ids; i++) { + cpu = index; /* Copy values here to work around gcc-2.95.3, gcc-2.96 */ - user = kcpustat_cpu(i).cpustat[CPUTIME_USER]; - nice = kcpustat_cpu(i).cpustat[CPUTIME_NICE]; - system = kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM]; - idle = get_idle_time(i); - iowait = get_iowait_time(i); - irq = kcpustat_cpu(i).cpustat[CPUTIME_IRQ]; - softirq = kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]; - steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; - guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; - guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; - seq_printf(p, "cpu%d", i); + user = kcpustat_cpu(cpu).cpustat[CPUTIME_USER]; + nice = kcpustat_cpu(cpu).cpustat[CPUTIME_NICE]; + system = kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM]; + idle = get_idle_time(cpu); + iowait = get_iowait_time(cpu); + irq = kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ]; + softirq = kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ]; + steal = kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL]; + guest = kcpustat_cpu(cpu).cpustat[CPUTIME_GUEST]; + guest_nice = kcpustat_cpu(cpu).cpustat[CPUTIME_GUEST_NICE]; + seq_printf(p, "cpu%d", cpu); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(system)); @@ -154,66 +252,96 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); seq_putc(p, '\n'); + index = cpumask_next(cpu, cpu_online_mask); } - seq_printf(p, "intr %llu", (unsigned long long)sum); - - /* sum again ? it could be updated? */ - for_each_irq_nr(j) - seq_put_decimal_ull(p, ' ', kstat_irqs(j)); - - seq_printf(p, - "\nctxt %llu\n" - "btime %lu\n" - "processes %lu\n" - "procs_running %lu\n" - "procs_blocked %lu\n", - nr_context_switches(), - (unsigned long)jif, - total_forks, - nr_running(), - nr_iowait()); - - seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq); - - for (i = 0; i < NR_SOFTIRQS; i++) - seq_put_decimal_ull(p, ' ', per_softirq_sums[i]); - seq_putc(p, '\n'); + iter->cpu_iter = cpu; + return 0; +} + +static int show_irq_details(struct seq_file *p, struct seq_stat_iter *iter) +{ + int ret; + + /* + * we update iterater in ->show()...seems ugly but for avoiding + * tons of function calls, print out here as much as possible + */ + do { + ret = seq_put_decimal_ull(p, ' ', kstat_irqs(iter->irq_iter)); + if (!ret) + iter->irq_iter += 1; + } while (!ret && iter->irq_iter < nr_irqs); return 0; } +static int show_softirq_details(struct seq_file *p, struct seq_stat_iter *iter) +{ + int ret; + + do { + ret = seq_put_decimal_ull(p, ' ', + iter->per_softirq_sums[iter->softirq_iter]); + if (!ret) + iter->softirq_iter += 1; + } while (!ret && iter->softirq_iter < NR_SOFTIRQS); + return 0; +} + +static int proc_stat_show(struct seq_file *p, void *v) +{ + struct seq_stat_iter *iter = v; + + switch (iter->stage) { + case SHOW_TOTAL_CPU_STAT: + return show_total_cpu_stat(p, iter); + case SHOW_PERCPU_STAT: + return show_online_cpu_stat(p, iter); + case SHOW_TOTAL_IRQS: + return seq_printf(p, "intr %llu", + (unsigned long long)iter->irq_sum); + case SHOW_IRQ_DETAILS: + return show_irq_details(p, iter); + case SHOW_TIMES: + return seq_printf(p, + "\nctxt %llu\n" + "btime %lu\n" + "processes %lu\n" + "procs_running %lu\n" + "procs_blocked %lu\n", + nr_context_switches(), + (unsigned long)iter->jiffies, + total_forks, + nr_running(), + nr_iowait()); + case SHOW_TOTAL_SOFTIRQ: + return seq_printf(p, "softirq %llu", + (unsigned long long)iter->softirq_sum); + case SHOW_SOFTIRQ_DETAILS: + return show_softirq_details(p, iter); + case SHOW_EOL: + return seq_putc(p, '\n'); + } + return 0; +} + +static const struct seq_operations show_stat_op = { + .start = proc_stat_start, + .next = proc_stat_next, + .stop = proc_stat_stop, + .show = proc_stat_show, +}; + static int stat_open(struct inode *inode, struct file *file) { - size_t size = 1024 + 128 * num_possible_cpus(); - char *buf; - struct seq_file *m; - int res; - - /* minimum size to display an interrupt count : 2 bytes */ - size += 2 * nr_irqs; - - /* don't ask for more than the kmalloc() max size */ - if (size > KMALLOC_MAX_SIZE) - size = KMALLOC_MAX_SIZE; - buf = kmalloc(size, GFP_KERNEL); - if (!buf) - return -ENOMEM; - - res = single_open(file, show_stat, NULL); - if (!res) { - m = file->private_data; - m->buf = buf; - m->size = ksize(buf); - } else - kfree(buf); - return res; + return seq_open_private(file, &show_stat_op, sizeof(struct seq_stat_iter)); } static const struct file_operations proc_stat_operations = { .open = stat_open, .read = seq_read, .llseek = seq_lseek, - .release = single_release, + .release = seq_release_private, }; static int __init proc_stat_init(void) -- 1.8.5.5 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open 2014-05-28 22:37 ` Andrew Morton 2014-05-30 8:38 ` Heiko Carstens @ 2014-06-09 8:11 ` Ian Kent 2014-06-11 12:43 ` Heiko Carstens 1 sibling, 1 reply; 21+ messages in thread From: Ian Kent @ 2014-06-09 8:11 UTC (permalink / raw) To: Andrew Morton Cc: Heiko Carstens, Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott, Robert (Server Storage) On Wed, 2014-05-28 at 15:37 -0700, Andrew Morton wrote: > On Wed, 28 May 2014 11:01:53 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > > > Now, /proc/stat uses single_open() for showing information. This means > > the all data will be gathered and buffered once to a (big) buf. > > > > Now, /proc/stat shows stats per cpu and stats per IRQs. To get information > > in once-shot, it allocates a big buffer (until KMALLOC_MAX_SIZE). > > > > Eric Dumazet reported that the bufsize calculation doesn't take > > the numner of IRQ into account and the information cannot be > > got in one-shot. (By this, seq_read() will allocate buffer again > > and read the whole data gain...) > > > > This patch changes /proc/stat to use seq_open() rather than single_open() > > and provides ->start(), ->next(), ->stop(), ->show(). > > > > By this, /proc/stat will not need to take care of size of buffer. > > > > [heiko.carstens@de.ibm.com]: This is the forward port of a patch > > from KAMEZAWA Hiroyuki (https://lkml.org/lkml/2012/1/23/41). > > I added a couple of simple changes like e.g. that the cpu iterator > > handles 32 cpus in a batch to avoid lots of iterations. > > > > With this patch it should not happen anymore that reading /proc/stat > > fails because of a failing high order memory allocation. > > So this deletes the problematic allocation which [1/2] kind-of fixed, > yes? > > I agree with Ian - there's a hotplugging race. And [1/2] doesn't do > anything to address the worst-case allocation size. So I think we may > as well do this all in a single patch. > > Without having looked closely at the code I worry a bit about the > effects. /proc/pid/stat is a complex thing and its contents will vary > in strange ways as the things it is reporting on undergo concurrent > changes. This patch will presumably replace one set of bizarre > behaviour with a new set of bizarre behaviour and there may be > unforseen consequences to established userspace. > > So we're going to need a lot of testing and a lot of testing time to > identify issues and weed them out. > > So.. can we take this up for 3.16-rc1? See if we can get some careful > review done then and test it for a couple of months? > > Meanwhile, the changelog looks a bit hastily thrown together - some > smoothing would be nice, and perhaps some work spent identifying > possible behavioural changes. Timing changes, locking canges, effects > of concurrent fork/exit activity etc? > Umm ... I didn't expect this to turn into such a rant, apologies in advance. Certainly using the usual seq_file ops is desirable in the long run and that change should be worked on by those that maintain this area of code but, as Andrew says, it's too large a change to put in without considerable testing. Now the problem has been exposed by a change which sets the number of CPUs to the maximum number the architecture (s390) can have, 256, when no value has been specified and the kernel default configuration is used rather than 32 when hotplug is not set or 64 when it is. The allocation problem doesn't occur when the number of CPUs is set to the previous default of 64, even for low end systems with 2 CPUs and 2G RAM (like the one for which this problem was reported), but becomes a problem when the number of CPUs is set to 256 on these systems. Also, I believe the s390 maintainers are committed to keeping the the default configured number of CPUs at 256. So the actual problem is the heuristic used to calculate a initial buffer size not accounting for a configured number of CPUs much greater than the hardware can sensibly accommodate. If we assume that the current implementation functions correctly when the buffer overflows, ie. doubles the allocation size and restarts, then an interim solution to the problem comes down to not much more than what is in patch 1 above. Looking at the current heuristic allocation sizes, without taking the allocation for irqs into account we get: cpus: 32 size: 5k cpus: 64 size: 9k cpus: 128 size: 17k cpus: 256 size: 33k I don't know how many irqs need to be accounted for or if that will make a difference to my comments below. Someone else will need to comment on that. We know (from the order 4 allocation fail) that with 256 CPUs kmalloc() is looking for a 64k slab chunk IIUC and on low memory systems will frequently fail to get it. And for the previous default of 64 CPUs kmalloc() would be looking for a 16k slab which we have no evidence it doesn't always get even on low end systems. So why even use a heuristic calculation, since it can be quite wasteful anyway or, as in this case badly wrong, why not just allocate 8k or 16k in the open every time knowing that if the actual number of CPUs is large we can reasonably expect the system RAM to be correspondingly large which should avoid allocation failures upon a read retry. Comments please? Ian ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open 2014-06-09 8:11 ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Ian Kent @ 2014-06-11 12:43 ` Heiko Carstens 2014-06-11 22:29 ` David Rientjes 0 siblings, 1 reply; 21+ messages in thread From: Heiko Carstens @ 2014-06-11 12:43 UTC (permalink / raw) To: Ian Kent Cc: Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott, Robert (Server Storage), Al Viro [full quote, since I added Al to cc] On Mon, Jun 09, 2014 at 04:11:59PM +0800, Ian Kent wrote: > On Wed, 2014-05-28 at 15:37 -0700, Andrew Morton wrote: > > On Wed, 28 May 2014 11:01:53 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > > > > > Now, /proc/stat uses single_open() for showing information. This means > > > the all data will be gathered and buffered once to a (big) buf. > > > > > > Now, /proc/stat shows stats per cpu and stats per IRQs. To get information > > > in once-shot, it allocates a big buffer (until KMALLOC_MAX_SIZE). > > > > > > Eric Dumazet reported that the bufsize calculation doesn't take > > > the numner of IRQ into account and the information cannot be > > > got in one-shot. (By this, seq_read() will allocate buffer again > > > and read the whole data gain...) > > > > > > This patch changes /proc/stat to use seq_open() rather than single_open() > > > and provides ->start(), ->next(), ->stop(), ->show(). > > > > > > By this, /proc/stat will not need to take care of size of buffer. > > > > > > [heiko.carstens@de.ibm.com]: This is the forward port of a patch > > > from KAMEZAWA Hiroyuki (https://lkml.org/lkml/2012/1/23/41). > > > I added a couple of simple changes like e.g. that the cpu iterator > > > handles 32 cpus in a batch to avoid lots of iterations. > > > > > > With this patch it should not happen anymore that reading /proc/stat > > > fails because of a failing high order memory allocation. > > > > So this deletes the problematic allocation which [1/2] kind-of fixed, > > yes? > > > > I agree with Ian - there's a hotplugging race. And [1/2] doesn't do > > anything to address the worst-case allocation size. So I think we may > > as well do this all in a single patch. > > > > Without having looked closely at the code I worry a bit about the > > effects. /proc/pid/stat is a complex thing and its contents will vary > > in strange ways as the things it is reporting on undergo concurrent > > changes. This patch will presumably replace one set of bizarre > > behaviour with a new set of bizarre behaviour and there may be > > unforseen consequences to established userspace. > > > > So we're going to need a lot of testing and a lot of testing time to > > identify issues and weed them out. > > > > So.. can we take this up for 3.16-rc1? See if we can get some careful > > review done then and test it for a couple of months? > > > > Meanwhile, the changelog looks a bit hastily thrown together - some > > smoothing would be nice, and perhaps some work spent identifying > > possible behavioural changes. Timing changes, locking canges, effects > > of concurrent fork/exit activity etc? > > > > Umm ... I didn't expect this to turn into such a rant, apologies in > advance. > > Certainly using the usual seq_file ops is desirable in the long run and > that change should be worked on by those that maintain this area of code > but, as Andrew says, it's too large a change to put in without > considerable testing. > > Now the problem has been exposed by a change which sets the number of > CPUs to the maximum number the architecture (s390) can have, 256, when > no value has been specified and the kernel default configuration is used > rather than 32 when hotplug is not set or 64 when it is. > > The allocation problem doesn't occur when the number of CPUs is set to > the previous default of 64, even for low end systems with 2 CPUs and 2G > RAM (like the one for which this problem was reported), but becomes a > problem when the number of CPUs is set to 256 on these systems. > > Also, I believe the s390 maintainers are committed to keeping the the > default configured number of CPUs at 256. > > So the actual problem is the heuristic used to calculate a initial > buffer size not accounting for a configured number of CPUs much greater > than the hardware can sensibly accommodate. > > If we assume that the current implementation functions correctly when > the buffer overflows, ie. doubles the allocation size and restarts, then > an interim solution to the problem comes down to not much more than what > is in patch 1 above. > > Looking at the current heuristic allocation sizes, without taking the > allocation for irqs into account we get: > cpus: 32 size: 5k > cpus: 64 size: 9k > cpus: 128 size: 17k > cpus: 256 size: 33k > > I don't know how many irqs need to be accounted for or if that will make > a difference to my comments below. Someone else will need to comment on > that. > > We know (from the order 4 allocation fail) that with 256 CPUs kmalloc() > is looking for a 64k slab chunk IIUC and on low memory systems will > frequently fail to get it. > > And for the previous default of 64 CPUs kmalloc() would be looking for a > 16k slab which we have no evidence it doesn't always get even on low end > systems. > > So why even use a heuristic calculation, since it can be quite wasteful > anyway or, as in this case badly wrong, why not just allocate 8k or 16k > in the open every time knowing that if the actual number of CPUs is > large we can reasonably expect the system RAM to be correspondingly > large which should avoid allocation failures upon a read retry. > > Comments please? Alternatively we could also change the seqfile code to fall back to vmalloc allocations. That would probably "fix" all single_open usages where large contiguous memory areas are needed and later on fail due to memory fragmentation. Does anybody like that approach (sample patch below)? --- fs/seq_file.c | 45 +++++++++++++++++++++++++++++++++++++-------- 1 file changed, 37 insertions(+), 8 deletions(-) diff --git a/fs/seq_file.c b/fs/seq_file.c index 1d641bb108d2..fca78a04c0d1 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -8,8 +8,10 @@ #include <linux/fs.h> #include <linux/export.h> #include <linux/seq_file.h> +#include <linux/vmalloc.h> #include <linux/slab.h> #include <linux/cred.h> +#include <linux/mm.h> #include <asm/uaccess.h> #include <asm/page.h> @@ -82,6 +84,31 @@ int seq_open(struct file *file, const struct seq_operations *op) } EXPORT_SYMBOL(seq_open); +static void seq_alloc(struct seq_file *m) +{ + m->size = PAGE_SIZE; + m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN); + if (!m->buf) + m->buf = vmalloc(m->size); +} + +static void seq_free(struct seq_file *m) +{ + if (unlikely(is_vmalloc_addr(m->buf))) + vfree(m->buf); + else + kfree(m->buf); +} + +static void seq_realloc(struct seq_file *m) +{ + seq_free(m); + m->size <<= 1; + m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN); + if (!m->buf) + m->buf = vmalloc(m->size); +} + static int traverse(struct seq_file *m, loff_t offset) { loff_t pos = 0, index; @@ -96,7 +123,7 @@ static int traverse(struct seq_file *m, loff_t offset) return 0; } if (!m->buf) { - m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL); + seq_alloc(m); if (!m->buf) return -ENOMEM; } @@ -135,9 +162,8 @@ static int traverse(struct seq_file *m, loff_t offset) Eoverflow: m->op->stop(m, p); - kfree(m->buf); m->count = 0; - m->buf = kmalloc(m->size <<= 1, GFP_KERNEL); + seq_realloc(m); return !m->buf ? -ENOMEM : -EAGAIN; } @@ -192,7 +218,7 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) /* grab buffer if we didn't have one */ if (!m->buf) { - m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL); + seq_alloc(m); if (!m->buf) goto Enomem; } @@ -232,9 +258,8 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) if (m->count < m->size) goto Fill; m->op->stop(m, p); - kfree(m->buf); m->count = 0; - m->buf = kmalloc(m->size <<= 1, GFP_KERNEL); + seq_realloc(m); if (!m->buf) goto Enomem; m->version = 0; @@ -350,7 +375,7 @@ EXPORT_SYMBOL(seq_lseek); int seq_release(struct inode *inode, struct file *file) { struct seq_file *m = file->private_data; - kfree(m->buf); + seq_free(m); kfree(m); return 0; } @@ -605,8 +630,12 @@ EXPORT_SYMBOL(single_open); int single_open_size(struct file *file, int (*show)(struct seq_file *, void *), void *data, size_t size) { - char *buf = kmalloc(size, GFP_KERNEL); + char *buf; int ret; + + buf = kmalloc(size, GFP_KERNEL | __GFP_NOWARN); + if (!buf) + buf = vmalloc(size); if (!buf) return -ENOMEM; ret = single_open(file, show, data); -- 1.8.5.5 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open 2014-06-11 12:43 ` Heiko Carstens @ 2014-06-11 22:29 ` David Rientjes 2014-06-12 6:24 ` Ian Kent 0 siblings, 1 reply; 21+ messages in thread From: David Rientjes @ 2014-06-11 22:29 UTC (permalink / raw) To: Heiko Carstens Cc: Ian Kent, Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott, Robert (Server Storage), Al Viro On Wed, 11 Jun 2014, Heiko Carstens wrote: > Alternatively we could also change the seqfile code to fall back to > vmalloc allocations. That would probably "fix" all single_open usages > where large contiguous memory areas are needed and later on fail due > to memory fragmentation. > Does anybody like that approach (sample patch below)? > > --- > fs/seq_file.c | 45 +++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 37 insertions(+), 8 deletions(-) > > diff --git a/fs/seq_file.c b/fs/seq_file.c > index 1d641bb108d2..fca78a04c0d1 100644 > --- a/fs/seq_file.c > +++ b/fs/seq_file.c > @@ -8,8 +8,10 @@ > #include <linux/fs.h> > #include <linux/export.h> > #include <linux/seq_file.h> > +#include <linux/vmalloc.h> > #include <linux/slab.h> > #include <linux/cred.h> > +#include <linux/mm.h> > > #include <asm/uaccess.h> > #include <asm/page.h> > @@ -82,6 +84,31 @@ int seq_open(struct file *file, const struct seq_operations *op) > } > EXPORT_SYMBOL(seq_open); > > +static void seq_alloc(struct seq_file *m) > +{ > + m->size = PAGE_SIZE; > + m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN); > + if (!m->buf) > + m->buf = vmalloc(m->size); > +} > + If m->size is unconditionally PAGE_SIZE, then how is vmalloc() going to allocate this if kmalloc() fails? > +static void seq_free(struct seq_file *m) > +{ > + if (unlikely(is_vmalloc_addr(m->buf))) > + vfree(m->buf); > + else > + kfree(m->buf); > +} > + > +static void seq_realloc(struct seq_file *m) > +{ > + seq_free(m); > + m->size <<= 1; > + m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN); > + if (!m->buf) > + m->buf = vmalloc(m->size); > +} > + > static int traverse(struct seq_file *m, loff_t offset) > { > loff_t pos = 0, index; > @@ -96,7 +123,7 @@ static int traverse(struct seq_file *m, loff_t offset) > return 0; > } > if (!m->buf) { > - m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL); > + seq_alloc(m); > if (!m->buf) > return -ENOMEM; > } > @@ -135,9 +162,8 @@ static int traverse(struct seq_file *m, loff_t offset) > > Eoverflow: > m->op->stop(m, p); > - kfree(m->buf); > m->count = 0; > - m->buf = kmalloc(m->size <<= 1, GFP_KERNEL); > + seq_realloc(m); > return !m->buf ? -ENOMEM : -EAGAIN; > } > It seems like traverse() could be rewritten to use krealloc() which does a memcpy() to simplify the calling code. > @@ -192,7 +218,7 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) > > /* grab buffer if we didn't have one */ > if (!m->buf) { > - m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL); > + seq_alloc(m); > if (!m->buf) > goto Enomem; > } > @@ -232,9 +258,8 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) > if (m->count < m->size) > goto Fill; > m->op->stop(m, p); > - kfree(m->buf); > m->count = 0; > - m->buf = kmalloc(m->size <<= 1, GFP_KERNEL); > + seq_realloc(m); > if (!m->buf) > goto Enomem; > m->version = 0; > @@ -350,7 +375,7 @@ EXPORT_SYMBOL(seq_lseek); > int seq_release(struct inode *inode, struct file *file) > { > struct seq_file *m = file->private_data; > - kfree(m->buf); > + seq_free(m); > kfree(m); > return 0; > } > @@ -605,8 +630,12 @@ EXPORT_SYMBOL(single_open); > int single_open_size(struct file *file, int (*show)(struct seq_file *, void *), > void *data, size_t size) > { > - char *buf = kmalloc(size, GFP_KERNEL); > + char *buf; > int ret; > + > + buf = kmalloc(size, GFP_KERNEL | __GFP_NOWARN); > + if (!buf) > + buf = vmalloc(size); > if (!buf) > return -ENOMEM; > ret = single_open(file, show, data); ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open 2014-06-11 22:29 ` David Rientjes @ 2014-06-12 6:24 ` Ian Kent 2014-06-12 6:52 ` David Rientjes 0 siblings, 1 reply; 21+ messages in thread From: Ian Kent @ 2014-06-12 6:24 UTC (permalink / raw) To: David Rientjes Cc: Heiko Carstens, Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott, Robert (Server Storage), Al Viro On Wed, 2014-06-11 at 15:29 -0700, David Rientjes wrote: > On Wed, 11 Jun 2014, Heiko Carstens wrote: > > > Alternatively we could also change the seqfile code to fall back to > > vmalloc allocations. That would probably "fix" all single_open usages > > where large contiguous memory areas are needed and later on fail due > > to memory fragmentation. > > Does anybody like that approach (sample patch below)? Personally, I much prefer this from a risk of regression POV. > > > > --- > > fs/seq_file.c | 45 +++++++++++++++++++++++++++++++++++++-------- > > 1 file changed, 37 insertions(+), 8 deletions(-) > > > > diff --git a/fs/seq_file.c b/fs/seq_file.c > > index 1d641bb108d2..fca78a04c0d1 100644 > > --- a/fs/seq_file.c > > +++ b/fs/seq_file.c > > @@ -8,8 +8,10 @@ > > #include <linux/fs.h> > > #include <linux/export.h> > > #include <linux/seq_file.h> > > +#include <linux/vmalloc.h> > > #include <linux/slab.h> > > #include <linux/cred.h> > > +#include <linux/mm.h> > > > > #include <asm/uaccess.h> > > #include <asm/page.h> > > @@ -82,6 +84,31 @@ int seq_open(struct file *file, const struct seq_operations *op) > > } > > EXPORT_SYMBOL(seq_open); > > > > +static void seq_alloc(struct seq_file *m) > > +{ > > + m->size = PAGE_SIZE; > > + m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN); > > + if (!m->buf) > > + m->buf = vmalloc(m->size); > > +} > > + > > If m->size is unconditionally PAGE_SIZE, then how is vmalloc() going to > allocate this if kmalloc() fails? This is just the initial allocation. If it runs out of room the allocation size doubles. I think 2*PAGE_SIZE is probably better here since that's closer to what the original heuristic allocation requested and is likely to avoid reallocations in most cases. The issue of kmalloc() failing for larger allocations on low speced hardware with fragmented memory might succeed when vmalloc() is used since it doesn't require contiguous memory chunks. But I guess the added pressure on the page table might still be a problem, nevertheless it's probably worth trying before bailing out. > > > +static void seq_free(struct seq_file *m) > > +{ > > + if (unlikely(is_vmalloc_addr(m->buf))) > > + vfree(m->buf); > > + else > > + kfree(m->buf); > > +} > > + > > +static void seq_realloc(struct seq_file *m) > > +{ > > + seq_free(m); > > + m->size <<= 1; > > + m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN); > > + if (!m->buf) > > + m->buf = vmalloc(m->size); > > +} > > + > > static int traverse(struct seq_file *m, loff_t offset) > > { > > loff_t pos = 0, index; > > @@ -96,7 +123,7 @@ static int traverse(struct seq_file *m, loff_t offset) > > return 0; > > } > > if (!m->buf) { > > - m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL); > > + seq_alloc(m); > > if (!m->buf) > > return -ENOMEM; > > } > > @@ -135,9 +162,8 @@ static int traverse(struct seq_file *m, loff_t offset) > > > > Eoverflow: > > m->op->stop(m, p); > > - kfree(m->buf); > > m->count = 0; > > - m->buf = kmalloc(m->size <<= 1, GFP_KERNEL); > > + seq_realloc(m); > > return !m->buf ? -ENOMEM : -EAGAIN; > > } > > > > It seems like traverse() could be rewritten to use krealloc() which does a > memcpy() to simplify the calling code. Sure, but that's an improvement that could be considered once the current problem is addressed. Even though the struct seq_file fields shouldn't be used outside of the seq_file routines we can't know that's the case for all users and we don't know if other's make assumptions based on the way it works now. Ian > > > @@ -192,7 +218,7 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) > > > > /* grab buffer if we didn't have one */ > > if (!m->buf) { > > - m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL); > > + seq_alloc(m); > > if (!m->buf) > > goto Enomem; > > } > > @@ -232,9 +258,8 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) > > if (m->count < m->size) > > goto Fill; > > m->op->stop(m, p); > > - kfree(m->buf); > > m->count = 0; > > - m->buf = kmalloc(m->size <<= 1, GFP_KERNEL); > > + seq_realloc(m); > > if (!m->buf) > > goto Enomem; > > m->version = 0; > > @@ -350,7 +375,7 @@ EXPORT_SYMBOL(seq_lseek); > > int seq_release(struct inode *inode, struct file *file) > > { > > struct seq_file *m = file->private_data; > > - kfree(m->buf); > > + seq_free(m); > > kfree(m); > > return 0; > > } > > @@ -605,8 +630,12 @@ EXPORT_SYMBOL(single_open); > > int single_open_size(struct file *file, int (*show)(struct seq_file *, void *), > > void *data, size_t size) > > { > > - char *buf = kmalloc(size, GFP_KERNEL); > > + char *buf; > > int ret; > > + > > + buf = kmalloc(size, GFP_KERNEL | __GFP_NOWARN); > > + if (!buf) > > + buf = vmalloc(size); > > if (!buf) > > return -ENOMEM; > > ret = single_open(file, show, data); ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open 2014-06-12 6:24 ` Ian Kent @ 2014-06-12 6:52 ` David Rientjes 2014-06-12 7:27 ` Heiko Carstens 2014-06-12 11:09 ` Ian Kent 0 siblings, 2 replies; 21+ messages in thread From: David Rientjes @ 2014-06-12 6:52 UTC (permalink / raw) To: Ian Kent Cc: Heiko Carstens, Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott, Robert (Server Storage), Al Viro On Thu, 12 Jun 2014, Ian Kent wrote: > > > diff --git a/fs/seq_file.c b/fs/seq_file.c > > > index 1d641bb108d2..fca78a04c0d1 100644 > > > --- a/fs/seq_file.c > > > +++ b/fs/seq_file.c > > > @@ -8,8 +8,10 @@ > > > #include <linux/fs.h> > > > #include <linux/export.h> > > > #include <linux/seq_file.h> > > > +#include <linux/vmalloc.h> > > > #include <linux/slab.h> > > > #include <linux/cred.h> > > > +#include <linux/mm.h> > > > > > > #include <asm/uaccess.h> > > > #include <asm/page.h> > > > @@ -82,6 +84,31 @@ int seq_open(struct file *file, const struct seq_operations *op) > > > } > > > EXPORT_SYMBOL(seq_open); > > > > > > +static void seq_alloc(struct seq_file *m) > > > +{ > > > + m->size = PAGE_SIZE; > > > + m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN); > > > + if (!m->buf) > > > + m->buf = vmalloc(m->size); > > > +} > > > + > > > > If m->size is unconditionally PAGE_SIZE, then how is vmalloc() going to > > allocate this if kmalloc() fails? > > This is just the initial allocation. > If it runs out of room the allocation size doubles. > > I think 2*PAGE_SIZE is probably better here since that's closer to what > the original heuristic allocation requested and is likely to avoid > reallocations in most cases. > > The issue of kmalloc() failing for larger allocations on low speced > hardware with fragmented memory might succeed when vmalloc() is used > since it doesn't require contiguous memory chunks. But I guess the added > pressure on the page table might still be a problem, nevertheless it's > probably worth trying before bailing out. > I'm not quarreling about using vmalloc() for allocations that are high-order, I'm referring to the rather obvious fact that m->size is set to PAGE_SIZE unconditionally above and thus vmalloc() isn't going to help in the slightest. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open 2014-06-12 6:52 ` David Rientjes @ 2014-06-12 7:27 ` Heiko Carstens 2014-06-12 8:18 ` Heiko Carstens 2014-06-12 20:59 ` David Rientjes 2014-06-12 11:09 ` Ian Kent 1 sibling, 2 replies; 21+ messages in thread From: Heiko Carstens @ 2014-06-12 7:27 UTC (permalink / raw) To: David Rientjes Cc: Ian Kent, Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott, Robert (Server Storage), Al Viro On Wed, Jun 11, 2014 at 11:52:31PM -0700, David Rientjes wrote: > On Thu, 12 Jun 2014, Ian Kent wrote: > > > > +static void seq_alloc(struct seq_file *m) > > > > +{ > > > > + m->size = PAGE_SIZE; > > > > + m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN); > > > > + if (!m->buf) > > > > + m->buf = vmalloc(m->size); > > > > +} > > > > + > > > > > > If m->size is unconditionally PAGE_SIZE, then how is vmalloc() going to > > > allocate this if kmalloc() fails? > > > > This is just the initial allocation. > > If it runs out of room the allocation size doubles. > > > > I think 2*PAGE_SIZE is probably better here since that's closer to what > > the original heuristic allocation requested and is likely to avoid > > reallocations in most cases. > > > > The issue of kmalloc() failing for larger allocations on low speced > > hardware with fragmented memory might succeed when vmalloc() is used > > since it doesn't require contiguous memory chunks. But I guess the added > > pressure on the page table might still be a problem, nevertheless it's > > probably worth trying before bailing out. > > I'm not quarreling about using vmalloc() for allocations that are > high-order, I'm referring to the rather obvious fact that m->size is set > to PAGE_SIZE unconditionally above and thus vmalloc() isn't going to help > in the slightest. Yes, that doesn't make any sense. I wrote the patch together in a hurry and didn't think much about it. So below is the what I think most simple conversion to a vmalloc fallback approach for seq files. However the question remains if this seems to be an acceptable approach at all... --- fs/seq_file.c | 38 +++++++++++++++++++++++++++++--------- 1 file changed, 29 insertions(+), 9 deletions(-) diff --git a/fs/seq_file.c b/fs/seq_file.c index 1d641bb108d2..b710130c6d6b 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -8,8 +8,10 @@ #include <linux/fs.h> #include <linux/export.h> #include <linux/seq_file.h> +#include <linux/vmalloc.h> #include <linux/slab.h> #include <linux/cred.h> +#include <linux/mm.h> #include <asm/uaccess.h> #include <asm/page.h> @@ -30,6 +32,24 @@ static void seq_set_overflow(struct seq_file *m) m->count = m->size; } +static void *seq_alloc(unsigned long size) +{ + void *buf; + + buf = kmalloc(size, GFP_KERNEL | __GFP_NOWARN); + if (!buf && size > PAGE_SIZE) + buf = vmalloc(size); + return buf; +} + +static void seq_free(const void *buf) +{ + if (unlikely(is_vmalloc_addr(buf))) + vfree(buf); + else + kfree(buf); +} + /** * seq_open - initialize sequential file * @file: file we initialize @@ -96,7 +116,7 @@ static int traverse(struct seq_file *m, loff_t offset) return 0; } if (!m->buf) { - m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL); + m->buf = seq_alloc(m->size = PAGE_SIZE); if (!m->buf) return -ENOMEM; } @@ -135,9 +155,9 @@ static int traverse(struct seq_file *m, loff_t offset) Eoverflow: m->op->stop(m, p); - kfree(m->buf); + seq_free(m->buf); m->count = 0; - m->buf = kmalloc(m->size <<= 1, GFP_KERNEL); + m->buf = seq_alloc(m->size <<= 1); return !m->buf ? -ENOMEM : -EAGAIN; } @@ -192,7 +212,7 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) /* grab buffer if we didn't have one */ if (!m->buf) { - m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL); + m->buf = seq_alloc(m->size = PAGE_SIZE); if (!m->buf) goto Enomem; } @@ -232,9 +252,9 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) if (m->count < m->size) goto Fill; m->op->stop(m, p); - kfree(m->buf); + seq_free(m->buf); m->count = 0; - m->buf = kmalloc(m->size <<= 1, GFP_KERNEL); + m->buf = seq_alloc(m->size <<= 1); if (!m->buf) goto Enomem; m->version = 0; @@ -350,7 +370,7 @@ EXPORT_SYMBOL(seq_lseek); int seq_release(struct inode *inode, struct file *file) { struct seq_file *m = file->private_data; - kfree(m->buf); + seq_free(m->buf); kfree(m); return 0; } @@ -605,13 +625,13 @@ EXPORT_SYMBOL(single_open); int single_open_size(struct file *file, int (*show)(struct seq_file *, void *), void *data, size_t size) { - char *buf = kmalloc(size, GFP_KERNEL); + char *buf = seq_alloc(size); int ret; if (!buf) return -ENOMEM; ret = single_open(file, show, data); if (ret) { - kfree(buf); + seq_free(buf); return ret; } ((struct seq_file *)file->private_data)->buf = buf; -- 1.8.5.5 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open 2014-06-12 7:27 ` Heiko Carstens @ 2014-06-12 8:18 ` Heiko Carstens 2014-06-12 20:59 ` David Rientjes 1 sibling, 0 replies; 21+ messages in thread From: Heiko Carstens @ 2014-06-12 8:18 UTC (permalink / raw) To: David Rientjes, Ian Kent, Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott, Robert (Server Storage), Al Viro On Thu, Jun 12, 2014 at 09:27:41AM +0200, Heiko Carstens wrote: > On Wed, Jun 11, 2014 at 11:52:31PM -0700, David Rientjes wrote: > > On Thu, 12 Jun 2014, Ian Kent wrote: > > > > > +static void seq_alloc(struct seq_file *m) > > > > > +{ > > > > > + m->size = PAGE_SIZE; > > > > > + m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN); > > > > > + if (!m->buf) > > > > > + m->buf = vmalloc(m->size); > > > > > +} > > > > > + > > > > > > > > If m->size is unconditionally PAGE_SIZE, then how is vmalloc() going to > > > > allocate this if kmalloc() fails? > > > > > > This is just the initial allocation. > > > If it runs out of room the allocation size doubles. > > > > > > I think 2*PAGE_SIZE is probably better here since that's closer to what > > > the original heuristic allocation requested and is likely to avoid > > > reallocations in most cases. > > > > > > The issue of kmalloc() failing for larger allocations on low speced > > > hardware with fragmented memory might succeed when vmalloc() is used > > > since it doesn't require contiguous memory chunks. But I guess the added > > > pressure on the page table might still be a problem, nevertheless it's > > > probably worth trying before bailing out. > > > > I'm not quarreling about using vmalloc() for allocations that are > > high-order, I'm referring to the rather obvious fact that m->size is set > > to PAGE_SIZE unconditionally above and thus vmalloc() isn't going to help > > in the slightest. > > Yes, that doesn't make any sense. I wrote the patch together in a hurry and > didn't think much about it. > So below is the what I think most simple conversion to a vmalloc fallback > approach for seq files. However the question remains if this seems to be an > acceptable approach at all... And so that the vmalloc fallback has any effect to the /proc/stat allocation it also needs to be converted to use single_open_size() instead: From: Heiko Carstens <heiko.carstens@de.ibm.com> Date: Thu, 12 Jun 2014 10:04:39 +0200 Subject: [PATCH] proc/stat: convert to single_open_size() Use seq_file's single_open_size() to preallocate a buffer that is large enough to hold the whole output, instead of open coding it. Also calculate the requested size using the number of online cpus instead of possible cpus, since the size of the output only depends on the number of online cpus. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> --- fs/proc/stat.c | 22 ++-------------------- 1 file changed, 2 insertions(+), 20 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 9d231e9e5f0e..bf2d03f8fd3e 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -184,29 +184,11 @@ static int show_stat(struct seq_file *p, void *v) static int stat_open(struct inode *inode, struct file *file) { - size_t size = 1024 + 128 * num_possible_cpus(); - char *buf; - struct seq_file *m; - int res; + size_t size = 1024 + 128 * num_online_cpus(); /* minimum size to display an interrupt count : 2 bytes */ size += 2 * nr_irqs; - - /* don't ask for more than the kmalloc() max size */ - if (size > KMALLOC_MAX_SIZE) - size = KMALLOC_MAX_SIZE; - buf = kmalloc(size, GFP_KERNEL); - if (!buf) - return -ENOMEM; - - res = single_open(file, show_stat, NULL); - if (!res) { - m = file->private_data; - m->buf = buf; - m->size = ksize(buf); - } else - kfree(buf); - return res; + return single_open_size(file, show_stat, NULL, size); } static const struct file_operations proc_stat_operations = { -- 1.8.5.5 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open 2014-06-12 7:27 ` Heiko Carstens 2014-06-12 8:18 ` Heiko Carstens @ 2014-06-12 20:59 ` David Rientjes 1 sibling, 0 replies; 21+ messages in thread From: David Rientjes @ 2014-06-12 20:59 UTC (permalink / raw) To: Heiko Carstens Cc: Ian Kent, Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott, Robert (Server Storage), Al Viro On Thu, 12 Jun 2014, Heiko Carstens wrote: > Yes, that doesn't make any sense. I wrote the patch together in a hurry and > didn't think much about it. > So below is the what I think most simple conversion to a vmalloc fallback > approach for seq files. However the question remains if this seems to be an > acceptable approach at all... > I think the approach is fine. We do now have a generic kvfree() function defined in mm.h that will handle the freeing, though, determined by is_vmalloc_addr() so seq_free() is no longer necessary. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open 2014-06-12 6:52 ` David Rientjes 2014-06-12 7:27 ` Heiko Carstens @ 2014-06-12 11:09 ` Ian Kent 1 sibling, 0 replies; 21+ messages in thread From: Ian Kent @ 2014-06-12 11:09 UTC (permalink / raw) To: David Rientjes Cc: Heiko Carstens, Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott, Robert (Server Storage), Al Viro On Wed, 2014-06-11 at 23:52 -0700, David Rientjes wrote: > On Thu, 12 Jun 2014, Ian Kent wrote: > > > > > diff --git a/fs/seq_file.c b/fs/seq_file.c > > > > index 1d641bb108d2..fca78a04c0d1 100644 > > > > --- a/fs/seq_file.c > > > > +++ b/fs/seq_file.c > > > > @@ -8,8 +8,10 @@ > > > > #include <linux/fs.h> > > > > #include <linux/export.h> > > > > #include <linux/seq_file.h> > > > > +#include <linux/vmalloc.h> > > > > #include <linux/slab.h> > > > > #include <linux/cred.h> > > > > +#include <linux/mm.h> > > > > > > > > #include <asm/uaccess.h> > > > > #include <asm/page.h> > > > > @@ -82,6 +84,31 @@ int seq_open(struct file *file, const struct seq_operations *op) > > > > } > > > > EXPORT_SYMBOL(seq_open); > > > > > > > > +static void seq_alloc(struct seq_file *m) > > > > +{ > > > > + m->size = PAGE_SIZE; > > > > + m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN); > > > > + if (!m->buf) > > > > + m->buf = vmalloc(m->size); > > > > +} > > > > + > > > > > > If m->size is unconditionally PAGE_SIZE, then how is vmalloc() going to > > > allocate this if kmalloc() fails? > > > > This is just the initial allocation. > > If it runs out of room the allocation size doubles. > > > > I think 2*PAGE_SIZE is probably better here since that's closer to what > > the original heuristic allocation requested and is likely to avoid > > reallocations in most cases. > > > > The issue of kmalloc() failing for larger allocations on low speced > > hardware with fragmented memory might succeed when vmalloc() is used > > since it doesn't require contiguous memory chunks. But I guess the added > > pressure on the page table might still be a problem, nevertheless it's > > probably worth trying before bailing out. > > > > I'm not quarreling about using vmalloc() for allocations that are > high-order, I'm referring to the rather obvious fact that m->size is set > to PAGE_SIZE unconditionally above and thus vmalloc() isn't going to help > in the slightest. LOL, yeah, if kmalloc() can't allocate a single page then we're in much bigger trouble! Ian ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: /proc/stat vs. failed order-4 allocation 2014-05-21 12:25 /proc/stat vs. failed order-4 allocation Heiko Carstens 2014-05-21 14:32 ` Christoph Hellwig @ 2014-05-22 11:29 ` Ian Kent 1 sibling, 0 replies; 21+ messages in thread From: Ian Kent @ 2014-05-22 11:29 UTC (permalink / raw) To: Heiko Carstens Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl On Wed, May 21, 2014 at 02:25:21PM +0200, Heiko Carstens wrote: > Hi all, > > I'm just wondering why /proc/stat is a single_open() seq_file and not a > regular seq_file with an iterator (say 48 online cpus for each iteration > or something similar). > > Of course, in theory, the "intr" line may be very long as well... > > With the current implementation everything must fit into a single buffer. > So if memory is highly fragmented we run into failing higher order > allocations (like below), which effectively means reading /proc/stat > doesn't work anymore. > > From stat_open: > > size_t size = 1024 + 128 * num_possible_cpus(); > [...] > /* minimum size to display an interrupt count : 2 bytes */ > size += 2 * nr_irqs; > [...] > buf = kmalloc(size, GFP_KERNEL); > if (!buf) > return -ENOMEM; > > With num_possible_cpus() = 256 we end up with an order 4 allocation. Appologies in advance, I think my comment is off-topic, but never the less .... The previous size calcualtion requested memory in multiples of page size. Won't the current size calcualtion result in memory fragmentation? Won't that lead much more rapidly to the page allocation failure below on low memory systems even with a small number of CPUs? > > So, would there be any objections, adding a cpu iterator to /proc/stat? > > 62129.701569] sadc: page allocation failure: order:4, mode:0x1040d0 > [62129.701573] CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1 > [62129.701574] 00000000edf27840 00000000edf27850 0000000000000002 0000000000000000 > 00000000edf278e0 00000000edf27858 00000000edf27858 00000000001120c0 > 0000000000000000 000000000072c7c0 0000000000711836 000000000000000b > 00000000edf278a0 00000000edf27840 0000000000000000 0000000000000000 > 00000000001040d0 00000000001120c0 00000000edf27840 00000000edf278a0 > [62129.701586] Call Trace: > [62129.701588] ([<0000000000111fbe>] show_trace+0xe6/0x130) > [62129.701591] [<0000000000112074>] show_stack+0x6c/0xe8 > [62129.701593] [<000000000020d356>] warn_alloc_failed+0xd6/0x138 > [62129.701596] [<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68 > [62129.701598] [<000000000021168e>] __get_free_pages+0x2e/0x58 > [62129.701599] [<000000000025a05c>] kmalloc_order_trace+0x44/0xc0 > [62129.701602] [<00000000002f3ffa>] stat_open+0x5a/0xd8 > [62129.701604] [<00000000002e9aaa>] proc_reg_open+0x8a/0x140 > [62129.701606] [<0000000000273b64>] do_dentry_open+0x1bc/0x2c8 > [62129.701608] [<000000000027411e>] finish_open+0x46/0x60 > [62129.701610] [<000000000028675a>] do_last+0x382/0x10d0 > [62129.701612] [<0000000000287570>] path_openat+0xc8/0x4f8 > [62129.701614] [<0000000000288bde>] do_filp_open+0x46/0xa8 > [62129.701616] [<000000000027541c>] do_sys_open+0x114/0x1f0 > [62129.701618] [<00000000005b1c1c>] sysc_tracego+0x14/0x1a > [62129.701620] [<000003fffd0040a0>] 0x3fffd0040a0 > [62129.701624] Mem-Info: > [62129.701625] DMA per-cpu: > [62129.701627] CPU 0: hi: 186, btch: 31 usd: 0 > [62129.701628] CPU 1: hi: 186, btch: 31 usd: 0 > [62129.701630] CPU 2: hi: 186, btch: 31 usd: 51 > [62129.701631] Normal per-cpu: > [62129.701632] CPU 0: hi: 186, btch: 31 usd: 30 > [62129.701634] CPU 1: hi: 186, btch: 31 usd: 0 > [62129.701635] CPU 2: hi: 186, btch: 31 usd: 0 > [62129.701639] active_anon:5416 inactive_anon:5571 isolated_anon:0 > active_file:440513 inactive_file:406221 isolated_file:27 > unevictable:1741 dirty:35305 writeback:0 unstable:0 > free:40319 slab_reclaimable:41921 slab_unreclaimable:34553 > mapped:3921 shmem:1351 pagetables:296 bounce:0 > free_cma:0 > [62129.701648] DMA free:25192kB min:11800kB low:14748kB high:17700kB active_anon:11032kB inactive_anon:11320kB active_file:1002092kB inactive_file:904260kB unevictable:3772kB isolated(anon):0kB isolated(file):4kB present:2097152kB managed:2070452kB mlocked:3772kB dirty:55072kB writeback:0kB mapped:6316kB shmem:1152kB slab_reclaimable:61192kB slab_unreclaimable:50108kB kernel_stack:2368kB pagetables:532kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:133 all_unreclaimable? no > [62129.701652] lowmem_reserve[]: 0 1837 1837 > [62129.701658] Normal free:136084kB min:10724kB low:13404kB high:16084kB active_anon:10632kB inactive_anon:10964kB active_file:759960kB inactive_file:720624kB unevictable:3192kB isolated(anon):0kB isolated(file):4kB present:1966080kB managed:1881776kB mlocked:3192kB dirty:86148kB writeback:0kB mapped:9368kB shmem:4252kB slab_reclaimable:106492kB slab_unreclaimable:88104kB kernel_stack:5808kB pagetables:652kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > [62129.701661] lowmem_reserve[]: 0 0 0 > [62129.701664] DMA: 1540*4kB (UEM) 2217*8kB (UEM) 9*16kB (UEM) 1*32kB (R) 5*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 0*1024kB = 25288kB > [62129.701673] Normal: 21631*4kB (UEM) 5755*8kB (UEM) 145*16kB (UEM) 8*32kB (ER) 4*64kB (R) 2*128kB (R) 0*256kB 1*512kB (R) 0*1024kB = 136164kB > [62129.701682] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1024kB > [62129.701684] 849331 total pagecache pages > [62129.701685] 131 pages in swap cache > [62129.701687] Swap cache stats: add 9956, delete 9825, find 1049/1416 > [62129.701688] Free swap = 7186784kB > [62129.701689] Total swap = 7212140kB > [62129.710679] 1015808 pages RAM > [62129.710681] 23437 pages reserved > [62129.710682] 1360146 pages shared > [62129.710683] 384507 pages non-shared > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2014-06-12 20:59 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-05-21 12:25 /proc/stat vs. failed order-4 allocation Heiko Carstens 2014-05-21 14:32 ` Christoph Hellwig 2014-05-22 3:05 ` Elliott, Robert (Server Storage) 2014-05-28 8:58 ` Heiko Carstens 2014-05-28 8:59 ` [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size Heiko Carstens 2014-05-28 11:06 ` Ian Kent 2014-05-28 11:14 ` Ian Kent 2014-05-28 9:01 ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Heiko Carstens 2014-05-28 22:37 ` Andrew Morton 2014-05-30 8:38 ` Heiko Carstens 2014-05-30 11:36 ` [PATCH] fs: proc/stat: use seq_file iterator interface Heiko Carstens 2014-06-09 8:11 ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Ian Kent 2014-06-11 12:43 ` Heiko Carstens 2014-06-11 22:29 ` David Rientjes 2014-06-12 6:24 ` Ian Kent 2014-06-12 6:52 ` David Rientjes 2014-06-12 7:27 ` Heiko Carstens 2014-06-12 8:18 ` Heiko Carstens 2014-06-12 20:59 ` David Rientjes 2014-06-12 11:09 ` Ian Kent 2014-05-22 11:29 ` /proc/stat vs. failed order-4 allocation Ian Kent
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.