All of lore.kernel.org
 help / color / mirror / Atom feed
* /proc/stat vs. failed order-4 allocation
@ 2014-05-21 12:25 Heiko Carstens
  2014-05-21 14:32 ` Christoph Hellwig
  2014-05-22 11:29 ` /proc/stat vs. failed order-4 allocation Ian Kent
  0 siblings, 2 replies; 21+ messages in thread
From: Heiko Carstens @ 2014-05-21 12:25 UTC (permalink / raw)
  To: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet
  Cc: linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl

Hi all,

I'm just wondering why /proc/stat is a single_open() seq_file and not a
regular seq_file with an iterator (say 48 online cpus for each iteration
or something similar).

Of course, in theory, the "intr" line may be very long as well...

With the current implementation everything must fit into a single buffer.
So if memory is highly fragmented we run into failing higher order
allocations (like below), which effectively means reading /proc/stat
doesn't work anymore.

>From stat_open:

        size_t size = 1024 + 128 * num_possible_cpus();
	[...]
        /* minimum size to display an interrupt count : 2 bytes */
        size += 2 * nr_irqs;
	[...]
        buf = kmalloc(size, GFP_KERNEL);
        if (!buf)
                return -ENOMEM;

With num_possible_cpus() = 256 we end up with an order 4 allocation.

So, would there be any objections, adding a cpu iterator to /proc/stat?

62129.701569] sadc: page allocation failure: order:4, mode:0x1040d0
[62129.701573] CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
[62129.701574] 00000000edf27840 00000000edf27850 0000000000000002 0000000000000000 
00000000edf278e0 00000000edf27858 00000000edf27858 00000000001120c0 
0000000000000000 000000000072c7c0 0000000000711836 000000000000000b 
00000000edf278a0 00000000edf27840 0000000000000000 0000000000000000 
00000000001040d0 00000000001120c0 00000000edf27840 00000000edf278a0 
[62129.701586] Call Trace:
[62129.701588] ([<0000000000111fbe>] show_trace+0xe6/0x130)
[62129.701591] [<0000000000112074>] show_stack+0x6c/0xe8
[62129.701593] [<000000000020d356>] warn_alloc_failed+0xd6/0x138
[62129.701596] [<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68
[62129.701598] [<000000000021168e>] __get_free_pages+0x2e/0x58
[62129.701599] [<000000000025a05c>] kmalloc_order_trace+0x44/0xc0
[62129.701602] [<00000000002f3ffa>] stat_open+0x5a/0xd8
[62129.701604] [<00000000002e9aaa>] proc_reg_open+0x8a/0x140
[62129.701606] [<0000000000273b64>] do_dentry_open+0x1bc/0x2c8
[62129.701608] [<000000000027411e>] finish_open+0x46/0x60
[62129.701610] [<000000000028675a>] do_last+0x382/0x10d0
[62129.701612] [<0000000000287570>] path_openat+0xc8/0x4f8
[62129.701614] [<0000000000288bde>] do_filp_open+0x46/0xa8
[62129.701616] [<000000000027541c>] do_sys_open+0x114/0x1f0
[62129.701618] [<00000000005b1c1c>] sysc_tracego+0x14/0x1a
[62129.701620] [<000003fffd0040a0>] 0x3fffd0040a0
[62129.701624] Mem-Info:
[62129.701625] DMA per-cpu:
[62129.701627] CPU 0: hi: 186, btch: 31 usd: 0
[62129.701628] CPU 1: hi: 186, btch: 31 usd: 0
[62129.701630] CPU 2: hi: 186, btch: 31 usd: 51
[62129.701631] Normal per-cpu:
[62129.701632] CPU 0: hi: 186, btch: 31 usd: 30
[62129.701634] CPU 1: hi: 186, btch: 31 usd: 0
[62129.701635] CPU 2: hi: 186, btch: 31 usd: 0
[62129.701639] active_anon:5416 inactive_anon:5571 isolated_anon:0
active_file:440513 inactive_file:406221 isolated_file:27
unevictable:1741 dirty:35305 writeback:0 unstable:0
free:40319 slab_reclaimable:41921 slab_unreclaimable:34553
mapped:3921 shmem:1351 pagetables:296 bounce:0
free_cma:0
[62129.701648] DMA free:25192kB min:11800kB low:14748kB high:17700kB active_anon:11032kB inactive_anon:11320kB active_file:1002092kB inactive_file:904260kB unevictable:3772kB isolated(anon):0kB isolated(file):4kB present:2097152kB managed:2070452kB mlocked:3772kB dirty:55072kB writeback:0kB mapped:6316kB shmem:1152kB slab_reclaimable:61192kB slab_unreclaimable:50108kB kernel_stack:2368kB pagetables:532kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:133 all_unreclaimable? no
[62129.701652] lowmem_reserve[]: 0 1837 1837
[62129.701658] Normal free:136084kB min:10724kB low:13404kB high:16084kB active_anon:10632kB inactive_anon:10964kB active_file:759960kB inactive_file:720624kB unevictable:3192kB isolated(anon):0kB isolated(file):4kB present:1966080kB managed:1881776kB mlocked:3192kB dirty:86148kB writeback:0kB mapped:9368kB shmem:4252kB slab_reclaimable:106492kB slab_unreclaimable:88104kB kernel_stack:5808kB pagetables:652kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[62129.701661] lowmem_reserve[]: 0 0 0
[62129.701664] DMA: 1540*4kB (UEM) 2217*8kB (UEM) 9*16kB (UEM) 1*32kB (R) 5*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 0*1024kB = 25288kB
[62129.701673] Normal: 21631*4kB (UEM) 5755*8kB (UEM) 145*16kB (UEM) 8*32kB (ER) 4*64kB (R) 2*128kB (R) 0*256kB 1*512kB (R) 0*1024kB = 136164kB
[62129.701682] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1024kB
[62129.701684] 849331 total pagecache pages
[62129.701685] 131 pages in swap cache
[62129.701687] Swap cache stats: add 9956, delete 9825, find 1049/1416
[62129.701688] Free swap = 7186784kB
[62129.701689] Total swap = 7212140kB
[62129.710679] 1015808 pages RAM
[62129.710681] 23437 pages reserved
[62129.710682] 1360146 pages shared
[62129.710683] 384507 pages non-shared


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: /proc/stat vs. failed order-4 allocation
  2014-05-21 12:25 /proc/stat vs. failed order-4 allocation Heiko Carstens
@ 2014-05-21 14:32 ` Christoph Hellwig
  2014-05-22  3:05   ` Elliott, Robert (Server Storage)
  2014-05-28  8:58   ` Heiko Carstens
  2014-05-22 11:29 ` /proc/stat vs. failed order-4 allocation Ian Kent
  1 sibling, 2 replies; 21+ messages in thread
From: Christoph Hellwig @ 2014-05-21 14:32 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet,
	linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl

On Wed, May 21, 2014 at 02:25:21PM +0200, Heiko Carstens wrote:
> Hi all,
> 
> I'm just wondering why /proc/stat is a single_open() seq_file and not a
> regular seq_file with an iterator (say 48 online cpus for each iteration
> or something similar).

Probably because no one sent a patch for it. I'm pretty sure it used the
even more horrible old proc ops before and was converted in batch with
various other files.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: /proc/stat vs. failed order-4 allocation
  2014-05-21 14:32 ` Christoph Hellwig
@ 2014-05-22  3:05   ` Elliott, Robert (Server Storage)
  2014-05-28  8:58   ` Heiko Carstens
  1 sibling, 0 replies; 21+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-05-22  3:05 UTC (permalink / raw)
  To: Christoph Hellwig, Heiko Carstens
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet,
	linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Wednesday, May 21, 2014 9:32 AM
> To: Heiko Carstens
> Cc: Andrew Morton; KAMEZAWA Hiroyuki; Andrea Righi; Eric Dumazet; linux-
> kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; Hendrik Brueckner;
> Thorsten Diehl
> Subject: Re: /proc/stat vs. failed order-4 allocation
> 
> On Wed, May 21, 2014 at 02:25:21PM +0200, Heiko Carstens wrote:
> > Hi all,
> >
> > I'm just wondering why /proc/stat is a single_open() seq_file and not a
> > regular seq_file with an iterator (say 48 online cpus for each iteration
> > or something similar).
> 
> Probably because no one sent a patch for it. I'm pretty sure it used the
> even more horrible old proc ops before and was converted in batch with
> various other files.

https://lkml.org/lkml/2012/1/20/153 worried about performance and led to 
the current code; the reply in https://lkml.org/lkml/2012/1/23/41 
discussed using seq_file, but that idea was rejected at the time.

Be careful about losing consistency of the information for the CPUs.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: /proc/stat vs. failed order-4 allocation
  2014-05-21 12:25 /proc/stat vs. failed order-4 allocation Heiko Carstens
  2014-05-21 14:32 ` Christoph Hellwig
@ 2014-05-22 11:29 ` Ian Kent
  1 sibling, 0 replies; 21+ messages in thread
From: Ian Kent @ 2014-05-22 11:29 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet,
	linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl

On Wed, May 21, 2014 at 02:25:21PM +0200, Heiko Carstens wrote:
> Hi all,
> 
> I'm just wondering why /proc/stat is a single_open() seq_file and not a
> regular seq_file with an iterator (say 48 online cpus for each iteration
> or something similar).
> 
> Of course, in theory, the "intr" line may be very long as well...
> 
> With the current implementation everything must fit into a single buffer.
> So if memory is highly fragmented we run into failing higher order
> allocations (like below), which effectively means reading /proc/stat
> doesn't work anymore.
> 
> From stat_open:
> 
>         size_t size = 1024 + 128 * num_possible_cpus();
> 	[...]
>         /* minimum size to display an interrupt count : 2 bytes */
>         size += 2 * nr_irqs;
> 	[...]
>         buf = kmalloc(size, GFP_KERNEL);
>         if (!buf)
>                 return -ENOMEM;
> 
> With num_possible_cpus() = 256 we end up with an order 4 allocation.

Appologies in advance, I think my comment is off-topic, but never the
less ....

The previous size calcualtion requested memory in multiples of page
size.

Won't the current size calcualtion result in memory fragmentation?
Won't that lead much more rapidly to the page allocation failure
below on low memory systems even with a small number of CPUs?

> 
> So, would there be any objections, adding a cpu iterator to /proc/stat?
> 
> 62129.701569] sadc: page allocation failure: order:4, mode:0x1040d0
> [62129.701573] CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
> [62129.701574] 00000000edf27840 00000000edf27850 0000000000000002 0000000000000000 
> 00000000edf278e0 00000000edf27858 00000000edf27858 00000000001120c0 
> 0000000000000000 000000000072c7c0 0000000000711836 000000000000000b 
> 00000000edf278a0 00000000edf27840 0000000000000000 0000000000000000 
> 00000000001040d0 00000000001120c0 00000000edf27840 00000000edf278a0 
> [62129.701586] Call Trace:
> [62129.701588] ([<0000000000111fbe>] show_trace+0xe6/0x130)
> [62129.701591] [<0000000000112074>] show_stack+0x6c/0xe8
> [62129.701593] [<000000000020d356>] warn_alloc_failed+0xd6/0x138
> [62129.701596] [<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68
> [62129.701598] [<000000000021168e>] __get_free_pages+0x2e/0x58
> [62129.701599] [<000000000025a05c>] kmalloc_order_trace+0x44/0xc0
> [62129.701602] [<00000000002f3ffa>] stat_open+0x5a/0xd8
> [62129.701604] [<00000000002e9aaa>] proc_reg_open+0x8a/0x140
> [62129.701606] [<0000000000273b64>] do_dentry_open+0x1bc/0x2c8
> [62129.701608] [<000000000027411e>] finish_open+0x46/0x60
> [62129.701610] [<000000000028675a>] do_last+0x382/0x10d0
> [62129.701612] [<0000000000287570>] path_openat+0xc8/0x4f8
> [62129.701614] [<0000000000288bde>] do_filp_open+0x46/0xa8
> [62129.701616] [<000000000027541c>] do_sys_open+0x114/0x1f0
> [62129.701618] [<00000000005b1c1c>] sysc_tracego+0x14/0x1a
> [62129.701620] [<000003fffd0040a0>] 0x3fffd0040a0
> [62129.701624] Mem-Info:
> [62129.701625] DMA per-cpu:
> [62129.701627] CPU 0: hi: 186, btch: 31 usd: 0
> [62129.701628] CPU 1: hi: 186, btch: 31 usd: 0
> [62129.701630] CPU 2: hi: 186, btch: 31 usd: 51
> [62129.701631] Normal per-cpu:
> [62129.701632] CPU 0: hi: 186, btch: 31 usd: 30
> [62129.701634] CPU 1: hi: 186, btch: 31 usd: 0
> [62129.701635] CPU 2: hi: 186, btch: 31 usd: 0
> [62129.701639] active_anon:5416 inactive_anon:5571 isolated_anon:0
> active_file:440513 inactive_file:406221 isolated_file:27
> unevictable:1741 dirty:35305 writeback:0 unstable:0
> free:40319 slab_reclaimable:41921 slab_unreclaimable:34553
> mapped:3921 shmem:1351 pagetables:296 bounce:0
> free_cma:0
> [62129.701648] DMA free:25192kB min:11800kB low:14748kB high:17700kB active_anon:11032kB inactive_anon:11320kB active_file:1002092kB inactive_file:904260kB unevictable:3772kB isolated(anon):0kB isolated(file):4kB present:2097152kB managed:2070452kB mlocked:3772kB dirty:55072kB writeback:0kB mapped:6316kB shmem:1152kB slab_reclaimable:61192kB slab_unreclaimable:50108kB kernel_stack:2368kB pagetables:532kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:133 all_unreclaimable? no
> [62129.701652] lowmem_reserve[]: 0 1837 1837
> [62129.701658] Normal free:136084kB min:10724kB low:13404kB high:16084kB active_anon:10632kB inactive_anon:10964kB active_file:759960kB inactive_file:720624kB unevictable:3192kB isolated(anon):0kB isolated(file):4kB present:1966080kB managed:1881776kB mlocked:3192kB dirty:86148kB writeback:0kB mapped:9368kB shmem:4252kB slab_reclaimable:106492kB slab_unreclaimable:88104kB kernel_stack:5808kB pagetables:652kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [62129.701661] lowmem_reserve[]: 0 0 0
> [62129.701664] DMA: 1540*4kB (UEM) 2217*8kB (UEM) 9*16kB (UEM) 1*32kB (R) 5*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 0*1024kB = 25288kB
> [62129.701673] Normal: 21631*4kB (UEM) 5755*8kB (UEM) 145*16kB (UEM) 8*32kB (ER) 4*64kB (R) 2*128kB (R) 0*256kB 1*512kB (R) 0*1024kB = 136164kB
> [62129.701682] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1024kB
> [62129.701684] 849331 total pagecache pages
> [62129.701685] 131 pages in swap cache
> [62129.701687] Swap cache stats: add 9956, delete 9825, find 1049/1416
> [62129.701688] Free swap = 7186784kB
> [62129.701689] Total swap = 7212140kB
> [62129.710679] 1015808 pages RAM
> [62129.710681] 23437 pages reserved
> [62129.710682] 1360146 pages shared
> [62129.710683] 384507 pages non-shared
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: /proc/stat vs. failed order-4 allocation
  2014-05-21 14:32 ` Christoph Hellwig
  2014-05-22  3:05   ` Elliott, Robert (Server Storage)
@ 2014-05-28  8:58   ` Heiko Carstens
  2014-05-28  8:59     ` [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size Heiko Carstens
  2014-05-28  9:01     ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Heiko Carstens
  1 sibling, 2 replies; 21+ messages in thread
From: Heiko Carstens @ 2014-05-28  8:58 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet,
	linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl,
	Ian Kent, Elliott, Robert (Server Storage)

On Wed, May 21, 2014 at 07:32:29AM -0700, Christoph Hellwig wrote:
> On Wed, May 21, 2014 at 02:25:21PM +0200, Heiko Carstens wrote:
> > Hi all,
> > 
> > I'm just wondering why /proc/stat is a single_open() seq_file and not a
> > regular seq_file with an iterator (say 48 online cpus for each iteration
> > or something similar).
> 
> Probably because no one sent a patch for it. I'm pretty sure it used the
> even more horrible old proc ops before and was converted in batch with
> various other files.

Ok, so how about the two patches sent as reply to this mail.

(btw. if nobody objects to the modified patch from KAMEZAWA Hiroyuki the
first patch could be dropped and/or folded into the second patch)

Performance wise there doesn't seem to be too much of a difference,
however all measurements have been done a 64 cpu 2nd level guest.
It _looks_ like the new code is < 3% slower.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size
  2014-05-28  8:58   ` Heiko Carstens
@ 2014-05-28  8:59     ` Heiko Carstens
  2014-05-28 11:06       ` Ian Kent
  2014-05-28  9:01     ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Heiko Carstens
  1 sibling, 1 reply; 21+ messages in thread
From: Heiko Carstens @ 2014-05-28  8:59 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel,
	Hendrik Brueckner, Thorsten Diehl, Ian Kent, Elliott,
	Robert (Server Storage)

The number of bytes contained 'within' /proc/stat depends on the number
of online cpus and not of the numbe of possible cpus.

This reduces the number of bytes requested for the initial buffer allocation
within stat_open(). Which is usually way too high and for nr_possible_cpus()
== 256 cpus would result in an order 4 allocation.

Order 4 allocations however may fail if memory is fragmented and we would
end up with an unreadable /proc/stat file:

62129.701569] sadc: page allocation failure: order:4, mode:0x1040d0
[62129.701573] CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
[...]
[62129.701586] Call Trace:
[62129.701588] ([<0000000000111fbe>] show_trace+0xe6/0x130)
[62129.701591] [<0000000000112074>] show_stack+0x6c/0xe8
[62129.701593] [<000000000020d356>] warn_alloc_failed+0xd6/0x138
[62129.701596] [<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68
[62129.701598] [<000000000021168e>] __get_free_pages+0x2e/0x58
[62129.701599] [<000000000025a05c>] kmalloc_order_trace+0x44/0xc0
[62129.701602] [<00000000002f3ffa>] stat_open+0x5a/0xd8
[62129.701604] [<00000000002e9aaa>] proc_reg_open+0x8a/0x140
[62129.701606] [<0000000000273b64>] do_dentry_open+0x1bc/0x2c8
[62129.701608] [<000000000027411e>] finish_open+0x46/0x60
[62129.701610] [<000000000028675a>] do_last+0x382/0x10d0
[62129.701612] [<0000000000287570>] path_openat+0xc8/0x4f8
[62129.701614] [<0000000000288bde>] do_filp_open+0x46/0xa8
[62129.701616] [<000000000027541c>] do_sys_open+0x114/0x1f0
[62129.701618] [<00000000005b1c1c>] sysc_tracego+0x14/0x1a

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
 fs/proc/stat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 9d231e9e5f0e..3898ca5f1e92 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v)
 
 static int stat_open(struct inode *inode, struct file *file)
 {
-	size_t size = 1024 + 128 * num_possible_cpus();
+	size_t size = 1024 + 128 * num_online_cpus();
 	char *buf;
 	struct seq_file *m;
 	int res;
-- 
1.8.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open
  2014-05-28  8:58   ` Heiko Carstens
  2014-05-28  8:59     ` [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size Heiko Carstens
@ 2014-05-28  9:01     ` Heiko Carstens
  2014-05-28 22:37       ` Andrew Morton
  1 sibling, 1 reply; 21+ messages in thread
From: Heiko Carstens @ 2014-05-28  9:01 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel,
	Hendrik Brueckner, Thorsten Diehl, Ian Kent, Elliott,
	Robert (Server Storage)

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, /proc/stat uses single_open() for showing information. This means
the all data will be gathered and buffered once to a (big) buf.

Now, /proc/stat shows stats per cpu and stats per IRQs. To get information
in once-shot, it allocates a big buffer (until KMALLOC_MAX_SIZE).

Eric Dumazet reported that the bufsize calculation doesn't take
the numner of IRQ into account and the information cannot be
got in one-shot. (By this, seq_read() will allocate buffer again
and read the whole data gain...)

This patch changes /proc/stat to use seq_open() rather than single_open()
and provides  ->start(), ->next(), ->stop(), ->show().

By this, /proc/stat will not need to take care of size of buffer.

[heiko.carstens@de.ibm.com]: This is the forward port of a patch
from KAMEZAWA Hiroyuki (https://lkml.org/lkml/2012/1/23/41).
I added a couple of simple changes like e.g. that the cpu iterator
handles 32 cpus in a batch to avoid lots of iterations.

With this patch it should not happen anymore that reading /proc/stat
fails because of a failing high order memory allocation.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
 fs/proc/stat.c | 278 +++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 203 insertions(+), 75 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 3898ca5f1e92..652e255fee90 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -77,22 +77,109 @@ static u64 get_iowait_time(int cpu)
 
 #endif
 
-static int show_stat(struct seq_file *p, void *v)
+enum proc_stat_stage /* The numbers are used as *pos and iter->stage */
+{
+	SHOW_TOTAL_CPU_STAT,
+	SHOW_PERCPU_STAT,
+	SHOW_TOTAL_IRQS,
+	SHOW_IRQ_DETAILS,
+	SHOW_TIMES,
+	SHOW_TOTAL_SOFTIRQ,
+	SHOW_SOFTIRQ_DETAILS,
+	SHOW_EOL,
+	END_STATS,
+};
+
+/*
+ * To reduce the number of ->next(), ->show() calls IRQ numbers are
+ * handled in batch.
+ */
+struct seq_stat_iter {
+	int stage;
+	unsigned long jiffies;
+	int cpu_iter;
+	int irq_iter;
+	int softirq_iter;
+	/* cached information */
+	u64 irq_sum;
+	u64 softirq_sum;
+	u32 per_softirq_sums[NR_SOFTIRQS];
+};
+
+static void *proc_stat_start(struct seq_file *p, loff_t *pos)
+{
+	struct seq_stat_iter *iter = p->private;
+
+	/* At lseek(), *pos==0 is passed.(see travers() in seq_file.c */
+	if (!*pos) {
+		struct timespec boottime;
+
+		memset(iter, 0, sizeof(*iter));
+		iter->stage = SHOW_TOTAL_CPU_STAT;
+		getboottime(&boottime);
+		iter->jiffies = boottime.tv_sec;
+	}
+	if (iter->stage == END_STATS)
+		return NULL;
+	return iter;
+}
+
+static void proc_stat_stop(struct seq_file *p, void *v)
+{
+}
+
+static void *proc_stat_next(struct seq_file *p, void *v, loff_t *pos)
+{
+	struct seq_stat_iter *iter = p->private;
+	int index;
+
+	switch (iter->stage) {
+	case SHOW_TOTAL_CPU_STAT:
+		iter->stage = SHOW_PERCPU_STAT;
+		iter->cpu_iter = cpumask_first(cpu_online_mask);
+		break;
+	case SHOW_PERCPU_STAT:
+		index = cpumask_next(iter->cpu_iter, cpu_online_mask);
+		if (index >= nr_cpu_ids)
+			iter->stage = SHOW_TOTAL_IRQS;
+		else
+			iter->cpu_iter = index;
+		break;
+	case SHOW_TOTAL_IRQS:
+		iter->stage = SHOW_IRQ_DETAILS;
+		iter->irq_iter = 0;
+		break;
+	case SHOW_IRQ_DETAILS:
+		if (iter->irq_iter >= nr_irqs)
+			iter->stage = SHOW_TIMES;
+		break;
+	case SHOW_TIMES:
+		iter->stage = SHOW_TOTAL_SOFTIRQ;
+		break;
+	case SHOW_TOTAL_SOFTIRQ:
+		iter->stage = SHOW_SOFTIRQ_DETAILS;
+		break;
+	case SHOW_SOFTIRQ_DETAILS:
+		iter->stage = SHOW_EOL;
+		break;
+	case SHOW_EOL:
+		iter->stage = END_STATS;
+		return NULL;
+	default:
+		break;
+	}
+	return iter;
+}
+
+static int show_total_cpu_stat(struct seq_file *p, struct seq_stat_iter *iter)
 {
-	int i, j;
-	unsigned long jif;
 	u64 user, nice, system, idle, iowait, irq, softirq, steal;
 	u64 guest, guest_nice;
-	u64 sum = 0;
-	u64 sum_softirq = 0;
-	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
-	struct timespec boottime;
+	int i, j;
 
-	user = nice = system = idle = iowait =
-		irq = softirq = steal = 0;
+	user = nice = system = idle = iowait = 0;
+	irq = softirq = steal = 0;
 	guest = guest_nice = 0;
-	getboottime(&boottime);
-	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
 		user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
@@ -105,17 +192,17 @@ static int show_stat(struct seq_file *p, void *v)
 		steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
 		guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
 		guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
-		sum += kstat_cpu_irqs_sum(i);
-		sum += arch_irq_stat_cpu(i);
+		iter->irq_sum += kstat_cpu_irqs_sum(i);
+		iter->irq_sum += arch_irq_stat_cpu(i);
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
 			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
 
-			per_softirq_sums[j] += softirq_stat;
-			sum_softirq += softirq_stat;
+			iter->per_softirq_sums[j] += softirq_stat;
+			iter->softirq_sum += softirq_stat;
 		}
 	}
-	sum += arch_irq_stat();
+	iter->irq_sum += arch_irq_stat();
 
 	seq_puts(p, "cpu ");
 	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
@@ -129,20 +216,31 @@ static int show_stat(struct seq_file *p, void *v)
 	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
 	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
 	seq_putc(p, '\n');
+	return 0;
+}
+
+static int show_online_cpu_stat(struct seq_file *p, struct seq_stat_iter *iter)
+{
+	u64 user, nice, system, idle, iowait, irq, softirq, steal;
+	u64 guest, guest_nice;
+	int i, cpu, index;
 
-	for_each_online_cpu(i) {
+	/* Handle 32 cpus at a time, to avoid lots of seqfile iterations. */
+	cpu = index = iter->cpu_iter;
+	for (i = 0; i < 32 && index < nr_cpu_ids; i++) {
+		cpu = index;
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kcpustat_cpu(i).cpustat[CPUTIME_USER];
-		nice = kcpustat_cpu(i).cpustat[CPUTIME_NICE];
-		system = kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
-		idle = get_idle_time(i);
-		iowait = get_iowait_time(i);
-		irq = kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
-		softirq = kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
-		steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
-		guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
-		guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
-		seq_printf(p, "cpu%d", i);
+		user = kcpustat_cpu(cpu).cpustat[CPUTIME_USER];
+		nice = kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
+		system = kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM];
+		idle = get_idle_time(cpu);
+		iowait = get_iowait_time(cpu);
+		irq = kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
+		softirq = kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
+		steal = kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
+		guest = kcpustat_cpu(cpu).cpustat[CPUTIME_GUEST];
+		guest_nice = kcpustat_cpu(cpu).cpustat[CPUTIME_GUEST_NICE];
+		seq_printf(p, "cpu%d", cpu);
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(system));
@@ -154,66 +252,96 @@ static int show_stat(struct seq_file *p, void *v)
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
 		seq_putc(p, '\n');
+		index = cpumask_next(cpu, cpu_online_mask);
 	}
-	seq_printf(p, "intr %llu", (unsigned long long)sum);
-
-	/* sum again ? it could be updated? */
-	for_each_irq_nr(j)
-		seq_put_decimal_ull(p, ' ', kstat_irqs(j));
-
-	seq_printf(p,
-		"\nctxt %llu\n"
-		"btime %lu\n"
-		"processes %lu\n"
-		"procs_running %lu\n"
-		"procs_blocked %lu\n",
-		nr_context_switches(),
-		(unsigned long)jif,
-		total_forks,
-		nr_running(),
-		nr_iowait());
-
-	seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
-
-	for (i = 0; i < NR_SOFTIRQS; i++)
-		seq_put_decimal_ull(p, ' ', per_softirq_sums[i]);
-	seq_putc(p, '\n');
+	iter->cpu_iter = cpu;
+	return 0;
+}
+
+static int show_irq_details(struct seq_file *p, struct seq_stat_iter *iter)
+{
+	int ret;
+
+	/*
+	 * we update iterater in ->show()...seems ugly but for avoiding
+	 * tons of function calls, print out here as much as possible
+	 */
+	do {
+		ret = seq_put_decimal_ull(p, ' ', kstat_irqs(iter->irq_iter));
+		if (!ret)
+			iter->irq_iter += 1;
+	} while (!ret && iter->irq_iter < nr_irqs);
 
 	return 0;
 }
 
+static int show_softirq_details(struct seq_file *p, struct seq_stat_iter *iter)
+{
+	int ret;
+
+	do {
+		ret = seq_put_decimal_ull(p, ' ',
+				iter->per_softirq_sums[iter->softirq_iter]);
+		if (!ret)
+			iter->softirq_iter += 1;
+	} while (!ret && iter->softirq_iter < NR_SOFTIRQS);
+	return 0;
+}
+
+static int proc_stat_show(struct seq_file *p, void *v)
+{
+	struct seq_stat_iter *iter = v;
+
+	switch (iter->stage) {
+	case SHOW_TOTAL_CPU_STAT:
+		return show_total_cpu_stat(p, iter);
+	case SHOW_PERCPU_STAT:
+		return show_online_cpu_stat(p, iter);
+	case SHOW_TOTAL_IRQS:
+		return seq_printf(p, "intr %llu",
+				  (unsigned long long)iter->irq_sum);
+	case SHOW_IRQ_DETAILS:
+		return show_irq_details(p, iter);
+	case SHOW_TIMES:
+		return seq_printf(p,
+				  "\nctxt %llu\n"
+				  "btime %lu\n"
+				  "processes %lu\n"
+				  "procs_running %lu\n"
+				  "procs_blocked %lu\n",
+				  nr_context_switches(),
+				  (unsigned long)iter->jiffies,
+				  total_forks,
+				  nr_running(),
+				  nr_iowait());
+	case SHOW_TOTAL_SOFTIRQ:
+		return seq_printf(p, "softirq %llu",
+				  (unsigned long long)iter->softirq_sum);
+	case SHOW_SOFTIRQ_DETAILS:
+		return show_softirq_details(p, iter);
+	case SHOW_EOL:
+		return seq_putc(p, '\n');
+	}
+	return 0;
+}
+
+static const struct seq_operations show_stat_op = {
+	.start = proc_stat_start,
+	.next  = proc_stat_next,
+	.stop  = proc_stat_stop,
+	.show  = proc_stat_show,
+};
+
 static int stat_open(struct inode *inode, struct file *file)
 {
-	size_t size = 1024 + 128 * num_online_cpus();
-	char *buf;
-	struct seq_file *m;
-	int res;
-
-	/* minimum size to display an interrupt count : 2 bytes */
-	size += 2 * nr_irqs;
-
-	/* don't ask for more than the kmalloc() max size */
-	if (size > KMALLOC_MAX_SIZE)
-		size = KMALLOC_MAX_SIZE;
-	buf = kmalloc(size, GFP_KERNEL);
-	if (!buf)
-		return -ENOMEM;
-
-	res = single_open(file, show_stat, NULL);
-	if (!res) {
-		m = file->private_data;
-		m->buf = buf;
-		m->size = ksize(buf);
-	} else
-		kfree(buf);
-	return res;
+	return seq_open_private(file, &show_stat_op, sizeof(struct seq_stat_iter));
 }
 
 static const struct file_operations proc_stat_operations = {
 	.open		= stat_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= single_release,
+	.release	= seq_release_private,
 };
 
 static int __init proc_stat_init(void)
-- 
1.8.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size
  2014-05-28  8:59     ` [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size Heiko Carstens
@ 2014-05-28 11:06       ` Ian Kent
  2014-05-28 11:14         ` Ian Kent
  0 siblings, 1 reply; 21+ messages in thread
From: Ian Kent @ 2014-05-28 11:06 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Christoph Hellwig, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel,
	Hendrik Brueckner, Thorsten Diehl, Elliott,
	Robert (Server Storage)

On Wed, 2014-05-28 at 10:59 +0200, Heiko Carstens wrote:
> The number of bytes contained 'within' /proc/stat depends on the number
> of online cpus and not of the numbe of possible cpus.
> 
> This reduces the number of bytes requested for the initial buffer allocation
> within stat_open(). Which is usually way too high and for nr_possible_cpus()
> == 256 cpus would result in an order 4 allocation.
> 
> Order 4 allocations however may fail if memory is fragmented and we would
> end up with an unreadable /proc/stat file:
> 
> 62129.701569] sadc: page allocation failure: order:4, mode:0x1040d0
> [62129.701573] CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
> [...]
> [62129.701586] Call Trace:
> [62129.701588] ([<0000000000111fbe>] show_trace+0xe6/0x130)
> [62129.701591] [<0000000000112074>] show_stack+0x6c/0xe8
> [62129.701593] [<000000000020d356>] warn_alloc_failed+0xd6/0x138
> [62129.701596] [<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68
> [62129.701598] [<000000000021168e>] __get_free_pages+0x2e/0x58
> [62129.701599] [<000000000025a05c>] kmalloc_order_trace+0x44/0xc0
> [62129.701602] [<00000000002f3ffa>] stat_open+0x5a/0xd8
> [62129.701604] [<00000000002e9aaa>] proc_reg_open+0x8a/0x140
> [62129.701606] [<0000000000273b64>] do_dentry_open+0x1bc/0x2c8
> [62129.701608] [<000000000027411e>] finish_open+0x46/0x60
> [62129.701610] [<000000000028675a>] do_last+0x382/0x10d0
> [62129.701612] [<0000000000287570>] path_openat+0xc8/0x4f8
> [62129.701614] [<0000000000288bde>] do_filp_open+0x46/0xa8
> [62129.701616] [<000000000027541c>] do_sys_open+0x114/0x1f0
> [62129.701618] [<00000000005b1c1c>] sysc_tracego+0x14/0x1a
> 
> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> ---
>  fs/proc/stat.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/proc/stat.c b/fs/proc/stat.c
> index 9d231e9e5f0e..3898ca5f1e92 100644
> --- a/fs/proc/stat.c
> +++ b/fs/proc/stat.c
> @@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v)
>  
>  static int stat_open(struct inode *inode, struct file *file)
>  {
> -	size_t size = 1024 + 128 * num_possible_cpus();
> +	size_t size = 1024 + 128 * num_online_cpus();

Yes, I thought of this too when I was looking at the problem but was
concerned about the number of online cpus changing during the read.

If a system can hotplug cpus then I guess we don't care much about the
number of cpus increasing during the read, we'll just see incorrect data
once, but what would happen if some cpus were removed? Do we even care
about that case?

>  	char *buf;
>  	struct seq_file *m;
>  	int res;



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size
  2014-05-28 11:06       ` Ian Kent
@ 2014-05-28 11:14         ` Ian Kent
  0 siblings, 0 replies; 21+ messages in thread
From: Ian Kent @ 2014-05-28 11:14 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Christoph Hellwig, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel,
	Hendrik Brueckner, Thorsten Diehl, Elliott,
	Robert (Server Storage)

On Wed, 2014-05-28 at 19:06 +0800, Ian Kent wrote:
> On Wed, 2014-05-28 at 10:59 +0200, Heiko Carstens wrote:
> > The number of bytes contained 'within' /proc/stat depends on the number
> > of online cpus and not of the numbe of possible cpus.
> > 
> > This reduces the number of bytes requested for the initial buffer allocation
> > within stat_open(). Which is usually way too high and for nr_possible_cpus()
> > == 256 cpus would result in an order 4 allocation.
> > 
> > Order 4 allocations however may fail if memory is fragmented and we would
> > end up with an unreadable /proc/stat file:
> > 
> > 62129.701569] sadc: page allocation failure: order:4, mode:0x1040d0
> > [62129.701573] CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
> > [...]
> > [62129.701586] Call Trace:
> > [62129.701588] ([<0000000000111fbe>] show_trace+0xe6/0x130)
> > [62129.701591] [<0000000000112074>] show_stack+0x6c/0xe8
> > [62129.701593] [<000000000020d356>] warn_alloc_failed+0xd6/0x138
> > [62129.701596] [<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68
> > [62129.701598] [<000000000021168e>] __get_free_pages+0x2e/0x58
> > [62129.701599] [<000000000025a05c>] kmalloc_order_trace+0x44/0xc0
> > [62129.701602] [<00000000002f3ffa>] stat_open+0x5a/0xd8
> > [62129.701604] [<00000000002e9aaa>] proc_reg_open+0x8a/0x140
> > [62129.701606] [<0000000000273b64>] do_dentry_open+0x1bc/0x2c8
> > [62129.701608] [<000000000027411e>] finish_open+0x46/0x60
> > [62129.701610] [<000000000028675a>] do_last+0x382/0x10d0
> > [62129.701612] [<0000000000287570>] path_openat+0xc8/0x4f8
> > [62129.701614] [<0000000000288bde>] do_filp_open+0x46/0xa8
> > [62129.701616] [<000000000027541c>] do_sys_open+0x114/0x1f0
> > [62129.701618] [<00000000005b1c1c>] sysc_tracego+0x14/0x1a
> > 
> > Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> > ---
> >  fs/proc/stat.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/stat.c b/fs/proc/stat.c
> > index 9d231e9e5f0e..3898ca5f1e92 100644
> > --- a/fs/proc/stat.c
> > +++ b/fs/proc/stat.c
> > @@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v)
> >  
> >  static int stat_open(struct inode *inode, struct file *file)
> >  {
> > -	size_t size = 1024 + 128 * num_possible_cpus();
> > +	size_t size = 1024 + 128 * num_online_cpus();
> 
> Yes, I thought of this too when I was looking at the problem but was
> concerned about the number of online cpus changing during the read.
> 
> If a system can hotplug cpus then I guess we don't care much about the
> number of cpus increasing during the read, we'll just see incorrect data
> once, but what would happen if some cpus were removed? Do we even care
> about that case?

Oh hang on, that's not right it's the opposite, if the number of cpus
increases between the call to stat_open() and show_stat() there might
not be enough space.

> 
> >  	char *buf;
> >  	struct seq_file *m;
> >  	int res;
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open
  2014-05-28  9:01     ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Heiko Carstens
@ 2014-05-28 22:37       ` Andrew Morton
  2014-05-30  8:38         ` Heiko Carstens
  2014-06-09  8:11         ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Ian Kent
  0 siblings, 2 replies; 21+ messages in thread
From: Andrew Morton @ 2014-05-28 22:37 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet,
	linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl,
	Ian Kent, Elliott, Robert (Server Storage)

On Wed, 28 May 2014 11:01:53 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote:

> Now, /proc/stat uses single_open() for showing information. This means
> the all data will be gathered and buffered once to a (big) buf.
> 
> Now, /proc/stat shows stats per cpu and stats per IRQs. To get information
> in once-shot, it allocates a big buffer (until KMALLOC_MAX_SIZE).
> 
> Eric Dumazet reported that the bufsize calculation doesn't take
> the numner of IRQ into account and the information cannot be
> got in one-shot. (By this, seq_read() will allocate buffer again
> and read the whole data gain...)
> 
> This patch changes /proc/stat to use seq_open() rather than single_open()
> and provides  ->start(), ->next(), ->stop(), ->show().
> 
> By this, /proc/stat will not need to take care of size of buffer.
> 
> [heiko.carstens@de.ibm.com]: This is the forward port of a patch
> from KAMEZAWA Hiroyuki (https://lkml.org/lkml/2012/1/23/41).
> I added a couple of simple changes like e.g. that the cpu iterator
> handles 32 cpus in a batch to avoid lots of iterations.
> 
> With this patch it should not happen anymore that reading /proc/stat
> fails because of a failing high order memory allocation.

So this deletes the problematic allocation which [1/2] kind-of fixed,
yes?

I agree with Ian - there's a hotplugging race.  And [1/2] doesn't do
anything to address the worst-case allocation size.  So I think we may
as well do this all in a single patch.

Without having looked closely at the code I worry a bit about the
effects.  /proc/pid/stat is a complex thing and its contents will vary
in strange ways as the things it is reporting on undergo concurrent
changes.  This patch will presumably replace one set of bizarre
behaviour with a new set of bizarre behaviour and there may be
unforseen consequences to established userspace.

So we're going to need a lot of testing and a lot of testing time to
identify issues and weed them out.

So..  can we take this up for 3.16-rc1?  See if we can get some careful
review done then and test it for a couple of months?

Meanwhile, the changelog looks a bit hastily thrown together - some
smoothing would be nice, and perhaps some work spent identifying
possible behavioural changes.  Timing changes, locking canges, effects
of concurrent fork/exit activity etc?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open
  2014-05-28 22:37       ` Andrew Morton
@ 2014-05-30  8:38         ` Heiko Carstens
  2014-05-30 11:36           ` [PATCH] fs: proc/stat: use seq_file iterator interface Heiko Carstens
  2014-06-09  8:11         ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Ian Kent
  1 sibling, 1 reply; 21+ messages in thread
From: Heiko Carstens @ 2014-05-30  8:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet,
	linux-kernel, linux-fsdevel, Hendrik Brueckner, Thorsten Diehl,
	Ian Kent, Elliott, Robert (Server Storage)

On Wed, May 28, 2014 at 03:37:04PM -0700, Andrew Morton wrote:
> On Wed, 28 May 2014 11:01:53 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > With this patch it should not happen anymore that reading /proc/stat
> > fails because of a failing high order memory allocation.
> 
> So this deletes the problematic allocation which [1/2] kind-of fixed,
> yes?

Yes.

> I agree with Ian - there's a hotplugging race.  And [1/2] doesn't do
> anything to address the worst-case allocation size.  So I think we may
> as well do this all in a single patch.

Fine with me. However the hotplugging race in 1/2 doesn't matter: if the
result doesn't fit into the preallocated buffer the seq_file infrastructure
would simply allocate a buffer twice as large as before and retry.

The point of patch 1/2 was to have a patch that probably solves the problem
almost always ;) , without having the problems you describe below.

> Without having looked closely at the code I worry a bit about the
> effects.  /proc/pid/stat is a complex thing and its contents will vary

It's /proc/stat not /proc/pid/stat.

> So..  can we take this up for 3.16-rc1?  See if we can get some careful
> review done then and test it for a couple of months?

Sure.

> Meanwhile, the changelog looks a bit hastily thrown together - some
> smoothing would be nice, and perhaps some work spent identifying
> possible behavioural changes.  Timing changes, locking canges, effects
> of concurrent fork/exit activity etc?

Well... I'll try to come up with something better. Even though I only
forward ported an existing patch to address a memory allocation failure.
Oh oh...


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] fs: proc/stat: use seq_file iterator interface
  2014-05-30  8:38         ` Heiko Carstens
@ 2014-05-30 11:36           ` Heiko Carstens
  0 siblings, 0 replies; 21+ messages in thread
From: Heiko Carstens @ 2014-05-30 11:36 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel,
	Hendrik Brueckner, Thorsten Diehl, Ian Kent, Elliott,
	Robert (Server Storage)

> Well... I'll try to come up with something better. Even though I only
> forward ported an existing patch to address a memory allocation failure.

How about the patch below then?

>From 52a050f85256d7586933365da1b98c6227651449 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Tue, 27 May 2014 12:41:00 +0200
Subject: [PATCH] fs: proc/stat: use seq_file iterator interface

This is the forward port of a patch from KAMEZAWA Hiroyuki with a couple
of additional small changess.
The original patch was posted here: https://lkml.org/lkml/2012/1/23/41.

/proc/stat uses single_open() for showing information. This means all data
will be gathered and buffered once into a big buffer.

The size of the buffer depends on the nubmer of possible cpus and therefore
may get very large. This may lead to large memory allocations, which can
fail if memory is fragmented (e.g. for num_possible_cpus() == 256):

sadc: page allocation failure: order:4, mode:0x1040d0
CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
[...]
Call Trace:
([<0000000000111fbe>] show_trace+0xe6/0x130)
[<0000000000112074>] show_stack+0x6c/0xe8
[<000000000020d356>] warn_alloc_failed+0xd6/0x138
[<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68
[<000000000021168e>] __get_free_pages+0x2e/0x58
[<000000000025a05c>] kmalloc_order_trace+0x44/0xc0
[<00000000002f3ffa>] stat_open+0x5a/0xd8
[<00000000002e9aaa>] proc_reg_open+0x8a/0x140
[<0000000000273b64>] do_dentry_open+0x1bc/0x2c8
[<000000000027411e>] finish_open+0x46/0x60
[<000000000028675a>] do_last+0x382/0x10d0
[<0000000000287570>] path_openat+0xc8/0x4f8
[<0000000000288bde>] do_filp_open+0x46/0xa8
[<000000000027541c>] do_sys_open+0x114/0x1f0
[<00000000005b1c1c>] sysc_tracego+0x14/0x1a

This in turn means that reading /proc/stat doesn't work at all in such
situations.

To address this issue convert /proc/stat to use the seq_file iterator
interface. The existing output is split into several pieces so that each
piece fits into a single page before it is copied to user space.  That way
the higher order allocation goes away.

Unfortunately this has user space visible side effects. Unlike before,
reading /proc/stat may now require several read system calls, even if the
buffer provided by user space would be large enough to hold the entire
data.  Howewer it seems to be unlikely that such user space programs exist,
since the content of /proc/stat varies all the time anyway.

Furthermore the code that generated the output gathered per cpu statistics
without any locking. So while a process read /proc/stat the data is
changed concurrently leading potentially to inconsistent data. This is
still the case after the change, however because of the iterator interface,
and the additional latencies we get with that, the per cpu statistics may
get more inconsistent than before.
If this is really an issue remains to be seen.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
 fs/proc/stat.c | 278 +++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 203 insertions(+), 75 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 9d231e9e5f0e..141c4b773ec1 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -77,22 +77,109 @@ static u64 get_iowait_time(int cpu)
 
 #endif
 
-static int show_stat(struct seq_file *p, void *v)
+enum proc_stat_stage /* The numbers are used as *pos and iter->stage */
+{
+	SHOW_TOTAL_CPU_STAT,
+	SHOW_PERCPU_STAT,
+	SHOW_TOTAL_IRQS,
+	SHOW_IRQ_DETAILS,
+	SHOW_TIMES,
+	SHOW_TOTAL_SOFTIRQ,
+	SHOW_SOFTIRQ_DETAILS,
+	SHOW_EOL,
+	END_STATS,
+};
+
+/*
+ * To reduce the number of ->next(), ->show() calls IRQ numbers are
+ * handled in batch.
+ */
+struct seq_stat_iter {
+	int stage;
+	unsigned long jiffies;
+	int cpu_iter;
+	int irq_iter;
+	int softirq_iter;
+	/* cached information */
+	u64 irq_sum;
+	u64 softirq_sum;
+	u32 per_softirq_sums[NR_SOFTIRQS];
+};
+
+static void *proc_stat_start(struct seq_file *p, loff_t *pos)
+{
+	struct seq_stat_iter *iter = p->private;
+
+	/* At lseek(), *pos==0 is passed (see traverse() in seq_file.c. */
+	if (!*pos) {
+		struct timespec boottime;
+
+		memset(iter, 0, sizeof(*iter));
+		iter->stage = SHOW_TOTAL_CPU_STAT;
+		getboottime(&boottime);
+		iter->jiffies = boottime.tv_sec;
+	}
+	if (iter->stage == END_STATS)
+		return NULL;
+	return iter;
+}
+
+static void proc_stat_stop(struct seq_file *p, void *v)
+{
+}
+
+static void *proc_stat_next(struct seq_file *p, void *v, loff_t *pos)
+{
+	struct seq_stat_iter *iter = p->private;
+	int index;
+
+	switch (iter->stage) {
+	case SHOW_TOTAL_CPU_STAT:
+		iter->stage = SHOW_PERCPU_STAT;
+		iter->cpu_iter = cpumask_first(cpu_online_mask);
+		break;
+	case SHOW_PERCPU_STAT:
+		index = cpumask_next(iter->cpu_iter, cpu_online_mask);
+		if (index >= nr_cpu_ids)
+			iter->stage = SHOW_TOTAL_IRQS;
+		else
+			iter->cpu_iter = index;
+		break;
+	case SHOW_TOTAL_IRQS:
+		iter->stage = SHOW_IRQ_DETAILS;
+		iter->irq_iter = 0;
+		break;
+	case SHOW_IRQ_DETAILS:
+		if (iter->irq_iter >= nr_irqs)
+			iter->stage = SHOW_TIMES;
+		break;
+	case SHOW_TIMES:
+		iter->stage = SHOW_TOTAL_SOFTIRQ;
+		break;
+	case SHOW_TOTAL_SOFTIRQ:
+		iter->stage = SHOW_SOFTIRQ_DETAILS;
+		break;
+	case SHOW_SOFTIRQ_DETAILS:
+		iter->stage = SHOW_EOL;
+		break;
+	case SHOW_EOL:
+		iter->stage = END_STATS;
+		return NULL;
+	default:
+		break;
+	}
+	return iter;
+}
+
+static int show_total_cpu_stat(struct seq_file *p, struct seq_stat_iter *iter)
 {
-	int i, j;
-	unsigned long jif;
 	u64 user, nice, system, idle, iowait, irq, softirq, steal;
 	u64 guest, guest_nice;
-	u64 sum = 0;
-	u64 sum_softirq = 0;
-	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
-	struct timespec boottime;
+	int i, j;
 
-	user = nice = system = idle = iowait =
-		irq = softirq = steal = 0;
+	user = nice = system = idle = iowait = 0;
+	irq = softirq = steal = 0;
 	guest = guest_nice = 0;
-	getboottime(&boottime);
-	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
 		user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
@@ -105,17 +192,17 @@ static int show_stat(struct seq_file *p, void *v)
 		steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
 		guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
 		guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
-		sum += kstat_cpu_irqs_sum(i);
-		sum += arch_irq_stat_cpu(i);
+		iter->irq_sum += kstat_cpu_irqs_sum(i);
+		iter->irq_sum += arch_irq_stat_cpu(i);
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
 			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
 
-			per_softirq_sums[j] += softirq_stat;
-			sum_softirq += softirq_stat;
+			iter->per_softirq_sums[j] += softirq_stat;
+			iter->softirq_sum += softirq_stat;
 		}
 	}
-	sum += arch_irq_stat();
+	iter->irq_sum += arch_irq_stat();
 
 	seq_puts(p, "cpu ");
 	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
@@ -129,20 +216,31 @@ static int show_stat(struct seq_file *p, void *v)
 	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
 	seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
 	seq_putc(p, '\n');
+	return 0;
+}
+
+static int show_online_cpu_stat(struct seq_file *p, struct seq_stat_iter *iter)
+{
+	u64 user, nice, system, idle, iowait, irq, softirq, steal;
+	u64 guest, guest_nice;
+	int i, cpu, index;
 
-	for_each_online_cpu(i) {
+	/* Handle 32 cpus at a time, to avoid lots of seqfile iterations. */
+	cpu = index = iter->cpu_iter;
+	for (i = 0; i < 32 && index < nr_cpu_ids; i++) {
+		cpu = index;
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kcpustat_cpu(i).cpustat[CPUTIME_USER];
-		nice = kcpustat_cpu(i).cpustat[CPUTIME_NICE];
-		system = kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
-		idle = get_idle_time(i);
-		iowait = get_iowait_time(i);
-		irq = kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
-		softirq = kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
-		steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
-		guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
-		guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
-		seq_printf(p, "cpu%d", i);
+		user = kcpustat_cpu(cpu).cpustat[CPUTIME_USER];
+		nice = kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
+		system = kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM];
+		idle = get_idle_time(cpu);
+		iowait = get_iowait_time(cpu);
+		irq = kcpustat_cpu(cpu).cpustat[CPUTIME_IRQ];
+		softirq = kcpustat_cpu(cpu).cpustat[CPUTIME_SOFTIRQ];
+		steal = kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL];
+		guest = kcpustat_cpu(cpu).cpustat[CPUTIME_GUEST];
+		guest_nice = kcpustat_cpu(cpu).cpustat[CPUTIME_GUEST_NICE];
+		seq_printf(p, "cpu%d", cpu);
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(system));
@@ -154,66 +252,96 @@ static int show_stat(struct seq_file *p, void *v)
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
 		seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
 		seq_putc(p, '\n');
+		index = cpumask_next(cpu, cpu_online_mask);
 	}
-	seq_printf(p, "intr %llu", (unsigned long long)sum);
-
-	/* sum again ? it could be updated? */
-	for_each_irq_nr(j)
-		seq_put_decimal_ull(p, ' ', kstat_irqs(j));
-
-	seq_printf(p,
-		"\nctxt %llu\n"
-		"btime %lu\n"
-		"processes %lu\n"
-		"procs_running %lu\n"
-		"procs_blocked %lu\n",
-		nr_context_switches(),
-		(unsigned long)jif,
-		total_forks,
-		nr_running(),
-		nr_iowait());
-
-	seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
-
-	for (i = 0; i < NR_SOFTIRQS; i++)
-		seq_put_decimal_ull(p, ' ', per_softirq_sums[i]);
-	seq_putc(p, '\n');
+	iter->cpu_iter = cpu;
+	return 0;
+}
+
+static int show_irq_details(struct seq_file *p, struct seq_stat_iter *iter)
+{
+	int ret;
+
+	/*
+	 * we update iterater in ->show()...seems ugly but for avoiding
+	 * tons of function calls, print out here as much as possible
+	 */
+	do {
+		ret = seq_put_decimal_ull(p, ' ', kstat_irqs(iter->irq_iter));
+		if (!ret)
+			iter->irq_iter += 1;
+	} while (!ret && iter->irq_iter < nr_irqs);
 
 	return 0;
 }
 
+static int show_softirq_details(struct seq_file *p, struct seq_stat_iter *iter)
+{
+	int ret;
+
+	do {
+		ret = seq_put_decimal_ull(p, ' ',
+				iter->per_softirq_sums[iter->softirq_iter]);
+		if (!ret)
+			iter->softirq_iter += 1;
+	} while (!ret && iter->softirq_iter < NR_SOFTIRQS);
+	return 0;
+}
+
+static int proc_stat_show(struct seq_file *p, void *v)
+{
+	struct seq_stat_iter *iter = v;
+
+	switch (iter->stage) {
+	case SHOW_TOTAL_CPU_STAT:
+		return show_total_cpu_stat(p, iter);
+	case SHOW_PERCPU_STAT:
+		return show_online_cpu_stat(p, iter);
+	case SHOW_TOTAL_IRQS:
+		return seq_printf(p, "intr %llu",
+				  (unsigned long long)iter->irq_sum);
+	case SHOW_IRQ_DETAILS:
+		return show_irq_details(p, iter);
+	case SHOW_TIMES:
+		return seq_printf(p,
+				  "\nctxt %llu\n"
+				  "btime %lu\n"
+				  "processes %lu\n"
+				  "procs_running %lu\n"
+				  "procs_blocked %lu\n",
+				  nr_context_switches(),
+				  (unsigned long)iter->jiffies,
+				  total_forks,
+				  nr_running(),
+				  nr_iowait());
+	case SHOW_TOTAL_SOFTIRQ:
+		return seq_printf(p, "softirq %llu",
+				  (unsigned long long)iter->softirq_sum);
+	case SHOW_SOFTIRQ_DETAILS:
+		return show_softirq_details(p, iter);
+	case SHOW_EOL:
+		return seq_putc(p, '\n');
+	}
+	return 0;
+}
+
+static const struct seq_operations show_stat_op = {
+	.start = proc_stat_start,
+	.next  = proc_stat_next,
+	.stop  = proc_stat_stop,
+	.show  = proc_stat_show,
+};
+
 static int stat_open(struct inode *inode, struct file *file)
 {
-	size_t size = 1024 + 128 * num_possible_cpus();
-	char *buf;
-	struct seq_file *m;
-	int res;
-
-	/* minimum size to display an interrupt count : 2 bytes */
-	size += 2 * nr_irqs;
-
-	/* don't ask for more than the kmalloc() max size */
-	if (size > KMALLOC_MAX_SIZE)
-		size = KMALLOC_MAX_SIZE;
-	buf = kmalloc(size, GFP_KERNEL);
-	if (!buf)
-		return -ENOMEM;
-
-	res = single_open(file, show_stat, NULL);
-	if (!res) {
-		m = file->private_data;
-		m->buf = buf;
-		m->size = ksize(buf);
-	} else
-		kfree(buf);
-	return res;
+	return seq_open_private(file, &show_stat_op, sizeof(struct seq_stat_iter));
 }
 
 static const struct file_operations proc_stat_operations = {
 	.open		= stat_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
-	.release	= single_release,
+	.release	= seq_release_private,
 };
 
 static int __init proc_stat_init(void)
-- 
1.8.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open
  2014-05-28 22:37       ` Andrew Morton
  2014-05-30  8:38         ` Heiko Carstens
@ 2014-06-09  8:11         ` Ian Kent
  2014-06-11 12:43           ` Heiko Carstens
  1 sibling, 1 reply; 21+ messages in thread
From: Ian Kent @ 2014-06-09  8:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Heiko Carstens, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel,
	Hendrik Brueckner, Thorsten Diehl, Elliott,
	Robert (Server Storage)

On Wed, 2014-05-28 at 15:37 -0700, Andrew Morton wrote:
> On Wed, 28 May 2014 11:01:53 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> 
> > Now, /proc/stat uses single_open() for showing information. This means
> > the all data will be gathered and buffered once to a (big) buf.
> > 
> > Now, /proc/stat shows stats per cpu and stats per IRQs. To get information
> > in once-shot, it allocates a big buffer (until KMALLOC_MAX_SIZE).
> > 
> > Eric Dumazet reported that the bufsize calculation doesn't take
> > the numner of IRQ into account and the information cannot be
> > got in one-shot. (By this, seq_read() will allocate buffer again
> > and read the whole data gain...)
> > 
> > This patch changes /proc/stat to use seq_open() rather than single_open()
> > and provides  ->start(), ->next(), ->stop(), ->show().
> > 
> > By this, /proc/stat will not need to take care of size of buffer.
> > 
> > [heiko.carstens@de.ibm.com]: This is the forward port of a patch
> > from KAMEZAWA Hiroyuki (https://lkml.org/lkml/2012/1/23/41).
> > I added a couple of simple changes like e.g. that the cpu iterator
> > handles 32 cpus in a batch to avoid lots of iterations.
> > 
> > With this patch it should not happen anymore that reading /proc/stat
> > fails because of a failing high order memory allocation.
> 
> So this deletes the problematic allocation which [1/2] kind-of fixed,
> yes?
> 
> I agree with Ian - there's a hotplugging race.  And [1/2] doesn't do
> anything to address the worst-case allocation size.  So I think we may
> as well do this all in a single patch.
> 
> Without having looked closely at the code I worry a bit about the
> effects.  /proc/pid/stat is a complex thing and its contents will vary
> in strange ways as the things it is reporting on undergo concurrent
> changes.  This patch will presumably replace one set of bizarre
> behaviour with a new set of bizarre behaviour and there may be
> unforseen consequences to established userspace.
> 
> So we're going to need a lot of testing and a lot of testing time to
> identify issues and weed them out.
> 
> So..  can we take this up for 3.16-rc1?  See if we can get some careful
> review done then and test it for a couple of months?
> 
> Meanwhile, the changelog looks a bit hastily thrown together - some
> smoothing would be nice, and perhaps some work spent identifying
> possible behavioural changes.  Timing changes, locking canges, effects
> of concurrent fork/exit activity etc?
> 

Umm ... I didn't expect this to turn into such a rant, apologies in
advance.
 
Certainly using the usual seq_file ops is desirable in the long run and
that change should be worked on by those that maintain this area of code
but, as Andrew says, it's too large a change to put in without
considerable testing.

Now the problem has been exposed by a change which sets the number of
CPUs to the maximum number the architecture (s390) can have, 256, when
no value has been specified and the kernel default configuration is used
rather than 32 when hotplug is not set or 64 when it is.

The allocation problem doesn't occur when the number of CPUs is set to
the previous default of 64, even for low end systems with 2 CPUs and 2G
RAM (like the one for which this problem was reported), but becomes a
problem when the number of CPUs is set to 256 on these systems.

Also, I believe the s390 maintainers are committed to keeping the the
default configured number of CPUs at 256.

So the actual problem is the heuristic used to calculate a initial
buffer size not accounting for a configured number of CPUs much greater
than the hardware can sensibly accommodate.

If we assume that the current implementation functions correctly when
the buffer overflows, ie. doubles the allocation size and restarts, then
an interim solution to the problem comes down to not much more than what
is in patch 1 above.

Looking at the current heuristic allocation sizes, without taking the
allocation for irqs into account we get:
cpus: 32      size: 5k
cpus: 64      size: 9k
cpus: 128     size: 17k
cpus: 256     size: 33k

I don't know how many irqs need to be accounted for or if that will make
a difference to my comments below. Someone else will need to comment on
that.

We know (from the order 4 allocation fail) that with 256 CPUs kmalloc()
is looking for a 64k slab chunk IIUC and on low memory systems will
frequently fail to get it.

And for the previous default of 64 CPUs kmalloc() would be looking for a
16k slab which we have no evidence it doesn't always get even on low end
systems.

So why even use a heuristic calculation, since it can be quite wasteful
anyway or, as in this case badly wrong, why not just allocate 8k or 16k
in the open every time knowing that if the actual number of CPUs is
large we can reasonably expect the system RAM to be correspondingly
large which should avoid allocation failures upon a read retry.

Comments please?

Ian


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open
  2014-06-09  8:11         ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Ian Kent
@ 2014-06-11 12:43           ` Heiko Carstens
  2014-06-11 22:29             ` David Rientjes
  0 siblings, 1 reply; 21+ messages in thread
From: Heiko Carstens @ 2014-06-11 12:43 UTC (permalink / raw)
  To: Ian Kent
  Cc: Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel,
	Hendrik Brueckner, Thorsten Diehl, Elliott,
	Robert (Server Storage),
	Al Viro

[full quote, since I added Al to cc]

On Mon, Jun 09, 2014 at 04:11:59PM +0800, Ian Kent wrote:
> On Wed, 2014-05-28 at 15:37 -0700, Andrew Morton wrote:
> > On Wed, 28 May 2014 11:01:53 +0200 Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > 
> > > Now, /proc/stat uses single_open() for showing information. This means
> > > the all data will be gathered and buffered once to a (big) buf.
> > > 
> > > Now, /proc/stat shows stats per cpu and stats per IRQs. To get information
> > > in once-shot, it allocates a big buffer (until KMALLOC_MAX_SIZE).
> > > 
> > > Eric Dumazet reported that the bufsize calculation doesn't take
> > > the numner of IRQ into account and the information cannot be
> > > got in one-shot. (By this, seq_read() will allocate buffer again
> > > and read the whole data gain...)
> > > 
> > > This patch changes /proc/stat to use seq_open() rather than single_open()
> > > and provides  ->start(), ->next(), ->stop(), ->show().
> > > 
> > > By this, /proc/stat will not need to take care of size of buffer.
> > > 
> > > [heiko.carstens@de.ibm.com]: This is the forward port of a patch
> > > from KAMEZAWA Hiroyuki (https://lkml.org/lkml/2012/1/23/41).
> > > I added a couple of simple changes like e.g. that the cpu iterator
> > > handles 32 cpus in a batch to avoid lots of iterations.
> > > 
> > > With this patch it should not happen anymore that reading /proc/stat
> > > fails because of a failing high order memory allocation.
> > 
> > So this deletes the problematic allocation which [1/2] kind-of fixed,
> > yes?
> > 
> > I agree with Ian - there's a hotplugging race.  And [1/2] doesn't do
> > anything to address the worst-case allocation size.  So I think we may
> > as well do this all in a single patch.
> > 
> > Without having looked closely at the code I worry a bit about the
> > effects.  /proc/pid/stat is a complex thing and its contents will vary
> > in strange ways as the things it is reporting on undergo concurrent
> > changes.  This patch will presumably replace one set of bizarre
> > behaviour with a new set of bizarre behaviour and there may be
> > unforseen consequences to established userspace.
> > 
> > So we're going to need a lot of testing and a lot of testing time to
> > identify issues and weed them out.
> > 
> > So..  can we take this up for 3.16-rc1?  See if we can get some careful
> > review done then and test it for a couple of months?
> > 
> > Meanwhile, the changelog looks a bit hastily thrown together - some
> > smoothing would be nice, and perhaps some work spent identifying
> > possible behavioural changes.  Timing changes, locking canges, effects
> > of concurrent fork/exit activity etc?
> > 
> 
> Umm ... I didn't expect this to turn into such a rant, apologies in
> advance.
> 
> Certainly using the usual seq_file ops is desirable in the long run and
> that change should be worked on by those that maintain this area of code
> but, as Andrew says, it's too large a change to put in without
> considerable testing.
> 
> Now the problem has been exposed by a change which sets the number of
> CPUs to the maximum number the architecture (s390) can have, 256, when
> no value has been specified and the kernel default configuration is used
> rather than 32 when hotplug is not set or 64 when it is.
> 
> The allocation problem doesn't occur when the number of CPUs is set to
> the previous default of 64, even for low end systems with 2 CPUs and 2G
> RAM (like the one for which this problem was reported), but becomes a
> problem when the number of CPUs is set to 256 on these systems.
> 
> Also, I believe the s390 maintainers are committed to keeping the the
> default configured number of CPUs at 256.
> 
> So the actual problem is the heuristic used to calculate a initial
> buffer size not accounting for a configured number of CPUs much greater
> than the hardware can sensibly accommodate.
> 
> If we assume that the current implementation functions correctly when
> the buffer overflows, ie. doubles the allocation size and restarts, then
> an interim solution to the problem comes down to not much more than what
> is in patch 1 above.
> 
> Looking at the current heuristic allocation sizes, without taking the
> allocation for irqs into account we get:
> cpus: 32      size: 5k
> cpus: 64      size: 9k
> cpus: 128     size: 17k
> cpus: 256     size: 33k
> 
> I don't know how many irqs need to be accounted for or if that will make
> a difference to my comments below. Someone else will need to comment on
> that.
> 
> We know (from the order 4 allocation fail) that with 256 CPUs kmalloc()
> is looking for a 64k slab chunk IIUC and on low memory systems will
> frequently fail to get it.
> 
> And for the previous default of 64 CPUs kmalloc() would be looking for a
> 16k slab which we have no evidence it doesn't always get even on low end
> systems.
> 
> So why even use a heuristic calculation, since it can be quite wasteful
> anyway or, as in this case badly wrong, why not just allocate 8k or 16k
> in the open every time knowing that if the actual number of CPUs is
> large we can reasonably expect the system RAM to be correspondingly
> large which should avoid allocation failures upon a read retry.
> 
> Comments please?

Alternatively we could also change the seqfile code to fall back to
vmalloc allocations. That would probably "fix" all single_open usages
where large contiguous memory areas are needed and later on fail due
to memory fragmentation.
Does anybody like that approach (sample patch below)?

---
 fs/seq_file.c | 45 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 8 deletions(-)

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 1d641bb108d2..fca78a04c0d1 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -8,8 +8,10 @@
 #include <linux/fs.h>
 #include <linux/export.h>
 #include <linux/seq_file.h>
+#include <linux/vmalloc.h>
 #include <linux/slab.h>
 #include <linux/cred.h>
+#include <linux/mm.h>
 
 #include <asm/uaccess.h>
 #include <asm/page.h>
@@ -82,6 +84,31 @@ int seq_open(struct file *file, const struct seq_operations *op)
 }
 EXPORT_SYMBOL(seq_open);
 
+static void seq_alloc(struct seq_file *m)
+{
+	m->size = PAGE_SIZE;
+	m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN);
+	if (!m->buf)
+		m->buf = vmalloc(m->size);
+}
+
+static void seq_free(struct seq_file *m)
+{
+	if (unlikely(is_vmalloc_addr(m->buf)))
+		vfree(m->buf);
+	else
+		kfree(m->buf);
+}
+
+static void seq_realloc(struct seq_file *m)
+{
+	seq_free(m);
+	m->size <<= 1;
+	m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN);
+	if (!m->buf)
+		m->buf = vmalloc(m->size);
+}
+
 static int traverse(struct seq_file *m, loff_t offset)
 {
 	loff_t pos = 0, index;
@@ -96,7 +123,7 @@ static int traverse(struct seq_file *m, loff_t offset)
 		return 0;
 	}
 	if (!m->buf) {
-		m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
+		seq_alloc(m);
 		if (!m->buf)
 			return -ENOMEM;
 	}
@@ -135,9 +162,8 @@ static int traverse(struct seq_file *m, loff_t offset)
 
 Eoverflow:
 	m->op->stop(m, p);
-	kfree(m->buf);
 	m->count = 0;
-	m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
+	seq_realloc(m);
 	return !m->buf ? -ENOMEM : -EAGAIN;
 }
 
@@ -192,7 +218,7 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
 
 	/* grab buffer if we didn't have one */
 	if (!m->buf) {
-		m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
+		seq_alloc(m);
 		if (!m->buf)
 			goto Enomem;
 	}
@@ -232,9 +258,8 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
 		if (m->count < m->size)
 			goto Fill;
 		m->op->stop(m, p);
-		kfree(m->buf);
 		m->count = 0;
-		m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
+		seq_realloc(m);
 		if (!m->buf)
 			goto Enomem;
 		m->version = 0;
@@ -350,7 +375,7 @@ EXPORT_SYMBOL(seq_lseek);
 int seq_release(struct inode *inode, struct file *file)
 {
 	struct seq_file *m = file->private_data;
-	kfree(m->buf);
+	seq_free(m);
 	kfree(m);
 	return 0;
 }
@@ -605,8 +630,12 @@ EXPORT_SYMBOL(single_open);
 int single_open_size(struct file *file, int (*show)(struct seq_file *, void *),
 		void *data, size_t size)
 {
-	char *buf = kmalloc(size, GFP_KERNEL);
+	char *buf;
 	int ret;
+
+	buf = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!buf)
+		buf = vmalloc(size);
 	if (!buf)
 		return -ENOMEM;
 	ret = single_open(file, show, data);
-- 
1.8.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open
  2014-06-11 12:43           ` Heiko Carstens
@ 2014-06-11 22:29             ` David Rientjes
  2014-06-12  6:24               ` Ian Kent
  0 siblings, 1 reply; 21+ messages in thread
From: David Rientjes @ 2014-06-11 22:29 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Ian Kent, Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel,
	Hendrik Brueckner, Thorsten Diehl, Elliott,
	Robert (Server Storage),
	Al Viro

On Wed, 11 Jun 2014, Heiko Carstens wrote:

> Alternatively we could also change the seqfile code to fall back to
> vmalloc allocations. That would probably "fix" all single_open usages
> where large contiguous memory areas are needed and later on fail due
> to memory fragmentation.
> Does anybody like that approach (sample patch below)?
> 
> ---
>  fs/seq_file.c | 45 +++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/seq_file.c b/fs/seq_file.c
> index 1d641bb108d2..fca78a04c0d1 100644
> --- a/fs/seq_file.c
> +++ b/fs/seq_file.c
> @@ -8,8 +8,10 @@
>  #include <linux/fs.h>
>  #include <linux/export.h>
>  #include <linux/seq_file.h>
> +#include <linux/vmalloc.h>
>  #include <linux/slab.h>
>  #include <linux/cred.h>
> +#include <linux/mm.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/page.h>
> @@ -82,6 +84,31 @@ int seq_open(struct file *file, const struct seq_operations *op)
>  }
>  EXPORT_SYMBOL(seq_open);
>  
> +static void seq_alloc(struct seq_file *m)
> +{
> +	m->size = PAGE_SIZE;
> +	m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN);
> +	if (!m->buf)
> +		m->buf = vmalloc(m->size);
> +}
> +

If m->size is unconditionally PAGE_SIZE, then how is vmalloc() going to 
allocate this if kmalloc() fails?

> +static void seq_free(struct seq_file *m)
> +{
> +	if (unlikely(is_vmalloc_addr(m->buf)))
> +		vfree(m->buf);
> +	else
> +		kfree(m->buf);
> +}
> +
> +static void seq_realloc(struct seq_file *m)
> +{
> +	seq_free(m);
> +	m->size <<= 1;
> +	m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN);
> +	if (!m->buf)
> +		m->buf = vmalloc(m->size);
> +}
> +
>  static int traverse(struct seq_file *m, loff_t offset)
>  {
>  	loff_t pos = 0, index;
> @@ -96,7 +123,7 @@ static int traverse(struct seq_file *m, loff_t offset)
>  		return 0;
>  	}
>  	if (!m->buf) {
> -		m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
> +		seq_alloc(m);
>  		if (!m->buf)
>  			return -ENOMEM;
>  	}
> @@ -135,9 +162,8 @@ static int traverse(struct seq_file *m, loff_t offset)
>  
>  Eoverflow:
>  	m->op->stop(m, p);
> -	kfree(m->buf);
>  	m->count = 0;
> -	m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
> +	seq_realloc(m);
>  	return !m->buf ? -ENOMEM : -EAGAIN;
>  }
>  

It seems like traverse() could be rewritten to use krealloc() which does a 
memcpy() to simplify the calling code.

> @@ -192,7 +218,7 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
>  
>  	/* grab buffer if we didn't have one */
>  	if (!m->buf) {
> -		m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
> +		seq_alloc(m);
>  		if (!m->buf)
>  			goto Enomem;
>  	}
> @@ -232,9 +258,8 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
>  		if (m->count < m->size)
>  			goto Fill;
>  		m->op->stop(m, p);
> -		kfree(m->buf);
>  		m->count = 0;
> -		m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
> +		seq_realloc(m);
>  		if (!m->buf)
>  			goto Enomem;
>  		m->version = 0;
> @@ -350,7 +375,7 @@ EXPORT_SYMBOL(seq_lseek);
>  int seq_release(struct inode *inode, struct file *file)
>  {
>  	struct seq_file *m = file->private_data;
> -	kfree(m->buf);
> +	seq_free(m);
>  	kfree(m);
>  	return 0;
>  }
> @@ -605,8 +630,12 @@ EXPORT_SYMBOL(single_open);
>  int single_open_size(struct file *file, int (*show)(struct seq_file *, void *),
>  		void *data, size_t size)
>  {
> -	char *buf = kmalloc(size, GFP_KERNEL);
> +	char *buf;
>  	int ret;
> +
> +	buf = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
> +	if (!buf)
> +		buf = vmalloc(size);
>  	if (!buf)
>  		return -ENOMEM;
>  	ret = single_open(file, show, data);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open
  2014-06-11 22:29             ` David Rientjes
@ 2014-06-12  6:24               ` Ian Kent
  2014-06-12  6:52                 ` David Rientjes
  0 siblings, 1 reply; 21+ messages in thread
From: Ian Kent @ 2014-06-12  6:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: Heiko Carstens, Andrew Morton, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel,
	linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott,
	Robert (Server Storage),
	Al Viro

On Wed, 2014-06-11 at 15:29 -0700, David Rientjes wrote:
> On Wed, 11 Jun 2014, Heiko Carstens wrote:
> 
> > Alternatively we could also change the seqfile code to fall back to
> > vmalloc allocations. That would probably "fix" all single_open usages
> > where large contiguous memory areas are needed and later on fail due
> > to memory fragmentation.
> > Does anybody like that approach (sample patch below)?

Personally, I much prefer this from a risk of regression POV.

> > 
> > ---
> >  fs/seq_file.c | 45 +++++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 37 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/seq_file.c b/fs/seq_file.c
> > index 1d641bb108d2..fca78a04c0d1 100644
> > --- a/fs/seq_file.c
> > +++ b/fs/seq_file.c
> > @@ -8,8 +8,10 @@
> >  #include <linux/fs.h>
> >  #include <linux/export.h>
> >  #include <linux/seq_file.h>
> > +#include <linux/vmalloc.h>
> >  #include <linux/slab.h>
> >  #include <linux/cred.h>
> > +#include <linux/mm.h>
> >  
> >  #include <asm/uaccess.h>
> >  #include <asm/page.h>
> > @@ -82,6 +84,31 @@ int seq_open(struct file *file, const struct seq_operations *op)
> >  }
> >  EXPORT_SYMBOL(seq_open);
> >  
> > +static void seq_alloc(struct seq_file *m)
> > +{
> > +	m->size = PAGE_SIZE;
> > +	m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN);
> > +	if (!m->buf)
> > +		m->buf = vmalloc(m->size);
> > +}
> > +
> 
> If m->size is unconditionally PAGE_SIZE, then how is vmalloc() going to 
> allocate this if kmalloc() fails?

This is just the initial allocation.
If it runs out of room the allocation size doubles.

I think 2*PAGE_SIZE is probably better here since that's closer to what
the original heuristic allocation requested and is likely to avoid
reallocations in most cases.

The issue of kmalloc() failing for larger allocations on low speced
hardware with fragmented memory might succeed when vmalloc() is used
since it doesn't require contiguous memory chunks. But I guess the added
pressure on the page table might still be a problem, nevertheless it's
probably worth trying before bailing out. 

> 
> > +static void seq_free(struct seq_file *m)
> > +{
> > +	if (unlikely(is_vmalloc_addr(m->buf)))
> > +		vfree(m->buf);
> > +	else
> > +		kfree(m->buf);
> > +}
> > +
> > +static void seq_realloc(struct seq_file *m)
> > +{
> > +	seq_free(m);
> > +	m->size <<= 1;
> > +	m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN);
> > +	if (!m->buf)
> > +		m->buf = vmalloc(m->size);
> > +}
> > +
> >  static int traverse(struct seq_file *m, loff_t offset)
> >  {
> >  	loff_t pos = 0, index;
> > @@ -96,7 +123,7 @@ static int traverse(struct seq_file *m, loff_t offset)
> >  		return 0;
> >  	}
> >  	if (!m->buf) {
> > -		m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
> > +		seq_alloc(m);
> >  		if (!m->buf)
> >  			return -ENOMEM;
> >  	}
> > @@ -135,9 +162,8 @@ static int traverse(struct seq_file *m, loff_t offset)
> >  
> >  Eoverflow:
> >  	m->op->stop(m, p);
> > -	kfree(m->buf);
> >  	m->count = 0;
> > -	m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
> > +	seq_realloc(m);
> >  	return !m->buf ? -ENOMEM : -EAGAIN;
> >  }
> >  
> 
> It seems like traverse() could be rewritten to use krealloc() which does a 
> memcpy() to simplify the calling code.

Sure, but that's an improvement that could be considered once the
current problem is addressed.

Even though the struct seq_file fields shouldn't be used outside of the
seq_file routines we can't know that's the case for all users and we
don't know if other's make assumptions based on the way it works now.

Ian

> 
> > @@ -192,7 +218,7 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
> >  
> >  	/* grab buffer if we didn't have one */
> >  	if (!m->buf) {
> > -		m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
> > +		seq_alloc(m);
> >  		if (!m->buf)
> >  			goto Enomem;
> >  	}
> > @@ -232,9 +258,8 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
> >  		if (m->count < m->size)
> >  			goto Fill;
> >  		m->op->stop(m, p);
> > -		kfree(m->buf);
> >  		m->count = 0;
> > -		m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
> > +		seq_realloc(m);
> >  		if (!m->buf)
> >  			goto Enomem;
> >  		m->version = 0;
> > @@ -350,7 +375,7 @@ EXPORT_SYMBOL(seq_lseek);
> >  int seq_release(struct inode *inode, struct file *file)
> >  {
> >  	struct seq_file *m = file->private_data;
> > -	kfree(m->buf);
> > +	seq_free(m);
> >  	kfree(m);
> >  	return 0;
> >  }
> > @@ -605,8 +630,12 @@ EXPORT_SYMBOL(single_open);
> >  int single_open_size(struct file *file, int (*show)(struct seq_file *, void *),
> >  		void *data, size_t size)
> >  {
> > -	char *buf = kmalloc(size, GFP_KERNEL);
> > +	char *buf;
> >  	int ret;
> > +
> > +	buf = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
> > +	if (!buf)
> > +		buf = vmalloc(size);
> >  	if (!buf)
> >  		return -ENOMEM;
> >  	ret = single_open(file, show, data);



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open
  2014-06-12  6:24               ` Ian Kent
@ 2014-06-12  6:52                 ` David Rientjes
  2014-06-12  7:27                   ` Heiko Carstens
  2014-06-12 11:09                   ` Ian Kent
  0 siblings, 2 replies; 21+ messages in thread
From: David Rientjes @ 2014-06-12  6:52 UTC (permalink / raw)
  To: Ian Kent
  Cc: Heiko Carstens, Andrew Morton, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel,
	linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott,
	Robert (Server Storage),
	Al Viro

On Thu, 12 Jun 2014, Ian Kent wrote:

> > > diff --git a/fs/seq_file.c b/fs/seq_file.c
> > > index 1d641bb108d2..fca78a04c0d1 100644
> > > --- a/fs/seq_file.c
> > > +++ b/fs/seq_file.c
> > > @@ -8,8 +8,10 @@
> > >  #include <linux/fs.h>
> > >  #include <linux/export.h>
> > >  #include <linux/seq_file.h>
> > > +#include <linux/vmalloc.h>
> > >  #include <linux/slab.h>
> > >  #include <linux/cred.h>
> > > +#include <linux/mm.h>
> > >  
> > >  #include <asm/uaccess.h>
> > >  #include <asm/page.h>
> > > @@ -82,6 +84,31 @@ int seq_open(struct file *file, const struct seq_operations *op)
> > >  }
> > >  EXPORT_SYMBOL(seq_open);
> > >  
> > > +static void seq_alloc(struct seq_file *m)
> > > +{
> > > +	m->size = PAGE_SIZE;
> > > +	m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN);
> > > +	if (!m->buf)
> > > +		m->buf = vmalloc(m->size);
> > > +}
> > > +
> > 
> > If m->size is unconditionally PAGE_SIZE, then how is vmalloc() going to 
> > allocate this if kmalloc() fails?
> 
> This is just the initial allocation.
> If it runs out of room the allocation size doubles.
> 
> I think 2*PAGE_SIZE is probably better here since that's closer to what
> the original heuristic allocation requested and is likely to avoid
> reallocations in most cases.
> 
> The issue of kmalloc() failing for larger allocations on low speced
> hardware with fragmented memory might succeed when vmalloc() is used
> since it doesn't require contiguous memory chunks. But I guess the added
> pressure on the page table might still be a problem, nevertheless it's
> probably worth trying before bailing out. 
> 

I'm not quarreling about using vmalloc() for allocations that are 
high-order, I'm referring to the rather obvious fact that m->size is set 
to PAGE_SIZE unconditionally above and thus vmalloc() isn't going to help 
in the slightest.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open
  2014-06-12  6:52                 ` David Rientjes
@ 2014-06-12  7:27                   ` Heiko Carstens
  2014-06-12  8:18                     ` Heiko Carstens
  2014-06-12 20:59                     ` David Rientjes
  2014-06-12 11:09                   ` Ian Kent
  1 sibling, 2 replies; 21+ messages in thread
From: Heiko Carstens @ 2014-06-12  7:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ian Kent, Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel,
	Hendrik Brueckner, Thorsten Diehl, Elliott,
	Robert (Server Storage),
	Al Viro

On Wed, Jun 11, 2014 at 11:52:31PM -0700, David Rientjes wrote:
> On Thu, 12 Jun 2014, Ian Kent wrote:
> > > > +static void seq_alloc(struct seq_file *m)
> > > > +{
> > > > +	m->size = PAGE_SIZE;
> > > > +	m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN);
> > > > +	if (!m->buf)
> > > > +		m->buf = vmalloc(m->size);
> > > > +}
> > > > +
> > > 
> > > If m->size is unconditionally PAGE_SIZE, then how is vmalloc() going to 
> > > allocate this if kmalloc() fails?
> > 
> > This is just the initial allocation.
> > If it runs out of room the allocation size doubles.
> > 
> > I think 2*PAGE_SIZE is probably better here since that's closer to what
> > the original heuristic allocation requested and is likely to avoid
> > reallocations in most cases.
> > 
> > The issue of kmalloc() failing for larger allocations on low speced
> > hardware with fragmented memory might succeed when vmalloc() is used
> > since it doesn't require contiguous memory chunks. But I guess the added
> > pressure on the page table might still be a problem, nevertheless it's
> > probably worth trying before bailing out. 
> 
> I'm not quarreling about using vmalloc() for allocations that are 
> high-order, I'm referring to the rather obvious fact that m->size is set 
> to PAGE_SIZE unconditionally above and thus vmalloc() isn't going to help 
> in the slightest.

Yes, that doesn't make any sense. I wrote the patch together in a hurry and
didn't think much about it.
So below is the what I think most simple conversion to a vmalloc fallback
approach for seq files. However the question remains if this seems to be an
acceptable approach at all...

---
 fs/seq_file.c | 38 +++++++++++++++++++++++++++++---------
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 1d641bb108d2..b710130c6d6b 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -8,8 +8,10 @@
 #include <linux/fs.h>
 #include <linux/export.h>
 #include <linux/seq_file.h>
+#include <linux/vmalloc.h>
 #include <linux/slab.h>
 #include <linux/cred.h>
+#include <linux/mm.h>
 
 #include <asm/uaccess.h>
 #include <asm/page.h>
@@ -30,6 +32,24 @@ static void seq_set_overflow(struct seq_file *m)
 	m->count = m->size;
 }
 
+static void *seq_alloc(unsigned long size)
+{
+	void *buf;
+
+	buf = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
+	if (!buf && size > PAGE_SIZE)
+		buf = vmalloc(size);
+	return buf;
+}
+
+static void seq_free(const void *buf)
+{
+	if (unlikely(is_vmalloc_addr(buf)))
+		vfree(buf);
+	else
+		kfree(buf);
+}
+
 /**
  *	seq_open -	initialize sequential file
  *	@file: file we initialize
@@ -96,7 +116,7 @@ static int traverse(struct seq_file *m, loff_t offset)
 		return 0;
 	}
 	if (!m->buf) {
-		m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
+		m->buf = seq_alloc(m->size = PAGE_SIZE);
 		if (!m->buf)
 			return -ENOMEM;
 	}
@@ -135,9 +155,9 @@ static int traverse(struct seq_file *m, loff_t offset)
 
 Eoverflow:
 	m->op->stop(m, p);
-	kfree(m->buf);
+	seq_free(m->buf);
 	m->count = 0;
-	m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
+	m->buf = seq_alloc(m->size <<= 1);
 	return !m->buf ? -ENOMEM : -EAGAIN;
 }
 
@@ -192,7 +212,7 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
 
 	/* grab buffer if we didn't have one */
 	if (!m->buf) {
-		m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
+		m->buf = seq_alloc(m->size = PAGE_SIZE);
 		if (!m->buf)
 			goto Enomem;
 	}
@@ -232,9 +252,9 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
 		if (m->count < m->size)
 			goto Fill;
 		m->op->stop(m, p);
-		kfree(m->buf);
+		seq_free(m->buf);
 		m->count = 0;
-		m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
+		m->buf = seq_alloc(m->size <<= 1);
 		if (!m->buf)
 			goto Enomem;
 		m->version = 0;
@@ -350,7 +370,7 @@ EXPORT_SYMBOL(seq_lseek);
 int seq_release(struct inode *inode, struct file *file)
 {
 	struct seq_file *m = file->private_data;
-	kfree(m->buf);
+	seq_free(m->buf);
 	kfree(m);
 	return 0;
 }
@@ -605,13 +625,13 @@ EXPORT_SYMBOL(single_open);
 int single_open_size(struct file *file, int (*show)(struct seq_file *, void *),
 		void *data, size_t size)
 {
-	char *buf = kmalloc(size, GFP_KERNEL);
+	char *buf = seq_alloc(size);
 	int ret;
 	if (!buf)
 		return -ENOMEM;
 	ret = single_open(file, show, data);
 	if (ret) {
-		kfree(buf);
+		seq_free(buf);
 		return ret;
 	}
 	((struct seq_file *)file->private_data)->buf = buf;
-- 
1.8.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open
  2014-06-12  7:27                   ` Heiko Carstens
@ 2014-06-12  8:18                     ` Heiko Carstens
  2014-06-12 20:59                     ` David Rientjes
  1 sibling, 0 replies; 21+ messages in thread
From: Heiko Carstens @ 2014-06-12  8:18 UTC (permalink / raw)
  To: David Rientjes, Ian Kent, Andrew Morton, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel,
	linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott,
	Robert (Server Storage),
	Al Viro

On Thu, Jun 12, 2014 at 09:27:41AM +0200, Heiko Carstens wrote:
> On Wed, Jun 11, 2014 at 11:52:31PM -0700, David Rientjes wrote:
> > On Thu, 12 Jun 2014, Ian Kent wrote:
> > > > > +static void seq_alloc(struct seq_file *m)
> > > > > +{
> > > > > +	m->size = PAGE_SIZE;
> > > > > +	m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN);
> > > > > +	if (!m->buf)
> > > > > +		m->buf = vmalloc(m->size);
> > > > > +}
> > > > > +
> > > > 
> > > > If m->size is unconditionally PAGE_SIZE, then how is vmalloc() going to 
> > > > allocate this if kmalloc() fails?
> > > 
> > > This is just the initial allocation.
> > > If it runs out of room the allocation size doubles.
> > > 
> > > I think 2*PAGE_SIZE is probably better here since that's closer to what
> > > the original heuristic allocation requested and is likely to avoid
> > > reallocations in most cases.
> > > 
> > > The issue of kmalloc() failing for larger allocations on low speced
> > > hardware with fragmented memory might succeed when vmalloc() is used
> > > since it doesn't require contiguous memory chunks. But I guess the added
> > > pressure on the page table might still be a problem, nevertheless it's
> > > probably worth trying before bailing out. 
> > 
> > I'm not quarreling about using vmalloc() for allocations that are 
> > high-order, I'm referring to the rather obvious fact that m->size is set 
> > to PAGE_SIZE unconditionally above and thus vmalloc() isn't going to help 
> > in the slightest.
> 
> Yes, that doesn't make any sense. I wrote the patch together in a hurry and
> didn't think much about it.
> So below is the what I think most simple conversion to a vmalloc fallback
> approach for seq files. However the question remains if this seems to be an
> acceptable approach at all...

And so that the vmalloc fallback has any effect to the /proc/stat allocation
it also needs to be converted to use single_open_size() instead:

From: Heiko Carstens <heiko.carstens@de.ibm.com>
Date: Thu, 12 Jun 2014 10:04:39 +0200
Subject: [PATCH] proc/stat: convert to single_open_size()

Use seq_file's single_open_size() to preallocate a buffer that is large
enough to hold the whole output, instead of open coding it.
Also calculate the requested size using the number of online cpus instead
of possible cpus, since the size of the output only depends on the number
of online cpus.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
 fs/proc/stat.c | 22 ++--------------------
 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 9d231e9e5f0e..bf2d03f8fd3e 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -184,29 +184,11 @@ static int show_stat(struct seq_file *p, void *v)
 
 static int stat_open(struct inode *inode, struct file *file)
 {
-	size_t size = 1024 + 128 * num_possible_cpus();
-	char *buf;
-	struct seq_file *m;
-	int res;
+	size_t size = 1024 + 128 * num_online_cpus();
 
 	/* minimum size to display an interrupt count : 2 bytes */
 	size += 2 * nr_irqs;
-
-	/* don't ask for more than the kmalloc() max size */
-	if (size > KMALLOC_MAX_SIZE)
-		size = KMALLOC_MAX_SIZE;
-	buf = kmalloc(size, GFP_KERNEL);
-	if (!buf)
-		return -ENOMEM;
-
-	res = single_open(file, show_stat, NULL);
-	if (!res) {
-		m = file->private_data;
-		m->buf = buf;
-		m->size = ksize(buf);
-	} else
-		kfree(buf);
-	return res;
+	return single_open_size(file, show_stat, NULL, size);
 }
 
 static const struct file_operations proc_stat_operations = {
-- 
1.8.5.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open
  2014-06-12  6:52                 ` David Rientjes
  2014-06-12  7:27                   ` Heiko Carstens
@ 2014-06-12 11:09                   ` Ian Kent
  1 sibling, 0 replies; 21+ messages in thread
From: Ian Kent @ 2014-06-12 11:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Heiko Carstens, Andrew Morton, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrea Righi, Eric Dumazet, linux-kernel,
	linux-fsdevel, Hendrik Brueckner, Thorsten Diehl, Elliott,
	Robert (Server Storage),
	Al Viro

On Wed, 2014-06-11 at 23:52 -0700, David Rientjes wrote:
> On Thu, 12 Jun 2014, Ian Kent wrote:
> 
> > > > diff --git a/fs/seq_file.c b/fs/seq_file.c
> > > > index 1d641bb108d2..fca78a04c0d1 100644
> > > > --- a/fs/seq_file.c
> > > > +++ b/fs/seq_file.c
> > > > @@ -8,8 +8,10 @@
> > > >  #include <linux/fs.h>
> > > >  #include <linux/export.h>
> > > >  #include <linux/seq_file.h>
> > > > +#include <linux/vmalloc.h>
> > > >  #include <linux/slab.h>
> > > >  #include <linux/cred.h>
> > > > +#include <linux/mm.h>
> > > >  
> > > >  #include <asm/uaccess.h>
> > > >  #include <asm/page.h>
> > > > @@ -82,6 +84,31 @@ int seq_open(struct file *file, const struct seq_operations *op)
> > > >  }
> > > >  EXPORT_SYMBOL(seq_open);
> > > >  
> > > > +static void seq_alloc(struct seq_file *m)
> > > > +{
> > > > +	m->size = PAGE_SIZE;
> > > > +	m->buf = kmalloc(m->size, GFP_KERNEL | __GFP_NOWARN);
> > > > +	if (!m->buf)
> > > > +		m->buf = vmalloc(m->size);
> > > > +}
> > > > +
> > > 
> > > If m->size is unconditionally PAGE_SIZE, then how is vmalloc() going to 
> > > allocate this if kmalloc() fails?
> > 
> > This is just the initial allocation.
> > If it runs out of room the allocation size doubles.
> > 
> > I think 2*PAGE_SIZE is probably better here since that's closer to what
> > the original heuristic allocation requested and is likely to avoid
> > reallocations in most cases.
> > 
> > The issue of kmalloc() failing for larger allocations on low speced
> > hardware with fragmented memory might succeed when vmalloc() is used
> > since it doesn't require contiguous memory chunks. But I guess the added
> > pressure on the page table might still be a problem, nevertheless it's
> > probably worth trying before bailing out. 
> > 
> 
> I'm not quarreling about using vmalloc() for allocations that are 
> high-order, I'm referring to the rather obvious fact that m->size is set 
> to PAGE_SIZE unconditionally above and thus vmalloc() isn't going to help 
> in the slightest.

LOL, yeah, if kmalloc() can't allocate a single page then we're in much
bigger trouble!

Ian



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open
  2014-06-12  7:27                   ` Heiko Carstens
  2014-06-12  8:18                     ` Heiko Carstens
@ 2014-06-12 20:59                     ` David Rientjes
  1 sibling, 0 replies; 21+ messages in thread
From: David Rientjes @ 2014-06-12 20:59 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Ian Kent, Andrew Morton, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Righi, Eric Dumazet, linux-kernel, linux-fsdevel,
	Hendrik Brueckner, Thorsten Diehl, Elliott,
	Robert (Server Storage),
	Al Viro

On Thu, 12 Jun 2014, Heiko Carstens wrote:

> Yes, that doesn't make any sense. I wrote the patch together in a hurry and
> didn't think much about it.
> So below is the what I think most simple conversion to a vmalloc fallback
> approach for seq files. However the question remains if this seems to be an
> acceptable approach at all...
> 

I think the approach is fine.  We do now have a generic kvfree() function 
defined in mm.h that will handle the freeing, though, determined by 
is_vmalloc_addr() so seq_free() is no longer necessary.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-06-12 20:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-21 12:25 /proc/stat vs. failed order-4 allocation Heiko Carstens
2014-05-21 14:32 ` Christoph Hellwig
2014-05-22  3:05   ` Elliott, Robert (Server Storage)
2014-05-28  8:58   ` Heiko Carstens
2014-05-28  8:59     ` [PATCH 1/2] fs: proc/stat: use num_online_cpus() for buffer size Heiko Carstens
2014-05-28 11:06       ` Ian Kent
2014-05-28 11:14         ` Ian Kent
2014-05-28  9:01     ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Heiko Carstens
2014-05-28 22:37       ` Andrew Morton
2014-05-30  8:38         ` Heiko Carstens
2014-05-30 11:36           ` [PATCH] fs: proc/stat: use seq_file iterator interface Heiko Carstens
2014-06-09  8:11         ` [PATCH 2/2] fs: proc/stat: use usual seq_file ops rather than single_open Ian Kent
2014-06-11 12:43           ` Heiko Carstens
2014-06-11 22:29             ` David Rientjes
2014-06-12  6:24               ` Ian Kent
2014-06-12  6:52                 ` David Rientjes
2014-06-12  7:27                   ` Heiko Carstens
2014-06-12  8:18                     ` Heiko Carstens
2014-06-12 20:59                     ` David Rientjes
2014-06-12 11:09                   ` Ian Kent
2014-05-22 11:29 ` /proc/stat vs. failed order-4 allocation Ian Kent

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.