linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Large stack usage in fs code (especially for PPC64)
@ 2008-11-17 20:34 Steven Rostedt
  2008-11-17 20:59 ` Steven Rostedt
                   ` (4 more replies)
  0 siblings, 5 replies; 35+ messages in thread
From: Steven Rostedt @ 2008-11-17 20:34 UTC (permalink / raw)
  To: LKML
  Cc: Paul Mackerras, Benjamin Herrenschmidt, linuxppc-dev,
	Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner


I've been hitting stack overflows on a PPC64 box, so I ran the ftrace 
stack_tracer and part of the problem with that box is that it can nest 
interrupts too deep. But what also worries me is that there's some heavy 
hitters of stacks in generic code. Namely the fs directory has some.

Here's the full dump of the stack (PPC64):

root@electra ~> cat /debug/tracing/stack_trace 
        Depth   Size      Location    (56 entries)
        -----   ----      --------
  0)    14032     112   ftrace_call+0x4/0x14
  1)    13920     128   .sched_clock+0x20/0x60
  2)    13792     128   .sched_clock_cpu+0x34/0x50
  3)    13664     144   .cpu_clock+0x3c/0xa0
  4)    13520     144   .get_timestamp+0x2c/0x50
  5)    13376     192   .softlockup_tick+0x100/0x220
  6)    13184     128   .run_local_timers+0x34/0x50
  7)    13056     160   .update_process_times+0x44/0xb0
  8)    12896     176   .tick_sched_timer+0x8c/0x120
  9)    12720     160   .__run_hrtimer+0xd8/0x130
 10)    12560     240   .hrtimer_interrupt+0x16c/0x220
 11)    12320     160   .timer_interrupt+0xcc/0x110
 12)    12160     240   decrementer_common+0xe0/0x100
 13)    11920     576   0x80
 14)    11344     160   .usb_hcd_irq+0x94/0x150
 15)    11184     160   .handle_IRQ_event+0x80/0x120
 16)    11024     160   .handle_fasteoi_irq+0xd8/0x1e0
 17)    10864     160   .do_IRQ+0xbc/0x150
 18)    10704     144   hardware_interrupt_entry+0x1c/0x3c
 19)    10560     672   0x0
 20)     9888     144   ._spin_unlock_irqrestore+0x84/0xd0
 21)     9744     160   .scsi_dispatch_cmd+0x170/0x360
 22)     9584     208   .scsi_request_fn+0x324/0x5e0
 23)     9376     144   .blk_invoke_request_fn+0xc8/0x1b0
 24)     9232     144   .__blk_run_queue+0x48/0x60
 25)     9088     144   .blk_run_queue+0x40/0x70
 26)     8944     192   .scsi_run_queue+0x3a8/0x3e0
 27)     8752     160   .scsi_next_command+0x58/0x90
 28)     8592     176   .scsi_end_request+0xd4/0x130
 29)     8416     208   .scsi_io_completion+0x15c/0x500
 30)     8208     160   .scsi_finish_command+0x15c/0x190
 31)     8048     160   .scsi_softirq_done+0x138/0x1e0
 32)     7888     160   .blk_done_softirq+0xd0/0x100
 33)     7728     192   .__do_softirq+0xe8/0x1e0
 34)     7536     144   .do_softirq+0xa4/0xd0
 35)     7392     144   .irq_exit+0xb4/0xf0
 36)     7248     160   .do_IRQ+0x114/0x150
 37)     7088     752   hardware_interrupt_entry+0x1c/0x3c
 38)     6336     144   .blk_rq_init+0x28/0xc0
 39)     6192     208   .get_request+0x13c/0x3d0
 40)     5984     240   .get_request_wait+0x60/0x170
 41)     5744     192   .__make_request+0xd4/0x560
 42)     5552     192   .generic_make_request+0x210/0x300
 43)     5360     208   .submit_bio+0x168/0x1a0
 44)     5152     160   .submit_bh+0x188/0x1e0
 45)     4992    1280   .block_read_full_page+0x23c/0x430
 46)     3712    1280   .do_mpage_readpage+0x43c/0x740
 47)     2432     352   .mpage_readpages+0x130/0x1c0
 48)     2080     160   .ext3_readpages+0x50/0x80
 49)     1920     256   .__do_page_cache_readahead+0x1e4/0x340
 50)     1664     160   .do_page_cache_readahead+0x94/0xe0
 51)     1504     240   .filemap_fault+0x360/0x530
 52)     1264     256   .__do_fault+0xb8/0x600
 53)     1008     240   .handle_mm_fault+0x190/0x920
 54)      768     320   .do_page_fault+0x3d4/0x5f0
 55)      448     448   handle_page_fault+0x20/0x5c


Notice at line 45 and 46 the stack usage of block_read_full_page and 
do_mpage_readpage. They each use 1280 bytes of stack! Looking at the start 
of these two:

int block_read_full_page(struct page *page, get_block_t *get_block)
{
	struct inode *inode = page->mapping->host;
	sector_t iblock, lblock;
	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
	unsigned int blocksize;
	int nr, i;
	int fully_mapped = 1;
[...]

static struct bio *
do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
		sector_t *last_block_in_bio, struct buffer_head *map_bh,
		unsigned long *first_logical_block, get_block_t get_block)
{
	struct inode *inode = page->mapping->host;
	const unsigned blkbits = inode->i_blkbits;
	const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
	const unsigned blocksize = 1 << blkbits;
	sector_t block_in_file;
	sector_t last_block;
	sector_t last_block_in_file;
	sector_t blocks[MAX_BUF_PER_PAGE];
	unsigned page_block;
	unsigned first_hole = blocks_per_page;
	struct block_device *bdev = NULL;
	int length;
	int fully_mapped = 1;
	unsigned nblocks;
	unsigned relative_block;


The thing that hits my eye on both is the MAX_BUF_PER_PAGE usage. That is 
defined as: 

define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)

Where PAGE_CACHE_SIZE is the same as PAGE_SIZE.

On PPC64 I'm told that the page size is 64K, which makes the above equal 
to: 64K / 512 = 128  multiply that by 8 byte words, we have 1024 bytes.

The problem with PPC64 is that the stack size is not equal to the page 
size. The stack size is only 16K not 64K.

The above stack trace was taken right after boot up and it was already at 
14K, not too far from the 16k limit.

Note, I was using a default config that had CONFIG_IRQSTACKS off and
CONFIG_PPC_64K_PAGES on.

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 20:34 Large stack usage in fs code (especially for PPC64) Steven Rostedt
@ 2008-11-17 20:59 ` Steven Rostedt
  2008-11-17 21:18   ` Linus Torvalds
                     ` (3 more replies)
  2008-11-17 21:08 ` Andrew Morton
                   ` (3 subsequent siblings)
  4 siblings, 4 replies; 35+ messages in thread
From: Steven Rostedt @ 2008-11-17 20:59 UTC (permalink / raw)
  To: LKML
  Cc: Paul Mackerras, Benjamin Herrenschmidt, linuxppc-dev,
	Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner


On Mon, 17 Nov 2008, Steven Rostedt wrote:
> 
> Note, I was using a default config that had CONFIG_IRQSTACKS off and
> CONFIG_PPC_64K_PAGES on.

Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that 
softirqs still use the same stack as the process.

root@electra ~> cat /debug/tracing/stack_trace 
        Depth   Size      Location    (59 entries)
        -----   ----      --------
  0)    12384     192   ftrace_call+0x4/0x14
  1)    12192     128   .sched_clock+0x20/0x60
  2)    12064     128   .sched_clock_cpu+0x34/0x50
  3)    11936     144   .cpu_clock+0x3c/0xa0
  4)    11792     144   .get_timestamp+0x2c/0x50
  5)    11648     144   .__touch_softlockup_watchdog+0x3c/0x60
  6)    11504     192   .softlockup_tick+0xe4/0x220
  7)    11312     128   .run_local_timers+0x34/0x50
  8)    11184     160   .update_process_times+0x44/0xb0
  9)    11024     176   .tick_sched_timer+0x8c/0x120
 10)    10848     160   .__run_hrtimer+0xd8/0x130
 11)    10688     240   .hrtimer_interrupt+0x16c/0x220
 12)    10448     160   .timer_interrupt+0xcc/0x110
 13)    10288      96   decrementer_common+0xe0/0x100
 14)    10192     784   .ftrace_test_stop_func+0x4c/0x70
 15)     9408     112   ftrace_call+0x4/0x14
 16)     9296     144   .init_buffer_head+0x20/0x60
 17)     9152     256   .cache_alloc_refill+0x5a0/0x740
 18)     8896     160   .kmem_cache_alloc+0xfc/0x140
 19)     8736     144   .alloc_buffer_head+0x3c/0xc0
 20)     8592     176   .alloc_page_buffers+0xb0/0x130
 21)     8416     160   .create_empty_buffers+0x44/0x1a0
 22)     8256    1280   .block_read_full_page+0x3b0/0x430
 23)     6976     144   .blkdev_readpage+0x38/0x60
 24)     6832     176   .read_cache_page_async+0x124/0x230
 25)     6656     160   .read_cache_page+0x50/0xe0
 26)     6496     160   .read_dev_sector+0x58/0x100
 27)     6336     272   .msdos_partition+0xa4/0x880
 28)     6064     240   .rescan_partitions+0x1f0/0x430
 29)     5824     208   .__blkdev_get+0x124/0x3d0
 30)     5616     144   .blkdev_get+0x3c/0x60
 31)     5472     192   .register_disk+0x194/0x1f0
 32)     5280     160   .add_disk+0xe8/0x160
 33)     5120     224   .sd_probe+0x2f0/0x440
 34)     4896     176   .driver_probe_device+0x14c/0x380
 35)     4720     144   .__device_attach+0x38/0x60
 36)     4576     192   .bus_for_each_drv+0xb8/0x120
 37)     4384     160   .device_attach+0x94/0xe0
 38)     4224     144   .bus_attach_device+0x78/0xb0
 39)     4080     240   .device_add+0x54c/0x730
 40)     3840     160   .scsi_sysfs_add_sdev+0x7c/0x2e0
 41)     3680     336   .scsi_probe_and_add_lun+0xb7c/0xc60
 42)     3344     192   .__scsi_add_device+0x11c/0x150
 43)     3152     176   .ata_scsi_scan_host+0x238/0x330
 44)     2976     208   .ata_host_register+0x320/0x3c0
 45)     2768     192   .ata_host_activate+0xf8/0x190
 46)     2576     224   .mv_pci_init_one+0x284/0x430
 47)     2352     160   .pci_device_probe+0x98/0xf0
 48)     2192     176   .driver_probe_device+0x14c/0x380
 49)     2016     160   .__driver_attach+0xd4/0x110
 50)     1856     192   .bus_for_each_dev+0xa8/0x100
 51)     1664     144   .driver_attach+0x40/0x60
 52)     1520     192   .bus_add_driver+0x268/0x340
 53)     1328     176   .driver_register+0x88/0x1d0
 54)     1152     176   .__pci_register_driver+0x90/0x100
 55)      976     144   .mv_init+0x38/0xa0
 56)      832     544   .do_one_initcall+0x78/0x240
 57)      288     192   .kernel_init+0x21c/0x2b0
 58)       96      96   .kernel_thread+0x54/0x70

This is still 12K. Kind of big even for a 16K stack.

-- Steve


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 20:34 Large stack usage in fs code (especially for PPC64) Steven Rostedt
  2008-11-17 20:59 ` Steven Rostedt
@ 2008-11-17 21:08 ` Andrew Morton
  2008-11-17 21:23   ` Linus Torvalds
  2008-11-17 21:09 ` Linus Torvalds
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2008-11-17 21:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, paulus, benh, linuxppc-dev, torvalds, mingo, tglx,
	linux-mm

On Mon, 17 Nov 2008 15:34:13 -0500 (EST)
Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> I've been hitting stack overflows on a PPC64 box, so I ran the ftrace 
> stack_tracer and part of the problem with that box is that it can nest 
> interrupts too deep. But what also worries me is that there's some heavy 
> hitters of stacks in generic code. Namely the fs directory has some.
> 
> Here's the full dump of the stack (PPC64):
>
> ...
>
> do_mpage_readpage. They each use 1280 bytes of stack! Looking at the start 
> of these two:
> 
> int block_read_full_page(struct page *page, get_block_t *get_block)
> {
> 	struct inode *inode = page->mapping->host;
> 	sector_t iblock, lblock;
> 	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> 	unsigned int blocksize;
> 	int nr, i;
> 	int fully_mapped = 1;
> [...]
> 
> static struct bio *
> do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
> 		sector_t *last_block_in_bio, struct buffer_head *map_bh,
> 		unsigned long *first_logical_block, get_block_t get_block)
> {
> 	struct inode *inode = page->mapping->host;
> 	const unsigned blkbits = inode->i_blkbits;
> 	const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
> 	const unsigned blocksize = 1 << blkbits;
> 	sector_t block_in_file;
> 	sector_t last_block;
> 	sector_t last_block_in_file;
> 	sector_t blocks[MAX_BUF_PER_PAGE];
> 	unsigned page_block;
> 	unsigned first_hole = blocks_per_page;
> 	struct block_device *bdev = NULL;
> 	int length;
> 	int fully_mapped = 1;
> 	unsigned nblocks;
> 	unsigned relative_block;
> 
> 
> The thing that hits my eye on both is the MAX_BUF_PER_PAGE usage. That is 
> defined as: 
> 
> define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
> 
> Where PAGE_CACHE_SIZE is the same as PAGE_SIZE.
> 
> On PPC64 I'm told that the page size is 64K, which makes the above equal 
> to: 64K / 512 = 128  multiply that by 8 byte words, we have 1024 bytes.
> 
> The problem with PPC64 is that the stack size is not equal to the page 
> size. The stack size is only 16K not 64K.
> 
> The above stack trace was taken right after boot up and it was already at 
> 14K, not too far from the 16k limit.
> 
> Note, I was using a default config that had CONFIG_IRQSTACKS off and
> CONFIG_PPC_64K_PAGES on.
> 

Far be it from me to apportion blame, but THIS IS ALL LINUS'S FAULT!!!!! :)

I fixed this six years ago.  See http://lkml.org/lkml/2002/6/17/68

I still have the patch but for some reason it appears to get some
rejects.  However I think the approach (whatever it was ;)) is still
usable.  Perhaps some keen young thing has time to get down and redo
it.




This patch fixes some potential stack consumption problems.

The storage for

	sector_t blocks[MAX_BUF_PER_PAGE];
and
	struct buffer_head *arr[MAX_BUF_PER_PAGE];

will consume a kilobyte with 64k page, 64-bit sector_t.  And on
the path

	do_mpage_readpage
	->block_read_full_page
	  -> <page allocation>
	    ->mpage_writepage

they can be nested three-deep.  3k of stack gone.  Presumably in this
case the stack page would be 64k anyway, so that's not a problem. 
However if PAGE_SIZE=4k and PAGE_CACHE_SIZE=64k, we die.

I've yet to see any reason for larger PAGE_CACHE_SIZE, but this is a
neater implementation anyway.  

The patch removes MAX_BUF_PER_PAGE and instead walks the page's buffer
ring to work out which buffers need to be submitted.



--- 2.5.22/fs/mpage.c~stack-space	Sun Jun 16 22:50:17 2002
+++ 2.5.22-akpm/fs/mpage.c	Sun Jun 16 22:50:17 2002
@@ -169,8 +169,9 @@ do_mpage_readpage(struct bio *bio, struc
 	const unsigned blocksize = 1 << blkbits;
 	struct bio_vec *bvec;
 	sector_t block_in_file;
-	sector_t last_block;
-	sector_t blocks[MAX_BUF_PER_PAGE];
+	sector_t last_file_block;
+	sector_t first_page_block = -1;
+	sector_t last_page_block = -1;
 	unsigned page_block;
 	unsigned first_hole = blocks_per_page;
 	struct block_device *bdev = NULL;
@@ -180,12 +181,12 @@ do_mpage_readpage(struct bio *bio, struc
 		goto confused;
 
 	block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits);
-	last_block = (inode->i_size + blocksize - 1) >> blkbits;
+	last_file_block = (inode->i_size + blocksize - 1) >> blkbits;
 
 	for (page_block = 0; page_block < blocks_per_page;
 				page_block++, block_in_file++) {
 		bh.b_state = 0;
-		if (block_in_file < last_block) {
+		if (block_in_file < last_file_block) {
 			if (get_block(inode, block_in_file, &bh, 0))
 				goto confused;
 		}
@@ -199,10 +200,14 @@ do_mpage_readpage(struct bio *bio, struc
 		if (first_hole != blocks_per_page)
 			goto confused;		/* hole -> non-hole */
 
-		/* Contiguous blocks? */
-		if (page_block && blocks[page_block-1] != bh.b_blocknr-1)
-			goto confused;
-		blocks[page_block] = bh.b_blocknr;
+		if (page_block) {
+			/* Contiguous blocks? */
+			if (bh.b_blocknr != last_page_block + 1)
+				goto confused;
+		} else {
+			first_page_block = bh.b_blocknr;
+		}
+		last_page_block = bh.b_blocknr;
 		bdev = bh.b_bdev;
 	}
 
@@ -222,7 +227,7 @@ do_mpage_readpage(struct bio *bio, struc
 	 * This page will go to BIO.  Do we need to send this BIO off first?
 	 */
 	if (bio && (bio->bi_idx == bio->bi_vcnt ||
-			*last_block_in_bio != blocks[0] - 1))
+			*last_block_in_bio != first_page_block - 1))
 		bio = mpage_bio_submit(READ, bio);
 
 	if (bio == NULL) {
@@ -230,7 +235,7 @@ do_mpage_readpage(struct bio *bio, struc
 
 		if (nr_bvecs > nr_pages)
 			nr_bvecs = nr_pages;
-		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
+		bio = mpage_alloc(bdev, first_page_block << (blkbits - 9),
 					nr_bvecs, GFP_KERNEL);
 		if (bio == NULL)
 			goto confused;
@@ -244,7 +249,7 @@ do_mpage_readpage(struct bio *bio, struc
 	if (buffer_boundary(&bh) || (first_hole != blocks_per_page))
 		bio = mpage_bio_submit(READ, bio);
 	else
-		*last_block_in_bio = blocks[blocks_per_page - 1];
+		*last_block_in_bio = last_page_block;
 out:
 	return bio;
 
@@ -322,9 +327,10 @@ mpage_writepage(struct bio *bio, struct 
 	unsigned long end_index;
 	const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
 	struct bio_vec *bvec;
-	sector_t last_block;
+	sector_t last_file_block;
 	sector_t block_in_file;
-	sector_t blocks[MAX_BUF_PER_PAGE];
+	sector_t first_page_block = -1;
+	sector_t last_page_block = -1;
 	unsigned page_block;
 	unsigned first_unmapped = blocks_per_page;
 	struct block_device *bdev = NULL;
@@ -355,11 +361,13 @@ mpage_writepage(struct bio *bio, struct 
 
 			if (!buffer_dirty(bh) || !buffer_uptodate(bh))
 				goto confused;
-			if (page_block) {
-				if (bh->b_blocknr != blocks[page_block-1] + 1)
+			if (page_block++) {
+				if (bh->b_blocknr != last_page_block + 1)
 					goto confused;
+			} else {
+				first_page_block = bh->b_blocknr;
 			}
-			blocks[page_block++] = bh->b_blocknr;
+			last_page_block = bh->b_blocknr;
 			boundary = buffer_boundary(bh);
 			bdev = bh->b_bdev;
 		} while ((bh = bh->b_this_page) != head);
@@ -381,7 +389,7 @@ mpage_writepage(struct bio *bio, struct 
 	 */
 	BUG_ON(!PageUptodate(page));
 	block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits);
-	last_block = (inode->i_size - 1) >> blkbits;
+	last_file_block = (inode->i_size - 1) >> blkbits;
 	for (page_block = 0; page_block < blocks_per_page; ) {
 		struct buffer_head map_bh;
 
@@ -392,13 +400,16 @@ mpage_writepage(struct bio *bio, struct 
 			unmap_underlying_metadata(map_bh.b_bdev,
 						map_bh.b_blocknr);
 		if (page_block) {
-			if (map_bh.b_blocknr != blocks[page_block-1] + 1)
+			if (map_bh.b_blocknr != last_page_block + 1)
 				goto confused;
+		} else {
+			first_page_block = map_bh.b_blocknr;
 		}
-		blocks[page_block++] = map_bh.b_blocknr;
+		page_block++;
+		last_page_block = map_bh.b_blocknr;
 		boundary = buffer_boundary(&map_bh);
 		bdev = map_bh.b_bdev;
-		if (block_in_file == last_block)
+		if (block_in_file == last_file_block)
 			break;
 		block_in_file++;
 	}
@@ -424,13 +435,13 @@ page_is_mapped:
 	 * This page will go to BIO.  Do we need to send this BIO off first?
 	 */
 	if (bio && (bio->bi_idx == bio->bi_vcnt ||
-				*last_block_in_bio != blocks[0] - 1))
+				*last_block_in_bio != first_page_block - 1))
 		bio = mpage_bio_submit(WRITE, bio);
 
 	if (bio == NULL) {
 		unsigned nr_bvecs = MPAGE_BIO_MAX_SIZE / PAGE_CACHE_SIZE;
 
-		bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
+		bio = mpage_alloc(bdev, first_page_block << (blkbits - 9),
 					nr_bvecs, GFP_NOFS);
 		if (bio == NULL)
 			goto confused;
@@ -464,7 +475,7 @@ page_is_mapped:
 	if (boundary || (first_unmapped != blocks_per_page))
 		bio = mpage_bio_submit(WRITE, bio);
 	else
-		*last_block_in_bio = blocks[blocks_per_page - 1];
+		*last_block_in_bio = last_page_block;
 	goto out;
 
 confused:
--- 2.5.22/fs/buffer.c~stack-space	Sun Jun 16 22:50:17 2002
+++ 2.5.22-akpm/fs/buffer.c	Sun Jun 16 23:22:48 2002
@@ -1799,7 +1799,7 @@ int block_read_full_page(struct page *pa
 {
 	struct inode *inode = page->mapping->host;
 	unsigned long iblock, lblock;
-	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *bh, *head;
 	unsigned int blocksize, blocks;
 	int nr, i;
 
@@ -1842,7 +1842,7 @@ int block_read_full_page(struct page *pa
 			if (buffer_uptodate(bh))
 				continue;
 		}
-		arr[nr++] = bh;
+		nr++;
 	} while (i++, iblock++, (bh = bh->b_this_page) != head);
 
 	if (!nr) {
@@ -1857,24 +1857,26 @@ int block_read_full_page(struct page *pa
 	}
 
 	/* Stage two: lock the buffers */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		lock_buffer(bh);
-		mark_buffer_async_read(bh);
-	}
+	do {
+		if (!buffer_uptodate(bh)) {
+			lock_buffer(bh);
+			mark_buffer_async_read(bh);
+		}
+	} while ((bh = bh->b_this_page) != head);
 
 	/*
 	 * Stage 3: start the IO.  Check for uptodateness
 	 * inside the buffer lock in case another process reading
 	 * the underlying blockdev brought it uptodate (the sct fix).
 	 */
-	for (i = 0; i < nr; i++) {
-		bh = arr[i];
-		if (buffer_uptodate(bh))
-			end_buffer_async_read(bh, 1);
-		else
-			submit_bh(READ, bh);
-	}
+	do {
+		if (buffer_async_read(bh)) {
+			if (buffer_uptodate(bh))
+				end_buffer_async_read(bh, 1);
+			else
+				submit_bh(READ, bh);
+		}
+	} while ((bh = bh->b_this_page) != head);
 	return 0;
 }
 
@@ -2490,8 +2492,8 @@ static void bh_mempool_free(void *elemen
 	return kmem_cache_free(bh_cachep, element);
 }
 
-#define NR_RESERVED (10*MAX_BUF_PER_PAGE)
-#define MAX_UNUSED_BUFFERS NR_RESERVED+20
+
+#define MEMPOOL_BUFFERS (32 * PAGE_CACHE_SIZE / 512)
 
 void __init buffer_init(void)
 {
@@ -2500,7 +2502,7 @@ void __init buffer_init(void)
 	bh_cachep = kmem_cache_create("buffer_head",
 			sizeof(struct buffer_head), 0,
 			SLAB_HWCACHE_ALIGN, init_buffer_head, NULL);
-	bh_mempool = mempool_create(MAX_UNUSED_BUFFERS, bh_mempool_alloc,
+	bh_mempool = mempool_create(MEMPOOL_BUFFERS, bh_mempool_alloc,
 				bh_mempool_free, NULL);
 	for (i = 0; i < ARRAY_SIZE(bh_wait_queue_heads); i++)
 		init_waitqueue_head(&bh_wait_queue_heads[i].wqh);
--- 2.5.22/fs/block_dev.c~stack-space	Sun Jun 16 22:50:17 2002
+++ 2.5.22-akpm/fs/block_dev.c	Sun Jun 16 22:50:17 2002
@@ -23,8 +23,6 @@
 
 #include <asm/uaccess.h>
 
-#define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
-
 static unsigned long max_block(struct block_device *bdev)
 {
 	unsigned int retval = ~0U;
--- 2.5.22/include/linux/buffer_head.h~stack-space	Sun Jun 16 22:50:17 2002
+++ 2.5.22-akpm/include/linux/buffer_head.h	Sun Jun 16 23:22:47 2002
@@ -29,8 +29,6 @@ enum bh_state_bits {
 			 */
 };
 
-#define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
-
 struct page;
 struct kiobuf;
 struct buffer_head;
--- 2.5.22/fs/ntfs/aops.c~stack-space	Sun Jun 16 22:50:17 2002
+++ 2.5.22-akpm/fs/ntfs/aops.c	Sun Jun 16 23:22:46 2002
@@ -104,7 +104,7 @@ static int ntfs_file_read_block(struct p
 	LCN lcn;
 	ntfs_inode *ni;
 	ntfs_volume *vol;
-	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *bh, *head;
 	sector_t iblock, lblock, zblock;
 	unsigned int blocksize, blocks, vcn_ofs;
 	int i, nr;
@@ -116,11 +116,12 @@ static int ntfs_file_read_block(struct p
 	blocksize_bits = VFS_I(ni)->i_blkbits;
 	blocksize = 1 << blocksize_bits;
 
-	if (!page_has_buffers(page))
+	if (!page_has_buffers(page)) {
 		create_empty_buffers(page, blocksize, 0);
+		if (!page_has_buffers(page))	/* This can't happen */
+			return -ENOMEM;
+	}
 	bh = head = page_buffers(page);
-	if (!bh)
-		return -ENOMEM;
 
 	blocks = PAGE_CACHE_SIZE >> blocksize_bits;
 	iblock = page->index << (PAGE_CACHE_SHIFT - blocksize_bits);
@@ -138,10 +139,12 @@ static int ntfs_file_read_block(struct p
 	/* Loop through all the buffers in the page. */
 	nr = i = 0;
 	do {
+		BUG_ON(buffer_async_read(bh));
 		if (unlikely(buffer_uptodate(bh)))
 			continue;
 		if (unlikely(buffer_mapped(bh))) {
-			arr[nr++] = bh;
+			set_buffer_async_read(bh);
+			nr++;
 			continue;
 		}
 		bh->b_bdev = vol->sb->s_bdev;
@@ -167,7 +170,8 @@ retry_remap:
 				set_buffer_mapped(bh);
 				/* Only read initialized data blocks. */
 				if (iblock < zblock) {
-					arr[nr++] = bh;
+					set_buffer_async_read(bh);
+					nr++;
 					continue;
 				}
 				/* Fully non-initialized data block, zero it. */
@@ -208,15 +212,18 @@ handle_zblock:
 	/* Check we have at least one buffer ready for i/o. */
 	if (nr) {
 		/* Lock the buffers. */
-		for (i = 0; i < nr; i++) {
-			struct buffer_head *tbh = arr[i];
-			lock_buffer(tbh);
-			tbh->b_end_io = end_buffer_read_file_async;
-			set_buffer_async_read(tbh);
-		}
+		do {
+			if (buffer_async_read(bh)) {
+				lock_buffer(bh);
+				bh->b_end_io = end_buffer_read_file_async;
+			}
+		} while ((bh = bh->b_this_page) != head);
+
 		/* Finally, start i/o on the buffers. */
-		for (i = 0; i < nr; i++)
-			submit_bh(READ, arr[i]);
+		do {
+			if (buffer_async_read(bh))
+				submit_bh(READ, bh);
+		} while ((bh = bh->b_this_page) != head);
 		return 0;
 	}
 	/* No i/o was scheduled on any of the buffers. */
@@ -404,7 +411,7 @@ static int ntfs_mftbmp_readpage(ntfs_vol
 {
 	VCN vcn;
 	LCN lcn;
-	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *bh, *head;
 	sector_t iblock, lblock, zblock;
 	unsigned int blocksize, blocks, vcn_ofs;
 	int nr, i;
@@ -416,11 +423,12 @@ static int ntfs_mftbmp_readpage(ntfs_vol
 	blocksize = vol->sb->s_blocksize;
 	blocksize_bits = vol->sb->s_blocksize_bits;
 
-	if (!page_has_buffers(page))
+	if (!page_has_buffers(page)) {
 		create_empty_buffers(page, blocksize, 0);
+		if (!page_has_buffers(page))	/* This can't happen */
+			return -ENOMEM;
+	}
 	bh = head = page_buffers(page);
-	if (!bh)
-		return -ENOMEM;
 
 	blocks = PAGE_CACHE_SIZE >> blocksize_bits;
 	iblock = page->index << (PAGE_CACHE_SHIFT - blocksize_bits);
@@ -431,10 +439,12 @@ static int ntfs_mftbmp_readpage(ntfs_vol
 	/* Loop through all the buffers in the page. */
 	nr = i = 0;
 	do {
+		BUG_ON(buffer_async_read(bh));
 		if (unlikely(buffer_uptodate(bh)))
 			continue;
 		if (unlikely(buffer_mapped(bh))) {
-			arr[nr++] = bh;
+			set_buffer_async_read(bh);
+			nr++;
 			continue;
 		}
 		bh->b_bdev = vol->sb->s_bdev;
@@ -457,7 +467,8 @@ static int ntfs_mftbmp_readpage(ntfs_vol
 				set_buffer_mapped(bh);
 				/* Only read initialized data blocks. */
 				if (iblock < zblock) {
-					arr[nr++] = bh;
+					set_buffer_async_read(bh);
+					nr++;
 					continue;
 				}
 				/* Fully non-initialized data block, zero it. */
@@ -491,15 +502,18 @@ handle_zblock:
 	/* Check we have at least one buffer ready for i/o. */
 	if (nr) {
 		/* Lock the buffers. */
-		for (i = 0; i < nr; i++) {
-			struct buffer_head *tbh = arr[i];
-			lock_buffer(tbh);
-			tbh->b_end_io = end_buffer_read_mftbmp_async;
-			set_buffer_async_read(tbh);
-		}
+		do {
+			if (buffer_async_read(bh)) {
+				lock_buffer(bh);
+				bh->b_end_io = end_buffer_read_mftbmp_async;
+			}
+		} while ((bh = bh->b_this_page) != head);
+
 		/* Finally, start i/o on the buffers. */
-		for (i = 0; i < nr; i++)
-			submit_bh(READ, arr[i]);
+		do {
+			if (buffer_async_read(bh))
+				submit_bh(READ, bh);
+		} while ((bh = bh->b_this_page) != head);
 		return 0;
 	}
 	/* No i/o was scheduled on any of the buffers. */
@@ -643,7 +657,7 @@ int ntfs_mst_readpage(struct file *dir, 
 	LCN lcn;
 	ntfs_inode *ni;
 	ntfs_volume *vol;
-	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+	struct buffer_head *bh, *head;
 	sector_t iblock, lblock, zblock;
 	unsigned int blocksize, blocks, vcn_ofs;
 	int i, nr;
@@ -658,11 +672,12 @@ int ntfs_mst_readpage(struct file *dir, 
 	blocksize_bits = VFS_I(ni)->i_blkbits;
 	blocksize = 1 << blocksize_bits;
 
-	if (!page_has_buffers(page))
+	if (!page_has_buffers(page)) {
 		create_empty_buffers(page, blocksize, 0);
+		if (!page_has_buffers(page))	/* This can't happen */
+			return -ENOMEM;
+	}
 	bh = head = page_buffers(page);
-	if (!bh)
-		return -ENOMEM;
 
 	blocks = PAGE_CACHE_SIZE >> blocksize_bits;
 	iblock = page->index << (PAGE_CACHE_SHIFT - blocksize_bits);
@@ -678,10 +693,12 @@ int ntfs_mst_readpage(struct file *dir, 
 	/* Loop through all the buffers in the page. */
 	nr = i = 0;
 	do {
+		BUG_ON(buffer_async_read(bh));
 		if (unlikely(buffer_uptodate(bh)))
 			continue;
 		if (unlikely(buffer_mapped(bh))) {
-			arr[nr++] = bh;
+			set_buffer_async_read(bh);
+			nr++;
 			continue;
 		}
 		bh->b_bdev = vol->sb->s_bdev;
@@ -707,7 +724,8 @@ retry_remap:
 				set_buffer_mapped(bh);
 				/* Only read initialized data blocks. */
 				if (iblock < zblock) {
-					arr[nr++] = bh;
+					set_buffer_async_read(bh);
+					nr++;
 					continue;
 				}
 				/* Fully non-initialized data block, zero it. */
@@ -748,15 +766,18 @@ handle_zblock:
 	/* Check we have at least one buffer ready for i/o. */
 	if (nr) {
 		/* Lock the buffers. */
-		for (i = 0; i < nr; i++) {
-			struct buffer_head *tbh = arr[i];
-			lock_buffer(tbh);
-			tbh->b_end_io = end_buffer_read_mst_async;
-			set_buffer_async_read(tbh);
-		}
+		do {
+			if (buffer_async_read(bh)) {
+				lock_buffer(bh);
+				bh->b_end_io = end_buffer_read_mst_async;
+			}
+		} while ((bh = bh->b_this_page) != head);
+
 		/* Finally, start i/o on the buffers. */
-		for (i = 0; i < nr; i++)
-			submit_bh(READ, arr[i]);
+		do {
+			if (buffer_async_read(bh))
+				submit_bh(READ, bh);
+		} while ((bh = bh->b_this_page) != head);
 		return 0;
 	}
 	/* No i/o was scheduled on any of the buffers. */

-


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 20:34 Large stack usage in fs code (especially for PPC64) Steven Rostedt
  2008-11-17 20:59 ` Steven Rostedt
  2008-11-17 21:08 ` Andrew Morton
@ 2008-11-17 21:09 ` Linus Torvalds
  2008-11-17 22:53   ` Paul Mackerras
  2008-11-17 23:13   ` Benjamin Herrenschmidt
  2008-11-17 23:03 ` Benjamin Herrenschmidt
  2008-11-17 23:30 ` Benjamin Herrenschmidt
  4 siblings, 2 replies; 35+ messages in thread
From: Linus Torvalds @ 2008-11-17 21:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Paul Mackerras, Benjamin Herrenschmidt, linuxppc-dev,
	Andrew Morton, Ingo Molnar, Thomas Gleixner



On Mon, 17 Nov 2008, Steven Rostedt wrote:
>
>  45)     4992    1280   .block_read_full_page+0x23c/0x430
>  46)     3712    1280   .do_mpage_readpage+0x43c/0x740

Ouch.

> Notice at line 45 and 46 the stack usage of block_read_full_page and 
> do_mpage_readpage. They each use 1280 bytes of stack! Looking at the start 
> of these two:
> 
> int block_read_full_page(struct page *page, get_block_t *get_block)
> {
> 	struct inode *inode = page->mapping->host;
> 	sector_t iblock, lblock;
> 	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];

Yeah, that's unacceptable.

Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry 
array), but as you say:

> On PPC64 I'm told that the page size is 64K, which makes the above equal 
> to: 64K / 512 = 128  multiply that by 8 byte words, we have 1024 bytes.

Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB 
pages are insane, and 16kB pages are borderline. I've told people so.

The ppc people run databases, and they don't care about sane people 
telling them the big pages suck. It's made worse by the fact that they 
also have horribly bad TLB fills on their broken CPU's, and years and 
years of telling people that the MMU on ppc's are sh*t has only been 
reacted to with "talk to the hand, we know better".

Quite frankly, 64kB pages are INSANE. But yes, in this case they actually 
cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses 
64 bytes, not 1kB.

I suspect the PPC people need to figure out some way to handle this in 
their broken setups (since I don't really expect them to finally admit 
that they were full of sh*t with their big pages), but since I think it's 
a ppc bug, I'm not at all interested in a fix that penalizes the _good_ 
case.

So either make it some kind of (clean) conditional dynamic non-stack 
allocation, or make it do some outer loop over the whole page that turns 
into a compile-time no-op when the page is sufficiently small to be done 
in one go.

Or perhaps say "if you have 64kB pages, you're a moron, and to counteract 
that moronic page size, you cannot do 512-byte granularity IO any more".

Of course, that would likely mean that FAT etc wouldn't work on ppc64, so 
I don't think that's a valid model either. But if the 64kB page size is 
just a "database server crazy-people config option", then maybe it's 
acceptable.

Database people usually don't want to connect their cameras or mp3-players 
with their FAT12 filesystems.

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 20:59 ` Steven Rostedt
@ 2008-11-17 21:18   ` Linus Torvalds
  2008-11-17 21:25     ` Pekka Enberg
  2008-11-18  0:54     ` Steven Rostedt
  2008-11-17 22:16   ` Paul Mackerras
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 35+ messages in thread
From: Linus Torvalds @ 2008-11-17 21:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Paul Mackerras, Benjamin Herrenschmidt, linuxppc-dev,
	Andrew Morton, Ingo Molnar, Thomas Gleixner



On Mon, 17 Nov 2008, Steven Rostedt wrote:
> 
> Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that 
> softirqs still use the same stack as the process.

Yes.

> This is still 12K. Kind of big even for a 16K stack.

And while that 1kB+ stack slot for block_read_full_page still stands out 
like a sore thumb, I do agree that there's way too many other functions 
too with big stack frames.

I do wonder just _what_ it is that causes the stack frames to be so 
horrid. For example, you have

	 18)     8896     160   .kmem_cache_alloc+0xfc/0x140

and I'm looking at my x86-64 compile, and it has a stack frame of just 8 
bytes (!) for local variables plus the save/restore area (which looks like 
three registers plus frame pointer plus return address). IOW, if I'm 
looking at the code right (so big caveat: I did _not_ do a real stack 
dump!) the x86-64 stack cost for that same function is on the order of 48 
bytes. Not 160.

Where does that factor-of-three+ difference come from? From the numbers, I 
suspect ppc64 has a 32-byte stack alignment, which may be part of it, and 
I guess the compiler is more eager to use all those extra registers and 
will happily have many more callee-saved regs that are actually used.

But that still a _lot_ of extra stack.

Of course, you may have things like spinlock debugging etc enabled. Some 
of our debugging options do tend to blow things up.

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 21:08 ` Andrew Morton
@ 2008-11-17 21:23   ` Linus Torvalds
  2008-11-17 21:31     ` Andrew Morton
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2008-11-17 21:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Steven Rostedt, linux-kernel, paulus, benh, linuxppc-dev, mingo,
	tglx, linux-mm



On Mon, 17 Nov 2008, Andrew Morton wrote:
> 
> Far be it from me to apportion blame, but THIS IS ALL LINUS'S FAULT!!!!! :)
> 
> I fixed this six years ago.  See http://lkml.org/lkml/2002/6/17/68

Btw, in that thread I also said:

  "If we have 64kB pages, such architectures will have to have a bigger 
   kernel stack. Which they will have, simply by virtue of having the very 
   same bigger page. So that problem kind of solves itself."

and that may still be the "right" solution - if somebody is so insane that 
they want 64kB pages, then they might as well have a 64kB kernel stack as 
well. 

Trust me, the kernel stack isn't where you blow your memory with a 64kB 
page. You blow all your memory on the memory fragmentation of your page 
cache. I did the stats for the kernel source tree a long time ago, and I 
think you wasted something like 4GB of RAM with a 64kB page size.

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 21:18   ` Linus Torvalds
@ 2008-11-17 21:25     ` Pekka Enberg
  2008-11-18  0:54     ` Steven Rostedt
  1 sibling, 0 replies; 35+ messages in thread
From: Pekka Enberg @ 2008-11-17 21:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, LKML, Paul Mackerras, Benjamin Herrenschmidt,
	linuxppc-dev, Andrew Morton, Ingo Molnar, Thomas Gleixner

On Mon, Nov 17, 2008 at 11:18 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> I do wonder just _what_ it is that causes the stack frames to be so
> horrid. For example, you have
>
>         18)     8896     160   .kmem_cache_alloc+0xfc/0x140
>
> and I'm looking at my x86-64 compile, and it has a stack frame of just 8
> bytes (!) for local variables plus the save/restore area (which looks like
> three registers plus frame pointer plus return address). IOW, if I'm
> looking at the code right (so big caveat: I did _not_ do a real stack
> dump!) the x86-64 stack cost for that same function is on the order of 48
> bytes. Not 160.
>
> Where does that factor-of-three+ difference come from? From the numbers, I
> suspect ppc64 has a 32-byte stack alignment, which may be part of it, and
> I guess the compiler is more eager to use all those extra registers and
> will happily have many more callee-saved regs that are actually used.
>
> But that still a _lot_ of extra stack.
>
> Of course, you may have things like spinlock debugging etc enabled. Some
> of our debugging options do tend to blow things up.

Note that kmem_cache_alloc() is likely to contain lots of inlined
functions for both SLAB and SLUB. Perhaps that blows up stack usage on
ppc?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 21:23   ` Linus Torvalds
@ 2008-11-17 21:31     ` Andrew Morton
  2008-11-17 21:42       ` Linus Torvalds
  2008-11-17 23:17       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2008-11-17 21:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: rostedt, linux-kernel, paulus, benh, linuxppc-dev, mingo, tglx, linux-mm

On Mon, 17 Nov 2008 13:23:23 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Mon, 17 Nov 2008, Andrew Morton wrote:
> > 
> > Far be it from me to apportion blame, but THIS IS ALL LINUS'S FAULT!!!!! :)
> > 
> > I fixed this six years ago.  See http://lkml.org/lkml/2002/6/17/68
> 
> Btw, in that thread I also said:
> 
>   "If we have 64kB pages, such architectures will have to have a bigger 
>    kernel stack. Which they will have, simply by virtue of having the very 
>    same bigger page. So that problem kind of solves itself."
> 
> and that may still be the "right" solution - if somebody is so insane that 
> they want 64kB pages, then they might as well have a 64kB kernel stack as 
> well. 

I'd have thought so, but I'm sure we're about to hear how important an
optimisation the smaller stacks are ;)

> Trust me, the kernel stack isn't where you blow your memory with a 64kB 
> page. You blow all your memory on the memory fragmentation of your page 
> cache. I did the stats for the kernel source tree a long time ago, and I 
> think you wasted something like 4GB of RAM with a 64kB page size.
> 

Yup.  That being said, the younger me did assert that "this is a neater
implementation anyway".  If we can implement those loops without
needing those on-stack temporary arrays then things probably are better
overall.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 21:31     ` Andrew Morton
@ 2008-11-17 21:42       ` Linus Torvalds
  2008-11-17 23:17       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2008-11-17 21:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: rostedt, linux-kernel, paulus, benh, linuxppc-dev, mingo, tglx, linux-mm



On Mon, 17 Nov 2008, Andrew Morton wrote:
> 
> Yup.  That being said, the younger me did assert that "this is a neater
> implementation anyway".  If we can implement those loops without
> needing those on-stack temporary arrays then things probably are better
> overall.

Sure, if it actually ends up being nicer, I'll not argue with it. But from 
an L1 I$ standpoint (and I$ is often very important, especially for kernel 
loads where loops are fairly rare), it's often _much_ better to do two 
"tight" loops over two subsystems (filesystem and block layer) than it is 
to do one bigger loop that contains both. If the L1 can fit both subsystem 
paths, you're fine - but if not, you may get a lot more misses.

So it's often nice if you can "stage" things so that you do a cluster of 
calls to one area, followed by a cluster of calls to another, rather than 
mix it up. 

But numbers talk. And code cleanliness. If somebody has numbers that the 
code size actually goes down for example, or the code is just more 
readable, micro-optimizing cache patterns isn't worth it.

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 20:59 ` Steven Rostedt
  2008-11-17 21:18   ` Linus Torvalds
@ 2008-11-17 22:16   ` Paul Mackerras
  2008-11-17 23:30     ` Steven Rostedt
  2008-11-17 23:04   ` Benjamin Herrenschmidt
  2008-11-18  2:29   ` Steven Rostedt
  3 siblings, 1 reply; 35+ messages in thread
From: Paul Mackerras @ 2008-11-17 22:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Benjamin Herrenschmidt, linuxppc-dev, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Thomas Gleixner

Steven Rostedt writes:

> Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that 
> softirqs still use the same stack as the process.

They shouldn't.  I don't see do_softirq in the trace, though.  Which
functions did you think would be run in a softirq?  It looks to me
like the deepest 10 or so functions are called at hard irq level,
within hrtimer_interrupt called from timer_interrupt.

> root@electra ~> cat /debug/tracing/stack_trace 
>         Depth   Size      Location    (59 entries)
>         -----   ----      --------
>   0)    12384     192   ftrace_call+0x4/0x14
>   1)    12192     128   .sched_clock+0x20/0x60
>   2)    12064     128   .sched_clock_cpu+0x34/0x50
>   3)    11936     144   .cpu_clock+0x3c/0xa0
>   4)    11792     144   .get_timestamp+0x2c/0x50
>   5)    11648     144   .__touch_softlockup_watchdog+0x3c/0x60
>   6)    11504     192   .softlockup_tick+0xe4/0x220
>   7)    11312     128   .run_local_timers+0x34/0x50
>   8)    11184     160   .update_process_times+0x44/0xb0
>   9)    11024     176   .tick_sched_timer+0x8c/0x120
>  10)    10848     160   .__run_hrtimer+0xd8/0x130
>  11)    10688     240   .hrtimer_interrupt+0x16c/0x220
>  12)    10448     160   .timer_interrupt+0xcc/0x110
>  13)    10288      96   decrementer_common+0xe0/0x100

Paul.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 21:09 ` Linus Torvalds
@ 2008-11-17 22:53   ` Paul Mackerras
  2008-11-18 10:07     ` Nick Piggin
  2008-11-17 23:13   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 35+ messages in thread
From: Paul Mackerras @ 2008-11-17 22:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, LKML, Benjamin Herrenschmidt, linuxppc-dev,
	Andrew Morton, Ingo Molnar, Thomas Gleixner

Linus Torvalds writes:

> The ppc people run databases, and they don't care about sane people 

And HPC apps, and all sorts of other things...

> telling them the big pages suck. It's made worse by the fact that they 
> also have horribly bad TLB fills on their broken CPU's, and years and 

Taking page faults at a 4k granularity also sucks.  A lot of the
performance improvement from using 64k pages comes just from executing
the page fault path (and other paths in the mm code) less frequently.
That's why we see a performance improvement from 64k pages even on
machines that don't support 64k pages in hardware (like the 21%
reduction in system time on a kernel compile that I reported earlier).
That has nothing to do with TLB miss times or anything to do with the
MMU.

I'd love to be able to use a 4k base page size if I could still get
the reduction in page faults and the expanded TLB reach that we get
now with 64k pages.  If we could allocate the page cache for large
files with order-4 allocations wherever possible that would be a good
start.  I think Christoph Lameter had some patches in that direction
but they didn't seem to get very far.

> years of telling people that the MMU on ppc's are sh*t has only been 
> reacted to with "talk to the hand, we know better".

Well, it's been reacted to with "AIX can use small pages where it
makes sense, and large pages where that makes sense, so why can't
Linux?"

> I suspect the PPC people need to figure out some way to handle this in 
> their broken setups (since I don't really expect them to finally admit 
> that they were full of sh*t with their big pages), but since I think it's 
> a ppc bug, I'm not at all interested in a fix that penalizes the _good_ 
> case.

Looks like we should make the stack a bit larger.

Paul.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 20:34 Large stack usage in fs code (especially for PPC64) Steven Rostedt
                   ` (2 preceding siblings ...)
  2008-11-17 21:09 ` Linus Torvalds
@ 2008-11-17 23:03 ` Benjamin Herrenschmidt
  2008-11-18  9:53   ` Christoph Hellwig
  2008-11-17 23:30 ` Benjamin Herrenschmidt
  4 siblings, 1 reply; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2008-11-17 23:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Paul Mackerras, linuxppc-dev, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Thomas Gleixner

On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote:
> Note, I was using a default config that had CONFIG_IRQSTACKS off and
> CONFIG_PPC_64K_PAGES on.

For one, we definitely need to turn IRQSTACKS on by default ... In fact,
I'm pondering just removing the option.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 20:59 ` Steven Rostedt
  2008-11-17 21:18   ` Linus Torvalds
  2008-11-17 22:16   ` Paul Mackerras
@ 2008-11-17 23:04   ` Benjamin Herrenschmidt
  2008-11-18  2:29   ` Steven Rostedt
  3 siblings, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2008-11-17 23:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Paul Mackerras, linuxppc-dev, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Thomas Gleixner

On Mon, 2008-11-17 at 15:59 -0500, Steven Rostedt wrote:
> On Mon, 17 Nov 2008, Steven Rostedt wrote:
> > 
> > Note, I was using a default config that had CONFIG_IRQSTACKS off and
> > CONFIG_PPC_64K_PAGES on.
> 
> Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that 
> softirqs still use the same stack as the process.

No, they should use their own, unless somebody change the softirq code
in ways that break us... (or unless I missed something, we based our
implementation roughtly on what x86 did back then).

Ben.

> root@electra ~> cat /debug/tracing/stack_trace 
>         Depth   Size      Location    (59 entries)
>         -----   ----      --------
>   0)    12384     192   ftrace_call+0x4/0x14
>   1)    12192     128   .sched_clock+0x20/0x60
>   2)    12064     128   .sched_clock_cpu+0x34/0x50
>   3)    11936     144   .cpu_clock+0x3c/0xa0
>   4)    11792     144   .get_timestamp+0x2c/0x50
>   5)    11648     144   .__touch_softlockup_watchdog+0x3c/0x60
>   6)    11504     192   .softlockup_tick+0xe4/0x220
>   7)    11312     128   .run_local_timers+0x34/0x50
>   8)    11184     160   .update_process_times+0x44/0xb0
>   9)    11024     176   .tick_sched_timer+0x8c/0x120
>  10)    10848     160   .__run_hrtimer+0xd8/0x130
>  11)    10688     240   .hrtimer_interrupt+0x16c/0x220
>  12)    10448     160   .timer_interrupt+0xcc/0x110
>  13)    10288      96   decrementer_common+0xe0/0x100
>  14)    10192     784   .ftrace_test_stop_func+0x4c/0x70
>  15)     9408     112   ftrace_call+0x4/0x14
>  16)     9296     144   .init_buffer_head+0x20/0x60
>  17)     9152     256   .cache_alloc_refill+0x5a0/0x740
>  18)     8896     160   .kmem_cache_alloc+0xfc/0x140
>  19)     8736     144   .alloc_buffer_head+0x3c/0xc0
>  20)     8592     176   .alloc_page_buffers+0xb0/0x130
>  21)     8416     160   .create_empty_buffers+0x44/0x1a0
>  22)     8256    1280   .block_read_full_page+0x3b0/0x430
>  23)     6976     144   .blkdev_readpage+0x38/0x60
>  24)     6832     176   .read_cache_page_async+0x124/0x230
>  25)     6656     160   .read_cache_page+0x50/0xe0
>  26)     6496     160   .read_dev_sector+0x58/0x100
>  27)     6336     272   .msdos_partition+0xa4/0x880
>  28)     6064     240   .rescan_partitions+0x1f0/0x430
>  29)     5824     208   .__blkdev_get+0x124/0x3d0
>  30)     5616     144   .blkdev_get+0x3c/0x60
>  31)     5472     192   .register_disk+0x194/0x1f0
>  32)     5280     160   .add_disk+0xe8/0x160
>  33)     5120     224   .sd_probe+0x2f0/0x440
>  34)     4896     176   .driver_probe_device+0x14c/0x380
>  35)     4720     144   .__device_attach+0x38/0x60
>  36)     4576     192   .bus_for_each_drv+0xb8/0x120
>  37)     4384     160   .device_attach+0x94/0xe0
>  38)     4224     144   .bus_attach_device+0x78/0xb0
>  39)     4080     240   .device_add+0x54c/0x730
>  40)     3840     160   .scsi_sysfs_add_sdev+0x7c/0x2e0
>  41)     3680     336   .scsi_probe_and_add_lun+0xb7c/0xc60
>  42)     3344     192   .__scsi_add_device+0x11c/0x150
>  43)     3152     176   .ata_scsi_scan_host+0x238/0x330
>  44)     2976     208   .ata_host_register+0x320/0x3c0
>  45)     2768     192   .ata_host_activate+0xf8/0x190
>  46)     2576     224   .mv_pci_init_one+0x284/0x430
>  47)     2352     160   .pci_device_probe+0x98/0xf0
>  48)     2192     176   .driver_probe_device+0x14c/0x380
>  49)     2016     160   .__driver_attach+0xd4/0x110
>  50)     1856     192   .bus_for_each_dev+0xa8/0x100
>  51)     1664     144   .driver_attach+0x40/0x60
>  52)     1520     192   .bus_add_driver+0x268/0x340
>  53)     1328     176   .driver_register+0x88/0x1d0
>  54)     1152     176   .__pci_register_driver+0x90/0x100
>  55)      976     144   .mv_init+0x38/0xa0
>  56)      832     544   .do_one_initcall+0x78/0x240
>  57)      288     192   .kernel_init+0x21c/0x2b0
>  58)       96      96   .kernel_thread+0x54/0x70
> 
> This is still 12K. Kind of big even for a 16K stack.
> 
> -- Steve


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 21:09 ` Linus Torvalds
  2008-11-17 22:53   ` Paul Mackerras
@ 2008-11-17 23:13   ` Benjamin Herrenschmidt
  2008-11-17 23:22     ` Josh Boyer
  2008-11-17 23:28     ` Linus Torvalds
  1 sibling, 2 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2008-11-17 23:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, LKML, Paul Mackerras, linuxppc-dev,
	Andrew Morton, Ingo Molnar, Thomas Gleixner


> Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry 
> array), but as you say:
> 
> > On PPC64 I'm told that the page size is 64K, which makes the above equal 
> > to: 64K / 512 = 128  multiply that by 8 byte words, we have 1024 bytes.
> 
> Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB 
> pages are insane, and 16kB pages are borderline. I've told people so.
> 
> The ppc people run databases, and they don't care about sane people 
> telling them the big pages suck.

Hehe :-)

Guess who is pushing for larger page sizes nowadays ? Embedded
people :-) In fact, we have patches submited on the list to offer the
option for ... 256K pages on some 44x embedded CPUs :-)

It makes some sort of sense I suppose on very static embedded workloads
with no swap nor demand paging.

> It's made worse by the fact that they 
> also have horribly bad TLB fills on their broken CPU's, and years and 
> years of telling people that the MMU on ppc's are sh*t has only been 
> reacted to with "talk to the hand, we know better".

Who are you talking about here precisely ? I don't think either Paul or
I every said something nearly around those lines ... Oh well.

But yeah, our existing server CPUs have pretty poor TLB refills and yes,
64K pages help. And yes, we would like things to be different, but they
aren't.

But there is also pressure to get larger page sizes from small embedded
field, where CPUs have even poorer TLB refill (software loaded
basically) :-)

> Quite frankly, 64kB pages are INSANE. But yes, in this case they actually 
> cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses 
> 64 bytes, not 1kB.

Come on, the code is crap to allocate that on the stack anyway :-)

> Of course, that would likely mean that FAT etc wouldn't work on ppc64, so 
> I don't think that's a valid model either. But if the 64kB page size is 
> just a "database server crazy-people config option", then maybe it's 
> acceptable.

Well, as I said, embedded folks are wanting that too ...

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 21:31     ` Andrew Morton
  2008-11-17 21:42       ` Linus Torvalds
@ 2008-11-17 23:17       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2008-11-17 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, rostedt, linux-kernel, paulus, linuxppc-dev,
	mingo, tglx, linux-mm


> I'd have thought so, but I'm sure we're about to hear how important an
> optimisation the smaller stacks are ;)

Not sure, I tend to agree that it would make sense to bump our stack to
64K on 64K pages, it's not like we are saving anything and we are
probably adding overhead in alloc/dealloc. I'll see what Paul thinks
here.

> Yup.  That being said, the younger me did assert that "this is a neater
> implementation anyway".  If we can implement those loops without
> needing those on-stack temporary arrays then things probably are better
> overall.

Amen.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 23:13   ` Benjamin Herrenschmidt
@ 2008-11-17 23:22     ` Josh Boyer
  2008-11-17 23:28     ` Linus Torvalds
  1 sibling, 0 replies; 35+ messages in thread
From: Josh Boyer @ 2008-11-17 23:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Thomas Gleixner, LKML, Steven Rostedt,
	linuxppc-dev, Paul Mackerras, Andrew Morton, Ingo Molnar

On Tue, 18 Nov 2008 10:13:16 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> 
> > Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry 
> > array), but as you say:
> > 
> > > On PPC64 I'm told that the page size is 64K, which makes the above equal 
> > > to: 64K / 512 = 128  multiply that by 8 byte words, we have 1024 bytes.
> > 
> > Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB 
> > pages are insane, and 16kB pages are borderline. I've told people so.
> > 
> > The ppc people run databases, and they don't care about sane people 
> > telling them the big pages suck.
> 
> Hehe :-)
> 
> Guess who is pushing for larger page sizes nowadays ? Embedded
> people :-) In fact, we have patches submited on the list to offer the
> option for ... 256K pages on some 44x embedded CPUs :-)

For clarification, that workload is very precise.  Namely embedded 44x
CPUs used in RAID cards.  I'm not entirely convinced bringing 256K
pages into mainline is a good thing yet anyway.

64K pages, while seemingly insane for embedded boards that typically
have less than 512 MiB of DRAM, help for a bit larger set of
workloads.  As a KVM host is the primary winner at the moment.  But
given the small number of TLB entries on these CPUs, it can pay off
elsewhere as well.

josh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 23:13   ` Benjamin Herrenschmidt
  2008-11-17 23:22     ` Josh Boyer
@ 2008-11-17 23:28     ` Linus Torvalds
  2008-11-18  0:11       ` Paul Mackerras
  2008-11-18  7:25       ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 35+ messages in thread
From: Linus Torvalds @ 2008-11-17 23:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Steven Rostedt, LKML, Paul Mackerras, linuxppc-dev,
	Andrew Morton, Ingo Molnar, Thomas Gleixner



On Tue, 18 Nov 2008, Benjamin Herrenschmidt wrote:
> 
> Guess who is pushing for larger page sizes nowadays ? Embedded
> people :-) In fact, we have patches submited on the list to offer the
> option for ... 256K pages on some 44x embedded CPUs :-)
> 
> It makes some sort of sense I suppose on very static embedded workloads
> with no swap nor demand paging.

It makes perfect sense for anything that doesn't use any MMU. 

The hugepage support seems to cover many of the relevant cases, ie 
databases and things like big static mappings (frame buffers etc).

> > It's made worse by the fact that they 
> > also have horribly bad TLB fills on their broken CPU's, and years and 
> > years of telling people that the MMU on ppc's are sh*t has only been 
> > reacted to with "talk to the hand, we know better".
> 
> Who are you talking about here precisely ? I don't think either Paul or
> I every said something nearly around those lines ... Oh well.

Every single time I've complained about it, somebody from IBM has said ".. 
but but AIX".

This time it was Paul. Sometimes it has been software people who agree, 
but point to hardware designers who "know better". If it's not some insane 
database person, it's a Fortran program that runs for days.

> But there is also pressure to get larger page sizes from small embedded
> field, where CPUs have even poorer TLB refill (software loaded
> basically) :-)

Yeah, I agree that you _can_ have even worse MMU's. I'm not saying that 
PPC64 is absolutely pessimal and cannot be made worse. Software fill is 
indeed even worse from a performance angle, despite the fact that it's 
really "nice" from a conceptual angle.

Of course, of thesw fill users that remain, many do seem to be ppc.. It's 
like the architecture brings out the worst in hardware designers.

> > Quite frankly, 64kB pages are INSANE. But yes, in this case they actually 
> > cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses 
> > 64 bytes, not 1kB.
> 
> Come on, the code is crap to allocate that on the stack anyway :-)

Why? We do actually expect to be able to use stack-space for small 
structures. We do it for a lot of things, including stuff like select() 
optimistically using arrays allocated on the stack for the common small 
case, just because it's, oh, about infinitely faster to do than to use 
kmalloc().

Many of the page cache functions also have the added twist that they get 
called from low-memory setups (eg write_whole_page()), and so try to 
minimize allocations for that reason too.

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 20:34 Large stack usage in fs code (especially for PPC64) Steven Rostedt
                   ` (3 preceding siblings ...)
  2008-11-17 23:03 ` Benjamin Herrenschmidt
@ 2008-11-17 23:30 ` Benjamin Herrenschmidt
  4 siblings, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2008-11-17 23:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Paul Mackerras, linuxppc-dev, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Thomas Gleixner

On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote:
> 
> I've been hitting stack overflows on a PPC64 box, so I ran the ftrace 
> stack_tracer and part of the problem with that box is that it can nest 
> interrupts too deep. But what also worries me is that there's some heavy 
> hitters of stacks in generic code. Namely the fs directory has some.

Note that we shouldn't stack interrupts much in practice. The PIC will
not let same or lower prio interrupts in until we have completed one.
However timer/decrementer is not going through the PIC, so I think what
happens is we get a hw IRQ, on the way back, just before returning from
do_IRQ (so we have completed the IRQ from the PIC standpoint), we go
into soft-irq's, at which point deep inside SCSI we get another HW IRQ
and we stack a decrementer interrupt on top of it.

Now, we should do stack switching for both HW IRQs and softirqs with
CONFIG_IRQSTACKS, which should significantly alleviate the problem.

Your second trace also shows how horrible the stack traces can be when
the device-model kicks in, ie, register->probe->register sub device ->
etc... that isnt going to be nice on x86 with 4k stacks neither. 

I wonder if we should generally recommend for drivers of "bus" devices
not to register sub devices from their own probe() routine, but defer
that to a kernel thread... Because the stacking can be pretty bad, I
mean, nobody's done SATA over USB yet but heh :-)

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 22:16   ` Paul Mackerras
@ 2008-11-17 23:30     ` Steven Rostedt
  0 siblings, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2008-11-17 23:30 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: LKML, Benjamin Herrenschmidt, linuxppc-dev, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Thomas Gleixner


On Tue, 18 Nov 2008, Paul Mackerras wrote:

> Steven Rostedt writes:
> 
> > Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that 
> > softirqs still use the same stack as the process.
> 
> They shouldn't.  I don't see do_softirq in the trace, though.  Which
> functions did you think would be run in a softirq?  It looks to me
> like the deepest 10 or so functions are called at hard irq level,
> within hrtimer_interrupt called from timer_interrupt.
> 
> > root@electra ~> cat /debug/tracing/stack_trace 
> >         Depth   Size      Location    (59 entries)
> >         -----   ----      --------
> >   0)    12384     192   ftrace_call+0x4/0x14
> >   1)    12192     128   .sched_clock+0x20/0x60
> >   2)    12064     128   .sched_clock_cpu+0x34/0x50
> >   3)    11936     144   .cpu_clock+0x3c/0xa0
> >   4)    11792     144   .get_timestamp+0x2c/0x50
> >   5)    11648     144   .__touch_softlockup_watchdog+0x3c/0x60
> >   6)    11504     192   .softlockup_tick+0xe4/0x220
> >   7)    11312     128   .run_local_timers+0x34/0x50
> >   8)    11184     160   .update_process_times+0x44/0xb0
> >   9)    11024     176   .tick_sched_timer+0x8c/0x120
> >  10)    10848     160   .__run_hrtimer+0xd8/0x130
> >  11)    10688     240   .hrtimer_interrupt+0x16c/0x220
> >  12)    10448     160   .timer_interrupt+0xcc/0x110
> >  13)    10288      96   decrementer_common+0xe0/0x100

Ah, you are right. I thought I saw softirq code in there, but must have 
been seeing things. I need to get my eyesight checked. The other day I 
totally botch a hand in cards because I thought I had 3 Aces of diamonds 
when I really only had 2 Aces of diamonds and a Ace of hearts.

-- Steve


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 23:28     ` Linus Torvalds
@ 2008-11-18  0:11       ` Paul Mackerras
  2008-11-18  2:08         ` Linus Torvalds
  2008-11-18  7:25       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 35+ messages in thread
From: Paul Mackerras @ 2008-11-18  0:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Steven Rostedt, LKML, linuxppc-dev,
	Andrew Morton, Ingo Molnar, Thomas Gleixner

Linus Torvalds writes:

> > > It's made worse by the fact that they 
> > > also have horribly bad TLB fills on their broken CPU's, and years and 
> > > years of telling people that the MMU on ppc's are sh*t has only been 
> > > reacted to with "talk to the hand, we know better".
> > 
> > Who are you talking about here precisely ? I don't think either Paul or
> > I every said something nearly around those lines ... Oh well.
> 
> Every single time I've complained about it, somebody from IBM has said ".. 
> but but AIX".
> 
> This time it was Paul. Sometimes it has been software people who agree, 

Maybe I wasn't clear.  I was reporting the reaction I get from the
hardware designers, who often like to push off the hard problems to
software to fix.

Keep flaming, by the way.  It provides me with some amount of
assistance in my quest to get the hardware designers to do better. ;)

Also, you didn't respond to my comments about the purely software
benefits of a larger page size.

Paul.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 21:18   ` Linus Torvalds
  2008-11-17 21:25     ` Pekka Enberg
@ 2008-11-18  0:54     ` Steven Rostedt
  2008-11-18  1:05       ` Paul Mackerras
  1 sibling, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2008-11-18  0:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Paul Mackerras, Benjamin Herrenschmidt, linuxppc-dev,
	Andrew Morton, Ingo Molnar, Thomas Gleixner


On Mon, 17 Nov 2008, Linus Torvalds wrote:
> 
> I do wonder just _what_ it is that causes the stack frames to be so 
> horrid. For example, you have
> 
> 	 18)     8896     160   .kmem_cache_alloc+0xfc/0x140
> 
> and I'm looking at my x86-64 compile, and it has a stack frame of just 8 
> bytes (!) for local variables plus the save/restore area (which looks like 
> three registers plus frame pointer plus return address). IOW, if I'm 
> looking at the code right (so big caveat: I did _not_ do a real stack 
> dump!) the x86-64 stack cost for that same function is on the order of 48 
> bytes. Not 160.

Out of curiosity, I just ran stack_trace on the latest version of git 
(pulled sometime today) and ran it on my x86_64.

I have SLUB and SLUB debug defined, and here's what I found:

 11)     3592      64   kmem_cache_alloc+0x64/0xa3

64 bytes, still much lower than the 160 of PPC64.

Just to see where it got that number, I ran objdump on slub.o.

000000000000300e <kmem_cache_alloc_node>:
    300e:       55                      push   %rbp
    300f:       48 89 e5                mov    %rsp,%rbp
    3012:       41 57                   push   %r15
    3014:       41 56                   push   %r14
    3016:       41 55                   push   %r13
    3018:       41 54                   push   %r12
    301a:       53                      push   %rbx
    301b:       48 83 ec 08             sub    $0x8,%rsp

Six words pushed, plus the 8 byte stack frame.

  6*8+8 = 56

But we also add the push of the function return address which is another 8 
bytes which gives us 64 bytes.

Here's the complete dump:

[root@bxrhel51 linux-compile.git]# cat /debug/tracing/stack_trace 
        Depth   Size      Location    (54 entries)
        -----   ----      --------
  0)     4504      48   __mod_zone_page_state+0x59/0x68
  1)     4456      64   __rmqueue_smallest+0xa0/0xd9
  2)     4392      80   __rmqueue+0x24/0x172
  3)     4312      96   rmqueue_bulk+0x57/0xa3
  4)     4216     224   get_page_from_freelist+0x371/0x6e1
  5)     3992     160   __alloc_pages_internal+0xe0/0x3f8
  6)     3832      16   __alloc_pages_nodemask+0xe/0x10
  7)     3816      48   alloc_pages_current+0xbe/0xc7
  8)     3768      16   alloc_slab_page+0x28/0x34
  9)     3752      64   new_slab+0x4a/0x1bb
 10)     3688      96   __slab_alloc+0x203/0x364
 11)     3592      64   kmem_cache_alloc+0x64/0xa3
 12)     3528      48   alloc_buffer_head+0x22/0x9d
 13)     3480      64   alloc_page_buffers+0x2f/0xd1
 14)     3416      48   create_empty_buffers+0x22/0xb5
 15)     3368     176   block_read_full_page+0x6b/0x25c
 16)     3192      16   blkdev_readpage+0x18/0x1a
 17)     3176      64   read_cache_page_async+0x85/0x11a
 18)     3112      32   read_cache_page+0x13/0x48
 19)     3080      48   read_dev_sector+0x36/0xaf
 20)     3032      96   read_lba+0x51/0xb0
 21)     2936     176   efi_partition+0x92/0x585
 22)     2760     128   rescan_partitions+0x173/0x308
 23)     2632      96   __blkdev_get+0x22a/0x2ea
 24)     2536      16   blkdev_get+0x10/0x12
 25)     2520      80   register_disk+0xe5/0x14a
 26)     2440      48   add_disk+0xbd/0x11f
 27)     2392      96   sd_probe+0x2c5/0x3ad
 28)     2296      48   driver_probe_device+0xc5/0x14c
 29)     2248      16   __device_attach+0xe/0x10
 30)     2232      64   bus_for_each_drv+0x56/0x8d
 31)     2168      48   device_attach+0x68/0x7f
 32)     2120      32   bus_attach_device+0x2d/0x5e
 33)     2088     112   device_add+0x45f/0x5d3
 34)     1976      64   scsi_sysfs_add_sdev+0xb7/0x246
 35)     1912     336   scsi_probe_and_add_lun+0x9f9/0xad7
 36)     1576      80   __scsi_add_device+0xb6/0xe5
 37)     1496      80   ata_scsi_scan_host+0x9e/0x1c6
 38)     1416      64   ata_host_register+0x238/0x252
 39)     1352      96   ata_pci_sff_activate_host+0x1a1/0x1ce
 40)     1256     256   piix_init_one+0x646/0x6b2
 41)     1000     144   pci_device_probe+0xc9/0x120
 42)      856      48   driver_probe_device+0xc5/0x14c
 43)      808      48   __driver_attach+0x67/0x91
 44)      760      64   bus_for_each_dev+0x54/0x84
 45)      696      16   driver_attach+0x21/0x23
 46)      680      64   bus_add_driver+0xba/0x204
 47)      616      64   driver_register+0x98/0x110
 48)      552      48   __pci_register_driver+0x6b/0xa4
 49)      504      16   piix_init+0x19/0x2c
 50)      488     272   _stext+0x5d/0x145
 51)      216      32   kernel_init+0x127/0x17a
 52)      184     184   child_rip+0xa/0x11

-- Steve


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-18  0:54     ` Steven Rostedt
@ 2008-11-18  1:05       ` Paul Mackerras
  2008-11-18  1:41         ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Paul Mackerras @ 2008-11-18  1:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, LKML, Benjamin Herrenschmidt, linuxppc-dev,
	Andrew Morton, Ingo Molnar, Thomas Gleixner

Steve,

> On Mon, 17 Nov 2008, Linus Torvalds wrote:
> > 
> > I do wonder just _what_ it is that causes the stack frames to be so 
> > horrid. For example, you have
> > 
> > 	 18)     8896     160   .kmem_cache_alloc+0xfc/0x140
> > 
> > and I'm looking at my x86-64 compile, and it has a stack frame of just 8 
> > bytes (!) for local variables plus the save/restore area (which looks like 
> > three registers plus frame pointer plus return address). IOW, if I'm 
> > looking at the code right (so big caveat: I did _not_ do a real stack 
> > dump!) the x86-64 stack cost for that same function is on the order of 48 
> > bytes. Not 160.
> 
> Out of curiosity, I just ran stack_trace on the latest version of git 
> (pulled sometime today) and ran it on my x86_64.
> 
> I have SLUB and SLUB debug defined, and here's what I found:
> 
>  11)     3592      64   kmem_cache_alloc+0x64/0xa3
> 
> 64 bytes, still much lower than the 160 of PPC64.

The ppc64 ABI has a minimum stack frame of 112 bytes, due to having an
area for called functions to store their parameters (64 bytes) plus 6
slots for saving stuff and for the compiler and linker to use if they
need to.  That's before any local variables are allocated.

The ppc32 ABI has a minimum stack frame of 16 bytes, which is much
nicer, at the expense of a much more complicated va_arg().

Paul.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-18  1:05       ` Paul Mackerras
@ 2008-11-18  1:41         ` Steven Rostedt
  2008-11-18  2:01           ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2008-11-18  1:41 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Linus Torvalds, LKML, Benjamin Herrenschmidt, linuxppc-dev,
	Andrew Morton, Ingo Molnar, Thomas Gleixner


On Tue, 18 Nov 2008, Paul Mackerras wrote:
> > 
> > 64 bytes, still much lower than the 160 of PPC64.
> 
> The ppc64 ABI has a minimum stack frame of 112 bytes, due to having an
> area for called functions to store their parameters (64 bytes) plus 6
> slots for saving stuff and for the compiler and linker to use if they
> need to.  That's before any local variables are allocated.
> 
> The ppc32 ABI has a minimum stack frame of 16 bytes, which is much
> nicer, at the expense of a much more complicated va_arg().

Here's a full dump that I got from my 32bit Powerbook:

rostedt@gollum:~$ cat /debug/tracing/stack_trace 
        Depth   Size      Location    (57 entries)
        -----   ----      --------
  0)     3196      48   ftrace_call+0x4/0x48
  1)     3148      48   update_curr+0xa0/0x110
  2)     3100      48   enqueue_task_fair+0x54/0xfc
  3)     3052      32   enqueue_task+0x34/0x54
  4)     3020      16   activate_task+0x40/0x64
  5)     3004      48   try_to_wake_up+0x78/0x164
  6)     2956      16   default_wake_function+0x28/0x40
  7)     2940      48   __wake_up_common+0x54/0xac
  8)     2892      32   __wake_up_sync+0x58/0x94
  9)     2860      16   sock_def_readable+0x58/0xb8
 10)     2844      32   sock_queue_rcv_skb+0x10c/0x128
 11)     2812      32   __udp_queue_rcv_skb+0x2c/0xc4
 12)     2780      32   udp_queue_rcv_skb+0x1d8/0x27c
 13)     2748      96   __udp4_lib_rcv+0x514/0x77c
 14)     2652      16   udp_rcv+0x30/0x48
 15)     2636      32   ip_local_deliver_finish+0xf4/0x1cc
 16)     2604      16   ip_local_deliver+0x94/0xac
 17)     2588      64   ip_rcv_finish+0x2ec/0x31c
 18)     2524      32   ip_rcv+0x23c/0x278
 19)     2492      48   netif_receive_skb+0x48c/0x4c0
 20)     2444      48   process_backlog+0xb0/0x144
 21)     2396      48   net_rx_action+0x8c/0x1bc
 22)     2348      48   __do_softirq+0x84/0x10c
 23)     2300      16   do_softirq+0x50/0x6c
 24)     2284      16   local_bh_enable+0x90/0xc8
 25)     2268      32   dev_queue_xmit+0x564/0x5b0
 26)     2236      48   ip_finish_output+0x2b4/0x2f4
 27)     2188      32   ip_output+0xec/0x104
 28)     2156      16   ip_local_out+0x44/0x5c
 29)     2140      48   ip_push_pending_frames+0x340/0x3ec
 30)     2092      48   udp_push_pending_frames+0x2ac/0x348
 31)     2044     160   udp_sendmsg+0x458/0x5bc
 32)     1884      32   inet_sendmsg+0x70/0x88
 33)     1852     240   sock_sendmsg+0xd0/0xf8
 34)     1612      32   kernel_sendmsg+0x3c/0x58
 35)     1580      96   xs_send_kvec+0xb4/0xcc [sunrpc]
 36)     1484      64   xs_sendpages+0x94/0x1f0 [sunrpc]
 37)     1420      32   xs_udp_send_request+0x44/0x14c [sunrpc]
 38)     1388      32   xprt_transmit+0x100/0x228 [sunrpc]
 39)     1356      32   call_transmit+0x234/0x290 [sunrpc]
 40)     1324      48   __rpc_execute+0xcc/0x2d4 [sunrpc]
 41)     1276      32   rpc_execute+0x48/0x60 [sunrpc]
 42)     1244      32   rpc_run_task+0x74/0x90 [sunrpc]
 43)     1212      64   rpc_call_sync+0x6c/0xa0 [sunrpc]
 44)     1148      80   rpcb_register_call+0xb0/0x110 [sunrpc]
 45)     1068      96   rpcb_register+0xe8/0x100 [sunrpc]
 46)      972      80   svc_register+0x118/0x168 [sunrpc]
 47)      892      64   svc_setup_socket+0xa0/0x354 [sunrpc]
 48)      828     272   svc_create_socket+0x1e0/0x250 [sunrpc]
 49)      556      16   svc_udp_create+0x38/0x50 [sunrpc]
 50)      540      96   svc_create_xprt+0x1c8/0x2fc [sunrpc]
 51)      444      48   lockd_up+0xcc/0x27c [lockd]
 52)      396      96   write_ports+0xf4/0x2cc [nfsd]
 53)      300      32   nfsctl_transaction_write+0x78/0xbc [nfsd]
 54)      268      32   vfs_write+0xc8/0x178
 55)      236      48   sys_write+0x5c/0xa0
 56)      188     188   ret_from_syscall+0x0/0x40

And here's my i386 max stack:

[root@mirhel51 ~]# cat /debug/tracing/stack_trace 
        Depth   Size      Location    (47 entries)
        -----   ----      --------
  0)     2216     240   blk_recount_segments+0x39/0x51
  1)     1976      12   bio_phys_segments+0x16/0x1c
  2)     1964      20   blk_rq_bio_prep+0x29/0xae
  3)     1944      16   init_request_from_bio+0xc9/0xcd
  4)     1928      60   __make_request+0x2f6/0x3c1
  5)     1868     136   generic_make_request+0x36a/0x3a0
  6)     1732      56   submit_bio+0xcd/0xd8
  7)     1676      28   submit_bh+0xd1/0xf0
  8)     1648     112   block_read_full_page+0x299/0x2a9
  9)     1536       8   blkdev_readpage+0x14/0x16
 10)     1528      28   read_cache_page_async+0x7e/0x109
 11)     1500      16   read_cache_page+0x11/0x49
 12)     1484      32   read_dev_sector+0x3c/0x72
 13)     1452      48   read_lba+0x4d/0xaa
 14)     1404     168   efi_partition+0x85/0x61b
 15)     1236      84   rescan_partitions+0x14b/0x316
 16)     1152      40   __blkdev_get+0x1b2/0x258
 17)     1112       8   blkdev_get+0xf/0x11
 18)     1104      36   register_disk+0xbc/0x112
 19)     1068      32   add_disk+0x9f/0xf3
 20)     1036      48   sd_probe+0x2d4/0x394
 21)      988      20   driver_probe_device+0xa5/0x120
 22)      968       8   __device_attach+0xd/0xf
 23)      960      28   bus_for_each_drv+0x3e/0x67
 24)      932      24   device_attach+0x56/0x6a
 25)      908      16   bus_attach_device+0x26/0x4d
 26)      892      64   device_add+0x380/0x4aa
 27)      828      28   scsi_sysfs_add_sdev+0xa1/0x1c9
 28)      800     160   scsi_probe_and_add_lun+0x97e/0xa8f
 29)      640      36   __scsi_add_device+0x88/0xae
 30)      604      40   ata_scsi_scan_host+0x8b/0x1cd
 31)      564      28   ata_host_register+0x1da/0x1ea
 32)      536      24   ata_host_activate+0x98/0xb5
 33)      512     188   ahci_init_one+0xa23/0xa4f
 34)      324      20   pci_device_probe+0x3e/0x5e
 35)      304      20   driver_probe_device+0xa5/0x120
 36)      284      20   __driver_attach+0x51/0x70
 37)      264      32   bus_for_each_dev+0x40/0x62
 38)      232      12   driver_attach+0x19/0x1b
 39)      220      28   bus_add_driver+0x9c/0x1ac
 40)      192      28   driver_register+0x76/0xd2
 41)      164      20   __pci_register_driver+0x44/0x70
 42)      144       8   ahci_init+0x14/0x16
 43)      136      92   _stext+0x4f/0x116
 44)       44      16   kernel_init+0xf7/0x145
 45)       28      28   kernel_thread_helper+0x7/0x10

Note, the i386 box had 4KSTACKS defined and the PPC did not have 
IRQSTACKS defined. I'll compile with IRQSTACKS defined and post that 
result.

-- Steve


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-18  1:41         ` Steven Rostedt
@ 2008-11-18  2:01           ` Steven Rostedt
  0 siblings, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2008-11-18  2:01 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Linus Torvalds, LKML, Benjamin Herrenschmidt, linuxppc-dev,
	Andrew Morton, Ingo Molnar, Thomas Gleixner


On Mon, 17 Nov 2008, Steven Rostedt wrote:
> 
> And here's my i386 max stack:
> 
> [root@mirhel51 ~]# cat /debug/tracing/stack_trace 
>         Depth   Size      Location    (47 entries)
>         -----   ----      --------
>   0)     2216     240   blk_recount_segments+0x39/0x51
>   1)     1976      12   bio_phys_segments+0x16/0x1c
>   2)     1964      20   blk_rq_bio_prep+0x29/0xae
>   3)     1944      16   init_request_from_bio+0xc9/0xcd
>   4)     1928      60   __make_request+0x2f6/0x3c1
>   5)     1868     136   generic_make_request+0x36a/0x3a0
>   6)     1732      56   submit_bio+0xcd/0xd8
>   7)     1676      28   submit_bh+0xd1/0xf0
>   8)     1648     112   block_read_full_page+0x299/0x2a9
>   9)     1536       8   blkdev_readpage+0x14/0x16
>  10)     1528      28   read_cache_page_async+0x7e/0x109
>  11)     1500      16   read_cache_page+0x11/0x49
>  12)     1484      32   read_dev_sector+0x3c/0x72
>  13)     1452      48   read_lba+0x4d/0xaa
>  14)     1404     168   efi_partition+0x85/0x61b
>  15)     1236      84   rescan_partitions+0x14b/0x316
>  16)     1152      40   __blkdev_get+0x1b2/0x258
>  17)     1112       8   blkdev_get+0xf/0x11
>  18)     1104      36   register_disk+0xbc/0x112
>  19)     1068      32   add_disk+0x9f/0xf3
>  20)     1036      48   sd_probe+0x2d4/0x394
>  21)      988      20   driver_probe_device+0xa5/0x120
>  22)      968       8   __device_attach+0xd/0xf
>  23)      960      28   bus_for_each_drv+0x3e/0x67
>  24)      932      24   device_attach+0x56/0x6a
>  25)      908      16   bus_attach_device+0x26/0x4d
>  26)      892      64   device_add+0x380/0x4aa
>  27)      828      28   scsi_sysfs_add_sdev+0xa1/0x1c9
>  28)      800     160   scsi_probe_and_add_lun+0x97e/0xa8f
>  29)      640      36   __scsi_add_device+0x88/0xae
>  30)      604      40   ata_scsi_scan_host+0x8b/0x1cd
>  31)      564      28   ata_host_register+0x1da/0x1ea
>  32)      536      24   ata_host_activate+0x98/0xb5
>  33)      512     188   ahci_init_one+0xa23/0xa4f
>  34)      324      20   pci_device_probe+0x3e/0x5e
>  35)      304      20   driver_probe_device+0xa5/0x120
>  36)      284      20   __driver_attach+0x51/0x70
>  37)      264      32   bus_for_each_dev+0x40/0x62
>  38)      232      12   driver_attach+0x19/0x1b
>  39)      220      28   bus_add_driver+0x9c/0x1ac
>  40)      192      28   driver_register+0x76/0xd2
>  41)      164      20   __pci_register_driver+0x44/0x70
>  42)      144       8   ahci_init+0x14/0x16
>  43)      136      92   _stext+0x4f/0x116
>  44)       44      16   kernel_init+0xf7/0x145
>  45)       28      28   kernel_thread_helper+0x7/0x10
> 
> Note, the i386 box had 4KSTACKS defined and the PPC did not have 
> IRQSTACKS defined. I'll compile with IRQSTACKS defined and post that 
> result.

Here's the PPC32 with IRQSTACKS on:

rostedt@gollum:~$ cat /debug/tracing/stack_trace 
        Depth   Size      Location    (42 entries)
        -----   ----      --------
  0)     2940      48   ftrace_call+0x4/0x48
  1)     2892      16   ide_map_sg+0x4c/0x88
  2)     2876      32   ide_build_sglist+0x30/0xa0
  3)     2844      64   pmac_ide_dma_setup+0x90/0x2a0
  4)     2780      48   do_rw_taskfile+0x250/0x2a0
  5)     2732      96   ide_do_rw_disk+0x290/0x2f8
  6)     2636      16   ide_gd_do_request+0x30/0x48
  7)     2620     176   ide_do_request+0x8e8/0xa70
  8)     2444      16   do_ide_request+0x34/0x4c
  9)     2428      16   __generic_unplug_device+0x4c/0x64
 10)     2412      16   generic_unplug_device+0x4c/0x90
 11)     2396      16   blk_unplug+0x34/0x4c
 12)     2380      80   dm_table_unplug_all+0x54/0xc0 [dm_mod]
 13)     2300      16   dm_unplug_all+0x34/0x54 [dm_mod]
 14)     2284      16   blk_unplug+0x34/0x4c
 15)     2268      16   blk_backing_dev_unplug+0x28/0x40
 16)     2252      16   sync_buffer+0x60/0x80
 17)     2236      48   __wait_on_bit+0x78/0xd4
 18)     2188      80   out_of_line_wait_on_bit+0x88/0xa0
 19)     2108      16   __wait_on_buffer+0x40/0x58
 20)     2092      16   __bread+0xbc/0xf0
 21)     2076      48   ext3_get_branch+0x90/0x118 [ext3]
 22)     2028     208   ext3_get_blocks_handle+0xac/0xa10 [ext3]
 23)     1820      48   ext3_get_block+0xc4/0x114 [ext3]
 24)     1772     160   do_mpage_readpage+0x26c/0x6f4
 25)     1612     144   mpage_readpages+0xe8/0x148
 26)     1468      16   ext3_readpages+0x38/0x50 [ext3]
 27)     1452      80   __do_page_cache_readahead+0x19c/0x248
 28)     1372      32   do_page_cache_readahead+0x80/0x98
 29)     1340      80   filemap_fault+0x1b0/0x444
 30)     1260      80   __do_fault+0x68/0x61c
 31)     1180      64   handle_mm_fault+0x500/0xafc
 32)     1116     176   do_page_fault+0x2d4/0x450
 33)      940      84   handle_page_fault+0xc/0x80
 34)      856     140   0xdeaffa78
 35)      716     128   load_elf_binary+0x6a0/0x1048
 36)      588      64   search_binary_handler+0x10c/0x310
 37)      524     176   load_script+0x248/0x268
 38)      348      64   search_binary_handler+0x10c/0x310
 39)      284      64   do_execve+0x15c/0x210
 40)      220      32   sys_execve+0x68/0x98
 41)      188     188   ret_from_syscall+0x0/0x40

It does indeed help.

-- Steve


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-18  0:11       ` Paul Mackerras
@ 2008-11-18  2:08         ` Linus Torvalds
  2008-11-18 10:24           ` Nick Piggin
  0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2008-11-18  2:08 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Benjamin Herrenschmidt, Steven Rostedt, LKML, linuxppc-dev,
	Andrew Morton, Ingo Molnar, Thomas Gleixner



On Tue, 18 Nov 2008, Paul Mackerras wrote:
> 
> Also, you didn't respond to my comments about the purely software
> benefits of a larger page size.

I realize that there are benefits. It's just that the downsides tend to 
swamp the upsides.

The fact is, Intel (and to a lesser degree, AMD) has shown how hardware 
can do good TLB's with essentially gang lookups, giving almost effective 
page sizes of 32kB with hardly any of the downsides. Couple that with 
low-latency fault handling (for not when you miss in the TLB, but when 
something really isn't in the page tables), and it seems to be seldom the 
biggest issue.

(Don't get me wrong - TLB's are not unimportant on x86 either. But on x86, 
things are generally much better).

Yes, we could prefill the page tables and do other things, and ultimately 
if you don't need to - by virtue of big pages, some loads will always 
benefit from just making the page size larger.

But the people who advocate large pages seem to never really face the 
downsides. They talk about their single loads, and optimize for that and 
nothing else. They don't seem to even acknowledge the fact that a 64kB 
page size is simply NOT EVEN REMOTELY ACCEPTABLE for other loads!

That's what gets to me. These absolute -idiots- talk about how they win 5% 
on some (important, for them) benchmark by doing large pages, but then 
ignore the fact that on other real-world loads they lose by sevaral 
HUNDRED percent because of the memory fragmentation costs.

(And btw, if they win more than 5%, it's because the hardware sucks really 
badly).

THAT is what irritates me.

What also irritates me is the ".. but AIX" argument. The fact is, the AIX 
memory management is very tightly tied to one particular broken MMU model. 
Linux supports something like thirty architectures, and while PPC may be 
one of the top ones, it is NOT EVEN CLOSE to be really relevant.

So ".. but AIX" simply doesn't matter. The Linux VM has other priorities.

And I _guarantee_ that in general, in the high-volume market (which is 
what drives things, like it or not), page sizes will not be growing. In 
that market, terabytes of RAM is not the primary case, and small files 
that want mmap are one _very_ common case.

To make things worse, the biggest performance market has another vendor 
that hasn't been saying ".. but AIX" for the last decade, and that 
actually listens to input. And, perhaps not incidentally, outperforms the 
highest-performance ppc64 chips mostly by a huge margin - while selling 
their chips for a fraction of the price.

I realize that this may be hard to accept for some people. But somebody 
who says "... but AIX" should be taking a damn hard look in the mirror, 
and ask themselves some really tough questions. Because quite frankly, the 
"..but AIX" market isn't the most interesting one.

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 20:59 ` Steven Rostedt
                     ` (2 preceding siblings ...)
  2008-11-17 23:04   ` Benjamin Herrenschmidt
@ 2008-11-18  2:29   ` Steven Rostedt
  2008-11-18  2:36     ` Paul Mackerras
  3 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2008-11-18  2:29 UTC (permalink / raw)
  To: LKML
  Cc: Paul Mackerras, Benjamin Herrenschmidt, linuxppc-dev,
	Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner


On Mon, 17 Nov 2008, Steven Rostedt wrote:
> 
> On Mon, 17 Nov 2008, Steven Rostedt wrote:
> > 
> > Note, I was using a default config that had CONFIG_IRQSTACKS off and
> > CONFIG_PPC_64K_PAGES on.
> 
> Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that 
> softirqs still use the same stack as the process.
> 

By-the-way, my box has been running stable ever since I switched to 
CONFIG_IRQSTACKS.

-- Steve


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-18  2:29   ` Steven Rostedt
@ 2008-11-18  2:36     ` Paul Mackerras
  2008-11-18  5:40       ` David Miller
  0 siblings, 1 reply; 35+ messages in thread
From: Paul Mackerras @ 2008-11-18  2:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Benjamin Herrenschmidt, linuxppc-dev, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Thomas Gleixner

Steven Rostedt writes:

> By-the-way, my box has been running stable ever since I switched to 
> CONFIG_IRQSTACKS.

Great.  We probably should remove the config option and just always
use irq stacks.

Paul.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-18  2:36     ` Paul Mackerras
@ 2008-11-18  5:40       ` David Miller
  0 siblings, 0 replies; 35+ messages in thread
From: David Miller @ 2008-11-18  5:40 UTC (permalink / raw)
  To: paulus; +Cc: rostedt, linux-kernel, linuxppc-dev, tglx, akpm, torvalds, mingo

From: Paul Mackerras <paulus@samba.org>
Date: Tue, 18 Nov 2008 13:36:16 +1100

> Steven Rostedt writes:
> 
> > By-the-way, my box has been running stable ever since I switched to 
> > CONFIG_IRQSTACKS.
> 
> Great.  We probably should remove the config option and just always
> use irq stacks.

That's what I did from the start on sparc64 when I added
irqstacks support.  It's pretty stupid to make it optional
when we know there are failure cases.

For example, is XFS dependant on IRQSTACKS on x86?  It should be, or
even more so XFS and NFS both being enabled at the same time :-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 23:28     ` Linus Torvalds
  2008-11-18  0:11       ` Paul Mackerras
@ 2008-11-18  7:25       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2008-11-18  7:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, LKML, Paul Mackerras, linuxppc-dev,
	Andrew Morton, Ingo Molnar, Thomas Gleixner


> > It makes some sort of sense I suppose on very static embedded workloads
> > with no swap nor demand paging.
> 
> It makes perfect sense for anything that doesn't use any MMU. 

To a certain extent. There's two different aspects to having an MMU and
in embedded space it's useful to have one and not the other, ie,
protection & address space isolation vs. paging.

Thus, it does make sense for those embedded devices with few files
(mostly a statically linked busybox, a few /dev entries and some app
stuff) to use large page sizes. Of course, as soon as they try to
populate their ramfs with lots of small files they lose... but mostly,
the idea is that the entire working set fits in the TLB and thus the
cost of TLB miss becomes irrelevant.

Now, regarding the shortcomings of the powerpc server MMU, well, we know
them, we know your opinion and mostly share it, and until we can get the
HW to change we are stuck with it.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 23:03 ` Benjamin Herrenschmidt
@ 2008-11-18  9:53   ` Christoph Hellwig
  2008-11-18 10:37     ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2008-11-18  9:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Steven Rostedt, LKML, Paul Mackerras, linuxppc-dev,
	Linus Torvalds, Andrew Morton, Ingo Molnar, Thomas Gleixner

On Tue, Nov 18, 2008 at 10:03:25AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote:
> > Note, I was using a default config that had CONFIG_IRQSTACKS off and
> > CONFIG_PPC_64K_PAGES on.
> 
> For one, we definitely need to turn IRQSTACKS on by default ... In fact,
> I'm pondering just removing the option.

It shouldn't be a user-visible option.  It shouldn't on x86 either, but
there it always gets into the ideological 4k stacks flameware so people
usually gave up after a while instead of doing the sensibke 8k +
irqstacks by default, 4k as an option..


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-17 22:53   ` Paul Mackerras
@ 2008-11-18 10:07     ` Nick Piggin
  0 siblings, 0 replies; 35+ messages in thread
From: Nick Piggin @ 2008-11-18 10:07 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Linus Torvalds, Steven Rostedt, LKML, Benjamin Herrenschmidt,
	linuxppc-dev, Andrew Morton, Ingo Molnar, Thomas Gleixner

On Tuesday 18 November 2008 09:53, Paul Mackerras wrote:

> I'd love to be able to use a 4k base page size if I could still get
> the reduction in page faults and the expanded TLB reach that we get
> now with 64k pages.  If we could allocate the page cache for large
> files with order-4 allocations wherever possible that would be a good
> start.

That can still have nasty side-effects like fragmentation and 64k
granular reclaim. It also adds complexity to mapping the pages to
userspace.

Christoph's patchset IIRC also only did page cache, wheras I suppose
your kernbench workload is gaining mainly from anonymous page faults.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-18  2:08         ` Linus Torvalds
@ 2008-11-18 10:24           ` Nick Piggin
  2008-11-18 11:44             ` Paul Mackerras
  2008-11-18 16:02             ` Linus Torvalds
  0 siblings, 2 replies; 35+ messages in thread
From: Nick Piggin @ 2008-11-18 10:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Benjamin Herrenschmidt, Steven Rostedt, LKML,
	linuxppc-dev, Andrew Morton, Ingo Molnar, Thomas Gleixner

On Tuesday 18 November 2008 13:08, Linus Torvalds wrote:
> On Tue, 18 Nov 2008, Paul Mackerras wrote:
> > Also, you didn't respond to my comments about the purely software
> > benefits of a larger page size.
>
> I realize that there are benefits. It's just that the downsides tend to
> swamp the upsides.
>
> The fact is, Intel (and to a lesser degree, AMD) has shown how hardware
> can do good TLB's with essentially gang lookups, giving almost effective
> page sizes of 32kB with hardly any of the downsides. Couple that with

It's much harder to do this with powerpc I think because they would need
to calculate 8 hashes and touch 8 cachelines to prefill 8 translations,
wouldn't they?


> low-latency fault handling (for not when you miss in the TLB, but when
> something really isn't in the page tables), and it seems to be seldom the
> biggest issue.
>
> (Don't get me wrong - TLB's are not unimportant on x86 either. But on x86,
> things are generally much better).

The per-page processing costs are interesting too, but IMO there is more
work that should be done to speed up order-0 pages. The patches I had to
remove the sync instruction for smp_mb() in unlock_page sped up pagecache
throughput (populate, write(2), reclaim) on my G5 by something really
crazy like 50% (most of that's in, but I'm still sitting on that fancy
unlock_page speedup to remove the final smp_mb).

I suspect some of the costs are also in powerpc specific code to insert
linux ptes into their hash table. I think some of the synchronisation for
those could possibly be shared with generic code so you don't need the
extra layer of locks there.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-18  9:53   ` Christoph Hellwig
@ 2008-11-18 10:37     ` Ingo Molnar
  0 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2008-11-18 10:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Benjamin Herrenschmidt, Steven Rostedt, LKML, Paul Mackerras,
	linuxppc-dev, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven


* Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Nov 18, 2008 at 10:03:25AM +1100, Benjamin Herrenschmidt wrote:
> > On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote:
> > > Note, I was using a default config that had CONFIG_IRQSTACKS off and
> > > CONFIG_PPC_64K_PAGES on.
> > 
> > For one, we definitely need to turn IRQSTACKS on by default ... In fact,
> > I'm pondering just removing the option.
> 
> It shouldn't be a user-visible option.  It shouldn't on x86 either, 
> but there it always gets into the ideological 4k stacks flameware so 
> people usually gave up after a while instead of doing the sensibke 
> 8k + irqstacks by default, 4k as an option..

yep, i tend to agree that 8k + irqstacks on both 32-bit and 64-bit 
would be the sane default. There's just too much gcc and other noise 
for us to have that cushion by default on 32-bit too. 64-bit is 
already there, on 32-bit we should decouple irqstacks from 4K stacks 
and just turn irqstacks on by default.

Life's too short to fight kernel stack overflows - and now we've got 
the stack-tracer that will record and show the frame of the worst-ever 
kernel stack situation the system was in since bootup. (see Steve's 
tracer output in this thread)

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-18 10:24           ` Nick Piggin
@ 2008-11-18 11:44             ` Paul Mackerras
  2008-11-18 16:02             ` Linus Torvalds
  1 sibling, 0 replies; 35+ messages in thread
From: Paul Mackerras @ 2008-11-18 11:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Benjamin Herrenschmidt, Steven Rostedt, LKML,
	linuxppc-dev, Andrew Morton, Ingo Molnar, Thomas Gleixner

Nick Piggin writes:

> It's much harder to do this with powerpc I think because they would need
> to calculate 8 hashes and touch 8 cachelines to prefill 8 translations,
> wouldn't they?

Yeah, the hashed page table sucks.  Film at 11, as they say. :)

Paul.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Large stack usage in fs code (especially for PPC64)
  2008-11-18 10:24           ` Nick Piggin
  2008-11-18 11:44             ` Paul Mackerras
@ 2008-11-18 16:02             ` Linus Torvalds
  1 sibling, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2008-11-18 16:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul Mackerras, Benjamin Herrenschmidt, Steven Rostedt, LKML,
	linuxppc-dev, Andrew Morton, Ingo Molnar, Thomas Gleixner



On Tue, 18 Nov 2008, Nick Piggin wrote:
> >
> > The fact is, Intel (and to a lesser degree, AMD) has shown how hardware
> > can do good TLB's with essentially gang lookups, giving almost effective
> > page sizes of 32kB with hardly any of the downsides. Couple that with
> 
> It's much harder to do this with powerpc I think because they would need
> to calculate 8 hashes and touch 8 cachelines to prefill 8 translations,
> wouldn't they?

Oh, absolutely. It's why I despise hashed page tables. It's a broken 
concept.

> The per-page processing costs are interesting too, but IMO there is more
> work that should be done to speed up order-0 pages. The patches I had to
> remove the sync instruction for smp_mb() in unlock_page sped up pagecache
> throughput (populate, write(2), reclaim) on my G5 by something really
> crazy like 50% (most of that's in, but I'm still sitting on that fancy
> unlock_page speedup to remove the final smp_mb).
> 
> I suspect some of the costs are also in powerpc specific code to insert
> linux ptes into their hash table. I think some of the synchronisation for
> those could possibly be shared with generic code so you don't need the
> extra layer of locks there.

Yeah, the hashed page tables get extra costs from the fact that it can't 
share the software page tables with the hardware ones, and the associated 
coherency logic. It's even worse at unmap time, I think.

			Linus

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2008-11-18 16:02 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-11-17 20:34 Large stack usage in fs code (especially for PPC64) Steven Rostedt
2008-11-17 20:59 ` Steven Rostedt
2008-11-17 21:18   ` Linus Torvalds
2008-11-17 21:25     ` Pekka Enberg
2008-11-18  0:54     ` Steven Rostedt
2008-11-18  1:05       ` Paul Mackerras
2008-11-18  1:41         ` Steven Rostedt
2008-11-18  2:01           ` Steven Rostedt
2008-11-17 22:16   ` Paul Mackerras
2008-11-17 23:30     ` Steven Rostedt
2008-11-17 23:04   ` Benjamin Herrenschmidt
2008-11-18  2:29   ` Steven Rostedt
2008-11-18  2:36     ` Paul Mackerras
2008-11-18  5:40       ` David Miller
2008-11-17 21:08 ` Andrew Morton
2008-11-17 21:23   ` Linus Torvalds
2008-11-17 21:31     ` Andrew Morton
2008-11-17 21:42       ` Linus Torvalds
2008-11-17 23:17       ` Benjamin Herrenschmidt
2008-11-17 21:09 ` Linus Torvalds
2008-11-17 22:53   ` Paul Mackerras
2008-11-18 10:07     ` Nick Piggin
2008-11-17 23:13   ` Benjamin Herrenschmidt
2008-11-17 23:22     ` Josh Boyer
2008-11-17 23:28     ` Linus Torvalds
2008-11-18  0:11       ` Paul Mackerras
2008-11-18  2:08         ` Linus Torvalds
2008-11-18 10:24           ` Nick Piggin
2008-11-18 11:44             ` Paul Mackerras
2008-11-18 16:02             ` Linus Torvalds
2008-11-18  7:25       ` Benjamin Herrenschmidt
2008-11-17 23:03 ` Benjamin Herrenschmidt
2008-11-18  9:53   ` Christoph Hellwig
2008-11-18 10:37     ` Ingo Molnar
2008-11-17 23:30 ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).