On Fri, Aug 12, 2016 at 10:54:42AM +1000, Dave Chinner wrote: > I'm now going to test Christoph's theory that this is an "overwrite > doing lots of block mapping" issue. More on that to follow. Ok, so going back to the profiles, I can say it's not an overwrite issue, because there is delayed allocation showing up in the profile. Lots of it. Which lead me to think "maybe the benchmark is just completely *dumb*". And, as usual, that's the answer. Here's the reproducer: # sudo mkfs.xfs -f -m crc=0 /dev/pmem1 # sudo mount -o noatime /dev/pmem1 /mnt/scratch # sudo xfs_io -f -c "pwrite 0 512m -b 1" /mnt/scratch/fooey And here's the profile: 4.50% [kernel] [k] xfs_bmapi_read 3.64% [kernel] [k] __block_commit_write.isra.30 3.55% [kernel] [k] __radix_tree_lookup 3.46% [kernel] [k] up_write 3.43% [kernel] [k] ___might_sleep 3.09% [kernel] [k] entry_SYSCALL_64_fastpath 3.01% [kernel] [k] xfs_iext_bno_to_ext 3.01% [kernel] [k] find_get_entry 2.98% [kernel] [k] down_write 2.71% [kernel] [k] mark_buffer_dirty 2.52% [kernel] [k] __mark_inode_dirty 2.38% [kernel] [k] unlock_page 2.14% [kernel] [k] xfs_break_layouts 2.07% [kernel] [k] xfs_bmapi_update_map 2.06% [kernel] [k] xfs_bmap_search_extents 2.04% [kernel] [k] xfs_iomap_write_delay 2.00% [kernel] [k] generic_write_checks 1.96% [kernel] [k] xfs_bmap_search_multi_extents 1.90% [kernel] [k] __xfs_bmbt_get_all 1.89% [kernel] [k] balance_dirty_pages_ratelimited 1.82% [kernel] [k] wait_for_stable_page 1.76% [kernel] [k] xfs_file_write_iter 1.68% [kernel] [k] xfs_iomap_eof_want_preallocate 1.68% [kernel] [k] xfs_bmapi_delay 1.67% [kernel] [k] iomap_write_actor 1.60% [kernel] [k] xfs_file_buffered_aio_write 1.56% [kernel] [k] __might_sleep 1.48% [kernel] [k] do_raw_spin_lock 1.44% [kernel] [k] generic_write_end 1.41% [kernel] [k] pagecache_get_page 1.38% [kernel] [k] xfs_bmapi_trim_map 1.21% [kernel] [k] __block_write_begin_int 1.17% [kernel] [k] vfs_write 1.17% [kernel] [k] xfs_file_iomap_begin 1.17% [kernel] [k] xfs_bmbt_get_startoff 1.14% [kernel] [k] iomap_apply 1.08% [kernel] [k] xfs_iunlock 1.08% [kernel] [k] iov_iter_copy_from_user_atomic 0.97% [kernel] [k] xfs_file_aio_write_checks 0.96% [kernel] [k] xfs_ilock ..... Yeah, I'm doing a sequential write in *1 byte pwrite() calls*. Ok, so the benchmark isn't /quite/ that abysmally stupid. It's still, ah, extremely challenged: if (NBUFSIZE != 1024) { /* enforce known block size */ fprintf(stderr, "NBUFSIZE changed to %d\n", NBUFSIZE); exit(1); } i.e. it's hard coded to do all it's "disk" IO in 1k block sizes. Every read, every write, every file copy, etc are all done with a 1024 byte buffer. There are lots of loops that look like: while (--n) { write(fd, nbuf, sizeof nbuf) } where n is the file size specified in the job file. Those loops are what is generating the profile we see: repeated partial page writes that extend the file. IOWs, the benchmark is doing exactly what we document in the fstat() man page *not to do* as it is will cause inefficient IO patterns: The st_blksize field gives the "preferred" blocksize for efficient filesystem I/O. (Writing to a file in smaller chunks may cause an inefficient read-modify-rewrite.) The smallest we ever set st_blksize to is PAGE_SIZE, so the benchmark is running well known and documented (at least 10 years ago) slow paths through the IO stack. I'm very tempted now simply to say that the aim7 disk benchmark is showing it's age and as such the results are not actually reflective of what typical applications will see. Christoph, maybe there's something we can do to only trigger speculative prealloc growth checks if the new file size crosses the end of the currently allocated block at the EOF. That would chop out a fair chunk of the xfs_bmapi_read calls being done in this workload. I'm not sure how much effort we should spend optimising this slow path, though.... Cheers, Dave. -- Dave Chinner david@fromorbit.com