linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Jan Kara <jack@suse.cz>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	Jan Kara <jack@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Hugh Dickins <hughd@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Matthew Wilcox <willy@infradead.org>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-block@vger.kernel.org
Subject: Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()
Date: Thu, 3 Nov 2016 21:40:12 +0100	[thread overview]
Message-ID: <20161103204012.GC24234@quack2.suse.cz> (raw)
In-Reply-To: <20161102083204.GB13949@node.shutemov.name>

On Wed 02-11-16 11:32:04, Kirill A. Shutemov wrote:
> On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> > On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> > > > If I understand the motivation right, it is mostly about being able to mmap
> > > > PMD-sized chunks to userspace. So my naive idea would be that we could just
> > > > implement it by allocating PMD sized chunks of pages when adding pages to
> > > > page cache, we don't even have to read them all unless we come from PMD
> > > > fault path.
> > > 
> > > Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> > > per-hugepage, one common list of buffer heads...
> > > 
> > > PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> > > it otherwise doesn't make sense) and handling it differently for file-THP
> > > is nightmare from maintenance POV.
> > 
> > But the complexity of two different page sizes for page cache and *each*
> > filesystem that wants to support it does not make the maintenance easy
> > either.
> 
> I think with time we can make small pages just a subcase of huge pages.
> And some generalization can be made once more than one filesystem with
> backing storage will adopt huge pages.

My objection is that IMHO currently the code is too ugly to go in. Too many
places need to know about THP and I'm not even sure you have patched all
the places or whether some corner cases remained unfixed and how should I
find that out.

> > So I'm not convinced that using the same rules for anon-THP and
> > file-THP is a clear win.
> 
> We already have file-THP with the same rules: tmpfs. Backing storage is
> what changes the picture.

Right, the ugliness comes from access to backing storage having to deal
with huge pages.

> > I'd also note that having PMD-sized pages has some obvious disadvantages as
> > well:
> >
> > 1) I'm not sure buffer head handling code will quite scale to 512 or even
> > 2048 buffer_heads on a linked list referenced from a page. It may work but
> > I suspect the performance will suck.
> 
> Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
> why syscall-based IO sucks. We spend a lot of time looking for desired
> block.
> 
> We need to switch to some other data structure for storing buffer_heads.
> Is there a reason why we have list there in first place?
> Why not just array?
> 
> I will look into it, but this sounds like a separate infrastructure change
> project.

As Christoph said iomap code should help you with that and make things
simpler. If things go as we imagine, we should be able to pretty much avoid
buffer heads. But it will take some time to get there.

> > 2) PMD-sized pages result in increased space & memory usage.
> 
> Space? Do you mean disk space? Not really: we still don't write beyond
> i_size or into holes.
> 
> Behaviour wrt to holes may change with mmap()-IO as we have less
> granularity, but the same can be seen just between different
> architectures: 4k vs. 64k base page size.

Yes, I meant different granularity of mmap based IO. And I agree it isn't a
new problem but the scale of the problem is much larger with 2MB pages than
with say 64K pages. And actually the overhead of higher IO granularity of
64K pages has been one of the reasons we have switched SLES PPC kernels
from 64K pages to 4K pages (we've got complaints from customers). 

> > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > allocating blocks underlying a page in the worst case (you don't seem to
> > update this estimate in your patch set). With 2048 blocks underlying a page,
> > each possibly in a different block group, it is a lot of metadata forcing
> > us to reserve a large transaction (not sure if you'll be able to even
> > reserve such large transaction with the default journal size), which again
> > makes things slower.
> 
> I didn't saw this on profiles. And xfstests looks fine. I probably need to
> run them with 1k blocks once again.

You wouldn't see this in profiles - it is a correctness thing. And it won't
be triggered unless the file is heavily fragmented which likely does not
happen with any test in xfstests. If it happens you'll notice though - the
filesystem will just report error and shut itself down.

> The numbers below generated with fio. The working set is relatively small,
> so it fits into page cache and writing set doesn't hit dirty_ratio.
> 
> I think the mmap performance should be enough to justify initial inclusion
> of an experimental feature: it useful for workloads that targets mmap()-IO.
> It will take time to get feature mature anyway.

I agree it will take time for feature to mature so I'me fine with
suboptimal performance in some cases. But I'm not fine with some of the
hacks you do currently because code maintenability is an issue even if
people don't actually use the feature...

> Configuration:
>  - 2x E5-2697v2, 64G RAM;
>  - INTEL SSDSC2CW24;
>  - IO request size is 4k;
>  - 8 processes, 512MB data set each;

The numbers indeed look interesting for mmaped case. Can you post the fio
cmdline? I'd like to compare profiles...

								Honza
> 
> Workload
>  read/write	baseline	stddev	huge=always	stddev		change
> --------------------------------------------------------------------------------
> sync-read
>  read		  21439.00	348.14	  20297.33	259.62		 -5.33%
> sync-write
>  write		   6833.20	147.08	   3630.13	 52.86		-46.88%
> sync-readwrite
>  read		   4377.17	 17.53	   2366.33	 19.52		-45.94%
>  write		   4378.50	 17.83	   2365.80	 19.94		-45.97%
> sync-randread
>  read		   5491.20	 66.66	  14664.00	288.29		167.05%
> sync-randwrite
>  write		   6396.13	 98.79	   2035.80	  8.17		-68.17%
> sync-randrw
>  read		   2927.30	115.81	   1036.08	 34.67		-64.61%
>  write		   2926.47	116.45	   1036.11	 34.90		-64.60%
> libaio-read
>  read		    254.36	 12.49	    258.63	 11.29		  1.68%
> libaio-write
>  write		   4979.20	122.75	   2904.77	 17.93		-41.66%
> libaio-readwrite
>  read		   2738.57	142.72	   2045.80	  4.12		-25.30%
>  write		   2729.93	141.80	   2039.77	  3.79		-25.28%
> libaio-randread
>  read		    113.63	  2.98	    210.63	  5.07		 85.37%
> libaio-randwrite
>  write		   4456.10	 76.21	   1649.63	  7.00		-62.98%
> libaio-randrw
>  read		     97.85	  8.03	    877.49	 28.27		796.80%
>  write		     97.55	  7.99	    874.83	 28.19		796.77%
> mmap-read
>  read		  20654.67	304.48	  24696.33	1064.07		 19.57%
> mmap-write
>  write		   8652.33	272.44	  13187.33	499.10		 52.41%
> mmap-readwrite
>  read		   6620.57	 16.05	   9221.60	399.56		 39.29%
>  write		   6623.63	 16.34	   9222.13	399.31		 39.23%
> mmap-randread
>  read		   6717.23	1360.55	  21939.33	326.38		226.61%
> mmap-randwrite
>  write		   3204.63	253.66	  12371.00	 61.49		286.03%
> mmap-randrw
>  read		   2150.50	 78.00	   7682.67	188.59		257.25%
>  write		   2149.50	 78.00	   7685.40	188.35		257.54%
> 
> -- 
>  Kirill A. Shutemov
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  parent reply	other threads:[~2016-11-03 20:50 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-15 11:54 [PATCHv3 00/41] ext4: support of huge pages Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 01/41] tools: Add WARN_ON_ONCE Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 02/41] radix tree test suite: Allow GFP_ATOMIC allocations to fail Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 03/41] radix-tree: Add radix_tree_join Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 04/41] radix-tree: Add radix_tree_split Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 05/41] radix-tree: Add radix_tree_split_preload() Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 06/41] radix-tree: Handle multiorder entries being deleted by replace_clear_tags Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 07/41] mm, shmem: swich huge tmpfs to multi-order radix-tree entries Kirill A. Shutemov
2016-09-16 12:07   ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 08/41] Revert "radix-tree: implement radix_tree_maybe_preload_order()" Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 09/41] page-flags: relax page flag policy for few flags Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 10/41] mm, rmap: account file thp pages Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 11/41] thp: try to free page's buffers before attempt split Kirill A. Shutemov
2016-10-11 15:40   ` Jan Kara
2016-10-11 21:43     ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 12/41] thp: handle write-protection faults for file THP Kirill A. Shutemov
2016-10-11 15:47   ` Jan Kara
2016-10-11 21:47     ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 13/41] truncate: make sure invalidate_mapping_pages() can discard huge pages Kirill A. Shutemov
2016-10-11 15:58   ` Jan Kara
2016-10-11 21:53     ` Kirill A. Shutemov
2016-10-12  6:43       ` Jan Kara
2016-10-24 10:41         ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 14/41] filemap: allocate huge page in page_cache_read(), if allowed Kirill A. Shutemov
2016-10-11 16:15   ` Jan Kara
2016-10-11 21:57     ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read() Kirill A. Shutemov
2016-10-13  9:33   ` Jan Kara
2016-10-31 18:10     ` Kirill A. Shutemov
2016-11-01 16:39       ` Jan Kara
2016-11-02  8:32         ` Kirill A. Shutemov
2016-11-02 14:37           ` Christoph Hellwig
2016-11-03 20:40           ` Jan Kara [this message]
2016-11-07 11:07             ` Kirill A. Shutemov
2016-11-07 14:59               ` Christoph Hellwig
2016-11-02 14:36         ` Christoph Hellwig
2016-11-03 17:56           ` Jan Kara
2016-11-07 11:13           ` Kirill A. Shutemov
2016-11-07 15:01             ` Christoph Hellwig
2016-11-07 16:03               ` Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 16/41] filemap: allocate huge page in pagecache_get_page(), if allowed Kirill A. Shutemov
2016-09-15 11:54 ` [PATCHv3 17/41] filemap: handle huge pages in filemap_fdatawait_range() Kirill A. Shutemov
2016-10-13  9:44   ` Jan Kara
2016-10-13 12:08     ` Kirill A. Shutemov
2016-10-13 13:18       ` Jan Kara
2016-10-24 11:36         ` Kirill A. Shutemov
2016-10-30 17:31           ` Jan Kara
2016-09-15 11:55 ` [PATCHv3 18/41] HACK: readahead: alloc huge pages, if allowed Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 19/41] block: define BIO_MAX_PAGES to HPAGE_PMD_NR if huge page cache enabled Kirill A. Shutemov
2016-09-16 12:09   ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 20/41] mm: make write_cache_pages() work on huge pages Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 21/41] thp: introduce hpage_size() and hpage_mask() Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 22/41] thp: do not threat slab pages as huge in hpage_{nr_pages,size,mask} Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 23/41] fs: make block_read_full_page() be able to read huge page Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 24/41] fs: make block_write_{begin,end}() be able to handle huge pages Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 25/41] fs: make block_page_mkwrite() aware about " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 26/41] truncate: make truncate_inode_pages_range() " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 27/41] truncate: make invalidate_inode_pages2_range() " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 28/41] mm, hugetlb: switch hugetlbfs to multi-order radix-tree entries Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 29/41] ext4: make ext4_mpage_readpages() hugepage-aware Kirill A. Shutemov
2016-09-15 12:27   ` Andreas Dilger
2016-09-16 12:17     ` Kirill A. Shutemov
2016-09-16 12:10   ` Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 30/41] ext4: make ext4_writepage() work on huge pages Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 31/41] ext4: handle huge pages in ext4_page_mkwrite() Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 32/41] ext4: handle huge pages in __ext4_block_zero_page_range() Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 33/41] ext4: make ext4_block_write_begin() aware about huge pages Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 34/41] ext4: handle huge pages in ext4_da_write_end() Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 35/41] ext4: make ext4_da_page_release_reservation() aware about huge pages Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 36/41] ext4: handle writeback with " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 37/41] ext4: make EXT4_IOC_MOVE_EXT work " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 38/41] ext4: fix SEEK_DATA/SEEK_HOLE for " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 39/41] ext4: make fallocate() operations work with " Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 40/41] mm, fs, ext4: expand use of page_mapping() and page_to_pgoff() Kirill A. Shutemov
2016-09-15 11:55 ` [PATCHv3 41/41] ext4, vfs: add huge= mount option Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161103204012.GC24234@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=aarcange@redhat.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=hughd@google.com \
    --cc=jack@suse.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ross.zwisler@linux.intel.com \
    --cc=tytso@mit.edu \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).