linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v1 2/7] mm: Prepare for DAX huge pages
       [not found]     ` <20141008155758.GK5098@wil.cx>
@ 2014-10-08 19:43       ` Kirill A. Shutemov
  2014-10-09 20:40         ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Kirill A. Shutemov @ 2014-10-08 19:43 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm

On Wed, Oct 08, 2014 at 11:57:58AM -0400, Matthew Wilcox wrote:
> On Wed, Oct 08, 2014 at 06:21:24PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 08, 2014 at 09:25:24AM -0400, Matthew Wilcox wrote:
> > > From: Matthew Wilcox <willy@linux.intel.com>
> > > 
> > > DAX wants to use the 'special' bit to mark PMD entries that are not backed
> > > by struct page, just as for PTEs. 
> > 
> > Hm. I don't see where you use PMD without special set.
> 
> Right ... I don't currently insert PMDs that point to huge pages of DRAM,
> only to huge pages of PMEM.

Looks like you don't need pmd_{mk,}special() then. It seems you have all
inforamtion you need -- vma -- to find out what's going on. Right?

PMD bits is not something we can assigning to a feature without a need.

> > > @@ -1104,9 +1103,20 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > >  	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> > >  		goto out_unlock;
> > >  
> > > -	page = pmd_page(orig_pmd);
> > > -	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> > > -	if (page_mapcount(page) == 1) {
> > > +	if (pmd_special(orig_pmd)) {
> > > +		/* VM_MIXEDMAP !pfn_valid() case */
> > > +		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) !=
> > > +				     (VM_WRITE|VM_SHARED)) {
> > > +			pmdp_clear_flush(vma, haddr, pmd);
> > > +			ret = VM_FAULT_FALLBACK;
> > 
> > No private THP pages with THP? Why?
> > It should be trivial: we already have a code path for !page case for zero
> > page and it shouldn't be too hard to modify do_dax_pmd_fault() to support
> > COW.
> > 
> > I remeber I've mentioned that you don't think it's reasonable to allocate
> > 2M page on COW, but that's what we do for anon memory...
> 
> I agree that it shouldn't be too hard, but I have no evidence that it'll
> be a performance win to COW 2MB pages for MAP_PRIVATE.  I'd rather be
> cautious for now and we can explore COWing 2MB chunks in a future patch.

I would rather make it other way around: use the same apporoach as for
anon memory until data shows it's doesn't make any good. Then consider
switching COW for *both* anon and file THP to fallback path.
This way we will get consistent behaviour for both types of mappings.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 5/7] dax: Add huge page fault support
       [not found] ` <1412774729-23956-6-git-send-email-matthew.r.wilcox@intel.com>
@ 2014-10-08 20:11   ` Kirill A. Shutemov
  2014-10-09 20:47     ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Kirill A. Shutemov @ 2014-10-08 20:11 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-kernel, linux-mm, Matthew Wilcox

On Wed, Oct 08, 2014 at 09:25:27AM -0400, Matthew Wilcox wrote:
> +
> +	pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (pgoff >= size)
> +		return VM_FAULT_SIGBUS;
> +	/* If the PMD would cover blocks out of the file */
> +	if ((pgoff | PG_PMD_COLOUR) >= size)
> +		return VM_FAULT_FALLBACK;

IIUC, zero pading would work too.

> +
> +	memset(&bh, 0, sizeof(bh));
> +	block = ((sector_t)pgoff & ~PG_PMD_COLOUR) << (PAGE_SHIFT - blkbits);
> +
> +	/* Start by seeing if we already have an allocated block */
> +	bh.b_size = PMD_SIZE;
> +	length = get_block(inode, block, &bh, 0);

This makes me confused. get_block() return zero on success, right?
Why the var called 'lenght'?

> +	sector = bh.b_blocknr << (blkbits - 9);
> +	length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, bh.b_size);
> +	if (length < 0)
> +		goto sigbus;
> +	if (length < PMD_SIZE)
> +		goto fallback;
> +	if (pfn & PG_PMD_COLOUR)
> +		goto fallback;	/* not aligned */

So, are you rely on pure luck to make get_block() allocate 2M aligned pfn?
Not really productive. You would need assistance from fs and
arch_get_unmapped_area() sides.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 2/7] mm: Prepare for DAX huge pages
  2014-10-08 19:43       ` [PATCH v1 2/7] mm: Prepare for DAX huge pages Kirill A. Shutemov
@ 2014-10-09 20:40         ` Matthew Wilcox
  2014-10-13 20:36           ` Kirill A. Shutemov
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2014-10-09 20:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm

On Wed, Oct 08, 2014 at 10:43:35PM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 08, 2014 at 11:57:58AM -0400, Matthew Wilcox wrote:
> > On Wed, Oct 08, 2014 at 06:21:24PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Oct 08, 2014 at 09:25:24AM -0400, Matthew Wilcox wrote:
> > > > From: Matthew Wilcox <willy@linux.intel.com>
> > > > 
> > > > DAX wants to use the 'special' bit to mark PMD entries that are not backed
> > > > by struct page, just as for PTEs. 
> > > 
> > > Hm. I don't see where you use PMD without special set.
> > 
> > Right ... I don't currently insert PMDs that point to huge pages of DRAM,
> > only to huge pages of PMEM.
> 
> Looks like you don't need pmd_{mk,}special() then. It seems you have all
> inforamtion you need -- vma -- to find out what's going on. Right?

That would prevent us from putting huge pages of DRAM into a VM_MIXEDMAP |
VM_HUGEPAGE vma.  Is that acceptable to the wider peanut gallery?

> > > No private THP pages with THP? Why?
> > > It should be trivial: we already have a code path for !page case for zero
> > > page and it shouldn't be too hard to modify do_dax_pmd_fault() to support
> > > COW.
> > > 
> > > I remeber I've mentioned that you don't think it's reasonable to allocate
> > > 2M page on COW, but that's what we do for anon memory...
> > 
> > I agree that it shouldn't be too hard, but I have no evidence that it'll
> > be a performance win to COW 2MB pages for MAP_PRIVATE.  I'd rather be
> > cautious for now and we can explore COWing 2MB chunks in a future patch.
> 
> I would rather make it other way around: use the same apporoach as for
> anon memory until data shows it's doesn't make any good. Then consider
> switching COW for *both* anon and file THP to fallback path.
> This way we will get consistent behaviour for both types of mappings.

I'm not sure that we want consistent behaviour for both types of mappings.
My understanding is that they're used for different purposes, and having
different bahaviour is acceptable.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 5/7] dax: Add huge page fault support
  2014-10-08 20:11   ` [PATCH v1 5/7] dax: Add huge page fault support Kirill A. Shutemov
@ 2014-10-09 20:47     ` Matthew Wilcox
  2014-10-13  1:13       ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2014-10-09 20:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm, Matthew Wilcox

On Wed, Oct 08, 2014 at 11:11:00PM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 08, 2014 at 09:25:27AM -0400, Matthew Wilcox wrote:
> > +	pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	if (pgoff >= size)
> > +		return VM_FAULT_SIGBUS;
> > +	/* If the PMD would cover blocks out of the file */
> > +	if ((pgoff | PG_PMD_COLOUR) >= size)
> > +		return VM_FAULT_FALLBACK;
> 
> IIUC, zero pading would work too.

The blocks after this file might be allocated to another file already.
I suppose we could ask the filesystem if it wants to allocate them to
this file.

Dave, Jan, is it acceptable to call get_block() for blocks that extend
beyond the current i_size?

> > +
> > +	memset(&bh, 0, sizeof(bh));
> > +	block = ((sector_t)pgoff & ~PG_PMD_COLOUR) << (PAGE_SHIFT - blkbits);
> > +
> > +	/* Start by seeing if we already have an allocated block */
> > +	bh.b_size = PMD_SIZE;
> > +	length = get_block(inode, block, &bh, 0);
> 
> This makes me confused. get_block() return zero on success, right?
> Why the var called 'lenght'?

Historical reasons.  I can go back and change the name of the variable.

> > +	sector = bh.b_blocknr << (blkbits - 9);
> > +	length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, bh.b_size);
> > +	if (length < 0)
> > +		goto sigbus;
> > +	if (length < PMD_SIZE)
> > +		goto fallback;
> > +	if (pfn & PG_PMD_COLOUR)
> > +		goto fallback;	/* not aligned */
> 
> So, are you rely on pure luck to make get_block() allocate 2M aligned pfn?
> Not really productive. You would need assistance from fs and
> arch_get_unmapped_area() sides.

Certainly ext4 and XFS will align their allocations; if you ask it for a
2MB block, it will try to allocate a 2MB block aligned on a 2MB boundary.

I started looking into the get_unampped_area (and have the code sitting
around to align specially marked files on special boundaries), but when
I mentioned it to the author of the NVM Library, he said "Oh, I'll just
pick a 1GB aligned area to request it be mapped at", so I haven't taken
it any further.

The upshot is that (confirmed with debugging code), when the tests run,
they pretty much always get a correctly aligned block.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 5/7] dax: Add huge page fault support
  2014-10-09 20:47     ` Matthew Wilcox
@ 2014-10-13  1:13       ` Dave Chinner
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2014-10-13  1:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Matthew Wilcox, linux-fsdevel, linux-kernel,
	linux-mm

On Thu, Oct 09, 2014 at 04:47:16PM -0400, Matthew Wilcox wrote:
> On Wed, Oct 08, 2014 at 11:11:00PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 08, 2014 at 09:25:27AM -0400, Matthew Wilcox wrote:
> > > +	pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > > +	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > +	if (pgoff >= size)
> > > +		return VM_FAULT_SIGBUS;
> > > +	/* If the PMD would cover blocks out of the file */
> > > +	if ((pgoff | PG_PMD_COLOUR) >= size)
> > > +		return VM_FAULT_FALLBACK;
> > 
> > IIUC, zero pading would work too.
> 
> The blocks after this file might be allocated to another file already.
> I suppose we could ask the filesystem if it wants to allocate them to
> this file.
> 
> Dave, Jan, is it acceptable to call get_block() for blocks that extend
> beyond the current i_size?

In what context? XFS basically does nothing for certain cases (e.g.
read mapping for direct IO) where zeroes are always going to be
returned, so essentially filesystems right now may actually just
return a "hole" for any read mapping request beyond EOF.

If "create" is set, then we'll either create or map existing blocks
beyond EOF because the we have to reserve space or allocate blocks
before the EOF gets extended when the write succeeds fully...

> > > +	if (length < PMD_SIZE)
> > > +		goto fallback;
> > > +	if (pfn & PG_PMD_COLOUR)
> > > +		goto fallback;	/* not aligned */
> > 
> > So, are you rely on pure luck to make get_block() allocate 2M aligned pfn?
> > Not really productive. You would need assistance from fs and
> > arch_get_unmapped_area() sides.
> 
> Certainly ext4 and XFS will align their allocations; if you ask it for a
> 2MB block, it will try to allocate a 2MB block aligned on a 2MB boundary.

As a sweeping generalisation, that's wrong. Empty filesystems might
behave that way, but we don't *guarantee* that this sort of
alignment will occur.

XFS has several different extent alignment strategies and
none of them will always work that way. Many of them are dependent
on mkfs parameters, and even then are used only as *guidelines*.
Further, alignment is dependent on the size of the write being done
- on some filesystem configs a 2MB write might be aligned, but on
others it won't be. More complex still is that mount options can
change alignment behaviour, as can per-file extent size hints, as
can truncation that removes post-eof blocks...

IOWs, if you want the filesystem to guarantee alignment to the
underlying hardware in this way for DAX, we're going to need to make
some modifications to the allocator alignment strategy.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 2/7] mm: Prepare for DAX huge pages
  2014-10-09 20:40         ` Matthew Wilcox
@ 2014-10-13 20:36           ` Kirill A. Shutemov
  0 siblings, 0 replies; 6+ messages in thread
From: Kirill A. Shutemov @ 2014-10-13 20:36 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm

On Thu, Oct 09, 2014 at 04:40:26PM -0400, Matthew Wilcox wrote:
> On Wed, Oct 08, 2014 at 10:43:35PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 08, 2014 at 11:57:58AM -0400, Matthew Wilcox wrote:
> > > On Wed, Oct 08, 2014 at 06:21:24PM +0300, Kirill A. Shutemov wrote:
> > > > On Wed, Oct 08, 2014 at 09:25:24AM -0400, Matthew Wilcox wrote:
> > > > > From: Matthew Wilcox <willy@linux.intel.com>
> > > > > 
> > > > > DAX wants to use the 'special' bit to mark PMD entries that are not backed
> > > > > by struct page, just as for PTEs. 
> > > > 
> > > > Hm. I don't see where you use PMD without special set.
> > > 
> > > Right ... I don't currently insert PMDs that point to huge pages of DRAM,
> > > only to huge pages of PMEM.
> > 
> > Looks like you don't need pmd_{mk,}special() then. It seems you have all
> > inforamtion you need -- vma -- to find out what's going on. Right?
> 
> That would prevent us from putting huge pages of DRAM into a VM_MIXEDMAP |
> VM_HUGEPAGE vma.  Is that acceptable to the wider peanut gallery?

We didn't have huge pages on VM_MIXEDMAP | VM_HUGEPAGE before and we don't
have them there after the patchset. Nothing changed.

It probably worth adding VM_BUG_ON() in some code path to be able to catch
this situation.

> > > > No private THP pages with THP? Why?
> > > > It should be trivial: we already have a code path for !page case for zero
> > > > page and it shouldn't be too hard to modify do_dax_pmd_fault() to support
> > > > COW.
> > > > 
> > > > I remeber I've mentioned that you don't think it's reasonable to allocate
> > > > 2M page on COW, but that's what we do for anon memory...
> > > 
> > > I agree that it shouldn't be too hard, but I have no evidence that it'll
> > > be a performance win to COW 2MB pages for MAP_PRIVATE.  I'd rather be
> > > cautious for now and we can explore COWing 2MB chunks in a future patch.
> > 
> > I would rather make it other way around: use the same apporoach as for
> > anon memory until data shows it's doesn't make any good. Then consider
> > switching COW for *both* anon and file THP to fallback path.
> > This way we will get consistent behaviour for both types of mappings.
> 
> I'm not sure that we want consistent behaviour for both types of mappings.
> My understanding is that they're used for different purposes, and having
> different bahaviour is acceptable.

This should be described in commit message along with other design
solutions (split wrt. mlock, etc) with their pros and cons.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-10-13 20:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1412774729-23956-1-git-send-email-matthew.r.wilcox@intel.com>
     [not found] ` <1412774729-23956-3-git-send-email-matthew.r.wilcox@intel.com>
     [not found]   ` <20141008152124.GA7288@node.dhcp.inet.fi>
     [not found]     ` <20141008155758.GK5098@wil.cx>
2014-10-08 19:43       ` [PATCH v1 2/7] mm: Prepare for DAX huge pages Kirill A. Shutemov
2014-10-09 20:40         ` Matthew Wilcox
2014-10-13 20:36           ` Kirill A. Shutemov
     [not found] ` <1412774729-23956-6-git-send-email-matthew.r.wilcox@intel.com>
2014-10-08 20:11   ` [PATCH v1 5/7] dax: Add huge page fault support Kirill A. Shutemov
2014-10-09 20:47     ` Matthew Wilcox
2014-10-13  1:13       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).