All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS_IOC_FSEMAP requirements
@ 2016-12-20 10:29 Carlos Maiolino
  2016-12-21  1:48 ` Darrick J. Wong
  0 siblings, 1 reply; 6+ messages in thread
From: Carlos Maiolino @ 2016-12-20 10:29 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

I've been working on the implementation of FSEMAP ioctl as we've been discussed
previously, and, the first discussion about this, was about using the same
fiemap structures to retrieve free extents from the btrees.

By our last chat about it (it's been a while, I know, I got busy with more
important stuff :), Dave suggested another uses for FSEMAP that were not in my
mind, so, I think it deserves its own implementation, independent of fiemap from
where the same idea came from.

So, I'd like to know, what else might FSEMAP be used for, beyond iterating free
space extents, so I can think of a new struct to be exchanged between user<->
kernel. FSEMAP is supposed to complement GETFSMAPX, discussed in LSF this year,
but I don't know if is there any plan to keep with GETFSMAPX or not, or even if
FSEMAP is still a valuable idea :)

Any comments, suggestions about to what direction should FSEMAP go?

Cheers
-- 
Carlos

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS_IOC_FSEMAP requirements
  2016-12-20 10:29 XFS_IOC_FSEMAP requirements Carlos Maiolino
@ 2016-12-21  1:48 ` Darrick J. Wong
  2016-12-22  8:57   ` Christoph Hellwig
  2016-12-22  9:28   ` Carlos Maiolino
  0 siblings, 2 replies; 6+ messages in thread
From: Darrick J. Wong @ 2016-12-21  1:48 UTC (permalink / raw)
  To: linux-xfs

On Tue, Dec 20, 2016 at 11:29:35AM +0100, Carlos Maiolino wrote:
> Hi folks,
> 
> I've been working on the implementation of FSEMAP ioctl as we've been
> discussed previously, and, the first discussion about this, was about
> using the same fiemap structures to retrieve free extents from the
> btrees.
> 
> By our last chat about it (it's been a while, I know, I got busy with
> more important stuff :), Dave suggested another uses for FSEMAP that
> were not in my mind, so, I think it deserves its own implementation,
> independent of fiemap from where the same idea came from.
> 
> So, I'd like to know, what else might FSEMAP be used for, beyond
> iterating free space extents, so I can think of a new struct to be
> exchanged between user<-> kernel. FSEMAP is supposed to complement
> GETFSMAPX, discussed in LSF this year, but I don't know if is there
> any plan to keep with GETFSMAPX or not, or even if FSEMAP is still a
> valuable idea :)
> 
> Any comments, suggestions about to what direction should FSEMAP go?

GETFSMAP reports free space extents along with the other space mappings.
If there is no rmapbt, the ioctl reports free space extents from the
bnobt and reports the non-free space as being owned by "unknown".
I was planning to send out the whole GETFSMAP + online scrub series for
review (for 4.11) after the 4.10 merge window closes.  Internally, the
online scrub kernel code cross-references space metadata against the
rmapbt if it's available.

For xfsprogs 4.11, the userspace online scrub tool uses the fsmap data
to figure out where to do media read testing after having the kernel
perform online checking of the metadata.  I also forward-ported spaceman
to current xfsprogs and getfsmap, so I'll be sending that out for review
for the 4.11 release too.

As for project ideas, I can think of a handful of them -- reworking the
in-core extent tree not to require large contiguous memory allocations,
sorting out reflink+dax, and stomping out the rest of the buffer head
usage, and all the rest of the ongoing cleanups and fix branches.
There's probably more, but let's see if Dave will chime in. :)

(I intend to track all this via google spreadsheet or something to keep
my head on straight.)

--D

> 
> Cheers
> -- 
> Carlos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS_IOC_FSEMAP requirements
  2016-12-21  1:48 ` Darrick J. Wong
@ 2016-12-22  8:57   ` Christoph Hellwig
  2016-12-22 20:07     ` Dave Chinner
  2016-12-22  9:28   ` Carlos Maiolino
  1 sibling, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2016-12-22  8:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Dec 20, 2016 at 05:48:16PM -0800, Darrick J. Wong wrote:
> As for project ideas, I can think of a handful of them -- reworking the
> in-core extent tree not to require large contiguous memory allocations,
> sorting out reflink+dax,

FYI, I have significant work done on both of them.  I just need to sort
out the ENOSPC issues (just got another report where we're running into
ENOSPC for inobt updates, so there might be even more dragons there)
and the DAX DIO rework, which is 90% of the work for reflink+DAX.

For the extent tree I played around with a simple rbtree, but I need
to finish abstracting away the access the the extent list - the first
pile of that landed for 4.10, but there is a lot more to be done before
I can easily change the representation.  I'm also not sure anymore
that the rbtree is the best choice.

> and stomping out the rest of the buffer head
> usage,

The next for that is the switch bmap and buffered reads for iomap,
I have a version that works for blocksize == page_size, just need
to finish the blocksize < pagesize case.  Once get_blocks is gone
we should be able to just kill buffer_heads trivially for the blocksize
== pagesize case, it's just the small block size that will require
a significant effort.  My initial plan was to just keep buffer_heads
for those as the first step while not using them for the common
blocksize == pagesize case.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS_IOC_FSEMAP requirements
  2016-12-21  1:48 ` Darrick J. Wong
  2016-12-22  8:57   ` Christoph Hellwig
@ 2016-12-22  9:28   ` Carlos Maiolino
  1 sibling, 0 replies; 6+ messages in thread
From: Carlos Maiolino @ 2016-12-22  9:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Dec 20, 2016 at 05:48:16PM -0800, Darrick J. Wong wrote:
> On Tue, Dec 20, 2016 at 11:29:35AM +0100, Carlos Maiolino wrote:
> > Hi folks,
> > 
> > I've been working on the implementation of FSEMAP ioctl as we've been
> > discussed previously, and, the first discussion about this, was about
> > using the same fiemap structures to retrieve free extents from the
> > btrees.
> > 
> > By our last chat about it (it's been a while, I know, I got busy with
> > more important stuff :), Dave suggested another uses for FSEMAP that
> > were not in my mind, so, I think it deserves its own implementation,
> > independent of fiemap from where the same idea came from.
> > 
> > So, I'd like to know, what else might FSEMAP be used for, beyond
> > iterating free space extents, so I can think of a new struct to be
> > exchanged between user<-> kernel. FSEMAP is supposed to complement
> > GETFSMAPX, discussed in LSF this year, but I don't know if is there
> > any plan to keep with GETFSMAPX or not, or even if FSEMAP is still a
> > valuable idea :)
> > 
> > Any comments, suggestions about to what direction should FSEMAP go?
> 
> GETFSMAP reports free space extents along with the other space mappings.
> If there is no rmapbt, the ioctl reports free space extents from the
> bnobt and reports the non-free space as being owned by "unknown".
> I was planning to send out the whole GETFSMAP + online scrub series for
> review (for 4.11) after the 4.10 merge window closes.  Internally, the
> online scrub kernel code cross-references space metadata against the
> rmapbt if it's available.
> 
> For xfsprogs 4.11, the userspace online scrub tool uses the fsmap data
> to figure out where to do media read testing after having the kernel
> perform online checking of the metadata.  I also forward-ported spaceman
> to current xfsprogs and getfsmap, so I'll be sending that out for review
> for the 4.11 release too.
> 

ah, ok, I'll switch back to my work still undone with our behavior working
together with dm-thin then =]

> As for project ideas, I can think of a handful of them -- reworking the
> in-core extent tree not to require large contiguous memory allocations,
> sorting out reflink+dax, and stomping out the rest of the buffer head
> usage, and all the rest of the ongoing cleanups and fix branches.
> There's probably more, but let's see if Dave will chime in. :)
> 
> (I intend to track all this via google spreadsheet or something to keep
> my head on straight.)
> 
> --D
> 
> > 
> > Cheers
> > -- 
> > Carlos
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS_IOC_FSEMAP requirements
  2016-12-22  8:57   ` Christoph Hellwig
@ 2016-12-22 20:07     ` Dave Chinner
  2016-12-22 20:24       ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2016-12-22 20:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs

On Thu, Dec 22, 2016 at 12:57:16AM -0800, Christoph Hellwig wrote:
> On Tue, Dec 20, 2016 at 05:48:16PM -0800, Darrick J. Wong wrote:
> > As for project ideas, I can think of a handful of them -- reworking the
> > in-core extent tree not to require large contiguous memory allocations,
> > sorting out reflink+dax,
> 
> FYI, I have significant work done on both of them.  I just need to sort
> out the ENOSPC issues (just got another report where we're running into
> ENOSPC for inobt updates, so there might be even more dragons there)
> and the DAX DIO rework, which is 90% of the work for reflink+DAX.
> 
> For the extent tree I played around with a simple rbtree, but I need
> to finish abstracting away the access the the extent list - the first
> pile of that landed for 4.10, but there is a lot more to be done before
> I can easily change the representation.  I'm also not sure anymore
> that the rbtree is the best choice.

And extent per rbtree node is almost certainly not the right choice
because of the object count requirement - we do not want to a
kmalloc for every extent we add to the list.  At I was looking at
replacing the indirect array with an rbtree - potentially the
interval tree variant - so that we still have pages of extents and
increments/walking is still mainly just pointer increments.

That was before I found out how easy it is to use the rhashtable
code and how much faster it is for large lists than an rbtree.
That's the way I've been thinking recently, anyway...

> > and stomping out the rest of the buffer head
> > usage,
> 
> The next for that is the switch bmap and buffered reads for iomap,
> I have a version that works for blocksize == page_size, just need
> to finish the blocksize < pagesize case.  Once get_blocks is gone
> we should be able to just kill buffer_heads trivially for the blocksize
> == pagesize case, it's just the small block size that will require
> a significant effort.  My initial plan was to just keep buffer_heads
> for those as the first step while not using them for the common
> blocksize == pagesize case.

My plan for the blocksize < page size was simply to track dirtines
on pages and forget about sub-page dirtiness. That way the
IO path is simply iterates entire pages to cover all the
mapped regions of the page. iomap already does that for us, and I
started on making writepage work that way, too. Haven't got to
working writepage code yet, though.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS_IOC_FSEMAP requirements
  2016-12-22 20:07     ` Dave Chinner
@ 2016-12-22 20:24       ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2016-12-22 20:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs

On Fri, Dec 23, 2016 at 07:07:29AM +1100, Dave Chinner wrote:
> And extent per rbtree node is almost certainly not the right choice
> because of the object count requirement - we do not want to a
> kmalloc for every extent we add to the list.

People are doing a kmalloc for each Packet / I/O at millions of 
I/Os per second, so I'm not that worried about that.  It's certainly
more efficient than the crazy amount of memmoves we're currently
doing based on my first preliminary numbers.

That beeing said I'm still looking for something even better.

> That was before I found out how easy it is to use the rhashtable
> code and how much faster it is for large lists than an rbtree.
> That's the way I've been thinking recently, anyway...

hashes generally aren't very good for sequential iteration, of which
we do a lot for the extent tree.  That beeing said it was on my todo
list to simply give it a try after I saw the buffer cache patch.

> My plan for the blocksize < page size was simply to track dirtines
> on pages and forget about sub-page dirtiness. That way the
> IO path is simply iterates entire pages to cover all the
> mapped regions of the page. iomap already does that for us, and I
> started on making writepage work that way, too. Haven't got to
> working writepage code yet, though.

There are four things that buffer_heads are used for in the blocksize <
pagesize case.

 - dirties - could be handled as mentioned by you
 - uptodateness - we could always read in the whole page and things
	would just work.  But on 64k page size this actually seems
	to be a performance issue, otherwise we wouldn't have the
	is_partially_uptodate address_space operation
 - tracking the block number for pure overwrites.  Probably not
   	really needed
 - tracking of I/O completions - we must write out the whole page
	on a writepage call, and something must track when all I/Os
	for the page have finished so that we can unlock it (or
	drop the writepage bit for the write case).

Nothing unsolveable, but at least the last one is a little nasty,
and doing the dumb things for 1 and 2 might cause performance
regressions.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-12-22 20:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-20 10:29 XFS_IOC_FSEMAP requirements Carlos Maiolino
2016-12-21  1:48 ` Darrick J. Wong
2016-12-22  8:57   ` Christoph Hellwig
2016-12-22 20:07     ` Dave Chinner
2016-12-22 20:24       ` Christoph Hellwig
2016-12-22  9:28   ` Carlos Maiolino

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.