All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] Virtual block address space mapping
@ 2018-01-29 10:08 Dave Chinner
  2018-01-31 21:25   ` Darrick J. Wong
  2018-02-01  2:23 ` J. Bruce Fields
  0 siblings, 2 replies; 7+ messages in thread
From: Dave Chinner @ 2018-01-29 10:08 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-fsdevel, linux-xfs

Hi Folks,

I want to talk about virtual block address space abstractions for
the kernel. This is the layer I've added to the IO stack to provide
cloneable subvolumes in XFS, and it's really a generic abstraction
the stack should provide, not be something hidden inside a
filesystem.

Note: this is *not* a block device interface. That's the mistake
we've made previously when trying to more closely integrate
filesystems and block devices.  Filesystems sit on a block address
space but the current stack does not allow the address space to be
separated from the block device.  This means a block based
filesystem can only sit on a block device.  By separating the
address space from block device and replacing it with a mapping
interface we can break the fs-on-bdev requirement and add
functionality that isn't currently possible.

There are two parts; first is to modify the filesystem to use a
virtual block address space, and the second is to implement a
virtual block address space provider. The provider is responsible
for snapshot/cloning subvolumes, so the provider really needs to be
a block device or filesystem that supports COW (dm-thinp,
btrfs, XFS, etc).

I've implemented both sides on XFS to provide the capability for an
XFS filesystem to host XFS subvolumes. however, this is an abstract
interface and so if someone modifies ext4 to use a virtual block
address space, then XFS will be able to host cloneable ext4
subvolumes, too. :P

The core API is a mapping and allocation interface based on the
iomap infrastructure we already use for the pNFS file layout and
fs/iomap.c. In fact, the whole mapping and two-phase write algorithm
is very similar to Christoph's export ops - we may even be able to
merge the two APIs depending on how pNFS ends up handing CoW
operations.

The API also provides space tracking cookies so that the subvolume
filesystem can reserve space in the host ahead of time and pass it
around to all the objects it modifies and writes to ensure space is
available for the writes. This matches to the transaction model in
the filesystems so the host can ENOSPC before we start modifying
subvolume metadata and doing IO.

If block devices like dm-thinp implement a provider, then we'll also
be able to avoid the fatal ENOSPC-on-write-IO when the pool fills
unexpectedly....

There's lots to talk about here. And, in the end, if nobody thinks
this is useful, then I'll just leave it all internal to XFS. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM TOPIC] Virtual block address space mapping
  2018-01-29 10:08 [LSF/MM TOPIC] Virtual block address space mapping Dave Chinner
@ 2018-01-31 21:25   ` Darrick J. Wong
  2018-02-01  2:23 ` J. Bruce Fields
  1 sibling, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2018-01-31 21:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-fsdevel, linux-xfs

On Mon, Jan 29, 2018 at 09:08:34PM +1100, Dave Chinner wrote:
> Hi Folks,
> 
> I want to talk about virtual block address space abstractions for
> the kernel. This is the layer I've added to the IO stack to provide
> cloneable subvolumes in XFS, and it's really a generic abstraction
> the stack should provide, not be something hidden inside a
> filesystem.
> 
> Note: this is *not* a block device interface. That's the mistake
> we've made previously when trying to more closely integrate
> filesystems and block devices.  Filesystems sit on a block address
> space but the current stack does not allow the address space to be
> separated from the block device.  This means a block based
> filesystem can only sit on a block device.  By separating the
> address space from block device and replacing it with a mapping
> interface we can break the fs-on-bdev requirement and add
> functionality that isn't currently possible.
> 
> There are two parts; first is to modify the filesystem to use a
> virtual block address space, and the second is to implement a
> virtual block address space provider. The provider is responsible
> for snapshot/cloning subvolumes, so the provider really needs to be
> a block device or filesystem that supports COW (dm-thinp,
> btrfs, XFS, etc).

Since I've not seen your code, what happens for the xfs that's written to
a raw disk?  Same bdev/buftarg mechanism we use now?

> I've implemented both sides on XFS to provide the capability for an
> XFS filesystem to host XFS subvolumes. however, this is an abstract
> interface and so if someone modifies ext4 to use a virtual block
> address space, then XFS will be able to host cloneable ext4
> subvolumes, too. :P

How hard is it to retrofit an existing bdev fs to use a virtual block
address space?

> The core API is a mapping and allocation interface based on the
> iomap infrastructure we already use for the pNFS file layout and
> fs/iomap.c. In fact, the whole mapping and two-phase write algorithm
> is very similar to Christoph's export ops - we may even be able to
> merge the two APIs depending on how pNFS ends up handing CoW
> operations.

Hm, how /is/ that supposed to happen? :)

I would surmise that pre-cow would work[1] albeit slowly.  It sorta
looks like Christoph is working[2] on this for pnfs.  Looking at 2.4.5,
we preallocate all the cow staging extents, hand the client the old maps
to read from and the new maps to write to, the client deals with the
actual copy-write, and finally when the client commits then we can do
the usual remapping business.

(Yeah, that is much less nasty than my na�ve approach.)

[1] https://marc.info/?l=linux-xfs&m=151626136624010&w=2
[2] https://tools.ietf.org/id/draft-hellwig-nfsv4-rdma-layout-00.html

> The API also provides space tracking cookies so that the subvolume
> filesystem can reserve space in the host ahead of time and pass it
> around to all the objects it modifies and writes to ensure space is
> available for the writes. This matches to the transaction model in
> the filesystems so the host can ENOSPC before we start modifying
> subvolume metadata and doing IO.
> 
> If block devices like dm-thinp implement a provider, then we'll also
> be able to avoid the fatal ENOSPC-on-write-IO when the pool fills
> unexpectedly....

<nod>

--D

> There's lots to talk about here. And, in the end, if nobody thinks
> this is useful, then I'll just leave it all internal to XFS. :)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM TOPIC] Virtual block address space mapping
@ 2018-01-31 21:25   ` Darrick J. Wong
  0 siblings, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2018-01-31 21:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-fsdevel, linux-xfs

On Mon, Jan 29, 2018 at 09:08:34PM +1100, Dave Chinner wrote:
> Hi Folks,
> 
> I want to talk about virtual block address space abstractions for
> the kernel. This is the layer I've added to the IO stack to provide
> cloneable subvolumes in XFS, and it's really a generic abstraction
> the stack should provide, not be something hidden inside a
> filesystem.
> 
> Note: this is *not* a block device interface. That's the mistake
> we've made previously when trying to more closely integrate
> filesystems and block devices.  Filesystems sit on a block address
> space but the current stack does not allow the address space to be
> separated from the block device.  This means a block based
> filesystem can only sit on a block device.  By separating the
> address space from block device and replacing it with a mapping
> interface we can break the fs-on-bdev requirement and add
> functionality that isn't currently possible.
> 
> There are two parts; first is to modify the filesystem to use a
> virtual block address space, and the second is to implement a
> virtual block address space provider. The provider is responsible
> for snapshot/cloning subvolumes, so the provider really needs to be
> a block device or filesystem that supports COW (dm-thinp,
> btrfs, XFS, etc).

Since I've not seen your code, what happens for the xfs that's written to
a raw disk?  Same bdev/buftarg mechanism we use now?

> I've implemented both sides on XFS to provide the capability for an
> XFS filesystem to host XFS subvolumes. however, this is an abstract
> interface and so if someone modifies ext4 to use a virtual block
> address space, then XFS will be able to host cloneable ext4
> subvolumes, too. :P

How hard is it to retrofit an existing bdev fs to use a virtual block
address space?

> The core API is a mapping and allocation interface based on the
> iomap infrastructure we already use for the pNFS file layout and
> fs/iomap.c. In fact, the whole mapping and two-phase write algorithm
> is very similar to Christoph's export ops - we may even be able to
> merge the two APIs depending on how pNFS ends up handing CoW
> operations.

Hm, how /is/ that supposed to happen? :)

I would surmise that pre-cow would work[1] albeit slowly.  It sorta
looks like Christoph is working[2] on this for pnfs.  Looking at 2.4.5,
we preallocate all the cow staging extents, hand the client the old maps
to read from and the new maps to write to, the client deals with the
actual copy-write, and finally when the client commits then we can do
the usual remapping business.

(Yeah, that is much less nasty than my naïve approach.)

[1] https://marc.info/?l=linux-xfs&m=151626136624010&w=2
[2] https://tools.ietf.org/id/draft-hellwig-nfsv4-rdma-layout-00.html

> The API also provides space tracking cookies so that the subvolume
> filesystem can reserve space in the host ahead of time and pass it
> around to all the objects it modifies and writes to ensure space is
> available for the writes. This matches to the transaction model in
> the filesystems so the host can ENOSPC before we start modifying
> subvolume metadata and doing IO.
> 
> If block devices like dm-thinp implement a provider, then we'll also
> be able to avoid the fatal ENOSPC-on-write-IO when the pool fills
> unexpectedly....

<nod>

--D

> There's lots to talk about here. And, in the end, if nobody thinks
> this is useful, then I'll just leave it all internal to XFS. :)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM TOPIC] Virtual block address space mapping
  2018-01-31 21:25   ` Darrick J. Wong
@ 2018-02-01  2:01     ` Dave Chinner
  -1 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2018-02-01  2:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: lsf-pc, linux-fsdevel, linux-xfs

On Wed, Jan 31, 2018 at 01:25:01PM -0800, Darrick J. Wong wrote:
> On Mon, Jan 29, 2018 at 09:08:34PM +1100, Dave Chinner wrote:
> > Hi Folks,
> > 
> > I want to talk about virtual block address space abstractions for
> > the kernel. This is the layer I've added to the IO stack to provide
> > cloneable subvolumes in XFS, and it's really a generic abstraction
> > the stack should provide, not be something hidden inside a
> > filesystem.
> > 
> > Note: this is *not* a block device interface. That's the mistake
> > we've made previously when trying to more closely integrate
> > filesystems and block devices.  Filesystems sit on a block address
> > space but the current stack does not allow the address space to be
> > separated from the block device.  This means a block based
> > filesystem can only sit on a block device.  By separating the
> > address space from block device and replacing it with a mapping
> > interface we can break the fs-on-bdev requirement and add
> > functionality that isn't currently possible.
> > 
> > There are two parts; first is to modify the filesystem to use a
> > virtual block address space, and the second is to implement a
> > virtual block address space provider. The provider is responsible
> > for snapshot/cloning subvolumes, so the provider really needs to be
> > a block device or filesystem that supports COW (dm-thinp,
> > btrfs, XFS, etc).
> 
> Since I've not seen your code, what happens for the xfs that's written to
> a raw disk?  Same bdev/buftarg mechanism we use now?

Same.

> > I've implemented both sides on XFS to provide the capability for an
> > XFS filesystem to host XFS subvolumes. however, this is an abstract
> > interface and so if someone modifies ext4 to use a virtual block
> > address space, then XFS will be able to host cloneable ext4
> > subvolumes, too. :P
> 
> How hard is it to retrofit an existing bdev fs to use a virtual block
> address space?

Somewhat difficult, because the space cookies need to be plumbed
through to the IO routines. XFS has it's own data and metadata IO
pathways, so it's likely to be much easier to do this for than a
filesystem using generic writeback infrastructure....

> > The core API is a mapping and allocation interface based on the
> > iomap infrastructure we already use for the pNFS file layout and
> > fs/iomap.c. In fact, the whole mapping and two-phase write algorithm
> > is very similar to Christoph's export ops - we may even be able to
> > merge the two APIs depending on how pNFS ends up handing CoW
> > operations.
> 
> Hm, how /is/ that supposed to happen? :)

Not sure - I'm waiting for Christoph to tell us. :)

> I would surmise that pre-cow would work[1] albeit slowly.  It sorta
> looks like Christoph is working[2] on this for pnfs.  Looking at 2.4.5,
> we preallocate all the cow staging extents, hand the client the old maps
> to read from and the new maps to write to, the client deals with the
> actual copy-write, and finally when the client commits then we can do
> the usual remapping business.
> 
> (Yeah, that is much less nasty than my na�ve approach.)

Yup, and that's pretty much what I'm doing. The subvolume already
has the modified data in it's cache, so when the subvol IO path
tries to remap the IO range, the underlying FS does the COW
allocation and indicates that it needs to run a commit operation on
IO completion. On completion, the subvol runs a ->commit(off, len,
VBAS_T_COW) operation and the underlying fs does the final
remapping.  A similar process is used to deal with preallocated
regions in the underlying file (remap returns IOMAP_UNWRITTEN,
subvol calls ->commit(VBAS_T_UNWRITTEN) on IO completion).

This means that the subvolume looks just like direct IO to the
underlying host filesystem - the XFS VBAS host implementation is
just a thin wrapper around existing internal iomap interfaces.

I'm working on cleaning it up for an initial patch posting so people
can get a better idea of how it currently works....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM TOPIC] Virtual block address space mapping
@ 2018-02-01  2:01     ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2018-02-01  2:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: lsf-pc, linux-fsdevel, linux-xfs

On Wed, Jan 31, 2018 at 01:25:01PM -0800, Darrick J. Wong wrote:
> On Mon, Jan 29, 2018 at 09:08:34PM +1100, Dave Chinner wrote:
> > Hi Folks,
> > 
> > I want to talk about virtual block address space abstractions for
> > the kernel. This is the layer I've added to the IO stack to provide
> > cloneable subvolumes in XFS, and it's really a generic abstraction
> > the stack should provide, not be something hidden inside a
> > filesystem.
> > 
> > Note: this is *not* a block device interface. That's the mistake
> > we've made previously when trying to more closely integrate
> > filesystems and block devices.  Filesystems sit on a block address
> > space but the current stack does not allow the address space to be
> > separated from the block device.  This means a block based
> > filesystem can only sit on a block device.  By separating the
> > address space from block device and replacing it with a mapping
> > interface we can break the fs-on-bdev requirement and add
> > functionality that isn't currently possible.
> > 
> > There are two parts; first is to modify the filesystem to use a
> > virtual block address space, and the second is to implement a
> > virtual block address space provider. The provider is responsible
> > for snapshot/cloning subvolumes, so the provider really needs to be
> > a block device or filesystem that supports COW (dm-thinp,
> > btrfs, XFS, etc).
> 
> Since I've not seen your code, what happens for the xfs that's written to
> a raw disk?  Same bdev/buftarg mechanism we use now?

Same.

> > I've implemented both sides on XFS to provide the capability for an
> > XFS filesystem to host XFS subvolumes. however, this is an abstract
> > interface and so if someone modifies ext4 to use a virtual block
> > address space, then XFS will be able to host cloneable ext4
> > subvolumes, too. :P
> 
> How hard is it to retrofit an existing bdev fs to use a virtual block
> address space?

Somewhat difficult, because the space cookies need to be plumbed
through to the IO routines. XFS has it's own data and metadata IO
pathways, so it's likely to be much easier to do this for than a
filesystem using generic writeback infrastructure....

> > The core API is a mapping and allocation interface based on the
> > iomap infrastructure we already use for the pNFS file layout and
> > fs/iomap.c. In fact, the whole mapping and two-phase write algorithm
> > is very similar to Christoph's export ops - we may even be able to
> > merge the two APIs depending on how pNFS ends up handing CoW
> > operations.
> 
> Hm, how /is/ that supposed to happen? :)

Not sure - I'm waiting for Christoph to tell us. :)

> I would surmise that pre-cow would work[1] albeit slowly.  It sorta
> looks like Christoph is working[2] on this for pnfs.  Looking at 2.4.5,
> we preallocate all the cow staging extents, hand the client the old maps
> to read from and the new maps to write to, the client deals with the
> actual copy-write, and finally when the client commits then we can do
> the usual remapping business.
> 
> (Yeah, that is much less nasty than my naïve approach.)

Yup, and that's pretty much what I'm doing. The subvolume already
has the modified data in it's cache, so when the subvol IO path
tries to remap the IO range, the underlying FS does the COW
allocation and indicates that it needs to run a commit operation on
IO completion. On completion, the subvol runs a ->commit(off, len,
VBAS_T_COW) operation and the underlying fs does the final
remapping.  A similar process is used to deal with preallocated
regions in the underlying file (remap returns IOMAP_UNWRITTEN,
subvol calls ->commit(VBAS_T_UNWRITTEN) on IO completion).

This means that the subvolume looks just like direct IO to the
underlying host filesystem - the XFS VBAS host implementation is
just a thin wrapper around existing internal iomap interfaces.

I'm working on cleaning it up for an initial patch posting so people
can get a better idea of how it currently works....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM TOPIC] Virtual block address space mapping
  2018-01-29 10:08 [LSF/MM TOPIC] Virtual block address space mapping Dave Chinner
  2018-01-31 21:25   ` Darrick J. Wong
@ 2018-02-01  2:23 ` J. Bruce Fields
  2018-02-01  5:21   ` Dave Chinner
  1 sibling, 1 reply; 7+ messages in thread
From: J. Bruce Fields @ 2018-02-01  2:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-fsdevel, linux-xfs

On Mon, Jan 29, 2018 at 09:08:34PM +1100, Dave Chinner wrote:
> I've implemented both sides on XFS to provide the capability for an
> XFS filesystem to host XFS subvolumes. however, this is an abstract
> interface and so if someone modifies ext4 to use a virtual block
> address space, then XFS will be able to host cloneable ext4
> subvolumes, too. :P

This sounds really interesting.

Would it be possible to write up a brief summary of your current
interface?  Even if it's not completely correct it might be a useful
starting point.

--b.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM TOPIC] Virtual block address space mapping
  2018-02-01  2:23 ` J. Bruce Fields
@ 2018-02-01  5:21   ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2018-02-01  5:21 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: lsf-pc, linux-fsdevel, linux-xfs

On Wed, Jan 31, 2018 at 09:23:32PM -0500, J. Bruce Fields wrote:
> On Mon, Jan 29, 2018 at 09:08:34PM +1100, Dave Chinner wrote:
> > I've implemented both sides on XFS to provide the capability for an
> > XFS filesystem to host XFS subvolumes. however, this is an abstract
> > interface and so if someone modifies ext4 to use a virtual block
> > address space, then XFS will be able to host cloneable ext4
> > subvolumes, too. :P
> 
> This sounds really interesting.
> 
> Would it be possible to write up a brief summary of your current
> interface?  Even if it's not completely correct it might be a useful
> starting point.

I almost did this, but decided i wouldn't because I haven't finished
data IO path remapping suuport for XFS yet. But, seeing as you
asked, I've attached the header that contains the API as it stands
below. It's intermingled with bits of the XFs implementation, and it
still has the original name I gave it when I first started ("block
space").

The difference between the "map" and "allocate" ops are that "map"
does a read only mapping and may return holes. "allocate" is a write
mapping and will fill holes and requires to have reserved space held
in a blkspc cookie...

i.e. use "map" for a read IO, "allocate w\ cookie" for a write IO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (c) 2017 Red Hat, Inc. All rights reserved. */

#ifndef __XFS_REMAP_H
#define __XFS_REMAP_H	1

struct xfs_buf;
struct xfs_inode;
struct iomap;
struct blkspsc_ops;

struct blkspc_host {
	void			*priv;	/* host private data */
	struct block_device	*bdev;	/* target device */
	const struct blkspc_ops	*ops;
};

struct blkspc_cookie {
	refcount_t	ref;
	bool		cancelled;		/* no longer usable */
	struct blkspc_host *host;		/* backpointer to owner */
	void		*priv;			/* blkdev private data */
};

struct blkspc_ops {
	struct blkspc_cookie * (*reserve)(struct blkspc_host *bsh,
					  u32 alloc_count, u32 block_size);
	int	(*unreserve)(struct blkspc_cookie *bsc);
	int	(*allocate)(struct blkspc_cookie *bsc, loff_t offset, loff_t count,
			    struct iomap *iomap);
	int	(*commit)(struct blkspc_cookie *bsc, loff_t offset,
			  loff_t count, int type);
	int	(*map)(struct blkspc_host *bsh, loff_t offset, loff_t count,
			    struct iomap *iomap);
};

enum {
	BS_TYPE_COW,
	BS_TYPE_UNWRITTEN,
};

static inline struct blkspc_cookie *
blkspc_cookie_get(
	struct blkspc_cookie	*bsc)
{
	trace_printk("blkspc_cookie_get %p %d", bsc, refcount_read(&bsc->ref));
	if (bsc->cancelled)
		return NULL;
	if (!refcount_inc_not_zero(&bsc->ref))
		return NULL;
	return bsc;
}

static inline void
blkspc_cookie_put(
	struct blkspc_cookie	*bsc)
{
	trace_printk("blkspc_cookie_put %p %d", bsc, refcount_read(&bsc->ref));
	if (!refcount_dec_and_test(&bsc->ref))
		return;
	/* last ref, return unused reservation */
	bsc->host->ops->unreserve(bsc);
	kmem_free(bsc);
}

static inline void
blkspc_cookie_cancel(
	struct blkspc_cookie	*bsc)
{
	bsc->cancelled = true;
}

/*
 * XXX: need a generic init function
 */
struct blkspc_host *
	xfs_blkspc_init(struct xfs_inode *host_ip, struct block_device *bdev);
void	xfs_blkspc_destroy(struct blkspc_host *bsh);

int	xfs_buf_remap_iodone(struct xfs_buf *bp);
int	xfs_buf_remap(struct xfs_buf *bp, struct blkspc_cookie *bsc);

#endif /* __XFS_REMAP_H */

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-02-01  5:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-29 10:08 [LSF/MM TOPIC] Virtual block address space mapping Dave Chinner
2018-01-31 21:25 ` Darrick J. Wong
2018-01-31 21:25   ` Darrick J. Wong
2018-02-01  2:01   ` Dave Chinner
2018-02-01  2:01     ` Dave Chinner
2018-02-01  2:23 ` J. Bruce Fields
2018-02-01  5:21   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.