* [LSF/MM TOPIC] Remote access to pmem on storage targets @ 2016-01-25 21:19 ` Chuck Lever 0 siblings, 0 replies; 27+ messages in thread From: Chuck Lever @ 2016-01-25 21:19 UTC (permalink / raw) To: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Cc: Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel I'd like to propose a discussion of how to take advantage of persistent memory in network-attached storage scenarios. RDMA runs on high speed network fabrics and offloads data transfer from host CPUs. Thus it is a good match to the performance characteristics of persistent memory. Today Linux supports iSER, SRP, and NFS/RDMA on RDMA fabrics. What kind of changes are needed in the Linux I/O stack (in particular, storage targets) and in these storage protocols to get the most benefit from ultra-low latency storage? There have been recent proposals about how storage protocols and implementations might need to change (eg. Tom Talpey's SNIA proposals for changing to a push data transfer model, Sagi's proposal to utilize DAX under the NFS/RDMA server, and my proposal for a new pNFS layout to drive RDMA data transfer directly). The outcome of the discussion would be to understand what people are working on now and what is the desired architectural approach in order to determine where storage developers should be focused. This could be either a BoF or a session during the main tracks. There is sure to be a narrow segment of each track's attendees that would have interest in this topic. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* [LSF/MM TOPIC] Remote access to pmem on storage targets @ 2016-01-25 21:19 ` Chuck Lever 0 siblings, 0 replies; 27+ messages in thread From: Chuck Lever @ 2016-01-25 21:19 UTC (permalink / raw) To: lsf-pc; +Cc: Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel I'd like to propose a discussion of how to take advantage of persistent memory in network-attached storage scenarios. RDMA runs on high speed network fabrics and offloads data transfer from host CPUs. Thus it is a good match to the performance characteristics of persistent memory. Today Linux supports iSER, SRP, and NFS/RDMA on RDMA fabrics. What kind of changes are needed in the Linux I/O stack (in particular, storage targets) and in these storage protocols to get the most benefit from ultra-low latency storage? There have been recent proposals about how storage protocols and implementations might need to change (eg. Tom Talpey's SNIA proposals for changing to a push data transfer model, Sagi's proposal to utilize DAX under the NFS/RDMA server, and my proposal for a new pNFS layout to drive RDMA data transfer directly). The outcome of the discussion would be to understand what people are working on now and what is the desired architectural approach in order to determine where storage developers should be focused. This could be either a BoF or a session during the main tracks. There is sure to be a narrow segment of each track's attendees that would have interest in this topic. -- Chuck Lever ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <06414D5A-0632-4C74-B76C-038093E8AED3-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>]
* Re: [Lsf-pc] [LSF/MM TOPIC] Remote access to pmem on storage targets 2016-01-25 21:19 ` Chuck Lever @ 2016-01-26 8:25 ` Jan Kara -1 siblings, 0 replies; 27+ messages in thread From: Jan Kara @ 2016-01-26 8:25 UTC (permalink / raw) To: Chuck Lever Cc: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Linux RDMA Mailing List, linux-fsdevel, Linux NFS Mailing List Hello, On Mon 25-01-16 16:19:24, Chuck Lever wrote: > I'd like to propose a discussion of how to take advantage of > persistent memory in network-attached storage scenarios. > > RDMA runs on high speed network fabrics and offloads data > transfer from host CPUs. Thus it is a good match to the > performance characteristics of persistent memory. > > Today Linux supports iSER, SRP, and NFS/RDMA on RDMA > fabrics. What kind of changes are needed in the Linux I/O > stack (in particular, storage targets) and in these storage > protocols to get the most benefit from ultra-low latency > storage? > > There have been recent proposals about how storage protocols > and implementations might need to change (eg. Tom Talpey's > SNIA proposals for changing to a push data transfer model, > Sagi's proposal to utilize DAX under the NFS/RDMA server, > and my proposal for a new pNFS layout to drive RDMA data > transfer directly). > > The outcome of the discussion would be to understand what > people are working on now and what is the desired > architectural approach in order to determine where storage > developers should be focused. > > This could be either a BoF or a session during the main > tracks. There is sure to be a narrow segment of each > track's attendees that would have interest in this topic. So hashing out details of pNFS layout isn't interesting to many people. But if you want a broader architectural discussion about how to overcome issues (and what those issues actually are) with the use of persistent memory for NAS, then that may be interesting. So what do you actually want? Honza -- Jan Kara <jack-IBi9RG/b67k@public.gmane.org> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Remote access to pmem on storage targets @ 2016-01-26 8:25 ` Jan Kara 0 siblings, 0 replies; 27+ messages in thread From: Jan Kara @ 2016-01-26 8:25 UTC (permalink / raw) To: Chuck Lever Cc: lsf-pc, Linux RDMA Mailing List, linux-fsdevel, Linux NFS Mailing List Hello, On Mon 25-01-16 16:19:24, Chuck Lever wrote: > I'd like to propose a discussion of how to take advantage of > persistent memory in network-attached storage scenarios. > > RDMA runs on high speed network fabrics and offloads data > transfer from host CPUs. Thus it is a good match to the > performance characteristics of persistent memory. > > Today Linux supports iSER, SRP, and NFS/RDMA on RDMA > fabrics. What kind of changes are needed in the Linux I/O > stack (in particular, storage targets) and in these storage > protocols to get the most benefit from ultra-low latency > storage? > > There have been recent proposals about how storage protocols > and implementations might need to change (eg. Tom Talpey's > SNIA proposals for changing to a push data transfer model, > Sagi's proposal to utilize DAX under the NFS/RDMA server, > and my proposal for a new pNFS layout to drive RDMA data > transfer directly). > > The outcome of the discussion would be to understand what > people are working on now and what is the desired > architectural approach in order to determine where storage > developers should be focused. > > This could be either a BoF or a session during the main > tracks. There is sure to be a narrow segment of each > track's attendees that would have interest in this topic. So hashing out details of pNFS layout isn't interesting to many people. But if you want a broader architectural discussion about how to overcome issues (and what those issues actually are) with the use of persistent memory for NAS, then that may be interesting. So what do you actually want? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Remote access to pmem on storage targets 2016-01-26 8:25 ` Jan Kara (?) @ 2016-01-26 15:58 ` Chuck Lever 2016-01-27 0:04 ` Dave Chinner [not found] ` <F0E2108B-891C-4570-B486-7DC7C4FB59C4-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> -1 siblings, 2 replies; 27+ messages in thread From: Chuck Lever @ 2016-01-26 15:58 UTC (permalink / raw) To: Jan Kara Cc: lsf-pc, Linux RDMA Mailing List, linux-fsdevel, Linux NFS Mailing List > On Jan 26, 2016, at 3:25 AM, Jan Kara <jack@suse.cz> wrote: > > Hello, > > On Mon 25-01-16 16:19:24, Chuck Lever wrote: >> I'd like to propose a discussion of how to take advantage of >> persistent memory in network-attached storage scenarios. >> >> RDMA runs on high speed network fabrics and offloads data >> transfer from host CPUs. Thus it is a good match to the >> performance characteristics of persistent memory. >> >> Today Linux supports iSER, SRP, and NFS/RDMA on RDMA >> fabrics. What kind of changes are needed in the Linux I/O >> stack (in particular, storage targets) and in these storage >> protocols to get the most benefit from ultra-low latency >> storage? >> >> There have been recent proposals about how storage protocols >> and implementations might need to change (eg. Tom Talpey's >> SNIA proposals for changing to a push data transfer model, >> Sagi's proposal to utilize DAX under the NFS/RDMA server, >> and my proposal for a new pNFS layout to drive RDMA data >> transfer directly). >> >> The outcome of the discussion would be to understand what >> people are working on now and what is the desired >> architectural approach in order to determine where storage >> developers should be focused. >> >> This could be either a BoF or a session during the main >> tracks. There is sure to be a narrow segment of each >> track's attendees that would have interest in this topic. > > So hashing out details of pNFS layout isn't interesting to many people. > But if you want a broader architectural discussion about how to overcome > issues (and what those issues actually are) with the use of persistent > memory for NAS, then that may be interesting. So what do you actually want? I mentioned pNFS briefly only as an example. There have been a variety of proposals and approaches so far, and it's time, I believe, to start focusing our efforts. Thus I'm requesting a "broader architectural discussion about how to overcome issues with the use of persistent memory for NAS," in particular how we'd like to do this with the Linux implementations of the iSER, SRP, and NFS/RDMA protocols using DAX/pmem or NVM[ef]. It is not going to be like the well-worn paradigm that involves a page cache on the storage target backed by slow I/O operations. The protocol layers on storage targets need a way to discover memory addresses of persistent memory that will be used as source/sink buffers for RDMA operations. And making data durable after a write is going to need some thought. So I believe some new plumbing will be necessary. I know this is not everyone's cup of tea. A BoF would be fine, if the PC believes that is a better venue (and I'm kind of leaning that way myself). -- Chuck Lever ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Remote access to pmem on storage targets 2016-01-26 15:58 ` Chuck Lever @ 2016-01-27 0:04 ` Dave Chinner 2016-01-27 15:55 ` Chuck Lever [not found] ` <F0E2108B-891C-4570-B486-7DC7C4FB59C4-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 27+ messages in thread From: Dave Chinner @ 2016-01-27 0:04 UTC (permalink / raw) To: Chuck Lever Cc: Jan Kara, lsf-pc, Linux RDMA Mailing List, linux-fsdevel, Linux NFS Mailing List On Tue, Jan 26, 2016 at 10:58:44AM -0500, Chuck Lever wrote: > It is not going to be like the well-worn paradigm that > involves a page cache on the storage target backed by > slow I/O operations. The protocol layers on storage > targets need a way to discover memory addresses of > persistent memory that will be used as source/sink > buffers for RDMA operations. > > And making data durable after a write is going to need > some thought. So I believe some new plumbing will be > necessary. Haven't we already solve this for the pNFS file driver that XFS implements? i.e. these export operations: int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset); int (*map_blocks)(struct inode *inode, loff_t offset, u64 len, struct iomap *iomap, bool write, u32 *device_generation); int (*commit_blocks)(struct inode *inode, struct iomap *iomaps, int nr_iomaps, struct iattr *iattr); so mapping/allocation of file offset to sector mappings, which can then trivially be used to grab the memory address through the bdev ->direct_access method, yes? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Remote access to pmem on storage targets 2016-01-27 0:04 ` Dave Chinner @ 2016-01-27 15:55 ` Chuck Lever 0 siblings, 0 replies; 27+ messages in thread From: Chuck Lever @ 2016-01-27 15:55 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Linux RDMA Mailing List, linux-fsdevel, Linux NFS Mailing List > On Jan 26, 2016, at 7:04 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote: > > On Tue, Jan 26, 2016 at 10:58:44AM -0500, Chuck Lever wrote: >> It is not going to be like the well-worn paradigm that >> involves a page cache on the storage target backed by >> slow I/O operations. The protocol layers on storage >> targets need a way to discover memory addresses of >> persistent memory that will be used as source/sink >> buffers for RDMA operations. >> >> And making data durable after a write is going to need >> some thought. So I believe some new plumbing will be >> necessary. > > Haven't we already solve this for the pNFS file driver that XFS > implements? i.e. these export operations: > > int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset); > int (*map_blocks)(struct inode *inode, loff_t offset, > u64 len, struct iomap *iomap, > bool write, u32 *device_generation); > int (*commit_blocks)(struct inode *inode, struct iomap *iomaps, > int nr_iomaps, struct iattr *iattr); > > so mapping/allocation of file offset to sector mappings, which can > then trivially be used to grab the memory address through the bdev > ->direct_access method, yes? Thanks, that makes sense. How would such addresses be utilized? I'll speak about the NFS/RDMA server for this example, as I am more familiar with that than with block targets. When I say "NFS server" here I mean the software service on the storage target that speaks the NFS protocol. In today's RDMA-enabled storage protocols, an initiator exposes its memory (in small segments) to storage targets, sends a request, and the target's network transport performs RDMA Read and Write operations to move the payload data in that request. Assuming the NFS server is somehow aware that what it is getting from ->direct_access is a persistent memory address and not an LBA, it would then have to pass it down to the transport layer (svcrdma) so that the address can be used as a source or sink buffer for RDMA operations. For an NFS READ, this should be straightforward. An RPC request comes in, the NFS server identifies the memory that is to source the READ reply and passes the address of that memory to the transport, which then pushes the data in that memory via an RDMA Write to the client. NFS WRITES are more difficult. An RPC request comes in, and today the transport layer gathers incoming payload data in anonymous pages before the NFS server even knows there is an incoming RPC. We'd have to add some kind of hook to enable the NFS server and the underlying filesystem to provide appropriate sink buffers to the transport. After the NFS WRITE request has been wholly received, the NFS server today uses vfs_writev to put that data into the target file. We'd probably want something more efficient for pmem-backed filesystems. We want something more efficient for traditional page cache-based filesystems anyway. Every NFS WRITE larger than a page would be essentially CoW, since the filesystem would need to provide "anonymous" blocks to sink incoming WRITE data and then transition those blocks into the target file? Not sure how this works for pNFS with block devices. Finally a client needs to perform an NFS COMMIT to ensure that the written data is at rest on durable storage. We could insist that all NFS WRITE operations to pmem will be DATA_SYNC or better (in other words, abandon UNSTABLE mode). If not, then a separate NFS COMMIT/LAYOUTCOMMIT is necessary to flush memory caches and ensure data durability. An extra RPC round trip is likely not a good idea when the cost structure of NFS WRITE is so much different than it is for traditional block devices. NFS WRITE is really the thing we want to go as fast as possible, btw. NFS READ on RDMA is already faster, for reasons I won't go into here. Aside from that, NFS READ results are frequently cached on clients, and some of the cost of NFS READ is already hidden by read-ahead. Because read(2) is often satisfied from a local cache, application progress is more frequently blocked by pending write(2) calls than by reads. A fully generic solution would have to provide NFS service for transports that do not enable direct data placement (eg TCP), and for filesystems that are legacy page cache-based (anything residing on a traditional block device). I imagine that the issues are similar for block targets, if they assume block devices are fronted by a memory cache. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Remote access to pmem on storage targets @ 2016-01-27 15:55 ` Chuck Lever 0 siblings, 0 replies; 27+ messages in thread From: Chuck Lever @ 2016-01-27 15:55 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, lsf-pc, Linux RDMA Mailing List, linux-fsdevel, Linux NFS Mailing List > On Jan 26, 2016, at 7:04 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Jan 26, 2016 at 10:58:44AM -0500, Chuck Lever wrote: >> It is not going to be like the well-worn paradigm that >> involves a page cache on the storage target backed by >> slow I/O operations. The protocol layers on storage >> targets need a way to discover memory addresses of >> persistent memory that will be used as source/sink >> buffers for RDMA operations. >> >> And making data durable after a write is going to need >> some thought. So I believe some new plumbing will be >> necessary. > > Haven't we already solve this for the pNFS file driver that XFS > implements? i.e. these export operations: > > int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset); > int (*map_blocks)(struct inode *inode, loff_t offset, > u64 len, struct iomap *iomap, > bool write, u32 *device_generation); > int (*commit_blocks)(struct inode *inode, struct iomap *iomaps, > int nr_iomaps, struct iattr *iattr); > > so mapping/allocation of file offset to sector mappings, which can > then trivially be used to grab the memory address through the bdev > ->direct_access method, yes? Thanks, that makes sense. How would such addresses be utilized? I'll speak about the NFS/RDMA server for this example, as I am more familiar with that than with block targets. When I say "NFS server" here I mean the software service on the storage target that speaks the NFS protocol. In today's RDMA-enabled storage protocols, an initiator exposes its memory (in small segments) to storage targets, sends a request, and the target's network transport performs RDMA Read and Write operations to move the payload data in that request. Assuming the NFS server is somehow aware that what it is getting from ->direct_access is a persistent memory address and not an LBA, it would then have to pass it down to the transport layer (svcrdma) so that the address can be used as a source or sink buffer for RDMA operations. For an NFS READ, this should be straightforward. An RPC request comes in, the NFS server identifies the memory that is to source the READ reply and passes the address of that memory to the transport, which then pushes the data in that memory via an RDMA Write to the client. NFS WRITES are more difficult. An RPC request comes in, and today the transport layer gathers incoming payload data in anonymous pages before the NFS server even knows there is an incoming RPC. We'd have to add some kind of hook to enable the NFS server and the underlying filesystem to provide appropriate sink buffers to the transport. After the NFS WRITE request has been wholly received, the NFS server today uses vfs_writev to put that data into the target file. We'd probably want something more efficient for pmem-backed filesystems. We want something more efficient for traditional page cache-based filesystems anyway. Every NFS WRITE larger than a page would be essentially CoW, since the filesystem would need to provide "anonymous" blocks to sink incoming WRITE data and then transition those blocks into the target file? Not sure how this works for pNFS with block devices. Finally a client needs to perform an NFS COMMIT to ensure that the written data is at rest on durable storage. We could insist that all NFS WRITE operations to pmem will be DATA_SYNC or better (in other words, abandon UNSTABLE mode). If not, then a separate NFS COMMIT/LAYOUTCOMMIT is necessary to flush memory caches and ensure data durability. An extra RPC round trip is likely not a good idea when the cost structure of NFS WRITE is so much different than it is for traditional block devices. NFS WRITE is really the thing we want to go as fast as possible, btw. NFS READ on RDMA is already faster, for reasons I won't go into here. Aside from that, NFS READ results are frequently cached on clients, and some of the cost of NFS READ is already hidden by read-ahead. Because read(2) is often satisfied from a local cache, application progress is more frequently blocked by pending write(2) calls than by reads. A fully generic solution would have to provide NFS service for transports that do not enable direct data placement (eg TCP), and for filesystems that are legacy page cache-based (anything residing on a traditional block device). I imagine that the issues are similar for block targets, if they assume block devices are fronted by a memory cache. -- Chuck Lever ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Remote access to pmem on storage targets 2016-01-27 15:55 ` Chuck Lever (?) @ 2016-01-28 21:10 ` Dave Chinner -1 siblings, 0 replies; 27+ messages in thread From: Dave Chinner @ 2016-01-28 21:10 UTC (permalink / raw) To: Chuck Lever Cc: Jan Kara, lsf-pc, Linux RDMA Mailing List, linux-fsdevel, Linux NFS Mailing List On Wed, Jan 27, 2016 at 10:55:36AM -0500, Chuck Lever wrote: > > > On Jan 26, 2016, at 7:04 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > On Tue, Jan 26, 2016 at 10:58:44AM -0500, Chuck Lever wrote: > >> It is not going to be like the well-worn paradigm that > >> involves a page cache on the storage target backed by > >> slow I/O operations. The protocol layers on storage > >> targets need a way to discover memory addresses of > >> persistent memory that will be used as source/sink > >> buffers for RDMA operations. > >> > >> And making data durable after a write is going to need > >> some thought. So I believe some new plumbing will be > >> necessary. > > > > Haven't we already solve this for the pNFS file driver that XFS > > implements? i.e. these export operations: > > > > int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset); > > int (*map_blocks)(struct inode *inode, loff_t offset, > > u64 len, struct iomap *iomap, > > bool write, u32 *device_generation); > > int (*commit_blocks)(struct inode *inode, struct iomap *iomaps, > > int nr_iomaps, struct iattr *iattr); > > > > so mapping/allocation of file offset to sector mappings, which can > > then trivially be used to grab the memory address through the bdev > > ->direct_access method, yes? > > Thanks, that makes sense. How would such addresses be > utilized? That's a different problem, and you need to talk to the IO guys about that. > I'll speak about the NFS/RDMA server for this example, as > I am more familiar with that than with block targets. When > I say "NFS server" here I mean the software service on the > storage target that speaks the NFS protocol. > > In today's RDMA-enabled storage protocols, an initiator > exposes its memory (in small segments) to storage targets, > sends a request, and the target's network transport performs > RDMA Read and Write operations to move the payload data in > that request. > > Assuming the NFS server is somehow aware that what it is > getting from ->direct_access is a persistent memory address > and not an LBA, it would then have to pass it down to the > transport layer (svcrdma) so that the address can be used > as a source or sink buffer for RDMA operations. > > For an NFS READ, this should be straightforward. An RPC > request comes in, the NFS server identifies the memory that > is to source the READ reply and passes the address of that > memory to the transport, which then pushes the data in > that memory via an RDMA Write to the client. Right, it's no different from using the page cache, except for however the memory adress is then mapped by the IO subsystem for the DMA transfer... > NFS WRITES are more difficult. An RPC request comes in, > and today the transport layer gathers incoming payload data > in anonymous pages before the NFS server even knows there > is an incoming RPC. We'd have to add some kind of hook to > enable the NFS server and the underlying filesystem to > provide appropriate sink buffers to the transport. ->map_blocks needs to be called to allocate/map the file offset and return a memory address before the data is sent from the client. > After the NFS WRITE request has been wholly received, the > NFS server today uses vfs_writev to put that data into the > target file. We'd probably want something more efficient > for pmem-backed filesystems. We want something more > efficient for traditional page cache-based filesystems > anyway. Yup. see above. > Every NFS WRITE larger than a page would be essentially > CoW, since the filesystem would need to provide "anonymous" > blocks to sink incoming WRITE data and then transition > those blocks into the target file? Not sure how this works > for pNFS with block devices. No, ->map_blocks can return blocks that are already allocated to the file at the given offset, hence overwrite in place works just fine. > Finally a client needs to perform an NFS COMMIT to ensure > that the written data is at rest on durable storage. We > could insist that all NFS WRITE operations to pmem will > be DATA_SYNC or better (in other words, abandon UNSTABLE > mode). You could, but you'd still need the two map/commit calls into the filesystem to get the memory and mark the write done... > If not, then a separate NFS COMMIT/LAYOUTCOMMIT > is necessary to flush memory caches and ensure data > durability. An extra RPC round trip is likely not a good > idea when the cost structure of NFS WRITE is so much > different than it is for traditional block devices. IIRC, ->commit_blocks is called from the LAYOUTCOMMIT operation. You'll need to call this to pair the ->map_blocks call above that provided the memory as the data sink for the write. This is because ->map_blocks allocates unwritten extents so that stale data will not be exposed before the write is complete and ->commit_blocks is called to remove the unwritten extent flag. > I imagine that the issues are similar for block targets, if > they assume block devices are fronted by a memory cache. Yup, hence the "three phase" write operation - map blocks, write data, commit blocks. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <F0E2108B-891C-4570-B486-7DC7C4FB59C4-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>]
* Re: [Lsf-pc] [LSF/MM TOPIC] Remote access to pmem on storage targets 2016-01-26 15:58 ` Chuck Lever @ 2016-01-27 10:52 ` Sagi Grimberg [not found] ` <F0E2108B-891C-4570-B486-7DC7C4FB59C4-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> 1 sibling, 0 replies; 27+ messages in thread From: Sagi Grimberg @ 2016-01-27 10:52 UTC (permalink / raw) To: Chuck Lever, Jan Kara Cc: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Linux RDMA Mailing List, linux-fsdevel, Linux NFS Mailing List >> So hashing out details of pNFS layout isn't interesting to many people. >> But if you want a broader architectural discussion about how to overcome >> issues (and what those issues actually are) with the use of persistent >> memory for NAS, then that may be interesting. So what do you actually want? > > I mentioned pNFS briefly only as an example. There have > been a variety of proposals and approaches so far, and > it's time, I believe, to start focusing our efforts. > > Thus I'm requesting a "broader architectural discussion > about how to overcome issues with the use of persistent > memory for NAS," in particular how we'd like to do this > with the Linux implementations of the iSER, SRP, and > NFS/RDMA protocols using DAX/pmem or NVM[ef]. I agree, I anticipate that we'll gradually see more and more implementations optimizing remote storage access in the presence of pmem devices (maybe even not only RDMA?). The straight forward approach would be for each implementation to have it's own logic for accessing remote pmem devices but I think we have a chance to consolidate that in a single API for everyone. I think the most natural way to start is NFS/RDMA (SCSI would be a bit more challenging...) > It is not going to be like the well-worn paradigm that > involves a page cache on the storage target backed by > slow I/O operations. The protocol layers on storage > targets need a way to discover memory addresses of > persistent memory that will be used as source/sink > buffers for RDMA operations. > And making data durable after a write is going to need > some thought. So I believe some new plumbing will be > necessary. The challenge here is persistence semantics that are missing in today's HCAs, so I think we should aim to have a SW solution for remote persistence semantics with sufficient hooks for possible HW that might be able to have it in the future... > I know this is not everyone's cup of tea. A BoF would > be fine, if the PC believes that is a better venue (and > I'm kind of leaning that way myself). I'd be happy to join such a discussion. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Remote access to pmem on storage targets @ 2016-01-27 10:52 ` Sagi Grimberg 0 siblings, 0 replies; 27+ messages in thread From: Sagi Grimberg @ 2016-01-27 10:52 UTC (permalink / raw) To: Chuck Lever, Jan Kara Cc: lsf-pc, Linux RDMA Mailing List, linux-fsdevel, Linux NFS Mailing List >> So hashing out details of pNFS layout isn't interesting to many people. >> But if you want a broader architectural discussion about how to overcome >> issues (and what those issues actually are) with the use of persistent >> memory for NAS, then that may be interesting. So what do you actually want? > > I mentioned pNFS briefly only as an example. There have > been a variety of proposals and approaches so far, and > it's time, I believe, to start focusing our efforts. > > Thus I'm requesting a "broader architectural discussion > about how to overcome issues with the use of persistent > memory for NAS," in particular how we'd like to do this > with the Linux implementations of the iSER, SRP, and > NFS/RDMA protocols using DAX/pmem or NVM[ef]. I agree, I anticipate that we'll gradually see more and more implementations optimizing remote storage access in the presence of pmem devices (maybe even not only RDMA?). The straight forward approach would be for each implementation to have it's own logic for accessing remote pmem devices but I think we have a chance to consolidate that in a single API for everyone. I think the most natural way to start is NFS/RDMA (SCSI would be a bit more challenging...) > It is not going to be like the well-worn paradigm that > involves a page cache on the storage target backed by > slow I/O operations. The protocol layers on storage > targets need a way to discover memory addresses of > persistent memory that will be used as source/sink > buffers for RDMA operations. > And making data durable after a write is going to need > some thought. So I believe some new plumbing will be > necessary. The challenge here is persistence semantics that are missing in today's HCAs, so I think we should aim to have a SW solution for remote persistence semantics with sufficient hooks for possible HW that might be able to have it in the future... > I know this is not everyone's cup of tea. A BoF would > be fine, if the PC believes that is a better venue (and > I'm kind of leaning that way myself). I'd be happy to join such a discussion. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM TOPIC] Remote access to pmem on storage targets 2016-01-25 21:19 ` Chuck Lever (?) @ 2016-01-26 15:25 ` Atchley, Scott -1 siblings, 0 replies; 27+ messages in thread From: Atchley, Scott @ 2016-01-26 15:25 UTC (permalink / raw) To: Chuck Lever Cc: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel > On Jan 25, 2016, at 4:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote: > > I'd like to propose a discussion of how to take advantage of > persistent memory in network-attached storage scenarios. > > RDMA runs on high speed network fabrics and offloads data > transfer from host CPUs. Thus it is a good match to the > performance characteristics of persistent memory. > > Today Linux supports iSER, SRP, and NFS/RDMA on RDMA > fabrics. What kind of changes are needed in the Linux I/O > stack (in particular, storage targets) and in these storage > protocols to get the most benefit from ultra-low latency > storage? > > There have been recent proposals about how storage protocols > and implementations might need to change (eg. Tom Talpey's > SNIA proposals for changing to a push data transfer model, > Sagi's proposal to utilize DAX under the NFS/RDMA server, > and my proposal for a new pNFS layout to drive RDMA data > transfer directly). > > The outcome of the discussion would be to understand what > people are working on now and what is the desired > architectural approach in order to determine where storage > developers should be focused. > > This could be either a BoF or a session during the main > tracks. There is sure to be a narrow segment of each > track's attendees that would have interest in this topic. > > -- > Chuck Lever Chuck, One difference on targets is that some NVM/persistent memory may be byte-addressable while other NVM is only block addressable. Another difference is that NVMe-over-Fabrics will allow remote access of the target’s NVMe devices using the NVMe API. Scott ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM TOPIC] Remote access to pmem on storage targets @ 2016-01-26 15:25 ` Atchley, Scott 0 siblings, 0 replies; 27+ messages in thread From: Atchley, Scott @ 2016-01-26 15:25 UTC (permalink / raw) To: Chuck Lever Cc: lsf-pc, Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel PiBPbiBKYW4gMjUsIDIwMTYsIGF0IDQ6MTkgUE0sIENodWNrIExldmVyIDxjaHVjay5sZXZlckBv cmFjbGUuY29tPiB3cm90ZToNCj4gDQo+IEknZCBsaWtlIHRvIHByb3Bvc2UgYSBkaXNjdXNzaW9u IG9mIGhvdyB0byB0YWtlIGFkdmFudGFnZSBvZg0KPiBwZXJzaXN0ZW50IG1lbW9yeSBpbiBuZXR3 b3JrLWF0dGFjaGVkIHN0b3JhZ2Ugc2NlbmFyaW9zLg0KPiANCj4gUkRNQSBydW5zIG9uIGhpZ2gg c3BlZWQgbmV0d29yayBmYWJyaWNzIGFuZCBvZmZsb2FkcyBkYXRhDQo+IHRyYW5zZmVyIGZyb20g aG9zdCBDUFVzLiBUaHVzIGl0IGlzIGEgZ29vZCBtYXRjaCB0byB0aGUNCj4gcGVyZm9ybWFuY2Ug Y2hhcmFjdGVyaXN0aWNzIG9mIHBlcnNpc3RlbnQgbWVtb3J5Lg0KPiANCj4gVG9kYXkgTGludXgg c3VwcG9ydHMgaVNFUiwgU1JQLCBhbmQgTkZTL1JETUEgb24gUkRNQQ0KPiBmYWJyaWNzLiBXaGF0 IGtpbmQgb2YgY2hhbmdlcyBhcmUgbmVlZGVkIGluIHRoZSBMaW51eCBJL08NCj4gc3RhY2sgKGlu IHBhcnRpY3VsYXIsIHN0b3JhZ2UgdGFyZ2V0cykgYW5kIGluIHRoZXNlIHN0b3JhZ2UNCj4gcHJv dG9jb2xzIHRvIGdldCB0aGUgbW9zdCBiZW5lZml0IGZyb20gdWx0cmEtbG93IGxhdGVuY3kNCj4g c3RvcmFnZT8NCj4gDQo+IFRoZXJlIGhhdmUgYmVlbiByZWNlbnQgcHJvcG9zYWxzIGFib3V0IGhv dyBzdG9yYWdlIHByb3RvY29scw0KPiBhbmQgaW1wbGVtZW50YXRpb25zIG1pZ2h0IG5lZWQgdG8g Y2hhbmdlIChlZy4gVG9tIFRhbHBleSdzDQo+IFNOSUEgcHJvcG9zYWxzIGZvciBjaGFuZ2luZyB0 byBhIHB1c2ggZGF0YSB0cmFuc2ZlciBtb2RlbCwNCj4gU2FnaSdzIHByb3Bvc2FsIHRvIHV0aWxp emUgREFYIHVuZGVyIHRoZSBORlMvUkRNQSBzZXJ2ZXIsDQo+IGFuZCBteSBwcm9wb3NhbCBmb3Ig YSBuZXcgcE5GUyBsYXlvdXQgdG8gZHJpdmUgUkRNQSBkYXRhDQo+IHRyYW5zZmVyIGRpcmVjdGx5 KS4NCj4gDQo+IFRoZSBvdXRjb21lIG9mIHRoZSBkaXNjdXNzaW9uIHdvdWxkIGJlIHRvIHVuZGVy c3RhbmQgd2hhdA0KPiBwZW9wbGUgYXJlIHdvcmtpbmcgb24gbm93IGFuZCB3aGF0IGlzIHRoZSBk ZXNpcmVkDQo+IGFyY2hpdGVjdHVyYWwgYXBwcm9hY2ggaW4gb3JkZXIgdG8gZGV0ZXJtaW5lIHdo ZXJlIHN0b3JhZ2UNCj4gZGV2ZWxvcGVycyBzaG91bGQgYmUgZm9jdXNlZC4NCj4gDQo+IFRoaXMg Y291bGQgYmUgZWl0aGVyIGEgQm9GIG9yIGEgc2Vzc2lvbiBkdXJpbmcgdGhlIG1haW4NCj4gdHJh Y2tzLiBUaGVyZSBpcyBzdXJlIHRvIGJlIGEgbmFycm93IHNlZ21lbnQgb2YgZWFjaA0KPiB0cmFj aydzIGF0dGVuZGVlcyB0aGF0IHdvdWxkIGhhdmUgaW50ZXJlc3QgaW4gdGhpcyB0b3BpYy4NCj4g DQo+IC0tDQo+IENodWNrIExldmVyDQoNCkNodWNrLA0KDQpPbmUgZGlmZmVyZW5jZSBvbiB0YXJn ZXRzIGlzIHRoYXQgc29tZSBOVk0vcGVyc2lzdGVudCBtZW1vcnkgbWF5IGJlIGJ5dGUtYWRkcmVz c2FibGUgd2hpbGUgb3RoZXIgTlZNIGlzIG9ubHkgYmxvY2sgYWRkcmVzc2FibGUuDQoNCkFub3Ro ZXIgZGlmZmVyZW5jZSBpcyB0aGF0IE5WTWUtb3Zlci1GYWJyaWNzIHdpbGwgYWxsb3cgcmVtb3Rl IGFjY2VzcyBvZiB0aGUgdGFyZ2V04oCZcyBOVk1lIGRldmljZXMgdXNpbmcgdGhlIE5WTWUgQVBJ Lg0KDQpTY290dA== ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM TOPIC] Remote access to pmem on storage targets @ 2016-01-26 15:25 ` Atchley, Scott 0 siblings, 0 replies; 27+ messages in thread From: Atchley, Scott @ 2016-01-26 15:25 UTC (permalink / raw) To: Chuck Lever Cc: lsf-pc, Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel > On Jan 25, 2016, at 4:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote: > > I'd like to propose a discussion of how to take advantage of > persistent memory in network-attached storage scenarios. > > RDMA runs on high speed network fabrics and offloads data > transfer from host CPUs. Thus it is a good match to the > performance characteristics of persistent memory. > > Today Linux supports iSER, SRP, and NFS/RDMA on RDMA > fabrics. What kind of changes are needed in the Linux I/O > stack (in particular, storage targets) and in these storage > protocols to get the most benefit from ultra-low latency > storage? > > There have been recent proposals about how storage protocols > and implementations might need to change (eg. Tom Talpey's > SNIA proposals for changing to a push data transfer model, > Sagi's proposal to utilize DAX under the NFS/RDMA server, > and my proposal for a new pNFS layout to drive RDMA data > transfer directly). > > The outcome of the discussion would be to understand what > people are working on now and what is the desired > architectural approach in order to determine where storage > developers should be focused. > > This could be either a BoF or a session during the main > tracks. There is sure to be a narrow segment of each > track's attendees that would have interest in this topic. > > -- > Chuck Lever Chuck, One difference on targets is that some NVM/persistent memory may be byte-addressable while other NVM is only block addressable. Another difference is that NVMe-over-Fabrics will allow remote access of the target’s NVMe devices using the NVMe API. Scott ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <5FD20017-B588-42E6-BBDA-2AA8ABDBA42B-1Heg1YXhbW8@public.gmane.org>]
* Re: [LSF/MM TOPIC] Remote access to pmem on storage targets 2016-01-26 15:25 ` Atchley, Scott @ 2016-01-26 15:29 ` Chuck Lever -1 siblings, 0 replies; 27+ messages in thread From: Chuck Lever @ 2016-01-26 15:29 UTC (permalink / raw) To: Atchley, Scott Cc: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel > On Jan 26, 2016, at 10:25 AM, Atchley, Scott <atchleyes-1Heg1YXhbW8@public.gmane.org> wrote: > >> On Jan 25, 2016, at 4:19 PM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: >> >> I'd like to propose a discussion of how to take advantage of >> persistent memory in network-attached storage scenarios. >> >> RDMA runs on high speed network fabrics and offloads data >> transfer from host CPUs. Thus it is a good match to the >> performance characteristics of persistent memory. >> >> Today Linux supports iSER, SRP, and NFS/RDMA on RDMA >> fabrics. What kind of changes are needed in the Linux I/O >> stack (in particular, storage targets) and in these storage >> protocols to get the most benefit from ultra-low latency >> storage? >> >> There have been recent proposals about how storage protocols >> and implementations might need to change (eg. Tom Talpey's >> SNIA proposals for changing to a push data transfer model, >> Sagi's proposal to utilize DAX under the NFS/RDMA server, >> and my proposal for a new pNFS layout to drive RDMA data >> transfer directly). >> >> The outcome of the discussion would be to understand what >> people are working on now and what is the desired >> architectural approach in order to determine where storage >> developers should be focused. >> >> This could be either a BoF or a session during the main >> tracks. There is sure to be a narrow segment of each >> track's attendees that would have interest in this topic. >> >> -- >> Chuck Lever > > Chuck, > > One difference on targets is that some NVM/persistent memory may be byte-addressable while other NVM is only block addressable. > > Another difference is that NVMe-over-Fabrics will allow remote access of the target’s NVMe devices using the NVMe API. As I understand it, NVMf devices look like local devices. NVMf devices need globally unique naming to enable safe use with pNFS and other remote storage access protocols. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM TOPIC] Remote access to pmem on storage targets @ 2016-01-26 15:29 ` Chuck Lever 0 siblings, 0 replies; 27+ messages in thread From: Chuck Lever @ 2016-01-26 15:29 UTC (permalink / raw) To: Atchley, Scott Cc: lsf-pc, Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel > On Jan 26, 2016, at 10:25 AM, Atchley, Scott <atchleyes@ornl.gov> wrote: > >> On Jan 25, 2016, at 4:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote: >> >> I'd like to propose a discussion of how to take advantage of >> persistent memory in network-attached storage scenarios. >> >> RDMA runs on high speed network fabrics and offloads data >> transfer from host CPUs. Thus it is a good match to the >> performance characteristics of persistent memory. >> >> Today Linux supports iSER, SRP, and NFS/RDMA on RDMA >> fabrics. What kind of changes are needed in the Linux I/O >> stack (in particular, storage targets) and in these storage >> protocols to get the most benefit from ultra-low latency >> storage? >> >> There have been recent proposals about how storage protocols >> and implementations might need to change (eg. Tom Talpey's >> SNIA proposals for changing to a push data transfer model, >> Sagi's proposal to utilize DAX under the NFS/RDMA server, >> and my proposal for a new pNFS layout to drive RDMA data >> transfer directly). >> >> The outcome of the discussion would be to understand what >> people are working on now and what is the desired >> architectural approach in order to determine where storage >> developers should be focused. >> >> This could be either a BoF or a session during the main >> tracks. There is sure to be a narrow segment of each >> track's attendees that would have interest in this topic. >> >> -- >> Chuck Lever > > Chuck, > > One difference on targets is that some NVM/persistent memory may be byte-addressable while other NVM is only block addressable. > > Another difference is that NVMe-over-Fabrics will allow remote access of the target’s NVMe devices using the NVMe API. As I understand it, NVMf devices look like local devices. NVMf devices need globally unique naming to enable safe use with pNFS and other remote storage access protocols. -- Chuck Lever ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <D0C5C0B9-A1A2-4428-B3CA-7BBCC5BEF10D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>]
* Re: [LSF/MM TOPIC] Remote access to pmem on storage targets 2016-01-26 15:29 ` Chuck Lever @ 2016-01-26 17:00 ` Christoph Hellwig -1 siblings, 0 replies; 27+ messages in thread From: Christoph Hellwig @ 2016-01-26 17:00 UTC (permalink / raw) To: Chuck Lever Cc: Atchley, Scott, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel On Tue, Jan 26, 2016 at 10:29:35AM -0500, Chuck Lever wrote: > As I understand it, NVMf devices look like local devices. > NVMf devices need globally unique naming to enable safe use > with pNFS and other remote storage access protocols. NVMe provides global uniqueue identifiers similar to SCSI, and in fact there is even a standardised mapping to SCSI. The current SCSI layout draft will work fine with both multi ported PCIe NVMe devices as well as future fabrics devices. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM TOPIC] Remote access to pmem on storage targets @ 2016-01-26 17:00 ` Christoph Hellwig 0 siblings, 0 replies; 27+ messages in thread From: Christoph Hellwig @ 2016-01-26 17:00 UTC (permalink / raw) To: Chuck Lever Cc: Atchley, Scott, lsf-pc, Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel On Tue, Jan 26, 2016 at 10:29:35AM -0500, Chuck Lever wrote: > As I understand it, NVMf devices look like local devices. > NVMf devices need globally unique naming to enable safe use > with pNFS and other remote storage access protocols. NVMe provides global uniqueue identifiers similar to SCSI, and in fact there is even a standardised mapping to SCSI. The current SCSI layout draft will work fine with both multi ported PCIe NVMe devices as well as future fabrics devices. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [LSF/MM TOPIC/ATTEND] RDMA passive target 2016-01-25 21:19 ` Chuck Lever (?) (?) @ 2016-01-27 16:54 ` Boaz Harrosh [not found] ` <56A8F646.5020003-/8YdC2HfS5554TAoqtyWWQ@public.gmane.org> 2016-01-27 17:27 ` Sagi Grimberg -1 siblings, 2 replies; 27+ messages in thread From: Boaz Harrosh @ 2016-01-27 16:54 UTC (permalink / raw) To: Chuck Lever, lsf-pc, Dan Williams, Yigal Korman Cc: Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel, Jan Kara, Ric Wheeler On 01/25/2016 11:19 PM, Chuck Lever wrote: > I'd like to propose a discussion of how to take advantage of > persistent memory in network-attached storage scenarios. > > RDMA runs on high speed network fabrics and offloads data > transfer from host CPUs. Thus it is a good match to the > performance characteristics of persistent memory. > > Today Linux supports iSER, SRP, and NFS/RDMA on RDMA > fabrics. What kind of changes are needed in the Linux I/O > stack (in particular, storage targets) and in these storage > protocols to get the most benefit from ultra-low latency > storage? > > There have been recent proposals about how storage protocols > and implementations might need to change (eg. Tom Talpey's > SNIA proposals for changing to a push data transfer model, > Sagi's proposal to utilize DAX under the NFS/RDMA server, > and my proposal for a new pNFS layout to drive RDMA data > transfer directly). > > The outcome of the discussion would be to understand what > people are working on now and what is the desired > architectural approach in order to determine where storage > developers should be focused. > > This could be either a BoF or a session during the main > tracks. There is sure to be a narrow segment of each > track's attendees that would have interest in this topic. > I would like to attend this talk, and also talk about a target we have been developing / utilizing that we would like to propose as a Linux standard driver. (It would be very important for me to also attend the other pmem talks in LSF, as well as some of the MM and FS talks proposed so far) RDMA passive target ~~~~~~~~~~~~~~~~~~~ The idea is to have a storage brick that exports a very low level pure RDMA API to access its memory based storage. The brick might be battery backed volatile based memory, or pmem based. In any case the brick might utilize a much higher capacity then memory by utilizing a "tiering" to slower media, which is enabled by the API. The API is simple: 1. Alloc_2M_block_at_virtual_address (ADDR_64_BIT) ADDR_64_BIT is any virtual address and defines the logical ID of the block. If the ID is already allocated an error is returned. If storage is exhausted return => ENOSPC 2. Free_2M_block_at_virtual_address (ADDR_64_BIT) Space for logical ID is returned to free store and the ID becomes free for a new allocation. 3. map_virtual_address(ADDR_64_BIT, flags) => RDMA handle previously allocated virtual address is locked in memory and an RDMA handle is returned. Flags: read-only, read-write, shared and so on... 4. unmap__virtual_address(ADDR_64_BIT) At this point the brick can write data to slower storage if memory space is needed. The RDMA handle from [3] is revoked. 5. List_mapped_IDs An extent based list of all allocated ranges. (This is usually used on mount or after a crash) The dumb brick is not the Network allocator / storage manager at all. and it is not a smart target / server. like an iser-target or pnfs-DS. A SW defined application can do that, on top of the Dumb-brick. The motivation is a low level very low latency API+library, which can be built upon for higher protocols or used directly for very low latency cluster. It does however mange a virtual allocation map of logical to physical mapping of the 2M blocks. Currently both drivers initiator and target are in Kernel, but with latest advancement by Dan Williams it can be implemented in user-mode as well, Almost. The almost is because: 1. If the target is over a /dev/pmemX then all is fine we have 2M contiguous memory blocks. 2. If the target is over an FS, we have a proposal pending for an falloc_2M_flag to ask the FS for a contiguous 2M allocations only. If any of the 2M allocations fail then return ENOSPC from falloc. This way we guaranty that each 2M block can be mapped by a single RDAM handle. An FS for this purpose is nice for an over-allocated / dynamic space usage by a target and other resources in the server. RDMA Initiator ~~~~~~~~~~~~~~~~~~~ The initiator is just a simple library. Both usermode and Kernel side should be available, for direct access to the RDMA-passive-brick. Thanks. Boaz > -- > Chuck Lever > ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <56A8F646.5020003-/8YdC2HfS5554TAoqtyWWQ@public.gmane.org>]
* Re: [Lsf-pc] [LSF/MM TOPIC/ATTEND] RDMA passive target 2016-01-27 16:54 ` [LSF/MM TOPIC/ATTEND] RDMA passive target Boaz Harrosh @ 2016-01-27 17:02 ` James Bottomley 2016-01-27 17:27 ` Sagi Grimberg 1 sibling, 0 replies; 27+ messages in thread From: James Bottomley @ 2016-01-27 17:02 UTC (permalink / raw) To: Boaz Harrosh, Chuck Lever, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dan Williams, Yigal Korman Cc: Linux RDMA Mailing List, linux-fsdevel, Linux NFS Mailing List, Jan Kara, Ric Wheeler On Wed, 2016-01-27 at 18:54 +0200, Boaz Harrosh wrote: > On 01/25/2016 11:19 PM, Chuck Lever wrote: > > I'd like to propose a discussion of how to take advantage of > > persistent memory in network-attached storage scenarios. > > > > RDMA runs on high speed network fabrics and offloads data > > transfer from host CPUs. Thus it is a good match to the > > performance characteristics of persistent memory. > > > > Today Linux supports iSER, SRP, and NFS/RDMA on RDMA > > fabrics. What kind of changes are needed in the Linux I/O > > stack (in particular, storage targets) and in these storage > > protocols to get the most benefit from ultra-low latency > > storage? > > > > There have been recent proposals about how storage protocols > > and implementations might need to change (eg. Tom Talpey's > > SNIA proposals for changing to a push data transfer model, > > Sagi's proposal to utilize DAX under the NFS/RDMA server, > > and my proposal for a new pNFS layout to drive RDMA data > > transfer directly). > > > > The outcome of the discussion would be to understand what > > people are working on now and what is the desired > > architectural approach in order to determine where storage > > developers should be focused. > > > > This could be either a BoF or a session during the main > > tracks. There is sure to be a narrow segment of each > > track's attendees that would have interest in this topic. > > > > I would like to attend this talk, and also talk about > a target we have been developing / utilizing that we would like > to propose as a Linux standard driver. For everyone who hasn't sent an attend request in, this is a good example of how not to get an invitation. When collecting the requests to attend, the admins tend to fold to the top of thread, so if you send a request to attend as a reply to somebody else, it won't be seen by that process. You don't need to resend this one, I noticed it, but just in case next time ... James -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC/ATTEND] RDMA passive target @ 2016-01-27 17:02 ` James Bottomley 0 siblings, 0 replies; 27+ messages in thread From: James Bottomley @ 2016-01-27 17:02 UTC (permalink / raw) To: Boaz Harrosh, Chuck Lever, lsf-pc, Dan Williams, Yigal Korman Cc: Linux RDMA Mailing List, linux-fsdevel, Linux NFS Mailing List, Jan Kara, Ric Wheeler On Wed, 2016-01-27 at 18:54 +0200, Boaz Harrosh wrote: > On 01/25/2016 11:19 PM, Chuck Lever wrote: > > I'd like to propose a discussion of how to take advantage of > > persistent memory in network-attached storage scenarios. > > > > RDMA runs on high speed network fabrics and offloads data > > transfer from host CPUs. Thus it is a good match to the > > performance characteristics of persistent memory. > > > > Today Linux supports iSER, SRP, and NFS/RDMA on RDMA > > fabrics. What kind of changes are needed in the Linux I/O > > stack (in particular, storage targets) and in these storage > > protocols to get the most benefit from ultra-low latency > > storage? > > > > There have been recent proposals about how storage protocols > > and implementations might need to change (eg. Tom Talpey's > > SNIA proposals for changing to a push data transfer model, > > Sagi's proposal to utilize DAX under the NFS/RDMA server, > > and my proposal for a new pNFS layout to drive RDMA data > > transfer directly). > > > > The outcome of the discussion would be to understand what > > people are working on now and what is the desired > > architectural approach in order to determine where storage > > developers should be focused. > > > > This could be either a BoF or a session during the main > > tracks. There is sure to be a narrow segment of each > > track's attendees that would have interest in this topic. > > > > I would like to attend this talk, and also talk about > a target we have been developing / utilizing that we would like > to propose as a Linux standard driver. For everyone who hasn't sent an attend request in, this is a good example of how not to get an invitation. When collecting the requests to attend, the admins tend to fold to the top of thread, so if you send a request to attend as a reply to somebody else, it won't be seen by that process. You don't need to resend this one, I noticed it, but just in case next time ... James ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM TOPIC/ATTEND] RDMA passive target 2016-01-27 16:54 ` [LSF/MM TOPIC/ATTEND] RDMA passive target Boaz Harrosh [not found] ` <56A8F646.5020003-/8YdC2HfS5554TAoqtyWWQ@public.gmane.org> @ 2016-01-27 17:27 ` Sagi Grimberg [not found] ` <56A8FE10.7000309-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> 1 sibling, 1 reply; 27+ messages in thread From: Sagi Grimberg @ 2016-01-27 17:27 UTC (permalink / raw) To: Boaz Harrosh, Chuck Lever, lsf-pc, Dan Williams, Yigal Korman Cc: Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel, Jan Kara, Ric Wheeler Hey Boaz, > RDMA passive target > ~~~~~~~~~~~~~~~~~~~ > > The idea is to have a storage brick that exports a very > low level pure RDMA API to access its memory based storage. > The brick might be battery backed volatile based memory, or > pmem based. In any case the brick might utilize a much higher > capacity then memory by utilizing a "tiering" to slower media, > which is enabled by the API. > > The API is simple: > > 1. Alloc_2M_block_at_virtual_address (ADDR_64_BIT) > ADDR_64_BIT is any virtual address and defines the logical ID of the block. > If the ID is already allocated an error is returned. > If storage is exhausted return => ENOSPC > 2. Free_2M_block_at_virtual_address (ADDR_64_BIT) > Space for logical ID is returned to free store and the ID becomes free for > a new allocation. > 3. map_virtual_address(ADDR_64_BIT, flags) => RDMA handle > previously allocated virtual address is locked in memory and an RDMA handle > is returned. > Flags: read-only, read-write, shared and so on... > 4. unmap__virtual_address(ADDR_64_BIT) > At this point the brick can write data to slower storage if memory space > is needed. The RDMA handle from [3] is revoked. > 5. List_mapped_IDs > An extent based list of all allocated ranges. (This is usually used on > mount or after a crash) My understanding is that you're describing a wire protocol correct? > The dumb brick is not the Network allocator / storage manager at all. and it > is not a smart target / server. like an iser-target or pnfs-DS. A SW defined > application can do that, on top of the Dumb-brick. The motivation is a low level > very low latency API+library, which can be built upon for higher protocols or > used directly for very low latency cluster. > It does however mange a virtual allocation map of logical to physical mapping > of the 2M blocks. The challenge in my mind would be to have persistence semantics in place. > > Currently both drivers initiator and target are in Kernel, but with > latest advancement by Dan Williams it can be implemented in user-mode as well, > Almost. > > The almost is because: > 1. If the target is over a /dev/pmemX then all is fine we have 2M contiguous > memory blocks. > 2. If the target is over an FS, we have a proposal pending for an falloc_2M_flag > to ask the FS for a contiguous 2M allocations only. If any of the 2M allocations > fail then return ENOSPC from falloc. This way we guaranty that each 2M block can be > mapped by a single RDAM handle. Umm, you don't need the 2M to be contiguous in order to represent them as a single RDMA handle. If that was true iSER would have never worked. Or I misunderstood what you meant... ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <56A8FE10.7000309-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>]
* Re: [LSF/MM TOPIC/ATTEND] RDMA passive target 2016-01-27 17:27 ` Sagi Grimberg @ 2016-01-31 14:20 ` Boaz Harrosh 0 siblings, 0 replies; 27+ messages in thread From: Boaz Harrosh @ 2016-01-31 14:20 UTC (permalink / raw) To: Sagi Grimberg, Chuck Lever, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dan Williams, Yigal Korman Cc: Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel, Jan Kara, Ric Wheeler On 01/27/2016 07:27 PM, Sagi Grimberg wrote: > Hey Boaz, > >> RDMA passive target >> ~~~~~~~~~~~~~~~~~~~ >> >> The idea is to have a storage brick that exports a very >> low level pure RDMA API to access its memory based storage. >> The brick might be battery backed volatile based memory, or >> pmem based. In any case the brick might utilize a much higher >> capacity then memory by utilizing a "tiering" to slower media, >> which is enabled by the API. >> >> The API is simple: >> >> 1. Alloc_2M_block_at_virtual_address (ADDR_64_BIT) >> ADDR_64_BIT is any virtual address and defines the logical ID of the block. >> If the ID is already allocated an error is returned. >> If storage is exhausted return => ENOSPC >> 2. Free_2M_block_at_virtual_address (ADDR_64_BIT) >> Space for logical ID is returned to free store and the ID becomes free for >> a new allocation. >> 3. map_virtual_address(ADDR_64_BIT, flags) => RDMA handle >> previously allocated virtual address is locked in memory and an RDMA handle >> is returned. >> Flags: read-only, read-write, shared and so on... >> 4. unmap__virtual_address(ADDR_64_BIT) >> At this point the brick can write data to slower storage if memory space >> is needed. The RDMA handle from [3] is revoked. >> 5. List_mapped_IDs >> An extent based list of all allocated ranges. (This is usually used on >> mount or after a crash) > > My understanding is that you're describing a wire protocol correct? > Almost. Not yet a wire protocol, Just an high level functionality description. first. But yes a wire protocol in the sense that I want an open source library that will be good for Kernel and Usermode. Any none Linux platform should be able to port the code base and use it. That said at some early point we should lock the wire protocol for inter version compatibility or at least have a fixture negotiation when things get evolved. >> The dumb brick is not the Network allocator / storage manager at all. and it >> is not a smart target / server. like an iser-target or pnfs-DS. A SW defined >> application can do that, on top of the Dumb-brick. The motivation is a low level >> very low latency API+library, which can be built upon for higher protocols or >> used directly for very low latency cluster. >> It does however mange a virtual allocation map of logical to physical mapping >> of the 2M blocks. > > The challenge in my mind would be to have persistence semantics in > place. > Ok Thanks for bringing this up. So there is two separate issues here. Which are actually not related to the above API. It is more an Initiator issue. Since once the server did the above map_virtual_address() and return a key to client machine it is out of the way. On the initiator what we do is: All RDMA async sends. Once the user did an fsync we do a sync-read(0, 1); so to guaranty both initiator and Server's nicks flush all write buffers, to Server's PCIE controller. But here lays the problem: In modern servers the PCIE/memory_controller chooses to write fast PCIE data (or actually any PCI data) directly to L3 cache on the principal that receiving application will access that memory very soon. This is what is called DDIO. Now here there is big uncertainty. and we are still investigating. The only working ADR machine we have with an old NvDIMM-type-12 legacy BIOS. (All the newer type-6 none NFIT BIOS systems never worked and had various problems with persistence) So that only working system, though advertised as DDIO machine does not exhibit the above problem. On a test of RDMA-SEND x X; RDMA-READ(0,1); POWER-OFF; We always are fine and never get a compare error between the machines. [I guess it depends on the specific system and the depth of the ADR flushing on power-off, there are 15 milliseconds of power to work with] But the Intel documentation says different. And it says that in a DDIO system persistence is not Guaranteed. There are two ways to solve this: 1. Put a remote procedure on the passive machine that will do a CLFLUSH of all written regions. We hate that in our system and will not want to do so, this is CPU intensive and will kill our latencies. So NO! 2. Disable the DDIO for the NIC we use for storage. [In our setup we can do this because there is a 10G management NIC for regular trafic, and a 40/100G Melanox card dedicated to storage, so for the storage NIC DDIO may be disabled. (Though again it makes not difference for us because in our lab with or without it works the same) ] 3. There is a future option that we asked Intel to do, which we should talk about here. Set a per packet HEADER flag which says DDIO-off/on, and a way for the PCIE card to enforce it. Intel guys where positive for this initiative and said They will support it in the next chipsets. But I do not have any specifics on this option. For us. Only option two is viable right now. In any way to answer your question at the Initiator we assume that after a sync-read of a single byte from an RDMA channel, all previous writes are persistent. [With the DDIO flag set to off when 3. is available] But this is only the very little Information I was able to gather and the little experimentation we did here in the lab. A real working NvDIMM ADR system is very scarce so far and all Vendors came out short for us with real off-the-shelf systems. I was hoping you might have more information for me. >> >> Currently both drivers initiator and target are in Kernel, but with >> latest advancement by Dan Williams it can be implemented in user-mode as well, >> Almost. >> >> The almost is because: >> 1. If the target is over a /dev/pmemX then all is fine we have 2M contiguous >> memory blocks. >> 2. If the target is over an FS, we have a proposal pending for an falloc_2M_flag >> to ask the FS for a contiguous 2M allocations only. If any of the 2M allocations >> fail then return ENOSPC from falloc. This way we guaranty that each 2M block can be >> mapped by a single RDAM handle. > > Umm, you don't need the 2M to be contiguous in order to represent them > as a single RDMA handle. If that was true iSER would have never worked. > Or I misunderstood what you meant... > OK I will let our RDMA guy Yigal Korman answer that, I guess you might be right. But regardless of this little detail we would like to keep everything 2M. Yes virtually on the wire protocol. But even on the Server Internal configuration we would like to see a single TLB 2M mapping of all Target's pmem. Also on the PCIE it is nice a scatter-list with 2M single entry, instead of the 4k. And I think it is nice for DAX systems to fallocate and guaranty 2M contiguous allocations of heavy accessed / mmap files. Thank you for your interest. Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM TOPIC/ATTEND] RDMA passive target @ 2016-01-31 14:20 ` Boaz Harrosh 0 siblings, 0 replies; 27+ messages in thread From: Boaz Harrosh @ 2016-01-31 14:20 UTC (permalink / raw) To: Sagi Grimberg, Chuck Lever, lsf-pc, Dan Williams, Yigal Korman Cc: Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel, Jan Kara, Ric Wheeler On 01/27/2016 07:27 PM, Sagi Grimberg wrote: > Hey Boaz, > >> RDMA passive target >> ~~~~~~~~~~~~~~~~~~~ >> >> The idea is to have a storage brick that exports a very >> low level pure RDMA API to access its memory based storage. >> The brick might be battery backed volatile based memory, or >> pmem based. In any case the brick might utilize a much higher >> capacity then memory by utilizing a "tiering" to slower media, >> which is enabled by the API. >> >> The API is simple: >> >> 1. Alloc_2M_block_at_virtual_address (ADDR_64_BIT) >> ADDR_64_BIT is any virtual address and defines the logical ID of the block. >> If the ID is already allocated an error is returned. >> If storage is exhausted return => ENOSPC >> 2. Free_2M_block_at_virtual_address (ADDR_64_BIT) >> Space for logical ID is returned to free store and the ID becomes free for >> a new allocation. >> 3. map_virtual_address(ADDR_64_BIT, flags) => RDMA handle >> previously allocated virtual address is locked in memory and an RDMA handle >> is returned. >> Flags: read-only, read-write, shared and so on... >> 4. unmap__virtual_address(ADDR_64_BIT) >> At this point the brick can write data to slower storage if memory space >> is needed. The RDMA handle from [3] is revoked. >> 5. List_mapped_IDs >> An extent based list of all allocated ranges. (This is usually used on >> mount or after a crash) > > My understanding is that you're describing a wire protocol correct? > Almost. Not yet a wire protocol, Just an high level functionality description. first. But yes a wire protocol in the sense that I want an open source library that will be good for Kernel and Usermode. Any none Linux platform should be able to port the code base and use it. That said at some early point we should lock the wire protocol for inter version compatibility or at least have a fixture negotiation when things get evolved. >> The dumb brick is not the Network allocator / storage manager at all. and it >> is not a smart target / server. like an iser-target or pnfs-DS. A SW defined >> application can do that, on top of the Dumb-brick. The motivation is a low level >> very low latency API+library, which can be built upon for higher protocols or >> used directly for very low latency cluster. >> It does however mange a virtual allocation map of logical to physical mapping >> of the 2M blocks. > > The challenge in my mind would be to have persistence semantics in > place. > Ok Thanks for bringing this up. So there is two separate issues here. Which are actually not related to the above API. It is more an Initiator issue. Since once the server did the above map_virtual_address() and return a key to client machine it is out of the way. On the initiator what we do is: All RDMA async sends. Once the user did an fsync we do a sync-read(0, 1); so to guaranty both initiator and Server's nicks flush all write buffers, to Server's PCIE controller. But here lays the problem: In modern servers the PCIE/memory_controller chooses to write fast PCIE data (or actually any PCI data) directly to L3 cache on the principal that receiving application will access that memory very soon. This is what is called DDIO. Now here there is big uncertainty. and we are still investigating. The only working ADR machine we have with an old NvDIMM-type-12 legacy BIOS. (All the newer type-6 none NFIT BIOS systems never worked and had various problems with persistence) So that only working system, though advertised as DDIO machine does not exhibit the above problem. On a test of RDMA-SEND x X; RDMA-READ(0,1); POWER-OFF; We always are fine and never get a compare error between the machines. [I guess it depends on the specific system and the depth of the ADR flushing on power-off, there are 15 milliseconds of power to work with] But the Intel documentation says different. And it says that in a DDIO system persistence is not Guaranteed. There are two ways to solve this: 1. Put a remote procedure on the passive machine that will do a CLFLUSH of all written regions. We hate that in our system and will not want to do so, this is CPU intensive and will kill our latencies. So NO! 2. Disable the DDIO for the NIC we use for storage. [In our setup we can do this because there is a 10G management NIC for regular trafic, and a 40/100G Melanox card dedicated to storage, so for the storage NIC DDIO may be disabled. (Though again it makes not difference for us because in our lab with or without it works the same) ] 3. There is a future option that we asked Intel to do, which we should talk about here. Set a per packet HEADER flag which says DDIO-off/on, and a way for the PCIE card to enforce it. Intel guys where positive for this initiative and said They will support it in the next chipsets. But I do not have any specifics on this option. For us. Only option two is viable right now. In any way to answer your question at the Initiator we assume that after a sync-read of a single byte from an RDMA channel, all previous writes are persistent. [With the DDIO flag set to off when 3. is available] But this is only the very little Information I was able to gather and the little experimentation we did here in the lab. A real working NvDIMM ADR system is very scarce so far and all Vendors came out short for us with real off-the-shelf systems. I was hoping you might have more information for me. >> >> Currently both drivers initiator and target are in Kernel, but with >> latest advancement by Dan Williams it can be implemented in user-mode as well, >> Almost. >> >> The almost is because: >> 1. If the target is over a /dev/pmemX then all is fine we have 2M contiguous >> memory blocks. >> 2. If the target is over an FS, we have a proposal pending for an falloc_2M_flag >> to ask the FS for a contiguous 2M allocations only. If any of the 2M allocations >> fail then return ENOSPC from falloc. This way we guaranty that each 2M block can be >> mapped by a single RDAM handle. > > Umm, you don't need the 2M to be contiguous in order to represent them > as a single RDMA handle. If that was true iSER would have never worked. > Or I misunderstood what you meant... > OK I will let our RDMA guy Yigal Korman answer that, I guess you might be right. But regardless of this little detail we would like to keep everything 2M. Yes virtually on the wire protocol. But even on the Server Internal configuration we would like to see a single TLB 2M mapping of all Target's pmem. Also on the PCIE it is nice a scatter-list with 2M single entry, instead of the 4k. And I think it is nice for DAX systems to fallocate and guaranty 2M contiguous allocations of heavy accessed / mmap files. Thank you for your interest. Boaz ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM TOPIC/ATTEND] RDMA passive target 2016-01-31 14:20 ` Boaz Harrosh (?) @ 2016-01-31 16:55 ` Yigal Korman [not found] ` <CACTTzNaOChdWN2eS9_kzv6HO_LVib-JVdkmeUn0LDe2eKxPEgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> -1 siblings, 1 reply; 27+ messages in thread From: Yigal Korman @ 2016-01-31 16:55 UTC (permalink / raw) To: Boaz Harrosh Cc: Sagi Grimberg, Chuck Lever, lsf-pc, Dan Williams, Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel, Jan Kara, Ric Wheeler On Sun, Jan 31, 2016 at 4:20 PM, Boaz Harrosh <boaz@plexistor.com> wrote: > > On 01/27/2016 07:27 PM, Sagi Grimberg wrote: > > Hey Boaz, > > > >> RDMA passive target > >> ~~~~~~~~~~~~~~~~~~~ > >> > >> The idea is to have a storage brick that exports a very > >> low level pure RDMA API to access its memory based storage. > >> The brick might be battery backed volatile based memory, or > >> pmem based. In any case the brick might utilize a much higher > >> capacity then memory by utilizing a "tiering" to slower media, > >> which is enabled by the API. > >> > >> The API is simple: > >> > >> 1. Alloc_2M_block_at_virtual_address (ADDR_64_BIT) > >> ADDR_64_BIT is any virtual address and defines the logical ID of the block. > >> If the ID is already allocated an error is returned. > >> If storage is exhausted return => ENOSPC > >> 2. Free_2M_block_at_virtual_address (ADDR_64_BIT) > >> Space for logical ID is returned to free store and the ID becomes free for > >> a new allocation. > >> 3. map_virtual_address(ADDR_64_BIT, flags) => RDMA handle > >> previously allocated virtual address is locked in memory and an RDMA handle > >> is returned. > >> Flags: read-only, read-write, shared and so on... > >> 4. unmap__virtual_address(ADDR_64_BIT) > >> At this point the brick can write data to slower storage if memory space > >> is needed. The RDMA handle from [3] is revoked. > >> 5. List_mapped_IDs > >> An extent based list of all allocated ranges. (This is usually used on > >> mount or after a crash) > > > > My understanding is that you're describing a wire protocol correct? > > > > Almost. Not yet a wire protocol, Just an high level functionality description. > first. But yes a wire protocol in the sense that I want an open source library > that will be good for Kernel and Usermode. Any none Linux platform should be > able to port the code base and use it. > That said at some early point we should lock the wire protocol for inter version > compatibility or at least have a fixture negotiation when things get evolved. > > >> The dumb brick is not the Network allocator / storage manager at all. and it > >> is not a smart target / server. like an iser-target or pnfs-DS. A SW defined > >> application can do that, on top of the Dumb-brick. The motivation is a low level > >> very low latency API+library, which can be built upon for higher protocols or > >> used directly for very low latency cluster. > >> It does however mange a virtual allocation map of logical to physical mapping > >> of the 2M blocks. > > > > The challenge in my mind would be to have persistence semantics in > > place. > > > > Ok Thanks for bringing this up. > > So there is two separate issues here. Which are actually not related to the > above API. It is more an Initiator issue. Since once the server did the above > map_virtual_address() and return a key to client machine it is out of the way. > > On the initiator what we do is: All RDMA async sends. Once the user did an fsync > we do a sync-read(0, 1); so to guaranty both initiator and Server's nicks flush all > write buffers, to Server's PCIE controller. > > But here lays the problem: In modern servers the PCIE/memory_controller chooses > to write fast PCIE data (or actually any PCI data) directly to L3 cache on the > principal that receiving application will access that memory very soon. > This is what is called DDIO. > Now here there is big uncertainty. and we are still investigating. The only working > ADR machine we have with an old NvDIMM-type-12 legacy BIOS. (All the newer type-6 > none NFIT BIOS systems never worked and had various problems with persistence) > So that only working system, though advertised as DDIO machine does not exhibit > the above problem. > On a test of RDMA-SEND x X; RDMA-READ(0,1); POWER-OFF; > We always are fine and never get a compare error between the machines. > [I guess it depends on the specific system and the depth of the ADR flushing > on power-off, there are 15 milliseconds of power to work with] > > But the Intel documentation says different. And it says that in a DDIO system > persistence is not Guaranteed. > > There are two ways to solve this: > 1. Put a remote procedure on the passive machine that will do a CLFLUSH of > all written regions. We hate that in our system and will not want to do > so, this is CPU intensive and will kill our latencies. > So NO! > 2. Disable the DDIO for the NIC we use for storage. > [In our setup we can do this because there is a 10G management NIC for > regular trafic, and a 40/100G Melanox card dedicated to storage, so for > the storage NIC DDIO may be disabled. (Though again it makes not difference > for us because in our lab with or without it works the same) > ] > 3. There is a future option that we asked Intel to do, which we should talk about > here. Set a per packet HEADER flag which says DDIO-off/on, and a way for the > PCIE card to enforce it. Intel guys where positive for this initiative and said > They will support it in the next chipsets. > But I do not have any specifics on this option. > > For us. Only option two is viable right now. > > In any way to answer your question at the Initiator we assume that after a sync-read > of a single byte from an RDMA channel, all previous writes are persistent. > [With the DDIO flag set to off when 3. is available] > > But this is only the very little Information I was able to gather and the > little experimentation we did here in the lab. A real working NvDIMM ADR > system is very scarce so far and all Vendors came out short for us with > real off-the-shelf systems. > I was hoping you might have more information for me. > > >> > >> Currently both drivers initiator and target are in Kernel, but with > >> latest advancement by Dan Williams it can be implemented in user-mode as well, > >> Almost. > >> > >> The almost is because: > >> 1. If the target is over a /dev/pmemX then all is fine we have 2M contiguous > >> memory blocks. > >> 2. If the target is over an FS, we have a proposal pending for an falloc_2M_flag > >> to ask the FS for a contiguous 2M allocations only. If any of the 2M allocations > >> fail then return ENOSPC from falloc. This way we guaranty that each 2M block can be > >> mapped by a single RDAM handle. > > > > Umm, you don't need the 2M to be contiguous in order to represent them > > as a single RDMA handle. If that was true iSER would have never worked. > > Or I misunderstood what you meant... > > > > OK I will let our RDMA guy Yigal Korman answer that, I guess you might be right. When Boaz says 'RDMA handle', he means the pair [rkey,remote_addr]. AFAIK the remote_addr describes a continuous memory space on the target. So if you want to write to this 'handle' - it must be continuous. Please correct me if I'm wrong. > > But regardless of this little detail we would like to keep everything 2M. Yes > virtually on the wire protocol. But even on the Server Internal configuration > we would like to see a single TLB 2M mapping of all Target's pmem. Also on the > PCIE it is nice a scatter-list with 2M single entry, instead of the 4k. > And I think it is nice for DAX systems to fallocate and guaranty 2M contiguous > allocations of heavy accessed / mmap files. > > Thank you for your interest. > Boaz > Regards, Yigal ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <CACTTzNaOChdWN2eS9_kzv6HO_LVib-JVdkmeUn0LDe2eKxPEgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [LSF/MM TOPIC/ATTEND] RDMA passive target 2016-01-31 16:55 ` Yigal Korman @ 2016-02-01 10:36 ` Sagi Grimberg 0 siblings, 0 replies; 27+ messages in thread From: Sagi Grimberg @ 2016-02-01 10:36 UTC (permalink / raw) To: Yigal Korman, Boaz Harrosh Cc: Chuck Lever, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Dan Williams, Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel, Jan Kara, Ric Wheeler >>>> The almost is because: >>>> 1. If the target is over a /dev/pmemX then all is fine we have 2M contiguous >>>> memory blocks. >>>> 2. If the target is over an FS, we have a proposal pending for an falloc_2M_flag >>>> to ask the FS for a contiguous 2M allocations only. If any of the 2M allocations >>>> fail then return ENOSPC from falloc. This way we guaranty that each 2M block can be >>>> mapped by a single RDAM handle. >>> >>> Umm, you don't need the 2M to be contiguous in order to represent them >>> as a single RDMA handle. If that was true iSER would have never worked. >>> Or I misunderstood what you meant... >>> >> >> OK I will let our RDMA guy Yigal Korman answer that, I guess you might be right. > > When Boaz says 'RDMA handle', he means the pair [rkey,remote_addr]. > AFAIK the remote_addr describes a continuous memory space on the target. > So if you want to write to this 'handle' - it must be continuous. > Please correct me if I'm wrong. OK, this is definitely wrong. But let's defer this discussion to another thread as it's not relevant to lsf folks... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [LSF/MM TOPIC/ATTEND] RDMA passive target @ 2016-02-01 10:36 ` Sagi Grimberg 0 siblings, 0 replies; 27+ messages in thread From: Sagi Grimberg @ 2016-02-01 10:36 UTC (permalink / raw) To: Yigal Korman, Boaz Harrosh Cc: Chuck Lever, lsf-pc, Dan Williams, Linux NFS Mailing List, Linux RDMA Mailing List, linux-fsdevel, Jan Kara, Ric Wheeler >>>> The almost is because: >>>> 1. If the target is over a /dev/pmemX then all is fine we have 2M contiguous >>>> memory blocks. >>>> 2. If the target is over an FS, we have a proposal pending for an falloc_2M_flag >>>> to ask the FS for a contiguous 2M allocations only. If any of the 2M allocations >>>> fail then return ENOSPC from falloc. This way we guaranty that each 2M block can be >>>> mapped by a single RDAM handle. >>> >>> Umm, you don't need the 2M to be contiguous in order to represent them >>> as a single RDMA handle. If that was true iSER would have never worked. >>> Or I misunderstood what you meant... >>> >> >> OK I will let our RDMA guy Yigal Korman answer that, I guess you might be right. > > When Boaz says 'RDMA handle', he means the pair [rkey,remote_addr]. > AFAIK the remote_addr describes a continuous memory space on the target. > So if you want to write to this 'handle' - it must be continuous. > Please correct me if I'm wrong. OK, this is definitely wrong. But let's defer this discussion to another thread as it's not relevant to lsf folks... ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2016-02-01 10:36 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-01-25 21:19 [LSF/MM TOPIC] Remote access to pmem on storage targets Chuck Lever 2016-01-25 21:19 ` Chuck Lever [not found] ` <06414D5A-0632-4C74-B76C-038093E8AED3-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> 2016-01-26 8:25 ` [Lsf-pc] " Jan Kara 2016-01-26 8:25 ` Jan Kara 2016-01-26 15:58 ` Chuck Lever 2016-01-27 0:04 ` Dave Chinner 2016-01-27 15:55 ` Chuck Lever 2016-01-27 15:55 ` Chuck Lever 2016-01-28 21:10 ` Dave Chinner [not found] ` <F0E2108B-891C-4570-B486-7DC7C4FB59C4-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> 2016-01-27 10:52 ` Sagi Grimberg 2016-01-27 10:52 ` Sagi Grimberg 2016-01-26 15:25 ` Atchley, Scott 2016-01-26 15:25 ` Atchley, Scott 2016-01-26 15:25 ` Atchley, Scott [not found] ` <5FD20017-B588-42E6-BBDA-2AA8ABDBA42B-1Heg1YXhbW8@public.gmane.org> 2016-01-26 15:29 ` Chuck Lever 2016-01-26 15:29 ` Chuck Lever [not found] ` <D0C5C0B9-A1A2-4428-B3CA-7BBCC5BEF10D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> 2016-01-26 17:00 ` Christoph Hellwig 2016-01-26 17:00 ` Christoph Hellwig 2016-01-27 16:54 ` [LSF/MM TOPIC/ATTEND] RDMA passive target Boaz Harrosh [not found] ` <56A8F646.5020003-/8YdC2HfS5554TAoqtyWWQ@public.gmane.org> 2016-01-27 17:02 ` [Lsf-pc] " James Bottomley 2016-01-27 17:02 ` James Bottomley 2016-01-27 17:27 ` Sagi Grimberg [not found] ` <56A8FE10.7000309-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> 2016-01-31 14:20 ` Boaz Harrosh 2016-01-31 14:20 ` Boaz Harrosh 2016-01-31 16:55 ` Yigal Korman [not found] ` <CACTTzNaOChdWN2eS9_kzv6HO_LVib-JVdkmeUn0LDe2eKxPEgA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2016-02-01 10:36 ` Sagi Grimberg 2016-02-01 10:36 ` Sagi Grimberg
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.