* [LSF/MM TOPIC] Un-addressable device memory and block/fs implications @ 2016-12-13 18:15 Jerome Glisse 2016-12-13 18:20 ` James Bottomley ` (3 more replies) 0 siblings, 4 replies; 31+ messages in thread From: Jerome Glisse @ 2016-12-13 18:15 UTC (permalink / raw) To: lsf-pc, linux-mm, linux-block, linux-fsdevel I would like to discuss un-addressable device memory in the context of filesystem and block device. Specificaly how to handle write-back, read, ... when a filesystem page is migrated to device memory that CPU can not access. I intend to post a patchset leveraging the same idea as the existing block bounce helper (block/bounce.c) to handle this. I believe this is worth discussing during summit see how people feels about such plan and if they have better ideas. I also like to join discussions on: - Peer-to-Peer DMAs between PCIe devices - CDM coherent device memory - PMEM - overall mm discussions Cheers, J�r�me -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 18:15 [LSF/MM TOPIC] Un-addressable device memory and block/fs implications Jerome Glisse @ 2016-12-13 18:20 ` James Bottomley 2016-12-13 18:55 ` Jerome Glisse 2016-12-13 20:15 ` Dave Chinner ` (2 subsequent siblings) 3 siblings, 1 reply; 31+ messages in thread From: James Bottomley @ 2016-12-13 18:20 UTC (permalink / raw) To: Jerome Glisse, lsf-pc, linux-mm, linux-block, linux-fsdevel On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote: > I would like to discuss un-addressable device memory in the context > of filesystem and block device. Specificaly how to handle write-back, > read, ... when a filesystem page is migrated to device memory that > CPU can not access. > > I intend to post a patchset leveraging the same idea as the existing > block bounce helper (block/bounce.c) to handle this. I believe this > is worth discussing during summit see how people feels about such > plan and if they have better ideas. Isn't this pretty much what the transcendent memory interfaces we currently have are for? It's current use cases seem to be compressed swap and distributed memory, but there doesn't seem to be any reason in principle why you can't use the interface as well. James > I also like to join discussions on: > - Peer-to-Peer DMAs between PCIe devices > - CDM coherent device memory > - PMEM > - overall mm discussions > > Cheers, > Jérôme > -- > To unsubscribe from this list: send the line "unsubscribe linux > -fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 18:20 ` James Bottomley @ 2016-12-13 18:55 ` Jerome Glisse 2016-12-13 20:01 ` James Bottomley 0 siblings, 1 reply; 31+ messages in thread From: Jerome Glisse @ 2016-12-13 18:55 UTC (permalink / raw) To: James Bottomley; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote: > On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote: > > I would like to discuss un-addressable device memory in the context > > of filesystem and block device. Specificaly how to handle write-back, > > read, ... when a filesystem page is migrated to device memory that > > CPU can not access. > > > > I intend to post a patchset leveraging the same idea as the existing > > block bounce helper (block/bounce.c) to handle this. I believe this > > is worth discussing during summit see how people feels about such > > plan and if they have better ideas. > > Isn't this pretty much what the transcendent memory interfaces we > currently have are for? It's current use cases seem to be compressed > swap and distributed memory, but there doesn't seem to be any reason in > principle why you can't use the interface as well. > I am not a specialist of tmem or cleancache but my understand is that there is no way to allow for file back page to be dirtied while being in this special memory. In my case when you migrate a page to the device it might very well be so that the device can write something in it (results of some sort of computation). So page might migrate to device memory as clean but return from it in dirty state. Second aspect is that even if memory i am dealing with is un-addressable i still have struct page for it and i want to be able to use regular page migration. So given my requirement i didn't thought that cleancache was the way to address them. Maybe i am wrong. Cheers, J�r�me -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 18:55 ` Jerome Glisse @ 2016-12-13 20:01 ` James Bottomley 2016-12-13 20:22 ` Jerome Glisse 2016-12-13 20:27 ` Dave Hansen 0 siblings, 2 replies; 31+ messages in thread From: James Bottomley @ 2016-12-13 20:01 UTC (permalink / raw) To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Tue, 2016-12-13 at 13:55 -0500, Jerome Glisse wrote: > On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote: > > On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote: > > > I would like to discuss un-addressable device memory in the > > > context > > > of filesystem and block device. Specificaly how to handle write > > > -back, > > > read, ... when a filesystem page is migrated to device memory > > > that > > > CPU can not access. > > > > > > I intend to post a patchset leveraging the same idea as the > > > existing > > > block bounce helper (block/bounce.c) to handle this. I believe > > > this > > > is worth discussing during summit see how people feels about such > > > plan and if they have better ideas. > > > > Isn't this pretty much what the transcendent memory interfaces we > > currently have are for? It's current use cases seem to be > > compressed > > swap and distributed memory, but there doesn't seem to be any > > reason in > > principle why you can't use the interface as well. > > > > I am not a specialist of tmem or cleancache Well, that makes two of us; I just got to sit through Dan Magenheimer's talks and some stuff stuck. > but my understand is that there is no way to allow for file back > page to be dirtied while being in this special memory. Unless you have some other definition of dirtied, I believe that's what an exclusive tmem get in frontswap actually does. It marks the page dirty when it comes back because it may have been modified. > In my case when you migrate a page to the device it might very well > be so that the device can write something in it (results of some sort > of computation). So page might migrate to device memory as clean but > return from it in dirty state. > > Second aspect is that even if memory i am dealing with is un > -addressable i still have struct page for it and i want to be able to > use regular page migration. Tmem keeps a struct page ... what's the problem with page migration? the fact that tmem locks the page when it's not addressable and you want to be able to migrate the page even when it's not addressable? > So given my requirement i didn't thought that cleancache was the way > to address them. Maybe i am wrong. I'm not saying it is, I just asked if you'd considered it, since the requirements look similar. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 20:01 ` James Bottomley @ 2016-12-13 20:22 ` Jerome Glisse 2016-12-13 20:27 ` Dave Hansen 1 sibling, 0 replies; 31+ messages in thread From: Jerome Glisse @ 2016-12-13 20:22 UTC (permalink / raw) To: James Bottomley; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Tue, Dec 13, 2016 at 12:01:04PM -0800, James Bottomley wrote: > On Tue, 2016-12-13 at 13:55 -0500, Jerome Glisse wrote: > > On Tue, Dec 13, 2016 at 10:20:52AM -0800, James Bottomley wrote: > > > On Tue, 2016-12-13 at 13:15 -0500, Jerome Glisse wrote: > > > > I would like to discuss un-addressable device memory in the > > > > context > > > > of filesystem and block device. Specificaly how to handle write > > > > -back, > > > > read, ... when a filesystem page is migrated to device memory > > > > that > > > > CPU can not access. > > > > > > > > I intend to post a patchset leveraging the same idea as the > > > > existing > > > > block bounce helper (block/bounce.c) to handle this. I believe > > > > this > > > > is worth discussing during summit see how people feels about such > > > > plan and if they have better ideas. > > > > > > Isn't this pretty much what the transcendent memory interfaces we > > > currently have are for? It's current use cases seem to be > > > compressed > > > swap and distributed memory, but there doesn't seem to be any > > > reason in > > > principle why you can't use the interface as well. > > > > > > > I am not a specialist of tmem or cleancache > > Well, that makes two of us; I just got to sit through Dan Magenheimer's > talks and some stuff stuck. > > > but my understand is that there is no way to allow for file back > > page to be dirtied while being in this special memory. > > Unless you have some other definition of dirtied, I believe that's what > an exclusive tmem get in frontswap actually does. It marks the page > dirty when it comes back because it may have been modified. Well frontswap only support anonymous or share page, not random filemap page. So it doesn't help for what i am aiming at :) Note that in my case the device report accurate dirty information (did the device modified the page or not) assuming hardware bugs doesn't exist. > > In my case when you migrate a page to the device it might very well > > be so that the device can write something in it (results of some sort > > of computation). So page might migrate to device memory as clean but > > return from it in dirty state. > > > > Second aspect is that even if memory i am dealing with is un > > -addressable i still have struct page for it and i want to be able to > > use regular page migration. > > Tmem keeps a struct page ... what's the problem with page migration? > the fact that tmem locks the page when it's not addressable and you > want to be able to migrate the page even when it's not addressable? Well the way cleancache or frontswap works is that they are use when kernel is trying to make room or evict something. In my case it is the device that trigger the migration for a range of virtual address of a process. Sure i can make a weird helper that would force to frontswap or cleancache pages i want to migrate but it seems counter intuitive to me. One extra requirement for me is to be able to easily and quickly find the migrated page by looking at the CPU page table of the process. With frontswap it adds a level of indirection where i need to find through frontswap the memory. With cleancache there isn't even any information left (the page table entry is cleared). > > > So given my requirement i didn't thought that cleancache was the way > > to address them. Maybe i am wrong. > > I'm not saying it is, I just asked if you'd considered it, since the > requirements look similar. Yes i briefly consider it but from the highlevel overview i had it did not seems to address all my requirement. Maybe it is because i lack in depth knowledge of cleancache/frontswap but skiming through code didn't convince me that i needed to dig deeper. The solution i am pursuing use struct page and thus everything is as if it was regular page to the kernel. The only thing that doesn't work is kmap or mapping it into a process. But this can easily be handled. For filesystem issues are about anything that do I/O so read/write/ writeback. In many case if CPU I/O happens what i want to do is migrate back to a regular page, so the read/write case is easy. But for writeback if page is dirty on the device and device reports it (calling set_page_dirty()) then i still want to have writeback to work so i don't loose data (if device dirtied the page it is probably because it was instructed to save current computations). With this in mind, the bounce helper design to work around block device limitation in respect to page they can access seemed to be a perfect fit. All i care about is providing a bounce page allowing writeback to happen without having to go through the "slow" page migration back to system page. J�r�me -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 20:01 ` James Bottomley 2016-12-13 20:22 ` Jerome Glisse @ 2016-12-13 20:27 ` Dave Hansen 1 sibling, 0 replies; 31+ messages in thread From: Dave Hansen @ 2016-12-13 20:27 UTC (permalink / raw) To: James Bottomley, Jerome Glisse Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On 12/13/2016 12:01 PM, James Bottomley wrote: >> > Second aspect is that even if memory i am dealing with is un >> > -addressable i still have struct page for it and i want to be able to >> > use regular page migration. > Tmem keeps a struct page ... what's the problem with page migration? > the fact that tmem locks the page when it's not addressable and you > want to be able to migrate the page even when it's not addressable? Hi James, Why do you say that tmem keeps a 'struct page'? For instance, its ->put_page operation _takes_ a 'struct page', but that's in the delete_from_page_cache() path where the page's last reference has been dropped and it is about to go away. The role of 'struct page' here is just to help create a key so that tmem can find the contents later *without* the original 'struct page'. Jerome's pages here are a new class of half-crippled 'struct page' which support more VM features than ZONE_DEVICE pages, but not quite a full feature set. It supports (and needs to support) a heck of a lot more VM features than memory in tmem would, though. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 18:15 [LSF/MM TOPIC] Un-addressable device memory and block/fs implications Jerome Glisse 2016-12-13 18:20 ` James Bottomley @ 2016-12-13 20:15 ` Dave Chinner 2016-12-13 20:31 ` Jerome Glisse 2016-12-14 3:55 ` Balbir Singh 2016-12-16 3:14 ` [LSF/MM ATTEND] " Aneesh Kumar K.V 3 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2016-12-13 20:15 UTC (permalink / raw) To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote: > I would like to discuss un-addressable device memory in the context of > filesystem and block device. Specificaly how to handle write-back, read, > ... when a filesystem page is migrated to device memory that CPU can not > access. You mean pmem that is DAX-capable that suddenly, without warning, becomes non-DAX capable? If you are not talking about pmem and DAX, then exactly what does "when a filesystem page is migrated to device memory that CPU can not access" mean? What "filesystem page" are we talking about that can get migrated from main RAM to something the CPU can't access? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 20:15 ` Dave Chinner @ 2016-12-13 20:31 ` Jerome Glisse 2016-12-13 21:10 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Jerome Glisse @ 2016-12-13 20:31 UTC (permalink / raw) To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote: > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote: > > I would like to discuss un-addressable device memory in the context of > > filesystem and block device. Specificaly how to handle write-back, read, > > ... when a filesystem page is migrated to device memory that CPU can not > > access. > > You mean pmem that is DAX-capable that suddenly, without warning, > becomes non-DAX capable? > > If you are not talking about pmem and DAX, then exactly what does > "when a filesystem page is migrated to device memory that CPU can > not access" mean? What "filesystem page" are we talking about that > can get migrated from main RAM to something the CPU can't access? I am talking about GPU, FPGA, ... any PCIE device that have fast on board memory that can not be expose transparently to the CPU. I am reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm https://lwn.net/Articles/706856/ So in my case i am only considering non DAX/PMEM filesystem ie any "regular" filesystem back by a "regular" block device. I want to be able to migrate mmaped area of such filesystem to device memory while the device is actively using that memory. >From kernel point of view such memory is almost like any other, it has a struct page and most of the mm code is non the wiser, nor need to be about it. CPU access trigger a migration back to regular CPU accessible page. But for thing like writeback i want to be able to do writeback with- out having to migrate page back first. So that data can stay on the device while writeback is happening. J�r�me -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 20:31 ` Jerome Glisse @ 2016-12-13 21:10 ` Dave Chinner 2016-12-13 21:24 ` Jerome Glisse 0 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2016-12-13 21:10 UTC (permalink / raw) To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote: > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote: > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote: > > > I would like to discuss un-addressable device memory in the context of > > > filesystem and block device. Specificaly how to handle write-back, read, > > > ... when a filesystem page is migrated to device memory that CPU can not > > > access. > > > > You mean pmem that is DAX-capable that suddenly, without warning, > > becomes non-DAX capable? > > > > If you are not talking about pmem and DAX, then exactly what does > > "when a filesystem page is migrated to device memory that CPU can > > not access" mean? What "filesystem page" are we talking about that > > can get migrated from main RAM to something the CPU can't access? > > I am talking about GPU, FPGA, ... any PCIE device that have fast on > board memory that can not be expose transparently to the CPU. I am > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm > https://lwn.net/Articles/706856/ So ZONE_DEVICE memory that is a DMA target but not CPU addressable? > So in my case i am only considering non DAX/PMEM filesystem ie any > "regular" filesystem back by a "regular" block device. I want to be > able to migrate mmaped area of such filesystem to device memory while > the device is actively using that memory. "migrate mmapped area of such filesystem" means what, exactly? Are you talking about file data contents that have been copied into the page cache and mmapped into a user process address space? IOWs, migrating ZONE_NORMAL page cache page content and state to a new ZONE_DEVICE page, and then migrating back again somehow? > From kernel point of view such memory is almost like any other, it > has a struct page and most of the mm code is non the wiser, nor need > to be about it. CPU access trigger a migration back to regular CPU > accessible page. That sounds ... complex. Page migration on page cache access inside the filesytem IO path locking during read()/write() sounds like a great way to cause deadlocks.... > But for thing like writeback i want to be able to do writeback with- > out having to migrate page back first. So that data can stay on the > device while writeback is happening. Why can't you do writeback before migration, so only clean pages get moved? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 21:10 ` Dave Chinner @ 2016-12-13 21:24 ` Jerome Glisse 2016-12-13 22:08 ` Dave Hansen ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: Jerome Glisse @ 2016-12-13 21:24 UTC (permalink / raw) To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote: > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote: > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote: > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote: > > > > I would like to discuss un-addressable device memory in the context of > > > > filesystem and block device. Specificaly how to handle write-back, read, > > > > ... when a filesystem page is migrated to device memory that CPU can not > > > > access. > > > > > > You mean pmem that is DAX-capable that suddenly, without warning, > > > becomes non-DAX capable? > > > > > > If you are not talking about pmem and DAX, then exactly what does > > > "when a filesystem page is migrated to device memory that CPU can > > > not access" mean? What "filesystem page" are we talking about that > > > can get migrated from main RAM to something the CPU can't access? > > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on > > board memory that can not be expose transparently to the CPU. I am > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm > > https://lwn.net/Articles/706856/ > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable? Well not only target, it can be source too. But the device can read and write any system memory and dma to/from that memory to its on board memory. > > > So in my case i am only considering non DAX/PMEM filesystem ie any > > "regular" filesystem back by a "regular" block device. I want to be > > able to migrate mmaped area of such filesystem to device memory while > > the device is actively using that memory. > > "migrate mmapped area of such filesystem" means what, exactly? fd = open("/path/to/some/file") ptr = mmap(fd, ...); gpu_compute_something(ptr); > > Are you talking about file data contents that have been copied into > the page cache and mmapped into a user process address space? > IOWs, migrating ZONE_NORMAL page cache page content and state > to a new ZONE_DEVICE page, and then migrating back again somehow? Take any existing application that mmap a file and allow to migrate chunk of that mmaped file to device memory without the application even knowing about it. So nothing special in respect to that mmaped file. It is a regular file on your filesystem. > > From kernel point of view such memory is almost like any other, it > > has a struct page and most of the mm code is non the wiser, nor need > > to be about it. CPU access trigger a migration back to regular CPU > > accessible page. > > That sounds ... complex. Page migration on page cache access inside > the filesytem IO path locking during read()/write() sounds like > a great way to cause deadlocks.... There are few restriction on device page, no one can do GUP on them and thus no one can pin them. Hence they can always be migrated back. Yes each fs need modification, most of it (if not all) is isolated in common filemap helpers. > > But for thing like writeback i want to be able to do writeback with- > > out having to migrate page back first. So that data can stay on the > > device while writeback is happening. > > Why can't you do writeback before migration, so only clean pages get > moved? Because device can write to the page while the page is inside the device memory and we might want to writeback to disk while page stays in device memory and computation continues. Cheers, J�r�me -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 21:24 ` Jerome Glisse @ 2016-12-13 22:08 ` Dave Hansen 2016-12-13 23:02 ` Jerome Glisse 2016-12-13 22:13 ` Dave Chinner 2016-12-14 11:13 ` [Lsf-pc] " Jan Kara 2 siblings, 1 reply; 31+ messages in thread From: Dave Hansen @ 2016-12-13 22:08 UTC (permalink / raw) To: Jerome Glisse, Dave Chinner Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel, Williams, Dan J On 12/13/2016 01:24 PM, Jerome Glisse wrote: > >>> > > From kernel point of view such memory is almost like any other, it >>> > > has a struct page and most of the mm code is non the wiser, nor need >>> > > to be about it. CPU access trigger a migration back to regular CPU >>> > > accessible page. >> > >> > That sounds ... complex. Page migration on page cache access inside >> > the filesytem IO path locking during read()/write() sounds like >> > a great way to cause deadlocks.... > There are few restriction on device page, no one can do GUP on them and > thus no one can pin them. Hence they can always be migrated back. Yes > each fs need modification, most of it (if not all) is isolated in common > filemap helpers. Huh, that's pretty different from the other ZONE_DEVICE uses. For those, you *can* do get_user_pages(). I'd be really interested to see the feature set that these pages have and how it differs from regular memory and the ZONE_DEVICE memory that have have in-kernel today. BTW, how is this restriction implemented? I would have expected to see follow_page_pte() or vm_normal_page() getting modified. I don't see a single reference to get_user_pages or "GUP" in any of the latest HMM patch set or the changelogs. As best I can tell, the slow GUP path will get stuck in a loop inside follow_page_pte(), while the fast GUP path will allow you to acquire a reference to the page. But, maybe I'm reading the code wrong. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 22:08 ` Dave Hansen @ 2016-12-13 23:02 ` Jerome Glisse 0 siblings, 0 replies; 31+ messages in thread From: Jerome Glisse @ 2016-12-13 23:02 UTC (permalink / raw) To: Dave Hansen Cc: Dave Chinner, lsf-pc, linux-mm, linux-block, linux-fsdevel, Williams, Dan J On Tue, Dec 13, 2016 at 02:08:22PM -0800, Dave Hansen wrote: > On 12/13/2016 01:24 PM, Jerome Glisse wrote: > > > >>> > > From kernel point of view such memory is almost like any other, it > >>> > > has a struct page and most of the mm code is non the wiser, nor need > >>> > > to be about it. CPU access trigger a migration back to regular CPU > >>> > > accessible page. > >> > > >> > That sounds ... complex. Page migration on page cache access inside > >> > the filesytem IO path locking during read()/write() sounds like > >> > a great way to cause deadlocks.... > > There are few restriction on device page, no one can do GUP on them and > > thus no one can pin them. Hence they can always be migrated back. Yes > > each fs need modification, most of it (if not all) is isolated in common > > filemap helpers. > > Huh, that's pretty different from the other ZONE_DEVICE uses. For > those, you *can* do get_user_pages(). > > I'd be really interested to see the feature set that these pages have > and how it differs from regular memory and the ZONE_DEVICE memory that > have have in-kernel today. Well i can do a list for current patchset where i do not allow migration of file back page. Roughly you can not kmap and GUP. But GUP has many more implications like direct I/O (source or destination of direct I/O) ... > > BTW, how is this restriction implemented? I would have expected to see > follow_page_pte() or vm_normal_page() getting modified. I don't see a > single reference to get_user_pages or "GUP" in any of the latest HMM > patch set or the changelogs. > > As best I can tell, the slow GUP path will get stuck in a loop inside > follow_page_pte(), while the fast GUP path will allow you to acquire a > reference to the page. But, maybe I'm reading the code wrong. It is a side effect of having a special swap pte so follow_page_pte() returns NULL which trigger page fault through handle_mm_fault() which trigger migration back to regular page. Same for fast GUP version. There is never a valid pte for an un-addressable page. Cheers, J�rome -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 21:24 ` Jerome Glisse 2016-12-13 22:08 ` Dave Hansen @ 2016-12-13 22:13 ` Dave Chinner 2016-12-13 22:55 ` Jerome Glisse 2016-12-14 11:13 ` [Lsf-pc] " Jan Kara 2 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2016-12-13 22:13 UTC (permalink / raw) To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote: > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote: > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote: > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote: > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote: > > > > > I would like to discuss un-addressable device memory in the context of > > > > > filesystem and block device. Specificaly how to handle write-back, read, > > > > > ... when a filesystem page is migrated to device memory that CPU can not > > > > > access. > > > > > > > > You mean pmem that is DAX-capable that suddenly, without warning, > > > > becomes non-DAX capable? > > > > > > > > If you are not talking about pmem and DAX, then exactly what does > > > > "when a filesystem page is migrated to device memory that CPU can > > > > not access" mean? What "filesystem page" are we talking about that > > > > can get migrated from main RAM to something the CPU can't access? > > > > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on > > > board memory that can not be expose transparently to the CPU. I am > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm > > > https://lwn.net/Articles/706856/ > > > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable? > > Well not only target, it can be source too. But the device can read > and write any system memory and dma to/from that memory to its on > board memory. So you want the device to be able to dirty mmapped pages that the CPU can't access? > > > So in my case i am only considering non DAX/PMEM filesystem ie any > > > "regular" filesystem back by a "regular" block device. I want to be > > > able to migrate mmaped area of such filesystem to device memory while > > > the device is actively using that memory. > > > > "migrate mmapped area of such filesystem" means what, exactly? > > fd = open("/path/to/some/file") > ptr = mmap(fd, ...); > gpu_compute_something(ptr); Thought so. Lots of problems with this. > > Are you talking about file data contents that have been copied into > > the page cache and mmapped into a user process address space? > > IOWs, migrating ZONE_NORMAL page cache page content and state > > to a new ZONE_DEVICE page, and then migrating back again somehow? > > Take any existing application that mmap a file and allow to migrate > chunk of that mmaped file to device memory without the application > even knowing about it. So nothing special in respect to that mmaped > file. >From the application point of view. Filesystem, page cache, etc there's substantial problems here... > It is a regular file on your filesystem. ... because of this. > > > From kernel point of view such memory is almost like any other, it > > > has a struct page and most of the mm code is non the wiser, nor need > > > to be about it. CPU access trigger a migration back to regular CPU > > > accessible page. > > > > That sounds ... complex. Page migration on page cache access inside > > the filesytem IO path locking during read()/write() sounds like > > a great way to cause deadlocks.... > > There are few restriction on device page, no one can do GUP on them and > thus no one can pin them. Hence they can always be migrated back. Yes > each fs need modification, most of it (if not all) is isolated in common > filemap helpers. Sure, but you haven't answered my question: how do you propose we address the issue of placing all the mm locks required for migration under the filesystem IO path locks? > > > But for thing like writeback i want to be able to do writeback with- > > > out having to migrate page back first. So that data can stay on the > > > device while writeback is happening. > > > > Why can't you do writeback before migration, so only clean pages get > > moved? > > Because device can write to the page while the page is inside the device > memory and we might want to writeback to disk while page stays in device > memory and computation continues. Ok. So how does the device trigger ->page_mkwrite on a clean page to tell the filesystem that the page has been dirtied? So that, for example, if the page covers a hole because the file is sparse the filesytem can do the required block allocation and data initialisation (i.e. zero the cached page) before it gets marked dirty and any data gets written to it? And if zeroing the page during such a fault requires CPU access to the data, how do you propose we handle page migration in the middle of the page fault to allow the CPU to zero the page? Seems like more lock order/inversion problems there, too... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 22:13 ` Dave Chinner @ 2016-12-13 22:55 ` Jerome Glisse 2016-12-14 0:14 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Jerome Glisse @ 2016-12-13 22:55 UTC (permalink / raw) To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote: > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote: > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote: > > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote: > > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote: > > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote: > > > > > > I would like to discuss un-addressable device memory in the context of > > > > > > filesystem and block device. Specificaly how to handle write-back, read, > > > > > > ... when a filesystem page is migrated to device memory that CPU can not > > > > > > access. > > > > > > > > > > You mean pmem that is DAX-capable that suddenly, without warning, > > > > > becomes non-DAX capable? > > > > > > > > > > If you are not talking about pmem and DAX, then exactly what does > > > > > "when a filesystem page is migrated to device memory that CPU can > > > > > not access" mean? What "filesystem page" are we talking about that > > > > > can get migrated from main RAM to something the CPU can't access? > > > > > > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on > > > > board memory that can not be expose transparently to the CPU. I am > > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm > > > > https://lwn.net/Articles/706856/ > > > > > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable? > > > > Well not only target, it can be source too. But the device can read > > and write any system memory and dma to/from that memory to its on > > board memory. > > So you want the device to be able to dirty mmapped pages that the > CPU can't access? Yes, correct. > > > > So in my case i am only considering non DAX/PMEM filesystem ie any > > > > "regular" filesystem back by a "regular" block device. I want to be > > > > able to migrate mmaped area of such filesystem to device memory while > > > > the device is actively using that memory. > > > > > > "migrate mmapped area of such filesystem" means what, exactly? > > > > fd = open("/path/to/some/file") > > ptr = mmap(fd, ...); > > gpu_compute_something(ptr); > > Thought so. Lots of problems with this. > > > > Are you talking about file data contents that have been copied into > > > the page cache and mmapped into a user process address space? > > > IOWs, migrating ZONE_NORMAL page cache page content and state > > > to a new ZONE_DEVICE page, and then migrating back again somehow? > > > > Take any existing application that mmap a file and allow to migrate > > chunk of that mmaped file to device memory without the application > > even knowing about it. So nothing special in respect to that mmaped > > file. > > From the application point of view. Filesystem, page cache, etc > there's substantial problems here... > > > It is a regular file on your filesystem. > > ... because of this. > > > > > From kernel point of view such memory is almost like any other, it > > > > has a struct page and most of the mm code is non the wiser, nor need > > > > to be about it. CPU access trigger a migration back to regular CPU > > > > accessible page. > > > > > > That sounds ... complex. Page migration on page cache access inside > > > the filesytem IO path locking during read()/write() sounds like > > > a great way to cause deadlocks.... > > > > There are few restriction on device page, no one can do GUP on them and > > thus no one can pin them. Hence they can always be migrated back. Yes > > each fs need modification, most of it (if not all) is isolated in common > > filemap helpers. > > Sure, but you haven't answered my question: how do you propose we > address the issue of placing all the mm locks required for migration > under the filesystem IO path locks? Two different plans (which are non exclusive of each other). First is to use workqueue and have read/write wait on the workqueue to be done migrating the page back. Second solution is to use a bounce page during I/O so that there is no need for migration. > > > > But for thing like writeback i want to be able to do writeback with- > > > > out having to migrate page back first. So that data can stay on the > > > > device while writeback is happening. > > > > > > Why can't you do writeback before migration, so only clean pages get > > > moved? > > > > Because device can write to the page while the page is inside the device > > memory and we might want to writeback to disk while page stays in device > > memory and computation continues. > > Ok. So how does the device trigger ->page_mkwrite on a clean page to > tell the filesystem that the page has been dirtied? So that, for > example, if the page covers a hole because the file is sparse the > filesytem can do the required block allocation and data > initialisation (i.e. zero the cached page) before it gets marked > dirty and any data gets written to it? > > And if zeroing the page during such a fault requires CPU access to > the data, how do you propose we handle page migration in the middle > of the page fault to allow the CPU to zero the page? Seems like more > lock order/inversion problems there, too... File back page are never allocated on device, at least we have no incentive for usecase we care about today to do so. So a regular page is first use and initialize (to zero for hole) before being migrated to device. So i do not believe there should be any major concern on ->page_mkwrite. At least this was my impression when i look at generic filemap one, but for some filesystem this might need be problematic. I intend to enable this kind of migration on fs basis and allowing control by userspace to block such migration for given fs. J�r�me -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 22:55 ` Jerome Glisse @ 2016-12-14 0:14 ` Dave Chinner 2016-12-14 1:07 ` Jerome Glisse 0 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2016-12-14 0:14 UTC (permalink / raw) To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote: > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote: > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote: > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote: > > > > > From kernel point of view such memory is almost like any other, it > > > > > has a struct page and most of the mm code is non the wiser, nor need > > > > > to be about it. CPU access trigger a migration back to regular CPU > > > > > accessible page. > > > > > > > > That sounds ... complex. Page migration on page cache access inside > > > > the filesytem IO path locking during read()/write() sounds like > > > > a great way to cause deadlocks.... > > > > > > There are few restriction on device page, no one can do GUP on them and > > > thus no one can pin them. Hence they can always be migrated back. Yes > > > each fs need modification, most of it (if not all) is isolated in common > > > filemap helpers. > > > > Sure, but you haven't answered my question: how do you propose we > > address the issue of placing all the mm locks required for migration > > under the filesystem IO path locks? > > Two different plans (which are non exclusive of each other). First is to use > workqueue and have read/write wait on the workqueue to be done migrating the > page back. Pushing something to a workqueue and then waiting on the workqueue to complete the work doesn't change lock ordering problems - it just hides them away and makes them harder to debug. > Second solution is to use a bounce page during I/O so that there is no need > for migration. Which means the page in the device is left with out-of-date contents, right? If so, how do you prevent data corruption/loss when the device has modified the page out of sight of the CPU and the bounce page doesn't contain those modifications? Or if the dirty device page is written back directly without containing the changes made in the bounce page? Hmmm - what happens when we invalidate and release a range of file pages that have been migrated to a device? e.g. on truncate? > > > > > But for thing like writeback i want to be able to do writeback with- > > > > > out having to migrate page back first. So that data can stay on the > > > > > device while writeback is happening. > > > > > > > > Why can't you do writeback before migration, so only clean pages get > > > > moved? > > > > > > Because device can write to the page while the page is inside the device > > > memory and we might want to writeback to disk while page stays in device > > > memory and computation continues. > > > > Ok. So how does the device trigger ->page_mkwrite on a clean page to > > tell the filesystem that the page has been dirtied? So that, for > > example, if the page covers a hole because the file is sparse the > > filesytem can do the required block allocation and data > > initialisation (i.e. zero the cached page) before it gets marked > > dirty and any data gets written to it? > > > > And if zeroing the page during such a fault requires CPU access to > > the data, how do you propose we handle page migration in the middle > > of the page fault to allow the CPU to zero the page? Seems like more > > lock order/inversion problems there, too... > > File back page are never allocated on device, at least we have no incentive > for usecase we care about today to do so. So a regular page is first use > and initialize (to zero for hole) before being migrated to device. > So i do not believe there should be any major concern on ->page_mkwrite. Such deja vu - inodes are not static objects as modern filesystems are highly dynamic. If you want to have safe, reliable non-coherent mmap-based file data offload to devices, then I suspect that we're going to need pretty much all of the same restrictions the pmem programming model requires for userspace data flushing. i.e.: https://lkml.org/lkml/2016/9/15/33 At which point I have to ask: why is mmap considered to be the right model for transfering data in and out of devices that are not directly CPU addressable? > At least > this was my impression when i look at generic filemap one, but for some > filesystem this might need be problematic. Definitely problematic for XFS, btrfs, f2fs, ocfs2, and probably ext4 and others as well. > and allowing control by userspace to block such > migration for given fs. How do you propose doing that? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-14 0:14 ` Dave Chinner @ 2016-12-14 1:07 ` Jerome Glisse 2016-12-14 4:23 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Jerome Glisse @ 2016-12-14 1:07 UTC (permalink / raw) To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote: > On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote: > > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote: > > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote: > > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote: > > > > > > From kernel point of view such memory is almost like any other, it > > > > > > has a struct page and most of the mm code is non the wiser, nor need > > > > > > to be about it. CPU access trigger a migration back to regular CPU > > > > > > accessible page. > > > > > > > > > > That sounds ... complex. Page migration on page cache access inside > > > > > the filesytem IO path locking during read()/write() sounds like > > > > > a great way to cause deadlocks.... > > > > > > > > There are few restriction on device page, no one can do GUP on them and > > > > thus no one can pin them. Hence they can always be migrated back. Yes > > > > each fs need modification, most of it (if not all) is isolated in common > > > > filemap helpers. > > > > > > Sure, but you haven't answered my question: how do you propose we > > > address the issue of placing all the mm locks required for migration > > > under the filesystem IO path locks? > > > > Two different plans (which are non exclusive of each other). First is to use > > workqueue and have read/write wait on the workqueue to be done migrating the > > page back. > > Pushing something to a workqueue and then waiting on the workqueue > to complete the work doesn't change lock ordering problems - it > just hides them away and makes them harder to debug. Migration doesn't need many lock below is a list and i don't see any lock issue in respect to ->read or ->write. lock_page(page); spin_lock_irq(&mapping->tree_lock); lock_buffer(bh); // if page has buffer_head i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { // page table lock for each entry } I don't think i miss any and thus i don't see any real issues here. Care to point to the lock you think is gona be problematic ? > > Second solution is to use a bounce page during I/O so that there is no need > > for migration. > > Which means the page in the device is left with out-of-date > contents, right? > > If so, how do you prevent data corruption/loss when the device > has modified the page out of sight of the CPU and the bounce page > doesn't contain those modifications? Or if the dirty device page is > written back directly without containing the changes made in the > bounce page? There is no issue here, if bounce page is use then the page is mark as read only on the device until write is done and device copy is updated with what we have been ask to write. So no coherency issue between the 2 copy. > > Hmmm - what happens when we invalidate and release a range of > file pages that have been migrated to a device? e.g. on truncate? Same as if it where regular memory, access by device trigger SIGBUS which is reported through the device API. On that respect it follows the exact same code path as regular page. > > > > > > But for thing like writeback i want to be able to do writeback with- > > > > > > out having to migrate page back first. So that data can stay on the > > > > > > device while writeback is happening. > > > > > > > > > > Why can't you do writeback before migration, so only clean pages get > > > > > moved? > > > > > > > > Because device can write to the page while the page is inside the device > > > > memory and we might want to writeback to disk while page stays in device > > > > memory and computation continues. > > > > > > Ok. So how does the device trigger ->page_mkwrite on a clean page to > > > tell the filesystem that the page has been dirtied? So that, for > > > example, if the page covers a hole because the file is sparse the > > > filesytem can do the required block allocation and data > > > initialisation (i.e. zero the cached page) before it gets marked > > > dirty and any data gets written to it? > > > > > > And if zeroing the page during such a fault requires CPU access to > > > the data, how do you propose we handle page migration in the middle > > > of the page fault to allow the CPU to zero the page? Seems like more > > > lock order/inversion problems there, too... > > > > File back page are never allocated on device, at least we have no incentive > > for usecase we care about today to do so. So a regular page is first use > > and initialize (to zero for hole) before being migrated to device. > > So i do not believe there should be any major concern on ->page_mkwrite. > > Such deja vu - inodes are not static objects as modern filesystems > are highly dynamic. If you want to have safe, reliable non-coherent > mmap-based file data offload to devices, then I suspect that we're > going to need pretty much all of the same restrictions the pmem > programming model requires for userspace data flushing. i.e.: > > https://lkml.org/lkml/2016/9/15/33 I don't see any of the issues in that email applying to my case. Like i said from fs/mm point of view my page are _exactly_ like regular page. Only thing is no CPU access. So what would have happen to regular page would happen to device page. There is no differences here whatsoever. > > At which point I have to ask: why is mmap considered to be the right > model for transfering data in and out of devices that are not > directly CPU addressable? That is where the industry is going, OpenCL 2.0/3.0, C++ concurrency and parallelism, OpenACC, OpenMP, HSA, Cuda ... all those API require unified address space and transparent use of device memory. There are hardware solution in the making like CCIX or OpenCAPI but not all players are willing to move forward and let PCIE go. So we will need a software solution to catter to those platform that decide to stick with PCIE or otherwise there is a large range of hardware we will not be able to use to their full potential (rendering them mostly useless on linux). > > At least > > this was my impression when i look at generic filemap one, but for some > > filesystem this might need be problematic. > > Definitely problematic for XFS, btrfs, f2fs, ocfs2, and probably > ext4 and others as well. > > > and allowing control by userspace to block such > > migration for given fs. > > How do you propose doing that? As a mount flag option is my first idea but i have no strong opinion here. It might make sense for finer granularity but i don't believe so. Cheers, J�r�me -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-14 1:07 ` Jerome Glisse @ 2016-12-14 4:23 ` Dave Chinner 2016-12-14 16:35 ` Jerome Glisse 0 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2016-12-14 4:23 UTC (permalink / raw) To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Tue, Dec 13, 2016 at 08:07:58PM -0500, Jerome Glisse wrote: > On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote: > > On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote: > > > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote: > > > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote: > > > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote: > > > > > > > From kernel point of view such memory is almost like any other, it > > > > > > > has a struct page and most of the mm code is non the wiser, nor need > > > > > > > to be about it. CPU access trigger a migration back to regular CPU > > > > > > > accessible page. > > > > > > > > > > > > That sounds ... complex. Page migration on page cache access inside > > > > > > the filesytem IO path locking during read()/write() sounds like > > > > > > a great way to cause deadlocks.... > > > > > > > > > > There are few restriction on device page, no one can do GUP on them and > > > > > thus no one can pin them. Hence they can always be migrated back. Yes > > > > > each fs need modification, most of it (if not all) is isolated in common > > > > > filemap helpers. > > > > > > > > Sure, but you haven't answered my question: how do you propose we > > > > address the issue of placing all the mm locks required for migration > > > > under the filesystem IO path locks? > > > > > > Two different plans (which are non exclusive of each other). First is to use > > > workqueue and have read/write wait on the workqueue to be done migrating the > > > page back. > > > > Pushing something to a workqueue and then waiting on the workqueue > > to complete the work doesn't change lock ordering problems - it > > just hides them away and makes them harder to debug. > > Migration doesn't need many lock below is a list and i don't see any lock issue > in respect to ->read or ->write. > > lock_page(page); > spin_lock_irq(&mapping->tree_lock); > lock_buffer(bh); // if page has buffer_head > i_mmap_lock_read(mapping); > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { > // page table lock for each entry > } We can't take the page or mapping tree locks that while we hold various filesystem locks. e.g. The IO path lock order is, in places: inode->i_rwsem get page from page cache lock_page(page) inode->allocation lock zero page data Filesystems are allowed to do this, because the IO path has guaranteed them access to the page cache data on the page that is locked. Your ZONE_DEVICE proposal breaks this guarantee - we might have a locked page, but we don't have access to it's data. Further, in various filesystems once the allocation lock is taken (e.g. the i_lock in XFS) we're not allowed to lock pages or the mapping tree as that leads to deadlocks with truncate, hole punch, etc. Hence if the "zero page data" operation occurs on a ZONE_DEVICE page that requires migration before the zeroing can occur, we can't perform migration here. Why are we even considering migration in situations where we already hold the ZONE_DEVICE page locked, hold other filesystem locks inside the page lock, and have an open dirty filesystem transaction as well? Even if migration si possible and succeeds, the struct page in the mapping tree for the file offset we are operating on is going to be different after migration. That implies we need to completely restart the operation. But given that we've already made changes, backing out at this point is ... complex and may not even be possible. i.e. we have an architectural assumption that page contents are always accessable when we have a locked struct page, and your proposal would appear to violate that assumption... > > > Second solution is to use a bounce page during I/O so that there is no need > > > for migration. > > > > Which means the page in the device is left with out-of-date > > contents, right? > > > > If so, how do you prevent data corruption/loss when the device > > has modified the page out of sight of the CPU and the bounce page > > doesn't contain those modifications? Or if the dirty device page is > > written back directly without containing the changes made in the > > bounce page? > > There is no issue here, if bounce page is use then the page is mark as read > only on the device until write is done and device copy is updated with what > we have been ask to write. So no coherency issue between the 2 copy. What if the page is already dirty on the device? You can't just "mark it read only" because then you lose any data the device had written that was not directly overwritten by the IO that needed bouncing. Partial page overwrites do occur... > > > > And if zeroing the page during such a fault requires CPU access to > > > > the data, how do you propose we handle page migration in the middle > > > > of the page fault to allow the CPU to zero the page? Seems like more > > > > lock order/inversion problems there, too... > > > > > > File back page are never allocated on device, at least we have no incentive > > > for usecase we care about today to do so. So a regular page is first use > > > and initialize (to zero for hole) before being migrated to device. > > > So i do not believe there should be any major concern on ->page_mkwrite. > > > > Such deja vu - inodes are not static objects as modern filesystems > > are highly dynamic. If you want to have safe, reliable non-coherent > > mmap-based file data offload to devices, then I suspect that we're > > going to need pretty much all of the same restrictions the pmem > > programming model requires for userspace data flushing. i.e.: > > > > https://lkml.org/lkml/2016/9/15/33 > > I don't see any of the issues in that email applying to my case. Like i said > from fs/mm point of view my page are _exactly_ like regular page. Except they aren't... > Only thing > is no CPU access. ... because filesystems need direct CPU access to the data the page points at when migration does not appear to be possible. FWIW, another nasty corner case I just realised: the file data requires some kind of data transformation on writeback. e.g. compression, encryption, parity calculations for RAID, etc. IOWs, it could be the block device underneath the filesystem that requires ZONE_DEVICE->ZONE_NORMAL migration to occur. And to make matters worse, that can occur in code paths that operate in a "must guarantee forwards progress" memory allocation context... > > At which point I have to ask: why is mmap considered to be the right > > model for transfering data in and out of devices that are not > > directly CPU addressable? > > That is where the industry is going, OpenCL 2.0/3.0, C++ concurrency and > parallelism, OpenACC, OpenMP, HSA, Cuda ... all those API require unified > address space and transparent use of device memory. Sure, but that doesn't mean you can just map random files into the user address space and then hand it off to random hardware and expect the filesystem to be perfectly happy with that. > > > migration for given fs. > > > > How do you propose doing that? > > As a mount flag option is my first idea but i have no strong opinion here. No, absolutely not. Mount options are not for controlling random special interest behaviours in filesystems. That makes it impossible to mix "incompatible" technologies in the same filesystem. > It might make sense for finer granularity but i don't believe so. Then you're just not thinking about complex computation engines the right way, are you? e.g. you have a pmem filesystem as the central high-speed data store for you computation engine. Some apps in the pipeline use DAX for their data access because it's 10x faster than using traditional buffered mmap access, so the filesystem is mounted "-o dax". But then you want to add a hardware accelerator to speed up a different stage of the pipeline by 10x, but it requires page based ZONE_DEVICE management. Unfortuantely the "-o zone_device" mount option is incompatible with "-o dax" and because "it doesn't make sense for DAX to be a fine grained option" you can't combine the two technologies into the one pipeline.... That'd really suck, wouldn't it? -Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-14 4:23 ` Dave Chinner @ 2016-12-14 16:35 ` Jerome Glisse 0 siblings, 0 replies; 31+ messages in thread From: Jerome Glisse @ 2016-12-14 16:35 UTC (permalink / raw) To: Dave Chinner; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Wed, Dec 14, 2016 at 03:23:13PM +1100, Dave Chinner wrote: > On Tue, Dec 13, 2016 at 08:07:58PM -0500, Jerome Glisse wrote: > > On Wed, Dec 14, 2016 at 11:14:22AM +1100, Dave Chinner wrote: > > > On Tue, Dec 13, 2016 at 05:55:24PM -0500, Jerome Glisse wrote: > > > > On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote: > > > > > On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote: > > > > > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote: > > > > > > > > From kernel point of view such memory is almost like any other, it > > > > > > > > has a struct page and most of the mm code is non the wiser, nor need > > > > > > > > to be about it. CPU access trigger a migration back to regular CPU > > > > > > > > accessible page. > > > > > > > > > > > > > > That sounds ... complex. Page migration on page cache access inside > > > > > > > the filesytem IO path locking during read()/write() sounds like > > > > > > > a great way to cause deadlocks.... > > > > > > > > > > > > There are few restriction on device page, no one can do GUP on them and > > > > > > thus no one can pin them. Hence they can always be migrated back. Yes > > > > > > each fs need modification, most of it (if not all) is isolated in common > > > > > > filemap helpers. > > > > > > > > > > Sure, but you haven't answered my question: how do you propose we > > > > > address the issue of placing all the mm locks required for migration > > > > > under the filesystem IO path locks? > > > > > > > > Two different plans (which are non exclusive of each other). First is to use > > > > workqueue and have read/write wait on the workqueue to be done migrating the > > > > page back. > > > > > > Pushing something to a workqueue and then waiting on the workqueue > > > to complete the work doesn't change lock ordering problems - it > > > just hides them away and makes them harder to debug. > > > > Migration doesn't need many lock below is a list and i don't see any lock issue > > in respect to ->read or ->write. > > > > lock_page(page); > > spin_lock_irq(&mapping->tree_lock); > > lock_buffer(bh); // if page has buffer_head > > i_mmap_lock_read(mapping); > > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { > > // page table lock for each entry > > } > > We can't take the page or mapping tree locks that while we hold > various filesystem locks. > > e.g. The IO path lock order is, in places: > > inode->i_rwsem > get page from page cache > lock_page(page) > inode->allocation lock > zero page data > > Filesystems are allowed to do this, because the IO path has > guaranteed them access to the page cache data on the page that is > locked. Your ZONE_DEVICE proposal breaks this guarantee - we might > have a locked page, but we don't have access to it's data. > > Further, in various filesystems once the allocation lock is taken > (e.g. the i_lock in XFS) we're not allowed to lock pages or the > mapping tree as that leads to deadlocks with truncate, hole punch, > etc. Hence if the "zero page data" operation occurs on a ZONE_DEVICE page that > requires migration before the zeroing can occur, we can't perform > migration here. > > Why are we even considering migration in situations where we already > hold the ZONE_DEVICE page locked, hold other filesystem locks inside > the page lock, and have an open dirty filesystem transaction as well? > > Even if migration si possible and succeeds, the struct page in the > mapping tree for the file offset we are operating on is going to be > different after migration. That implies we need to completely > restart the operation. But given that we've already made changes, > backing out at this point is ... complex and may not even be > possible. So i skim through xfs code and i still think this is doable. So in the above sequence: inode->i_rwsem page = find_get_page(); if (device_unaddressable(page)) { page = migratepage(); } ... Now there is thing like filemap_write_and_wait...() but thus can be handled by the bio bounce buffer like i said ie a the block layer we allocate temporary page, page are already read only on the device as device obey regular thing like page_mkclean(). So page content is stable. The migrate page is using buffer_migrate_page() and i don't see any deadlock there. So i am not seeing any problem in doing migrate early on right after page lookup. > > i.e. we have an architectural assumption that page contents are > always accessable when we have a locked struct page, and your > proposal would appear to violate that assumption... And it is, data might be in device memory but you can use bounce page to access it and you can write protect it on the device so that it doesn't change. Looking at xfs, it never does a kmap() directly, only through some of the generic code and thus are place where we can use bounce page. > > > > Second solution is to use a bounce page during I/O so that there is no need > > > > for migration. > > > > > > Which means the page in the device is left with out-of-date > > > contents, right? > > > > > > If so, how do you prevent data corruption/loss when the device > > > has modified the page out of sight of the CPU and the bounce page > > > doesn't contain those modifications? Or if the dirty device page is > > > written back directly without containing the changes made in the > > > bounce page? > > > > There is no issue here, if bounce page is use then the page is mark as read > > only on the device until write is done and device copy is updated with what > > we have been ask to write. So no coherency issue between the 2 copy. > > What if the page is already dirty on the device? You can't just > "mark it read only" because then you lose any data the device had > written that was not directly overwritten by the IO that needed > bouncing. > > Partial page overwrites do occur... I should have been more explicit you: - write protect page on device - alloc bounce page - dma device data to bounce page - perform write on bounce page - dma bounce page back to device data - write io end It is just like it would be on CPU. There is no data hazard, no loss of data or incoherency here. > > > > > And if zeroing the page during such a fault requires CPU access to > > > > > the data, how do you propose we handle page migration in the middle > > > > > of the page fault to allow the CPU to zero the page? Seems like more > > > > > lock order/inversion problems there, too... > > > > > > > > File back page are never allocated on device, at least we have no incentive > > > > for usecase we care about today to do so. So a regular page is first use > > > > and initialize (to zero for hole) before being migrated to device. > > > > So i do not believe there should be any major concern on ->page_mkwrite. > > > > > > Such deja vu - inodes are not static objects as modern filesystems > > > are highly dynamic. If you want to have safe, reliable non-coherent > > > mmap-based file data offload to devices, then I suspect that we're > > > going to need pretty much all of the same restrictions the pmem > > > programming model requires for userspace data flushing. i.e.: > > > > > > https://lkml.org/lkml/2016/9/15/33 > > > > I don't see any of the issues in that email applying to my case. Like i said > > from fs/mm point of view my page are _exactly_ like regular page. > > Except they aren't... > > > Only thing > > is no CPU access. > > ... because filesystems need direct CPU access to the data the page > points at when migration does not appear to be possible. And it can, the data is always accessible, it is just a matter of using a bounce page. I did a grep on kmap() and 99% of call site are about meta-data page which i don't want to migrate. Then there is some in generic helper for read/write/aio ... this are place where bounce page can be use if the page is not migrated earlier in the i/o process. > > FWIW, another nasty corner case I just realised: the file data > requires some kind of data transformation on writeback. e.g. > compression, encryption, parity calculations for RAID, etc. IOWs, it > could be the block device underneath the filesystem that requires > ZONE_DEVICE->ZONE_NORMAL migration to occur. And to make matters > worse, that can occur in code paths that operate in a "must > guarantee forwards progress" memory allocation context... Well my proposal is about using the bio bounce code, which was done for ISA block device and i don't see any issue there. We allocate bounce page copy data from device into bounce page, the block layer does its thing (compress, encrypt, ...) on the bounce page. It is non the wiser. There is no migration happening. Note that at this point the page is already write protected on the device like it would be on the CPU. > > > At which point I have to ask: why is mmap considered to be the right > > > model for transfering data in and out of devices that are not > > > directly CPU addressable? > > > > That is where the industry is going, OpenCL 2.0/3.0, C++ concurrency and > > parallelism, OpenACC, OpenMP, HSA, Cuda ... all those API require unified > > address space and transparent use of device memory. > > Sure, but that doesn't mean you can just map random files into the > user address space and then hand it off to random hardware and > expect the filesystem to be perfectly happy with that. I am not expecting filesystem will be happy as it is but i am expecting there is way to make it happy :) > > > > migration for given fs. > > > > > > How do you propose doing that? > > > > As a mount flag option is my first idea but i have no strong opinion here. > > No, absolutely not. Mount options are not for controlling random > special interest behaviours in filesystems. That makes it impossible > to mix "incompatible" technologies in the same filesystem. I don't have strong opinion here. I just would like to allow sys-admin to decide somehow if they don't want to allow some fs to be migrated to device. I don't have good knowledge on what interface would be appropriate for this. > > > It might make sense for finer granularity but i don't believe so. > > Then you're just not thinking about complex computation engines the > right way, are you? > > e.g. you have a pmem filesystem as the central high-speed data store > for you computation engine. Some apps in the pipeline use DAX for > their data access because it's 10x faster than using traditional > buffered mmap access, so the filesystem is mounted "-o dax". But > then you want to add a hardware accelerator to speed up a different > stage of the pipeline by 10x, but it requires page based ZONE_DEVICE > management. > > Unfortuantely the "-o zone_device" mount option is incompatible with > "-o dax" and because "it doesn't make sense for DAX to be a fine > grained option" you can't combine the two technologies into the one > pipeline.... > > That'd really suck, wouldn't it? Well i don't to allow migration for dax fs because dax is a different problem. I think it is only use with pmem and i don't think i want to allow pmem migration. It would break some assumption people have about pmem. People using both technology would have to do extra work in there program to leverage both. Cheers, J�r�me -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 21:24 ` Jerome Glisse 2016-12-13 22:08 ` Dave Hansen 2016-12-13 22:13 ` Dave Chinner @ 2016-12-14 11:13 ` Jan Kara 2016-12-14 17:15 ` Jerome Glisse 2016-12-19 17:00 ` Aneesh Kumar K.V 2 siblings, 2 replies; 31+ messages in thread From: Jan Kara @ 2016-12-14 11:13 UTC (permalink / raw) To: Jerome Glisse; +Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel On Tue 13-12-16 16:24:33, Jerome Glisse wrote: > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote: > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote: > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote: > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote: > > > > > I would like to discuss un-addressable device memory in the context of > > > > > filesystem and block device. Specificaly how to handle write-back, read, > > > > > ... when a filesystem page is migrated to device memory that CPU can not > > > > > access. > > > > > > > > You mean pmem that is DAX-capable that suddenly, without warning, > > > > becomes non-DAX capable? > > > > > > > > If you are not talking about pmem and DAX, then exactly what does > > > > "when a filesystem page is migrated to device memory that CPU can > > > > not access" mean? What "filesystem page" are we talking about that > > > > can get migrated from main RAM to something the CPU can't access? > > > > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on > > > board memory that can not be expose transparently to the CPU. I am > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm > > > https://lwn.net/Articles/706856/ > > > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable? > > Well not only target, it can be source too. But the device can read > and write any system memory and dma to/from that memory to its on > board memory. > > > > > > So in my case i am only considering non DAX/PMEM filesystem ie any > > > "regular" filesystem back by a "regular" block device. I want to be > > > able to migrate mmaped area of such filesystem to device memory while > > > the device is actively using that memory. > > > > "migrate mmapped area of such filesystem" means what, exactly? > > fd = open("/path/to/some/file") > ptr = mmap(fd, ...); > gpu_compute_something(ptr); > > > > > Are you talking about file data contents that have been copied into > > the page cache and mmapped into a user process address space? > > IOWs, migrating ZONE_NORMAL page cache page content and state > > to a new ZONE_DEVICE page, and then migrating back again somehow? > > Take any existing application that mmap a file and allow to migrate > chunk of that mmaped file to device memory without the application > even knowing about it. So nothing special in respect to that mmaped > file. It is a regular file on your filesystem. OK, so I share most of Dave's concerns about this. But let's talk about what we can do and what you need and we may find something usable. First let me understand what is doable / what are the costs on your side. So we have a page cache page that you'd like to migrate to the device. Fine. You are willing to sacrifice direct IO - even better. We can fall back to buffered IO in that case (well, except for XFS which does not do it but that's a minor detail). One thing I'm not sure about: When a page is migrated to the device, is its contents available and is just possibly stale or will something bad happen if we try to access (or even modify) page data? And by migration you really mean page migration? Be aware that migration of pagecache pages may be a problem for some pages of some filesystems on its own - e. g. page migration may fail because there is a filesystem transaction outstanding modifying that page. For userspace these will be really hard to understand sporadic errors because it's really filesystem internal thing. So far page migration was widely used only for free space defragmentation and for that purpose if page is not migratable for a minute who cares. So won't it be easier to leave the pagecache page where it is and *copy* it to the device? Can the device notify us *before* it is going to modify a page, not just after it has modified it? Possibly if we just give it the page read-only and it will have to ask CPU to get write permission? If yes, then I belive this could work and even fs support should be doable. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-14 11:13 ` [Lsf-pc] " Jan Kara @ 2016-12-14 17:15 ` Jerome Glisse 2016-12-15 16:19 ` Jan Kara 2016-12-19 17:00 ` Aneesh Kumar K.V 1 sibling, 1 reply; 31+ messages in thread From: Jerome Glisse @ 2016-12-14 17:15 UTC (permalink / raw) To: Jan Kara; +Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel On Wed, Dec 14, 2016 at 12:13:51PM +0100, Jan Kara wrote: > On Tue 13-12-16 16:24:33, Jerome Glisse wrote: > > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote: > > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote: > > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote: > > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote: > > > > > > I would like to discuss un-addressable device memory in the context of > > > > > > filesystem and block device. Specificaly how to handle write-back, read, > > > > > > ... when a filesystem page is migrated to device memory that CPU can not > > > > > > access. > > > > > > > > > > You mean pmem that is DAX-capable that suddenly, without warning, > > > > > becomes non-DAX capable? > > > > > > > > > > If you are not talking about pmem and DAX, then exactly what does > > > > > "when a filesystem page is migrated to device memory that CPU can > > > > > not access" mean? What "filesystem page" are we talking about that > > > > > can get migrated from main RAM to something the CPU can't access? > > > > > > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on > > > > board memory that can not be expose transparently to the CPU. I am > > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm > > > > https://lwn.net/Articles/706856/ > > > > > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable? > > > > Well not only target, it can be source too. But the device can read > > and write any system memory and dma to/from that memory to its on > > board memory. > > > > > > > > > So in my case i am only considering non DAX/PMEM filesystem ie any > > > > "regular" filesystem back by a "regular" block device. I want to be > > > > able to migrate mmaped area of such filesystem to device memory while > > > > the device is actively using that memory. > > > > > > "migrate mmapped area of such filesystem" means what, exactly? > > > > fd = open("/path/to/some/file") > > ptr = mmap(fd, ...); > > gpu_compute_something(ptr); > > > > > > > > Are you talking about file data contents that have been copied into > > > the page cache and mmapped into a user process address space? > > > IOWs, migrating ZONE_NORMAL page cache page content and state > > > to a new ZONE_DEVICE page, and then migrating back again somehow? > > > > Take any existing application that mmap a file and allow to migrate > > chunk of that mmaped file to device memory without the application > > even knowing about it. So nothing special in respect to that mmaped > > file. It is a regular file on your filesystem. > > OK, so I share most of Dave's concerns about this. But let's talk about > what we can do and what you need and we may find something usable. First > let me understand what is doable / what are the costs on your side. > > So we have a page cache page that you'd like to migrate to the device. > Fine. You are willing to sacrifice direct IO - even better. We can fall > back to buffered IO in that case (well, except for XFS which does not do it > but that's a minor detail). One thing I'm not sure about: When a page is > migrated to the device, is its contents available and is just possibly stale > or will something bad happen if we try to access (or even modify) page data? Well i am not ready to sacrifice anything :) the point is that high level langage are evolving in direction in which they want to transparently use device like GPU without the programmer knowledge so it is important that all feature keeps working as if nothing is amiss. Device behave exactly like CPU in respect to memory. They have a page table and they have same kind of capabilities. So device will follow same rules. When you start writeback you do page_mkclean() and this will be reflected on the device too, it will write protect the page. Moreover you can access the data at any time, device are cache coherent and so when you use their dma engine to retrive page content you will get the full page content and nothing can be stale (assuming that page is first write protected). > > And by migration you really mean page migration? Be aware that migration of > pagecache pages may be a problem for some pages of some filesystems on its > own - e. g. page migration may fail because there is a filesystem transaction > outstanding modifying that page. For userspace these will be really hard > to understand sporadic errors because it's really filesystem internal > thing. So far page migration was widely used only for free space > defragmentation and for that purpose if page is not migratable for a minute > who cares. I am aware that page migration can fail because a writeback is underway and i am fine with it. When that happens either device wait or use the system page directly (read only obviously as device obey read/write protection). > > So won't it be easier to leave the pagecache page where it is and *copy* it > to the device? Can the device notify us *before* it is going to modify a > page, not just after it has modified it? Possibly if we just give it the > page read-only and it will have to ask CPU to get write permission? If yes, > then I belive this could work and even fs support should be doable. Well yes and no. Device obey the same rule as CPU so if a file back page is map read only in the process it must first do a write fault which will call in the fs (page_mkwrite() of vm_ops). But once a page has write permission there is no way to be notify by hardware on every write. First the hardware do not have the capability. Second we are talking thousand (10 000 is upper range in today device) of concurrent thread, each can possibly write to page under consideration. We really want the device page to behave just like regular page. Most fs code path never map file content, it only happens during read/write and i believe this can be handled either by migrating back or by using bounce page. I want to provide the choice between the two solutions as one will be better for some workload and the other for different workload. Cheers, J�r�me -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-14 17:15 ` Jerome Glisse @ 2016-12-15 16:19 ` Jan Kara 2016-12-15 19:14 ` Jerome Glisse 2016-12-16 3:10 ` Aneesh Kumar K.V 0 siblings, 2 replies; 31+ messages in thread From: Jan Kara @ 2016-12-15 16:19 UTC (permalink / raw) To: Jerome Glisse Cc: Jan Kara, Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel On Wed 14-12-16 12:15:14, Jerome Glisse wrote: <snipped explanation that the device has the same cabilities as CPUs wrt page handling> > > So won't it be easier to leave the pagecache page where it is and *copy* it > > to the device? Can the device notify us *before* it is going to modify a > > page, not just after it has modified it? Possibly if we just give it the > > page read-only and it will have to ask CPU to get write permission? If yes, > > then I belive this could work and even fs support should be doable. > > Well yes and no. Device obey the same rule as CPU so if a file back page is > map read only in the process it must first do a write fault which will call > in the fs (page_mkwrite() of vm_ops). But once a page has write permission > there is no way to be notify by hardware on every write. First the hardware > do not have the capability. Second we are talking thousand (10 000 is upper > range in today device) of concurrent thread, each can possibly write to page > under consideration. Sure, I meant whether the device is able to do equivalent of ->page_mkwrite notification which apparently it is. OK. > We really want the device page to behave just like regular page. Most fs code > path never map file content, it only happens during read/write and i believe > this can be handled either by migrating back or by using bounce page. I want > to provide the choice between the two solutions as one will be better for some > workload and the other for different workload. I agree with keeping page used by the device behaving as similar as possible as any other page. I'm just exploring different possibilities how to make that happen. E.g. the scheme I was aiming at is: When you want page A to be used by the device, you set up page A' in the device but make sure any access to it will fault. When the device wants to access A', it notifies the CPU, that writeprotects all mappings of A, copy A to A' and map A' read-only for the device. When the device wants to write to A', it notifies CPU, that will clear all mappings of A and mark A as not-uptodate & dirty. When the CPU will then want to access the data in A again - we need to catch ->readpage, ->readpages, ->writepage, ->writepages - it will writeprotect A' in the device, copy data to A, mark A as uptodate & dirty, and off we go. When we want to write to the page on CPU - we get either wp fault if it was via mmap, or we have to catch that in places using kmap() - we just remove access to A' from the device. This scheme makes the device mapping functionality transparent to the filesystem (you actually don't need to hook directly into ->readpage etc. handlers, you can just have wrappers around them for this functionality) and fairly straightforward... It is so transparent that even direct IO works with this since the page cache invalidation pass we do before actually doing the direct IO will make sure to pull all the pages from the device and write them to disk if needed. What do you think? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-15 16:19 ` Jan Kara @ 2016-12-15 19:14 ` Jerome Glisse 2016-12-16 8:14 ` Jan Kara 2016-12-16 3:10 ` Aneesh Kumar K.V 1 sibling, 1 reply; 31+ messages in thread From: Jerome Glisse @ 2016-12-15 19:14 UTC (permalink / raw) To: Jan Kara; +Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel On Thu, Dec 15, 2016 at 05:19:39PM +0100, Jan Kara wrote: > On Wed 14-12-16 12:15:14, Jerome Glisse wrote: > <snipped explanation that the device has the same cabilities as CPUs wrt > page handling> > > > > So won't it be easier to leave the pagecache page where it is and *copy* it > > > to the device? Can the device notify us *before* it is going to modify a > > > page, not just after it has modified it? Possibly if we just give it the > > > page read-only and it will have to ask CPU to get write permission? If yes, > > > then I belive this could work and even fs support should be doable. > > > > Well yes and no. Device obey the same rule as CPU so if a file back page is > > map read only in the process it must first do a write fault which will call > > in the fs (page_mkwrite() of vm_ops). But once a page has write permission > > there is no way to be notify by hardware on every write. First the hardware > > do not have the capability. Second we are talking thousand (10 000 is upper > > range in today device) of concurrent thread, each can possibly write to page > > under consideration. > > Sure, I meant whether the device is able to do equivalent of ->page_mkwrite > notification which apparently it is. OK. > > > We really want the device page to behave just like regular page. Most fs code > > path never map file content, it only happens during read/write and i believe > > this can be handled either by migrating back or by using bounce page. I want > > to provide the choice between the two solutions as one will be better for some > > workload and the other for different workload. > > I agree with keeping page used by the device behaving as similar as > possible as any other page. I'm just exploring different possibilities how > to make that happen. E.g. the scheme I was aiming at is: > > When you want page A to be used by the device, you set up page A' in the > device but make sure any access to it will fault. > > When the device wants to access A', it notifies the CPU, that writeprotects > all mappings of A, copy A to A' and map A' read-only for the device. > > When the device wants to write to A', it notifies CPU, that will clear all > mappings of A and mark A as not-uptodate & dirty. When the CPU will then > want to access the data in A again - we need to catch ->readpage, > ->readpages, ->writepage, ->writepages - it will writeprotect A' in > the device, copy data to A, mark A as uptodate & dirty, and off we go. > > When we want to write to the page on CPU - we get either wp fault if it was > via mmap, or we have to catch that in places using kmap() - we just remove > access to A' from the device. > > This scheme makes the device mapping functionality transparent to the > filesystem (you actually don't need to hook directly into ->readpage etc. > handlers, you can just have wrappers around them for this functionality) > and fairly straightforward... It is so transparent that even direct IO works > with this since the page cache invalidation pass we do before actually doing > the direct IO will make sure to pull all the pages from the device and write > them to disk if needed. What do you think? This is do-able but i think it will require the same amount of changes than what i had in mind (excluding the block bounce code) with one drawback. Doing it that way we can not free page A. On some workload this probably does not hurt much but on workload where you read a big dataset from disk and then use it only on the GPU for long period of time (minutes/hours) you will waste GB of system memory. Right now i am working on some other patchset, i intend to take a stab at this in January/February time frame, before summit so i can post an RFC and have a clear picture of every code path that needs modifications. I expect this would provide better frame for discussion. I assume i will have to change >readpage >readpages writepage >writepages but i think that the only place i really need to change are do_generic_file_read() and generic_perform_write() (or iov_iter_copy_*). Of course this only apply to fs that use those generic helpers. I also probably will change >mmap or rather the helper it uses to set the pte depending on what looks better. Note that i don't think wrapping is an easy task. I would need to replace page A mapping (struct page.mapping) to point to a wrapping address_space but there is enough place in the kernel that directly dereference that and expect to hit the right (real) address_space. I would need to replace all dereference of page->mapping to an helper function and possibly would need to change some of the call site logic accordingly. This might prove a bigger change than just having to use bounce in do_generic_file_read() and generic_perform_write(). Cheers, J�r�me -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-15 19:14 ` Jerome Glisse @ 2016-12-16 8:14 ` Jan Kara 0 siblings, 0 replies; 31+ messages in thread From: Jan Kara @ 2016-12-16 8:14 UTC (permalink / raw) To: Jerome Glisse Cc: Jan Kara, Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel On Thu 15-12-16 14:14:53, Jerome Glisse wrote: > On Thu, Dec 15, 2016 at 05:19:39PM +0100, Jan Kara wrote: > > On Wed 14-12-16 12:15:14, Jerome Glisse wrote: > > <snipped explanation that the device has the same cabilities as CPUs wrt > > page handling> > > > > > > So won't it be easier to leave the pagecache page where it is and *copy* it > > > > to the device? Can the device notify us *before* it is going to modify a > > > > page, not just after it has modified it? Possibly if we just give it the > > > > page read-only and it will have to ask CPU to get write permission? If yes, > > > > then I belive this could work and even fs support should be doable. > > > > > > Well yes and no. Device obey the same rule as CPU so if a file back page is > > > map read only in the process it must first do a write fault which will call > > > in the fs (page_mkwrite() of vm_ops). But once a page has write permission > > > there is no way to be notify by hardware on every write. First the hardware > > > do not have the capability. Second we are talking thousand (10 000 is upper > > > range in today device) of concurrent thread, each can possibly write to page > > > under consideration. > > > > Sure, I meant whether the device is able to do equivalent of ->page_mkwrite > > notification which apparently it is. OK. > > > > > We really want the device page to behave just like regular page. Most fs code > > > path never map file content, it only happens during read/write and i believe > > > this can be handled either by migrating back or by using bounce page. I want > > > to provide the choice between the two solutions as one will be better for some > > > workload and the other for different workload. > > > > I agree with keeping page used by the device behaving as similar as > > possible as any other page. I'm just exploring different possibilities how > > to make that happen. E.g. the scheme I was aiming at is: > > > > When you want page A to be used by the device, you set up page A' in the > > device but make sure any access to it will fault. > > > > When the device wants to access A', it notifies the CPU, that writeprotects > > all mappings of A, copy A to A' and map A' read-only for the device. > > > > When the device wants to write to A', it notifies CPU, that will clear all > > mappings of A and mark A as not-uptodate & dirty. When the CPU will then > > want to access the data in A again - we need to catch ->readpage, > > ->readpages, ->writepage, ->writepages - it will writeprotect A' in > > the device, copy data to A, mark A as uptodate & dirty, and off we go. > > > > When we want to write to the page on CPU - we get either wp fault if it was > > via mmap, or we have to catch that in places using kmap() - we just remove > > access to A' from the device. > > > > This scheme makes the device mapping functionality transparent to the > > filesystem (you actually don't need to hook directly into ->readpage etc. > > handlers, you can just have wrappers around them for this functionality) > > and fairly straightforward... It is so transparent that even direct IO works > > with this since the page cache invalidation pass we do before actually doing > > the direct IO will make sure to pull all the pages from the device and write > > them to disk if needed. What do you think? > > This is do-able but i think it will require the same amount of changes than > what i had in mind (excluding the block bounce code) with one drawback. Doing > it that way we can not free page A. I guess I'd have to see code implementing your approach to be able to judge what ends up being less code - the devil is in the details here I believe. Actually, when thinking about it with a fresh mind, I don't think we'd have to catch kmap() at all with my approach - all writes could be cached either in grab_cache_page_write_begin() or in page_mkwrite(). What I like about my solution is that it is completely fs agnostic and the places that need handling of device pages have very relaxed locking constraints - grabbing locks necessary to update mappings / communicate with the device should be no brainer in those contexts. > On some workload this probably does not hurt much but on workload where you > read a big dataset from disk and then use it only on the GPU for long period > of time (minutes/hours) you will waste GB of system memory. I was thinking about this as well. So you could just leave the page A to be undergoing normal page aging and reclaim. However what you need is to somehow maintain the information that index I in file F is mapped to the device's page A' so that ->readpage() and friends know they should pull the page from the device and not from disk. Traditionally we do this by exceptional entries in the radix tree - i.e., when we reclaim A, we do not insert shadow exceptional entry into the radix tree telling when the page was evicted but instead insert there exceptional entry telling this page is stored in the device. > Right now i am working on some other patchset, i intend to take a stab at this > in January/February time frame, before summit so i can post an RFC and have a > clear picture of every code path that needs modifications. I expect this would > provide better frame for discussion. Yeah, that sounds good. > I assume i will have to change >readpage >readpages writepage >writepages but > i think that the only place i really need to change are do_generic_file_read() > and generic_perform_write() (or iov_iter_copy_*). Of course this only apply to > fs that use those generic helpers. Not really. There is other stuff that can be pulling pagecache pages in memory - e.g. think of readahead, or page faults, or page fault around logic, or splice, or ... > I also probably will change >mmap or rather the helper it uses to set the pte > depending on what looks better. > > Note that i don't think wrapping is an easy task. I would need to replace page > A mapping (struct page.mapping) to point to a wrapping address_space but there > is enough place in the kernel that directly dereference that and expect to hit > the right (real) address_space. I would need to replace all dereference of > page->mapping to an helper function and possibly would need to change some of > the call site logic accordingly. This might prove a bigger change than just > having to use bounce in do_generic_file_read() and generic_perform_write(). So what I meant by wrapping is that you'd wrap places that call ->readpage, ->readpages, ->writepage, ->writepages with a helper function that will do what you need. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-15 16:19 ` Jan Kara 2016-12-15 19:14 ` Jerome Glisse @ 2016-12-16 3:10 ` Aneesh Kumar K.V 2016-12-19 8:46 ` Jan Kara 1 sibling, 1 reply; 31+ messages in thread From: Aneesh Kumar K.V @ 2016-12-16 3:10 UTC (permalink / raw) To: Jan Kara, Jerome Glisse Cc: Jan Kara, Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel Jan Kara <jack@suse.cz> writes: > On Wed 14-12-16 12:15:14, Jerome Glisse wrote: > <snipped explanation that the device has the same cabilities as CPUs wrt > page handling> > >> > So won't it be easier to leave the pagecache page where it is and *copy* it >> > to the device? Can the device notify us *before* it is going to modify a >> > page, not just after it has modified it? Possibly if we just give it the >> > page read-only and it will have to ask CPU to get write permission? If yes, >> > then I belive this could work and even fs support should be doable. >> >> Well yes and no. Device obey the same rule as CPU so if a file back page is >> map read only in the process it must first do a write fault which will call >> in the fs (page_mkwrite() of vm_ops). But once a page has write permission >> there is no way to be notify by hardware on every write. First the hardware >> do not have the capability. Second we are talking thousand (10 000 is upper >> range in today device) of concurrent thread, each can possibly write to page >> under consideration. > > Sure, I meant whether the device is able to do equivalent of ->page_mkwrite > notification which apparently it is. OK. > >> We really want the device page to behave just like regular page. Most fs code >> path never map file content, it only happens during read/write and i believe >> this can be handled either by migrating back or by using bounce page. I want >> to provide the choice between the two solutions as one will be better for some >> workload and the other for different workload. > > I agree with keeping page used by the device behaving as similar as > possible as any other page. I'm just exploring different possibilities how > to make that happen. E.g. the scheme I was aiming at is: > > When you want page A to be used by the device, you set up page A' in the > device but make sure any access to it will fault. > > When the device wants to access A', it notifies the CPU, that writeprotects > all mappings of A, copy A to A' and map A' read-only for the device. A and A' will have different pfns here and hence different struct page. So what will be there in the address_space->page_tree ? If we place A' in the page cache, then we are essentially bringing lot of locking complexity Dave talked about in previous mails. > > When the device wants to write to A', it notifies CPU, that will clear all > mappings of A and mark A as not-uptodate & dirty. When the CPU will then > want to access the data in A again - we need to catch ->readpage, > ->readpages, ->writepage, ->writepages - it will writeprotect A' in > the device, copy data to A, mark A as uptodate & dirty, and off we go. > > When we want to write to the page on CPU - we get either wp fault if it was > via mmap, or we have to catch that in places using kmap() - we just remove > access to A' from the device. > > This scheme makes the device mapping functionality transparent to the > filesystem (you actually don't need to hook directly into ->readpage etc. > handlers, you can just have wrappers around them for this functionality) > and fairly straightforward... It is so transparent that even direct IO works > with this since the page cache invalidation pass we do before actually doing > the direct IO will make sure to pull all the pages from the device and write > them to disk if needed. What do you think? > -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-16 3:10 ` Aneesh Kumar K.V @ 2016-12-19 8:46 ` Jan Kara 0 siblings, 0 replies; 31+ messages in thread From: Jan Kara @ 2016-12-19 8:46 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Jan Kara, Jerome Glisse, Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel On Fri 16-12-16 08:40:38, Aneesh Kumar K.V wrote: > Jan Kara <jack@suse.cz> writes: > > > On Wed 14-12-16 12:15:14, Jerome Glisse wrote: > > <snipped explanation that the device has the same cabilities as CPUs wrt > > page handling> > > > >> > So won't it be easier to leave the pagecache page where it is and *copy* it > >> > to the device? Can the device notify us *before* it is going to modify a > >> > page, not just after it has modified it? Possibly if we just give it the > >> > page read-only and it will have to ask CPU to get write permission? If yes, > >> > then I belive this could work and even fs support should be doable. > >> > >> Well yes and no. Device obey the same rule as CPU so if a file back page is > >> map read only in the process it must first do a write fault which will call > >> in the fs (page_mkwrite() of vm_ops). But once a page has write permission > >> there is no way to be notify by hardware on every write. First the hardware > >> do not have the capability. Second we are talking thousand (10 000 is upper > >> range in today device) of concurrent thread, each can possibly write to page > >> under consideration. > > > > Sure, I meant whether the device is able to do equivalent of ->page_mkwrite > > notification which apparently it is. OK. > > > >> We really want the device page to behave just like regular page. Most fs code > >> path never map file content, it only happens during read/write and i believe > >> this can be handled either by migrating back or by using bounce page. I want > >> to provide the choice between the two solutions as one will be better for some > >> workload and the other for different workload. > > > > I agree with keeping page used by the device behaving as similar as > > possible as any other page. I'm just exploring different possibilities how > > to make that happen. E.g. the scheme I was aiming at is: > > > > When you want page A to be used by the device, you set up page A' in the > > device but make sure any access to it will fault. > > > > When the device wants to access A', it notifies the CPU, that writeprotects > > all mappings of A, copy A to A' and map A' read-only for the device. > > > A and A' will have different pfns here and hence different struct page. Yes. In fact I don't think there's need to have struct page for A' in my scheme. At least for the purposes of page cache tracking... Maybe there's good reason to have it from a device driver POV. > So what will be there in the address_space->page_tree ? If we place > A' in the page cache, then we are essentially bringing lot of locking > complexity Dave talked about in previous mails. No, I meant page A will stay in the page_tree. There's no need for migration in my scheme. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-14 11:13 ` [Lsf-pc] " Jan Kara 2016-12-14 17:15 ` Jerome Glisse @ 2016-12-19 17:00 ` Aneesh Kumar K.V 1 sibling, 0 replies; 31+ messages in thread From: Aneesh Kumar K.V @ 2016-12-19 17:00 UTC (permalink / raw) To: Jan Kara, Jerome Glisse Cc: Dave Chinner, linux-block, linux-mm, lsf-pc, linux-fsdevel Jan Kara <jack@suse.cz> writes: > On Tue 13-12-16 16:24:33, Jerome Glisse wrote: >> On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote: >> > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote: >> > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote: >> > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote: >> > > > > I would like to discuss un-addressable device memory in the context of >> > > > > filesystem and block device. Specificaly how to handle write-back, read, >> > > > > ... when a filesystem page is migrated to device memory that CPU can not >> > > > > access. >> > > > >> > > > You mean pmem that is DAX-capable that suddenly, without warning, >> > > > becomes non-DAX capable? >> > > > >> > > > If you are not talking about pmem and DAX, then exactly what does >> > > > "when a filesystem page is migrated to device memory that CPU can >> > > > not access" mean? What "filesystem page" are we talking about that >> > > > can get migrated from main RAM to something the CPU can't access? >> > > >> > > I am talking about GPU, FPGA, ... any PCIE device that have fast on >> > > board memory that can not be expose transparently to the CPU. I am >> > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm >> > > https://lwn.net/Articles/706856/ >> > >> > So ZONE_DEVICE memory that is a DMA target but not CPU addressable? >> >> Well not only target, it can be source too. But the device can read >> and write any system memory and dma to/from that memory to its on >> board memory. >> >> > >> > > So in my case i am only considering non DAX/PMEM filesystem ie any >> > > "regular" filesystem back by a "regular" block device. I want to be >> > > able to migrate mmaped area of such filesystem to device memory while >> > > the device is actively using that memory. >> > >> > "migrate mmapped area of such filesystem" means what, exactly? >> >> fd = open("/path/to/some/file") >> ptr = mmap(fd, ...); >> gpu_compute_something(ptr); >> >> > >> > Are you talking about file data contents that have been copied into >> > the page cache and mmapped into a user process address space? >> > IOWs, migrating ZONE_NORMAL page cache page content and state >> > to a new ZONE_DEVICE page, and then migrating back again somehow? >> >> Take any existing application that mmap a file and allow to migrate >> chunk of that mmaped file to device memory without the application >> even knowing about it. So nothing special in respect to that mmaped >> file. It is a regular file on your filesystem. > > OK, so I share most of Dave's concerns about this. But let's talk about > what we can do and what you need and we may find something usable. First > let me understand what is doable / what are the costs on your side. > > So we have a page cache page that you'd like to migrate to the device. > Fine. You are willing to sacrifice direct IO - even better. We can fall > back to buffered IO in that case (well, except for XFS which does not do it > but that's a minor detail). One thing I'm not sure about: When a page is > migrated to the device, is its contents available and is just possibly stale > or will something bad happen if we try to access (or even modify) page data? For Coherent Device Memory case, the CPU can continue to access these device pages. > > And by migration you really mean page migration? Be aware that migration of > pagecache pages may be a problem for some pages of some filesystems on its > own - e. g. page migration may fail because there is a filesystem transaction > outstanding modifying that page. For userspace these will be really hard > to understand sporadic errors because it's really filesystem internal > thing. So far page migration was widely used only for free space > defragmentation and for that purpose if page is not migratable for a minute > who cares. On the device driver side, i guess we should be able to handle page migration failures and retry. For the reverse, i guess we need the guarantee that a CPU access can always migrate back these pages without failures ? Are there failure condition we need to handle when migrating pages back to system memory ? > > So won't it be easier to leave the pagecache page where it is and *copy* it > to the device? Can the device notify us *before* it is going to modify a > page, not just after it has modified it? Possibly if we just give it the > page read-only and it will have to ask CPU to get write permission? If yes, > then I belive this could work and even fs support should be doable. > For coherent device memory scenario, we can live with one copy and both cpu/device can access these pages. In CDM case the decision to migrate is driven by the frequency of access from the device. -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications 2016-12-13 18:15 [LSF/MM TOPIC] Un-addressable device memory and block/fs implications Jerome Glisse 2016-12-13 18:20 ` James Bottomley 2016-12-13 20:15 ` Dave Chinner @ 2016-12-14 3:55 ` Balbir Singh 2016-12-16 3:14 ` [LSF/MM ATTEND] " Aneesh Kumar K.V 3 siblings, 0 replies; 31+ messages in thread From: Balbir Singh @ 2016-12-14 3:55 UTC (permalink / raw) To: Jerome Glisse; +Cc: lsf-pc, linux-mm, linux-block, linux-fsdevel On Wed, Dec 14, 2016 at 5:15 AM, Jerome Glisse <jglisse@redhat.com> wrote: > I would like to discuss un-addressable device memory in the context of > filesystem and block device. Specificaly how to handle write-back, read, > ... when a filesystem page is migrated to device memory that CPU can not > access. > > I intend to post a patchset leveraging the same idea as the existing > block bounce helper (block/bounce.c) to handle this. I believe this is > worth discussing during summit see how people feels about such plan and > if they have better ideas. > > Yes, that would be interesting. I presume all of this is for ZONE_DEVICE and HMM. I think designing such an interface requires careful thought on tracking pages to ensure we don't lose writes and also the impact on things like the writeback subsytem. >From a HMM perspective and an overall MM perspective, I worry that our accounting system is broken with the proposed mirroring and unaddressable memory that needs to be addressed as well. It would also be nice to have a discussion on migration patches currently on the list 1. THP migration 2. HMM migration 3. Async migration > I also like to join discussions on: > - Peer-to-Peer DMAs between PCIe devices > - CDM coherent device memory Yes, this needs discussion. Specifically from is all of CDM memory NORMAL or not and the special requirements we have today for CDM. > - PMEM > - overall mm discussions Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* [LSF/MM ATTEND] Un-addressable device memory and block/fs implications 2016-12-13 18:15 [LSF/MM TOPIC] Un-addressable device memory and block/fs implications Jerome Glisse ` (2 preceding siblings ...) 2016-12-14 3:55 ` Balbir Singh @ 2016-12-16 3:14 ` Aneesh Kumar K.V 2017-01-16 12:04 ` Anshuman Khandual 2017-01-18 11:00 ` [Lsf-pc] " Jan Kara 3 siblings, 2 replies; 31+ messages in thread From: Aneesh Kumar K.V @ 2016-12-16 3:14 UTC (permalink / raw) To: Jerome Glisse, lsf-pc, linux-mm, linux-block, linux-fsdevel Jerome Glisse <jglisse@redhat.com> writes: > I would like to discuss un-addressable device memory in the context of > filesystem and block device. Specificaly how to handle write-back, read, > ... when a filesystem page is migrated to device memory that CPU can not > access. > > I intend to post a patchset leveraging the same idea as the existing > block bounce helper (block/bounce.c) to handle this. I believe this is > worth discussing during summit see how people feels about such plan and > if they have better ideas. > > > I also like to join discussions on: > - Peer-to-Peer DMAs between PCIe devices > - CDM coherent device memory > - PMEM > - overall mm discussions I would like to attend this discussion. I can talk about coherent device memory and how having HMM handle that will make it easy to have one interface for device driver. For Coherent device case we definitely need page cache migration support. -aneesh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM ATTEND] Un-addressable device memory and block/fs implications 2016-12-16 3:14 ` [LSF/MM ATTEND] " Aneesh Kumar K.V @ 2017-01-16 12:04 ` Anshuman Khandual 2017-01-16 23:15 ` John Hubbard 2017-01-18 11:00 ` [Lsf-pc] " Jan Kara 1 sibling, 1 reply; 31+ messages in thread From: Anshuman Khandual @ 2017-01-16 12:04 UTC (permalink / raw) To: Aneesh Kumar K.V, Jerome Glisse, lsf-pc, linux-mm, linux-block, linux-fsdevel On 12/16/2016 08:44 AM, Aneesh Kumar K.V wrote: > Jerome Glisse <jglisse@redhat.com> writes: > >> I would like to discuss un-addressable device memory in the context of >> filesystem and block device. Specificaly how to handle write-back, read, >> ... when a filesystem page is migrated to device memory that CPU can not >> access. >> >> I intend to post a patchset leveraging the same idea as the existing >> block bounce helper (block/bounce.c) to handle this. I believe this is >> worth discussing during summit see how people feels about such plan and >> if they have better ideas. >> >> >> I also like to join discussions on: >> - Peer-to-Peer DMAs between PCIe devices >> - CDM coherent device memory >> - PMEM >> - overall mm discussions > I would like to attend this discussion. I can talk about coherent device > memory and how having HMM handle that will make it easy to have one > interface for device driver. For Coherent device case we definitely need > page cache migration support. I have been in the discussion on the mailing list about HMM since V13 which got posted back in October. Touched upon many points including how it changes ZONE_DEVICE to accommodate un-addressable device memory, migration capability of currently supported ZONE_DEVICE based persistent memory etc. Looked at the HMM more closely from the perspective whether it can also accommodate coherent device memory which has been already discussed by others on this thread. I too would like to attend to discuss more on this topic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [LSF/MM ATTEND] Un-addressable device memory and block/fs implications 2017-01-16 12:04 ` Anshuman Khandual @ 2017-01-16 23:15 ` John Hubbard 0 siblings, 0 replies; 31+ messages in thread From: John Hubbard @ 2017-01-16 23:15 UTC (permalink / raw) To: Anshuman Khandual, Aneesh Kumar K.V, Jerome Glisse, lsf-pc, linux-mm, linux-block, linux-fsdevel On 01/16/2017 04:04 AM, Anshuman Khandual wrote: > On 12/16/2016 08:44 AM, Aneesh Kumar K.V wrote: >> Jerome Glisse <jglisse@redhat.com> writes: >> >>> I would like to discuss un-addressable device memory in the context of >>> filesystem and block device. Specificaly how to handle write-back, read, >>> ... when a filesystem page is migrated to device memory that CPU can not >>> access. >>> >>> I intend to post a patchset leveraging the same idea as the existing >>> block bounce helper (block/bounce.c) to handle this. I believe this is >>> worth discussing during summit see how people feels about such plan and >>> if they have better ideas. >>> >>> >>> I also like to join discussions on: >>> - Peer-to-Peer DMAs between PCIe devices Yes! This is looming large, because we keep insisting on building new computers with a *lot* of GPUs in them, and then connect them up with NICs as well, and oddly enough, people keep trying to do pee-to-peer between GPUs, and from GPUs to NICs, etc. :) It feels like the linux-rdma and linux-pci discussions in the past sort of stalled, due to not being certain of the long-term direction of the design. So it's worth coming up with that. >>> - CDM coherent device memory >>> - PMEM >>> - overall mm discussions >> I would like to attend this discussion. I can talk about coherent device >> memory and how having HMM handle that will make it easy to have one >> interface for device driver. For Coherent device case we definitely need >> page cache migration support. > > I have been in the discussion on the mailing list about HMM since V13 which > got posted back in October. Touched upon many points including how it changes > ZONE_DEVICE to accommodate un-addressable device memory, migration capability > of currently supported ZONE_DEVICE based persistent memory etc. Looked at the > HMM more closely from the perspective whether it can also accommodate coherent > device memory which has been already discussed by others on this thread. I too > would like to attend to discuss more on this topic. Also, on the huge page points (mentioned early in this short thread): some of our GPUs could, at times, match the CPU's large/huge page sizes. It is a delicate thing to achieve, but moving around, say, 2 MB pages between CPU and GPU would be, for some workloads, really fast. I should be able to present performance numbers for HMM on Pascal GPUs, so if anyone would like that, let me know in advance of any particular workloads or configurations that seem most interesting, and I'll gather that. Also would like to attend this one. thanks John Hubbard NVIDIA > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Un-addressable device memory and block/fs implications 2016-12-16 3:14 ` [LSF/MM ATTEND] " Aneesh Kumar K.V 2017-01-16 12:04 ` Anshuman Khandual @ 2017-01-18 11:00 ` Jan Kara 1 sibling, 0 replies; 31+ messages in thread From: Jan Kara @ 2017-01-18 11:00 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Jerome Glisse, lsf-pc, linux-mm, linux-block, linux-fsdevel On Fri 16-12-16 08:44:11, Aneesh Kumar K.V wrote: > Jerome Glisse <jglisse@redhat.com> writes: > > > I would like to discuss un-addressable device memory in the context of > > filesystem and block device. Specificaly how to handle write-back, read, > > ... when a filesystem page is migrated to device memory that CPU can not > > access. > > > > I intend to post a patchset leveraging the same idea as the existing > > block bounce helper (block/bounce.c) to handle this. I believe this is > > worth discussing during summit see how people feels about such plan and > > if they have better ideas. > > > > > > I also like to join discussions on: > > - Peer-to-Peer DMAs between PCIe devices > > - CDM coherent device memory > > - PMEM > > - overall mm discussions > > I would like to attend this discussion. I can talk about coherent device > memory and how having HMM handle that will make it easy to have one > interface for device driver. For Coherent device case we definitely need > page cache migration support. Aneesh, did you intend this as your request to attend? You posted it as a reply to another email so it is not really clear. Note that each attend request should be a separate email so that it does not get lost... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2017-01-18 11:00 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-12-13 18:15 [LSF/MM TOPIC] Un-addressable device memory and block/fs implications Jerome Glisse 2016-12-13 18:20 ` James Bottomley 2016-12-13 18:55 ` Jerome Glisse 2016-12-13 20:01 ` James Bottomley 2016-12-13 20:22 ` Jerome Glisse 2016-12-13 20:27 ` Dave Hansen 2016-12-13 20:15 ` Dave Chinner 2016-12-13 20:31 ` Jerome Glisse 2016-12-13 21:10 ` Dave Chinner 2016-12-13 21:24 ` Jerome Glisse 2016-12-13 22:08 ` Dave Hansen 2016-12-13 23:02 ` Jerome Glisse 2016-12-13 22:13 ` Dave Chinner 2016-12-13 22:55 ` Jerome Glisse 2016-12-14 0:14 ` Dave Chinner 2016-12-14 1:07 ` Jerome Glisse 2016-12-14 4:23 ` Dave Chinner 2016-12-14 16:35 ` Jerome Glisse 2016-12-14 11:13 ` [Lsf-pc] " Jan Kara 2016-12-14 17:15 ` Jerome Glisse 2016-12-15 16:19 ` Jan Kara 2016-12-15 19:14 ` Jerome Glisse 2016-12-16 8:14 ` Jan Kara 2016-12-16 3:10 ` Aneesh Kumar K.V 2016-12-19 8:46 ` Jan Kara 2016-12-19 17:00 ` Aneesh Kumar K.V 2016-12-14 3:55 ` Balbir Singh 2016-12-16 3:14 ` [LSF/MM ATTEND] " Aneesh Kumar K.V 2017-01-16 12:04 ` Anshuman Khandual 2017-01-16 23:15 ` John Hubbard 2017-01-18 11:00 ` [Lsf-pc] " Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).