* [PATCH] mm/mincore: allow for making sys_mincore() privileged @ 2019-01-05 17:27 Jiri Kosina 2019-01-05 19:14 ` Vlastimil Babka ` (5 more replies) 0 siblings, 6 replies; 161+ messages in thread From: Jiri Kosina @ 2019-01-05 17:27 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko Cc: linux-mm, linux-kernel, linux-api From: Jiri Kosina <jkosina@suse.cz> There are possibilities [1] how mincore() could be used as a converyor of a sidechannel information about pagecache metadata. Provide vm.mincore_privileged sysctl, which makes it possible to mincore() start returning -EPERM in case it's invoked by a process lacking CAP_SYS_ADMIN. The default behavior stays "mincore() can be used by anybody" in order to be conservative with respect to userspace behavior. [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/ Signed-off-by: Jiri Kosina <jkosina@suse.cz> --- Documentation/sysctl/vm.txt | 9 +++++++++ kernel/sysctl.c | 8 ++++++++ mm/mincore.c | 5 +++++ 3 files changed, 22 insertions(+) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 187ce4f599a2..afb8635e925e 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -41,6 +41,7 @@ Currently, these files are in /proc/sys/vm: - min_free_kbytes - min_slab_ratio - min_unmapped_ratio +- mincore_privileged - mmap_min_addr - mmap_rnd_bits - mmap_rnd_compat_bits @@ -485,6 +486,14 @@ files and similar are considered. The default is 1 percent. ============================================================== +mincore_privileged: + +mincore() could be potentially used to mount a side-channel attack against +pagecache metadata. This sysctl provides system administrators means to +make it available only to processess that own CAP_SYS_ADMIN capability. + +The default is 0, which means mincore() can be used without restrictions. +============================================================== mmap_min_addr diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 1825f712e73b..f03cb07c8dd4 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -114,6 +114,7 @@ extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max; #ifndef CONFIG_MMU extern int sysctl_nr_trim_pages; #endif +extern int sysctl_mincore_privileged; /* Constants used for minimum and maximum */ #ifdef CONFIG_LOCKUP_DETECTOR @@ -1684,6 +1685,13 @@ static struct ctl_table vm_table[] = { .extra2 = (void *)&mmap_rnd_compat_bits_max, }, #endif + { + .procname = "mincore_privileged", + .data = &sysctl_mincore_privileged, + .maxlen = sizeof(sysctl_mincore_privileged), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { } }; diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..77d4928cdfaa 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -21,6 +21,8 @@ #include <linux/uaccess.h> #include <asm/pgtable.h> +int sysctl_mincore_privileged; + static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { @@ -228,6 +230,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, unsigned long pages; unsigned char *tmp; + if (sysctl_mincore_privileged && !capable(CAP_SYS_ADMIN)) + return -EPERM; + /* Check the start address: needs to be page-aligned.. */ if (start & ~PAGE_MASK) return -EINVAL; -- Jiri Kosina SUSE Labs ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 17:27 [PATCH] mm/mincore: allow for making sys_mincore() privileged Jiri Kosina @ 2019-01-05 19:14 ` Vlastimil Babka 2019-01-05 19:24 ` Jiri Kosina 2019-01-05 19:44 ` kbuild test robot ` (4 subsequent siblings) 5 siblings, 1 reply; 161+ messages in thread From: Vlastimil Babka @ 2019-01-05 19:14 UTC (permalink / raw) To: Jiri Kosina, Linus Torvalds, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko Cc: linux-mm, linux-kernel, linux-api On 5.1.2019 18:27, Jiri Kosina wrote: > From: Jiri Kosina <jkosina@suse.cz> > > There are possibilities [1] how mincore() could be used as a converyor of > a sidechannel information about pagecache metadata. > > Provide vm.mincore_privileged sysctl, which makes it possible to mincore() > start returning -EPERM in case it's invoked by a process lacking > CAP_SYS_ADMIN. Haven't checked the details yet, but wouldn't it be safe if anonymous private mincore() kept working, and restrictions were applied only to page cache? > The default behavior stays "mincore() can be used by anybody" in order to > be conservative with respect to userspace behavior. What if we lied instead of returned -EPERM, to not break userspace so obviously? I guess false positive would be the safer lie? > [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/ > > Signed-off-by: Jiri Kosina <jkosina@suse.cz> > --- > Documentation/sysctl/vm.txt | 9 +++++++++ > kernel/sysctl.c | 8 ++++++++ > mm/mincore.c | 5 +++++ > 3 files changed, 22 insertions(+) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index 187ce4f599a2..afb8635e925e 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -41,6 +41,7 @@ Currently, these files are in /proc/sys/vm: > - min_free_kbytes > - min_slab_ratio > - min_unmapped_ratio > +- mincore_privileged > - mmap_min_addr > - mmap_rnd_bits > - mmap_rnd_compat_bits > @@ -485,6 +486,14 @@ files and similar are considered. > The default is 1 percent. > > ============================================================== > +mincore_privileged: > + > +mincore() could be potentially used to mount a side-channel attack against > +pagecache metadata. This sysctl provides system administrators means to > +make it available only to processess that own CAP_SYS_ADMIN capability. > + > +The default is 0, which means mincore() can be used without restrictions. > +============================================================== > > mmap_min_addr > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 1825f712e73b..f03cb07c8dd4 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -114,6 +114,7 @@ extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max; > #ifndef CONFIG_MMU > extern int sysctl_nr_trim_pages; > #endif > +extern int sysctl_mincore_privileged; > > /* Constants used for minimum and maximum */ > #ifdef CONFIG_LOCKUP_DETECTOR > @@ -1684,6 +1685,13 @@ static struct ctl_table vm_table[] = { > .extra2 = (void *)&mmap_rnd_compat_bits_max, > }, > #endif > + { > + .procname = "mincore_privileged", > + .data = &sysctl_mincore_privileged, > + .maxlen = sizeof(sysctl_mincore_privileged), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, > { } > }; > > diff --git a/mm/mincore.c b/mm/mincore.c > index 218099b5ed31..77d4928cdfaa 100644 > --- a/mm/mincore.c > +++ b/mm/mincore.c > @@ -21,6 +21,8 @@ > #include <linux/uaccess.h> > #include <asm/pgtable.h> > > +int sysctl_mincore_privileged; > + > static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, > unsigned long end, struct mm_walk *walk) > { > @@ -228,6 +230,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, > unsigned long pages; > unsigned char *tmp; > > + if (sysctl_mincore_privileged && !capable(CAP_SYS_ADMIN)) > + return -EPERM; > + > /* Check the start address: needs to be page-aligned.. */ > if (start & ~PAGE_MASK) > return -EINVAL; > ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 19:14 ` Vlastimil Babka @ 2019-01-05 19:24 ` Jiri Kosina 2019-01-05 19:38 ` Vlastimil Babka 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-05 19:24 UTC (permalink / raw) To: Vlastimil Babka Cc: Linus Torvalds, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, linux-kernel, linux-api On Sat, 5 Jan 2019, Vlastimil Babka wrote: > > There are possibilities [1] how mincore() could be used as a converyor of > > a sidechannel information about pagecache metadata. > > > > Provide vm.mincore_privileged sysctl, which makes it possible to mincore() > > start returning -EPERM in case it's invoked by a process lacking > > CAP_SYS_ADMIN. > > Haven't checked the details yet, but wouldn't it be safe if anonymous private > mincore() kept working, and restrictions were applied only to page cache? I was considering that, but then I decided not to do so, as that'd make the interface even more confusing and semantics non-obvious in the 'privileged' case. > > The default behavior stays "mincore() can be used by anybody" in order to > > be conservative with respect to userspace behavior. > > What if we lied instead of returned -EPERM, to not break userspace so > obviously? I guess false positive would be the safer lie? So your proposal basically would be if (privileged && !CAP_SYS_ADMIN) if (pagecache) return false; else return do_mincore() right ? I think userspace would hate us for that semantics, but on the other hand I can sort of understand the 'mincore() is racy anyway, so what' argument, if that's what you are suggesting. But then, I have no idea what userspace is using mincore() for. https://codesearch.debian.net/search?q=mincore might provide some insight I guess (thanks Matthew). -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 19:24 ` Jiri Kosina @ 2019-01-05 19:38 ` Vlastimil Babka 2019-01-08 9:14 ` Bernd Petrovitsch 0 siblings, 1 reply; 161+ messages in thread From: Vlastimil Babka @ 2019-01-05 19:38 UTC (permalink / raw) To: Jiri Kosina Cc: Linus Torvalds, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, linux-kernel, linux-api On 5.1.2019 20:24, Jiri Kosina wrote: > On Sat, 5 Jan 2019, Vlastimil Babka wrote: > >>> There are possibilities [1] how mincore() could be used as a converyor of >>> a sidechannel information about pagecache metadata. >>> >>> Provide vm.mincore_privileged sysctl, which makes it possible to mincore() >>> start returning -EPERM in case it's invoked by a process lacking >>> CAP_SYS_ADMIN. >> >> Haven't checked the details yet, but wouldn't it be safe if anonymous private >> mincore() kept working, and restrictions were applied only to page cache? > > I was considering that, but then I decided not to do so, as that'd make > the interface even more confusing and semantics non-obvious in the > 'privileged' case. > >>> The default behavior stays "mincore() can be used by anybody" in order to >>> be conservative with respect to userspace behavior. >> >> What if we lied instead of returned -EPERM, to not break userspace so >> obviously? I guess false positive would be the safer lie? > > So your proposal basically would be > > if (privileged && !CAP_SYS_ADMIN) > if (pagecache) > return false; I was thinking about "return true" here, assuming that userspace generally wants to ensure itself there won't be page faults when it starts doing something critical, and if it sees a "false" it will try to do some kind of prefaulting, possibly in a loop. There might be somebody trying to make sure something is out of pagecache (it wants to see "false"), but can't think of anything except benchmarks? > else > return do_mincore() > > right ? > > I think userspace would hate us for that semantics, but on the other hand > I can sort of understand the 'mincore() is racy anyway, so what' argument, > if that's what you are suggesting. > > But then, I have no idea what userspace is using mincore() for. > https://codesearch.debian.net/search?q=mincore might provide some insight > I guess (thanks Matthew). > ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 19:38 ` Vlastimil Babka @ 2019-01-08 9:14 ` Bernd Petrovitsch 2019-01-08 11:37 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Bernd Petrovitsch @ 2019-01-08 9:14 UTC (permalink / raw) To: Vlastimil Babka, Jiri Kosina Cc: Linus Torvalds, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, linux-kernel, linux-api On 05/01/2019 20:38, Vlastimil Babka wrote: [...] > I was thinking about "return true" here, assuming that userspace generally wants > to ensure itself there won't be page faults when it starts doing something > critical, and if it sees a "false" it will try to do some kind of prefaulting, > possibly in a loop. There might be somebody trying to make sure something is out Isn't that racy by design as the pages may get flushed out after the check? Shouldn't the application use e.g. mlock()/.... to guarantee no page faults in the first place? MfG, Bernd -- Bernd Petrovitsch Email : bernd@petrovitsch.priv.at LUGA : http://www.luga.at ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-08 9:14 ` Bernd Petrovitsch @ 2019-01-08 11:37 ` Jiri Kosina 2019-01-08 13:53 ` Bernd Petrovitsch 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-08 11:37 UTC (permalink / raw) To: Bernd Petrovitsch Cc: Vlastimil Babka, Linus Torvalds, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, linux-kernel, linux-api On Tue, 8 Jan 2019, Bernd Petrovitsch wrote: > Shouldn't the application use e.g. mlock()/.... to guarantee no page > faults in the first place? Calling mincore() on pages you've just mlock()ed is sort of pointless though. -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-08 11:37 ` Jiri Kosina @ 2019-01-08 13:53 ` Bernd Petrovitsch 2019-01-08 14:08 ` Kirill A. Shutemov 0 siblings, 1 reply; 161+ messages in thread From: Bernd Petrovitsch @ 2019-01-08 13:53 UTC (permalink / raw) To: Jiri Kosina Cc: Vlastimil Babka, Linus Torvalds, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, linux-kernel, linux-api On 08/01/2019 12:37, Jiri Kosina wrote: > On Tue, 8 Jan 2019, Bernd Petrovitsch wrote: > >> Shouldn't the application use e.g. mlock()/.... to guarantee no page >> faults in the first place? > > Calling mincore() on pages you've just mlock()ed is sort of pointless > though. Obviously;-) Sorry for being unclear above: If I want my application to avoid suffering from page faults, I use simply mlock() (and/or friends) to nail the relevant pages into physical RAM and not "look if they are out, if yes, get them in" which has also the risk that these important pages are too soon evicted again. But perhaps I'm missing some very special use cases .... MfG, Brend -- Bernd Petrovitsch Email : bernd@petrovitsch.priv.at LUGA : http://www.luga.at ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-08 13:53 ` Bernd Petrovitsch @ 2019-01-08 14:08 ` Kirill A. Shutemov 0 siblings, 0 replies; 161+ messages in thread From: Kirill A. Shutemov @ 2019-01-08 14:08 UTC (permalink / raw) To: Bernd Petrovitsch Cc: Jiri Kosina, Vlastimil Babka, Linus Torvalds, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, linux-kernel, linux-api On Tue, Jan 08, 2019 at 02:53:00PM +0100, Bernd Petrovitsch wrote: > On 08/01/2019 12:37, Jiri Kosina wrote: > > On Tue, 8 Jan 2019, Bernd Petrovitsch wrote: > > > >> Shouldn't the application use e.g. mlock()/.... to guarantee no page > >> faults in the first place? > > > > Calling mincore() on pages you've just mlock()ed is sort of pointless > > though. > > Obviously;-) > > Sorry for being unclear above: If I want my application to > avoid suffering from page faults, I use simply mlock() > (and/or friends) to nail the relevant pages into physical > RAM and not "look if they are out, if yes, get them in" which > has also the risk that these important pages are too soon > evicted again. Note, that mlock() doesn't prevent minor page faults. Mlocked memory is still subject to mechanisms that makes the page temporary unmapped. For instance migration (including NUMA balancing), compaction, khugepaged... -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 17:27 [PATCH] mm/mincore: allow for making sys_mincore() privileged Jiri Kosina 2019-01-05 19:14 ` Vlastimil Babka @ 2019-01-05 19:44 ` kbuild test robot 2019-01-05 19:46 ` Linus Torvalds ` (3 subsequent siblings) 5 siblings, 0 replies; 161+ messages in thread From: kbuild test robot @ 2019-01-05 19:44 UTC (permalink / raw) To: Jiri Kosina Cc: kbuild-all, Linus Torvalds, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, linux-kernel, linux-api [-- Attachment #1: Type: text/plain, Size: 1001 bytes --] Hi Jiri, I love your patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.20 next-20190103] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Jiri-Kosina/mm-mincore-allow-for-making-sys_mincore-privileged/20190106-013707 config: c6x-evmc6678_defconfig (attached as .config) compiler: c6x-elf-gcc (GCC) 8.1.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=8.1.0 make.cross ARCH=c6x All errors (new ones prefixed by >>): >> kernel/sysctl.o:(.fardata+0x6a0): undefined reference to `sysctl_mincore_privileged' --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 5258 bytes --] ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 17:27 [PATCH] mm/mincore: allow for making sys_mincore() privileged Jiri Kosina 2019-01-05 19:14 ` Vlastimil Babka 2019-01-05 19:44 ` kbuild test robot @ 2019-01-05 19:46 ` Linus Torvalds 2019-01-05 20:12 ` Jiri Kosina 2019-01-05 20:13 ` Linus Torvalds 2019-01-05 19:56 ` kbuild test robot ` (2 subsequent siblings) 5 siblings, 2 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-05 19:46 UTC (permalink / raw) To: Jiri Kosina Cc: Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, Linux List Kernel Mailing, linux-api On Sat, Jan 5, 2019 at 9:27 AM Jiri Kosina <jikos@kernel.org> wrote: > > From: Jiri Kosina <jkosina@suse.cz> > > There are possibilities [1] how mincore() could be used as a converyor of > a sidechannel information about pagecache metadata. Can we please just limit it to vma's that are either anonymous, or map a file that the user actually owns? Then the capability check could be for "override the file owner check" instead, which makes tons of sense. No new sysctl's for something like this, please. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 19:46 ` Linus Torvalds @ 2019-01-05 20:12 ` Jiri Kosina 2019-01-05 20:17 ` Linus Torvalds 2019-01-05 20:13 ` Linus Torvalds 1 sibling, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-05 20:12 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, Linux List Kernel Mailing, linux-api On Sat, 5 Jan 2019, Linus Torvalds wrote: > > There are possibilities [1] how mincore() could be used as a converyor of > > a sidechannel information about pagecache metadata. > > Can we please just limit it to vma's that are either anonymous, or map > a file that the user actually owns? > > Then the capability check could be for "override the file owner check" > instead, which makes tons of sense. Makes sense. I am still not completely sure what to return in such cases though; we can either blatantly lie and always pretend that the pages are resident (to avoid calling process entering some prefaulting mode), or return -ENOMEM for mappings of files that don't belong to the user (in case it's not CAP_SYS_ADMIN one). -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 20:12 ` Jiri Kosina @ 2019-01-05 20:17 ` Linus Torvalds 2019-01-05 20:43 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-05 20:17 UTC (permalink / raw) To: Jiri Kosina Cc: Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, Linux List Kernel Mailing, linux-api [ Crossed emails ] On Sat, Jan 5, 2019 at 12:12 PM Jiri Kosina <jikos@kernel.org> wrote: > > I am still not completely sure what to return in such cases though; we can > either blatantly lie and always pretend that the pages are resident That's what my untested patch did. Or maybe just claim they are all not present? And again, that patch was entirely untested, so it may be garbage and have some fundamental problem. I also don't know exactly what rule might make most sense, but "you can write to the file" certainly to me implies that you also could know what parts of it are in-core. Who actually _uses_ mincore()? That's probably the best guide to what we should do. Maybe they open the file read-only even if they are the owner, and we really should look at file ownership instead. I tried to make that "can_do_mincore()" function easy to understand and easy to just modify to some sane state. Again, my patch is meant as a "perhaps something like this?" rather than some "this is _exactly_ how it must be done". Take the patch as a quick suggestion, not some final answer. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 20:17 ` Linus Torvalds @ 2019-01-05 20:43 ` Jiri Kosina 2019-01-05 21:54 ` Linus Torvalds 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-05 20:43 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, Linux List Kernel Mailing, linux-api On Sat, 5 Jan 2019, Linus Torvalds wrote: > > I am still not completely sure what to return in such cases though; we can > > either blatantly lie and always pretend that the pages are resident > > That's what my untested patch did. Or maybe just claim they are all > not present? Thinking about it a little bit more, I believe Vlastimil has a good point with 'non present' potentially causing more bogus activity in userspace in response (in an effort to actually make them present, and failing indefinitely). IOW, I think it's a reasonable expectation that the common scenario is "check if it's present, and if not, try to fault it in" instead of "check if it's present, and if it is, try to evict it". > And again, that patch was entirely untested, so it may be garbage and > have some fundamental problem. I will be travelling for next ~24 hours, but I have just asked our QA guys to run it through some basic battery of testing (which will probably happen on monday anyway). > I also don't know exactly what rule might make most sense, but "you can > write to the file" certainly to me implies that you also could know what > parts of it are in-core. I think it's reasonable; I can't really imagine any sidechannel to a global state be possibly mounted on valid R/W mappings. I'd guess that probably the most interesting here are the code segments of shared libraries, allowing to trace victim's execution. > Who actually _uses_ mincore()? That's probably the best guide to what > we should do. Maybe they open the file read-only even if they are the > owner, and we really should look at file ownership instead. Yeah, well https://codesearch.debian.net/search?q=mincore is a bit too much mess to get some idea quickly I am afraid. -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 20:43 ` Jiri Kosina @ 2019-01-05 21:54 ` Linus Torvalds 2019-01-06 11:33 ` Kevin Easton ` (2 more replies) 0 siblings, 3 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-05 21:54 UTC (permalink / raw) To: Jiri Kosina, Masatake YAMATO Cc: Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, Linux List Kernel Mailing, linux-api On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina <jikos@kernel.org> wrote: > > > Who actually _uses_ mincore()? That's probably the best guide to what > > we should do. Maybe they open the file read-only even if they are the > > owner, and we really should look at file ownership instead. > > Yeah, well > > https://codesearch.debian.net/search?q=mincore > > is a bit too much mess to get some idea quickly I am afraid. Yeah, heh. And the first hit is 'fincore', which probably nobody cares about anyway, but it does fd = open (name, O_RDONLY) .. mmap(window, len, PROT_NONE, MAP_PRIVATE, .. so if we want to keep that working, we'd really need to actually check file ownership rather than just looking at f_mode. But I don't know if anybody *uses* and cares about fincore, and it's particularly questionable for non-root users. And the Android go runtime code seems to oddly use mincore to figure out page size: // try using mincore to detect the physical page size. // mincore should return EINVAL when address is not a multiple of system page size. which is all kinds of odd, but whatever.. Why mincore, rather than something sane and obvious like mmap? Don't ask me... Anyway, the Debian code search just results in mostly non-present stuff. It's sad that google code search is no more. It was great for exactly these kinds of questions. The mono runtime seems to have some mono_pages_not_faulted() function, but I don't know if people use it for file mappings, and I couldn't find any interesting users of it. I didn't find anything that seems to really care, but I gave up after a few pages of really boring stuff. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 21:54 ` Linus Torvalds @ 2019-01-06 11:33 ` Kevin Easton 2019-01-08 8:50 ` Kevin Easton 2019-01-18 14:23 ` Tejun Heo 2 siblings, 0 replies; 161+ messages in thread From: Kevin Easton @ 2019-01-06 11:33 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Kosina, Masatake YAMATO, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, Linux List Kernel Mailing, linux-api On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote: > On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina <jikos@kernel.org> wrote: > > > > > Who actually _uses_ mincore()? That's probably the best guide to what > > > we should do. Maybe they open the file read-only even if they are the > > > owner, and we really should look at file ownership instead. > > > > Yeah, well > > > > https://codesearch.debian.net/search?q=mincore > > > > is a bit too much mess to get some idea quickly I am afraid. > Anyway, the Debian code search just results in mostly non-present > stuff. It's sad that google code search is no more. It was great for > exactly these kinds of questions. If you select the "Group search results by Debian source package" option on the search results page it makes it a lot easier to skim through. It looks to me like Firefox is expecting mincore() not to fail on libraries that it has mapped: https://sources.debian.org/src/firefox-esr/60.4.0esr-1/mozglue/linker/BaseElf.cpp/?hl=98#L98 - Kevin > > The mono runtime seems to have some mono_pages_not_faulted() function, > but I don't know if people use it for file mappings, and I couldn't > find any interesting users of it. > > I didn't find anything that seems to really care, but I gave up after > a few pages of really boring stuff. > > Linus > > ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 21:54 ` Linus Torvalds 2019-01-06 11:33 ` Kevin Easton @ 2019-01-08 8:50 ` Kevin Easton 2019-01-18 14:23 ` Tejun Heo 2 siblings, 0 replies; 161+ messages in thread From: Kevin Easton @ 2019-01-08 8:50 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Kosina, Masatake YAMATO, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, Linux List Kernel Mailing, linux-api On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote: > On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina <jikos@kernel.org> wrote: > > > > > Who actually _uses_ mincore()? That's probably the best guide to what > > > we should do. Maybe they open the file read-only even if they are the > > > owner, and we really should look at file ownership instead. > > > > Yeah, well > > > > https://codesearch.debian.net/search?q=mincore > > > > is a bit too much mess to get some idea quickly I am afraid. > > Yeah, heh. > > And the first hit is 'fincore', which probably nobody cares about > anyway, but it does > > fd = open (name, O_RDONLY) > .. > mmap(window, len, PROT_NONE, MAP_PRIVATE, .. > > so if we want to keep that working, we'd really need to actually check > file ownership rather than just looking at f_mode. > > But I don't know if anybody *uses* and cares about fincore, and it's > particularly questionable for non-root users. > ... > I didn't find anything that seems to really care, but I gave up after > a few pages of really boring stuff. I've gone through everything in the Debian code search, and this is the stuff that seems like it would be affected at all by the current patch: util-linux Contains 'fincore' as already noted above. e2fsprogs e4defrag tries to drop pages that it caused to be loaded into the page cache, but it's not clear that this ever worked as designed anyway (it calls mincore() before ioctl(fd, EXT4_IOC_MOVE_EXT ..) but then after the sync_file_range it drops the pages that *were* in the page cache at the time of mincore()). pgfincore postgresql extension used to try to dump/restore page cache status of database backing files across reboots. It uses a fresh mapping with mincore() to try to determine the current page cache status of a file. nocache LD_PRELOAD library that tries to drop any pages that the victim program has caused to be loaded into the page cache, uses mincore on a fresh mapping to see what was resident beforehand. Also includes 'cachestats' command that's basically another 'fincore'. xfsprogs xfs_io has a 'mincore' sub-command that is roughly equivalent to 'fincore'. vmtouch vmtouch is "Portable file system cache diagnostics and control", among other things it implements 'fincore' type functionality, and one of its touted use-cases is "Preserving virtual memory profile when failing over servers". qemu qemu uses mincore() with a fresh PROT_NONE, MAP_PRIVATE mapping to implement the "x-check-cache-dropped" option. ( https://patchwork.kernel.org/patch/10395865/ ) (Everything else I could see was either looking at anonymous VMAs, its own existing mapping that it's been using for actual IO, or was just using mincore() to see if an address was part of any mapping at all). - Kevin ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 21:54 ` Linus Torvalds 2019-01-06 11:33 ` Kevin Easton 2019-01-08 8:50 ` Kevin Easton @ 2019-01-18 14:23 ` Tejun Heo 2 siblings, 0 replies; 161+ messages in thread From: Tejun Heo @ 2019-01-18 14:23 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Kosina, Masatake YAMATO, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, Linux List Kernel Mailing, linux-api Hello, Linus. On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote: > And the first hit is 'fincore', which probably nobody cares about > anyway, but it does > > fd = open (name, O_RDONLY) > .. > mmap(window, len, PROT_NONE, MAP_PRIVATE, .. So, folks here have been using fincore(1) for diagnostic purposes and are also looking to expand on it to investigate per-cgroup cache usages (mmap -> mincore -> /proc/self/pagemap -> /proc/kpagecgroup -> cgroup path). These are all root-only usages to find out what's going on with the whole page cache. We aren't attached to doing things this particular way but it'd suck if there's no way. Thanks. -- tejun ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 19:46 ` Linus Torvalds 2019-01-05 20:12 ` Jiri Kosina @ 2019-01-05 20:13 ` Linus Torvalds 1 sibling, 0 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-05 20:13 UTC (permalink / raw) To: Jiri Kosina Cc: Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, Linux List Kernel Mailing, linux-api [-- Attachment #1: Type: text/plain, Size: 335 bytes --] On Sat, Jan 5, 2019 at 11:46 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Can we please just limit it to vma's that are either anonymous, or map > a file that the user actually owns? .. or slightly simpler: a file that the user opened for writing. IOW, some (TOTALLY UNTESTED!) patch like this? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1161 bytes --] mm/mincore.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..61e38895fb02 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,13 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + return vma_is_anonymous(vma) + || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE)) + || capable(CAP_SYS_ADMIN); +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +196,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = (end - addr) >> PAGE_SHIFT; + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err; ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 17:27 [PATCH] mm/mincore: allow for making sys_mincore() privileged Jiri Kosina ` (2 preceding siblings ...) 2019-01-05 19:46 ` Linus Torvalds @ 2019-01-05 19:56 ` kbuild test robot 2019-01-05 22:54 ` Jann Horn 2019-01-30 12:44 ` [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments Vlastimil Babka 5 siblings, 0 replies; 161+ messages in thread From: kbuild test robot @ 2019-01-05 19:56 UTC (permalink / raw) To: Jiri Kosina Cc: kbuild-all, Linus Torvalds, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, linux-mm, linux-kernel, linux-api [-- Attachment #1: Type: text/plain, Size: 1018 bytes --] Hi Jiri, I love your patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.20 next-20190103] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Jiri-Kosina/mm-mincore-allow-for-making-sys_mincore-privileged/20190106-013707 config: microblaze-nommu_defconfig (attached as .config) compiler: microblaze-linux-gcc (GCC) 8.1.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=8.1.0 make.cross ARCH=microblaze All errors (new ones prefixed by >>): >> kernel/sysctl.o:(.data+0x67c): undefined reference to `sysctl_mincore_privileged' --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 12566 bytes --] ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 17:27 [PATCH] mm/mincore: allow for making sys_mincore() privileged Jiri Kosina ` (3 preceding siblings ...) 2019-01-05 19:56 ` kbuild test robot @ 2019-01-05 22:54 ` Jann Horn 2019-01-05 23:05 ` Linus Torvalds 2019-01-05 23:09 ` Jiri Kosina 2019-01-30 12:44 ` [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments Vlastimil Babka 5 siblings, 2 replies; 161+ messages in thread From: Jann Horn @ 2019-01-05 22:54 UTC (permalink / raw) To: Jiri Kosina Cc: Linus Torvalds, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Sat, Jan 5, 2019 at 6:27 PM Jiri Kosina <jikos@kernel.org> wrote: > There are possibilities [1] how mincore() could be used as a converyor of > a sidechannel information about pagecache metadata. > > Provide vm.mincore_privileged sysctl, which makes it possible to mincore() > start returning -EPERM in case it's invoked by a process lacking > CAP_SYS_ADMIN. > > The default behavior stays "mincore() can be used by anybody" in order to > be conservative with respect to userspace behavior. > > [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/ Just checking: I guess /proc/$pid/pagemap (iow, the pagemap_read() handler) is less problematic because it only returns data about the state of page tables, and doesn't query the address_space? In other words, it permits monitoring evictions, but non-intrusively detecting that something has been loaded into memory by another process is harder? ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 22:54 ` Jann Horn @ 2019-01-05 23:05 ` Linus Torvalds 2019-01-05 23:16 ` Linus Torvalds 2019-01-05 23:09 ` Jiri Kosina 1 sibling, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-05 23:05 UTC (permalink / raw) To: Jann Horn Cc: Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Sat, Jan 5, 2019 at 2:55 PM Jann Horn <jannh@google.com> wrote: > > Just checking: I guess /proc/$pid/pagemap (iow, the pagemap_read() > handler) is less problematic because it only returns data about the > state of page tables, and doesn't query the address_space? In other > words, it permits monitoring evictions, but non-intrusively detecting > that something has been loaded into memory by another process is > harder? Yes. I think it was brought up during the original report, but to use the pagemap for this, you'd basically need to first populate all the pages, and then wait for pageout. So pagemap *does* leak very similar data, but it's much harder to use as an attack vector. That said, I think "mincore()" is just the simple one. You *can* (and this was also discussed on the security list) do things like - fault in a single page - the kernel will opportunistically fault in pages that it already has available _around_ that page. - then use pagemap (or just _timing_ - no real kernel support needed) to see if those pages are now mapped in your address space so honestly, mincore is just the "big hammer" and easy way to get at some of this data. But it's probably worth closing exactly because it's easy. There are other ways to get at the "are these pages mapped" information, but they are a lot more combersome to use. Side note: maybe we could just remove the "__mincore_unmapped_range()" thing entirely, and basically make mincore() do what pagemap does, which is to say "are the pages mapped in this VM". That would be nicer than my patch, simply because removing code is always nice. And arguably it's a better semantic anyway. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 23:05 ` Linus Torvalds @ 2019-01-05 23:16 ` Linus Torvalds 2019-01-05 23:28 ` Linus Torvalds 2019-01-05 23:39 ` Linus Torvalds 0 siblings, 2 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-05 23:16 UTC (permalink / raw) To: Jann Horn Cc: Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API [-- Attachment #1: Type: text/plain, Size: 762 bytes --] On Sat, Jan 5, 2019 at 3:05 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > That would be nicer than my patch, simply because removing code is > always nice. And arguably it's a better semantic anyway. Yeah, I wonder why we did that thing where mincore() walks the page tables, but if they are empty it looks in the page cache. [... goes and looks in history ..] It goes back to forever, it looks like. I can't find a reason. Anyway, a removal patch would look something like the attached, I think. That makes mincore() actually say how many pages are in _this_ mapping, not how many pages could be paged in without doing IO. Hmm. Maybe we should try this first. Simplicity is always good. Again, obviously untested. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 2711 bytes --] mm/mincore.c | 74 +++++------------------------------------------------------- 1 file changed, 6 insertions(+), 68 deletions(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..317eb64ea4ef 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -42,64 +42,12 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, return 0; } -/* - * Later we can get more picky about what "in core" means precisely. - * For now, simply check to see if the page is in the page cache, - * and is up to date; i.e. that no page-in operation would be required - * at this time if an application were to map and access this page. - */ -static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) -{ - unsigned char present = 0; - struct page *page; - - /* - * When tmpfs swaps out a page from a file, any process mapping that - * file will not get a swp_entry_t in its pte, but rather it is like - * any other file mapping (ie. marked !present and faulted in with - * tmpfs's .fault). So swapped out tmpfs mappings are tested here. - */ -#ifdef CONFIG_SWAP - if (shmem_mapping(mapping)) { - page = find_get_entry(mapping, pgoff); - /* - * shmem/tmpfs may return swap: account for swapcache - * page too. - */ - if (xa_is_value(page)) { - swp_entry_t swp = radix_to_swp_entry(page); - page = find_get_page(swap_address_space(swp), - swp_offset(swp)); - } - } else - page = find_get_page(mapping, pgoff); -#else - page = find_get_page(mapping, pgoff); -#endif - if (page) { - present = PageUptodate(page); - put_page(page); - } - - return present; -} - static int __mincore_unmapped_range(unsigned long addr, unsigned long end, struct vm_area_struct *vma, unsigned char *vec) { unsigned long nr = (end - addr) >> PAGE_SHIFT; - int i; - if (vma->vm_file) { - pgoff_t pgoff; - - pgoff = linear_page_index(vma, addr); - for (i = 0; i < nr; i++, pgoff++) - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); - } else { - for (i = 0; i < nr; i++) - vec[i] = 0; - } + memset(vec, 0, nr); return nr; } @@ -144,21 +92,11 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); - if (non_swap_entry(entry)) { - /* - * migration or hwpoison entries are always - * uptodate - */ - *vec = 1; - } else { -#ifdef CONFIG_SWAP - *vec = mincore_page(swap_address_space(entry), - swp_offset(entry)); -#else - WARN_ON(1); - *vec = 1; -#endif - } + /* + * migration or hwpoison entries are always + * uptodate + */ + *vec = !!non_swap_entry(entry); } vec++; } ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 23:16 ` Linus Torvalds @ 2019-01-05 23:28 ` Linus Torvalds 2019-01-05 23:39 ` Linus Torvalds 1 sibling, 0 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-05 23:28 UTC (permalink / raw) To: Jann Horn Cc: Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It goes back to forever, it looks like. I can't find a reason. Our man-pages talk abouit the "without doing IO" part. That may be the result of our code, though, not the reason for it. The BSD man-page has other flags, but doesn't describe what "in core" really means: MINCORE_INCORE Page is in core (resident). MINCORE_REFERENCED Page has been referenced by us. MINCORE_MODIFIED Page has been modified by us. MINCORE_REFERENCED_OTHER Page has been referenced. MINCORE_MODIFIED_OTHER Page has been modified. MINCORE_SUPER Page is part of a large (``super'') page. but the fact that it has MINCORE_MODIFIED_OTHER does obviously imply that yes, historically it really did look up the pages elsewhere, not just in the page tables. Still, maybe we can get away with just making it be about our own page tables. That would be lovely. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 23:16 ` Linus Torvalds 2019-01-05 23:28 ` Linus Torvalds @ 2019-01-05 23:39 ` Linus Torvalds 2019-01-06 0:11 ` Matthew Wilcox 2019-01-07 10:10 ` Michael Ellerman 1 sibling, 2 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-05 23:39 UTC (permalink / raw) To: Jann Horn Cc: Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It goes back to forever, it looks like. I can't find a reason. mincore() was originally added in 2.3.52pre3, it looks like. Around 2000 or so. But sadly before the BK history. And that comment about "Later we can get more picky about what "in core" means precisely." that still exists above mincore_page() goes back to the original patch. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 23:39 ` Linus Torvalds @ 2019-01-06 0:11 ` Matthew Wilcox 2019-01-06 0:22 ` Linus Torvalds 2019-01-07 10:10 ` Michael Ellerman 1 sibling, 1 reply; 161+ messages in thread From: Matthew Wilcox @ 2019-01-06 0:11 UTC (permalink / raw) To: Linus Torvalds Cc: Jann Horn, Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Sat, Jan 05, 2019 at 03:39:10PM -0800, Linus Torvalds wrote: > On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > It goes back to forever, it looks like. I can't find a reason. > > mincore() was originally added in 2.3.52pre3, it looks like. Around > 2000 or so. But sadly before the BK history. > > And that comment about > > "Later we can get more picky about what "in core" means precisely." > > that still exists above mincore_page() goes back to the original patch. FreeBSD claims to have a manpage from SunOS 4.1.3 with mincore (!) https://www.freebsd.org/cgi/man.cgi?query=mincore&apropos=0&sektion=0&manpath=SunOS+4.1.3&arch=default&format=html DESCRIPTION mincore() returns the primary memory residency status of pages in the address space covered by mappings in the range [addr, addr + len). The status is returned as a char-per-page in the character array referenced by *vec (which the system assumes to be large enough to encompass all the pages in the address range). The least significant bit of each character is set to 1 to indicate that the referenced page is in pri- mary memory, 0 if it is not. The settings of other bits in each char- acter is undefined and may contain other information in the future. ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-06 0:11 ` Matthew Wilcox @ 2019-01-06 0:22 ` Linus Torvalds 2019-01-06 1:50 ` Linus Torvalds 2019-01-07 4:32 ` Dominique Martinet 0 siblings, 2 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-06 0:22 UTC (permalink / raw) To: Matthew Wilcox Cc: Jann Horn, Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Sat, Jan 5, 2019 at 4:11 PM Matthew Wilcox <willy@infradead.org> wrote: > > FreeBSD claims to have a manpage from SunOS 4.1.3 with mincore (!) > > https://www.freebsd.org/cgi/man.cgi?query=mincore&apropos=0&sektion=0&manpath=SunOS+4.1.3&arch=default&format=html > > DESCRIPTION > mincore() returns the primary memory residency status of pages in the > address space covered by mappings in the range [addr, addr + len). It's still not clear that "primary memory residency status" actually means. Does it mean "mapped", or does it mean "exists in caches and doesn't need IO". I don't even know what kind of caches SunOS 4.1.3 had. The Linux implementation depends on the page cache, and wouldn't work (at least not very well) in a system that has a traditional disk buffer cache. Anyway, I guess it's mostly moot. From a "does this cause regressions" standpoint, the only thing that matters is really whatever Linux programs that have used this since it was introduced 18+ years ago. But I think my patch to just rip out all that page lookup, and just base it on the page table state has the fundamental advantage that it gets rid of code. Maybe I should jst commit it, and see if anything breaks? We do have options in case things break, and then we'd at least know who cares (and perhaps a lot more information of _why_ they care). Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-06 0:22 ` Linus Torvalds @ 2019-01-06 1:50 ` Linus Torvalds 2019-01-06 21:46 ` Linus Torvalds 2019-01-07 4:32 ` Dominique Martinet 1 sibling, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-06 1:50 UTC (permalink / raw) To: Matthew Wilcox Cc: Jann Horn, Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API [-- Attachment #1: Type: text/plain, Size: 599 bytes --] On Sat, Jan 5, 2019 at 4:22 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But I think my patch to just rip out all that page lookup, and just > base it on the page table state has the fundamental advantage that it > gets rid of code. Maybe I should jst commit it, and see if anything > breaks? We do have options in case things break, and then we'd at > least know who cares (and perhaps a lot more information of _why_ they > care). Slightly updated patch in case somebody wants to try things out. Also, any comments about the pmd_trans_unstable() case? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 3523 bytes --] mm/mincore.c | 94 +++++++++--------------------------------------------------- 1 file changed, 13 insertions(+), 81 deletions(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..f0f91461a9f4 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -42,72 +42,14 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, return 0; } -/* - * Later we can get more picky about what "in core" means precisely. - * For now, simply check to see if the page is in the page cache, - * and is up to date; i.e. that no page-in operation would be required - * at this time if an application were to map and access this page. - */ -static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) -{ - unsigned char present = 0; - struct page *page; - - /* - * When tmpfs swaps out a page from a file, any process mapping that - * file will not get a swp_entry_t in its pte, but rather it is like - * any other file mapping (ie. marked !present and faulted in with - * tmpfs's .fault). So swapped out tmpfs mappings are tested here. - */ -#ifdef CONFIG_SWAP - if (shmem_mapping(mapping)) { - page = find_get_entry(mapping, pgoff); - /* - * shmem/tmpfs may return swap: account for swapcache - * page too. - */ - if (xa_is_value(page)) { - swp_entry_t swp = radix_to_swp_entry(page); - page = find_get_page(swap_address_space(swp), - swp_offset(swp)); - } - } else - page = find_get_page(mapping, pgoff); -#else - page = find_get_page(mapping, pgoff); -#endif - if (page) { - present = PageUptodate(page); - put_page(page); - } - - return present; -} - -static int __mincore_unmapped_range(unsigned long addr, unsigned long end, - struct vm_area_struct *vma, unsigned char *vec) -{ - unsigned long nr = (end - addr) >> PAGE_SHIFT; - int i; - - if (vma->vm_file) { - pgoff_t pgoff; - - pgoff = linear_page_index(vma, addr); - for (i = 0; i < nr; i++, pgoff++) - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); - } else { - for (i = 0; i < nr; i++) - vec[i] = 0; - } - return nr; -} - static int mincore_unmapped_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { - walk->private += __mincore_unmapped_range(addr, end, - walk->vma, walk->private); + unsigned char *vec = walk->private; + unsigned long nr = (end - addr) >> PAGE_SHIFT; + + memset(vec, 0, nr); + walk->private += nr; return 0; } @@ -127,8 +69,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } + /* We'll consider a THP page under construction to be there */ if (pmd_trans_unstable(pmd)) { - __mincore_unmapped_range(addr, end, vma, vec); + memset(vec, 1, nr); goto out; } @@ -137,28 +80,17 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pte_t pte = *ptep; if (pte_none(pte)) - __mincore_unmapped_range(addr, addr + PAGE_SIZE, - vma, vec); + *vec = 0; else if (pte_present(pte)) *vec = 1; else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); - if (non_swap_entry(entry)) { - /* - * migration or hwpoison entries are always - * uptodate - */ - *vec = 1; - } else { -#ifdef CONFIG_SWAP - *vec = mincore_page(swap_address_space(entry), - swp_offset(entry)); -#else - WARN_ON(1); - *vec = 1; -#endif - } + /* + * migration or hwpoison entries are always + * uptodate + */ + *vec = !!non_swap_entry(entry); } vec++; } ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-06 1:50 ` Linus Torvalds @ 2019-01-06 21:46 ` Linus Torvalds 2019-01-08 4:43 ` Dave Chinner 0 siblings, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-06 21:46 UTC (permalink / raw) To: Matthew Wilcox Cc: Jann Horn, Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Sat, Jan 5, 2019 at 5:50 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Slightly updated patch in case somebody wants to try things out. I decided to just apply that patch. It is *not* marked for stable, very intentionally, because I expect that we will need to wait and see if there are issues with it, and whether we might have to do something entirely different (more like the traditional behavior with some extra "only for owner" logic). But doing a test patch during the merge window (which is about to close) sounds like the right thing to do. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-06 21:46 ` Linus Torvalds @ 2019-01-08 4:43 ` Dave Chinner 2019-01-08 17:57 ` Linus Torvalds 0 siblings, 1 reply; 161+ messages in thread From: Dave Chinner @ 2019-01-08 4:43 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Jann Horn, Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Sun, Jan 06, 2019 at 01:46:37PM -0800, Linus Torvalds wrote: > On Sat, Jan 5, 2019 at 5:50 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Slightly updated patch in case somebody wants to try things out. > > I decided to just apply that patch. It is *not* marked for stable, > very intentionally, because I expect that we will need to wait and see > if there are issues with it, and whether we might have to do something > entirely different (more like the traditional behavior with some extra > "only for owner" logic). So, I read the paper and before I was half way through it I figured there are a bunch of other similar page cache invalidation attacks we can perform without needing mincore. i.e. Focussing on mmap() and mincore() misses the wider issues we have with global shared caches. My first thought: fd = open(some_file, O_RDONLY); iov[0].iov_base = buf; iov[0].iov_len = 1; ret = preadv2(fd, iov, 1, off, RWF_NOWAIT); switch (ret) { case -EAGAIN: /* page not resident in page cache */ break; case 1: /* page resident in page cache */ break; default: /* beyond EOF or some other error */ break; } This is "as documented" in the man page for preadv2: RWF_NOWAIT (since Linux 4.14) Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set errno to EAGAIN. Currently, this flag is meaningful only for preadv2(). IOWs, we've explicitly designed interfaces to communicate whether data is "immediately accessible" or not to the application so they can make sensible decisions about IO scheduling. i.e. IO won't block the main application processing loop and so can be scheduled in the background by the app and the data processed when that IO returns. That just so happens to be exactly the same information about the page cache that mincore is making available to userspace. If we "remove" this information from the interfaces like it has been done for mincore(), it doesn't mean userspace can't get it in other ways. e.g. it now just has to time the read(2) syscall duration and infer whether the data came from the page cache or disk from the timing information. IMO, there's nothing new or novel about this page cache information leak - it was just a matter of time before some researcher put 2 and 2 together and realised that sharing the page cache across a security boundary is little different from sharing deduplicated pages across those same security boundaries. i.e. As long as we shared caches across security boundaries and userspace can control both cache invalidation and instantiation, we cannot prevent userspace from constructing these invalidate+read information exfiltration scenarios. And then there is overlayfs. Overlay is really just a way to efficiently share the page cache of the same underlying read-only directory tree across all containers on a host. i.e. we have been specifically designing our container hosting systems to share the underlying read-only page cache across all security boundaries on the host. If overlay is vulnerable to these shared page cache attacks (seems extremely likely) then we've got much bigger problems than mincore to think about.... > But doing a test patch during the merge window (which is about to > close) sounds like the right thing to do. IMO it seems like the wrong thing to do. It's just a hacky band-aid over a specific extraction method and does nothing to reduce the actual scope of the information leak. Focussing on the band-aid means you've missed all the other avenues that the same information is exposed and all the infrastructure we've build on the core concept of sharing kernel side pages across security boundaries. And that's even without considering whether the change breaks userspace. Which it does. e.g. vmtouch is fairly widely used to manage page cache instantiation for rapid bring-up and migration of guest VMs and containers. They save the hot page cache information from a running container and then using that to instantiate the page cache in new instances running the same workload so they run at full speed right from the start. This use case calls mincore() to pull the page cache information from the running container. If anyone else proposed merging a syscall implementation change that was extremely likely to break userspace you'd be shouting at them that "we don't break userspace".... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-08 4:43 ` Dave Chinner @ 2019-01-08 17:57 ` Linus Torvalds 2019-01-09 2:24 ` Dave Chinner 0 siblings, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-08 17:57 UTC (permalink / raw) To: Dave Chinner Cc: Matthew Wilcox, Jann Horn, Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Mon, Jan 7, 2019 at 8:43 PM Dave Chinner <david@fromorbit.com> wrote: > > So, I read the paper and before I was half way through it I figured > there are a bunch of other similar page cache invalidation attacks > we can perform without needing mincore. i.e. Focussing on mmap() and > mincore() misses the wider issues we have with global shared caches. Oh, agreed, and that was discussed in the original report too. The thing is, you can also depend on our pre-faulting of pages in the page fault handler, and use that to get the cached status of nearby pages. So do something like "fault one page, then do mincore() to see how many pages near it were mapped". See our "do_fault_around()" logic. But mincore is certainly the easiest interface, and the one that doesn't require much effort or setup. It's also the one where our old behavior was actually arguably simply stupid and actively wrong (ie "in caches" isn't even strictly speaking a valid question, since the caches in question may be invalid). So let's try to see if giving mincore() slightly more well-defined semantics actually causes any pain. I do think that the RWF_NOWAIT case might also be interesting to look at. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-08 17:57 ` Linus Torvalds @ 2019-01-09 2:24 ` Dave Chinner 2019-01-09 2:31 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Dave Chinner @ 2019-01-09 2:24 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Jann Horn, Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Tue, Jan 08, 2019 at 09:57:49AM -0800, Linus Torvalds wrote: > On Mon, Jan 7, 2019 at 8:43 PM Dave Chinner <david@fromorbit.com> wrote: > > > > So, I read the paper and before I was half way through it I figured > > there are a bunch of other similar page cache invalidation attacks > > we can perform without needing mincore. i.e. Focussing on mmap() and > > mincore() misses the wider issues we have with global shared caches. > > Oh, agreed, and that was discussed in the original report too. > > The thing is, you can also depend on our pre-faulting of pages in the > page fault handler, and use that to get the cached status of nearby > pages. So do something like "fault one page, then do mincore() to see > how many pages near it were mapped". See our "do_fault_around()" > logic. Observing fault-around could help you detect what code an application is running, but it's not necessary (and can be turned off). Also, such an it observation is not dependent on using mincore. neither fault-around nor mincore are required functionality to exploit the information leaks. And, FWIW, fault-around actually destroys the information in the exfiltration channel described in the paper because it perturbs the carefully constructed page cache residency pattern that encodes the message. > But mincore is certainly the easiest interface, and the one that > doesn't require much effort or setup. Off the top of my head, here's a few vectors for reading the page cache residency state without perturbing the page cache residency pattern: - mincore - preadv2(RWF_NOWAIT) - fadvise(POSIX_FADV_RANDOM); timed read(2) syscalls - madvise(MADV_RANDOM); timed read of first byte in each page i.e. mincore is a messenger, but it's not the only trivial observation technique available. The only difference between mincore and the others will be the observation latency and hence channel bandwidth. IOWs, the question we need to focus on now is not "does breaking mincore affect anyone", it is "how the hell do we mitigate and isolate an information leak exposed by fundamental OS functionality that *everything* depends on for performance"? > It's also the one where our old > behavior was actually arguably simply stupid and actively wrong (ie > "in caches" isn't even strictly speaking a valid question, since the > caches in question may be invalid). This is irrelevant to the problem reported. Sure, mincore may be an awful interface, but it's semantics are not the cause of the information leak. You're just shooting the messenger... > I do think that the RWF_NOWAIT case might also be interesting to look at. As are all the other mechanisms you can use to observer page cache residency without perturbing it's state. Keep in mind that the researchers documented a remote observation technique that leaked the information across the network to a remote host, so this leak has much, much wider scope than changing mincore can address... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-09 2:24 ` Dave Chinner @ 2019-01-09 2:31 ` Jiri Kosina 2019-01-09 4:39 ` Dave Chinner 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-09 2:31 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, 9 Jan 2019, Dave Chinner wrote: > > But mincore is certainly the easiest interface, and the one that > > doesn't require much effort or setup. > > Off the top of my head, here's a few vectors for reading the page > cache residency state without perturbing the page cache residency > pattern: > - mincore > - preadv2(RWF_NOWAIT) > - fadvise(POSIX_FADV_RANDOM); timed read(2) syscalls > - madvise(MADV_RANDOM); timed read of first byte in each page While I obviously agree that all those are creating pagecache sidechannel in principle, I think we really should mostly focus on the first two (with mincore() already having been covered). Rationale has been provided by Daniel Gruss in this thread -- if the attacker is left with cache timing as the only available vector, he's going to be much more successful with mounting hardware cache timing attack anyway. Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-09 2:31 ` Jiri Kosina @ 2019-01-09 4:39 ` Dave Chinner 2019-01-09 10:08 ` Jiri Kosina 2019-01-09 18:25 ` Linus Torvalds 0 siblings, 2 replies; 161+ messages in thread From: Dave Chinner @ 2019-01-09 4:39 UTC (permalink / raw) To: Jiri Kosina Cc: Linus Torvalds, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 09, 2019 at 03:31:35AM +0100, Jiri Kosina wrote: > On Wed, 9 Jan 2019, Dave Chinner wrote: > > > > But mincore is certainly the easiest interface, and the one that > > > doesn't require much effort or setup. > > > > Off the top of my head, here's a few vectors for reading the page > > cache residency state without perturbing the page cache residency > > pattern: > > - mincore > > - preadv2(RWF_NOWAIT) > > - fadvise(POSIX_FADV_RANDOM); timed read(2) syscalls > > - madvise(MADV_RANDOM); timed read of first byte in each page > > While I obviously agree that all those are creating pagecache sidechannel > in principle, I think we really should mostly focus on the first two (with > mincore() already having been covered). FWIW, I just realised that the easiest, most reliable way to invalidate the page cache over a file range is simply to do a O_DIRECT read on it. IOWs, all three requirements of this information leak - highly specific, reliable cache invalidation control, controlled cache instantiation and 3rd-party detection of cache residency can all be performed with just the read(2) syscall... > Rationale has been provided by Daniel Gruss in this thread -- if the > attacker is left with cache timing as the only available vector, he's > going to be much more successful with mounting hardware cache timing > attack anyway. No, he said: "Restricting mincore() is sufficient to fix the hardware-agnostic part." That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic and provides exactly the same information about the page cache as mincore. Timed read/mmap access loops for cache observation are also hardware agnostic, and on fast SSD based storage will only be marginally slower bandwidth than preadv2(RWF_NOWAIT). Attackers will pick whatever leak vector we don't fix, so we either fix them all (which I think is probably impossible without removing caching altogether) or we start thinking about how we need to isolate the page cache so that information isn't shared across important security boundaries (e.g. page cache contents are per-mount namespace). Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-09 4:39 ` Dave Chinner @ 2019-01-09 10:08 ` Jiri Kosina 2019-01-10 1:15 ` Dave Chinner 2019-01-09 18:25 ` Linus Torvalds 1 sibling, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-09 10:08 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, 9 Jan 2019, Dave Chinner wrote: > FWIW, I just realised that the easiest, most reliable way to invalidate > the page cache over a file range is simply to do a O_DIRECT read on it. Neat, good catch indeed. Still, it's only the invalidation part, but the residency check is the crucial one. > > Rationale has been provided by Daniel Gruss in this thread -- if the > > attacker is left with cache timing as the only available vector, he's > > going to be much more successful with mounting hardware cache timing > > attack anyway. > > No, he said: > > "Restricting mincore() is sufficient to fix the hardware-agnostic > part." > > That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic and > provides exactly the same information about the page cache as mincore. Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has "just" been overlooked. I can't speak for Daniel, but I believe he might be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT is sufficient ...". > Timed read/mmap access loops for cache observation are also hardware > agnostic, and on fast SSD based storage will only be marginally slower > bandwidth than preadv2(RWF_NOWAIT). > > Attackers will pick whatever leak vector we don't fix, so we either fix > them all (which I think is probably impossible without removing caching > altogether) We can't really fix the fact that it's possible to do the timing on the HW caches though. > or we start thinking about how we need to isolate the page cache so that > information isn't shared across important security boundaries (e.g. page > cache contents are per-mount namespace). Umm, sorry for being dense, but how would that help that particular attack scenario on a system that doesn't really employ any namespacing? (which I still believe is a majority of the systems out there, but I might have just missed the containers train long time ago :) ). -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-09 10:08 ` Jiri Kosina @ 2019-01-10 1:15 ` Dave Chinner 2019-01-10 7:54 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Dave Chinner @ 2019-01-10 1:15 UTC (permalink / raw) To: Jiri Kosina Cc: Linus Torvalds, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 09, 2019 at 11:08:57AM +0100, Jiri Kosina wrote: > On Wed, 9 Jan 2019, Dave Chinner wrote: > > > FWIW, I just realised that the easiest, most reliable way to invalidate > > the page cache over a file range is simply to do a O_DIRECT read on it. > > Neat, good catch indeed. Still, it's only the invalidation part, but the > residency check is the crucial one. > > > > Rationale has been provided by Daniel Gruss in this thread -- if the > > > attacker is left with cache timing as the only available vector, he's > > > going to be much more successful with mounting hardware cache timing > > > attack anyway. > > > > No, he said: > > > > "Restricting mincore() is sufficient to fix the hardware-agnostic > > part." > > > > That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic and > > provides exactly the same information about the page cache as mincore. > > Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has > "just" been overlooked. I can't speak for Daniel, but I believe he might > be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT > is sufficient ...". Good luck with restricting RWF_NOWAIT. I eagerly await all the fstests that exercise both the existing and new behaviours to demonstrate they work correctly. > > Timed read/mmap access loops for cache observation are also hardware > > agnostic, and on fast SSD based storage will only be marginally slower > > bandwidth than preadv2(RWF_NOWAIT). > > > > Attackers will pick whatever leak vector we don't fix, so we either fix > > them all (which I think is probably impossible without removing caching > > altogether) > > We can't really fix the fact that it's possible to do the timing on the HW > caches though. We can't really fix the fact that it's possible to do the timing on the page cache, either. > > or we start thinking about how we need to isolate the page cache so that > > information isn't shared across important security boundaries (e.g. page > > cache contents are per-mount namespace). > > Umm, sorry for being dense, but how would that help that particular attack > scenario on a system that doesn't really employ any namespacing? What's your security boundary? The "detect what code an app is running" exploit is based on invalidating and then observing how shared, non-user-owned files mapped with execute privileges change cache residency. If the security boundary is within the local container, should users inside that container be allowed to invalidate the cache of executable files and libraries they don't own? In this case, we can't stop observation, because that only require read permissions and high precision timing, hence the only thing that can be done here is prevent non-owners from invalidating the page cache. If the security boundary is a namespace or guest VM, then permission checks don't work - the user may own the file within that container. This problem now is that the page cache is observable and controllable from both sides of the fence. Hence the only way to prevent observation of the code being run in a different namespace is to prevent the page being shared across both containers. The exfiltration exploit requires the page cache to be observable and controllable on both sides of the security boundary. Should users be able to observe and control the cached pages accessed by a different container? KSM page deduplication lessons say no. This is an even harder problem, because page cache residency can be observed from remote machines.... What scares me is that new features being proposed could make our exposure a whole lot worse. e.g. the recent virtio-pmem ("fake-dax") proposal will directly share host page cache pages into guest VMs w/ DAX capability. i.e. the guest directly accesses the host page cache. This opens up the potential for host page cache timing attacks from the guest VMs, and potential guest to guest observation/exploitation is possible if the same files are mapped into multiple guests.... IOws the two questions here are simply: "What's your security boundary?" and "Is the page cache visible and controllable on both sides?". Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 1:15 ` Dave Chinner @ 2019-01-10 7:54 ` Jiri Kosina 0 siblings, 0 replies; 161+ messages in thread From: Jiri Kosina @ 2019-01-10 7:54 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, 10 Jan 2019, Dave Chinner wrote: > > Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has > > "just" been overlooked. I can't speak for Daniel, but I believe he might > > be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT > > is sufficient ...". > > Good luck with restricting RWF_NOWAIT. I eagerly await all the > fstests that exercise both the existing and new behaviours to > demonstrate they work correctly. Well, we can still resurrect my original aproach of doing this opt-in based on a sysctl setting, and letting the admin choose his poison. If 'secure' mode is selected, RWF_NOWAIT will then probably just always fail wit EAGAIN. -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-09 4:39 ` Dave Chinner 2019-01-09 10:08 ` Jiri Kosina @ 2019-01-09 18:25 ` Linus Torvalds 2019-01-10 0:44 ` Dave Chinner 1 sibling, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-09 18:25 UTC (permalink / raw) To: Dave Chinner Cc: Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Tue, Jan 8, 2019 at 8:39 PM Dave Chinner <david@fromorbit.com> wrote: > > FWIW, I just realised that the easiest, most reliable way to > invalidate the page cache over a file range is simply to do a > O_DIRECT read on it. If that's the case, that's actually an O_DIRECT bug. It should only invalidate the caches on write. On reads, it wants to either _flush_ any direct caches before the read, or just take the data from the caches. At no point is "invalidate" a valid model. Of course, I'm not in the least bit shocked if O_DIRECT is buggy like this. But looking at least at the ext4 routine, the read just does ret = filemap_write_and_wait_range(mapping, iocb->ki_pos, and I don't see any invalidation. Having read access to a file absolutely should *not* mean that you can flush caches on it. That's a write op. Any filesystem that invalidates the caches on read is utterly buggy. Can you actually point to such a thing? Let's get that fixed, because it's completely wrong regardless of this whole mincore issue. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-09 18:25 ` Linus Torvalds @ 2019-01-10 0:44 ` Dave Chinner 2019-01-10 1:18 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 161+ messages in thread From: Dave Chinner @ 2019-01-10 0:44 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 09, 2019 at 10:25:43AM -0800, Linus Torvalds wrote: > On Tue, Jan 8, 2019 at 8:39 PM Dave Chinner <david@fromorbit.com> wrote: > > > > FWIW, I just realised that the easiest, most reliable way to > > invalidate the page cache over a file range is simply to do a > > O_DIRECT read on it. > > If that's the case, that's actually an O_DIRECT bug. > > It should only invalidate the caches on write. Sounds nice from a theoretical POV, but reality has taught us very different lessons. FWIW, a quick check of XFS's history so you understand how long this behaviour has been around. It was introduced in the linux port in 2001 as direct IO support was being added: commit e837eac23662afae603aaaef7c94bc839c1b8f67 Author: Steve Lord <lord@sgi.com> Date: Mon Mar 5 16:47:52 2001 +0000 Add bounds checking for direct I/O, do the cache invalidation for data coherency on direct I/O. This was basically a direct port of the flush+invalidation code in the Irix direct IO path, which was introduced in 1995: > revision 1.149 > date: 1995/08/11 20:09:44; author: ajs; state: Exp; lines: +70 -2 > 280514 Adding page cache flusing calls to make direct > I/O coherent with buffered I/O. IOWs, history tells us that invalidation for direct IO reads has been done on XFS for almost 25 years. I know for certain that there have been applications out there that depend on this invalidation-on-read behaviour (another of those "reality bites" lessons) so we can't just remove it because you *think* it is a bug. i.e. we *could* remove the invalidation on read, but this we have a major behavioural change to the XFS direct IO path. This means we need to determine if we've just awoken sleeping data corruption krakens as well as determine if there are any performance regressions that result from the behavioural change. Which brings me to validation. If the recent clone/dedupe/copy_file_range() debacle has taught me anything, it's that validating a "simple" IO path mechanism is going to take months worth of machine time before we have any confidence that the change is not going to expose users to new data corruption problems. That's the difficulty here - it only takes 5 minutes to change the code, but there's months of machine time needed to determine if it's really safe to make that code change. Testing has a nasty habit of finding invalid assumptions; when those are assumptions about data coherency and integrity we can't test them on our users. And, really, this would be just another band-aid over a symptom of the information leak - it doesn't prevent users from being able to control page cache invalidation. It just removes one method, just like hacking mincore only removes one method of observing the page cache. And, like mincore(), there's every chance it impacts on userspace in a negative manner and so we need to be very careful here. > On reads, it wants to either _flush_ any direct caches before the > read, or just take the data from the caches. At no point is > "invalidate" a valid model. > > Of course, I'm not in the least bit shocked if O_DIRECT is buggy like > this. But looking at least at the ext4 routine, the read just does > > ret = filemap_write_and_wait_range(mapping, iocb->ki_pos, > > and I don't see any invalidation. I wouldn't look at ext4 as an example of a reliable, problem free direct IO implementation because, historically speaking, it's been a series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and been far worse than XFS from data integrity, performance and reliability perspectives. IMO, "because ext4" has been a poor reason for justifying anything for a long time, not the least when talking about features that didn't even originate in extN.... > Can you actually point to such a thing? Let's get that fixed, because > it's completely wrong regardless of this whole mincore issue. The biggest problem that remains today is that we have no mechanism for serialising page faults against DIO. If we leave pages cached in memory while we have a AIO+DIO read (or write!) in progress, we can dirty the page and run a buffered read before the AIO+DIO read returns. This now leaves us in the state where where the AIO+DIO read returns different (stale) data to a buffered read that has already completed because it hit the dirty page cache. i.e. we still have nasty page cache vs direct IO coherency problems, and they are largely unsolvable because of the limitations of the core kernel infrastructure architecture. Yes, you can argue that userspace is doing an insane thing, but every so often we come across coherency issues like this that are out of a user's control (e.g. backup scan vs app accesses) and we do our best to ensure that they don't cause problems given the constraints we have. Invalidating the page cache on dio reads mostly mitigates these coherency race conditions and that's why it's still there in the XFS code paths... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 0:44 ` Dave Chinner @ 2019-01-10 1:18 ` Linus Torvalds 2019-01-10 5:26 ` Andy Lutomirski 2019-01-10 7:03 ` Dave Chinner 2019-01-10 14:50 ` Matthew Wilcox 2019-01-11 7:36 ` Jiri Kosina 2 siblings, 2 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-10 1:18 UTC (permalink / raw) To: Dave Chinner Cc: Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner <david@fromorbit.com> wrote: > > I wouldn't look at ext4 as an example of a reliable, problem free > direct IO implementation because, historically speaking, it's been a > series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and > been far worse than XFS from data integrity, performance and > reliability perspectives. That's some big words from somebody who just admitted to much worse hacks. Seriously. XFS is buggy in this regard, ext4 apparently isn't. Thinking that it's better to just invalidate the cache for direct IO reads is all kinds of odd. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 1:18 ` Linus Torvalds @ 2019-01-10 5:26 ` Andy Lutomirski 2019-01-10 14:47 ` Matthew Wilcox 2019-01-11 1:47 ` Dave Chinner 2019-01-10 7:03 ` Dave Chinner 1 sibling, 2 replies; 161+ messages in thread From: Andy Lutomirski @ 2019-01-10 5:26 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 9, 2019 at 5:18 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner <david@fromorbit.com> wrote: > > > > I wouldn't look at ext4 as an example of a reliable, problem free > > direct IO implementation because, historically speaking, it's been a > > series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and > > been far worse than XFS from data integrity, performance and > > reliability perspectives. > > That's some big words from somebody who just admitted to much worse hacks. > > Seriously. XFS is buggy in this regard, ext4 apparently isn't. > > Thinking that it's better to just invalidate the cache for direct IO > reads is all kinds of odd. > This whole discussion seems to have gone a little bit off the rails... Linus, I think I agree with Dave's overall sentiment, though, and I think you should consider reverting your patch. Here's why. The basic idea behind the attack is that the authors found efficient ways to do two things: evict a page from page cache and detect, *without filling the cache*, whether a page is cached. The combination lets an attacker efficiently tell when another process reads a page. We need to keep in mind that this attack is a sophisticated attack, and anyone using it won't have any problem using a nontrivial way to detect whether a page is in page cache. So, unless we're going to try for real to make it hard to tell whether a page is cached without causing that page to become cached, it's not worth playing whack-a-mole. And, right now, mincore is whacking a mole. RWF_NOWAIT appears to do essentially the same thing at very little cost. I haven't really dug in, but I assume that various prefaulting tricks combined with various pagetable probing tricks can do similar things, but that's at least a *lot* more complicated. So unless we're going to lock down RWF_NOWAIT as well, I see no reason to lock down mincore(). Direct IO is a red herring -- O_DIRECT is destructive enough that it seems likely to make the attacks a lot less efficient. --- begin digression --- Since direct IO has been brought up, I have a question. I've wondered for years why direct IO works the way it does. If I were implementing it from scratch, my first inclination would be to use the page cache instead of fighting it. To do a single-page direct read, I would look that page up in the page cache (i.e. i_pages these days). If the page is there, I would do a normal buffered read. If the page is not there, I would insert a record into i_pages indicating that direct IO is in progress and then I would do the IO into the destination page. If any other read, direct or otherwise, sees a record saying "under direct IO", it would wait. To do a single-page direct write, I would look it up in i_pages. If it's there, I would do a buffered write followed by a sync (because applications expect a sync). If it's not there, I would again add a record saying "under direct IO" and do the IO. The idea is that this works as much like buffered IO as possible, except that the pages backing the IO aren't normal sharable page cache pages. The multi-page case would be just an optimization on top of the single-page case. The idea would be to somehow mark i_pages with entire extents under direct IO. It's a radix tree -- this can, at least in theory, be done efficiently. As long as all direct IO operations run in increasing order of offset, there shouldn't be lock ordering problems. Other than history and possibly performance, is there any reason that direct IO doesn't work this way? P.S. What, if anything, prevents direct writes from causing trouble when the underlying FS or backing store needs stable pages? Similarly, what, if anything, prevents direct reads from temporarily exposing unintended data to user code if the fs or underlying device transforms the data during the read process (e.g. by decrypting something)? --- end digression --- ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 5:26 ` Andy Lutomirski @ 2019-01-10 14:47 ` Matthew Wilcox 2019-01-10 21:44 ` Dave Chinner 2019-01-11 1:47 ` Dave Chinner 1 sibling, 1 reply; 161+ messages in thread From: Matthew Wilcox @ 2019-01-10 14:47 UTC (permalink / raw) To: Andy Lutomirski Cc: Linus Torvalds, Dave Chinner, Jiri Kosina, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote: > Since direct IO has been brought up, I have a question. I've wondered > for years why direct IO works the way it does. If I were implementing > it from scratch, my first inclination would be to use the page cache > instead of fighting it. To do a single-page direct read, I would look > that page up in the page cache (i.e. i_pages these days). If the page > is there, I would do a normal buffered read. If the page is not > there, I would insert a record into i_pages indicating that direct IO > is in progress and then I would do the IO into the destination page. > If any other read, direct or otherwise, sees a record saying "under > direct IO", it would wait. OK, you're in the same ballpark I am ;-) Kent Overstreet pointed out that what you want to do here is great for the mixed case, but it's pretty inefficient for IOs to files which are wholly uncached. So what I'm currently thinking about is an rwsem which works like this: O_DIRECT task: if i_pages is empty, take rwsem for read, recheck i_pages is empty, do IO, drop rwsem. if i_pages is not empty, insert XA_LOCK_ENTRY, when IO complete, wake waitqueue for that (mapping, index). buffered IO: if i_pages is empty, take rwsem for write, allocate page, insert page, drop rwsem. if i_pages is not empty, look up index, if entry is XA_LOCK_ENTRY sleep on waitqueue. otherwise proceed as now. ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 14:47 ` Matthew Wilcox @ 2019-01-10 21:44 ` Dave Chinner 2019-01-10 21:59 ` Linus Torvalds 0 siblings, 1 reply; 161+ messages in thread From: Dave Chinner @ 2019-01-10 21:44 UTC (permalink / raw) To: Matthew Wilcox Cc: Andy Lutomirski, Linus Torvalds, Jiri Kosina, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 10, 2019 at 06:47:11AM -0800, Matthew Wilcox wrote: > On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote: > > Since direct IO has been brought up, I have a question. I've wondered > > for years why direct IO works the way it does. If I were implementing > > it from scratch, my first inclination would be to use the page cache > > instead of fighting it. To do a single-page direct read, I would look > > that page up in the page cache (i.e. i_pages these days). If the page > > is there, I would do a normal buffered read. If the page is not > > there, I would insert a record into i_pages indicating that direct IO > > is in progress and then I would do the IO into the destination page. > > If any other read, direct or otherwise, sees a record saying "under > > direct IO", it would wait. > > OK, you're in the same ballpark I am ;-) Kent Overstreet pointed out > that what you want to do here is great for the mixed case, but it's > pretty inefficient for IOs to files which are wholly uncached. > > So what I'm currently thinking about is an rwsem which works like this: > > O_DIRECT task: > if i_pages is empty, take rwsem for read, recheck i_pages is empty, do IO, > drop rwsem. GUP does page fault on user buffer which is a mmapped region of same file. page fault sets up for buffered IO, tries to take rwsem for write, deadlocks. Most of the schemes we come up with fall down at this point - you can't hold a lock over gup that is also used in the buffered IO path. That's why XFS (and now ext4) have the IOLOCK and MMAPLOCK for truncation serialisation - we can't lock out both read()/write() and mmap IO paths with the same lock... > if i_pages is not empty, insert XA_LOCK_ENTRY, when IO complete, wake waitqueue for that (mapping, index). I assume you really mean add a tag to the entry? But this means there is no record ofthe direct IO being in flight except for the rwsem being held across the IO. Even if we did insert a flag to say "DIO in progress" and not rely on the lock.... > buffered IO: > if i_pages is empty, take rwsem for write, allocate page, insert page, drop rwsem. > if i_pages is not empty, look up index, if entry is XA_LOCK_ENTRY sleep on > waitqueue. otherwise proceed as now. ... we'll sleep on that flags in the page fault and deadlock anyway. I'm pretty sure we explored this "record DIO state in the radix tree" 2 or 3 years ago and came to the conclusion that it didn't work for reasons like the above. i.e. it doesn't solve the problems we currently have with locking and serialisation between DIO and mmap... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 21:44 ` Dave Chinner @ 2019-01-10 21:59 ` Linus Torvalds 0 siblings, 0 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-10 21:59 UTC (permalink / raw) To: Dave Chinner Cc: Matthew Wilcox, Andy Lutomirski, Jiri Kosina, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 10, 2019 at 1:44 PM Dave Chinner <david@fromorbit.com> wrote: > > GUP does page fault on user buffer which is a mmapped region of same > file. page fault sets up for buffered IO, tries to take rwsem for > write, deadlocks. > > Most of the schemes we come up with fall down at this point - you > can't hold a lock over gup that is also used in the buffered IO > path. That's why XFS (and now ext4) have the IOLOCK and MMAPLOCK > for truncation serialisation - we can't lock out both read()/write() > and mmap IO paths with the same lock... Side note: a somewhat similar version of is true even in the absence of GUP and dio, for the case of doing a mmap of a file, and then reading or writing from the mapped region into the file itself. There are "interesting" locking scenarios wrt just holding the page locked, and trying to then fill that page with information with just a regular "copy_from_user()". Page fault -> try to read the file -> oops, the page we're trying to read from is locked because we're trying to write to it. So we have that odd dance in generic_perform_write() which does - touch the first user byte without holding any lock - do write_begin() (which gets the page lock) - copy from user space using the "atomic" copy (which just gives up on fault) - if nothing got copied, go back and try again with a smaller copy that can't cross a page. We might have raced with pageout. It might be possible to do something similar for direct IO, although simpler: just do the GUP entirely atomically (and in the fault case just fall back to non-direct IO). Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 5:26 ` Andy Lutomirski 2019-01-10 14:47 ` Matthew Wilcox @ 2019-01-11 1:47 ` Dave Chinner 1 sibling, 0 replies; 161+ messages in thread From: Dave Chinner @ 2019-01-11 1:47 UTC (permalink / raw) To: Andy Lutomirski Cc: Linus Torvalds, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote: > Since direct IO has been brought up, I have a question. I've wondered > for years why direct IO works the way it does. If I were implementing > it from scratch, my first inclination would be to use the page cache > instead of fighting it. To do a single-page direct read, I would look > that page up in the page cache (i.e. i_pages these days). If the page > is there, I would do a normal buffered read. If the page is not Therein lies the problem. Copying data is prohibitively expensive, and that's the primary reason for O_DIRECT existing. i.e. O_DIRECT is a low-overhead, zero-copy data movement interface. The moment we switch from using CPU to dispatch IO to copying data, performance goes down because we will be unable to keep storage pipelines full. IOWs, any rework of O_DIRECT that involves copying data is a non-starter. But let's bring this back to the issue at hand - observability of page cache residency of file pages. If th epage is caceh resident, then it will have a latency of copying that data out of the page (i.e. very low latency). If the page is not resident, then it will do IO and take much, much longer to complete. i.e. we have clear timing differences between cachce hit and cache miss IO. This is exactly the timing information needed for observing page cache residency. We need to work out how to make page cache residency less observable, not add new, near perfect observation mechanisms that third parties can easily exploit... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 1:18 ` Linus Torvalds 2019-01-10 5:26 ` Andy Lutomirski @ 2019-01-10 7:03 ` Dave Chinner 2019-01-10 11:47 ` Linus Torvalds 1 sibling, 1 reply; 161+ messages in thread From: Dave Chinner @ 2019-01-10 7:03 UTC (permalink / raw) To: Linus Torvalds Cc: Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 09, 2019 at 05:18:21PM -0800, Linus Torvalds wrote: > On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner <david@fromorbit.com> wrote: > > > > I wouldn't look at ext4 as an example of a reliable, problem free > > direct IO implementation because, historically speaking, it's been a > > series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and > > been far worse than XFS from data integrity, performance and > > reliability perspectives. > > That's some big words from somebody who just admitted to much worse hacks. Sorry, what hacks did I just admit to making? This O_DIRECT behaviour long predates me - I'm just the messenger and you are shooting from the hip. Linus, the point I was making is that there are many, many ways to control page cache invalidation and measure page cache residency, and that trying to address them one-by-one is just a game of whack-a-mole. In future, can you please try not to go off the rails when someone mentions O_DIRECT? You have a terrible habit of going off on misdirected rants about O_DIRECT and/or XFS at any opportunity you can get, and all it does is derail whatever useful conversation was taking place. > Seriously. XFS is buggy in this regard, ext4 apparently isn't. So you keep asserting despite being presented with evidence that it mitigates other longstanding bugs that are really hard to solve. Ignoring all the evidence you've been presented with and re-asserting your original statement doesn't make it correct. Did you not think to ask "what are those problems, and what can do to solve them so we can remove the invalidation mitigations that XFS uses?". That would be a useful contribution, whereas shouting about how O_DIRECT is broken just pisses off the people working their asses off to fix the problems you just heard about and are ranting about. > Thinking that it's better to just invalidate the cache for direct IO > reads is all kinds of odd. No, it's practicality. If we can't fix the problem, we have to mitigate it. When we fix the underlying problem we can remove the mitigation code. having you assert that it's broken and demand that it be removed doesn't change the fact that we haven't fixed the underlying problems. It's being worked on, but we're not there yet. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 7:03 ` Dave Chinner @ 2019-01-10 11:47 ` Linus Torvalds 2019-01-10 12:24 ` Dominique Martinet 0 siblings, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-10 11:47 UTC (permalink / raw) To: Dave Chinner Cc: Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 9, 2019 at 11:04 PM Dave Chinner <david@fromorbit.com> wrote: > > Sorry, what hacks did I just admit to making? This O_DIRECT > behaviour long predates me - I'm just the messenger and you are > shooting from the hip. Sure, sorry. I find this whole thing annoying. > Linus, the point I was making is that there are many, many ways to > control page cache invalidation and measure page cache residency, > and that trying to address them one-by-one is just a game of > whack-a-mole. .. and I agree. But let's a step back.Because there are different issues. First off, the whole page cache attack is not necessarily something many people will care about. As has been pointed out, it's often a matter of convenience and (relative) portability. And no, we're *never* going to stop all side channel leaks. Some parts of caching (notably the timing effects of it) are pretty fundamental. So at no point is this going to be some kind of absolute line in the sand _anyway_. There is no black-and-white "you're protected", there's only levels of convenience. A remote attacker is hopefully going to be limited by the interfaces to just timing attacks, although who knows what something like JS might expose. Presumably neither mincore() nor arbitrary O_DIRECT or pread2() flags. Anyway, the reason I was trying to plug mincore() is largely that that code didn't make much sense to begin with, and simply this: mm/mincore.c | 94 +++++++++--------------------------------------------------- 1 file changed, 13 insertions(+), 81 deletions(-) if we can make people happier by removing lines of code and making the semantics more clear anyway, it's worth trying. No? Is that everything? No. As mentioned, you'll never get to that "ok, we plugged everything" point anyway. But removing a fairly easy way to probe the cache that has no real upsides should be fairly non-controversial. But I do have to say that in many ways the page cache is *not* a great attack vector because there's often lots of it, and it's fairly hard to control. Once something is in the page cache for whatever reason, it tends to be pretty sticky, and flushing it tends to be fairly hard to predict. And a cheap and residency (whether a simple probe like mincore of or a NOWAIT flag) check is actually important just to try to control the flushing part. Brute-forcing the flushing is generally very expensive, but if you can't even see if you flushed it, it's way more so. If there's a way to control the cache residency directly, that's actually a much bigger hole than any residency check ever were. Because once you can flush caches by reading, at that point you can just flush a particular page and look at the IO stats for the root partition or something. No residency check even needed. So I do think that yes, as long as you can do a directed cache flush, mincore is *entirely* immaterial. Still, giving mincore clearer semantics and simpler code? Win-win. (Except, of course, if somebody actually notices outside of tests. Which may well happen and just force us to revert that commit. But that's a separate issue entirely). But I do think that we should strive to *never* invalidate caches on read accesses. I don't actually see where you are doing that, honestly: at least dio_complete() only does it for writes. So I'm actually hoping that you are mis-remembering this and it turns out that O_DIRECT reads don't invalidate caches. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 11:47 ` Linus Torvalds @ 2019-01-10 12:24 ` Dominique Martinet 2019-01-10 22:11 ` Linus Torvalds 0 siblings, 1 reply; 161+ messages in thread From: Dominique Martinet @ 2019-01-10 12:24 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Linus Torvalds wrote on Thu, Jan 10, 2019: > (Except, of course, if somebody actually notices outside of tests. > Which may well happen and just force us to revert that commit. But > that's a separate issue entirely). Both Dave and I pointed at a couple of utilities that break with this. nocache can arguably work with the new behaviour but will behave differently; vmtouch on the other hand is no longer able to display what's in cache or not - people use that for example to "warm up" a container in page cache based on how it appears after it had been running for a while is a pretty valid usecase to me. >From the list Kevin harvested out of the debian code search, the postgresql use case is pretty similar - probe what pages of the database were in cache at shutdown so when you restart it you can preload these and reach "cruse speed" faster. Sure that's probably not billions of users but this all looks fairly valid to me... -- Dominique ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 12:24 ` Dominique Martinet @ 2019-01-10 22:11 ` Linus Torvalds 2019-01-11 2:03 ` Dave Chinner ` (2 more replies) 0 siblings, 3 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-10 22:11 UTC (permalink / raw) To: Dominique Martinet Cc: Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 10, 2019 at 4:25 AM Dominique Martinet <asmadeus@codewreck.org> wrote: > > Linus Torvalds wrote on Thu, Jan 10, 2019: > > (Except, of course, if somebody actually notices outside of tests. > > Which may well happen and just force us to revert that commit. But > > that's a separate issue entirely). > > Both Dave and I pointed at a couple of utilities that break with > this. nocache can arguably work with the new behaviour but will behave > differently; vmtouch on the other hand is no longer able to display > what's in cache or not - people use that for example to "warm up" a > container in page cache based on how it appears after it had been > running for a while is a pretty valid usecase to me. So honestly, the main reason I'm loath to revert is that yes, we know of theoretical differences, but they seem to all be performance-related. It would be really good to hear numbers. Is the warm-up optimization something that changes things from 3ms to 3.5ms? Or does it change things from 3ms to half a second? Because Dave is absolutely correct that mincore() isn't really even all that interesting an information leak if you can do the same with RWF_NOWAIT. But the other side of that same coin is that if we're not able to block mincore() sanely, then there's no point at looking at RWF_NOWAIT either. And we *can* do sane things about RWF_NOWAIT. For example, we could start async IO on RWF_NOWAIT, and suddenly it would go from "probe the page cache" to "probe and fill", and be much harder to use as an attack vector.. Do we want to do that? Maybe, maybe not. But if mincore() can't be fixed, there's no point in even trying. Now, if the mincore() change results in a big performance hit for people who use it as a heuristic for filling caches etc, then reverting the trial balloon is obviously something we must do, but at that point I'd also like to know which load it was that cared so much, and just what it did. Because we did have an alternate patch that just said "was the file writably opened, then we can do the page cache probing". But at least one user (fincore) didn't do even that. So right now, I consider the mincore change to be a "try to probe the state of mincore users", and we haven't really gotten a lot of information back yet. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 22:11 ` Linus Torvalds @ 2019-01-11 2:03 ` Dave Chinner 2019-01-11 2:18 ` Linus Torvalds 2019-01-11 4:57 ` Dominique Martinet 2019-01-16 0:42 ` Josh Snyder 2 siblings, 1 reply; 161+ messages in thread From: Dave Chinner @ 2019-01-11 2:03 UTC (permalink / raw) To: Linus Torvalds Cc: Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote: > And we *can* do sane things about RWF_NOWAIT. For example, we could > start async IO on RWF_NOWAIT, and suddenly it would go from "probe the > page cache" to "probe and fill", and be much harder to use as an > attack vector.. We can only do that if the application submits the read via AIO and has an async IO completion reporting mechanism. Otherwise we have to wait for the IO to complete to copy the data into the user's buffer. And given that the app is using RWF_NOWAIT to explicitly avoid blocking on the IO.... Also, keep in mind that RWF_NOWAIT also prevents blocking on filesystem locks and full request queues. One of the prime drivers of RWF_NOWAIT was to prevent AIO submission from blocking on filesystem locks - it allows userspace to submit other IO while it can't get all the access it requires to a single file or a single block device is congested. Hence I don't think there's a such a simple answer here - blocking for IO breaks RWF_NOWAIT. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-11 2:03 ` Dave Chinner @ 2019-01-11 2:18 ` Linus Torvalds 2019-01-11 4:04 ` Dave Chinner 0 siblings, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-11 2:18 UTC (permalink / raw) To: Dave Chinner Cc: Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote: > > And we *can* do sane things about RWF_NOWAIT. For example, we could > > start async IO on RWF_NOWAIT, and suddenly it would go from "probe the > > page cache" to "probe and fill", and be much harder to use as an > > attack vector.. > > We can only do that if the application submits the read via AIO and > has an async IO completion reporting mechanism. Oh, no, you misunderstand. RWF_NOWAIT has a lot of situations where it will potentially return early (the DAX and direct IO ones have their own), but I was thinking of the one in generic_file_buffered_read(), which triggers when you don't find a page mapping. That looks like the obvious "probe page cache" case. But we could literally move that test down just a few lines. Let it start read-ahead. .. and then it will actually trigger on the *second* case instead, where we have if (!PageUptodate(page)) { if (iocb->ki_flags & IOCB_NOWAIT) { put_page(page); goto would_block; } and that's where RWF_MNOWAIT would act. It would still return EAGAIN. But it would have started filling the page cache. So now the act of probing would fill the page cache, and the attacker would be left high and dry - the fact that the page cache now exists is because of the attack, not because of whatever it was trying to measure. See? But obviously this kind of change only matters if we also have mincore() not returning the probe data. mincore() obviously can't do the same kind of read-ahead to defeat things. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-11 2:18 ` Linus Torvalds @ 2019-01-11 4:04 ` Dave Chinner 2019-01-11 4:08 ` Andy Lutomirski 2019-01-11 7:08 ` Linus Torvalds 0 siblings, 2 replies; 161+ messages in thread From: Dave Chinner @ 2019-01-11 4:04 UTC (permalink / raw) To: Linus Torvalds Cc: Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 10, 2019 at 06:18:16PM -0800, Linus Torvalds wrote: > On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote: > > > And we *can* do sane things about RWF_NOWAIT. For example, we could > > > start async IO on RWF_NOWAIT, and suddenly it would go from "probe the > > > page cache" to "probe and fill", and be much harder to use as an > > > attack vector.. > > > > We can only do that if the application submits the read via AIO and > > has an async IO completion reporting mechanism. > > Oh, no, you misunderstand. > > RWF_NOWAIT has a lot of situations where it will potentially return > early (the DAX and direct IO ones have their own), but I was thinking > of the one in generic_file_buffered_read(), which triggers when you > don't find a page mapping. That looks like the obvious "probe page > cache" case. > > But we could literally move that test down just a few lines. Let it > start read-ahead. > > .. and then it will actually trigger on the *second* case instead, where we have > > if (!PageUptodate(page)) { > if (iocb->ki_flags & IOCB_NOWAIT) { > put_page(page); > goto would_block; > } > > and that's where RWF_MNOWAIT would act. > > It would still return EAGAIN. > > But it would have started filling the page cache. So now the act of > probing would fill the page cache, and the attacker would be left high > and dry - the fact that the page cache now exists is because of the > attack, not because of whatever it was trying to measure. > > See? Except for fadvise(POSIX_FADV_RANDOM) which triggers this code in page_cache_sync_readahead(): /* be dumb */ if (filp && (filp->f_mode & FMODE_RANDOM)) { force_page_cache_readahead(mapping, filp, offset, req_size); return; } So it will only read the single page we tried to access and won't perturb the rest of the message encoded into subsequent pages in file. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-11 4:04 ` Dave Chinner @ 2019-01-11 4:08 ` Andy Lutomirski 2019-01-11 7:20 ` Dave Chinner 2019-01-11 7:08 ` Linus Torvalds 1 sibling, 1 reply; 161+ messages in thread From: Andy Lutomirski @ 2019-01-11 4:08 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API > On Jan 10, 2019, at 8:04 PM, Dave Chinner <david@fromorbit.com> wrote: > >> On Thu, Jan 10, 2019 at 06:18:16PM -0800, Linus Torvalds wrote: >>> On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner <david@fromorbit.com> wrote: >>> >>>> On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote: >>>> And we *can* do sane things about RWF_NOWAIT. For example, we could >>>> start async IO on RWF_NOWAIT, and suddenly it would go from "probe the >>>> page cache" to "probe and fill", and be much harder to use as an >>>> attack vector.. >>> >>> We can only do that if the application submits the read via AIO and >>> has an async IO completion reporting mechanism. >> >> Oh, no, you misunderstand. >> >> RWF_NOWAIT has a lot of situations where it will potentially return >> early (the DAX and direct IO ones have their own), but I was thinking >> of the one in generic_file_buffered_read(), which triggers when you >> don't find a page mapping. That looks like the obvious "probe page >> cache" case. >> >> But we could literally move that test down just a few lines. Let it >> start read-ahead. >> >> .. and then it will actually trigger on the *second* case instead, where we have >> >> if (!PageUptodate(page)) { >> if (iocb->ki_flags & IOCB_NOWAIT) { >> put_page(page); >> goto would_block; >> } >> >> and that's where RWF_MNOWAIT would act. >> >> It would still return EAGAIN. >> >> But it would have started filling the page cache. So now the act of >> probing would fill the page cache, and the attacker would be left high >> and dry - the fact that the page cache now exists is because of the >> attack, not because of whatever it was trying to measure. >> >> See? > > Except for fadvise(POSIX_FADV_RANDOM) which triggers this code in > page_cache_sync_readahead(): > > /* be dumb */ > if (filp && (filp->f_mode & FMODE_RANDOM)) { > force_page_cache_readahead(mapping, filp, offset, req_size); > return; > } > > So it will only read the single page we tried to access and won't > perturb the rest of the message encoded into subsequent pages in > file. > There are two types of attacks. One is an intentional side channel where two cooperating processes communicate. This is, under some circumstances, a problem, but it’s not one we’re about to solve in general. The other is an attacker monitoring an unwilling process. I think we care a lot more about that, and Linus’ idea will help. ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-11 4:08 ` Andy Lutomirski @ 2019-01-11 7:20 ` Dave Chinner 0 siblings, 0 replies; 161+ messages in thread From: Dave Chinner @ 2019-01-11 7:20 UTC (permalink / raw) To: Andy Lutomirski Cc: Linus Torvalds, Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 10, 2019 at 08:08:37PM -0800, Andy Lutomirski wrote: > > On Jan 10, 2019, at 8:04 PM, Dave Chinner <david@fromorbit.com> > > wrote: > > > >> On Thu, Jan 10, 2019 at 06:18:16PM -0800, Linus Torvalds > >> wrote: > >>> On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner > >>> <david@fromorbit.com> wrote: > >>> > >>>> On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds > >>>> wrote: And we *can* do sane things about RWF_NOWAIT. For > >>>> example, we could start async IO on RWF_NOWAIT, and suddenly > >>>> it would go from "probe the page cache" to "probe and fill", > >>>> and be much harder to use as an attack vector.. > >>> > >>> We can only do that if the application submits the read via > >>> AIO and has an async IO completion reporting mechanism. > >> > >> Oh, no, you misunderstand. > >> > >> RWF_NOWAIT has a lot of situations where it will potentially > >> return early (the DAX and direct IO ones have their own), but I > >> was thinking of the one in generic_file_buffered_read(), which > >> triggers when you don't find a page mapping. That looks like > >> the obvious "probe page cache" case. > >> > >> But we could literally move that test down just a few lines. > >> Let it start read-ahead. > >> > >> .. and then it will actually trigger on the *second* case > >> instead, where we have > >> > >> if (!PageUptodate(page)) { if (iocb->ki_flags & > >> IOCB_NOWAIT) { put_page(page); goto would_block; > >> } > >> > >> and that's where RWF_MNOWAIT would act. > >> > >> It would still return EAGAIN. > >> > >> But it would have started filling the page cache. So now the > >> act of probing would fill the page cache, and the attacker > >> would be left high and dry - the fact that the page cache now > >> exists is because of the attack, not because of whatever it was > >> trying to measure. > >> > >> See? > > > > Except for fadvise(POSIX_FADV_RANDOM) which triggers this code > > in page_cache_sync_readahead(): > > > > /* be dumb */ if (filp && (filp->f_mode & FMODE_RANDOM)) > > { force_page_cache_readahead(mapping, filp, offset, > > req_size); return; } > > > > So it will only read the single page we tried to access and > > won't perturb the rest of the message encoded into subsequent > > pages in file. > > There are two types of attacks. One is an intentional side > channel where two cooperating processes communicate. This is, > under some circumstances, a problem, Yes, that's the covert communication channel that can cross container and machine boundaries without any required privileges. > but it’s not one > we’re about to solve in general. The other is an attacker > monitoring an unwilling process. Which uses exactly the same mechanisms as the first case. i.e. controlled invalidation and page cache residency monitoring.If we aren't going to solve the first problem case, the we aren't going to solve the second because they are one and the same problem... However, I suspect you have misunderstood the monitoring mechanism here - dispatch IO for this page doesn't prevent the information leak about that page. It's when we return EAGAIN that we leak information about page cache residency. What Linus is attempting to do is perturb the nearby state of the page cache by triggering async readahead in the EAGAIN case. Async readahead will fill all the holes in readahead window and hence destroy the information about where the page fault landed and instantiated the page cache. That would prevent the attacker from determining what code the executable is running as they would only be able to check a single page in an executable at a time and that makes the attack highly impractical. But if the attacker uses FADV_RANDOM, readahead is only triggered for the page the attacker is trying to read. Hence it does not disturb the nearby page cache residency pattern the executable's page faults left behind and so doesn't destroy the information that they are trying to extract from the unwilling process. Sure, Linus's change makes monitoring the executable file after FADV_RANDOM a "read-once" mechanism. However, detection of what code is executing is a repeated invalidate-and-sweep exercise to begin with, so it basically doesn't change the information or the rate at which the monitoring process can extract information from the file. /me hasn't thought about this sort of stuff since he was running page cache invalidation attacks on Irix system libraries way back in 2002 when he found a libc bug that killed the init process and paniced the kernel. This isn't my first rodeo - it's been well known for a long, long time that the system page cache can be exploited to monitor executing code... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-11 4:04 ` Dave Chinner 2019-01-11 4:08 ` Andy Lutomirski @ 2019-01-11 7:08 ` Linus Torvalds 2019-01-11 7:36 ` Dave Chinner 1 sibling, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-11 7:08 UTC (permalink / raw) To: Dave Chinner Cc: Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 10, 2019 at 8:04 PM Dave Chinner <david@fromorbit.com> wrote: > > So it will only read the single page we tried to access and won't > perturb the rest of the message encoded into subsequent pages in > file. Dave, you're being intentionally obtuse, aren't you? It's only that single page that *matters*. That's the page that the probe reveals the status of - but it's also the page that the probe then *changes* the status of. See? Think of it as "the act of measurement changes that which is measured". And that makes the measurement pointless. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-11 7:08 ` Linus Torvalds @ 2019-01-11 7:36 ` Dave Chinner 2019-01-11 16:26 ` Linus Torvalds 0 siblings, 1 reply; 161+ messages in thread From: Dave Chinner @ 2019-01-11 7:36 UTC (permalink / raw) To: Linus Torvalds Cc: Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 10, 2019 at 11:08:07PM -0800, Linus Torvalds wrote: > On Thu, Jan 10, 2019 at 8:04 PM Dave Chinner <david@fromorbit.com> wrote: > > > > So it will only read the single page we tried to access and won't > > perturb the rest of the message encoded into subsequent pages in > > file. > > Dave, you're being intentionally obtuse, aren't you? > > It's only that single page that *matters*. That's the page that the > probe reveals the status of - but it's also the page that the probe > then *changes* the status of. It changes the state of it /after/ we've already got the information we need from it. It's not up to date, it has to come from disk, we return EAGAIN, which means it was not in the cache. i.e. if we return EAGAIN, we've leaked the inforation the attacker wants regardless of how the act of initiating readahead on the page change the state of the page. Yes, it raises the complexity bar a bit, and lowers the monitoring frequency somewhat, but that's about it. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-11 7:36 ` Dave Chinner @ 2019-01-11 16:26 ` Linus Torvalds 2019-01-15 23:45 ` Dave Chinner 2019-02-20 15:49 ` Nicolai Stange 0 siblings, 2 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-11 16:26 UTC (permalink / raw) To: Dave Chinner Cc: Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 10, 2019 at 11:36 PM Dave Chinner <david@fromorbit.com> wrote: > > > It's only that single page that *matters*. That's the page that the > > probe reveals the status of - but it's also the page that the probe > > then *changes* the status of. > > It changes the state of it /after/ we've already got the information > we need from it. It's not up to date, it has to come from disk, we > return EAGAIN, which means it was not in the cache. Oh, I see the confusion. Yes, you get the information about whether something was in the cache or not, so the side channel does exist to some degree. But it's actually hugely reduced for a rather important reason: the _primary_ reason for needing to know whether some page is in the cache or not is not actually to see if it was ever accessed - it's to see that the cache has been scrubbed (and to _guide_ the scrubbing), and *when* it was accessed. Think of it this way: the buffer cache residency is actually a horribly bad signal on its own mainly because you generally have a very high hit-rate. In most normal non-streaming situations with sufficient amounts of memory you have pretty much everything cached. So in order to use it as a signal, first you have to first scrub the cache (because if the page was already there, there's no signal at all), and then for the signal to be as useful as possible, you're also going to want to try to get out more than one bit of information: you are going to try to see the patterns and the timings of how it gets filled. And that's actually quite painful. You don't know the initial cache state, and you're not (in general) controlling the machine entirely, because there's also that actual other entity that you're trying to attack and see what it does. So what you want to do is basically to first make sure the cache is scrubbed (only for the pages you're interested in!), then trigger whatever behavior you are looking for, and then look how that affected the cache. In other words, you want *multiple* residency status check - first to see what the cache state is (because you're going to want that for scrubbing), then to see that "yes, it's gone" when doing the scrubbing, and then to see the *pattern* and timings of how things are brought in. And then you're likely to want to do this over and over again, so that you can get real data out of the signal. This is why something that doesn't perturb what you measure is really important. If the act of measurement brings the page in, then you can't use it for that "did I successfully scrub it" phase at all, and you can't use it for measurement but once, so your view into patterns and timings is going to be *much* worse. And notice that this is true even if the act of measurement only affects the *one* page you're measuring. Sure, any additional noise around it would likely be annoying too, but it's not really necessary to make the attack much harder to carry out. In fact, it's almost irrelevant, since the signal you're trying to *see* is going to be affected by prefetching etc too, so the patterns and timings you need to look at are in bigger chunks than the readahead thing. So yes, you as an attacker can remove the prefetching from *your* load, but you can't remove it from the target load anyway, so you'll just have to live with it. Can you brute-force scrubbing? Yes. For something like an L1 cache, that's easy (well, QoS domains make it harder). For something like a disk cache, it's much harder, and makes any attempt to read out state a lot slower. The paper that started this all uses mincore() not just to see "is the page now scrubbed", but also to guide the scrubbing itself (working set estimation etc). And note that in many ways, the *scrubbing* is really the harder part. Populating the cache is really easy: just read the data you want to populate. So if you are looking for a particular signal, say "did this error case trigger so that it faulted in *that* piece of information", you'd want to scrub the target, populate everything else, and then try to measure at "did I trigger that target". Except you wouldn't want to do it one page at a time but see as much pattern of "they were touched in this order" as you can, and you'd like to get timing information of how the pages you are interested were populated too. And you'd generally do this over and over and over again because you're trying to read out some signal. Notice what the expensive operation was? It's the scrubbing.The "did the target do IO" you might actually even see other ways for the trivial cases, like even just look at iostat: just pre-populate everything but the part you care about, then try to trigger whatever you're searching for, and see if it caused IO or not. So it's a bit like a chalkboard: in order to read out the result, you need to erase it first, and doing that blindly is nasty. And you want to look at timings, which is also really nasty if every time you look, you smudge the very place you looked at. It makes it hard to see what somebody else is writing on the board if you're always overwriting what you just looked at. Did you get some new information? If not, now you have to go back and do that scrubbing again, and you'll likely be missing what *else* the person wrote. Ans as always: there is no "black and white". There is no "absolute security", and similarly, there is no "absolute leak proof". It's all about making it inconvenient enough that it's not really practical. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-11 16:26 ` Linus Torvalds @ 2019-01-15 23:45 ` Dave Chinner 2019-01-16 4:54 ` Linus Torvalds 2019-02-20 15:49 ` Nicolai Stange 1 sibling, 1 reply; 161+ messages in thread From: Dave Chinner @ 2019-01-15 23:45 UTC (permalink / raw) To: Linus Torvalds Cc: Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Fri, Jan 11, 2019 at 08:26:14AM -0800, Linus Torvalds wrote: > On Thu, Jan 10, 2019 at 11:36 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > It's only that single page that *matters*. That's the page that the > > > probe reveals the status of - but it's also the page that the probe > > > then *changes* the status of. > > > > It changes the state of it /after/ we've already got the information > > we need from it. It's not up to date, it has to come from disk, we > > return EAGAIN, which means it was not in the cache. > > Oh, I see the confusion. > > Yes, you get the information about whether something was in the cache > or not, so the side channel does exist to some degree. > > But it's actually hugely reduced for a rather important reason: the > _primary_ reason for needing to know whether some page is in the cache > or not is not actually to see if it was ever accessed - it's to see > that the cache has been scrubbed (and to _guide_ the scrubbing), and > *when* it was accessed. Oh, you're assuming that you need to probe the page cache to determine if brute force invalidation has progressed far enough to invalidate the page in question. I'm assuming that you can invalidate the page cache reliably by a means that does not repeated require probing to detect invalidation has occurred. I've mentioned one method in this discussion already... IOWs, just because the specific brute force attack documented in the paper required repeated probing it doesn't mean all future invalidation attacks will require repeated probing. Hence a robust defense mechanism should not rely on the attacker requiring multiple observations of the page cache to extract the information they are seeking... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-15 23:45 ` Dave Chinner @ 2019-01-16 4:54 ` Linus Torvalds 2019-01-16 5:49 ` Linus Torvalds 2019-01-17 1:26 ` Dave Chinner 0 siblings, 2 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-16 4:54 UTC (permalink / raw) To: Dave Chinner Cc: Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 16, 2019 at 11:45 AM Dave Chinner <david@fromorbit.com> wrote: > > I'm assuming that you can invalidate the page cache reliably by a > means that does not repeated require probing to detect invalidation > has occurred. I've mentioned one method in this discussion > already... Yes. And it was made clear to you that it was a bug in xfs dio and what the right thing to do was. And you ignored that, and claimed it was a feature. Why do you then bother arguing this thing? We absolutely agree that xfs has an information leak. If you don't care, just _say_ so. Don't try to argue against other people who are trying to fix things. We can easily just say "ok, xfs people don't care", and ignore the xfs invalidation issue. That's fine. But don't try to make it a big deal for other filesystems that _don't_ have the bug. I even pointed out how ext4 does the page cache flushing correcrly. You pooh-poohed it. You can't have it both ways. Either you care or you don't. If you don't care (and so far everything you said seems to imply you don't), then why are you even discussing this? Just admit you don't care, and we're done. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 4:54 ` Linus Torvalds @ 2019-01-16 5:49 ` Linus Torvalds 2019-01-17 1:26 ` Dave Chinner 1 sibling, 0 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-16 5:49 UTC (permalink / raw) To: Dave Chinner Cc: Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 16, 2019 at 4:54 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, Jan 16, 2019 at 11:45 AM Dave Chinner <david@fromorbit.com> wrote: > > > > I'm assuming that you can invalidate the page cache reliably by a > > means that does not repeated require probing to detect invalidation > > has occurred. I've mentioned one method in this discussion > > already... > > Yes. And it was made clear to you that it was a bug in xfs dio and > what the right thing to do was. Side note: I actually think we *do* the right thing. Even for xfs. I couldn't find the alleged place that invalidates the page cache on dio reads. The *generic* dio code only does it for writes (which is correct and fine). And maybe xfs has some extra invalidation, but I don't see it. So I actually hope your "you can use direct-io read to do directed invalidating of the page cache" isn't true. I admittedly did *not* try to delve very deeply into it, but the invalidates I found looked correct. The generic code does it for writes, and at least ext4 does the "writeback and wait" for reads. There *does* seem to be a 'invalidate_inode_pages2_range()' call in iomap_dio_rw(). That has a *comment* that says it only is for writes, but it looks to me like it would trigger for reads too. Just a plain bug/oversight? Or me misreading things. So yes, maybe xfs does that "invalidate on read", but it really seems to be just a bug. If the xfs people insist on keeping the bug, fine (looks like gfs2 and xfs are the only users), but it seems kind of sad. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 4:54 ` Linus Torvalds 2019-01-16 5:49 ` Linus Torvalds @ 2019-01-17 1:26 ` Dave Chinner 1 sibling, 0 replies; 161+ messages in thread From: Dave Chinner @ 2019-01-17 1:26 UTC (permalink / raw) To: Linus Torvalds Cc: Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 16, 2019 at 04:54:49PM +1200, Linus Torvalds wrote: > On Wed, Jan 16, 2019 at 11:45 AM Dave Chinner <david@fromorbit.com> wrote: > > > > I'm assuming that you can invalidate the page cache reliably by a > > means that does not repeated require probing to detect invalidation > > has occurred. I've mentioned one method in this discussion > > already... > > Yes. And it was made clear to you that it was a bug in xfs dio and > what the right thing to do was. > > And you ignored that, and claimed it was a feature. Linus, either you aren't listening or you're being intentionally provocative. So, for the *third* time this thread: we can probably remove this code but first we need to be sure it doesn't cause unexpected regressions before we commit such a change. We are not cowboys who test userspace behavioural changes on users without review or discussion. Indeed, I wrote a patch to remove the invalidation /several days ago/ and put it into my test trees, and it's been there since. Just because you don't see immediate changes doesn't mean it isn't happening. > Either you care or you don't. If you don't care (and so far everything > you said seems to imply you don't), Linus, this is just a personal attack and IMO a violation of the CoC. It's straight out wrong, insulting, totally unprofessional and completely uncalled for. This is most definitely not a useful technical response to the issues I raised. i.e you cut out all the context of my response about whether "no probing necessary" page cache invalidation attacks are something we need to care about in the future. We don't need you to shout about existing "no probing necessary" page cache invalidation attacks that are already being addressed, we need to determine if it's going to be a recurring problem in future because that directly affects the mitigation strategies we can implement. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-11 16:26 ` Linus Torvalds 2019-01-15 23:45 ` Dave Chinner @ 2019-02-20 15:49 ` Nicolai Stange 1 sibling, 0 replies; 161+ messages in thread From: Nicolai Stange @ 2019-02-20 15:49 UTC (permalink / raw) To: Nicolai Stange Cc: Vlastimil Babka, Kirill A. Shutemov, Tejun Heo, Kevin Easton, Cyril Hrubis, Daniel Gruss, Andy Lutomirski, Linus Torvalds, Dave Chinner, Dominique Martinet, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Linus Torvalds <torvalds@linux-foundation.org> writes: <snip> > So in order to use it as a signal, first you have to first scrub the > cache (because if the page was already there, there's no signal at > all), and then for the signal to be as useful as possible, you're also > going to want to try to get out more than one bit of information: you > are going to try to see the patterns and the timings of how it gets > filled. > > And that's actually quite painful. You don't know the initial cache > state, and you're not (in general) controlling the machine entirely, > because there's also that actual other entity that you're trying to > attack and see what it does. > > So what you want to do is basically to first make sure the cache is > scrubbed (only for the pages you're interested in!), then trigger > whatever behavior you are looking for, and then look how that affected > the cache. > > In other words, you want *multiple* residency status check - first to > see what the cache state is (because you're going to want that for > scrubbing), then to see that "yes, it's gone" when doing the > scrubbing, and then to see the *pattern* and timings of how things are > brought in. <snap> In an attempt to gain a better understanding of the guided eviction part resp. the relevance of mincore() & friends to that, I worked on reproducing the results from [1], section 6.1 ("Efficient Page Cache Eviction on Linux"). In case anybody wants to run their own experiments: the sources can be found at [2]. Disclaimer: I don't have access to the sources used by the [1]-paper's authors nor do I know anything about their experimental setup. So it might very well be the case, that my implementation is completely different and/or inefficient. Anyways, quoting from [1], section 6.1: "Eviction Set 1. These are pages already in the page cache, used by other processes. To keep them in the page cache, a thread continuously accesses these pages while also keep- ing the system load low by using sched yield and sleep. Consequently, they are among the most recently accessed pages of the system and eviction of these pages becomes highly unlikely." I had two questions: 1.) Do the actual contents of "Eviction set 1" matter for the guided eviction's performance or can they as well be arbitrary but fixed? Because if the set's actual contents weren't of any importance, then mincore() would not be needed to initialize it with "pages already in the page cache". 2.) How does keeping some fixed set resident + doing some IO compare to simply streaming a huge random file through the page cache? (To make it explicit: I didn't look into the probe part of the attack or the checking of the victim page's residency status as a termination condition for the eviction run.) Obviously, there are two primary factors affecting the victim page eviction performance: the file page cache size and disk read bandwidth. Naively speaking, I would suppose that keeping a certain set resident is a cheap and stable way to constrain IO to the remaining portion of the page cache and thus, reduce the amount of data required to be read from disk until the victim page gets evicted. Results summary (details can be found at the end of this mail): - The baseline benchmark of simply streaming random data through the page cache behaves as expected: avg of "Inactive(file)" / avg of "victim page eviction time" yields ~480MB/s, which approx. matches my disk's read bandwidth (note: the victim page was mapped as !PROT_EXEC). - I didn't do any sophisticated fine-tuning wrt. to parameters, but for the configuration yielding the best result, the average victim page eviction time was 147ms (stddev(*): 69ms, stderr: 1ms) with the "random but fixed resident set method". That's an improvement by a factor of 2.6 over the baseline "streaming random data method" (with the same amount of anonymous memory, i.e. "Eviction set 3", allocated: 7GB out of a total of 8GB). - In principle, question 1.) can't be answered by experiment without controlling the initial, legitimate system workload. I did some lax tests on my desktop running firefox, libreoffice etc. though and of course, overall responsiveness got a lot better if the "Eviction set 1" had been populated with pages already resident at the time the "attack" started. But the victim page eviction times didn't seem to improve -- at least not by factors such that my biased mind would have been able to recognize any change. In conclusion, keeping an arbitrary, but fixed "Eviction set 1" resident improved the victim page eviction performance by some factor over the "streaming" baseline, where "Eviction set 1" was populated from a single, attacker-controlled file and mincore() was not needed for determining its initial contents. To my surprise though, I needed to rely on mincore() at some other place, namely *my* implementation of keeping the resident set resident. My first naive approach was to have a single thread repeatedly iterating over the pages and reading the first few bytes from each through a mmapped area. That did not work out, because, even for smaller resident sets of 1GB and if run with realtime priority, the accessing thread would at some point in time encounter a reclaimed page and have to wait for the page fault to get served. While waiting for that, even more of the resident pages are likely to get reclaimed, causing additional delays later on. Eventually, the resident page accessor looses the game and will encounter page faults for almost the whole resident set (which isn't resident anymore). I worked around this by making the accessor thread check page residency via mincore(), touch only the resident ones and queue the others to some refault ("rewarm") thread. From briefly looking at iostat, this rewarmer thread actually seemed to saturate the disk and thus, there was no need for additional IO to put pressure on the page cache. For completeness, the amount of pages from the resident set actually found resident made up for ~97% of all file page cache pages (Inactive+Active(file)). Note that the way mincore() is used here is different than for the probe part of the attack: for probing, we'd like to know when a victim page has been faulted in again, while the residency keeper needs to check that some page has not been reclaimed before accessing it. Furthermore, mincore() is run on pages from an attacker-controlled and -owned file here. AFAICS, the patches currently under review (c.f. [3]) don't mitigate against this latter abuse of mincore() & Co. I personally doubt that doing something about it would be worth it though: first of all, until proven otherwise, I'm assuming that the improvement of the "resident set" based eviction method over the "streaming" baseline is not by orders of magnitude and that the victim page eviction times interesting to attackers (let's say better than 500ms) are achievable only under certain conditions: no swap and the ability to block a large fraction of total memory with anonymous mappings. What's more, I can imagine that there are other ways to keep the resident set resident without relying on mincore() at all: I haven't tried it, but simply spawning a larger number of accessor threads, each at a different position within "Eviction set 1", and hoping that most of the time at least one of them finds a resident page to touch, might work, too. Thanks, Nicolai Experiment setup + numbers ========================== My laptop has 8GB of RAM, a SSD and runs OpenSUSE 42.3 kernel package version 4.4.165-81. I stopped almost all services, but didn't setup any CPU isolation. I ran 5 x 2 experiments where I allocated (and filled, of course) 3,4,5,6,7GB of anonymous memory each and compared the baseline "read-in-a-huge-file" (mapped PROT_EXEC) results with the resident set based eviction for each of these. While doing so, I continuously measured eviction times of a !PROT_EXEC victim page and reported the 'Active(file)' + 'Inactive(file)' statistics from /proc/meminfo. Baseline "streaming" benchmark results: anon mem (GB): 7 6 5 4 3 Inactive(file) (MB): 189.3389 709.7063 1221.0910 1735.7510 2247.3340 eviction time (ms): 386.7712 1453.3975 2533.6084 3622.4995 4694.4668 quotient (MB/s): 489.5372 488.3085 481.9573 479.1584 478.7198 and, for comparison with the results from the resident set based eviction below: Inactive+Active(file) (MB): 358 1390 2415 3445 4469 Resident set based eviction: For the results below, I chose to draw the resident set from a single file filled with random data and made it total mem - allocated anon mem in size. The disk bandwidth was saturated all the time. anon mem (GB): 7 6 5 4 3 eviction time (ms): 146.6296 428.9594 620.2084 863.1423 977.8963 improvement over baseline: 2.6 3.4 4.1 4.2 4.8 Inactive+Active(file) (MB): 429 1449 2471 3494 4515 resident from res. set (MB): 417 1428 2426 3424 4429 fraction: 97.3% 98.6% 98.2% 98.0% 98.1% [1] https://arxiv.org/abs/1901.01161 ("Page Cache Attacks") [2] https://github.com/nicstange/pgc [3] https://lkml.kernel.org/r/20190130124420.1834-1-vbabka@suse.cz (*) The distribution of eviction times is not Gaussian. -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 22:11 ` Linus Torvalds 2019-01-11 2:03 ` Dave Chinner @ 2019-01-11 4:57 ` Dominique Martinet 2019-01-11 7:11 ` Linus Torvalds 2019-01-16 0:42 ` Josh Snyder 2 siblings, 1 reply; 161+ messages in thread From: Dominique Martinet @ 2019-01-11 4:57 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Linus Torvalds wrote on Thu, Jan 10, 2019: > On Thu, Jan 10, 2019 at 4:25 AM Dominique Martinet > <asmadeus@codewreck.org> wrote: > > Linus Torvalds wrote on Thu, Jan 10, 2019: > > > (Except, of course, if somebody actually notices outside of tests. > > > Which may well happen and just force us to revert that commit. But > > > that's a separate issue entirely). > > > > Both Dave and I pointed at a couple of utilities that break with > > this. nocache can arguably work with the new behaviour but will behave > > differently; vmtouch on the other hand is no longer able to display > > what's in cache or not - people use that for example to "warm up" a > > container in page cache based on how it appears after it had been > > running for a while is a pretty valid usecase to me. > > So honestly, the main reason I'm loath to revert is that yes, we know > of theoretical differences, but they seem to all be > performance-related. I don't see what other use mincore could have, yes - even the "debugging" use I gave is performance investigations and not hard problems (and I probably would go straight to perf nowadays, you'd get the info that the program doesn't use cache from the call graphs) > It would be really good to hear numbers. Is the warm-up optimization > something that changes things from 3ms to 3.5ms? Or does it change > things from 3ms to half a second? This is heavily workload and storage hardware dependant, so hard to give some absolute value. Trying with some big server, fast SSD, mysql and doing: # echo 3 > /proc/sys/vm/drop_caches # (optional) prefetch table and innodb files # systemctl restart mariadb # time mysql -q db "select * from mytable where id in $ENTRIES" > /dev/null # time mysql -q db "select * from mytable where id in $ENTRIES2" > /dev/null # time mysql -q db "select * from mytable where id in $ENTRIES3" > /dev/null (where ENTRIES* are lists of 1000 id, and id is indexed; the table is 8GB for 62590661 entries so 1000 entries is approx 128KB of data out of that file) I get on average over a few queries approximately a real time of 350ms, 230ms and 220ms immediately after drop cache and service restart, and 150ms, 60ms and 60ms after a prefetch (hand-wavy average over 3 runs, I didn't have the patience to do proper testing). (In both cases, user/sys are less than 10ms; I don't see much difference there) If I restart the service without dropping caches and redo the query I get 60ms from the first query onwards so I must not be preloading everything properly, some real script that would look all over a container to properly restore the page cache would do better than me blindly preloading a few files. Either way, we're talking about a factor of 2-3 until the application has been looking at most of the entries, and I didn't try to see how that would look like on spinning disks or the kind of slow storage one would get on VPS somewhere in the cloud - I'm sure someone with time to waste could get much more impressive figures, but this already look pretty worthwhile to me. -- Dominique Martinet | Asmadeus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-11 4:57 ` Dominique Martinet @ 2019-01-11 7:11 ` Linus Torvalds 2019-01-11 7:32 ` Dominique Martinet 0 siblings, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-11 7:11 UTC (permalink / raw) To: Dominique Martinet Cc: Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 10, 2019 at 8:58 PM Dominique Martinet <asmadeus@codewreck.org> wrote: > > I get on average over a few queries approximately a real time of 350ms, > 230ms and 220ms immediately after drop cache and service restart, and > 150ms, 60ms and 60ms after a prefetch (hand-wavy average over 3 runs, I > didn't have the patience to do proper testing). > (In both cases, user/sys are less than 10ms; I don't see much difference > there) But those numbers aren't about the mincore() change. That's just from dropping caches. Now, what's the difference with the mincore change, and without? Is it actually measurable? Because that's all that matters: is the mincore change something you can even notice? Is it a big regression? The fact that things are slower when they are cold in the cache isn't the issue. The issue is whether the change to mincore semantics makes any difference to real loads. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-11 7:11 ` Linus Torvalds @ 2019-01-11 7:32 ` Dominique Martinet 0 siblings, 0 replies; 161+ messages in thread From: Dominique Martinet @ 2019-01-11 7:32 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Linus Torvalds wrote on Thu, Jan 10, 2019: > But those numbers aren't about the mincore() change. That's just from > dropping caches. > > Now, what's the difference with the mincore change, and without? Is it > actually measurable? > > Because that's all that matters: is the mincore change something you > can even notice? Is it a big regression? > > The fact that things are slower when they are cold in the cache isn't > the issue. The issue is whether the change to mincore semantics makes > any difference to real loads. mincore itself isn't used to reload the data, but is necessary to know *what* you need to reload. If you don't know what pages are hot, how can you preload them? For small enough a database and with enough memory you can act stupid and load the whole tables in cache, that's what I did because I was lazy and only had a big mysql data set around. But the container warming up automaton Dave mentioned and postgresql db preloading with pgfincore explicitely depend on being able to tell what they need to preload. pgfincore documentation states: ---- set of PostgreSQL functions to manage blocks in memory Those functions let you know which and how many disk block from a relation are in the page cache of the operating system, and eventually write the result to a file. Then using this file, it is possible to restore the page cache state for each block of the relation. ---- If you cannot dump an arbitrary "hot state" x, you cannot restore it. This is all basically a repeat of the other subthread; sure precaching itself doesn't need mincore and if you're all-knowing the syscall isn't needed, but mere mortals need it. If it's about the commit itself, vmtouch tells me 0 page in these database files are in cache when I reboot to a 5.0-rc1 kernel and run some queries, so the difference after a fresh boot is exactly what I stated. I'm probably missing something but I'm not quite sure in what situation the "new mincore" has any use right now. -- Dominique Martinet | Asmadeus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 22:11 ` Linus Torvalds 2019-01-11 2:03 ` Dave Chinner 2019-01-11 4:57 ` Dominique Martinet @ 2019-01-16 0:42 ` Josh Snyder 2019-01-16 5:00 ` Linus Torvalds 2 siblings, 1 reply; 161+ messages in thread From: Josh Snyder @ 2019-01-16 0:42 UTC (permalink / raw) To: Linus Torvalds Cc: Dominique Martinet, Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Linus Torvalds wrote on Thu, Jan 10, 2019: > So right now, I consider the mincore change to be a "try to probe the > state of mincore users", and we haven't really gotten a lot of > information back yet. <raises hand> For Netflix, losing accurate information from the mincore syscall would lengthen database cluster maintenance operations from days to months. We rely on cross-process mincore to migrate the contents of a page cache from machine to machine, and across reboots. To do this, I wrote and maintain happycache [1], a page cache dumper/loader tool. It is quite similar in architecture to pgfincore, except that it is agnostic to workload. The gist of happycache's operation is "produce a dump of residence status for each page, do some operation, then reload exactly the same pages which were present before." happycache is entirely dependent on accurate reporting of the in-core status of file-backed pages, as accessed by another process. We primarily use happycache with Cassandra, which (like Postgres + pgfincore) relies heavily on OS page cache to reduce disk accesses. Because our workloads never experience a cold page cache, we are able to provision hardware for a peak utilization level that is far lower than the hypothetical "every query is a cache miss" peak. A database warmed by happycache can be ready for service in seconds (bounded only by the performance of the drives and the I/O subsystem), with no period of in-service degradation. By contrast, putting a database in service without a page cache entails a potentially unbounded period of degradation (at Netflix, the time to populate a single node's cache via natural cache misses varies by workload from hours to weeks). If a single node upgrade were to take weeks, then upgrading an entire cluster would take months. Since we want to apply security upgrades (and other things) on a somewhat tighter schedule, we would have to develop more complex solutions to provide the same functionality already provided by mincore. At the bottom line, happycache is designed to benignly exploit the same information leak documented in the paper [2]. I think it makes perfect sense to remove cross-process mincore functionality from unprivileged users, but not to remove it entirely. Josh Snyder Netflix Cloud Database Engineering [1] https://github.com/hashbrowncipher/happycache [2] https://arxiv.org/abs/1901.01161 ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 0:42 ` Josh Snyder @ 2019-01-16 5:00 ` Linus Torvalds 2019-01-16 5:25 ` Andy Lutomirski 2019-01-16 12:36 ` Matthew Wilcox 0 siblings, 2 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-16 5:00 UTC (permalink / raw) To: Josh Snyder Cc: Dominique Martinet, Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 16, 2019 at 12:42 PM Josh Snyder <joshs@netflix.com> wrote: > > For Netflix, losing accurate information from the mincore syscall would > lengthen database cluster maintenance operations from days to months. We > rely on cross-process mincore to migrate the contents of a page cache from > machine to machine, and across reboots. Ok, this is the kind of feedback we need, and means I guess we can't just use the mapping existence for mincore. The two other ways that we considered were: (a) owner of the file gets to know cache information for that file. (b) having the fd opened *writably* gets you cache residency information. Sadly, taking a look at happycache, you open the file read-only, so (b) doesn't work. Judging just from the source code, I can't tell how the user ownership works. Any input on that? And if you're not the owner of the file, do you have another suggestion for that "Yes, I have the right to see what's in-core for this file". Because the problem is literally that if it's some random read-only system file, the kernel shouldn't leak access patterns to it.. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 5:00 ` Linus Torvalds @ 2019-01-16 5:25 ` Andy Lutomirski 2019-01-16 5:34 ` Linus Torvalds 2019-01-16 12:36 ` Matthew Wilcox 1 sibling, 1 reply; 161+ messages in thread From: Andy Lutomirski @ 2019-01-16 5:25 UTC (permalink / raw) To: Linus Torvalds Cc: Josh Snyder, Dominique Martinet, Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API > On Jan 15, 2019, at 9:00 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Wed, Jan 16, 2019 at 12:42 PM Josh Snyder <joshs@netflix.com> wrote: >> >> For Netflix, losing accurate information from the mincore syscall would >> lengthen database cluster maintenance operations from days to months. We >> rely on cross-process mincore to migrate the contents of a page cache from >> machine to machine, and across reboots. > > Ok, this is the kind of feedback we need, and means I guess we can't > just use the mapping existence for mincore. > > The two other ways that we considered were: > > (a) owner of the file gets to know cache information for that file. > > (b) having the fd opened *writably* gets you cache residency information. > > Sadly, taking a look at happycache, you open the file read-only, so > (b) doesn't work. > > Judging just from the source code, I can't tell how the user ownership > works. Any input on that? > > And if you're not the owner of the file, do you have another > suggestion for that "Yes, I have the right to see what's in-core for > this file". Because the problem is literally that if it's some random > read-only system file, the kernel shouldn't leak access patterns to > it.. > > Something like CAP_DAC_READ_SEARCH might not be crazy. ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 5:25 ` Andy Lutomirski @ 2019-01-16 5:34 ` Linus Torvalds 2019-01-16 5:46 ` Dominique Martinet 0 siblings, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-16 5:34 UTC (permalink / raw) To: Andy Lutomirski Cc: Josh Snyder, Dominique Martinet, Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 16, 2019 at 5:25 PM Andy Lutomirski <luto@amacapital.net> wrote: > > Something like CAP_DAC_READ_SEARCH might not be crazy. I agree that it would work. In fact' it's what Jiri's patch basically did. Except Jiri used CAP_SYS_ADMIN instead. But that then basically limits it to root (or root-like with capability masks), which is quite likely to not work in practice all that well. That's why I wanted to find alternatives. *Very* few people want to run their databases as root. Jiri's original patch kind of acknowledged that by making the new test be conditional, and off by default. So then it's a "only do this for lockdown mode, because normal people won't find it acceptable". And I'm not a huge fan of that approach. If you don't protect normal people, then what's the point, really? Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 5:34 ` Linus Torvalds @ 2019-01-16 5:46 ` Dominique Martinet 2019-01-16 5:58 ` Linus Torvalds 0 siblings, 1 reply; 161+ messages in thread From: Dominique Martinet @ 2019-01-16 5:46 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Josh Snyder, Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Linus Torvalds wrote on Wed, Jan 16, 2019: > *Very* few people want to run their databases as root. In the case of happycache, this isn't the database doing the dump/restore, but a separate process that could have the cap - it's better if we can do without though, and from his readme he runs as user cassandra in the /var/lib/cassandra directory for example so that'd match the file owner. For pgfincore, it's a postgres extension so the main process does it - but it does have files open as write as well as being the owner. > Jiri's original patch kind of acknowledged that by making the new test > be conditional, and off by default. So then it's a "only do this for > lockdown mode, because normal people won't find it acceptable". > > And I'm not a huge fan of that approach. If you don't protect normal > people, then what's the point, really? I agree with that. "Being owner or has cap" (whichever cap) is probably OK. On the other hand, writeability check makes more sense in general - could we somehow check if the user has write access to the file instead of checking if it currently is opened read-write? -- Dominique ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 5:46 ` Dominique Martinet @ 2019-01-16 5:58 ` Linus Torvalds 2019-01-16 6:34 ` Dominique Martinet 2019-01-16 16:12 ` Jiri Kosina 0 siblings, 2 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-16 5:58 UTC (permalink / raw) To: Dominique Martinet Cc: Andy Lutomirski, Josh Snyder, Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 16, 2019 at 5:46 PM Dominique Martinet <asmadeus@codewreck.org> wrote: > > "Being owner or has cap" (whichever cap) is probably OK. > On the other hand, writeability check makes more sense in general - > could we somehow check if the user has write access to the file instead > of checking if it currently is opened read-write? That's likely the best option. We could say "is it open for write, or _could_ we open it for writing?" It's a slightly annoying special case, and I'd have preferred to avoid it, but it doesn't sound *compilcated*. I'm on the road, but I did send out this: https://lore.kernel.org/lkml/CAHk-=wif_9nvNHJiyxHzJ80_WUb0P7CXNBvXkjZz-r1u0ozp7g@mail.gmail.com/ originally. The "let's try to only do the mmap residency" was the optimistic "maybe we can just get rid of this complexity entirely" version.. Anybody willing to test the above patch instead? And replace the || capable(CAP_SYS_ADMIN) check with something like || inode_permission(inode, MAY_WRITE) == 0 instead? (This is obviously after you've reverted the "only check mmap residency" patch..) Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 5:58 ` Linus Torvalds @ 2019-01-16 6:34 ` Dominique Martinet 2019-01-16 7:52 ` Josh Snyder 2019-01-16 16:12 ` Jiri Kosina 1 sibling, 1 reply; 161+ messages in thread From: Dominique Martinet @ 2019-01-16 6:34 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Josh Snyder, Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API [-- Attachment #1: Type: text/plain, Size: 1030 bytes --] Linus Torvalds wrote on Wed, Jan 16, 2019: > Anybody willing to test the above patch instead? And replace the > > || capable(CAP_SYS_ADMIN) > > check with something like > > || inode_permission(inode, MAY_WRITE) == 0 > > instead? > > (This is obviously after you've reverted the "only check mmap > residency" patch..) That seems to work on an x86_64 vm. I've tested with the attached patch: - root can lookup pages on any file I tried; - user can lookup page on file it owns, assuming it can write to it (e.g. it won't work on a 0400 file you own) - user cannot lookup pages on e.g. /lib64/libc-2.28.so There is a difference with your previous patch though, that used to list no page in core when it didn't know; this patch lists pages as in core when it refuses to tell. I don't think that's very important, though. If anything, the 0400 user-owner file might be a problem in some edge case (e.g. if you're preloading git directories, many objects are 0444); should we *also* check ownership?... -- Dominique [-- Attachment #2: mincore.diff --] [-- Type: text/x-diff, Size: 1197 bytes --] mm/mincore.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..11ed7064f4eb 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,13 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + return vma_is_anonymous(vma) + || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE)) + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +196,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = (end - addr) >> PAGE_SHIFT; + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err; ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 6:34 ` Dominique Martinet @ 2019-01-16 7:52 ` Josh Snyder 2019-01-16 12:18 ` Kevin Easton 2019-01-17 21:45 ` Vlastimil Babka 0 siblings, 2 replies; 161+ messages in thread From: Josh Snyder @ 2019-01-16 7:52 UTC (permalink / raw) To: Dominique Martinet Cc: Linus Torvalds, Andy Lutomirski, Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet <asmadeus@codewreck.org> wrote: > > There is a difference with your previous patch though, that used to list no > page in core when it didn't know; this patch lists pages as in core when it > refuses to tell. I don't think that's very important, though. Is there a reason not to return -EPERM in this case? > > If anything, the 0400 user-owner file might be a problem in some edge > case (e.g. if you're preloading git directories, many objects are 0444); > should we *also* check ownership?... Yes, this seems valuable. Some databases with immutable files (e.g. git, as you've mentioned) conceivably operate this way. Josh ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 7:52 ` Josh Snyder @ 2019-01-16 12:18 ` Kevin Easton 2019-01-17 21:45 ` Vlastimil Babka 1 sibling, 0 replies; 161+ messages in thread From: Kevin Easton @ 2019-01-16 12:18 UTC (permalink / raw) To: Josh Snyder Cc: Dominique Martinet, Linus Torvalds, Andy Lutomirski, Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Tue, Jan 15, 2019 at 11:52:25PM -0800, Josh Snyder wrote: > On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet <asmadeus@codewreck.org> > wrote: > > > > There is a difference with your previous patch though, that used to list no > > page in core when it didn't know; this patch lists pages as in core when it > > refuses to tell. I don't think that's very important, though. > > Is there a reason not to return -EPERM in this case? When I was looking through the Debian Code Search results, quite a few of the hits were for code that uses mincore() as a way to check if _anything_ is mapped at an address or not. This code doesn't care about the in core / not in core result of mincore(), just whether it returned an error or not. I think a new error return would break most of the instances of that code I saw. - Kevin ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 7:52 ` Josh Snyder 2019-01-16 12:18 ` Kevin Easton @ 2019-01-17 21:45 ` Vlastimil Babka 2019-01-18 4:49 ` Linus Torvalds 1 sibling, 1 reply; 161+ messages in thread From: Vlastimil Babka @ 2019-01-17 21:45 UTC (permalink / raw) To: Josh Snyder, Dominique Martinet Cc: Linus Torvalds, Andy Lutomirski, Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On 1/16/2019 8:52 AM, Josh Snyder wrote: > On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet <asmadeus@codewreck.org> > wrote: >> >> There is a difference with your previous patch though, that used to list no >> page in core when it didn't know; this patch lists pages as in core when it >> refuses to tell. I don't think that's very important, though. I've argued previously that reporting false positives (as your patch does) should be better, otherwise there might be somebody trying to fault in their pages in a loop until mincore reports positive, which would become an endless loop. So agreed with your change. Or maybe we could resort to the 5.0-rc1 page table check (that is now being reverted) but only in cases when we are not allowed the page cache residency check? Or would that be needlessly complicated? And it would be able to leak if a page was evicted from the page cache... > Is there a reason not to return -EPERM in this case? That would definitely break somebody. >> >> If anything, the 0400 user-owner file might be a problem in some edge >> case (e.g. if you're preloading git directories, many objects are 0444); >> should we *also* check ownership?... > > Yes, this seems valuable. Some databases with immutable files (e.g. git, as > you've mentioned) conceivably operate this way. > > Josh > ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-17 21:45 ` Vlastimil Babka @ 2019-01-18 4:49 ` Linus Torvalds 2019-01-18 18:58 ` Vlastimil Babka 0 siblings, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-18 4:49 UTC (permalink / raw) To: Vlastimil Babka Cc: Josh Snyder, Dominique Martinet, Andy Lutomirski, Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Fri, Jan 18, 2019 at 9:45 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > Or maybe we could resort to the 5.0-rc1 page table check (that is now being > reverted) but only in cases when we are not allowed the page cache residency > check? Or would that be needlessly complicated? I think it would be good fallback semantics, but I'm not sure it's worth it. Have you tried writing a patch for it? I don't think you'd want to do the check *when* you find a hole, so you'd have to do it upfront and then pass the cached data down with the private pointer (or have a separate "struct mm_walk" structure, perhaps? So I suspect we're better off with the patch we have. But if somebody *wants* to try to do that fancier patch, and it doesn't look horrendous, I think it might be the "quality" solution. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-18 4:49 ` Linus Torvalds @ 2019-01-18 18:58 ` Vlastimil Babka 0 siblings, 0 replies; 161+ messages in thread From: Vlastimil Babka @ 2019-01-18 18:58 UTC (permalink / raw) To: Linus Torvalds Cc: Josh Snyder, Dominique Martinet, Andy Lutomirski, Dave Chinner, Jiri Kosina, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On 1/18/19 5:49 AM, Linus Torvalds wrote: > On Fri, Jan 18, 2019 at 9:45 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> Or maybe we could resort to the 5.0-rc1 page table check (that is now being >> reverted) but only in cases when we are not allowed the page cache residency >> check? Or would that be needlessly complicated? > > I think it would be good fallback semantics, but I'm not sure it's > worth it. Have you tried writing a patch for it? I don't think you'd > want to do the check *when* you find a hole, so you'd have to do it > upfront and then pass the cached data down with the private pointer > (or have a separate "struct mm_walk" structure, perhaps? > > So I suspect we're better off with the patch we have. But if somebody > *wants* to try to do that fancier patch, and it doesn't look > horrendous, I think it might be the "quality" solution. I thought to drop the idea because of leaking that page has been evicted, but then I realized there are other ways to check for that anyway in /proc. So I'll try, but probably not until after next week. If somebody else wants to, they are welcome. As you say, the current solution should be ok, so that would be a patch on top anyway, for bisectability etc. Vlastimil > Linus > ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 5:58 ` Linus Torvalds 2019-01-16 6:34 ` Dominique Martinet @ 2019-01-16 16:12 ` Jiri Kosina 2019-01-16 17:48 ` Linus Torvalds 1 sibling, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-16 16:12 UTC (permalink / raw) To: Linus Torvalds Cc: Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, 16 Jan 2019, Linus Torvalds wrote: > > "Being owner or has cap" (whichever cap) is probably OK. On the other > > hand, writeability check makes more sense in general - could we > > somehow check if the user has write access to the file instead of > > checking if it currently is opened read-write? > > That's likely the best option. We could say "is it open for write, or > _could_ we open it for writing?" > > It's a slightly annoying special case, and I'd have preferred to avoid > it, but it doesn't sound *compilcated*. > > I'm on the road, but I did send out this: > > https://lore.kernel.org/lkml/CAHk-=wif_9nvNHJiyxHzJ80_WUb0P7CXNBvXkjZz-r1u0ozp7g@mail.gmail.com/ > > originally. The "let's try to only do the mmap residency" was the > optimistic "maybe we can just get rid of this complexity entirely" > version.. > > Anybody willing to test the above patch instead? And replace the > > || capable(CAP_SYS_ADMIN) > > check with something like > > || inode_permission(inode, MAY_WRITE) == 0 > > instead? > > (This is obviously after you've reverted the "only check mmap > residency" patch..) So that seems to deal with mincore() in a reasonable way indeed. It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it provide any good answer what to do about it, does it? Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 16:12 ` Jiri Kosina @ 2019-01-16 17:48 ` Linus Torvalds 2019-01-16 20:23 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-16 17:48 UTC (permalink / raw) To: Jiri Kosina Cc: Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 17, 2019 at 4:12 AM Jiri Kosina <jikos@kernel.org> wrote: > > So that seems to deal with mincore() in a reasonable way indeed. > > It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it > provide any good answer what to do about it, does it? As I suggested earlier in the thread, the fix for RWF_NOWAIT might be to just move the test down to after readahead. We could/should be smarter than this,but it *might* be as simple as just diff --git a/mm/filemap.c b/mm/filemap.c index 9f5e323e883e..7bcdd36e629d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read( page = find_get_page(mapping, index); if (!page) { - if (iocb->ki_flags & IOCB_NOWAIT) - goto would_block; page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); which starts readahead even if IOCB_NOWAIT is set and will then test IOCB_NOWAIT _later_ and not actually wait for it. Linus ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 17:48 ` Linus Torvalds @ 2019-01-16 20:23 ` Jiri Kosina 2019-01-16 21:37 ` Matthew Wilcox ` (2 more replies) 0 siblings, 3 replies; 161+ messages in thread From: Jiri Kosina @ 2019-01-16 20:23 UTC (permalink / raw) To: Linus Torvalds Cc: Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API [-- Attachment #1: Type: text/plain, Size: 801 bytes --] On Thu, 17 Jan 2019, Linus Torvalds wrote: > > So that seems to deal with mincore() in a reasonable way indeed. > > > > It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it > > provide any good answer what to do about it, does it? > > As I suggested earlier in the thread, the fix for RWF_NOWAIT might be > to just move the test down to after readahead. So I've done some basic smoke testing (~2 hours of LTP+xfstests) on the kernel with the three topmost patches from https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel applied (also attaching to this mail), and no obvious breakage popped up. So if noone sees any principal problem there, I'll happily submit it with proper attribution etc. Thanks, -- Jiri Kosina SUSE Labs [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: text/x-patch; name=0001-Revert-Change-mincore-to-count-mapped-pages-rather-t.patch, Size: 4248 bytes --] From cbf9381eed6766cff5b05f9d948c1d225cb3d78b Mon Sep 17 00:00:00 2001 From: Jiri Kosina <jkosina@suse.cz> Date: Wed, 16 Jan 2019 20:51:31 +0100 Subject: [PATCH 1/3] Revert "Change mincore() to count "mapped" pages rather than "cached" pages" This reverts commit 574823bfab82d9d8fa47f422778043fbb4b4f50e. Another aproach (checking file access permissions in order to decide what mincore() should return in order not to leak data) will be implemented instead. Signed-off-by: Jiri Kosina <jkosina@suse.cz> --- mm/mincore.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 81 insertions(+), 13 deletions(-) diff --git a/mm/mincore.c b/mm/mincore.c index f0f91461a9f4..218099b5ed31 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -42,14 +42,72 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, return 0; } -static int mincore_unmapped_range(unsigned long addr, unsigned long end, - struct mm_walk *walk) +/* + * Later we can get more picky about what "in core" means precisely. + * For now, simply check to see if the page is in the page cache, + * and is up to date; i.e. that no page-in operation would be required + * at this time if an application were to map and access this page. + */ +static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) +{ + unsigned char present = 0; + struct page *page; + + /* + * When tmpfs swaps out a page from a file, any process mapping that + * file will not get a swp_entry_t in its pte, but rather it is like + * any other file mapping (ie. marked !present and faulted in with + * tmpfs's .fault). So swapped out tmpfs mappings are tested here. + */ +#ifdef CONFIG_SWAP + if (shmem_mapping(mapping)) { + page = find_get_entry(mapping, pgoff); + /* + * shmem/tmpfs may return swap: account for swapcache + * page too. + */ + if (xa_is_value(page)) { + swp_entry_t swp = radix_to_swp_entry(page); + page = find_get_page(swap_address_space(swp), + swp_offset(swp)); + } + } else + page = find_get_page(mapping, pgoff); +#else + page = find_get_page(mapping, pgoff); +#endif + if (page) { + present = PageUptodate(page); + put_page(page); + } + + return present; +} + +static int __mincore_unmapped_range(unsigned long addr, unsigned long end, + struct vm_area_struct *vma, unsigned char *vec) { - unsigned char *vec = walk->private; unsigned long nr = (end - addr) >> PAGE_SHIFT; + int i; - memset(vec, 0, nr); - walk->private += nr; + if (vma->vm_file) { + pgoff_t pgoff; + + pgoff = linear_page_index(vma, addr); + for (i = 0; i < nr; i++, pgoff++) + vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); + } else { + for (i = 0; i < nr; i++) + vec[i] = 0; + } + return nr; +} + +static int mincore_unmapped_range(unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + walk->private += __mincore_unmapped_range(addr, end, + walk->vma, walk->private); return 0; } @@ -69,9 +127,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } - /* We'll consider a THP page under construction to be there */ if (pmd_trans_unstable(pmd)) { - memset(vec, 1, nr); + __mincore_unmapped_range(addr, end, vma, vec); goto out; } @@ -80,17 +137,28 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pte_t pte = *ptep; if (pte_none(pte)) - *vec = 0; + __mincore_unmapped_range(addr, addr + PAGE_SIZE, + vma, vec); else if (pte_present(pte)) *vec = 1; else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); - /* - * migration or hwpoison entries are always - * uptodate - */ - *vec = !!non_swap_entry(entry); + if (non_swap_entry(entry)) { + /* + * migration or hwpoison entries are always + * uptodate + */ + *vec = 1; + } else { +#ifdef CONFIG_SWAP + *vec = mincore_page(swap_address_space(entry), + swp_offset(entry)); +#else + WARN_ON(1); + *vec = 1; +#endif + } } vec++; } -- 2.12.3 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #3: Type: text/x-patch; name=0002-mm-mincore-make-mincore-more-conservative.patch, Size: 2315 bytes --] From ca02a026f40dfaebc29c29edd9c992a0ff10075e Mon Sep 17 00:00:00 2001 From: Jiri Kosina <jkosina@suse.cz> Date: Wed, 16 Jan 2019 20:53:17 +0100 Subject: [PATCH 2/3] mm/mincore: make mincore() more conservative The semantics of what mincore() considers to be resident is not completely clearar, but Linux has always (since 2.3.52, which is when mincore() was initially done) treated it as "page is available in page cache". That's potentially a problem, as that [in]directly exposes meta-information about pagecache / memory mapping state even about memory not strictly belonging to the process executing the syscall, opening possibilities for sidechannel attacks. Change the semantics of mincore() so that it only reveals pagecache information for non-anonymous mappings that belog to files that the calling process could (if it tried to) successfully open for writing. Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Originally-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz> --- mm/mincore.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..11ed7064f4eb 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,13 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + return vma_is_anonymous(vma) + || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE)) + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +196,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = (end - addr) >> PAGE_SHIFT; + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err; -- 2.12.3 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #4: Type: text/x-patch; name=0003-mm-filemap-initiate-readahead-even-if-IOCB_NOWAIT-is.patch, Size: 1368 bytes --] From e7765f317afb193adb4ba00d81251686191cbf4b Mon Sep 17 00:00:00 2001 From: Jiri Kosina <jkosina@suse.cz> Date: Wed, 16 Jan 2019 21:06:58 +0100 Subject: [PATCH 3/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache contents, as it reveals metadata about residency of pages in pagecache. If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page not resident" information, and vice versa. Close that sidechannel by always initiating readahead on the cache if we encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing the pagecache residency itself will actually populate the cache, making the sidechannel useless. Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz> --- mm/filemap.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 9f5e323e883e..7bcdd36e629d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, page = find_get_page(mapping, index); if (!page) { - if (iocb->ki_flags & IOCB_NOWAIT) - goto would_block; page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); -- 2.12.3 ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 20:23 ` Jiri Kosina @ 2019-01-16 21:37 ` Matthew Wilcox 2019-01-16 21:41 ` Jiri Kosina 2019-01-17 4:51 ` Linus Torvalds 2019-01-17 1:49 ` Dominique Martinet 2019-01-23 20:27 ` Linus Torvalds 2 siblings, 2 replies; 161+ messages in thread From: Matthew Wilcox @ 2019-01-16 21:37 UTC (permalink / raw) To: Jiri Kosina Cc: Linus Torvalds, Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 16, 2019 at 09:23:04PM +0100, Jiri Kosina wrote: > On Thu, 17 Jan 2019, Linus Torvalds wrote: > > As I suggested earlier in the thread, the fix for RWF_NOWAIT might be > > to just move the test down to after readahead. Your patch 3/3 just removes the test. Am I right in thinking that it doesn't need to be *moved* because the existing test after !PageUptodate catches it? Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there any in LTP? Some typos in the commit messages: > Another aproach (checking file access permissions in order to decide "approach" > Subject: [PATCH 2/3] mm/mincore: make mincore() more conservative > > The semantics of what mincore() considers to be resident is not completely > clearar, but Linux has always (since 2.3.52, which is when mincore() was "clear" > initially done) treated it as "page is available in page cache". > > That's potentially a problem, as that [in]directly exposes meta-information > about pagecache / memory mapping state even about memory not strictly belonging > to the process executing the syscall, opening possibilities for sidechannel > attacks. > > Change the semantics of mincore() so that it only reveals pagecache information > for non-anonymous mappings that belog to files that the calling process could "belong" ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 21:37 ` Matthew Wilcox @ 2019-01-16 21:41 ` Jiri Kosina 2019-01-17 9:52 ` Cyril Hrubis 2019-01-28 13:49 ` Cyril Hrubis 2019-01-17 4:51 ` Linus Torvalds 1 sibling, 2 replies; 161+ messages in thread From: Jiri Kosina @ 2019-01-16 21:41 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Cyril Hrubis, Linux API On Wed, 16 Jan 2019, Matthew Wilcox wrote: > > On Thu, 17 Jan 2019, Linus Torvalds wrote: > > > As I suggested earlier in the thread, the fix for RWF_NOWAIT might be > > > to just move the test down to after readahead. > > Your patch 3/3 just removes the test. Am I right in thinking that it > doesn't need to be *moved* because the existing test after !PageUptodate > catches it? Exactly. It just initiates read-ahead for IOCB_NOWAIT cases as well, and if it's actually set, it'll be handled by the !PageUpdtodate case. > Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there > any in LTP? Not in the released version AFAIK. I've asked the LTP maintainer (in our internal bugzilla) to take care of this thread a few days ago, but not sure what came out of it. Adding him (Cyril) to CC. > Some typos in the commit messages: > > > Another aproach (checking file access permissions in order to decide > "approach" > > > Subject: [PATCH 2/3] mm/mincore: make mincore() more conservative > > > > The semantics of what mincore() considers to be resident is not completely > > clearar, but Linux has always (since 2.3.52, which is when mincore() was > "clear" > > > initially done) treated it as "page is available in page cache". > > > > That's potentially a problem, as that [in]directly exposes meta-information > > about pagecache / memory mapping state even about memory not strictly belonging > > to the process executing the syscall, opening possibilities for sidechannel > > attacks. > > > > Change the semantics of mincore() so that it only reveals pagecache information > > for non-anonymous mappings that belog to files that the calling process could > "belong" Thanks. -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 21:41 ` Jiri Kosina @ 2019-01-17 9:52 ` Cyril Hrubis 2019-01-28 13:49 ` Cyril Hrubis 1 sibling, 0 replies; 161+ messages in thread From: Cyril Hrubis @ 2019-01-17 9:52 UTC (permalink / raw) To: Jiri Kosina Cc: Matthew Wilcox, Linus Torvalds, Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Hi! > > Your patch 3/3 just removes the test. Am I right in thinking that it > > doesn't need to be *moved* because the existing test after !PageUptodate > > catches it? > > Exactly. It just initiates read-ahead for IOCB_NOWAIT cases as well, and > if it's actually set, it'll be handled by the !PageUpdtodate case. > > > Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there > > any in LTP? > > Not in the released version AFAIK. I've asked the LTP maintainer (in our > internal bugzilla) to take care of this thread a few days ago, but not > sure what came out of it. Adding him (Cyril) to CC. So far not much, I've looked over our mincore() tests and noted down how to improve them here: https://github.com/linux-test-project/ltp/issues/461 We do plan to test the final mincore() fix: https://github.com/linux-test-project/ltp/issues/460 And we do have RWF_NOWAIT tests on our TODO for some time as well: https://github.com/linux-test-project/ltp/issues/286 I guess I can raise priority for that one so that we have basic functional tests in a week or so. Also if anyone has some RWF_NOWAIT tests already it would be nice if these could be shared with us. [A bit off topic rant] I've been telling kernel developers for years that if they have a test code they used when developing a kernel feature that they should share it with us (LTP community) and we will turn these into automated tests and maintain them for free. LTP is also used in many QA departements around the word so such tests will end up executed in different environments also for free. Sadly this does not happen much and there are only few exceptions so far. But maybe I wasn't shouting loudly enough. -- Cyril Hrubis chrubis@suse.cz ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 21:41 ` Jiri Kosina 2019-01-17 9:52 ` Cyril Hrubis @ 2019-01-28 13:49 ` Cyril Hrubis 1 sibling, 0 replies; 161+ messages in thread From: Cyril Hrubis @ 2019-01-28 13:49 UTC (permalink / raw) To: Jiri Kosina Cc: Matthew Wilcox, Linus Torvalds, Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Hi! > > Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there > > any in LTP? Just FYI I've send a patch with basic RWF_NOWAIT test for review to LTP ML and also CCed mailing lists from this thread. https://lkml.org/lkml/2019/1/28/416 -- Cyril Hrubis chrubis@suse.cz ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 21:37 ` Matthew Wilcox 2019-01-16 21:41 ` Jiri Kosina @ 2019-01-17 4:51 ` Linus Torvalds 2019-01-18 4:54 ` Linus Torvalds 1 sibling, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-17 4:51 UTC (permalink / raw) To: Matthew Wilcox Cc: Jiri Kosina, Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 17, 2019 at 9:37 AM Matthew Wilcox <willy@infradead.org> wrote: > > Your patch 3/3 just removes the test. Am I right in thinking that it > doesn't need to be *moved* because the existing test after !PageUptodate > catches it? That's the _hope_. That's the simplest patch I can come up with as a potential solution. But it's possible that there's some nasty performance regression because somebody really relies on not even triggering read-ahead, and we might need to do some totally different thing. So it may be that somebody has a case that really wants something else, and we'd need to move the RWF_NOWAIT test elsewhere and do something slightly more complicated. As with the mincore() change, maybe reality doesn't like the simplest fix... > Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there > any in LTP? RWF_NOWAIT is actually _fairly_ new. It was introduced "only" about 18 months ago and made it into 4.13. Which makes me hopeful there aren't a lot of people who care deeply. And starting readahead *may* actually be what a RWF_NOWAIT read user generally wants, so for all we know it might even improve performance and/or allow new uses. With the "start readahead but don't wait for it" semantics, you can have a model where you try to handle all the requests that can be handled out of cache first (using RWF_NOWAIT) and then when you've run out of cached cases you clear the RWF_NOWAIT flag, but now the IO has been started early (and could overlap with the cached request handling), so then when you actually do a blocking version, you get much better performance. So there is an argument that removing that one RWF_NOWAIT case might actually be a good thing in general, outside of the "don't allow probing the cache without changing the state of it" issue. But that's handwavy and optimistic. Reality is often not as accomodating ;) Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-17 4:51 ` Linus Torvalds @ 2019-01-18 4:54 ` Linus Torvalds 0 siblings, 0 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-18 4:54 UTC (permalink / raw) To: Matthew Wilcox Cc: Jiri Kosina, Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 17, 2019 at 4:51 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Jan 17, 2019 at 9:37 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > Your patch 3/3 just removes the test. Am I right in thinking that it > > doesn't need to be *moved* because the existing test after !PageUptodate > > catches it? > > That's the _hope_. > > That's the simplest patch I can come up with as a potential solution. > But it's possible that there's some nasty performance regression > because somebody really relies on not even triggering read-ahead, and > we might need to do some totally different thing. Oh, and somebody should probably check that there isn't some simple way to just avoid that readahead code entirely. In particular, right now we skip readahead for at least these cases: /* no read-ahead */ if (!ra->ra_pages) return; if (blk_cgroup_congested()) return; and I don't think we need to worry about the cgroup congestion case - if the attack has to also congest its cgroup with IO, I think they have bigger problems. And I think 'ra_pages' can be zero only in the presence of IO errors, but I might be wrong. It would be good if somebody double-checks that. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 20:23 ` Jiri Kosina 2019-01-16 21:37 ` Matthew Wilcox @ 2019-01-17 1:49 ` Dominique Martinet 2019-01-23 20:27 ` Linus Torvalds 2 siblings, 0 replies; 161+ messages in thread From: Dominique Martinet @ 2019-01-17 1:49 UTC (permalink / raw) To: Jiri Kosina Cc: Linus Torvalds, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Jiri Kosina wrote on Wed, Jan 16, 2019: > So if noone sees any principal problem there, I'll happily submit it with > proper attribution etc. I'm not convinced just the write permission check is enough for mincore(), as Josh also seems to share the concern I raised (e.g. map a git directory "hot" pages) We probably need to add an inode_owner_or_capable() or similar, the open question is do we still need the write access check after that - I don't really know how expensive these calls are. Thanks, -- Dominique ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 20:23 ` Jiri Kosina 2019-01-16 21:37 ` Matthew Wilcox 2019-01-17 1:49 ` Dominique Martinet @ 2019-01-23 20:27 ` Linus Torvalds 2019-01-23 20:35 ` Linus Torvalds 2019-01-24 0:24 ` Dominique Martinet 2 siblings, 2 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-23 20:27 UTC (permalink / raw) To: Jiri Kosina Cc: Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 17, 2019 at 9:23 AM Jiri Kosina <jikos@kernel.org> wrote: > > So I've done some basic smoke testing (~2 hours of LTP+xfstests) on the > kernel with the three topmost patches from > > https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel > > applied (also attaching to this mail), and no obvious breakage popped up. > > So if noone sees any principal problem there, I'll happily submit it with > proper attribution etc. So this seems to have died down, and a week later we seem to not have a lot of noise here any more. I think it means people either weren't testing it, or just didn't find any real problems. I've reverted the 'let's try to just remove the code' part in my tree. But I didn't apply the two other patches yet. Any final comments before that should happen? Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-23 20:27 ` Linus Torvalds @ 2019-01-23 20:35 ` Linus Torvalds 2019-01-23 23:12 ` Jiri Kosina 2019-01-24 0:24 ` Dominique Martinet 1 sibling, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-23 20:35 UTC (permalink / raw) To: Jiri Kosina Cc: Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 24, 2019 at 9:27 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I've reverted the 'let's try to just remove the code' part in my tree. > But I didn't apply the two other patches yet. Any final comments > before that should happen? Side note: the inode_permission() addition to can_do_mincore() in that patch 0002, seems to be questionable. We do +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + return vma_is_anonymous(vma) + || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE)) + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; +} note how it tests whether vma->vm_file is NULL for the FMODE_WRITE test, but not for the inode_permission() test. So either we test unnecessarily in the second line, or we don't properly test it in the third one. I think the "test vm_file" thing may be unnecessary, because a non-anonymous mapping should always have a file pointer and an inode. But I could imagine some odd case (vdso mapping, anyone?) that doesn't have a vm_file, but also isn't anonymous. Anybody? Anyway, it's one reason why I didn't actually apply those other two patches yet. This may be a 5.1 issue.. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-23 20:35 ` Linus Torvalds @ 2019-01-23 23:12 ` Jiri Kosina 2019-01-24 0:20 ` Linus Torvalds 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-23 23:12 UTC (permalink / raw) To: Linus Torvalds Cc: Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, 24 Jan 2019, Linus Torvalds wrote: > Side note: the inode_permission() addition to can_do_mincore() in that > patch 0002, seems to be questionable. We do > > +static inline bool can_do_mincore(struct vm_area_struct *vma) > +{ > + return vma_is_anonymous(vma) > + || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE)) > + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; > +} > > note how it tests whether vma->vm_file is NULL for the FMODE_WRITE > test, but not for the inode_permission() test. > > So either we test unnecessarily in the second line, or we don't > properly test it in the third one. > > I think the "test vm_file" thing may be unnecessary, because a > non-anonymous mapping should always have a file pointer and an inode. > But I could imagine some odd case (vdso mapping, anyone?) that > doesn't have a vm_file, but also isn't anonymous. Hmm, good point. So dropping the 'vma->vm_file' test and checking whether given vma is special mapping should hopefully provide the desired semantics, shouldn't it? -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-23 23:12 ` Jiri Kosina @ 2019-01-24 0:20 ` Linus Torvalds 0 siblings, 0 replies; 161+ messages in thread From: Linus Torvalds @ 2019-01-24 0:20 UTC (permalink / raw) To: Jiri Kosina Cc: Dominique Martinet, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 24, 2019 at 12:12 PM Jiri Kosina <jikos@kernel.org> wrote: > > > > > I think the "test vm_file" thing may be unnecessary, because a > > non-anonymous mapping should always have a file pointer and an inode. > > But I could imagine some odd case (vdso mapping, anyone?) that > > doesn't have a vm_file, but also isn't anonymous. > > Hmm, good point. > > So dropping the 'vma->vm_file' test and checking whether given vma is > special mapping should hopefully provide the desired semantics, shouldn't > it? Maybe. But on the whole I think it would be simpler and more straightforward to just instead add a vm_file test for the inode_permission() case. That way you at least know that you aren't following a NULL pointer. If the file then turns out to be some special thing, it doesn't really _matter_, I think. It won't have anything in the page cache etc, but the code should "work". Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-23 20:27 ` Linus Torvalds 2019-01-23 20:35 ` Linus Torvalds @ 2019-01-24 0:24 ` Dominique Martinet 2019-01-24 12:45 ` Dominique Martinet 1 sibling, 1 reply; 161+ messages in thread From: Dominique Martinet @ 2019-01-24 0:24 UTC (permalink / raw) To: Linus Torvalds, Jiri Kosina Cc: Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Linus Torvalds wrote on Thu, Jan 24, 2019: > I've reverted the 'let's try to just remove the code' part in my tree. > But I didn't apply the two other patches yet. Any final comments > before that should happen? I mentionned when sending the updated version that just checking file permission might not be enough, e.g. a git tree is full of read-only objects that someone might want to preload and think we might really want to check both despite the overhead in the denied case. Josh agreed and I meant to send a new version since nothing was happening but work priorities got the better of me, and I was kind of waiting for the ltp testcases[1] as well because aside from the few tests I ran by hand I'm not sure the few hours of ltp/xfstests Jiri ran did much but this is probably going to be a chicken-or-egg problem.. [1] https://github.com/linux-test-project/ltp/issues/461 Jiri Kosina wrote on Thu, Jan 24, 2019: > On Thu, 24 Jan 2019, Linus Torvalds wrote: > > > Side note: the inode_permission() addition to can_do_mincore() in that > > patch 0002, seems to be questionable. We do > > > > +static inline bool can_do_mincore(struct vm_area_struct *vma) > > +{ > > + return vma_is_anonymous(vma) > > + || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE)) > > + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; > > +} > > > > note how it tests whether vma->vm_file is NULL for the FMODE_WRITE > > test, but not for the inode_permission() test. > > > > So either we test unnecessarily in the second line, or we don't > > properly test it in the third one. > > > > I think the "test vm_file" thing may be unnecessary, because a > > non-anonymous mapping should always have a file pointer and an inode. > > But I could imagine some odd case (vdso mapping, anyone?) that > > doesn't have a vm_file, but also isn't anonymous. > > Hmm, good point. > > So dropping the 'vma->vm_file' test and checking whether given vma is > special mapping should hopefully provide the desired semantics, shouldn't > it? I think it's probably better to keep this simple, if we're going to check something before accessing vm_file we might as well directly check it. I was thinking of something along the lines of: return vma_is_anonymous(vma) || (vma->vm_file && (inode_owner_or_capable(file_inode(vma->vm_file)) || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0)); I dropped the first f_mode check because none of the known mincore users open the files read-write, and the check is redundant with inode_permission() so while it would probably be an optimisation in some cases I do not think it is useful in practice. On the other hand, I have no idea how expensive the inode_permission and owner checks really are - do they try to refresh attributes on a networked filesystem or would it trust the cache or is it fs dependant? Honestly this is more a case of "the people who's be interested in seeing this have no idea what they're doing" than lack of interest.. I wouldn't mind if there were tests doing mincore on a bunch of special files/mappings but I just tried on a few regular files by hand, this isn't proper coverage; I'll try to take more time to test various mappings today (JST). Thanks, -- Dominique ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-24 0:24 ` Dominique Martinet @ 2019-01-24 12:45 ` Dominique Martinet 2019-01-24 14:25 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Dominique Martinet @ 2019-01-24 12:45 UTC (permalink / raw) To: Linus Torvalds, Jiri Kosina Cc: Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Dominique Martinet wrote on Thu, Jan 24, 2019: > I was thinking of something along the lines of: > return vma_is_anonymous(vma) || (vma->vm_file && > (inode_owner_or_capable(file_inode(vma->vm_file)) > || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0)); > > I dropped the first f_mode check because none of the known mincore users > open the files read-write, and the check is redundant with > inode_permission() so while it would probably be an optimisation in some > cases I do not think it is useful in practice. > On the other hand, I have no idea how expensive the inode_permission and > owner checks really are - do they try to refresh attributes on a > networked filesystem or would it trust the cache or is it fs dependant? > > Honestly this is more a case of "the people who's be interested in > seeing this have no idea what they're doing" than lack of interest.. I > wouldn't mind if there were tests doing mincore on a bunch of special > files/mappings but I just tried on a few regular files by hand, this > isn't proper coverage; I'll try to take more time to test various > mappings today (JST). I've done some tests with this, it appears OK. Obviously the tests I previously had done still work: - user's own files are ok, even if read-only now. - non-user writable files are ok. - non-user non-writable files (e.g. system libs) aren't. - root can still do anything. On new tests: - there are vmas with no file that aren't anonymous and come all the way there (vvar and vdso), so factoring vma->vm_file check is definitely needed. - vsyscall doesn't reach can_do_mincore() - [heap] [stack] and other fileless regular maps are anonymous - I tried a char device (/dev/zero) and it was marked anonymous despite mapping with MAP_SHARED, which is somewhat expected I guess? - I couldn't map /proc or /sys files (no such device), so no mincore there. I'd post my test program but I actually added pr_info messages in can_do_mincore to check what it returned because madvise dontneed isn't guaranteed to evict pages so we can't rely on madvise dontneed + mincore to return 0; not sure what to do for ltp... If anyone has a good idea of how to check if mincore actually got granted permissions without drop_caches I'll post to the ltp github. Anything else to try? Jiri, you've offered resubmitting the last two patches properly, can you incorporate this change or should I just send this directly? (I'd take most of your commit message and add your name somewhere) Thanks, -- Dominique ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-24 12:45 ` Dominique Martinet @ 2019-01-24 14:25 ` Jiri Kosina 2019-01-27 22:35 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-24 14:25 UTC (permalink / raw) To: Dominique Martinet Cc: Linus Torvalds, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API [-- Attachment #1: Type: text/plain, Size: 591 bytes --] On Thu, 24 Jan 2019, Dominique Martinet wrote: > Jiri, you've offered resubmitting the last two patches properly, can you > incorporate this change or should I just send this directly? (I'd take > most of your commit message and add your name somewhere) I've been running some basic smoke testing with the kernel from https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel-v2 (attaching the respective two patches to apply on top of latest Linus' tree to this mail as well), and everything looks good so far. Thanks, -- Jiri Kosina SUSE Labs [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: text/x-patch; name=0001-mm-mincore-make-mincore-more-conservative.patch, Size: 2335 bytes --] From 9810565f1d5f966a84900cdcb85e33aa7571afbe Mon Sep 17 00:00:00 2001 From: Jiri Kosina <jkosina@suse.cz> Date: Wed, 16 Jan 2019 20:53:17 +0100 Subject: [PATCH 1/2] mm/mincore: make mincore() more conservative The semantics of what mincore() considers to be resident is not completely clear, but Linux has always (since 2.3.52, which is when mincore() was initially done) treated it as "page is available in page cache". That's potentially a problem, as that [in]directly exposes meta-information about pagecache / memory mapping state even about memory not strictly belonging to the process executing the syscall, opening possibilities for sidechannel attacks. Change the semantics of mincore() so that it only reveals pagecache information for non-anonymous mappings that belog to files that the calling process could (if it tried to) successfully open for writing. Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Originally-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz> --- mm/mincore.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..747a4907a3ac 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + return vma_is_anonymous(vma) || + (vma->vm_file && + (inode_owner_or_capable(file_inode(vma->vm_file)) + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0)); +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = (end - addr) >> PAGE_SHIFT; + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err; -- 2.12.3 [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #3: Type: text/x-patch; name=0002-mm-filemap-initiate-readahead-even-if-IOCB_NOWAIT-is.patch, Size: 1368 bytes --] From f287185fc5e0ffbbb380f2d68dd19290715829a8 Mon Sep 17 00:00:00 2001 From: Jiri Kosina <jkosina@suse.cz> Date: Wed, 16 Jan 2019 21:06:58 +0100 Subject: [PATCH 2/2] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache contents, as it reveals metadata about residency of pages in pagecache. If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page not resident" information, and vice versa. Close that sidechannel by always initiating readahead on the cache if we encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing the pagecache residency itself will actually populate the cache, making the sidechannel useless. Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz> --- mm/filemap.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 9f5e323e883e..7bcdd36e629d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, page = find_get_page(mapping, index); if (!page) { - if (iocb->ki_flags & IOCB_NOWAIT) - goto would_block; page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); -- 2.12.3 ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-24 14:25 ` Jiri Kosina @ 2019-01-27 22:35 ` Jiri Kosina 2019-01-28 0:05 ` Dominique Martinet 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-27 22:35 UTC (permalink / raw) To: Dominique Martinet Cc: Linus Torvalds, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, 24 Jan 2019, Jiri Kosina wrote: > > Jiri, you've offered resubmitting the last two patches properly, can you > > incorporate this change or should I just send this directly? (I'd take > > most of your commit message and add your name somewhere) > > I've been running some basic smoke testing with the kernel from > > https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel-v2 > > (attaching the respective two patches to apply on top of latest Linus' > tree to this mail as well), and everything looks good so far. So, any objections to aproaching it this way? I've not been able to spot any obvious breakage so far with it. Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-27 22:35 ` Jiri Kosina @ 2019-01-28 0:05 ` Dominique Martinet 2019-01-29 23:52 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Dominique Martinet @ 2019-01-28 0:05 UTC (permalink / raw) To: Jiri Kosina Cc: Linus Torvalds, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Jiri Kosina wrote on Sun, Jan 27, 2019: > So, any objections to aproaching it this way? I'm not sure why I'm the main recipient of that mail but answering because I am -- let's get these patches in through the regular -mm tree though -- Dominique ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-28 0:05 ` Dominique Martinet @ 2019-01-29 23:52 ` Jiri Kosina 2019-01-30 9:09 ` Michal Hocko 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-29 23:52 UTC (permalink / raw) To: Dominique Martinet Cc: Linus Torvalds, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Mon, 28 Jan 2019, Dominique Martinet wrote: > > So, any objections to aproaching it this way? > > I'm not sure why I'm the main recipient of that mail but answering > because I am -- let's get these patches in through the regular -mm tree > though *prod to mm maintainers* (at least for an opinion) -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-29 23:52 ` Jiri Kosina @ 2019-01-30 9:09 ` Michal Hocko 2019-01-30 12:29 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Michal Hocko @ 2019-01-30 9:09 UTC (permalink / raw) To: Jiri Kosina Cc: Dominique Martinet, Linus Torvalds, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Linux-MM, kernel list, Linux API On Wed 30-01-19 00:52:02, Jiri Kosina wrote: > On Mon, 28 Jan 2019, Dominique Martinet wrote: > > > > So, any objections to aproaching it this way? > > > > I'm not sure why I'm the main recipient of that mail but answering > > because I am -- let's get these patches in through the regular -mm tree > > though > > *prod to mm maintainers* (at least for an opinion) Could you repost those patches please? The thread is long and it is not really clear what is the most up-to-date state of patches (at least to me). -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-30 9:09 ` Michal Hocko @ 2019-01-30 12:29 ` Jiri Kosina 0 siblings, 0 replies; 161+ messages in thread From: Jiri Kosina @ 2019-01-30 12:29 UTC (permalink / raw) To: Michal Hocko Cc: Dominique Martinet, Linus Torvalds, Andy Lutomirski, Josh Snyder, Dave Chinner, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Linux-MM, kernel list, Linux API On Wed, 30 Jan 2019, Michal Hocko wrote: > > > I'm not sure why I'm the main recipient of that mail but answering > > > because I am -- let's get these patches in through the regular -mm > > > tree though > > > > *prod to mm maintainers* (at least for an opinion) > > Could you repost those patches please? The thread is long and it is not > really clear what is the most up-to-date state of patches (at least to > me). Vlastimil seems to have one extra patch to go on top, so we agreed that he'll be sending that as a complete self-contained series (either as a followup to the very first e-mail in this monsterthread, or completely separately) shortly. Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-16 5:00 ` Linus Torvalds 2019-01-16 5:25 ` Andy Lutomirski @ 2019-01-16 12:36 ` Matthew Wilcox 1 sibling, 0 replies; 161+ messages in thread From: Matthew Wilcox @ 2019-01-16 12:36 UTC (permalink / raw) To: Linus Torvalds Cc: Josh Snyder, Dominique Martinet, Dave Chinner, Jiri Kosina, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Wed, Jan 16, 2019 at 05:00:25PM +1200, Linus Torvalds wrote: > And if you're not the owner of the file, do you have another > suggestion for that "Yes, I have the right to see what's in-core for > this file". Because the problem is literally that if it's some random > read-only system file, the kernel shouldn't leak access patterns to > it.. This probably isn't a good heuristic, but thought I'd mention it anyway ... if the file is executable and you're not the owner, mincore always/never says its pages are resident. That'd fix all library leaks, but then there's probably a smart way of figuring out something from access patterns to a data file of some kind (/etc/passwd?) ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 0:44 ` Dave Chinner 2019-01-10 1:18 ` Linus Torvalds @ 2019-01-10 14:50 ` Matthew Wilcox 2019-01-11 7:36 ` Jiri Kosina 2 siblings, 0 replies; 161+ messages in thread From: Matthew Wilcox @ 2019-01-10 14:50 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Jiri Kosina, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 10, 2019 at 11:44:24AM +1100, Dave Chinner wrote: > And, really, this would be just another band-aid over a symptom of > the information leak - it doesn't prevent users from being able to > control page cache invalidation. It just removes one method, just > like hacking mincore only removes one method of observing the page > cache. And, like mincore(), there's every chance it impacts on > userspace in a negative manner and so we need to be very careful > here. Putting the mincore() / cache timing information leak aside though, the current behaviour of XFS means that an attacker can screw up the performance of random applications just by repeatedly doing O_DIRECT reads of libc.so. Maybe O_DIRECT reads should be forbidden from files on XFS unless you also have write access to them? (eg owner). ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-10 0:44 ` Dave Chinner 2019-01-10 1:18 ` Linus Torvalds 2019-01-10 14:50 ` Matthew Wilcox @ 2019-01-11 7:36 ` Jiri Kosina 2019-01-17 2:22 ` Dave Chinner 2 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-11 7:36 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, 10 Jan 2019, Dave Chinner wrote: > Sounds nice from a theoretical POV, but reality has taught us very > different lessons. > > FWIW, a quick check of XFS's history so you understand how long this > behaviour has been around. It was introduced in the linux port in 2001 > as direct IO support was being added: > > commit e837eac23662afae603aaaef7c94bc839c1b8f67 > Author: Steve Lord <lord@sgi.com> > Date: Mon Mar 5 16:47:52 2001 +0000 > > Add bounds checking for direct I/O, do the cache invalidation for > data coherency on direct I/O. Out of curiosity, which repository is this from please? Even google doesn't seem to know about this SHA. Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-11 7:36 ` Jiri Kosina @ 2019-01-17 2:22 ` Dave Chinner 2019-01-17 8:18 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Dave Chinner @ 2019-01-17 2:22 UTC (permalink / raw) To: Jiri Kosina Cc: Linus Torvalds, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Fri, Jan 11, 2019 at 08:36:55AM +0100, Jiri Kosina wrote: > On Thu, 10 Jan 2019, Dave Chinner wrote: > > > Sounds nice from a theoretical POV, but reality has taught us very > > different lessons. > > > > FWIW, a quick check of XFS's history so you understand how long this > > behaviour has been around. It was introduced in the linux port in 2001 > > as direct IO support was being added: > > > > commit e837eac23662afae603aaaef7c94bc839c1b8f67 > > Author: Steve Lord <lord@sgi.com> > > Date: Mon Mar 5 16:47:52 2001 +0000 > > > > Add bounds checking for direct I/O, do the cache invalidation for > > data coherency on direct I/O. > > Out of curiosity, which repository is this from please? Even google > doesn't seem to know about this SHA. because oss.sgi.com is no longer with us, it's fallen out of all the search engines. It was from the "archive/xfs-import.git" tree on oss.sgi.com: https://web.archive.org/web/20120326044237/http://oss.sgi.com:80/cgi-bin/gitweb.cgi but archive.org doesn't have a copy of the git tree. It contained the XFS history right back to the first Irix commit in 1993. Some of us still have copies of it sitting around.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-17 2:22 ` Dave Chinner @ 2019-01-17 8:18 ` Jiri Kosina 2019-01-17 21:06 ` Dave Chinner 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-17 8:18 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, 17 Jan 2019, Dave Chinner wrote: > > > commit e837eac23662afae603aaaef7c94bc839c1b8f67 > > > Author: Steve Lord <lord@sgi.com> > > > Date: Mon Mar 5 16:47:52 2001 +0000 > > > > > > Add bounds checking for direct I/O, do the cache invalidation for > > > data coherency on direct I/O. > > > > Out of curiosity, which repository is this from please? Even google > > doesn't seem to know about this SHA. > > because oss.sgi.com is no longer with us, it's fallen out of all the > search engines. It was from the "archive/xfs-import.git" tree on > oss.sgi.com: > > https://web.archive.org/web/20120326044237/http://oss.sgi.com:80/cgi-bin/gitweb.cgi > > but archive.org doesn't have a copy of the git tree. It contained > the XFS history right back to the first Irix commit in 1993. Some of > us still have copies of it sitting around.... For cases like this, would it be worth pushing it to git.kernel.org as an frozen historical reference archive? Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-17 8:18 ` Jiri Kosina @ 2019-01-17 21:06 ` Dave Chinner 0 siblings, 0 replies; 161+ messages in thread From: Dave Chinner @ 2019-01-17 21:06 UTC (permalink / raw) To: Jiri Kosina Cc: Linus Torvalds, Matthew Wilcox, Jann Horn, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Thu, Jan 17, 2019 at 09:18:41AM +0100, Jiri Kosina wrote: > On Thu, 17 Jan 2019, Dave Chinner wrote: > > > > > commit e837eac23662afae603aaaef7c94bc839c1b8f67 > > > > Author: Steve Lord <lord@sgi.com> > > > > Date: Mon Mar 5 16:47:52 2001 +0000 > > > > > > > > Add bounds checking for direct I/O, do the cache invalidation for > > > > data coherency on direct I/O. > > > > > > Out of curiosity, which repository is this from please? Even google > > > doesn't seem to know about this SHA. > > > > because oss.sgi.com is no longer with us, it's fallen out of all the > > search engines. It was from the "archive/xfs-import.git" tree on > > oss.sgi.com: > > > > https://web.archive.org/web/20120326044237/http://oss.sgi.com:80/cgi-bin/gitweb.cgi > > > > but archive.org doesn't have a copy of the git tree. It contained > > the XFS history right back to the first Irix commit in 1993. Some of > > us still have copies of it sitting around.... > > For cases like this, would it be worth pushing it to git.kernel.org as an > frozen historical reference archive? I'm not sure we should be putting code from Irix on kernel.org. I uploaded a copy to github a few months ago so XFS devs could easily reference relevant commits in email. The reasons we decided not to upload it to kernel.org should be clear from the readme... https://github.com/dchinner/xfs-history Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-06 0:22 ` Linus Torvalds 2019-01-06 1:50 ` Linus Torvalds @ 2019-01-07 4:32 ` Dominique Martinet 2019-01-07 10:33 ` Vlastimil Babka 1 sibling, 1 reply; 161+ messages in thread From: Dominique Martinet @ 2019-01-07 4:32 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Jann Horn, Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Linus Torvalds wrote on Sat, Jan 05, 2019: > But I think my patch to just rip out all that page lookup, and just > base it on the page table state has the fundamental advantage that it > gets rid of code. Maybe I should jst commit it, and see if anything > breaks? We do have options in case things break, and then we'd at > least know who cares (and perhaps a lot more information of _why_ they > care). There actually are many tools like fincore which depend on mincore to try to tell whether a file is "loaded in cache" or not (I personally use vmtouch[1], but I know of at least nocache[2] uses it as well to only try to evict used pages) [1] https://hoytech.com/vmtouch/ [2] https://github.com/Feh/nocache I mostly use these to either fadvise(POSIX_FADV_DONTNEED) or prefetch/lock whole files so my "production" use-cases don't actually rely on the mincore part of them; but when playing with these actions it's actually fairly useful to be able to visualize which part of a file ended in cache or monitor how a file's content evolve in cache... There are various non-obvious behaviours where being able to poke around is enlightening (e.g. fadvise dontneed is actually a hint, so even if nothing uses the file linux sometimes keep the data around if it thinks that would be useful and nocache has a mode to call fadvise multiple times and things like that...) Anyway, I agree the use of mincore for this is rather ugly, and frankly some "cache management API" might be better in the long run if only for performance reason (don't try these tools on a hundred TB sparse file...), but until that pipe dream comes true I think mincore as it was is useful for system admins. Linus Torvalds wrote on Sun, Jan 06, 2019: > I decided to just apply that patch. It is *not* marked for stable, > very intentionally, because I expect that we will need to wait and see > if there are issues with it, and whether we might have to do something > entirely different (more like the traditional behavior with some extra > "only for owner" logic). FWIW I personally don't care much about "only for owner" or depending on mmap options; I don't understand much of the security implications honestly so I'm not sure how these limitations actually help. On the other hand, a simple CAP_SYS_ADMIN check making the call take either behaviour should be safe and would cover what I described above. (by the way, while we are discussing permissions, a regular user can use fadvise dontneed on files it doesn't own as well as long as it can open them for reading; I'm not sure if that would need restricting as well in the context of the security issue. Frankly even with mincore someone could likely tell the difference through timing, if they just do it a few times. Do magic, probe, flush out, repeat until satisfied.) Thanks, -- Dominique Martinet | Asmadeus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-07 4:32 ` Dominique Martinet @ 2019-01-07 10:33 ` Vlastimil Babka 2019-01-07 11:08 ` Dominique Martinet 0 siblings, 1 reply; 161+ messages in thread From: Vlastimil Babka @ 2019-01-07 10:33 UTC (permalink / raw) To: Dominique Martinet, Linus Torvalds Cc: Matthew Wilcox, Jann Horn, Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API, daniel On 1/7/19 5:32 AM, Dominique Martinet wrote: > Linus Torvalds wrote on Sat, Jan 05, 2019: >> But I think my patch to just rip out all that page lookup, and just >> base it on the page table state has the fundamental advantage that it >> gets rid of code. Maybe I should jst commit it, and see if anything >> breaks? We do have options in case things break, and then we'd at >> least know who cares (and perhaps a lot more information of _why_ they >> care). > > There actually are many tools like fincore which depend on mincore to > try to tell whether a file is "loaded in cache" or not (I personally use > vmtouch[1], but I know of at least nocache[2] uses it as well to only > try to evict used pages) nocache could probably do fine without mincore. IIUC the point is to not evict anything that was already resident prior to running some command wrapped in nocache. Without the mincore checks, posix_fadvise(POSIX_FADV_DONTNEED) will still not drop anything that others have mapped. That means without mincore() it will drop data that's in cache but not currently in use by anybody, which shouldn't cause large performance regressions? > [1] https://hoytech.com/vmtouch/ > [2] https://github.com/Feh/nocache > > > I mostly use these to either fadvise(POSIX_FADV_DONTNEED) or > prefetch/lock whole files so my "production" use-cases don't actually > rely on the mincore part of them; Ah so you seem to confirm my above point. ... > FWIW I personally don't care much about "only for owner" or depending on > mmap options; I don't understand much of the security implications > honestly so I'm not sure how these limitations actually help. > On the other hand, a simple CAP_SYS_ADMIN check making the call take > either behaviour should be safe and would cover what I described above. So without CAP_SYS_ADMIN, mincore() would return mapping status, and with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy :( Maybe if we introduced mincore2() with flags similar to BSD mentioned earlier in the thread, and the cache residency flag would require CAP_SYS_ADMIN or something similar. > (by the way, while we are discussing permissions, a regular user can use > fadvise dontneed on files it doesn't own as well as long as it can open > them for reading; I'm not sure if that would need restricting as well in > the context of the security issue. Probably not, as I've mentioned it won't evict what's mapped by somebody else. And eviction is also possible via controlling LRU, which is what the paper [1] does anyway (and also mentions that DONTNEED doesn't work). Being able to evict somebody's page is AFAIU not sufficient for attack, the side channel is about knowing that somebody brought that page back to RAM by touching it. > Frankly even with mincore someone > could likely tell the difference through timing, if they just do it a > few times. Do magic, probe, flush out, repeat until satisfied.) That's my bigger concern here. In [1] there's described a remote attack (on webserver) using the page fault timing differences for present/not present page cache pages. Noisy but works, and I expect locally it to be much less noisy. Yet the countermeasures section only mentions restricting mincore() as if it was sufficient (and also how to make evictions harder, but that's secondary IMHO). [1] https://arxiv.org/abs/1901.01161 > > Thanks, > ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-07 10:33 ` Vlastimil Babka @ 2019-01-07 11:08 ` Dominique Martinet 2019-01-07 11:59 ` Vlastimil Babka 2019-01-07 13:29 ` Daniel Gruss 0 siblings, 2 replies; 161+ messages in thread From: Dominique Martinet @ 2019-01-07 11:08 UTC (permalink / raw) To: Vlastimil Babka Cc: Linus Torvalds, Matthew Wilcox, Jann Horn, Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API, daniel Vlastimil Babka wrote on Mon, Jan 07, 2019: > On 1/7/19 5:32 AM, Dominique Martinet wrote: > > Linus Torvalds wrote on Sat, Jan 05, 2019: > >> But I think my patch to just rip out all that page lookup, and just > >> base it on the page table state has the fundamental advantage that it > >> gets rid of code. Maybe I should jst commit it, and see if anything > >> breaks? We do have options in case things break, and then we'd at > >> least know who cares (and perhaps a lot more information of _why_ they > >> care). > > > > There actually are many tools like fincore which depend on mincore to > > try to tell whether a file is "loaded in cache" or not (I personally use > > vmtouch[1], but I know of at least nocache[2] uses it as well to only > > try to evict used pages) > > nocache could probably do fine without mincore. IIUC the point is to not > evict anything that was already resident prior to running some command > wrapped in nocache. Without the mincore checks, > posix_fadvise(POSIX_FADV_DONTNEED) will still not drop anything that > others have mapped. That means without mincore() it will drop data > that's in cache but not currently in use by anybody, which shouldn't > cause large performance regressions? Yes, I sent them a patch to allow this behaviour (which got merged ages ago); but the default nocache usage will try to preserve pages that were mapped before even if the user mapped it themselves so it's a little bit more robust that just trusting linux to do the right thing. I did various tests with fadvise dontneed and honestly the way it works is far from obvious when operating as a single user at least; I didn't test multi-users as that didn't really concern me. With the current mincore change, it will think everything was "in core" and not flush anything unless my option to just fadvise dontneed everything is passed though ; so even if we can make it work it is a change of behaviour that is breaking an existing application, and it has no way of telling it didn't work. Honestly though, as I said, mincore() is much more useful for debugging for me ; the application can be changed if required. I just pointed it out as it'll need changing, and it has no obvious way of testing at runtime if the syscall works (except dumb kernel version check, but that won't work with stable backports); so it's not that obvious. > > FWIW I personally don't care much about "only for owner" or depending on > > mmap options; I don't understand much of the security implications > > honestly so I'm not sure how these limitations actually help. > > On the other hand, a simple CAP_SYS_ADMIN check making the call take > > either behaviour should be safe and would cover what I described above. > > So without CAP_SYS_ADMIN, mincore() would return mapping status, and > with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy > :( Maybe if we introduced mincore2() with flags similar to BSD mentioned > earlier in the thread, and the cache residency flag would require > CAP_SYS_ADMIN or something similar. I agree, that's rather clumsy... Or rather might lead to some unexpected behaviours. I'm open to other ideas. I'm not sure how the BSD flags help though? > > (by the way, while we are discussing permissions, a regular user can use > > fadvise dontneed on files it doesn't own as well as long as it can open > > them for reading; I'm not sure if that would need restricting as well in > > the context of the security issue. > > Probably not, as I've mentioned it won't evict what's mapped by somebody > else. And eviction is also possible via controlling LRU, which is what > the paper [1] does anyway (and also mentions that DONTNEED doesn't > work). Being able to evict somebody's page is AFAIU not sufficient for > attack, the side channel is about knowing that somebody brought that > page back to RAM by touching it. Thanks for the link to the paper, I hadn't taken the time to extract it from the news article but it's much more interesting indeed. Some of what's been said here makes more sense to me now (checking about ownership and similar might indeed be good enough). For evicting page I agree most targets would be different users so I guess that makes sense as well; there seem to be other ways of efficiently evicting a page from cache anyway. (BTW, this was brought up earlier, the paper also mentions /proc/self/pagemap as not being useful.) > > Frankly even with mincore someone > > could likely tell the difference through timing, if they just do it a > > few times. Do magic, probe, flush out, repeat until satisfied.) (I meant "without mincore", obviously, as you understood properly) > That's my bigger concern here. In [1] there's described a remote attack > (on webserver) using the page fault timing differences for present/not > present page cache pages. Noisy but works, and I expect locally it to be > much less noisy. Yet the countermeasures section only mentions > restricting mincore() as if it was sufficient (and also how to make > evictions harder, but that's secondary IMHO). I'd suggest making clock rougher for non-root users but javascript tried that and it wasn't enough... :) Honestly won't be of much help there, good luck? Thanks, -- Dominique ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-07 11:08 ` Dominique Martinet @ 2019-01-07 11:59 ` Vlastimil Babka 2019-01-07 13:29 ` Daniel Gruss 1 sibling, 0 replies; 161+ messages in thread From: Vlastimil Babka @ 2019-01-07 11:59 UTC (permalink / raw) To: Dominique Martinet Cc: Linus Torvalds, Matthew Wilcox, Jann Horn, Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API, daniel On 1/7/19 12:08 PM, Dominique Martinet wrote: > > With the current mincore change, it will think everything was "in core" > and not flush anything unless my option to just fadvise dontneed > everything is passed though ; so even if we can make it work it is a > change of behaviour that is breaking an existing application, and it has > no way of telling it didn't work. IIUC the current change is commit 574823bfab82 ("Change mincore() to count "mapped" pages rather than "cached" pages") which will not pretend everything is "in core", but only pages that the calling process has populated page table mapping for (which implies in core, but the opposite doesn't hold). "nocache" most certainly doesn't populate the mappings before calling mincore(), as that would bring pages to page cache and defeat the purpose of determining if they were already there prior the nocache execution. Instead it will think that nothing was "in core", and thus later call fadvise dontneed or everything, but as I've said earlier that shouldn't matter much. > Honestly though, as I said, mincore() is much more useful for debugging > for me ; the application can be changed if required. I just pointed it > out as it'll need changing, and it has no obvious way of testing at > runtime if the syscall works (except dumb kernel version check, but that > won't work with stable backports); so it's not that obvious. Agree. >>> FWIW I personally don't care much about "only for owner" or depending on >>> mmap options; I don't understand much of the security implications >>> honestly so I'm not sure how these limitations actually help. >>> On the other hand, a simple CAP_SYS_ADMIN check making the call take >>> either behaviour should be safe and would cover what I described above. >> >> So without CAP_SYS_ADMIN, mincore() would return mapping status, and >> with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy >> :( Maybe if we introduced mincore2() with flags similar to BSD mentioned >> earlier in the thread, and the cache residency flag would require >> CAP_SYS_ADMIN or something similar. > > I agree, that's rather clumsy... Or rather might lead to some unexpected > behaviours. I'm open to other ideas. > I'm not sure how the BSD flags help though? Definitely it would be a long-term solution, introducing new API, waiting for userspace to use it... and meanwhile we would have to keep the status quo or some kind of the clumsy/subtle approach. >>> (by the way, while we are discussing permissions, a regular user can use >>> fadvise dontneed on files it doesn't own as well as long as it can open >>> them for reading; I'm not sure if that would need restricting as well in >>> the context of the security issue. >> >> Probably not, as I've mentioned it won't evict what's mapped by somebody >> else. And eviction is also possible via controlling LRU, which is what >> the paper [1] does anyway (and also mentions that DONTNEED doesn't >> work). Being able to evict somebody's page is AFAIU not sufficient for >> attack, the side channel is about knowing that somebody brought that >> page back to RAM by touching it. > > Thanks for the link to the paper, I hadn't taken the time to extract it > from the news article but it's much more interesting indeed. It went public only this morning, the article was older :) ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-07 11:08 ` Dominique Martinet 2019-01-07 11:59 ` Vlastimil Babka @ 2019-01-07 13:29 ` Daniel Gruss 1 sibling, 0 replies; 161+ messages in thread From: Daniel Gruss @ 2019-01-07 13:29 UTC (permalink / raw) To: Dominique Martinet, Vlastimil Babka Cc: Linus Torvalds, Matthew Wilcox, Jann Horn, Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On 1/7/19 12:08 PM, Dominique Martinet wrote: >> That's my bigger concern here. In [1] there's described a remote attack >> (on webserver) using the page fault timing differences for present/not >> present page cache pages. Noisy but works, and I expect locally it to be >> much less noisy. Yet the countermeasures section only mentions >> restricting mincore() as if it was sufficient (and also how to make >> evictions harder, but that's secondary IMHO). > > I'd suggest making clock rougher for non-root users but javascript tried > that and it wasn't enough... :) > Honestly won't be of much help there, good luck? Restricting mincore() is sufficient to fix the hardware-agnostic part. If the attack is not hardware-agnostic anymore, an attacker could also just use a hardware cache attack, which has a higher temporal and spatial resolution, so there's no reason why the attacker would use page cache attacks instead then. Cheers, Daniel ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 23:39 ` Linus Torvalds 2019-01-06 0:11 ` Matthew Wilcox @ 2019-01-07 10:10 ` Michael Ellerman 1 sibling, 0 replies; 161+ messages in thread From: Michael Ellerman @ 2019-01-07 10:10 UTC (permalink / raw) To: Linus Torvalds, Jann Horn Cc: Jiri Kosina, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> It goes back to forever, it looks like. I can't find a reason. > > mincore() was originally added in 2.3.52pre3, it looks like. Around > 2000 or so. But sadly before the BK history. Yeah, it's here in the commit titled "Import 2.3.52pre3" (takes a second or two to load): https://github.com/mpe/linux-fullhistory/commit/a1bcda3256956318c95c8da8bee09f79190bb034#diff-fd2d793b8b4760b4887c8c7bbb3451d7R1730 But no further detail. (Instructions for using that tree https://github.com/mpe/linux-fullhistory/wiki) cheers ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged 2019-01-05 22:54 ` Jann Horn 2019-01-05 23:05 ` Linus Torvalds @ 2019-01-05 23:09 ` Jiri Kosina 1 sibling, 0 replies; 161+ messages in thread From: Jiri Kosina @ 2019-01-05 23:09 UTC (permalink / raw) To: Jann Horn Cc: Linus Torvalds, Andrew Morton, Greg KH, Peter Zijlstra, Michal Hocko, Linux-MM, kernel list, Linux API On Sat, 5 Jan 2019, Jann Horn wrote: > > Provide vm.mincore_privileged sysctl, which makes it possible to mincore() > > start returning -EPERM in case it's invoked by a process lacking > > CAP_SYS_ADMIN. > > > > The default behavior stays "mincore() can be used by anybody" in order to > > be conservative with respect to userspace behavior. > > > > [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/ > > Just checking: I guess /proc/$pid/pagemap (iow, the pagemap_read() > handler) is less problematic because it only returns data about the > state of page tables, and doesn't query the address_space? In other > words, it permits monitoring evictions, but non-intrusively detecting > that something has been loaded into memory by another process is > harder? So I was just about to immediately reply that we don't expose pagemap anymore due to rowhammer, but apparently that's not true any more (this behavioud was originally introduced by ab676b7d6fbf, but then changed via 1c90308e7a77 (and further adjusted for swap entries in ab6ecf247a, but I guess that's not all that interesting). Hmm. But unless it has been mapped with MAP_POPULATE (whcih is outside the attacker's control), there is no guarantee that the mappings would actually be there, right? -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments 2019-01-05 17:27 [PATCH] mm/mincore: allow for making sys_mincore() privileged Jiri Kosina ` (4 preceding siblings ...) 2019-01-05 22:54 ` Jann Horn @ 2019-01-30 12:44 ` Vlastimil Babka 2019-01-30 12:44 ` [PATCH 1/3] mm/mincore: make mincore() more conservative Vlastimil Babka ` (4 more replies) 5 siblings, 5 replies; 161+ messages in thread From: Vlastimil Babka @ 2019-01-30 12:44 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds Cc: linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Vlastimil Babka, Andy Lutomirski, Cyril Hrubis, Daniel Gruss, Dave Chinner, Dominique Martinet, Jiri Kosina, Jiri Kosina, Kevin Easton, Kirill A. Shutemov, Matthew Wilcox, Tejun Heo I've collected the patches from the discussion for formal posting. The first two should be settled already, third one is the possible improvement I've mentioned earlier, where only in restricted case we resort to existence of page table mapping (the original and later reverted approach from Linus) instead of faking the result completely. Review and testing welcome. The consensus seems to be going through -mm tree for 5.1, unless Linus wants them alredy for 5.0. Jiri Kosina (2): mm/mincore: make mincore() more conservative mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O Vlastimil Babka (1): mm/mincore: provide mapped status when cached status is not allowed mm/filemap.c | 2 -- mm/mincore.c | 54 ++++++++++++++++++++++++++++++++++++++++------------ 2 files changed, 42 insertions(+), 14 deletions(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 161+ messages in thread
* [PATCH 1/3] mm/mincore: make mincore() more conservative 2019-01-30 12:44 ` [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments Vlastimil Babka @ 2019-01-30 12:44 ` Vlastimil Babka 2019-01-31 9:43 ` Michal Hocko ` (2 more replies) 2019-01-30 12:44 ` [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O Vlastimil Babka ` (3 subsequent siblings) 4 siblings, 3 replies; 161+ messages in thread From: Vlastimil Babka @ 2019-01-30 12:44 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds Cc: linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Vlastimil Babka, Jiri Kosina From: Jiri Kosina <jkosina@suse.cz> The semantics of what mincore() considers to be resident is not completely clear, but Linux has always (since 2.3.52, which is when mincore() was initially done) treated it as "page is available in page cache". That's potentially a problem, as that [in]directly exposes meta-information about pagecache / memory mapping state even about memory not strictly belonging to the process executing the syscall, opening possibilities for sidechannel attacks. Change the semantics of mincore() so that it only reveals pagecache information for non-anonymous mappings that belog to files that the calling process could (if it tried to) successfully open for writing. Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Originally-by: Dominique Martinet <asmadeus@codewreck.org> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Chinner <david@fromorbit.com> Cc: Kevin Easton <kevin@guarana.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Cyril Hrubis <chrubis@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daniel Gruss <daniel@gruss.cc> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/mincore.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..747a4907a3ac 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + return vma_is_anonymous(vma) || + (vma->vm_file && + (inode_owner_or_capable(file_inode(vma->vm_file)) + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0)); +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = (end - addr) >> PAGE_SHIFT; + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err; -- 2.20.1 ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH 1/3] mm/mincore: make mincore() more conservative 2019-01-30 12:44 ` [PATCH 1/3] mm/mincore: make mincore() more conservative Vlastimil Babka @ 2019-01-31 9:43 ` Michal Hocko 2019-01-31 9:51 ` Dominique Martinet 2019-01-31 17:46 ` Josh Snyder 2019-02-01 8:56 ` Vlastimil Babka 2019-03-06 23:13 ` Andrew Morton 2 siblings, 2 replies; 161+ messages in thread From: Michal Hocko @ 2019-01-31 9:43 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina On Wed 30-01-19 13:44:18, Vlastimil Babka wrote: > From: Jiri Kosina <jkosina@suse.cz> > > The semantics of what mincore() considers to be resident is not completely > clear, but Linux has always (since 2.3.52, which is when mincore() was > initially done) treated it as "page is available in page cache". > > That's potentially a problem, as that [in]directly exposes meta-information > about pagecache / memory mapping state even about memory not strictly belonging > to the process executing the syscall, opening possibilities for sidechannel > attacks. > > Change the semantics of mincore() so that it only reveals pagecache information > for non-anonymous mappings that belog to files that the calling process could > (if it tried to) successfully open for writing. I agree that this is a better way than the original 574823bfab82 ("Change mincore() to count "mapped" pages rather than "cached" pages"). One thing is still not clear to me though. Is the new owner/writeable check OK for the Netflix-like usecases? I mean does happycache have appropriate access to the cache data? I have tried to re-read the original thread but couldn't find any confirmation. I nit below > Originally-by: Linus Torvalds <torvalds@linux-foundation.org> > Originally-by: Dominique Martinet <asmadeus@codewreck.org> > Cc: Dominique Martinet <asmadeus@codewreck.org> > Cc: Andy Lutomirski <luto@amacapital.net> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Kevin Easton <kevin@guarana.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Cyril Hrubis <chrubis@suse.cz> > Cc: Tejun Heo <tj@kernel.org> > Cc: Kirill A. Shutemov <kirill@shutemov.name> > Cc: Daniel Gruss <daniel@gruss.cc> > Signed-off-by: Jiri Kosina <jkosina@suse.cz> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> other than that looks good to me. Acked-by: Michal Hocko <mhocko@suse.com> If this still doesn't help happycache kind of workloads then we should add a capability check IMO but this looks like a decent foundation to me. > --- > mm/mincore.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/mm/mincore.c b/mm/mincore.c > index 218099b5ed31..747a4907a3ac 100644 > --- a/mm/mincore.c > +++ b/mm/mincore.c > @@ -169,6 +169,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > return 0; > } > > +static inline bool can_do_mincore(struct vm_area_struct *vma) > +{ > + return vma_is_anonymous(vma) || > + (vma->vm_file && > + (inode_owner_or_capable(file_inode(vma->vm_file)) > + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0)); > +} This is hard to read. Can we do if (vma_is_anonymous(vma)) return true; if (!vma->vm_file) return false; return inode_owner_or_capable(file_inode(vma->vm_file)) || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 1/3] mm/mincore: make mincore() more conservative 2019-01-31 9:43 ` Michal Hocko @ 2019-01-31 9:51 ` Dominique Martinet 2019-01-31 17:46 ` Josh Snyder 1 sibling, 0 replies; 161+ messages in thread From: Dominique Martinet @ 2019-01-31 9:51 UTC (permalink / raw) To: Michal Hocko, Josh Snyder Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina Michal Hocko wrote on Thu, Jan 31, 2019: > > Change the semantics of mincore() so that it only reveals pagecache information > > for non-anonymous mappings that belog to files that the calling process could > > (if it tried to) successfully open for writing. > > I agree that this is a better way than the original 574823bfab82 > ("Change mincore() to count "mapped" pages rather than "cached" pages"). > One thing is still not clear to me though. Is the new owner/writeable > check OK for the Netflix-like usecases? I mean does happycache have > appropriate access to the cache data? I have tried to re-read the > original thread but couldn't find any confirmation. It's enough for my use cases and Josh didn't seem to oppose, but since he's not in Cc I don't think he would answer -- added him now :) FWIW happycache writes in the current directory by default so I assume in the way they use it it would usually have access one way or another. > If this still doesn't help happycache kind of workloads then we should > add a capability check IMO but this looks like a decent foundation to > me. the inode_owner_or_capable/inode_permission helpers already do allow quite a few capabilities there -- Dominique ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 1/3] mm/mincore: make mincore() more conservative 2019-01-31 9:43 ` Michal Hocko 2019-01-31 9:51 ` Dominique Martinet @ 2019-01-31 17:46 ` Josh Snyder 1 sibling, 0 replies; 161+ messages in thread From: Josh Snyder @ 2019-01-31 17:46 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, open list, Linux-MM, Linux API, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina On Thu, Jan 31, 2019 at 1:44 AM Michal Hocko <mhocko@kernel.org> wrote: > One thing is still not clear to me though. Is the new owner/writeable > check OK for the Netflix-like usecases? I mean does happycache have > appropriate access to the cache data? I have tried to re-read the > original thread but couldn't find any confirmation. The owner/writable check will suit every database that I've ever used happycache with, including cassandra, postgres and git. Acked-by: Josh Snyder <joshs@netflix.com> ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 1/3] mm/mincore: make mincore() more conservative 2019-01-30 12:44 ` [PATCH 1/3] mm/mincore: make mincore() more conservative Vlastimil Babka 2019-01-31 9:43 ` Michal Hocko @ 2019-02-01 8:56 ` Vlastimil Babka 2019-03-06 23:13 ` Andrew Morton 2 siblings, 0 replies; 161+ messages in thread From: Vlastimil Babka @ 2019-02-01 8:56 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds Cc: linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina Here's updated version with Michal's suggestion, and acks: I think this patch is fine to go, less sure about 2/3 and 3/3. ----8<---- >From 49f17d9f6a42ecc2a508125b0c880ff0402a6f49 Mon Sep 17 00:00:00 2001 From: Jiri Kosina <jkosina@suse.cz> Date: Wed, 16 Jan 2019 20:53:17 +0100 Subject: [PATCH v2] mm/mincore: make mincore() more conservative The semantics of what mincore() considers to be resident is not completely clear, but Linux has always (since 2.3.52, which is when mincore() was initially done) treated it as "page is available in page cache". That's potentially a problem, as that [in]directly exposes meta-information about pagecache / memory mapping state even about memory not strictly belonging to the process executing the syscall, opening possibilities for sidechannel attacks. Change the semantics of mincore() so that it only reveals pagecache information for non-anonymous mappings that belog to files that the calling process could (if it tried to) successfully open for writing. [mhocko@suse.com: restructure can_do_mincore() conditions] Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Originally-by: Dominique Martinet <asmadeus@codewreck.org> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Chinner <david@fromorbit.com> Cc: Kevin Easton <kevin@guarana.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Cyril Hrubis <chrubis@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daniel Gruss <daniel@gruss.cc> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Josh Snyder <joshs@netflix.com> Acked-by: Michal Hocko <mhocko@suse.com> --- mm/mincore.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..b8842b849604 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,16 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + if (vma_is_anonymous(vma)) + return true; + if (!vma->vm_file) + return false; + return inode_owner_or_capable(file_inode(vma->vm_file)) || + inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +199,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = (end - addr) >> PAGE_SHIFT; + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err; -- 2.20.1 ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH 1/3] mm/mincore: make mincore() more conservative 2019-01-30 12:44 ` [PATCH 1/3] mm/mincore: make mincore() more conservative Vlastimil Babka 2019-01-31 9:43 ` Michal Hocko 2019-02-01 8:56 ` Vlastimil Babka @ 2019-03-06 23:13 ` Andrew Morton 2019-03-07 0:01 ` Jiri Kosina 2 siblings, 1 reply; 161+ messages in thread From: Andrew Morton @ 2019-03-06 23:13 UTC (permalink / raw) To: Vlastimil Babka Cc: Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina On Wed, 30 Jan 2019 13:44:18 +0100 Vlastimil Babka <vbabka@suse.cz> wrote: > From: Jiri Kosina <jkosina@suse.cz> > > The semantics of what mincore() considers to be resident is not completely > clear, but Linux has always (since 2.3.52, which is when mincore() was > initially done) treated it as "page is available in page cache". > > That's potentially a problem, as that [in]directly exposes meta-information > about pagecache / memory mapping state even about memory not strictly belonging > to the process executing the syscall, opening possibilities for sidechannel > attacks. > > Change the semantics of mincore() so that it only reveals pagecache information > for non-anonymous mappings that belog to files that the calling process could > (if it tried to) successfully open for writing. "for writing" comes as a bit of a surprise. Why not for reading? Could we please explain the reasoning in the changelog and in the (presently absent) comments which describe can_do_mincore()? > @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v > vma = find_vma(current->mm, addr); > if (!vma || addr < vma->vm_start) > return -ENOMEM; > - mincore_walk.mm = vma->vm_mm; > end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); > + if (!can_do_mincore(vma)) { > + unsigned long pages = (end - addr) >> PAGE_SHIFT; I'm not sure this is correct in all cases. If addr = 4095 vma->vm_end = 4096 pages = 1000 then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it should have been 1. Please check? A mincore test suite in tools/testing/selftests would be useful, methinks. To exercise such corner cases, check for future breakage, etc. > + memset(vec, 1, pages); > + return pages; > + } > + mincore_walk.mm = vma->vm_mm; > err = walk_page_range(addr, end, &mincore_walk); > if (err < 0) > return err; ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 1/3] mm/mincore: make mincore() more conservative 2019-03-06 23:13 ` Andrew Morton @ 2019-03-07 0:01 ` Jiri Kosina 2019-03-07 0:40 ` Dominique Martinet 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-03-07 0:01 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss On Wed, 6 Mar 2019, Andrew Morton wrote: > > The semantics of what mincore() considers to be resident is not completely > > clear, but Linux has always (since 2.3.52, which is when mincore() was > > initially done) treated it as "page is available in page cache". > > > > That's potentially a problem, as that [in]directly exposes meta-information > > about pagecache / memory mapping state even about memory not strictly belonging > > to the process executing the syscall, opening possibilities for sidechannel > > attacks. > > > > Change the semantics of mincore() so that it only reveals pagecache information > > for non-anonymous mappings that belog to files that the calling process could > > (if it tried to) successfully open for writing. > > "for writing" comes as a bit of a surprise. Why not for reading? I guess this is a rhetorical question from you :) but fair enough, good point, I'll explain this a bit more in the changelog and in the code comments. > > @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v > > vma = find_vma(current->mm, addr); > > if (!vma || addr < vma->vm_start) > > return -ENOMEM; > > - mincore_walk.mm = vma->vm_mm; > > end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); > > + if (!can_do_mincore(vma)) { > > + unsigned long pages = (end - addr) >> PAGE_SHIFT; > > I'm not sure this is correct in all cases. If > > addr = 4095 > vma->vm_end = 4096 > pages = 1000 > > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it > should have been 1. Good catch! It should rather be something like unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT); I'll fix that up and resend tomorrow. Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 1/3] mm/mincore: make mincore() more conservative 2019-03-07 0:01 ` Jiri Kosina @ 2019-03-07 0:40 ` Dominique Martinet 2019-03-07 5:46 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Dominique Martinet @ 2019-03-07 0:40 UTC (permalink / raw) To: Jiri Kosina Cc: Andrew Morton, Vlastimil Babka, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss Jiri Kosina wrote on Thu, Mar 07, 2019: > > I'm not sure this is correct in all cases. If > > > > addr = 4095 > > vma->vm_end = 4096 > > pages = 1000 > > > > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it > > should have been 1. > > Good catch! It should rather be something like > > unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT); That would be 0 for addr = 0 and vma->vm_end = 1; I assume we would still want to count that as one page. I'm not too familiar with this area of the code, but I think there's a handy macro we can use for this, perhaps DIV_ROUND_UP(end - addr, PAGE_SIZE) ? kernel/kexec_core.c has defined PAGE_COUNT() which seems more appropriate but I do not see a global equivalent #define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT) -- Dominique ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 1/3] mm/mincore: make mincore() more conservative 2019-03-07 0:40 ` Dominique Martinet @ 2019-03-07 5:46 ` Jiri Kosina 0 siblings, 0 replies; 161+ messages in thread From: Jiri Kosina @ 2019-03-07 5:46 UTC (permalink / raw) To: Dominique Martinet Cc: Andrew Morton, Vlastimil Babka, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss On Thu, 7 Mar 2019, Dominique Martinet wrote: > > > > > > addr = 4095 > > > vma->vm_end = 4096 > > > pages = 1000 > > > > > > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it > > > should have been 1. > > > > Good catch! It should rather be something like > > > > unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT); > > That would be 0 for addr = 0 and vma->vm_end = 1; I assume we would > still want to count that as one page. Yeah, that was bogus as well, ETOOTIRED yesterday, sorry for the noise. Both the variants are off. > I'm not too familiar with this area of the code, but I think there's a > handy macro we can use for this, perhaps > DIV_ROUND_UP(end - addr, PAGE_SIZE) ? > > kernel/kexec_core.c has defined PAGE_COUNT() which seems more > appropriate but I do not see a global equivalent > #define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT) I'll fix that up when doing the other changes requested by Andrew. Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-30 12:44 ` [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments Vlastimil Babka 2019-01-30 12:44 ` [PATCH 1/3] mm/mincore: make mincore() more conservative Vlastimil Babka @ 2019-01-30 12:44 ` Vlastimil Babka 2019-01-30 15:04 ` Florian Weimer ` (2 more replies) 2019-01-30 12:44 ` [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed Vlastimil Babka ` (2 subsequent siblings) 4 siblings, 3 replies; 161+ messages in thread From: Vlastimil Babka @ 2019-01-30 12:44 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds Cc: linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Vlastimil Babka, Jiri Kosina From: Jiri Kosina <jkosina@suse.cz> preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache contents, as it reveals metadata about residency of pages in pagecache. If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page not resident" information, and vice versa. Close that sidechannel by always initiating readahead on the cache if we encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing the pagecache residency itself will actually populate the cache, making the sidechannel useless. Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Chinner <david@fromorbit.com> Cc: Kevin Easton <kevin@guarana.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Cyril Hrubis <chrubis@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daniel Gruss <daniel@gruss.cc> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/filemap.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 9f5e323e883e..7bcdd36e629d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, page = find_get_page(mapping, index); if (!page) { - if (iocb->ki_flags & IOCB_NOWAIT) - goto would_block; page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); -- 2.20.1 ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-30 12:44 ` [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O Vlastimil Babka @ 2019-01-30 15:04 ` Florian Weimer 2019-01-30 15:15 ` Jiri Kosina 2019-01-31 9:56 ` Michal Hocko 2019-01-31 12:04 ` Daniel Gruss 2 siblings, 1 reply; 161+ messages in thread From: Florian Weimer @ 2019-01-30 15:04 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina * Vlastimil Babka: > preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache > contents, as it reveals metadata about residency of pages in > pagecache. > > If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page > not resident" information, and vice versa. > > Close that sidechannel by always initiating readahead on the cache if > we encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, > probing the pagecache residency itself will actually populate the > cache, making the sidechannel useless. I think this needs to use a different flag because the semantics are so much different. If I understand this change correctly, previously, RWF_NOWAIT essentially avoided any I/O, and now it does not. Thanks, Florian ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-30 15:04 ` Florian Weimer @ 2019-01-30 15:15 ` Jiri Kosina 2019-01-31 10:47 ` Florian Weimer 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-30 15:15 UTC (permalink / raw) To: Florian Weimer Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss On Wed, 30 Jan 2019, Florian Weimer wrote: > > preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache > > contents, as it reveals metadata about residency of pages in > > pagecache. > > > > If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page > > not resident" information, and vice versa. > > > > Close that sidechannel by always initiating readahead on the cache if > > we encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, > > probing the pagecache residency itself will actually populate the > > cache, making the sidechannel useless. > > I think this needs to use a different flag because the semantics are so > much different. If I understand this change correctly, previously, > RWF_NOWAIT essentially avoided any I/O, and now it does not. It still avoid synchronous I/O, due to this code still being in place: if (!PageUptodate(page)) { if (iocb->ki_flags & IOCB_NOWAIT) { put_page(page); goto would_block; } but goes the would_block path only after initiating asynchronous readahead. -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-30 15:15 ` Jiri Kosina @ 2019-01-31 10:47 ` Florian Weimer 2019-01-31 11:34 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Florian Weimer @ 2019-01-31 10:47 UTC (permalink / raw) To: Jiri Kosina Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss * Jiri Kosina: > On Wed, 30 Jan 2019, Florian Weimer wrote: > >> > preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache >> > contents, as it reveals metadata about residency of pages in >> > pagecache. >> > >> > If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page >> > not resident" information, and vice versa. >> > >> > Close that sidechannel by always initiating readahead on the cache if >> > we encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, >> > probing the pagecache residency itself will actually populate the >> > cache, making the sidechannel useless. >> >> I think this needs to use a different flag because the semantics are so >> much different. If I understand this change correctly, previously, >> RWF_NOWAIT essentially avoided any I/O, and now it does not. > > It still avoid synchronous I/O, due to this code still being in place: > > if (!PageUptodate(page)) { > if (iocb->ki_flags & IOCB_NOWAIT) { > put_page(page); > goto would_block; > } > > but goes the would_block path only after initiating asynchronous > readahead. But it wouldn't schedule asynchronous readahead before? I'm worried that something, say PostgreSQL doing a sequential scan, would implement a two-pass approach, first using RWF_NOWAIT to process what's in the kernel page cache, and then read the rest without it. If RWF_NOWAIT is treated as a prefetch hint, there could be much more read activity, and a lot of it would be pointless because the data might have to be evicted before userspace can use it. Thanks, Florian ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-31 10:47 ` Florian Weimer @ 2019-01-31 11:34 ` Jiri Kosina 0 siblings, 0 replies; 161+ messages in thread From: Jiri Kosina @ 2019-01-31 11:34 UTC (permalink / raw) To: Florian Weimer Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss On Thu, 31 Jan 2019, Florian Weimer wrote: > >> I think this needs to use a different flag because the semantics are so > >> much different. If I understand this change correctly, previously, > >> RWF_NOWAIT essentially avoided any I/O, and now it does not. > > > > It still avoid synchronous I/O, due to this code still being in place: > > > > if (!PageUptodate(page)) { > > if (iocb->ki_flags & IOCB_NOWAIT) { > > put_page(page); > > goto would_block; > > } > > > > but goes the would_block path only after initiating asynchronous > > readahead. > > But it wouldn't schedule asynchronous readahead before? It would, that's kind of the whole point. > I'm worried that something, say PostgreSQL doing a sequential scan, > would implement a two-pass approach, first using RWF_NOWAIT to process > what's in the kernel page cache, and then read the rest without it. If > RWF_NOWAIT is treated as a prefetch hint, there could be much more read > activity, and a lot of it would be pointless because the data might have > to be evicted before userspace can use it. So are you aware of anything already existing, that'd implement this semantics? I've quickly grepped https://github.com/postgres/postgres for RWF_NOWAIT, and they don't seem to use it at all. RWF_NOWAIT is rather new. The usecase I am aware of is to make sure that the thread doing io_submit() doesn't get blocked for too long, because it has other things to do quickly in order to avoid starving other sub-threads (and delegate the I/O submission to asynchronous context). -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-30 12:44 ` [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O Vlastimil Babka 2019-01-30 15:04 ` Florian Weimer @ 2019-01-31 9:56 ` Michal Hocko 2019-01-31 10:15 ` Jiri Kosina 2019-02-01 1:44 ` Dave Chinner 2019-01-31 12:04 ` Daniel Gruss 2 siblings, 2 replies; 161+ messages in thread From: Michal Hocko @ 2019-01-31 9:56 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina, linux-fsdevel [Cc fs-devel] On Wed 30-01-19 13:44:19, Vlastimil Babka wrote: > From: Jiri Kosina <jkosina@suse.cz> > > preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache contents, as > it reveals metadata about residency of pages in pagecache. > > If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page not > resident" information, and vice versa. > > Close that sidechannel by always initiating readahead on the cache if we > encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing > the pagecache residency itself will actually populate the cache, making the > sidechannel useless. I guess the current wording doesn't disallow background IO to be triggered for EAGAIN case. I am not sure whether that breaks clever applications which try to perform larger IO for those cases though. > Originally-by: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Dominique Martinet <asmadeus@codewreck.org> > Cc: Andy Lutomirski <luto@amacapital.net> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Kevin Easton <kevin@guarana.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Cyril Hrubis <chrubis@suse.cz> > Cc: Tejun Heo <tj@kernel.org> > Cc: Kirill A. Shutemov <kirill@shutemov.name> > Cc: Daniel Gruss <daniel@gruss.cc> > Signed-off-by: Jiri Kosina <jkosina@suse.cz> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > --- > mm/filemap.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 9f5e323e883e..7bcdd36e629d 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, > > page = find_get_page(mapping, index); > if (!page) { > - if (iocb->ki_flags & IOCB_NOWAIT) > - goto would_block; > page_cache_sync_readahead(mapping, > ra, filp, > index, last_index - index); Maybe a stupid question but I am not really familiar with this path but what exactly does prevent a sync read down page_cache_sync_readahead path? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-31 9:56 ` Michal Hocko @ 2019-01-31 10:15 ` Jiri Kosina 2019-01-31 10:23 ` Michal Hocko 2019-02-01 1:44 ` Dave Chinner 1 sibling, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-31 10:15 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, linux-fsdevel On Thu, 31 Jan 2019, Michal Hocko wrote: > > diff --git a/mm/filemap.c b/mm/filemap.c > > index 9f5e323e883e..7bcdd36e629d 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, > > > > page = find_get_page(mapping, index); > > if (!page) { > > - if (iocb->ki_flags & IOCB_NOWAIT) > > - goto would_block; > > page_cache_sync_readahead(mapping, > > ra, filp, > > index, last_index - index); > > Maybe a stupid question but I am not really familiar with this path but > what exactly does prevent a sync read down page_cache_sync_readahead > path? page_cache_sync_readahead() only submits the read ahead request(s), it doesn't wait for it to finish. -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-31 10:15 ` Jiri Kosina @ 2019-01-31 10:23 ` Michal Hocko 2019-01-31 10:30 ` Jiri Kosina 2019-01-31 17:54 ` Linus Torvalds 0 siblings, 2 replies; 161+ messages in thread From: Michal Hocko @ 2019-01-31 10:23 UTC (permalink / raw) To: Jiri Kosina Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, linux-fsdevel On Thu 31-01-19 11:15:28, Jiri Kosina wrote: > On Thu, 31 Jan 2019, Michal Hocko wrote: > > > > diff --git a/mm/filemap.c b/mm/filemap.c > > > index 9f5e323e883e..7bcdd36e629d 100644 > > > --- a/mm/filemap.c > > > +++ b/mm/filemap.c > > > @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, > > > > > > page = find_get_page(mapping, index); > > > if (!page) { > > > - if (iocb->ki_flags & IOCB_NOWAIT) > > > - goto would_block; > > > page_cache_sync_readahead(mapping, > > > ra, filp, > > > index, last_index - index); > > > > Maybe a stupid question but I am not really familiar with this path but > > what exactly does prevent a sync read down page_cache_sync_readahead > > path? > > page_cache_sync_readahead() only submits the read ahead request(s), it > doesn't wait for it to finish. OK, I guess my question was not precise. What does prevent taking fs locks down the path? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-31 10:23 ` Michal Hocko @ 2019-01-31 10:30 ` Jiri Kosina 2019-01-31 11:32 ` Michal Hocko 2019-01-31 17:54 ` Linus Torvalds 1 sibling, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-31 10:30 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, linux-fsdevel On Thu, 31 Jan 2019, Michal Hocko wrote: > > > > diff --git a/mm/filemap.c b/mm/filemap.c > > > > index 9f5e323e883e..7bcdd36e629d 100644 > > > > --- a/mm/filemap.c > > > > +++ b/mm/filemap.c > > > > @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, > > > > > > > > page = find_get_page(mapping, index); > > > > if (!page) { > > > > - if (iocb->ki_flags & IOCB_NOWAIT) > > > > - goto would_block; > > > > page_cache_sync_readahead(mapping, > > > > ra, filp, > > > > index, last_index - index); > > > > > > Maybe a stupid question but I am not really familiar with this path but > > > what exactly does prevent a sync read down page_cache_sync_readahead > > > path? > > > > page_cache_sync_readahead() only submits the read ahead request(s), it > > doesn't wait for it to finish. > > OK, I guess my question was not precise. What does prevent taking fs > locks down the path? Well, RWF_NOWAIT doesn't mean the kernel can't reschedule while executing preadv2(), right? It just means it will not wait for the arrival of the whole data blob into pagecache in case it's not there. -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-31 10:30 ` Jiri Kosina @ 2019-01-31 11:32 ` Michal Hocko 0 siblings, 0 replies; 161+ messages in thread From: Michal Hocko @ 2019-01-31 11:32 UTC (permalink / raw) To: Jiri Kosina Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, linux-fsdevel On Thu 31-01-19 11:30:24, Jiri Kosina wrote: > On Thu, 31 Jan 2019, Michal Hocko wrote: > > > > > > diff --git a/mm/filemap.c b/mm/filemap.c > > > > > index 9f5e323e883e..7bcdd36e629d 100644 > > > > > --- a/mm/filemap.c > > > > > +++ b/mm/filemap.c > > > > > @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, > > > > > > > > > > page = find_get_page(mapping, index); > > > > > if (!page) { > > > > > - if (iocb->ki_flags & IOCB_NOWAIT) > > > > > - goto would_block; > > > > > page_cache_sync_readahead(mapping, > > > > > ra, filp, > > > > > index, last_index - index); > > > > > > > > Maybe a stupid question but I am not really familiar with this path but > > > > what exactly does prevent a sync read down page_cache_sync_readahead > > > > path? > > > > > > page_cache_sync_readahead() only submits the read ahead request(s), it > > > doesn't wait for it to finish. > > > > OK, I guess my question was not precise. What does prevent taking fs > > locks down the path? > > Well, RWF_NOWAIT doesn't mean the kernel can't reschedule while executing > preadv2(), right? It just means it will not wait for the arrival of the > whole data blob into pagecache in case it's not there. No, it can reschedule for sure but the man page says: : If this flag is specified, the preadv2() system call will return : instantly if it would have to read data from the backing storage or wait : for a lock. I assume that the lock is meant to be a filesystem lock here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-31 10:23 ` Michal Hocko 2019-01-31 10:30 ` Jiri Kosina @ 2019-01-31 17:54 ` Linus Torvalds 2019-02-01 5:13 ` Dave Chinner 1 sibling, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-01-31 17:54 UTC (permalink / raw) To: Michal Hocko Cc: Jiri Kosina, Vlastimil Babka, Andrew Morton, Linux List Kernel Mailing, Linux-MM, Linux API, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, linux-fsdevel On Thu, Jan 31, 2019 at 2:23 AM Michal Hocko <mhocko@kernel.org> wrote: > > OK, I guess my question was not precise. What does prevent taking fs > locks down the path? IOCB_NOWAIT has never meant that, and will never mean it. We will never give user space those kinds of guarantees. We do locking for various reasons. For example, we'll do the mm lock just when fetching/storing data from/to user space if there's a page fault. Or - more obviously - we'll also check for - and sleep on - mandatory locks in rw_verify_area(). There is nothing like "atomic IO" to user space. We simply do not give those kinds of guarantees. That's even more true when this is a information leak that we shouldn't expose to user space in the first place. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-31 17:54 ` Linus Torvalds @ 2019-02-01 5:13 ` Dave Chinner 2019-02-01 7:05 ` Linus Torvalds 0 siblings, 1 reply; 161+ messages in thread From: Dave Chinner @ 2019-02-01 5:13 UTC (permalink / raw) To: Linus Torvalds Cc: Michal Hocko, Jiri Kosina, Vlastimil Babka, Andrew Morton, Linux List Kernel Mailing, Linux-MM, Linux API, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, linux-fsdevel On Thu, Jan 31, 2019 at 09:54:16AM -0800, Linus Torvalds wrote: > On Thu, Jan 31, 2019 at 2:23 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > OK, I guess my question was not precise. What does prevent taking fs > > locks down the path? > > IOCB_NOWAIT has never meant that, and will never mean it. I think you're wrong, Linus. IOCB_NOWAIT was specifically designed to prevent blocking on filesystem locks during AIO submission. The initial commits spell that out pretty clearly: commit b745fafaf70c0a98a2e1e7ac8cb14542889ceb0e Author: Goldwyn Rodrigues <rgoldwyn@suse.com> Date: Tue Jun 20 07:05:43 2017 -0500 fs: Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT RWF_NOWAIT informs kernel to bail out if an AIO request will block for reasons such as file allocations, or a writeback triggered, or would block while allocating requests while performing direct I/O. RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags. FMODE_AIO_NOWAIT is a flag which identifies the file opened is capable of returning -EAGAIN if the AIO call will block. This must be set by supporting filesystems in the ->open() call. Filesystems xfs, btrfs and ext4 would be supported in the following patches. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> commit 29a5d29ec181ebdc98a26cedbd76ce9870248892 Author: Goldwyn Rodrigues <rgoldwyn@suse.com> Date: Tue Jun 20 07:05:48 2017 -0500 xfs: nowait aio support If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable immediately. IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin if it needs allocation either due to file extension, writing to a hole, or COW or waiting for other DIOs to finish. Return -EAGAIN if we don't have extent list in memory. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> commit 728fbc0e10b7f3ce2ee043b32e3453fd5201c055 Author: Goldwyn Rodrigues <rgoldwyn@suse.com> Date: Tue Jun 20 07:05:47 2017 -0500 ext4: nowait aio support Return EAGAIN if any of the following checks fail for direct I/O: + i_rwsem is lockable + Writing beyond end of file (will trigger allocation) + Blocks are not allocated at the write location Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> > We will never give user space those kinds of guarantees. We do locking > for various reasons. For example, we'll do the mm lock just when > fetching/storing data from/to user space if there's a page fault. You are conflating "best effort non-blocking operation" with "atomic guarantee". RWF_NOWAIT/IOCB_NOWAIT is the former, not the latter. i.e. RWF_NOWAIT addresses the "every second IO submission blocks" problems that AIO submission suffered from due to filesystem lock contention, not the rare and unusual things like "page fault during get_user_pages in direct IO submission". Maybe one day, but right now those rare cases are not pain points for applications that require nonblock AIO submission via RWF_NOWAIT. > Or - > more obviously - we'll also check for - and sleep on - mandatory locks > in rw_verify_area(). Well, only if you don't use fcntl(O_NONBLOCK) on the file to tell mandatory locking to fail with -EAGAIN instead of sleeping. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-02-01 5:13 ` Dave Chinner @ 2019-02-01 7:05 ` Linus Torvalds 2019-02-01 7:21 ` Linus Torvalds 0 siblings, 1 reply; 161+ messages in thread From: Linus Torvalds @ 2019-02-01 7:05 UTC (permalink / raw) To: Dave Chinner Cc: Michal Hocko, Jiri Kosina, Vlastimil Babka, Andrew Morton, Linux List Kernel Mailing, Linux-MM, Linux API, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, linux-fsdevel On Thu, Jan 31, 2019 at 9:16 PM Dave Chinner <david@fromorbit.com> wrote: > > You are conflating "best effort non-blocking operation" with > "atomic guarantee". RWF_NOWAIT/IOCB_NOWAIT is the > former, not the latter. Right. That's my *point*, Dave. It's not 'atomic guarantee", and never will be. We are in 100% agreement. That's what I _said_. And part of "best effort" is very much "not a security information leak". I really don't see why you are so argumentative. As I mentioned earlier in the thread, it's actually quite possible that users will actually find that starting read-ahead is a *good* thing, Dave. Even - in fact *particularly* - the user you brought up: samba using RWF_NOWAIT to try to do things synchronously quickly. So Dave, why are you being so negative? Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-02-01 7:05 ` Linus Torvalds @ 2019-02-01 7:21 ` Linus Torvalds 0 siblings, 0 replies; 161+ messages in thread From: Linus Torvalds @ 2019-02-01 7:21 UTC (permalink / raw) To: Dave Chinner Cc: Michal Hocko, Jiri Kosina, Vlastimil Babka, Andrew Morton, Linux List Kernel Mailing, Linux-MM, Linux API, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, linux-fsdevel On Thu, Jan 31, 2019 at 11:05 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And part of "best effort" is very much "not a security information leak". Side note: it's entirely possible that the preadv2(RWF_NOWAIT) interface is actually already effectively too slow to be effectively used as much of an attack vector. One of the advantages of mincore() for the attack was that you could just get a lot of page status information in one go. With RWF_NOWAIT, you only really get "up to the first non-cached page", so it's already a weaker signal than mincore() gave. System calls aren't horrendously slow (at least not with fixed non-meltdown CPU's), but it might still be a somewhat noticeable inconvenience in an attack that is already probably not all that easy to do on an arbitrary target. So it might not be a huge deal. But I think we should at least try to make things less useful for these kinds of attack vectors. And no, that doesn't mean "stop all theoretical attacks". It means "let's try to make things less convenient as a data leak". That's why things like "oh, you can still see the signal if you can keep the backing device congested" is not something I'd worry about. It's just another (big) inconvenience, and not all that simple to do. At some point, it's simply not worth it as an attack vector any more. Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-31 9:56 ` Michal Hocko 2019-01-31 10:15 ` Jiri Kosina @ 2019-02-01 1:44 ` Dave Chinner 2019-02-12 15:48 ` Jiri Kosina 1 sibling, 1 reply; 161+ messages in thread From: Dave Chinner @ 2019-02-01 1:44 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina, linux-fsdevel On Thu, Jan 31, 2019 at 10:56:44AM +0100, Michal Hocko wrote: > [Cc fs-devel] > > On Wed 30-01-19 13:44:19, Vlastimil Babka wrote: > > From: Jiri Kosina <jkosina@suse.cz> > > > > preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache contents, as > > it reveals metadata about residency of pages in pagecache. > > > > If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page not > > resident" information, and vice versa. > > > > Close that sidechannel by always initiating readahead on the cache if we > > encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing > > the pagecache residency itself will actually populate the cache, making the > > sidechannel useless. > > I guess the current wording doesn't disallow background IO to be > triggered for EAGAIN case. I am not sure whether that breaks clever > applications which try to perform larger IO for those cases though. Actually, it does: RWF_NOWAIT (since Linux 4.14) Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. page_cache_sync_readahead() can block on page allocation, it calls ->readpages() which means there are page locks and filesystem locks in play (e.g. for block mapping), there's potential for blocking on metadata IO (both submission and completion) to read block maps, the data readahead can be submitted for IO so it can get stuck anywhere in the IO path, etc... Basically, it completely subverts the documented behaviour of RWF_NOWAIT. There are applications (like Samba (*)) that are planning to use this to avoid blocking their main processing threads on buffered IO. This change makes RWF_NOWAIT pretty much useless to them - it /was/ the only solution we had for reliably issuing non-blocking IO, with this patch it isn't a viable solution at all. (*) https://github.com/samba-team/samba/commit/6381044c0270a647c20935d22fd23f235d19b328 IOWs, if this change goes through, it needs to be documented as an intentional behavioural bug in the preadv2 manpage so that userspace developers are aware of the new limitations of RWF_NOWAIT and should avoid it like the plague. But worse than that is nobody has bothered to (or ask someone familiar with the code to) do an audit of RWF_NOWAIT usage after I pointed out the behavioural issues. The one person who was engaged and /had done an audit/ got shouted down with so much bullshit they just walked away.... So, I'll invite the incoherent, incandescent O_DIRECT rage flames of Linus to be unleashed again and point out the /other reference/ to IOCB_NOWAIT in mm/filemap.c. That is, in generic_file_read_iter(), in the *generic O_DIRECT read path*: if (iocb->ki_flags & IOCB_DIRECT) { ..... if (iocb->ki_flags & IOCB_NOWAIT) { if (filemap_range_has_page(mapping, iocb->ki_pos, iocb->ki_pos + count - 1)) return -EAGAIN; } else { ..... This page cache probe is about 100 lines of code down from the code that this patch modifies, in it's direct caller. It's not hard to find, I shouldn't have to point it out, nor have to explain how it makes this patch completely irrelevant. > > diff --git a/mm/filemap.c b/mm/filemap.c > > index 9f5e323e883e..7bcdd36e629d 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, > > > > page = find_get_page(mapping, index); > > if (!page) { > > - if (iocb->ki_flags & IOCB_NOWAIT) > > - goto would_block; > > page_cache_sync_readahead(mapping, > > ra, filp, > > index, last_index - index); > > Maybe a stupid question but I am not really familiar with this path but > what exactly does prevent a sync read down page_cache_sync_readahead > path? It's effectively useless as a workaround because you can avoid the readahead IO being issued relatively easily: void page_cache_sync_readahead(struct address_space *mapping, struct file_ra_state *ra, struct file *filp, pgoff_t offset, unsigned long req_size) { /* no read-ahead */ if (!ra->ra_pages) return; if (blk_cgroup_congested()) return; .... IOWs, we just have to issue enough IO to congest the block device (or, even easier, a rate-limited cgroup), and we can still use RWF_NOWAIT to probe the page cache. Or if we can convince ra->ra_pages to be zero (e.g. it's on bdi device with no readahead configured because it's real fast) then it doesn't work there, either. So this a) isn't a robust workaround, b) it breaks documented API semantics and c) isn't the only path to page cache probing via RWF_NOWAIT. It's just a new game of whack-a-mole. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-02-01 1:44 ` Dave Chinner @ 2019-02-12 15:48 ` Jiri Kosina 0 siblings, 0 replies; 161+ messages in thread From: Jiri Kosina @ 2019-02-12 15:48 UTC (permalink / raw) To: Dave Chinner Cc: Michal Hocko, Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, linux-fsdevel On Fri, 1 Feb 2019, Dave Chinner wrote: > So, I'll invite the incoherent, incandescent O_DIRECT rage flames of > Linus to be unleashed again and point out the /other reference/ to > IOCB_NOWAIT in mm/filemap.c. That is, in generic_file_read_iter(), > in the *generic O_DIRECT read path*: > > if (iocb->ki_flags & IOCB_DIRECT) { > ..... > if (iocb->ki_flags & IOCB_NOWAIT) { > if (filemap_range_has_page(mapping, iocb->ki_pos, > iocb->ki_pos + count - 1)) > return -EAGAIN; > } else { > ..... OK, thanks Dave, this is a good point I've missed in this mail before (probabably as I focused only on the aspect of disagreement what NONBLOCK actually means :) ). I will look into fixing it for next iteration. > It's effectively useless as a workaround because you can avoid the > readahead IO being issued relatively easily: > > void page_cache_sync_readahead(struct address_space *mapping, > struct file_ra_state *ra, struct file *filp, > pgoff_t offset, unsigned long req_size) > { > /* no read-ahead */ > if (!ra->ra_pages) > return; > > if (blk_cgroup_congested()) > return; > .... > > IOWs, we just have to issue enough IO to congest the block device (or, > even easier, a rate-limited cgroup), and we can still use RWF_NOWAIT > to probe the page cache. Or if we can convince ra->ra_pages to be > zero (e.g. it's on bdi device with no readahead configured because > it's real fast) then it doesn't work there, either. It's though questionable whether the noise level here wouldn't be too high already for any sidechannel to work reliably. So I'd suggest to operate under the assumption that it would be too noisy, unless anyone is able to prove otherwise. Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-30 12:44 ` [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O Vlastimil Babka 2019-01-30 15:04 ` Florian Weimer 2019-01-31 9:56 ` Michal Hocko @ 2019-01-31 12:04 ` Daniel Gruss 2019-01-31 12:06 ` Vlastimil Babka 2019-01-31 12:08 ` Jiri Kosina 2 siblings, 2 replies; 161+ messages in thread From: Daniel Gruss @ 2019-01-31 12:04 UTC (permalink / raw) To: Vlastimil Babka, Andrew Morton, Linus Torvalds Cc: linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Jiri Kosina On 1/30/19 1:44 PM, Vlastimil Babka wrote: > Close that sidechannel by always initiating readahead on the cache if we > encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing > the pagecache residency itself will actually populate the cache, making the > sidechannel useless. I fear this does not really close the side channel. You can time the preadv2 function and infer which path it took, so you just bring it down to the same as using mmap and timing accesses. If I understood it correctly, this patch just removes the advantages of preadv2 over mmmap+access for the attacker. Cheers, Daniel ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-31 12:04 ` Daniel Gruss @ 2019-01-31 12:06 ` Vlastimil Babka 2019-01-31 12:08 ` Jiri Kosina 1 sibling, 0 replies; 161+ messages in thread From: Vlastimil Babka @ 2019-01-31 12:06 UTC (permalink / raw) To: Daniel Gruss, Andrew Morton, Linus Torvalds Cc: linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Jiri Kosina On 1/31/19 1:04 PM, Daniel Gruss wrote: > On 1/30/19 1:44 PM, Vlastimil Babka wrote: >> Close that sidechannel by always initiating readahead on the cache if we >> encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing >> the pagecache residency itself will actually populate the cache, making the >> sidechannel useless. > > I fear this does not really close the side channel. You can time the > preadv2 function and infer which path it took, so you just bring it down > to the same as using mmap and timing accesses. > If I understood it correctly, this patch just removes the advantages of > preadv2 over mmmap+access for the attacker. But isn't that the same with mincore()? We can't simply remove the possibility of mmap+access, but we are closing the simpler methods? Vlastimil > Cheers, > Daniel > ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-31 12:04 ` Daniel Gruss 2019-01-31 12:06 ` Vlastimil Babka @ 2019-01-31 12:08 ` Jiri Kosina 2019-01-31 12:57 ` Daniel Gruss 1 sibling, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-01-31 12:08 UTC (permalink / raw) To: Daniel Gruss Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov On Thu, 31 Jan 2019, Daniel Gruss wrote: > If I understood it correctly, this patch just removes the advantages of > preadv2 over mmmap+access for the attacker. Which is the desired effect. We are not trying to solve the timing aspect, as I don't think there is a reasonable way to do it, is there? -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O 2019-01-31 12:08 ` Jiri Kosina @ 2019-01-31 12:57 ` Daniel Gruss 0 siblings, 0 replies; 161+ messages in thread From: Daniel Gruss @ 2019-01-31 12:57 UTC (permalink / raw) To: Jiri Kosina Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov On 1/31/19 1:08 PM, Jiri Kosina wrote: > On Thu, 31 Jan 2019, Daniel Gruss wrote: > >> If I understood it correctly, this patch just removes the advantages of >> preadv2 over mmmap+access for the attacker. > > Which is the desired effect. We are not trying to solve the timing aspect, > as I don't think there is a reasonable way to do it, is there? There are two building blocks to cache attacks, bringing the cache into a state, and observing a state change, you can mitigate them by breaking either of these building blocks. For most attacks the attacker would be interested in observing *when* a specific victim page is loaded into the page cache rather than observing whether it is in the page cache right now (it could be there for ages if the system was not under memory pressure). So, one could try to prevent interference in the page cache between attacker and victim -> working set algorithms do that to some extent. Simpler idea (with more side effects) would be limiting the maximum share of the page cache per user (or per process, depending on the threat model)... Cheers, Daniel ^ permalink raw reply [flat|nested] 161+ messages in thread
* [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed 2019-01-30 12:44 ` [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments Vlastimil Babka 2019-01-30 12:44 ` [PATCH 1/3] mm/mincore: make mincore() more conservative Vlastimil Babka 2019-01-30 12:44 ` [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O Vlastimil Babka @ 2019-01-30 12:44 ` Vlastimil Babka 2019-01-31 10:09 ` Michal Hocko 2019-03-06 12:11 ` [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments Jiri Kosina 2019-03-12 14:17 ` [PATCH v2 0/2] prevent mincore() page cache leaks Vlastimil Babka 4 siblings, 1 reply; 161+ messages in thread From: Vlastimil Babka @ 2019-01-30 12:44 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds Cc: linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Vlastimil Babka, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina After "mm/mincore: make mincore() more conservative" we sometimes restrict the information about page cache residency, which we have to do without breaking existing userspace, if possible. We thus fake the resulting values as 1, which should be safer than faking them as 0, as there might theoretically exist code that would try to fault in the page(s) until mincore() returns 1. Faking 1 however means that such code would not fault in a page even if it was not in page cache, with unwanted performance implications. We can improve the situation by revisting the approach of 574823bfab82 ("Change mincore() to count "mapped" pages rather than "cached" pages") but only applying it to cases where page cache residency check is restricted. Thus mincore() will return 0 for an unmapped page (which may or may not be resident in a pagecache), and 1 after the process faults it in. One potential downside is that mincore() will be again able to recognize when a previously mapped page was reclaimed. While that might be useful for some attack scenarios, it's not as crucial as recognizing that somebody else faulted the page in, and there are also other ways to recognize reclaimed pages anyway. Cc: Jiri Kosina <jikos@kernel.org> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Chinner <david@fromorbit.com> Cc: Kevin Easton <kevin@guarana.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Cyril Hrubis <chrubis@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daniel Gruss <daniel@gruss.cc> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/mincore.c | 49 +++++++++++++++++++++++++++++++++---------------- 1 file changed, 33 insertions(+), 16 deletions(-) diff --git a/mm/mincore.c b/mm/mincore.c index 747a4907a3ac..d6784a803ae7 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -21,12 +21,18 @@ #include <linux/uaccess.h> #include <asm/pgtable.h> +struct mincore_walk_private { + unsigned char *vec; + bool can_check_pagecache; +}; + static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { #ifdef CONFIG_HUGETLB_PAGE unsigned char present; - unsigned char *vec = walk->private; + struct mincore_walk_private *walk_private = walk->private; + unsigned char *vec = walk_private->vec; /* * Hugepages under user process are always in RAM and never @@ -35,7 +41,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, present = pte && !huge_pte_none(huge_ptep_get(pte)); for (; addr != end; vec++, addr += PAGE_SIZE) *vec = present; - walk->private = vec; + walk_private->vec = vec; #else BUG(); #endif @@ -85,7 +91,8 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) } static int __mincore_unmapped_range(unsigned long addr, unsigned long end, - struct vm_area_struct *vma, unsigned char *vec) + struct vm_area_struct *vma, unsigned char *vec, + bool can_check_pagecache) { unsigned long nr = (end - addr) >> PAGE_SHIFT; int i; @@ -95,7 +102,9 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, pgoff = linear_page_index(vma, addr); for (i = 0; i < nr; i++, pgoff++) - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); + vec[i] = can_check_pagecache ? + mincore_page(vma->vm_file->f_mapping, pgoff) + : 0; } else { for (i = 0; i < nr; i++) vec[i] = 0; @@ -106,8 +115,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, static int mincore_unmapped_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { - walk->private += __mincore_unmapped_range(addr, end, - walk->vma, walk->private); + struct mincore_walk_private *walk_private = walk->private; + unsigned char *vec = walk_private->vec; + + walk_private->vec += __mincore_unmapped_range(addr, end, walk->vma, + vec, walk_private->can_check_pagecache); return 0; } @@ -117,7 +129,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, spinlock_t *ptl; struct vm_area_struct *vma = walk->vma; pte_t *ptep; - unsigned char *vec = walk->private; + struct mincore_walk_private *walk_private = walk->private; + unsigned char *vec = walk_private->vec; int nr = (end - addr) >> PAGE_SHIFT; ptl = pmd_trans_huge_lock(pmd, vma); @@ -128,7 +141,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } if (pmd_trans_unstable(pmd)) { - __mincore_unmapped_range(addr, end, vma, vec); + __mincore_unmapped_range(addr, end, vma, vec, + walk_private->can_check_pagecache); goto out; } @@ -138,7 +152,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (pte_none(pte)) __mincore_unmapped_range(addr, addr + PAGE_SIZE, - vma, vec); + vma, vec, walk_private->can_check_pagecache); else if (pte_present(pte)) *vec = 1; else { /* pte is a swap entry */ @@ -152,8 +166,12 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, *vec = 1; } else { #ifdef CONFIG_SWAP - *vec = mincore_page(swap_address_space(entry), + if (walk_private->can_check_pagecache) + *vec = mincore_page( + swap_address_space(entry), swp_offset(entry)); + else + *vec = 0; #else WARN_ON(1); *vec = 1; @@ -187,22 +205,21 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v struct vm_area_struct *vma; unsigned long end; int err; + struct mincore_walk_private walk_private = { + .vec = vec + }; struct mm_walk mincore_walk = { .pmd_entry = mincore_pte_range, .pte_hole = mincore_unmapped_range, .hugetlb_entry = mincore_hugetlb, - .private = vec, + .private = &walk_private }; vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); - if (!can_do_mincore(vma)) { - unsigned long pages = (end - addr) >> PAGE_SHIFT; - memset(vec, 1, pages); - return pages; - } + walk_private.can_check_pagecache = can_do_mincore(vma); mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) -- 2.20.1 ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed 2019-01-30 12:44 ` [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed Vlastimil Babka @ 2019-01-31 10:09 ` Michal Hocko 2019-02-01 9:04 ` Vlastimil Babka 0 siblings, 1 reply; 161+ messages in thread From: Michal Hocko @ 2019-01-31 10:09 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina On Wed 30-01-19 13:44:20, Vlastimil Babka wrote: > After "mm/mincore: make mincore() more conservative" we sometimes restrict the > information about page cache residency, which we have to do without breaking > existing userspace, if possible. We thus fake the resulting values as 1, which > should be safer than faking them as 0, as there might theoretically exist code > that would try to fault in the page(s) until mincore() returns 1. > > Faking 1 however means that such code would not fault in a page even if it was > not in page cache, with unwanted performance implications. We can improve the > situation by revisting the approach of 574823bfab82 ("Change mincore() to count > "mapped" pages rather than "cached" pages") but only applying it to cases where > page cache residency check is restricted. Thus mincore() will return 0 for an > unmapped page (which may or may not be resident in a pagecache), and 1 after > the process faults it in. > > One potential downside is that mincore() will be again able to recognize when a > previously mapped page was reclaimed. While that might be useful for some > attack scenarios, it's not as crucial as recognizing that somebody else faulted > the page in, and there are also other ways to recognize reclaimed pages anyway. Is this really worth it? Do we know about any specific usecase that would benefit from this change? TBH I would rather wait for the report than add a hard to evaluate side channel. > Cc: Jiri Kosina <jikos@kernel.org> > Cc: Dominique Martinet <asmadeus@codewreck.org> > Cc: Andy Lutomirski <luto@amacapital.net> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Kevin Easton <kevin@guarana.org> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Cyril Hrubis <chrubis@suse.cz> > Cc: Tejun Heo <tj@kernel.org> > Cc: Kirill A. Shutemov <kirill@shutemov.name> > Cc: Daniel Gruss <daniel@gruss.cc> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > --- > mm/mincore.c | 49 +++++++++++++++++++++++++++++++++---------------- > 1 file changed, 33 insertions(+), 16 deletions(-) > > diff --git a/mm/mincore.c b/mm/mincore.c > index 747a4907a3ac..d6784a803ae7 100644 > --- a/mm/mincore.c > +++ b/mm/mincore.c > @@ -21,12 +21,18 @@ > #include <linux/uaccess.h> > #include <asm/pgtable.h> > > +struct mincore_walk_private { > + unsigned char *vec; > + bool can_check_pagecache; > +}; > + > static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, > unsigned long end, struct mm_walk *walk) > { > #ifdef CONFIG_HUGETLB_PAGE > unsigned char present; > - unsigned char *vec = walk->private; > + struct mincore_walk_private *walk_private = walk->private; > + unsigned char *vec = walk_private->vec; > > /* > * Hugepages under user process are always in RAM and never > @@ -35,7 +41,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, > present = pte && !huge_pte_none(huge_ptep_get(pte)); > for (; addr != end; vec++, addr += PAGE_SIZE) > *vec = present; > - walk->private = vec; > + walk_private->vec = vec; > #else > BUG(); > #endif > @@ -85,7 +91,8 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) > } > > static int __mincore_unmapped_range(unsigned long addr, unsigned long end, > - struct vm_area_struct *vma, unsigned char *vec) > + struct vm_area_struct *vma, unsigned char *vec, > + bool can_check_pagecache) > { > unsigned long nr = (end - addr) >> PAGE_SHIFT; > int i; > @@ -95,7 +102,9 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, > > pgoff = linear_page_index(vma, addr); > for (i = 0; i < nr; i++, pgoff++) > - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); > + vec[i] = can_check_pagecache ? > + mincore_page(vma->vm_file->f_mapping, pgoff) > + : 0; > } else { > for (i = 0; i < nr; i++) > vec[i] = 0; > @@ -106,8 +115,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, > static int mincore_unmapped_range(unsigned long addr, unsigned long end, > struct mm_walk *walk) > { > - walk->private += __mincore_unmapped_range(addr, end, > - walk->vma, walk->private); > + struct mincore_walk_private *walk_private = walk->private; > + unsigned char *vec = walk_private->vec; > + > + walk_private->vec += __mincore_unmapped_range(addr, end, walk->vma, > + vec, walk_private->can_check_pagecache); > return 0; > } > > @@ -117,7 +129,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > spinlock_t *ptl; > struct vm_area_struct *vma = walk->vma; > pte_t *ptep; > - unsigned char *vec = walk->private; > + struct mincore_walk_private *walk_private = walk->private; > + unsigned char *vec = walk_private->vec; > int nr = (end - addr) >> PAGE_SHIFT; > > ptl = pmd_trans_huge_lock(pmd, vma); > @@ -128,7 +141,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > } > > if (pmd_trans_unstable(pmd)) { > - __mincore_unmapped_range(addr, end, vma, vec); > + __mincore_unmapped_range(addr, end, vma, vec, > + walk_private->can_check_pagecache); > goto out; > } > > @@ -138,7 +152,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > > if (pte_none(pte)) > __mincore_unmapped_range(addr, addr + PAGE_SIZE, > - vma, vec); > + vma, vec, walk_private->can_check_pagecache); > else if (pte_present(pte)) > *vec = 1; > else { /* pte is a swap entry */ > @@ -152,8 +166,12 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > *vec = 1; > } else { > #ifdef CONFIG_SWAP > - *vec = mincore_page(swap_address_space(entry), > + if (walk_private->can_check_pagecache) > + *vec = mincore_page( > + swap_address_space(entry), > swp_offset(entry)); > + else > + *vec = 0; > #else > WARN_ON(1); > *vec = 1; > @@ -187,22 +205,21 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v > struct vm_area_struct *vma; > unsigned long end; > int err; > + struct mincore_walk_private walk_private = { > + .vec = vec > + }; > struct mm_walk mincore_walk = { > .pmd_entry = mincore_pte_range, > .pte_hole = mincore_unmapped_range, > .hugetlb_entry = mincore_hugetlb, > - .private = vec, > + .private = &walk_private > }; > > vma = find_vma(current->mm, addr); > if (!vma || addr < vma->vm_start) > return -ENOMEM; > end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); > - if (!can_do_mincore(vma)) { > - unsigned long pages = (end - addr) >> PAGE_SHIFT; > - memset(vec, 1, pages); > - return pages; > - } > + walk_private.can_check_pagecache = can_do_mincore(vma); > mincore_walk.mm = vma->vm_mm; > err = walk_page_range(addr, end, &mincore_walk); > if (err < 0) > -- > 2.20.1 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed 2019-01-31 10:09 ` Michal Hocko @ 2019-02-01 9:04 ` Vlastimil Babka 2019-02-01 9:11 ` Michal Hocko 2019-02-12 3:44 ` Jiri Kosina 0 siblings, 2 replies; 161+ messages in thread From: Vlastimil Babka @ 2019-02-01 9:04 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina, Josh Snyder On 1/31/19 11:09 AM, Michal Hocko wrote: > On Wed 30-01-19 13:44:20, Vlastimil Babka wrote: >> After "mm/mincore: make mincore() more conservative" we sometimes restrict the >> information about page cache residency, which we have to do without breaking >> existing userspace, if possible. We thus fake the resulting values as 1, which >> should be safer than faking them as 0, as there might theoretically exist code >> that would try to fault in the page(s) until mincore() returns 1. >> >> Faking 1 however means that such code would not fault in a page even if it was >> not in page cache, with unwanted performance implications. We can improve the >> situation by revisting the approach of 574823bfab82 ("Change mincore() to count >> "mapped" pages rather than "cached" pages") but only applying it to cases where >> page cache residency check is restricted. Thus mincore() will return 0 for an >> unmapped page (which may or may not be resident in a pagecache), and 1 after >> the process faults it in. >> >> One potential downside is that mincore() will be again able to recognize when a >> previously mapped page was reclaimed. While that might be useful for some >> attack scenarios, it's not as crucial as recognizing that somebody else faulted >> the page in, and there are also other ways to recognize reclaimed pages anyway. > > Is this really worth it? Do we know about any specific usecase that > would benefit from this change? TBH I would rather wait for the report > than add a hard to evaluate side channel. Well it's not that complicated IMHO. Linus said it's worth trying, so let's see how he likes the result. The side channel exists anyway as long as process can e.g. check if its rss shrinked, and I doubt we are going to remove that possibility. Also CC Josh Snyder since I forgot originally, and keeping rest of mail for reference. >> Cc: Jiri Kosina <jikos@kernel.org> >> Cc: Dominique Martinet <asmadeus@codewreck.org> >> Cc: Andy Lutomirski <luto@amacapital.net> >> Cc: Dave Chinner <david@fromorbit.com> >> Cc: Kevin Easton <kevin@guarana.org> >> Cc: Matthew Wilcox <willy@infradead.org> >> Cc: Cyril Hrubis <chrubis@suse.cz> >> Cc: Tejun Heo <tj@kernel.org> >> Cc: Kirill A. Shutemov <kirill@shutemov.name> >> Cc: Daniel Gruss <daniel@gruss.cc> >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> >> --- >> mm/mincore.c | 49 +++++++++++++++++++++++++++++++++---------------- >> 1 file changed, 33 insertions(+), 16 deletions(-) >> >> diff --git a/mm/mincore.c b/mm/mincore.c >> index 747a4907a3ac..d6784a803ae7 100644 >> --- a/mm/mincore.c >> +++ b/mm/mincore.c >> @@ -21,12 +21,18 @@ >> #include <linux/uaccess.h> >> #include <asm/pgtable.h> >> >> +struct mincore_walk_private { >> + unsigned char *vec; >> + bool can_check_pagecache; >> +}; >> + >> static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, >> unsigned long end, struct mm_walk *walk) >> { >> #ifdef CONFIG_HUGETLB_PAGE >> unsigned char present; >> - unsigned char *vec = walk->private; >> + struct mincore_walk_private *walk_private = walk->private; >> + unsigned char *vec = walk_private->vec; >> >> /* >> * Hugepages under user process are always in RAM and never >> @@ -35,7 +41,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, >> present = pte && !huge_pte_none(huge_ptep_get(pte)); >> for (; addr != end; vec++, addr += PAGE_SIZE) >> *vec = present; >> - walk->private = vec; >> + walk_private->vec = vec; >> #else >> BUG(); >> #endif >> @@ -85,7 +91,8 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) >> } >> >> static int __mincore_unmapped_range(unsigned long addr, unsigned long end, >> - struct vm_area_struct *vma, unsigned char *vec) >> + struct vm_area_struct *vma, unsigned char *vec, >> + bool can_check_pagecache) >> { >> unsigned long nr = (end - addr) >> PAGE_SHIFT; >> int i; >> @@ -95,7 +102,9 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, >> >> pgoff = linear_page_index(vma, addr); >> for (i = 0; i < nr; i++, pgoff++) >> - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); >> + vec[i] = can_check_pagecache ? >> + mincore_page(vma->vm_file->f_mapping, pgoff) >> + : 0; >> } else { >> for (i = 0; i < nr; i++) >> vec[i] = 0; >> @@ -106,8 +115,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, >> static int mincore_unmapped_range(unsigned long addr, unsigned long end, >> struct mm_walk *walk) >> { >> - walk->private += __mincore_unmapped_range(addr, end, >> - walk->vma, walk->private); >> + struct mincore_walk_private *walk_private = walk->private; >> + unsigned char *vec = walk_private->vec; >> + >> + walk_private->vec += __mincore_unmapped_range(addr, end, walk->vma, >> + vec, walk_private->can_check_pagecache); >> return 0; >> } >> >> @@ -117,7 +129,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, >> spinlock_t *ptl; >> struct vm_area_struct *vma = walk->vma; >> pte_t *ptep; >> - unsigned char *vec = walk->private; >> + struct mincore_walk_private *walk_private = walk->private; >> + unsigned char *vec = walk_private->vec; >> int nr = (end - addr) >> PAGE_SHIFT; >> >> ptl = pmd_trans_huge_lock(pmd, vma); >> @@ -128,7 +141,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, >> } >> >> if (pmd_trans_unstable(pmd)) { >> - __mincore_unmapped_range(addr, end, vma, vec); >> + __mincore_unmapped_range(addr, end, vma, vec, >> + walk_private->can_check_pagecache); >> goto out; >> } >> >> @@ -138,7 +152,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, >> >> if (pte_none(pte)) >> __mincore_unmapped_range(addr, addr + PAGE_SIZE, >> - vma, vec); >> + vma, vec, walk_private->can_check_pagecache); >> else if (pte_present(pte)) >> *vec = 1; >> else { /* pte is a swap entry */ >> @@ -152,8 +166,12 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, >> *vec = 1; >> } else { >> #ifdef CONFIG_SWAP >> - *vec = mincore_page(swap_address_space(entry), >> + if (walk_private->can_check_pagecache) >> + *vec = mincore_page( >> + swap_address_space(entry), >> swp_offset(entry)); >> + else >> + *vec = 0; >> #else >> WARN_ON(1); >> *vec = 1; >> @@ -187,22 +205,21 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v >> struct vm_area_struct *vma; >> unsigned long end; >> int err; >> + struct mincore_walk_private walk_private = { >> + .vec = vec >> + }; >> struct mm_walk mincore_walk = { >> .pmd_entry = mincore_pte_range, >> .pte_hole = mincore_unmapped_range, >> .hugetlb_entry = mincore_hugetlb, >> - .private = vec, >> + .private = &walk_private >> }; >> >> vma = find_vma(current->mm, addr); >> if (!vma || addr < vma->vm_start) >> return -ENOMEM; >> end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); >> - if (!can_do_mincore(vma)) { >> - unsigned long pages = (end - addr) >> PAGE_SHIFT; >> - memset(vec, 1, pages); >> - return pages; >> - } >> + walk_private.can_check_pagecache = can_do_mincore(vma); >> mincore_walk.mm = vma->vm_mm; >> err = walk_page_range(addr, end, &mincore_walk); >> if (err < 0) >> -- >> 2.20.1 > ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed 2019-02-01 9:04 ` Vlastimil Babka @ 2019-02-01 9:11 ` Michal Hocko 2019-02-01 9:27 ` Vlastimil Babka 2019-02-12 3:44 ` Jiri Kosina 1 sibling, 1 reply; 161+ messages in thread From: Michal Hocko @ 2019-02-01 9:11 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina, Josh Snyder On Fri 01-02-19 10:04:23, Vlastimil Babka wrote: > The side channel exists anyway as long as process can e.g. check if > its rss shrinked, and I doubt we are going to remove that possibility. Well, but rss update will not tell you that the page has been faulted in which is the most interesting part. You shouldn't be able to sniff on /proc/$vicimt/smaps as an attacker. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed 2019-02-01 9:11 ` Michal Hocko @ 2019-02-01 9:27 ` Vlastimil Babka 2019-02-06 20:14 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Vlastimil Babka @ 2019-02-01 9:27 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina, Josh Snyder On 2/1/19 10:11 AM, Michal Hocko wrote: > On Fri 01-02-19 10:04:23, Vlastimil Babka wrote: >> The side channel exists anyway as long as process can e.g. check if >> its rss shrinked, and I doubt we are going to remove that possibility. > > Well, but rss update will not tell you that the page has been faulted in > which is the most interesting part. Sure, but the patch doesn't add back that capability neither. It allows to recognize page being reclaimed, and I argue you can infer that from rss change as well. That change is mentioned in the last paragraph in changelog, and I thought "add a hard to evaluate side channel" in your reply referred to that. It doesn't add back the "original" side channel to detect somebody else accessed a page. > You shouldn't be able to sniff on > /proc/$vicimt/smaps as an attacker. ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed 2019-02-01 9:27 ` Vlastimil Babka @ 2019-02-06 20:14 ` Jiri Kosina 0 siblings, 0 replies; 161+ messages in thread From: Jiri Kosina @ 2019-02-06 20:14 UTC (permalink / raw) To: Vlastimil Babka, Linus Torvalds Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Josh Snyder On Fri, 1 Feb 2019, Vlastimil Babka wrote: > > Well, but rss update will not tell you that the page has been faulted in > > which is the most interesting part. > > Sure, but the patch doesn't add back that capability neither. It allows > to recognize page being reclaimed, and I argue you can infer that from > rss change as well. That change is mentioned in the last paragraph in > changelog, and I thought "add a hard to evaluate side channel" in your > reply referred to that. It doesn't add back the "original" side channel > to detect somebody else accessed a page. On Fri, 1 Feb 2019, Vlastimil Babka wrote: > > Is this really worth it? Do we know about any specific usecase that > > would benefit from this change? TBH I would rather wait for the report > > than add a hard to evaluate side channel. > > Well it's not that complicated IMHO. Linus said it's worth trying, so > let's see how he likes the result. The side channel exists anyway as > long as process can e.g. check if its rss shrinked, and I doubt we are > going to remove that possibility. Linus, do you have any opinion here? I have a hunch that mm maintainers are keeping this on a backburner because there might still open question(s) in the air. Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed 2019-02-01 9:04 ` Vlastimil Babka 2019-02-01 9:11 ` Michal Hocko @ 2019-02-12 3:44 ` Jiri Kosina 2019-02-12 6:36 ` Michal Hocko 1 sibling, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-02-12 3:44 UTC (permalink / raw) To: Vlastimil Babka Cc: Michal Hocko, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Josh Snyder On Fri, 1 Feb 2019, Vlastimil Babka wrote: > >> After "mm/mincore: make mincore() more conservative" we sometimes restrict the > >> information about page cache residency, which we have to do without breaking > >> existing userspace, if possible. We thus fake the resulting values as 1, which > >> should be safer than faking them as 0, as there might theoretically exist code > >> that would try to fault in the page(s) until mincore() returns 1. > >> > >> Faking 1 however means that such code would not fault in a page even if it was > >> not in page cache, with unwanted performance implications. We can improve the > >> situation by revisting the approach of 574823bfab82 ("Change mincore() to count > >> "mapped" pages rather than "cached" pages") but only applying it to cases where > >> page cache residency check is restricted. Thus mincore() will return 0 for an > >> unmapped page (which may or may not be resident in a pagecache), and 1 after > >> the process faults it in. > >> > >> One potential downside is that mincore() will be again able to recognize when a > >> previously mapped page was reclaimed. While that might be useful for some > >> attack scenarios, it's not as crucial as recognizing that somebody else faulted > >> the page in, and there are also other ways to recognize reclaimed pages anyway. > > > > Is this really worth it? Do we know about any specific usecase that > > would benefit from this change? TBH I would rather wait for the report > > than add a hard to evaluate side channel. > > Well it's not that complicated IMHO. Linus said it's worth trying, so > let's see how he likes the result. The side channel exists anyway as > long as process can e.g. check if its rss shrinked, and I doubt we are > going to remove that possibility. So, where do we go from here? Either Linus and Andrew like the mincore() return value tweak, or this could be further discussed (*). But in either of the cases, I think patches 1 and 2 should be at least queued for 5.1. (*) I'd personally include it as well, as I don't see how it would break anything, it's pretty straightforward, and brings back some sanity to mincore() return value. Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed 2019-02-12 3:44 ` Jiri Kosina @ 2019-02-12 6:36 ` Michal Hocko 2019-02-12 13:09 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Michal Hocko @ 2019-02-12 6:36 UTC (permalink / raw) To: Jiri Kosina Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Josh Snyder On Tue 12-02-19 04:44:30, Jiri Kosina wrote: > On Fri, 1 Feb 2019, Vlastimil Babka wrote: > > > >> After "mm/mincore: make mincore() more conservative" we sometimes restrict the > > >> information about page cache residency, which we have to do without breaking > > >> existing userspace, if possible. We thus fake the resulting values as 1, which > > >> should be safer than faking them as 0, as there might theoretically exist code > > >> that would try to fault in the page(s) until mincore() returns 1. > > >> > > >> Faking 1 however means that such code would not fault in a page even if it was > > >> not in page cache, with unwanted performance implications. We can improve the > > >> situation by revisting the approach of 574823bfab82 ("Change mincore() to count > > >> "mapped" pages rather than "cached" pages") but only applying it to cases where > > >> page cache residency check is restricted. Thus mincore() will return 0 for an > > >> unmapped page (which may or may not be resident in a pagecache), and 1 after > > >> the process faults it in. > > >> > > >> One potential downside is that mincore() will be again able to recognize when a > > >> previously mapped page was reclaimed. While that might be useful for some > > >> attack scenarios, it's not as crucial as recognizing that somebody else faulted > > >> the page in, and there are also other ways to recognize reclaimed pages anyway. > > > > > > Is this really worth it? Do we know about any specific usecase that > > > would benefit from this change? TBH I would rather wait for the report > > > than add a hard to evaluate side channel. > > > > Well it's not that complicated IMHO. Linus said it's worth trying, so > > let's see how he likes the result. The side channel exists anyway as > > long as process can e.g. check if its rss shrinked, and I doubt we are > > going to remove that possibility. > > So, where do we go from here? > > Either Linus and Andrew like the mincore() return value tweak, or this > could be further discussed (*). But in either of the cases, I think > patches 1 and 2 should be at least queued for 5.1. I would go with patch 1 for 5.1. Patches 2 still sounds controversial or incomplete to me. And patch 3, well I will leave the decision to Andrew/Linus. > (*) I'd personally include it as well, as I don't see how it would break > anything, it's pretty straightforward, and brings back some sanity to > mincore() return value. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed 2019-02-12 6:36 ` Michal Hocko @ 2019-02-12 13:09 ` Jiri Kosina 2019-02-12 14:01 ` Michal Hocko 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-02-12 13:09 UTC (permalink / raw) To: Michal Hocko Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Josh Snyder On Tue, 12 Feb 2019, Michal Hocko wrote: > I would go with patch 1 for 5.1. Patches 2 still sounds controversial or > incomplete to me. Is it because of the disagreement what 'non-blocking' really means, or do you see something else missing? Merging patch just patch 1 withouth patch 2 is probably sort of useless excercise, unfortunately. Thanks, -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed 2019-02-12 13:09 ` Jiri Kosina @ 2019-02-12 14:01 ` Michal Hocko 0 siblings, 0 replies; 161+ messages in thread From: Michal Hocko @ 2019-02-12 14:01 UTC (permalink / raw) To: Jiri Kosina Cc: Vlastimil Babka, Andrew Morton, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Josh Snyder On Tue 12-02-19 14:09:03, Jiri Kosina wrote: > On Tue, 12 Feb 2019, Michal Hocko wrote: > > > I would go with patch 1 for 5.1. Patches 2 still sounds controversial or > > incomplete to me. > > Is it because of the disagreement what 'non-blocking' really means, or do > you see something else missing? Not only. See the remark from Dave [1] that the patch in its current form seems to be incomplete. Also FS people were not involved properly to evaluate all the potential fallouts. Even if the only way forward is to "cripple" IOCB_NOWAIT then the documentation should go along with the change rather than suprise people much later when the system behaves unexpectedly. So I _think_ this patch is not really ready yet. Also I haven't heard any discussion whether we can reduce the effect of the change in a similar way we do for mincore. > Merging patch just patch 1 withouth patch 2 is probably sort of useless > excercise, unfortunately. Why would that be the case. We know that mincore is the simplest way _right now_. Closing it makes sense on its own. [1] http://lkml.kernel.org/r/20190201014446.GU6173@dastard -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments 2019-01-30 12:44 ` [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments Vlastimil Babka ` (2 preceding siblings ...) 2019-01-30 12:44 ` [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed Vlastimil Babka @ 2019-03-06 12:11 ` Jiri Kosina 2019-03-06 22:35 ` Andrew Morton 2019-03-12 14:17 ` [PATCH v2 0/2] prevent mincore() page cache leaks Vlastimil Babka 4 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-03-06 12:11 UTC (permalink / raw) To: Vlastimil Babka, Andrew Morton Cc: Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Andy Lutomirski, Cyril Hrubis, Daniel Gruss, Dave Chinner, Dominique Martinet, Kevin Easton, Kirill A. Shutemov, Matthew Wilcox, Tejun Heo On Wed, 30 Jan 2019, Vlastimil Babka wrote: > I've collected the patches from the discussion for formal posting. The first > two should be settled already, third one is the possible improvement I've > mentioned earlier, where only in restricted case we resort to existence of page > table mapping (the original and later reverted approach from Linus) instead of > faking the result completely. Review and testing welcome. > > The consensus seems to be going through -mm tree for 5.1, unless Linus wants > them alredy for 5.0. > > Jiri Kosina (2): > mm/mincore: make mincore() more conservative > mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O > > Vlastimil Babka (1): > mm/mincore: provide mapped status when cached status is not allowed Andrew, could you please take at least the correct and straightforward fix for mincore() before we figure out how to deal with the slightly less practical RWF_NOWAIT? Thanks. -- Jiri Kosina SUSE Labs ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments 2019-03-06 12:11 ` [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments Jiri Kosina @ 2019-03-06 22:35 ` Andrew Morton 2019-03-06 22:48 ` Jiri Kosina 0 siblings, 1 reply; 161+ messages in thread From: Andrew Morton @ 2019-03-06 22:35 UTC (permalink / raw) To: Jiri Kosina Cc: Vlastimil Babka, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Andy Lutomirski, Cyril Hrubis, Daniel Gruss, Dave Chinner, Dominique Martinet, Kevin Easton, Kirill A. Shutemov, Matthew Wilcox, Tejun Heo On Wed, 6 Mar 2019 13:11:39 +0100 (CET) Jiri Kosina <jikos@kernel.org> wrote: > On Wed, 30 Jan 2019, Vlastimil Babka wrote: > > > I've collected the patches from the discussion for formal posting. The first > > two should be settled already, third one is the possible improvement I've > > mentioned earlier, where only in restricted case we resort to existence of page > > table mapping (the original and later reverted approach from Linus) instead of > > faking the result completely. Review and testing welcome. > > > > The consensus seems to be going through -mm tree for 5.1, unless Linus wants > > them alredy for 5.0. > > > > Jiri Kosina (2): > > mm/mincore: make mincore() more conservative > > mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O > > > > Vlastimil Babka (1): > > mm/mincore: provide mapped status when cached status is not allowed > > Andrew, > > could you please take at least the correct and straightforward fix for > mincore() before we figure out how to deal with the slightly less > practical RWF_NOWAIT? Thanks. I assume we're talking about [1/3] and [2/3] from this thread? Can we have a resend please? Gather the various acks and revisions, make changelog changes to address the review questions and comments? Thanks. ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments 2019-03-06 22:35 ` Andrew Morton @ 2019-03-06 22:48 ` Jiri Kosina 2019-03-06 23:23 ` Andrew Morton 0 siblings, 1 reply; 161+ messages in thread From: Jiri Kosina @ 2019-03-06 22:48 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Andy Lutomirski, Cyril Hrubis, Daniel Gruss, Dave Chinner, Dominique Martinet, Kevin Easton, Kirill A. Shutemov, Matthew Wilcox, Tejun Heo On Wed, 6 Mar 2019, Andrew Morton wrote: > > could you please take at least the correct and straightforward fix for > > mincore() before we figure out how to deal with the slightly less > > practical RWF_NOWAIT? Thanks. > > I assume we're talking about [1/3] and [2/3] from this thread? > > Can we have a resend please? Gather the various acks and revisions, > make changelog changes to address the review questions and comments? 1/3 is clearly the one to be merged. The version with all the acks gathered is in this thread, at https://lore.kernel.org/lkml/de52b3bd-4e39-c133-542a-0a9c5e357404@suse.cz/ Attaching the patch also at the end of this mail so that it could be easily picked up. I am unfortunately not sure what changelog changes you are talking about, there were none requested during the review as far as I know. 2/3 is clearly postponed for now, it needs more thinking. 3/3 is actually waiting for your decision, see https://lore.kernel.org/lkml/20190212063643.GL15609@dhcp22.suse.cz/ The 1/3 patch to be merged in any case: === cut here === From: Jiri Kosina <jkosina@suse.cz> Date: Wed, 16 Jan 2019 20:53:17 +0100 Subject: [PATCH v2] mm/mincore: make mincore() more conservative The semantics of what mincore() considers to be resident is not completely clear, but Linux has always (since 2.3.52, which is when mincore() was initially done) treated it as "page is available in page cache". That's potentially a problem, as that [in]directly exposes meta-information about pagecache / memory mapping state even about memory not strictly belonging to the process executing the syscall, opening possibilities for sidechannel attacks. Change the semantics of mincore() so that it only reveals pagecache information for non-anonymous mappings that belog to files that the calling process could (if it tried to) successfully open for writing. [mhocko@suse.com: restructure can_do_mincore() conditions] Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Originally-by: Dominique Martinet <asmadeus@codewreck.org> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Chinner <david@fromorbit.com> Cc: Kevin Easton <kevin@guarana.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Cyril Hrubis <chrubis@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daniel Gruss <daniel@gruss.cc> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Josh Snyder <joshs@netflix.com> Acked-by: Michal Hocko <mhocko@suse.com> --- mm/mincore.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..b8842b849604 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,16 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + if (vma_is_anonymous(vma)) + return true; + if (!vma->vm_file) + return false; + return inode_owner_or_capable(file_inode(vma->vm_file)) || + inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +199,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = (end - addr) >> PAGE_SHIFT; + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err; -- Jiri Kosina SUSE Labs ^ permalink raw reply related [flat|nested] 161+ messages in thread
* Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments 2019-03-06 22:48 ` Jiri Kosina @ 2019-03-06 23:23 ` Andrew Morton 2019-03-06 23:32 ` Dominique Martinet 0 siblings, 1 reply; 161+ messages in thread From: Andrew Morton @ 2019-03-06 23:23 UTC (permalink / raw) To: Jiri Kosina Cc: Vlastimil Babka, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Andy Lutomirski, Cyril Hrubis, Daniel Gruss, Dave Chinner, Dominique Martinet, Kevin Easton, Kirill A. Shutemov, Matthew Wilcox, Tejun Heo On Wed, 6 Mar 2019 23:48:03 +0100 (CET) Jiri Kosina <jikos@kernel.org> wrote: > 3/3 is actually waiting for your decision, see > > https://lore.kernel.org/lkml/20190212063643.GL15609@dhcp22.suse.cz/ I pity anyone who tried to understand this code by reading this code. Can we please get some careful commentary in there explaining what is going on, and why things are thus? I guess the [3/3] change makes sense, although it's unclear whether anyone really needs it? 5.0 was released with 574823bfab8 ("Change mincore() to count "mapped" pages rather than "cached" pages") so we'll have a release cycle to somewhat determine how much impact 574823bfab8 has on users. How about I queue up [3/3] and we reevaluate its desirability in a couple of months? ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments 2019-03-06 23:23 ` Andrew Morton @ 2019-03-06 23:32 ` Dominique Martinet 2019-03-06 23:38 ` Andrew Morton 0 siblings, 1 reply; 161+ messages in thread From: Dominique Martinet @ 2019-03-06 23:32 UTC (permalink / raw) To: Andrew Morton Cc: Jiri Kosina, Vlastimil Babka, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Andy Lutomirski, Cyril Hrubis, Daniel Gruss, Dave Chinner, Kevin Easton, Kirill A. Shutemov, Matthew Wilcox, Tejun Heo Andrew Morton wrote on Wed, Mar 06, 2019: > On Wed, 6 Mar 2019 23:48:03 +0100 (CET) Jiri Kosina <jikos@kernel.org> wrote: > > > 3/3 is actually waiting for your decision, see > > > > https://lore.kernel.org/lkml/20190212063643.GL15609@dhcp22.suse.cz/ > > I pity anyone who tried to understand this code by reading this code. > Can we please get some careful commentary in there explaining what is > going on, and why things are thus? > > I guess the [3/3] change makes sense, although it's unclear whether > anyone really needs it? 5.0 was released with 574823bfab8 ("Change > mincore() to count "mapped" pages rather than "cached" pages") so we'll > have a release cycle to somewhat determine how much impact 574823bfab8 > has on users. How about I queue up [3/3] and we reevaluate its > desirability in a couple of months? FWIW, 574823bfab8 has been reverted in 30bac164aca750, included in 5.0-rc4, so the controversial change has only been there from 5.0-rc1 to 5.0-rc3 -- Dominique ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments 2019-03-06 23:32 ` Dominique Martinet @ 2019-03-06 23:38 ` Andrew Morton 2019-03-09 16:53 ` Linus Torvalds 0 siblings, 1 reply; 161+ messages in thread From: Andrew Morton @ 2019-03-06 23:38 UTC (permalink / raw) To: Dominique Martinet Cc: Jiri Kosina, Vlastimil Babka, Linus Torvalds, linux-kernel, linux-mm, linux-api, Peter Zijlstra, Greg KH, Jann Horn, Andy Lutomirski, Cyril Hrubis, Daniel Gruss, Dave Chinner, Kevin Easton, Kirill A. Shutemov, Matthew Wilcox, Tejun Heo On Thu, 7 Mar 2019 00:32:09 +0100 Dominique Martinet <asmadeus@codewreck.org> wrote: > Andrew Morton wrote on Wed, Mar 06, 2019: > > On Wed, 6 Mar 2019 23:48:03 +0100 (CET) Jiri Kosina <jikos@kernel.org> wrote: > > > > > 3/3 is actually waiting for your decision, see > > > > > > https://lore.kernel.org/lkml/20190212063643.GL15609@dhcp22.suse.cz/ > > > > I pity anyone who tried to understand this code by reading this code. > > Can we please get some careful commentary in there explaining what is > > going on, and why things are thus? > > > > I guess the [3/3] change makes sense, although it's unclear whether > > anyone really needs it? 5.0 was released with 574823bfab8 ("Change > > mincore() to count "mapped" pages rather than "cached" pages") so we'll > > have a release cycle to somewhat determine how much impact 574823bfab8 > > has on users. How about I queue up [3/3] and we reevaluate its > > desirability in a couple of months? > > FWIW, > > 574823bfab8 has been reverted in 30bac164aca750, included in 5.0-rc4, so > the controversial change has only been there from 5.0-rc1 to 5.0-rc3 Ah, OK, thanks, I misread. Linus, do you have thoughts on http://lkml.kernel.org/r/20190130124420.1834-4-vbabka@suse.cz ? ^ permalink raw reply [flat|nested] 161+ messages in thread
* Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments 2019-03-06 23:38 ` Andrew Morton @ 2019-03-09 16:53 ` Linus Torvalds 0 siblings, 0 replies; 161+ messages in thread From: Linus Torvalds @ 2019-03-09 16:53 UTC (permalink / raw) To: Andrew Morton Cc: Dominique Martinet, Jiri Kosina, Vlastimil Babka, Linux List Kernel Mailing, Linux-MM, Linux API, Peter Zijlstra, Greg KH, Jann Horn, Andy Lutomirski, Cyril Hrubis, Daniel Gruss, Dave Chinner, Kevin Easton, Kirill A. Shutemov, Matthew Wilcox, Tejun Heo On Wed, Mar 6, 2019 at 3:38 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > Linus, do you have thoughts on > http://lkml.kernel.org/r/20190130124420.1834-4-vbabka@suse.cz ? I think that's fine, and probably the right thing to do, but I also suspect that nobody actually cares ;( Linus ^ permalink raw reply [flat|nested] 161+ messages in thread
* [PATCH v2 0/2] prevent mincore() page cache leaks 2019-01-30 12:44 ` [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments Vlastimil Babka ` (3 preceding siblings ...) 2019-03-06 12:11 ` [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments Jiri Kosina @ 2019-03-12 14:17 ` Vlastimil Babka 2019-03-12 14:17 ` [PATCH v2 1/2] mm/mincore: make mincore() more conservative Vlastimil Babka 2019-03-12 14:17 ` [PATCH v2 2/2] mm/mincore: provide mapped status when cached status is not allowed Vlastimil Babka 4 siblings, 2 replies; 161+ messages in thread From: Vlastimil Babka @ 2019-03-12 14:17 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Jann Horn, Michal Hocko, linux-mm, linux-kernel, linux-api, Vlastimil Babka, Andy Lutomirski, Cyril Hrubis, Daniel Gruss, Dave Chinner, Dominique Martinet, Jiri Kosina, Jiri Kosina, Josh Snyder, Kevin Easton, Kirill A. Shutemov, Matthew Wilcox, Michal Hocko, Tejun Heo Here's a new version of the mincore() patches, with feedback from Andrew Morton applied. The IOCB_NOWAIT patch was dropped since David Chinner pointed out it's incomplete. We definitely want the first patch, while for the second Linus said: I think that's fine, and probably the right thing to do, but I also suspect that nobody actually cares ;( Whether or not somebody cares, we should hear of no breakage. If somebody does care after all, without second patch we might hear of breakage, so I would suggest applying it. It's not that complicated after all (famous last words?) Jiri Kosina (1): mm/mincore: make mincore() more conservative Vlastimil Babka (1): mm/mincore: provide mapped status when cached status is not allowed mm/mincore.c | 80 ++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 68 insertions(+), 12 deletions(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 161+ messages in thread
* [PATCH v2 1/2] mm/mincore: make mincore() more conservative 2019-03-12 14:17 ` [PATCH v2 0/2] prevent mincore() page cache leaks Vlastimil Babka @ 2019-03-12 14:17 ` Vlastimil Babka 2019-03-12 14:17 ` [PATCH v2 2/2] mm/mincore: provide mapped status when cached status is not allowed Vlastimil Babka 1 sibling, 0 replies; 161+ messages in thread From: Vlastimil Babka @ 2019-03-12 14:17 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Jann Horn, Michal Hocko, linux-mm, linux-kernel, linux-api, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Vlastimil Babka, Josh Snyder, Michal Hocko, Jiri Kosina From: Jiri Kosina <jkosina@suse.cz> The semantics of what mincore() considers to be resident is not completely clear, but Linux has always (since 2.3.52, which is when mincore() was initially done) treated it as "page is available in page cache". That's potentially a problem, as that [in]directly exposes meta-information about pagecache / memory mapping state even about memory not strictly belonging to the process executing the syscall, opening possibilities for sidechannel attacks. Change the semantics of mincore() so that it only reveals pagecache information for non-anonymous mappings that belog to files that the calling process could (if it tried to) successfully open for writing; otherwise we'd be including shared non-exclusive mappings, which - is the sidechannel - is not the usecase for mincore(), as that's primarily used for data, not (shared) text [mhocko@suse.com: restructure can_do_mincore() conditions] Originally-by: Linus Torvalds <torvalds@linux-foundation.org> Originally-by: Dominique Martinet <asmadeus@codewreck.org> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Chinner <david@fromorbit.com> Cc: Kevin Easton <kevin@guarana.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Cyril Hrubis <chrubis@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daniel Gruss <daniel@gruss.cc> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Josh Snyder <joshs@netflix.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/mincore.c | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..c3f058bd0faf 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,22 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + if (vma_is_anonymous(vma)) + return true; + if (!vma->vm_file) + return false; + /* + * Reveal pagecache information only for non-anonymous mappings that + * correspond to the files the calling process could (if tried) open + * for writing; otherwise we'd be including shared non-exclusive + * mappings, which opens a side channel. + */ + return inode_owner_or_capable(file_inode(vma->vm_file)) || + inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +205,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = DIV_ROUND_UP(end - addr, PAGE_SIZE); + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err; -- 2.20.1 ^ permalink raw reply related [flat|nested] 161+ messages in thread
* [PATCH v2 2/2] mm/mincore: provide mapped status when cached status is not allowed 2019-03-12 14:17 ` [PATCH v2 0/2] prevent mincore() page cache leaks Vlastimil Babka 2019-03-12 14:17 ` [PATCH v2 1/2] mm/mincore: make mincore() more conservative Vlastimil Babka @ 2019-03-12 14:17 ` Vlastimil Babka 1 sibling, 0 replies; 161+ messages in thread From: Vlastimil Babka @ 2019-03-12 14:17 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Jann Horn, Michal Hocko, linux-mm, linux-kernel, linux-api, Vlastimil Babka, Jiri Kosina, Dominique Martinet, Andy Lutomirski, Dave Chinner, Kevin Easton, Matthew Wilcox, Cyril Hrubis, Tejun Heo, Kirill A . Shutemov, Daniel Gruss, Jiri Kosina, Josh Snyder, Michal Hocko After "mm/mincore: make mincore() more conservative" we sometimes restrict the information about page cache residency, which needs to be done without breaking existing userspace, as much as possible. Instead of returning with error, we thus fake the results. For that we return residency values as 1, which should be safer than faking them as 0, as there might theoretically exist code that would try to fault in the page(s) in a loop until mincore() returns 1 for them. Faking 1 however means that such code would not fault in a page even if it was not truly in page cache, with possibly unwanted performance implications. We can improve the situation by revisting the approach of 574823bfab82 ("Change mincore() to count "mapped" pages rather than "cached" pages"), later reverted by 30bac164aca7 and replaced by restricting/faking the results. In this patch we apply the approach only to cases where page cache residency check is restricted. Thus mincore() will return 0 for an unmapped page (which may or may not be resident in a pagecache), and 1 after the process faults it in. One potential downside is that mincore() users will be now able to recognize when a previously mapped page was reclaimed. While that might be useful for some attack scenarios, it is not as crucial as recognizing that somebody else faulted the page in, which is the main reason we are making mincore() more conservative. For detecting that pages being reclaimed, there are also other existing ways anyway. Cc: Jiri Kosina <jikos@kernel.org> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Chinner <david@fromorbit.com> Cc: Kevin Easton <kevin@guarana.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Cyril Hrubis <chrubis@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daniel Gruss <daniel@gruss.cc> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/mincore.c | 67 +++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 51 insertions(+), 16 deletions(-) diff --git a/mm/mincore.c b/mm/mincore.c index c3f058bd0faf..c9a265abc631 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -21,12 +21,23 @@ #include <linux/uaccess.h> #include <asm/pgtable.h> +/* + * mincore() page walk's private structure. Contains pointer to the array + * of return values to be set, and whether the current vma passed the + * can_do_mincore() check. + */ +struct mincore_walk_private { + unsigned char *vec; + bool can_check_pagecache; +}; + static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { #ifdef CONFIG_HUGETLB_PAGE unsigned char present; - unsigned char *vec = walk->private; + struct mincore_walk_private *walk_private = walk->private; + unsigned char *vec = walk_private->vec; /* * Hugepages under user process are always in RAM and never @@ -35,7 +46,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, present = pte && !huge_pte_none(huge_ptep_get(pte)); for (; addr != end; vec++, addr += PAGE_SIZE) *vec = present; - walk->private = vec; + walk_private->vec = vec; #else BUG(); #endif @@ -85,7 +96,8 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) } static int __mincore_unmapped_range(unsigned long addr, unsigned long end, - struct vm_area_struct *vma, unsigned char *vec) + struct vm_area_struct *vma, unsigned char *vec, + bool can_check_pagecache) { unsigned long nr = (end - addr) >> PAGE_SHIFT; int i; @@ -95,7 +107,14 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, pgoff = linear_page_index(vma, addr); for (i = 0; i < nr; i++, pgoff++) - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); + /* + * Return page cache residency state if we are allowed + * to, otherwise return mapping state, which is 0 for + * an unmapped range. + */ + vec[i] = can_check_pagecache ? + mincore_page(vma->vm_file->f_mapping, pgoff) + : 0; } else { for (i = 0; i < nr; i++) vec[i] = 0; @@ -106,8 +125,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, static int mincore_unmapped_range(unsigned long addr, unsigned long end, struct mm_walk *walk) { - walk->private += __mincore_unmapped_range(addr, end, - walk->vma, walk->private); + struct mincore_walk_private *walk_private = walk->private; + unsigned char *vec = walk_private->vec; + + walk_private->vec += __mincore_unmapped_range(addr, end, walk->vma, + vec, walk_private->can_check_pagecache); return 0; } @@ -117,7 +139,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, spinlock_t *ptl; struct vm_area_struct *vma = walk->vma; pte_t *ptep; - unsigned char *vec = walk->private; + struct mincore_walk_private *walk_private = walk->private; + unsigned char *vec = walk_private->vec; int nr = (end - addr) >> PAGE_SHIFT; ptl = pmd_trans_huge_lock(pmd, vma); @@ -128,7 +151,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, } if (pmd_trans_unstable(pmd)) { - __mincore_unmapped_range(addr, end, vma, vec); + __mincore_unmapped_range(addr, end, vma, vec, + walk_private->can_check_pagecache); goto out; } @@ -138,7 +162,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (pte_none(pte)) __mincore_unmapped_range(addr, addr + PAGE_SIZE, - vma, vec); + vma, vec, walk_private->can_check_pagecache); else if (pte_present(pte)) *vec = 1; else { /* pte is a swap entry */ @@ -152,8 +176,20 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, *vec = 1; } else { #ifdef CONFIG_SWAP - *vec = mincore_page(swap_address_space(entry), + /* + * If tmpfs pages are being swapped out, treat + * it with same restrictions on mincore() as + * the page cache so we don't expose that + * somebody else brought them back from swap. + * In the restricted case return 0 as swap + * entry means the page is not mapped. + */ + if (walk_private->can_check_pagecache) + *vec = mincore_page( + swap_address_space(entry), swp_offset(entry)); + else + *vec = 0; #else WARN_ON(1); *vec = 1; @@ -195,22 +231,21 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v struct vm_area_struct *vma; unsigned long end; int err; + struct mincore_walk_private walk_private = { + .vec = vec + }; struct mm_walk mincore_walk = { .pmd_entry = mincore_pte_range, .pte_hole = mincore_unmapped_range, .hugetlb_entry = mincore_hugetlb, - .private = vec, + .private = &walk_private }; vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); - if (!can_do_mincore(vma)) { - unsigned long pages = DIV_ROUND_UP(end - addr, PAGE_SIZE); - memset(vec, 1, pages); - return pages; - } + walk_private.can_check_pagecache = can_do_mincore(vma); mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) -- 2.20.1 ^ permalink raw reply related [flat|nested] 161+ messages in thread
end of thread, other threads:[~2019-03-12 14:17 UTC | newest] Thread overview: 161+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-01-05 17:27 [PATCH] mm/mincore: allow for making sys_mincore() privileged Jiri Kosina 2019-01-05 19:14 ` Vlastimil Babka 2019-01-05 19:24 ` Jiri Kosina 2019-01-05 19:38 ` Vlastimil Babka 2019-01-08 9:14 ` Bernd Petrovitsch 2019-01-08 11:37 ` Jiri Kosina 2019-01-08 13:53 ` Bernd Petrovitsch 2019-01-08 14:08 ` Kirill A. Shutemov 2019-01-05 19:44 ` kbuild test robot 2019-01-05 19:46 ` Linus Torvalds 2019-01-05 20:12 ` Jiri Kosina 2019-01-05 20:17 ` Linus Torvalds 2019-01-05 20:43 ` Jiri Kosina 2019-01-05 21:54 ` Linus Torvalds 2019-01-06 11:33 ` Kevin Easton 2019-01-08 8:50 ` Kevin Easton 2019-01-18 14:23 ` Tejun Heo 2019-01-05 20:13 ` Linus Torvalds 2019-01-05 19:56 ` kbuild test robot 2019-01-05 22:54 ` Jann Horn 2019-01-05 23:05 ` Linus Torvalds 2019-01-05 23:16 ` Linus Torvalds 2019-01-05 23:28 ` Linus Torvalds 2019-01-05 23:39 ` Linus Torvalds 2019-01-06 0:11 ` Matthew Wilcox 2019-01-06 0:22 ` Linus Torvalds 2019-01-06 1:50 ` Linus Torvalds 2019-01-06 21:46 ` Linus Torvalds 2019-01-08 4:43 ` Dave Chinner 2019-01-08 17:57 ` Linus Torvalds 2019-01-09 2:24 ` Dave Chinner 2019-01-09 2:31 ` Jiri Kosina 2019-01-09 4:39 ` Dave Chinner 2019-01-09 10:08 ` Jiri Kosina 2019-01-10 1:15 ` Dave Chinner 2019-01-10 7:54 ` Jiri Kosina 2019-01-09 18:25 ` Linus Torvalds 2019-01-10 0:44 ` Dave Chinner 2019-01-10 1:18 ` Linus Torvalds 2019-01-10 5:26 ` Andy Lutomirski 2019-01-10 14:47 ` Matthew Wilcox 2019-01-10 21:44 ` Dave Chinner 2019-01-10 21:59 ` Linus Torvalds 2019-01-11 1:47 ` Dave Chinner 2019-01-10 7:03 ` Dave Chinner 2019-01-10 11:47 ` Linus Torvalds 2019-01-10 12:24 ` Dominique Martinet 2019-01-10 22:11 ` Linus Torvalds 2019-01-11 2:03 ` Dave Chinner 2019-01-11 2:18 ` Linus Torvalds 2019-01-11 4:04 ` Dave Chinner 2019-01-11 4:08 ` Andy Lutomirski 2019-01-11 7:20 ` Dave Chinner 2019-01-11 7:08 ` Linus Torvalds 2019-01-11 7:36 ` Dave Chinner 2019-01-11 16:26 ` Linus Torvalds 2019-01-15 23:45 ` Dave Chinner 2019-01-16 4:54 ` Linus Torvalds 2019-01-16 5:49 ` Linus Torvalds 2019-01-17 1:26 ` Dave Chinner 2019-02-20 15:49 ` Nicolai Stange 2019-01-11 4:57 ` Dominique Martinet 2019-01-11 7:11 ` Linus Torvalds 2019-01-11 7:32 ` Dominique Martinet 2019-01-16 0:42 ` Josh Snyder 2019-01-16 5:00 ` Linus Torvalds 2019-01-16 5:25 ` Andy Lutomirski 2019-01-16 5:34 ` Linus Torvalds 2019-01-16 5:46 ` Dominique Martinet 2019-01-16 5:58 ` Linus Torvalds 2019-01-16 6:34 ` Dominique Martinet 2019-01-16 7:52 ` Josh Snyder 2019-01-16 12:18 ` Kevin Easton 2019-01-17 21:45 ` Vlastimil Babka 2019-01-18 4:49 ` Linus Torvalds 2019-01-18 18:58 ` Vlastimil Babka 2019-01-16 16:12 ` Jiri Kosina 2019-01-16 17:48 ` Linus Torvalds 2019-01-16 20:23 ` Jiri Kosina 2019-01-16 21:37 ` Matthew Wilcox 2019-01-16 21:41 ` Jiri Kosina 2019-01-17 9:52 ` Cyril Hrubis 2019-01-28 13:49 ` Cyril Hrubis 2019-01-17 4:51 ` Linus Torvalds 2019-01-18 4:54 ` Linus Torvalds 2019-01-17 1:49 ` Dominique Martinet 2019-01-23 20:27 ` Linus Torvalds 2019-01-23 20:35 ` Linus Torvalds 2019-01-23 23:12 ` Jiri Kosina 2019-01-24 0:20 ` Linus Torvalds 2019-01-24 0:24 ` Dominique Martinet 2019-01-24 12:45 ` Dominique Martinet 2019-01-24 14:25 ` Jiri Kosina 2019-01-27 22:35 ` Jiri Kosina 2019-01-28 0:05 ` Dominique Martinet 2019-01-29 23:52 ` Jiri Kosina 2019-01-30 9:09 ` Michal Hocko 2019-01-30 12:29 ` Jiri Kosina 2019-01-16 12:36 ` Matthew Wilcox 2019-01-10 14:50 ` Matthew Wilcox 2019-01-11 7:36 ` Jiri Kosina 2019-01-17 2:22 ` Dave Chinner 2019-01-17 8:18 ` Jiri Kosina 2019-01-17 21:06 ` Dave Chinner 2019-01-07 4:32 ` Dominique Martinet 2019-01-07 10:33 ` Vlastimil Babka 2019-01-07 11:08 ` Dominique Martinet 2019-01-07 11:59 ` Vlastimil Babka 2019-01-07 13:29 ` Daniel Gruss 2019-01-07 10:10 ` Michael Ellerman 2019-01-05 23:09 ` Jiri Kosina 2019-01-30 12:44 ` [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments Vlastimil Babka 2019-01-30 12:44 ` [PATCH 1/3] mm/mincore: make mincore() more conservative Vlastimil Babka 2019-01-31 9:43 ` Michal Hocko 2019-01-31 9:51 ` Dominique Martinet 2019-01-31 17:46 ` Josh Snyder 2019-02-01 8:56 ` Vlastimil Babka 2019-03-06 23:13 ` Andrew Morton 2019-03-07 0:01 ` Jiri Kosina 2019-03-07 0:40 ` Dominique Martinet 2019-03-07 5:46 ` Jiri Kosina 2019-01-30 12:44 ` [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O Vlastimil Babka 2019-01-30 15:04 ` Florian Weimer 2019-01-30 15:15 ` Jiri Kosina 2019-01-31 10:47 ` Florian Weimer 2019-01-31 11:34 ` Jiri Kosina 2019-01-31 9:56 ` Michal Hocko 2019-01-31 10:15 ` Jiri Kosina 2019-01-31 10:23 ` Michal Hocko 2019-01-31 10:30 ` Jiri Kosina 2019-01-31 11:32 ` Michal Hocko 2019-01-31 17:54 ` Linus Torvalds 2019-02-01 5:13 ` Dave Chinner 2019-02-01 7:05 ` Linus Torvalds 2019-02-01 7:21 ` Linus Torvalds 2019-02-01 1:44 ` Dave Chinner 2019-02-12 15:48 ` Jiri Kosina 2019-01-31 12:04 ` Daniel Gruss 2019-01-31 12:06 ` Vlastimil Babka 2019-01-31 12:08 ` Jiri Kosina 2019-01-31 12:57 ` Daniel Gruss 2019-01-30 12:44 ` [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed Vlastimil Babka 2019-01-31 10:09 ` Michal Hocko 2019-02-01 9:04 ` Vlastimil Babka 2019-02-01 9:11 ` Michal Hocko 2019-02-01 9:27 ` Vlastimil Babka 2019-02-06 20:14 ` Jiri Kosina 2019-02-12 3:44 ` Jiri Kosina 2019-02-12 6:36 ` Michal Hocko 2019-02-12 13:09 ` Jiri Kosina 2019-02-12 14:01 ` Michal Hocko 2019-03-06 12:11 ` [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments Jiri Kosina 2019-03-06 22:35 ` Andrew Morton 2019-03-06 22:48 ` Jiri Kosina 2019-03-06 23:23 ` Andrew Morton 2019-03-06 23:32 ` Dominique Martinet 2019-03-06 23:38 ` Andrew Morton 2019-03-09 16:53 ` Linus Torvalds 2019-03-12 14:17 ` [PATCH v2 0/2] prevent mincore() page cache leaks Vlastimil Babka 2019-03-12 14:17 ` [PATCH v2 1/2] mm/mincore: make mincore() more conservative Vlastimil Babka 2019-03-12 14:17 ` [PATCH v2 2/2] mm/mincore: provide mapped status when cached status is not allowed Vlastimil Babka
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).