All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-27 19:53 ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-02-27 19:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

Here's new version of faultaround patchset. It took a while to tune it and
collect performance data.

First patch adds new callback ->map_pages to vm_operations_struct.

->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from "pgoff"
till "max_pgoff". ->map_pages() is called with page table locked and must
not block.  If it's not possible to reach a page without blocking,
filesystem should skip it. Filesystem should use do_set_pte() to setup
page table entry. Pointer to entry associated with offset "pgoff" is
passed in "pte" field in vm_fault structure. Pointers to entries for other
offsets should be calculated relative to "pte".

Currently VM use ->map_pages only on read page fault path. We try to map
FAULT_AROUND_PAGES a time. FAULT_AROUND_PAGES is 16 for now. Performance
data for different FAULT_AROUND_ORDER is below.

TODO:
 - implement ->map_pages() for shmem/tmpfs;
 - modify get_user_pages() to be able to use ->map_pages() and implement
   mmap(MAP_POPULATE|MAP_NONBLOCK) on top.

Please consider applying.

=========================================================================
Tested on 4-socket machine (120 threads) with 128GiB of RAM.

Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
somewhere between 3 and 5. Let's say 4 :)

Linux build (make -j60)
FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
	minor-faults		283,301,572	247,151,987	212,215,789	204,772,882	199,568,944	194,703,779	193,381,485
	time, seconds		151.227629483	153.920996480	151.356125472	150.863792049	150.879207877	151.150764954	151.450962358
Linux rebuild (make -j60)
FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
	minor-faults		5,396,854	4,148,444	2,855,286	2,577,282	2,361,957	2,169,573	2,112,643
	time, seconds		27.404543757	27.559725591	27.030057426	26.855045126	26.678618635	26.974523490	26.761320095
Git test suite (make -j60 test)
FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
	minor-faults		129,591,823	99,200,751	66,106,718	57,606,410	51,510,808	45,776,813	44,085,515
	time, seconds		66.087215026	64.784546905	64.401156567	65.282708668	66.034016829	66.793780811	67.237810413

Two synthetic tests: access every word in file in sequential/random order.
It doesn't improve much after FAULT_AROUND_ORDER == 4.

Sequential access 16GiB file
FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
 1 thread
	minor-faults		4,195,437	2,098,275	525,068		262,251		131,170		32,856		8,282
	time, seconds		7.250461742	6.461711074	5.493859139	5.488488147	5.707213983	5.898510832	5.109232856
 8 threads
	minor-faults		33,557,540	16,892,728	4,515,848	2,366,999	1,423,382	442,732		142,339
	time, seconds		16.649304881	9.312555263	6.612490639	6.394316732	6.669827501	6.75078944	6.371900528
 32 threads
	minor-faults		134,228,222	67,526,810	17,725,386	9,716,537	4,763,731	1,668,921	537,200
	time, seconds		49.164430543	29.712060103	12.938649729	10.175151004	11.840094583	9.594081325	9.928461797
 60 threads
	minor-faults		251,687,988	126,146,952	32,919,406	18,208,804	10,458,947	2,733,907	928,217
	time, seconds		86.260656897	49.626551828	22.335007632	17.608243696	16.523119035	16.339489186	16.326390902
 120 threads
	minor-faults		503,352,863	252,939,677	67,039,168	35,191,827	19,170,091	4,688,357	1,471,862
	time, seconds		124.589206333	79.757867787	39.508707872	32.167281632	29.972989292	28.729834575	28.042251622
Random access 1GiB file
 1 thread
	minor-faults		262,636		132,743		34,369		17,299		8,527		3,451		1,222
	time, seconds		15.351890914	16.613802482	16.569227308	15.179220992	16.557356122	16.578247824	15.365266994
 8 threads
	minor-faults		2,098,948	1,061,871	273,690		154,501		87,110		25,663		7,384
	time, seconds		15.040026343	15.096933500	14.474757288	14.289129964	14.411537468	14.296316837	14.395635804
 32 threads
	minor-faults		8,390,734	4,231,023	1,054,432	528,847		269,242		97,746		26,881
	time, seconds		20.430433109	21.585235358	22.115062928	14.872878951	14.880856305	14.883370649	14.821261690
 60 threads
	minor-faults		15,733,258	7,892,809	1,973,393	988,266		594,789		164,994		51,691
	time, seconds		26.577302548	25.692397770	18.728863715	20.153026398	21.619101933	17.745086260	17.613215273
 120 threads
	minor-faults		31,471,111	15,816,616	3,959,209	1,978,685	1,008,299	264,635		96,010
	time, seconds		41.835322703	40.459786095	36.085306105	35.313894834	35.814445675	36.552633793	34.289210594

Worst case scenario: we touch one page every 2M, so faultaround is useless
here. Just to demonstrate how much overhead we add.

Touch only one page in page table in 16GiB file
FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
 1 thread
	minor-faults		8,372		8,324		8,270		8,260		8,249		8,239		8,237
	time, seconds		0.039892712	0.045369149	0.051846126	0.063681685	0.079095975	0.17652406	0.541213386
 8 threads
	minor-faults		65,731		65,681		65,628		65,620		65,608		65,599		65,596
	time, seconds		0.124159196	0.488600638	0.156854426	0.191901957	0.242631486	0.543569456	1.677303984
 32 threads
	minor-faults		262,388		262,341		262,285		262,276		262,266		262,257		263,183
	time, seconds		0.452421421	0.488600638	0.565020946	0.648229739	0.789850823	1.651584361	5.000361559
 60 threads
	minor-faults		491,822		491,792		491,723		491,711		491,701		491,691		491,825
	time, seconds		0.763288616	0.869620515	0.980727360	1.161732354	1.466915814	3.04041448	9.308612938
 120 threads
	minor-faults		983,466		983,655		983,366		983,372		983,363		984,083		984,164
	time, seconds		1.595846553	1.667902182	2.008959376	2.425380942	2.941368804	5.977807890	18.401846125

Kirill A. Shutemov (2):
  mm: introduce vm_ops->map_pages()
  mm: implement ->map_pages for page cache

 Documentation/filesystems/Locking | 10 ++++++
 fs/9p/vfs_file.c                  |  2 ++
 fs/btrfs/file.c                   |  1 +
 fs/cifs/file.c                    |  1 +
 fs/ext4/file.c                    |  1 +
 fs/f2fs/file.c                    |  1 +
 fs/fuse/file.c                    |  1 +
 fs/gfs2/file.c                    |  1 +
 fs/nfs/file.c                     |  1 +
 fs/nilfs2/file.c                  |  1 +
 fs/ubifs/file.c                   |  1 +
 fs/xfs/xfs_file.c                 |  1 +
 include/linux/mm.h                |  9 +++++
 mm/filemap.c                      | 72 +++++++++++++++++++++++++++++++++++++++
 mm/memory.c                       | 67 ++++++++++++++++++++++++++++++++++--
 mm/nommu.c                        |  6 ++++
 16 files changed, 173 insertions(+), 3 deletions(-)

-- 
1.9.0


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-27 19:53 ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-02-27 19:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

Here's new version of faultaround patchset. It took a while to tune it and
collect performance data.

First patch adds new callback ->map_pages to vm_operations_struct.

->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from "pgoff"
till "max_pgoff". ->map_pages() is called with page table locked and must
not block.  If it's not possible to reach a page without blocking,
filesystem should skip it. Filesystem should use do_set_pte() to setup
page table entry. Pointer to entry associated with offset "pgoff" is
passed in "pte" field in vm_fault structure. Pointers to entries for other
offsets should be calculated relative to "pte".

Currently VM use ->map_pages only on read page fault path. We try to map
FAULT_AROUND_PAGES a time. FAULT_AROUND_PAGES is 16 for now. Performance
data for different FAULT_AROUND_ORDER is below.

TODO:
 - implement ->map_pages() for shmem/tmpfs;
 - modify get_user_pages() to be able to use ->map_pages() and implement
   mmap(MAP_POPULATE|MAP_NONBLOCK) on top.

Please consider applying.

=========================================================================
Tested on 4-socket machine (120 threads) with 128GiB of RAM.

Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
somewhere between 3 and 5. Let's say 4 :)

Linux build (make -j60)
FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
	minor-faults		283,301,572	247,151,987	212,215,789	204,772,882	199,568,944	194,703,779	193,381,485
	time, seconds		151.227629483	153.920996480	151.356125472	150.863792049	150.879207877	151.150764954	151.450962358
Linux rebuild (make -j60)
FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
	minor-faults		5,396,854	4,148,444	2,855,286	2,577,282	2,361,957	2,169,573	2,112,643
	time, seconds		27.404543757	27.559725591	27.030057426	26.855045126	26.678618635	26.974523490	26.761320095
Git test suite (make -j60 test)
FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
	minor-faults		129,591,823	99,200,751	66,106,718	57,606,410	51,510,808	45,776,813	44,085,515
	time, seconds		66.087215026	64.784546905	64.401156567	65.282708668	66.034016829	66.793780811	67.237810413

Two synthetic tests: access every word in file in sequential/random order.
It doesn't improve much after FAULT_AROUND_ORDER == 4.

Sequential access 16GiB file
FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
 1 thread
	minor-faults		4,195,437	2,098,275	525,068		262,251		131,170		32,856		8,282
	time, seconds		7.250461742	6.461711074	5.493859139	5.488488147	5.707213983	5.898510832	5.109232856
 8 threads
	minor-faults		33,557,540	16,892,728	4,515,848	2,366,999	1,423,382	442,732		142,339
	time, seconds		16.649304881	9.312555263	6.612490639	6.394316732	6.669827501	6.75078944	6.371900528
 32 threads
	minor-faults		134,228,222	67,526,810	17,725,386	9,716,537	4,763,731	1,668,921	537,200
	time, seconds		49.164430543	29.712060103	12.938649729	10.175151004	11.840094583	9.594081325	9.928461797
 60 threads
	minor-faults		251,687,988	126,146,952	32,919,406	18,208,804	10,458,947	2,733,907	928,217
	time, seconds		86.260656897	49.626551828	22.335007632	17.608243696	16.523119035	16.339489186	16.326390902
 120 threads
	minor-faults		503,352,863	252,939,677	67,039,168	35,191,827	19,170,091	4,688,357	1,471,862
	time, seconds		124.589206333	79.757867787	39.508707872	32.167281632	29.972989292	28.729834575	28.042251622
Random access 1GiB file
 1 thread
	minor-faults		262,636		132,743		34,369		17,299		8,527		3,451		1,222
	time, seconds		15.351890914	16.613802482	16.569227308	15.179220992	16.557356122	16.578247824	15.365266994
 8 threads
	minor-faults		2,098,948	1,061,871	273,690		154,501		87,110		25,663		7,384
	time, seconds		15.040026343	15.096933500	14.474757288	14.289129964	14.411537468	14.296316837	14.395635804
 32 threads
	minor-faults		8,390,734	4,231,023	1,054,432	528,847		269,242		97,746		26,881
	time, seconds		20.430433109	21.585235358	22.115062928	14.872878951	14.880856305	14.883370649	14.821261690
 60 threads
	minor-faults		15,733,258	7,892,809	1,973,393	988,266		594,789		164,994		51,691
	time, seconds		26.577302548	25.692397770	18.728863715	20.153026398	21.619101933	17.745086260	17.613215273
 120 threads
	minor-faults		31,471,111	15,816,616	3,959,209	1,978,685	1,008,299	264,635		96,010
	time, seconds		41.835322703	40.459786095	36.085306105	35.313894834	35.814445675	36.552633793	34.289210594

Worst case scenario: we touch one page every 2M, so faultaround is useless
here. Just to demonstrate how much overhead we add.

Touch only one page in page table in 16GiB file
FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
 1 thread
	minor-faults		8,372		8,324		8,270		8,260		8,249		8,239		8,237
	time, seconds		0.039892712	0.045369149	0.051846126	0.063681685	0.079095975	0.17652406	0.541213386
 8 threads
	minor-faults		65,731		65,681		65,628		65,620		65,608		65,599		65,596
	time, seconds		0.124159196	0.488600638	0.156854426	0.191901957	0.242631486	0.543569456	1.677303984
 32 threads
	minor-faults		262,388		262,341		262,285		262,276		262,266		262,257		263,183
	time, seconds		0.452421421	0.488600638	0.565020946	0.648229739	0.789850823	1.651584361	5.000361559
 60 threads
	minor-faults		491,822		491,792		491,723		491,711		491,701		491,691		491,825
	time, seconds		0.763288616	0.869620515	0.980727360	1.161732354	1.466915814	3.04041448	9.308612938
 120 threads
	minor-faults		983,466		983,655		983,366		983,372		983,363		984,083		984,164
	time, seconds		1.595846553	1.667902182	2.008959376	2.425380942	2.941368804	5.977807890	18.401846125

Kirill A. Shutemov (2):
  mm: introduce vm_ops->map_pages()
  mm: implement ->map_pages for page cache

 Documentation/filesystems/Locking | 10 ++++++
 fs/9p/vfs_file.c                  |  2 ++
 fs/btrfs/file.c                   |  1 +
 fs/cifs/file.c                    |  1 +
 fs/ext4/file.c                    |  1 +
 fs/f2fs/file.c                    |  1 +
 fs/fuse/file.c                    |  1 +
 fs/gfs2/file.c                    |  1 +
 fs/nfs/file.c                     |  1 +
 fs/nilfs2/file.c                  |  1 +
 fs/ubifs/file.c                   |  1 +
 fs/xfs/xfs_file.c                 |  1 +
 include/linux/mm.h                |  9 +++++
 mm/filemap.c                      | 72 +++++++++++++++++++++++++++++++++++++++
 mm/memory.c                       | 67 ++++++++++++++++++++++++++++++++++--
 mm/nommu.c                        |  6 ++++
 16 files changed, 173 insertions(+), 3 deletions(-)

-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-02-27 19:53 ` Kirill A. Shutemov
@ 2014-02-27 19:53   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-02-27 19:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

The patch introduces new vm_ops callback ->map_pages() and uses it for
mapping easy accessible pages around fault address.

On read page fault, if filesystem provides ->map_pages(), we try to map
up to FAULT_AROUND_PAGES pages around page fault address in hope to
reduce number of minor page faults.

We call ->map_pages first and use ->fault() as fallback if page by the
offset is not ready to be mapped (cold page cache or something).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/filesystems/Locking | 10 ++++++
 include/linux/mm.h                |  8 +++++
 mm/memory.c                       | 67 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 5b0c083d7c0e..767930f04a12 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -525,6 +525,7 @@ locking rules:
 open:		yes
 close:		yes
 fault:		yes		can return with page locked
+map_pages:	yes
 page_mkwrite:	yes		can return with page locked
 access:		yes
 
@@ -536,6 +537,15 @@ the page, then ensure it is not already truncated (the page lock will block
 subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
 locked. The VM will unlock the page.
 
+	->map_pages() is called when VM asks to map easy accessible pages.
+Filesystem should find and map pages associated with offsets from "pgoff"
+till "max_pgoff". ->map_pages() is called with page table locked and must
+not block.  If it's not possible to reach a page without blocking,
+filesystem should skip it. Filesystem should use do_set_pte() to setup
+page table entry. Pointer to entry associated with offset "pgoff" is
+passed in "pte" field in vm_fault structure. Pointers to entries for other
+offsets should be calculated relative to "pte".
+
 	->page_mkwrite() is called when a previously read-only pte is
 about to become writeable. The filesystem again must ensure that there are
 no truncate/invalidate races, and then return with the page locked. If
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46eade6a..aed92cb17127 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -210,6 +210,10 @@ struct vm_fault {
 					 * is set (which is also implied by
 					 * VM_FAULT_ERROR).
 					 */
+	/* for ->map_pages() only */
+	pgoff_t max_pgoff;		/* map pages for offset from pgoff till
+					 * max_pgoff inclusive */
+	pte_t *pte;			/* pte entry associated with ->pgoff */
 };
 
 /*
@@ -221,6 +225,7 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
@@ -571,6 +576,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 		pte = pte_mkwrite(pte);
 	return pte;
 }
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+		struct page *page, pte_t *pte, bool write, bool anon);
 #endif
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46ef1e1..3f17a60e817f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3318,7 +3318,8 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	return ret;
 }
 
-static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		struct page *page, pte_t *pte, bool write, bool anon)
 {
 	pte_t entry;
@@ -3342,6 +3343,52 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }
 
+#define FAULT_AROUND_ORDER 4
+#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
+#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
+
+static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
+		pte_t *pte, pgoff_t pgoff, unsigned int flags)
+{
+	unsigned long start_addr;
+	pgoff_t max_pgoff;
+	struct vm_fault vmf;
+	int off;
+
+	BUILD_BUG_ON(FAULT_AROUND_PAGES > PTRS_PER_PTE);
+
+	start_addr = max(address & FAULT_AROUND_MASK, vma->vm_start);
+	off = ((address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+	pte -= off;
+	pgoff -= off;
+
+	/*
+	 *  max_pgoff is either end of page table or end of vma
+	 *  or FAULT_AROUND_PAGES from pgoff, depending what is neast.
+	 */
+	max_pgoff = pgoff - ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+		PTRS_PER_PTE - 1;
+	max_pgoff = min3(max_pgoff, vma_pages(vma) + vma->vm_pgoff - 1,
+			pgoff + FAULT_AROUND_PAGES - 1);
+
+	/* Check if it makes any sense to call ->map_pages */
+	while (!pte_none(*pte)) {
+		if (++pgoff > max_pgoff)
+			return;
+		start_addr += PAGE_SIZE;
+		if (start_addr >= vma->vm_end)
+			return;
+		pte++;
+	}
+
+	vmf.virtual_address = (void __user *) start_addr;
+	vmf.pte = pte;
+	vmf.pgoff = pgoff;
+	vmf.max_pgoff = max_pgoff;
+	vmf.flags = flags;
+	vma->vm_ops->map_pages(vma, &vmf);
+}
+
 static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3349,7 +3396,20 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *fault_page;
 	spinlock_t *ptl;
 	pte_t *pte;
-	int ret;
+	int ret = 0;
+
+	/*
+	 * Let's call ->map_pages() first and use ->fault() as fallback
+	 * if page by the offset is not ready to be mapped (cold cache or
+	 * something).
+	 */
+	if (vma->vm_ops->map_pages) {
+		pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+		do_fault_around(vma, address, pte, pgoff, flags);
+		if (!pte_same(*pte, orig_pte))
+			goto unlock_out;
+		pte_unmap_unlock(pte, ptl);
+	}
 
 	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
@@ -3363,8 +3423,9 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return ret;
 	}
 	do_set_pte(vma, address, fault_page, pte, false, false);
-	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
+unlock_out:
+	pte_unmap_unlock(pte, ptl);
 	return ret;
 }
 
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-02-27 19:53   ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-02-27 19:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

The patch introduces new vm_ops callback ->map_pages() and uses it for
mapping easy accessible pages around fault address.

On read page fault, if filesystem provides ->map_pages(), we try to map
up to FAULT_AROUND_PAGES pages around page fault address in hope to
reduce number of minor page faults.

We call ->map_pages first and use ->fault() as fallback if page by the
offset is not ready to be mapped (cold page cache or something).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/filesystems/Locking | 10 ++++++
 include/linux/mm.h                |  8 +++++
 mm/memory.c                       | 67 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 5b0c083d7c0e..767930f04a12 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -525,6 +525,7 @@ locking rules:
 open:		yes
 close:		yes
 fault:		yes		can return with page locked
+map_pages:	yes
 page_mkwrite:	yes		can return with page locked
 access:		yes
 
@@ -536,6 +537,15 @@ the page, then ensure it is not already truncated (the page lock will block
 subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
 locked. The VM will unlock the page.
 
+	->map_pages() is called when VM asks to map easy accessible pages.
+Filesystem should find and map pages associated with offsets from "pgoff"
+till "max_pgoff". ->map_pages() is called with page table locked and must
+not block.  If it's not possible to reach a page without blocking,
+filesystem should skip it. Filesystem should use do_set_pte() to setup
+page table entry. Pointer to entry associated with offset "pgoff" is
+passed in "pte" field in vm_fault structure. Pointers to entries for other
+offsets should be calculated relative to "pte".
+
 	->page_mkwrite() is called when a previously read-only pte is
 about to become writeable. The filesystem again must ensure that there are
 no truncate/invalidate races, and then return with the page locked. If
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46eade6a..aed92cb17127 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -210,6 +210,10 @@ struct vm_fault {
 					 * is set (which is also implied by
 					 * VM_FAULT_ERROR).
 					 */
+	/* for ->map_pages() only */
+	pgoff_t max_pgoff;		/* map pages for offset from pgoff till
+					 * max_pgoff inclusive */
+	pte_t *pte;			/* pte entry associated with ->pgoff */
 };
 
 /*
@@ -221,6 +225,7 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
@@ -571,6 +576,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 		pte = pte_mkwrite(pte);
 	return pte;
 }
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+		struct page *page, pte_t *pte, bool write, bool anon);
 #endif
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46ef1e1..3f17a60e817f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3318,7 +3318,8 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	return ret;
 }
 
-static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		struct page *page, pte_t *pte, bool write, bool anon)
 {
 	pte_t entry;
@@ -3342,6 +3343,52 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }
 
+#define FAULT_AROUND_ORDER 4
+#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
+#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
+
+static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
+		pte_t *pte, pgoff_t pgoff, unsigned int flags)
+{
+	unsigned long start_addr;
+	pgoff_t max_pgoff;
+	struct vm_fault vmf;
+	int off;
+
+	BUILD_BUG_ON(FAULT_AROUND_PAGES > PTRS_PER_PTE);
+
+	start_addr = max(address & FAULT_AROUND_MASK, vma->vm_start);
+	off = ((address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+	pte -= off;
+	pgoff -= off;
+
+	/*
+	 *  max_pgoff is either end of page table or end of vma
+	 *  or FAULT_AROUND_PAGES from pgoff, depending what is neast.
+	 */
+	max_pgoff = pgoff - ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+		PTRS_PER_PTE - 1;
+	max_pgoff = min3(max_pgoff, vma_pages(vma) + vma->vm_pgoff - 1,
+			pgoff + FAULT_AROUND_PAGES - 1);
+
+	/* Check if it makes any sense to call ->map_pages */
+	while (!pte_none(*pte)) {
+		if (++pgoff > max_pgoff)
+			return;
+		start_addr += PAGE_SIZE;
+		if (start_addr >= vma->vm_end)
+			return;
+		pte++;
+	}
+
+	vmf.virtual_address = (void __user *) start_addr;
+	vmf.pte = pte;
+	vmf.pgoff = pgoff;
+	vmf.max_pgoff = max_pgoff;
+	vmf.flags = flags;
+	vma->vm_ops->map_pages(vma, &vmf);
+}
+
 static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3349,7 +3396,20 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *fault_page;
 	spinlock_t *ptl;
 	pte_t *pte;
-	int ret;
+	int ret = 0;
+
+	/*
+	 * Let's call ->map_pages() first and use ->fault() as fallback
+	 * if page by the offset is not ready to be mapped (cold cache or
+	 * something).
+	 */
+	if (vma->vm_ops->map_pages) {
+		pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+		do_fault_around(vma, address, pte, pgoff, flags);
+		if (!pte_same(*pte, orig_pte))
+			goto unlock_out;
+		pte_unmap_unlock(pte, ptl);
+	}
 
 	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
@@ -3363,8 +3423,9 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return ret;
 	}
 	do_set_pte(vma, address, fault_page, pte, false, false);
-	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
+unlock_out:
+	pte_unmap_unlock(pte, ptl);
 	return ret;
 }
 
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCHv3 2/2] mm: implement ->map_pages for page cache
  2014-02-27 19:53 ` Kirill A. Shutemov
@ 2014-02-27 19:53   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-02-27 19:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/9p/vfs_file.c   |  2 ++
 fs/btrfs/file.c    |  1 +
 fs/cifs/file.c     |  1 +
 fs/ext4/file.c     |  1 +
 fs/f2fs/file.c     |  1 +
 fs/fuse/file.c     |  1 +
 fs/gfs2/file.c     |  1 +
 fs/nfs/file.c      |  1 +
 fs/nilfs2/file.c   |  1 +
 fs/ubifs/file.c    |  1 +
 fs/xfs/xfs_file.c  |  1 +
 include/linux/mm.h |  1 +
 mm/filemap.c       | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/nommu.c         |  6 +++++
 14 files changed, 91 insertions(+)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index a16b0ff497ca..d8223209d4b1 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
 
 static const struct vm_operations_struct v9fs_file_vm_ops = {
 	.fault = filemap_fault,
+	.map_pages = filemap_map_pages,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
@@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
 static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
 	.close = v9fs_mmap_vm_close,
 	.fault = filemap_fault,
+	.map_pages = filemap_map_pages,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0165b8672f09..d1f0415bbfb1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1993,6 +1993,7 @@ out:
 
 static const struct vm_operations_struct btrfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= btrfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 755584684f6c..6d081de57fdb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static struct vm_operations_struct cifs_file_vm_ops = {
 	.fault = filemap_fault,
+	.map_pages = filemap_map_pages,
 	.page_mkwrite = cifs_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a5073959f32..46e78f8a133f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite   = ext4_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 0dfcef53a6ed..129a3bdb05ca 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -84,6 +84,7 @@ out:
 
 static const struct vm_operations_struct f2fs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= f2fs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77bcc303c3ae..da99a76668f8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct fuse_file_vm_ops = {
 	.close		= fuse_vma_close,
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= fuse_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index efc078f0ee4e..2739f3c3bb8f 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -494,6 +494,7 @@ out:
 
 static const struct vm_operations_struct gfs2_vm_ops = {
 	.fault = filemap_fault,
+	.map_pages = filemap_map_pages,
 	.page_mkwrite = gfs2_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5bb790a69c71..284ca901fe16 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -617,6 +617,7 @@ out:
 
 static const struct vm_operations_struct nfs_file_vm_ops = {
 	.fault = filemap_fault,
+	.map_pages = filemap_map_pages,
 	.page_mkwrite = nfs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 08fdb77852ac..f3a82fbcae02 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static const struct vm_operations_struct nilfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= nilfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 123c79b7261e..4f34dbae823d 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1538,6 +1538,7 @@ out_unlock:
 
 static const struct vm_operations_struct ubifs_file_vm_ops = {
 	.fault        = filemap_fault,
+	.map_pages = filemap_map_pages,
 	.page_mkwrite = ubifs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..b2be204e16ca 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= xfs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/include/linux/mm.h b/include/linux/mm.h
index aed92cb17127..7c4bfb725e63 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1818,6 +1818,7 @@ extern void truncate_inode_pages_range(struct address_space *,
 
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
+extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
 extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6ac5421..1bc12a96060d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -1726,6 +1727,76 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct radix_tree_iter iter;
+	void **slot;
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	loff_t size;
+	struct page *page;
+	unsigned long address = (unsigned long) vmf->virtual_address;
+	unsigned long addr;
+	pte_t *pte;
+
+	rcu_read_lock();
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
+		if (iter.index > vmf->max_pgoff)
+			break;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				break;
+			else
+				goto next;
+		}
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		if (!PageUptodate(page) ||
+				PageReadahead(page) ||
+				PageHWPoison(page))
+			goto skip;
+		if (!trylock_page(page))
+			goto skip;
+
+		if (page->mapping != mapping || !PageUptodate(page))
+			goto unlock;
+
+		size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;
+		if (page->index >= size	>> PAGE_CACHE_SHIFT)
+			goto unlock;
+
+		pte = vmf->pte + page->index - vmf->pgoff;
+		if (!pte_none(*pte))
+			goto unlock;
+
+		if (file->f_ra.mmap_miss > 0)
+			file->f_ra.mmap_miss--;
+		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
+		do_set_pte(vma, addr, page, pte, false, false);
+		unlock_page(page);
+		goto next;
+unlock:
+		unlock_page(page);
+skip:
+		page_cache_release(page);
+next:
+		if (page->index == vmf->max_pgoff)
+			break;
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(filemap_map_pages);
+
 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -1755,6 +1826,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/mm/nommu.c b/mm/nommu.c
index 8740213b1647..ce401e37e29f 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1985,6 +1985,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 }
 EXPORT_SYMBOL(filemap_fault);
 
+void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	BUG();
+}
+EXPORT_SYMBOL(filemap_map_pages);
+
 int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr,
 			     unsigned long size, pgoff_t pgoff)
 {
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCHv3 2/2] mm: implement ->map_pages for page cache
@ 2014-02-27 19:53   ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-02-27 19:53 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/9p/vfs_file.c   |  2 ++
 fs/btrfs/file.c    |  1 +
 fs/cifs/file.c     |  1 +
 fs/ext4/file.c     |  1 +
 fs/f2fs/file.c     |  1 +
 fs/fuse/file.c     |  1 +
 fs/gfs2/file.c     |  1 +
 fs/nfs/file.c      |  1 +
 fs/nilfs2/file.c   |  1 +
 fs/ubifs/file.c    |  1 +
 fs/xfs/xfs_file.c  |  1 +
 include/linux/mm.h |  1 +
 mm/filemap.c       | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/nommu.c         |  6 +++++
 14 files changed, 91 insertions(+)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index a16b0ff497ca..d8223209d4b1 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
 
 static const struct vm_operations_struct v9fs_file_vm_ops = {
 	.fault = filemap_fault,
+	.map_pages = filemap_map_pages,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
@@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
 static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
 	.close = v9fs_mmap_vm_close,
 	.fault = filemap_fault,
+	.map_pages = filemap_map_pages,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0165b8672f09..d1f0415bbfb1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1993,6 +1993,7 @@ out:
 
 static const struct vm_operations_struct btrfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= btrfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 755584684f6c..6d081de57fdb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static struct vm_operations_struct cifs_file_vm_ops = {
 	.fault = filemap_fault,
+	.map_pages = filemap_map_pages,
 	.page_mkwrite = cifs_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a5073959f32..46e78f8a133f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite   = ext4_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 0dfcef53a6ed..129a3bdb05ca 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -84,6 +84,7 @@ out:
 
 static const struct vm_operations_struct f2fs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= f2fs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77bcc303c3ae..da99a76668f8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct fuse_file_vm_ops = {
 	.close		= fuse_vma_close,
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= fuse_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index efc078f0ee4e..2739f3c3bb8f 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -494,6 +494,7 @@ out:
 
 static const struct vm_operations_struct gfs2_vm_ops = {
 	.fault = filemap_fault,
+	.map_pages = filemap_map_pages,
 	.page_mkwrite = gfs2_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5bb790a69c71..284ca901fe16 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -617,6 +617,7 @@ out:
 
 static const struct vm_operations_struct nfs_file_vm_ops = {
 	.fault = filemap_fault,
+	.map_pages = filemap_map_pages,
 	.page_mkwrite = nfs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 08fdb77852ac..f3a82fbcae02 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static const struct vm_operations_struct nilfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= nilfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 123c79b7261e..4f34dbae823d 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1538,6 +1538,7 @@ out_unlock:
 
 static const struct vm_operations_struct ubifs_file_vm_ops = {
 	.fault        = filemap_fault,
+	.map_pages = filemap_map_pages,
 	.page_mkwrite = ubifs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..b2be204e16ca 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= xfs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/include/linux/mm.h b/include/linux/mm.h
index aed92cb17127..7c4bfb725e63 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1818,6 +1818,7 @@ extern void truncate_inode_pages_range(struct address_space *,
 
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
+extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
 extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6ac5421..1bc12a96060d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -1726,6 +1727,76 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct radix_tree_iter iter;
+	void **slot;
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	loff_t size;
+	struct page *page;
+	unsigned long address = (unsigned long) vmf->virtual_address;
+	unsigned long addr;
+	pte_t *pte;
+
+	rcu_read_lock();
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
+		if (iter.index > vmf->max_pgoff)
+			break;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				break;
+			else
+				goto next;
+		}
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		if (!PageUptodate(page) ||
+				PageReadahead(page) ||
+				PageHWPoison(page))
+			goto skip;
+		if (!trylock_page(page))
+			goto skip;
+
+		if (page->mapping != mapping || !PageUptodate(page))
+			goto unlock;
+
+		size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;
+		if (page->index >= size	>> PAGE_CACHE_SHIFT)
+			goto unlock;
+
+		pte = vmf->pte + page->index - vmf->pgoff;
+		if (!pte_none(*pte))
+			goto unlock;
+
+		if (file->f_ra.mmap_miss > 0)
+			file->f_ra.mmap_miss--;
+		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
+		do_set_pte(vma, addr, page, pte, false, false);
+		unlock_page(page);
+		goto next;
+unlock:
+		unlock_page(page);
+skip:
+		page_cache_release(page);
+next:
+		if (page->index == vmf->max_pgoff)
+			break;
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL(filemap_map_pages);
+
 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -1755,6 +1826,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/mm/nommu.c b/mm/nommu.c
index 8740213b1647..ce401e37e29f 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1985,6 +1985,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 }
 EXPORT_SYMBOL(filemap_fault);
 
+void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	BUG();
+}
+EXPORT_SYMBOL(filemap_map_pages);
+
 int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr,
 			     unsigned long size, pgoff_t pgoff)
 {
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-27 19:53 ` Kirill A. Shutemov
@ 2014-02-27 21:28   ` Linus Torvalds
  -1 siblings, 0 replies; 73+ messages in thread
From: Linus Torvalds @ 2014-02-27 21:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Thu, Feb 27, 2014 at 11:53 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Here's new version of faultaround patchset. It took a while to tune it and
> collect performance data.

Andrew, mind taking this into -mm with my acks? It's based on top of
Kirill's cleanup patches that I think are also in your tree.

Kirill - no complaints from me. I do have two minor issues that you
might satisfy, but I think the patch is fine as-is.

The issues/questions are:

 (a) could you test this on a couple of different architectures? Even
if you just have access to intel machines, testing it across a couple
of generations of microarchitectures would be good. The reason I say
that is that from my profiles, it *looks* like the page fault costs
are relatively higher on Ivybridge/Haswell than on some earlier
uarchs.

   Now, I may well be wrong about the uarch issue, and maybe I just
didn't notice it as much before. I've stared at a lot of profiles over
the years, though, and the page fault cost seems to stand out much
more than it used to. And don't get me wrong - it might not be because
Ivy/Haswell is any worse, it might just be that exception performance
hasn't improved together with some other improvements.

 (b) I suspect we should try to strongly discourage filesystems from
actually using map_pages unless they use the standard
filemap_map_pages function as-is. Even with the fairly clean
interface, and forcing people to use "do_set_pte()", I think the docs
might want to try to more explicitly discourage people from using this
to do their own hacks..

Hmm? Either way, even without those questions answered, I'm happy with
how your patches look.

                    Linus

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-27 21:28   ` Linus Torvalds
  0 siblings, 0 replies; 73+ messages in thread
From: Linus Torvalds @ 2014-02-27 21:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Thu, Feb 27, 2014 at 11:53 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Here's new version of faultaround patchset. It took a while to tune it and
> collect performance data.

Andrew, mind taking this into -mm with my acks? It's based on top of
Kirill's cleanup patches that I think are also in your tree.

Kirill - no complaints from me. I do have two minor issues that you
might satisfy, but I think the patch is fine as-is.

The issues/questions are:

 (a) could you test this on a couple of different architectures? Even
if you just have access to intel machines, testing it across a couple
of generations of microarchitectures would be good. The reason I say
that is that from my profiles, it *looks* like the page fault costs
are relatively higher on Ivybridge/Haswell than on some earlier
uarchs.

   Now, I may well be wrong about the uarch issue, and maybe I just
didn't notice it as much before. I've stared at a lot of profiles over
the years, though, and the page fault cost seems to stand out much
more than it used to. And don't get me wrong - it might not be because
Ivy/Haswell is any worse, it might just be that exception performance
hasn't improved together with some other improvements.

 (b) I suspect we should try to strongly discourage filesystems from
actually using map_pages unless they use the standard
filemap_map_pages function as-is. Even with the fairly clean
interface, and forcing people to use "do_set_pte()", I think the docs
might want to try to more explicitly discourage people from using this
to do their own hacks..

Hmm? Either way, even without those questions answered, I'm happy with
how your patches look.

                    Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 2/2] mm: implement ->map_pages for page cache
  2014-02-27 19:53   ` Kirill A. Shutemov
@ 2014-02-27 21:47     ` Andrew Morton
  -1 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2014-02-27 21:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Mel Gorman, Rik van Riel, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, linux-kernel

On Thu, 27 Feb 2014 21:53:47 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> filemap_map_pages() is generic implementation of ->map_pages() for
> filesystems who uses page cache.
> 
> It should be safe to use filemap_map_pages() for ->map_pages() if
> filesystem use filemap_fault() for ->fault().
> 
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1818,6 +1818,7 @@ extern void truncate_inode_pages_range(struct address_space *,
>  
>  /* generic vm_area_ops exported for stackable file systems */
>  extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
> +extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
>  extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
>  /* mm/page-writeback.c */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7a13f6ac5421..1bc12a96060d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -33,6 +33,7 @@
>  #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
>  #include <linux/memcontrol.h>
>  #include <linux/cleancache.h>
> +#include <linux/rmap.h>
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -1726,6 +1727,76 @@ page_not_uptodate:
>  }
>  EXPORT_SYMBOL(filemap_fault);
>  
> +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	struct radix_tree_iter iter;
> +	void **slot;
> +	struct file *file = vma->vm_file;
> +	struct address_space *mapping = file->f_mapping;
> +	loff_t size;
> +	struct page *page;
> +	unsigned long address = (unsigned long) vmf->virtual_address;
> +	unsigned long addr;
> +	pte_t *pte;
> +
> +	rcu_read_lock();
> +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> +		if (iter.index > vmf->max_pgoff)
> +			break;
> +repeat:
> +		page = radix_tree_deref_slot(slot);
> +		if (radix_tree_exception(page)) {
> +			if (radix_tree_deref_retry(page))
> +				break;
> +			else
> +				goto next;
> +		}
> +
> +		if (!page_cache_get_speculative(page))
> +			goto repeat;
> +
> +		/* Has the page moved? */
> +		if (unlikely(page != *slot)) {
> +			page_cache_release(page);
> +			goto repeat;
> +		}
> +
> +		if (!PageUptodate(page) ||
> +				PageReadahead(page) ||
> +				PageHWPoison(page))
> +			goto skip;
> +		if (!trylock_page(page))
> +			goto skip;
> +
> +		if (page->mapping != mapping || !PageUptodate(page))
> +			goto unlock;
> +
> +		size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;

Could perhaps use round_up here.

> +		if (page->index >= size	>> PAGE_CACHE_SHIFT)
> +			goto unlock;
> +		pte = vmf->pte + page->index - vmf->pgoff;
> +		if (!pte_none(*pte))
> +			goto unlock;
> +
> +		if (file->f_ra.mmap_miss > 0)
> +			file->f_ra.mmap_miss--;

I'm wondering about this.  We treat every speculative faultahead as a
hit, whether or not userspace will actually touch that page.

What's the effect of this?  To cause the amount of physical readahead
to increase?  But if userspace is in fact touching the file in a sparse
random fashion, that is exactly the wrong thing to do?

> +		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
> +		do_set_pte(vma, addr, page, pte, false, false);
> +		unlock_page(page);
> +		goto next;
> +unlock:
> +		unlock_page(page);
> +skip:
> +		page_cache_release(page);
> +next:
> +		if (page->index == vmf->max_pgoff)
> +			break;
> +	}
> +	rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(filemap_map_pages);
> +


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 2/2] mm: implement ->map_pages for page cache
@ 2014-02-27 21:47     ` Andrew Morton
  0 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2014-02-27 21:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Mel Gorman, Rik van Riel, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, linux-kernel

On Thu, 27 Feb 2014 21:53:47 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> filemap_map_pages() is generic implementation of ->map_pages() for
> filesystems who uses page cache.
> 
> It should be safe to use filemap_map_pages() for ->map_pages() if
> filesystem use filemap_fault() for ->fault().
> 
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1818,6 +1818,7 @@ extern void truncate_inode_pages_range(struct address_space *,
>  
>  /* generic vm_area_ops exported for stackable file systems */
>  extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
> +extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
>  extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
>  /* mm/page-writeback.c */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7a13f6ac5421..1bc12a96060d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -33,6 +33,7 @@
>  #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
>  #include <linux/memcontrol.h>
>  #include <linux/cleancache.h>
> +#include <linux/rmap.h>
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -1726,6 +1727,76 @@ page_not_uptodate:
>  }
>  EXPORT_SYMBOL(filemap_fault);
>  
> +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	struct radix_tree_iter iter;
> +	void **slot;
> +	struct file *file = vma->vm_file;
> +	struct address_space *mapping = file->f_mapping;
> +	loff_t size;
> +	struct page *page;
> +	unsigned long address = (unsigned long) vmf->virtual_address;
> +	unsigned long addr;
> +	pte_t *pte;
> +
> +	rcu_read_lock();
> +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> +		if (iter.index > vmf->max_pgoff)
> +			break;
> +repeat:
> +		page = radix_tree_deref_slot(slot);
> +		if (radix_tree_exception(page)) {
> +			if (radix_tree_deref_retry(page))
> +				break;
> +			else
> +				goto next;
> +		}
> +
> +		if (!page_cache_get_speculative(page))
> +			goto repeat;
> +
> +		/* Has the page moved? */
> +		if (unlikely(page != *slot)) {
> +			page_cache_release(page);
> +			goto repeat;
> +		}
> +
> +		if (!PageUptodate(page) ||
> +				PageReadahead(page) ||
> +				PageHWPoison(page))
> +			goto skip;
> +		if (!trylock_page(page))
> +			goto skip;
> +
> +		if (page->mapping != mapping || !PageUptodate(page))
> +			goto unlock;
> +
> +		size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;

Could perhaps use round_up here.

> +		if (page->index >= size	>> PAGE_CACHE_SHIFT)
> +			goto unlock;
> +		pte = vmf->pte + page->index - vmf->pgoff;
> +		if (!pte_none(*pte))
> +			goto unlock;
> +
> +		if (file->f_ra.mmap_miss > 0)
> +			file->f_ra.mmap_miss--;

I'm wondering about this.  We treat every speculative faultahead as a
hit, whether or not userspace will actually touch that page.

What's the effect of this?  To cause the amount of physical readahead
to increase?  But if userspace is in fact touching the file in a sparse
random fashion, that is exactly the wrong thing to do?

> +		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
> +		do_set_pte(vma, addr, page, pte, false, false);
> +		unlock_page(page);
> +		goto next;
> +unlock:
> +		unlock_page(page);
> +skip:
> +		page_cache_release(page);
> +next:
> +		if (page->index == vmf->max_pgoff)
> +			break;
> +	}
> +	rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(filemap_map_pages);
> +

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-02-27 19:53   ` Kirill A. Shutemov
@ 2014-02-27 21:59     ` Dave Hansen
  -1 siblings, 0 replies; 73+ messages in thread
From: Dave Hansen @ 2014-02-27 21:59 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, Mel Gorman,
	Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, linux-kernel

On 02/27/2014 11:53 AM, Kirill A. Shutemov wrote:
> +#define FAULT_AROUND_ORDER 4
> +#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
> +#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)

Looking at the performance data made me think of this: do we really want
this to be static?  It seems like the kind of thing that will cause a
regression _somewhere_.

Also, the folks with larger base bage sizes probably don't want a
FAULT_AROUND_ORDER=4.  That's 1MB of fault-around for ppc64, for example.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-02-27 21:59     ` Dave Hansen
  0 siblings, 0 replies; 73+ messages in thread
From: Dave Hansen @ 2014-02-27 21:59 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, Mel Gorman,
	Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, linux-kernel

On 02/27/2014 11:53 AM, Kirill A. Shutemov wrote:
> +#define FAULT_AROUND_ORDER 4
> +#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
> +#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)

Looking at the performance data made me think of this: do we really want
this to be static?  It seems like the kind of thing that will cause a
regression _somewhere_.

Also, the folks with larger base bage sizes probably don't want a
FAULT_AROUND_ORDER=4.  That's 1MB of fault-around for ppc64, for example.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-02-27 21:59     ` Dave Hansen
@ 2014-02-27 22:06       ` Linus Torvalds
  -1 siblings, 0 replies; 73+ messages in thread
From: Linus Torvalds @ 2014-02-27 22:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Thu, Feb 27, 2014 at 1:59 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> Also, the folks with larger base bage sizes probably don't want a
> FAULT_AROUND_ORDER=4.  That's 1MB of fault-around for ppc64, for example.

Actually, I'd expect that they won't mind, because there's no real
extra cost (the costs are indepenent of page size).

For small mappings the mapping size itself will avoid the
fault-around, and for big mappings they'll get the reduced page
faults.

They chose 64kB pages for a reason (although arguably that reason is
"our TLB fills are horrible crap"), they'll be fine with that "let's
try to map a few pages around us".

That said, making it runtime configurable for testing is likely a good
thing anyway, with some hardcoded maximum fault-around size for
sanity.

              Linus

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-02-27 22:06       ` Linus Torvalds
  0 siblings, 0 replies; 73+ messages in thread
From: Linus Torvalds @ 2014-02-27 22:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Thu, Feb 27, 2014 at 1:59 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> Also, the folks with larger base bage sizes probably don't want a
> FAULT_AROUND_ORDER=4.  That's 1MB of fault-around for ppc64, for example.

Actually, I'd expect that they won't mind, because there's no real
extra cost (the costs are indepenent of page size).

For small mappings the mapping size itself will avoid the
fault-around, and for big mappings they'll get the reduced page
faults.

They chose 64kB pages for a reason (although arguably that reason is
"our TLB fills are horrible crap"), they'll be fine with that "let's
try to map a few pages around us".

That said, making it runtime configurable for testing is likely a good
thing anyway, with some hardcoded maximum fault-around size for
sanity.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-02-27 21:59     ` Dave Hansen
@ 2014-02-27 22:08       ` Andrew Morton
  -1 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2014-02-27 22:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Linus Torvalds, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, linux-kernel

On Thu, 27 Feb 2014 13:59:59 -0800 Dave Hansen <dave.hansen@linux.intel.com> wrote:

> On 02/27/2014 11:53 AM, Kirill A. Shutemov wrote:
> > +#define FAULT_AROUND_ORDER 4
> > +#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
> > +#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
> 
> Looking at the performance data made me think of this: do we really want
> this to be static?  It seems like the kind of thing that will cause a
> regression _somewhere_.

Yes, allowing people to tweak it at runtime would improve testability a
lot.

I don't think we want to let yet another tunable out into the wild
unless we really need to - perhaps a not-for-mainline add-on patch, or
something in debugfs so we have the option of taking it away later.

> Also, the folks with larger base bage sizes probably don't want a
> FAULT_AROUND_ORDER=4.  That's 1MB of fault-around for ppc64, for example.

Yup, we don't want the same app to trigger dramatically different
kernel behaviour when it is moved from x86 to ppc.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-02-27 22:08       ` Andrew Morton
  0 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2014-02-27 22:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Linus Torvalds, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, linux-kernel

On Thu, 27 Feb 2014 13:59:59 -0800 Dave Hansen <dave.hansen@linux.intel.com> wrote:

> On 02/27/2014 11:53 AM, Kirill A. Shutemov wrote:
> > +#define FAULT_AROUND_ORDER 4
> > +#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
> > +#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
> 
> Looking at the performance data made me think of this: do we really want
> this to be static?  It seems like the kind of thing that will cause a
> regression _somewhere_.

Yes, allowing people to tweak it at runtime would improve testability a
lot.

I don't think we want to let yet another tunable out into the wild
unless we really need to - perhaps a not-for-mainline add-on patch, or
something in debugfs so we have the option of taking it away later.

> Also, the folks with larger base bage sizes probably don't want a
> FAULT_AROUND_ORDER=4.  That's 1MB of fault-around for ppc64, for example.

Yup, we don't want the same app to trigger dramatically different
kernel behaviour when it is moved from x86 to ppc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-02-27 22:06       ` Linus Torvalds
@ 2014-02-27 22:34         ` Dave Hansen
  -1 siblings, 0 replies; 73+ messages in thread
From: Dave Hansen @ 2014-02-27 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, Linux Kernel Mailing List,
	anton, Paul Mackerras, Benjamin Herrenschmidt

On 02/27/2014 02:06 PM, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 1:59 PM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
>>
>> Also, the folks with larger base bage sizes probably don't want a
>> FAULT_AROUND_ORDER=4.  That's 1MB of fault-around for ppc64, for example.
> 
> Actually, I'd expect that they won't mind, because there's no real
> extra cost (the costs are indepenent of page size).

The question is really whether or not we ever access the mapping that we
faulted around, though.  If we never access it, then the cost (however
small it was) is a loss.  That's the mechanism that I'd expect causes
Kirill's numbers to go up after they hit their minimum at ~order-4.

> For small mappings the mapping size itself will avoid the
> fault-around, and for big mappings they'll get the reduced page
> faults.

Kirill's git test suite runs did show that it _can_ hurt in some cases.
 Orders 1 to 4 were improvements.  But, Order-5 was even with the
baseline, and orders 7 and 9 got a bit worse:

> Git test suite (make -j60 test)
> FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
> 	minor-faults		129,591,823	99,200,751	66,106,718	57,606,410	51,510,808	45,776,813	44,085,515
> 	time, seconds		66.087215026	64.784546905	64.401156567	65.282708668	66.034016829	66.793780811	67.237810413



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-02-27 22:34         ` Dave Hansen
  0 siblings, 0 replies; 73+ messages in thread
From: Dave Hansen @ 2014-02-27 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, Linux Kernel Mailing List,
	anton, Paul Mackerras, Benjamin Herrenschmidt

On 02/27/2014 02:06 PM, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 1:59 PM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
>>
>> Also, the folks with larger base bage sizes probably don't want a
>> FAULT_AROUND_ORDER=4.  That's 1MB of fault-around for ppc64, for example.
> 
> Actually, I'd expect that they won't mind, because there's no real
> extra cost (the costs are indepenent of page size).

The question is really whether or not we ever access the mapping that we
faulted around, though.  If we never access it, then the cost (however
small it was) is a loss.  That's the mechanism that I'd expect causes
Kirill's numbers to go up after they hit their minimum at ~order-4.

> For small mappings the mapping size itself will avoid the
> fault-around, and for big mappings they'll get the reduced page
> faults.

Kirill's git test suite runs did show that it _can_ hurt in some cases.
 Orders 1 to 4 were improvements.  But, Order-5 was even with the
baseline, and orders 7 and 9 got a bit worse:

> Git test suite (make -j60 test)
> FAULT_AROUND_ORDER		Baseline	1		3		4		5		7		9
> 	minor-faults		129,591,823	99,200,751	66,106,718	57,606,410	51,510,808	45,776,813	44,085,515
> 	time, seconds		66.087215026	64.784546905	64.401156567	65.282708668	66.034016829	66.793780811	67.237810413


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-27 21:28   ` Linus Torvalds
@ 2014-02-28  0:10     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-02-28  0:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Thu, Feb 27, 2014 at 01:28:22PM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:53 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Here's new version of faultaround patchset. It took a while to tune it and
> > collect performance data.
> 
> Andrew, mind taking this into -mm with my acks? It's based on top of
> Kirill's cleanup patches that I think are also in your tree.
> 
> Kirill - no complaints from me. I do have two minor issues that you
> might satisfy, but I think the patch is fine as-is.
> 
> The issues/questions are:
> 
>  (a) could you test this on a couple of different architectures? Even
> if you just have access to intel machines, testing it across a couple
> of generations of microarchitectures would be good. The reason I say
> that is that from my profiles, it *looks* like the page fault costs
> are relatively higher on Ivybridge/Haswell than on some earlier
> uarchs.

These numbers were from Ivy Bridge.
I'll bring some numbers for Westmere and Haswell.

>  (b) I suspect we should try to strongly discourage filesystems from
> actually using map_pages unless they use the standard
> filemap_map_pages function as-is. Even with the fairly clean
> interface, and forcing people to use "do_set_pte()", I think the docs
> might want to try to more explicitly discourage people from using this
> to do their own hacks..

We would need ->map_pages() at least for shmem/tmpfs. It should be
benefitial there.

Also Matthew noticed that some drivers do ugly hacks like fault in whole
VMA on first page fault. IIUC, it's for performance reasons. See
psbfb_vm_fault() or ttm_bo_vm_fault().

I thought it could be reasonable to have ->map_pages() there and do VMA
population get_user_pages() on mmap() instead.

What do you think?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-28  0:10     ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-02-28  0:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Thu, Feb 27, 2014 at 01:28:22PM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:53 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Here's new version of faultaround patchset. It took a while to tune it and
> > collect performance data.
> 
> Andrew, mind taking this into -mm with my acks? It's based on top of
> Kirill's cleanup patches that I think are also in your tree.
> 
> Kirill - no complaints from me. I do have two minor issues that you
> might satisfy, but I think the patch is fine as-is.
> 
> The issues/questions are:
> 
>  (a) could you test this on a couple of different architectures? Even
> if you just have access to intel machines, testing it across a couple
> of generations of microarchitectures would be good. The reason I say
> that is that from my profiles, it *looks* like the page fault costs
> are relatively higher on Ivybridge/Haswell than on some earlier
> uarchs.

These numbers were from Ivy Bridge.
I'll bring some numbers for Westmere and Haswell.

>  (b) I suspect we should try to strongly discourage filesystems from
> actually using map_pages unless they use the standard
> filemap_map_pages function as-is. Even with the fairly clean
> interface, and forcing people to use "do_set_pte()", I think the docs
> might want to try to more explicitly discourage people from using this
> to do their own hacks..

We would need ->map_pages() at least for shmem/tmpfs. It should be
benefitial there.

Also Matthew noticed that some drivers do ugly hacks like fault in whole
VMA on first page fault. IIUC, it's for performance reasons. See
psbfb_vm_fault() or ttm_bo_vm_fault().

I thought it could be reasonable to have ->map_pages() there and do VMA
population get_user_pages() on mmap() instead.

What do you think?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-02-27 22:34         ` Dave Hansen
@ 2014-02-28  0:18           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-02-28  0:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Kirill A. Shutemov, Andrew Morton, Mel Gorman,
	Rik van Riel, Andi Kleen, Matthew Wilcox, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List, anton, Paul Mackerras,
	Benjamin Herrenschmidt

On Thu, Feb 27, 2014 at 02:34:55PM -0800, Dave Hansen wrote:
> Kirill's git test suite runs did show that it _can_ hurt in some cases.

And see last use-case for how much it can hurt. :)

It shouldn't differs much for the same *number* of pages between [u]archs
unless setup of the pte is significantly more expensive or page fault is
faster.

But of course, I can move FAULT_AROUND_PAGES to arch/x86/ and let
architecture mantainers to decide if they want the feature. ;)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-02-28  0:18           ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-02-28  0:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Kirill A. Shutemov, Andrew Morton, Mel Gorman,
	Rik van Riel, Andi Kleen, Matthew Wilcox, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List, anton, Paul Mackerras,
	Benjamin Herrenschmidt

On Thu, Feb 27, 2014 at 02:34:55PM -0800, Dave Hansen wrote:
> Kirill's git test suite runs did show that it _can_ hurt in some cases.

And see last use-case for how much it can hurt. :)

It shouldn't differs much for the same *number* of pages between [u]archs
unless setup of the pte is significantly more expensive or page fault is
faster.

But of course, I can move FAULT_AROUND_PAGES to arch/x86/ and let
architecture mantainers to decide if they want the feature. ;)

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 2/2] mm: implement ->map_pages for page cache
  2014-02-27 21:47     ` Andrew Morton
@ 2014-02-28  0:31       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-02-28  0:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Linus Torvalds, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel

On Thu, Feb 27, 2014 at 01:47:11PM -0800, Andrew Morton wrote:
> On Thu, 27 Feb 2014 21:53:47 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > filemap_map_pages() is generic implementation of ->map_pages() for
> > filesystems who uses page cache.
> > 
> > It should be safe to use filemap_map_pages() for ->map_pages() if
> > filesystem use filemap_fault() for ->fault().
> > 
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1818,6 +1818,7 @@ extern void truncate_inode_pages_range(struct address_space *,
> >  
> >  /* generic vm_area_ops exported for stackable file systems */
> >  extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
> > +extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
> >  extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
> >  
> >  /* mm/page-writeback.c */
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 7a13f6ac5421..1bc12a96060d 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -33,6 +33,7 @@
> >  #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
> >  #include <linux/memcontrol.h>
> >  #include <linux/cleancache.h>
> > +#include <linux/rmap.h>
> >  #include "internal.h"
> >  
> >  #define CREATE_TRACE_POINTS
> > @@ -1726,6 +1727,76 @@ page_not_uptodate:
> >  }
> >  EXPORT_SYMBOL(filemap_fault);
> >  
> > +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > +	struct radix_tree_iter iter;
> > +	void **slot;
> > +	struct file *file = vma->vm_file;
> > +	struct address_space *mapping = file->f_mapping;
> > +	loff_t size;
> > +	struct page *page;
> > +	unsigned long address = (unsigned long) vmf->virtual_address;
> > +	unsigned long addr;
> > +	pte_t *pte;
> > +
> > +	rcu_read_lock();
> > +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> > +		if (iter.index > vmf->max_pgoff)
> > +			break;
> > +repeat:
> > +		page = radix_tree_deref_slot(slot);
> > +		if (radix_tree_exception(page)) {
> > +			if (radix_tree_deref_retry(page))
> > +				break;
> > +			else
> > +				goto next;
> > +		}
> > +
> > +		if (!page_cache_get_speculative(page))
> > +			goto repeat;
> > +
> > +		/* Has the page moved? */
> > +		if (unlikely(page != *slot)) {
> > +			page_cache_release(page);
> > +			goto repeat;
> > +		}
> > +
> > +		if (!PageUptodate(page) ||
> > +				PageReadahead(page) ||
> > +				PageHWPoison(page))
> > +			goto skip;
> > +		if (!trylock_page(page))
> > +			goto skip;
> > +
> > +		if (page->mapping != mapping || !PageUptodate(page))
> > +			goto unlock;
> > +
> > +		size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;
> 
> Could perhaps use round_up here.

Okay, I'll update tomorrow.

> > +		if (page->index >= size	>> PAGE_CACHE_SHIFT)
> > +			goto unlock;
> > +		pte = vmf->pte + page->index - vmf->pgoff;
> > +		if (!pte_none(*pte))
> > +			goto unlock;
> > +
> > +		if (file->f_ra.mmap_miss > 0)
> > +			file->f_ra.mmap_miss--;
> 
> I'm wondering about this.  We treat every speculative faultahead as a
> hit, whether or not userspace will actually touch that page.
> 
> What's the effect of this?  To cause the amount of physical readahead
> to increase?  But if userspace is in fact touching the file in a sparse
> random fashion, that is exactly the wrong thing to do?

IIUC, it will not increase readahead window: readahead recalculate the
window when it hits PageReadahead() and we don't touch these pages.

It can increase number of readahead retry before give up.

I'm not sure what we should do here if anything. We could decrease
->mmap_miss by half of mapped pages or something. But I think
MMAP_LOTSAMISS is pretty arbitrary anyway.

> 
> > +		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
> > +		do_set_pte(vma, addr, page, pte, false, false);
> > +		unlock_page(page);
> > +		goto next;
> > +unlock:
> > +		unlock_page(page);
> > +skip:
> > +		page_cache_release(page);
> > +next:
> > +		if (page->index == vmf->max_pgoff)
> > +			break;
> > +	}
> > +	rcu_read_unlock();
> > +}
> > +EXPORT_SYMBOL(filemap_map_pages);
> > +
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 2/2] mm: implement ->map_pages for page cache
@ 2014-02-28  0:31       ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-02-28  0:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Linus Torvalds, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel

On Thu, Feb 27, 2014 at 01:47:11PM -0800, Andrew Morton wrote:
> On Thu, 27 Feb 2014 21:53:47 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > filemap_map_pages() is generic implementation of ->map_pages() for
> > filesystems who uses page cache.
> > 
> > It should be safe to use filemap_map_pages() for ->map_pages() if
> > filesystem use filemap_fault() for ->fault().
> > 
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1818,6 +1818,7 @@ extern void truncate_inode_pages_range(struct address_space *,
> >  
> >  /* generic vm_area_ops exported for stackable file systems */
> >  extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
> > +extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
> >  extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
> >  
> >  /* mm/page-writeback.c */
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 7a13f6ac5421..1bc12a96060d 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -33,6 +33,7 @@
> >  #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
> >  #include <linux/memcontrol.h>
> >  #include <linux/cleancache.h>
> > +#include <linux/rmap.h>
> >  #include "internal.h"
> >  
> >  #define CREATE_TRACE_POINTS
> > @@ -1726,6 +1727,76 @@ page_not_uptodate:
> >  }
> >  EXPORT_SYMBOL(filemap_fault);
> >  
> > +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > +	struct radix_tree_iter iter;
> > +	void **slot;
> > +	struct file *file = vma->vm_file;
> > +	struct address_space *mapping = file->f_mapping;
> > +	loff_t size;
> > +	struct page *page;
> > +	unsigned long address = (unsigned long) vmf->virtual_address;
> > +	unsigned long addr;
> > +	pte_t *pte;
> > +
> > +	rcu_read_lock();
> > +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> > +		if (iter.index > vmf->max_pgoff)
> > +			break;
> > +repeat:
> > +		page = radix_tree_deref_slot(slot);
> > +		if (radix_tree_exception(page)) {
> > +			if (radix_tree_deref_retry(page))
> > +				break;
> > +			else
> > +				goto next;
> > +		}
> > +
> > +		if (!page_cache_get_speculative(page))
> > +			goto repeat;
> > +
> > +		/* Has the page moved? */
> > +		if (unlikely(page != *slot)) {
> > +			page_cache_release(page);
> > +			goto repeat;
> > +		}
> > +
> > +		if (!PageUptodate(page) ||
> > +				PageReadahead(page) ||
> > +				PageHWPoison(page))
> > +			goto skip;
> > +		if (!trylock_page(page))
> > +			goto skip;
> > +
> > +		if (page->mapping != mapping || !PageUptodate(page))
> > +			goto unlock;
> > +
> > +		size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;
> 
> Could perhaps use round_up here.

Okay, I'll update tomorrow.

> > +		if (page->index >= size	>> PAGE_CACHE_SHIFT)
> > +			goto unlock;
> > +		pte = vmf->pte + page->index - vmf->pgoff;
> > +		if (!pte_none(*pte))
> > +			goto unlock;
> > +
> > +		if (file->f_ra.mmap_miss > 0)
> > +			file->f_ra.mmap_miss--;
> 
> I'm wondering about this.  We treat every speculative faultahead as a
> hit, whether or not userspace will actually touch that page.
> 
> What's the effect of this?  To cause the amount of physical readahead
> to increase?  But if userspace is in fact touching the file in a sparse
> random fashion, that is exactly the wrong thing to do?

IIUC, it will not increase readahead window: readahead recalculate the
window when it hits PageReadahead() and we don't touch these pages.

It can increase number of readahead retry before give up.

I'm not sure what we should do here if anything. We could decrease
->mmap_miss by half of mapped pages or something. But I think
MMAP_LOTSAMISS is pretty arbitrary anyway.

> 
> > +		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
> > +		do_set_pte(vma, addr, page, pte, false, false);
> > +		unlock_page(page);
> > +		goto next;
> > +unlock:
> > +		unlock_page(page);
> > +skip:
> > +		page_cache_release(page);
> > +next:
> > +		if (page->index == vmf->max_pgoff)
> > +			break;
> > +	}
> > +	rcu_read_unlock();
> > +}
> > +EXPORT_SYMBOL(filemap_map_pages);
> > +
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-28  0:10     ` Kirill A. Shutemov
@ 2014-02-28  3:52       ` Wilcox, Matthew R
  -1 siblings, 0 replies; 73+ messages in thread
From: Wilcox, Matthew R @ 2014-02-28  3:52 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Dave Hansen, Alexander Viro, Dave Chinner, Ning Qu,
	linux-mm, linux-fsdevel, Linux Kernel Mailing List

I think the psbfb case is just horribly broken; they probably want to populate the entire VMA at mmap time rather than fault time.  It'll be less code for them.

ttm is more nuanced, and there're one or two other graphics drivers that have similar requirements of "faulting around".  But all of the ones that try it have the nasty feature of potentially faulting in some pages that weren't the one that actually faulted on, and then failing before faulting in the requested one, then returning to userspace.

Granted, this is a pretty rare case.  You'd have to be incredibly low on memory to fail to allocate a page table page.  But it can happen, and shouldn't.

So I was thinking about a helper that these drivers could use to "fault around" in ->fault, then Kirill let me in on ->map_pages, and I think this way could work too.

________________________________________
From: Kirill A. Shutemov [kirill@shutemov.name]
Sent: February 27, 2014 4:10 PM
To: Linus Torvalds
Cc: Kirill A. Shutemov; Andrew Morton; Mel Gorman; Rik van Riel; Andi Kleen; Wilcox, Matthew R; Dave Hansen; Alexander Viro; Dave Chinner; Ning Qu; linux-mm; linux-fsdevel; Linux Kernel Mailing List
Subject: Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache

On Thu, Feb 27, 2014 at 01:28:22PM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:53 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Here's new version of faultaround patchset. It took a while to tune it and
> > collect performance data.
>
> Andrew, mind taking this into -mm with my acks? It's based on top of
> Kirill's cleanup patches that I think are also in your tree.
>
> Kirill - no complaints from me. I do have two minor issues that you
> might satisfy, but I think the patch is fine as-is.
>
> The issues/questions are:
>
>  (a) could you test this on a couple of different architectures? Even
> if you just have access to intel machines, testing it across a couple
> of generations of microarchitectures would be good. The reason I say
> that is that from my profiles, it *looks* like the page fault costs
> are relatively higher on Ivybridge/Haswell than on some earlier
> uarchs.

These numbers were from Ivy Bridge.
I'll bring some numbers for Westmere and Haswell.

>  (b) I suspect we should try to strongly discourage filesystems from
> actually using map_pages unless they use the standard
> filemap_map_pages function as-is. Even with the fairly clean
> interface, and forcing people to use "do_set_pte()", I think the docs
> might want to try to more explicitly discourage people from using this
> to do their own hacks..

We would need ->map_pages() at least for shmem/tmpfs. It should be
benefitial there.

Also Matthew noticed that some drivers do ugly hacks like fault in whole
VMA on first page fault. IIUC, it's for performance reasons. See
psbfb_vm_fault() or ttm_bo_vm_fault().

I thought it could be reasonable to have ->map_pages() there and do VMA
population get_user_pages() on mmap() instead.

What do you think?

--
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-28  3:52       ` Wilcox, Matthew R
  0 siblings, 0 replies; 73+ messages in thread
From: Wilcox, Matthew R @ 2014-02-28  3:52 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Dave Hansen, Alexander Viro, Dave Chinner, Ning Qu,
	linux-mm, linux-fsdevel, Linux Kernel Mailing List

I think the psbfb case is just horribly broken; they probably want to populate the entire VMA at mmap time rather than fault time.  It'll be less code for them.

ttm is more nuanced, and there're one or two other graphics drivers that have similar requirements of "faulting around".  But all of the ones that try it have the nasty feature of potentially faulting in some pages that weren't the one that actually faulted on, and then failing before faulting in the requested one, then returning to userspace.

Granted, this is a pretty rare case.  You'd have to be incredibly low on memory to fail to allocate a page table page.  But it can happen, and shouldn't.

So I was thinking about a helper that these drivers could use to "fault around" in ->fault, then Kirill let me in on ->map_pages, and I think this way could work too.

________________________________________
From: Kirill A. Shutemov [kirill@shutemov.name]
Sent: February 27, 2014 4:10 PM
To: Linus Torvalds
Cc: Kirill A. Shutemov; Andrew Morton; Mel Gorman; Rik van Riel; Andi Kleen; Wilcox, Matthew R; Dave Hansen; Alexander Viro; Dave Chinner; Ning Qu; linux-mm; linux-fsdevel; Linux Kernel Mailing List
Subject: Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache

On Thu, Feb 27, 2014 at 01:28:22PM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:53 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Here's new version of faultaround patchset. It took a while to tune it and
> > collect performance data.
>
> Andrew, mind taking this into -mm with my acks? It's based on top of
> Kirill's cleanup patches that I think are also in your tree.
>
> Kirill - no complaints from me. I do have two minor issues that you
> might satisfy, but I think the patch is fine as-is.
>
> The issues/questions are:
>
>  (a) could you test this on a couple of different architectures? Even
> if you just have access to intel machines, testing it across a couple
> of generations of microarchitectures would be good. The reason I say
> that is that from my profiles, it *looks* like the page fault costs
> are relatively higher on Ivybridge/Haswell than on some earlier
> uarchs.

These numbers were from Ivy Bridge.
I'll bring some numbers for Westmere and Haswell.

>  (b) I suspect we should try to strongly discourage filesystems from
> actually using map_pages unless they use the standard
> filemap_map_pages function as-is. Even with the fairly clean
> interface, and forcing people to use "do_set_pte()", I think the docs
> might want to try to more explicitly discourage people from using this
> to do their own hacks..

We would need ->map_pages() at least for shmem/tmpfs. It should be
benefitial there.

Also Matthew noticed that some drivers do ugly hacks like fault in whole
VMA on first page fault. IIUC, it's for performance reasons. See
psbfb_vm_fault() or ttm_bo_vm_fault().

I thought it could be reasonable to have ->map_pages() there and do VMA
population get_user_pages() on mmap() instead.

What do you think?

--
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-02-27 22:34         ` Dave Hansen
@ 2014-02-28 11:50           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 73+ messages in thread
From: Benjamin Herrenschmidt @ 2014-02-28 11:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Kirill A. Shutemov, Andrew Morton, Mel Gorman,
	Rik van Riel, Andi Kleen, Matthew Wilcox, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List, anton, Paul Mackerras

On Thu, 2014-02-27 at 14:34 -0800, Dave Hansen wrote:
> 
> The question is really whether or not we ever access the mapping that we
> faulted around, though.  If we never access it, then the cost (however
> small it was) is a loss.  That's the mechanism that I'd expect causes
> Kirill's numbers to go up after they hit their minimum at ~order-4.

On the other hand, the cost of our faults on ppc64 is higher. The two hash
lookups by the MMU (generally L2 misses) before it even decides to take the
fault, followed by a generally longer code path before we get to Linux
fault handler.

So there might be a bigger win for us, especially if the "around" pages
get pre-hashed (ie, via update_mmu_cache)

I don't have the bandwidth to play around with that myself at the moment
but I'll try to find somebody who can.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-02-28 11:50           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 73+ messages in thread
From: Benjamin Herrenschmidt @ 2014-02-28 11:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Kirill A. Shutemov, Andrew Morton, Mel Gorman,
	Rik van Riel, Andi Kleen, Matthew Wilcox, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List, anton, Paul Mackerras

On Thu, 2014-02-27 at 14:34 -0800, Dave Hansen wrote:
> 
> The question is really whether or not we ever access the mapping that we
> faulted around, though.  If we never access it, then the cost (however
> small it was) is a loss.  That's the mechanism that I'd expect causes
> Kirill's numbers to go up after they hit their minimum at ~order-4.

On the other hand, the cost of our faults on ppc64 is higher. The two hash
lookups by the MMU (generally L2 misses) before it even decides to take the
fault, followed by a generally longer code path before we get to Linux
fault handler.

So there might be a bigger win for us, especially if the "around" pages
get pre-hashed (ie, via update_mmu_cache)

I don't have the bandwidth to play around with that myself at the moment
but I'll try to find somebody who can.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-28  0:10     ` Kirill A. Shutemov
@ 2014-02-28 23:08       ` Linus Torvalds
  -1 siblings, 0 replies; 73+ messages in thread
From: Linus Torvalds @ 2014-02-28 23:08 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Thu, Feb 27, 2014 at 6:10 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> Also Matthew noticed that some drivers do ugly hacks like fault in whole
> VMA on first page fault. IIUC, it's for performance reasons. See
> psbfb_vm_fault() or ttm_bo_vm_fault().

I guarantee it's not for performance reasons, it's probably some other breakage.

And if anything really does want to populate things fully, doing so at
fault time is wrong anyway, you should just use MAP_POPULATE at mmap
time.

So I'll believe the shm/tmpfs issue, and that's "mm internal" anyway.
But random crappy drivers? No way in hell.

               Linus

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-28 23:08       ` Linus Torvalds
  0 siblings, 0 replies; 73+ messages in thread
From: Linus Torvalds @ 2014-02-28 23:08 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Thu, Feb 27, 2014 at 6:10 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> Also Matthew noticed that some drivers do ugly hacks like fault in whole
> VMA on first page fault. IIUC, it's for performance reasons. See
> psbfb_vm_fault() or ttm_bo_vm_fault().

I guarantee it's not for performance reasons, it's probably some other breakage.

And if anything really does want to populate things fully, doing so at
fault time is wrong anyway, you should just use MAP_POPULATE at mmap
time.

So I'll believe the shm/tmpfs issue, and that's "mm internal" anyway.
But random crappy drivers? No way in hell.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-02-27 19:53   ` Kirill A. Shutemov
  (?)
  (?)
@ 2014-03-03 23:16   ` Andrew Morton
  2014-03-04  1:26       ` Kirill A. Shutemov
  2014-03-05  0:04       ` Rusty Russell
  -1 siblings, 2 replies; 73+ messages in thread
From: Andrew Morton @ 2014-03-03 23:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Mel Gorman, Rik van Riel, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, linux-kernel, Rusty Russell

On Thu, 27 Feb 2014 21:53:46 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> The patch introduces new vm_ops callback ->map_pages() and uses it for
> mapping easy accessible pages around fault address.
> 
> On read page fault, if filesystem provides ->map_pages(), we try to map
> up to FAULT_AROUND_PAGES pages around page fault address in hope to
> reduce number of minor page faults.
> 
> We call ->map_pages first and use ->fault() as fallback if page by the
> offset is not ready to be mapped (cold page cache or something).
>
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
>
> ...
>
> @@ -571,6 +576,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>  		pte = pte_mkwrite(pte);
>  	return pte;
>  }
> +
> +void do_set_pte(struct vm_area_struct *vma, unsigned long address,
> +		struct page *page, pte_t *pte, bool write, bool anon);
>  #endif
>  
>  /*

lguest made a dubious naming decision:

drivers/lguest/page_tables.c:890: error: conflicting types for 'do_set_pte'
include/linux/mm.h:593: note: previous declaration of 'do_set_pte' was here

I'll rename lguest's do_set_pte() to do_guest_set_pte() as a
preparatory patch.


btw, do_set_pte() could really do with some documentation.  It's not a
trivial function and it does a lot of stuff.  It's exported to other
compilation units and we should explain the what, the why and
particularly the locking preconditions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-03-03 23:16   ` Andrew Morton
  2014-03-04  1:26       ` Kirill A. Shutemov
@ 2014-03-04  1:26       ` Kirill A. Shutemov
  1 sibling, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-03-04  1:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Linus Torvalds, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Rusty Russell

On Mon, Mar 03, 2014 at 03:16:11PM -0800, Andrew Morton wrote:
> On Thu, 27 Feb 2014 21:53:46 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > The patch introduces new vm_ops callback ->map_pages() and uses it for
> > mapping easy accessible pages around fault address.
> > 
> > On read page fault, if filesystem provides ->map_pages(), we try to map
> > up to FAULT_AROUND_PAGES pages around page fault address in hope to
> > reduce number of minor page faults.
> > 
> > We call ->map_pages first and use ->fault() as fallback if page by the
> > offset is not ready to be mapped (cold page cache or something).
> >
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> >
> > ...
> >
> > @@ -571,6 +576,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> >  		pte = pte_mkwrite(pte);
> >  	return pte;
> >  }
> > +
> > +void do_set_pte(struct vm_area_struct *vma, unsigned long address,
> > +		struct page *page, pte_t *pte, bool write, bool anon);
> >  #endif
> >  
> >  /*
> 
> lguest made a dubious naming decision:
> 
> drivers/lguest/page_tables.c:890: error: conflicting types for 'do_set_pte'
> include/linux/mm.h:593: note: previous declaration of 'do_set_pte' was here
> 
> I'll rename lguest's do_set_pte() to do_guest_set_pte() as a
> preparatory patch.

Thanks.

> btw, do_set_pte() could really do with some documentation.  It's not a
> trivial function and it does a lot of stuff.  It's exported to other
> compilation units and we should explain the what, the why and
> particularly the locking preconditions.

Like this?

>From f9be5307ed59ab65645377c1da03eaf4167bfcf2 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Tue, 4 Mar 2014 03:15:43 +0200
Subject: [PATCH] mm: documentaton for do_set_pte()

do_set_pte() now exported to other compilation unit. Let's document
interface.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index e2d54e818c5b..6dc622b06915 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3319,7 +3319,21 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	return ret;
 }
 
-
+/**
+ * do_set_pte - setup new PTE entry for given page and add reverse page mapping.
+ *
+ * @vma: virtual memory area
+ * @address: user virtual address
+ * @page: page to map
+ * @pte: pointer to target page table entry
+ * @write: true, if new entry is writable
+ * @anon: true, if it's anonymous page
+ *
+ * Caller must hold page table lock relevant for @pte.
+ *
+ * Target users are page handler itself and implementations of
+ * vm_ops->map_pages.
+ */
 void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		struct page *page, pte_t *pte, bool write, bool anon)
 {
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-03-04  1:26       ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-03-04  1:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Linus Torvalds, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Rusty Russell

On Mon, Mar 03, 2014 at 03:16:11PM -0800, Andrew Morton wrote:
> On Thu, 27 Feb 2014 21:53:46 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > The patch introduces new vm_ops callback ->map_pages() and uses it for
> > mapping easy accessible pages around fault address.
> > 
> > On read page fault, if filesystem provides ->map_pages(), we try to map
> > up to FAULT_AROUND_PAGES pages around page fault address in hope to
> > reduce number of minor page faults.
> > 
> > We call ->map_pages first and use ->fault() as fallback if page by the
> > offset is not ready to be mapped (cold page cache or something).
> >
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> >
> > ...
> >
> > @@ -571,6 +576,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> >  		pte = pte_mkwrite(pte);
> >  	return pte;
> >  }
> > +
> > +void do_set_pte(struct vm_area_struct *vma, unsigned long address,
> > +		struct page *page, pte_t *pte, bool write, bool anon);
> >  #endif
> >  
> >  /*
> 
> lguest made a dubious naming decision:
> 
> drivers/lguest/page_tables.c:890: error: conflicting types for 'do_set_pte'
> include/linux/mm.h:593: note: previous declaration of 'do_set_pte' was here
> 
> I'll rename lguest's do_set_pte() to do_guest_set_pte() as a
> preparatory patch.

Thanks.

> btw, do_set_pte() could really do with some documentation.  It's not a
> trivial function and it does a lot of stuff.  It's exported to other
> compilation units and we should explain the what, the why and
> particularly the locking preconditions.

Like this?

>From f9be5307ed59ab65645377c1da03eaf4167bfcf2 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Tue, 4 Mar 2014 03:15:43 +0200
Subject: [PATCH] mm: documentaton for do_set_pte()

do_set_pte() now exported to other compilation unit. Let's document
interface.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index e2d54e818c5b..6dc622b06915 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3319,7 +3319,21 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	return ret;
 }
 
-
+/**
+ * do_set_pte - setup new PTE entry for given page and add reverse page mapping.
+ *
+ * @vma: virtual memory area
+ * @address: user virtual address
+ * @page: page to map
+ * @pte: pointer to target page table entry
+ * @write: true, if new entry is writable
+ * @anon: true, if it's anonymous page
+ *
+ * Caller must hold page table lock relevant for @pte.
+ *
+ * Target users are page handler itself and implementations of
+ * vm_ops->map_pages.
+ */
 void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		struct page *page, pte_t *pte, bool write, bool anon)
 {
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-03-04  1:26       ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-03-04  1:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Linus Torvalds, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Rusty Russell

On Mon, Mar 03, 2014 at 03:16:11PM -0800, Andrew Morton wrote:
> On Thu, 27 Feb 2014 21:53:46 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > The patch introduces new vm_ops callback ->map_pages() and uses it for
> > mapping easy accessible pages around fault address.
> > 
> > On read page fault, if filesystem provides ->map_pages(), we try to map
> > up to FAULT_AROUND_PAGES pages around page fault address in hope to
> > reduce number of minor page faults.
> > 
> > We call ->map_pages first and use ->fault() as fallback if page by the
> > offset is not ready to be mapped (cold page cache or something).
> >
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> >
> > ...
> >
> > @@ -571,6 +576,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> >  		pte = pte_mkwrite(pte);
> >  	return pte;
> >  }
> > +
> > +void do_set_pte(struct vm_area_struct *vma, unsigned long address,
> > +		struct page *page, pte_t *pte, bool write, bool anon);
> >  #endif
> >  
> >  /*
> 
> lguest made a dubious naming decision:
> 
> drivers/lguest/page_tables.c:890: error: conflicting types for 'do_set_pte'
> include/linux/mm.h:593: note: previous declaration of 'do_set_pte' was here
> 
> I'll rename lguest's do_set_pte() to do_guest_set_pte() as a
> preparatory patch.

Thanks.

> btw, do_set_pte() could really do with some documentation.  It's not a
> trivial function and it does a lot of stuff.  It's exported to other
> compilation units and we should explain the what, the why and
> particularly the locking preconditions.

Like this?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-03-03 23:16   ` Andrew Morton
@ 2014-03-05  0:04       ` Rusty Russell
  2014-03-05  0:04       ` Rusty Russell
  1 sibling, 0 replies; 73+ messages in thread
From: Rusty Russell @ 2014-03-05  0:04 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov
  Cc: Linus Torvalds, Mel Gorman, Rik van Riel, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, linux-kernel

Andrew Morton <akpm@linux-foundation.org> writes:
> On Thu, 27 Feb 2014 21:53:46 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>> +
>> +void do_set_pte(struct vm_area_struct *vma, unsigned long address,
>> +		struct page *page, pte_t *pte, bool write, bool anon);
>>  #endif
>>  
>>  /*
>
> lguest made a dubious naming decision:
>
> drivers/lguest/page_tables.c:890: error: conflicting types for 'do_set_pte'
> include/linux/mm.h:593: note: previous declaration of 'do_set_pte' was here
>
> I'll rename lguest's do_set_pte() to do_guest_set_pte() as a
> preparatory patch.

s/do_/ if you don't mind; if we're going to prefix it, we don't need the
extra verb.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-03-05  0:04       ` Rusty Russell
  0 siblings, 0 replies; 73+ messages in thread
From: Rusty Russell @ 2014-03-05  0:04 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov
  Cc: Linus Torvalds, Mel Gorman, Rik van Riel, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, linux-kernel

Andrew Morton <akpm@linux-foundation.org> writes:
> On Thu, 27 Feb 2014 21:53:46 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
>> +
>> +void do_set_pte(struct vm_area_struct *vma, unsigned long address,
>> +		struct page *page, pte_t *pte, bool write, bool anon);
>>  #endif
>>  
>>  /*
>
> lguest made a dubious naming decision:
>
> drivers/lguest/page_tables.c:890: error: conflicting types for 'do_set_pte'
> include/linux/mm.h:593: note: previous declaration of 'do_set_pte' was here
>
> I'll rename lguest's do_set_pte() to do_guest_set_pte() as a
> preparatory patch.

s/do_/ if you don't mind; if we're going to prefix it, we don't need the
extra verb.

Thanks,
Rusty.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-03-05  0:04       ` Rusty Russell
@ 2014-03-05 20:02         ` Andrew Morton
  -1 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2014-03-05 20:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Kirill A. Shutemov, Linus Torvalds, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel

On Wed, 05 Mar 2014 10:34:15 +1030 Rusty Russell <rusty@rustcorp.com.au> wrote:

> Andrew Morton <akpm@linux-foundation.org> writes:
> > On Thu, 27 Feb 2014 21:53:46 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> >> +
> >> +void do_set_pte(struct vm_area_struct *vma, unsigned long address,
> >> +		struct page *page, pte_t *pte, bool write, bool anon);
> >>  #endif
> >>  
> >>  /*
> >
> > lguest made a dubious naming decision:
> >
> > drivers/lguest/page_tables.c:890: error: conflicting types for 'do_set_pte'
> > include/linux/mm.h:593: note: previous declaration of 'do_set_pte' was here
> >
> > I'll rename lguest's do_set_pte() to do_guest_set_pte() as a
> > preparatory patch.
> 
> s/do_/ if you don't mind; if we're going to prefix it, we don't need the
> extra verb.

drivers/lguest/page_tables.c already has a guest_set_pte(), which calls
do_guest_set_pte().

How about we use the __ tradition, so __guest_set_pte() is a helper for
guest_set_pte()?


--- a/drivers/lguest/page_tables.c~drivers-lguest-page_tablesc-rename-do_set_pte
+++ a/drivers/lguest/page_tables.c
@@ -887,7 +887,7 @@ void guest_new_pagetable(struct lg_cpu *
  * _PAGE_ACCESSED then we can put a read-only PTE entry in immediately, and if
  * they set _PAGE_DIRTY then we can put a writable PTE entry in immediately.
  */
-static void do_set_pte(struct lg_cpu *cpu, int idx,
+static void __guest_set_pte(struct lg_cpu *cpu, int idx,
 		       unsigned long vaddr, pte_t gpte)
 {
 	/* Look up the matching shadow page directory entry. */
@@ -960,13 +960,13 @@ void guest_set_pte(struct lg_cpu *cpu,
 		unsigned int i;
 		for (i = 0; i < ARRAY_SIZE(cpu->lg->pgdirs); i++)
 			if (cpu->lg->pgdirs[i].pgdir)
-				do_set_pte(cpu, i, vaddr, gpte);
+				__guest_set_pte(cpu, i, vaddr, gpte);
 	} else {
 		/* Is this page table one we have a shadow for? */
 		int pgdir = find_pgdir(cpu->lg, gpgdir);
 		if (pgdir != ARRAY_SIZE(cpu->lg->pgdirs))
 			/* If so, do the update. */
-			do_set_pte(cpu, pgdir, vaddr, gpte);
+			__guest_set_pte(cpu, pgdir, vaddr, gpte);
 	}
 }
 
_


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-03-05 20:02         ` Andrew Morton
  0 siblings, 0 replies; 73+ messages in thread
From: Andrew Morton @ 2014-03-05 20:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Kirill A. Shutemov, Linus Torvalds, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel

On Wed, 05 Mar 2014 10:34:15 +1030 Rusty Russell <rusty@rustcorp.com.au> wrote:

> Andrew Morton <akpm@linux-foundation.org> writes:
> > On Thu, 27 Feb 2014 21:53:46 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> >> +
> >> +void do_set_pte(struct vm_area_struct *vma, unsigned long address,
> >> +		struct page *page, pte_t *pte, bool write, bool anon);
> >>  #endif
> >>  
> >>  /*
> >
> > lguest made a dubious naming decision:
> >
> > drivers/lguest/page_tables.c:890: error: conflicting types for 'do_set_pte'
> > include/linux/mm.h:593: note: previous declaration of 'do_set_pte' was here
> >
> > I'll rename lguest's do_set_pte() to do_guest_set_pte() as a
> > preparatory patch.
> 
> s/do_/ if you don't mind; if we're going to prefix it, we don't need the
> extra verb.

drivers/lguest/page_tables.c already has a guest_set_pte(), which calls
do_guest_set_pte().

How about we use the __ tradition, so __guest_set_pte() is a helper for
guest_set_pte()?


--- a/drivers/lguest/page_tables.c~drivers-lguest-page_tablesc-rename-do_set_pte
+++ a/drivers/lguest/page_tables.c
@@ -887,7 +887,7 @@ void guest_new_pagetable(struct lg_cpu *
  * _PAGE_ACCESSED then we can put a read-only PTE entry in immediately, and if
  * they set _PAGE_DIRTY then we can put a writable PTE entry in immediately.
  */
-static void do_set_pte(struct lg_cpu *cpu, int idx,
+static void __guest_set_pte(struct lg_cpu *cpu, int idx,
 		       unsigned long vaddr, pte_t gpte)
 {
 	/* Look up the matching shadow page directory entry. */
@@ -960,13 +960,13 @@ void guest_set_pte(struct lg_cpu *cpu,
 		unsigned int i;
 		for (i = 0; i < ARRAY_SIZE(cpu->lg->pgdirs); i++)
 			if (cpu->lg->pgdirs[i].pgdir)
-				do_set_pte(cpu, i, vaddr, gpte);
+				__guest_set_pte(cpu, i, vaddr, gpte);
 	} else {
 		/* Is this page table one we have a shadow for? */
 		int pgdir = find_pgdir(cpu->lg, gpgdir);
 		if (pgdir != ARRAY_SIZE(cpu->lg->pgdirs))
 			/* If so, do the update. */
-			do_set_pte(cpu, pgdir, vaddr, gpte);
+			__guest_set_pte(cpu, pgdir, vaddr, gpte);
 	}
 }
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-27 21:28   ` Linus Torvalds
@ 2014-03-12 14:22     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-03-12 14:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Thu, Feb 27, 2014 at 01:28:22PM -0800, Linus Torvalds wrote:
>  (a) could you test this on a couple of different architectures?


Here's data from 4-socket Westmere. The advantage of faultaround is not that
obvious here.

FAULT_AROUND_ORDER              baseline        2               4               6

Linux build (make -j40)
minor-faults                    297,626,681     245,934,795     227,800,211     221,052,532
time, seconds                   293.4322        291.4602        292.9198        295.2355

Linux rebuild (make -j40)
minor-faults                    5,903,936       3,802,148       3,018,728       2,735,602
time, seconds                   41.5657         41.0301         40.7621         41.1161

Git test suite
minor-faults                    171,314,056     109,187,718     81,955,503      70,172,157
time, seconds                   223.2327        220.2623        223.8355        231.7843

I don't have time to test on Haswell now. Probably later.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-03-12 14:22     ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-03-12 14:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Thu, Feb 27, 2014 at 01:28:22PM -0800, Linus Torvalds wrote:
>  (a) could you test this on a couple of different architectures?


Here's data from 4-socket Westmere. The advantage of faultaround is not that
obvious here.

FAULT_AROUND_ORDER              baseline        2               4               6

Linux build (make -j40)
minor-faults                    297,626,681     245,934,795     227,800,211     221,052,532
time, seconds                   293.4322        291.4602        292.9198        295.2355

Linux rebuild (make -j40)
minor-faults                    5,903,936       3,802,148       3,018,728       2,735,602
time, seconds                   41.5657         41.0301         40.7621         41.1161

Git test suite
minor-faults                    171,314,056     109,187,718     81,955,503      70,172,157
time, seconds                   223.2327        220.2623        223.8355        231.7843

I don't have time to test on Haswell now. Probably later.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 2/2] mm: implement ->map_pages for page cache
  2014-02-27 19:53   ` Kirill A. Shutemov
@ 2014-04-02 18:03     ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 73+ messages in thread
From: Konstantin Khlebnikov @ 2014-04-02 18:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Thu, Feb 27, 2014 at 11:53 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> filemap_map_pages() is generic implementation of ->map_pages() for
> filesystems who uses page cache.
>
> It should be safe to use filemap_map_pages() for ->map_pages() if
> filesystem use filemap_fault() for ->fault().
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  fs/9p/vfs_file.c   |  2 ++
>  fs/btrfs/file.c    |  1 +
>  fs/cifs/file.c     |  1 +
>  fs/ext4/file.c     |  1 +
>  fs/f2fs/file.c     |  1 +
>  fs/fuse/file.c     |  1 +
>  fs/gfs2/file.c     |  1 +
>  fs/nfs/file.c      |  1 +
>  fs/nilfs2/file.c   |  1 +
>  fs/ubifs/file.c    |  1 +
>  fs/xfs/xfs_file.c  |  1 +
>  include/linux/mm.h |  1 +
>  mm/filemap.c       | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/nommu.c         |  6 +++++
>  14 files changed, 91 insertions(+)
>
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index a16b0ff497ca..d8223209d4b1 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
>
>  static const struct vm_operations_struct v9fs_file_vm_ops = {
>         .fault = filemap_fault,
> +       .map_pages = filemap_map_pages,
>         .page_mkwrite = v9fs_vm_page_mkwrite,
>         .remap_pages = generic_file_remap_pages,
>  };
> @@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
>  static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
>         .close = v9fs_mmap_vm_close,
>         .fault = filemap_fault,
> +       .map_pages = filemap_map_pages,
>         .page_mkwrite = v9fs_vm_page_mkwrite,
>         .remap_pages = generic_file_remap_pages,
>  };
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0165b8672f09..d1f0415bbfb1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1993,6 +1993,7 @@ out:
>
>  static const struct vm_operations_struct btrfs_file_vm_ops = {
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = btrfs_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 755584684f6c..6d081de57fdb 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>
>  static struct vm_operations_struct cifs_file_vm_ops = {
>         .fault = filemap_fault,
> +       .map_pages = filemap_map_pages,
>         .page_mkwrite = cifs_page_mkwrite,
>         .remap_pages = generic_file_remap_pages,
>  };
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 1a5073959f32..46e78f8a133f 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
>
>  static const struct vm_operations_struct ext4_file_vm_ops = {
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = ext4_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 0dfcef53a6ed..129a3bdb05ca 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -84,6 +84,7 @@ out:
>
>  static const struct vm_operations_struct f2fs_file_vm_ops = {
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = f2fs_vm_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 77bcc303c3ae..da99a76668f8 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  static const struct vm_operations_struct fuse_file_vm_ops = {
>         .close          = fuse_vma_close,
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = fuse_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index efc078f0ee4e..2739f3c3bb8f 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -494,6 +494,7 @@ out:
>
>  static const struct vm_operations_struct gfs2_vm_ops = {
>         .fault = filemap_fault,
> +       .map_pages = filemap_map_pages,
>         .page_mkwrite = gfs2_page_mkwrite,
>         .remap_pages = generic_file_remap_pages,
>  };
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 5bb790a69c71..284ca901fe16 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -617,6 +617,7 @@ out:
>
>  static const struct vm_operations_struct nfs_file_vm_ops = {
>         .fault = filemap_fault,
> +       .map_pages = filemap_map_pages,
>         .page_mkwrite = nfs_vm_page_mkwrite,
>         .remap_pages = generic_file_remap_pages,
>  };
> diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
> index 08fdb77852ac..f3a82fbcae02 100644
> --- a/fs/nilfs2/file.c
> +++ b/fs/nilfs2/file.c
> @@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>
>  static const struct vm_operations_struct nilfs_file_vm_ops = {
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = nilfs_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
> index 123c79b7261e..4f34dbae823d 100644
> --- a/fs/ubifs/file.c
> +++ b/fs/ubifs/file.c
> @@ -1538,6 +1538,7 @@ out_unlock:
>
>  static const struct vm_operations_struct ubifs_file_vm_ops = {
>         .fault        = filemap_fault,
> +       .map_pages = filemap_map_pages,
>         .page_mkwrite = ubifs_vm_page_mkwrite,
>         .remap_pages = generic_file_remap_pages,
>  };
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 64b48eade91d..b2be204e16ca 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {
>
>  static const struct vm_operations_struct xfs_file_vm_ops = {
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = xfs_vm_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index aed92cb17127..7c4bfb725e63 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1818,6 +1818,7 @@ extern void truncate_inode_pages_range(struct address_space *,
>
>  /* generic vm_area_ops exported for stackable file systems */
>  extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
> +extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
>  extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
>
>  /* mm/page-writeback.c */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7a13f6ac5421..1bc12a96060d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -33,6 +33,7 @@
>  #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
>  #include <linux/memcontrol.h>
>  #include <linux/cleancache.h>
> +#include <linux/rmap.h>
>  #include "internal.h"
>
>  #define CREATE_TRACE_POINTS
> @@ -1726,6 +1727,76 @@ page_not_uptodate:
>  }
>  EXPORT_SYMBOL(filemap_fault);
>
> +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +       struct radix_tree_iter iter;
> +       void **slot;
> +       struct file *file = vma->vm_file;
> +       struct address_space *mapping = file->f_mapping;
> +       loff_t size;
> +       struct page *page;
> +       unsigned long address = (unsigned long) vmf->virtual_address;
> +       unsigned long addr;
> +       pte_t *pte;
> +
> +       rcu_read_lock();
> +       radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> +               if (iter.index > vmf->max_pgoff)
> +                       break;
> +repeat:
> +               page = radix_tree_deref_slot(slot);

Here is obvious race with memory reclaimer/truncate. Pointer to page
might become NULL.

> +               if (radix_tree_exception(page)) {
> +                       if (radix_tree_deref_retry(page))
> +                               break;
> +                       else
> +                               goto next;
> +               }
> +
> +               if (!page_cache_get_speculative(page))
> +                       goto repeat;
> +
> +               /* Has the page moved? */
> +               if (unlikely(page != *slot)) {
> +                       page_cache_release(page);
> +                       goto repeat;
> +               }
> +
> +               if (!PageUptodate(page) ||
> +                               PageReadahead(page) ||
> +                               PageHWPoison(page))
> +                       goto skip;
> +               if (!trylock_page(page))
> +                       goto skip;
> +
> +               if (page->mapping != mapping || !PageUptodate(page))
> +                       goto unlock;
> +
> +               size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;
> +               if (page->index >= size >> PAGE_CACHE_SHIFT)
> +                       goto unlock;
> +
> +               pte = vmf->pte + page->index - vmf->pgoff;
> +               if (!pte_none(*pte))
> +                       goto unlock;
> +
> +               if (file->f_ra.mmap_miss > 0)
> +                       file->f_ra.mmap_miss--;
> +               addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
> +               do_set_pte(vma, addr, page, pte, false, false);
> +               unlock_page(page);
> +               goto next;
> +unlock:
> +               unlock_page(page);
> +skip:
> +               page_cache_release(page);
> +next:
> +               if (page->index == vmf->max_pgoff)
> +                       break;
> +       }
> +       rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(filemap_map_pages);
> +
>  int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>         struct page *page = vmf->page;
> @@ -1755,6 +1826,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
>
>  const struct vm_operations_struct generic_file_vm_ops = {
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = filemap_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 8740213b1647..ce401e37e29f 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1985,6 +1985,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  }
>  EXPORT_SYMBOL(filemap_fault);
>
> +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +       BUG();
> +}
> +EXPORT_SYMBOL(filemap_map_pages);
> +
>  int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr,
>                              unsigned long size, pgoff_t pgoff)
>  {
> --
> 1.9.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 2/2] mm: implement ->map_pages for page cache
@ 2014-04-02 18:03     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 73+ messages in thread
From: Konstantin Khlebnikov @ 2014-04-02 18:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Thu, Feb 27, 2014 at 11:53 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> filemap_map_pages() is generic implementation of ->map_pages() for
> filesystems who uses page cache.
>
> It should be safe to use filemap_map_pages() for ->map_pages() if
> filesystem use filemap_fault() for ->fault().
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  fs/9p/vfs_file.c   |  2 ++
>  fs/btrfs/file.c    |  1 +
>  fs/cifs/file.c     |  1 +
>  fs/ext4/file.c     |  1 +
>  fs/f2fs/file.c     |  1 +
>  fs/fuse/file.c     |  1 +
>  fs/gfs2/file.c     |  1 +
>  fs/nfs/file.c      |  1 +
>  fs/nilfs2/file.c   |  1 +
>  fs/ubifs/file.c    |  1 +
>  fs/xfs/xfs_file.c  |  1 +
>  include/linux/mm.h |  1 +
>  mm/filemap.c       | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/nommu.c         |  6 +++++
>  14 files changed, 91 insertions(+)
>
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index a16b0ff497ca..d8223209d4b1 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
>
>  static const struct vm_operations_struct v9fs_file_vm_ops = {
>         .fault = filemap_fault,
> +       .map_pages = filemap_map_pages,
>         .page_mkwrite = v9fs_vm_page_mkwrite,
>         .remap_pages = generic_file_remap_pages,
>  };
> @@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
>  static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
>         .close = v9fs_mmap_vm_close,
>         .fault = filemap_fault,
> +       .map_pages = filemap_map_pages,
>         .page_mkwrite = v9fs_vm_page_mkwrite,
>         .remap_pages = generic_file_remap_pages,
>  };
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0165b8672f09..d1f0415bbfb1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1993,6 +1993,7 @@ out:
>
>  static const struct vm_operations_struct btrfs_file_vm_ops = {
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = btrfs_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 755584684f6c..6d081de57fdb 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>
>  static struct vm_operations_struct cifs_file_vm_ops = {
>         .fault = filemap_fault,
> +       .map_pages = filemap_map_pages,
>         .page_mkwrite = cifs_page_mkwrite,
>         .remap_pages = generic_file_remap_pages,
>  };
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 1a5073959f32..46e78f8a133f 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
>
>  static const struct vm_operations_struct ext4_file_vm_ops = {
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = ext4_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 0dfcef53a6ed..129a3bdb05ca 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -84,6 +84,7 @@ out:
>
>  static const struct vm_operations_struct f2fs_file_vm_ops = {
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = f2fs_vm_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 77bcc303c3ae..da99a76668f8 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  static const struct vm_operations_struct fuse_file_vm_ops = {
>         .close          = fuse_vma_close,
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = fuse_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index efc078f0ee4e..2739f3c3bb8f 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -494,6 +494,7 @@ out:
>
>  static const struct vm_operations_struct gfs2_vm_ops = {
>         .fault = filemap_fault,
> +       .map_pages = filemap_map_pages,
>         .page_mkwrite = gfs2_page_mkwrite,
>         .remap_pages = generic_file_remap_pages,
>  };
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 5bb790a69c71..284ca901fe16 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -617,6 +617,7 @@ out:
>
>  static const struct vm_operations_struct nfs_file_vm_ops = {
>         .fault = filemap_fault,
> +       .map_pages = filemap_map_pages,
>         .page_mkwrite = nfs_vm_page_mkwrite,
>         .remap_pages = generic_file_remap_pages,
>  };
> diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
> index 08fdb77852ac..f3a82fbcae02 100644
> --- a/fs/nilfs2/file.c
> +++ b/fs/nilfs2/file.c
> @@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>
>  static const struct vm_operations_struct nilfs_file_vm_ops = {
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = nilfs_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
> index 123c79b7261e..4f34dbae823d 100644
> --- a/fs/ubifs/file.c
> +++ b/fs/ubifs/file.c
> @@ -1538,6 +1538,7 @@ out_unlock:
>
>  static const struct vm_operations_struct ubifs_file_vm_ops = {
>         .fault        = filemap_fault,
> +       .map_pages = filemap_map_pages,
>         .page_mkwrite = ubifs_vm_page_mkwrite,
>         .remap_pages = generic_file_remap_pages,
>  };
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 64b48eade91d..b2be204e16ca 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {
>
>  static const struct vm_operations_struct xfs_file_vm_ops = {
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = xfs_vm_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index aed92cb17127..7c4bfb725e63 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1818,6 +1818,7 @@ extern void truncate_inode_pages_range(struct address_space *,
>
>  /* generic vm_area_ops exported for stackable file systems */
>  extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
> +extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
>  extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
>
>  /* mm/page-writeback.c */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7a13f6ac5421..1bc12a96060d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -33,6 +33,7 @@
>  #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
>  #include <linux/memcontrol.h>
>  #include <linux/cleancache.h>
> +#include <linux/rmap.h>
>  #include "internal.h"
>
>  #define CREATE_TRACE_POINTS
> @@ -1726,6 +1727,76 @@ page_not_uptodate:
>  }
>  EXPORT_SYMBOL(filemap_fault);
>
> +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +       struct radix_tree_iter iter;
> +       void **slot;
> +       struct file *file = vma->vm_file;
> +       struct address_space *mapping = file->f_mapping;
> +       loff_t size;
> +       struct page *page;
> +       unsigned long address = (unsigned long) vmf->virtual_address;
> +       unsigned long addr;
> +       pte_t *pte;
> +
> +       rcu_read_lock();
> +       radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> +               if (iter.index > vmf->max_pgoff)
> +                       break;
> +repeat:
> +               page = radix_tree_deref_slot(slot);

Here is obvious race with memory reclaimer/truncate. Pointer to page
might become NULL.

> +               if (radix_tree_exception(page)) {
> +                       if (radix_tree_deref_retry(page))
> +                               break;
> +                       else
> +                               goto next;
> +               }
> +
> +               if (!page_cache_get_speculative(page))
> +                       goto repeat;
> +
> +               /* Has the page moved? */
> +               if (unlikely(page != *slot)) {
> +                       page_cache_release(page);
> +                       goto repeat;
> +               }
> +
> +               if (!PageUptodate(page) ||
> +                               PageReadahead(page) ||
> +                               PageHWPoison(page))
> +                       goto skip;
> +               if (!trylock_page(page))
> +                       goto skip;
> +
> +               if (page->mapping != mapping || !PageUptodate(page))
> +                       goto unlock;
> +
> +               size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;
> +               if (page->index >= size >> PAGE_CACHE_SHIFT)
> +                       goto unlock;
> +
> +               pte = vmf->pte + page->index - vmf->pgoff;
> +               if (!pte_none(*pte))
> +                       goto unlock;
> +
> +               if (file->f_ra.mmap_miss > 0)
> +                       file->f_ra.mmap_miss--;
> +               addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
> +               do_set_pte(vma, addr, page, pte, false, false);
> +               unlock_page(page);
> +               goto next;
> +unlock:
> +               unlock_page(page);
> +skip:
> +               page_cache_release(page);
> +next:
> +               if (page->index == vmf->max_pgoff)
> +                       break;
> +       }
> +       rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(filemap_map_pages);
> +
>  int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>         struct page *page = vmf->page;
> @@ -1755,6 +1826,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
>
>  const struct vm_operations_struct generic_file_vm_ops = {
>         .fault          = filemap_fault,
> +       .map_pages      = filemap_map_pages,
>         .page_mkwrite   = filemap_page_mkwrite,
>         .remap_pages    = generic_file_remap_pages,
>  };
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 8740213b1647..ce401e37e29f 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1985,6 +1985,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>  }
>  EXPORT_SYMBOL(filemap_fault);
>
> +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +       BUG();
> +}
> +EXPORT_SYMBOL(filemap_map_pages);
> +
>  int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr,
>                              unsigned long size, pgoff_t pgoff)
>  {
> --
> 1.9.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 2/2] mm: implement ->map_pages for page cache
  2014-04-02 18:03     ` Konstantin Khlebnikov
@ 2014-04-02 19:07       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-04-02 19:07 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, Mel Gorman,
	Rik van Riel, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Wed, Apr 02, 2014 at 10:03:24PM +0400, Konstantin Khlebnikov wrote:
> > +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > +       struct radix_tree_iter iter;
> > +       void **slot;
> > +       struct file *file = vma->vm_file;
> > +       struct address_space *mapping = file->f_mapping;
> > +       loff_t size;
> > +       struct page *page;
> > +       unsigned long address = (unsigned long) vmf->virtual_address;
> > +       unsigned long addr;
> > +       pte_t *pte;
> > +
> > +       rcu_read_lock();
> > +       radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> > +               if (iter.index > vmf->max_pgoff)
> > +                       break;
> > +repeat:
> > +               page = radix_tree_deref_slot(slot);
> 
> Here is obvious race with memory reclaimer/truncate. Pointer to page
> might become NULL.

Thanks for noticing that. It has been fixed in -mm already.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 2/2] mm: implement ->map_pages for page cache
@ 2014-04-02 19:07       ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-04-02 19:07 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, Mel Gorman,
	Rik van Riel, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Wed, Apr 02, 2014 at 10:03:24PM +0400, Konstantin Khlebnikov wrote:
> > +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > +       struct radix_tree_iter iter;
> > +       void **slot;
> > +       struct file *file = vma->vm_file;
> > +       struct address_space *mapping = file->f_mapping;
> > +       loff_t size;
> > +       struct page *page;
> > +       unsigned long address = (unsigned long) vmf->virtual_address;
> > +       unsigned long addr;
> > +       pte_t *pte;
> > +
> > +       rcu_read_lock();
> > +       radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> > +               if (iter.index > vmf->max_pgoff)
> > +                       break;
> > +repeat:
> > +               page = radix_tree_deref_slot(slot);
> 
> Here is obvious race with memory reclaimer/truncate. Pointer to page
> might become NULL.

Thanks for noticing that. It has been fixed in -mm already.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-02-27 19:53   ` Kirill A. Shutemov
@ 2014-07-24  3:33     ` Sasha Levin
  -1 siblings, 0 replies; 73+ messages in thread
From: Sasha Levin @ 2014-07-24  3:33 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, Mel Gorman,
	Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Dave Jones, Andrey Ryabinin

On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
> The patch introduces new vm_ops callback ->map_pages() and uses it for
> mapping easy accessible pages around fault address.
> 
> On read page fault, if filesystem provides ->map_pages(), we try to map
> up to FAULT_AROUND_PAGES pages around page fault address in hope to
> reduce number of minor page faults.
> 
> We call ->map_pages first and use ->fault() as fallback if page by the
> offset is not ready to be mapped (cold page cache or something).
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---

Hi all,

This patch triggers use-after-free when fuzzing using trinity and the KASAN
patchset.

KASAN's report is:

[  663.269187] AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
[  663.275260] page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
[  663.277061] page flags: 0xafffff80008000(tail)
[  663.278759] page dumped because: kasan error
[  663.280645] CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
[  663.282898]  00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
[  663.288759]  ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
[  663.291496]  0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
[  663.294379] Call Trace:
[  663.294806] dump_stack (lib/dump_stack.c:52)
[  663.300665] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
[  663.301659] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[  663.304645] ? preempt_count_sub (kernel/sched/core.c:2606)
[  663.305800] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  663.306839] ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
[  663.307515] __asan_load8 (mm/kasan/kasan.c:364)
[  663.308038] ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  663.309158] do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  663.310311] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[  663.311282] ? __pte_alloc (mm/memory.c:598)
[  663.312331] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[  663.313895] ? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
[  663.314793] __get_user_pages (mm/gup.c:286 mm/gup.c:478)
[  663.315775] __mlock_vma_pages_range (mm/mlock.c:262)
[  663.316879] __mm_populate (mm/mlock.c:710)
[  663.317813] SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
[  663.318848] tracesys (arch/x86/kernel/entry_64.S:541)
[  663.319683] Read of size 8 by thread T9262:
[  663.320580] Memory state around the buggy address:
[  663.321392]  ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.322573]  ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.323802]  ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.325080]  ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.326327]  ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.327572] >ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.328840]                          ^
[  663.329487]  ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.330762]  ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.331994]  ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.333262]  ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.334488]  ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

This also proceeds with the traditional:

[  663.474532] BUG: unable to handle kernel paging request at ffff88048a635de8
[  663.474548] IP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)

But the rest of it got scrambled between the KASAN prints and the other BUG info + trace (who
broke printk?!).


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-07-24  3:33     ` Sasha Levin
  0 siblings, 0 replies; 73+ messages in thread
From: Sasha Levin @ 2014-07-24  3:33 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, Mel Gorman,
	Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Dave Jones, Andrey Ryabinin

On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
> The patch introduces new vm_ops callback ->map_pages() and uses it for
> mapping easy accessible pages around fault address.
> 
> On read page fault, if filesystem provides ->map_pages(), we try to map
> up to FAULT_AROUND_PAGES pages around page fault address in hope to
> reduce number of minor page faults.
> 
> We call ->map_pages first and use ->fault() as fallback if page by the
> offset is not ready to be mapped (cold page cache or something).
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---

Hi all,

This patch triggers use-after-free when fuzzing using trinity and the KASAN
patchset.

KASAN's report is:

[  663.269187] AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
[  663.275260] page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
[  663.277061] page flags: 0xafffff80008000(tail)
[  663.278759] page dumped because: kasan error
[  663.280645] CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
[  663.282898]  00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
[  663.288759]  ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
[  663.291496]  0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
[  663.294379] Call Trace:
[  663.294806] dump_stack (lib/dump_stack.c:52)
[  663.300665] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
[  663.301659] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[  663.304645] ? preempt_count_sub (kernel/sched/core.c:2606)
[  663.305800] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  663.306839] ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
[  663.307515] __asan_load8 (mm/kasan/kasan.c:364)
[  663.308038] ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  663.309158] do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  663.310311] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[  663.311282] ? __pte_alloc (mm/memory.c:598)
[  663.312331] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[  663.313895] ? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
[  663.314793] __get_user_pages (mm/gup.c:286 mm/gup.c:478)
[  663.315775] __mlock_vma_pages_range (mm/mlock.c:262)
[  663.316879] __mm_populate (mm/mlock.c:710)
[  663.317813] SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
[  663.318848] tracesys (arch/x86/kernel/entry_64.S:541)
[  663.319683] Read of size 8 by thread T9262:
[  663.320580] Memory state around the buggy address:
[  663.321392]  ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.322573]  ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.323802]  ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.325080]  ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.326327]  ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.327572] >ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.328840]                          ^
[  663.329487]  ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.330762]  ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.331994]  ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.333262]  ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  663.334488]  ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

This also proceeds with the traditional:

[  663.474532] BUG: unable to handle kernel paging request at ffff88048a635de8
[  663.474548] IP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)

But the rest of it got scrambled between the KASAN prints and the other BUG info + trace (who
broke printk?!).


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-07-24  3:33     ` Sasha Levin
@ 2014-07-24  6:53       ` Andrey Ryabinin
  -1 siblings, 0 replies; 73+ messages in thread
From: Andrey Ryabinin @ 2014-07-24  6:53 UTC (permalink / raw)
  To: Sasha Levin, Kirill A. Shutemov, Linus Torvalds, Andrew Morton,
	Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Dave Jones, Konstantin Khlebnikov

On 07/24/14 07:33, Sasha Levin wrote:
> On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
>> The patch introduces new vm_ops callback ->map_pages() and uses it for
>> mapping easy accessible pages around fault address.
>>
>> On read page fault, if filesystem provides ->map_pages(), we try to map
>> up to FAULT_AROUND_PAGES pages around page fault address in hope to
>> reduce number of minor page faults.
>>
>> We call ->map_pages first and use ->fault() as fallback if page by the
>> offset is not ready to be mapped (cold page cache or something).
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> ---
> 
> Hi all,
> 
> This patch triggers use-after-free when fuzzing using trinity and the KASAN
> patchset.
> 

I think this should be fixed already by following patch:

From: Konstantin Khlebnikov <koct9i@gmail.com>
Subject: mm: do not call do_fault_around for non-linear fault

Ingo Korb reported that "repeated mapping of the same file on tmpfs using
remap_file_pages sometimes triggers a BUG at mm/filemap.c:202 when the
process exits".  He bisected the bug to d7c1755179b82d ("mm: implement
->map_pages for shmem/tmpfs"), although the bug was actually added by
8c6e50b0290c4 ("mm: introduce vm_ops->map_pages()").

Problem is caused by calling do_fault_around for _non-linear_ faiult.  In
this case pgoff is shifted and might become negative during calculation.

Faulting around non-linear page-fault has no sense and breaks logic in
do_fault_around because pgoff is shifted.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Ingo Korb <ingo.korb@tu-dortmund.de>
Tested-by: Ingo Korb <ingo.korb@tu-dortmund.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Ning Qu <quning@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>	[3.15.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff -puN mm/memory.c~mm-do-not-call-do_fault_around-for-non-linear-fault mm/memory.c
--- a/mm/memory.c~mm-do-not-call-do_fault_around-for-non-linear-fault
+++ a/mm/memory.c
@@ -2882,7 +2882,8 @@ static int do_read_fault(struct mm_struc
 	 * if page by the offset is not ready to be mapped (cold cache or
 	 * something).
 	 */
-	if (vma->vm_ops->map_pages && fault_around_pages() > 1) {
+	if (vma->vm_ops->map_pages && !(flags & FAULT_FLAG_NONLINEAR) &&
+	    fault_around_pages() > 1) {
 		pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 		do_fault_around(vma, address, pte, pgoff, flags);
 		if (!pte_same(*pte, orig_pte))


> KASAN's report is:
> 
> [  663.269187] AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> [  663.275260] page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
> [  663.277061] page flags: 0xafffff80008000(tail)
> [  663.278759] page dumped because: kasan error
> [  663.280645] CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> [  663.282898]  00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> [  663.288759]  ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> [  663.291496]  0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> [  663.294379] Call Trace:
> [  663.294806] dump_stack (lib/dump_stack.c:52)
> [  663.300665] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> [  663.301659] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [  663.304645] ? preempt_count_sub (kernel/sched/core.c:2606)
> [  663.305800] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  663.306839] ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> [  663.307515] __asan_load8 (mm/kasan/kasan.c:364)
> [  663.308038] ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  663.309158] do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  663.310311] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [  663.311282] ? __pte_alloc (mm/memory.c:598)
> [  663.312331] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> [  663.313895] ? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> [  663.314793] __get_user_pages (mm/gup.c:286 mm/gup.c:478)
> [  663.315775] __mlock_vma_pages_range (mm/mlock.c:262)
> [  663.316879] __mm_populate (mm/mlock.c:710)
> [  663.317813] SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> [  663.318848] tracesys (arch/x86/kernel/entry_64.S:541)
> [  663.319683] Read of size 8 by thread T9262:
> [  663.320580] Memory state around the buggy address:
> [  663.321392]  ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.322573]  ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.323802]  ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.325080]  ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.326327]  ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.327572] >ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.328840]                          ^
> [  663.329487]  ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.330762]  ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.331994]  ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.333262]  ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.334488]  ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 
> This also proceeds with the traditional:
> 
> [  663.474532] BUG: unable to handle kernel paging request at ffff88048a635de8
> [  663.474548] IP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> 
> But the rest of it got scrambled between the KASAN prints and the other BUG info + trace (who
> broke printk?!).
> 
> 
> Thanks,
> Sasha
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-07-24  6:53       ` Andrey Ryabinin
  0 siblings, 0 replies; 73+ messages in thread
From: Andrey Ryabinin @ 2014-07-24  6:53 UTC (permalink / raw)
  To: Sasha Levin, Kirill A. Shutemov, Linus Torvalds, Andrew Morton,
	Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Dave Jones, Konstantin Khlebnikov

On 07/24/14 07:33, Sasha Levin wrote:
> On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
>> The patch introduces new vm_ops callback ->map_pages() and uses it for
>> mapping easy accessible pages around fault address.
>>
>> On read page fault, if filesystem provides ->map_pages(), we try to map
>> up to FAULT_AROUND_PAGES pages around page fault address in hope to
>> reduce number of minor page faults.
>>
>> We call ->map_pages first and use ->fault() as fallback if page by the
>> offset is not ready to be mapped (cold page cache or something).
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> ---
> 
> Hi all,
> 
> This patch triggers use-after-free when fuzzing using trinity and the KASAN
> patchset.
> 

I think this should be fixed already by following patch:

From: Konstantin Khlebnikov <koct9i@gmail.com>
Subject: mm: do not call do_fault_around for non-linear fault

Ingo Korb reported that "repeated mapping of the same file on tmpfs using
remap_file_pages sometimes triggers a BUG at mm/filemap.c:202 when the
process exits".  He bisected the bug to d7c1755179b82d ("mm: implement
->map_pages for shmem/tmpfs"), although the bug was actually added by
8c6e50b0290c4 ("mm: introduce vm_ops->map_pages()").

Problem is caused by calling do_fault_around for _non-linear_ faiult.  In
this case pgoff is shifted and might become negative during calculation.

Faulting around non-linear page-fault has no sense and breaks logic in
do_fault_around because pgoff is shifted.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Ingo Korb <ingo.korb@tu-dortmund.de>
Tested-by: Ingo Korb <ingo.korb@tu-dortmund.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Ning Qu <quning@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>	[3.15.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff -puN mm/memory.c~mm-do-not-call-do_fault_around-for-non-linear-fault mm/memory.c
--- a/mm/memory.c~mm-do-not-call-do_fault_around-for-non-linear-fault
+++ a/mm/memory.c
@@ -2882,7 +2882,8 @@ static int do_read_fault(struct mm_struc
 	 * if page by the offset is not ready to be mapped (cold cache or
 	 * something).
 	 */
-	if (vma->vm_ops->map_pages && fault_around_pages() > 1) {
+	if (vma->vm_ops->map_pages && !(flags & FAULT_FLAG_NONLINEAR) &&
+	    fault_around_pages() > 1) {
 		pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 		do_fault_around(vma, address, pte, pgoff, flags);
 		if (!pte_same(*pte, orig_pte))


> KASAN's report is:
> 
> [  663.269187] AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> [  663.275260] page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
> [  663.277061] page flags: 0xafffff80008000(tail)
> [  663.278759] page dumped because: kasan error
> [  663.280645] CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> [  663.282898]  00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> [  663.288759]  ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> [  663.291496]  0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> [  663.294379] Call Trace:
> [  663.294806] dump_stack (lib/dump_stack.c:52)
> [  663.300665] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> [  663.301659] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [  663.304645] ? preempt_count_sub (kernel/sched/core.c:2606)
> [  663.305800] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  663.306839] ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> [  663.307515] __asan_load8 (mm/kasan/kasan.c:364)
> [  663.308038] ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  663.309158] do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  663.310311] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [  663.311282] ? __pte_alloc (mm/memory.c:598)
> [  663.312331] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> [  663.313895] ? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> [  663.314793] __get_user_pages (mm/gup.c:286 mm/gup.c:478)
> [  663.315775] __mlock_vma_pages_range (mm/mlock.c:262)
> [  663.316879] __mm_populate (mm/mlock.c:710)
> [  663.317813] SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> [  663.318848] tracesys (arch/x86/kernel/entry_64.S:541)
> [  663.319683] Read of size 8 by thread T9262:
> [  663.320580] Memory state around the buggy address:
> [  663.321392]  ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.322573]  ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.323802]  ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.325080]  ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.326327]  ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.327572] >ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.328840]                          ^
> [  663.329487]  ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.330762]  ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.331994]  ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.333262]  ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  663.334488]  ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 
> This also proceeds with the traditional:
> 
> [  663.474532] BUG: unable to handle kernel paging request at ffff88048a635de8
> [  663.474548] IP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> 
> But the rest of it got scrambled between the KASAN prints and the other BUG info + trace (who
> broke printk?!).
> 
> 
> Thanks,
> Sasha
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-07-24  6:53       ` Andrey Ryabinin
@ 2014-07-24 12:48         ` Sasha Levin
  -1 siblings, 0 replies; 73+ messages in thread
From: Sasha Levin @ 2014-07-24 12:48 UTC (permalink / raw)
  To: Andrey Ryabinin, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Dave Jones, Konstantin Khlebnikov

On 07/24/2014 02:53 AM, Andrey Ryabinin wrote:
> On 07/24/14 07:33, Sasha Levin wrote:
>> > On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
>>> >> The patch introduces new vm_ops callback ->map_pages() and uses it for
>>> >> mapping easy accessible pages around fault address.
>>> >>
>>> >> On read page fault, if filesystem provides ->map_pages(), we try to map
>>> >> up to FAULT_AROUND_PAGES pages around page fault address in hope to
>>> >> reduce number of minor page faults.
>>> >>
>>> >> We call ->map_pages first and use ->fault() as fallback if page by the
>>> >> offset is not ready to be mapped (cold page cache or something).
>>> >>
>>> >> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> >> ---
>> > 
>> > Hi all,
>> > 
>> > This patch triggers use-after-free when fuzzing using trinity and the KASAN
>> > patchset.
>> > 
> I think this should be fixed already by following patch:
> 
> From: Konstantin Khlebnikov <koct9i@gmail.com>
> Subject: mm: do not call do_fault_around for non-linear fault

I don't think so. It's supposed to deal with a different issue, and it was already
in my -next tree which triggered the issue I've reported.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-07-24 12:48         ` Sasha Levin
  0 siblings, 0 replies; 73+ messages in thread
From: Sasha Levin @ 2014-07-24 12:48 UTC (permalink / raw)
  To: Andrey Ryabinin, Kirill A. Shutemov, Linus Torvalds,
	Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Dave Jones, Konstantin Khlebnikov

On 07/24/2014 02:53 AM, Andrey Ryabinin wrote:
> On 07/24/14 07:33, Sasha Levin wrote:
>> > On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
>>> >> The patch introduces new vm_ops callback ->map_pages() and uses it for
>>> >> mapping easy accessible pages around fault address.
>>> >>
>>> >> On read page fault, if filesystem provides ->map_pages(), we try to map
>>> >> up to FAULT_AROUND_PAGES pages around page fault address in hope to
>>> >> reduce number of minor page faults.
>>> >>
>>> >> We call ->map_pages first and use ->fault() as fallback if page by the
>>> >> offset is not ready to be mapped (cold page cache or something).
>>> >>
>>> >> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> >> ---
>> > 
>> > Hi all,
>> > 
>> > This patch triggers use-after-free when fuzzing using trinity and the KASAN
>> > patchset.
>> > 
> I think this should be fixed already by following patch:
> 
> From: Konstantin Khlebnikov <koct9i@gmail.com>
> Subject: mm: do not call do_fault_around for non-linear fault

I don't think so. It's supposed to deal with a different issue, and it was already
in my -next tree which triggered the issue I've reported.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-07-24  3:33     ` Sasha Levin
@ 2014-07-24 13:05       ` Sasha Levin
  -1 siblings, 0 replies; 73+ messages in thread
From: Sasha Levin @ 2014-07-24 13:05 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, Mel Gorman,
	Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Dave Jones, Andrey Ryabinin

On 07/23/2014 11:33 PM, Sasha Levin wrote:
> On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
>> > The patch introduces new vm_ops callback ->map_pages() and uses it for
>> > mapping easy accessible pages around fault address.
>> > 
>> > On read page fault, if filesystem provides ->map_pages(), we try to map
>> > up to FAULT_AROUND_PAGES pages around page fault address in hope to
>> > reduce number of minor page faults.
>> > 
>> > We call ->map_pages first and use ->fault() as fallback if page by the
>> > offset is not ready to be mapped (cold page cache or something).
>> > 
>> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > ---
> Hi all,
> 
> This patch triggers use-after-free when fuzzing using trinity and the KASAN
> patchset.

FWIW, if it helps, here's another KASAN report with the conventional BUG following it:

[  360.498001] ==================================================================
[  360.500896] AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff880581ee3fd0
[  360.504474] page:ffffea001607b8c0 count:0 mapcount:0 mapping:          (null) index:0x0
[  360.507264] page flags: 0xefffff80000000()
[  360.508655] page dumped because: kasan error
[  360.509489] CPU: 8 PID: 9251 Comm: trinity-c159 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
[  360.511717]  00000000000000ff 0000000000000000 ffffea001607b8c0 ffff8801ba8bbb98
[  360.513272]  ffffffff8fe40903 ffff8801ba8bbc68 ffff8801ba8bbc58 ffffffff8b42acfc
[  360.514729]  0000000000000001 ffff880592deaa48 ffff8801ba84b038 ffff8801ba8bbbd0
[  360.516156] Call Trace:
[  360.516622] dump_stack (lib/dump_stack.c:52)
[  360.517566] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
[  360.518745] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[  360.519923] ? preempt_count_sub (kernel/sched/core.c:2606)
[  360.521124] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  360.522431] ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
[  360.523651] __asan_load8 (mm/kasan/kasan.c:364)
[  360.524625] ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  360.525887] do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  360.527156] ? __rcu_read_unlock (kernel/rcu/update.c:101)
[  360.528251] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[  360.529308] ? vmacache_update (mm/vmacache.c:61)
[  360.530505] ? find_vma (mm/mmap.c:2027)
[  360.531453] __do_page_fault (arch/x86/mm/fault.c:1231)
[  360.532503] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[  360.533744] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  360.535001] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
[  360.536262] ? trace_hardirqs_off (kernel/locking/lockdep.c:2645)
[  360.537360] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[  360.538493] do_async_page_fault (arch/x86/kernel/kvm.c:279)
[  360.539567] async_page_fault (arch/x86/kernel/entry_64.S:1321)
[  360.540668] Read of size 8 by thread T9251:
[  360.541485] Memory state around the buggy address:
[  360.542463]  ffff880581ee3d00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  360.543816]  ffff880581ee3d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  360.545116]  ffff880581ee3e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  360.546476]  ffff880581ee3e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  360.547806]  ffff880581ee3f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  360.549112] >ffff880581ee3f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  360.550507]                                                  ^
[  360.551574]  ffff880581ee4000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  360.552910]  ffff880581ee4080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  360.554207]  ffff880581ee4100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  360.555621]  ffff880581ee4180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  360.557035]  ffff880581ee4200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  360.558489] ==================================================================
[  360.559804] BUG: unable to handle kernel paging request at ffff880581ee3fd0
[  360.559817] IP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  360.559827] PGD 147b9067 PUD 70353d067 PMD 70352d067 PTE 8000000581ee3060
[  360.559836] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  360.559892] Dumping ftrace buffer:
[  360.560117]    (ftrace buffer empty)
[  360.560132] Modules linked in:
[  360.560145] CPU: 8 PID: 9251 Comm: trinity-c159 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
[  360.560154] task: ffff8801ba84b000 ti: ffff8801ba8b8000 task.ti: ffff8801ba8b8000
[  360.560172] RIP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  360.560180] RSP: 0000:ffff8801ba8bbc98  EFLAGS: 00010296
[  360.560186] RAX: 0000000000000000 RBX: ffff8801ba8d5a00 RCX: 0000000000000006
[  360.560192] RDX: 0000000000000006 RSI: ffffffff912f5fd5 RDI: 0000000000000282
[  360.560199] RBP: ffff8801ba8bbd58 R08: 0000000000000001 R09: 0000000000000000
[  360.560206] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801ba8ad470
[  360.560213] R13: 00007f2411bfa000 R14: 0000000000000000 R15: ffff880581ee3fd0
[  360.560222] FS:  00007f2412de6700(0000) GS:ffff8801de000000(0000) knlGS:0000000000000000
[  360.560228] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  360.560234] CR2: ffff880581ee3fd0 CR3: 00000001ba89b000 CR4: 00000000000006a0
[  360.560255] DR0: 00000000006ec000 DR1: 0000000000000000 DR2: 0000000000000000
[  360.560262] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
[  360.560264] Stack:
[  360.560278]  0000000000000001 0000000000000000 ffff880581ee4030 ffff880592deaa30
[  360.560290]  0000000000000005 00007f2411c07000 ffff8801ba8d5a90 ffff880592deaa30
[  360.560297]  0000000000000000 000000a8ba84b038 000000000000000c 00007f2411c06220
[  360.560299] Call Trace:
[  360.560317] ? __rcu_read_unlock (kernel/rcu/update.c:101)
[  360.560332] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[  360.560345] ? vmacache_update (mm/vmacache.c:61)
[  360.560358] ? find_vma (mm/mmap.c:2027)
[  360.560370] __do_page_fault (arch/x86/mm/fault.c:1231)
[  360.560386] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[  360.560400] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  360.560413] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
[  360.560425] ? trace_hardirqs_off (kernel/locking/lockdep.c:2645)
[  360.560432] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[  360.560440] do_async_page_fault (arch/x86/kernel/kvm.c:279)
[  360.560447] async_page_fault (arch/x86/kernel/entry_64.S:1321)
[ 360.560458] Code: 47 f9 48 8b 8d 68 ff ff ff 48 29 f1 48 c1 e9 0c 49 8d 44 08 ff 48 39 c7 48 0f 46 c7 4c 89 ff 48 89 85 60 ff ff ff e8 3e 19 07 00 <49> 83 3f 00 0f 84 8f 00 00 00 49 83 c6 01 4c 39 b5 60 ff ff ff
All code
========
   0:	47 f9                	rex.RXB stc
   2:	48 8b 8d 68 ff ff ff 	mov    -0x98(%rbp),%rcx
   9:	48 29 f1             	sub    %rsi,%rcx
   c:	48 c1 e9 0c          	shr    $0xc,%rcx
  10:	49 8d 44 08 ff       	lea    -0x1(%r8,%rcx,1),%rax
  15:	48 39 c7             	cmp    %rax,%rdi
  18:	48 0f 46 c7          	cmovbe %rdi,%rax
  1c:	4c 89 ff             	mov    %r15,%rdi
  1f:	48 89 85 60 ff ff ff 	mov    %rax,-0xa0(%rbp)
  26:	e8 3e 19 07 00       	callq  0x71969
  2b:*	49 83 3f 00          	cmpq   $0x0,(%r15)		<-- trapping instruction
  2f:	0f 84 8f 00 00 00    	je     0xc4
  35:	49 83 c6 01          	add    $0x1,%r14
  39:	4c 39 b5 60 ff ff ff 	cmp    %r14,-0xa0(%rbp)
	...

Code starting with the faulting instruction
===========================================
   0:	49 83 3f 00          	cmpq   $0x0,(%r15)
   4:	0f 84 8f 00 00 00    	je     0x99
   a:	49 83 c6 01          	add    $0x1,%r14
   e:	4c 39 b5 60 ff ff ff 	cmp    %r14,-0xa0(%rbp)
	...
[  360.560458] RIP do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  360.560458]  RSP <ffff8801ba8bbc98>
[  360.560458] CR2: ffff880581ee3fd0
[  360.560458] ---[ end trace ccd7cee352be7945 ]---

Thanks,
Sasha

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-07-24 13:05       ` Sasha Levin
  0 siblings, 0 replies; 73+ messages in thread
From: Sasha Levin @ 2014-07-24 13:05 UTC (permalink / raw)
  To: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, Mel Gorman,
	Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Dave Jones, Andrey Ryabinin

On 07/23/2014 11:33 PM, Sasha Levin wrote:
> On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
>> > The patch introduces new vm_ops callback ->map_pages() and uses it for
>> > mapping easy accessible pages around fault address.
>> > 
>> > On read page fault, if filesystem provides ->map_pages(), we try to map
>> > up to FAULT_AROUND_PAGES pages around page fault address in hope to
>> > reduce number of minor page faults.
>> > 
>> > We call ->map_pages first and use ->fault() as fallback if page by the
>> > offset is not ready to be mapped (cold page cache or something).
>> > 
>> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > ---
> Hi all,
> 
> This patch triggers use-after-free when fuzzing using trinity and the KASAN
> patchset.

FWIW, if it helps, here's another KASAN report with the conventional BUG following it:

[  360.498001] ==================================================================
[  360.500896] AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff880581ee3fd0
[  360.504474] page:ffffea001607b8c0 count:0 mapcount:0 mapping:          (null) index:0x0
[  360.507264] page flags: 0xefffff80000000()
[  360.508655] page dumped because: kasan error
[  360.509489] CPU: 8 PID: 9251 Comm: trinity-c159 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
[  360.511717]  00000000000000ff 0000000000000000 ffffea001607b8c0 ffff8801ba8bbb98
[  360.513272]  ffffffff8fe40903 ffff8801ba8bbc68 ffff8801ba8bbc58 ffffffff8b42acfc
[  360.514729]  0000000000000001 ffff880592deaa48 ffff8801ba84b038 ffff8801ba8bbbd0
[  360.516156] Call Trace:
[  360.516622] dump_stack (lib/dump_stack.c:52)
[  360.517566] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
[  360.518745] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[  360.519923] ? preempt_count_sub (kernel/sched/core.c:2606)
[  360.521124] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  360.522431] ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
[  360.523651] __asan_load8 (mm/kasan/kasan.c:364)
[  360.524625] ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  360.525887] do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  360.527156] ? __rcu_read_unlock (kernel/rcu/update.c:101)
[  360.528251] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[  360.529308] ? vmacache_update (mm/vmacache.c:61)
[  360.530505] ? find_vma (mm/mmap.c:2027)
[  360.531453] __do_page_fault (arch/x86/mm/fault.c:1231)
[  360.532503] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[  360.533744] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  360.535001] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
[  360.536262] ? trace_hardirqs_off (kernel/locking/lockdep.c:2645)
[  360.537360] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[  360.538493] do_async_page_fault (arch/x86/kernel/kvm.c:279)
[  360.539567] async_page_fault (arch/x86/kernel/entry_64.S:1321)
[  360.540668] Read of size 8 by thread T9251:
[  360.541485] Memory state around the buggy address:
[  360.542463]  ffff880581ee3d00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  360.543816]  ffff880581ee3d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  360.545116]  ffff880581ee3e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  360.546476]  ffff880581ee3e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  360.547806]  ffff880581ee3f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  360.549112] >ffff880581ee3f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  360.550507]                                                  ^
[  360.551574]  ffff880581ee4000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  360.552910]  ffff880581ee4080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  360.554207]  ffff880581ee4100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  360.555621]  ffff880581ee4180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  360.557035]  ffff880581ee4200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  360.558489] ==================================================================
[  360.559804] BUG: unable to handle kernel paging request at ffff880581ee3fd0
[  360.559817] IP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  360.559827] PGD 147b9067 PUD 70353d067 PMD 70352d067 PTE 8000000581ee3060
[  360.559836] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  360.559892] Dumping ftrace buffer:
[  360.560117]    (ftrace buffer empty)
[  360.560132] Modules linked in:
[  360.560145] CPU: 8 PID: 9251 Comm: trinity-c159 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
[  360.560154] task: ffff8801ba84b000 ti: ffff8801ba8b8000 task.ti: ffff8801ba8b8000
[  360.560172] RIP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  360.560180] RSP: 0000:ffff8801ba8bbc98  EFLAGS: 00010296
[  360.560186] RAX: 0000000000000000 RBX: ffff8801ba8d5a00 RCX: 0000000000000006
[  360.560192] RDX: 0000000000000006 RSI: ffffffff912f5fd5 RDI: 0000000000000282
[  360.560199] RBP: ffff8801ba8bbd58 R08: 0000000000000001 R09: 0000000000000000
[  360.560206] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801ba8ad470
[  360.560213] R13: 00007f2411bfa000 R14: 0000000000000000 R15: ffff880581ee3fd0
[  360.560222] FS:  00007f2412de6700(0000) GS:ffff8801de000000(0000) knlGS:0000000000000000
[  360.560228] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  360.560234] CR2: ffff880581ee3fd0 CR3: 00000001ba89b000 CR4: 00000000000006a0
[  360.560255] DR0: 00000000006ec000 DR1: 0000000000000000 DR2: 0000000000000000
[  360.560262] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
[  360.560264] Stack:
[  360.560278]  0000000000000001 0000000000000000 ffff880581ee4030 ffff880592deaa30
[  360.560290]  0000000000000005 00007f2411c07000 ffff8801ba8d5a90 ffff880592deaa30
[  360.560297]  0000000000000000 000000a8ba84b038 000000000000000c 00007f2411c06220
[  360.560299] Call Trace:
[  360.560317] ? __rcu_read_unlock (kernel/rcu/update.c:101)
[  360.560332] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[  360.560345] ? vmacache_update (mm/vmacache.c:61)
[  360.560358] ? find_vma (mm/mmap.c:2027)
[  360.560370] __do_page_fault (arch/x86/mm/fault.c:1231)
[  360.560386] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[  360.560400] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  360.560413] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
[  360.560425] ? trace_hardirqs_off (kernel/locking/lockdep.c:2645)
[  360.560432] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[  360.560440] do_async_page_fault (arch/x86/kernel/kvm.c:279)
[  360.560447] async_page_fault (arch/x86/kernel/entry_64.S:1321)
[ 360.560458] Code: 47 f9 48 8b 8d 68 ff ff ff 48 29 f1 48 c1 e9 0c 49 8d 44 08 ff 48 39 c7 48 0f 46 c7 4c 89 ff 48 89 85 60 ff ff ff e8 3e 19 07 00 <49> 83 3f 00 0f 84 8f 00 00 00 49 83 c6 01 4c 39 b5 60 ff ff ff
All code
========
   0:	47 f9                	rex.RXB stc
   2:	48 8b 8d 68 ff ff ff 	mov    -0x98(%rbp),%rcx
   9:	48 29 f1             	sub    %rsi,%rcx
   c:	48 c1 e9 0c          	shr    $0xc,%rcx
  10:	49 8d 44 08 ff       	lea    -0x1(%r8,%rcx,1),%rax
  15:	48 39 c7             	cmp    %rax,%rdi
  18:	48 0f 46 c7          	cmovbe %rdi,%rax
  1c:	4c 89 ff             	mov    %r15,%rdi
  1f:	48 89 85 60 ff ff ff 	mov    %rax,-0xa0(%rbp)
  26:	e8 3e 19 07 00       	callq  0x71969
  2b:*	49 83 3f 00          	cmpq   $0x0,(%r15)		<-- trapping instruction
  2f:	0f 84 8f 00 00 00    	je     0xc4
  35:	49 83 c6 01          	add    $0x1,%r14
  39:	4c 39 b5 60 ff ff ff 	cmp    %r14,-0xa0(%rbp)
	...

Code starting with the faulting instruction
===========================================
   0:	49 83 3f 00          	cmpq   $0x0,(%r15)
   4:	0f 84 8f 00 00 00    	je     0x99
   a:	49 83 c6 01          	add    $0x1,%r14
   e:	4c 39 b5 60 ff ff ff 	cmp    %r14,-0xa0(%rbp)
	...
[  360.560458] RIP do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[  360.560458]  RSP <ffff8801ba8bbc98>
[  360.560458] CR2: ffff880581ee3fd0
[  360.560458] ---[ end trace ccd7cee352be7945 ]---

Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
  2014-07-24 13:05       ` Sasha Levin
@ 2014-07-24 13:30         ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 73+ messages in thread
From: Konstantin Khlebnikov @ 2014-07-24 13:30 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, Mel Gorman,
	Rik van Riel, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List, Dave Jones, Andrey Ryabinin

On Thu, Jul 24, 2014 at 5:05 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
> On 07/23/2014 11:33 PM, Sasha Levin wrote:
>> On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
>>> > The patch introduces new vm_ops callback ->map_pages() and uses it for
>>> > mapping easy accessible pages around fault address.
>>> >
>>> > On read page fault, if filesystem provides ->map_pages(), we try to map
>>> > up to FAULT_AROUND_PAGES pages around page fault address in hope to
>>> > reduce number of minor page faults.
>>> >
>>> > We call ->map_pages first and use ->fault() as fallback if page by the
>>> > offset is not ready to be mapped (cold page cache or something).
>>> >
>>> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> > ---
>> Hi all,
>>
>> This patch triggers use-after-free when fuzzing using trinity and the KASAN
>> patchset.
>
> FWIW, if it helps, here's another KASAN report with the conventional BUG following it:

I haven't looked into the code yet, but this patch has a potential to
fix this bug.
https://lkml.org/lkml/2014/7/15/159

This pgoff-mess still looks suspicious for me.

>
> [  360.498001] ==================================================================
> [  360.500896] AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff880581ee3fd0
> [  360.504474] page:ffffea001607b8c0 count:0 mapcount:0 mapping:          (null) index:0x0
> [  360.507264] page flags: 0xefffff80000000()
> [  360.508655] page dumped because: kasan error
> [  360.509489] CPU: 8 PID: 9251 Comm: trinity-c159 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> [  360.511717]  00000000000000ff 0000000000000000 ffffea001607b8c0 ffff8801ba8bbb98
> [  360.513272]  ffffffff8fe40903 ffff8801ba8bbc68 ffff8801ba8bbc58 ffffffff8b42acfc
> [  360.514729]  0000000000000001 ffff880592deaa48 ffff8801ba84b038 ffff8801ba8bbbd0
> [  360.516156] Call Trace:
> [  360.516622] dump_stack (lib/dump_stack.c:52)
> [  360.517566] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> [  360.518745] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [  360.519923] ? preempt_count_sub (kernel/sched/core.c:2606)
> [  360.521124] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  360.522431] ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> [  360.523651] __asan_load8 (mm/kasan/kasan.c:364)
> [  360.524625] ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  360.525887] do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  360.527156] ? __rcu_read_unlock (kernel/rcu/update.c:101)
> [  360.528251] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> [  360.529308] ? vmacache_update (mm/vmacache.c:61)
> [  360.530505] ? find_vma (mm/mmap.c:2027)
> [  360.531453] __do_page_fault (arch/x86/mm/fault.c:1231)
> [  360.532503] ? context_tracking_user_exit (kernel/context_tracking.c:184)
> [  360.533744] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [  360.535001] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
> [  360.536262] ? trace_hardirqs_off (kernel/locking/lockdep.c:2645)
> [  360.537360] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
> [  360.538493] do_async_page_fault (arch/x86/kernel/kvm.c:279)
> [  360.539567] async_page_fault (arch/x86/kernel/entry_64.S:1321)
> [  360.540668] Read of size 8 by thread T9251:
> [  360.541485] Memory state around the buggy address:
> [  360.542463]  ffff880581ee3d00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [  360.543816]  ffff880581ee3d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [  360.545116]  ffff880581ee3e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [  360.546476]  ffff880581ee3e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [  360.547806]  ffff880581ee3f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [  360.549112] >ffff880581ee3f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [  360.550507]                                                  ^
> [  360.551574]  ffff880581ee4000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [  360.552910]  ffff880581ee4080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [  360.554207]  ffff880581ee4100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [  360.555621]  ffff880581ee4180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [  360.557035]  ffff880581ee4200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [  360.558489] ==================================================================
> [  360.559804] BUG: unable to handle kernel paging request at ffff880581ee3fd0
> [  360.559817] IP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  360.559827] PGD 147b9067 PUD 70353d067 PMD 70352d067 PTE 8000000581ee3060
> [  360.559836] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  360.559892] Dumping ftrace buffer:
> [  360.560117]    (ftrace buffer empty)
> [  360.560132] Modules linked in:
> [  360.560145] CPU: 8 PID: 9251 Comm: trinity-c159 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> [  360.560154] task: ffff8801ba84b000 ti: ffff8801ba8b8000 task.ti: ffff8801ba8b8000
> [  360.560172] RIP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  360.560180] RSP: 0000:ffff8801ba8bbc98  EFLAGS: 00010296
> [  360.560186] RAX: 0000000000000000 RBX: ffff8801ba8d5a00 RCX: 0000000000000006
> [  360.560192] RDX: 0000000000000006 RSI: ffffffff912f5fd5 RDI: 0000000000000282
> [  360.560199] RBP: ffff8801ba8bbd58 R08: 0000000000000001 R09: 0000000000000000
> [  360.560206] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801ba8ad470
> [  360.560213] R13: 00007f2411bfa000 R14: 0000000000000000 R15: ffff880581ee3fd0
> [  360.560222] FS:  00007f2412de6700(0000) GS:ffff8801de000000(0000) knlGS:0000000000000000
> [  360.560228] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  360.560234] CR2: ffff880581ee3fd0 CR3: 00000001ba89b000 CR4: 00000000000006a0
> [  360.560255] DR0: 00000000006ec000 DR1: 0000000000000000 DR2: 0000000000000000
> [  360.560262] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
> [  360.560264] Stack:
> [  360.560278]  0000000000000001 0000000000000000 ffff880581ee4030 ffff880592deaa30
> [  360.560290]  0000000000000005 00007f2411c07000 ffff8801ba8d5a90 ffff880592deaa30
> [  360.560297]  0000000000000000 000000a8ba84b038 000000000000000c 00007f2411c06220
> [  360.560299] Call Trace:
> [  360.560317] ? __rcu_read_unlock (kernel/rcu/update.c:101)
> [  360.560332] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> [  360.560345] ? vmacache_update (mm/vmacache.c:61)
> [  360.560358] ? find_vma (mm/mmap.c:2027)
> [  360.560370] __do_page_fault (arch/x86/mm/fault.c:1231)
> [  360.560386] ? context_tracking_user_exit (kernel/context_tracking.c:184)
> [  360.560400] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [  360.560413] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
> [  360.560425] ? trace_hardirqs_off (kernel/locking/lockdep.c:2645)
> [  360.560432] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
> [  360.560440] do_async_page_fault (arch/x86/kernel/kvm.c:279)
> [  360.560447] async_page_fault (arch/x86/kernel/entry_64.S:1321)
> [ 360.560458] Code: 47 f9 48 8b 8d 68 ff ff ff 48 29 f1 48 c1 e9 0c 49 8d 44 08 ff 48 39 c7 48 0f 46 c7 4c 89 ff 48 89 85 60 ff ff ff e8 3e 19 07 00 <49> 83 3f 00 0f 84 8f 00 00 00 49 83 c6 01 4c 39 b5 60 ff ff ff
> All code
> ========
>    0:   47 f9                   rex.RXB stc
>    2:   48 8b 8d 68 ff ff ff    mov    -0x98(%rbp),%rcx
>    9:   48 29 f1                sub    %rsi,%rcx
>    c:   48 c1 e9 0c             shr    $0xc,%rcx
>   10:   49 8d 44 08 ff          lea    -0x1(%r8,%rcx,1),%rax
>   15:   48 39 c7                cmp    %rax,%rdi
>   18:   48 0f 46 c7             cmovbe %rdi,%rax
>   1c:   4c 89 ff                mov    %r15,%rdi
>   1f:   48 89 85 60 ff ff ff    mov    %rax,-0xa0(%rbp)
>   26:   e8 3e 19 07 00          callq  0x71969
>   2b:*  49 83 3f 00             cmpq   $0x0,(%r15)              <-- trapping instruction
>   2f:   0f 84 8f 00 00 00       je     0xc4
>   35:   49 83 c6 01             add    $0x1,%r14
>   39:   4c 39 b5 60 ff ff ff    cmp    %r14,-0xa0(%rbp)
>         ...
>
> Code starting with the faulting instruction
> ===========================================
>    0:   49 83 3f 00             cmpq   $0x0,(%r15)
>    4:   0f 84 8f 00 00 00       je     0x99
>    a:   49 83 c6 01             add    $0x1,%r14
>    e:   4c 39 b5 60 ff ff ff    cmp    %r14,-0xa0(%rbp)
>         ...
> [  360.560458] RIP do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  360.560458]  RSP <ffff8801ba8bbc98>
> [  360.560458] CR2: ffff880581ee3fd0
> [  360.560458] ---[ end trace ccd7cee352be7945 ]---
>
> Thanks,
> Sasha
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()
@ 2014-07-24 13:30         ` Konstantin Khlebnikov
  0 siblings, 0 replies; 73+ messages in thread
From: Konstantin Khlebnikov @ 2014-07-24 13:30 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Kirill A. Shutemov, Linus Torvalds, Andrew Morton, Mel Gorman,
	Rik van Riel, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List, Dave Jones, Andrey Ryabinin

On Thu, Jul 24, 2014 at 5:05 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
> On 07/23/2014 11:33 PM, Sasha Levin wrote:
>> On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
>>> > The patch introduces new vm_ops callback ->map_pages() and uses it for
>>> > mapping easy accessible pages around fault address.
>>> >
>>> > On read page fault, if filesystem provides ->map_pages(), we try to map
>>> > up to FAULT_AROUND_PAGES pages around page fault address in hope to
>>> > reduce number of minor page faults.
>>> >
>>> > We call ->map_pages first and use ->fault() as fallback if page by the
>>> > offset is not ready to be mapped (cold page cache or something).
>>> >
>>> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> > ---
>> Hi all,
>>
>> This patch triggers use-after-free when fuzzing using trinity and the KASAN
>> patchset.
>
> FWIW, if it helps, here's another KASAN report with the conventional BUG following it:

I haven't looked into the code yet, but this patch has a potential to
fix this bug.
https://lkml.org/lkml/2014/7/15/159

This pgoff-mess still looks suspicious for me.

>
> [  360.498001] ==================================================================
> [  360.500896] AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff880581ee3fd0
> [  360.504474] page:ffffea001607b8c0 count:0 mapcount:0 mapping:          (null) index:0x0
> [  360.507264] page flags: 0xefffff80000000()
> [  360.508655] page dumped because: kasan error
> [  360.509489] CPU: 8 PID: 9251 Comm: trinity-c159 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> [  360.511717]  00000000000000ff 0000000000000000 ffffea001607b8c0 ffff8801ba8bbb98
> [  360.513272]  ffffffff8fe40903 ffff8801ba8bbc68 ffff8801ba8bbc58 ffffffff8b42acfc
> [  360.514729]  0000000000000001 ffff880592deaa48 ffff8801ba84b038 ffff8801ba8bbbd0
> [  360.516156] Call Trace:
> [  360.516622] dump_stack (lib/dump_stack.c:52)
> [  360.517566] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> [  360.518745] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [  360.519923] ? preempt_count_sub (kernel/sched/core.c:2606)
> [  360.521124] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [  360.522431] ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> [  360.523651] __asan_load8 (mm/kasan/kasan.c:364)
> [  360.524625] ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  360.525887] do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  360.527156] ? __rcu_read_unlock (kernel/rcu/update.c:101)
> [  360.528251] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> [  360.529308] ? vmacache_update (mm/vmacache.c:61)
> [  360.530505] ? find_vma (mm/mmap.c:2027)
> [  360.531453] __do_page_fault (arch/x86/mm/fault.c:1231)
> [  360.532503] ? context_tracking_user_exit (kernel/context_tracking.c:184)
> [  360.533744] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [  360.535001] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
> [  360.536262] ? trace_hardirqs_off (kernel/locking/lockdep.c:2645)
> [  360.537360] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
> [  360.538493] do_async_page_fault (arch/x86/kernel/kvm.c:279)
> [  360.539567] async_page_fault (arch/x86/kernel/entry_64.S:1321)
> [  360.540668] Read of size 8 by thread T9251:
> [  360.541485] Memory state around the buggy address:
> [  360.542463]  ffff880581ee3d00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [  360.543816]  ffff880581ee3d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [  360.545116]  ffff880581ee3e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [  360.546476]  ffff880581ee3e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [  360.547806]  ffff880581ee3f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [  360.549112] >ffff880581ee3f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [  360.550507]                                                  ^
> [  360.551574]  ffff880581ee4000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [  360.552910]  ffff880581ee4080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [  360.554207]  ffff880581ee4100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [  360.555621]  ffff880581ee4180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [  360.557035]  ffff880581ee4200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [  360.558489] ==================================================================
> [  360.559804] BUG: unable to handle kernel paging request at ffff880581ee3fd0
> [  360.559817] IP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  360.559827] PGD 147b9067 PUD 70353d067 PMD 70352d067 PTE 8000000581ee3060
> [  360.559836] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  360.559892] Dumping ftrace buffer:
> [  360.560117]    (ftrace buffer empty)
> [  360.560132] Modules linked in:
> [  360.560145] CPU: 8 PID: 9251 Comm: trinity-c159 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> [  360.560154] task: ffff8801ba84b000 ti: ffff8801ba8b8000 task.ti: ffff8801ba8b8000
> [  360.560172] RIP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  360.560180] RSP: 0000:ffff8801ba8bbc98  EFLAGS: 00010296
> [  360.560186] RAX: 0000000000000000 RBX: ffff8801ba8d5a00 RCX: 0000000000000006
> [  360.560192] RDX: 0000000000000006 RSI: ffffffff912f5fd5 RDI: 0000000000000282
> [  360.560199] RBP: ffff8801ba8bbd58 R08: 0000000000000001 R09: 0000000000000000
> [  360.560206] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801ba8ad470
> [  360.560213] R13: 00007f2411bfa000 R14: 0000000000000000 R15: ffff880581ee3fd0
> [  360.560222] FS:  00007f2412de6700(0000) GS:ffff8801de000000(0000) knlGS:0000000000000000
> [  360.560228] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  360.560234] CR2: ffff880581ee3fd0 CR3: 00000001ba89b000 CR4: 00000000000006a0
> [  360.560255] DR0: 00000000006ec000 DR1: 0000000000000000 DR2: 0000000000000000
> [  360.560262] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
> [  360.560264] Stack:
> [  360.560278]  0000000000000001 0000000000000000 ffff880581ee4030 ffff880592deaa30
> [  360.560290]  0000000000000005 00007f2411c07000 ffff8801ba8d5a90 ffff880592deaa30
> [  360.560297]  0000000000000000 000000a8ba84b038 000000000000000c 00007f2411c06220
> [  360.560299] Call Trace:
> [  360.560317] ? __rcu_read_unlock (kernel/rcu/update.c:101)
> [  360.560332] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> [  360.560345] ? vmacache_update (mm/vmacache.c:61)
> [  360.560358] ? find_vma (mm/mmap.c:2027)
> [  360.560370] __do_page_fault (arch/x86/mm/fault.c:1231)
> [  360.560386] ? context_tracking_user_exit (kernel/context_tracking.c:184)
> [  360.560400] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [  360.560413] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
> [  360.560425] ? trace_hardirqs_off (kernel/locking/lockdep.c:2645)
> [  360.560432] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
> [  360.560440] do_async_page_fault (arch/x86/kernel/kvm.c:279)
> [  360.560447] async_page_fault (arch/x86/kernel/entry_64.S:1321)
> [ 360.560458] Code: 47 f9 48 8b 8d 68 ff ff ff 48 29 f1 48 c1 e9 0c 49 8d 44 08 ff 48 39 c7 48 0f 46 c7 4c 89 ff 48 89 85 60 ff ff ff e8 3e 19 07 00 <49> 83 3f 00 0f 84 8f 00 00 00 49 83 c6 01 4c 39 b5 60 ff ff ff
> All code
> ========
>    0:   47 f9                   rex.RXB stc
>    2:   48 8b 8d 68 ff ff ff    mov    -0x98(%rbp),%rcx
>    9:   48 29 f1                sub    %rsi,%rcx
>    c:   48 c1 e9 0c             shr    $0xc,%rcx
>   10:   49 8d 44 08 ff          lea    -0x1(%r8,%rcx,1),%rax
>   15:   48 39 c7                cmp    %rax,%rdi
>   18:   48 0f 46 c7             cmovbe %rdi,%rax
>   1c:   4c 89 ff                mov    %r15,%rdi
>   1f:   48 89 85 60 ff ff ff    mov    %rax,-0xa0(%rbp)
>   26:   e8 3e 19 07 00          callq  0x71969
>   2b:*  49 83 3f 00             cmpq   $0x0,(%r15)              <-- trapping instruction
>   2f:   0f 84 8f 00 00 00       je     0xc4
>   35:   49 83 c6 01             add    $0x1,%r14
>   39:   4c 39 b5 60 ff ff ff    cmp    %r14,-0xa0(%rbp)
>         ...
>
> Code starting with the faulting instruction
> ===========================================
>    0:   49 83 3f 00             cmpq   $0x0,(%r15)
>    4:   0f 84 8f 00 00 00       je     0x99
>    a:   49 83 c6 01             add    $0x1,%r14
>    e:   4c 39 b5 60 ff ff ff    cmp    %r14,-0xa0(%rbp)
>         ...
> [  360.560458] RIP do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [  360.560458]  RSP <ffff8801ba8bbc98>
> [  360.560458] CR2: ffff880581ee3fd0
> [  360.560458] ---[ end trace ccd7cee352be7945 ]---
>
> Thanks,
> Sasha
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH] mm: don't allow fault_around_bytes to be 0
  2014-07-24  3:33     ` Sasha Levin
@ 2014-07-28  7:43       ` Andrey Ryabinin
  -1 siblings, 0 replies; 73+ messages in thread
From: Andrey Ryabinin @ 2014-07-28  7:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	linux-kernel, Dave Jones, Andrey Ryabinin, stable,
	Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
patchset:

	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
	page flags: 0xafffff80008000(tail)
	page dumped because: kasan error
	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
	Call Trace:
	dump_stack (lib/dump_stack.c:52)
	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
	? debug_smp_processor_id (lib/smp_processor_id.c:57)
	? preempt_count_sub (kernel/sched/core.c:2606)
	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
	__asan_load8 (mm/kasan/kasan.c:364)
	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
	? __pte_alloc (mm/memory.c:598)
	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
	__mlock_vma_pages_range (mm/mlock.c:262)
	__mm_populate (mm/mlock.c:710)
	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
	tracesys (arch/x86/kernel/entry_64.S:541)
	Read of size 8 by thread T9262:
	Memory state around the buggy address:
	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	                         ^
	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb


It looks like that pte pointer is invalid in do_fault_around().
This could happen if fault_around_bytes is set to 0.
fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
The result of rounddown_pow_of_to is undefined if parameter == 0
(in my environment it returns 0x8000000000000000).

One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
however this would add extra code on fault path.

So let's just forbid to set fault_around_bytes to zero.
Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
instead of 0.

Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: <stable@vger.kernel.org> # 3.15.x
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e8d820..5927c42 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2784,7 +2784,7 @@ static int fault_around_bytes_get(void *data, u64 *val)
 
 static int fault_around_bytes_set(void *data, u64 val)
 {
-	if (val / PAGE_SIZE > PTRS_PER_PTE)
+	if (!val || val / PAGE_SIZE > PTRS_PER_PTE)
 		return -EINVAL;
 	fault_around_bytes = val;
 	return 0;
-- 
1.8.5.5


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH] mm: don't allow fault_around_bytes to be 0
@ 2014-07-28  7:43       ` Andrey Ryabinin
  0 siblings, 0 replies; 73+ messages in thread
From: Andrey Ryabinin @ 2014-07-28  7:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	linux-kernel, Dave Jones, Andrey Ryabinin, stable,
	Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
patchset:

	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
	page flags: 0xafffff80008000(tail)
	page dumped because: kasan error
	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
	Call Trace:
	dump_stack (lib/dump_stack.c:52)
	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
	? debug_smp_processor_id (lib/smp_processor_id.c:57)
	? preempt_count_sub (kernel/sched/core.c:2606)
	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
	__asan_load8 (mm/kasan/kasan.c:364)
	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
	? __pte_alloc (mm/memory.c:598)
	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
	__mlock_vma_pages_range (mm/mlock.c:262)
	__mm_populate (mm/mlock.c:710)
	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
	tracesys (arch/x86/kernel/entry_64.S:541)
	Read of size 8 by thread T9262:
	Memory state around the buggy address:
	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	                         ^
	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb


It looks like that pte pointer is invalid in do_fault_around().
This could happen if fault_around_bytes is set to 0.
fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
The result of rounddown_pow_of_to is undefined if parameter == 0
(in my environment it returns 0x8000000000000000).

One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
however this would add extra code on fault path.

So let's just forbid to set fault_around_bytes to zero.
Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
instead of 0.

Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: <stable@vger.kernel.org> # 3.15.x
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e8d820..5927c42 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2784,7 +2784,7 @@ static int fault_around_bytes_get(void *data, u64 *val)
 
 static int fault_around_bytes_set(void *data, u64 val)
 {
-	if (val / PAGE_SIZE > PTRS_PER_PTE)
+	if (!val || val / PAGE_SIZE > PTRS_PER_PTE)
 		return -EINVAL;
 	fault_around_bytes = val;
 	return 0;
-- 
1.8.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
  2014-07-28  7:43       ` Andrey Ryabinin
@ 2014-07-28  7:47         ` Andrey Ryabinin
  -1 siblings, 0 replies; 73+ messages in thread
From: Andrey Ryabinin @ 2014-07-28  7:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	linux-kernel, Dave Jones, stable, Kirill A. Shutemov, Mel Gorman,
	Rik van Riel, Konstantin Khlebnikov, Hugh Dickins, Sasha Levin

On 07/28/14 11:43, Andrey Ryabinin wrote:
> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
> patchset:
> 
> 	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> 	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
> 	page flags: 0xafffff80008000(tail)
> 	page dumped because: kasan error
> 	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> 	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> 	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> 	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> 	Call Trace:
> 	dump_stack (lib/dump_stack.c:52)
> 	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> 	? debug_smp_processor_id (lib/smp_processor_id.c:57)
> 	? preempt_count_sub (kernel/sched/core.c:2606)
> 	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> 	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> 	__asan_load8 (mm/kasan/kasan.c:364)
> 	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> 	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> 	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> 	? __pte_alloc (mm/memory.c:598)
> 	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> 	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> 	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
> 	__mlock_vma_pages_range (mm/mlock.c:262)
> 	__mm_populate (mm/mlock.c:710)
> 	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> 	tracesys (arch/x86/kernel/entry_64.S:541)
> 	Read of size 8 by thread T9262:
> 	Memory state around the buggy address:
> 	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	                         ^
> 	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 
> 
> It looks like that pte pointer is invalid in do_fault_around().
> This could happen if fault_around_bytes is set to 0.
> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
> The result of rounddown_pow_of_to is undefined if parameter == 0
> (in my environment it returns 0x8000000000000000).
> 
> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
> however this would add extra code on fault path.
> 
> So let's just forbid to set fault_around_bytes to zero.
> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
> instead of 0.
> 
> Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
> Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Cc: <stable@vger.kernel.org> # 3.15.x
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Hugh Dickins <hughd@google.com>
> ---
>  mm/memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e8d820..5927c42 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2784,7 +2784,7 @@ static int fault_around_bytes_get(void *data, u64 *val)
>  
>  static int fault_around_bytes_set(void *data, u64 val)
>  {
> -	if (val / PAGE_SIZE > PTRS_PER_PTE)
> +	if (!val || val / PAGE_SIZE > PTRS_PER_PTE)
>  		return -EINVAL;
>  	fault_around_bytes = val;
>  	return 0;
> 

Adding Sasha to cc.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
@ 2014-07-28  7:47         ` Andrey Ryabinin
  0 siblings, 0 replies; 73+ messages in thread
From: Andrey Ryabinin @ 2014-07-28  7:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	linux-kernel, Dave Jones, stable, Kirill A. Shutemov, Mel Gorman,
	Rik van Riel, Konstantin Khlebnikov, Hugh Dickins, Sasha Levin

On 07/28/14 11:43, Andrey Ryabinin wrote:
> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
> patchset:
> 
> 	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> 	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
> 	page flags: 0xafffff80008000(tail)
> 	page dumped because: kasan error
> 	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> 	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> 	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> 	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> 	Call Trace:
> 	dump_stack (lib/dump_stack.c:52)
> 	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> 	? debug_smp_processor_id (lib/smp_processor_id.c:57)
> 	? preempt_count_sub (kernel/sched/core.c:2606)
> 	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> 	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> 	__asan_load8 (mm/kasan/kasan.c:364)
> 	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> 	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> 	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> 	? __pte_alloc (mm/memory.c:598)
> 	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> 	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> 	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
> 	__mlock_vma_pages_range (mm/mlock.c:262)
> 	__mm_populate (mm/mlock.c:710)
> 	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> 	tracesys (arch/x86/kernel/entry_64.S:541)
> 	Read of size 8 by thread T9262:
> 	Memory state around the buggy address:
> 	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	                         ^
> 	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 
> 
> It looks like that pte pointer is invalid in do_fault_around().
> This could happen if fault_around_bytes is set to 0.
> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
> The result of rounddown_pow_of_to is undefined if parameter == 0
> (in my environment it returns 0x8000000000000000).
> 
> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
> however this would add extra code on fault path.
> 
> So let's just forbid to set fault_around_bytes to zero.
> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
> instead of 0.
> 
> Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
> Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Cc: <stable@vger.kernel.org> # 3.15.x
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Hugh Dickins <hughd@google.com>
> ---
>  mm/memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e8d820..5927c42 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2784,7 +2784,7 @@ static int fault_around_bytes_get(void *data, u64 *val)
>  
>  static int fault_around_bytes_set(void *data, u64 val)
>  {
> -	if (val / PAGE_SIZE > PTRS_PER_PTE)
> +	if (!val || val / PAGE_SIZE > PTRS_PER_PTE)
>  		return -EINVAL;
>  	fault_around_bytes = val;
>  	return 0;
> 

Adding Sasha to cc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
  2014-07-28  7:43       ` Andrey Ryabinin
  (?)
@ 2014-07-28  9:36         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-07-28  9:36 UTC (permalink / raw)
  To: Andrey Ryabinin, Sasha Levin
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Matthew Wilcox,
	Dave Hansen, Alexander Viro, Dave Chinner, Ning Qu, linux-mm,
	linux-fsdevel, linux-kernel, Dave Jones, stable,
	Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On Mon, Jul 28, 2014 at 11:43:20AM +0400, Andrey Ryabinin wrote:
> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
> patchset:
> 
> 	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> 	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
> 	page flags: 0xafffff80008000(tail)
> 	page dumped because: kasan error
> 	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> 	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> 	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> 	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> 	Call Trace:
> 	dump_stack (lib/dump_stack.c:52)
> 	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> 	? debug_smp_processor_id (lib/smp_processor_id.c:57)
> 	? preempt_count_sub (kernel/sched/core.c:2606)
> 	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> 	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> 	__asan_load8 (mm/kasan/kasan.c:364)
> 	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> 	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> 	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> 	? __pte_alloc (mm/memory.c:598)
> 	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> 	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> 	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
> 	__mlock_vma_pages_range (mm/mlock.c:262)
> 	__mm_populate (mm/mlock.c:710)
> 	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> 	tracesys (arch/x86/kernel/entry_64.S:541)
> 	Read of size 8 by thread T9262:
> 	Memory state around the buggy address:
> 	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	                         ^
> 	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 
> 
> It looks like that pte pointer is invalid in do_fault_around().
> This could happen if fault_around_bytes is set to 0.
> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
> The result of rounddown_pow_of_to is undefined if parameter == 0
> (in my environment it returns 0x8000000000000000).

Ouch. Good catch!

Although, I'm not convinced that it caused the issue. Sasha, did you touch the
debugfs handle?

> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
> however this would add extra code on fault path.
> 
> So let's just forbid to set fault_around_bytes to zero.
> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
> instead of 0.

>From user point of view, 0 is perfectly fine. What about untested patch
below?

Other option: get rid of debugfs interface, so fault_around_pages() and
fault_around_mask() will always be known compile time.

There's other problem with the debugfs handle: we don't have serialization
between fault_around_bytes_set() and do_fault_around(). It can end up
badly if fault_around_bytes will be changed under do_fault_around()...

I don't think it worth adding the serialization to hot path to protect
against debug interface.
Any thoughts?

>From 2932fbcefe4ec21c046348e21981149ecce5d161 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Mon, 28 Jul 2014 12:16:49 +0300
Subject: [PATCH] mm, debugfs: workaround undefined behaviour of
 rounddown_pow_of_two(0)

Result of rounddown_pow_of_two(0) is not defined. It can cause a bug if
user will set fault_around_bytes to 0 via debugfs interface.

Let's set fault_around_bytes to PAGE_SIZE if user tries to set it to
something below PAGE_SIZE.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e8d8205b610..2d8fa7a7b0ee 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2786,7 +2786,8 @@ static int fault_around_bytes_set(void *data, u64 val)
 {
 	if (val / PAGE_SIZE > PTRS_PER_PTE)
 		return -EINVAL;
-	fault_around_bytes = val;
+	/* rounddown_pow_of_two(0) is not defined */
+	fault_around_bytes = max(val, PAGE_SIZE);
 	return 0;
 }
 DEFINE_SIMPLE_ATTRIBUTE(fault_around_bytes_fops,
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
@ 2014-07-28  9:36         ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-07-28  9:36 UTC (permalink / raw)
  To: Andrey Ryabinin, Sasha Levin
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Matthew Wilcox,
	Dave Hansen, Alexander Viro, Dave Chinner, Ning Qu, linux-mm,
	linux-fsdevel, linux-kernel, Dave Jones, stable,
	Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On Mon, Jul 28, 2014 at 11:43:20AM +0400, Andrey Ryabinin wrote:
> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
> patchset:
> 
> 	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> 	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
> 	page flags: 0xafffff80008000(tail)
> 	page dumped because: kasan error
> 	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> 	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> 	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> 	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> 	Call Trace:
> 	dump_stack (lib/dump_stack.c:52)
> 	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> 	? debug_smp_processor_id (lib/smp_processor_id.c:57)
> 	? preempt_count_sub (kernel/sched/core.c:2606)
> 	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> 	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> 	__asan_load8 (mm/kasan/kasan.c:364)
> 	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> 	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> 	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> 	? __pte_alloc (mm/memory.c:598)
> 	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> 	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> 	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
> 	__mlock_vma_pages_range (mm/mlock.c:262)
> 	__mm_populate (mm/mlock.c:710)
> 	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> 	tracesys (arch/x86/kernel/entry_64.S:541)
> 	Read of size 8 by thread T9262:
> 	Memory state around the buggy address:
> 	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	                         ^
> 	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 
> 
> It looks like that pte pointer is invalid in do_fault_around().
> This could happen if fault_around_bytes is set to 0.
> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
> The result of rounddown_pow_of_to is undefined if parameter == 0
> (in my environment it returns 0x8000000000000000).

Ouch. Good catch!

Although, I'm not convinced that it caused the issue. Sasha, did you touch the
debugfs handle?

> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
> however this would add extra code on fault path.
> 
> So let's just forbid to set fault_around_bytes to zero.
> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
> instead of 0.

>From user point of view, 0 is perfectly fine. What about untested patch
below?

Other option: get rid of debugfs interface, so fault_around_pages() and
fault_around_mask() will always be known compile time.

There's other problem with the debugfs handle: we don't have serialization
between fault_around_bytes_set() and do_fault_around(). It can end up
badly if fault_around_bytes will be changed under do_fault_around()...

I don't think it worth adding the serialization to hot path to protect
against debug interface.
Any thoughts?

>From 2932fbcefe4ec21c046348e21981149ecce5d161 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Mon, 28 Jul 2014 12:16:49 +0300
Subject: [PATCH] mm, debugfs: workaround undefined behaviour of
 rounddown_pow_of_two(0)

Result of rounddown_pow_of_two(0) is not defined. It can cause a bug if
user will set fault_around_bytes to 0 via debugfs interface.

Let's set fault_around_bytes to PAGE_SIZE if user tries to set it to
something below PAGE_SIZE.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e8d8205b610..2d8fa7a7b0ee 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2786,7 +2786,8 @@ static int fault_around_bytes_set(void *data, u64 val)
 {
 	if (val / PAGE_SIZE > PTRS_PER_PTE)
 		return -EINVAL;
-	fault_around_bytes = val;
+	/* rounddown_pow_of_two(0) is not defined */
+	fault_around_bytes = max(val, PAGE_SIZE);
 	return 0;
 }
 DEFINE_SIMPLE_ATTRIBUTE(fault_around_bytes_fops,
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
@ 2014-07-28  9:36         ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-07-28  9:36 UTC (permalink / raw)
  To: Andrey Ryabinin, Sasha Levin
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Matthew Wilcox,
	Dave Hansen, Alexander Viro, Dave Chinner, Ning Qu, linux-mm,
	linux-fsdevel, linux-kernel, Dave Jones, stable,
	Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On Mon, Jul 28, 2014 at 11:43:20AM +0400, Andrey Ryabinin wrote:
> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
> patchset:
> 
> 	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> 	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
> 	page flags: 0xafffff80008000(tail)
> 	page dumped because: kasan error
> 	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> 	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> 	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> 	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> 	Call Trace:
> 	dump_stack (lib/dump_stack.c:52)
> 	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> 	? debug_smp_processor_id (lib/smp_processor_id.c:57)
> 	? preempt_count_sub (kernel/sched/core.c:2606)
> 	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> 	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> 	__asan_load8 (mm/kasan/kasan.c:364)
> 	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> 	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> 	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> 	? __pte_alloc (mm/memory.c:598)
> 	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> 	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> 	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
> 	__mlock_vma_pages_range (mm/mlock.c:262)
> 	__mm_populate (mm/mlock.c:710)
> 	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> 	tracesys (arch/x86/kernel/entry_64.S:541)
> 	Read of size 8 by thread T9262:
> 	Memory state around the buggy address:
> 	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	                         ^
> 	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 
> 
> It looks like that pte pointer is invalid in do_fault_around().
> This could happen if fault_around_bytes is set to 0.
> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
> The result of rounddown_pow_of_to is undefined if parameter == 0
> (in my environment it returns 0x8000000000000000).

Ouch. Good catch!

Although, I'm not convinced that it caused the issue. Sasha, did you touch the
debugfs handle?

> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
> however this would add extra code on fault path.
> 
> So let's just forbid to set fault_around_bytes to zero.
> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
> instead of 0.

>From user point of view, 0 is perfectly fine. What about untested patch
below?

Other option: get rid of debugfs interface, so fault_around_pages() and
fault_around_mask() will always be known compile time.

There's other problem with the debugfs handle: we don't have serialization
between fault_around_bytes_set() and do_fault_around(). It can end up
badly if fault_around_bytes will be changed under do_fault_around()...

I don't think it worth adding the serialization to hot path to protect
against debug interface.
Any thoughts?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
  2014-07-28  9:36         ` Kirill A. Shutemov
  (?)
  (?)
@ 2014-07-28 10:27           ` Andrey Ryabinin
  -1 siblings, 0 replies; 73+ messages in thread
From: Andrey Ryabinin @ 2014-07-28 10:27 UTC (permalink / raw)
  To: Kirill A. Shutemov, Sasha Levin
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Matthew Wilcox,
	Dave Hansen, Alexander Viro, Dave Chinner, Ning Qu, linux-mm,
	linux-fsdevel, linux-kernel, Dave Jones, stable,
	Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On 07/28/14 13:36, Kirill A. Shutemov wrote:
> On Mon, Jul 28, 2014 at 11:43:20AM +0400, Andrey Ryabinin wrote:
>> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
>> patchset:
>>
>> 	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
>> 	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
>> 	page flags: 0xafffff80008000(tail)
>> 	page dumped because: kasan error
>> 	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
>> 	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
>> 	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
>> 	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
>> 	Call Trace:
>> 	dump_stack (lib/dump_stack.c:52)
>> 	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
>> 	? debug_smp_processor_id (lib/smp_processor_id.c:57)
>> 	? preempt_count_sub (kernel/sched/core.c:2606)
>> 	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
>> 	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
>> 	__asan_load8 (mm/kasan/kasan.c:364)
>> 	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
>> 	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
>> 	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>> 	? __pte_alloc (mm/memory.c:598)
>> 	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
>> 	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
>> 	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
>> 	__mlock_vma_pages_range (mm/mlock.c:262)
>> 	__mm_populate (mm/mlock.c:710)
>> 	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
>> 	tracesys (arch/x86/kernel/entry_64.S:541)
>> 	Read of size 8 by thread T9262:
>> 	Memory state around the buggy address:
>> 	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	                         ^
>> 	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>
>>
>> It looks like that pte pointer is invalid in do_fault_around().
>> This could happen if fault_around_bytes is set to 0.
>> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
>> The result of rounddown_pow_of_to is undefined if parameter == 0
>> (in my environment it returns 0x8000000000000000).
> 
> Ouch. Good catch!
> 
> Although, I'm not convinced that it caused the issue. Sasha, did you touch the
> debugfs handle?
> 

I suppose trinity could change it, no? I've got the very same spew after setting fault_around_bytes to 0.

>> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
>> however this would add extra code on fault path.
>>
>> So let's just forbid to set fault_around_bytes to zero.
>> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
>> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
>> instead of 0.
> 
>>From user point of view, 0 is perfectly fine. What about untested patch
> below?
> 

In case if we are not going to get rid of debugfs interface I would better keep
faul_around_bytes always roundded down, like in following patch:


>From f41b7777b29f06dc62f80526e5617cae82a38709 Mon Sep 17 00:00:00 2001
From: Andrey Ryabinin <a.ryabinin@samsung.com>
Date: Mon, 28 Jul 2014 13:46:10 +0400
Subject: [PATCH] mm: debugfs: move rounddown_pow_of_two() out from do_fault
 path

do_fault_around expects fault_around_bytes rounded down to nearest
page order. Instead of calling rounddown_pow_of_two every time
in fault_around_pages()/fault_around_mask() we could do round down
when user changes fault_around_bytes via debugfs interface.

This also fixes bug when user set fault_around_bytes to 0.
Result of rounddown_pow_of_two(0) is not defined, therefore
fault_around_bytes == 0 doesn't work without this patch.

Let's set fault_around_bytes to PAGE_SIZE if user sets to something
less than PAGE_SIZE

Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: <stable@vger.kernel.org> # 3.15.x
---
 mm/memory.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e8d820..e0c6fd6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2758,20 +2758,16 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }

-static unsigned long fault_around_bytes = 65536;
+static unsigned long fault_around_bytes = rounddown_pow_of_two(65536);

-/*
- * fault_around_pages() and fault_around_mask() round down fault_around_bytes
- * to nearest page order. It's what do_fault_around() expects to see.
- */
 static inline unsigned long fault_around_pages(void)
 {
-	return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
+	return fault_around_bytes >> PAGE_SHIFT;
 }

 static inline unsigned long fault_around_mask(void)
 {
-	return ~(rounddown_pow_of_two(fault_around_bytes) - 1) & PAGE_MASK;
+	return ~(fault_around_bytes - 1) & PAGE_MASK;
 }


@@ -2782,11 +2778,18 @@ static int fault_around_bytes_get(void *data, u64 *val)
 	return 0;
 }

+/*
+ * fault_around_pages() and fault_around_mask() expects fault_around_bytes
+ * rounded down to nearest page order. It's what do_fault_around() expects to see.
+ */
 static int fault_around_bytes_set(void *data, u64 val)
 {
 	if (val / PAGE_SIZE > PTRS_PER_PTE)
 		return -EINVAL;
-	fault_around_bytes = val;
+	if (val > PAGE_SIZE)
+		fault_around_bytes = rounddown_pow_of_two(val);
+	else
+		fault_around_bytes = PAGE_SIZE; /* rounddown_pow_of_two(0) is undefined */
 	return 0;
 }
 DEFINE_SIMPLE_ATTRIBUTE(fault_around_bytes_fops,
-- 
1.8.5.5








^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
@ 2014-07-28 10:27           ` Andrey Ryabinin
  0 siblings, 0 replies; 73+ messages in thread
From: Andrey Ryabinin @ 2014-07-28 10:27 UTC (permalink / raw)
  To: Kirill A. Shutemov, Sasha Levin
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Matthew Wilcox,
	Dave Hansen, Alexander Viro, Dave Chinner, Ning Qu, linux-mm,
	linux-fsdevel, linux-kernel, Dave Jones, stable,
	Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On 07/28/14 13:36, Kirill A. Shutemov wrote:
> On Mon, Jul 28, 2014 at 11:43:20AM +0400, Andrey Ryabinin wrote:
>> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
>> patchset:
>>
>> 	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
>> 	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
>> 	page flags: 0xafffff80008000(tail)
>> 	page dumped because: kasan error
>> 	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
>> 	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
>> 	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
>> 	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
>> 	Call Trace:
>> 	dump_stack (lib/dump_stack.c:52)
>> 	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
>> 	? debug_smp_processor_id (lib/smp_processor_id.c:57)
>> 	? preempt_count_sub (kernel/sched/core.c:2606)
>> 	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
>> 	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
>> 	__asan_load8 (mm/kasan/kasan.c:364)
>> 	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
>> 	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
>> 	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>> 	? __pte_alloc (mm/memory.c:598)
>> 	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
>> 	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
>> 	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
>> 	__mlock_vma_pages_range (mm/mlock.c:262)
>> 	__mm_populate (mm/mlock.c:710)
>> 	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
>> 	tracesys (arch/x86/kernel/entry_64.S:541)
>> 	Read of size 8 by thread T9262:
>> 	Memory state around the buggy address:
>> 	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	                         ^
>> 	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>
>>
>> It looks like that pte pointer is invalid in do_fault_around().
>> This could happen if fault_around_bytes is set to 0.
>> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
>> The result of rounddown_pow_of_to is undefined if parameter == 0
>> (in my environment it returns 0x8000000000000000).
> 
> Ouch. Good catch!
> 
> Although, I'm not convinced that it caused the issue. Sasha, did you touch the
> debugfs handle?
> 

I suppose trinity could change it, no? I've got the very same spew after setting fault_around_bytes to 0.

>> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
>> however this would add extra code on fault path.
>>
>> So let's just forbid to set fault_around_bytes to zero.
>> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
>> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
>> instead of 0.
> 
>>From user point of view, 0 is perfectly fine. What about untested patch
> below?
> 

In case if we are not going to get rid of debugfs interface I would better keep
faul_around_bytes always roundded down, like in following patch:


>From f41b7777b29f06dc62f80526e5617cae82a38709 Mon Sep 17 00:00:00 2001
From: Andrey Ryabinin <a.ryabinin@samsung.com>
Date: Mon, 28 Jul 2014 13:46:10 +0400
Subject: [PATCH] mm: debugfs: move rounddown_pow_of_two() out from do_fault
 path

do_fault_around expects fault_around_bytes rounded down to nearest
page order. Instead of calling rounddown_pow_of_two every time
in fault_around_pages()/fault_around_mask() we could do round down
when user changes fault_around_bytes via debugfs interface.

This also fixes bug when user set fault_around_bytes to 0.
Result of rounddown_pow_of_two(0) is not defined, therefore
fault_around_bytes == 0 doesn't work without this patch.

Let's set fault_around_bytes to PAGE_SIZE if user sets to something
less than PAGE_SIZE

Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: <stable@vger.kernel.org> # 3.15.x
---
 mm/memory.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e8d820..e0c6fd6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2758,20 +2758,16 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }

-static unsigned long fault_around_bytes = 65536;
+static unsigned long fault_around_bytes = rounddown_pow_of_two(65536);

-/*
- * fault_around_pages() and fault_around_mask() round down fault_around_bytes
- * to nearest page order. It's what do_fault_around() expects to see.
- */
 static inline unsigned long fault_around_pages(void)
 {
-	return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
+	return fault_around_bytes >> PAGE_SHIFT;
 }

 static inline unsigned long fault_around_mask(void)
 {
-	return ~(rounddown_pow_of_two(fault_around_bytes) - 1) & PAGE_MASK;
+	return ~(fault_around_bytes - 1) & PAGE_MASK;
 }


@@ -2782,11 +2778,18 @@ static int fault_around_bytes_get(void *data, u64 *val)
 	return 0;
 }

+/*
+ * fault_around_pages() and fault_around_mask() expects fault_around_bytes
+ * rounded down to nearest page order. It's what do_fault_around() expects to see.
+ */
 static int fault_around_bytes_set(void *data, u64 val)
 {
 	if (val / PAGE_SIZE > PTRS_PER_PTE)
 		return -EINVAL;
-	fault_around_bytes = val;
+	if (val > PAGE_SIZE)
+		fault_around_bytes = rounddown_pow_of_two(val);
+	else
+		fault_around_bytes = PAGE_SIZE; /* rounddown_pow_of_two(0) is undefined */
 	return 0;
 }
 DEFINE_SIMPLE_ATTRIBUTE(fault_around_bytes_fops,
-- 
1.8.5.5







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
@ 2014-07-28 10:27           ` Andrey Ryabinin
  0 siblings, 0 replies; 73+ messages in thread
From: Andrey Ryabinin @ 2014-07-28 10:27 UTC (permalink / raw)
  To: Kirill A. Shutemov, Sasha Levin
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Matthew Wilcox,
	Dave Hansen, Alexander Viro, Dave Chinner, Ning Qu, linux-mm,
	linux-fsdevel, linux-kernel, Dave Jones, stable,
	Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On 07/28/14 13:36, Kirill A. Shutemov wrote:
> On Mon, Jul 28, 2014 at 11:43:20AM +0400, Andrey Ryabinin wrote:
>> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
>> patchset:
>>
>> 	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
>> 	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
>> 	page flags: 0xafffff80008000(tail)
>> 	page dumped because: kasan error
>> 	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
>> 	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
>> 	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
>> 	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
>> 	Call Trace:
>> 	dump_stack (lib/dump_stack.c:52)
>> 	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
>> 	? debug_smp_processor_id (lib/smp_processor_id.c:57)
>> 	? preempt_count_sub (kernel/sched/core.c:2606)
>> 	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
>> 	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
>> 	__asan_load8 (mm/kasan/kasan.c:364)
>> 	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
>> 	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
>> 	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>> 	? __pte_alloc (mm/memory.c:598)
>> 	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
>> 	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
>> 	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
>> 	__mlock_vma_pages_range (mm/mlock.c:262)
>> 	__mm_populate (mm/mlock.c:710)
>> 	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
>> 	tracesys (arch/x86/kernel/entry_64.S:541)
>> 	Read of size 8 by thread T9262:
>> 	Memory state around the buggy address:
>> 	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	                         ^
>> 	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>
>>
>> It looks like that pte pointer is invalid in do_fault_around().
>> This could happen if fault_around_bytes is set to 0.
>> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
>> The result of rounddown_pow_of_to is undefined if parameter == 0
>> (in my environment it returns 0x8000000000000000).
> 
> Ouch. Good catch!
> 
> Although, I'm not convinced that it caused the issue. Sasha, did you touch the
> debugfs handle?
> 

I suppose trinity could change it, no? I've got the very same spew after setting fault_around_bytes to 0.

>> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
>> however this would add extra code on fault path.
>>
>> So let's just forbid to set fault_around_bytes to zero.
>> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
>> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
>> instead of 0.
> 
>>>From user point of view, 0 is perfectly fine. What about untested patch
> below?
> 

In case if we are not going to get rid of debugfs interface I would better keep
faul_around_bytes always roundded down, like in following patch:


>From f41b7777b29f06dc62f80526e5617cae82a38709 Mon Sep 17 00:00:00 2001
From: Andrey Ryabinin <a.ryabinin@samsung.com>
Date: Mon, 28 Jul 2014 13:46:10 +0400
Subject: [PATCH] mm: debugfs: move rounddown_pow_of_two() out from do_fault
 path

do_fault_around expects fault_around_bytes rounded down to nearest
page order. Instead of calling rounddown_pow_of_two every time
in fault_around_pages()/fault_around_mask() we could do round down
when user changes fault_around_bytes via debugfs interface.

This also fixes bug when user set fault_around_bytes to 0.
Result of rounddown_pow_of_two(0) is not defined, therefore
fault_around_bytes == 0 doesn't work without this patch.

Let's set fault_around_bytes to PAGE_SIZE if user sets to something
less than PAGE_SIZE

Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: <stable@vger.kernel.org> # 3.15.x
---
 mm/memory.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e8d820..e0c6fd6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2758,20 +2758,16 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }

-static unsigned long fault_around_bytes = 65536;
+static unsigned long fault_around_bytes = rounddown_pow_of_two(65536);

-/*
- * fault_around_pages() and fault_around_mask() round down fault_around_bytes
- * to nearest page order. It's what do_fault_around() expects to see.
- */
 static inline unsigned long fault_around_pages(void)
 {
-	return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
+	return fault_around_bytes >> PAGE_SHIFT;
 }

 static inline unsigned long fault_around_mask(void)
 {
-	return ~(rounddown_pow_of_two(fault_around_bytes) - 1) & PAGE_MASK;
+	return ~(fault_around_bytes - 1) & PAGE_MASK;
 }


@@ -2782,11 +2778,18 @@ static int fault_around_bytes_get(void *data, u64 *val)
 	return 0;
 }

+/*
+ * fault_around_pages() and fault_around_mask() expects fault_around_bytes
+ * rounded down to nearest page order. It's what do_fault_around() expects to see.
+ */
 static int fault_around_bytes_set(void *data, u64 val)
 {
 	if (val / PAGE_SIZE > PTRS_PER_PTE)
 		return -EINVAL;
-	fault_around_bytes = val;
+	if (val > PAGE_SIZE)
+		fault_around_bytes = rounddown_pow_of_two(val);
+	else
+		fault_around_bytes = PAGE_SIZE; /* rounddown_pow_of_two(0) is undefined */
 	return 0;
 }
 DEFINE_SIMPLE_ATTRIBUTE(fault_around_bytes_fops,
-- 
1.8.5.5








^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
@ 2014-07-28 10:27           ` Andrey Ryabinin
  0 siblings, 0 replies; 73+ messages in thread
From: Andrey Ryabinin @ 2014-07-28 10:27 UTC (permalink / raw)
  To: Kirill A. Shutemov, Sasha Levin
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Matthew Wilcox,
	Dave Hansen, Alexander Viro, Dave Chinner, Ning Qu, linux-mm,
	linux-fsdevel, linux-kernel, Dave Jones, stable,
	Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On 07/28/14 13:36, Kirill A. Shutemov wrote:
> On Mon, Jul 28, 2014 at 11:43:20AM +0400, Andrey Ryabinin wrote:
>> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
>> patchset:
>>
>> 	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
>> 	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
>> 	page flags: 0xafffff80008000(tail)
>> 	page dumped because: kasan error
>> 	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
>> 	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
>> 	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
>> 	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
>> 	Call Trace:
>> 	dump_stack (lib/dump_stack.c:52)
>> 	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
>> 	? debug_smp_processor_id (lib/smp_processor_id.c:57)
>> 	? preempt_count_sub (kernel/sched/core.c:2606)
>> 	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
>> 	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
>> 	__asan_load8 (mm/kasan/kasan.c:364)
>> 	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
>> 	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
>> 	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>> 	? __pte_alloc (mm/memory.c:598)
>> 	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
>> 	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
>> 	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
>> 	__mlock_vma_pages_range (mm/mlock.c:262)
>> 	__mm_populate (mm/mlock.c:710)
>> 	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
>> 	tracesys (arch/x86/kernel/entry_64.S:541)
>> 	Read of size 8 by thread T9262:
>> 	Memory state around the buggy address:
>> 	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	                         ^
>> 	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>
>>
>> It looks like that pte pointer is invalid in do_fault_around().
>> This could happen if fault_around_bytes is set to 0.
>> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
>> The result of rounddown_pow_of_to is undefined if parameter == 0
>> (in my environment it returns 0x8000000000000000).
> 
> Ouch. Good catch!
> 
> Although, I'm not convinced that it caused the issue. Sasha, did you touch the
> debugfs handle?
> 

I suppose trinity could change it, no? I've got the very same spew after setting fault_around_bytes to 0.

>> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
>> however this would add extra code on fault path.
>>
>> So let's just forbid to set fault_around_bytes to zero.
>> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
>> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
>> instead of 0.
> 
>>From user point of view, 0 is perfectly fine. What about untested patch
> below?
> 

In case if we are not going to get rid of debugfs interface I would better keep
faul_around_bytes always roundded down, like in following patch:

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
  2014-07-28 10:27           ` Andrey Ryabinin
@ 2014-07-28 10:52             ` Kirill A. Shutemov
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-07-28 10:52 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Sasha Levin, Andrew Morton, Linus Torvalds, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, linux-kernel, Dave Jones,
	stable, Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On Mon, Jul 28, 2014 at 02:27:37PM +0400, Andrey Ryabinin wrote:
> On 07/28/14 13:36, Kirill A. Shutemov wrote:
> > On Mon, Jul 28, 2014 at 11:43:20AM +0400, Andrey Ryabinin wrote:
> >> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
> >> patchset:
> >>
> >> 	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> >> 	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
> >> 	page flags: 0xafffff80008000(tail)
> >> 	page dumped because: kasan error
> >> 	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> >> 	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> >> 	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> >> 	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> >> 	Call Trace:
> >> 	dump_stack (lib/dump_stack.c:52)
> >> 	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> >> 	? debug_smp_processor_id (lib/smp_processor_id.c:57)
> >> 	? preempt_count_sub (kernel/sched/core.c:2606)
> >> 	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> >> 	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> >> 	__asan_load8 (mm/kasan/kasan.c:364)
> >> 	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> >> 	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> >> 	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> >> 	? __pte_alloc (mm/memory.c:598)
> >> 	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> >> 	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> >> 	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
> >> 	__mlock_vma_pages_range (mm/mlock.c:262)
> >> 	__mm_populate (mm/mlock.c:710)
> >> 	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> >> 	tracesys (arch/x86/kernel/entry_64.S:541)
> >> 	Read of size 8 by thread T9262:
> >> 	Memory state around the buggy address:
> >> 	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	                         ^
> >> 	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >>
> >>
> >> It looks like that pte pointer is invalid in do_fault_around().
> >> This could happen if fault_around_bytes is set to 0.
> >> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
> >> The result of rounddown_pow_of_to is undefined if parameter == 0
> >> (in my environment it returns 0x8000000000000000).
> > 
> > Ouch. Good catch!
> > 
> > Although, I'm not convinced that it caused the issue. Sasha, did you touch the
> > debugfs handle?
> > 
> 
> I suppose trinity could change it, no? I've got the very same spew after setting fault_around_bytes to 0.
> 
> >> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
> >> however this would add extra code on fault path.
> >>
> >> So let's just forbid to set fault_around_bytes to zero.
> >> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
> >> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
> >> instead of 0.
> > 
> >>From user point of view, 0 is perfectly fine. What about untested patch
> > below?
> > 
> 
> In case if we are not going to get rid of debugfs interface I would better keep
> faul_around_bytes always roundded down, like in following patch:
> 
> 
> From f41b7777b29f06dc62f80526e5617cae82a38709 Mon Sep 17 00:00:00 2001
> From: Andrey Ryabinin <a.ryabinin@samsung.com>
> Date: Mon, 28 Jul 2014 13:46:10 +0400
> Subject: [PATCH] mm: debugfs: move rounddown_pow_of_two() out from do_fault
>  path
> 
> do_fault_around expects fault_around_bytes rounded down to nearest
> page order. Instead of calling rounddown_pow_of_two every time
> in fault_around_pages()/fault_around_mask() we could do round down
> when user changes fault_around_bytes via debugfs interface.
> 
> This also fixes bug when user set fault_around_bytes to 0.
> Result of rounddown_pow_of_two(0) is not defined, therefore
> fault_around_bytes == 0 doesn't work without this patch.
> 
> Let's set fault_around_bytes to PAGE_SIZE if user sets to something
> less than PAGE_SIZE
> 
> Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
> Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
> Cc: <stable@vger.kernel.org> # 3.15.x
> ---
>  mm/memory.c | 19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e8d820..e0c6fd6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2758,20 +2758,16 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
>  	update_mmu_cache(vma, address, pte);
>  }
> 
> -static unsigned long fault_around_bytes = 65536;
> +static unsigned long fault_around_bytes = rounddown_pow_of_two(65536);

This looks weird, but okay...

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
@ 2014-07-28 10:52             ` Kirill A. Shutemov
  0 siblings, 0 replies; 73+ messages in thread
From: Kirill A. Shutemov @ 2014-07-28 10:52 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Sasha Levin, Andrew Morton, Linus Torvalds, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	Ning Qu, linux-mm, linux-fsdevel, linux-kernel, Dave Jones,
	stable, Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On Mon, Jul 28, 2014 at 02:27:37PM +0400, Andrey Ryabinin wrote:
> On 07/28/14 13:36, Kirill A. Shutemov wrote:
> > On Mon, Jul 28, 2014 at 11:43:20AM +0400, Andrey Ryabinin wrote:
> >> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
> >> patchset:
> >>
> >> 	AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> >> 	page:ffffea001229ccc0 count:0 mapcount:0 mapping:          (null) index:0x0
> >> 	page flags: 0xafffff80008000(tail)
> >> 	page dumped because: kasan error
> >> 	CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> >> 	 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> >> 	 ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> >> 	 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> >> 	Call Trace:
> >> 	dump_stack (lib/dump_stack.c:52)
> >> 	kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> >> 	? debug_smp_processor_id (lib/smp_processor_id.c:57)
> >> 	? preempt_count_sub (kernel/sched/core.c:2606)
> >> 	? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> >> 	? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> >> 	__asan_load8 (mm/kasan/kasan.c:364)
> >> 	? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> >> 	do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> >> 	? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> >> 	? __pte_alloc (mm/memory.c:598)
> >> 	handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> >> 	? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> >> 	__get_user_pages (mm/gup.c:286 mm/gup.c:478)
> >> 	__mlock_vma_pages_range (mm/mlock.c:262)
> >> 	__mm_populate (mm/mlock.c:710)
> >> 	SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> >> 	tracesys (arch/x86/kernel/entry_64.S:541)
> >> 	Read of size 8 by thread T9262:
> >> 	Memory state around the buggy address:
> >> 	 ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	                         ^
> >> 	 ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> 	 ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >>
> >>
> >> It looks like that pte pointer is invalid in do_fault_around().
> >> This could happen if fault_around_bytes is set to 0.
> >> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
> >> The result of rounddown_pow_of_to is undefined if parameter == 0
> >> (in my environment it returns 0x8000000000000000).
> > 
> > Ouch. Good catch!
> > 
> > Although, I'm not convinced that it caused the issue. Sasha, did you touch the
> > debugfs handle?
> > 
> 
> I suppose trinity could change it, no? I've got the very same spew after setting fault_around_bytes to 0.
> 
> >> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
> >> however this would add extra code on fault path.
> >>
> >> So let's just forbid to set fault_around_bytes to zero.
> >> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
> >> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
> >> instead of 0.
> > 
> >>From user point of view, 0 is perfectly fine. What about untested patch
> > below?
> > 
> 
> In case if we are not going to get rid of debugfs interface I would better keep
> faul_around_bytes always roundded down, like in following patch:
> 
> 
> From f41b7777b29f06dc62f80526e5617cae82a38709 Mon Sep 17 00:00:00 2001
> From: Andrey Ryabinin <a.ryabinin@samsung.com>
> Date: Mon, 28 Jul 2014 13:46:10 +0400
> Subject: [PATCH] mm: debugfs: move rounddown_pow_of_two() out from do_fault
>  path
> 
> do_fault_around expects fault_around_bytes rounded down to nearest
> page order. Instead of calling rounddown_pow_of_two every time
> in fault_around_pages()/fault_around_mask() we could do round down
> when user changes fault_around_bytes via debugfs interface.
> 
> This also fixes bug when user set fault_around_bytes to 0.
> Result of rounddown_pow_of_two(0) is not defined, therefore
> fault_around_bytes == 0 doesn't work without this patch.
> 
> Let's set fault_around_bytes to PAGE_SIZE if user sets to something
> less than PAGE_SIZE
> 
> Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
> Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
> Cc: <stable@vger.kernel.org> # 3.15.x
> ---
>  mm/memory.c | 19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e8d820..e0c6fd6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2758,20 +2758,16 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
>  	update_mmu_cache(vma, address, pte);
>  }
> 
> -static unsigned long fault_around_bytes = 65536;
> +static unsigned long fault_around_bytes = rounddown_pow_of_two(65536);

This looks weird, but okay...

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
  2014-07-28 10:27           ` Andrey Ryabinin
@ 2014-07-28 12:32             ` Sasha Levin
  -1 siblings, 0 replies; 73+ messages in thread
From: Sasha Levin @ 2014-07-28 12:32 UTC (permalink / raw)
  To: Andrey Ryabinin, Kirill A. Shutemov
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Matthew Wilcox,
	Dave Hansen, Alexander Viro, Dave Chinner, Ning Qu, linux-mm,
	linux-fsdevel, linux-kernel, Dave Jones, stable,
	Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On 07/28/2014 06:27 AM, Andrey Ryabinin wrote:
>> Although, I'm not convinced that it caused the issue. Sasha, did you touch the
>> > debugfs handle?
>> > 
> I suppose trinity could change it, no? I've got the very same spew after setting fault_around_bytes to 0.

Not on purpose, but as Andrey said - it's very possible that trinity did.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
@ 2014-07-28 12:32             ` Sasha Levin
  0 siblings, 0 replies; 73+ messages in thread
From: Sasha Levin @ 2014-07-28 12:32 UTC (permalink / raw)
  To: Andrey Ryabinin, Kirill A. Shutemov
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Matthew Wilcox,
	Dave Hansen, Alexander Viro, Dave Chinner, Ning Qu, linux-mm,
	linux-fsdevel, linux-kernel, Dave Jones, stable,
	Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On 07/28/2014 06:27 AM, Andrey Ryabinin wrote:
>> Although, I'm not convinced that it caused the issue. Sasha, did you touch the
>> > debugfs handle?
>> > 
> I suppose trinity could change it, no? I've got the very same spew after setting fault_around_bytes to 0.

Not on purpose, but as Andrey said - it's very possible that trinity did.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
  2014-07-28  9:36         ` Kirill A. Shutemov
@ 2014-07-28 15:26           ` Dave Hansen
  -1 siblings, 0 replies; 73+ messages in thread
From: Dave Hansen @ 2014-07-28 15:26 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrey Ryabinin, Sasha Levin
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Matthew Wilcox,
	Alexander Viro, Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	linux-kernel, Dave Jones, stable, Kirill A. Shutemov, Mel Gorman,
	Rik van Riel, Konstantin Khlebnikov, Hugh Dickins

On 07/28/2014 02:36 AM, Kirill A. Shutemov wrote:
> +++ b/mm/memory.c
> @@ -2786,7 +2786,8 @@ static int fault_around_bytes_set(void *data, u64 val)
>  {
>  	if (val / PAGE_SIZE > PTRS_PER_PTE)
>  		return -EINVAL;
> -	fault_around_bytes = val;
> +	/* rounddown_pow_of_two(0) is not defined */
> +	fault_around_bytes = max(val, PAGE_SIZE);
>  	return 0;
>  }

It's also possible to race and have fault_around_bytes change between
when fault_around_mask() and fault_around_pages() are called so that
they don't match any more.  The min()/max() in do_fault_around() should
keep this from doing anything _too_ nasty, but it's worth thinking about
at least.

The safest thing to do might be to use an ACCESS_ONCE() at the beginning
of do_fault_around() for fault_around_bytes and generate
fault_around_mask() from the ACCESS_ONCE() result.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
@ 2014-07-28 15:26           ` Dave Hansen
  0 siblings, 0 replies; 73+ messages in thread
From: Dave Hansen @ 2014-07-28 15:26 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrey Ryabinin, Sasha Levin
  Cc: Andrew Morton, Linus Torvalds, Andi Kleen, Matthew Wilcox,
	Alexander Viro, Dave Chinner, Ning Qu, linux-mm, linux-fsdevel,
	linux-kernel, Dave Jones, stable, Kirill A. Shutemov, Mel Gorman,
	Rik van Riel, Konstantin Khlebnikov, Hugh Dickins

On 07/28/2014 02:36 AM, Kirill A. Shutemov wrote:
> +++ b/mm/memory.c
> @@ -2786,7 +2786,8 @@ static int fault_around_bytes_set(void *data, u64 val)
>  {
>  	if (val / PAGE_SIZE > PTRS_PER_PTE)
>  		return -EINVAL;
> -	fault_around_bytes = val;
> +	/* rounddown_pow_of_two(0) is not defined */
> +	fault_around_bytes = max(val, PAGE_SIZE);
>  	return 0;
>  }

It's also possible to race and have fault_around_bytes change between
when fault_around_mask() and fault_around_pages() are called so that
they don't match any more.  The min()/max() in do_fault_around() should
keep this from doing anything _too_ nasty, but it's worth thinking about
at least.

The safest thing to do might be to use an ACCESS_ONCE() at the beginning
of do_fault_around() for fault_around_bytes and generate
fault_around_mask() from the ACCESS_ONCE() result.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
  2014-07-28 10:27           ` Andrey Ryabinin
@ 2014-07-28 22:43             ` David Rientjes
  -1 siblings, 0 replies; 73+ messages in thread
From: David Rientjes @ 2014-07-28 22:43 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Kirill A. Shutemov, Sasha Levin, Andrew Morton, Linus Torvalds,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Dave Jones, stable, Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On Mon, 28 Jul 2014, Andrey Ryabinin wrote:

> do_fault_around expects fault_around_bytes rounded down to nearest
> page order. Instead of calling rounddown_pow_of_two every time
> in fault_around_pages()/fault_around_mask() we could do round down
> when user changes fault_around_bytes via debugfs interface.
> 

If you're going to optimize this, it seems like fault_around_bytes would 
benefit from being __read_mostly.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH] mm: don't allow fault_around_bytes to be 0
@ 2014-07-28 22:43             ` David Rientjes
  0 siblings, 0 replies; 73+ messages in thread
From: David Rientjes @ 2014-07-28 22:43 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Kirill A. Shutemov, Sasha Levin, Andrew Morton, Linus Torvalds,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, Ning Qu, linux-mm, linux-fsdevel, linux-kernel,
	Dave Jones, stable, Kirill A. Shutemov, Mel Gorman, Rik van Riel,
	Konstantin Khlebnikov, Hugh Dickins

On Mon, 28 Jul 2014, Andrey Ryabinin wrote:

> do_fault_around expects fault_around_bytes rounded down to nearest
> page order. Instead of calling rounddown_pow_of_two every time
> in fault_around_pages()/fault_around_mask() we could do round down
> when user changes fault_around_bytes via debugfs interface.
> 

If you're going to optimize this, it seems like fault_around_bytes would 
benefit from being __read_mostly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2014-07-28 22:43 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-27 19:53 [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache Kirill A. Shutemov
2014-02-27 19:53 ` Kirill A. Shutemov
2014-02-27 19:53 ` [PATCHv3 1/2] mm: introduce vm_ops->map_pages() Kirill A. Shutemov
2014-02-27 19:53   ` Kirill A. Shutemov
2014-02-27 21:59   ` Dave Hansen
2014-02-27 21:59     ` Dave Hansen
2014-02-27 22:06     ` Linus Torvalds
2014-02-27 22:06       ` Linus Torvalds
2014-02-27 22:34       ` Dave Hansen
2014-02-27 22:34         ` Dave Hansen
2014-02-28  0:18         ` Kirill A. Shutemov
2014-02-28  0:18           ` Kirill A. Shutemov
2014-02-28 11:50         ` Benjamin Herrenschmidt
2014-02-28 11:50           ` Benjamin Herrenschmidt
2014-02-27 22:08     ` Andrew Morton
2014-02-27 22:08       ` Andrew Morton
2014-03-03 23:16   ` Andrew Morton
2014-03-04  1:26     ` Kirill A. Shutemov
2014-03-04  1:26       ` Kirill A. Shutemov
2014-03-04  1:26       ` Kirill A. Shutemov
2014-03-05  0:04     ` Rusty Russell
2014-03-05  0:04       ` Rusty Russell
2014-03-05 20:02       ` Andrew Morton
2014-03-05 20:02         ` Andrew Morton
2014-07-24  3:33   ` Sasha Levin
2014-07-24  3:33     ` Sasha Levin
2014-07-24  6:53     ` Andrey Ryabinin
2014-07-24  6:53       ` Andrey Ryabinin
2014-07-24 12:48       ` Sasha Levin
2014-07-24 12:48         ` Sasha Levin
2014-07-24 13:05     ` Sasha Levin
2014-07-24 13:05       ` Sasha Levin
2014-07-24 13:30       ` Konstantin Khlebnikov
2014-07-24 13:30         ` Konstantin Khlebnikov
2014-07-28  7:43     ` [PATCH] mm: don't allow fault_around_bytes to be 0 Andrey Ryabinin
2014-07-28  7:43       ` Andrey Ryabinin
2014-07-28  7:47       ` Andrey Ryabinin
2014-07-28  7:47         ` Andrey Ryabinin
2014-07-28  9:36       ` Kirill A. Shutemov
2014-07-28  9:36         ` Kirill A. Shutemov
2014-07-28  9:36         ` Kirill A. Shutemov
2014-07-28 10:27         ` Andrey Ryabinin
2014-07-28 10:27           ` Andrey Ryabinin
2014-07-28 10:27           ` Andrey Ryabinin
2014-07-28 10:27           ` Andrey Ryabinin
2014-07-28 10:52           ` Kirill A. Shutemov
2014-07-28 10:52             ` Kirill A. Shutemov
2014-07-28 12:32           ` Sasha Levin
2014-07-28 12:32             ` Sasha Levin
2014-07-28 22:43           ` David Rientjes
2014-07-28 22:43             ` David Rientjes
2014-07-28 15:26         ` Dave Hansen
2014-07-28 15:26           ` Dave Hansen
2014-02-27 19:53 ` [PATCHv3 2/2] mm: implement ->map_pages for page cache Kirill A. Shutemov
2014-02-27 19:53   ` Kirill A. Shutemov
2014-02-27 21:47   ` Andrew Morton
2014-02-27 21:47     ` Andrew Morton
2014-02-28  0:31     ` Kirill A. Shutemov
2014-02-28  0:31       ` Kirill A. Shutemov
2014-04-02 18:03   ` Konstantin Khlebnikov
2014-04-02 18:03     ` Konstantin Khlebnikov
2014-04-02 19:07     ` Kirill A. Shutemov
2014-04-02 19:07       ` Kirill A. Shutemov
2014-02-27 21:28 ` [PATCHv3 0/2] mm: map few pages around fault address if they are in " Linus Torvalds
2014-02-27 21:28   ` Linus Torvalds
2014-02-28  0:10   ` Kirill A. Shutemov
2014-02-28  0:10     ` Kirill A. Shutemov
2014-02-28  3:52     ` Wilcox, Matthew R
2014-02-28  3:52       ` Wilcox, Matthew R
2014-02-28 23:08     ` Linus Torvalds
2014-02-28 23:08       ` Linus Torvalds
2014-03-12 14:22   ` Kirill A. Shutemov
2014-03-12 14:22     ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.