All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-17 18:38 ` Kirill A. Shutemov
  0 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-17 18:38 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

I've rewrote it from scratch since v1. Locking multiple pages at once was
not that great idea. lock_page() is mess. It seems we don't have good
locking ordering rules for lock_page() vs. lock_page() and it triggers
deadlocks.

Now we have ->fault_nonblock() to ask filesystem for a page, if it's
reachable without blocking. We request one page a time. It's not terribly
efficient and I will probably re-think the interface once again to expose
iterator or something...

Synthetic tests shows pretty impressive numbers: time to complete is
reduced by 80% on sequential multi-threaded access to a file.

Unfortunately, scripting use-case doesn't show any change. I've tried git
test-suite, kernel build and kernel rebuild: no significant difference.

I will try reduce FAULT_AROUND_PAGES to see if it makes any difference.

One important use-case which is not covered is shmem/tmpfs: we need to
implement separate ->fault_nonblock() for it.

Yet another feature to implement is MAP_POPULATE|MAP_NONBLOCK on top of
->fault_nonblock(). It should be useful for dynamic linker for example.

Benchmark data is below.

Any comments?

=========================================================================

Git test-suite make -j60 test:

Base:
 1,529,278,935,718      cycles                     ( +-  0.23% ) [100.00%]
   799,283,349,594      instructions              #    0.52  insns per cycle
                                                  #    3.14  stalled cycles per insn  ( +-  0.12% ) [100.00%]
 2,509,889,276,415      stalled-cycles-frontend   #  164.12% frontend cycles idle     ( +-  0.18% )
       115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 1      major-faults

      59.686645587 seconds time elapsed                                          ( +-  0.30% )
Patched:
 1,613,998,434,527      cycles                     ( +-  0.24% ) [100.00%]
   865,776,308,589      instructions              #    0.54  insns per cycle
                                                  #    3.07  stalled cycles per insn  ( +-  0.18% ) [100.00%]
 2,661,812,494,310      stalled-cycles-frontend   #  164.92% frontend cycles idle     ( +-  0.17% )
        47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 1      major-faults                                                  ( +- 50.00% )

      60.241766430 seconds time elapsed                                          ( +-  0.85% )

Run make -j60 on clean allmodconfig kernel tree:

Base:
19,511,808,321,876      cycles                    [100.00%]
17,551,075,994,147      instructions              #    0.90  insns per cycle
                                                  #    1.55  stalled cycles per insn [100.00%]
27,227,785,020,522      stalled-cycles-frontend   #  139.55% frontend cycles idle
       268,039,365      minor-faults                                                 [100.00%]
                 0      major-faults

     132.830612471 seconds time elapsed
Patched:
19,421,069,663,037      cycles                    [100.00%]
17,587,643,648,161      instructions              #    0.91  insns per cycle
                                                  #    1.54  stalled cycles per insn [100.00%]
27,063,152,011,598      stalled-cycles-frontend   #  139.35% frontend cycles idle
       193,550,437      minor-faults                                                 [100.00%]
                 0      major-faults

     132.851823758 seconds time elapsed

Run make -j60 on already built allmodconfig kernel tree:

Base:
   286,539,116,594      cycles                     ( +-  0.05% ) [100.00%]
   386,447,067,412      instructions              #    1.35  insns per cycle
                                                  #    0.96  stalled cycles per insn  ( +-  0.00% ) [100.00%]
   372,549,722,846      stalled-cycles-frontend   #  130.02% frontend cycles idle     ( +-  0.04% )
         4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
                 0      major-faults

      27.215434226 seconds time elapsed                                          ( +-  0.18% )
Patched:
   286,435,472,207      cycles                     ( +-  0.07% ) [100.00%]
   388,217,937,147      instructions              #    1.36  insns per cycle
                                                  #    0.96  stalled cycles per insn  ( +-  0.01% ) [100.00%]
   373,681,951,018      stalled-cycles-frontend   #  130.46% frontend cycles idle     ( +-  0.10% )
         2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
                 0      major-faults

      27.292854546 seconds time elapsed                                          ( +-  0.29% )

Access 16g file sequential:

Threads		Base, seconds	Patched, seconds	Diff, %
1		8.284232583	5.424205564		-34,52
8		18.338648302	6.330400940		-65.48
32		54.004043136	10.774241948		-80.04
60		88.928810991	16.571960471		-81.36
120		144.729413453	28.875873029		-80.04

Access 1g file random:

Threads		Base, seconds	Patched, seconds	Diff, %
1		18.247095738	15.003616508		-17.77
8		15.226716773	14.239478275		-6.48
32		22.080582106	16.797777162		-23.92
60		25.759900660	22.768611526		-11.61
120		44.443953873	38.197095497		-14.05

Below is `perf stat' output for the tests above.

=========================================================================

# usemem -t <threads> -f testfile -o 16g #

## 1 thread ##
### base ###
   28,284,111,426      cycles                     ( +-  1.53% ) [100.00%]
    44,120,803,329      instructions              #    1.56  insns per cycle
                                                  #    1.02  stalled cycles per insn  ( +-  0.00% ) [100.00%]
    45,148,666,958      stalled-cycles-frontend   #  159.63% frontend cycles idle     ( +-  1.89% )
         4,194,709      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

       8.284232583 seconds time elapsed                                          ( +-  1.67% )
### patched ###
    18,615,784,628      cycles                     ( +-  8.46% ) [100.00%]
    41,590,187,000      instructions              #    2.23  insns per cycle
                                                  #    0.66  stalled cycles per insn  ( +-  0.02% ) [100.00%]
    27,374,671,499      stalled-cycles-frontend   #  147.05% frontend cycles idle     ( +- 11.48% )
           131,260      minor-faults                                                  ( +-  0.03% ) [100.00%]
                 0      major-faults

       5.424205564 seconds time elapsed                                          ( +-  8.54% )

## 8 threads ##
### base ###
   447,114,910,645      cycles                     ( +-  3.22% ) [100.00%]
   350,993,124,470      instructions              #    0.79  insns per cycle
                                                  #    2.29  stalled cycles per insn  ( +-  0.01% ) [100.00%]
   803,040,468,490      stalled-cycles-frontend   #  179.60% frontend cycles idle     ( +-  3.59% )
        33,556,008      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

      18.338648302 seconds time elapsed                                          ( +-  3.20% )
### patched ###
   136,437,995,805      cycles                     ( +-  0.87% ) [100.00%]
   328,096,731,149      instructions              #    2.40  insns per cycle
                                                  #    0.60  stalled cycles per insn  ( +-  0.02% ) [100.00%]
   197,237,317,698      stalled-cycles-frontend   #  144.56% frontend cycles idle     ( +-  1.19% )
         1,338,884      minor-faults                                                  ( +-  3.11% ) [100.00%]
                 0      major-faults

       6.330400940 seconds time elapsed                                          ( +-  1.50% )

## 32 threads ##
### base ###
 4,844,333,104,028      cycles                     ( +-  2.17% ) [100.00%]
 1,401,233,609,645      instructions              #    0.29  insns per cycle
                                                  #    6.61  stalled cycles per insn  ( +-  0.02% ) [100.00%]
 9,265,429,354,789      stalled-cycles-frontend   #  191.26% frontend cycles idle     ( +-  1.81% )
       134,228,040      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

      54.004043136 seconds time elapsed                                          ( +-  2.03% )
### patched ###
   512,226,985,443      cycles                     ( +-  2.21% ) [100.00%]
 1,307,350,894,141      instructions              #    2.55  insns per cycle
                                                  #    0.52  stalled cycles per insn  ( +-  0.02% ) [100.00%]
   677,259,628,341      stalled-cycles-frontend   #  132.22% frontend cycles idle     ( +-  1.56% )
         4,820,979      minor-faults                                                  ( +-  3.88% ) [100.00%]
                 0      major-faults

      10.774241948 seconds time elapsed                                          ( +-  2.40% )
## 60 threads ##
### base ###
14,878,749,780,310      cycles                     ( +-  1.24% ) [100.00%]
 2,633,166,794,060      instructions              #    0.18  insns per cycle
                                                  #   10.75  stalled cycles per insn  ( +-  0.01% ) [100.00%]
28,301,405,498,867      stalled-cycles-frontend   #  190.21% frontend cycles idle     ( +-  0.88% )
       251,687,643      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

      88.928810991 seconds time elapsed                                          ( +-  1.44% )
### patched ###
 1,156,172,977,262      cycles                     ( +-  3.24% ) [100.00%]
 2,450,915,465,588      instructions              #    2.12  insns per cycle
                                                  #    0.63  stalled cycles per insn  ( +-  0.01% ) [100.00%]
 1,534,996,614,819      stalled-cycles-frontend   #  132.77% frontend cycles idle     ( +-  4.37% )
         9,239,706      minor-faults                                                  ( +-  1.66% ) [100.00%]
                 0      major-faults

      16.571960471 seconds time elapsed                                          ( +-  2.64% )
## 120 threads ##
### base ###
47,350,885,336,341      cycles                     ( +-  2.28% ) [100.00%]
 5,282,921,722,672      instructions              #    0.11  insns per cycle
                                                  #    8.96  stalled cycles per insn  ( +-  0.04% ) [100.00%]
47,332,933,971,860      stalled-cycles-frontend   #   99.96% frontend cycles idle     ( +-  2.40% )
       503,357,749      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

     144.729413453 seconds time elapsed                                          ( +-  1.75% )
### patched ###
 3,791,855,572,557      cycles                     ( +-  2.15% ) [100.00%]
 4,901,681,561,338      instructions              #    1.29  insns per cycle
                                                  #    0.59  stalled cycles per insn  ( +-  0.01% ) [100.00%]
 2,880,236,854,539      stalled-cycles-frontend   #   75.96% frontend cycles idle     ( +-  3.69% )
        18,080,707      minor-faults                                                  ( +-  1.22% ) [100.00%]
                 0      major-faults

      28.875873029 seconds time elapsed                                          ( +-  1.19% )

# usemem -t <threads> -R -f testfile -o 1g #

## 1 thread ##
### base ###
    62,357,670,556      cycles                     ( +-  0.58% ) [100.00%]
    11,612,293,315      instructions              #    0.19  insns per cycle
                                                  #   10.31  stalled cycles per insn  ( +-  0.03% ) [100.00%]
   119,734,125,613      stalled-cycles-frontend   #  192.01% frontend cycles idle     ( +-  0.58% )
           262,723      minor-faults                                                  ( +-  0.01% ) [100.00%]
                 0      major-faults

      18.247095738 seconds time elapsed                                          ( +-  0.58% )
### patched ###
    51,421,735,331      cycles                     ( +- 11.53% ) [100.00%]
    11,446,081,379      instructions              #    0.22  insns per cycle
                                                  #    8.55  stalled cycles per insn  ( +-  0.19% ) [100.00%]
    97,900,718,100      stalled-cycles-frontend   #  190.39% frontend cycles idle     ( +- 12.09% )
             8,629      minor-faults                                                  ( +-  1.20% ) [100.00%]
                 0      major-faults                                                  ( +-100.00% )

      15.003616508 seconds time elapsed                                          ( +- 11.56% )

## 8 threads ##
### base ###
   395,785,250,840      cycles                     ( +-  0.60% ) [100.00%]
    93,850,278,507      instructions              #    0.24  insns per cycle
                                                  #    8.02  stalled cycles per insn  ( +-  0.03% ) [100.00%]
   752,951,585,171      stalled-cycles-frontend   #  190.24% frontend cycles idle     ( +-  0.61% )
         2,097,988      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

      15.226716773 seconds time elapsed                                          ( +-  0.41% )
### patched ###
 Performance counter stats for 'system wide' (5 runs):

   369,032,514,042      cycles                     ( +-  0.42% ) [100.00%]
    92,357,503,760      instructions              #    0.25  insns per cycle
                                                  #    7.59  stalled cycles per insn  ( +-  0.01% ) [100.00%]
   700,555,940,601      stalled-cycles-frontend   #  189.84% frontend cycles idle     ( +-  0.44% )
            89,186      minor-faults                                                  ( +-  1.88% ) [100.00%]
                 0      major-faults

      14.239478275 seconds time elapsed                                          ( +-  0.48% )

## 32 threads ##
### base ###
 1,791,516,702,606      cycles                     ( +-  1.57% ) [100.00%]
   369,936,392,907      instructions              #    0.21  insns per cycle
                                                  #    9.07  stalled cycles per insn  ( +-  0.07% ) [100.00%]
 3,357,017,162,189      stalled-cycles-frontend   #  187.38% frontend cycles idle     ( +-  1.77% )
         8,390,395      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

      22.080582106 seconds time elapsed                                          ( +-  9.50% )
### patched ###
 1,517,335,667,819      cycles                     ( +-  0.63% ) [100.00%]
   363,401,817,392      instructions              #    0.24  insns per cycle
                                                  #    7.45  stalled cycles per insn  ( +-  0.01% ) [100.00%]
 2,705,627,050,641      stalled-cycles-frontend   #  178.31% frontend cycles idle     ( +-  1.15% )
           269,287      minor-faults                                                  ( +-  0.16% ) [100.00%]
                 0      major-faults

      16.797777162 seconds time elapsed                                          ( +-  7.69% )

## 60 threads ##
### base ###
 3,983,862,013,747      cycles                     ( +-  1.42% ) [100.00%]
   693,310,458,766      instructions              #    0.17  insns per cycle
                                                  #   10.75  stalled cycles per insn  ( +-  0.01% ) [100.00%]
 7,454,138,580,163      stalled-cycles-frontend   #  187.11% frontend cycles idle     ( +-  1.27% )
        15,735,826      minor-faults                                                  ( +-  0.02% ) [100.00%]
                 1      major-faults                                                  ( +-100.00% )

      25.759900660 seconds time elapsed                                          ( +-  0.64% )
### patched ###
 2,871,988,372,901      cycles                     ( +-  1.17% ) [100.00%]
   681,452,765,745      instructions              #    0.24  insns per cycle
                                                  #    7.61  stalled cycles per insn  ( +-  0.03% ) [100.00%]
 5,186,017,674,808      stalled-cycles-frontend   #  180.57% frontend cycles idle     ( +-  1.24% )
           498,198      minor-faults                                                  ( +-  0.05% ) [100.00%]
                 0      major-faults

      22.768611526 seconds time elapsed                                          ( +-  3.13% )

## 120 threads ##
### base ###
14,891,491,866,813      cycles                     ( +-  0.83% ) [100.00%]
 1,391,880,325,881      instructions              #    0.09  insns per cycle
                                                  #   10.51  stalled cycles per insn  ( +-  0.02% ) [100.00%]
14,629,621,106,104      stalled-cycles-frontend   #   98.24% frontend cycles idle     ( +-  1.04% )
        31,468,057      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

      44.443953873 seconds time elapsed                                          ( +-  0.91% )
### patched ###
12,450,645,304,818      cycles                     ( +-  2.94% ) [100.00%]
 1,367,989,003,638      instructions              #    0.11  insns per cycle
                                                  #    8.98  stalled cycles per insn  ( +-  0.04% ) [100.00%]
12,282,274,614,247      stalled-cycles-frontend   #   98.65% frontend cycles idle     ( +-  3.34% )
           996,361      minor-faults                                                  ( +-  0.02% ) [100.00%]
                 0      major-faults

      38.197095497 seconds time elapsed                                          ( +-  2.

Kirill A. Shutemov (2):
  mm: introduce vm_ops->fault_nonblock()
  mm: implement ->fault_nonblock() for page cache

 Documentation/filesystems/Locking |  8 ++++++++
 fs/9p/vfs_file.c                  |  2 ++
 fs/btrfs/file.c                   |  1 +
 fs/cifs/file.c                    |  1 +
 fs/ext4/file.c                    |  1 +
 fs/f2fs/file.c                    |  1 +
 fs/fuse/file.c                    |  1 +
 fs/gfs2/file.c                    |  1 +
 fs/nfs/file.c                     |  1 +
 fs/nilfs2/file.c                  |  1 +
 fs/ubifs/file.c                   |  1 +
 fs/xfs/xfs_file.c                 |  1 +
 include/linux/mm.h                |  3 +++
 mm/filemap.c                      | 35 +++++++++++++++++++++++++++++++++++
 mm/memory.c                       | 38 +++++++++++++++++++++++++++++++++++++-
 15 files changed, 95 insertions(+), 1 deletion(-)

-- 
1.9.0.rc3


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-17 18:38 ` Kirill A. Shutemov
  0 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-17 18:38 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

I've rewrote it from scratch since v1. Locking multiple pages at once was
not that great idea. lock_page() is mess. It seems we don't have good
locking ordering rules for lock_page() vs. lock_page() and it triggers
deadlocks.

Now we have ->fault_nonblock() to ask filesystem for a page, if it's
reachable without blocking. We request one page a time. It's not terribly
efficient and I will probably re-think the interface once again to expose
iterator or something...

Synthetic tests shows pretty impressive numbers: time to complete is
reduced by 80% on sequential multi-threaded access to a file.

Unfortunately, scripting use-case doesn't show any change. I've tried git
test-suite, kernel build and kernel rebuild: no significant difference.

I will try reduce FAULT_AROUND_PAGES to see if it makes any difference.

One important use-case which is not covered is shmem/tmpfs: we need to
implement separate ->fault_nonblock() for it.

Yet another feature to implement is MAP_POPULATE|MAP_NONBLOCK on top of
->fault_nonblock(). It should be useful for dynamic linker for example.

Benchmark data is below.

Any comments?

=========================================================================

Git test-suite make -j60 test:

Base:
 1,529,278,935,718      cycles                     ( +-  0.23% ) [100.00%]
   799,283,349,594      instructions              #    0.52  insns per cycle
                                                  #    3.14  stalled cycles per insn  ( +-  0.12% ) [100.00%]
 2,509,889,276,415      stalled-cycles-frontend   #  164.12% frontend cycles idle     ( +-  0.18% )
       115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 1      major-faults

      59.686645587 seconds time elapsed                                          ( +-  0.30% )
Patched:
 1,613,998,434,527      cycles                     ( +-  0.24% ) [100.00%]
   865,776,308,589      instructions              #    0.54  insns per cycle
                                                  #    3.07  stalled cycles per insn  ( +-  0.18% ) [100.00%]
 2,661,812,494,310      stalled-cycles-frontend   #  164.92% frontend cycles idle     ( +-  0.17% )
        47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 1      major-faults                                                  ( +- 50.00% )

      60.241766430 seconds time elapsed                                          ( +-  0.85% )

Run make -j60 on clean allmodconfig kernel tree:

Base:
19,511,808,321,876      cycles                    [100.00%]
17,551,075,994,147      instructions              #    0.90  insns per cycle
                                                  #    1.55  stalled cycles per insn [100.00%]
27,227,785,020,522      stalled-cycles-frontend   #  139.55% frontend cycles idle
       268,039,365      minor-faults                                                 [100.00%]
                 0      major-faults

     132.830612471 seconds time elapsed
Patched:
19,421,069,663,037      cycles                    [100.00%]
17,587,643,648,161      instructions              #    0.91  insns per cycle
                                                  #    1.54  stalled cycles per insn [100.00%]
27,063,152,011,598      stalled-cycles-frontend   #  139.35% frontend cycles idle
       193,550,437      minor-faults                                                 [100.00%]
                 0      major-faults

     132.851823758 seconds time elapsed

Run make -j60 on already built allmodconfig kernel tree:

Base:
   286,539,116,594      cycles                     ( +-  0.05% ) [100.00%]
   386,447,067,412      instructions              #    1.35  insns per cycle
                                                  #    0.96  stalled cycles per insn  ( +-  0.00% ) [100.00%]
   372,549,722,846      stalled-cycles-frontend   #  130.02% frontend cycles idle     ( +-  0.04% )
         4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
                 0      major-faults

      27.215434226 seconds time elapsed                                          ( +-  0.18% )
Patched:
   286,435,472,207      cycles                     ( +-  0.07% ) [100.00%]
   388,217,937,147      instructions              #    1.36  insns per cycle
                                                  #    0.96  stalled cycles per insn  ( +-  0.01% ) [100.00%]
   373,681,951,018      stalled-cycles-frontend   #  130.46% frontend cycles idle     ( +-  0.10% )
         2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
                 0      major-faults

      27.292854546 seconds time elapsed                                          ( +-  0.29% )

Access 16g file sequential:

Threads		Base, seconds	Patched, seconds	Diff, %
1		8.284232583	5.424205564		-34,52
8		18.338648302	6.330400940		-65.48
32		54.004043136	10.774241948		-80.04
60		88.928810991	16.571960471		-81.36
120		144.729413453	28.875873029		-80.04

Access 1g file random:

Threads		Base, seconds	Patched, seconds	Diff, %
1		18.247095738	15.003616508		-17.77
8		15.226716773	14.239478275		-6.48
32		22.080582106	16.797777162		-23.92
60		25.759900660	22.768611526		-11.61
120		44.443953873	38.197095497		-14.05

Below is `perf stat' output for the tests above.

=========================================================================

# usemem -t <threads> -f testfile -o 16g #

## 1 thread ##
### base ###
   28,284,111,426      cycles                     ( +-  1.53% ) [100.00%]
    44,120,803,329      instructions              #    1.56  insns per cycle
                                                  #    1.02  stalled cycles per insn  ( +-  0.00% ) [100.00%]
    45,148,666,958      stalled-cycles-frontend   #  159.63% frontend cycles idle     ( +-  1.89% )
         4,194,709      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

       8.284232583 seconds time elapsed                                          ( +-  1.67% )
### patched ###
    18,615,784,628      cycles                     ( +-  8.46% ) [100.00%]
    41,590,187,000      instructions              #    2.23  insns per cycle
                                                  #    0.66  stalled cycles per insn  ( +-  0.02% ) [100.00%]
    27,374,671,499      stalled-cycles-frontend   #  147.05% frontend cycles idle     ( +- 11.48% )
           131,260      minor-faults                                                  ( +-  0.03% ) [100.00%]
                 0      major-faults

       5.424205564 seconds time elapsed                                          ( +-  8.54% )

## 8 threads ##
### base ###
   447,114,910,645      cycles                     ( +-  3.22% ) [100.00%]
   350,993,124,470      instructions              #    0.79  insns per cycle
                                                  #    2.29  stalled cycles per insn  ( +-  0.01% ) [100.00%]
   803,040,468,490      stalled-cycles-frontend   #  179.60% frontend cycles idle     ( +-  3.59% )
        33,556,008      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

      18.338648302 seconds time elapsed                                          ( +-  3.20% )
### patched ###
   136,437,995,805      cycles                     ( +-  0.87% ) [100.00%]
   328,096,731,149      instructions              #    2.40  insns per cycle
                                                  #    0.60  stalled cycles per insn  ( +-  0.02% ) [100.00%]
   197,237,317,698      stalled-cycles-frontend   #  144.56% frontend cycles idle     ( +-  1.19% )
         1,338,884      minor-faults                                                  ( +-  3.11% ) [100.00%]
                 0      major-faults

       6.330400940 seconds time elapsed                                          ( +-  1.50% )

## 32 threads ##
### base ###
 4,844,333,104,028      cycles                     ( +-  2.17% ) [100.00%]
 1,401,233,609,645      instructions              #    0.29  insns per cycle
                                                  #    6.61  stalled cycles per insn  ( +-  0.02% ) [100.00%]
 9,265,429,354,789      stalled-cycles-frontend   #  191.26% frontend cycles idle     ( +-  1.81% )
       134,228,040      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

      54.004043136 seconds time elapsed                                          ( +-  2.03% )
### patched ###
   512,226,985,443      cycles                     ( +-  2.21% ) [100.00%]
 1,307,350,894,141      instructions              #    2.55  insns per cycle
                                                  #    0.52  stalled cycles per insn  ( +-  0.02% ) [100.00%]
   677,259,628,341      stalled-cycles-frontend   #  132.22% frontend cycles idle     ( +-  1.56% )
         4,820,979      minor-faults                                                  ( +-  3.88% ) [100.00%]
                 0      major-faults

      10.774241948 seconds time elapsed                                          ( +-  2.40% )
## 60 threads ##
### base ###
14,878,749,780,310      cycles                     ( +-  1.24% ) [100.00%]
 2,633,166,794,060      instructions              #    0.18  insns per cycle
                                                  #   10.75  stalled cycles per insn  ( +-  0.01% ) [100.00%]
28,301,405,498,867      stalled-cycles-frontend   #  190.21% frontend cycles idle     ( +-  0.88% )
       251,687,643      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

      88.928810991 seconds time elapsed                                          ( +-  1.44% )
### patched ###
 1,156,172,977,262      cycles                     ( +-  3.24% ) [100.00%]
 2,450,915,465,588      instructions              #    2.12  insns per cycle
                                                  #    0.63  stalled cycles per insn  ( +-  0.01% ) [100.00%]
 1,534,996,614,819      stalled-cycles-frontend   #  132.77% frontend cycles idle     ( +-  4.37% )
         9,239,706      minor-faults                                                  ( +-  1.66% ) [100.00%]
                 0      major-faults

      16.571960471 seconds time elapsed                                          ( +-  2.64% )
## 120 threads ##
### base ###
47,350,885,336,341      cycles                     ( +-  2.28% ) [100.00%]
 5,282,921,722,672      instructions              #    0.11  insns per cycle
                                                  #    8.96  stalled cycles per insn  ( +-  0.04% ) [100.00%]
47,332,933,971,860      stalled-cycles-frontend   #   99.96% frontend cycles idle     ( +-  2.40% )
       503,357,749      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

     144.729413453 seconds time elapsed                                          ( +-  1.75% )
### patched ###
 3,791,855,572,557      cycles                     ( +-  2.15% ) [100.00%]
 4,901,681,561,338      instructions              #    1.29  insns per cycle
                                                  #    0.59  stalled cycles per insn  ( +-  0.01% ) [100.00%]
 2,880,236,854,539      stalled-cycles-frontend   #   75.96% frontend cycles idle     ( +-  3.69% )
        18,080,707      minor-faults                                                  ( +-  1.22% ) [100.00%]
                 0      major-faults

      28.875873029 seconds time elapsed                                          ( +-  1.19% )

# usemem -t <threads> -R -f testfile -o 1g #

## 1 thread ##
### base ###
    62,357,670,556      cycles                     ( +-  0.58% ) [100.00%]
    11,612,293,315      instructions              #    0.19  insns per cycle
                                                  #   10.31  stalled cycles per insn  ( +-  0.03% ) [100.00%]
   119,734,125,613      stalled-cycles-frontend   #  192.01% frontend cycles idle     ( +-  0.58% )
           262,723      minor-faults                                                  ( +-  0.01% ) [100.00%]
                 0      major-faults

      18.247095738 seconds time elapsed                                          ( +-  0.58% )
### patched ###
    51,421,735,331      cycles                     ( +- 11.53% ) [100.00%]
    11,446,081,379      instructions              #    0.22  insns per cycle
                                                  #    8.55  stalled cycles per insn  ( +-  0.19% ) [100.00%]
    97,900,718,100      stalled-cycles-frontend   #  190.39% frontend cycles idle     ( +- 12.09% )
             8,629      minor-faults                                                  ( +-  1.20% ) [100.00%]
                 0      major-faults                                                  ( +-100.00% )

      15.003616508 seconds time elapsed                                          ( +- 11.56% )

## 8 threads ##
### base ###
   395,785,250,840      cycles                     ( +-  0.60% ) [100.00%]
    93,850,278,507      instructions              #    0.24  insns per cycle
                                                  #    8.02  stalled cycles per insn  ( +-  0.03% ) [100.00%]
   752,951,585,171      stalled-cycles-frontend   #  190.24% frontend cycles idle     ( +-  0.61% )
         2,097,988      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

      15.226716773 seconds time elapsed                                          ( +-  0.41% )
### patched ###
 Performance counter stats for 'system wide' (5 runs):

   369,032,514,042      cycles                     ( +-  0.42% ) [100.00%]
    92,357,503,760      instructions              #    0.25  insns per cycle
                                                  #    7.59  stalled cycles per insn  ( +-  0.01% ) [100.00%]
   700,555,940,601      stalled-cycles-frontend   #  189.84% frontend cycles idle     ( +-  0.44% )
            89,186      minor-faults                                                  ( +-  1.88% ) [100.00%]
                 0      major-faults

      14.239478275 seconds time elapsed                                          ( +-  0.48% )

## 32 threads ##
### base ###
 1,791,516,702,606      cycles                     ( +-  1.57% ) [100.00%]
   369,936,392,907      instructions              #    0.21  insns per cycle
                                                  #    9.07  stalled cycles per insn  ( +-  0.07% ) [100.00%]
 3,357,017,162,189      stalled-cycles-frontend   #  187.38% frontend cycles idle     ( +-  1.77% )
         8,390,395      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

      22.080582106 seconds time elapsed                                          ( +-  9.50% )
### patched ###
 1,517,335,667,819      cycles                     ( +-  0.63% ) [100.00%]
   363,401,817,392      instructions              #    0.24  insns per cycle
                                                  #    7.45  stalled cycles per insn  ( +-  0.01% ) [100.00%]
 2,705,627,050,641      stalled-cycles-frontend   #  178.31% frontend cycles idle     ( +-  1.15% )
           269,287      minor-faults                                                  ( +-  0.16% ) [100.00%]
                 0      major-faults

      16.797777162 seconds time elapsed                                          ( +-  7.69% )

## 60 threads ##
### base ###
 3,983,862,013,747      cycles                     ( +-  1.42% ) [100.00%]
   693,310,458,766      instructions              #    0.17  insns per cycle
                                                  #   10.75  stalled cycles per insn  ( +-  0.01% ) [100.00%]
 7,454,138,580,163      stalled-cycles-frontend   #  187.11% frontend cycles idle     ( +-  1.27% )
        15,735,826      minor-faults                                                  ( +-  0.02% ) [100.00%]
                 1      major-faults                                                  ( +-100.00% )

      25.759900660 seconds time elapsed                                          ( +-  0.64% )
### patched ###
 2,871,988,372,901      cycles                     ( +-  1.17% ) [100.00%]
   681,452,765,745      instructions              #    0.24  insns per cycle
                                                  #    7.61  stalled cycles per insn  ( +-  0.03% ) [100.00%]
 5,186,017,674,808      stalled-cycles-frontend   #  180.57% frontend cycles idle     ( +-  1.24% )
           498,198      minor-faults                                                  ( +-  0.05% ) [100.00%]
                 0      major-faults

      22.768611526 seconds time elapsed                                          ( +-  3.13% )

## 120 threads ##
### base ###
14,891,491,866,813      cycles                     ( +-  0.83% ) [100.00%]
 1,391,880,325,881      instructions              #    0.09  insns per cycle
                                                  #   10.51  stalled cycles per insn  ( +-  0.02% ) [100.00%]
14,629,621,106,104      stalled-cycles-frontend   #   98.24% frontend cycles idle     ( +-  1.04% )
        31,468,057      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 0      major-faults

      44.443953873 seconds time elapsed                                          ( +-  0.91% )
### patched ###
12,450,645,304,818      cycles                     ( +-  2.94% ) [100.00%]
 1,367,989,003,638      instructions              #    0.11  insns per cycle
                                                  #    8.98  stalled cycles per insn  ( +-  0.04% ) [100.00%]
12,282,274,614,247      stalled-cycles-frontend   #   98.65% frontend cycles idle     ( +-  3.34% )
           996,361      minor-faults                                                  ( +-  0.02% ) [100.00%]
                 0      major-faults

      38.197095497 seconds time elapsed                                          ( +-  2.

Kirill A. Shutemov (2):
  mm: introduce vm_ops->fault_nonblock()
  mm: implement ->fault_nonblock() for page cache

 Documentation/filesystems/Locking |  8 ++++++++
 fs/9p/vfs_file.c                  |  2 ++
 fs/btrfs/file.c                   |  1 +
 fs/cifs/file.c                    |  1 +
 fs/ext4/file.c                    |  1 +
 fs/f2fs/file.c                    |  1 +
 fs/fuse/file.c                    |  1 +
 fs/gfs2/file.c                    |  1 +
 fs/nfs/file.c                     |  1 +
 fs/nilfs2/file.c                  |  1 +
 fs/ubifs/file.c                   |  1 +
 fs/xfs/xfs_file.c                 |  1 +
 include/linux/mm.h                |  3 +++
 mm/filemap.c                      | 35 +++++++++++++++++++++++++++++++++++
 mm/memory.c                       | 38 +++++++++++++++++++++++++++++++++++++-
 15 files changed, 95 insertions(+), 1 deletion(-)

-- 
1.9.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/2] mm: introduce vm_ops->fault_nonblock()
  2014-02-17 18:38 ` Kirill A. Shutemov
@ 2014-02-17 18:38   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-17 18:38 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

The patch introduces new vm_ops callback ->fault_nonblock() and uses it
for mapping easy accessible pages around fault address.

On read page fault, if filesystem provides ->fault_nonblock(), we try to
map up to FAULT_AROUND_PAGES (32 at the moment) pages around page fault
address in hope to reduce number of minor page faults.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/filesystems/Locking |  8 ++++++++
 include/linux/mm.h                |  3 +++
 mm/memory.c                       | 38 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 5b0c083d7c0e..11506b97e3b7 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -525,6 +525,7 @@ locking rules:
 open:		yes
 close:		yes
 fault:		yes		can return with page locked
+fault_nonblock	yes		must return with page locked
 page_mkwrite:	yes		can return with page locked
 access:		yes
 
@@ -536,6 +537,13 @@ the page, then ensure it is not already truncated (the page lock will block
 subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
 locked. The VM will unlock the page.
 
+	->fault_nonblock() is called when VM tries to map easy accessible
+pages. Filesystem must find and return the page associated with the passed
+in "pgoff" in the vm_fault structure. If it's not possible to return a
+page without blocking, NULL should be returned. The page must be locked
+and filesystem must ensure page is not truncated. The VM will unlock the
+page. ->fault_nonblock() is called with page table locked.
+
 	->page_mkwrite() is called when a previously read-only pte is
 about to become writeable. The filesystem again must ensure that there are
 no truncate/invalidate races, and then return with the page locked. If
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46eade6a..b9a688dbd62a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -221,6 +221,8 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	void (*fault_nonblock)(struct vm_area_struct *vma,
+			struct vm_fault *vmf);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
@@ -1810,6 +1812,7 @@ extern void truncate_inode_pages_range(struct address_space *,
 
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
+extern void filemap_fault_nonblock(struct vm_area_struct *, struct vm_fault *);
 extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46ef1e1..f4990fb66770 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3342,6 +3342,39 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }
 
+#define FAULT_AROUND_ORDER 5
+#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
+#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
+
+static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
+		pte_t *pte, pgoff_t pgoff, unsigned int flags)
+{
+	struct vm_fault vmf;
+	unsigned long start_addr = address & FAULT_AROUND_MASK;
+	int off = (address - start_addr) >> PAGE_SHIFT;
+	int i;
+
+	for (i = 0; i < FAULT_AROUND_PAGES; i++) {
+		unsigned long addr = start_addr + i * PAGE_SIZE;
+		pte_t *_pte = pte - off +i;
+
+		if (!pte_none(*_pte))
+			continue;
+		if (addr < vma->vm_start || addr >= vma->vm_end)
+			continue;
+
+		vmf.virtual_address = (void __user *) addr;
+		vmf.pgoff = pgoff - off + i;
+		vmf.flags = flags;
+		vmf.page = NULL;
+		vma->vm_ops->fault_nonblock(vma, &vmf);
+		if (!vmf.page)
+			continue;
+		do_set_pte(vma, addr, vmf.page, _pte, false, false);
+		unlock_page(vmf.page);
+	}
+}
+
 static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3363,8 +3396,11 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return ret;
 	}
 	do_set_pte(vma, address, fault_page, pte, false, false);
-	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
+
+	if (vma->vm_ops->fault_nonblock)
+		do_fault_around(vma, address, pte, pgoff, flags);
+	pte_unmap_unlock(pte, ptl);
 	return ret;
 }
 
-- 
1.9.0.rc3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 1/2] mm: introduce vm_ops->fault_nonblock()
@ 2014-02-17 18:38   ` Kirill A. Shutemov
  0 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-17 18:38 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

The patch introduces new vm_ops callback ->fault_nonblock() and uses it
for mapping easy accessible pages around fault address.

On read page fault, if filesystem provides ->fault_nonblock(), we try to
map up to FAULT_AROUND_PAGES (32 at the moment) pages around page fault
address in hope to reduce number of minor page faults.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/filesystems/Locking |  8 ++++++++
 include/linux/mm.h                |  3 +++
 mm/memory.c                       | 38 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 5b0c083d7c0e..11506b97e3b7 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -525,6 +525,7 @@ locking rules:
 open:		yes
 close:		yes
 fault:		yes		can return with page locked
+fault_nonblock	yes		must return with page locked
 page_mkwrite:	yes		can return with page locked
 access:		yes
 
@@ -536,6 +537,13 @@ the page, then ensure it is not already truncated (the page lock will block
 subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
 locked. The VM will unlock the page.
 
+	->fault_nonblock() is called when VM tries to map easy accessible
+pages. Filesystem must find and return the page associated with the passed
+in "pgoff" in the vm_fault structure. If it's not possible to return a
+page without blocking, NULL should be returned. The page must be locked
+and filesystem must ensure page is not truncated. The VM will unlock the
+page. ->fault_nonblock() is called with page table locked.
+
 	->page_mkwrite() is called when a previously read-only pte is
 about to become writeable. The filesystem again must ensure that there are
 no truncate/invalidate races, and then return with the page locked. If
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46eade6a..b9a688dbd62a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -221,6 +221,8 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	void (*fault_nonblock)(struct vm_area_struct *vma,
+			struct vm_fault *vmf);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
@@ -1810,6 +1812,7 @@ extern void truncate_inode_pages_range(struct address_space *,
 
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
+extern void filemap_fault_nonblock(struct vm_area_struct *, struct vm_fault *);
 extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46ef1e1..f4990fb66770 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3342,6 +3342,39 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }
 
+#define FAULT_AROUND_ORDER 5
+#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
+#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
+
+static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
+		pte_t *pte, pgoff_t pgoff, unsigned int flags)
+{
+	struct vm_fault vmf;
+	unsigned long start_addr = address & FAULT_AROUND_MASK;
+	int off = (address - start_addr) >> PAGE_SHIFT;
+	int i;
+
+	for (i = 0; i < FAULT_AROUND_PAGES; i++) {
+		unsigned long addr = start_addr + i * PAGE_SIZE;
+		pte_t *_pte = pte - off +i;
+
+		if (!pte_none(*_pte))
+			continue;
+		if (addr < vma->vm_start || addr >= vma->vm_end)
+			continue;
+
+		vmf.virtual_address = (void __user *) addr;
+		vmf.pgoff = pgoff - off + i;
+		vmf.flags = flags;
+		vmf.page = NULL;
+		vma->vm_ops->fault_nonblock(vma, &vmf);
+		if (!vmf.page)
+			continue;
+		do_set_pte(vma, addr, vmf.page, _pte, false, false);
+		unlock_page(vmf.page);
+	}
+}
+
 static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3363,8 +3396,11 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return ret;
 	}
 	do_set_pte(vma, address, fault_page, pte, false, false);
-	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
+
+	if (vma->vm_ops->fault_nonblock)
+		do_fault_around(vma, address, pte, pgoff, flags);
+	pte_unmap_unlock(pte, ptl);
 	return ret;
 }
 
-- 
1.9.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/2] mm: implement ->fault_nonblock() for page cache
  2014-02-17 18:38 ` Kirill A. Shutemov
@ 2014-02-17 18:38   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-17 18:38 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

filemap_fault_nonblock() is generic implementation of ->fault_nonblock()
for filesystems who uses page cache.

It should be safe to use filemap_fault_nonblock() for ->fault_nonblock()
if filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/9p/vfs_file.c  |  2 ++
 fs/btrfs/file.c   |  1 +
 fs/cifs/file.c    |  1 +
 fs/ext4/file.c    |  1 +
 fs/f2fs/file.c    |  1 +
 fs/fuse/file.c    |  1 +
 fs/gfs2/file.c    |  1 +
 fs/nfs/file.c     |  1 +
 fs/nilfs2/file.c  |  1 +
 fs/ubifs/file.c   |  1 +
 fs/xfs/xfs_file.c |  1 +
 mm/filemap.c      | 35 +++++++++++++++++++++++++++++++++++
 12 files changed, 47 insertions(+)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index a16b0ff497ca..a7f7e41bec37 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
 
 static const struct vm_operations_struct v9fs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
@@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
 static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
 	.close = v9fs_mmap_vm_close,
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0165b8672f09..13523a63e1f3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1993,6 +1993,7 @@ out:
 
 static const struct vm_operations_struct btrfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= btrfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 755584684f6c..71aff75e067c 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static struct vm_operations_struct cifs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = cifs_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a5073959f32..182ae5543a1d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite   = ext4_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 0dfcef53a6ed..7c48fd2eb99c 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -84,6 +84,7 @@ out:
 
 static const struct vm_operations_struct f2fs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= f2fs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77bcc303c3ae..e95e52ec7bc2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct fuse_file_vm_ops = {
 	.close		= fuse_vma_close,
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= fuse_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index efc078f0ee4e..7c4b2f096ac8 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -494,6 +494,7 @@ out:
 
 static const struct vm_operations_struct gfs2_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = gfs2_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5bb790a69c71..8fbe80168d1f 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -617,6 +617,7 @@ out:
 
 static const struct vm_operations_struct nfs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = nfs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 08fdb77852ac..adc4aa07d7d8 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static const struct vm_operations_struct nilfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= nilfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 123c79b7261e..f27c4c401a3f 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1538,6 +1538,7 @@ out_unlock:
 
 static const struct vm_operations_struct ubifs_file_vm_ops = {
 	.fault        = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = ubifs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..bc619150c960 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= xfs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6ac5421..7b7c9c600544 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1726,6 +1726,40 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+void filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	pgoff_t size;
+	struct page *page;
+
+	page = find_get_page(mapping, vmf->pgoff);
+	if (!page)
+		return;
+	if (PageReadahead(page) || PageHWPoison(page))
+		goto put;
+	if (!trylock_page(page))
+		goto put;
+	/* Truncated? */
+	if (unlikely(page->mapping != mapping))
+		goto unlock;
+	if (unlikely(!PageUptodate(page)))
+		goto unlock;
+	size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
+		>> PAGE_CACHE_SHIFT;
+	if (unlikely(page->index >= size))
+		goto unlock;
+	if (file->f_ra.mmap_miss > 0)
+		file->f_ra.mmap_miss--;
+	vmf->page = page;
+	return;
+unlock:
+	unlock_page(page);
+put:
+	put_page(page);
+}
+EXPORT_SYMBOL(filemap_fault_nonblock);
+
 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -1755,6 +1789,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
-- 
1.9.0.rc3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/2] mm: implement ->fault_nonblock() for page cache
@ 2014-02-17 18:38   ` Kirill A. Shutemov
  0 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-17 18:38 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel
  Cc: Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

filemap_fault_nonblock() is generic implementation of ->fault_nonblock()
for filesystems who uses page cache.

It should be safe to use filemap_fault_nonblock() for ->fault_nonblock()
if filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/9p/vfs_file.c  |  2 ++
 fs/btrfs/file.c   |  1 +
 fs/cifs/file.c    |  1 +
 fs/ext4/file.c    |  1 +
 fs/f2fs/file.c    |  1 +
 fs/fuse/file.c    |  1 +
 fs/gfs2/file.c    |  1 +
 fs/nfs/file.c     |  1 +
 fs/nilfs2/file.c  |  1 +
 fs/ubifs/file.c   |  1 +
 fs/xfs/xfs_file.c |  1 +
 mm/filemap.c      | 35 +++++++++++++++++++++++++++++++++++
 12 files changed, 47 insertions(+)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index a16b0ff497ca..a7f7e41bec37 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
 
 static const struct vm_operations_struct v9fs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
@@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
 static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
 	.close = v9fs_mmap_vm_close,
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0165b8672f09..13523a63e1f3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1993,6 +1993,7 @@ out:
 
 static const struct vm_operations_struct btrfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= btrfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 755584684f6c..71aff75e067c 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static struct vm_operations_struct cifs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = cifs_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a5073959f32..182ae5543a1d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite   = ext4_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 0dfcef53a6ed..7c48fd2eb99c 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -84,6 +84,7 @@ out:
 
 static const struct vm_operations_struct f2fs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= f2fs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77bcc303c3ae..e95e52ec7bc2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct fuse_file_vm_ops = {
 	.close		= fuse_vma_close,
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= fuse_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index efc078f0ee4e..7c4b2f096ac8 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -494,6 +494,7 @@ out:
 
 static const struct vm_operations_struct gfs2_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = gfs2_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5bb790a69c71..8fbe80168d1f 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -617,6 +617,7 @@ out:
 
 static const struct vm_operations_struct nfs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = nfs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 08fdb77852ac..adc4aa07d7d8 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static const struct vm_operations_struct nilfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= nilfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 123c79b7261e..f27c4c401a3f 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1538,6 +1538,7 @@ out_unlock:
 
 static const struct vm_operations_struct ubifs_file_vm_ops = {
 	.fault        = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = ubifs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..bc619150c960 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= xfs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6ac5421..7b7c9c600544 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1726,6 +1726,40 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+void filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	pgoff_t size;
+	struct page *page;
+
+	page = find_get_page(mapping, vmf->pgoff);
+	if (!page)
+		return;
+	if (PageReadahead(page) || PageHWPoison(page))
+		goto put;
+	if (!trylock_page(page))
+		goto put;
+	/* Truncated? */
+	if (unlikely(page->mapping != mapping))
+		goto unlock;
+	if (unlikely(!PageUptodate(page)))
+		goto unlock;
+	size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
+		>> PAGE_CACHE_SHIFT;
+	if (unlikely(page->index >= size))
+		goto unlock;
+	if (file->f_ra.mmap_miss > 0)
+		file->f_ra.mmap_miss--;
+	vmf->page = page;
+	return;
+unlock:
+	unlock_page(page);
+put:
+	put_page(page);
+}
+EXPORT_SYMBOL(filemap_fault_nonblock);
+
 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -1755,6 +1789,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
-- 
1.9.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-17 18:38 ` Kirill A. Shutemov
@ 2014-02-17 19:01   ` Linus Torvalds
  -1 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-17 19:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> reachable without blocking. We request one page a time. It's not terribly
> efficient and I will probably re-think the interface once again to expose
> iterator or something...

Hmm. Yeah, clearly this isn't working, since the real workloads all
end up looking like

>        115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
>       59.686645587 seconds time elapsed                                          ( +-  0.30% )
 becomes
>         47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
>       60.241766430 seconds time elapsed                                          ( +-  0.85% )

and

>        268,039,365      minor-faults                                                 [100.00%]
>      132.830612471 seconds time elapsed
becomes
>        193,550,437      minor-faults                                                 [100.00%]
>      132.851823758 seconds time elapsed

and

>          4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
>       27.215434226 seconds time elapsed                                          ( +-  0.18% )
becomes
>          2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
>       27.292854546 seconds time elapsed                                          ( +-  0.29% )

ie it shows a clear reduction in faults, but the added costs clearly
eat up any wins and it all becomes (just _slightly_) slower.

Sad.

I do wonder if we really need to lock the pages we fault in. We lock
them in order to test for being up-to-date and still mapped. The
up-to-date check we don't really need to worry about: that we can test
without locking by just reading "page->flags" atomically and verifying
that it's uptodate and not locked.

The other reason to lock the page is:

 - for anonymous pages we need the lock for rmap, so the VM generally
always locks the page. But that's not an issue for file-backed pages:
the "rmap" for a filebacked page is just the page mapcount and the
cgroup statistics, and those don't need the page lock.

 - the whole truncation/unmapping thing

So the complex part is racing with truncate/unmapping the page. But
since we hold the page table lock, I *think* what we should be able to
do is:

 - increment the page _mapcount (iow, do "page_add_file_rmap()"
early). This guarantees that any *subsequent* unmap activity on this
page will walk the file mapping lists, and become serialized by the
page table lock we hold.

 - mb_after_atomic_inc() (this is generally free)

 - test that the page is still unlocked and uptodate, and the page
mapping still points to our page.

 - if that is true, we're all good, we can use the page, otherwise we
decrement the mapcount (page_remove_rmap()) and skip the page.

Hmm? Doing something like this means that we would never lock the
pages we prefault, and you can go back to your gang lookup rather than
that "one page at a time". And the race case is basically never going
to trigger.

Comments?

                  Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-17 19:01   ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-17 19:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> reachable without blocking. We request one page a time. It's not terribly
> efficient and I will probably re-think the interface once again to expose
> iterator or something...

Hmm. Yeah, clearly this isn't working, since the real workloads all
end up looking like

>        115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
>       59.686645587 seconds time elapsed                                          ( +-  0.30% )
 becomes
>         47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
>       60.241766430 seconds time elapsed                                          ( +-  0.85% )

and

>        268,039,365      minor-faults                                                 [100.00%]
>      132.830612471 seconds time elapsed
becomes
>        193,550,437      minor-faults                                                 [100.00%]
>      132.851823758 seconds time elapsed

and

>          4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
>       27.215434226 seconds time elapsed                                          ( +-  0.18% )
becomes
>          2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
>       27.292854546 seconds time elapsed                                          ( +-  0.29% )

ie it shows a clear reduction in faults, but the added costs clearly
eat up any wins and it all becomes (just _slightly_) slower.

Sad.

I do wonder if we really need to lock the pages we fault in. We lock
them in order to test for being up-to-date and still mapped. The
up-to-date check we don't really need to worry about: that we can test
without locking by just reading "page->flags" atomically and verifying
that it's uptodate and not locked.

The other reason to lock the page is:

 - for anonymous pages we need the lock for rmap, so the VM generally
always locks the page. But that's not an issue for file-backed pages:
the "rmap" for a filebacked page is just the page mapcount and the
cgroup statistics, and those don't need the page lock.

 - the whole truncation/unmapping thing

So the complex part is racing with truncate/unmapping the page. But
since we hold the page table lock, I *think* what we should be able to
do is:

 - increment the page _mapcount (iow, do "page_add_file_rmap()"
early). This guarantees that any *subsequent* unmap activity on this
page will walk the file mapping lists, and become serialized by the
page table lock we hold.

 - mb_after_atomic_inc() (this is generally free)

 - test that the page is still unlocked and uptodate, and the page
mapping still points to our page.

 - if that is true, we're all good, we can use the page, otherwise we
decrement the mapcount (page_remove_rmap()) and skip the page.

Hmm? Doing something like this means that we would never lock the
pages we prefault, and you can go back to your gang lookup rather than
that "one page at a time". And the race case is basically never going
to trigger.

Comments?

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-17 19:01   ` Linus Torvalds
@ 2014-02-17 19:49     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-17 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Mon, Feb 17, 2014 at 11:01:58AM -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> > reachable without blocking. We request one page a time. It's not terribly
> > efficient and I will probably re-think the interface once again to expose
> > iterator or something...
> 
> Hmm. Yeah, clearly this isn't working, since the real workloads all
> end up looking like
> 
> >        115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
> >       59.686645587 seconds time elapsed                                          ( +-  0.30% )
>  becomes
> >         47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
> >       60.241766430 seconds time elapsed                                          ( +-  0.85% )
> 
> and
> 
> >        268,039,365      minor-faults                                                 [100.00%]
> >      132.830612471 seconds time elapsed
> becomes
> >        193,550,437      minor-faults                                                 [100.00%]
> >      132.851823758 seconds time elapsed
> 
> and
> 
> >          4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
> >       27.215434226 seconds time elapsed                                          ( +-  0.18% )
> becomes
> >          2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
> >       27.292854546 seconds time elapsed                                          ( +-  0.29% )
> 
> ie it shows a clear reduction in faults, but the added costs clearly
> eat up any wins and it all becomes (just _slightly_) slower.
> 
> Sad.
> 
> I do wonder if we really need to lock the pages we fault in. We lock
> them in order to test for being up-to-date and still mapped. The
> up-to-date check we don't really need to worry about: that we can test
> without locking by just reading "page->flags" atomically and verifying
> that it's uptodate and not locked.
> 
> The other reason to lock the page is:
> 
>  - for anonymous pages we need the lock for rmap, so the VM generally
> always locks the page. But that's not an issue for file-backed pages:
> the "rmap" for a filebacked page is just the page mapcount and the
> cgroup statistics, and those don't need the page lock.
> 
>  - the whole truncation/unmapping thing
> 
> So the complex part is racing with truncate/unmapping the page. But
> since we hold the page table lock, I *think* what we should be able to
> do is:
> 
>  - increment the page _mapcount (iow, do "page_add_file_rmap()"
> early). This guarantees that any *subsequent* unmap activity on this
> page will walk the file mapping lists, and become serialized by the
> page table lock we hold.
> 
>  - mb_after_atomic_inc() (this is generally free)
> 
>  - test that the page is still unlocked and uptodate, and the page
> mapping still points to our page.
> 
>  - if that is true, we're all good, we can use the page, otherwise we
> decrement the mapcount (page_remove_rmap()) and skip the page.
> 
> Hmm? Doing something like this means that we would never lock the
> pages we prefault, and you can go back to your gang lookup rather than
> that "one page at a time". And the race case is basically never going
> to trigger.
> 
> Comments?

Sounds reasonable to me. I'll take a closer look tomorrow.

But it could be safer to keep locking in place and reduce lookup cost by
exposing something like ->fault_iter_init() and ->fault_iter_next(). It
will still return one page a time, but it will keep radix-tree context
around for cheaper next-page lookup.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-17 19:49     ` Kirill A. Shutemov
  0 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-17 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Mon, Feb 17, 2014 at 11:01:58AM -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> > reachable without blocking. We request one page a time. It's not terribly
> > efficient and I will probably re-think the interface once again to expose
> > iterator or something...
> 
> Hmm. Yeah, clearly this isn't working, since the real workloads all
> end up looking like
> 
> >        115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
> >       59.686645587 seconds time elapsed                                          ( +-  0.30% )
>  becomes
> >         47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
> >       60.241766430 seconds time elapsed                                          ( +-  0.85% )
> 
> and
> 
> >        268,039,365      minor-faults                                                 [100.00%]
> >      132.830612471 seconds time elapsed
> becomes
> >        193,550,437      minor-faults                                                 [100.00%]
> >      132.851823758 seconds time elapsed
> 
> and
> 
> >          4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
> >       27.215434226 seconds time elapsed                                          ( +-  0.18% )
> becomes
> >          2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
> >       27.292854546 seconds time elapsed                                          ( +-  0.29% )
> 
> ie it shows a clear reduction in faults, but the added costs clearly
> eat up any wins and it all becomes (just _slightly_) slower.
> 
> Sad.
> 
> I do wonder if we really need to lock the pages we fault in. We lock
> them in order to test for being up-to-date and still mapped. The
> up-to-date check we don't really need to worry about: that we can test
> without locking by just reading "page->flags" atomically and verifying
> that it's uptodate and not locked.
> 
> The other reason to lock the page is:
> 
>  - for anonymous pages we need the lock for rmap, so the VM generally
> always locks the page. But that's not an issue for file-backed pages:
> the "rmap" for a filebacked page is just the page mapcount and the
> cgroup statistics, and those don't need the page lock.
> 
>  - the whole truncation/unmapping thing
> 
> So the complex part is racing with truncate/unmapping the page. But
> since we hold the page table lock, I *think* what we should be able to
> do is:
> 
>  - increment the page _mapcount (iow, do "page_add_file_rmap()"
> early). This guarantees that any *subsequent* unmap activity on this
> page will walk the file mapping lists, and become serialized by the
> page table lock we hold.
> 
>  - mb_after_atomic_inc() (this is generally free)
> 
>  - test that the page is still unlocked and uptodate, and the page
> mapping still points to our page.
> 
>  - if that is true, we're all good, we can use the page, otherwise we
> decrement the mapcount (page_remove_rmap()) and skip the page.
> 
> Hmm? Doing something like this means that we would never lock the
> pages we prefault, and you can go back to your gang lookup rather than
> that "one page at a time". And the race case is basically never going
> to trigger.
> 
> Comments?

Sounds reasonable to me. I'll take a closer look tomorrow.

But it could be safer to keep locking in place and reduce lookup cost by
exposing something like ->fault_iter_init() and ->fault_iter_next(). It
will still return one page a time, but it will keep radix-tree context
around for cheaper next-page lookup.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-17 19:49     ` Kirill A. Shutemov
@ 2014-02-17 20:24       ` Linus Torvalds
  -1 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-17 20:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Mon, Feb 17, 2014 at 11:49 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> But it could be safer to keep locking in place and reduce lookup cost by
> exposing something like ->fault_iter_init() and ->fault_iter_next(). It
> will still return one page a time, but it will keep radix-tree context
> around for cheaper next-page lookup.

I really would prefer for the loop to be much smaller than that, and
not contain indirect calls to helpers that pretty much guarantee that
you can't generate nice code.

Plus I'd rather not have the mm layer know too much about the radix
tree iterations anyway, and try to use the existing page array
functions we already have (ie "find_get_pages()").

So I'd really prefer if we can do this with tight loops over explicit
pages, rather than some loop over an iterator.

              Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-17 20:24       ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-17 20:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Mon, Feb 17, 2014 at 11:49 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> But it could be safer to keep locking in place and reduce lookup cost by
> exposing something like ->fault_iter_init() and ->fault_iter_next(). It
> will still return one page a time, but it will keep radix-tree context
> around for cheaper next-page lookup.

I really would prefer for the loop to be much smaller than that, and
not contain indirect calls to helpers that pretty much guarantee that
you can't generate nice code.

Plus I'd rather not have the mm layer know too much about the radix
tree iterations anyway, and try to use the existing page array
functions we already have (ie "find_get_pages()").

So I'd really prefer if we can do this with tight loops over explicit
pages, rather than some loop over an iterator.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-17 19:01   ` Linus Torvalds
@ 2014-02-18 13:28     ` Rik van Riel
  -1 siblings, 0 replies; 37+ messages in thread
From: Rik van Riel @ 2014-02-18 13:28 UTC (permalink / raw)
  To: Linus Torvalds, Kirill A. Shutemov
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Matthew Wilcox,
	Dave Hansen, Alexander Viro, Dave Chinner, linux-mm,
	linux-fsdevel, Linux Kernel Mailing List

On 02/17/2014 02:01 PM, Linus Torvalds wrote:

>  - increment the page _mapcount (iow, do "page_add_file_rmap()"
> early). This guarantees that any *subsequent* unmap activity on this
> page will walk the file mapping lists, and become serialized by the
> page table lock we hold.
> 
>  - mb_after_atomic_inc() (this is generally free)
> 
>  - test that the page is still unlocked and uptodate, and the page
> mapping still points to our page.
> 
>  - if that is true, we're all good, we can use the page, otherwise we
> decrement the mapcount (page_remove_rmap()) and skip the page.
> 
> Hmm? Doing something like this means that we would never lock the
> pages we prefault, and you can go back to your gang lookup rather than
> that "one page at a time". And the race case is basically never going
> to trigger.
> 
> Comments?

What would the direct io code do when it runs into a page with
elevated mapcount, but for which a mapping cannot be found yet?

Looking at the code, it looks like the above scheme could cause
some trouble with invalidate_inode_pages2_range(), which has
the following sequence:

                        if (page_mapped(page)) {
				... unmap page
			}
                        BUG_ON(page_mapped(page));

In other words, it looks like incrementing _mapcount first could
lead to an oops in the truncate and direct IO code.

The page lock is used to prevent such races.

*sigh*

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-18 13:28     ` Rik van Riel
  0 siblings, 0 replies; 37+ messages in thread
From: Rik van Riel @ 2014-02-18 13:28 UTC (permalink / raw)
  To: Linus Torvalds, Kirill A. Shutemov
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Matthew Wilcox,
	Dave Hansen, Alexander Viro, Dave Chinner, linux-mm,
	linux-fsdevel, Linux Kernel Mailing List

On 02/17/2014 02:01 PM, Linus Torvalds wrote:

>  - increment the page _mapcount (iow, do "page_add_file_rmap()"
> early). This guarantees that any *subsequent* unmap activity on this
> page will walk the file mapping lists, and become serialized by the
> page table lock we hold.
> 
>  - mb_after_atomic_inc() (this is generally free)
> 
>  - test that the page is still unlocked and uptodate, and the page
> mapping still points to our page.
> 
>  - if that is true, we're all good, we can use the page, otherwise we
> decrement the mapcount (page_remove_rmap()) and skip the page.
> 
> Hmm? Doing something like this means that we would never lock the
> pages we prefault, and you can go back to your gang lookup rather than
> that "one page at a time". And the race case is basically never going
> to trigger.
> 
> Comments?

What would the direct io code do when it runs into a page with
elevated mapcount, but for which a mapping cannot be found yet?

Looking at the code, it looks like the above scheme could cause
some trouble with invalidate_inode_pages2_range(), which has
the following sequence:

                        if (page_mapped(page)) {
				... unmap page
			}
                        BUG_ON(page_mapped(page));

In other words, it looks like incrementing _mapcount first could
lead to an oops in the truncate and direct IO code.

The page lock is used to prevent such races.

*sigh*

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-18 13:28     ` Rik van Riel
@ 2014-02-18 14:15       ` Wilcox, Matthew R
  -1 siblings, 0 replies; 37+ messages in thread
From: Wilcox, Matthew R @ 2014-02-18 14:15 UTC (permalink / raw)
  To: Rik van Riel, Linus Torvalds, Kirill A. Shutemov
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

We don't really need to lock all the pages being returned to protect against truncate.  We only need to lock the one at the highest index, and check i_size while that lock is held since truncate_inode_pages_range() will block on any page that is locked.

We're still vulnerable to holepunches, but there's no locking currently between holepunches and truncate, so we're no worse off now.
________________________________________
From: Rik van Riel [riel@redhat.com]
Sent: February 18, 2014 5:28 AM
To: Linus Torvalds; Kirill A. Shutemov
Cc: Andrew Morton; Mel Gorman; Andi Kleen; Wilcox, Matthew R; Dave Hansen; Alexander Viro; Dave Chinner; linux-mm; linux-fsdevel; Linux Kernel Mailing List
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On 02/17/2014 02:01 PM, Linus Torvalds wrote:

>  - increment the page _mapcount (iow, do "page_add_file_rmap()"
> early). This guarantees that any *subsequent* unmap activity on this
> page will walk the file mapping lists, and become serialized by the
> page table lock we hold.
>
>  - mb_after_atomic_inc() (this is generally free)
>
>  - test that the page is still unlocked and uptodate, and the page
> mapping still points to our page.
>
>  - if that is true, we're all good, we can use the page, otherwise we
> decrement the mapcount (page_remove_rmap()) and skip the page.
>
> Hmm? Doing something like this means that we would never lock the
> pages we prefault, and you can go back to your gang lookup rather than
> that "one page at a time". And the race case is basically never going
> to trigger.
>
> Comments?

What would the direct io code do when it runs into a page with
elevated mapcount, but for which a mapping cannot be found yet?

Looking at the code, it looks like the above scheme could cause
some trouble with invalidate_inode_pages2_range(), which has
the following sequence:

                        if (page_mapped(page)) {
                                ... unmap page
                        }
                        BUG_ON(page_mapped(page));

In other words, it looks like incrementing _mapcount first could
lead to an oops in the truncate and direct IO code.

The page lock is used to prevent such races.

*sigh*

--
All rights reversed

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-18 14:15       ` Wilcox, Matthew R
  0 siblings, 0 replies; 37+ messages in thread
From: Wilcox, Matthew R @ 2014-02-18 14:15 UTC (permalink / raw)
  To: Rik van Riel, Linus Torvalds, Kirill A. Shutemov
  Cc: Andrew Morton, Mel Gorman, Andi Kleen, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

We don't really need to lock all the pages being returned to protect against truncate.  We only need to lock the one at the highest index, and check i_size while that lock is held since truncate_inode_pages_range() will block on any page that is locked.

We're still vulnerable to holepunches, but there's no locking currently between holepunches and truncate, so we're no worse off now.
________________________________________
From: Rik van Riel [riel@redhat.com]
Sent: February 18, 2014 5:28 AM
To: Linus Torvalds; Kirill A. Shutemov
Cc: Andrew Morton; Mel Gorman; Andi Kleen; Wilcox, Matthew R; Dave Hansen; Alexander Viro; Dave Chinner; linux-mm; linux-fsdevel; Linux Kernel Mailing List
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache

On 02/17/2014 02:01 PM, Linus Torvalds wrote:

>  - increment the page _mapcount (iow, do "page_add_file_rmap()"
> early). This guarantees that any *subsequent* unmap activity on this
> page will walk the file mapping lists, and become serialized by the
> page table lock we hold.
>
>  - mb_after_atomic_inc() (this is generally free)
>
>  - test that the page is still unlocked and uptodate, and the page
> mapping still points to our page.
>
>  - if that is true, we're all good, we can use the page, otherwise we
> decrement the mapcount (page_remove_rmap()) and skip the page.
>
> Hmm? Doing something like this means that we would never lock the
> pages we prefault, and you can go back to your gang lookup rather than
> that "one page at a time". And the race case is basically never going
> to trigger.
>
> Comments?

What would the direct io code do when it runs into a page with
elevated mapcount, but for which a mapping cannot be found yet?

Looking at the code, it looks like the above scheme could cause
some trouble with invalidate_inode_pages2_range(), which has
the following sequence:

                        if (page_mapped(page)) {
                                ... unmap page
                        }
                        BUG_ON(page_mapped(page));

In other words, it looks like incrementing _mapcount first could
lead to an oops in the truncate and direct IO code.

The page lock is used to prevent such races.

*sigh*

--
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-18 13:28     ` Rik van Riel
@ 2014-02-18 14:23       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-18 14:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Kirill A. Shutemov, Andrew Morton, Mel Gorman,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 08:28:02AM -0500, Rik van Riel wrote:
> On 02/17/2014 02:01 PM, Linus Torvalds wrote:
> 
> >  - increment the page _mapcount (iow, do "page_add_file_rmap()"
> > early). This guarantees that any *subsequent* unmap activity on this
> > page will walk the file mapping lists, and become serialized by the
> > page table lock we hold.
> > 
> >  - mb_after_atomic_inc() (this is generally free)
> > 
> >  - test that the page is still unlocked and uptodate, and the page
> > mapping still points to our page.
> > 
> >  - if that is true, we're all good, we can use the page, otherwise we
> > decrement the mapcount (page_remove_rmap()) and skip the page.
> > 
> > Hmm? Doing something like this means that we would never lock the
> > pages we prefault, and you can go back to your gang lookup rather than
> > that "one page at a time". And the race case is basically never going
> > to trigger.
> > 
> > Comments?
> 
> What would the direct io code do when it runs into a page with
> elevated mapcount, but for which a mapping cannot be found yet?
> 
> Looking at the code, it looks like the above scheme could cause
> some trouble with invalidate_inode_pages2_range(), which has
> the following sequence:
> 
>                         if (page_mapped(page)) {
> 				... unmap page
> 			}
>                         BUG_ON(page_mapped(page));
> 
> In other words, it looks like incrementing _mapcount first could
> lead to an oops in the truncate and direct IO code.
> 
> The page lock is used to prevent such races.
> 
> *sigh*

What if we will retry unmap once again, before triggering BUG().
The second unmap will be serialized by page table lock, right?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-18 14:23       ` Kirill A. Shutemov
  0 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-18 14:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Kirill A. Shutemov, Andrew Morton, Mel Gorman,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 08:28:02AM -0500, Rik van Riel wrote:
> On 02/17/2014 02:01 PM, Linus Torvalds wrote:
> 
> >  - increment the page _mapcount (iow, do "page_add_file_rmap()"
> > early). This guarantees that any *subsequent* unmap activity on this
> > page will walk the file mapping lists, and become serialized by the
> > page table lock we hold.
> > 
> >  - mb_after_atomic_inc() (this is generally free)
> > 
> >  - test that the page is still unlocked and uptodate, and the page
> > mapping still points to our page.
> > 
> >  - if that is true, we're all good, we can use the page, otherwise we
> > decrement the mapcount (page_remove_rmap()) and skip the page.
> > 
> > Hmm? Doing something like this means that we would never lock the
> > pages we prefault, and you can go back to your gang lookup rather than
> > that "one page at a time". And the race case is basically never going
> > to trigger.
> > 
> > Comments?
> 
> What would the direct io code do when it runs into a page with
> elevated mapcount, but for which a mapping cannot be found yet?
> 
> Looking at the code, it looks like the above scheme could cause
> some trouble with invalidate_inode_pages2_range(), which has
> the following sequence:
> 
>                         if (page_mapped(page)) {
> 				... unmap page
> 			}
>                         BUG_ON(page_mapped(page));
> 
> In other words, it looks like incrementing _mapcount first could
> lead to an oops in the truncate and direct IO code.
> 
> The page lock is used to prevent such races.
> 
> *sigh*

What if we will retry unmap once again, before triggering BUG().
The second unmap will be serialized by page table lock, right?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-18 13:28     ` Rik van Riel
@ 2014-02-18 17:51       ` Linus Torvalds
  -1 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-18 17:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 5:28 AM, Rik van Riel <riel@redhat.com> wrote:
>
> What would the direct io code do when it runs into a page with
> elevated mapcount, but for which a mapping cannot be found yet?

Actually, you cannot get into that situation, since the definition of
"found" is that you have to follow the page tables (remember: this is
a *file* mapping, not an anonymous one, so you don't actually have an
rmap list, you have the inode mapping list).

And since we hold the page table lock, you cannot actually get to the
point where you see that it's not mapped yet. See?

That said:

> Looking at the code, it looks like the above scheme could cause
> some trouble with invalidate_inode_pages2_range(), which has
> the following sequence:
>
>                         if (page_mapped(page)) {
>                                 ... unmap page
>                         }
>                         BUG_ON(page_mapped(page));

The BUG_ON() itself could trigger, because it could race with us
optimistically trying to increment the page mapping code. And yes, we
might have to remove that.

But the actual "unmap page" logic should not be able to ever see any
difference.

See my argument?

              Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-18 17:51       ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-18 17:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 5:28 AM, Rik van Riel <riel@redhat.com> wrote:
>
> What would the direct io code do when it runs into a page with
> elevated mapcount, but for which a mapping cannot be found yet?

Actually, you cannot get into that situation, since the definition of
"found" is that you have to follow the page tables (remember: this is
a *file* mapping, not an anonymous one, so you don't actually have an
rmap list, you have the inode mapping list).

And since we hold the page table lock, you cannot actually get to the
point where you see that it's not mapped yet. See?

That said:

> Looking at the code, it looks like the above scheme could cause
> some trouble with invalidate_inode_pages2_range(), which has
> the following sequence:
>
>                         if (page_mapped(page)) {
>                                 ... unmap page
>                         }
>                         BUG_ON(page_mapped(page));

The BUG_ON() itself could trigger, because it could race with us
optimistically trying to increment the page mapping code. And yes, we
might have to remove that.

But the actual "unmap page" logic should not be able to ever see any
difference.

See my argument?

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-17 19:01   ` Linus Torvalds
@ 2014-02-18 17:59     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-18 17:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> > reachable without blocking. We request one page a time. It's not terribly
> > efficient and I will probably re-think the interface once again to expose
> > iterator or something...
> 
> Hmm. Yeah, clearly this isn't working, since the real workloads all
> end up looking like
> 
> >        115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
> >       59.686645587 seconds time elapsed                                          ( +-  0.30% )
>  becomes
> >         47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
> >       60.241766430 seconds time elapsed                                          ( +-  0.85% )
> 
> and
> 
> >        268,039,365      minor-faults                                                 [100.00%]
> >      132.830612471 seconds time elapsed
> becomes
> >        193,550,437      minor-faults                                                 [100.00%]
> >      132.851823758 seconds time elapsed
> 
> and
> 
> >          4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
> >       27.215434226 seconds time elapsed                                          ( +-  0.18% )
> becomes
> >          2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
> >       27.292854546 seconds time elapsed                                          ( +-  0.29% )
> 
> ie it shows a clear reduction in faults, but the added costs clearly
> eat up any wins and it all becomes (just _slightly_) slower.

I did an experement with setup pte directly in filemap_fault_nonblock() to
see how much we can get from it. And it helps:

git:		-1.21s
clean build:	-2.22s
rebuild:	-0.63s

Is it a layering violation to setup pte directly in ->fault_nonblock()?

perf stat and patch below.

Git test-suite make -j60 test:
 1,591,184,058,944      cycles                     ( +-  0.05% ) [100.00%]
   811,200,260,823      instructions              #    0.51  insns per cycle
                                                  #    3.24  stalled cycles per insn  ( +-  0.19% ) [100.00%]
 2,631,511,271,429      stalled-cycles-frontend   #  165.38% frontend cycles idle     ( +-  0.08% )
        47,305,697      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 1      major-faults

      59.028360009 seconds time elapsed                                          ( +-  0.58% )

Run make -j60 on clean allmodconfig kernel tree:
19,163,958,689,310      cycles                    [100.00%]
17,446,888,861,177      instructions              #    0.91  insns per cycle
                                                  #    1.53  stalled cycles per insn [100.00%]
26,777,884,033,091      stalled-cycles-frontend   #  139.73% frontend cycles idle
       193,118,569      minor-faults                                                 [100.00%]
                 0      major-faults

     130.631767214 seconds time elapsed

Run make -j60 on already built allmodconfig kernel tree:
   282,398,537,719      cycles                     ( +-  0.03% ) [100.00%]
   385,807,937,931      instructions              #    1.37  insns per cycle
                                                  #    0.95  stalled cycles per insn  ( +-  0.01% ) [100.00%]
   365,940,576,310      stalled-cycles-frontend   #  129.58% frontend cycles idle     ( +-  0.07% )
         2,254,887      minor-faults                                                  ( +-  0.02% ) [100.00%]
                 0      major-faults

      26.660708754 seconds time elapsed                                          ( +-  0.29% )

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b9a688dbd62a..e671dd5abe27 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -221,8 +221,8 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
-	void (*fault_nonblock)(struct vm_area_struct *vma,
-			struct vm_fault *vmf);
+	int (*fault_nonblock)(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
@@ -1812,7 +1812,8 @@ extern void truncate_inode_pages_range(struct address_space *,
 
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
-extern void filemap_fault_nonblock(struct vm_area_struct *, struct vm_fault *);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte);
 extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b7c9c600544..0a8884efbcd8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -134,6 +135,9 @@ void __delete_from_page_cache(struct page *page)
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	if (PageSwapBacked(page))
 		__dec_zone_page_state(page, NR_SHMEM);
+	if (page_mapped(page)) {
+		dump_page(page, "");
+	}
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -1726,37 +1730,90 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
-void filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf)
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+		struct page *page, pte_t *pte, bool write, bool anon);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte)
 {
+	struct radix_tree_iter iter;
+	void **slot;
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	pgoff_t size;
 	struct page *page;
+	unsigned long address = (unsigned long) vmf->virtual_address;
+	unsigned long addr;
+	pte_t *_pte;
+	int ret = 0;
 
-	page = find_get_page(mapping, vmf->pgoff);
-	if (!page)
-		return;
-	if (PageReadahead(page) || PageHWPoison(page))
-		goto put;
-	if (!trylock_page(page))
-		goto put;
-	/* Truncated? */
-	if (unlikely(page->mapping != mapping))
-		goto unlock;
-	if (unlikely(!PageUptodate(page)))
-		goto unlock;
-	size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
-		>> PAGE_CACHE_SHIFT;
-	if (unlikely(page->index >= size))
-		goto unlock;
-	if (file->f_ra.mmap_miss > 0)
-		file->f_ra.mmap_miss--;
-	vmf->page = page;
-	return;
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
+repeat:
+		page = radix_tree_deref_slot(slot);
+
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				WARN_ON(iter.index);
+				goto restart;
+			}
+			/*
+			 * Otherwise, shmem/tmpfs must be storing a swap entry
+			 * here as an exceptional entry: so skip over it -
+			 * we only reach this from invalidate_mapping_pages().
+			 */
+			continue;
+		}
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		if (page->index > max_pgoff) {
+			page_cache_release(page);
+			break;
+		}
+
+		if (PageReadahead(page) || PageHWPoison(page) ||
+				!PageUptodate(page))
+			goto skip;
+		if (!trylock_page(page))
+			goto skip;
+		if (page->mapping != mapping || !PageUptodate(page))
+			goto unlock;
+		size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
+			>> PAGE_CACHE_SHIFT;
+		if (page->index >= size)
+			goto unlock;
+		if (file->f_ra.mmap_miss > 0)
+			file->f_ra.mmap_miss--;
+		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
+		_pte = pte + page->index - vmf->pgoff;
+		if (!pte_none(*_pte))
+			goto unlock;
+		do_set_pte(vma, addr, page, _pte, false, false);
+
+		unlock_page(page);
+		if (++ret == nr_pages || page->index == max_pgoff)
+			break;
+		continue;
 unlock:
-	unlock_page(page);
-put:
-	put_page(page);
+		unlock_page(page);
+skip:
+		page_cache_release(page);
+	}
+	rcu_read_unlock();
+	return ret;
 }
 EXPORT_SYMBOL(filemap_fault_nonblock);
 
diff --git a/mm/memory.c b/mm/memory.c
index f4990fb66770..1af0e3a3f0f1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3318,7 +3318,8 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	return ret;
 }
 
-static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		struct page *page, pte_t *pte, bool write, bool anon)
 {
 	pte_t entry;
@@ -3342,37 +3343,47 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }
 
-#define FAULT_AROUND_ORDER 5
+#define FAULT_AROUND_ORDER 3
 #define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
 #define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
 
 static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
 		pte_t *pte, pgoff_t pgoff, unsigned int flags)
 {
+	unsigned long start_addr;
+	pgoff_t max_pgoff;
 	struct vm_fault vmf;
-	unsigned long start_addr = address & FAULT_AROUND_MASK;
-	int off = (address - start_addr) >> PAGE_SHIFT;
-	int i;
-
-	for (i = 0; i < FAULT_AROUND_PAGES; i++) {
-		unsigned long addr = start_addr + i * PAGE_SIZE;
-		pte_t *_pte = pte - off +i;
+	int off, ret;
 
-		if (!pte_none(*_pte))
-			continue;
-		if (addr < vma->vm_start || addr >= vma->vm_end)
-			continue;
+	/* Do not cross vma or page table border */
+	max_pgoff = min(pgoff - pte_index(address) + PTRS_PER_PTE - 1,
+			vma_pages(vma) + vma->vm_pgoff - 1);
 
-		vmf.virtual_address = (void __user *) addr;
-		vmf.pgoff = pgoff - off + i;
-		vmf.flags = flags;
-		vmf.page = NULL;
-		vma->vm_ops->fault_nonblock(vma, &vmf);
-		if (!vmf.page)
-			continue;
-		do_set_pte(vma, addr, vmf.page, _pte, false, false);
-		unlock_page(vmf.page);
+	start_addr = max(address & FAULT_AROUND_MASK, vma->vm_start);
+	if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+		BUG();
+	off = pte_index(start_addr) - pte_index(address);
+	pte += off;
+	pgoff += off;
+
+	/* Check if it makes any sense to call ->fault_nonblock */
+	while (!pte_none(*pte)) {
+		pte++;
+		pgoff++;
+		start_addr += PAGE_SIZE;
+		/* Do not cross vma or page table border */
+		if (!pte_index(start_addr) || start_addr >= vma->vm_end)
+			return;
+		if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+			BUG();
 	}
+
+
+	vmf.virtual_address = (void __user *) start_addr;
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	ret = vma->vm_ops->fault_nonblock(vma, &vmf,
+			max_pgoff, FAULT_AROUND_PAGES, pte);
 }
 
 static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-18 17:59     ` Kirill A. Shutemov
  0 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-18 17:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> > reachable without blocking. We request one page a time. It's not terribly
> > efficient and I will probably re-think the interface once again to expose
> > iterator or something...
> 
> Hmm. Yeah, clearly this isn't working, since the real workloads all
> end up looking like
> 
> >        115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
> >       59.686645587 seconds time elapsed                                          ( +-  0.30% )
>  becomes
> >         47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
> >       60.241766430 seconds time elapsed                                          ( +-  0.85% )
> 
> and
> 
> >        268,039,365      minor-faults                                                 [100.00%]
> >      132.830612471 seconds time elapsed
> becomes
> >        193,550,437      minor-faults                                                 [100.00%]
> >      132.851823758 seconds time elapsed
> 
> and
> 
> >          4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
> >       27.215434226 seconds time elapsed                                          ( +-  0.18% )
> becomes
> >          2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
> >       27.292854546 seconds time elapsed                                          ( +-  0.29% )
> 
> ie it shows a clear reduction in faults, but the added costs clearly
> eat up any wins and it all becomes (just _slightly_) slower.

I did an experement with setup pte directly in filemap_fault_nonblock() to
see how much we can get from it. And it helps:

git:		-1.21s
clean build:	-2.22s
rebuild:	-0.63s

Is it a layering violation to setup pte directly in ->fault_nonblock()?

perf stat and patch below.

Git test-suite make -j60 test:
 1,591,184,058,944      cycles                     ( +-  0.05% ) [100.00%]
   811,200,260,823      instructions              #    0.51  insns per cycle
                                                  #    3.24  stalled cycles per insn  ( +-  0.19% ) [100.00%]
 2,631,511,271,429      stalled-cycles-frontend   #  165.38% frontend cycles idle     ( +-  0.08% )
        47,305,697      minor-faults                                                  ( +-  0.00% ) [100.00%]
                 1      major-faults

      59.028360009 seconds time elapsed                                          ( +-  0.58% )

Run make -j60 on clean allmodconfig kernel tree:
19,163,958,689,310      cycles                    [100.00%]
17,446,888,861,177      instructions              #    0.91  insns per cycle
                                                  #    1.53  stalled cycles per insn [100.00%]
26,777,884,033,091      stalled-cycles-frontend   #  139.73% frontend cycles idle
       193,118,569      minor-faults                                                 [100.00%]
                 0      major-faults

     130.631767214 seconds time elapsed

Run make -j60 on already built allmodconfig kernel tree:
   282,398,537,719      cycles                     ( +-  0.03% ) [100.00%]
   385,807,937,931      instructions              #    1.37  insns per cycle
                                                  #    0.95  stalled cycles per insn  ( +-  0.01% ) [100.00%]
   365,940,576,310      stalled-cycles-frontend   #  129.58% frontend cycles idle     ( +-  0.07% )
         2,254,887      minor-faults                                                  ( +-  0.02% ) [100.00%]
                 0      major-faults

      26.660708754 seconds time elapsed                                          ( +-  0.29% )

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b9a688dbd62a..e671dd5abe27 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -221,8 +221,8 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
-	void (*fault_nonblock)(struct vm_area_struct *vma,
-			struct vm_fault *vmf);
+	int (*fault_nonblock)(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
@@ -1812,7 +1812,8 @@ extern void truncate_inode_pages_range(struct address_space *,
 
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
-extern void filemap_fault_nonblock(struct vm_area_struct *, struct vm_fault *);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte);
 extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b7c9c600544..0a8884efbcd8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -134,6 +135,9 @@ void __delete_from_page_cache(struct page *page)
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	if (PageSwapBacked(page))
 		__dec_zone_page_state(page, NR_SHMEM);
+	if (page_mapped(page)) {
+		dump_page(page, "");
+	}
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -1726,37 +1730,90 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
-void filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf)
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+		struct page *page, pte_t *pte, bool write, bool anon);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte)
 {
+	struct radix_tree_iter iter;
+	void **slot;
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	pgoff_t size;
 	struct page *page;
+	unsigned long address = (unsigned long) vmf->virtual_address;
+	unsigned long addr;
+	pte_t *_pte;
+	int ret = 0;
 
-	page = find_get_page(mapping, vmf->pgoff);
-	if (!page)
-		return;
-	if (PageReadahead(page) || PageHWPoison(page))
-		goto put;
-	if (!trylock_page(page))
-		goto put;
-	/* Truncated? */
-	if (unlikely(page->mapping != mapping))
-		goto unlock;
-	if (unlikely(!PageUptodate(page)))
-		goto unlock;
-	size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
-		>> PAGE_CACHE_SHIFT;
-	if (unlikely(page->index >= size))
-		goto unlock;
-	if (file->f_ra.mmap_miss > 0)
-		file->f_ra.mmap_miss--;
-	vmf->page = page;
-	return;
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
+repeat:
+		page = radix_tree_deref_slot(slot);
+
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				WARN_ON(iter.index);
+				goto restart;
+			}
+			/*
+			 * Otherwise, shmem/tmpfs must be storing a swap entry
+			 * here as an exceptional entry: so skip over it -
+			 * we only reach this from invalidate_mapping_pages().
+			 */
+			continue;
+		}
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		if (page->index > max_pgoff) {
+			page_cache_release(page);
+			break;
+		}
+
+		if (PageReadahead(page) || PageHWPoison(page) ||
+				!PageUptodate(page))
+			goto skip;
+		if (!trylock_page(page))
+			goto skip;
+		if (page->mapping != mapping || !PageUptodate(page))
+			goto unlock;
+		size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
+			>> PAGE_CACHE_SHIFT;
+		if (page->index >= size)
+			goto unlock;
+		if (file->f_ra.mmap_miss > 0)
+			file->f_ra.mmap_miss--;
+		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
+		_pte = pte + page->index - vmf->pgoff;
+		if (!pte_none(*_pte))
+			goto unlock;
+		do_set_pte(vma, addr, page, _pte, false, false);
+
+		unlock_page(page);
+		if (++ret == nr_pages || page->index == max_pgoff)
+			break;
+		continue;
 unlock:
-	unlock_page(page);
-put:
-	put_page(page);
+		unlock_page(page);
+skip:
+		page_cache_release(page);
+	}
+	rcu_read_unlock();
+	return ret;
 }
 EXPORT_SYMBOL(filemap_fault_nonblock);
 
diff --git a/mm/memory.c b/mm/memory.c
index f4990fb66770..1af0e3a3f0f1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3318,7 +3318,8 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	return ret;
 }
 
-static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		struct page *page, pte_t *pte, bool write, bool anon)
 {
 	pte_t entry;
@@ -3342,37 +3343,47 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }
 
-#define FAULT_AROUND_ORDER 5
+#define FAULT_AROUND_ORDER 3
 #define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
 #define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
 
 static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
 		pte_t *pte, pgoff_t pgoff, unsigned int flags)
 {
+	unsigned long start_addr;
+	pgoff_t max_pgoff;
 	struct vm_fault vmf;
-	unsigned long start_addr = address & FAULT_AROUND_MASK;
-	int off = (address - start_addr) >> PAGE_SHIFT;
-	int i;
-
-	for (i = 0; i < FAULT_AROUND_PAGES; i++) {
-		unsigned long addr = start_addr + i * PAGE_SIZE;
-		pte_t *_pte = pte - off +i;
+	int off, ret;
 
-		if (!pte_none(*_pte))
-			continue;
-		if (addr < vma->vm_start || addr >= vma->vm_end)
-			continue;
+	/* Do not cross vma or page table border */
+	max_pgoff = min(pgoff - pte_index(address) + PTRS_PER_PTE - 1,
+			vma_pages(vma) + vma->vm_pgoff - 1);
 
-		vmf.virtual_address = (void __user *) addr;
-		vmf.pgoff = pgoff - off + i;
-		vmf.flags = flags;
-		vmf.page = NULL;
-		vma->vm_ops->fault_nonblock(vma, &vmf);
-		if (!vmf.page)
-			continue;
-		do_set_pte(vma, addr, vmf.page, _pte, false, false);
-		unlock_page(vmf.page);
+	start_addr = max(address & FAULT_AROUND_MASK, vma->vm_start);
+	if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+		BUG();
+	off = pte_index(start_addr) - pte_index(address);
+	pte += off;
+	pgoff += off;
+
+	/* Check if it makes any sense to call ->fault_nonblock */
+	while (!pte_none(*pte)) {
+		pte++;
+		pgoff++;
+		start_addr += PAGE_SIZE;
+		/* Do not cross vma or page table border */
+		if (!pte_index(start_addr) || start_addr >= vma->vm_end)
+			return;
+		if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+			BUG();
 	}
+
+
+	vmf.virtual_address = (void __user *) start_addr;
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	ret = vma->vm_ops->fault_nonblock(vma, &vmf,
+			max_pgoff, FAULT_AROUND_PAGES, pte);
 }
 
 static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-18 14:15       ` Wilcox, Matthew R
@ 2014-02-18 18:02         ` Linus Torvalds
  -1 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-18 18:02 UTC (permalink / raw)
  To: Wilcox, Matthew R
  Cc: Rik van Riel, Kirill A. Shutemov, Andrew Morton, Mel Gorman,
	Andi Kleen, Dave Hansen, Alexander Viro, Dave Chinner, linux-mm,
	linux-fsdevel, Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 6:15 AM, Wilcox, Matthew R
<matthew.r.wilcox@intel.com> wrote:
> We don't really need to lock all the pages being returned to protect against truncate.  We only need to lock the one at the highest index, and check i_size while that lock is held since truncate_inode_pages_range() will block on any page that is locked.
>
> We're still vulnerable to holepunches, but there's no locking currently between holepunches and truncate, so we're no worse off now.

It's not "holepunches and truncate", it's "holepunches and page
mapping", and I do think we currently serialize the two - the whole
"check page->mapping still being non-NULL" before mapping it while
having the page locked does that.

Besides, that per-page locking should serialize against truncate too.
No, there is no "global" serialization, but there *is* exactly that
page-level serialization where both truncation and hole punching end
up making sure that the page no longer exists in the page cache and
isn't mapped.

I'm just claiming that *because* of the way rmap works for file
mappings (walking the i_mapped list and page tables), we should
actually be ok.  The anonymous rmap list is protected by the page
lock, but the file-backed rmap is protected by the pte lock (well, and
the "i_mmap_mutex" that in turn protects the i_mmap list etc).

       Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-18 18:02         ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-18 18:02 UTC (permalink / raw)
  To: Wilcox, Matthew R
  Cc: Rik van Riel, Kirill A. Shutemov, Andrew Morton, Mel Gorman,
	Andi Kleen, Dave Hansen, Alexander Viro, Dave Chinner, linux-mm,
	linux-fsdevel, Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 6:15 AM, Wilcox, Matthew R
<matthew.r.wilcox@intel.com> wrote:
> We don't really need to lock all the pages being returned to protect against truncate.  We only need to lock the one at the highest index, and check i_size while that lock is held since truncate_inode_pages_range() will block on any page that is locked.
>
> We're still vulnerable to holepunches, but there's no locking currently between holepunches and truncate, so we're no worse off now.

It's not "holepunches and truncate", it's "holepunches and page
mapping", and I do think we currently serialize the two - the whole
"check page->mapping still being non-NULL" before mapping it while
having the page locked does that.

Besides, that per-page locking should serialize against truncate too.
No, there is no "global" serialization, but there *is* exactly that
page-level serialization where both truncation and hole punching end
up making sure that the page no longer exists in the page cache and
isn't mapped.

I'm just claiming that *because* of the way rmap works for file
mappings (walking the i_mapped list and page tables), we should
actually be ok.  The anonymous rmap list is protected by the page
lock, but the file-backed rmap is protected by the pte lock (well, and
the "i_mmap_mutex" that in turn protects the i_mmap list etc).

       Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-18 17:59     ` Kirill A. Shutemov
  (?)
@ 2014-02-18 18:07       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-18 18:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Kirill A. Shutemov, Andrew Morton, Mel Gorman,
	Rik van Riel, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

Kirill A. Shutemov wrote:
> Linus Torvalds wrote:
> > On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> > > reachable without blocking. We request one page a time. It's not terribly
> > > efficient and I will probably re-think the interface once again to expose
> > > iterator or something...
> > 
> > Hmm. Yeah, clearly this isn't working, since the real workloads all
> > end up looking like
> > 
> > >        115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
> > >       59.686645587 seconds time elapsed                                          ( +-  0.30% )
> >  becomes
> > >         47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
> > >       60.241766430 seconds time elapsed                                          ( +-  0.85% )
> > 
> > and
> > 
> > >        268,039,365      minor-faults                                                 [100.00%]
> > >      132.830612471 seconds time elapsed
> > becomes
> > >        193,550,437      minor-faults                                                 [100.00%]
> > >      132.851823758 seconds time elapsed
> > 
> > and
> > 
> > >          4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
> > >       27.215434226 seconds time elapsed                                          ( +-  0.18% )
> > becomes
> > >          2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
> > >       27.292854546 seconds time elapsed                                          ( +-  0.29% )
> > 
> > ie it shows a clear reduction in faults, but the added costs clearly
> > eat up any wins and it all becomes (just _slightly_) slower.
> 
> I did an experement with setup pte directly in filemap_fault_nonblock() to
> see how much we can get from it. And it helps:
> 
> git:		-1.21s
> clean build:	-2.22s
> rebuild:	-0.63s
> 
> Is it a layering violation to setup pte directly in ->fault_nonblock()?
> 
> perf stat and patch below.
> 
> Git test-suite make -j60 test:
>  1,591,184,058,944      cycles                     ( +-  0.05% ) [100.00%]
>    811,200,260,823      instructions              #    0.51  insns per cycle
>                                                   #    3.24  stalled cycles per insn  ( +-  0.19% ) [100.00%]
>  2,631,511,271,429      stalled-cycles-frontend   #  165.38% frontend cycles idle     ( +-  0.08% )
>         47,305,697      minor-faults                                                  ( +-  0.00% ) [100.00%]
>                  1      major-faults
> 
>       59.028360009 seconds time elapsed                                          ( +-  0.58% )
> 
> Run make -j60 on clean allmodconfig kernel tree:
> 19,163,958,689,310      cycles                    [100.00%]
> 17,446,888,861,177      instructions              #    0.91  insns per cycle
>                                                   #    1.53  stalled cycles per insn [100.00%]
> 26,777,884,033,091      stalled-cycles-frontend   #  139.73% frontend cycles idle
>        193,118,569      minor-faults                                                 [100.00%]
>                  0      major-faults
> 
>      130.631767214 seconds time elapsed
> 
> Run make -j60 on already built allmodconfig kernel tree:
>    282,398,537,719      cycles                     ( +-  0.03% ) [100.00%]
>    385,807,937,931      instructions              #    1.37  insns per cycle
>                                                   #    0.95  stalled cycles per insn  ( +-  0.01% ) [100.00%]
>    365,940,576,310      stalled-cycles-frontend   #  129.58% frontend cycles idle     ( +-  0.07% )
>          2,254,887      minor-faults                                                  ( +-  0.02% ) [100.00%]
>                  0      major-faults
> 
>       26.660708754 seconds time elapsed                                          ( +-  0.29% )

Patch is wrong. Correct one is below.

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index a16b0ff497ca..a7f7e41bec37 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
 
 static const struct vm_operations_struct v9fs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
@@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
 static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
 	.close = v9fs_mmap_vm_close,
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0165b8672f09..13523a63e1f3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1993,6 +1993,7 @@ out:
 
 static const struct vm_operations_struct btrfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= btrfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 755584684f6c..71aff75e067c 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static struct vm_operations_struct cifs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = cifs_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a5073959f32..182ae5543a1d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite   = ext4_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 0dfcef53a6ed..7c48fd2eb99c 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -84,6 +84,7 @@ out:
 
 static const struct vm_operations_struct f2fs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= f2fs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77bcc303c3ae..e95e52ec7bc2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct fuse_file_vm_ops = {
 	.close		= fuse_vma_close,
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= fuse_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index efc078f0ee4e..7c4b2f096ac8 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -494,6 +494,7 @@ out:
 
 static const struct vm_operations_struct gfs2_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = gfs2_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5bb790a69c71..8fbe80168d1f 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -617,6 +617,7 @@ out:
 
 static const struct vm_operations_struct nfs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = nfs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 08fdb77852ac..adc4aa07d7d8 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static const struct vm_operations_struct nilfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= nilfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 123c79b7261e..f27c4c401a3f 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1538,6 +1538,7 @@ out_unlock:
 
 static const struct vm_operations_struct ubifs_file_vm_ops = {
 	.fault        = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = ubifs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..bc619150c960 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= xfs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46eade6a..e671dd5abe27 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -221,6 +221,8 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	int (*fault_nonblock)(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
@@ -1810,6 +1812,8 @@ extern void truncate_inode_pages_range(struct address_space *,
 
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte);
 extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6ac5421..0a8884efbcd8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -1726,6 +1730,93 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+		struct page *page, pte_t *pte, bool write, bool anon);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte)
+{
+	struct radix_tree_iter iter;
+	void **slot;
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	pgoff_t size;
+	struct page *page;
+	unsigned long address = (unsigned long) vmf->virtual_address;
+	unsigned long addr;
+	pte_t *_pte;
+	int ret = 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
+repeat:
+		page = radix_tree_deref_slot(slot);
+
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				WARN_ON(iter.index);
+				goto restart;
+			}
+			/*
+			 * Otherwise, shmem/tmpfs must be storing a swap entry
+			 * here as an exceptional entry: so skip over it -
+			 * we only reach this from invalidate_mapping_pages().
+			 */
+			continue;
+		}
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		if (page->index > max_pgoff) {
+			page_cache_release(page);
+			break;
+		}
+
+		if (PageReadahead(page) || PageHWPoison(page) ||
+				!PageUptodate(page))
+			goto skip;
+		if (!trylock_page(page))
+			goto skip;
+		if (page->mapping != mapping || !PageUptodate(page))
+			goto unlock;
+		size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
+			>> PAGE_CACHE_SHIFT;
+		if (page->index >= size)
+			goto unlock;
+		if (file->f_ra.mmap_miss > 0)
+			file->f_ra.mmap_miss--;
+		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
+		_pte = pte + page->index - vmf->pgoff;
+		if (!pte_none(*_pte))
+			goto unlock;
+		do_set_pte(vma, addr, page, _pte, false, false);
+
+		unlock_page(page);
+		if (++ret == nr_pages || page->index == max_pgoff)
+			break;
+		continue;
+unlock:
+		unlock_page(page);
+skip:
+		page_cache_release(page);
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(filemap_fault_nonblock);
+
 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -1755,6 +1846,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46ef1e1..e79c8d6f6f47 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3318,7 +3318,8 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	return ret;
 }
 
-static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		struct page *page, pte_t *pte, bool write, bool anon)
 {
 	pte_t entry;
@@ -3342,6 +3343,49 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }
 
+#define FAULT_AROUND_ORDER 5
+#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
+#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
+
+static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
+		pte_t *pte, pgoff_t pgoff, unsigned int flags)
+{
+	unsigned long start_addr;
+	pgoff_t max_pgoff;
+	struct vm_fault vmf;
+	int off, ret;
+
+	/* Do not cross vma or page table border */
+	max_pgoff = min(pgoff - pte_index(address) + PTRS_PER_PTE - 1,
+			vma_pages(vma) + vma->vm_pgoff - 1);
+
+	start_addr = max(address & FAULT_AROUND_MASK, vma->vm_start);
+	if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+		BUG();
+	off = pte_index(start_addr) - pte_index(address);
+	pte += off;
+	pgoff += off;
+
+	/* Check if it makes any sense to call ->fault_nonblock */
+	while (!pte_none(*pte)) {
+		pte++;
+		pgoff++;
+		start_addr += PAGE_SIZE;
+		/* Do not cross vma or page table border */
+		if (!pte_index(start_addr) || start_addr >= vma->vm_end)
+			return;
+		if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+			BUG();
+	}
+
+
+	vmf.virtual_address = (void __user *) start_addr;
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	ret = vma->vm_ops->fault_nonblock(vma, &vmf,
+			max_pgoff, FAULT_AROUND_PAGES, pte);
+}
+
 static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3363,8 +3407,11 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return ret;
 	}
 	do_set_pte(vma, address, fault_page, pte, false, false);
-	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
+
+	if (vma->vm_ops->fault_nonblock)
+		do_fault_around(vma, address, pte, pgoff, flags);
+	pte_unmap_unlock(pte, ptl);
 	return ret;
 }
 
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-18 18:07       ` Kirill A. Shutemov
  0 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-18 18:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Kirill A. Shutemov, Andrew Morton, Mel Gorman,
	Rik van Riel, Andi Kleen, Matthew Wilcox, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

Kirill A. Shutemov wrote:
> Linus Torvalds wrote:
> > On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> > > reachable without blocking. We request one page a time. It's not terribly
> > > efficient and I will probably re-think the interface once again to expose
> > > iterator or something...
> > 
> > Hmm. Yeah, clearly this isn't working, since the real workloads all
> > end up looking like
> > 
> > >        115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
> > >       59.686645587 seconds time elapsed                                          ( +-  0.30% )
> >  becomes
> > >         47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
> > >       60.241766430 seconds time elapsed                                          ( +-  0.85% )
> > 
> > and
> > 
> > >        268,039,365      minor-faults                                                 [100.00%]
> > >      132.830612471 seconds time elapsed
> > becomes
> > >        193,550,437      minor-faults                                                 [100.00%]
> > >      132.851823758 seconds time elapsed
> > 
> > and
> > 
> > >          4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
> > >       27.215434226 seconds time elapsed                                          ( +-  0.18% )
> > becomes
> > >          2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
> > >       27.292854546 seconds time elapsed                                          ( +-  0.29% )
> > 
> > ie it shows a clear reduction in faults, but the added costs clearly
> > eat up any wins and it all becomes (just _slightly_) slower.
> 
> I did an experement with setup pte directly in filemap_fault_nonblock() to
> see how much we can get from it. And it helps:
> 
> git:		-1.21s
> clean build:	-2.22s
> rebuild:	-0.63s
> 
> Is it a layering violation to setup pte directly in ->fault_nonblock()?
> 
> perf stat and patch below.
> 
> Git test-suite make -j60 test:
>  1,591,184,058,944      cycles                     ( +-  0.05% ) [100.00%]
>    811,200,260,823      instructions              #    0.51  insns per cycle
>                                                   #    3.24  stalled cycles per insn  ( +-  0.19% ) [100.00%]
>  2,631,511,271,429      stalled-cycles-frontend   #  165.38% frontend cycles idle     ( +-  0.08% )
>         47,305,697      minor-faults                                                  ( +-  0.00% ) [100.00%]
>                  1      major-faults
> 
>       59.028360009 seconds time elapsed                                          ( +-  0.58% )
> 
> Run make -j60 on clean allmodconfig kernel tree:
> 19,163,958,689,310      cycles                    [100.00%]
> 17,446,888,861,177      instructions              #    0.91  insns per cycle
>                                                   #    1.53  stalled cycles per insn [100.00%]
> 26,777,884,033,091      stalled-cycles-frontend   #  139.73% frontend cycles idle
>        193,118,569      minor-faults                                                 [100.00%]
>                  0      major-faults
> 
>      130.631767214 seconds time elapsed
> 
> Run make -j60 on already built allmodconfig kernel tree:
>    282,398,537,719      cycles                     ( +-  0.03% ) [100.00%]
>    385,807,937,931      instructions              #    1.37  insns per cycle
>                                                   #    0.95  stalled cycles per insn  ( +-  0.01% ) [100.00%]
>    365,940,576,310      stalled-cycles-frontend   #  129.58% frontend cycles idle     ( +-  0.07% )
>          2,254,887      minor-faults                                                  ( +-  0.02% ) [100.00%]
>                  0      major-faults
> 
>       26.660708754 seconds time elapsed                                          ( +-  0.29% )

Patch is wrong. Correct one is below.

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index a16b0ff497ca..a7f7e41bec37 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
 
 static const struct vm_operations_struct v9fs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
@@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
 static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
 	.close = v9fs_mmap_vm_close,
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0165b8672f09..13523a63e1f3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1993,6 +1993,7 @@ out:
 
 static const struct vm_operations_struct btrfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= btrfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 755584684f6c..71aff75e067c 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static struct vm_operations_struct cifs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = cifs_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a5073959f32..182ae5543a1d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite   = ext4_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 0dfcef53a6ed..7c48fd2eb99c 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -84,6 +84,7 @@ out:
 
 static const struct vm_operations_struct f2fs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= f2fs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77bcc303c3ae..e95e52ec7bc2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct fuse_file_vm_ops = {
 	.close		= fuse_vma_close,
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= fuse_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index efc078f0ee4e..7c4b2f096ac8 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -494,6 +494,7 @@ out:
 
 static const struct vm_operations_struct gfs2_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = gfs2_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5bb790a69c71..8fbe80168d1f 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -617,6 +617,7 @@ out:
 
 static const struct vm_operations_struct nfs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = nfs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 08fdb77852ac..adc4aa07d7d8 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static const struct vm_operations_struct nilfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= nilfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 123c79b7261e..f27c4c401a3f 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1538,6 +1538,7 @@ out_unlock:
 
 static const struct vm_operations_struct ubifs_file_vm_ops = {
 	.fault        = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = ubifs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..bc619150c960 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= xfs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46eade6a..e671dd5abe27 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -221,6 +221,8 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	int (*fault_nonblock)(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
@@ -1810,6 +1812,8 @@ extern void truncate_inode_pages_range(struct address_space *,
 
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte);
 extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6ac5421..0a8884efbcd8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -1726,6 +1730,93 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+		struct page *page, pte_t *pte, bool write, bool anon);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte)
+{
+	struct radix_tree_iter iter;
+	void **slot;
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	pgoff_t size;
+	struct page *page;
+	unsigned long address = (unsigned long) vmf->virtual_address;
+	unsigned long addr;
+	pte_t *_pte;
+	int ret = 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
+repeat:
+		page = radix_tree_deref_slot(slot);
+
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				WARN_ON(iter.index);
+				goto restart;
+			}
+			/*
+			 * Otherwise, shmem/tmpfs must be storing a swap entry
+			 * here as an exceptional entry: so skip over it -
+			 * we only reach this from invalidate_mapping_pages().
+			 */
+			continue;
+		}
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		if (page->index > max_pgoff) {
+			page_cache_release(page);
+			break;
+		}
+
+		if (PageReadahead(page) || PageHWPoison(page) ||
+				!PageUptodate(page))
+			goto skip;
+		if (!trylock_page(page))
+			goto skip;
+		if (page->mapping != mapping || !PageUptodate(page))
+			goto unlock;
+		size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
+			>> PAGE_CACHE_SHIFT;
+		if (page->index >= size)
+			goto unlock;
+		if (file->f_ra.mmap_miss > 0)
+			file->f_ra.mmap_miss--;
+		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
+		_pte = pte + page->index - vmf->pgoff;
+		if (!pte_none(*_pte))
+			goto unlock;
+		do_set_pte(vma, addr, page, _pte, false, false);
+
+		unlock_page(page);
+		if (++ret == nr_pages || page->index == max_pgoff)
+			break;
+		continue;
+unlock:
+		unlock_page(page);
+skip:
+		page_cache_release(page);
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(filemap_fault_nonblock);
+
 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -1755,6 +1846,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46ef1e1..e79c8d6f6f47 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3318,7 +3318,8 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	return ret;
 }
 
-static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		struct page *page, pte_t *pte, bool write, bool anon)
 {
 	pte_t entry;
@@ -3342,6 +3343,49 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }
 
+#define FAULT_AROUND_ORDER 5
+#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
+#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
+
+static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
+		pte_t *pte, pgoff_t pgoff, unsigned int flags)
+{
+	unsigned long start_addr;
+	pgoff_t max_pgoff;
+	struct vm_fault vmf;
+	int off, ret;
+
+	/* Do not cross vma or page table border */
+	max_pgoff = min(pgoff - pte_index(address) + PTRS_PER_PTE - 1,
+			vma_pages(vma) + vma->vm_pgoff - 1);
+
+	start_addr = max(address & FAULT_AROUND_MASK, vma->vm_start);
+	if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+		BUG();
+	off = pte_index(start_addr) - pte_index(address);
+	pte += off;
+	pgoff += off;
+
+	/* Check if it makes any sense to call ->fault_nonblock */
+	while (!pte_none(*pte)) {
+		pte++;
+		pgoff++;
+		start_addr += PAGE_SIZE;
+		/* Do not cross vma or page table border */
+		if (!pte_index(start_addr) || start_addr >= vma->vm_end)
+			return;
+		if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+			BUG();
+	}
+
+
+	vmf.virtual_address = (void __user *) start_addr;
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	ret = vma->vm_ops->fault_nonblock(vma, &vmf,
+			max_pgoff, FAULT_AROUND_PAGES, pte);
+}
+
 static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3363,8 +3407,11 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return ret;
 	}
 	do_set_pte(vma, address, fault_page, pte, false, false);
-	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
+
+	if (vma->vm_ops->fault_nonblock)
+		do_fault_around(vma, address, pte, pgoff, flags);
+	pte_unmap_unlock(pte, ptl);
 	return ret;
 }
 
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-18 18:07       ` Kirill A. Shutemov
  0 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-18 18:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

Kirill A. Shutemov wrote:
> Linus Torvalds wrote:
> > On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
> > <kirill.shutemov@linux.intel.com> wrote:
> > >
> > > Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> > > reachable without blocking. We request one page a time. It's not terribly
> > > efficient and I will probably re-think the interface once again to expose
> > > iterator or something...
> > 
> > Hmm. Yeah, clearly this isn't working, since the real workloads all
> > end up looking like
> > 
> > >        115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
> > >       59.686645587 seconds time elapsed                                          ( +-  0.30% )
> >  becomes
> > >         47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
> > >       60.241766430 seconds time elapsed                                          ( +-  0.85% )
> > 
> > and
> > 
> > >        268,039,365      minor-faults                                                 [100.00%]
> > >      132.830612471 seconds time elapsed
> > becomes
> > >        193,550,437      minor-faults                                                 [100.00%]
> > >      132.851823758 seconds time elapsed
> > 
> > and
> > 
> > >          4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
> > >       27.215434226 seconds time elapsed                                          ( +-  0.18% )
> > becomes
> > >          2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
> > >       27.292854546 seconds time elapsed                                          ( +-  0.29% )
> > 
> > ie it shows a clear reduction in faults, but the added costs clearly
> > eat up any wins and it all becomes (just _slightly_) slower.
> 
> I did an experement with setup pte directly in filemap_fault_nonblock() to
> see how much we can get from it. And it helps:
> 
> git:		-1.21s
> clean build:	-2.22s
> rebuild:	-0.63s
> 
> Is it a layering violation to setup pte directly in ->fault_nonblock()?
> 
> perf stat and patch below.
> 
> Git test-suite make -j60 test:
>  1,591,184,058,944      cycles                     ( +-  0.05% ) [100.00%]
>    811,200,260,823      instructions              #    0.51  insns per cycle
>                                                   #    3.24  stalled cycles per insn  ( +-  0.19% ) [100.00%]
>  2,631,511,271,429      stalled-cycles-frontend   #  165.38% frontend cycles idle     ( +-  0.08% )
>         47,305,697      minor-faults                                                  ( +-  0.00% ) [100.00%]
>                  1      major-faults
> 
>       59.028360009 seconds time elapsed                                          ( +-  0.58% )
> 
> Run make -j60 on clean allmodconfig kernel tree:
> 19,163,958,689,310      cycles                    [100.00%]
> 17,446,888,861,177      instructions              #    0.91  insns per cycle
>                                                   #    1.53  stalled cycles per insn [100.00%]
> 26,777,884,033,091      stalled-cycles-frontend   #  139.73% frontend cycles idle
>        193,118,569      minor-faults                                                 [100.00%]
>                  0      major-faults
> 
>      130.631767214 seconds time elapsed
> 
> Run make -j60 on already built allmodconfig kernel tree:
>    282,398,537,719      cycles                     ( +-  0.03% ) [100.00%]
>    385,807,937,931      instructions              #    1.37  insns per cycle
>                                                   #    0.95  stalled cycles per insn  ( +-  0.01% ) [100.00%]
>    365,940,576,310      stalled-cycles-frontend   #  129.58% frontend cycles idle     ( +-  0.07% )
>          2,254,887      minor-faults                                                  ( +-  0.02% ) [100.00%]
>                  0      major-faults
> 
>       26.660708754 seconds time elapsed                                          ( +-  0.29% )

Patch is wrong. Correct one is below.

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index a16b0ff497ca..a7f7e41bec37 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
 
 static const struct vm_operations_struct v9fs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
@@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
 static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
 	.close = v9fs_mmap_vm_close,
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = v9fs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0165b8672f09..13523a63e1f3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1993,6 +1993,7 @@ out:
 
 static const struct vm_operations_struct btrfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= btrfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 755584684f6c..71aff75e067c 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static struct vm_operations_struct cifs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = cifs_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a5073959f32..182ae5543a1d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite   = ext4_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 0dfcef53a6ed..7c48fd2eb99c 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -84,6 +84,7 @@ out:
 
 static const struct vm_operations_struct f2fs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= f2fs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77bcc303c3ae..e95e52ec7bc2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 static const struct vm_operations_struct fuse_file_vm_ops = {
 	.close		= fuse_vma_close,
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= fuse_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index efc078f0ee4e..7c4b2f096ac8 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -494,6 +494,7 @@ out:
 
 static const struct vm_operations_struct gfs2_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = gfs2_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5bb790a69c71..8fbe80168d1f 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -617,6 +617,7 @@ out:
 
 static const struct vm_operations_struct nfs_file_vm_ops = {
 	.fault = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = nfs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 08fdb77852ac..adc4aa07d7d8 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 static const struct vm_operations_struct nilfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= nilfs_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 123c79b7261e..f27c4c401a3f 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1538,6 +1538,7 @@ out_unlock:
 
 static const struct vm_operations_struct ubifs_file_vm_ops = {
 	.fault        = filemap_fault,
+	.fault_nonblock = filemap_fault_nonblock,
 	.page_mkwrite = ubifs_vm_page_mkwrite,
 	.remap_pages = generic_file_remap_pages,
 };
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..bc619150c960 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= xfs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46eade6a..e671dd5abe27 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -221,6 +221,8 @@ struct vm_operations_struct {
 	void (*open)(struct vm_area_struct * area);
 	void (*close)(struct vm_area_struct * area);
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	int (*fault_nonblock)(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
@@ -1810,6 +1812,8 @@ extern void truncate_inode_pages_range(struct address_space *,
 
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte);
 extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6ac5421..0a8884efbcd8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -1726,6 +1730,93 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+		struct page *page, pte_t *pte, bool write, bool anon);
+int filemap_fault_nonblock(struct vm_area_struct *vma, struct vm_fault *vmf,
+		pgoff_t max_pgoff, int nr_pages, pte_t *pte)
+{
+	struct radix_tree_iter iter;
+	void **slot;
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	pgoff_t size;
+	struct page *page;
+	unsigned long address = (unsigned long) vmf->virtual_address;
+	unsigned long addr;
+	pte_t *_pte;
+	int ret = 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
+repeat:
+		page = radix_tree_deref_slot(slot);
+
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				WARN_ON(iter.index);
+				goto restart;
+			}
+			/*
+			 * Otherwise, shmem/tmpfs must be storing a swap entry
+			 * here as an exceptional entry: so skip over it -
+			 * we only reach this from invalidate_mapping_pages().
+			 */
+			continue;
+		}
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		if (page->index > max_pgoff) {
+			page_cache_release(page);
+			break;
+		}
+
+		if (PageReadahead(page) || PageHWPoison(page) ||
+				!PageUptodate(page))
+			goto skip;
+		if (!trylock_page(page))
+			goto skip;
+		if (page->mapping != mapping || !PageUptodate(page))
+			goto unlock;
+		size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
+			>> PAGE_CACHE_SHIFT;
+		if (page->index >= size)
+			goto unlock;
+		if (file->f_ra.mmap_miss > 0)
+			file->f_ra.mmap_miss--;
+		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
+		_pte = pte + page->index - vmf->pgoff;
+		if (!pte_none(*_pte))
+			goto unlock;
+		do_set_pte(vma, addr, page, _pte, false, false);
+
+		unlock_page(page);
+		if (++ret == nr_pages || page->index == max_pgoff)
+			break;
+		continue;
+unlock:
+		unlock_page(page);
+skip:
+		page_cache_release(page);
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(filemap_fault_nonblock);
+
 int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -1755,6 +1846,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.fault_nonblock	= filemap_fault_nonblock,
 	.page_mkwrite	= filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46ef1e1..e79c8d6f6f47 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3318,7 +3318,8 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 	return ret;
 }
 
-static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		struct page *page, pte_t *pte, bool write, bool anon)
 {
 	pte_t entry;
@@ -3342,6 +3343,49 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 	update_mmu_cache(vma, address, pte);
 }
 
+#define FAULT_AROUND_ORDER 5
+#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
+#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
+
+static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
+		pte_t *pte, pgoff_t pgoff, unsigned int flags)
+{
+	unsigned long start_addr;
+	pgoff_t max_pgoff;
+	struct vm_fault vmf;
+	int off, ret;
+
+	/* Do not cross vma or page table border */
+	max_pgoff = min(pgoff - pte_index(address) + PTRS_PER_PTE - 1,
+			vma_pages(vma) + vma->vm_pgoff - 1);
+
+	start_addr = max(address & FAULT_AROUND_MASK, vma->vm_start);
+	if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+		BUG();
+	off = pte_index(start_addr) - pte_index(address);
+	pte += off;
+	pgoff += off;
+
+	/* Check if it makes any sense to call ->fault_nonblock */
+	while (!pte_none(*pte)) {
+		pte++;
+		pgoff++;
+		start_addr += PAGE_SIZE;
+		/* Do not cross vma or page table border */
+		if (!pte_index(start_addr) || start_addr >= vma->vm_end)
+			return;
+		if ((start_addr & PMD_MASK) != (address & PMD_MASK))
+			BUG();
+	}
+
+
+	vmf.virtual_address = (void __user *) start_addr;
+	vmf.pgoff = pgoff;
+	vmf.flags = flags;
+	ret = vma->vm_ops->fault_nonblock(vma, &vmf,
+			max_pgoff, FAULT_AROUND_PAGES, pte);
+}
+
 static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3363,8 +3407,11 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return ret;
 	}
 	do_set_pte(vma, address, fault_page, pte, false, false);
-	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
+
+	if (vma->vm_ops->fault_nonblock)
+		do_fault_around(vma, address, pte, pgoff, flags);
+	pte_unmap_unlock(pte, ptl);
 	return ret;
 }
 
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-18 18:07       ` Kirill A. Shutemov
@ 2014-02-18 18:28         ` Linus Torvalds
  -1 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-18 18:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 10:07 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> Patch is wrong. Correct one is below.

Hmm. I don't hate this. Looking through it, it's fairly simple
conceptually, and the code isn't that complex either. I can live with
this.

I think it's a bit odd how you pass both "max_pgoff" and "nr_pages" to
the fault-around function, though. In fact, I'd consider that a bug.
Passing in "FAULT_AROUND_PAGES" is just wrong, since the code cannot -
and in fact *must* not - actually fault in that many pages, since the
starting/ending address can be limited by other things.

So I think that part of the code is bogus. You need to remove
nr_pages, because any use of it is just incorrect. I don't think it
can actually matter, since the max_pgoff checks are more restrictive,
but if you think it can matter please explain how and why it wouldn't
be a major bug?

Apart from that, I'd really like to see numbers for different ranges
of FAULT_AROUND_ORDER, because I think 5 is pretty high, but on the
whole I don't find this horrible, and you still lock the page so it
doesn't involve any new rules. I'm not hugely happy with another raw
radix-tree user, but it's not horrible.

Btw, is the "radix_tree_deref_retry(page) -> goto restart" really
necessary? I'd be almost more inclined to just make it just do a
"break;" to break out of the loop and stop doing anything clever at
all.

IOW, from a quick look there's a couple of small details I don't like
that look odd, but ..

                Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-18 18:28         ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-18 18:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, Andi Kleen,
	Matthew Wilcox, Dave Hansen, Alexander Viro, Dave Chinner,
	linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 10:07 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> Patch is wrong. Correct one is below.

Hmm. I don't hate this. Looking through it, it's fairly simple
conceptually, and the code isn't that complex either. I can live with
this.

I think it's a bit odd how you pass both "max_pgoff" and "nr_pages" to
the fault-around function, though. In fact, I'd consider that a bug.
Passing in "FAULT_AROUND_PAGES" is just wrong, since the code cannot -
and in fact *must* not - actually fault in that many pages, since the
starting/ending address can be limited by other things.

So I think that part of the code is bogus. You need to remove
nr_pages, because any use of it is just incorrect. I don't think it
can actually matter, since the max_pgoff checks are more restrictive,
but if you think it can matter please explain how and why it wouldn't
be a major bug?

Apart from that, I'd really like to see numbers for different ranges
of FAULT_AROUND_ORDER, because I think 5 is pretty high, but on the
whole I don't find this horrible, and you still lock the page so it
doesn't involve any new rules. I'm not hugely happy with another raw
radix-tree user, but it's not horrible.

Btw, is the "radix_tree_deref_retry(page) -> goto restart" really
necessary? I'd be almost more inclined to just make it just do a
"break;" to break out of the loop and stop doing anything clever at
all.

IOW, from a quick look there's a couple of small details I don't like
that look odd, but ..

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-18 18:02         ` Linus Torvalds
@ 2014-02-18 18:53           ` Matthew Wilcox
  -1 siblings, 0 replies; 37+ messages in thread
From: Matthew Wilcox @ 2014-02-18 18:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Wilcox, Matthew R, Rik van Riel, Kirill A. Shutemov,
	Andrew Morton, Mel Gorman, Andi Kleen, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 10:02:26AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 6:15 AM, Wilcox, Matthew R
> <matthew.r.wilcox@intel.com> wrote:
> > We don't really need to lock all the pages being returned to protect
> > against truncate.  We only need to lock the one at the highest index,
> > and check i_size while that lock is held since truncate_inode_pages_range()
> > will block on any page that is locked.
> >
> > We're still vulnerable to holepunches, but there's no locking currently
> > between holepunches and truncate, so we're no worse off now.
> 
> It's not "holepunches and truncate", it's "holepunches and page
> mapping", and I do think we currently serialize the two - the whole
> "check page->mapping still being non-NULL" before mapping it while
> having the page locked does that.

Yes, I did mean "holepunches and page faults".  But here's the race I see:

Process A			Process B
ext4_fallocate()
ext4_punch_hole()
filemap_write_and_wait_range()
mutex_lock(&inode->i_mutex);
truncate_pagecache_range()
unmap_mapping_range()
				__do_fault()
				filemap_fault()
				lock_page_or_retry()
				(page->mapping == mapping at this point)
				set_pte_at()
				unlock_page()
truncate_inode_pages_range()
(now the pte is pointing at a page that
 is no longer attached to this file)
mutex_unlock(&inode->i_mutex);

Would we solve the problem by putting in a second call to
unmap_mapping_range() after calling truncate_inode_pages_range() in
truncate_pagecache_range(), like truncate_pagecache() does?

> Besides, that per-page locking should serialize against truncate too.
> No, there is no "global" serialization, but there *is* exactly that
> page-level serialization where both truncation and hole punching end
> up making sure that the page no longer exists in the page cache and
> isn't mapped.

What I'm suggesting is going back to Kirill's earlier patch, but only
locking the page with the highest index instead of all of the pages.
truncate() will block on that page and then we'll notice that some or
all of the other pages are also now past i_size and give up.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-18 18:53           ` Matthew Wilcox
  0 siblings, 0 replies; 37+ messages in thread
From: Matthew Wilcox @ 2014-02-18 18:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Wilcox, Matthew R, Rik van Riel, Kirill A. Shutemov,
	Andrew Morton, Mel Gorman, Andi Kleen, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 10:02:26AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 6:15 AM, Wilcox, Matthew R
> <matthew.r.wilcox@intel.com> wrote:
> > We don't really need to lock all the pages being returned to protect
> > against truncate.  We only need to lock the one at the highest index,
> > and check i_size while that lock is held since truncate_inode_pages_range()
> > will block on any page that is locked.
> >
> > We're still vulnerable to holepunches, but there's no locking currently
> > between holepunches and truncate, so we're no worse off now.
> 
> It's not "holepunches and truncate", it's "holepunches and page
> mapping", and I do think we currently serialize the two - the whole
> "check page->mapping still being non-NULL" before mapping it while
> having the page locked does that.

Yes, I did mean "holepunches and page faults".  But here's the race I see:

Process A			Process B
ext4_fallocate()
ext4_punch_hole()
filemap_write_and_wait_range()
mutex_lock(&inode->i_mutex);
truncate_pagecache_range()
unmap_mapping_range()
				__do_fault()
				filemap_fault()
				lock_page_or_retry()
				(page->mapping == mapping at this point)
				set_pte_at()
				unlock_page()
truncate_inode_pages_range()
(now the pte is pointing at a page that
 is no longer attached to this file)
mutex_unlock(&inode->i_mutex);

Would we solve the problem by putting in a second call to
unmap_mapping_range() after calling truncate_inode_pages_range() in
truncate_pagecache_range(), like truncate_pagecache() does?

> Besides, that per-page locking should serialize against truncate too.
> No, there is no "global" serialization, but there *is* exactly that
> page-level serialization where both truncation and hole punching end
> up making sure that the page no longer exists in the page cache and
> isn't mapped.

What I'm suggesting is going back to Kirill's earlier patch, but only
locking the page with the highest index instead of all of the pages.
truncate() will block on that page and then we'll notice that some or
all of the other pages are also now past i_size and give up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-18 18:53           ` Matthew Wilcox
@ 2014-02-18 19:07             ` Linus Torvalds
  -1 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-18 19:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wilcox, Matthew R, Rik van Riel, Kirill A. Shutemov,
	Andrew Morton, Mel Gorman, Andi Kleen, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 10:53 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
>
> Yes, I did mean "holepunches and page faults".  But here's the race I see:

Hmm. With truncate, we should be protected by i_size being changed
first (iirc - I didn't actually check), but I think you're right that
hole punching might race with a page being mapped at the same time.

> What I'm suggesting is going back to Kirill's earlier patch, but only
> locking the page with the highest index instead of all of the pages.
> truncate() will block on that page and then we'll notice that some or
> all of the other pages are also now past i_size and give up.

Actually, Kirill's latest patch seems to solve the problem with
locking - by simply never locking more than one page at a time. So I
guess it's all moot at least wrt the page preload..

          Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-18 19:07             ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-18 19:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wilcox, Matthew R, Rik van Riel, Kirill A. Shutemov,
	Andrew Morton, Mel Gorman, Andi Kleen, Dave Hansen,
	Alexander Viro, Dave Chinner, linux-mm, linux-fsdevel,
	Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 10:53 AM, Matthew Wilcox <willy@linux.intel.com> wrote:
>
> Yes, I did mean "holepunches and page faults".  But here's the race I see:

Hmm. With truncate, we should be protected by i_size being changed
first (iirc - I didn't actually check), but I think you're right that
hole punching might race with a page being mapped at the same time.

> What I'm suggesting is going back to Kirill's earlier patch, but only
> locking the page with the highest index instead of all of the pages.
> truncate() will block on that page and then we'll notice that some or
> all of the other pages are also now past i_size and give up.

Actually, Kirill's latest patch seems to solve the problem with
locking - by simply never locking more than one page at a time. So I
guess it's all moot at least wrt the page preload..

          Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-18 18:28         ` Linus Torvalds
@ 2014-02-18 23:57           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-18 23:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 10:28:11AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 10:07 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Patch is wrong. Correct one is below.
> 
> Hmm. I don't hate this. Looking through it, it's fairly simple
> conceptually, and the code isn't that complex either. I can live with
> this.
> 
> I think it's a bit odd how you pass both "max_pgoff" and "nr_pages" to
> the fault-around function, though. In fact, I'd consider that a bug.
> Passing in "FAULT_AROUND_PAGES" is just wrong, since the code cannot -
> and in fact *must* not - actually fault in that many pages, since the
> starting/ending address can be limited by other things.
> 
> So I think that part of the code is bogus. You need to remove
> nr_pages, because any use of it is just incorrect. I don't think it
> can actually matter, since the max_pgoff checks are more restrictive,
> but if you think it can matter please explain how and why it wouldn't
> be a major bug?

I don't like this too...

Current max_pgoff is end of page table (or end of vma, if it ends before).

If we drop nr_pages but keep current max_pgoff, we will potentially setup
PTRS_PER_PTE pages a time: i.e. page fault to first page of page table and
all pages are ready. nr_pages limits the number.

It's not necessary bad idea to populate whole page table at once. I need
to measure how much latency we will add by doing that.

The only problem I see is that we take ptl for a bit too long. But with
split ptl it will affect only page table we populate.

Other approach is too limit ourself to FAULT_AROUND_PAGES from start_addr.
In this case sometimes we will do useless radix-tree lookup even if we had
chance to populated pages further in the page table.

> Apart from that, I'd really like to see numbers for different ranges
> of FAULT_AROUND_ORDER, because I think 5 is pretty high, but on the
> whole I don't find this horrible, and you still lock the page so it
> doesn't involve any new rules. I'm not hugely happy with another raw
> radix-tree user, but it's not horrible.
> 
> Btw, is the "radix_tree_deref_retry(page) -> goto restart" really
> necessary? I'd be almost more inclined to just make it just do a
> "break;" to break out of the loop and stop doing anything clever at
> all.

The code has not ready yet. I'll rework it. It just what I had by the end
of the day. I wanted to know if setup pte directly from ->fault_nonblock()
is okayish approach or considered layering violation.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-18 23:57           ` Kirill A. Shutemov
  0 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2014-02-18 23:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 10:28:11AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2014 at 10:07 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Patch is wrong. Correct one is below.
> 
> Hmm. I don't hate this. Looking through it, it's fairly simple
> conceptually, and the code isn't that complex either. I can live with
> this.
> 
> I think it's a bit odd how you pass both "max_pgoff" and "nr_pages" to
> the fault-around function, though. In fact, I'd consider that a bug.
> Passing in "FAULT_AROUND_PAGES" is just wrong, since the code cannot -
> and in fact *must* not - actually fault in that many pages, since the
> starting/ending address can be limited by other things.
> 
> So I think that part of the code is bogus. You need to remove
> nr_pages, because any use of it is just incorrect. I don't think it
> can actually matter, since the max_pgoff checks are more restrictive,
> but if you think it can matter please explain how and why it wouldn't
> be a major bug?

I don't like this too...

Current max_pgoff is end of page table (or end of vma, if it ends before).

If we drop nr_pages but keep current max_pgoff, we will potentially setup
PTRS_PER_PTE pages a time: i.e. page fault to first page of page table and
all pages are ready. nr_pages limits the number.

It's not necessary bad idea to populate whole page table at once. I need
to measure how much latency we will add by doing that.

The only problem I see is that we take ptl for a bit too long. But with
split ptl it will affect only page table we populate.

Other approach is too limit ourself to FAULT_AROUND_PAGES from start_addr.
In this case sometimes we will do useless radix-tree lookup even if we had
chance to populated pages further in the page table.

> Apart from that, I'd really like to see numbers for different ranges
> of FAULT_AROUND_ORDER, because I think 5 is pretty high, but on the
> whole I don't find this horrible, and you still lock the page so it
> doesn't involve any new rules. I'm not hugely happy with another raw
> radix-tree user, but it's not horrible.
> 
> Btw, is the "radix_tree_deref_retry(page) -> goto restart" really
> necessary? I'd be almost more inclined to just make it just do a
> "break;" to break out of the loop and stop doing anything clever at
> all.

The code has not ready yet. I'll rework it. It just what I had by the end
of the day. I wanted to know if setup pte directly from ->fault_nonblock()
is okayish approach or considered layering violation.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
  2014-02-18 23:57           ` Kirill A. Shutemov
@ 2014-02-19  0:29             ` Linus Torvalds
  -1 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-19  0:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 3:57 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> Current max_pgoff is end of page table (or end of vma, if it ends before).

Yeah, but that should be trivial to do, and limit it to FAULT_AROUND_ORDER.

> Other approach is too limit ourself to FAULT_AROUND_PAGES from start_addr.
> In this case sometimes we will do useless radix-tree lookup even if we had
> chance to populated pages further in the page table.

So the reason I'd prefer to limit the whole thing to that is to not
generate too many extra cache misses. It would be lovely if we stayed
withing one or two cachelines of the page table entry that we have to
modify anyway.

But it would be really interesting to see the numbers for different
FAULT_AROUND_ORDER and perhaps different variations of this.

>> Btw, is the "radix_tree_deref_retry(page) -> goto restart" really
>> necessary? I'd be almost more inclined to just make it just do a
>> "break;" to break out of the loop and stop doing anything clever at
>> all.
>
> The code has not ready yet. I'll rework it. It just what I had by the end
> of the day. I wanted to know if setup pte directly from ->fault_nonblock()
> is okayish approach or considered layering violation.

Ok. Maybe somebody else screams bloody murder, but considering that
you got 1%+ performance improvements (if I read your numbers right), I
think it looks quite promising, and not overly horrid.

Having some complexity and layering violation that is strictly all in
mm/filemap.c I don't see as horrid.

I would probably *not* like random drivers start to use that new
'fault_nonblock' thing, though.

                 Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
@ 2014-02-19  0:29             ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2014-02-19  0:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Mel Gorman, Rik van Riel,
	Andi Kleen, Matthew Wilcox, Dave Hansen, Alexander Viro,
	Dave Chinner, linux-mm, linux-fsdevel, Linux Kernel Mailing List

On Tue, Feb 18, 2014 at 3:57 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> Current max_pgoff is end of page table (or end of vma, if it ends before).

Yeah, but that should be trivial to do, and limit it to FAULT_AROUND_ORDER.

> Other approach is too limit ourself to FAULT_AROUND_PAGES from start_addr.
> In this case sometimes we will do useless radix-tree lookup even if we had
> chance to populated pages further in the page table.

So the reason I'd prefer to limit the whole thing to that is to not
generate too many extra cache misses. It would be lovely if we stayed
withing one or two cachelines of the page table entry that we have to
modify anyway.

But it would be really interesting to see the numbers for different
FAULT_AROUND_ORDER and perhaps different variations of this.

>> Btw, is the "radix_tree_deref_retry(page) -> goto restart" really
>> necessary? I'd be almost more inclined to just make it just do a
>> "break;" to break out of the loop and stop doing anything clever at
>> all.
>
> The code has not ready yet. I'll rework it. It just what I had by the end
> of the day. I wanted to know if setup pte directly from ->fault_nonblock()
> is okayish approach or considered layering violation.

Ok. Maybe somebody else screams bloody murder, but considering that
you got 1%+ performance improvements (if I read your numbers right), I
think it looks quite promising, and not overly horrid.

Having some complexity and layering violation that is strictly all in
mm/filemap.c I don't see as horrid.

I would probably *not* like random drivers start to use that new
'fault_nonblock' thing, though.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2014-02-19  0:29 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-17 18:38 [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache Kirill A. Shutemov
2014-02-17 18:38 ` Kirill A. Shutemov
2014-02-17 18:38 ` [PATCH 1/2] mm: introduce vm_ops->fault_nonblock() Kirill A. Shutemov
2014-02-17 18:38   ` Kirill A. Shutemov
2014-02-17 18:38 ` [PATCH 2/2] mm: implement ->fault_nonblock() for page cache Kirill A. Shutemov
2014-02-17 18:38   ` Kirill A. Shutemov
2014-02-17 19:01 ` [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in " Linus Torvalds
2014-02-17 19:01   ` Linus Torvalds
2014-02-17 19:49   ` Kirill A. Shutemov
2014-02-17 19:49     ` Kirill A. Shutemov
2014-02-17 20:24     ` Linus Torvalds
2014-02-17 20:24       ` Linus Torvalds
2014-02-18 13:28   ` Rik van Riel
2014-02-18 13:28     ` Rik van Riel
2014-02-18 14:15     ` Wilcox, Matthew R
2014-02-18 14:15       ` Wilcox, Matthew R
2014-02-18 18:02       ` Linus Torvalds
2014-02-18 18:02         ` Linus Torvalds
2014-02-18 18:53         ` Matthew Wilcox
2014-02-18 18:53           ` Matthew Wilcox
2014-02-18 19:07           ` Linus Torvalds
2014-02-18 19:07             ` Linus Torvalds
2014-02-18 14:23     ` Kirill A. Shutemov
2014-02-18 14:23       ` Kirill A. Shutemov
2014-02-18 17:51     ` Linus Torvalds
2014-02-18 17:51       ` Linus Torvalds
2014-02-18 17:59   ` Kirill A. Shutemov
2014-02-18 17:59     ` Kirill A. Shutemov
2014-02-18 18:07     ` Kirill A. Shutemov
2014-02-18 18:07       ` Kirill A. Shutemov
2014-02-18 18:07       ` Kirill A. Shutemov
2014-02-18 18:28       ` Linus Torvalds
2014-02-18 18:28         ` Linus Torvalds
2014-02-18 23:57         ` Kirill A. Shutemov
2014-02-18 23:57           ` Kirill A. Shutemov
2014-02-19  0:29           ` Linus Torvalds
2014-02-19  0:29             ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.