All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv4 00/39] Transparent huge page cache
@ 2013-05-12  1:22 ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:22 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's version 4. You can also use git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git

branch thp/pagecache.

If you want to check changes since v3 you can look at diff between tags
thp/pagecache/v3 and thp/pagecache/v4-prerebase.

Intro
-----

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project. It provides information on what
performance boost we should expect on other files systems.

Design overview
---------------

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries: one entry for head page and HPAGE_PMD_NR-1 entries
for tail pages.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache two ways: write(2) to file or page
fault sparse file. Potentially, third way is collapsing small page, but
it's outside initial implementation.

[ While preparing the patchset I've found one more place where we could
  alocate huge page: read(2) on sparse file. With current code we will get
  4k pages. It's okay, but not optimal. Will be fixed later. ]

File systems are decision makers on allocation huge or small pages: they
should have better visibility if it's useful in every particular case.

For write(2) the decision point is mapping_ops->write_begin(). For ramfs
it's simple_write_begin.

For page fault, it's vm_ops->fault(): mm core will call ->fault() with
FAULT_FLAG_TRANSHUGE if huge page is appropriate. ->fault can return
VM_FAULT_FALLBACK if it wants small page instead. For ramfs ->fault() is
filemap_fault().

Performance
-----------

Numbers I posted with v3 were too good to be true. I forgot to
disable debug options in kernel config :-P

The test machine is 4s Westmere - 4x10 cores + HT.

I've used IOzone for benchmarking. Base command is:

iozone -s 8g/$threads -t $threads -r 4 -i 0 -i 1 -i 2 -i 3

Units are KB/s. I've used "Children see throughput" field from iozone
report.

Using mmap (-B option):

** Initial writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1444052    3010882    6055090   11746060   14404889   25109004   28310733   29044218   29619191   29618651   29514987   29348440   29315639   29326998   29410809
patched:	    2207350    4707001    9642674   18356751   21399813   27011674   26775610   24088924   18549342   15453297   13876530   13358992   13166737   13095453   13111227
speed-up(times):       1.53       1.56       1.59       1.56       1.49       1.08       0.95       0.83       0.63       0.52       0.47       0.46       0.45       0.45       0.45

** Rewriters **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    2012192    3941325    7179208   13093224   13978721   19120624   14938912   16672082   16430882   14384357   12311291   16421748   13485785   10642142   11461610
patched:	    3106380    5822011   11657398   17109111   15498272   18507004   16960717   14877209   17498172   15317104   15470030   19190455   14758974    9242583   10548081
speed-up(times):       1.54       1.48       1.62       1.31       1.11       0.97       1.14       0.89       1.06       1.06       1.26       1.17       1.09       0.87       0.92

** Readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1541551    3301643    5624206   11672717   16145085   27885416   38730976   42438132   47526802   48077097   47126201   45950491   45108567   45011088   46310317
patched:	    1800898    3582243    8062851   14418948   17587027   34938636   46653133   46561002   50396044   49525385   47731629   46594399   46424568   45357496   45258561
speed-up(times):       1.17       1.08       1.43       1.24       1.09       1.25       1.20       1.10       1.06       1.03       1.01       1.01       1.03       1.01       0.98

** Re-readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1407462    3022304    5944814   12290200   15700871   27452022   38785250   45720460   47958008   48616065   47805237   45933767   45139644   44752527   45324330
patched:	    1880030    4265188    7406094   15220592   19781387   33994635   43689297   47557123   51175499   50607686   48695647   46799726   46250685   46108964   45180965
speed-up(times):       1.34       1.41       1.25       1.24       1.26       1.24       1.13       1.04       1.07       1.04       1.02       1.02       1.02       1.03       1.00

** Reverse readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1790475    3547606    6639853   14323339   17029576   30420579   39954056   44082873   45397731   45956797   46861276   46149824   44356709   43789684   44961204
patched:	    1848356    3470499    7270728   15685450   19329038   33186403   43574373   48972628   47398951   48588366   48233477   46959725   46383543   43998385   45272745
speed-up(times):       1.03       0.98       1.10       1.10       1.14       1.09       1.09       1.11       1.04       1.06       1.03       1.02       1.05       1.00       1.01

** Random_readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1098140    2549558    4625359    9248630   11764863   22648276   32809857   37617500   39028665   41283083   41886214   44448720   43535904   43481063   44041363
patched:	    1893732    4034810    8218138   15051324   24400039   35208044   41339655   48233519   51046118   47613022   46427129   45893974   45190367   45158010   45944107
speed-up(times):       1.72       1.58       1.78       1.63       2.07       1.55       1.26       1.28       1.31       1.15       1.11       1.03       1.04       1.04       1.04

** Random_writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1366232    2863721    5714268   10615938   12711800   18768227   19430964   19895410   19108420   19666818   19189895   19666578   18953431   18712664   18676119
patched:	    3308906    6093588   11885456   21035728   21744093   21940402   20155000   20800063   21107088   20821950   21369886   21324576   21019851   20418478   20547713
speed-up(times):       2.42       2.13       2.08       1.98       1.71       1.17       1.04       1.05       1.10       1.06       1.11       1.08       1.11       1.09       1.10

****************************

Using syscall (no -B option):

** Initial writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1786744    3693529    7600563   14594702   17645248   26197482   28938801   29700591   29858369   29831816   29730708   29606829   29621126   29538778   29589533
patched:	    1817240    3732281    7598178   14578689   17824204   27186214   29552434   26634121   22304410   18631185   16485981   15801835   15590995   15514384   15483872
speed-up(times):       1.02       1.01       1.00       1.00       1.01       1.04       1.02       0.90       0.75       0.62       0.55       0.53       0.53       0.53       0.52

** Rewriters **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    2025119    3891368    8662423   14477011   17815278   20618509   18330301   14184305   14421901   12488145   12329534   12285723   12049399   12101321   12017546
patched:	    2071648    4106464    8915170   15475594   18461212   23360704   25107019   26244308   26634094   27680123   27342845   27006682   26239505   25881556   26030227
speed-up(times):       1.02       1.06       1.03       1.07       1.04       1.13       1.37       1.85       1.85       2.22       2.22       2.20       2.18       2.14       2.17

** Readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    2414037    5609352    9326943   20594508   22135032   37437276   35593047   41574568   45919334   45903379   45680066   45703659   42766312   42265067   44491712
patched:	    2388758    4573606    9867239   18485205   22269461   36172618   46830113   45828302   45974984   48244870   45334303   45395237   44213071   44418922   44881804
speed-up(times):       0.99       0.82       1.06       0.90       1.01       0.97       1.32       1.10       1.00       1.05       0.99       0.99       1.03       1.05       1.01

** Re-readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    2410474    5006316    9620458   19420701   24929010   37301471   37897701   48067032   46620958   44619322   45474645   45627080   38448032   44844358   44529239
patched:	    2210495    4588974    9330074   18237863   23200139   36691762   43412170   48349035   46607100   47318490   45429944   45285141   44631543   44601157   44913130
speed-up(times):       0.92       0.92       0.97       0.94       0.93       0.98       1.15       1.01       1.00       1.06       1.00       0.99       1.16       0.99       1.01

** Reverse readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    2383446    4633256    9572545   18500373   21489130   36958118   31747157   39855519   31440942   32131944   37714689   42428280   17402480   14893057   16207342
patched:	    2240576    4847211    8373112   17181179   20205163   35186361   42922118   45388409   46244837   47153867   45257508   45476325   43479030   43613958   43296206
speed-up(times):       0.94       1.05       0.87       0.93       0.94       0.95       1.35       1.14       1.47       1.47       1.20       1.07       2.50       2.93       2.67

** Random_readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1821175    3575869    8742168   13764493   20136443   30901949   37823254   43994032   41037782   43925224   41853227   42095250   39393426   33851319   41424361
patched:	    1458968    3169634    6244046   12271864   15474602   29337377   35430875   39734695   41587609   42676631   42077827   41473062   40933033   40944148   41846858
speed-up(times):       0.80       0.89       0.71       0.89       0.77       0.95       0.94       0.90       1.01       0.97       1.01       0.99       1.04       1.21       1.01

** Random_writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1556393    3063377    6014016   12199163   16187258   24737005   27293400   27678633   26549637   26963066   26202907   26090764   26159003   25842459   26009927
patched:	    1642937    3461512    6405111   12425923   16990495   25404113   27340882   27467380   27057498   27297246   26627644   26733315   26624258   26787503   26603172
speed-up(times):       1.06       1.13       1.07       1.02       1.05       1.03       1.00       0.99       1.02       1.01       1.02       1.02       1.02       1.04       1.02

I haven't yet analyzed why it behaves poorly on high number of processes,
but I will.

Changelog
---------

v4:
 - Drop RFC tag;
 - Consolidate code thp and non-thp code (net diff to v3 is -177 lines);
 - Compile time and sysfs knob for the feature;
 - Rework zone_stat for huge pages;
 - x86-64 only for now;
 - ...
v3:
 - set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP;
 - rewrite lru_add_page_tail() to address few bags;
 - memcg accounting;
 - represent file thp pages in meminfo and friends;
 - dump page order in filemap trace;
 - add missed flush_dcache_page() in zero_huge_user_segment;
 - random cleanups based on feedback.
v2:
 - mmap();
 - fix add_to_page_cache_locked() and delete_from_page_cache();
 - introduce mapping_can_have_hugepages();
 - call split_huge_page() only for head page in filemap_fault();
 - wait_split_huge_page(): serialize over i_mmap_mutex too;
 - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
 - fix off-by-one in zero_huge_user_segment();
 - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;

Kirill A. Shutemov (39):
  mm: drop actor argument of do_generic_file_read()
  block: implement add_bdi_stat()
  mm: implement zero_huge_user_segment and friends
  radix-tree: implement preload for multiple contiguous elements
  memcg, thp: charge huge cache pages
  thp, mm: avoid PageUnevictable on active/inactive lru lists
  thp, mm: basic defines for transparent huge page cache
  thp: compile-time and sysfs knob for thp pagecache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp: account anon transparent huge pages into NR_ANON_PAGES
  thp: represent file thp pages in meminfo and friends
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  mm: trace filemap: dump page order
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: trigger bug in replace_page_cache_page() on THP
  thp, mm: locking tail page is a bug
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: add event counters for huge page alloc on write to a file
  thp, mm: allocate huge pages in grab_cache_page_write_begin()
  thp, mm: naive support of thp in generic read/write routines
  thp, libfs: initial support of thp in
    simple_read/write_begin/write_end
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: truncate support for transparent huge page cache
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache
  x86-64, mm: proper alignment mappings with hugepages
  thp: prepare zap_huge_pmd() to uncharge file pages
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  thp: do_huge_pmd_anonymous_page() cleanup
  thp: consolidate code between handle_mm_fault() and
    do_huge_pmd_anonymous_page()
  mm: cleanup __do_fault() implementation
  thp, mm: implement do_huge_linear_fault()
  thp, mm: handle huge pages in filemap_fault()
  mm: decomposite do_wp_page() and get rid of some 'goto' logic
  mm: do_wp_page(): extract VM_WRITE|VM_SHARED case to separate
    function
  thp: handle write-protect exception to file-backed huge pages
  thp: vma_adjust_trans_huge(): adjust file-backed VMA too
  thp: map file-backed huge pages on fault

 arch/x86/kernel/sys_x86_64.c   |   12 +-
 drivers/base/node.c            |   10 +-
 fs/libfs.c                     |   50 +++-
 fs/proc/meminfo.c              |    9 +-
 fs/ramfs/inode.c               |    6 +-
 include/linux/backing-dev.h    |   10 +
 include/linux/fs.h             |    1 +
 include/linux/huge_mm.h        |   92 +++++--
 include/linux/mm.h             |   19 +-
 include/linux/mmzone.h         |    1 +
 include/linux/pagemap.h        |   33 ++-
 include/linux/radix-tree.h     |   11 +
 include/linux/vm_event_item.h  |    2 +
 include/trace/events/filemap.h |    7 +-
 lib/radix-tree.c               |   33 ++-
 mm/Kconfig                     |   10 +
 mm/filemap.c                   |  216 +++++++++++----
 mm/huge_memory.c               |  257 +++++++++--------
 mm/memcontrol.c                |    2 -
 mm/memory.c                    |  597 ++++++++++++++++++++++++++--------------
 mm/rmap.c                      |   18 +-
 mm/swap.c                      |   20 +-
 mm/truncate.c                  |   13 +
 mm/vmstat.c                    |    3 +
 24 files changed, 988 insertions(+), 444 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCHv4 00/39] Transparent huge page cache
@ 2013-05-12  1:22 ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:22 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's version 4. You can also use git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git

branch thp/pagecache.

If you want to check changes since v3 you can look at diff between tags
thp/pagecache/v3 and thp/pagecache/v4-prerebase.

Intro
-----

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project. It provides information on what
performance boost we should expect on other files systems.

Design overview
---------------

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries: one entry for head page and HPAGE_PMD_NR-1 entries
for tail pages.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache two ways: write(2) to file or page
fault sparse file. Potentially, third way is collapsing small page, but
it's outside initial implementation.

[ While preparing the patchset I've found one more place where we could
  alocate huge page: read(2) on sparse file. With current code we will get
  4k pages. It's okay, but not optimal. Will be fixed later. ]

File systems are decision makers on allocation huge or small pages: they
should have better visibility if it's useful in every particular case.

For write(2) the decision point is mapping_ops->write_begin(). For ramfs
it's simple_write_begin.

For page fault, it's vm_ops->fault(): mm core will call ->fault() with
FAULT_FLAG_TRANSHUGE if huge page is appropriate. ->fault can return
VM_FAULT_FALLBACK if it wants small page instead. For ramfs ->fault() is
filemap_fault().

Performance
-----------

Numbers I posted with v3 were too good to be true. I forgot to
disable debug options in kernel config :-P

The test machine is 4s Westmere - 4x10 cores + HT.

I've used IOzone for benchmarking. Base command is:

iozone -s 8g/$threads -t $threads -r 4 -i 0 -i 1 -i 2 -i 3

Units are KB/s. I've used "Children see throughput" field from iozone
report.

Using mmap (-B option):

** Initial writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1444052    3010882    6055090   11746060   14404889   25109004   28310733   29044218   29619191   29618651   29514987   29348440   29315639   29326998   29410809
patched:	    2207350    4707001    9642674   18356751   21399813   27011674   26775610   24088924   18549342   15453297   13876530   13358992   13166737   13095453   13111227
speed-up(times):       1.53       1.56       1.59       1.56       1.49       1.08       0.95       0.83       0.63       0.52       0.47       0.46       0.45       0.45       0.45

** Rewriters **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    2012192    3941325    7179208   13093224   13978721   19120624   14938912   16672082   16430882   14384357   12311291   16421748   13485785   10642142   11461610
patched:	    3106380    5822011   11657398   17109111   15498272   18507004   16960717   14877209   17498172   15317104   15470030   19190455   14758974    9242583   10548081
speed-up(times):       1.54       1.48       1.62       1.31       1.11       0.97       1.14       0.89       1.06       1.06       1.26       1.17       1.09       0.87       0.92

** Readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1541551    3301643    5624206   11672717   16145085   27885416   38730976   42438132   47526802   48077097   47126201   45950491   45108567   45011088   46310317
patched:	    1800898    3582243    8062851   14418948   17587027   34938636   46653133   46561002   50396044   49525385   47731629   46594399   46424568   45357496   45258561
speed-up(times):       1.17       1.08       1.43       1.24       1.09       1.25       1.20       1.10       1.06       1.03       1.01       1.01       1.03       1.01       0.98

** Re-readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1407462    3022304    5944814   12290200   15700871   27452022   38785250   45720460   47958008   48616065   47805237   45933767   45139644   44752527   45324330
patched:	    1880030    4265188    7406094   15220592   19781387   33994635   43689297   47557123   51175499   50607686   48695647   46799726   46250685   46108964   45180965
speed-up(times):       1.34       1.41       1.25       1.24       1.26       1.24       1.13       1.04       1.07       1.04       1.02       1.02       1.02       1.03       1.00

** Reverse readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1790475    3547606    6639853   14323339   17029576   30420579   39954056   44082873   45397731   45956797   46861276   46149824   44356709   43789684   44961204
patched:	    1848356    3470499    7270728   15685450   19329038   33186403   43574373   48972628   47398951   48588366   48233477   46959725   46383543   43998385   45272745
speed-up(times):       1.03       0.98       1.10       1.10       1.14       1.09       1.09       1.11       1.04       1.06       1.03       1.02       1.05       1.00       1.01

** Random_readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1098140    2549558    4625359    9248630   11764863   22648276   32809857   37617500   39028665   41283083   41886214   44448720   43535904   43481063   44041363
patched:	    1893732    4034810    8218138   15051324   24400039   35208044   41339655   48233519   51046118   47613022   46427129   45893974   45190367   45158010   45944107
speed-up(times):       1.72       1.58       1.78       1.63       2.07       1.55       1.26       1.28       1.31       1.15       1.11       1.03       1.04       1.04       1.04

** Random_writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1366232    2863721    5714268   10615938   12711800   18768227   19430964   19895410   19108420   19666818   19189895   19666578   18953431   18712664   18676119
patched:	    3308906    6093588   11885456   21035728   21744093   21940402   20155000   20800063   21107088   20821950   21369886   21324576   21019851   20418478   20547713
speed-up(times):       2.42       2.13       2.08       1.98       1.71       1.17       1.04       1.05       1.10       1.06       1.11       1.08       1.11       1.09       1.10

****************************

Using syscall (no -B option):

** Initial writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1786744    3693529    7600563   14594702   17645248   26197482   28938801   29700591   29858369   29831816   29730708   29606829   29621126   29538778   29589533
patched:	    1817240    3732281    7598178   14578689   17824204   27186214   29552434   26634121   22304410   18631185   16485981   15801835   15590995   15514384   15483872
speed-up(times):       1.02       1.01       1.00       1.00       1.01       1.04       1.02       0.90       0.75       0.62       0.55       0.53       0.53       0.53       0.52

** Rewriters **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    2025119    3891368    8662423   14477011   17815278   20618509   18330301   14184305   14421901   12488145   12329534   12285723   12049399   12101321   12017546
patched:	    2071648    4106464    8915170   15475594   18461212   23360704   25107019   26244308   26634094   27680123   27342845   27006682   26239505   25881556   26030227
speed-up(times):       1.02       1.06       1.03       1.07       1.04       1.13       1.37       1.85       1.85       2.22       2.22       2.20       2.18       2.14       2.17

** Readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    2414037    5609352    9326943   20594508   22135032   37437276   35593047   41574568   45919334   45903379   45680066   45703659   42766312   42265067   44491712
patched:	    2388758    4573606    9867239   18485205   22269461   36172618   46830113   45828302   45974984   48244870   45334303   45395237   44213071   44418922   44881804
speed-up(times):       0.99       0.82       1.06       0.90       1.01       0.97       1.32       1.10       1.00       1.05       0.99       0.99       1.03       1.05       1.01

** Re-readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    2410474    5006316    9620458   19420701   24929010   37301471   37897701   48067032   46620958   44619322   45474645   45627080   38448032   44844358   44529239
patched:	    2210495    4588974    9330074   18237863   23200139   36691762   43412170   48349035   46607100   47318490   45429944   45285141   44631543   44601157   44913130
speed-up(times):       0.92       0.92       0.97       0.94       0.93       0.98       1.15       1.01       1.00       1.06       1.00       0.99       1.16       0.99       1.01

** Reverse readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    2383446    4633256    9572545   18500373   21489130   36958118   31747157   39855519   31440942   32131944   37714689   42428280   17402480   14893057   16207342
patched:	    2240576    4847211    8373112   17181179   20205163   35186361   42922118   45388409   46244837   47153867   45257508   45476325   43479030   43613958   43296206
speed-up(times):       0.94       1.05       0.87       0.93       0.94       0.95       1.35       1.14       1.47       1.47       1.20       1.07       2.50       2.93       2.67

** Random_readers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1821175    3575869    8742168   13764493   20136443   30901949   37823254   43994032   41037782   43925224   41853227   42095250   39393426   33851319   41424361
patched:	    1458968    3169634    6244046   12271864   15474602   29337377   35430875   39734695   41587609   42676631   42077827   41473062   40933033   40944148   41846858
speed-up(times):       0.80       0.89       0.71       0.89       0.77       0.95       0.94       0.90       1.01       0.97       1.01       0.99       1.04       1.21       1.01

** Random_writers **
threads:	          1          2          4          8         10         20         30         40         50         60         70         80        120        160        200
baseline:	    1556393    3063377    6014016   12199163   16187258   24737005   27293400   27678633   26549637   26963066   26202907   26090764   26159003   25842459   26009927
patched:	    1642937    3461512    6405111   12425923   16990495   25404113   27340882   27467380   27057498   27297246   26627644   26733315   26624258   26787503   26603172
speed-up(times):       1.06       1.13       1.07       1.02       1.05       1.03       1.00       0.99       1.02       1.01       1.02       1.02       1.02       1.04       1.02

I haven't yet analyzed why it behaves poorly on high number of processes,
but I will.

Changelog
---------

v4:
 - Drop RFC tag;
 - Consolidate code thp and non-thp code (net diff to v3 is -177 lines);
 - Compile time and sysfs knob for the feature;
 - Rework zone_stat for huge pages;
 - x86-64 only for now;
 - ...
v3:
 - set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP;
 - rewrite lru_add_page_tail() to address few bags;
 - memcg accounting;
 - represent file thp pages in meminfo and friends;
 - dump page order in filemap trace;
 - add missed flush_dcache_page() in zero_huge_user_segment;
 - random cleanups based on feedback.
v2:
 - mmap();
 - fix add_to_page_cache_locked() and delete_from_page_cache();
 - introduce mapping_can_have_hugepages();
 - call split_huge_page() only for head page in filemap_fault();
 - wait_split_huge_page(): serialize over i_mmap_mutex too;
 - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
 - fix off-by-one in zero_huge_user_segment();
 - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;

Kirill A. Shutemov (39):
  mm: drop actor argument of do_generic_file_read()
  block: implement add_bdi_stat()
  mm: implement zero_huge_user_segment and friends
  radix-tree: implement preload for multiple contiguous elements
  memcg, thp: charge huge cache pages
  thp, mm: avoid PageUnevictable on active/inactive lru lists
  thp, mm: basic defines for transparent huge page cache
  thp: compile-time and sysfs knob for thp pagecache
  thp, mm: introduce mapping_can_have_hugepages() predicate
  thp: account anon transparent huge pages into NR_ANON_PAGES
  thp: represent file thp pages in meminfo and friends
  thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  mm: trace filemap: dump page order
  thp, mm: rewrite delete_from_page_cache() to support huge pages
  thp, mm: trigger bug in replace_page_cache_page() on THP
  thp, mm: locking tail page is a bug
  thp, mm: handle tail pages in page_cache_get_speculative()
  thp, mm: add event counters for huge page alloc on write to a file
  thp, mm: allocate huge pages in grab_cache_page_write_begin()
  thp, mm: naive support of thp in generic read/write routines
  thp, libfs: initial support of thp in
    simple_read/write_begin/write_end
  thp: handle file pages in split_huge_page()
  thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  thp, mm: truncate support for transparent huge page cache
  thp, mm: split huge page on mmap file page
  ramfs: enable transparent huge page cache
  x86-64, mm: proper alignment mappings with hugepages
  thp: prepare zap_huge_pmd() to uncharge file pages
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  thp: do_huge_pmd_anonymous_page() cleanup
  thp: consolidate code between handle_mm_fault() and
    do_huge_pmd_anonymous_page()
  mm: cleanup __do_fault() implementation
  thp, mm: implement do_huge_linear_fault()
  thp, mm: handle huge pages in filemap_fault()
  mm: decomposite do_wp_page() and get rid of some 'goto' logic
  mm: do_wp_page(): extract VM_WRITE|VM_SHARED case to separate
    function
  thp: handle write-protect exception to file-backed huge pages
  thp: vma_adjust_trans_huge(): adjust file-backed VMA too
  thp: map file-backed huge pages on fault

 arch/x86/kernel/sys_x86_64.c   |   12 +-
 drivers/base/node.c            |   10 +-
 fs/libfs.c                     |   50 +++-
 fs/proc/meminfo.c              |    9 +-
 fs/ramfs/inode.c               |    6 +-
 include/linux/backing-dev.h    |   10 +
 include/linux/fs.h             |    1 +
 include/linux/huge_mm.h        |   92 +++++--
 include/linux/mm.h             |   19 +-
 include/linux/mmzone.h         |    1 +
 include/linux/pagemap.h        |   33 ++-
 include/linux/radix-tree.h     |   11 +
 include/linux/vm_event_item.h  |    2 +
 include/trace/events/filemap.h |    7 +-
 lib/radix-tree.c               |   33 ++-
 mm/Kconfig                     |   10 +
 mm/filemap.c                   |  216 +++++++++++----
 mm/huge_memory.c               |  257 +++++++++--------
 mm/memcontrol.c                |    2 -
 mm/memory.c                    |  597 ++++++++++++++++++++++++++--------------
 mm/rmap.c                      |   18 +-
 mm/swap.c                      |   20 +-
 mm/truncate.c                  |   13 +
 mm/vmstat.c                    |    3 +
 24 files changed, 988 insertions(+), 444 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* [PATCHv4 01/39] mm: drop actor argument of do_generic_file_read()
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:22   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:22 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index e989fb1..61158ac 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1088,7 +1088,6 @@ static void shrink_readahead_size_eio(struct file *filp,
  * @filp:	the file to read
  * @ppos:	current file position
  * @desc:	read_descriptor
- * @actor:	read method
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1097,7 +1096,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static void do_generic_file_read(struct file *filp, loff_t *ppos,
-		read_descriptor_t *desc, read_actor_t actor)
+		read_descriptor_t *desc)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
@@ -1198,13 +1197,14 @@ page_ok:
 		 * Ok, we have the page, and it's up-to-date, so
 		 * now we can copy it to user space...
 		 *
-		 * The actor routine returns how many bytes were actually used..
+		 * The file_read_actor routine returns how many bytes were
+		 * actually used..
 		 * NOTE! This may not be the same as how much of a user buffer
 		 * we filled up (we may be padding etc), so we can only update
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = actor(desc, page, offset, nr);
+		ret = file_read_actor(desc, page, offset, nr);
 		offset += ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
@@ -1477,7 +1477,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 		if (desc.count == 0)
 			continue;
 		desc.error = 0;
-		do_generic_file_read(filp, ppos, &desc, file_read_actor);
+		do_generic_file_read(filp, ppos, &desc);
 		retval += desc.written;
 		if (desc.error) {
 			retval = retval ?: desc.error;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 01/39] mm: drop actor argument of do_generic_file_read()
@ 2013-05-12  1:22   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:22 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index e989fb1..61158ac 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1088,7 +1088,6 @@ static void shrink_readahead_size_eio(struct file *filp,
  * @filp:	the file to read
  * @ppos:	current file position
  * @desc:	read_descriptor
- * @actor:	read method
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1097,7 +1096,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static void do_generic_file_read(struct file *filp, loff_t *ppos,
-		read_descriptor_t *desc, read_actor_t actor)
+		read_descriptor_t *desc)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
@@ -1198,13 +1197,14 @@ page_ok:
 		 * Ok, we have the page, and it's up-to-date, so
 		 * now we can copy it to user space...
 		 *
-		 * The actor routine returns how many bytes were actually used..
+		 * The file_read_actor routine returns how many bytes were
+		 * actually used..
 		 * NOTE! This may not be the same as how much of a user buffer
 		 * we filled up (we may be padding etc), so we can only update
 		 * "pos" here (the actor routine has to update the user buffer
 		 * pointers and the remaining count).
 		 */
-		ret = actor(desc, page, offset, nr);
+		ret = file_read_actor(desc, page, offset, nr);
 		offset += ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
@@ -1477,7 +1477,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 		if (desc.count == 0)
 			continue;
 		desc.error = 0;
-		do_generic_file_read(filp, ppos, &desc, file_read_actor);
+		do_generic_file_read(filp, ppos, &desc);
 		retval += desc.written;
 		if (desc.error) {
 			retval = retval ?: desc.error;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 02/39] block: implement add_bdi_stat()
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:22   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:22 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We're going to add/remove a number of page cache entries at once. This
patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
amount. It's required for batched page cache manipulations.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/backing-dev.h |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3504599..b05d961 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
 	__add_bdi_stat(bdi, item, -1);
 }
 
+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_bdi_stat(bdi, item, amount);
+	local_irq_restore(flags);
+}
+
 static inline void dec_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 02/39] block: implement add_bdi_stat()
@ 2013-05-12  1:22   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:22 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We're going to add/remove a number of page cache entries at once. This
patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
amount. It's required for batched page cache manipulations.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/backing-dev.h |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3504599..b05d961 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
 	__add_bdi_stat(bdi, item, -1);
 }
 
+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__add_bdi_stat(bdi, item, amount);
+	local_irq_restore(flags);
+}
+
 static inline void dec_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 03/39] mm: implement zero_huge_user_segment and friends
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment and zero_user, but for huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    7 +++++++
 mm/memory.c        |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c05d7cf..5e156fb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1797,6 +1797,13 @@ extern void dump_page(struct page *page);
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr,
 			    unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+		unsigned start, unsigned end);
+static inline void zero_huge_user(struct page *page,
+		unsigned start, unsigned len)
+{
+	zero_huge_user_segment(page, start, start + len);
+}
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
 				unsigned int pages_per_huge_page);
diff --git a/mm/memory.c b/mm/memory.c
index f7a1fba..f02a8be 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4266,6 +4266,42 @@ void clear_huge_page(struct page *page,
 	}
 }
 
+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+	int i;
+	unsigned start_idx, end_idx;
+	unsigned start_off, end_off;
+
+	BUG_ON(end < start);
+
+	might_sleep();
+
+	if (start == end)
+		return;
+
+	start_idx = start >> PAGE_SHIFT;
+	start_off = start & ~PAGE_MASK;
+	end_idx = (end - 1) >> PAGE_SHIFT;
+	end_off = ((end - 1) & ~PAGE_MASK) + 1;
+
+	/*
+	 * if start and end are on the same small page we can call
+	 * zero_user_segment() once and save one kmap_atomic().
+	 */
+	if (start_idx == end_idx)
+		return zero_user_segment(page + start_idx, start_off, end_off);
+
+	/* zero the first (possibly partial) page */
+	zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
+	for (i = start_idx + 1; i < end_idx; i++) {
+		cond_resched();
+		clear_highpage(page + i);
+		flush_dcache_page(page + i);
+	}
+	/* zero the last (possibly partial) page */
+	zero_user_segment(page + end_idx, 0, end_off);
+}
+
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
 				    struct vm_area_struct *vma,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 03/39] mm: implement zero_huge_user_segment and friends
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment and zero_user, but for huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    7 +++++++
 mm/memory.c        |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c05d7cf..5e156fb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1797,6 +1797,13 @@ extern void dump_page(struct page *page);
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr,
 			    unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+		unsigned start, unsigned end);
+static inline void zero_huge_user(struct page *page,
+		unsigned start, unsigned len)
+{
+	zero_huge_user_segment(page, start, start + len);
+}
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
 				unsigned int pages_per_huge_page);
diff --git a/mm/memory.c b/mm/memory.c
index f7a1fba..f02a8be 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4266,6 +4266,42 @@ void clear_huge_page(struct page *page,
 	}
 }
 
+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+	int i;
+	unsigned start_idx, end_idx;
+	unsigned start_off, end_off;
+
+	BUG_ON(end < start);
+
+	might_sleep();
+
+	if (start == end)
+		return;
+
+	start_idx = start >> PAGE_SHIFT;
+	start_off = start & ~PAGE_MASK;
+	end_idx = (end - 1) >> PAGE_SHIFT;
+	end_off = ((end - 1) & ~PAGE_MASK) + 1;
+
+	/*
+	 * if start and end are on the same small page we can call
+	 * zero_user_segment() once and save one kmap_atomic().
+	 */
+	if (start_idx == end_idx)
+		return zero_user_segment(page + start_idx, start_off, end_off);
+
+	/* zero the first (possibly partial) page */
+	zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
+	for (i = start_idx + 1; i < end_idx; i++) {
+		cond_resched();
+		clear_highpage(page + i);
+		flush_dcache_page(page + i);
+	}
+	/* zero the last (possibly partial) page */
+	zero_user_segment(page + end_idx, 0, end_off);
+}
+
 static void copy_user_gigantic_page(struct page *dst, struct page *src,
 				    unsigned long addr,
 				    struct vm_area_struct *vma,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entires to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Worst case for adding N contiguous items is adding entries at indexes
(ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
item plus extra nodes if you cross the boundary from one node to the next.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
 #endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6.

Since only THP uses batched preload at the , we disable (set max preload
to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be changed
in the future.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/radix-tree.h |   11 +++++++++++
 lib/radix-tree.c           |   33 ++++++++++++++++++++++++++-------
 2 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..a859195 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR	512
+#else
+#define RADIX_TREE_PRELOAD_NR	1
+#endif
+
 /**
  * Radix-tree synchronization
  *
@@ -231,6 +241,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
 unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..1bc352f 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
  * The worst case is a zero height tree with just a single item at index 0,
  * and then inserting an item at index ULONG_MAX. This requires 2 new branches
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+	(RADIX_TREE_PRELOAD_MIN + \
+	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -257,29 +265,35 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int preload_target = RADIX_TREE_PRELOAD_MIN +
+		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+	if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+				"too large preload requested"))
+		return -ENOMEM;
 
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < preload_target) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < preload_target)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +302,11 @@ int radix_tree_preload(gfp_t gfp_mask)
 out:
 	return ret;
 }
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+	return radix_tree_preload_count(1, gfp_mask);
+}
 EXPORT_SYMBOL(radix_tree_preload);
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entires to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Worst case for adding N contiguous items is adding entries at indexes
(ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
item plus extra nodes if you cross the boundary from one node to the next.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

 #ifdef __KERNEL__
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
 #else
 #define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
 #endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6.

Since only THP uses batched preload at the , we disable (set max preload
to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be changed
in the future.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/radix-tree.h |   11 +++++++++++
 lib/radix-tree.c           |   33 ++++++++++++++++++++++++++-------
 2 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..a859195 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do {									\
 	(root)->rnode = NULL;						\
 } while (0)
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR	512
+#else
+#define RADIX_TREE_PRELOAD_NR	1
+#endif
+
 /**
  * Radix-tree synchronization
  *
@@ -231,6 +241,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
 unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				unsigned long index, unsigned long max_scan);
 int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..1bc352f 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
  * The worst case is a zero height tree with just a single item at index 0,
  * and then inserting an item at index ULONG_MAX. This requires 2 new branches
  * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
  * Hence:
  */
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+	(RADIX_TREE_PRELOAD_MIN + \
+	 DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
 
 /*
  * Per-cpu pool of preloaded nodes
  */
 struct radix_tree_preload {
 	int nr;
-	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+	struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
 };
 static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
 
@@ -257,29 +265,35 @@ radix_tree_node_free(struct radix_tree_node *node)
 
 /*
  * Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail.  On
- * success, return zero, with preemption disabled.  On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled.  On error, return -ENOMEM
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_WAIT being passed to INIT_RADIX_TREE().
  */
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
 	int ret = -ENOMEM;
+	int preload_target = RADIX_TREE_PRELOAD_MIN +
+		DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+	if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+				"too large preload requested"))
+		return -ENOMEM;
 
 	preempt_disable();
 	rtp = &__get_cpu_var(radix_tree_preloads);
-	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+	while (rtp->nr < preload_target) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = &__get_cpu_var(radix_tree_preloads);
-		if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+		if (rtp->nr < preload_target)
 			rtp->nodes[rtp->nr++] = node;
 		else
 			kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +302,11 @@ int radix_tree_preload(gfp_t gfp_mask)
 out:
 	return ret;
 }
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+	return radix_tree_preload_count(1, gfp_mask);
+}
 EXPORT_SYMBOL(radix_tree_preload);
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 05/39] memcg, thp: charge huge cache pages
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov, Michal Hocko, KAMEZAWA Hiroyuki

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

mem_cgroup_cache_charge() has check for PageCompound(). The check
prevents charging huge cache pages.

I don't see a reason why the check is present. Looks like it's just
legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

Let's just drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe4f123..a7de6a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4080,8 +4080,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 
 	if (!PageSwapCache(page))
 		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 05/39] memcg, thp: charge huge cache pages
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov, Michal Hocko, KAMEZAWA Hiroyuki

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

mem_cgroup_cache_charge() has check for PageCompound(). The check
prevents charging huge cache pages.

I don't see a reason why the check is present. Looks like it's just
legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

Let's just drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe4f123..a7de6a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4080,8 +4080,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 
 	if (!PageSwapCache(page))
 		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 06/39] thp, mm: avoid PageUnevictable on active/inactive lru lists
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_active_list
goes crazy:

kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
invalid opcode: 0000 [#1] SMP
CPU 0
Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
RIP: 0010:[<ffffffff81110478>]  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
RSP: 0000:ffff8800796d9b28  EFLAGS: 00010082
RAX: 00000000ffffffea RBX: 0000000000000012 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea0001de8040
RBP: ffff8800796d9b88 R08: ffff8800796d9df0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012
R13: ffffea0001de8060 R14: ffffffff818818e8 R15: ffff8800796d9bf8
FS:  0000000000000000(0000) GS:ffff88007a200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1bfc108000 CR3: 000000000180b000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 293, threadinfo ffff8800796d8000, task ffff880079e0a6e0)
Stack:
 ffff8800796d9b48 ffffffff81881880 ffff8800796d9df0 ffff8800796d9be0
 0000000000000002 000000000000001f ffff8800796d9b88 ffffffff818818c8
 ffffffff81881480 ffff8800796d9dc0 0000000000000002 000000000000001f
Call Trace:
 [<ffffffff81111e98>] shrink_inactive_list+0x108/0x4a0
 [<ffffffff8109ce3d>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff8107b8bf>] ? local_clock+0x4f/0x60
 [<ffffffff8110ff5d>] ? shrink_slab+0x1fd/0x4c0
 [<ffffffff811125a1>] shrink_zone+0x371/0x610
 [<ffffffff8110ff75>] ? shrink_slab+0x215/0x4c0
 [<ffffffff81112dfc>] kswapd+0x5bc/0xb60
 [<ffffffff81112840>] ? shrink_zone+0x610/0x610
 [<ffffffff81066676>] kthread+0xd6/0xe0
 [<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
 [<ffffffff814fed6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
Code: 1f 40 00 49 8b 45 08 49 8b 75 00 48 89 46 08 48 89 30 49 8b 06 4c 89 68 08 49 89 45 00 4d 89 75 08 4d 89 2e eb 9c 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 31 db 45 31 e4 eb 9b 0f 0b 0f 0b 65 48
RIP  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
 RSP <ffff8800796d9b28>

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
Let's just copy PG_active and PG_unevictable from head page in
__split_huge_page_refcount(), it will simplify lru_add_page_tail().

This will fix one more bug in lru_add_page_tail():
if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru.
The patch will fix it as well since we copy PG_active from head page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    4 +++-
 mm/swap.c        |   20 ++------------------
 2 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 03a89a2..b39fa01 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1612,7 +1612,9 @@ static void __split_huge_page_refcount(struct page *page,
 				     ((1L << PG_referenced) |
 				      (1L << PG_swapbacked) |
 				      (1L << PG_mlocked) |
-				      (1L << PG_uptodate)));
+				      (1L << PG_uptodate) |
+				      (1L << PG_active) |
+				      (1L << PG_unevictable)));
 		page_tail->flags |= (1L << PG_dirty);
 
 		/* clear PageTail before overwriting first_page */
diff --git a/mm/swap.c b/mm/swap.c
index acd40bf..9b0a64b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -739,8 +739,6 @@ EXPORT_SYMBOL(__pagevec_release);
 void lru_add_page_tail(struct page *page, struct page *page_tail,
 		       struct lruvec *lruvec, struct list_head *list)
 {
-	int uninitialized_var(active);
-	enum lru_list lru;
 	const int file = 0;
 
 	VM_BUG_ON(!PageHead(page));
@@ -752,20 +750,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	if (!list)
 		SetPageLRU(page_tail);
 
-	if (page_evictable(page_tail)) {
-		if (PageActive(page)) {
-			SetPageActive(page_tail);
-			active = 1;
-			lru = LRU_ACTIVE_ANON;
-		} else {
-			active = 0;
-			lru = LRU_INACTIVE_ANON;
-		}
-	} else {
-		SetPageUnevictable(page_tail);
-		lru = LRU_UNEVICTABLE;
-	}
-
 	if (likely(PageLRU(page)))
 		list_add_tail(&page_tail->lru, &page->lru);
 	else if (list) {
@@ -781,13 +765,13 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 		 * Use the standard add function to put page_tail on the list,
 		 * but then correct its position so they all end up in order.
 		 */
-		add_page_to_lru_list(page_tail, lruvec, lru);
+		add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail));
 		list_head = page_tail->lru.prev;
 		list_move_tail(&page_tail->lru, list_head);
 	}
 
 	if (!PageUnevictable(page))
-		update_page_reclaim_stat(lruvec, file, active);
+		update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 06/39] thp, mm: avoid PageUnevictable on active/inactive lru lists
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_active_list
goes crazy:

kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
invalid opcode: 0000 [#1] SMP
CPU 0
Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
RIP: 0010:[<ffffffff81110478>]  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
RSP: 0000:ffff8800796d9b28  EFLAGS: 00010082
RAX: 00000000ffffffea RBX: 0000000000000012 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea0001de8040
RBP: ffff8800796d9b88 R08: ffff8800796d9df0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012
R13: ffffea0001de8060 R14: ffffffff818818e8 R15: ffff8800796d9bf8
FS:  0000000000000000(0000) GS:ffff88007a200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1bfc108000 CR3: 000000000180b000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 293, threadinfo ffff8800796d8000, task ffff880079e0a6e0)
Stack:
 ffff8800796d9b48 ffffffff81881880 ffff8800796d9df0 ffff8800796d9be0
 0000000000000002 000000000000001f ffff8800796d9b88 ffffffff818818c8
 ffffffff81881480 ffff8800796d9dc0 0000000000000002 000000000000001f
Call Trace:
 [<ffffffff81111e98>] shrink_inactive_list+0x108/0x4a0
 [<ffffffff8109ce3d>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff8107b8bf>] ? local_clock+0x4f/0x60
 [<ffffffff8110ff5d>] ? shrink_slab+0x1fd/0x4c0
 [<ffffffff811125a1>] shrink_zone+0x371/0x610
 [<ffffffff8110ff75>] ? shrink_slab+0x215/0x4c0
 [<ffffffff81112dfc>] kswapd+0x5bc/0xb60
 [<ffffffff81112840>] ? shrink_zone+0x610/0x610
 [<ffffffff81066676>] kthread+0xd6/0xe0
 [<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
 [<ffffffff814fed6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
Code: 1f 40 00 49 8b 45 08 49 8b 75 00 48 89 46 08 48 89 30 49 8b 06 4c 89 68 08 49 89 45 00 4d 89 75 08 4d 89 2e eb 9c 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 31 db 45 31 e4 eb 9b 0f 0b 0f 0b 65 48
RIP  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
 RSP <ffff8800796d9b28>

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
Let's just copy PG_active and PG_unevictable from head page in
__split_huge_page_refcount(), it will simplify lru_add_page_tail().

This will fix one more bug in lru_add_page_tail():
if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru.
The patch will fix it as well since we copy PG_active from head page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    4 +++-
 mm/swap.c        |   20 ++------------------
 2 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 03a89a2..b39fa01 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1612,7 +1612,9 @@ static void __split_huge_page_refcount(struct page *page,
 				     ((1L << PG_referenced) |
 				      (1L << PG_swapbacked) |
 				      (1L << PG_mlocked) |
-				      (1L << PG_uptodate)));
+				      (1L << PG_uptodate) |
+				      (1L << PG_active) |
+				      (1L << PG_unevictable)));
 		page_tail->flags |= (1L << PG_dirty);
 
 		/* clear PageTail before overwriting first_page */
diff --git a/mm/swap.c b/mm/swap.c
index acd40bf..9b0a64b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -739,8 +739,6 @@ EXPORT_SYMBOL(__pagevec_release);
 void lru_add_page_tail(struct page *page, struct page *page_tail,
 		       struct lruvec *lruvec, struct list_head *list)
 {
-	int uninitialized_var(active);
-	enum lru_list lru;
 	const int file = 0;
 
 	VM_BUG_ON(!PageHead(page));
@@ -752,20 +750,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	if (!list)
 		SetPageLRU(page_tail);
 
-	if (page_evictable(page_tail)) {
-		if (PageActive(page)) {
-			SetPageActive(page_tail);
-			active = 1;
-			lru = LRU_ACTIVE_ANON;
-		} else {
-			active = 0;
-			lru = LRU_INACTIVE_ANON;
-		}
-	} else {
-		SetPageUnevictable(page_tail);
-		lru = LRU_UNEVICTABLE;
-	}
-
 	if (likely(PageLRU(page)))
 		list_add_tail(&page_tail->lru, &page->lru);
 	else if (list) {
@@ -781,13 +765,13 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 		 * Use the standard add function to put page_tail on the list,
 		 * but then correct its position so they all end up in order.
 		 */
-		add_page_to_lru_list(page_tail, lruvec, lru);
+		add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail));
 		list_head = page_tail->lru.prev;
 		list_move_tail(&page_tail->lru, list_head);
 	}
 
 	if (!PageUnevictable(page))
-		update_page_reclaim_stat(lruvec, file, active);
+		update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 07/39] thp, mm: basic defines for transparent huge page cache
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 528454c..6b4c9b2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
+#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 #define transparent_hugepage_enabled(__vma)				\
@@ -185,6 +189,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
+#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 07/39] thp, mm: basic defines for transparent huge page cache
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 528454c..6b4c9b2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 #define HPAGE_PMD_MASK HPAGE_MASK
 #define HPAGE_PMD_SIZE HPAGE_SIZE
 
+#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
 extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 #define transparent_hugepage_enabled(__vma)				\
@@ -185,6 +189,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
+#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 08/39] thp: compile-time and sysfs knob for thp pagecache
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for X86_64.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    7 +++++++
 mm/Kconfig              |   10 ++++++++++
 mm/huge_memory.c        |   19 +++++++++++++++++++
 3 files changed, 36 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6b4c9b2..88b44e2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
+	TRANSPARENT_HUGEPAGE_PAGECACHE,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
@@ -240,4 +241,10 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline bool transparent_hugepage_pagecache(void)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+		return 0;
+	return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index e742d06..3a271b7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -420,6 +420,16 @@ choice
 	  benefit.
 endchoice
 
+config TRANSPARENT_HUGEPAGE_PAGECACHE
+	bool "Transparent Hugepage Support for page cache"
+	depends on X86_64 && TRANSPARENT_HUGEPAGE
+	default y
+	help
+	  Enabling the option adds support hugepages for file-backed
+	  mappings. It requires transparent hugepage support from
+	  filesystem side. For now, the only filesystem which supports
+	  hugepages is ramfs.
+
 config CROSS_MEMORY_ATTACH
 	bool "Cross Memory Support"
 	depends on MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b39fa01..bd8ef7f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -42,6 +42,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
 #endif
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+	(1<<TRANSPARENT_HUGEPAGE_PAGECACHE)|
+#endif
 	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
 
 /* default scan 8*512 pte (or vmas) every 30 second */
@@ -357,6 +360,21 @@ static ssize_t defrag_store(struct kobject *kobj,
 static struct kobj_attribute defrag_attr =
 	__ATTR(defrag, 0644, defrag_show, defrag_store);
 
+static ssize_t page_cache_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static ssize_t page_cache_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static struct kobj_attribute page_cache_attr =
+	__ATTR(page_cache, 0644, page_cache_show, page_cache_store);
+
 static ssize_t use_zero_page_show(struct kobject *kobj,
 		struct kobj_attribute *attr, char *buf)
 {
@@ -392,6 +410,7 @@ static struct kobj_attribute debug_cow_attr =
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
+	&page_cache_attr.attr,
 	&use_zero_page_attr.attr,
 #ifdef CONFIG_DEBUG_VM
 	&debug_cow_attr.attr,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 08/39] thp: compile-time and sysfs knob for thp pagecache
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for X86_64.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    7 +++++++
 mm/Kconfig              |   10 ++++++++++
 mm/huge_memory.c        |   19 +++++++++++++++++++
 3 files changed, 36 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6b4c9b2..88b44e2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
+	TRANSPARENT_HUGEPAGE_PAGECACHE,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
@@ -240,4 +241,10 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline bool transparent_hugepage_pagecache(void)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+		return 0;
+	return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index e742d06..3a271b7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -420,6 +420,16 @@ choice
 	  benefit.
 endchoice
 
+config TRANSPARENT_HUGEPAGE_PAGECACHE
+	bool "Transparent Hugepage Support for page cache"
+	depends on X86_64 && TRANSPARENT_HUGEPAGE
+	default y
+	help
+	  Enabling the option adds support hugepages for file-backed
+	  mappings. It requires transparent hugepage support from
+	  filesystem side. For now, the only filesystem which supports
+	  hugepages is ramfs.
+
 config CROSS_MEMORY_ATTACH
 	bool "Cross Memory Support"
 	depends on MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b39fa01..bd8ef7f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -42,6 +42,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
 #endif
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+	(1<<TRANSPARENT_HUGEPAGE_PAGECACHE)|
+#endif
 	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
 
 /* default scan 8*512 pte (or vmas) every 30 second */
@@ -357,6 +360,21 @@ static ssize_t defrag_store(struct kobject *kobj,
 static struct kobj_attribute defrag_attr =
 	__ATTR(defrag, 0644, defrag_show, defrag_store);
 
+static ssize_t page_cache_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static ssize_t page_cache_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static struct kobj_attribute page_cache_attr =
+	__ATTR(page_cache, 0644, page_cache_show, page_cache_store);
+
 static ssize_t use_zero_page_show(struct kobject *kobj,
 		struct kobj_attribute *attr, char *buf)
 {
@@ -392,6 +410,7 @@ static struct kobj_attribute debug_cow_attr =
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
+	&page_cache_attr.attr,
 	&use_zero_page_attr.attr,
 #ifdef CONFIG_DEBUG_VM
 	&debug_cow_attr.attr,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..28597ec 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 				(__force unsigned long)mask;
 }
 
+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
+		gfp_t gfp_mask = mapping_gfp_mask(m);
+		/* __GFP_COMP is key part of GFP_TRANSHUGE */
+		return !!(gfp_mask & __GFP_COMP) &&
+			transparent_hugepage_pagecache();
+	}
+
+	return false;
+}
+
 /*
  * The page cache can done in larger chunks than
  * one page, because it allows for more efficient
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..28597ec 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 				(__force unsigned long)mask;
 }
 
+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
+		gfp_t gfp_mask = mapping_gfp_mask(m);
+		/* __GFP_COMP is key part of GFP_TRANSHUGE */
+		return !!(gfp_mask & __GFP_COMP) &&
+			transparent_hugepage_pagecache();
+	}
+
+	return false;
+}
+
 /*
  * The page cache can done in larger chunks than
  * one page, because it allows for more efficient
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 10/39] thp: account anon transparent huge pages into NR_ANON_PAGES
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We use NR_ANON_PAGES as base for reporting AnonPages to user.
There's not much sense in not accounting transparent huge pages there, but
add them on printing to user.

Let's account transparent huge pages in NR_ANON_PAGES in the first place.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/base/node.c |    6 ------
 fs/proc/meminfo.c   |    6 ------
 mm/huge_memory.c    |    1 -
 mm/rmap.c           |   18 +++++++++---------
 4 files changed, 9 insertions(+), 22 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7616a77..bc9f43b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -125,13 +125,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_WRITEBACK)),
 		       nid, K(node_page_state(nid, NR_FILE_PAGES)),
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		       nid, K(node_page_state(nid, NR_ANON_PAGES)
-			+ node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
-			HPAGE_PMD_NR),
-#else
 		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
-#endif
 		       nid, K(node_page_state(nid, NR_SHMEM)),
 		       nid, node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 5aa847a..59d85d6 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -132,13 +132,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		K(i.freeswap),
 		K(global_page_state(NR_FILE_DIRTY)),
 		K(global_page_state(NR_WRITEBACK)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		K(global_page_state(NR_ANON_PAGES)
-		  + global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
-		  HPAGE_PMD_NR),
-#else
 		K(global_page_state(NR_ANON_PAGES)),
-#endif
 		K(global_page_state(NR_FILE_MAPPED)),
 		K(global_page_state(NR_SHMEM)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE) +
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bd8ef7f..ed31e90 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1672,7 +1672,6 @@ static void __split_huge_page_refcount(struct page *page,
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
-	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
 
 	ClearPageCompound(page);
 	compound_unlock(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 6280da8..6abf387 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1055,11 +1055,11 @@ void do_page_add_anon_rmap(struct page *page,
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
 	if (first) {
-		if (!PageTransHuge(page))
-			__inc_zone_page_state(page, NR_ANON_PAGES);
-		else
+		if (PageTransHuge(page))
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
+		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+				hpage_nr_pages(page));
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1088,10 +1088,10 @@ void page_add_new_anon_rmap(struct page *page,
 	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	if (!PageTransHuge(page))
-		__inc_zone_page_state(page, NR_ANON_PAGES);
-	else
+	if (PageTransHuge(page))
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+			hpage_nr_pages(page));
 	__page_set_anon_rmap(page, vma, address, 1);
 	if (!mlocked_vma_newpage(vma, page))
 		lru_cache_add_lru(page, LRU_ACTIVE_ANON);
@@ -1150,11 +1150,11 @@ void page_remove_rmap(struct page *page)
 		goto out;
 	if (anon) {
 		mem_cgroup_uncharge_page(page);
-		if (!PageTransHuge(page))
-			__dec_zone_page_state(page, NR_ANON_PAGES);
-		else
+		if (PageTransHuge(page))
 			__dec_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
+		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+				hpage_nr_pages(page));
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 10/39] thp: account anon transparent huge pages into NR_ANON_PAGES
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We use NR_ANON_PAGES as base for reporting AnonPages to user.
There's not much sense in not accounting transparent huge pages there, but
add them on printing to user.

Let's account transparent huge pages in NR_ANON_PAGES in the first place.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/base/node.c |    6 ------
 fs/proc/meminfo.c   |    6 ------
 mm/huge_memory.c    |    1 -
 mm/rmap.c           |   18 +++++++++---------
 4 files changed, 9 insertions(+), 22 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7616a77..bc9f43b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -125,13 +125,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_WRITEBACK)),
 		       nid, K(node_page_state(nid, NR_FILE_PAGES)),
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		       nid, K(node_page_state(nid, NR_ANON_PAGES)
-			+ node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
-			HPAGE_PMD_NR),
-#else
 		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
-#endif
 		       nid, K(node_page_state(nid, NR_SHMEM)),
 		       nid, node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 5aa847a..59d85d6 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -132,13 +132,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		K(i.freeswap),
 		K(global_page_state(NR_FILE_DIRTY)),
 		K(global_page_state(NR_WRITEBACK)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		K(global_page_state(NR_ANON_PAGES)
-		  + global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
-		  HPAGE_PMD_NR),
-#else
 		K(global_page_state(NR_ANON_PAGES)),
-#endif
 		K(global_page_state(NR_FILE_MAPPED)),
 		K(global_page_state(NR_SHMEM)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE) +
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bd8ef7f..ed31e90 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1672,7 +1672,6 @@ static void __split_huge_page_refcount(struct page *page,
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
 	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
-	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
 
 	ClearPageCompound(page);
 	compound_unlock(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 6280da8..6abf387 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1055,11 +1055,11 @@ void do_page_add_anon_rmap(struct page *page,
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
 	if (first) {
-		if (!PageTransHuge(page))
-			__inc_zone_page_state(page, NR_ANON_PAGES);
-		else
+		if (PageTransHuge(page))
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
+		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+				hpage_nr_pages(page));
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1088,10 +1088,10 @@ void page_add_new_anon_rmap(struct page *page,
 	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	if (!PageTransHuge(page))
-		__inc_zone_page_state(page, NR_ANON_PAGES);
-	else
+	if (PageTransHuge(page))
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+			hpage_nr_pages(page));
 	__page_set_anon_rmap(page, vma, address, 1);
 	if (!mlocked_vma_newpage(vma, page))
 		lru_cache_add_lru(page, LRU_ACTIVE_ANON);
@@ -1150,11 +1150,11 @@ void page_remove_rmap(struct page *page)
 		goto out;
 	if (anon) {
 		mem_cgroup_uncharge_page(page);
-		if (!PageTransHuge(page))
-			__dec_zone_page_state(page, NR_ANON_PAGES);
-		else
+		if (PageTransHuge(page))
 			__dec_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
+		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+				hpage_nr_pages(page));
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 11/39] thp: represent file thp pages in meminfo and friends
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch adds new zone stat to count file transparent huge pages and
adjust related places.

For now we don't count mapped or dirty file thp pages separately.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/base/node.c    |    4 ++++
 fs/proc/meminfo.c      |    3 +++
 include/linux/mmzone.h |    1 +
 mm/vmstat.c            |    1 +
 4 files changed, 9 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index bc9f43b..de261f5 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -119,6 +119,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       "Node %d AnonHugePages:  %8lu kB\n"
+		       "Node %d FileHugePages:  %8lu kB\n"
 #endif
 			,
 		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -140,6 +141,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
 			, nid,
 			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR)
+			, nid,
+			K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
 			HPAGE_PMD_NR));
 #else
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 59d85d6..a62952c 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -104,6 +104,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		"AnonHugePages:  %8lu kB\n"
+		"FileHugePages:  %8lu kB\n"
 #endif
 		,
 		K(i.totalram),
@@ -158,6 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
 		   HPAGE_PMD_NR)
+		,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
 #endif
 		);
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 72e1cb5..33fd258 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,7 @@ enum zone_stat_item {
 	NUMA_OTHER,		/* allocation from other node */
 #endif
 	NR_ANON_TRANSPARENT_HUGEPAGES,
+	NR_FILE_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7a35116..7945285 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -738,6 +738,7 @@ const char * const vmstat_text[] = {
 	"numa_other",
 #endif
 	"nr_anon_transparent_hugepages",
+	"nr_file_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 11/39] thp: represent file thp pages in meminfo and friends
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The patch adds new zone stat to count file transparent huge pages and
adjust related places.

For now we don't count mapped or dirty file thp pages separately.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/base/node.c    |    4 ++++
 fs/proc/meminfo.c      |    3 +++
 include/linux/mmzone.h |    1 +
 mm/vmstat.c            |    1 +
 4 files changed, 9 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index bc9f43b..de261f5 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -119,6 +119,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       "Node %d AnonHugePages:  %8lu kB\n"
+		       "Node %d FileHugePages:  %8lu kB\n"
 #endif
 			,
 		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -140,6 +141,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
 			, nid,
 			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR)
+			, nid,
+			K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
 			HPAGE_PMD_NR));
 #else
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 59d85d6..a62952c 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -104,6 +104,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		"AnonHugePages:  %8lu kB\n"
+		"FileHugePages:  %8lu kB\n"
 #endif
 		,
 		K(i.totalram),
@@ -158,6 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
 		   HPAGE_PMD_NR)
+		,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
 #endif
 		);
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 72e1cb5..33fd258 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,7 @@ enum zone_stat_item {
 	NUMA_OTHER,		/* allocation from other node */
 #endif
 	NR_ANON_TRANSPARENT_HUGEPAGES,
+	NR_FILE_TRANSPARENT_HUGEPAGES,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7a35116..7945285 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -738,6 +738,7 @@ const char * const vmstat_text[] = {
 	"numa_other",
 #endif
 	"nr_anon_transparent_hugepages",
+	"nr_file_transparent_hugepages",
 	"nr_free_cma",
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   71 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 47 insertions(+), 24 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 61158ac..b0c7c8c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -460,39 +460,62 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
+	int i, nr;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
+	/* memory cgroup controller handles thp pages on its side */
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
-		goto out;
-
-	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
-	if (error == 0) {
-		page_cache_get(page);
-		page->mapping = mapping;
-		page->index = offset;
+		return error;
 
-		spin_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (likely(!error)) {
-			mapping->nrpages++;
-			__inc_zone_page_state(page, NR_FILE_PAGES);
-			spin_unlock_irq(&mapping->tree_lock);
-			trace_mm_filemap_add_to_page_cache(page);
-		} else {
-			page->mapping = NULL;
-			/* Leave page->index set: truncation relies upon it */
-			spin_unlock_irq(&mapping->tree_lock);
-			mem_cgroup_uncharge_cache_page(page);
-			page_cache_release(page);
-		}
-		radix_tree_preload_end();
-	} else
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
+		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+		nr = hpage_nr_pages(page);
+	} else {
+		BUG_ON(PageTransHuge(page));
+		nr = 1;
+	}
+	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
+	if (error) {
 		mem_cgroup_uncharge_cache_page(page);
-out:
+		return error;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+	for (i = 0; i < nr; i++) {
+		page_cache_get(page + i);
+		page[i].index = offset + i;
+		page[i].mapping = mapping;
+		error = radix_tree_insert(&mapping->page_tree,
+				offset + i, page + i);
+		if (error)
+			goto err;
+	}
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+	if (PageTransHuge(page))
+		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+	mapping->nrpages += nr;
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	trace_mm_filemap_add_to_page_cache(page);
+	return 0;
+err:
+	if (i != 0)
+		error = -ENOSPC; /* no space for a huge page */
+	page_cache_release(page + i);
+	page[i].mapping = NULL;
+	for (i--; i >= 0; i--) {
+		/* Leave page->index set: truncation relies upon it */
+		page[i].mapping = NULL;
+		radix_tree_delete(&mapping->page_tree, offset + i);
+		page_cache_release(page + i);
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	mem_cgroup_uncharge_cache_page(page);
 	return error;
 }
 EXPORT_SYMBOL(add_to_page_cache_locked);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   71 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 47 insertions(+), 24 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 61158ac..b0c7c8c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -460,39 +460,62 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
+	int i, nr;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageSwapBacked(page));
 
+	/* memory cgroup controller handles thp pages on its side */
 	error = mem_cgroup_cache_charge(page, current->mm,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
-		goto out;
-
-	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
-	if (error == 0) {
-		page_cache_get(page);
-		page->mapping = mapping;
-		page->index = offset;
+		return error;
 
-		spin_lock_irq(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (likely(!error)) {
-			mapping->nrpages++;
-			__inc_zone_page_state(page, NR_FILE_PAGES);
-			spin_unlock_irq(&mapping->tree_lock);
-			trace_mm_filemap_add_to_page_cache(page);
-		} else {
-			page->mapping = NULL;
-			/* Leave page->index set: truncation relies upon it */
-			spin_unlock_irq(&mapping->tree_lock);
-			mem_cgroup_uncharge_cache_page(page);
-			page_cache_release(page);
-		}
-		radix_tree_preload_end();
-	} else
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
+		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+		nr = hpage_nr_pages(page);
+	} else {
+		BUG_ON(PageTransHuge(page));
+		nr = 1;
+	}
+	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
+	if (error) {
 		mem_cgroup_uncharge_cache_page(page);
-out:
+		return error;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+	for (i = 0; i < nr; i++) {
+		page_cache_get(page + i);
+		page[i].index = offset + i;
+		page[i].mapping = mapping;
+		error = radix_tree_insert(&mapping->page_tree,
+				offset + i, page + i);
+		if (error)
+			goto err;
+	}
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+	if (PageTransHuge(page))
+		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+	mapping->nrpages += nr;
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	trace_mm_filemap_add_to_page_cache(page);
+	return 0;
+err:
+	if (i != 0)
+		error = -ENOSPC; /* no space for a huge page */
+	page_cache_release(page + i);
+	page[i].mapping = NULL;
+	for (i--; i >= 0; i--) {
+		/* Leave page->index set: truncation relies upon it */
+		page[i].mapping = NULL;
+		radix_tree_delete(&mapping->page_tree, offset + i);
+		page_cache_release(page + i);
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+	mem_cgroup_uncharge_cache_page(page);
 	return error;
 }
 EXPORT_SYMBOL(add_to_page_cache_locked);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 13/39] mm: trace filemap: dump page order
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Dump page order to trace to be able to distinguish between small page
and huge page in page cache.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/trace/events/filemap.h |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49..7e14b13 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__field(struct page *, page)
 		__field(unsigned long, i_ino)
 		__field(unsigned long, index)
+		__field(int, order)
 		__field(dev_t, s_dev)
 	),
 
@@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__entry->page = page;
 		__entry->i_ino = page->mapping->host->i_ino;
 		__entry->index = page->index;
+		__entry->order = compound_order(page);
 		if (page->mapping->host->i_sb)
 			__entry->s_dev = page->mapping->host->i_sb->s_dev;
 		else
 			__entry->s_dev = page->mapping->host->i_rdev;
 	),
 
-	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu",
+	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d",
 		MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
 		__entry->i_ino,
 		__entry->page,
 		page_to_pfn(__entry->page),
-		__entry->index << PAGE_SHIFT)
+		__entry->index << PAGE_SHIFT,
+		__entry->order)
 );
 
 DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 13/39] mm: trace filemap: dump page order
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Dump page order to trace to be able to distinguish between small page
and huge page in page cache.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/trace/events/filemap.h |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49..7e14b13 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__field(struct page *, page)
 		__field(unsigned long, i_ino)
 		__field(unsigned long, index)
+		__field(int, order)
 		__field(dev_t, s_dev)
 	),
 
@@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 		__entry->page = page;
 		__entry->i_ino = page->mapping->host->i_ino;
 		__entry->index = page->index;
+		__entry->order = compound_order(page);
 		if (page->mapping->host->i_sb)
 			__entry->s_dev = page->mapping->host->i_sb->s_dev;
 		else
 			__entry->s_dev = page->mapping->host->i_rdev;
 	),
 
-	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu",
+	TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d",
 		MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
 		__entry->i_ino,
 		__entry->page,
 		page_to_pfn(__entry->page),
-		__entry->index << PAGE_SHIFT)
+		__entry->index << PAGE_SHIFT,
+		__entry->order)
 );
 
 DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b0c7c8c..657ce82 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,9 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	bool thp = PageTransHuge(page) &&
+		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
+	int nr;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 	/*
@@ -127,13 +130,29 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (thp) {
+		int i;
+
+		nr = HPAGE_CACHE_NR;
+		radix_tree_delete(&mapping->page_tree, page->index);
+		for (i = 1; i < HPAGE_CACHE_NR; i++) {
+			radix_tree_delete(&mapping->page_tree, page->index + i);
+			page[i].mapping = NULL;
+			page_cache_release(page + i);
+		}
+		__dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+	} else {
+		BUG_ON(PageTransHuge(page));
+		nr = 1;
+		radix_tree_delete(&mapping->page_tree, page->index);
+	}
+
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+	mapping->nrpages -= nr;
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -144,8 +163,8 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
-		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
 	}
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b0c7c8c..657ce82 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,9 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	bool thp = PageTransHuge(page) &&
+		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
+	int nr;
 
 	trace_mm_filemap_delete_from_page_cache(page);
 	/*
@@ -127,13 +130,29 @@ void __delete_from_page_cache(struct page *page)
 	else
 		cleancache_invalidate_page(mapping, page);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	if (thp) {
+		int i;
+
+		nr = HPAGE_CACHE_NR;
+		radix_tree_delete(&mapping->page_tree, page->index);
+		for (i = 1; i < HPAGE_CACHE_NR; i++) {
+			radix_tree_delete(&mapping->page_tree, page->index + i);
+			page[i].mapping = NULL;
+			page_cache_release(page + i);
+		}
+		__dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+	} else {
+		BUG_ON(PageTransHuge(page));
+		nr = 1;
+		radix_tree_delete(&mapping->page_tree, page->index);
+	}
+
 	page->mapping = NULL;
 	/* Leave page->index set: truncation lookup relies upon it */
-	mapping->nrpages--;
-	__dec_zone_page_state(page, NR_FILE_PAGES);
+	mapping->nrpages -= nr;
+	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
 	BUG_ON(page_mapped(page));
 
 	/*
@@ -144,8 +163,8 @@ void __delete_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
-		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
 	}
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

replace_page_cache_page() is only used by FUSE. It's unlikely that we
will support THP in FUSE page cache any soon.

Let's pospone implemetation of THP handling in replace_page_cache_page()
until any will use it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 657ce82..3a03426 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -428,6 +428,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 {
 	int error;
 
+	VM_BUG_ON(PageTransHuge(old));
+	VM_BUG_ON(PageTransHuge(new));
 	VM_BUG_ON(!PageLocked(old));
 	VM_BUG_ON(!PageLocked(new));
 	VM_BUG_ON(new->mapping);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

replace_page_cache_page() is only used by FUSE. It's unlikely that we
will support THP in FUSE page cache any soon.

Let's pospone implemetation of THP handling in replace_page_cache_page()
until any will use it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 657ce82..3a03426 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -428,6 +428,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 {
 	int error;
 
+	VM_BUG_ON(PageTransHuge(old));
+	VM_BUG_ON(PageTransHuge(new));
 	VM_BUG_ON(!PageLocked(old));
 	VM_BUG_ON(!PageLocked(new));
 	VM_BUG_ON(new->mapping);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 16/39] thp, mm: locking tail page is a bug
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Locking head page means locking entire compound page.
If we try to lock tail page, something went wrong.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3a03426..9ea46a4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -681,6 +681,7 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -690,6 +691,7 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	return __wait_on_bit_lock(page_waitqueue(page), &wait,
 					sleep_on_page_killable, TASK_KILLABLE);
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 16/39] thp, mm: locking tail page is a bug
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Locking head page means locking entire compound page.
If we try to lock tail page, something went wrong.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3a03426..9ea46a4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -681,6 +681,7 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -690,6 +691,7 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	VM_BUG_ON(PageTail(page));
 	return __wait_on_bit_lock(page_waitqueue(page), &wait,
 					sleep_on_page_killable, TASK_KILLABLE);
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 17/39] thp, mm: handle tail pages in page_cache_get_speculative()
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 28597ec..2e86251 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -161,6 +161,9 @@ static inline int page_cache_get_speculative(struct page *page)
 {
 	VM_BUG_ON(in_interrupt());
 
+	if (unlikely(PageTail(page)))
+		return __get_page_tail(page);
+
 #ifdef CONFIG_TINY_RCU
 # ifdef CONFIG_PREEMPT_COUNT
 	VM_BUG_ON(!in_atomic());
@@ -187,7 +190,6 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON(PageTail(page));
 
 	return 1;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 17/39] thp, mm: handle tail pages in page_cache_get_speculative()
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/pagemap.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 28597ec..2e86251 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -161,6 +161,9 @@ static inline int page_cache_get_speculative(struct page *page)
 {
 	VM_BUG_ON(in_interrupt());
 
+	if (unlikely(PageTail(page)))
+		return __get_page_tail(page);
+
 #ifdef CONFIG_TINY_RCU
 # ifdef CONFIG_PREEMPT_COUNT
 	VM_BUG_ON(!in_atomic());
@@ -187,7 +190,6 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON(PageTail(page));
 
 	return 1;
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 18/39] thp, mm: add event counters for huge page alloc on write to a file
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Existing stats specify source of thp page: fault or collapse. We're
going allocate a new huge page with write(2). It's nither fault nor
collapse.

Let's introduce new events for that.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/vm_event_item.h |    2 ++
 mm/vmstat.c                   |    2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d4b7a18..584c71c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
+		THP_WRITE_ALLOC,
+		THP_WRITE_ALLOC_FAILED,
 		THP_SPLIT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7945285..df8dcda 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -821,6 +821,8 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
+	"thp_write_alloc",
+	"thp_write_alloc_failed",
 	"thp_split",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 18/39] thp, mm: add event counters for huge page alloc on write to a file
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Existing stats specify source of thp page: fault or collapse. We're
going allocate a new huge page with write(2). It's nither fault nor
collapse.

Let's introduce new events for that.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/vm_event_item.h |    2 ++
 mm/vmstat.c                   |    2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d4b7a18..584c71c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
+		THP_WRITE_ALLOC,
+		THP_WRITE_ALLOC_FAILED,
 		THP_SPLIT,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7945285..df8dcda 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -821,6 +821,8 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
+	"thp_write_alloc",
+	"thp_write_alloc_failed",
 	"thp_split",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 19/39] thp, mm: allocate huge pages in grab_cache_page_write_begin()
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/fs.h      |    1 +
 include/linux/huge_mm.h |    3 +++
 include/linux/pagemap.h |    9 ++++++++-
 mm/filemap.c            |   29 ++++++++++++++++++++++++-----
 4 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..a70b0ac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -280,6 +280,7 @@ enum positive_aop_returns {
 #define AOP_FLAG_NOFS			0x0004 /* used by filesystem to direct
 						* helper code (eg buffer layer)
 						* to clear GFP_FS from alloc */
+#define AOP_FLAG_TRANSHUGE		0x0008 /* allocate transhuge page */
 
 /*
  * oh the beauties of C type declarations.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 88b44e2..74494a2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -194,6 +194,9 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
 #define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
 
+#define THP_WRITE_ALLOC		({ BUILD_BUG(); 0; })
+#define THP_WRITE_ALLOC_FAILED	({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2e86251..8feeecc 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -270,8 +270,15 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
 
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
+struct page *__grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
+static inline struct page *grab_cache_page_write_begin(
+		struct address_space *mapping, pgoff_t index, unsigned flags)
+{
+	if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
+		return NULL;
+	return __grab_cache_page_write_begin(mapping, index, flags);
+}
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index 9ea46a4..e086ef0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2309,25 +2309,44 @@ EXPORT_SYMBOL(generic_file_direct_write);
  * Find or create a page at the given pagecache position. Return the locked
  * page. This function is specifically for buffered writes.
  */
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
-					pgoff_t index, unsigned flags)
+struct page *__grab_cache_page_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags)
 {
 	int status;
 	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
+	bool thp = (flags & AOP_FLAG_TRANSHUGE) &&
+		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
 
 	gfp_mask = mapping_gfp_mask(mapping);
 	if (mapping_cap_account_dirty(mapping))
 		gfp_mask |= __GFP_WRITE;
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
+	if (thp) {
+		BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+		BUG_ON(!(gfp_mask & __GFP_COMP));
+	}
 repeat:
 	page = find_lock_page(mapping, index);
-	if (page)
+	if (page) {
+		if (thp && !PageTransHuge(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			return NULL;
+		}
 		goto found;
+	}
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	if (thp) {
+		page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+		if (page)
+			count_vm_event(THP_WRITE_ALLOC);
+		else
+			count_vm_event(THP_WRITE_ALLOC_FAILED);
+	} else
+		page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
@@ -2342,7 +2361,7 @@ found:
 	wait_for_stable_page(page);
 	return page;
 }
-EXPORT_SYMBOL(grab_cache_page_write_begin);
+EXPORT_SYMBOL(__grab_cache_page_write_begin);
 
 static ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 19/39] thp, mm: allocate huge pages in grab_cache_page_write_begin()
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/fs.h      |    1 +
 include/linux/huge_mm.h |    3 +++
 include/linux/pagemap.h |    9 ++++++++-
 mm/filemap.c            |   29 ++++++++++++++++++++++++-----
 4 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..a70b0ac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -280,6 +280,7 @@ enum positive_aop_returns {
 #define AOP_FLAG_NOFS			0x0004 /* used by filesystem to direct
 						* helper code (eg buffer layer)
 						* to clear GFP_FS from alloc */
+#define AOP_FLAG_TRANSHUGE		0x0008 /* allocate transhuge page */
 
 /*
  * oh the beauties of C type declarations.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 88b44e2..74494a2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -194,6 +194,9 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
 #define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
 
+#define THP_WRITE_ALLOC		({ BUILD_BUG(); 0; })
+#define THP_WRITE_ALLOC_FAILED	({ BUILD_BUG(); 0; })
+
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2e86251..8feeecc 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -270,8 +270,15 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
 
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
+struct page *__grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
+static inline struct page *grab_cache_page_write_begin(
+		struct address_space *mapping, pgoff_t index, unsigned flags)
+{
+	if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
+		return NULL;
+	return __grab_cache_page_write_begin(mapping, index, flags);
+}
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index 9ea46a4..e086ef0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2309,25 +2309,44 @@ EXPORT_SYMBOL(generic_file_direct_write);
  * Find or create a page at the given pagecache position. Return the locked
  * page. This function is specifically for buffered writes.
  */
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
-					pgoff_t index, unsigned flags)
+struct page *__grab_cache_page_write_begin(struct address_space *mapping,
+		pgoff_t index, unsigned flags)
 {
 	int status;
 	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
+	bool thp = (flags & AOP_FLAG_TRANSHUGE) &&
+		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
 
 	gfp_mask = mapping_gfp_mask(mapping);
 	if (mapping_cap_account_dirty(mapping))
 		gfp_mask |= __GFP_WRITE;
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
+	if (thp) {
+		BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+		BUG_ON(!(gfp_mask & __GFP_COMP));
+	}
 repeat:
 	page = find_lock_page(mapping, index);
-	if (page)
+	if (page) {
+		if (thp && !PageTransHuge(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			return NULL;
+		}
 		goto found;
+	}
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	if (thp) {
+		page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+		if (page)
+			count_vm_event(THP_WRITE_ALLOC);
+		else
+			count_vm_event(THP_WRITE_ALLOC_FAILED);
+	} else
+		page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
@@ -2342,7 +2361,7 @@ found:
 	wait_for_stable_page(page);
 	return page;
 }
-EXPORT_SYMBOL(grab_cache_page_write_begin);
+EXPORT_SYMBOL(__grab_cache_page_write_begin);
 
 static ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing store.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index e086ef0..ebd361a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1177,6 +1177,17 @@ find_page:
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
+		if (PageTransCompound(page)) {
+			struct page *head = compound_trans_head(page);
+			/*
+			 * We don't yet support huge pages in page cache
+			 * for filesystems with backing device, so pages
+			 * should always be up-to-date.
+			 */
+			BUG_ON(ra->ra_pages);
+			BUG_ON(!PageUptodate(head));
+			goto page_ok;
+		}
 		if (PageReadahead(page)) {
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
@@ -2413,8 +2424,13 @@ again:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (PageTransHuge(page))
+			offset = pos & ~HPAGE_PMD_MASK;
+
 		pagefault_disable();
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		copied = iov_iter_copy_from_user_atomic(
+				page + (offset >> PAGE_CACHE_SHIFT),
+				i, offset & ~PAGE_CACHE_MASK, bytes);
 		pagefault_enable();
 		flush_dcache_page(page);
 
@@ -2437,6 +2453,7 @@ again:
 			 * because not all segments in the iov can be copied at
 			 * once without a pagefault.
 			 */
+			offset = pos & ~PAGE_CACHE_MASK;
 			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
 						iov_iter_single_seg_count(i));
 			goto again;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing store.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index e086ef0..ebd361a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1177,6 +1177,17 @@ find_page:
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
+		if (PageTransCompound(page)) {
+			struct page *head = compound_trans_head(page);
+			/*
+			 * We don't yet support huge pages in page cache
+			 * for filesystems with backing device, so pages
+			 * should always be up-to-date.
+			 */
+			BUG_ON(ra->ra_pages);
+			BUG_ON(!PageUptodate(head));
+			goto page_ok;
+		}
 		if (PageReadahead(page)) {
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
@@ -2413,8 +2424,13 @@ again:
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_page(page);
 
+		if (PageTransHuge(page))
+			offset = pos & ~HPAGE_PMD_MASK;
+
 		pagefault_disable();
-		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+		copied = iov_iter_copy_from_user_atomic(
+				page + (offset >> PAGE_CACHE_SHIFT),
+				i, offset & ~PAGE_CACHE_MASK, bytes);
 		pagefault_enable();
 		flush_dcache_page(page);
 
@@ -2437,6 +2453,7 @@ again:
 			 * because not all segments in the iov can be copied at
 			 * once without a pagefault.
 			 */
+			offset = pos & ~PAGE_CACHE_MASK;
 			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
 						iov_iter_single_seg_count(i));
 			goto again;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 21/39] thp, libfs: initial support of thp in simple_read/write_begin/write_end
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
It's probably to weak condition and need to be reworked later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c              |   50 ++++++++++++++++++++++++++++++++++++-----------
 include/linux/pagemap.h |    8 ++++++++
 2 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 916da8c..ce807fe 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -383,7 +383,7 @@ EXPORT_SYMBOL(simple_setattr);
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
+	clear_pagecache_page(page);
 	flush_dcache_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
@@ -394,21 +394,44 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
-	struct page *page;
+	struct page *page = NULL;
 	pgoff_t index;
 
 	index = pos >> PAGE_CACHE_SHIFT;
 
-	page = grab_cache_page_write_begin(mapping, index, flags);
+	/* XXX: too weak condition? */
+	if (mapping_can_have_hugepages(mapping)) {
+		page = grab_cache_page_write_begin(mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK,
+				flags | AOP_FLAG_TRANSHUGE);
+		/* fallback to small page */
+		if (!page) {
+			unsigned long offset;
+			offset = pos & ~PAGE_CACHE_MASK;
+			len = min_t(unsigned long,
+					len, PAGE_CACHE_SIZE - offset);
+		}
+		BUG_ON(page && !PageTransHuge(page));
+	}
+	if (!page)
+		page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-
 	*pagep = page;
 
-	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
+	if (!PageUptodate(page)) {
+		unsigned from;
+
+		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user_segment(page, 0, from);
+			zero_huge_user_segment(page,
+					from + len, HPAGE_PMD_SIZE);
+		} else if (len != PAGE_CACHE_SIZE) {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user_segments(page, 0, from,
+					from + len, PAGE_CACHE_SIZE);
+		}
 	}
 	return 0;
 }
@@ -443,9 +466,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 
 	/* zero the stale part of the page if we did a short copy */
 	if (copied < len) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user(page, from + copied, len - copied);
+		unsigned from;
+		if (PageTransHuge(page)) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user(page, from + copied, len - copied);
+		} else {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user(page, from + copied, len - copied);
+		}
 	}
 
 	if (!PageUptodate(page))
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8feeecc..462fcca 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -579,4 +579,12 @@ static inline int add_to_page_cache(struct page *page,
 	return error;
 }
 
+static inline void clear_pagecache_page(struct page *page)
+{
+	if (PageTransHuge(page))
+		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+	else
+		clear_highpage(page);
+}
+
 #endif /* _LINUX_PAGEMAP_H */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 21/39] thp, libfs: initial support of thp in simple_read/write_begin/write_end
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
It's probably to weak condition and need to be reworked later.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/libfs.c              |   50 ++++++++++++++++++++++++++++++++++++-----------
 include/linux/pagemap.h |    8 ++++++++
 2 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 916da8c..ce807fe 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -383,7 +383,7 @@ EXPORT_SYMBOL(simple_setattr);
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
+	clear_pagecache_page(page);
 	flush_dcache_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
@@ -394,21 +394,44 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
-	struct page *page;
+	struct page *page = NULL;
 	pgoff_t index;
 
 	index = pos >> PAGE_CACHE_SHIFT;
 
-	page = grab_cache_page_write_begin(mapping, index, flags);
+	/* XXX: too weak condition? */
+	if (mapping_can_have_hugepages(mapping)) {
+		page = grab_cache_page_write_begin(mapping,
+				index & ~HPAGE_CACHE_INDEX_MASK,
+				flags | AOP_FLAG_TRANSHUGE);
+		/* fallback to small page */
+		if (!page) {
+			unsigned long offset;
+			offset = pos & ~PAGE_CACHE_MASK;
+			len = min_t(unsigned long,
+					len, PAGE_CACHE_SIZE - offset);
+		}
+		BUG_ON(page && !PageTransHuge(page));
+	}
+	if (!page)
+		page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-
 	*pagep = page;
 
-	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
+	if (!PageUptodate(page)) {
+		unsigned from;
+
+		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user_segment(page, 0, from);
+			zero_huge_user_segment(page,
+					from + len, HPAGE_PMD_SIZE);
+		} else if (len != PAGE_CACHE_SIZE) {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user_segments(page, 0, from,
+					from + len, PAGE_CACHE_SIZE);
+		}
 	}
 	return 0;
 }
@@ -443,9 +466,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
 
 	/* zero the stale part of the page if we did a short copy */
 	if (copied < len) {
-		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
-		zero_user(page, from + copied, len - copied);
+		unsigned from;
+		if (PageTransHuge(page)) {
+			from = pos & ~HPAGE_PMD_MASK;
+			zero_huge_user(page, from + copied, len - copied);
+		} else {
+			from = pos & ~PAGE_CACHE_MASK;
+			zero_user(page, from + copied, len - copied);
+		}
 	}
 
 	if (!PageUptodate(page))
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8feeecc..462fcca 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -579,4 +579,12 @@ static inline int add_to_page_cache(struct page *page,
 	return error;
 }
 
+static inline void clear_pagecache_page(struct page *page)
+{
+	if (PageTransHuge(page))
+		zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+	else
+		clear_highpage(page);
+}
+
 #endif /* _LINUX_PAGEMAP_H */
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 22/39] thp: handle file pages in split_huge_page()
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   68 +++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 57 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ed31e90..73974e8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1655,23 +1655,23 @@ static void __split_huge_page_refcount(struct page *page,
 		*/
 		page_tail->_mapcount = page->_mapcount;
 
-		BUG_ON(page_tail->mapping);
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
 		page_nid_xchg_last(page_tail, page_nid_last(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec, list);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	if (PageAnon(page))
+		__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	else
+		__mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
 	compound_unlock(page);
@@ -1771,7 +1771,7 @@ static int __split_huge_page_map(struct page *page,
 }
 
 /* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
 			      struct anon_vma *anon_vma,
 			      struct list_head *list)
 {
@@ -1795,7 +1795,7 @@ static void __split_huge_page(struct page *page,
 	 * and establishes a child pmd before
 	 * __split_huge_page_splitting() freezes the parent pmd (so if
 	 * we fail to prevent copy_huge_pmd() from running until the
-	 * whole __split_huge_page() is complete), we will still see
+	 * whole __split_anon_huge_page() is complete), we will still see
 	 * the newly established pmd of the child later during the
 	 * walk, to be able to set it as pmd_trans_splitting too.
 	 */
@@ -1826,14 +1826,11 @@ static void __split_huge_page(struct page *page,
  * from the hugepage.
  * Return 0 if the hugepage is split successfully otherwise return 1.
  */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
+static int split_anon_huge_page(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
 
-	BUG_ON(is_huge_zero_page(page));
-	BUG_ON(!PageAnon(page));
-
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
 	 * the anon_vma disappearing so we first we take a reference to it
@@ -1851,7 +1848,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		goto out_unlock;
 
 	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma, list);
+	__split_anon_huge_page(page, anon_vma, list);
 	count_vm_event(THP_SPLIT);
 
 	BUG_ON(PageCompound(page));
@@ -1862,6 +1859,55 @@ out:
 	return ret;
 }
 
+static int split_file_huge_page(struct page *page, struct list_head *list)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int mapcount, mapcount2;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	mapcount = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page, list);
+
+	mapcount2 = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
+	BUG_ON(mapcount != mapcount2);
+	count_vm_event(THP_SPLIT);
+	mutex_unlock(&mapping->i_mmap_mutex);
+	return 0;
+}
+
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+	BUG_ON(is_huge_zero_page(page));
+
+	if (PageAnon(page))
+		return split_anon_huge_page(page, list);
+	else
+		return split_file_huge_page(page, list);
+}
+
 #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 22/39] thp: handle file pages in split_huge_page()
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   68 +++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 57 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ed31e90..73974e8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1655,23 +1655,23 @@ static void __split_huge_page_refcount(struct page *page,
 		*/
 		page_tail->_mapcount = page->_mapcount;
 
-		BUG_ON(page_tail->mapping);
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
 		page_nid_xchg_last(page_tail, page_nid_last(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
 
 		lru_add_page_tail(page, page_tail, lruvec, list);
 	}
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	if (PageAnon(page))
+		__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	else
+		__mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
 	compound_unlock(page);
@@ -1771,7 +1771,7 @@ static int __split_huge_page_map(struct page *page,
 }
 
 /* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
 			      struct anon_vma *anon_vma,
 			      struct list_head *list)
 {
@@ -1795,7 +1795,7 @@ static void __split_huge_page(struct page *page,
 	 * and establishes a child pmd before
 	 * __split_huge_page_splitting() freezes the parent pmd (so if
 	 * we fail to prevent copy_huge_pmd() from running until the
-	 * whole __split_huge_page() is complete), we will still see
+	 * whole __split_anon_huge_page() is complete), we will still see
 	 * the newly established pmd of the child later during the
 	 * walk, to be able to set it as pmd_trans_splitting too.
 	 */
@@ -1826,14 +1826,11 @@ static void __split_huge_page(struct page *page,
  * from the hugepage.
  * Return 0 if the hugepage is split successfully otherwise return 1.
  */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
+static int split_anon_huge_page(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
 
-	BUG_ON(is_huge_zero_page(page));
-	BUG_ON(!PageAnon(page));
-
 	/*
 	 * The caller does not necessarily hold an mmap_sem that would prevent
 	 * the anon_vma disappearing so we first we take a reference to it
@@ -1851,7 +1848,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		goto out_unlock;
 
 	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma, list);
+	__split_anon_huge_page(page, anon_vma, list);
 	count_vm_event(THP_SPLIT);
 
 	BUG_ON(PageCompound(page));
@@ -1862,6 +1859,55 @@ out:
 	return ret;
 }
 
+static int split_file_huge_page(struct page *page, struct list_head *list)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	int mapcount, mapcount2;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	mapcount = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page, list);
+
+	mapcount2 = 0;
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		unsigned long addr = vma_address(page, vma);
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
+	BUG_ON(mapcount != mapcount2);
+	count_vm_event(THP_SPLIT);
+	mutex_unlock(&mapping->i_mmap_mutex);
+	return 0;
+}
+
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+	BUG_ON(is_huge_zero_page(page));
+
+	if (PageAnon(page))
+		return split_anon_huge_page(page, list);
+	else
+		return split_file_huge_page(page, list);
+}
+
 #define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages backed by files,
wait_split_huge_page() has to serialize not only over anon_vma_lock,
but over i_mmap_mutex too.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |   15 ++++++++++++---
 mm/huge_memory.c        |    4 ++--
 mm/memory.c             |    4 ++--
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 74494a2..9e6425f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -118,11 +118,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 			__split_huge_page_pmd(__vma, __address,		\
 					____pmd);			\
 	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
+#define wait_split_huge_page(__vma, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
+		struct address_space *__mapping =			\
+					vma->vm_file->f_mapping;	\
+		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
+		if (__mapping)						\
+			mutex_lock(&__mapping->i_mmap_mutex);		\
+		if (__anon_vma) {					\
+			anon_vma_lock_write(__anon_vma);		\
+			anon_vma_unlock_write(__anon_vma);		\
+		}							\
+		if (__mapping)						\
+			mutex_unlock(&__mapping->i_mmap_mutex);		\
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73974e8..7ad458d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -924,7 +924,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		spin_unlock(&dst_mm->page_table_lock);
 		pte_free(dst_mm, pgtable);
 
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		wait_split_huge_page(vma, src_pmd); /* src_vma */
 		goto out;
 	}
 	src_page = pmd_page(pmd);
@@ -1497,7 +1497,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
+			wait_split_huge_page(vma, pmd);
 			return -1;
 		} else {
 			/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index f02a8be..c845cf2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -620,7 +620,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (new)
 		pte_free(mm, new);
 	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
+		wait_split_huge_page(vma, pmd);
 	return 0;
 }
 
@@ -1530,7 +1530,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
 				spin_unlock(&mm->page_table_lock);
-				wait_split_huge_page(vma->anon_vma, pmd);
+				wait_split_huge_page(vma, pmd);
 			} else {
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages backed by files,
wait_split_huge_page() has to serialize not only over anon_vma_lock,
but over i_mmap_mutex too.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |   15 ++++++++++++---
 mm/huge_memory.c        |    4 ++--
 mm/memory.c             |    4 ++--
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 74494a2..9e6425f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -118,11 +118,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 			__split_huge_page_pmd(__vma, __address,		\
 					____pmd);			\
 	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
+#define wait_split_huge_page(__vma, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
+		struct address_space *__mapping =			\
+					vma->vm_file->f_mapping;	\
+		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
+		if (__mapping)						\
+			mutex_lock(&__mapping->i_mmap_mutex);		\
+		if (__anon_vma) {					\
+			anon_vma_lock_write(__anon_vma);		\
+			anon_vma_unlock_write(__anon_vma);		\
+		}							\
+		if (__mapping)						\
+			mutex_unlock(&__mapping->i_mmap_mutex);		\
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73974e8..7ad458d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -924,7 +924,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		spin_unlock(&dst_mm->page_table_lock);
 		pte_free(dst_mm, pgtable);
 
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		wait_split_huge_page(vma, src_pmd); /* src_vma */
 		goto out;
 	}
 	src_page = pmd_page(pmd);
@@ -1497,7 +1497,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	if (likely(pmd_trans_huge(*pmd))) {
 		if (unlikely(pmd_trans_splitting(*pmd))) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
-			wait_split_huge_page(vma->anon_vma, pmd);
+			wait_split_huge_page(vma, pmd);
 			return -1;
 		} else {
 			/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index f02a8be..c845cf2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -620,7 +620,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (new)
 		pte_free(mm, new);
 	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
+		wait_split_huge_page(vma, pmd);
 	return 0;
 }
 
@@ -1530,7 +1530,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
 				spin_unlock(&mm->page_table_lock);
-				wait_split_huge_page(vma->anon_vma, pmd);
+				wait_split_huge_page(vma, pmd);
 			} else {
 				page = follow_trans_huge_pmd(vma, address,
 							     pmd, flags);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 24/39] thp, mm: truncate support for transparent huge page cache
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If we starting position of truncation is in tail page we have to spilit
the huge page page first.

We also have to split if end is within the huge page. Otherwise we can
truncate whole huge page at once.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/truncate.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..0152feb 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* split page if we start from tail page */
+			if (PageTransTail(page))
+				split_huge_page(compound_trans_head(page));
+			if (PageTransHuge(page)) {
+				/* split if end is within huge page */
+				if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+					split_huge_page(page);
+				else
+					/* skip tail pages */
+					i += HPAGE_CACHE_NR - 1;
+			}
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -280,6 +291,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			if (PageTransHuge(page))
+				split_huge_page(page);
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 24/39] thp, mm: truncate support for transparent huge page cache
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If we starting position of truncation is in tail page we have to spilit
the huge page page first.

We also have to split if end is within the huge page. Otherwise we can
truncate whole huge page at once.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/truncate.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..0152feb 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			/* split page if we start from tail page */
+			if (PageTransTail(page))
+				split_huge_page(compound_trans_head(page));
+			if (PageTransHuge(page)) {
+				/* split if end is within huge page */
+				if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+					split_huge_page(page);
+				else
+					/* skip tail pages */
+					i += HPAGE_CACHE_NR - 1;
+			}
 			if (!trylock_page(page))
 				continue;
 			WARN_ON(page->index != index);
@@ -280,6 +291,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			if (index > end)
 				break;
 
+			if (PageTransHuge(page))
+				split_huge_page(page);
 			lock_page(page);
 			WARN_ON(page->index != index);
 			wait_on_page_writeback(page);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 25/39] thp, mm: split huge page on mmap file page
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on fault attempt.

Later in the patchset we'll implement mmap() properly and this code path
be used for fallback cases.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index ebd361a..9877347 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1700,6 +1700,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
+	if (PageTransCompound(page))
+		split_huge_page(compound_trans_head(page));
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 25/39] thp, mm: split huge page on mmap file page
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on fault attempt.

Later in the patchset we'll implement mmap() properly and this code path
be used for fallback cases.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index ebd361a..9877347 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1700,6 +1700,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
+	if (PageTransCompound(page))
+		split_huge_page(compound_trans_head(page));
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 26/39] ramfs: enable transparent huge page cache
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

For now we allocate only non-movable huge page. ramfs pages cannot be
moved yet.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/ramfs/inode.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index c24f1e1..54d69c7 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		/*
+		 * TODO: make ramfs pages movable
+		 */
+		mapping_set_gfp_mask(inode->i_mapping,
+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 26/39] ramfs: enable transparent huge page cache
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

For now we allocate only non-movable huge page. ramfs pages cannot be
moved yet.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/ramfs/inode.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index c24f1e1..54d69c7 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		/*
+		 * TODO: make ramfs pages movable
+		 */
+		mapping_set_gfp_mask(inode->i_mapping,
+				GFP_TRANSHUGE & ~__GFP_MOVABLE);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Make arch_get_unmapped_area() return unmapped area aligned to HPAGE_MASK
if the file mapping can have huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/sys_x86_64.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index dbded5a..d97ab40 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/uaccess.h>
 #include <linux/elf.h>
+#include <linux/pagemap.h>
 
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
@@ -34,6 +35,13 @@ static unsigned long get_align_mask(void)
 	return va_align.mask;
 }
 
+static inline unsigned long mapping_align_mask(struct address_space *mapping)
+{
+	if (mapping_can_have_hugepages(mapping))
+		return PAGE_MASK & ~HPAGE_MASK;
+	return get_align_mask();
+}
+
 unsigned long align_vdso_addr(unsigned long addr)
 {
 	unsigned long align_mask = get_align_mask();
@@ -135,7 +143,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.length = len;
 	info.low_limit = begin;
 	info.high_limit = end;
-	info.align_mask = filp ? get_align_mask() : 0;
+	info.align_mask = filp ? mapping_align_mask(filp->f_mapping) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	return vm_unmapped_area(&info);
 }
@@ -174,7 +182,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = mm->mmap_base;
-	info.align_mask = filp ? get_align_mask() : 0;
+	info.align_mask = filp ? mapping_align_mask(filp->f_mapping) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	addr = vm_unmapped_area(&info);
 	if (!(addr & ~PAGE_MASK))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Make arch_get_unmapped_area() return unmapped area aligned to HPAGE_MASK
if the file mapping can have huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/sys_x86_64.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index dbded5a..d97ab40 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/uaccess.h>
 #include <linux/elf.h>
+#include <linux/pagemap.h>
 
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
@@ -34,6 +35,13 @@ static unsigned long get_align_mask(void)
 	return va_align.mask;
 }
 
+static inline unsigned long mapping_align_mask(struct address_space *mapping)
+{
+	if (mapping_can_have_hugepages(mapping))
+		return PAGE_MASK & ~HPAGE_MASK;
+	return get_align_mask();
+}
+
 unsigned long align_vdso_addr(unsigned long addr)
 {
 	unsigned long align_mask = get_align_mask();
@@ -135,7 +143,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	info.length = len;
 	info.low_limit = begin;
 	info.high_limit = end;
-	info.align_mask = filp ? get_align_mask() : 0;
+	info.align_mask = filp ? mapping_align_mask(filp->f_mapping) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	return vm_unmapped_area(&info);
 }
@@ -174,7 +182,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 	info.length = len;
 	info.low_limit = PAGE_SIZE;
 	info.high_limit = mm->mmap_base;
-	info.align_mask = filp ? get_align_mask() : 0;
+	info.align_mask = filp ? mapping_align_mask(filp->f_mapping) : 0;
 	info.align_offset = pgoff << PAGE_SHIFT;
 	addr = vm_unmapped_area(&info);
 	if (!(addr & ~PAGE_MASK))
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 28/39] thp: prepare zap_huge_pmd() to uncharge file pages
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Uncharge pages from correct counter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ad458d..a88f9b2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1385,10 +1385,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			spin_unlock(&tlb->mm->page_table_lock);
 			put_huge_zero_page();
 		} else {
+			int member;
 			page = pmd_page(orig_pmd);
 			page_remove_rmap(page);
 			VM_BUG_ON(page_mapcount(page) < 0);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			member = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
+			add_mm_counter(tlb->mm, member, -HPAGE_PMD_NR);
 			VM_BUG_ON(!PageHead(page));
 			tlb->mm->nr_ptes--;
 			spin_unlock(&tlb->mm->page_table_lock);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 28/39] thp: prepare zap_huge_pmd() to uncharge file pages
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Uncharge pages from correct counter.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ad458d..a88f9b2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1385,10 +1385,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			spin_unlock(&tlb->mm->page_table_lock);
 			put_huge_zero_page();
 		} else {
+			int member;
 			page = pmd_page(orig_pmd);
 			page_remove_rmap(page);
 			VM_BUG_ON(page_mapcount(page) < 0);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			member = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
+			add_mm_counter(tlb->mm, member, -HPAGE_PMD_NR);
 			VM_BUG_ON(!PageHead(page));
 			tlb->mm->nr_ptes--;
 			spin_unlock(&tlb->mm->page_table_lock);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's confusing that mk_huge_pmd() has sematics different from mk_pte()
or mk_pmd().

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
prototype to match mk_pte().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a88f9b2..575f29b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -709,11 +709,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd;
 }
 
-static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 {
 	pmd_t entry;
-	entry = mk_pmd(page, vma->vm_page_prot);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	entry = mk_pmd(page, prot);
 	entry = pmd_mkhuge(entry);
 	return entry;
 }
@@ -746,7 +745,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
-		entry = mk_huge_pmd(page, vma);
+		entry = mk_huge_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
 		pgtable_trans_huge_deposit(mm, pgtable);
@@ -1229,7 +1229,8 @@ alloc:
 		goto out_mn;
 	} else {
 		pmd_t entry;
-		entry = mk_huge_pmd(new_page, vma);
+		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -2410,7 +2411,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
 
-	_pmd = mk_huge_pmd(new_page, vma);
+	_pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

It's confusing that mk_huge_pmd() has sematics different from mk_pte()
or mk_pmd().

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
prototype to match mk_pte().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a88f9b2..575f29b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -709,11 +709,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd;
 }
 
-static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 {
 	pmd_t entry;
-	entry = mk_pmd(page, vma->vm_page_prot);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	entry = mk_pmd(page, prot);
 	entry = pmd_mkhuge(entry);
 	return entry;
 }
@@ -746,7 +745,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
-		entry = mk_huge_pmd(page, vma);
+		entry = mk_huge_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
 		pgtable_trans_huge_deposit(mm, pgtable);
@@ -1229,7 +1229,8 @@ alloc:
 		goto out_mn;
 	} else {
 		pmd_t entry;
-		entry = mk_huge_pmd(new_page, vma);
+		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -2410,7 +2411,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	__SetPageUptodate(new_page);
 	pgtable = pmd_pgtable(_pmd);
 
-	_pmd = mk_huge_pmd(new_page, vma);
+	_pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), so
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 30/39] thp: do_huge_pmd_anonymous_page() cleanup
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Minor cleanup: unindent most code of the fucntion by inverting one
condition. It's preparation for the next patch.

No functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   83 +++++++++++++++++++++++++++---------------------------
 1 file changed, 41 insertions(+), 42 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 575f29b..ab07f5d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -804,55 +804,54 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long haddr = address & HPAGE_PMD_MASK;
 	pte_t *pte;
 
-	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
-		if (unlikely(anon_vma_prepare(vma)))
-			return VM_FAULT_OOM;
-		if (unlikely(khugepaged_enter(vma)))
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		goto out;
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+	if (unlikely(khugepaged_enter(vma)))
+		return VM_FAULT_OOM;
+	if (!(flags & FAULT_FLAG_WRITE) &&
+			transparent_hugepage_use_zero_page()) {
+		pgtable_t pgtable;
+		struct page *zero_page;
+		bool set;
+		pgtable = pte_alloc_one(mm, haddr);
+		if (unlikely(!pgtable))
 			return VM_FAULT_OOM;
-		if (!(flags & FAULT_FLAG_WRITE) &&
-				transparent_hugepage_use_zero_page()) {
-			pgtable_t pgtable;
-			struct page *zero_page;
-			bool set;
-			pgtable = pte_alloc_one(mm, haddr);
-			if (unlikely(!pgtable))
-				return VM_FAULT_OOM;
-			zero_page = get_huge_zero_page();
-			if (unlikely(!zero_page)) {
-				pte_free(mm, pgtable);
-				count_vm_event(THP_FAULT_FALLBACK);
-				goto out;
-			}
-			spin_lock(&mm->page_table_lock);
-			set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-					zero_page);
-			spin_unlock(&mm->page_table_lock);
-			if (!set) {
-				pte_free(mm, pgtable);
-				put_huge_zero_page();
-			}
-			return 0;
-		}
-		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					  vma, haddr, numa_node_id(), 0);
-		if (unlikely(!page)) {
+		zero_page = get_huge_zero_page();
+		if (unlikely(!zero_page)) {
+			pte_free(mm, pgtable);
 			count_vm_event(THP_FAULT_FALLBACK);
 			goto out;
 		}
-		count_vm_event(THP_FAULT_ALLOC);
-		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
-			put_page(page);
-			goto out;
-		}
-		if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
-							  page))) {
-			mem_cgroup_uncharge_page(page);
-			put_page(page);
-			goto out;
+		spin_lock(&mm->page_table_lock);
+		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+				zero_page);
+		spin_unlock(&mm->page_table_lock);
+		if (!set) {
+			pte_free(mm, pgtable);
+			put_huge_zero_page();
 		}
-
 		return 0;
 	}
+	page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+			vma, haddr, numa_node_id(), 0);
+	if (unlikely(!page)) {
+		count_vm_event(THP_FAULT_FALLBACK);
+		goto out;
+	}
+	count_vm_event(THP_FAULT_ALLOC);
+	if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+		put_page(page);
+		goto out;
+	}
+	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		goto out;
+	}
+
+	return 0;
 out:
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 30/39] thp: do_huge_pmd_anonymous_page() cleanup
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Minor cleanup: unindent most code of the fucntion by inverting one
condition. It's preparation for the next patch.

No functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |   83 +++++++++++++++++++++++++++---------------------------
 1 file changed, 41 insertions(+), 42 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 575f29b..ab07f5d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -804,55 +804,54 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long haddr = address & HPAGE_PMD_MASK;
 	pte_t *pte;
 
-	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
-		if (unlikely(anon_vma_prepare(vma)))
-			return VM_FAULT_OOM;
-		if (unlikely(khugepaged_enter(vma)))
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		goto out;
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+	if (unlikely(khugepaged_enter(vma)))
+		return VM_FAULT_OOM;
+	if (!(flags & FAULT_FLAG_WRITE) &&
+			transparent_hugepage_use_zero_page()) {
+		pgtable_t pgtable;
+		struct page *zero_page;
+		bool set;
+		pgtable = pte_alloc_one(mm, haddr);
+		if (unlikely(!pgtable))
 			return VM_FAULT_OOM;
-		if (!(flags & FAULT_FLAG_WRITE) &&
-				transparent_hugepage_use_zero_page()) {
-			pgtable_t pgtable;
-			struct page *zero_page;
-			bool set;
-			pgtable = pte_alloc_one(mm, haddr);
-			if (unlikely(!pgtable))
-				return VM_FAULT_OOM;
-			zero_page = get_huge_zero_page();
-			if (unlikely(!zero_page)) {
-				pte_free(mm, pgtable);
-				count_vm_event(THP_FAULT_FALLBACK);
-				goto out;
-			}
-			spin_lock(&mm->page_table_lock);
-			set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-					zero_page);
-			spin_unlock(&mm->page_table_lock);
-			if (!set) {
-				pte_free(mm, pgtable);
-				put_huge_zero_page();
-			}
-			return 0;
-		}
-		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
-					  vma, haddr, numa_node_id(), 0);
-		if (unlikely(!page)) {
+		zero_page = get_huge_zero_page();
+		if (unlikely(!zero_page)) {
+			pte_free(mm, pgtable);
 			count_vm_event(THP_FAULT_FALLBACK);
 			goto out;
 		}
-		count_vm_event(THP_FAULT_ALLOC);
-		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
-			put_page(page);
-			goto out;
-		}
-		if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
-							  page))) {
-			mem_cgroup_uncharge_page(page);
-			put_page(page);
-			goto out;
+		spin_lock(&mm->page_table_lock);
+		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+				zero_page);
+		spin_unlock(&mm->page_table_lock);
+		if (!set) {
+			pte_free(mm, pgtable);
+			put_huge_zero_page();
 		}
-
 		return 0;
 	}
+	page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+			vma, haddr, numa_node_id(), 0);
+	if (unlikely(!page)) {
+		count_vm_event(THP_FAULT_FALLBACK);
+		goto out;
+	}
+	count_vm_event(THP_FAULT_ALLOC);
+	if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+		put_page(page);
+		goto out;
+	}
+	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
+		mem_cgroup_uncharge_page(page);
+		put_page(page);
+		goto out;
+	}
+
+	return 0;
 out:
 	/*
 	 * Use __pte_alloc instead of pte_alloc_map, because we can't
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 31/39] thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
to handle fallback path.

Let's consolidate code back by introducing VM_FAULT_FALLBACK return
code.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    3 ---
 include/linux/mm.h      |    3 ++-
 mm/huge_memory.c        |   31 +++++--------------------------
 mm/memory.c             |    9 ++++++---
 4 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9e6425f..d688271 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -101,9 +101,6 @@ extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			  pmd_t *dst_pmd, pmd_t *src_pmd,
 			  struct vm_area_struct *vma,
 			  unsigned long addr, unsigned long end);
-extern int handle_pte_fault(struct mm_struct *mm,
-			    struct vm_area_struct *vma, unsigned long address,
-			    pte_t *pte, pmd_t *pmd, unsigned int flags);
 extern int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5e156fb..280b414 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -881,11 +881,12 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
+#define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
-			 VM_FAULT_HWPOISON_LARGE)
+			 VM_FAULT_FALLBACK | VM_FAULT_HWPOISON_LARGE)
 
 /* Encode hstate index for a hwpoisoned large page */
 #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ab07f5d..facfdac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -802,10 +802,9 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
-	pte_t *pte;
 
 	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
-		goto out;
+		return VM_FAULT_FALLBACK;
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 	if (unlikely(khugepaged_enter(vma)))
@@ -822,7 +821,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!zero_page)) {
 			pte_free(mm, pgtable);
 			count_vm_event(THP_FAULT_FALLBACK);
-			goto out;
+			return VM_FAULT_FALLBACK;
 		}
 		spin_lock(&mm->page_table_lock);
 		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
@@ -838,40 +837,20 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			vma, haddr, numa_node_id(), 0);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
-		goto out;
+		return VM_FAULT_FALLBACK;
 	}
 	count_vm_event(THP_FAULT_ALLOC);
 	if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
 		put_page(page);
-		goto out;
+		return VM_FAULT_FALLBACK;
 	}
 	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
 		mem_cgroup_uncharge_page(page);
 		put_page(page);
-		goto out;
+		return VM_FAULT_FALLBACK;
 	}
 
 	return 0;
-out:
-	/*
-	 * Use __pte_alloc instead of pte_alloc_map, because we can't
-	 * run pte_offset_map on the pmd, if an huge pmd could
-	 * materialize from under us from a different thread.
-	 */
-	if (unlikely(pmd_none(*pmd)) &&
-	    unlikely(__pte_alloc(mm, vma, pmd, address)))
-		return VM_FAULT_OOM;
-	/* if an huge pmd materialized from under us just retry later */
-	if (unlikely(pmd_trans_huge(*pmd)))
-		return 0;
-	/*
-	 * A regular pmd is established and it can't morph into a huge pmd
-	 * from under us anymore at this point because we hold the mmap_sem
-	 * read mode and khugepaged takes it in write mode. So now it's
-	 * safe to run pte_offset_map().
-	 */
-	pte = pte_offset_map(pmd, address);
-	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index c845cf2..4008d93 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3701,7 +3701,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-int handle_pte_fault(struct mm_struct *mm,
+static int handle_pte_fault(struct mm_struct *mm,
 		     struct vm_area_struct *vma, unsigned long address,
 		     pte_t *pte, pmd_t *pmd, unsigned int flags)
 {
@@ -3788,9 +3788,12 @@ retry:
 	if (!pmd)
 		return VM_FAULT_OOM;
 	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+		int ret = 0;
 		if (!vma->vm_ops)
-			return do_huge_pmd_anonymous_page(mm, vma, address,
-							  pmd, flags);
+			ret = do_huge_pmd_anonymous_page(mm, vma, address,
+					pmd, flags);
+		if ((ret & VM_FAULT_FALLBACK) == 0)
+			return ret;
 	} else {
 		pmd_t orig_pmd = *pmd;
 		int ret;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 31/39] thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
to handle fallback path.

Let's consolidate code back by introducing VM_FAULT_FALLBACK return
code.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    3 ---
 include/linux/mm.h      |    3 ++-
 mm/huge_memory.c        |   31 +++++--------------------------
 mm/memory.c             |    9 ++++++---
 4 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9e6425f..d688271 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -101,9 +101,6 @@ extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			  pmd_t *dst_pmd, pmd_t *src_pmd,
 			  struct vm_area_struct *vma,
 			  unsigned long addr, unsigned long end);
-extern int handle_pte_fault(struct mm_struct *mm,
-			    struct vm_area_struct *vma, unsigned long address,
-			    pte_t *pte, pmd_t *pmd, unsigned int flags);
 extern int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5e156fb..280b414 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -881,11 +881,12 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
+#define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
-			 VM_FAULT_HWPOISON_LARGE)
+			 VM_FAULT_FALLBACK | VM_FAULT_HWPOISON_LARGE)
 
 /* Encode hstate index for a hwpoisoned large page */
 #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ab07f5d..facfdac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -802,10 +802,9 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
-	pte_t *pte;
 
 	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
-		goto out;
+		return VM_FAULT_FALLBACK;
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 	if (unlikely(khugepaged_enter(vma)))
@@ -822,7 +821,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(!zero_page)) {
 			pte_free(mm, pgtable);
 			count_vm_event(THP_FAULT_FALLBACK);
-			goto out;
+			return VM_FAULT_FALLBACK;
 		}
 		spin_lock(&mm->page_table_lock);
 		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
@@ -838,40 +837,20 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			vma, haddr, numa_node_id(), 0);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
-		goto out;
+		return VM_FAULT_FALLBACK;
 	}
 	count_vm_event(THP_FAULT_ALLOC);
 	if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
 		put_page(page);
-		goto out;
+		return VM_FAULT_FALLBACK;
 	}
 	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
 		mem_cgroup_uncharge_page(page);
 		put_page(page);
-		goto out;
+		return VM_FAULT_FALLBACK;
 	}
 
 	return 0;
-out:
-	/*
-	 * Use __pte_alloc instead of pte_alloc_map, because we can't
-	 * run pte_offset_map on the pmd, if an huge pmd could
-	 * materialize from under us from a different thread.
-	 */
-	if (unlikely(pmd_none(*pmd)) &&
-	    unlikely(__pte_alloc(mm, vma, pmd, address)))
-		return VM_FAULT_OOM;
-	/* if an huge pmd materialized from under us just retry later */
-	if (unlikely(pmd_trans_huge(*pmd)))
-		return 0;
-	/*
-	 * A regular pmd is established and it can't morph into a huge pmd
-	 * from under us anymore at this point because we hold the mmap_sem
-	 * read mode and khugepaged takes it in write mode. So now it's
-	 * safe to run pte_offset_map().
-	 */
-	pte = pte_offset_map(pmd, address);
-	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index c845cf2..4008d93 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3701,7 +3701,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-int handle_pte_fault(struct mm_struct *mm,
+static int handle_pte_fault(struct mm_struct *mm,
 		     struct vm_area_struct *vma, unsigned long address,
 		     pte_t *pte, pmd_t *pmd, unsigned int flags)
 {
@@ -3788,9 +3788,12 @@ retry:
 	if (!pmd)
 		return VM_FAULT_OOM;
 	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+		int ret = 0;
 		if (!vma->vm_ops)
-			return do_huge_pmd_anonymous_page(mm, vma, address,
-							  pmd, flags);
+			ret = do_huge_pmd_anonymous_page(mm, vma, address,
+					pmd, flags);
+		if ((ret & VM_FAULT_FALLBACK) == 0)
+			return ret;
 	} else {
 		pmd_t orig_pmd = *pmd;
 		int ret;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 32/39] mm: cleanup __do_fault() implementation
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's cleanup __do_fault() to prepare it for transparent huge pages
support injection.

Cleanups:
 - int -> bool where appropriate;
 - unindent some code by reverting 'if' condition;
 - extract !pte_same() path to get it clear;
 - separate pte update from mm stats update;
 - some comments reformated;

Functionality is not changed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c |  157 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 76 insertions(+), 81 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4008d93..97b22c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3301,21 +3301,18 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *page_table;
 	spinlock_t *ptl;
-	struct page *page;
-	struct page *cow_page;
+	struct page *page, *cow_page, *dirty_page = NULL;
 	pte_t entry;
-	int anon = 0;
-	struct page *dirty_page = NULL;
+	bool anon = false, page_mkwrite = false;
+	bool write = flags & FAULT_FLAG_WRITE;
 	struct vm_fault vmf;
 	int ret;
-	int page_mkwrite = 0;
 
 	/*
 	 * If we do COW later, allocate page befor taking lock_page()
 	 * on the file cache page. This will reduce lock holding time.
 	 */
-	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
-
+	if (write && !(vma->vm_flags & VM_SHARED)) {
 		if (unlikely(anon_vma_prepare(vma)))
 			return VM_FAULT_OOM;
 
@@ -3336,8 +3333,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	vmf.page = NULL;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
-	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
-			    VM_FAULT_RETRY)))
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
@@ -3356,98 +3352,89 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	else
 		VM_BUG_ON(!PageLocked(vmf.page));
 
+	page = vmf.page;
+	if (!write)
+		goto update_pgtable;
+
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	page = vmf.page;
-	if (flags & FAULT_FLAG_WRITE) {
-		if (!(vma->vm_flags & VM_SHARED)) {
-			page = cow_page;
-			anon = 1;
-			copy_user_highpage(page, vmf.page, address, vma);
-			__SetPageUptodate(page);
-		} else {
-			/*
-			 * If the page will be shareable, see if the backing
-			 * address space wants to know that the page is about
-			 * to become writable
-			 */
-			if (vma->vm_ops->page_mkwrite) {
-				int tmp;
-
+	if (!(vma->vm_flags & VM_SHARED)) {
+		page = cow_page;
+		anon = true;
+		copy_user_highpage(page, vmf.page, address, vma);
+		__SetPageUptodate(page);
+	} else if (vma->vm_ops->page_mkwrite) {
+		/*
+		 * If the page will be shareable, see if the backing address
+		 * space wants to know that the page is about to become writable
+		 */
+		int tmp;
+
+		unlock_page(page);
+		vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+		tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+		if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+			ret = tmp;
+			goto unwritable_page;
+		}
+		if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+			lock_page(page);
+			if (!page->mapping) {
+				ret = 0; /* retry the fault */
 				unlock_page(page);
-				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
-				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
-				if (unlikely(tmp &
-					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
-					ret = tmp;
-					goto unwritable_page;
-				}
-				if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
-					lock_page(page);
-					if (!page->mapping) {
-						ret = 0; /* retry the fault */
-						unlock_page(page);
-						goto unwritable_page;
-					}
-				} else
-					VM_BUG_ON(!PageLocked(page));
-				page_mkwrite = 1;
+				goto unwritable_page;
 			}
-		}
-
+		} else
+			VM_BUG_ON(!PageLocked(page));
+		page_mkwrite = true;
 	}
 
+update_pgtable:
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	/* Only go through if we didn't race with anybody else... */
+	if (unlikely(!pte_same(*page_table, orig_pte))) {
+		pte_unmap_unlock(page_table, ptl);
+		goto race_out;
+	}
+
+	flush_icache_page(vma, page);
+	if (anon) {
+		inc_mm_counter_fast(mm, MM_ANONPAGES);
+		page_add_new_anon_rmap(page, vma, address);
+	} else {
+		inc_mm_counter_fast(mm, MM_FILEPAGES);
+		page_add_file_rmap(page);
+		if (write) {
+			dirty_page = page;
+			get_page(dirty_page);
+		}
+	}
 
 	/*
-	 * This silly early PAGE_DIRTY setting removes a race
-	 * due to the bad i386 page protection. But it's valid
-	 * for other architectures too.
+	 * This silly early PAGE_DIRTY setting removes a race due to the bad
+	 * i386 page protection. But it's valid for other architectures too.
 	 *
-	 * Note that if FAULT_FLAG_WRITE is set, we either now have
-	 * an exclusive copy of the page, or this is a shared mapping,
-	 * so we can make it writable and dirty to avoid having to
-	 * handle that later.
+	 * Note that if FAULT_FLAG_WRITE is set, we either now have an
+	 * exclusive copy of the page, or this is a shared mapping, so we can
+	 * make it writable and dirty to avoid having to handle that later.
 	 */
-	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_same(*page_table, orig_pte))) {
-		flush_icache_page(vma, page);
-		entry = mk_pte(page, vma->vm_page_prot);
-		if (flags & FAULT_FLAG_WRITE)
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		if (anon) {
-			inc_mm_counter_fast(mm, MM_ANONPAGES);
-			page_add_new_anon_rmap(page, vma, address);
-		} else {
-			inc_mm_counter_fast(mm, MM_FILEPAGES);
-			page_add_file_rmap(page);
-			if (flags & FAULT_FLAG_WRITE) {
-				dirty_page = page;
-				get_page(dirty_page);
-			}
-		}
-		set_pte_at(mm, address, page_table, entry);
+	entry = mk_pte(page, vma->vm_page_prot);
+	if (write)
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	set_pte_at(mm, address, page_table, entry);
 
-		/* no need to invalidate: a not-present page won't be cached */
-		update_mmu_cache(vma, address, page_table);
-	} else {
-		if (cow_page)
-			mem_cgroup_uncharge_page(cow_page);
-		if (anon)
-			page_cache_release(page);
-		else
-			anon = 1; /* no anon but release faulted_page */
-	}
+	/* no need to invalidate: a not-present page won't be cached */
+	update_mmu_cache(vma, address, page_table);
 
 	pte_unmap_unlock(page_table, ptl);
 
 	if (dirty_page) {
 		struct address_space *mapping = page->mapping;
-		int dirtied = 0;
+		bool dirtied = false;
 
 		if (set_page_dirty(dirty_page))
-			dirtied = 1;
+			dirtied = true;
 		unlock_page(dirty_page);
 		put_page(dirty_page);
 		if ((dirtied || page_mkwrite) && mapping) {
@@ -3479,6 +3466,14 @@ uncharge_out:
 		page_cache_release(cow_page);
 	}
 	return ret;
+race_out:
+	if (cow_page)
+		mem_cgroup_uncharge_page(cow_page);
+	if (anon)
+		page_cache_release(page);
+	unlock_page(vmf.page);
+	page_cache_release(vmf.page);
+	return ret;
 }
 
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 32/39] mm: cleanup __do_fault() implementation
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's cleanup __do_fault() to prepare it for transparent huge pages
support injection.

Cleanups:
 - int -> bool where appropriate;
 - unindent some code by reverting 'if' condition;
 - extract !pte_same() path to get it clear;
 - separate pte update from mm stats update;
 - some comments reformated;

Functionality is not changed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c |  157 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 76 insertions(+), 81 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4008d93..97b22c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3301,21 +3301,18 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *page_table;
 	spinlock_t *ptl;
-	struct page *page;
-	struct page *cow_page;
+	struct page *page, *cow_page, *dirty_page = NULL;
 	pte_t entry;
-	int anon = 0;
-	struct page *dirty_page = NULL;
+	bool anon = false, page_mkwrite = false;
+	bool write = flags & FAULT_FLAG_WRITE;
 	struct vm_fault vmf;
 	int ret;
-	int page_mkwrite = 0;
 
 	/*
 	 * If we do COW later, allocate page befor taking lock_page()
 	 * on the file cache page. This will reduce lock holding time.
 	 */
-	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
-
+	if (write && !(vma->vm_flags & VM_SHARED)) {
 		if (unlikely(anon_vma_prepare(vma)))
 			return VM_FAULT_OOM;
 
@@ -3336,8 +3333,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	vmf.page = NULL;
 
 	ret = vma->vm_ops->fault(vma, &vmf);
-	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
-			    VM_FAULT_RETRY)))
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
@@ -3356,98 +3352,89 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	else
 		VM_BUG_ON(!PageLocked(vmf.page));
 
+	page = vmf.page;
+	if (!write)
+		goto update_pgtable;
+
 	/*
 	 * Should we do an early C-O-W break?
 	 */
-	page = vmf.page;
-	if (flags & FAULT_FLAG_WRITE) {
-		if (!(vma->vm_flags & VM_SHARED)) {
-			page = cow_page;
-			anon = 1;
-			copy_user_highpage(page, vmf.page, address, vma);
-			__SetPageUptodate(page);
-		} else {
-			/*
-			 * If the page will be shareable, see if the backing
-			 * address space wants to know that the page is about
-			 * to become writable
-			 */
-			if (vma->vm_ops->page_mkwrite) {
-				int tmp;
-
+	if (!(vma->vm_flags & VM_SHARED)) {
+		page = cow_page;
+		anon = true;
+		copy_user_highpage(page, vmf.page, address, vma);
+		__SetPageUptodate(page);
+	} else if (vma->vm_ops->page_mkwrite) {
+		/*
+		 * If the page will be shareable, see if the backing address
+		 * space wants to know that the page is about to become writable
+		 */
+		int tmp;
+
+		unlock_page(page);
+		vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+		tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+		if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+			ret = tmp;
+			goto unwritable_page;
+		}
+		if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+			lock_page(page);
+			if (!page->mapping) {
+				ret = 0; /* retry the fault */
 				unlock_page(page);
-				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
-				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
-				if (unlikely(tmp &
-					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
-					ret = tmp;
-					goto unwritable_page;
-				}
-				if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
-					lock_page(page);
-					if (!page->mapping) {
-						ret = 0; /* retry the fault */
-						unlock_page(page);
-						goto unwritable_page;
-					}
-				} else
-					VM_BUG_ON(!PageLocked(page));
-				page_mkwrite = 1;
+				goto unwritable_page;
 			}
-		}
-
+		} else
+			VM_BUG_ON(!PageLocked(page));
+		page_mkwrite = true;
 	}
 
+update_pgtable:
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	/* Only go through if we didn't race with anybody else... */
+	if (unlikely(!pte_same(*page_table, orig_pte))) {
+		pte_unmap_unlock(page_table, ptl);
+		goto race_out;
+	}
+
+	flush_icache_page(vma, page);
+	if (anon) {
+		inc_mm_counter_fast(mm, MM_ANONPAGES);
+		page_add_new_anon_rmap(page, vma, address);
+	} else {
+		inc_mm_counter_fast(mm, MM_FILEPAGES);
+		page_add_file_rmap(page);
+		if (write) {
+			dirty_page = page;
+			get_page(dirty_page);
+		}
+	}
 
 	/*
-	 * This silly early PAGE_DIRTY setting removes a race
-	 * due to the bad i386 page protection. But it's valid
-	 * for other architectures too.
+	 * This silly early PAGE_DIRTY setting removes a race due to the bad
+	 * i386 page protection. But it's valid for other architectures too.
 	 *
-	 * Note that if FAULT_FLAG_WRITE is set, we either now have
-	 * an exclusive copy of the page, or this is a shared mapping,
-	 * so we can make it writable and dirty to avoid having to
-	 * handle that later.
+	 * Note that if FAULT_FLAG_WRITE is set, we either now have an
+	 * exclusive copy of the page, or this is a shared mapping, so we can
+	 * make it writable and dirty to avoid having to handle that later.
 	 */
-	/* Only go through if we didn't race with anybody else... */
-	if (likely(pte_same(*page_table, orig_pte))) {
-		flush_icache_page(vma, page);
-		entry = mk_pte(page, vma->vm_page_prot);
-		if (flags & FAULT_FLAG_WRITE)
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		if (anon) {
-			inc_mm_counter_fast(mm, MM_ANONPAGES);
-			page_add_new_anon_rmap(page, vma, address);
-		} else {
-			inc_mm_counter_fast(mm, MM_FILEPAGES);
-			page_add_file_rmap(page);
-			if (flags & FAULT_FLAG_WRITE) {
-				dirty_page = page;
-				get_page(dirty_page);
-			}
-		}
-		set_pte_at(mm, address, page_table, entry);
+	entry = mk_pte(page, vma->vm_page_prot);
+	if (write)
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	set_pte_at(mm, address, page_table, entry);
 
-		/* no need to invalidate: a not-present page won't be cached */
-		update_mmu_cache(vma, address, page_table);
-	} else {
-		if (cow_page)
-			mem_cgroup_uncharge_page(cow_page);
-		if (anon)
-			page_cache_release(page);
-		else
-			anon = 1; /* no anon but release faulted_page */
-	}
+	/* no need to invalidate: a not-present page won't be cached */
+	update_mmu_cache(vma, address, page_table);
 
 	pte_unmap_unlock(page_table, ptl);
 
 	if (dirty_page) {
 		struct address_space *mapping = page->mapping;
-		int dirtied = 0;
+		bool dirtied = false;
 
 		if (set_page_dirty(dirty_page))
-			dirtied = 1;
+			dirtied = true;
 		unlock_page(dirty_page);
 		put_page(dirty_page);
 		if ((dirtied || page_mkwrite) && mapping) {
@@ -3479,6 +3466,14 @@ uncharge_out:
 		page_cache_release(cow_page);
 	}
 	return ret;
+race_out:
+	if (cow_page)
+		mem_cgroup_uncharge_page(cow_page);
+	if (anon)
+		page_cache_release(page);
+	unlock_page(vmf.page);
+	page_cache_release(vmf.page);
+	return ret;
 }
 
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's modify __do_fault() to handle transhuge pages. To indicate that
huge page is required caller pass flags with FAULT_FLAG_TRANSHUGE set.

__do_fault() now returns VM_FAULT_FALLBACK to indicate that fallback to
small pages is required.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |   41 +++++++++++++
 include/linux/mm.h      |    5 ++
 mm/huge_memory.c        |   22 -------
 mm/memory.c             |  148 ++++++++++++++++++++++++++++++++++++++++-------
 4 files changed, 172 insertions(+), 44 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d688271..b20334a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -188,6 +188,28 @@ static inline struct page *compound_trans_head(struct page *page)
 	return page;
 }
 
+static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
+{
+	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+}
+
+static inline struct page *alloc_hugepage_vma(int defrag,
+					      struct vm_area_struct *vma,
+					      unsigned long haddr, int nd,
+					      gfp_t extra_gfp)
+{
+	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
+			       HPAGE_PMD_ORDER, vma, haddr, nd);
+}
+
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
+{
+	pmd_t entry;
+	entry = mk_pmd(page, prot);
+	entry = pmd_mkhuge(entry);
+	return entry;
+}
+
 extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 
@@ -200,12 +222,15 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
 #define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
 
+#define THP_FAULT_ALLOC		({ BUILD_BUG(); 0; })
+#define THP_FAULT_FALLBACK	({ BUILD_BUG(); 0; })
 #define THP_WRITE_ALLOC		({ BUILD_BUG(); 0; })
 #define THP_WRITE_ALLOC_FAILED	({ BUILD_BUG(); 0; })
 
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
+#define transparent_hugepage_defrag(__vma) 0
 
 #define transparent_hugepage_flags 0UL
 static inline int
@@ -242,6 +267,22 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 	return 0;
 }
 
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
+{
+	pmd_t entry;
+	BUILD_BUG();
+	return entry;
+}
+
+static inline struct page *alloc_hugepage_vma(int defrag,
+		struct vm_area_struct *vma,
+		unsigned long haddr, int nd,
+		gfp_t extra_gfp)
+{
+	BUILD_BUG();
+	return NULL;
+}
+
 static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 					unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 280b414..563c8b7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -167,6 +167,11 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
 #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x40	/* second try */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+#define FAULT_FLAG_TRANSHUGE	0x80	/* Try to allocate transhuge page */
+#else
+#define FAULT_FLAG_TRANSHUGE	0	/* Optimize out THP code if disabled */
+#endif
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index facfdac..893cc69 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -709,14 +709,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd;
 }
 
-static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
-{
-	pmd_t entry;
-	entry = mk_pmd(page, prot);
-	entry = pmd_mkhuge(entry);
-	return entry;
-}
-
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
@@ -758,20 +750,6 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	return 0;
 }
 
-static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
-{
-	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
-}
-
-static inline struct page *alloc_hugepage_vma(int defrag,
-					      struct vm_area_struct *vma,
-					      unsigned long haddr, int nd,
-					      gfp_t extra_gfp)
-{
-	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
-			       HPAGE_PMD_ORDER, vma, haddr, nd);
-}
-
 #ifndef CONFIG_NUMA
 static inline struct page *alloc_hugepage(int defrag)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 97b22c7..8997cd8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -59,6 +59,7 @@
 #include <linux/gfp.h>
 #include <linux/migrate.h>
 #include <linux/string.h>
+#include <linux/khugepaged.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -167,6 +168,7 @@ static void check_sync_rss_stat(struct task_struct *task)
 }
 #else /* SPLIT_RSS_COUNTING */
 
+#define add_mm_counter_fast(mm, member, val) add_mm_counter(mm, member, val)
 #define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
 #define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
 
@@ -3282,6 +3284,38 @@ oom:
 	return VM_FAULT_OOM;
 }
 
+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+			(vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+		return false;
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return false;
+	return true;
+}
+
+static struct page *alloc_fault_page_vma(struct vm_area_struct *vma,
+		unsigned long addr, unsigned int flags)
+{
+
+	if (flags & FAULT_FLAG_TRANSHUGE) {
+		struct page *page;
+		unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+				vma, haddr, numa_node_id(), 0);
+		if (page)
+			count_vm_event(THP_FAULT_ALLOC);
+		else
+			count_vm_event(THP_FAULT_FALLBACK);
+		return page;
+	}
+	return alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+}
+
 /*
  * __do_fault() tries to create a new page mapping. It aggressively
  * tries to share with existing pages, but makes a separate copy if
@@ -3301,12 +3335,23 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *page_table;
 	spinlock_t *ptl;
+	pgtable_t pgtable = NULL;
 	struct page *page, *cow_page, *dirty_page = NULL;
-	pte_t entry;
 	bool anon = false, page_mkwrite = false;
 	bool write = flags & FAULT_FLAG_WRITE;
+	bool thp = flags & FAULT_FLAG_TRANSHUGE;
+	unsigned long addr_aligned;
 	struct vm_fault vmf;
-	int ret;
+	int nr, ret;
+
+	if (thp) {
+		if (!transhuge_vma_suitable(vma, address))
+			return VM_FAULT_FALLBACK;
+		if (unlikely(khugepaged_enter(vma)))
+			return VM_FAULT_OOM;
+		addr_aligned = address & HPAGE_PMD_MASK;
+	} else
+		addr_aligned = address & PAGE_MASK;
 
 	/*
 	 * If we do COW later, allocate page befor taking lock_page()
@@ -3316,17 +3361,25 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(anon_vma_prepare(vma)))
 			return VM_FAULT_OOM;
 
-		cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+		cow_page = alloc_fault_page_vma(vma, address, flags);
 		if (!cow_page)
-			return VM_FAULT_OOM;
+			return VM_FAULT_OOM | VM_FAULT_FALLBACK;
 
 		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
 			page_cache_release(cow_page);
-			return VM_FAULT_OOM;
+			return VM_FAULT_OOM | VM_FAULT_FALLBACK;
 		}
 	} else
 		cow_page = NULL;
 
+	if (thp) {
+		pgtable = pte_alloc_one(mm, address);
+		if (unlikely(!pgtable)) {
+			ret = VM_FAULT_OOM;
+			goto uncharge_out;
+		}
+	}
+
 	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
@@ -3353,6 +3406,13 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		VM_BUG_ON(!PageLocked(vmf.page));
 
 	page = vmf.page;
+
+	/*
+	 * If we asked for huge page we expect to get it or VM_FAULT_FALLBACK.
+	 * If we don't ask for huge page it must be splitted in ->fault().
+	 */
+	BUG_ON(PageTransHuge(page) != thp);
+
 	if (!write)
 		goto update_pgtable;
 
@@ -3362,7 +3422,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!(vma->vm_flags & VM_SHARED)) {
 		page = cow_page;
 		anon = true;
-		copy_user_highpage(page, vmf.page, address, vma);
+		if (thp)
+			copy_user_huge_page(page, vmf.page, addr_aligned, vma,
+					HPAGE_PMD_NR);
+		else
+			copy_user_highpage(page, vmf.page, address, vma);
 		__SetPageUptodate(page);
 	} else if (vma->vm_ops->page_mkwrite) {
 		/*
@@ -3373,6 +3437,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		unlock_page(page);
 		vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+		if (thp)
+			vmf.flags |= FAULT_FLAG_TRANSHUGE;
 		tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
 		if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
 			ret = tmp;
@@ -3391,19 +3457,30 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 update_pgtable:
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	/* Only go through if we didn't race with anybody else... */
-	if (unlikely(!pte_same(*page_table, orig_pte))) {
-		pte_unmap_unlock(page_table, ptl);
-		goto race_out;
+	if (thp) {
+		spin_lock(&mm->page_table_lock);
+		if (!pmd_none(*pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			goto race_out;
+		}
+		/* make GCC happy */
+		ptl = NULL; page_table = NULL;
+	} else {
+		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+		if (unlikely(!pte_same(*page_table, orig_pte))) {
+			pte_unmap_unlock(page_table, ptl);
+			goto race_out;
+		}
 	}
 
 	flush_icache_page(vma, page);
+	nr = thp ? HPAGE_PMD_NR : 1;
 	if (anon) {
-		inc_mm_counter_fast(mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, address);
+		add_mm_counter_fast(mm, MM_ANONPAGES, nr);
+		page_add_new_anon_rmap(page, vma, addr_aligned);
 	} else {
-		inc_mm_counter_fast(mm, MM_FILEPAGES);
+		add_mm_counter_fast(mm, MM_FILEPAGES, nr);
 		page_add_file_rmap(page);
 		if (write) {
 			dirty_page = page;
@@ -3419,15 +3496,23 @@ update_pgtable:
 	 * exclusive copy of the page, or this is a shared mapping, so we can
 	 * make it writable and dirty to avoid having to handle that later.
 	 */
-	entry = mk_pte(page, vma->vm_page_prot);
-	if (write)
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-	set_pte_at(mm, address, page_table, entry);
-
-	/* no need to invalidate: a not-present page won't be cached */
-	update_mmu_cache(vma, address, page_table);
-
-	pte_unmap_unlock(page_table, ptl);
+	if (thp) {
+		pmd_t entry = mk_huge_pmd(page, vma->vm_page_prot);
+		if (flags & FAULT_FLAG_WRITE)
+			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		set_pmd_at(mm, address, pmd, entry);
+		pgtable_trans_huge_deposit(mm, pgtable);
+		mm->nr_ptes++;
+		update_mmu_cache_pmd(vma, address, pmd);
+		spin_unlock(&mm->page_table_lock);
+	} else {
+		pte_t entry = mk_pte(page, vma->vm_page_prot);
+		if (write)
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		set_pte_at(mm, address, page_table, entry);
+		update_mmu_cache(vma, address, page_table);
+		pte_unmap_unlock(page_table, ptl);
+	}
 
 	if (dirty_page) {
 		struct address_space *mapping = page->mapping;
@@ -3457,9 +3542,13 @@ update_pgtable:
 	return ret;
 
 unwritable_page:
+	if (pgtable)
+		pte_free(mm, pgtable);
 	page_cache_release(page);
 	return ret;
 uncharge_out:
+	if (pgtable)
+		pte_free(mm, pgtable);
 	/* fs's fault handler get error */
 	if (cow_page) {
 		mem_cgroup_uncharge_page(cow_page);
@@ -3467,6 +3556,8 @@ uncharge_out:
 	}
 	return ret;
 race_out:
+	if (pgtable)
+		pte_free(mm, pgtable);
 	if (cow_page)
 		mem_cgroup_uncharge_page(cow_page);
 	if (anon)
@@ -3519,6 +3610,19 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static int do_huge_linear_fault(struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long address, pmd_t *pmd,
+		unsigned int flags)
+{
+	pgoff_t pgoff = (((address & PAGE_MASK)
+			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	pte_t __unused; /* unused with FAULT_FLAG_TRANSHUGE */
+
+	flags |= FAULT_FLAG_TRANSHUGE;
+
+	return __do_fault(mm, vma, address, pmd, pgoff, flags, __unused);
+}
+
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 				unsigned long addr, int current_nid)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's modify __do_fault() to handle transhuge pages. To indicate that
huge page is required caller pass flags with FAULT_FLAG_TRANSHUGE set.

__do_fault() now returns VM_FAULT_FALLBACK to indicate that fallback to
small pages is required.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |   41 +++++++++++++
 include/linux/mm.h      |    5 ++
 mm/huge_memory.c        |   22 -------
 mm/memory.c             |  148 ++++++++++++++++++++++++++++++++++++++++-------
 4 files changed, 172 insertions(+), 44 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d688271..b20334a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -188,6 +188,28 @@ static inline struct page *compound_trans_head(struct page *page)
 	return page;
 }
 
+static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
+{
+	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+}
+
+static inline struct page *alloc_hugepage_vma(int defrag,
+					      struct vm_area_struct *vma,
+					      unsigned long haddr, int nd,
+					      gfp_t extra_gfp)
+{
+	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
+			       HPAGE_PMD_ORDER, vma, haddr, nd);
+}
+
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
+{
+	pmd_t entry;
+	entry = mk_pmd(page, prot);
+	entry = pmd_mkhuge(entry);
+	return entry;
+}
+
 extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 
@@ -200,12 +222,15 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
 #define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
 #define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
 
+#define THP_FAULT_ALLOC		({ BUILD_BUG(); 0; })
+#define THP_FAULT_FALLBACK	({ BUILD_BUG(); 0; })
 #define THP_WRITE_ALLOC		({ BUILD_BUG(); 0; })
 #define THP_WRITE_ALLOC_FAILED	({ BUILD_BUG(); 0; })
 
 #define hpage_nr_pages(x) 1
 
 #define transparent_hugepage_enabled(__vma) 0
+#define transparent_hugepage_defrag(__vma) 0
 
 #define transparent_hugepage_flags 0UL
 static inline int
@@ -242,6 +267,22 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 	return 0;
 }
 
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
+{
+	pmd_t entry;
+	BUILD_BUG();
+	return entry;
+}
+
+static inline struct page *alloc_hugepage_vma(int defrag,
+		struct vm_area_struct *vma,
+		unsigned long haddr, int nd,
+		gfp_t extra_gfp)
+{
+	BUILD_BUG();
+	return NULL;
+}
+
 static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 					unsigned long addr, pmd_t pmd, pmd_t *pmdp)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 280b414..563c8b7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -167,6 +167,11 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_RETRY_NOWAIT	0x10	/* Don't drop mmap_sem and wait when retrying */
 #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x40	/* second try */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+#define FAULT_FLAG_TRANSHUGE	0x80	/* Try to allocate transhuge page */
+#else
+#define FAULT_FLAG_TRANSHUGE	0	/* Optimize out THP code if disabled */
+#endif
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index facfdac..893cc69 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -709,14 +709,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd;
 }
 
-static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
-{
-	pmd_t entry;
-	entry = mk_pmd(page, prot);
-	entry = pmd_mkhuge(entry);
-	return entry;
-}
-
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
@@ -758,20 +750,6 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	return 0;
 }
 
-static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
-{
-	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
-}
-
-static inline struct page *alloc_hugepage_vma(int defrag,
-					      struct vm_area_struct *vma,
-					      unsigned long haddr, int nd,
-					      gfp_t extra_gfp)
-{
-	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
-			       HPAGE_PMD_ORDER, vma, haddr, nd);
-}
-
 #ifndef CONFIG_NUMA
 static inline struct page *alloc_hugepage(int defrag)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 97b22c7..8997cd8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -59,6 +59,7 @@
 #include <linux/gfp.h>
 #include <linux/migrate.h>
 #include <linux/string.h>
+#include <linux/khugepaged.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -167,6 +168,7 @@ static void check_sync_rss_stat(struct task_struct *task)
 }
 #else /* SPLIT_RSS_COUNTING */
 
+#define add_mm_counter_fast(mm, member, val) add_mm_counter(mm, member, val)
 #define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
 #define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
 
@@ -3282,6 +3284,38 @@ oom:
 	return VM_FAULT_OOM;
 }
 
+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+			(vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+		return false;
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return false;
+	return true;
+}
+
+static struct page *alloc_fault_page_vma(struct vm_area_struct *vma,
+		unsigned long addr, unsigned int flags)
+{
+
+	if (flags & FAULT_FLAG_TRANSHUGE) {
+		struct page *page;
+		unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+				vma, haddr, numa_node_id(), 0);
+		if (page)
+			count_vm_event(THP_FAULT_ALLOC);
+		else
+			count_vm_event(THP_FAULT_FALLBACK);
+		return page;
+	}
+	return alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+}
+
 /*
  * __do_fault() tries to create a new page mapping. It aggressively
  * tries to share with existing pages, but makes a separate copy if
@@ -3301,12 +3335,23 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	pte_t *page_table;
 	spinlock_t *ptl;
+	pgtable_t pgtable = NULL;
 	struct page *page, *cow_page, *dirty_page = NULL;
-	pte_t entry;
 	bool anon = false, page_mkwrite = false;
 	bool write = flags & FAULT_FLAG_WRITE;
+	bool thp = flags & FAULT_FLAG_TRANSHUGE;
+	unsigned long addr_aligned;
 	struct vm_fault vmf;
-	int ret;
+	int nr, ret;
+
+	if (thp) {
+		if (!transhuge_vma_suitable(vma, address))
+			return VM_FAULT_FALLBACK;
+		if (unlikely(khugepaged_enter(vma)))
+			return VM_FAULT_OOM;
+		addr_aligned = address & HPAGE_PMD_MASK;
+	} else
+		addr_aligned = address & PAGE_MASK;
 
 	/*
 	 * If we do COW later, allocate page befor taking lock_page()
@@ -3316,17 +3361,25 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (unlikely(anon_vma_prepare(vma)))
 			return VM_FAULT_OOM;
 
-		cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+		cow_page = alloc_fault_page_vma(vma, address, flags);
 		if (!cow_page)
-			return VM_FAULT_OOM;
+			return VM_FAULT_OOM | VM_FAULT_FALLBACK;
 
 		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
 			page_cache_release(cow_page);
-			return VM_FAULT_OOM;
+			return VM_FAULT_OOM | VM_FAULT_FALLBACK;
 		}
 	} else
 		cow_page = NULL;
 
+	if (thp) {
+		pgtable = pte_alloc_one(mm, address);
+		if (unlikely(!pgtable)) {
+			ret = VM_FAULT_OOM;
+			goto uncharge_out;
+		}
+	}
+
 	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
 	vmf.pgoff = pgoff;
 	vmf.flags = flags;
@@ -3353,6 +3406,13 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		VM_BUG_ON(!PageLocked(vmf.page));
 
 	page = vmf.page;
+
+	/*
+	 * If we asked for huge page we expect to get it or VM_FAULT_FALLBACK.
+	 * If we don't ask for huge page it must be splitted in ->fault().
+	 */
+	BUG_ON(PageTransHuge(page) != thp);
+
 	if (!write)
 		goto update_pgtable;
 
@@ -3362,7 +3422,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!(vma->vm_flags & VM_SHARED)) {
 		page = cow_page;
 		anon = true;
-		copy_user_highpage(page, vmf.page, address, vma);
+		if (thp)
+			copy_user_huge_page(page, vmf.page, addr_aligned, vma,
+					HPAGE_PMD_NR);
+		else
+			copy_user_highpage(page, vmf.page, address, vma);
 		__SetPageUptodate(page);
 	} else if (vma->vm_ops->page_mkwrite) {
 		/*
@@ -3373,6 +3437,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		unlock_page(page);
 		vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+		if (thp)
+			vmf.flags |= FAULT_FLAG_TRANSHUGE;
 		tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
 		if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
 			ret = tmp;
@@ -3391,19 +3457,30 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 update_pgtable:
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	/* Only go through if we didn't race with anybody else... */
-	if (unlikely(!pte_same(*page_table, orig_pte))) {
-		pte_unmap_unlock(page_table, ptl);
-		goto race_out;
+	if (thp) {
+		spin_lock(&mm->page_table_lock);
+		if (!pmd_none(*pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			goto race_out;
+		}
+		/* make GCC happy */
+		ptl = NULL; page_table = NULL;
+	} else {
+		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+		if (unlikely(!pte_same(*page_table, orig_pte))) {
+			pte_unmap_unlock(page_table, ptl);
+			goto race_out;
+		}
 	}
 
 	flush_icache_page(vma, page);
+	nr = thp ? HPAGE_PMD_NR : 1;
 	if (anon) {
-		inc_mm_counter_fast(mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, address);
+		add_mm_counter_fast(mm, MM_ANONPAGES, nr);
+		page_add_new_anon_rmap(page, vma, addr_aligned);
 	} else {
-		inc_mm_counter_fast(mm, MM_FILEPAGES);
+		add_mm_counter_fast(mm, MM_FILEPAGES, nr);
 		page_add_file_rmap(page);
 		if (write) {
 			dirty_page = page;
@@ -3419,15 +3496,23 @@ update_pgtable:
 	 * exclusive copy of the page, or this is a shared mapping, so we can
 	 * make it writable and dirty to avoid having to handle that later.
 	 */
-	entry = mk_pte(page, vma->vm_page_prot);
-	if (write)
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-	set_pte_at(mm, address, page_table, entry);
-
-	/* no need to invalidate: a not-present page won't be cached */
-	update_mmu_cache(vma, address, page_table);
-
-	pte_unmap_unlock(page_table, ptl);
+	if (thp) {
+		pmd_t entry = mk_huge_pmd(page, vma->vm_page_prot);
+		if (flags & FAULT_FLAG_WRITE)
+			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		set_pmd_at(mm, address, pmd, entry);
+		pgtable_trans_huge_deposit(mm, pgtable);
+		mm->nr_ptes++;
+		update_mmu_cache_pmd(vma, address, pmd);
+		spin_unlock(&mm->page_table_lock);
+	} else {
+		pte_t entry = mk_pte(page, vma->vm_page_prot);
+		if (write)
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		set_pte_at(mm, address, page_table, entry);
+		update_mmu_cache(vma, address, page_table);
+		pte_unmap_unlock(page_table, ptl);
+	}
 
 	if (dirty_page) {
 		struct address_space *mapping = page->mapping;
@@ -3457,9 +3542,13 @@ update_pgtable:
 	return ret;
 
 unwritable_page:
+	if (pgtable)
+		pte_free(mm, pgtable);
 	page_cache_release(page);
 	return ret;
 uncharge_out:
+	if (pgtable)
+		pte_free(mm, pgtable);
 	/* fs's fault handler get error */
 	if (cow_page) {
 		mem_cgroup_uncharge_page(cow_page);
@@ -3467,6 +3556,8 @@ uncharge_out:
 	}
 	return ret;
 race_out:
+	if (pgtable)
+		pte_free(mm, pgtable);
 	if (cow_page)
 		mem_cgroup_uncharge_page(cow_page);
 	if (anon)
@@ -3519,6 +3610,19 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+static int do_huge_linear_fault(struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long address, pmd_t *pmd,
+		unsigned int flags)
+{
+	pgoff_t pgoff = (((address & PAGE_MASK)
+			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	pte_t __unused; /* unused with FAULT_FLAG_TRANSHUGE */
+
+	flags |= FAULT_FLAG_TRANSHUGE;
+
+	return __do_fault(mm, vma, address, pmd, pgoff, flags, __unused);
+}
+
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 				unsigned long addr, int current_nid)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 34/39] thp, mm: handle huge pages in filemap_fault()
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If caller asks for huge page (flags & FAULT_FLAG_TRANSHUGE),
filemap_fault() return it if there's a huge page already by the offset.

If the area of page cache required to create huge is empty, we create a
new huge page and return it.

Otherwise we return VM_FAULT_FALLBACK to indicate that fallback to small
pages is required.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   52 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 9877347..1deedd6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1557,14 +1557,23 @@ EXPORT_SYMBOL(generic_file_aio_read);
  * This adds the requested page to the page cache if it isn't already there,
  * and schedules an I/O to read in its contents from disk.
  */
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset, bool thp)
 {
 	struct address_space *mapping = file->f_mapping;
-	struct page *page; 
+	struct page *page;
 	int ret;
 
 	do {
-		page = page_cache_alloc_cold(mapping);
+		if (thp) {
+			gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
+			BUG_ON(offset & HPAGE_CACHE_INDEX_MASK);
+			page = alloc_pages(gfp_mask, HPAGE_PMD_ORDER);
+			if (page)
+				count_vm_event(THP_FAULT_ALLOC);
+			else
+				count_vm_event(THP_FAULT_FALLBACK);
+		} else
+			page = page_cache_alloc_cold(mapping);
 		if (!page)
 			return -ENOMEM;
 
@@ -1573,11 +1582,18 @@ static int page_cache_read(struct file *file, pgoff_t offset)
 			ret = mapping->a_ops->readpage(file, page);
 		else if (ret == -EEXIST)
 			ret = 0; /* losing race to add is OK */
+		else if (ret == -ENOSPC)
+			/*
+			 * No space in page cache to add huge page.
+			 * For caller it's the same as -ENOMEM: fall back to
+			 * small pages is required.
+			 */
+			ret = -ENOMEM;
 
 		page_cache_release(page);
 
 	} while (ret == AOP_TRUNCATED_PAGE);
-		
+
 	return ret;
 }
 
@@ -1669,13 +1685,20 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct address_space *mapping = file->f_mapping;
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
+	bool thp = vmf->flags & FAULT_FLAG_TRANSHUGE;
 	pgoff_t offset = vmf->pgoff;
+	unsigned long address = (unsigned long)vmf->virtual_address;
 	struct page *page;
 	pgoff_t size;
 	int ret = 0;
 
+	if (thp) {
+		BUG_ON(ra->ra_pages);
+		offset = linear_page_index(vma, address & HPAGE_PMD_MASK);
+	}
+
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (offset >= size)
+	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
 
 	/*
@@ -1700,7 +1723,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	if (PageTransCompound(page))
+	/* Split huge page if we don't want huge page to be here */
+	if (!thp && PageTransCompound(page))
 		split_huge_page(compound_trans_head(page));
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
@@ -1722,12 +1746,22 @@ retry_find:
 	if (unlikely(!PageUptodate(page)))
 		goto page_not_uptodate;
 
+	if (thp && !PageTransHuge(page)) {
+		/*
+		 * Caller asked for huge page, but we have small page
+		 * by this offset. Fallback to small pages.
+		 */
+		unlock_page(page);
+		page_cache_release(page);
+		return VM_FAULT_FALLBACK;
+	}
+
 	/*
 	 * Found the page and have a reference on it.
 	 * We must recheck i_size under page lock.
 	 */
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (unlikely(offset >= size)) {
+	if (unlikely(vmf->pgoff >= size)) {
 		unlock_page(page);
 		page_cache_release(page);
 		return VM_FAULT_SIGBUS;
@@ -1741,7 +1775,7 @@ no_cached_page:
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, offset);
+	error = page_cache_read(file, offset, thp);
 
 	/*
 	 * The page we want has now been added to the page cache.
@@ -1757,7 +1791,7 @@ no_cached_page:
 	 * to schedule I/O.
 	 */
 	if (error == -ENOMEM)
-		return VM_FAULT_OOM;
+		return VM_FAULT_OOM | VM_FAULT_FALLBACK;
 	return VM_FAULT_SIGBUS;
 
 page_not_uptodate:
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 34/39] thp, mm: handle huge pages in filemap_fault()
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

If caller asks for huge page (flags & FAULT_FLAG_TRANSHUGE),
filemap_fault() return it if there's a huge page already by the offset.

If the area of page cache required to create huge is empty, we create a
new huge page and return it.

Otherwise we return VM_FAULT_FALLBACK to indicate that fallback to small
pages is required.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c |   52 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 9877347..1deedd6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1557,14 +1557,23 @@ EXPORT_SYMBOL(generic_file_aio_read);
  * This adds the requested page to the page cache if it isn't already there,
  * and schedules an I/O to read in its contents from disk.
  */
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset, bool thp)
 {
 	struct address_space *mapping = file->f_mapping;
-	struct page *page; 
+	struct page *page;
 	int ret;
 
 	do {
-		page = page_cache_alloc_cold(mapping);
+		if (thp) {
+			gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
+			BUG_ON(offset & HPAGE_CACHE_INDEX_MASK);
+			page = alloc_pages(gfp_mask, HPAGE_PMD_ORDER);
+			if (page)
+				count_vm_event(THP_FAULT_ALLOC);
+			else
+				count_vm_event(THP_FAULT_FALLBACK);
+		} else
+			page = page_cache_alloc_cold(mapping);
 		if (!page)
 			return -ENOMEM;
 
@@ -1573,11 +1582,18 @@ static int page_cache_read(struct file *file, pgoff_t offset)
 			ret = mapping->a_ops->readpage(file, page);
 		else if (ret == -EEXIST)
 			ret = 0; /* losing race to add is OK */
+		else if (ret == -ENOSPC)
+			/*
+			 * No space in page cache to add huge page.
+			 * For caller it's the same as -ENOMEM: fall back to
+			 * small pages is required.
+			 */
+			ret = -ENOMEM;
 
 		page_cache_release(page);
 
 	} while (ret == AOP_TRUNCATED_PAGE);
-		
+
 	return ret;
 }
 
@@ -1669,13 +1685,20 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct address_space *mapping = file->f_mapping;
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
+	bool thp = vmf->flags & FAULT_FLAG_TRANSHUGE;
 	pgoff_t offset = vmf->pgoff;
+	unsigned long address = (unsigned long)vmf->virtual_address;
 	struct page *page;
 	pgoff_t size;
 	int ret = 0;
 
+	if (thp) {
+		BUG_ON(ra->ra_pages);
+		offset = linear_page_index(vma, address & HPAGE_PMD_MASK);
+	}
+
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (offset >= size)
+	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
 
 	/*
@@ -1700,7 +1723,8 @@ retry_find:
 			goto no_cached_page;
 	}
 
-	if (PageTransCompound(page))
+	/* Split huge page if we don't want huge page to be here */
+	if (!thp && PageTransCompound(page))
 		split_huge_page(compound_trans_head(page));
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
 		page_cache_release(page);
@@ -1722,12 +1746,22 @@ retry_find:
 	if (unlikely(!PageUptodate(page)))
 		goto page_not_uptodate;
 
+	if (thp && !PageTransHuge(page)) {
+		/*
+		 * Caller asked for huge page, but we have small page
+		 * by this offset. Fallback to small pages.
+		 */
+		unlock_page(page);
+		page_cache_release(page);
+		return VM_FAULT_FALLBACK;
+	}
+
 	/*
 	 * Found the page and have a reference on it.
 	 * We must recheck i_size under page lock.
 	 */
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (unlikely(offset >= size)) {
+	if (unlikely(vmf->pgoff >= size)) {
 		unlock_page(page);
 		page_cache_release(page);
 		return VM_FAULT_SIGBUS;
@@ -1741,7 +1775,7 @@ no_cached_page:
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, offset);
+	error = page_cache_read(file, offset, thp);
 
 	/*
 	 * The page we want has now been added to the page cache.
@@ -1757,7 +1791,7 @@ no_cached_page:
 	 * to schedule I/O.
 	 */
 	if (error == -ENOMEM)
-		return VM_FAULT_OOM;
+		return VM_FAULT_OOM | VM_FAULT_FALLBACK;
 	return VM_FAULT_SIGBUS;
 
 page_not_uptodate:
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 35/39] mm: decomposite do_wp_page() and get rid of some 'goto' logic
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's extract some 'reuse' path to separate function and use it instead
of ugly goto.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c |  110 ++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 59 insertions(+), 51 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8997cd8..eb99ab1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2594,6 +2594,52 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
 		copy_user_highpage(dst, src, va, vma);
 }
 
+static void dirty_page(struct vm_area_struct *vma, struct page *page,
+		bool page_mkwrite)
+{
+	/*
+	 * Yes, Virginia, this is actually required to prevent a race
+	 * with clear_page_dirty_for_io() from clearing the page dirty
+	 * bit after it clear all dirty ptes, but before a racing
+	 * do_wp_page installs a dirty pte.
+	 *
+	 * __do_fault is protected similarly.
+	 */
+	if (!page_mkwrite) {
+		wait_on_page_locked(page);
+		set_page_dirty_balance(page, page_mkwrite);
+		/* file_update_time outside page_lock */
+		if (vma->vm_file)
+			file_update_time(vma->vm_file);
+	}
+	put_page(page);
+	if (page_mkwrite) {
+		struct address_space *mapping = page->mapping;
+
+		set_page_dirty(page);
+		unlock_page(page);
+		page_cache_release(page);
+		if (mapping)	{
+			/*
+			 * Some device drivers do not set page.mapping
+			 * but still dirty their pages
+			 */
+			balance_dirty_pages_ratelimited(mapping);
+		}
+	}
+}
+
+static void mkwrite_pte(struct vm_area_struct *vma, unsigned long address,
+		pte_t *page_table, pte_t orig_pte)
+{
+	pte_t entry;
+	flush_cache_page(vma, address, pte_pfn(orig_pte));
+	entry = pte_mkyoung(orig_pte);
+	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	if (ptep_set_access_flags(vma, address, page_table, entry, 1))
+		update_mmu_cache(vma, address, page_table);
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -2618,10 +2664,8 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	__releases(ptl)
 {
 	struct page *old_page, *new_page = NULL;
-	pte_t entry;
 	int ret = 0;
 	int page_mkwrite = 0;
-	struct page *dirty_page = NULL;
 	unsigned long mmun_start = 0;	/* For mmu_notifiers */
 	unsigned long mmun_end = 0;	/* For mmu_notifiers */
 
@@ -2635,8 +2679,11 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * accounting on raw pfn maps.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-				     (VM_WRITE|VM_SHARED))
-			goto reuse;
+				     (VM_WRITE|VM_SHARED)) {
+			mkwrite_pte(vma, address, page_table, orig_pte);
+			pte_unmap_unlock(page_table, ptl);
+			return VM_FAULT_WRITE;
+		}
 		goto gotten;
 	}
 
@@ -2665,7 +2712,9 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 */
 			page_move_anon_rmap(old_page, vma, address);
 			unlock_page(old_page);
-			goto reuse;
+			mkwrite_pte(vma, address, page_table, orig_pte);
+			pte_unmap_unlock(page_table, ptl);
+			return VM_FAULT_WRITE;
 		}
 		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
@@ -2727,53 +2776,11 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 			page_mkwrite = 1;
 		}
-		dirty_page = old_page;
-		get_page(dirty_page);
-
-reuse:
-		flush_cache_page(vma, address, pte_pfn(orig_pte));
-		entry = pte_mkyoung(orig_pte);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		if (ptep_set_access_flags(vma, address, page_table, entry,1))
-			update_mmu_cache(vma, address, page_table);
+		get_page(old_page);
+		mkwrite_pte(vma, address, page_table, orig_pte);
 		pte_unmap_unlock(page_table, ptl);
-		ret |= VM_FAULT_WRITE;
-
-		if (!dirty_page)
-			return ret;
-
-		/*
-		 * Yes, Virginia, this is actually required to prevent a race
-		 * with clear_page_dirty_for_io() from clearing the page dirty
-		 * bit after it clear all dirty ptes, but before a racing
-		 * do_wp_page installs a dirty pte.
-		 *
-		 * __do_fault is protected similarly.
-		 */
-		if (!page_mkwrite) {
-			wait_on_page_locked(dirty_page);
-			set_page_dirty_balance(dirty_page, page_mkwrite);
-			/* file_update_time outside page_lock */
-			if (vma->vm_file)
-				file_update_time(vma->vm_file);
-		}
-		put_page(dirty_page);
-		if (page_mkwrite) {
-			struct address_space *mapping = dirty_page->mapping;
-
-			set_page_dirty(dirty_page);
-			unlock_page(dirty_page);
-			page_cache_release(dirty_page);
-			if (mapping)	{
-				/*
-				 * Some device drivers do not set page.mapping
-				 * but still dirty their pages
-				 */
-				balance_dirty_pages_ratelimited(mapping);
-			}
-		}
-
-		return ret;
+		dirty_page(vma, old_page, page_mkwrite);
+		return ret | VM_FAULT_WRITE;
 	}
 
 	/*
@@ -2810,6 +2817,7 @@ gotten:
 	 */
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (likely(pte_same(*page_table, orig_pte))) {
+		pte_t entry;
 		if (old_page) {
 			if (!PageAnon(old_page)) {
 				dec_mm_counter_fast(mm, MM_FILEPAGES);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 35/39] mm: decomposite do_wp_page() and get rid of some 'goto' logic
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Let's extract some 'reuse' path to separate function and use it instead
of ugly goto.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c |  110 ++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 59 insertions(+), 51 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8997cd8..eb99ab1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2594,6 +2594,52 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
 		copy_user_highpage(dst, src, va, vma);
 }
 
+static void dirty_page(struct vm_area_struct *vma, struct page *page,
+		bool page_mkwrite)
+{
+	/*
+	 * Yes, Virginia, this is actually required to prevent a race
+	 * with clear_page_dirty_for_io() from clearing the page dirty
+	 * bit after it clear all dirty ptes, but before a racing
+	 * do_wp_page installs a dirty pte.
+	 *
+	 * __do_fault is protected similarly.
+	 */
+	if (!page_mkwrite) {
+		wait_on_page_locked(page);
+		set_page_dirty_balance(page, page_mkwrite);
+		/* file_update_time outside page_lock */
+		if (vma->vm_file)
+			file_update_time(vma->vm_file);
+	}
+	put_page(page);
+	if (page_mkwrite) {
+		struct address_space *mapping = page->mapping;
+
+		set_page_dirty(page);
+		unlock_page(page);
+		page_cache_release(page);
+		if (mapping)	{
+			/*
+			 * Some device drivers do not set page.mapping
+			 * but still dirty their pages
+			 */
+			balance_dirty_pages_ratelimited(mapping);
+		}
+	}
+}
+
+static void mkwrite_pte(struct vm_area_struct *vma, unsigned long address,
+		pte_t *page_table, pte_t orig_pte)
+{
+	pte_t entry;
+	flush_cache_page(vma, address, pte_pfn(orig_pte));
+	entry = pte_mkyoung(orig_pte);
+	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	if (ptep_set_access_flags(vma, address, page_table, entry, 1))
+		update_mmu_cache(vma, address, page_table);
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -2618,10 +2664,8 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	__releases(ptl)
 {
 	struct page *old_page, *new_page = NULL;
-	pte_t entry;
 	int ret = 0;
 	int page_mkwrite = 0;
-	struct page *dirty_page = NULL;
 	unsigned long mmun_start = 0;	/* For mmu_notifiers */
 	unsigned long mmun_end = 0;	/* For mmu_notifiers */
 
@@ -2635,8 +2679,11 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * accounting on raw pfn maps.
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-				     (VM_WRITE|VM_SHARED))
-			goto reuse;
+				     (VM_WRITE|VM_SHARED)) {
+			mkwrite_pte(vma, address, page_table, orig_pte);
+			pte_unmap_unlock(page_table, ptl);
+			return VM_FAULT_WRITE;
+		}
 		goto gotten;
 	}
 
@@ -2665,7 +2712,9 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 */
 			page_move_anon_rmap(old_page, vma, address);
 			unlock_page(old_page);
-			goto reuse;
+			mkwrite_pte(vma, address, page_table, orig_pte);
+			pte_unmap_unlock(page_table, ptl);
+			return VM_FAULT_WRITE;
 		}
 		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
@@ -2727,53 +2776,11 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 			page_mkwrite = 1;
 		}
-		dirty_page = old_page;
-		get_page(dirty_page);
-
-reuse:
-		flush_cache_page(vma, address, pte_pfn(orig_pte));
-		entry = pte_mkyoung(orig_pte);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		if (ptep_set_access_flags(vma, address, page_table, entry,1))
-			update_mmu_cache(vma, address, page_table);
+		get_page(old_page);
+		mkwrite_pte(vma, address, page_table, orig_pte);
 		pte_unmap_unlock(page_table, ptl);
-		ret |= VM_FAULT_WRITE;
-
-		if (!dirty_page)
-			return ret;
-
-		/*
-		 * Yes, Virginia, this is actually required to prevent a race
-		 * with clear_page_dirty_for_io() from clearing the page dirty
-		 * bit after it clear all dirty ptes, but before a racing
-		 * do_wp_page installs a dirty pte.
-		 *
-		 * __do_fault is protected similarly.
-		 */
-		if (!page_mkwrite) {
-			wait_on_page_locked(dirty_page);
-			set_page_dirty_balance(dirty_page, page_mkwrite);
-			/* file_update_time outside page_lock */
-			if (vma->vm_file)
-				file_update_time(vma->vm_file);
-		}
-		put_page(dirty_page);
-		if (page_mkwrite) {
-			struct address_space *mapping = dirty_page->mapping;
-
-			set_page_dirty(dirty_page);
-			unlock_page(dirty_page);
-			page_cache_release(dirty_page);
-			if (mapping)	{
-				/*
-				 * Some device drivers do not set page.mapping
-				 * but still dirty their pages
-				 */
-				balance_dirty_pages_ratelimited(mapping);
-			}
-		}
-
-		return ret;
+		dirty_page(vma, old_page, page_mkwrite);
+		return ret | VM_FAULT_WRITE;
 	}
 
 	/*
@@ -2810,6 +2817,7 @@ gotten:
 	 */
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (likely(pte_same(*page_table, orig_pte))) {
+		pte_t entry;
 		if (old_page) {
 			if (!PageAnon(old_page)) {
 				dec_mm_counter_fast(mm, MM_FILEPAGES);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 36/39] mm: do_wp_page(): extract VM_WRITE|VM_SHARED case to separate function
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The code will be shared with transhuge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c |  142 ++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 73 insertions(+), 69 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eb99ab1..4685dd1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2641,6 +2641,76 @@ static void mkwrite_pte(struct vm_area_struct *vma, unsigned long address,
 }
 
 /*
+ * Only catch write-faults on shared writable pages, read-only shared pages can
+ * get COWed by get_user_pages(.write=1, .force=1).
+ */
+static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pte_t *page_table, pmd_t *pmd,
+		spinlock_t *ptl, pte_t orig_pte, struct page *page)
+{
+	struct vm_fault vmf;
+	bool page_mkwrite = false;
+	int tmp, ret = 0;
+
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		goto mkwrite_done;
+
+	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+	vmf.pgoff = page->index;
+	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+	vmf.page = page;
+
+	/*
+	 * Notify the address space that the page is about to
+	 * become writable so that it can prohibit this or wait
+	 * for the page to get into an appropriate state.
+	 *
+	 * We do this without the lock held, so that it can
+	 * sleep if it needs to.
+	 */
+	page_cache_get(page);
+	pte_unmap_unlock(page_table, ptl);
+
+	tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+	if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+		ret = tmp;
+		page_cache_release(page);
+		return ret;
+	}
+	if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+		lock_page(page);
+		if (!page->mapping) {
+			unlock_page(page);
+			page_cache_release(page);
+			return ret;
+		}
+	} else
+		VM_BUG_ON(!PageLocked(page));
+
+	/*
+	 * Since we dropped the lock we need to revalidate
+	 * the PTE as someone else may have changed it.  If
+	 * they did, we just return, as we can count on the
+	 * MMU to tell us if they didn't also make it writable.
+	 */
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_same(*page_table, orig_pte)) {
+		unlock_page(page);
+		pte_unmap_unlock(page_table, ptl);
+		page_cache_release(page);
+		return ret;
+	}
+
+	page_mkwrite = true;
+mkwrite_done:
+	get_page(page);
+	mkwrite_pte(vma, address, page_table, orig_pte);
+	pte_unmap_unlock(page_table, ptl);
+	dirty_page(vma, page, page_mkwrite);
+	return ret | VM_FAULT_WRITE;
+}
+
+/*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
  * and decrementing the shared-page counter for the old page.
@@ -2665,7 +2735,6 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *old_page, *new_page = NULL;
 	int ret = 0;
-	int page_mkwrite = 0;
 	unsigned long mmun_start = 0;	/* For mmu_notifiers */
 	unsigned long mmun_end = 0;	/* For mmu_notifiers */
 
@@ -2718,70 +2787,9 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-					(VM_WRITE|VM_SHARED))) {
-		/*
-		 * Only catch write-faults on shared writable pages,
-		 * read-only shared pages can get COWed by
-		 * get_user_pages(.write=1, .force=1).
-		 */
-		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
-			struct vm_fault vmf;
-			int tmp;
-
-			vmf.virtual_address = (void __user *)(address &
-								PAGE_MASK);
-			vmf.pgoff = old_page->index;
-			vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
-			vmf.page = old_page;
-
-			/*
-			 * Notify the address space that the page is about to
-			 * become writable so that it can prohibit this or wait
-			 * for the page to get into an appropriate state.
-			 *
-			 * We do this without the lock held, so that it can
-			 * sleep if it needs to.
-			 */
-			page_cache_get(old_page);
-			pte_unmap_unlock(page_table, ptl);
-
-			tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
-			if (unlikely(tmp &
-					(VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
-				ret = tmp;
-				goto unwritable_page;
-			}
-			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
-				lock_page(old_page);
-				if (!old_page->mapping) {
-					ret = 0; /* retry the fault */
-					unlock_page(old_page);
-					goto unwritable_page;
-				}
-			} else
-				VM_BUG_ON(!PageLocked(old_page));
-
-			/*
-			 * Since we dropped the lock we need to revalidate
-			 * the PTE as someone else may have changed it.  If
-			 * they did, we just return, as we can count on the
-			 * MMU to tell us if they didn't also make it writable.
-			 */
-			page_table = pte_offset_map_lock(mm, pmd, address,
-							 &ptl);
-			if (!pte_same(*page_table, orig_pte)) {
-				unlock_page(old_page);
-				goto unlock;
-			}
-
-			page_mkwrite = 1;
-		}
-		get_page(old_page);
-		mkwrite_pte(vma, address, page_table, orig_pte);
-		pte_unmap_unlock(page_table, ptl);
-		dirty_page(vma, old_page, page_mkwrite);
-		return ret | VM_FAULT_WRITE;
-	}
+					(VM_WRITE|VM_SHARED)))
+		return do_wp_page_shared(mm, vma, address, page_table, pmd, ptl,
+				orig_pte, old_page);
 
 	/*
 	 * Ok, we need to copy. Oh, well..
@@ -2900,10 +2908,6 @@ oom:
 	if (old_page)
 		page_cache_release(old_page);
 	return VM_FAULT_OOM;
-
-unwritable_page:
-	page_cache_release(old_page);
-	return ret;
 }
 
 static void unmap_mapping_range_vma(struct vm_area_struct *vma,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 36/39] mm: do_wp_page(): extract VM_WRITE|VM_SHARED case to separate function
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

The code will be shared with transhuge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c |  142 ++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 73 insertions(+), 69 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eb99ab1..4685dd1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2641,6 +2641,76 @@ static void mkwrite_pte(struct vm_area_struct *vma, unsigned long address,
 }
 
 /*
+ * Only catch write-faults on shared writable pages, read-only shared pages can
+ * get COWed by get_user_pages(.write=1, .force=1).
+ */
+static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pte_t *page_table, pmd_t *pmd,
+		spinlock_t *ptl, pte_t orig_pte, struct page *page)
+{
+	struct vm_fault vmf;
+	bool page_mkwrite = false;
+	int tmp, ret = 0;
+
+	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+		goto mkwrite_done;
+
+	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+	vmf.pgoff = page->index;
+	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+	vmf.page = page;
+
+	/*
+	 * Notify the address space that the page is about to
+	 * become writable so that it can prohibit this or wait
+	 * for the page to get into an appropriate state.
+	 *
+	 * We do this without the lock held, so that it can
+	 * sleep if it needs to.
+	 */
+	page_cache_get(page);
+	pte_unmap_unlock(page_table, ptl);
+
+	tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+	if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+		ret = tmp;
+		page_cache_release(page);
+		return ret;
+	}
+	if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+		lock_page(page);
+		if (!page->mapping) {
+			unlock_page(page);
+			page_cache_release(page);
+			return ret;
+		}
+	} else
+		VM_BUG_ON(!PageLocked(page));
+
+	/*
+	 * Since we dropped the lock we need to revalidate
+	 * the PTE as someone else may have changed it.  If
+	 * they did, we just return, as we can count on the
+	 * MMU to tell us if they didn't also make it writable.
+	 */
+	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_same(*page_table, orig_pte)) {
+		unlock_page(page);
+		pte_unmap_unlock(page_table, ptl);
+		page_cache_release(page);
+		return ret;
+	}
+
+	page_mkwrite = true;
+mkwrite_done:
+	get_page(page);
+	mkwrite_pte(vma, address, page_table, orig_pte);
+	pte_unmap_unlock(page_table, ptl);
+	dirty_page(vma, page, page_mkwrite);
+	return ret | VM_FAULT_WRITE;
+}
+
+/*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
  * and decrementing the shared-page counter for the old page.
@@ -2665,7 +2735,6 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *old_page, *new_page = NULL;
 	int ret = 0;
-	int page_mkwrite = 0;
 	unsigned long mmun_start = 0;	/* For mmu_notifiers */
 	unsigned long mmun_end = 0;	/* For mmu_notifiers */
 
@@ -2718,70 +2787,9 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-					(VM_WRITE|VM_SHARED))) {
-		/*
-		 * Only catch write-faults on shared writable pages,
-		 * read-only shared pages can get COWed by
-		 * get_user_pages(.write=1, .force=1).
-		 */
-		if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
-			struct vm_fault vmf;
-			int tmp;
-
-			vmf.virtual_address = (void __user *)(address &
-								PAGE_MASK);
-			vmf.pgoff = old_page->index;
-			vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
-			vmf.page = old_page;
-
-			/*
-			 * Notify the address space that the page is about to
-			 * become writable so that it can prohibit this or wait
-			 * for the page to get into an appropriate state.
-			 *
-			 * We do this without the lock held, so that it can
-			 * sleep if it needs to.
-			 */
-			page_cache_get(old_page);
-			pte_unmap_unlock(page_table, ptl);
-
-			tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
-			if (unlikely(tmp &
-					(VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
-				ret = tmp;
-				goto unwritable_page;
-			}
-			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
-				lock_page(old_page);
-				if (!old_page->mapping) {
-					ret = 0; /* retry the fault */
-					unlock_page(old_page);
-					goto unwritable_page;
-				}
-			} else
-				VM_BUG_ON(!PageLocked(old_page));
-
-			/*
-			 * Since we dropped the lock we need to revalidate
-			 * the PTE as someone else may have changed it.  If
-			 * they did, we just return, as we can count on the
-			 * MMU to tell us if they didn't also make it writable.
-			 */
-			page_table = pte_offset_map_lock(mm, pmd, address,
-							 &ptl);
-			if (!pte_same(*page_table, orig_pte)) {
-				unlock_page(old_page);
-				goto unlock;
-			}
-
-			page_mkwrite = 1;
-		}
-		get_page(old_page);
-		mkwrite_pte(vma, address, page_table, orig_pte);
-		pte_unmap_unlock(page_table, ptl);
-		dirty_page(vma, old_page, page_mkwrite);
-		return ret | VM_FAULT_WRITE;
-	}
+					(VM_WRITE|VM_SHARED)))
+		return do_wp_page_shared(mm, vma, address, page_table, pmd, ptl,
+				orig_pte, old_page);
 
 	/*
 	 * Ok, we need to copy. Oh, well..
@@ -2900,10 +2908,6 @@ oom:
 	if (old_page)
 		page_cache_release(old_page);
 	return VM_FAULT_OOM;
-
-unwritable_page:
-	page_cache_release(old_page);
-	return ret;
 }
 
 static void unmap_mapping_range_vma(struct vm_area_struct *vma,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

VM_WRITE|VM_SHARED has already almost covered by do_wp_page_shared().
We only need to hadle locking differentely and setup pmd instead of pte.

do_huge_pmd_wp_page() itself needs only few minor changes:

- now we may need to allocate anon_vma on WP. Having huge page to COW
  doesn't mean we have anon_vma, since the huge page can be file-backed.
- we need to adjust mm counters on COW file pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    4 +++
 mm/huge_memory.c   |   17 +++++++++++--
 mm/memory.c        |   70 +++++++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 74 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 563c8b7..7f3bc24 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1001,6 +1001,10 @@ extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags);
+extern int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, struct page *page,
+		pte_t *page_table, spinlock_t *ptl,
+		pte_t orig_pte, pmd_t orig_pmd);
 #else
 static inline int handle_mm_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 893cc69..d7c9df5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1110,7 +1110,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
-	VM_BUG_ON(!vma->anon_vma);
 	haddr = address & HPAGE_PMD_MASK;
 	if (is_huge_zero_pmd(orig_pmd))
 		goto alloc;
@@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
-	if (page_mapcount(page) == 1) {
+	if (PageAnon(page) && page_mapcount(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1129,9 +1128,18 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret |= VM_FAULT_WRITE;
 		goto out_unlock;
 	}
+
+	if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)) {
+		pte_t __unused;
+		return do_wp_page_shared(mm, vma, address, pmd, page,
+			       NULL, NULL, __unused, orig_pmd);
+	}
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
 alloc:
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -1195,6 +1203,11 @@ alloc:
 			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 			put_huge_zero_page();
 		} else {
+			if (!PageAnon(page)) {
+				/* File page COWed with anon page */
+				add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+				add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			}
 			VM_BUG_ON(!PageHead(page));
 			page_remove_rmap(page);
 			put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 4685dd1..ebff552 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2640,16 +2640,33 @@ static void mkwrite_pte(struct vm_area_struct *vma, unsigned long address,
 		update_mmu_cache(vma, address, page_table);
 }
 
+static void mkwrite_pmd(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, pmd_t orig_pmd)
+{
+	pmd_t entry;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	flush_cache_page(vma, address, pmd_pfn(orig_pmd));
+	entry = pmd_mkyoung(orig_pmd);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+		update_mmu_cache_pmd(vma, address, pmd);
+}
+
 /*
  * Only catch write-faults on shared writable pages, read-only shared pages can
  * get COWed by get_user_pages(.write=1, .force=1).
  */
-static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		spinlock_t *ptl, pte_t orig_pte, struct page *page)
+int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, struct page *page,
+		pte_t *page_table, spinlock_t *ptl,
+		pte_t orig_pte, pmd_t orig_pmd)
 {
 	struct vm_fault vmf;
 	bool page_mkwrite = false;
+	/* no page_table means caller asks for THP */
+	bool thp = (page_table == NULL) &&
+		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
 	int tmp, ret = 0;
 
 	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
@@ -2660,6 +2677,9 @@ static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 	vmf.page = page;
 
+	if (thp)
+		vmf.flags |= FAULT_FLAG_TRANSHUGE;
+
 	/*
 	 * Notify the address space that the page is about to
 	 * become writable so that it can prohibit this or wait
@@ -2669,7 +2689,10 @@ static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * sleep if it needs to.
 	 */
 	page_cache_get(page);
-	pte_unmap_unlock(page_table, ptl);
+	if (thp)
+		spin_unlock(&mm->page_table_lock);
+	else
+		pte_unmap_unlock(page_table, ptl);
 
 	tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
 	if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
@@ -2693,19 +2716,34 @@ static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * they did, we just return, as we can count on the
 	 * MMU to tell us if they didn't also make it writable.
 	 */
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_same(*page_table, orig_pte)) {
-		unlock_page(page);
-		pte_unmap_unlock(page_table, ptl);
-		page_cache_release(page);
-		return ret;
+	if (thp) {
+		spin_lock(&mm->page_table_lock);
+		if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+			unlock_page(page);
+			spin_unlock(&mm->page_table_lock);
+			page_cache_release(page);
+			return ret;
+		}
+	} else {
+		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+		if (!pte_same(*page_table, orig_pte)) {
+			unlock_page(page);
+			pte_unmap_unlock(page_table, ptl);
+			page_cache_release(page);
+			return ret;
+		}
 	}
 
 	page_mkwrite = true;
 mkwrite_done:
 	get_page(page);
-	mkwrite_pte(vma, address, page_table, orig_pte);
-	pte_unmap_unlock(page_table, ptl);
+	if (thp) {
+		mkwrite_pmd(vma, address, pmd, orig_pmd);
+		spin_unlock(&mm->page_table_lock);
+	} else {
+		mkwrite_pte(vma, address, page_table, orig_pte);
+		pte_unmap_unlock(page_table, ptl);
+	}
 	dirty_page(vma, page, page_mkwrite);
 	return ret | VM_FAULT_WRITE;
 }
@@ -2787,9 +2825,11 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-					(VM_WRITE|VM_SHARED)))
-		return do_wp_page_shared(mm, vma, address, page_table, pmd, ptl,
-				orig_pte, old_page);
+					(VM_WRITE|VM_SHARED))) {
+		pmd_t __unused;
+		return do_wp_page_shared(mm, vma, address, pmd, old_page,
+				page_table, ptl, orig_pte, __unused);
+	}
 
 	/*
 	 * Ok, we need to copy. Oh, well..
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

VM_WRITE|VM_SHARED has already almost covered by do_wp_page_shared().
We only need to hadle locking differentely and setup pmd instead of pte.

do_huge_pmd_wp_page() itself needs only few minor changes:

- now we may need to allocate anon_vma on WP. Having huge page to COW
  doesn't mean we have anon_vma, since the huge page can be file-backed.
- we need to adjust mm counters on COW file pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |    4 +++
 mm/huge_memory.c   |   17 +++++++++++--
 mm/memory.c        |   70 +++++++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 74 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 563c8b7..7f3bc24 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1001,6 +1001,10 @@ extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags);
+extern int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, struct page *page,
+		pte_t *page_table, spinlock_t *ptl,
+		pte_t orig_pte, pmd_t orig_pmd);
 #else
 static inline int handle_mm_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 893cc69..d7c9df5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1110,7 +1110,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
-	VM_BUG_ON(!vma->anon_vma);
 	haddr = address & HPAGE_PMD_MASK;
 	if (is_huge_zero_pmd(orig_pmd))
 		goto alloc;
@@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
-	if (page_mapcount(page) == 1) {
+	if (PageAnon(page) && page_mapcount(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1129,9 +1128,18 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret |= VM_FAULT_WRITE;
 		goto out_unlock;
 	}
+
+	if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)) {
+		pte_t __unused;
+		return do_wp_page_shared(mm, vma, address, pmd, page,
+			       NULL, NULL, __unused, orig_pmd);
+	}
 	get_page(page);
 	spin_unlock(&mm->page_table_lock);
 alloc:
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -1195,6 +1203,11 @@ alloc:
 			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 			put_huge_zero_page();
 		} else {
+			if (!PageAnon(page)) {
+				/* File page COWed with anon page */
+				add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+				add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			}
 			VM_BUG_ON(!PageHead(page));
 			page_remove_rmap(page);
 			put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 4685dd1..ebff552 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2640,16 +2640,33 @@ static void mkwrite_pte(struct vm_area_struct *vma, unsigned long address,
 		update_mmu_cache(vma, address, page_table);
 }
 
+static void mkwrite_pmd(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, pmd_t orig_pmd)
+{
+	pmd_t entry;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	flush_cache_page(vma, address, pmd_pfn(orig_pmd));
+	entry = pmd_mkyoung(orig_pmd);
+	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+		update_mmu_cache_pmd(vma, address, pmd);
+}
+
 /*
  * Only catch write-faults on shared writable pages, read-only shared pages can
  * get COWed by get_user_pages(.write=1, .force=1).
  */
-static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		spinlock_t *ptl, pte_t orig_pte, struct page *page)
+int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, struct page *page,
+		pte_t *page_table, spinlock_t *ptl,
+		pte_t orig_pte, pmd_t orig_pmd)
 {
 	struct vm_fault vmf;
 	bool page_mkwrite = false;
+	/* no page_table means caller asks for THP */
+	bool thp = (page_table == NULL) &&
+		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
 	int tmp, ret = 0;
 
 	if (vma->vm_ops && vma->vm_ops->page_mkwrite)
@@ -2660,6 +2677,9 @@ static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 	vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 	vmf.page = page;
 
+	if (thp)
+		vmf.flags |= FAULT_FLAG_TRANSHUGE;
+
 	/*
 	 * Notify the address space that the page is about to
 	 * become writable so that it can prohibit this or wait
@@ -2669,7 +2689,10 @@ static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * sleep if it needs to.
 	 */
 	page_cache_get(page);
-	pte_unmap_unlock(page_table, ptl);
+	if (thp)
+		spin_unlock(&mm->page_table_lock);
+	else
+		pte_unmap_unlock(page_table, ptl);
 
 	tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
 	if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
@@ -2693,19 +2716,34 @@ static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * they did, we just return, as we can count on the
 	 * MMU to tell us if they didn't also make it writable.
 	 */
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_same(*page_table, orig_pte)) {
-		unlock_page(page);
-		pte_unmap_unlock(page_table, ptl);
-		page_cache_release(page);
-		return ret;
+	if (thp) {
+		spin_lock(&mm->page_table_lock);
+		if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+			unlock_page(page);
+			spin_unlock(&mm->page_table_lock);
+			page_cache_release(page);
+			return ret;
+		}
+	} else {
+		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+		if (!pte_same(*page_table, orig_pte)) {
+			unlock_page(page);
+			pte_unmap_unlock(page_table, ptl);
+			page_cache_release(page);
+			return ret;
+		}
 	}
 
 	page_mkwrite = true;
 mkwrite_done:
 	get_page(page);
-	mkwrite_pte(vma, address, page_table, orig_pte);
-	pte_unmap_unlock(page_table, ptl);
+	if (thp) {
+		mkwrite_pmd(vma, address, pmd, orig_pmd);
+		spin_unlock(&mm->page_table_lock);
+	} else {
+		mkwrite_pte(vma, address, page_table, orig_pte);
+		pte_unmap_unlock(page_table, ptl);
+	}
 	dirty_page(vma, page, page_mkwrite);
 	return ret | VM_FAULT_WRITE;
 }
@@ -2787,9 +2825,11 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-					(VM_WRITE|VM_SHARED)))
-		return do_wp_page_shared(mm, vma, address, page_table, pmd, ptl,
-				orig_pte, old_page);
+					(VM_WRITE|VM_SHARED))) {
+		pmd_t __unused;
+		return do_wp_page_shared(mm, vma, address, pmd, old_page,
+				page_table, ptl, orig_pte, __unused);
+	}
 
 	/*
 	 * Ok, we need to copy. Oh, well..
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 38/39] thp: vma_adjust_trans_huge(): adjust file-backed VMA too
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages in page cache, we need to call
adjust file-backed VMA, which potentially can contain huge pages.

For now we call it for all VMAs.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |   11 +----------
 mm/huge_memory.c        |    2 +-
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b20334a..f4d6626 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -139,7 +139,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
 #endif
 extern int hugepage_madvise(struct vm_area_struct *vma,
 			    unsigned long *vm_flags, int advice);
-extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
 				    long adjust_next);
@@ -155,15 +155,6 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 	else
 		return 0;
 }
-static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
-					 unsigned long start,
-					 unsigned long end,
-					 long adjust_next)
-{
-	if (!vma->anon_vma || vma->vm_ops)
-		return;
-	__vma_adjust_trans_huge(vma, start, end, adjust_next);
-}
 static inline int hpage_nr_pages(struct page *page)
 {
 	if (unlikely(PageTransHuge(page)))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d7c9df5..9c3815b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2783,7 +2783,7 @@ static void split_huge_page_address(struct mm_struct *mm,
 	split_huge_page_pmd_mm(mm, address, pmd);
 }
 
-void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+void vma_adjust_trans_huge(struct vm_area_struct *vma,
 			     unsigned long start,
 			     unsigned long end,
 			     long adjust_next)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 38/39] thp: vma_adjust_trans_huge(): adjust file-backed VMA too
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Since we're going to have huge pages in page cache, we need to call
adjust file-backed VMA, which potentially can contain huge pages.

For now we call it for all VMAs.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |   11 +----------
 mm/huge_memory.c        |    2 +-
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b20334a..f4d6626 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -139,7 +139,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
 #endif
 extern int hugepage_madvise(struct vm_area_struct *vma,
 			    unsigned long *vm_flags, int advice);
-extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
 				    long adjust_next);
@@ -155,15 +155,6 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
 	else
 		return 0;
 }
-static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
-					 unsigned long start,
-					 unsigned long end,
-					 long adjust_next)
-{
-	if (!vma->anon_vma || vma->vm_ops)
-		return;
-	__vma_adjust_trans_huge(vma, start, end, adjust_next);
-}
 static inline int hpage_nr_pages(struct page *page)
 {
 	if (unlikely(PageTransHuge(page)))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d7c9df5..9c3815b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2783,7 +2783,7 @@ static void split_huge_page_address(struct mm_struct *mm,
 	split_huge_page_pmd_mm(mm, address, pmd);
 }
 
-void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+void vma_adjust_trans_huge(struct vm_area_struct *vma,
 			     unsigned long start,
 			     unsigned long end,
 			     long adjust_next)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 39/39] thp: map file-backed huge pages on fault
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Look like all pieces are in place, we can map file-backed huge-pages
now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    4 +++-
 mm/memory.c             |    5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f4d6626..903f097 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -78,7 +78,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 	   (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
 	   ((__vma)->vm_flags & VM_HUGEPAGE))) &&			\
 	 !((__vma)->vm_flags & VM_NOHUGEPAGE) &&			\
-	 !is_vma_temporary_stack(__vma))
+	 !is_vma_temporary_stack(__vma) &&				\
+	 (!(__vma)->vm_ops ||						\
+		  mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))
 #define transparent_hugepage_defrag(__vma)				\
 	((transparent_hugepage_flags &					\
 	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
diff --git a/mm/memory.c b/mm/memory.c
index ebff552..7fe9752 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3939,10 +3939,13 @@ retry:
 	if (!pmd)
 		return VM_FAULT_OOM;
 	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
-		int ret = 0;
+		int ret;
 		if (!vma->vm_ops)
 			ret = do_huge_pmd_anonymous_page(mm, vma, address,
 					pmd, flags);
+		else
+			ret = do_huge_linear_fault(mm, vma, address,
+					pmd, flags);
 		if ((ret & VM_FAULT_FALLBACK) == 0)
 			return ret;
 	} else {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 243+ messages in thread

* [PATCHv4 39/39] thp: map file-backed huge pages on fault
@ 2013-05-12  1:23   ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-12  1:23 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, Dave Hansen, linux-fsdevel, linux-kernel,
	Kirill A. Shutemov

From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>

Look like all pieces are in place, we can map file-backed huge-pages
now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |    4 +++-
 mm/memory.c             |    5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f4d6626..903f097 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -78,7 +78,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 	   (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
 	   ((__vma)->vm_flags & VM_HUGEPAGE))) &&			\
 	 !((__vma)->vm_flags & VM_NOHUGEPAGE) &&			\
-	 !is_vma_temporary_stack(__vma))
+	 !is_vma_temporary_stack(__vma) &&				\
+	 (!(__vma)->vm_ops ||						\
+		  mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))
 #define transparent_hugepage_defrag(__vma)				\
 	((transparent_hugepage_flags &					\
 	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
diff --git a/mm/memory.c b/mm/memory.c
index ebff552..7fe9752 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3939,10 +3939,13 @@ retry:
 	if (!pmd)
 		return VM_FAULT_OOM;
 	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
-		int ret = 0;
+		int ret;
 		if (!vma->vm_ops)
 			ret = do_huge_pmd_anonymous_page(mm, vma, address,
 					pmd, flags);
+		else
+			ret = do_huge_linear_fault(mm, vma, address,
+					pmd, flags);
 		if ((ret & VM_FAULT_FALLBACK) == 0)
 			return ret;
 	} else {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 01/39] mm: drop actor argument of do_generic_file_read()
  2013-05-12  1:22   ` Kirill A. Shutemov
@ 2013-05-21 18:22     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 18:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:22 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> There's only one caller of do_generic_file_read() and the only actor is
> file_read_actor(). No reason to have a callback parameter.

Looks sane.  This can and should go up separately from the rest of the set.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 01/39] mm: drop actor argument of do_generic_file_read()
@ 2013-05-21 18:22     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 18:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:22 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> There's only one caller of do_generic_file_read() and the only actor is
> file_read_actor(). No reason to have a callback parameter.

Looks sane.  This can and should go up separately from the rest of the set.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 02/39] block: implement add_bdi_stat()
  2013-05-12  1:22   ` Kirill A. Shutemov
@ 2013-05-21 18:25     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 18:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:22 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> We're going to add/remove a number of page cache entries at once. This
> patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
> amount. It's required for batched page cache manipulations.

Add, but no dec?

I'd also move this closer to where it gets used in the series.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 02/39] block: implement add_bdi_stat()
@ 2013-05-21 18:25     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 18:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:22 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> We're going to add/remove a number of page cache entries at once. This
> patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
> amount. It's required for batched page cache manipulations.

Add, but no dec?

I'd also move this closer to where it gets used in the series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 00/39] Transparent huge page cache
  2013-05-12  1:22 ` Kirill A. Shutemov
@ 2013-05-21 18:37   ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 18:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:22 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> It's version 4. You can also use git tree:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git
> 
> branch thp/pagecache.
> 
> If you want to check changes since v3 you can look at diff between tags
> thp/pagecache/v3 and thp/pagecache/v4-prerebase.

What's the purpose of posting these patches?  Do you want them merged?
Or are they useful as they stand, or are they just here so folks can
play with them as you improve them?

> The goal of the project is preparing kernel infrastructure to handle huge
> pages in page cache.
> 
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project. It provides information on what
> performance boost we should expect on other files systems.

Do you think folks would use ramfs in practice?  Or is this just a toy?
 Could this replace some (or all) existing hugetlbfs use, for instance?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 00/39] Transparent huge page cache
@ 2013-05-21 18:37   ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 18:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:22 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> It's version 4. You can also use git tree:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git
> 
> branch thp/pagecache.
> 
> If you want to check changes since v3 you can look at diff between tags
> thp/pagecache/v3 and thp/pagecache/v4-prerebase.

What's the purpose of posting these patches?  Do you want them merged?
Or are they useful as they stand, or are they just here so folks can
play with them as you improve them?

> The goal of the project is preparing kernel infrastructure to handle huge
> pages in page cache.
> 
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project. It provides information on what
> performance boost we should expect on other files systems.

Do you think folks would use ramfs in practice?  Or is this just a toy?
 Could this replace some (or all) existing hugetlbfs use, for instance?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 18:58     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 18:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. For transparent huge page cache we want
> to insert HPAGE_PMD_NR (512 on x86-64) entires to address_space at once.

                                       ^^entries

> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.

Would radix_tree_preload_contig() be a better name, then?

...
> On 64-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> 
> On 32-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> 
> On most machines we will have RADIX_TREE_MAP_SHIFT=6.

Thanks for adding that to the description.  The array you're talking
about is just pointers, right?

107-43 = 64.  So, we have 64 extra pointers * NR_CPUS, plus 64 extra
radix tree nodes that we will keep around most of the time.  On x86_64,
that's 512 bytes plus 64*560 bytes of nodes which is ~35k of memory per CPU.

That's not bad I guess, but I do bet it's something that some folks want
to configure out.  Please make sure to call out the actual size cost in
bytes per CPU in future patch postings, at least for the common case
(64-bit non-CONFIG_BASE_SMALL).

> Since only THP uses batched preload at the , we disable (set max preload
> to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be changed
> in the future.

"at the..."  Is there something missing in that sentence?

No major nits, so:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements
@ 2013-05-21 18:58     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 18:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. For transparent huge page cache we want
> to insert HPAGE_PMD_NR (512 on x86-64) entires to address_space at once.

                                       ^^entries

> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.

Would radix_tree_preload_contig() be a better name, then?

...
> On 64-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> 
> On 32-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> 
> On most machines we will have RADIX_TREE_MAP_SHIFT=6.

Thanks for adding that to the description.  The array you're talking
about is just pointers, right?

107-43 = 64.  So, we have 64 extra pointers * NR_CPUS, plus 64 extra
radix tree nodes that we will keep around most of the time.  On x86_64,
that's 512 bytes plus 64*560 bytes of nodes which is ~35k of memory per CPU.

That's not bad I guess, but I do bet it's something that some folks want
to configure out.  Please make sure to call out the actual size cost in
bytes per CPU in future patch postings, at least for the common case
(64-bit non-CONFIG_BASE_SMALL).

> Since only THP uses batched preload at the , we disable (set max preload
> to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be changed
> in the future.

"at the..."  Is there something missing in that sentence?

No major nits, so:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 05/39] memcg, thp: charge huge cache pages
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 19:04     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Michal Hocko, KAMEZAWA Hiroyuki

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> mem_cgroup_cache_charge() has check for PageCompound(). The check
> prevents charging huge cache pages.
> 
> I don't see a reason why the check is present. Looks like it's just
> legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

FWIW, that commit introduced two PageCompound() checks.  The other one
went away inexplicably in 01b1ae63c22.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 05/39] memcg, thp: charge huge cache pages
@ 2013-05-21 19:04     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel, Michal Hocko, KAMEZAWA Hiroyuki

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> mem_cgroup_cache_charge() has check for PageCompound(). The check
> prevents charging huge cache pages.
> 
> I don't see a reason why the check is present. Looks like it's just
> legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

FWIW, that commit introduced two PageCompound() checks.  The other one
went away inexplicably in 01b1ae63c22.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 06/39] thp, mm: avoid PageUnevictable on active/inactive lru lists
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 19:17     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
> that have been placed on the LRU lists when first allocated), but these
> pages must not have PageUnevictable set - otherwise shrink_active_list
> goes crazy:
> 
> kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
> invalid opcode: 0000 [#1] SMP
> CPU 0
> Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
> RIP: 0010:[<ffffffff81110478>]  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
> RSP: 0000:ffff8800796d9b28  EFLAGS: 00010082'
...

I'd much rather see a code snippet and description the BUG_ON() than a
register and stack dump.  That line number is wrong already. ;)

> For lru_add_page_tail(), it means we should not set PageUnevictable()
> for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
> Let's just copy PG_active and PG_unevictable from head page in
> __split_huge_page_refcount(), it will simplify lru_add_page_tail().
> 
> This will fix one more bug in lru_add_page_tail():
> if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
> will go to the same lru as page, but nobody cares to sync page_tail
> active/inactive state with page. So we can end up with inactive page on
> active lru.
> The patch will fix it as well since we copy PG_active from head page.

This all seems good, and if it fixes a bug, it should really get merged
as it stands.  Have you been actually able to trigger that bug in any
way in practice?

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 06/39] thp, mm: avoid PageUnevictable on active/inactive lru lists
@ 2013-05-21 19:17     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
> that have been placed on the LRU lists when first allocated), but these
> pages must not have PageUnevictable set - otherwise shrink_active_list
> goes crazy:
> 
> kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
> invalid opcode: 0000 [#1] SMP
> CPU 0
> Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
> RIP: 0010:[<ffffffff81110478>]  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
> RSP: 0000:ffff8800796d9b28  EFLAGS: 00010082'
...

I'd much rather see a code snippet and description the BUG_ON() than a
register and stack dump.  That line number is wrong already. ;)

> For lru_add_page_tail(), it means we should not set PageUnevictable()
> for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
> Let's just copy PG_active and PG_unevictable from head page in
> __split_huge_page_refcount(), it will simplify lru_add_page_tail().
> 
> This will fix one more bug in lru_add_page_tail():
> if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
> will go to the same lru as page, but nobody cares to sync page_tail
> active/inactive state with page. So we can end up with inactive page on
> active lru.
> The patch will fix it as well since we copy PG_active from head page.

This all seems good, and if it fixes a bug, it should really get merged
as it stands.  Have you been actually able to trigger that bug in any
way in practice?

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 19:28     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Returns true if mapping can have huge pages. Just check for __GFP_COMP
> in gfp mask of the mapping for now.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/pagemap.h |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e3dea75..28597ec 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -84,6 +84,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>  				(__force unsigned long)mask;
>  }
>  
> +static inline bool mapping_can_have_hugepages(struct address_space *m)
> +{
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
> +		gfp_t gfp_mask = mapping_gfp_mask(m);
> +		/* __GFP_COMP is key part of GFP_TRANSHUGE */
> +		return !!(gfp_mask & __GFP_COMP) &&
> +			transparent_hugepage_pagecache();
> +	}
> +
> +	return false;
> +}

transparent_hugepage_pagecache() already has the same IS_ENABLED()
check,  Is it really necessary to do it again here?

IOW, can you do this?

> +static inline bool mapping_can_have_hugepages(struct address_space
> +{
> +		gfp_t gfp_mask = mapping_gfp_mask(m);
		if (!transparent_hugepage_pagecache())
			return false;
> +		/* __GFP_COMP is key part of GFP_TRANSHUGE */
> +		return !!(gfp_mask & __GFP_COMP);
> +}

I know we talked about this in the past, but I've forgotten already.
Why is this checking for __GFP_COMP instead of GFP_TRANSHUGE?

Please flesh out the comment.

Also, what happens if "transparent_hugepage_flags &
(1<<TRANSPARENT_HUGEPAGE_PAGECACHE)" becomes false at runtime and you
have some already-instantiated huge page cache mappings around?  Will
things like mapping_align_mask() break?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-05-21 19:28     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Returns true if mapping can have huge pages. Just check for __GFP_COMP
> in gfp mask of the mapping for now.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/pagemap.h |   12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e3dea75..28597ec 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -84,6 +84,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>  				(__force unsigned long)mask;
>  }
>  
> +static inline bool mapping_can_have_hugepages(struct address_space *m)
> +{
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
> +		gfp_t gfp_mask = mapping_gfp_mask(m);
> +		/* __GFP_COMP is key part of GFP_TRANSHUGE */
> +		return !!(gfp_mask & __GFP_COMP) &&
> +			transparent_hugepage_pagecache();
> +	}
> +
> +	return false;
> +}

transparent_hugepage_pagecache() already has the same IS_ENABLED()
check,  Is it really necessary to do it again here?

IOW, can you do this?

> +static inline bool mapping_can_have_hugepages(struct address_space
> +{
> +		gfp_t gfp_mask = mapping_gfp_mask(m);
		if (!transparent_hugepage_pagecache())
			return false;
> +		/* __GFP_COMP is key part of GFP_TRANSHUGE */
> +		return !!(gfp_mask & __GFP_COMP);
> +}

I know we talked about this in the past, but I've forgotten already.
Why is this checking for __GFP_COMP instead of GFP_TRANSHUGE?

Please flesh out the comment.

Also, what happens if "transparent_hugepage_flags &
(1<<TRANSPARENT_HUGEPAGE_PAGECACHE)" becomes false at runtime and you
have some already-instantiated huge page cache mappings around?  Will
things like mapping_align_mask() break?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 10/39] thp: account anon transparent huge pages into NR_ANON_PAGES
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 19:32     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	linux-fsdevel, linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> We use NR_ANON_PAGES as base for reporting AnonPages to user.
> There's not much sense in not accounting transparent huge pages there, but
> add them on printing to user.
> 
> Let's account transparent huge pages in NR_ANON_PAGES in the first place.

This is another one that needs to be pretty carefully considered
_independently_ of the rest of this set.  It also has potential
user-visible changes, so it would be nice to have a blurb in the patch
description if you've thought about this, any why you think it's OK.

But, it still makes solid sense to me, and simplifies the code.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 10/39] thp: account anon transparent huge pages into NR_ANON_PAGES
@ 2013-05-21 19:32     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, Dave Hansen,
	linux-fsdevel, linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> We use NR_ANON_PAGES as base for reporting AnonPages to user.
> There's not much sense in not accounting transparent huge pages there, but
> add them on printing to user.
> 
> Let's account transparent huge pages in NR_ANON_PAGES in the first place.

This is another one that needs to be pretty carefully considered
_independently_ of the rest of this set.  It also has potential
user-visible changes, so it would be nice to have a blurb in the patch
description if you've thought about this, any why you think it's OK.

But, it still makes solid sense to me, and simplifies the code.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 11/39] thp: represent file thp pages in meminfo and friends
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 19:34     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> The patch adds new zone stat to count file transparent huge pages and
> adjust related places.
> 
> For now we don't count mapped or dirty file thp pages separately.

You need to call out that this depends on the previous "NR_ANON_PAGES"
behaviour change to make sense.  Otherwise,

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 11/39] thp: represent file thp pages in meminfo and friends
@ 2013-05-21 19:34     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> The patch adds new zone stat to count file transparent huge pages and
> adjust related places.
> 
> For now we don't count mapped or dirty file thp pages separately.

You need to call out that this depends on the previous "NR_ANON_PAGES"
behaviour change to make sense.  Otherwise,

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 13/39] mm: trace filemap: dump page order
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 19:35     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Dump page order to trace to be able to distinguish between small page
> and huge page in page cache.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 13/39] mm: trace filemap: dump page order
@ 2013-05-21 19:35     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Dump page order to trace to be able to distinguish between small page
> and huge page in page cache.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 19:59     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
> page for the specified index and HPAGE_CACHE_NR-1 tail pages for
> following indexes.

The really nice way to do these patches is refactor them, first, with no
behavior change, in one patch, the introduce the new support in the
second one.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 61158ac..b0c7c8c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -460,39 +460,62 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  		pgoff_t offset, gfp_t gfp_mask)
>  {
>  	int error;
> +	int i, nr;
>  
>  	VM_BUG_ON(!PageLocked(page));
>  	VM_BUG_ON(PageSwapBacked(page));
>  
> +	/* memory cgroup controller handles thp pages on its side */
>  	error = mem_cgroup_cache_charge(page, current->mm,
>  					gfp_mask & GFP_RECLAIM_MASK);
>  	if (error)
> -		goto out;
> -
> -	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> -	if (error == 0) {
> -		page_cache_get(page);
> -		page->mapping = mapping;
> -		page->index = offset;
> +		return error;
>  
> -		spin_lock_irq(&mapping->tree_lock);
> -		error = radix_tree_insert(&mapping->page_tree, offset, page);
> -		if (likely(!error)) {
> -			mapping->nrpages++;
> -			__inc_zone_page_state(page, NR_FILE_PAGES);
> -			spin_unlock_irq(&mapping->tree_lock);
> -			trace_mm_filemap_add_to_page_cache(page);
> -		} else {
> -			page->mapping = NULL;
> -			/* Leave page->index set: truncation relies upon it */
> -			spin_unlock_irq(&mapping->tree_lock);
> -			mem_cgroup_uncharge_cache_page(page);
> -			page_cache_release(page);
> -		}
> -		radix_tree_preload_end();
> -	} else
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
> +		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
> +		nr = hpage_nr_pages(page);
> +	} else {
> +		BUG_ON(PageTransHuge(page));
> +		nr = 1;
> +	}

Why can't this just be

		nr = hpage_nr_pages(page);

Are you trying to optimize for the THP=y, but THP-pagecache=n case?

> +	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
> +	if (error) {
>  		mem_cgroup_uncharge_cache_page(page);
> -out:
> +		return error;
> +	}
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	for (i = 0; i < nr; i++) {
> +		page_cache_get(page + i);
> +		page[i].index = offset + i;
> +		page[i].mapping = mapping;
> +		error = radix_tree_insert(&mapping->page_tree,
> +				offset + i, page + i);
> +		if (error)
> +			goto err;

I know it's not a super-common thing in the kernel, but could you call
this "insert_err" or something?

> +	}
> +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> +	if (PageTransHuge(page))
> +		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> +	mapping->nrpages += nr;
> +	spin_unlock_irq(&mapping->tree_lock);
> +	radix_tree_preload_end();
> +	trace_mm_filemap_add_to_page_cache(page);
> +	return 0;
> +err:
> +	if (i != 0)
> +		error = -ENOSPC; /* no space for a huge page */
> +	page_cache_release(page + i);
> +	page[i].mapping = NULL;

I guess it's a slight behaviour change (I think it's harmless) but if
you delay doing the page_cache_get() and page[i].mapping= until after
the radix tree insertion, you can avoid these two lines.

> +	for (i--; i >= 0; i--) {

I kinda glossed over that initial "i--".  It might be worth a quick
comment to call it out.

> +		/* Leave page->index set: truncation relies upon it */
> +		page[i].mapping = NULL;
> +		radix_tree_delete(&mapping->page_tree, offset + i);
> +		page_cache_release(page + i);
> +	}
> +	spin_unlock_irq(&mapping->tree_lock);
> +	radix_tree_preload_end();
> +	mem_cgroup_uncharge_cache_page(page);
>  	return error;
>  }

FWIW, I think you can move the radix_tree_preload_end() up a bit.  I
guess it won't make any practical difference since you're holding a
spinlock, but it at least makes the point that you're not depending on
it any more.

I'm also trying to figure out how and when you'd actually have to unroll
a partial-huge-page worth of radix_tree_insert().  In the small-page
case, you can collide with another guy inserting in to the page cache.
But, can that happen in the _middle_ of a THP?

Despite my nits, the code still looks correct here, so:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-05-21 19:59     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 19:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
> page for the specified index and HPAGE_CACHE_NR-1 tail pages for
> following indexes.

The really nice way to do these patches is refactor them, first, with no
behavior change, in one patch, the introduce the new support in the
second one.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 61158ac..b0c7c8c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -460,39 +460,62 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  		pgoff_t offset, gfp_t gfp_mask)
>  {
>  	int error;
> +	int i, nr;
>  
>  	VM_BUG_ON(!PageLocked(page));
>  	VM_BUG_ON(PageSwapBacked(page));
>  
> +	/* memory cgroup controller handles thp pages on its side */
>  	error = mem_cgroup_cache_charge(page, current->mm,
>  					gfp_mask & GFP_RECLAIM_MASK);
>  	if (error)
> -		goto out;
> -
> -	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> -	if (error == 0) {
> -		page_cache_get(page);
> -		page->mapping = mapping;
> -		page->index = offset;
> +		return error;
>  
> -		spin_lock_irq(&mapping->tree_lock);
> -		error = radix_tree_insert(&mapping->page_tree, offset, page);
> -		if (likely(!error)) {
> -			mapping->nrpages++;
> -			__inc_zone_page_state(page, NR_FILE_PAGES);
> -			spin_unlock_irq(&mapping->tree_lock);
> -			trace_mm_filemap_add_to_page_cache(page);
> -		} else {
> -			page->mapping = NULL;
> -			/* Leave page->index set: truncation relies upon it */
> -			spin_unlock_irq(&mapping->tree_lock);
> -			mem_cgroup_uncharge_cache_page(page);
> -			page_cache_release(page);
> -		}
> -		radix_tree_preload_end();
> -	} else
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
> +		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
> +		nr = hpage_nr_pages(page);
> +	} else {
> +		BUG_ON(PageTransHuge(page));
> +		nr = 1;
> +	}

Why can't this just be

		nr = hpage_nr_pages(page);

Are you trying to optimize for the THP=y, but THP-pagecache=n case?

> +	error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
> +	if (error) {
>  		mem_cgroup_uncharge_cache_page(page);
> -out:
> +		return error;
> +	}
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	for (i = 0; i < nr; i++) {
> +		page_cache_get(page + i);
> +		page[i].index = offset + i;
> +		page[i].mapping = mapping;
> +		error = radix_tree_insert(&mapping->page_tree,
> +				offset + i, page + i);
> +		if (error)
> +			goto err;

I know it's not a super-common thing in the kernel, but could you call
this "insert_err" or something?

> +	}
> +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> +	if (PageTransHuge(page))
> +		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> +	mapping->nrpages += nr;
> +	spin_unlock_irq(&mapping->tree_lock);
> +	radix_tree_preload_end();
> +	trace_mm_filemap_add_to_page_cache(page);
> +	return 0;
> +err:
> +	if (i != 0)
> +		error = -ENOSPC; /* no space for a huge page */
> +	page_cache_release(page + i);
> +	page[i].mapping = NULL;

I guess it's a slight behaviour change (I think it's harmless) but if
you delay doing the page_cache_get() and page[i].mapping= until after
the radix tree insertion, you can avoid these two lines.

> +	for (i--; i >= 0; i--) {

I kinda glossed over that initial "i--".  It might be worth a quick
comment to call it out.

> +		/* Leave page->index set: truncation relies upon it */
> +		page[i].mapping = NULL;
> +		radix_tree_delete(&mapping->page_tree, offset + i);
> +		page_cache_release(page + i);
> +	}
> +	spin_unlock_irq(&mapping->tree_lock);
> +	radix_tree_preload_end();
> +	mem_cgroup_uncharge_cache_page(page);
>  	return error;
>  }

FWIW, I think you can move the radix_tree_preload_end() up a bit.  I
guess it won't make any practical difference since you're holding a
spinlock, but it at least makes the point that you're not depending on
it any more.

I'm also trying to figure out how and when you'd actually have to unroll
a partial-huge-page worth of radix_tree_insert().  In the small-page
case, you can collide with another guy inserting in to the page cache.
But, can that happen in the _middle_ of a THP?

Despite my nits, the code still looks correct here, so:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 20:14     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 20:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
> time.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/filemap.c |   31 +++++++++++++++++++++++++------
>  1 file changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b0c7c8c..657ce82 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -115,6 +115,9 @@
>  void __delete_from_page_cache(struct page *page)
>  {
>  	struct address_space *mapping = page->mapping;
> +	bool thp = PageTransHuge(page) &&
> +		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
> +	int nr;

Is that check for the config option really necessary?  How would we get
a page with PageTransHuge() set without it being enabled?

>  	trace_mm_filemap_delete_from_page_cache(page);
>  	/*
> @@ -127,13 +130,29 @@ void __delete_from_page_cache(struct page *page)
>  	else
>  		cleancache_invalidate_page(mapping, page);
>  
> -	radix_tree_delete(&mapping->page_tree, page->index);
> +	if (thp) {
> +		int i;
> +
> +		nr = HPAGE_CACHE_NR;
> +		radix_tree_delete(&mapping->page_tree, page->index);
> +		for (i = 1; i < HPAGE_CACHE_NR; i++) {
> +			radix_tree_delete(&mapping->page_tree, page->index + i);
> +			page[i].mapping = NULL;
> +			page_cache_release(page + i);
> +		}
> +		__dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> +	} else {
> +		BUG_ON(PageTransHuge(page));
> +		nr = 1;
> +		radix_tree_delete(&mapping->page_tree, page->index);
> +	}
>  	page->mapping = NULL;

I like to rewrite your code. :)

	nr = hpage_nr_pages(page);
	for (i = 0; i < nr; i++) {
		page[i].mapping = NULL;
		radix_tree_delete(&mapping->page_tree, page->index + i);
		/* tail pages: */
		if (i)
			page_cache_release(page + i);
	}
	if (thp)
	     __dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);

I like this because it explicitly calls out the logic that tail pages
are different from head pages.  We handle their reference counts
differently.

Which reminds me...  Why do we handle their reference counts differently? :)

It seems like we could easily put a for loop in delete_from_page_cache()
that will release their reference counts along with the head page.
Wouldn't that make the code less special-cased for tail pages?

>  	/* Leave page->index set: truncation lookup relies upon it */
> -	mapping->nrpages--;
> -	__dec_zone_page_state(page, NR_FILE_PAGES);
> +	mapping->nrpages -= nr;
> +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
>  	if (PageSwapBacked(page))
> -		__dec_zone_page_state(page, NR_SHMEM);
> +		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
>  	BUG_ON(page_mapped(page));

Man, we suck:

	__dec_zone_page_state()
and
	__mod_zone_page_state()

take a differently-typed first argument.  <sigh>

Would there be any good to making __dec_zone_page_state() check to see
if the page we passed in _is_ a compound page, and adjusting its
behaviour accordingly?

>  	/*
> @@ -144,8 +163,8 @@ void __delete_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> -		dec_zone_page_state(page, NR_FILE_DIRTY);
> -		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> +		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
> +		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
>  	}
>  }

Ahh, I see now why you didn't need a dec_bdi_stat().  Oh well...


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-05-21 20:14     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 20:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
> time.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/filemap.c |   31 +++++++++++++++++++++++++------
>  1 file changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b0c7c8c..657ce82 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -115,6 +115,9 @@
>  void __delete_from_page_cache(struct page *page)
>  {
>  	struct address_space *mapping = page->mapping;
> +	bool thp = PageTransHuge(page) &&
> +		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
> +	int nr;

Is that check for the config option really necessary?  How would we get
a page with PageTransHuge() set without it being enabled?

>  	trace_mm_filemap_delete_from_page_cache(page);
>  	/*
> @@ -127,13 +130,29 @@ void __delete_from_page_cache(struct page *page)
>  	else
>  		cleancache_invalidate_page(mapping, page);
>  
> -	radix_tree_delete(&mapping->page_tree, page->index);
> +	if (thp) {
> +		int i;
> +
> +		nr = HPAGE_CACHE_NR;
> +		radix_tree_delete(&mapping->page_tree, page->index);
> +		for (i = 1; i < HPAGE_CACHE_NR; i++) {
> +			radix_tree_delete(&mapping->page_tree, page->index + i);
> +			page[i].mapping = NULL;
> +			page_cache_release(page + i);
> +		}
> +		__dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> +	} else {
> +		BUG_ON(PageTransHuge(page));
> +		nr = 1;
> +		radix_tree_delete(&mapping->page_tree, page->index);
> +	}
>  	page->mapping = NULL;

I like to rewrite your code. :)

	nr = hpage_nr_pages(page);
	for (i = 0; i < nr; i++) {
		page[i].mapping = NULL;
		radix_tree_delete(&mapping->page_tree, page->index + i);
		/* tail pages: */
		if (i)
			page_cache_release(page + i);
	}
	if (thp)
	     __dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);

I like this because it explicitly calls out the logic that tail pages
are different from head pages.  We handle their reference counts
differently.

Which reminds me...  Why do we handle their reference counts differently? :)

It seems like we could easily put a for loop in delete_from_page_cache()
that will release their reference counts along with the head page.
Wouldn't that make the code less special-cased for tail pages?

>  	/* Leave page->index set: truncation lookup relies upon it */
> -	mapping->nrpages--;
> -	__dec_zone_page_state(page, NR_FILE_PAGES);
> +	mapping->nrpages -= nr;
> +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
>  	if (PageSwapBacked(page))
> -		__dec_zone_page_state(page, NR_SHMEM);
> +		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
>  	BUG_ON(page_mapped(page));

Man, we suck:

	__dec_zone_page_state()
and
	__mod_zone_page_state()

take a differently-typed first argument.  <sigh>

Would there be any good to making __dec_zone_page_state() check to see
if the page we passed in _is_ a compound page, and adjusting its
behaviour accordingly?

>  	/*
> @@ -144,8 +163,8 @@ void __delete_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> -		dec_zone_page_state(page, NR_FILE_DIRTY);
> -		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> +		mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
> +		add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
>  	}
>  }

Ahh, I see now why you didn't need a dec_bdi_stat().  Oh well...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 20:17     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 20:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> replace_page_cache_page() is only used by FUSE. It's unlikely that we
> will support THP in FUSE page cache any soon.
> 
> Let's pospone implemetation of THP handling in replace_page_cache_page()
> until any will use it.
...
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 657ce82..3a03426 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -428,6 +428,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
>  {
>  	int error;
>  
> +	VM_BUG_ON(PageTransHuge(old));
> +	VM_BUG_ON(PageTransHuge(new));
>  	VM_BUG_ON(!PageLocked(old));
>  	VM_BUG_ON(!PageLocked(new));
>  	VM_BUG_ON(new->mapping);

The code calling replace_page_cache_page() has a bunch of fallback and
error returning code.  It seems a little bit silly to bring the whole
machine down when you could just WARN_ONCE() and return an error code
like fuse already does:

>         /*
>          * This is a new and locked page, it shouldn't be mapped or
>          * have any special flags on it
>          */
>         if (WARN_ON(page_mapped(oldpage)))
>                 goto out_fallback_unlock;
>         if (WARN_ON(page_has_private(oldpage)))
>                 goto out_fallback_unlock;
>         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
>                 goto out_fallback_unlock;
>         if (WARN_ON(PageMlocked(oldpage)))
>                 goto out_fallback_unlock;


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP
@ 2013-05-21 20:17     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 20:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> replace_page_cache_page() is only used by FUSE. It's unlikely that we
> will support THP in FUSE page cache any soon.
> 
> Let's pospone implemetation of THP handling in replace_page_cache_page()
> until any will use it.
...
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 657ce82..3a03426 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -428,6 +428,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
>  {
>  	int error;
>  
> +	VM_BUG_ON(PageTransHuge(old));
> +	VM_BUG_ON(PageTransHuge(new));
>  	VM_BUG_ON(!PageLocked(old));
>  	VM_BUG_ON(!PageLocked(new));
>  	VM_BUG_ON(new->mapping);

The code calling replace_page_cache_page() has a bunch of fallback and
error returning code.  It seems a little bit silly to bring the whole
machine down when you could just WARN_ONCE() and return an error code
like fuse already does:

>         /*
>          * This is a new and locked page, it shouldn't be mapped or
>          * have any special flags on it
>          */
>         if (WARN_ON(page_mapped(oldpage)))
>                 goto out_fallback_unlock;
>         if (WARN_ON(page_has_private(oldpage)))
>                 goto out_fallback_unlock;
>         if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
>                 goto out_fallback_unlock;
>         if (WARN_ON(PageMlocked(oldpage)))
>                 goto out_fallback_unlock;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 16/39] thp, mm: locking tail page is a bug
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 20:18     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 20:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Locking head page means locking entire compound page.
> If we try to lock tail page, something went wrong.

Have you actually triggered this in your development?

This is another one that can theoretically get merged separately.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 16/39] thp, mm: locking tail page is a bug
@ 2013-05-21 20:18     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 20:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Locking head page means locking entire compound page.
> If we try to lock tail page, something went wrong.

Have you actually triggered this in your development?

This is another one that can theoretically get merged separately.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 17/39] thp, mm: handle tail pages in page_cache_get_speculative()
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 20:49     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 20:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> For tail page we call __get_page_tail(). It has the same semantics, but
> for tail page.

page_cache_get_speculative() has a ~50-line comment above it with lots
of scariness about grace periods and RCU.  A two line comment saying
that the semantics are the same doesn't make me feel great that you've
done your homework here.

Are there any performance implications here?  __get_page_tail() says:
"It implements the slow path of get_page().".
page_cache_get_speculative() seems awfully speculative which would make
me think that it is part of a _fast_ path.

> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 28597ec..2e86251 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -161,6 +161,9 @@ static inline int page_cache_get_speculative(struct page *page)
>  {
>  	VM_BUG_ON(in_interrupt());
>  
> +	if (unlikely(PageTail(page)))
> +		return __get_page_tail(page);
> +
>  #ifdef CONFIG_TINY_RCU
>  # ifdef CONFIG_PREEMPT_COUNT
>  	VM_BUG_ON(!in_atomic());
> @@ -187,7 +190,6 @@ static inline int page_cache_get_speculative(struct page *page)
>  		return 0;
>  	}
>  #endif
> -	VM_BUG_ON(PageTail(page));
>  
>  	return 1;
>  }

FWIW, that VM_BUG_ON() should theoretically be able to stay there since
it's unreachable now that you've short-circuited the function for
PageTail() pages.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 17/39] thp, mm: handle tail pages in page_cache_get_speculative()
@ 2013-05-21 20:49     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 20:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> For tail page we call __get_page_tail(). It has the same semantics, but
> for tail page.

page_cache_get_speculative() has a ~50-line comment above it with lots
of scariness about grace periods and RCU.  A two line comment saying
that the semantics are the same doesn't make me feel great that you've
done your homework here.

Are there any performance implications here?  __get_page_tail() says:
"It implements the slow path of get_page().".
page_cache_get_speculative() seems awfully speculative which would make
me think that it is part of a _fast_ path.

> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 28597ec..2e86251 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -161,6 +161,9 @@ static inline int page_cache_get_speculative(struct page *page)
>  {
>  	VM_BUG_ON(in_interrupt());
>  
> +	if (unlikely(PageTail(page)))
> +		return __get_page_tail(page);
> +
>  #ifdef CONFIG_TINY_RCU
>  # ifdef CONFIG_PREEMPT_COUNT
>  	VM_BUG_ON(!in_atomic());
> @@ -187,7 +190,6 @@ static inline int page_cache_get_speculative(struct page *page)
>  		return 0;
>  	}
>  #endif
> -	VM_BUG_ON(PageTail(page));
>  
>  	return 1;
>  }

FWIW, that VM_BUG_ON() should theoretically be able to stay there since
it's unreachable now that you've short-circuited the function for
PageTail() pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 18/39] thp, mm: add event counters for huge page alloc on write to a file
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 20:54     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 20:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index d4b7a18..584c71c 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		THP_FAULT_FALLBACK,
>  		THP_COLLAPSE_ALLOC,
>  		THP_COLLAPSE_ALLOC_FAILED,
> +		THP_WRITE_ALLOC,
> +		THP_WRITE_ALLOC_FAILED,
>  		THP_SPLIT,
>  		THP_ZERO_PAGE_ALLOC,
>  		THP_ZERO_PAGE_ALLOC_FAILED,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7945285..df8dcda 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -821,6 +821,8 @@ const char * const vmstat_text[] = {
>  	"thp_fault_fallback",
>  	"thp_collapse_alloc",
>  	"thp_collapse_alloc_failed",
> +	"thp_write_alloc",
> +	"thp_write_alloc_failed",
>  	"thp_split",
>  	"thp_zero_page_alloc",
>  	"thp_zero_page_alloc_failed",

I guess these new counters are _consistent_ with all the others.  But,
why do we need a separate "_failed" for each one of these?  While I'm
nitpicking, does "thp_write_alloc" mean allocs or _successful_ allocs?
I had to look at the code to tell.

I thihk it's probably safe to combine this patch with the next one.
Breaking them apart just makes it harder to review.  If _anything_,
this, plus the use of the counters should go in to a different patch
from the true code changes in "mm: allocate huge pages in
grab_cache_page_write_begin()".

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 18/39] thp, mm: add event counters for huge page alloc on write to a file
@ 2013-05-21 20:54     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 20:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index d4b7a18..584c71c 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		THP_FAULT_FALLBACK,
>  		THP_COLLAPSE_ALLOC,
>  		THP_COLLAPSE_ALLOC_FAILED,
> +		THP_WRITE_ALLOC,
> +		THP_WRITE_ALLOC_FAILED,
>  		THP_SPLIT,
>  		THP_ZERO_PAGE_ALLOC,
>  		THP_ZERO_PAGE_ALLOC_FAILED,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7945285..df8dcda 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -821,6 +821,8 @@ const char * const vmstat_text[] = {
>  	"thp_fault_fallback",
>  	"thp_collapse_alloc",
>  	"thp_collapse_alloc_failed",
> +	"thp_write_alloc",
> +	"thp_write_alloc_failed",
>  	"thp_split",
>  	"thp_zero_page_alloc",
>  	"thp_zero_page_alloc_failed",

I guess these new counters are _consistent_ with all the others.  But,
why do we need a separate "_failed" for each one of these?  While I'm
nitpicking, does "thp_write_alloc" mean allocs or _successful_ allocs?
I had to look at the code to tell.

I thihk it's probably safe to combine this patch with the next one.
Breaking them apart just makes it harder to review.  If _anything_,
this, plus the use of the counters should go in to a different patch
from the true code changes in "mm: allocate huge pages in
grab_cache_page_write_begin()".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 19/39] thp, mm: allocate huge pages in grab_cache_page_write_begin()
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 21:14     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 21:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.

Why do we need this flag?  When might we set it, and when would we not
set it?  What kinds of callers need to check for and act on it?

Some of this, at least, needs to make it in to the comment by the #define.

> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -194,6 +194,9 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
>  #define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
>  #define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
>  
> +#define THP_WRITE_ALLOC		({ BUILD_BUG(); 0; })
> +#define THP_WRITE_ALLOC_FAILED	({ BUILD_BUG(); 0; })

Doesn't this belong in the previous patch?

>  #define hpage_nr_pages(x) 1
>  
>  #define transparent_hugepage_enabled(__vma) 0
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2e86251..8feeecc 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -270,8 +270,15 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
>  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
>  			int tag, unsigned int nr_pages, struct page **pages);
>  
> -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
>  			pgoff_t index, unsigned flags);
> +static inline struct page *grab_cache_page_write_begin(
> +		struct address_space *mapping, pgoff_t index, unsigned flags)
> +{
> +	if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
> +		return NULL;
> +	return __grab_cache_page_write_begin(mapping, index, flags);
> +}

OK, so there's some of the behavior.

Could you also call out why you refactored this code?  It seems like
you're trying to optimize for the case where AOP_FLAG_TRANSHUGE isn't
set and where the compiler knows that it isn't set.

Could you talk a little bit about the cases that you're thinking of here?

>  /*
>   * Returns locked page at given index in given cache, creating it if needed.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9ea46a4..e086ef0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2309,25 +2309,44 @@ EXPORT_SYMBOL(generic_file_direct_write);
>   * Find or create a page at the given pagecache position. Return the locked
>   * page. This function is specifically for buffered writes.
>   */
> -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> -					pgoff_t index, unsigned flags)
> +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> +		pgoff_t index, unsigned flags)
>  {
>  	int status;
>  	gfp_t gfp_mask;
>  	struct page *page;
>  	gfp_t gfp_notmask = 0;
> +	bool thp = (flags & AOP_FLAG_TRANSHUGE) &&
> +		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);

Instead of 'thp', how about 'must_use_thp'?  The flag seems to be a
pretty strong edict rather than a hint, so it should be reflected in the
variables derived from it.

"IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)" has also popped up
enough times in the code that it's probably time to start thinking about
shortening it up.  It's a wee bit verbose.

>  	gfp_mask = mapping_gfp_mask(mapping);
>  	if (mapping_cap_account_dirty(mapping))
>  		gfp_mask |= __GFP_WRITE;
>  	if (flags & AOP_FLAG_NOFS)
>  		gfp_notmask = __GFP_FS;
> +	if (thp) {
> +		BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
> +		BUG_ON(!(gfp_mask & __GFP_COMP));
> +	}
>  repeat:
>  	page = find_lock_page(mapping, index);
> -	if (page)
> +	if (page) {
> +		if (thp && !PageTransHuge(page)) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			return NULL;
> +		}
>  		goto found;
> +	}
>  
> -	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
> +	if (thp) {
> +		page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
> +		if (page)
> +			count_vm_event(THP_WRITE_ALLOC);
> +		else
> +			count_vm_event(THP_WRITE_ALLOC_FAILED);
> +	} else
> +		page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
>  	if (!page)
>  		return NULL;
>  	status = add_to_page_cache_lru(page, mapping, index,
> @@ -2342,7 +2361,7 @@ found:
>  	wait_for_stable_page(page);
>  	return page;
>  }
> -EXPORT_SYMBOL(grab_cache_page_write_begin);
> +EXPORT_SYMBOL(__grab_cache_page_write_begin);
>  
>  static ssize_t generic_perform_write(struct file *file,
>  				struct iov_iter *i, loff_t pos)
> 


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 19/39] thp, mm: allocate huge pages in grab_cache_page_write_begin()
@ 2013-05-21 21:14     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 21:14 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.

Why do we need this flag?  When might we set it, and when would we not
set it?  What kinds of callers need to check for and act on it?

Some of this, at least, needs to make it in to the comment by the #define.

> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -194,6 +194,9 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
>  #define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
>  #define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
>  
> +#define THP_WRITE_ALLOC		({ BUILD_BUG(); 0; })
> +#define THP_WRITE_ALLOC_FAILED	({ BUILD_BUG(); 0; })

Doesn't this belong in the previous patch?

>  #define hpage_nr_pages(x) 1
>  
>  #define transparent_hugepage_enabled(__vma) 0
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2e86251..8feeecc 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -270,8 +270,15 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
>  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
>  			int tag, unsigned int nr_pages, struct page **pages);
>  
> -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
>  			pgoff_t index, unsigned flags);
> +static inline struct page *grab_cache_page_write_begin(
> +		struct address_space *mapping, pgoff_t index, unsigned flags)
> +{
> +	if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
> +		return NULL;
> +	return __grab_cache_page_write_begin(mapping, index, flags);
> +}

OK, so there's some of the behavior.

Could you also call out why you refactored this code?  It seems like
you're trying to optimize for the case where AOP_FLAG_TRANSHUGE isn't
set and where the compiler knows that it isn't set.

Could you talk a little bit about the cases that you're thinking of here?

>  /*
>   * Returns locked page at given index in given cache, creating it if needed.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9ea46a4..e086ef0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2309,25 +2309,44 @@ EXPORT_SYMBOL(generic_file_direct_write);
>   * Find or create a page at the given pagecache position. Return the locked
>   * page. This function is specifically for buffered writes.
>   */
> -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> -					pgoff_t index, unsigned flags)
> +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> +		pgoff_t index, unsigned flags)
>  {
>  	int status;
>  	gfp_t gfp_mask;
>  	struct page *page;
>  	gfp_t gfp_notmask = 0;
> +	bool thp = (flags & AOP_FLAG_TRANSHUGE) &&
> +		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);

Instead of 'thp', how about 'must_use_thp'?  The flag seems to be a
pretty strong edict rather than a hint, so it should be reflected in the
variables derived from it.

"IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)" has also popped up
enough times in the code that it's probably time to start thinking about
shortening it up.  It's a wee bit verbose.

>  	gfp_mask = mapping_gfp_mask(mapping);
>  	if (mapping_cap_account_dirty(mapping))
>  		gfp_mask |= __GFP_WRITE;
>  	if (flags & AOP_FLAG_NOFS)
>  		gfp_notmask = __GFP_FS;
> +	if (thp) {
> +		BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
> +		BUG_ON(!(gfp_mask & __GFP_COMP));
> +	}
>  repeat:
>  	page = find_lock_page(mapping, index);
> -	if (page)
> +	if (page) {
> +		if (thp && !PageTransHuge(page)) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			return NULL;
> +		}
>  		goto found;
> +	}
>  
> -	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
> +	if (thp) {
> +		page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
> +		if (page)
> +			count_vm_event(THP_WRITE_ALLOC);
> +		else
> +			count_vm_event(THP_WRITE_ALLOC_FAILED);
> +	} else
> +		page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
>  	if (!page)
>  		return NULL;
>  	status = add_to_page_cache_lru(page, mapping, index,
> @@ -2342,7 +2361,7 @@ found:
>  	wait_for_stable_page(page);
>  	return page;
>  }
> -EXPORT_SYMBOL(grab_cache_page_write_begin);
> +EXPORT_SYMBOL(__grab_cache_page_write_begin);
>  
>  static ssize_t generic_perform_write(struct file *file,
>  				struct iov_iter *i, loff_t pos)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 21:28     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 21:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> +		if (PageTransHuge(page))
> +			offset = pos & ~HPAGE_PMD_MASK;
> +
>  		pagefault_disable();
> -		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> +		copied = iov_iter_copy_from_user_atomic(
> +				page + (offset >> PAGE_CACHE_SHIFT),
> +				i, offset & ~PAGE_CACHE_MASK, bytes);
>  		pagefault_enable();
>  		flush_dcache_page(page);

I think there's enough voodoo in there to warrant a comment or adding
some temporary variables.  There are three things going on that you wan
to convey:

1. Offset is normally <PAGE_SIZE, but you make it <HPAGE_PMD_SIZE if
   you are dealing with a huge page
2. (offset >> PAGE_CACHE_SHIFT) is always 0 for small pages since
    offset < PAGE_SIZE
3. "offset & ~PAGE_CACHE_MASK" does nothing for small-page offsets, but
   it turns a large-page offset back in to a small-page-offset.

I think you can do it with something like this:

 	int subpage_nr = 0;
	off_t smallpage_offset = offset;
	if (PageTransHuge(page)) {
		// we transform 'offset' to be offset in to the huge
		// page instead of inside the PAGE_SIZE page
		offset = pos & ~HPAGE_PMD_MASK;
		subpage_nr = (offset >> PAGE_CACHE_SHIFT);
	}
	
> +		copied = iov_iter_copy_from_user_atomic(
> +				page + subpage_nr,
> +				i, smallpage_offset, bytes);


> @@ -2437,6 +2453,7 @@ again:
>  			 * because not all segments in the iov can be copied at
>  			 * once without a pagefault.
>  			 */
> +			offset = pos & ~PAGE_CACHE_MASK;

Urg, and now it's *BACK* in to a small-page offset?

This means that 'offset' has two _different_ meanings and it morphs
between them during the function a couple of times.  That seems very
error-prone to me.

>  			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
>  						iov_iter_single_seg_count(i));
>  			goto again;
> 


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines
@ 2013-05-21 21:28     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 21:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> +		if (PageTransHuge(page))
> +			offset = pos & ~HPAGE_PMD_MASK;
> +
>  		pagefault_disable();
> -		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> +		copied = iov_iter_copy_from_user_atomic(
> +				page + (offset >> PAGE_CACHE_SHIFT),
> +				i, offset & ~PAGE_CACHE_MASK, bytes);
>  		pagefault_enable();
>  		flush_dcache_page(page);

I think there's enough voodoo in there to warrant a comment or adding
some temporary variables.  There are three things going on that you wan
to convey:

1. Offset is normally <PAGE_SIZE, but you make it <HPAGE_PMD_SIZE if
   you are dealing with a huge page
2. (offset >> PAGE_CACHE_SHIFT) is always 0 for small pages since
    offset < PAGE_SIZE
3. "offset & ~PAGE_CACHE_MASK" does nothing for small-page offsets, but
   it turns a large-page offset back in to a small-page-offset.

I think you can do it with something like this:

 	int subpage_nr = 0;
	off_t smallpage_offset = offset;
	if (PageTransHuge(page)) {
		// we transform 'offset' to be offset in to the huge
		// page instead of inside the PAGE_SIZE page
		offset = pos & ~HPAGE_PMD_MASK;
		subpage_nr = (offset >> PAGE_CACHE_SHIFT);
	}
	
> +		copied = iov_iter_copy_from_user_atomic(
> +				page + subpage_nr,
> +				i, smallpage_offset, bytes);


> @@ -2437,6 +2453,7 @@ again:
>  			 * because not all segments in the iov can be copied at
>  			 * once without a pagefault.
>  			 */
> +			offset = pos & ~PAGE_CACHE_MASK;

Urg, and now it's *BACK* in to a small-page offset?

This means that 'offset' has two _different_ meanings and it morphs
between them during the function a couple of times.  That seems very
error-prone to me.

>  			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
>  						iov_iter_single_seg_count(i));
>  			goto again;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 21/39] thp, libfs: initial support of thp in simple_read/write_begin/write_end
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 21:49     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 21:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
> It's probably to weak condition and need to be reworked later.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  fs/libfs.c              |   50 ++++++++++++++++++++++++++++++++++++-----------
>  include/linux/pagemap.h |    8 ++++++++
>  2 files changed, 47 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 916da8c..ce807fe 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -383,7 +383,7 @@ EXPORT_SYMBOL(simple_setattr);
>  
>  int simple_readpage(struct file *file, struct page *page)
>  {
> -	clear_highpage(page);
> +	clear_pagecache_page(page);
>  	flush_dcache_page(page);
>  	SetPageUptodate(page);
>  	unlock_page(page);
> @@ -394,21 +394,44 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
>  			loff_t pos, unsigned len, unsigned flags,
>  			struct page **pagep, void **fsdata)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  	pgoff_t index;

I know ramfs uses simple_write_begin(), but it's not the only one.  I
think you probably want to create a new ->write_begin() function just
for ramfs rather than modifying this one.

The optimization that you just put in a few patches ago:

>> +static inline struct page *grab_cache_page_write_begin(
>> +{
>> +	if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
>> +		return NULL;
>> +	return __grab_cache_page_write_begin(mapping, index, flags);


is now worthless for any user of simple_readpage().

>  	index = pos >> PAGE_CACHE_SHIFT;
>  
> -	page = grab_cache_page_write_begin(mapping, index, flags);
> +	/* XXX: too weak condition? */

Why would it be too weak?

> +	if (mapping_can_have_hugepages(mapping)) {
> +		page = grab_cache_page_write_begin(mapping,
> +				index & ~HPAGE_CACHE_INDEX_MASK,
> +				flags | AOP_FLAG_TRANSHUGE);
> +		/* fallback to small page */
> +		if (!page) {
> +			unsigned long offset;
> +			offset = pos & ~PAGE_CACHE_MASK;
> +			len = min_t(unsigned long,
> +					len, PAGE_CACHE_SIZE - offset);
> +		}

Why does this have to muck with 'len'?  It doesn't appear to be undoing
anything from earlier in the function.  What is it fixing up?

> +		BUG_ON(page && !PageTransHuge(page));
> +	}

So, those semantics for AOP_FLAG_TRANSHUGE are actually pretty strong.
They mean that you can only return a transparent pagecache page, but you
better not return a small page.

Would it have been possible for a huge page to get returned from
grab_cache_page_write_begin(), but had it split up between there and the
BUG_ON()?

Which reminds me... under what circumstances _do_ we split these huge
pages?  How are those circumstances different from the anonymous ones?

> +	if (!page)
> +		page = grab_cache_page_write_begin(mapping, index, flags);
>  	if (!page)
>  		return -ENOMEM;
> -
>  	*pagep = page;
>  
> -	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
> -		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> -
> -		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
> +	if (!PageUptodate(page)) {
> +		unsigned from;
> +
> +		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
> +			from = pos & ~HPAGE_PMD_MASK;
> +			zero_huge_user_segment(page, 0, from);
> +			zero_huge_user_segment(page,
> +					from + len, HPAGE_PMD_SIZE);
> +		} else if (len != PAGE_CACHE_SIZE) {
> +			from = pos & ~PAGE_CACHE_MASK;
> +			zero_user_segments(page, 0, from,
> +					from + len, PAGE_CACHE_SIZE);
> +		}
>  	}
>  	return 0;
>  }
> @@ -443,9 +466,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
>  
>  	/* zero the stale part of the page if we did a short copy */
>  	if (copied < len) {
> -		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> -
> -		zero_user(page, from + copied, len - copied);
> +		unsigned from;
> +		if (PageTransHuge(page)) {
> +			from = pos & ~HPAGE_PMD_MASK;
> +			zero_huge_user(page, from + copied, len - copied);
> +		} else {
> +			from = pos & ~PAGE_CACHE_MASK;
> +			zero_user(page, from + copied, len - copied);
> +		}
>  	}

When I see stuff going in to the simple_* functions, I fear that this
code will end up getting copied in to each and every one of the
filesystems that implement these on their own.

I guess this works for now, but I'm worried that the the next fs is just
going to copy-and-paste these.  Guess I'll yell at them when they do it. :)


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 21/39] thp, libfs: initial support of thp in simple_read/write_begin/write_end
@ 2013-05-21 21:49     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 21:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
> It's probably to weak condition and need to be reworked later.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  fs/libfs.c              |   50 ++++++++++++++++++++++++++++++++++++-----------
>  include/linux/pagemap.h |    8 ++++++++
>  2 files changed, 47 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 916da8c..ce807fe 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -383,7 +383,7 @@ EXPORT_SYMBOL(simple_setattr);
>  
>  int simple_readpage(struct file *file, struct page *page)
>  {
> -	clear_highpage(page);
> +	clear_pagecache_page(page);
>  	flush_dcache_page(page);
>  	SetPageUptodate(page);
>  	unlock_page(page);
> @@ -394,21 +394,44 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
>  			loff_t pos, unsigned len, unsigned flags,
>  			struct page **pagep, void **fsdata)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  	pgoff_t index;

I know ramfs uses simple_write_begin(), but it's not the only one.  I
think you probably want to create a new ->write_begin() function just
for ramfs rather than modifying this one.

The optimization that you just put in a few patches ago:

>> +static inline struct page *grab_cache_page_write_begin(
>> +{
>> +	if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
>> +		return NULL;
>> +	return __grab_cache_page_write_begin(mapping, index, flags);


is now worthless for any user of simple_readpage().

>  	index = pos >> PAGE_CACHE_SHIFT;
>  
> -	page = grab_cache_page_write_begin(mapping, index, flags);
> +	/* XXX: too weak condition? */

Why would it be too weak?

> +	if (mapping_can_have_hugepages(mapping)) {
> +		page = grab_cache_page_write_begin(mapping,
> +				index & ~HPAGE_CACHE_INDEX_MASK,
> +				flags | AOP_FLAG_TRANSHUGE);
> +		/* fallback to small page */
> +		if (!page) {
> +			unsigned long offset;
> +			offset = pos & ~PAGE_CACHE_MASK;
> +			len = min_t(unsigned long,
> +					len, PAGE_CACHE_SIZE - offset);
> +		}

Why does this have to muck with 'len'?  It doesn't appear to be undoing
anything from earlier in the function.  What is it fixing up?

> +		BUG_ON(page && !PageTransHuge(page));
> +	}

So, those semantics for AOP_FLAG_TRANSHUGE are actually pretty strong.
They mean that you can only return a transparent pagecache page, but you
better not return a small page.

Would it have been possible for a huge page to get returned from
grab_cache_page_write_begin(), but had it split up between there and the
BUG_ON()?

Which reminds me... under what circumstances _do_ we split these huge
pages?  How are those circumstances different from the anonymous ones?

> +	if (!page)
> +		page = grab_cache_page_write_begin(mapping, index, flags);
>  	if (!page)
>  		return -ENOMEM;
> -
>  	*pagep = page;
>  
> -	if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
> -		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> -
> -		zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
> +	if (!PageUptodate(page)) {
> +		unsigned from;
> +
> +		if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
> +			from = pos & ~HPAGE_PMD_MASK;
> +			zero_huge_user_segment(page, 0, from);
> +			zero_huge_user_segment(page,
> +					from + len, HPAGE_PMD_SIZE);
> +		} else if (len != PAGE_CACHE_SIZE) {
> +			from = pos & ~PAGE_CACHE_MASK;
> +			zero_user_segments(page, 0, from,
> +					from + len, PAGE_CACHE_SIZE);
> +		}
>  	}
>  	return 0;
>  }
> @@ -443,9 +466,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
>  
>  	/* zero the stale part of the page if we did a short copy */
>  	if (copied < len) {
> -		unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> -
> -		zero_user(page, from + copied, len - copied);
> +		unsigned from;
> +		if (PageTransHuge(page)) {
> +			from = pos & ~HPAGE_PMD_MASK;
> +			zero_huge_user(page, from + copied, len - copied);
> +		} else {
> +			from = pos & ~PAGE_CACHE_MASK;
> +			zero_user(page, from + copied, len - copied);
> +		}
>  	}

When I see stuff going in to the simple_* functions, I fear that this
code will end up getting copied in to each and every one of the
filesystems that implement these on their own.

I guess this works for now, but I'm worried that the the next fs is just
going to copy-and-paste these.  Guess I'll yell at them when they do it. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 22:05     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 22:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Since we're going to have huge pages backed by files,
> wait_split_huge_page() has to serialize not only over anon_vma_lock,
> but over i_mmap_mutex too.
...
> -#define wait_split_huge_page(__anon_vma, __pmd)				\
> +#define wait_split_huge_page(__vma, __pmd)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
> -		anon_vma_lock_write(__anon_vma);			\
> -		anon_vma_unlock_write(__anon_vma);			\
> +		struct address_space *__mapping =			\
> +					vma->vm_file->f_mapping;	\
> +		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
> +		if (__mapping)						\
> +			mutex_lock(&__mapping->i_mmap_mutex);		\
> +		if (__anon_vma) {					\
> +			anon_vma_lock_write(__anon_vma);		\
> +			anon_vma_unlock_write(__anon_vma);		\
> +		}							\
> +		if (__mapping)						\
> +			mutex_unlock(&__mapping->i_mmap_mutex);		\
>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>  		       pmd_trans_huge(*____pmd));			\
>  	} while (0)

Kirill, I asked about this patch in the previous series, and you wrote
some very nice, detailed answers to my stupid questions.  But, you
didn't add any comments or update the patch description.  So, if a
reviewer or anybody looking at the changelog in the future has my same
stupid questions, they're unlikely to find the very nice description
that you wrote up.

I'd highly suggest that you go back through the comments you've received
before and make sure that you both answered the questions, *and* made
sure to cover those questions either in the code or in the patch
descriptions.

Could you also describe the lengths to which you've gone to try and keep
this macro from growing in to any larger of an abomination.  Is it truly
_impossible_ to turn this in to a normal function?  Or will it simply be
a larger amount of work that you can do right now?  What would it take?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
@ 2013-05-21 22:05     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 22:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Since we're going to have huge pages backed by files,
> wait_split_huge_page() has to serialize not only over anon_vma_lock,
> but over i_mmap_mutex too.
...
> -#define wait_split_huge_page(__anon_vma, __pmd)				\
> +#define wait_split_huge_page(__vma, __pmd)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
> -		anon_vma_lock_write(__anon_vma);			\
> -		anon_vma_unlock_write(__anon_vma);			\
> +		struct address_space *__mapping =			\
> +					vma->vm_file->f_mapping;	\
> +		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
> +		if (__mapping)						\
> +			mutex_lock(&__mapping->i_mmap_mutex);		\
> +		if (__anon_vma) {					\
> +			anon_vma_lock_write(__anon_vma);		\
> +			anon_vma_unlock_write(__anon_vma);		\
> +		}							\
> +		if (__mapping)						\
> +			mutex_unlock(&__mapping->i_mmap_mutex);		\
>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>  		       pmd_trans_huge(*____pmd));			\
>  	} while (0)

Kirill, I asked about this patch in the previous series, and you wrote
some very nice, detailed answers to my stupid questions.  But, you
didn't add any comments or update the patch description.  So, if a
reviewer or anybody looking at the changelog in the future has my same
stupid questions, they're unlikely to find the very nice description
that you wrote up.

I'd highly suggest that you go back through the comments you've received
before and make sure that you both answered the questions, *and* made
sure to cover those questions either in the code or in the patch
descriptions.

Could you also describe the lengths to which you've gone to try and keep
this macro from growing in to any larger of an abomination.  Is it truly
_impossible_ to turn this in to a normal function?  Or will it simply be
a larger amount of work that you can do right now?  What would it take?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 24/39] thp, mm: truncate support for transparent huge page cache
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 22:39     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 22:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> If we starting position of truncation is in tail page we have to spilit
> the huge page page first.

That's a very interesting sentence sentence. :)

> We also have to split if end is within the huge page. Otherwise we can
> truncate whole huge page at once.

How about something more like this as a description?

Splitting a huge page is relatively expensive.  If at all possible, we
would like to do truncation without first splitting a page.  However, if
the truncation request starts or ends in the middle of a huge page, we
have no choice and must split it.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/truncate.c |   13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index c75b736..0152feb 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			if (index > end)
>  				break;
>  
> +			/* split page if we start from tail page */
> +			if (PageTransTail(page))
> +				split_huge_page(compound_trans_head(page));

I know it makes no logical difference, but should this be an "else if"?
 It would make it more clear to me that PageTransTail() and
PageTransHead() are mutually exclusive.

> +			if (PageTransHuge(page)) {
> +				/* split if end is within huge page */
> +				if (index == (end & ~HPAGE_CACHE_INDEX_MASK))

How about:

	if ((end - index) > HPAGE_CACHE_NR)

That seems a bit more straightforward, to me at least.

> +					split_huge_page(page);
> +				else
> +					/* skip tail pages */
> +					i += HPAGE_CACHE_NR - 1;
> +			}


Hmm..  This is all inside a loop, right?

                for (i = 0; i < pagevec_count(&pvec); i++) {
                        struct page *page = pvec.pages[i];

PAGEVEC_SIZE is only 14 here, so it seems a bit odd to be incrementing i
by 512-1.  We'll break out of the pagevec loop, but won't 'index' be set
to the wrong thing on the next iteration of the loop?  Did you want to
be incrementing 'index' instead of 'i'?

This is also another case where I wonder about racing split_huge_page()
operations.

>  			if (!trylock_page(page))
>  				continue;
>  			WARN_ON(page->index != index);
> @@ -280,6 +291,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			if (index > end)
>  				break;
>  
> +			if (PageTransHuge(page))
> +				split_huge_page(page);
>  			lock_page(page);
>  			WARN_ON(page->index != index);
>  			wait_on_page_writeback(page);

This seems to imply that we would have taken care of the case where we
encountered a tail page in the first pass.  Should we put a comment in
to explain that assumption?


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 24/39] thp, mm: truncate support for transparent huge page cache
@ 2013-05-21 22:39     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 22:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> If we starting position of truncation is in tail page we have to spilit
> the huge page page first.

That's a very interesting sentence sentence. :)

> We also have to split if end is within the huge page. Otherwise we can
> truncate whole huge page at once.

How about something more like this as a description?

Splitting a huge page is relatively expensive.  If at all possible, we
would like to do truncation without first splitting a page.  However, if
the truncation request starts or ends in the middle of a huge page, we
have no choice and must split it.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/truncate.c |   13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index c75b736..0152feb 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			if (index > end)
>  				break;
>  
> +			/* split page if we start from tail page */
> +			if (PageTransTail(page))
> +				split_huge_page(compound_trans_head(page));

I know it makes no logical difference, but should this be an "else if"?
 It would make it more clear to me that PageTransTail() and
PageTransHead() are mutually exclusive.

> +			if (PageTransHuge(page)) {
> +				/* split if end is within huge page */
> +				if (index == (end & ~HPAGE_CACHE_INDEX_MASK))

How about:

	if ((end - index) > HPAGE_CACHE_NR)

That seems a bit more straightforward, to me at least.

> +					split_huge_page(page);
> +				else
> +					/* skip tail pages */
> +					i += HPAGE_CACHE_NR - 1;
> +			}


Hmm..  This is all inside a loop, right?

                for (i = 0; i < pagevec_count(&pvec); i++) {
                        struct page *page = pvec.pages[i];

PAGEVEC_SIZE is only 14 here, so it seems a bit odd to be incrementing i
by 512-1.  We'll break out of the pagevec loop, but won't 'index' be set
to the wrong thing on the next iteration of the loop?  Did you want to
be incrementing 'index' instead of 'i'?

This is also another case where I wonder about racing split_huge_page()
operations.

>  			if (!trylock_page(page))
>  				continue;
>  			WARN_ON(page->index != index);
> @@ -280,6 +291,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  			if (index > end)
>  				break;
>  
> +			if (PageTransHuge(page))
> +				split_huge_page(page);
>  			lock_page(page);
>  			WARN_ON(page->index != index);
>  			wait_on_page_writeback(page);

This seems to imply that we would have taken care of the case where we
encountered a tail page in the first pass.  Should we put a comment in
to explain that assumption?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 26/39] ramfs: enable transparent huge page cache
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 22:43     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 22:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> ramfs is the most simple fs from page cache point of view. Let's start
> transparent huge page cache enabling here.
> 
> For now we allocate only non-movable huge page. ramfs pages cannot be
> moved yet.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  fs/ramfs/inode.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> index c24f1e1..54d69c7 100644
> --- a/fs/ramfs/inode.c
> +++ b/fs/ramfs/inode.c
> @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
>  		inode_init_owner(inode, dir, mode);
>  		inode->i_mapping->a_ops = &ramfs_aops;
>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> -		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> +		/*
> +		 * TODO: make ramfs pages movable
> +		 */
> +		mapping_set_gfp_mask(inode->i_mapping,
> +				GFP_TRANSHUGE & ~__GFP_MOVABLE);

So, before these patches, ramfs was movable.  Now, even on architectures
or configurations that have no chance of using THP-pagecache, ramfs
pages are no longer movable.  Right?

That seems unfortunate, and probably not something we want to
intentionally merge in this state.

Worst-case, we should at least make sure the pages remain movable in
configurations where THP-pagecache is unavailable.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 26/39] ramfs: enable transparent huge page cache
@ 2013-05-21 22:43     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 22:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> ramfs is the most simple fs from page cache point of view. Let's start
> transparent huge page cache enabling here.
> 
> For now we allocate only non-movable huge page. ramfs pages cannot be
> moved yet.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  fs/ramfs/inode.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> index c24f1e1..54d69c7 100644
> --- a/fs/ramfs/inode.c
> +++ b/fs/ramfs/inode.c
> @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
>  		inode_init_owner(inode, dir, mode);
>  		inode->i_mapping->a_ops = &ramfs_aops;
>  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> -		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> +		/*
> +		 * TODO: make ramfs pages movable
> +		 */
> +		mapping_set_gfp_mask(inode->i_mapping,
> +				GFP_TRANSHUGE & ~__GFP_MOVABLE);

So, before these patches, ramfs was movable.  Now, even on architectures
or configurations that have no chance of using THP-pagecache, ramfs
pages are no longer movable.  Right?

That seems unfortunate, and probably not something we want to
intentionally merge in this state.

Worst-case, we should at least make sure the pages remain movable in
configurations where THP-pagecache is unavailable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 22:56     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 22:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> +static inline unsigned long mapping_align_mask(struct address_space *mapping)
> +{
> +	if (mapping_can_have_hugepages(mapping))
> +		return PAGE_MASK & ~HPAGE_MASK;
> +	return get_align_mask();
> +}

get_align_mask() appears to be a bit more complicated to me than just a
plain old mask.  Are you sure you don't need to pick up any of its
behavior for the mapping_can_have_hugepages() case?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages
@ 2013-05-21 22:56     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 22:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> +static inline unsigned long mapping_align_mask(struct address_space *mapping)
> +{
> +	if (mapping_can_have_hugepages(mapping))
> +		return PAGE_MASK & ~HPAGE_MASK;
> +	return get_align_mask();
> +}

get_align_mask() appears to be a bit more complicated to me than just a
plain old mask.  Are you sure you don't need to pick up any of its
behavior for the mapping_can_have_hugepages() case?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 23:20     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 23:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Make arch_get_unmapped_area() return unmapped area aligned to HPAGE_MASK
> if the file mapping can have huge pages.

OK, so there are at least four phases of this patch set which are
distinct to me.

1. Prep work that can go upstream now
2. Making the page cache able to hold compound pages
3. Making thp-cache work with ramfs
4. Making mmap() work with thp-cache

(1) needs to go upstream now.

(2) and (3) are related and should go upstream together.  There should
be enough performance benefits from this alone to let them get merged.

(4) has lot of the code complexity, and is certainly required...
eventually.  I think you should stop for the _moment_ posting things in
this category and wait until you get the other stuff merged.  Go ahead
and keep it in your git tree for toying around with, but don't try to
get it merged until parts 1-3 are in.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages
@ 2013-05-21 23:20     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 23:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Make arch_get_unmapped_area() return unmapped area aligned to HPAGE_MASK
> if the file mapping can have huge pages.

OK, so there are at least four phases of this patch set which are
distinct to me.

1. Prep work that can go upstream now
2. Making the page cache able to hold compound pages
3. Making thp-cache work with ramfs
4. Making mmap() work with thp-cache

(1) needs to go upstream now.

(2) and (3) are related and should go upstream together.  There should
be enough performance benefits from this alone to let them get merged.

(4) has lot of the code complexity, and is certainly required...
eventually.  I think you should stop for the _moment_ posting things in
this category and wait until you get the other stuff merged.  Go ahead
and keep it in your git tree for toying around with, but don't try to
get it merged until parts 1-3 are in.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 23:23     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 23:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> It's confusing that mk_huge_pmd() has sematics different from mk_pte()
> or mk_pmd().
> 
> Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
> prototype to match mk_pte().

Was there a motivation to do this beyond adding consistency?  Do you use
this later or something?

> @@ -746,7 +745,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>  		pte_free(mm, pgtable);
>  	} else {
>  		pmd_t entry;
> -		entry = mk_huge_pmd(page, vma);
> +		entry = mk_huge_pmd(page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>  		page_add_new_anon_rmap(page, vma, haddr);
>  		set_pmd_at(mm, haddr, pmd, entry);

I'm not the biggest fan since this does add lines of code, but I do
appreciate the consistency it adds, so:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
@ 2013-05-21 23:23     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 23:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> It's confusing that mk_huge_pmd() has sematics different from mk_pte()
> or mk_pmd().
> 
> Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
> prototype to match mk_pte().

Was there a motivation to do this beyond adding consistency?  Do you use
this later or something?

> @@ -746,7 +745,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>  		pte_free(mm, pgtable);
>  	} else {
>  		pmd_t entry;
> -		entry = mk_huge_pmd(page, vma);
> +		entry = mk_huge_pmd(page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>  		page_add_new_anon_rmap(page, vma, haddr);
>  		set_pmd_at(mm, haddr, pmd, entry);

I'm not the biggest fan since this does add lines of code, but I do
appreciate the consistency it adds, so:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 23:23     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 23:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> It's confusing that mk_huge_pmd() has sematics different from mk_pte()
> or mk_pmd().
> 
> Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
> prototype to match mk_pte().

Oh, and please stick this in your queue of stuff to go upstream, first.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
@ 2013-05-21 23:23     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 23:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> It's confusing that mk_huge_pmd() has sematics different from mk_pte()
> or mk_pmd().
> 
> Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
> prototype to match mk_pte().

Oh, and please stick this in your queue of stuff to go upstream, first.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 31/39] thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 23:38     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 23:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
> to handle fallback path.
> 
> Let's consolidate code back by introducing VM_FAULT_FALLBACK return
> code.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/huge_mm.h |    3 ---
>  include/linux/mm.h      |    3 ++-
>  mm/huge_memory.c        |   31 +++++--------------------------
>  mm/memory.c             |    9 ++++++---
>  4 files changed, 13 insertions(+), 33 deletions

Wow, nice diffstat!

This and the previous patch can go in the cleanups pile, no?

> @@ -3788,9 +3788,12 @@ retry:
>  	if (!pmd)
>  		return VM_FAULT_OOM;
>  	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> +		int ret = 0;
>  		if (!vma->vm_ops)
> -			return do_huge_pmd_anonymous_page(mm, vma, address,
> -							  pmd, flags);
> +			ret = do_huge_pmd_anonymous_page(mm, vma, address,
> +					pmd, flags);
> +		if ((ret & VM_FAULT_FALLBACK) == 0)
> +			return ret;

This could use a small comment about where the code flow is going, when
and why.  FWIW, I vastly prefer the '!' form in these:

	if (!(ret & VM_FAULT_FALLBACK))

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 31/39] thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
@ 2013-05-21 23:38     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 23:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
> to handle fallback path.
> 
> Let's consolidate code back by introducing VM_FAULT_FALLBACK return
> code.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/huge_mm.h |    3 ---
>  include/linux/mm.h      |    3 ++-
>  mm/huge_memory.c        |   31 +++++--------------------------
>  mm/memory.c             |    9 ++++++---
>  4 files changed, 13 insertions(+), 33 deletions

Wow, nice diffstat!

This and the previous patch can go in the cleanups pile, no?

> @@ -3788,9 +3788,12 @@ retry:
>  	if (!pmd)
>  		return VM_FAULT_OOM;
>  	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> +		int ret = 0;
>  		if (!vma->vm_ops)
> -			return do_huge_pmd_anonymous_page(mm, vma, address,
> -							  pmd, flags);
> +			ret = do_huge_pmd_anonymous_page(mm, vma, address,
> +					pmd, flags);
> +		if ((ret & VM_FAULT_FALLBACK) == 0)
> +			return ret;

This could use a small comment about where the code flow is going, when
and why.  FWIW, I vastly prefer the '!' form in these:

	if (!(ret & VM_FAULT_FALLBACK))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 32/39] mm: cleanup __do_fault() implementation
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-21 23:57     ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 23:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Let's cleanup __do_fault() to prepare it for transparent huge pages
> support injection.
> 
> Cleanups:
>  - int -> bool where appropriate;
>  - unindent some code by reverting 'if' condition;
>  - extract !pte_same() path to get it clear;
>  - separate pte update from mm stats update;
>  - some comments reformated;

I've scanned through the rest of these patches.  They look OK, and I
don't have _too_ much to say.  They definitely need some closer review,
but I think you should concentrate your attention on the stuff _before_
this point in the series.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 32/39] mm: cleanup __do_fault() implementation
@ 2013-05-21 23:57     ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-21 23:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> 
> Let's cleanup __do_fault() to prepare it for transparent huge pages
> support injection.
> 
> Cleanups:
>  - int -> bool where appropriate;
>  - unindent some code by reverting 'if' condition;
>  - extract !pte_same() path to get it clear;
>  - separate pte update from mm stats update;
>  - some comments reformated;

I've scanned through the rest of these patches.  They look OK, and I
don't have _too_ much to say.  They definitely need some closer review,
but I think you should concentrate your attention on the stuff _before_
this point in the series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 31/39] thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-22  6:51     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22  6:51 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
> to handle fallback path.
>
> Let's consolidate code back by introducing VM_FAULT_FALLBACK return
> code.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---

Acked-by: Hillf Danton <dhillf@gmail.com>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 31/39] thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
@ 2013-05-22  6:51     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22  6:51 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
> to handle fallback path.
>
> Let's consolidate code back by introducing VM_FAULT_FALLBACK return
> code.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---

Acked-by: Hillf Danton <dhillf@gmail.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 28/39] thp: prepare zap_huge_pmd() to uncharge file pages
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-22  7:26     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22  7:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Uncharge pages from correct counter.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/huge_memory.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7ad458d..a88f9b2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1385,10 +1385,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                         spin_unlock(&tlb->mm->page_table_lock);
>                         put_huge_zero_page();
>                 } else {
> +                       int member;
>                         page = pmd_page(orig_pmd);

Better _if_ member is determined before we touch rmap, conceptually?

>                         page_remove_rmap(page);
>                         VM_BUG_ON(page_mapcount(page) < 0);
> -                       add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +                       member = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
> +                       add_mm_counter(tlb->mm, member, -HPAGE_PMD_NR);
>                         VM_BUG_ON(!PageHead(page));
>                         tlb->mm->nr_ptes--;
>                         spin_unlock(&tlb->mm->page_table_lock);
> --
> 1.7.10.4
>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 28/39] thp: prepare zap_huge_pmd() to uncharge file pages
@ 2013-05-22  7:26     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22  7:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Uncharge pages from correct counter.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/huge_memory.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7ad458d..a88f9b2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1385,10 +1385,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>                         spin_unlock(&tlb->mm->page_table_lock);
>                         put_huge_zero_page();
>                 } else {
> +                       int member;
>                         page = pmd_page(orig_pmd);

Better _if_ member is determined before we touch rmap, conceptually?

>                         page_remove_rmap(page);
>                         VM_BUG_ON(page_mapcount(page) < 0);
> -                       add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +                       member = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
> +                       add_mm_counter(tlb->mm, member, -HPAGE_PMD_NR);
>                         VM_BUG_ON(!PageHead(page));
>                         tlb->mm->nr_ptes--;
>                         spin_unlock(&tlb->mm->page_table_lock);
> --
> 1.7.10.4
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 02/39] block: implement add_bdi_stat()
  2013-05-21 18:25     ` Dave Hansen
@ 2013-05-22 11:06       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 11:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:22 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > We're going to add/remove a number of page cache entries at once. This
> > patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
> > amount. It's required for batched page cache manipulations.
> 
> Add, but no dec?

'sub', I guess, not 'dec'. For that we use add_bdi_stat(m, item, -nr).
It's consistent with __add_bdi_stat() usage.

> I'd also move this closer to where it gets used in the series.

Okay.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 02/39] block: implement add_bdi_stat()
@ 2013-05-22 11:06       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 11:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:22 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > We're going to add/remove a number of page cache entries at once. This
> > patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
> > amount. It's required for batched page cache manipulations.
> 
> Add, but no dec?

'sub', I guess, not 'dec'. For that we use add_bdi_stat(m, item, -nr).
It's consistent with __add_bdi_stat() usage.

> I'd also move this closer to where it gets used in the series.

Okay.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 08/39] thp: compile-time and sysfs knob for thp pagecache
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-22 11:19     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22 11:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for X86_64.
>
How about THPC, TRANSPARENT_HUGEPAGE_CACHE?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 08/39] thp: compile-time and sysfs knob for thp pagecache
@ 2013-05-22 11:19     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22 11:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for X86_64.
>
How about THPC, TRANSPARENT_HUGEPAGE_CACHE?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 34/39] thp, mm: handle huge pages in filemap_fault()
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-22 11:37     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22 11:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> If caller asks for huge page (flags & FAULT_FLAG_TRANSHUGE),
> filemap_fault() return it if there's a huge page already by the offset.
>
> If the area of page cache required to create huge is empty, we create a
> new huge page and return it.
>
> Otherwise we return VM_FAULT_FALLBACK to indicate that fallback to small
> pages is required.
>
s/small/regular/g ?

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 34/39] thp, mm: handle huge pages in filemap_fault()
@ 2013-05-22 11:37     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22 11:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> If caller asks for huge page (flags & FAULT_FLAG_TRANSHUGE),
> filemap_fault() return it if there's a huge page already by the offset.
>
> If the area of page cache required to create huge is empty, we create a
> new huge page and return it.
>
> Otherwise we return VM_FAULT_FALLBACK to indicate that fallback to small
> pages is required.
>
s/small/regular/g ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 30/39] thp: do_huge_pmd_anonymous_page() cleanup
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-22 11:45     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22 11:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Minor cleanup: unindent most code of the fucntion by inverting one
> condition. It's preparation for the next patch.
>
> No functional changes.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
Acked-by: Hillf Danton <dhillf@gmail.com>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 30/39] thp: do_huge_pmd_anonymous_page() cleanup
@ 2013-05-22 11:45     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22 11:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Minor cleanup: unindent most code of the fucntion by inverting one
> condition. It's preparation for the next patch.
>
> No functional changes.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
Acked-by: Hillf Danton <dhillf@gmail.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements
  2013-05-21 18:58     ` Dave Hansen
@ 2013-05-22 12:03       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 12:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
> 
> Would radix_tree_preload_contig() be a better name, then?

Yes. Will rename.

> ...
> > On 64-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > 
> > On 32-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > 
> > On most machines we will have RADIX_TREE_MAP_SHIFT=6.
> 
> Thanks for adding that to the description.  The array you're talking
> about is just pointers, right?
> 
> 107-43 = 64.  So, we have 64 extra pointers * NR_CPUS, plus 64 extra
> radix tree nodes that we will keep around most of the time.  On x86_64,
> that's 512 bytes plus 64*560 bytes of nodes which is ~35k of memory per CPU.
> 
> That's not bad I guess, but I do bet it's something that some folks want
> to configure out.  Please make sure to call out the actual size cost in
> bytes per CPU in future patch postings, at least for the common case
> (64-bit non-CONFIG_BASE_SMALL).

I will add this to the commit message:

On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
on 64-bit system the per-CPU feature overhead is
 for preload array:
   (30 - 21) * sizeof(void*) = 72 bytes
 plus, if the preload array is full
   (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
 total: 5112 bytes

on 32-bit system the per-CPU feature overhead is
 for preload array:
   (19 - 11) * sizeof(void*) = 32 bytes
 plus, if the preload array is full
   (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
 total: 2400 bytes
---

Is it good enough?

I probably, will add !BASE_SMALL dependency to
TRANSPARENT_HUGEPAGE_PAGECACHE config option.

> 
> > Since only THP uses batched preload at the , we disable (set max preload
> > to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be changed
> > in the future.
> 
> "at the..."  Is there something missing in that sentence?

at the moment :)

> No major nits, so:
> 
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Thanks!

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements
@ 2013-05-22 12:03       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 12:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
> 
> Would radix_tree_preload_contig() be a better name, then?

Yes. Will rename.

> ...
> > On 64-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > 
> > On 32-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > 
> > On most machines we will have RADIX_TREE_MAP_SHIFT=6.
> 
> Thanks for adding that to the description.  The array you're talking
> about is just pointers, right?
> 
> 107-43 = 64.  So, we have 64 extra pointers * NR_CPUS, plus 64 extra
> radix tree nodes that we will keep around most of the time.  On x86_64,
> that's 512 bytes plus 64*560 bytes of nodes which is ~35k of memory per CPU.
> 
> That's not bad I guess, but I do bet it's something that some folks want
> to configure out.  Please make sure to call out the actual size cost in
> bytes per CPU in future patch postings, at least for the common case
> (64-bit non-CONFIG_BASE_SMALL).

I will add this to the commit message:

On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
on 64-bit system the per-CPU feature overhead is
 for preload array:
   (30 - 21) * sizeof(void*) = 72 bytes
 plus, if the preload array is full
   (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
 total: 5112 bytes

on 32-bit system the per-CPU feature overhead is
 for preload array:
   (19 - 11) * sizeof(void*) = 32 bytes
 plus, if the preload array is full
   (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
 total: 2400 bytes
---

Is it good enough?

I probably, will add !BASE_SMALL dependency to
TRANSPARENT_HUGEPAGE_PAGECACHE config option.

> 
> > Since only THP uses batched preload at the , we disable (set max preload
> > to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be changed
> > in the future.
> 
> "at the..."  Is there something missing in that sentence?

at the moment :)

> No major nits, so:
> 
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Thanks!

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 06/39] thp, mm: avoid PageUnevictable on active/inactive lru lists
  2013-05-21 19:17     ` Dave Hansen
@ 2013-05-22 12:34       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 12:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
> > that have been placed on the LRU lists when first allocated), but these
> > pages must not have PageUnevictable set - otherwise shrink_active_list
> > goes crazy:
> > 
> > kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
> > invalid opcode: 0000 [#1] SMP
> > CPU 0
> > Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
> > RIP: 0010:[<ffffffff81110478>]  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
> > RSP: 0000:ffff8800796d9b28  EFLAGS: 00010082'
> ...
> 
> I'd much rather see a code snippet and description the BUG_ON() than a
> register and stack dump.  That line number is wrong already. ;)

Good point.

> > For lru_add_page_tail(), it means we should not set PageUnevictable()
> > for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
> > Let's just copy PG_active and PG_unevictable from head page in
> > __split_huge_page_refcount(), it will simplify lru_add_page_tail().
> > 
> > This will fix one more bug in lru_add_page_tail():
> > if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
> > will go to the same lru as page, but nobody cares to sync page_tail
> > active/inactive state with page. So we can end up with inactive page on
> > active lru.
> > The patch will fix it as well since we copy PG_active from head page.
> 
> This all seems good, and if it fixes a bug, it should really get merged
> as it stands.  Have you been actually able to trigger that bug in any
> way in practice?

I was only able to trigger it on ramfs transhuge pages split.
I doubt there's a way to reproduce it on current upstream code.

> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Thanks!

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 06/39] thp, mm: avoid PageUnevictable on active/inactive lru lists
@ 2013-05-22 12:34       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 12:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
> > that have been placed on the LRU lists when first allocated), but these
> > pages must not have PageUnevictable set - otherwise shrink_active_list
> > goes crazy:
> > 
> > kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
> > invalid opcode: 0000 [#1] SMP
> > CPU 0
> > Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
> > RIP: 0010:[<ffffffff81110478>]  [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
> > RSP: 0000:ffff8800796d9b28  EFLAGS: 00010082'
> ...
> 
> I'd much rather see a code snippet and description the BUG_ON() than a
> register and stack dump.  That line number is wrong already. ;)

Good point.

> > For lru_add_page_tail(), it means we should not set PageUnevictable()
> > for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
> > Let's just copy PG_active and PG_unevictable from head page in
> > __split_huge_page_refcount(), it will simplify lru_add_page_tail().
> > 
> > This will fix one more bug in lru_add_page_tail():
> > if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
> > will go to the same lru as page, but nobody cares to sync page_tail
> > active/inactive state with page. So we can end up with inactive page on
> > active lru.
> > The patch will fix it as well since we copy PG_active from head page.
> 
> This all seems good, and if it fixes a bug, it should really get merged
> as it stands.  Have you been actually able to trigger that bug in any
> way in practice?

I was only able to trigger it on ramfs transhuge pages split.
I doubt there's a way to reproduce it on current upstream code.

> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Thanks!

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-22 12:47     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22 12:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> @@ -3301,12 +3335,23 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  {
>         pte_t *page_table;
>         spinlock_t *ptl;
> +       pgtable_t pgtable = NULL;
>         struct page *page, *cow_page, *dirty_page = NULL;
> -       pte_t entry;
>         bool anon = false, page_mkwrite = false;
>         bool write = flags & FAULT_FLAG_WRITE;
> +       bool thp = flags & FAULT_FLAG_TRANSHUGE;
> +       unsigned long addr_aligned;
>         struct vm_fault vmf;
> -       int ret;
> +       int nr, ret;
> +
> +       if (thp) {
> +               if (!transhuge_vma_suitable(vma, address))
> +                       return VM_FAULT_FALLBACK;
> +               if (unlikely(khugepaged_enter(vma)))

vma->vm_mm now is under the care of khugepaged, why?

> +                       return VM_FAULT_OOM;
> +               addr_aligned = address & HPAGE_PMD_MASK;
> +       } else
> +               addr_aligned = address & PAGE_MASK;
>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
@ 2013-05-22 12:47     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22 12:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> @@ -3301,12 +3335,23 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  {
>         pte_t *page_table;
>         spinlock_t *ptl;
> +       pgtable_t pgtable = NULL;
>         struct page *page, *cow_page, *dirty_page = NULL;
> -       pte_t entry;
>         bool anon = false, page_mkwrite = false;
>         bool write = flags & FAULT_FLAG_WRITE;
> +       bool thp = flags & FAULT_FLAG_TRANSHUGE;
> +       unsigned long addr_aligned;
>         struct vm_fault vmf;
> -       int ret;
> +       int nr, ret;
> +
> +       if (thp) {
> +               if (!transhuge_vma_suitable(vma, address))
> +                       return VM_FAULT_FALLBACK;
> +               if (unlikely(khugepaged_enter(vma)))

vma->vm_mm now is under the care of khugepaged, why?

> +                       return VM_FAULT_OOM;
> +               addr_aligned = address & HPAGE_PMD_MASK;
> +       } else
> +               addr_aligned = address & PAGE_MASK;
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-22 12:56     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22 12:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> @@ -3316,17 +3361,25 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>                 if (unlikely(anon_vma_prepare(vma)))
>                         return VM_FAULT_OOM;
>
> -               cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> +               cow_page = alloc_fault_page_vma(vma, address, flags);
>                 if (!cow_page)
> -                       return VM_FAULT_OOM;
> +                       return VM_FAULT_OOM | VM_FAULT_FALLBACK;
>

Fallback makes sense with !thp ?

>                 if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
>                         page_cache_release(cow_page);
> -                       return VM_FAULT_OOM;
> +                       return VM_FAULT_OOM | VM_FAULT_FALLBACK;
>                 }
>         } else
>                 cow_page = NULL;

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
@ 2013-05-22 12:56     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22 12:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> @@ -3316,17 +3361,25 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>                 if (unlikely(anon_vma_prepare(vma)))
>                         return VM_FAULT_OOM;
>
> -               cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> +               cow_page = alloc_fault_page_vma(vma, address, flags);
>                 if (!cow_page)
> -                       return VM_FAULT_OOM;
> +                       return VM_FAULT_OOM | VM_FAULT_FALLBACK;
>

Fallback makes sense with !thp ?

>                 if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
>                         page_cache_release(cow_page);
> -                       return VM_FAULT_OOM;
> +                       return VM_FAULT_OOM | VM_FAULT_FALLBACK;
>                 }
>         } else
>                 cow_page = NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-22 13:24     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22 13:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>         page = vmf.page;
> +
> +       /*
> +        * If we asked for huge page we expect to get it or VM_FAULT_FALLBACK.
> +        * If we don't ask for huge page it must be splitted in ->fault().
> +        */
> +       BUG_ON(PageTransHuge(page) != thp);
> +
Based on the log message in 34/39(
If the area of page cache required to create huge is empty, we create a
new huge page and return it.), the above trap looks bogus.

	if (thp)
		BUG_ON(!PageTransHuge(page));

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
@ 2013-05-22 13:24     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-22 13:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>         page = vmf.page;
> +
> +       /*
> +        * If we asked for huge page we expect to get it or VM_FAULT_FALLBACK.
> +        * If we don't ask for huge page it must be splitted in ->fault().
> +        */
> +       BUG_ON(PageTransHuge(page) != thp);
> +
Based on the log message in 34/39(
If the area of page cache required to create huge is empty, we create a
new huge page and return it.), the above trap looks bogus.

	if (thp)
		BUG_ON(!PageTransHuge(page));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-05-21 19:28     ` Dave Hansen
@ 2013-05-22 13:51       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 13:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Returns true if mapping can have huge pages. Just check for __GFP_COMP
> > in gfp mask of the mapping for now.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/pagemap.h |   12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index e3dea75..28597ec 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -84,6 +84,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> >  				(__force unsigned long)mask;
> >  }
> >  
> > +static inline bool mapping_can_have_hugepages(struct address_space *m)
> > +{
> > +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
> > +		gfp_t gfp_mask = mapping_gfp_mask(m);
> > +		/* __GFP_COMP is key part of GFP_TRANSHUGE */
> > +		return !!(gfp_mask & __GFP_COMP) &&
> > +			transparent_hugepage_pagecache();
> > +	}
> > +
> > +	return false;
> > +}
> 
> transparent_hugepage_pagecache() already has the same IS_ENABLED()
> check,  Is it really necessary to do it again here?
> 
> IOW, can you do this?
> 
> > +static inline bool mapping_can_have_hugepages(struct address_space
> > +{
> > +		gfp_t gfp_mask = mapping_gfp_mask(m);
> 		if (!transparent_hugepage_pagecache())
> 			return false;
> > +		/* __GFP_COMP is key part of GFP_TRANSHUGE */
> > +		return !!(gfp_mask & __GFP_COMP);
> > +}

Yeah, it's better.

> I know we talked about this in the past, but I've forgotten already.
> Why is this checking for __GFP_COMP instead of GFP_TRANSHUGE?

It's up to filesystem what gfp mask to use. For example ramfs's pages are
not movable currently. So, we check only part which matters.

> Please flesh out the comment.

I'll make the comment in code a bit more descriptive.

> Also, what happens if "transparent_hugepage_flags &
> (1<<TRANSPARENT_HUGEPAGE_PAGECACHE)" becomes false at runtime and you
> have some already-instantiated huge page cache mappings around?  Will
> things like mapping_align_mask() break?

We will not touch existing huge pages in existing VMAs. The userspace can
use them until they will be unmapped or split. It's consistent with anon
THP pages.

If anybody mmap() the file after disabling the feature, we will not
setup huge pages anymore: transparent_hugepage_enabled() check in
handle_mm_fault will fail and the page fill be split.

mapping_align_mask() is part of mmap() call path, so there's only chance
that we will get VMA aligned more strictly then needed. Nothing to worry
about.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-05-22 13:51       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 13:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Returns true if mapping can have huge pages. Just check for __GFP_COMP
> > in gfp mask of the mapping for now.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/pagemap.h |   12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index e3dea75..28597ec 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -84,6 +84,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> >  				(__force unsigned long)mask;
> >  }
> >  
> > +static inline bool mapping_can_have_hugepages(struct address_space *m)
> > +{
> > +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
> > +		gfp_t gfp_mask = mapping_gfp_mask(m);
> > +		/* __GFP_COMP is key part of GFP_TRANSHUGE */
> > +		return !!(gfp_mask & __GFP_COMP) &&
> > +			transparent_hugepage_pagecache();
> > +	}
> > +
> > +	return false;
> > +}
> 
> transparent_hugepage_pagecache() already has the same IS_ENABLED()
> check,  Is it really necessary to do it again here?
> 
> IOW, can you do this?
> 
> > +static inline bool mapping_can_have_hugepages(struct address_space
> > +{
> > +		gfp_t gfp_mask = mapping_gfp_mask(m);
> 		if (!transparent_hugepage_pagecache())
> 			return false;
> > +		/* __GFP_COMP is key part of GFP_TRANSHUGE */
> > +		return !!(gfp_mask & __GFP_COMP);
> > +}

Yeah, it's better.

> I know we talked about this in the past, but I've forgotten already.
> Why is this checking for __GFP_COMP instead of GFP_TRANSHUGE?

It's up to filesystem what gfp mask to use. For example ramfs's pages are
not movable currently. So, we check only part which matters.

> Please flesh out the comment.

I'll make the comment in code a bit more descriptive.

> Also, what happens if "transparent_hugepage_flags &
> (1<<TRANSPARENT_HUGEPAGE_PAGECACHE)" becomes false at runtime and you
> have some already-instantiated huge page cache mappings around?  Will
> things like mapping_align_mask() break?

We will not touch existing huge pages in existing VMAs. The userspace can
use them until they will be unmapped or split. It's consistent with anon
THP pages.

If anybody mmap() the file after disabling the feature, we will not
setup huge pages anymore: transparent_hugepage_enabled() check in
handle_mm_fault will fail and the page fill be split.

mapping_align_mask() is part of mmap() call path, so there's only chance
that we will get VMA aligned more strictly then needed. Nothing to worry
about.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 16/39] thp, mm: locking tail page is a bug
  2013-05-21 20:18     ` Dave Hansen
@ 2013-05-22 14:12       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 14:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Locking head page means locking entire compound page.
> > If we try to lock tail page, something went wrong.
> 
> Have you actually triggered this in your development?

Yes, on early prototypes.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 16/39] thp, mm: locking tail page is a bug
@ 2013-05-22 14:12       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 14:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Locking head page means locking entire compound page.
> > If we try to lock tail page, something went wrong.
> 
> Have you actually triggered this in your development?

Yes, on early prototypes.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements
  2013-05-22 12:03       ` Kirill A. Shutemov
@ 2013-05-22 14:20         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-22 14:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/22/2013 05:03 AM, Kirill A. Shutemov wrote:
> On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> on 64-bit system the per-CPU feature overhead is
>  for preload array:
>    (30 - 21) * sizeof(void*) = 72 bytes
>  plus, if the preload array is full
>    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
>  total: 5112 bytes
> 
> on 32-bit system the per-CPU feature overhead is
>  for preload array:
>    (19 - 11) * sizeof(void*) = 32 bytes
>  plus, if the preload array is full
>    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
>  total: 2400 bytes
> ---
> 
> Is it good enough?

Yup, just stick the calculations way down in the commit message.  You
can put the description that it "eats about 5k more memory per-cpu than
existing code" up in the very beginning.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements
@ 2013-05-22 14:20         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-22 14:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/22/2013 05:03 AM, Kirill A. Shutemov wrote:
> On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> on 64-bit system the per-CPU feature overhead is
>  for preload array:
>    (30 - 21) * sizeof(void*) = 72 bytes
>  plus, if the preload array is full
>    (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
>  total: 5112 bytes
> 
> on 32-bit system the per-CPU feature overhead is
>  for preload array:
>    (19 - 11) * sizeof(void*) = 32 bytes
>  plus, if the preload array is full
>    (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
>  total: 2400 bytes
> ---
> 
> Is it good enough?

Yup, just stick the calculations way down in the commit message.  You
can put the description that it "eats about 5k more memory per-cpu than
existing code" up in the very beginning.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 26/39] ramfs: enable transparent huge page cache
  2013-05-21 22:43     ` Dave Hansen
@ 2013-05-22 14:22       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 14:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > ramfs is the most simple fs from page cache point of view. Let's start
> > transparent huge page cache enabling here.
> > 
> > For now we allocate only non-movable huge page. ramfs pages cannot be
> > moved yet.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  fs/ramfs/inode.c |    6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> > index c24f1e1..54d69c7 100644
> > --- a/fs/ramfs/inode.c
> > +++ b/fs/ramfs/inode.c
> > @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> >  		inode_init_owner(inode, dir, mode);
> >  		inode->i_mapping->a_ops = &ramfs_aops;
> >  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> > -		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > +		/*
> > +		 * TODO: make ramfs pages movable
> > +		 */
> > +		mapping_set_gfp_mask(inode->i_mapping,
> > +				GFP_TRANSHUGE & ~__GFP_MOVABLE);
> 
> So, before these patches, ramfs was movable.  Now, even on architectures
> or configurations that have no chance of using THP-pagecache, ramfs
> pages are no longer movable.  Right?

No, it wasn't movable. GFP_HIGHUSER is not GFP_HIGHUSER_MOVABLE (yeah,
names of gfp constants could be more consistent).

ramfs should be fixed to use movable pages, but it's outside the scope of the
patchset.

See more details: http://lkml.org/lkml/2013/4/2/720

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 26/39] ramfs: enable transparent huge page cache
@ 2013-05-22 14:22       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 14:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > ramfs is the most simple fs from page cache point of view. Let's start
> > transparent huge page cache enabling here.
> > 
> > For now we allocate only non-movable huge page. ramfs pages cannot be
> > moved yet.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  fs/ramfs/inode.c |    6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> > index c24f1e1..54d69c7 100644
> > --- a/fs/ramfs/inode.c
> > +++ b/fs/ramfs/inode.c
> > @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> >  		inode_init_owner(inode, dir, mode);
> >  		inode->i_mapping->a_ops = &ramfs_aops;
> >  		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> > -		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > +		/*
> > +		 * TODO: make ramfs pages movable
> > +		 */
> > +		mapping_set_gfp_mask(inode->i_mapping,
> > +				GFP_TRANSHUGE & ~__GFP_MOVABLE);
> 
> So, before these patches, ramfs was movable.  Now, even on architectures
> or configurations that have no chance of using THP-pagecache, ramfs
> pages are no longer movable.  Right?

No, it wasn't movable. GFP_HIGHUSER is not GFP_HIGHUSER_MOVABLE (yeah,
names of gfp constants could be more consistent).

ramfs should be fixed to use movable pages, but it's outside the scope of the
patchset.

See more details: http://lkml.org/lkml/2013/4/2/720

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  2013-05-21 23:23     ` Dave Hansen
@ 2013-05-22 14:37       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 14:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > It's confusing that mk_huge_pmd() has sematics different from mk_pte()
> > or mk_pmd().
> > 
> > Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
> > prototype to match mk_pte().
> 
> Was there a motivation to do this beyond adding consistency?  Do you use
> this later or something?

I spent some time on debugging problem caused by this inconsistency, so at
that point I was motivated to fix it. :)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
@ 2013-05-22 14:37       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 14:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > It's confusing that mk_huge_pmd() has sematics different from mk_pte()
> > or mk_pmd().
> > 
> > Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
> > prototype to match mk_pte().
> 
> Was there a motivation to do this beyond adding consistency?  Do you use
> this later or something?

I spent some time on debugging problem caused by this inconsistency, so at
that point I was motivated to fix it. :)

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 16/39] thp, mm: locking tail page is a bug
  2013-05-22 14:12       ` Kirill A. Shutemov
@ 2013-05-22 14:53         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-22 14:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/22/2013 07:12 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>> Locking head page means locking entire compound page.
>>> If we try to lock tail page, something went wrong.
>>
>> Have you actually triggered this in your development?
> 
> Yes, on early prototypes.

I'd mention this in the description, and think about how necessary this
is with your _current_ code.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 16/39] thp, mm: locking tail page is a bug
@ 2013-05-22 14:53         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-22 14:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/22/2013 07:12 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>> Locking head page means locking entire compound page.
>>> If we try to lock tail page, something went wrong.
>>
>> Have you actually triggered this in your development?
> 
> Yes, on early prototypes.

I'd mention this in the description, and think about how necessary this
is with your _current_ code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 26/39] ramfs: enable transparent huge page cache
  2013-05-22 14:22       ` Kirill A. Shutemov
@ 2013-05-22 14:55         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-22 14:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/22/2013 07:22 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>>> +		/*
>>> +		 * TODO: make ramfs pages movable
>>> +		 */
>>> +		mapping_set_gfp_mask(inode->i_mapping,
>>> +				GFP_TRANSHUGE & ~__GFP_MOVABLE);
>>
>> So, before these patches, ramfs was movable.  Now, even on architectures
>> or configurations that have no chance of using THP-pagecache, ramfs
>> pages are no longer movable.  Right?
> 
> No, it wasn't movable. GFP_HIGHUSER is not GFP_HIGHUSER_MOVABLE (yeah,
> names of gfp constants could be more consistent).
> 
> ramfs should be fixed to use movable pages, but it's outside the scope of the
> patchset.
> 
> See more details: http://lkml.org/lkml/2013/4/2/720

Please make sure this is clear from the patch description.

Personally, I wouldn't be adding TODO's to the code that I'm not
planning to go fix, lest I would get tagged with _doing_ it. :)


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 26/39] ramfs: enable transparent huge page cache
@ 2013-05-22 14:55         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-22 14:55 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/22/2013 07:22 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>>> +		/*
>>> +		 * TODO: make ramfs pages movable
>>> +		 */
>>> +		mapping_set_gfp_mask(inode->i_mapping,
>>> +				GFP_TRANSHUGE & ~__GFP_MOVABLE);
>>
>> So, before these patches, ramfs was movable.  Now, even on architectures
>> or configurations that have no chance of using THP-pagecache, ramfs
>> pages are no longer movable.  Right?
> 
> No, it wasn't movable. GFP_HIGHUSER is not GFP_HIGHUSER_MOVABLE (yeah,
> names of gfp constants could be more consistent).
> 
> ramfs should be fixed to use movable pages, but it's outside the scope of the
> patchset.
> 
> See more details: http://lkml.org/lkml/2013/4/2/720

Please make sure this is clear from the patch description.

Personally, I wouldn't be adding TODO's to the code that I'm not
planning to go fix, lest I would get tagged with _doing_ it. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  2013-05-22 14:37       ` Kirill A. Shutemov
@ 2013-05-22 14:56         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-22 14:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/22/2013 07:37 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>> It's confusing that mk_huge_pmd() has sematics different from mk_pte()
>>> or mk_pmd().
>>>
>>> Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
>>> prototype to match mk_pte().
>>
>> Was there a motivation to do this beyond adding consistency?  Do you use
>> this later or something?
> 
> I spent some time on debugging problem caused by this inconsistency, so at
> that point I was motivated to fix it. :)

A little anecdote that this bit you in practice to help indicate this
isn't just random code churn would be nice.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
@ 2013-05-22 14:56         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-22 14:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/22/2013 07:37 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>>
>>> It's confusing that mk_huge_pmd() has sematics different from mk_pte()
>>> or mk_pmd().
>>>
>>> Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
>>> prototype to match mk_pte().
>>
>> Was there a motivation to do this beyond adding consistency?  Do you use
>> this later or something?
> 
> I spent some time on debugging problem caused by this inconsistency, so at
> that point I was motivated to fix it. :)

A little anecdote that this bit you in practice to help indicate this
isn't just random code churn would be nice.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
  2013-05-22 12:47     ` Hillf Danton
@ 2013-05-22 15:13       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 15:13 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > @@ -3301,12 +3335,23 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  {
> >         pte_t *page_table;
> >         spinlock_t *ptl;
> > +       pgtable_t pgtable = NULL;
> >         struct page *page, *cow_page, *dirty_page = NULL;
> > -       pte_t entry;
> >         bool anon = false, page_mkwrite = false;
> >         bool write = flags & FAULT_FLAG_WRITE;
> > +       bool thp = flags & FAULT_FLAG_TRANSHUGE;
> > +       unsigned long addr_aligned;
> >         struct vm_fault vmf;
> > -       int ret;
> > +       int nr, ret;
> > +
> > +       if (thp) {
> > +               if (!transhuge_vma_suitable(vma, address))
> > +                       return VM_FAULT_FALLBACK;
> > +               if (unlikely(khugepaged_enter(vma)))
> 
> vma->vm_mm now is under the care of khugepaged, why?

Because it has at least once VMA suitable for huge pages.

Yes, we can't collapse pages in file-backed VMAs yet, but It's better to
be consistent to avoid issues when collapsing will be implemented.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
@ 2013-05-22 15:13       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 15:13 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > @@ -3301,12 +3335,23 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  {
> >         pte_t *page_table;
> >         spinlock_t *ptl;
> > +       pgtable_t pgtable = NULL;
> >         struct page *page, *cow_page, *dirty_page = NULL;
> > -       pte_t entry;
> >         bool anon = false, page_mkwrite = false;
> >         bool write = flags & FAULT_FLAG_WRITE;
> > +       bool thp = flags & FAULT_FLAG_TRANSHUGE;
> > +       unsigned long addr_aligned;
> >         struct vm_fault vmf;
> > -       int ret;
> > +       int nr, ret;
> > +
> > +       if (thp) {
> > +               if (!transhuge_vma_suitable(vma, address))
> > +                       return VM_FAULT_FALLBACK;
> > +               if (unlikely(khugepaged_enter(vma)))
> 
> vma->vm_mm now is under the care of khugepaged, why?

Because it has at least once VMA suitable for huge pages.

Yes, we can't collapse pages in file-backed VMAs yet, but It's better to
be consistent to avoid issues when collapsing will be implemented.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
  2013-05-22 12:56     ` Hillf Danton
@ 2013-05-22 15:14       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 15:14 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > @@ -3316,17 +3361,25 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >                 if (unlikely(anon_vma_prepare(vma)))
> >                         return VM_FAULT_OOM;
> >
> > -               cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> > +               cow_page = alloc_fault_page_vma(vma, address, flags);
> >                 if (!cow_page)
> > -                       return VM_FAULT_OOM;
> > +                       return VM_FAULT_OOM | VM_FAULT_FALLBACK;
> >
> 
> Fallback makes sense with !thp ?

No, it's nop. handle_pte_fault() will notice only VM_FAULT_OOM. That's
what we need.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
@ 2013-05-22 15:14       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 15:14 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > @@ -3316,17 +3361,25 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >                 if (unlikely(anon_vma_prepare(vma)))
> >                         return VM_FAULT_OOM;
> >
> > -               cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> > +               cow_page = alloc_fault_page_vma(vma, address, flags);
> >                 if (!cow_page)
> > -                       return VM_FAULT_OOM;
> > +                       return VM_FAULT_OOM | VM_FAULT_FALLBACK;
> >
> 
> Fallback makes sense with !thp ?

No, it's nop. handle_pte_fault() will notice only VM_FAULT_OOM. That's
what we need.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
  2013-05-22 13:24     ` Hillf Danton
@ 2013-05-22 15:26       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 15:26 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >         page = vmf.page;
> > +
> > +       /*
> > +        * If we asked for huge page we expect to get it or VM_FAULT_FALLBACK.
> > +        * If we don't ask for huge page it must be splitted in ->fault().
> > +        */
> > +       BUG_ON(PageTransHuge(page) != thp);
> > +
> Based on the log message in 34/39(
> If the area of page cache required to create huge is empty, we create a
> new huge page and return it.), the above trap looks bogus.

The statement in 34/39 is true for (flags & FAULT_FLAG_TRANSHUGE).
For !(flags & FAULT_FLAG_TRANSHUGE) huge page must be split in ->fault.

The BUG_ON() above is shortcut for two checks:

if (thp)
	BUG_ON(!PageTransHuge(page));
if (!thp)
	BUG_ON(PageTransHuge(page));

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()
@ 2013-05-22 15:26       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 15:26 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >         page = vmf.page;
> > +
> > +       /*
> > +        * If we asked for huge page we expect to get it or VM_FAULT_FALLBACK.
> > +        * If we don't ask for huge page it must be splitted in ->fault().
> > +        */
> > +       BUG_ON(PageTransHuge(page) != thp);
> > +
> Based on the log message in 34/39(
> If the area of page cache required to create huge is empty, we create a
> new huge page and return it.), the above trap looks bogus.

The statement in 34/39 is true for (flags & FAULT_FLAG_TRANSHUGE).
For !(flags & FAULT_FLAG_TRANSHUGE) huge page must be split in ->fault.

The BUG_ON() above is shortcut for two checks:

if (thp)
	BUG_ON(!PageTransHuge(page));
if (!thp)
	BUG_ON(PageTransHuge(page));

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate
  2013-05-22 13:51       ` Kirill A. Shutemov
@ 2013-05-22 15:31         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-22 15:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/22/2013 06:51 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> Also, what happens if "transparent_hugepage_flags &
>> (1<<TRANSPARENT_HUGEPAGE_PAGECACHE)" becomes false at runtime and you
>> have some already-instantiated huge page cache mappings around?  Will
>> things like mapping_align_mask() break?
> 
> We will not touch existing huge pages in existing VMAs. The userspace can
> use them until they will be unmapped or split. It's consistent with anon
> THP pages.
> 
> If anybody mmap() the file after disabling the feature, we will not
> setup huge pages anymore: transparent_hugepage_enabled() check in
> handle_mm_fault will fail and the page fill be split.
> 
> mapping_align_mask() is part of mmap() call path, so there's only chance
> that we will get VMA aligned more strictly then needed. Nothing to worry
> about.

Could we get a little blurb along those lines somewhere?  Maybe even in
your docs that you've added to Documentation/.  Oh, wait, you don't have
any documentation? :)

You did add a sysfs knob, so you do owe us some docs for it.

"If the THP-cache sysfs tunable is disabled, huge pages will no longer
be mapped with new mmap()s, but they will remain in place in the page
cache.  You might still see some benefits from read/write operations,
etc..."

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate
@ 2013-05-22 15:31         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-22 15:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/22/2013 06:51 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> Also, what happens if "transparent_hugepage_flags &
>> (1<<TRANSPARENT_HUGEPAGE_PAGECACHE)" becomes false at runtime and you
>> have some already-instantiated huge page cache mappings around?  Will
>> things like mapping_align_mask() break?
> 
> We will not touch existing huge pages in existing VMAs. The userspace can
> use them until they will be unmapped or split. It's consistent with anon
> THP pages.
> 
> If anybody mmap() the file after disabling the feature, we will not
> setup huge pages anymore: transparent_hugepage_enabled() check in
> handle_mm_fault will fail and the page fill be split.
> 
> mapping_align_mask() is part of mmap() call path, so there's only chance
> that we will get VMA aligned more strictly then needed. Nothing to worry
> about.

Could we get a little blurb along those lines somewhere?  Maybe even in
your docs that you've added to Documentation/.  Oh, wait, you don't have
any documentation? :)

You did add a sysfs knob, so you do owe us some docs for it.

"If the THP-cache sysfs tunable is disabled, huge pages will no longer
be mapped with new mmap()s, but they will remain in place in the page
cache.  You might still see some benefits from read/write operations,
etc..."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 34/39] thp, mm: handle huge pages in filemap_fault()
  2013-05-22 11:37     ` Hillf Danton
@ 2013-05-22 15:34       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 15:34 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > If caller asks for huge page (flags & FAULT_FLAG_TRANSHUGE),
> > filemap_fault() return it if there's a huge page already by the offset.
> >
> > If the area of page cache required to create huge is empty, we create a
> > new huge page and return it.
> >
> > Otherwise we return VM_FAULT_FALLBACK to indicate that fallback to small
> > pages is required.
> >
> s/small/regular/g ?

% git log --oneline -p -i --grep 'small.\?page' | wc -l
5962
% git log --oneline -p -i --grep 'regular.\?page' | wc -l
3623

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 34/39] thp, mm: handle huge pages in filemap_fault()
@ 2013-05-22 15:34       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-22 15:34 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > If caller asks for huge page (flags & FAULT_FLAG_TRANSHUGE),
> > filemap_fault() return it if there's a huge page already by the offset.
> >
> > If the area of page cache required to create huge is empty, we create a
> > new huge page and return it.
> >
> > Otherwise we return VM_FAULT_FALLBACK to indicate that fallback to small
> > pages is required.
> >
> s/small/regular/g ?

% git log --oneline -p -i --grep 'small.\?page' | wc -l
5962
% git log --oneline -p -i --grep 'regular.\?page' | wc -l
3623

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 03/39] mm: implement zero_huge_user_segment and friends
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-23 10:32     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-23 10:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Let's add helpers to clear huge page segment(s). They provide the same
> functionallity as zero_user_segment and zero_user, but for huge pages.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/mm.h |    7 +++++++
>  mm/memory.c        |   36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c05d7cf..5e156fb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1797,6 +1797,13 @@ extern void dump_page(struct page *page);
>  extern void clear_huge_page(struct page *page,
>                             unsigned long addr,
>                             unsigned int pages_per_huge_page);
> +extern void zero_huge_user_segment(struct page *page,
> +               unsigned start, unsigned end);
> +static inline void zero_huge_user(struct page *page,
> +               unsigned start, unsigned len)
> +{
> +       zero_huge_user_segment(page, start, start + len);
> +}
>  extern void copy_user_huge_page(struct page *dst, struct page *src,
>                                 unsigned long addr, struct vm_area_struct *vma,
>                                 unsigned int pages_per_huge_page);
> diff --git a/mm/memory.c b/mm/memory.c
> index f7a1fba..f02a8be 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4266,6 +4266,42 @@ void clear_huge_page(struct page *page,
>         }
>  }
>
> +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
> +{
> +       int i;
> +       unsigned start_idx, end_idx;
> +       unsigned start_off, end_off;
> +
> +       BUG_ON(end < start);
> +
> +       might_sleep();
> +
> +       if (start == end)
> +               return;
> +
> +       start_idx = start >> PAGE_SHIFT;
> +       start_off = start & ~PAGE_MASK;
> +       end_idx = (end - 1) >> PAGE_SHIFT;
> +       end_off = ((end - 1) & ~PAGE_MASK) + 1;
> +
> +       /*
> +        * if start and end are on the same small page we can call
> +        * zero_user_segment() once and save one kmap_atomic().
> +        */
> +       if (start_idx == end_idx)
> +               return zero_user_segment(page + start_idx, start_off, end_off);
> +
> +       /* zero the first (possibly partial) page */
> +       zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
> +       for (i = start_idx + 1; i < end_idx; i++) {
> +               cond_resched();
> +               clear_highpage(page + i);
> +               flush_dcache_page(page + i);

Can we use the function again?
	zero_user_segment(page + i, 0, PAGE_SIZE);

> +       }
> +       /* zero the last (possibly partial) page */
> +       zero_user_segment(page + end_idx, 0, end_off);
> +}
> +
>  static void copy_user_gigantic_page(struct page *dst, struct page *src,
>                                     unsigned long addr,
>                                     struct vm_area_struct *vma,
> --
> 1.7.10.4
>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 03/39] mm: implement zero_huge_user_segment and friends
@ 2013-05-23 10:32     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-23 10:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Let's add helpers to clear huge page segment(s). They provide the same
> functionallity as zero_user_segment and zero_user, but for huge pages.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/mm.h |    7 +++++++
>  mm/memory.c        |   36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c05d7cf..5e156fb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1797,6 +1797,13 @@ extern void dump_page(struct page *page);
>  extern void clear_huge_page(struct page *page,
>                             unsigned long addr,
>                             unsigned int pages_per_huge_page);
> +extern void zero_huge_user_segment(struct page *page,
> +               unsigned start, unsigned end);
> +static inline void zero_huge_user(struct page *page,
> +               unsigned start, unsigned len)
> +{
> +       zero_huge_user_segment(page, start, start + len);
> +}
>  extern void copy_user_huge_page(struct page *dst, struct page *src,
>                                 unsigned long addr, struct vm_area_struct *vma,
>                                 unsigned int pages_per_huge_page);
> diff --git a/mm/memory.c b/mm/memory.c
> index f7a1fba..f02a8be 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4266,6 +4266,42 @@ void clear_huge_page(struct page *page,
>         }
>  }
>
> +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
> +{
> +       int i;
> +       unsigned start_idx, end_idx;
> +       unsigned start_off, end_off;
> +
> +       BUG_ON(end < start);
> +
> +       might_sleep();
> +
> +       if (start == end)
> +               return;
> +
> +       start_idx = start >> PAGE_SHIFT;
> +       start_off = start & ~PAGE_MASK;
> +       end_idx = (end - 1) >> PAGE_SHIFT;
> +       end_off = ((end - 1) & ~PAGE_MASK) + 1;
> +
> +       /*
> +        * if start and end are on the same small page we can call
> +        * zero_user_segment() once and save one kmap_atomic().
> +        */
> +       if (start_idx == end_idx)
> +               return zero_user_segment(page + start_idx, start_off, end_off);
> +
> +       /* zero the first (possibly partial) page */
> +       zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
> +       for (i = start_idx + 1; i < end_idx; i++) {
> +               cond_resched();
> +               clear_highpage(page + i);
> +               flush_dcache_page(page + i);

Can we use the function again?
	zero_user_segment(page + i, 0, PAGE_SIZE);

> +       }
> +       /* zero the last (possibly partial) page */
> +       zero_user_segment(page + end_idx, 0, end_off);
> +}
> +
>  static void copy_user_gigantic_page(struct page *dst, struct page *src,
>                                     unsigned long addr,
>                                     struct vm_area_struct *vma,
> --
> 1.7.10.4
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 07/39] thp, mm: basic defines for transparent huge page cache
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-23 10:36     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-23 10:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>

Better if one or two sentences are prepared to show that the following
defines are necessary.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/huge_mm.h |    8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 528454c..6b4c9b2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
>  #define HPAGE_PMD_MASK HPAGE_MASK
>  #define HPAGE_PMD_SIZE HPAGE_SIZE
>
> +#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
> +#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
> +#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
> +
>  extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>
>  #define transparent_hugepage_enabled(__vma)                            \
> @@ -185,6 +189,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
>  #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
>  #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
>
> +#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
> +#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
> +#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
> +
>  #define hpage_nr_pages(x) 1
>
>  #define transparent_hugepage_enabled(__vma) 0
> --
> 1.7.10.4
>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 07/39] thp, mm: basic defines for transparent huge page cache
@ 2013-05-23 10:36     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-23 10:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>

Better if one or two sentences are prepared to show that the following
defines are necessary.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/huge_mm.h |    8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 528454c..6b4c9b2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
>  #define HPAGE_PMD_MASK HPAGE_MASK
>  #define HPAGE_PMD_SIZE HPAGE_SIZE
>
> +#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
> +#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
> +#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
> +
>  extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>
>  #define transparent_hugepage_enabled(__vma)                            \
> @@ -185,6 +189,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
>  #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
>  #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
>
> +#define HPAGE_CACHE_ORDER      ({ BUILD_BUG(); 0; })
> +#define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
> +#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
> +
>  #define hpage_nr_pages(x) 1
>
>  #define transparent_hugepage_enabled(__vma) 0
> --
> 1.7.10.4
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 38/39] thp: vma_adjust_trans_huge(): adjust file-backed VMA too
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-23 11:01     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-23 11:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Since we're going to have huge pages in page cache, we need to call
> adjust file-backed VMA, which potentially can contain huge pages.
>
> For now we call it for all VMAs.
>
> Probably later we will need to introduce a flag to indicate that the VMA
> has huge pages.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---

Acked-by: Hillf Danton <dhillf@gmail.com>

>  include/linux/huge_mm.h |   11 +----------
>  mm/huge_memory.c        |    2 +-
>  2 files changed, 2 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index b20334a..f4d6626 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -139,7 +139,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
>  #endif
>  extern int hugepage_madvise(struct vm_area_struct *vma,
>                             unsigned long *vm_flags, int advice);
> -extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
> +extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
>                                     unsigned long start,
>                                     unsigned long end,
>                                     long adjust_next);
> @@ -155,15 +155,6 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
>         else
>                 return 0;
>  }
> -static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> -                                        unsigned long start,
> -                                        unsigned long end,
> -                                        long adjust_next)
> -{
> -       if (!vma->anon_vma || vma->vm_ops)
> -               return;
> -       __vma_adjust_trans_huge(vma, start, end, adjust_next);
> -}
>  static inline int hpage_nr_pages(struct page *page)
>  {
>         if (unlikely(PageTransHuge(page)))
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d7c9df5..9c3815b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2783,7 +2783,7 @@ static void split_huge_page_address(struct mm_struct *mm,
>         split_huge_page_pmd_mm(mm, address, pmd);
>  }
>
> -void __vma_adjust_trans_huge(struct vm_area_struct *vma,
> +void vma_adjust_trans_huge(struct vm_area_struct *vma,
>                              unsigned long start,
>                              unsigned long end,
>                              long adjust_next)
> --
> 1.7.10.4
>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 38/39] thp: vma_adjust_trans_huge(): adjust file-backed VMA too
@ 2013-05-23 11:01     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-23 11:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Since we're going to have huge pages in page cache, we need to call
> adjust file-backed VMA, which potentially can contain huge pages.
>
> For now we call it for all VMAs.
>
> Probably later we will need to introduce a flag to indicate that the VMA
> has huge pages.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---

Acked-by: Hillf Danton <dhillf@gmail.com>

>  include/linux/huge_mm.h |   11 +----------
>  mm/huge_memory.c        |    2 +-
>  2 files changed, 2 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index b20334a..f4d6626 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -139,7 +139,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
>  #endif
>  extern int hugepage_madvise(struct vm_area_struct *vma,
>                             unsigned long *vm_flags, int advice);
> -extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
> +extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
>                                     unsigned long start,
>                                     unsigned long end,
>                                     long adjust_next);
> @@ -155,15 +155,6 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
>         else
>                 return 0;
>  }
> -static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> -                                        unsigned long start,
> -                                        unsigned long end,
> -                                        long adjust_next)
> -{
> -       if (!vma->anon_vma || vma->vm_ops)
> -               return;
> -       __vma_adjust_trans_huge(vma, start, end, adjust_next);
> -}
>  static inline int hpage_nr_pages(struct page *page)
>  {
>         if (unlikely(PageTransHuge(page)))
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d7c9df5..9c3815b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2783,7 +2783,7 @@ static void split_huge_page_address(struct mm_struct *mm,
>         split_huge_page_pmd_mm(mm, address, pmd);
>  }
>
> -void __vma_adjust_trans_huge(struct vm_area_struct *vma,
> +void vma_adjust_trans_huge(struct vm_area_struct *vma,
>                              unsigned long start,
>                              unsigned long end,
>                              long adjust_next)
> --
> 1.7.10.4
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 03/39] mm: implement zero_huge_user_segment and friends
  2013-05-23 10:32     ` Hillf Danton
@ 2013-05-23 11:32       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-23 11:32 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Let's add helpers to clear huge page segment(s). They provide the same
> > functionallity as zero_user_segment and zero_user, but for huge pages.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/mm.h |    7 +++++++
> >  mm/memory.c        |   36 ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 43 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index c05d7cf..5e156fb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1797,6 +1797,13 @@ extern void dump_page(struct page *page);
> >  extern void clear_huge_page(struct page *page,
> >                             unsigned long addr,
> >                             unsigned int pages_per_huge_page);
> > +extern void zero_huge_user_segment(struct page *page,
> > +               unsigned start, unsigned end);
> > +static inline void zero_huge_user(struct page *page,
> > +               unsigned start, unsigned len)
> > +{
> > +       zero_huge_user_segment(page, start, start + len);
> > +}
> >  extern void copy_user_huge_page(struct page *dst, struct page *src,
> >                                 unsigned long addr, struct vm_area_struct *vma,
> >                                 unsigned int pages_per_huge_page);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f7a1fba..f02a8be 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4266,6 +4266,42 @@ void clear_huge_page(struct page *page,
> >         }
> >  }
> >
> > +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
> > +{
> > +       int i;
> > +       unsigned start_idx, end_idx;
> > +       unsigned start_off, end_off;
> > +
> > +       BUG_ON(end < start);
> > +
> > +       might_sleep();
> > +
> > +       if (start == end)
> > +               return;
> > +
> > +       start_idx = start >> PAGE_SHIFT;
> > +       start_off = start & ~PAGE_MASK;
> > +       end_idx = (end - 1) >> PAGE_SHIFT;
> > +       end_off = ((end - 1) & ~PAGE_MASK) + 1;
> > +
> > +       /*
> > +        * if start and end are on the same small page we can call
> > +        * zero_user_segment() once and save one kmap_atomic().
> > +        */
> > +       if (start_idx == end_idx)
> > +               return zero_user_segment(page + start_idx, start_off, end_off);
> > +
> > +       /* zero the first (possibly partial) page */
> > +       zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
> > +       for (i = start_idx + 1; i < end_idx; i++) {
> > +               cond_resched();
> > +               clear_highpage(page + i);
> > +               flush_dcache_page(page + i);
> 
> Can we use the function again?
> 	zero_user_segment(page + i, 0, PAGE_SIZE);

No. zero_user_segment() is memset()-based. clear_highpage() is higly
optimized for page clearing on many architectures.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 03/39] mm: implement zero_huge_user_segment and friends
@ 2013-05-23 11:32       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-23 11:32 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Let's add helpers to clear huge page segment(s). They provide the same
> > functionallity as zero_user_segment and zero_user, but for huge pages.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/mm.h |    7 +++++++
> >  mm/memory.c        |   36 ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 43 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index c05d7cf..5e156fb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1797,6 +1797,13 @@ extern void dump_page(struct page *page);
> >  extern void clear_huge_page(struct page *page,
> >                             unsigned long addr,
> >                             unsigned int pages_per_huge_page);
> > +extern void zero_huge_user_segment(struct page *page,
> > +               unsigned start, unsigned end);
> > +static inline void zero_huge_user(struct page *page,
> > +               unsigned start, unsigned len)
> > +{
> > +       zero_huge_user_segment(page, start, start + len);
> > +}
> >  extern void copy_user_huge_page(struct page *dst, struct page *src,
> >                                 unsigned long addr, struct vm_area_struct *vma,
> >                                 unsigned int pages_per_huge_page);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f7a1fba..f02a8be 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4266,6 +4266,42 @@ void clear_huge_page(struct page *page,
> >         }
> >  }
> >
> > +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
> > +{
> > +       int i;
> > +       unsigned start_idx, end_idx;
> > +       unsigned start_off, end_off;
> > +
> > +       BUG_ON(end < start);
> > +
> > +       might_sleep();
> > +
> > +       if (start == end)
> > +               return;
> > +
> > +       start_idx = start >> PAGE_SHIFT;
> > +       start_off = start & ~PAGE_MASK;
> > +       end_idx = (end - 1) >> PAGE_SHIFT;
> > +       end_off = ((end - 1) & ~PAGE_MASK) + 1;
> > +
> > +       /*
> > +        * if start and end are on the same small page we can call
> > +        * zero_user_segment() once and save one kmap_atomic().
> > +        */
> > +       if (start_idx == end_idx)
> > +               return zero_user_segment(page + start_idx, start_off, end_off);
> > +
> > +       /* zero the first (possibly partial) page */
> > +       zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
> > +       for (i = start_idx + 1; i < end_idx; i++) {
> > +               cond_resched();
> > +               clear_highpage(page + i);
> > +               flush_dcache_page(page + i);
> 
> Can we use the function again?
> 	zero_user_segment(page + i, 0, PAGE_SIZE);

No. zero_user_segment() is memset()-based. clear_highpage() is higly
optimized for page clearing on many architectures.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 39/39] thp: map file-backed huge pages on fault
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-23 11:36     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-23 11:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Look like all pieces are in place, we can map file-backed huge-pages
> now.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/huge_mm.h |    4 +++-
>  mm/memory.c             |    5 ++++-
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f4d6626..903f097 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -78,7 +78,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>            (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&                   \
>            ((__vma)->vm_flags & VM_HUGEPAGE))) &&                       \
>          !((__vma)->vm_flags & VM_NOHUGEPAGE) &&                        \
> -        !is_vma_temporary_stack(__vma))
> +        !is_vma_temporary_stack(__vma) &&                              \
> +        (!(__vma)->vm_ops ||                                           \
> +                 mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))

Redefine, why?

>  #define transparent_hugepage_defrag(__vma)                             \
>         ((transparent_hugepage_flags &                                  \
>           (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||                     \
> diff --git a/mm/memory.c b/mm/memory.c
> index ebff552..7fe9752 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3939,10 +3939,13 @@ retry:
>         if (!pmd)
>                 return VM_FAULT_OOM;
>         if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> -               int ret = 0;
> +               int ret;
>                 if (!vma->vm_ops)
>                         ret = do_huge_pmd_anonymous_page(mm, vma, address,
>                                         pmd, flags);

Ah vma->vm_ops is checked here, so
		else if (mapping_can_have_hugepages())

> +               else
> +                       ret = do_huge_linear_fault(mm, vma, address,
> +                                       pmd, flags);
>                 if ((ret & VM_FAULT_FALLBACK) == 0)
>                         return ret;
>         } else {
> --
> 1.7.10.4
>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 39/39] thp: map file-backed huge pages on fault
@ 2013-05-23 11:36     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-23 11:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>
> Look like all pieces are in place, we can map file-backed huge-pages
> now.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/huge_mm.h |    4 +++-
>  mm/memory.c             |    5 ++++-
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f4d6626..903f097 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -78,7 +78,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>            (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&                   \
>            ((__vma)->vm_flags & VM_HUGEPAGE))) &&                       \
>          !((__vma)->vm_flags & VM_NOHUGEPAGE) &&                        \
> -        !is_vma_temporary_stack(__vma))
> +        !is_vma_temporary_stack(__vma) &&                              \
> +        (!(__vma)->vm_ops ||                                           \
> +                 mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))

Redefine, why?

>  #define transparent_hugepage_defrag(__vma)                             \
>         ((transparent_hugepage_flags &                                  \
>           (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||                     \
> diff --git a/mm/memory.c b/mm/memory.c
> index ebff552..7fe9752 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3939,10 +3939,13 @@ retry:
>         if (!pmd)
>                 return VM_FAULT_OOM;
>         if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> -               int ret = 0;
> +               int ret;
>                 if (!vma->vm_ops)
>                         ret = do_huge_pmd_anonymous_page(mm, vma, address,
>                                         pmd, flags);

Ah vma->vm_ops is checked here, so
		else if (mapping_can_have_hugepages())

> +               else
> +                       ret = do_huge_linear_fault(mm, vma, address,
> +                                       pmd, flags);
>                 if ((ret & VM_FAULT_FALLBACK) == 0)
>                         return ret;
>         } else {
> --
> 1.7.10.4
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 39/39] thp: map file-backed huge pages on fault
  2013-05-23 11:36     ` Hillf Danton
@ 2013-05-23 11:48       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-23 11:48 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Look like all pieces are in place, we can map file-backed huge-pages
> > now.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/huge_mm.h |    4 +++-
> >  mm/memory.c             |    5 ++++-
> >  2 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index f4d6626..903f097 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -78,7 +78,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
> >            (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&                   \
> >            ((__vma)->vm_flags & VM_HUGEPAGE))) &&                       \
> >          !((__vma)->vm_flags & VM_NOHUGEPAGE) &&                        \
> > -        !is_vma_temporary_stack(__vma))
> > +        !is_vma_temporary_stack(__vma) &&                              \
> > +        (!(__vma)->vm_ops ||                                           \
> > +                 mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))
> 
> Redefine, why?
> 
> >  #define transparent_hugepage_defrag(__vma)                             \
> >         ((transparent_hugepage_flags &                                  \
> >           (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||                     \
> > diff --git a/mm/memory.c b/mm/memory.c
> > index ebff552..7fe9752 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3939,10 +3939,13 @@ retry:
> >         if (!pmd)
> >                 return VM_FAULT_OOM;
> >         if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> > -               int ret = 0;
> > +               int ret;
> >                 if (!vma->vm_ops)
> >                         ret = do_huge_pmd_anonymous_page(mm, vma, address,
> >                                         pmd, flags);
> 
> Ah vma->vm_ops is checked here, so
> 		else if (mapping_can_have_hugepages())

Okay, it's cleaner.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 39/39] thp: map file-backed huge pages on fault
@ 2013-05-23 11:48       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-23 11:48 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> >
> > Look like all pieces are in place, we can map file-backed huge-pages
> > now.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/huge_mm.h |    4 +++-
> >  mm/memory.c             |    5 ++++-
> >  2 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index f4d6626..903f097 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -78,7 +78,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
> >            (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&                   \
> >            ((__vma)->vm_flags & VM_HUGEPAGE))) &&                       \
> >          !((__vma)->vm_flags & VM_NOHUGEPAGE) &&                        \
> > -        !is_vma_temporary_stack(__vma))
> > +        !is_vma_temporary_stack(__vma) &&                              \
> > +        (!(__vma)->vm_ops ||                                           \
> > +                 mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))
> 
> Redefine, why?
> 
> >  #define transparent_hugepage_defrag(__vma)                             \
> >         ((transparent_hugepage_flags &                                  \
> >           (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||                     \
> > diff --git a/mm/memory.c b/mm/memory.c
> > index ebff552..7fe9752 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3939,10 +3939,13 @@ retry:
> >         if (!pmd)
> >                 return VM_FAULT_OOM;
> >         if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> > -               int ret = 0;
> > +               int ret;
> >                 if (!vma->vm_ops)
> >                         ret = do_huge_pmd_anonymous_page(mm, vma, address,
> >                                         pmd, flags);
> 
> Ah vma->vm_ops is checked here, so
> 		else if (mapping_can_have_hugepages())

Okay, it's cleaner.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages
  2013-05-12  1:23   ` Kirill A. Shutemov
@ 2013-05-23 11:57     ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-23 11:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> @@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
>         page = pmd_page(orig_pmd);
>         VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> -       if (page_mapcount(page) == 1) {
> +       if (PageAnon(page) && page_mapcount(page) == 1) {

Could we avoid copying huge page if
no-one else is using it, no matter anon?

>                 pmd_t entry;
>                 entry = pmd_mkyoung(orig_pmd);
>                 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages
@ 2013-05-23 11:57     ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-23 11:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> @@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
>         page = pmd_page(orig_pmd);
>         VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> -       if (page_mapcount(page) == 1) {
> +       if (PageAnon(page) && page_mapcount(page) == 1) {

Could we avoid copying huge page if
no-one else is using it, no matter anon?

>                 pmd_t entry;
>                 entry = pmd_mkyoung(orig_pmd);
>                 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages
  2013-05-23 11:57     ` Hillf Danton
@ 2013-05-23 12:08       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-23 12:08 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > @@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >
> >         page = pmd_page(orig_pmd);
> >         VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> > -       if (page_mapcount(page) == 1) {
> > +       if (PageAnon(page) && page_mapcount(page) == 1) {
> 
> Could we avoid copying huge page if
> no-one else is using it, no matter anon?

No. The page is still in page cache and can be later accessed later.
We could isolate the page from page cache, but I'm not sure whether it's
good idea.

do_wp_page() does exectly the same for small pages, so let's keep it
consistent.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages
@ 2013-05-23 12:08       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-23 12:08 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > @@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >
> >         page = pmd_page(orig_pmd);
> >         VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> > -       if (page_mapcount(page) == 1) {
> > +       if (PageAnon(page) && page_mapcount(page) == 1) {
> 
> Could we avoid copying huge page if
> no-one else is using it, no matter anon?

No. The page is still in page cache and can be later accessed later.
We could isolate the page from page cache, but I'm not sure whether it's
good idea.

do_wp_page() does exectly the same for small pages, so let's keep it
consistent.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages
  2013-05-23 12:08       ` Kirill A. Shutemov
@ 2013-05-23 12:12         ` Hillf Danton
  -1 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-23 12:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Thu, May 23, 2013 at 8:08 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > @@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>> >
>> >         page = pmd_page(orig_pmd);
>> >         VM_BUG_ON(!PageCompound(page) || !PageHead(page));
>> > -       if (page_mapcount(page) == 1) {
>> > +       if (PageAnon(page) && page_mapcount(page) == 1) {
>>
>> Could we avoid copying huge page if
>> no-one else is using it, no matter anon?
>
> No. The page is still in page cache and can be later accessed later.
> We could isolate the page from page cache, but I'm not sure whether it's
> good idea.
>
Hugetlb tries to avoid copying pahe.

	/* If no-one else is actually using this page, avoid the copy
	 * and just make the page writable */
	avoidcopy = (page_mapcount(old_page) == 1);

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages
@ 2013-05-23 12:12         ` Hillf Danton
  0 siblings, 0 replies; 243+ messages in thread
From: Hillf Danton @ 2013-05-23 12:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Dave Hansen, linux-fsdevel,
	linux-kernel

On Thu, May 23, 2013 at 8:08 PM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Hillf Danton wrote:
>> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
>> <kirill.shutemov@linux.intel.com> wrote:
>> > @@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>> >
>> >         page = pmd_page(orig_pmd);
>> >         VM_BUG_ON(!PageCompound(page) || !PageHead(page));
>> > -       if (page_mapcount(page) == 1) {
>> > +       if (PageAnon(page) && page_mapcount(page) == 1) {
>>
>> Could we avoid copying huge page if
>> no-one else is using it, no matter anon?
>
> No. The page is still in page cache and can be later accessed later.
> We could isolate the page from page cache, but I'm not sure whether it's
> good idea.
>
Hugetlb tries to avoid copying pahe.

	/* If no-one else is actually using this page, avoid the copy
	 * and just make the page writable */
	avoidcopy = (page_mapcount(old_page) == 1);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages
  2013-05-23 12:12         ` Hillf Danton
@ 2013-05-23 12:33           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-23 12:33 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Thu, May 23, 2013 at 8:08 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Hillf Danton wrote:
> >> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> >> <kirill.shutemov@linux.intel.com> wrote:
> >> > @@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >> >
> >> >         page = pmd_page(orig_pmd);
> >> >         VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> >> > -       if (page_mapcount(page) == 1) {
> >> > +       if (PageAnon(page) && page_mapcount(page) == 1) {
> >>
> >> Could we avoid copying huge page if
> >> no-one else is using it, no matter anon?
> >
> > No. The page is still in page cache and can be later accessed later.
> > We could isolate the page from page cache, but I'm not sure whether it's
> > good idea.
> >
> Hugetlb tries to avoid copying pahe.
> 
> 	/* If no-one else is actually using this page, avoid the copy
> 	 * and just make the page writable */
> 	avoidcopy = (page_mapcount(old_page) == 1);

It makes sense for hugetlb, since it RAM-backed only.

Currently, the project supports only ramfs, but I hope we will bring
storage-backed filesystems later. For them it would be much cheaper to
copy the page then bring it back later from storage.

And one more point: we must not ever reuse dirty pages, since it will lead
to data lost. And ramfs pages are always dirty.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages
@ 2013-05-23 12:33           ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-23 12:33 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Dave Hansen,
	linux-fsdevel, linux-kernel

Hillf Danton wrote:
> On Thu, May 23, 2013 at 8:08 PM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> > Hillf Danton wrote:
> >> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> >> <kirill.shutemov@linux.intel.com> wrote:
> >> > @@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >> >
> >> >         page = pmd_page(orig_pmd);
> >> >         VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> >> > -       if (page_mapcount(page) == 1) {
> >> > +       if (PageAnon(page) && page_mapcount(page) == 1) {
> >>
> >> Could we avoid copying huge page if
> >> no-one else is using it, no matter anon?
> >
> > No. The page is still in page cache and can be later accessed later.
> > We could isolate the page from page cache, but I'm not sure whether it's
> > good idea.
> >
> Hugetlb tries to avoid copying pahe.
> 
> 	/* If no-one else is actually using this page, avoid the copy
> 	 * and just make the page writable */
> 	avoidcopy = (page_mapcount(old_page) == 1);

It makes sense for hugetlb, since it RAM-backed only.

Currently, the project supports only ramfs, but I hope we will bring
storage-backed filesystems later. For them it would be much cheaper to
copy the page then bring it back later from storage.

And one more point: we must not ever reuse dirty pages, since it will lead
to data lost. And ramfs pages are always dirty.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-05-21 19:59     ` Dave Hansen
@ 2013-05-23 14:36       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-23 14:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
> > page for the specified index and HPAGE_CACHE_NR-1 tail pages for
> > following indexes.
> 
> The really nice way to do these patches is refactor them, first, with no
> behavior change, in one patch, the introduce the new support in the
> second one.

I've split it into two patches.

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 61158ac..b0c7c8c 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -460,39 +460,62 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> >  		pgoff_t offset, gfp_t gfp_mask)
> >  {
> >  	int error;
> > +	int i, nr;
> >  
> >  	VM_BUG_ON(!PageLocked(page));
> >  	VM_BUG_ON(PageSwapBacked(page));
> >  
> > +	/* memory cgroup controller handles thp pages on its side */
> >  	error = mem_cgroup_cache_charge(page, current->mm,
> >  					gfp_mask & GFP_RECLAIM_MASK);
> >  	if (error)
> > -		goto out;
> > -
> > -	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> > -	if (error == 0) {
> > -		page_cache_get(page);
> > -		page->mapping = mapping;
> > -		page->index = offset;
> > +		return error;
> >  
> > -		spin_lock_irq(&mapping->tree_lock);
> > -		error = radix_tree_insert(&mapping->page_tree, offset, page);
> > -		if (likely(!error)) {
> > -			mapping->nrpages++;
> > -			__inc_zone_page_state(page, NR_FILE_PAGES);
> > -			spin_unlock_irq(&mapping->tree_lock);
> > -			trace_mm_filemap_add_to_page_cache(page);
> > -		} else {
> > -			page->mapping = NULL;
> > -			/* Leave page->index set: truncation relies upon it */
> > -			spin_unlock_irq(&mapping->tree_lock);
> > -			mem_cgroup_uncharge_cache_page(page);
> > -			page_cache_release(page);
> > -		}
> > -		radix_tree_preload_end();
> > -	} else
> > +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
> > +		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
> > +		nr = hpage_nr_pages(page);
> > +	} else {
> > +		BUG_ON(PageTransHuge(page));
> > +		nr = 1;
> > +	}
> 
> Why can't this just be
> 
> 		nr = hpage_nr_pages(page);
> 
> Are you trying to optimize for the THP=y, but THP-pagecache=n case?

Yes, I try to optimize for the case.

> > +		if (error)
> > +			goto err;
> 
> I know it's not a super-common thing in the kernel, but could you call
> this "insert_err" or something?

I've changed it to err_insert.

> > +	}
> > +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> > +	if (PageTransHuge(page))
> > +		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> > +	mapping->nrpages += nr;
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	radix_tree_preload_end();
> > +	trace_mm_filemap_add_to_page_cache(page);
> > +	return 0;
> > +err:
> > +	if (i != 0)
> > +		error = -ENOSPC; /* no space for a huge page */
> > +	page_cache_release(page + i);
> > +	page[i].mapping = NULL;
> 
> I guess it's a slight behaviour change (I think it's harmless) but if
> you delay doing the page_cache_get() and page[i].mapping= until after
> the radix tree insertion, you can avoid these two lines.

Hm. I don't think it's safe. The spinlock protects radix-tree against
modification, but find_get_page() can see it just after
radix_tree_insert().

The page is locked and IIUC never uptodate at this point, so nobody will
be able to do much with it, but leave it without valid ->mapping is a bad
idea.

> > +	for (i--; i >= 0; i--) {
> 
> I kinda glossed over that initial "i--".  It might be worth a quick
> comment to call it out.

Okay.

> > +		/* Leave page->index set: truncation relies upon it */
> > +		page[i].mapping = NULL;
> > +		radix_tree_delete(&mapping->page_tree, offset + i);
> > +		page_cache_release(page + i);
> > +	}
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	radix_tree_preload_end();
> > +	mem_cgroup_uncharge_cache_page(page);
> >  	return error;
> >  }
> 
> FWIW, I think you can move the radix_tree_preload_end() up a bit.  I
> guess it won't make any practical difference since you're holding a
> spinlock, but it at least makes the point that you're not depending on
> it any more.

Good point.

> I'm also trying to figure out how and when you'd actually have to unroll
> a partial-huge-page worth of radix_tree_insert().  In the small-page
> case, you can collide with another guy inserting in to the page cache.
> But, can that happen in the _middle_ of a THP?

E.g. if you enable THP after some uptime, the mapping can contain small pages
already.
Or if a process map the file with bad alignement (MAP_FIXED) and touch the
area, it will get small pages.

> Despite my nits, the code still looks correct here, so:
> 
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

The incremental diff for the patch is below. I guess it's still valid to
use your ack, right?

diff --git a/mm/filemap.c b/mm/filemap.c
index f643062..d004331 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -492,29 +492,33 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		error = radix_tree_insert(&mapping->page_tree,
 				offset + i, page + i);
 		if (error)
-			goto err;
+			goto err_insert;
 	}
+	radix_tree_preload_end();
 	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
 	if (PageTransHuge(page))
 		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
 	mapping->nrpages += nr;
 	spin_unlock_irq(&mapping->tree_lock);
-	radix_tree_preload_end();
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
-err:
+err_insert:
+	radix_tree_preload_end();
 	if (i != 0)
 		error = -ENOSPC; /* no space for a huge page */
+
+	/* page[i] was not inserted to tree, handle separately */
 	page_cache_release(page + i);
 	page[i].mapping = NULL;
-	for (i--; i >= 0; i--) {
+	i--;
+
+	for (; i >= 0; i--) {
 		/* Leave page->index set: truncation relies upon it */
 		page[i].mapping = NULL;
 		radix_tree_delete(&mapping->page_tree, offset + i);
 		page_cache_release(page + i);
 	}
 	spin_unlock_irq(&mapping->tree_lock);
-	radix_tree_preload_end();
 	mem_cgroup_uncharge_cache_page(page);
 	return error;
 }
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-05-23 14:36       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-23 14:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
> > page for the specified index and HPAGE_CACHE_NR-1 tail pages for
> > following indexes.
> 
> The really nice way to do these patches is refactor them, first, with no
> behavior change, in one patch, the introduce the new support in the
> second one.

I've split it into two patches.

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 61158ac..b0c7c8c 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -460,39 +460,62 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> >  		pgoff_t offset, gfp_t gfp_mask)
> >  {
> >  	int error;
> > +	int i, nr;
> >  
> >  	VM_BUG_ON(!PageLocked(page));
> >  	VM_BUG_ON(PageSwapBacked(page));
> >  
> > +	/* memory cgroup controller handles thp pages on its side */
> >  	error = mem_cgroup_cache_charge(page, current->mm,
> >  					gfp_mask & GFP_RECLAIM_MASK);
> >  	if (error)
> > -		goto out;
> > -
> > -	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> > -	if (error == 0) {
> > -		page_cache_get(page);
> > -		page->mapping = mapping;
> > -		page->index = offset;
> > +		return error;
> >  
> > -		spin_lock_irq(&mapping->tree_lock);
> > -		error = radix_tree_insert(&mapping->page_tree, offset, page);
> > -		if (likely(!error)) {
> > -			mapping->nrpages++;
> > -			__inc_zone_page_state(page, NR_FILE_PAGES);
> > -			spin_unlock_irq(&mapping->tree_lock);
> > -			trace_mm_filemap_add_to_page_cache(page);
> > -		} else {
> > -			page->mapping = NULL;
> > -			/* Leave page->index set: truncation relies upon it */
> > -			spin_unlock_irq(&mapping->tree_lock);
> > -			mem_cgroup_uncharge_cache_page(page);
> > -			page_cache_release(page);
> > -		}
> > -		radix_tree_preload_end();
> > -	} else
> > +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
> > +		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
> > +		nr = hpage_nr_pages(page);
> > +	} else {
> > +		BUG_ON(PageTransHuge(page));
> > +		nr = 1;
> > +	}
> 
> Why can't this just be
> 
> 		nr = hpage_nr_pages(page);
> 
> Are you trying to optimize for the THP=y, but THP-pagecache=n case?

Yes, I try to optimize for the case.

> > +		if (error)
> > +			goto err;
> 
> I know it's not a super-common thing in the kernel, but could you call
> this "insert_err" or something?

I've changed it to err_insert.

> > +	}
> > +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> > +	if (PageTransHuge(page))
> > +		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> > +	mapping->nrpages += nr;
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	radix_tree_preload_end();
> > +	trace_mm_filemap_add_to_page_cache(page);
> > +	return 0;
> > +err:
> > +	if (i != 0)
> > +		error = -ENOSPC; /* no space for a huge page */
> > +	page_cache_release(page + i);
> > +	page[i].mapping = NULL;
> 
> I guess it's a slight behaviour change (I think it's harmless) but if
> you delay doing the page_cache_get() and page[i].mapping= until after
> the radix tree insertion, you can avoid these two lines.

Hm. I don't think it's safe. The spinlock protects radix-tree against
modification, but find_get_page() can see it just after
radix_tree_insert().

The page is locked and IIUC never uptodate at this point, so nobody will
be able to do much with it, but leave it without valid ->mapping is a bad
idea.

> > +	for (i--; i >= 0; i--) {
> 
> I kinda glossed over that initial "i--".  It might be worth a quick
> comment to call it out.

Okay.

> > +		/* Leave page->index set: truncation relies upon it */
> > +		page[i].mapping = NULL;
> > +		radix_tree_delete(&mapping->page_tree, offset + i);
> > +		page_cache_release(page + i);
> > +	}
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	radix_tree_preload_end();
> > +	mem_cgroup_uncharge_cache_page(page);
> >  	return error;
> >  }
> 
> FWIW, I think you can move the radix_tree_preload_end() up a bit.  I
> guess it won't make any practical difference since you're holding a
> spinlock, but it at least makes the point that you're not depending on
> it any more.

Good point.

> I'm also trying to figure out how and when you'd actually have to unroll
> a partial-huge-page worth of radix_tree_insert().  In the small-page
> case, you can collide with another guy inserting in to the page cache.
> But, can that happen in the _middle_ of a THP?

E.g. if you enable THP after some uptime, the mapping can contain small pages
already.
Or if a process map the file with bad alignement (MAP_FIXED) and touch the
area, it will get small pages.

> Despite my nits, the code still looks correct here, so:
> 
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

The incremental diff for the patch is below. I guess it's still valid to
use your ack, right?

diff --git a/mm/filemap.c b/mm/filemap.c
index f643062..d004331 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -492,29 +492,33 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 		error = radix_tree_insert(&mapping->page_tree,
 				offset + i, page + i);
 		if (error)
-			goto err;
+			goto err_insert;
 	}
+	radix_tree_preload_end();
 	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
 	if (PageTransHuge(page))
 		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
 	mapping->nrpages += nr;
 	spin_unlock_irq(&mapping->tree_lock);
-	radix_tree_preload_end();
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
-err:
+err_insert:
+	radix_tree_preload_end();
 	if (i != 0)
 		error = -ENOSPC; /* no space for a huge page */
+
+	/* page[i] was not inserted to tree, handle separately */
 	page_cache_release(page + i);
 	page[i].mapping = NULL;
-	for (i--; i >= 0; i--) {
+	i--;
+
+	for (; i >= 0; i--) {
 		/* Leave page->index set: truncation relies upon it */
 		page[i].mapping = NULL;
 		radix_tree_delete(&mapping->page_tree, offset + i);
 		page_cache_release(page + i);
 	}
 	spin_unlock_irq(&mapping->tree_lock);
-	radix_tree_preload_end();
 	mem_cgroup_uncharge_cache_page(page);
 	return error;
 }
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 07/39] thp, mm: basic defines for transparent huge page cache
  2013-05-23 10:36     ` Hillf Danton
@ 2013-05-23 15:49       ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-23 15:49 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

On 05/23/2013 03:36 AM, Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
>> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Better if one or two sentences are prepared to show that the following
> defines are necessary.
...
>> >
>> > +#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
>> > +#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
>> > +#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)

Yeah, or just stick them in the patch that uses them first.  These
aren't exactly rocket science.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 07/39] thp, mm: basic defines for transparent huge page cache
@ 2013-05-23 15:49       ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-23 15:49 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, linux-fsdevel,
	linux-kernel

On 05/23/2013 03:36 AM, Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
>> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Better if one or two sentences are prepared to show that the following
> defines are necessary.
...
>> >
>> > +#define HPAGE_CACHE_ORDER      (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
>> > +#define HPAGE_CACHE_NR         (1L << HPAGE_CACHE_ORDER)
>> > +#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)

Yeah, or just stick them in the patch that uses them first.  These
aren't exactly rocket science.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-05-23 14:36       ` Kirill A. Shutemov
@ 2013-05-23 16:00         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-23 16:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/23/2013 07:36 AM, Kirill A. Shutemov wrote:
>>> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
>>> +		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
>>> +		nr = hpage_nr_pages(page);
>>> +	} else {
>>> +		BUG_ON(PageTransHuge(page));
>>> +		nr = 1;
>>> +	}
>>
>> Why can't this just be
>>
>> 		nr = hpage_nr_pages(page);
>>
>> Are you trying to optimize for the THP=y, but THP-pagecache=n case?
> 
> Yes, I try to optimize for the case.

I'd suggest either optimizing in _common_ code, or not optimizing it at
all.  Once in production, and all the config options are on, the
optimization goes away anyway.

You could create a hpagecache_nr_pages() helper or something I guess.

>>> +	}
>>> +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
>>> +	if (PageTransHuge(page))
>>> +		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
>>> +	mapping->nrpages += nr;
>>> +	spin_unlock_irq(&mapping->tree_lock);
>>> +	radix_tree_preload_end();
>>> +	trace_mm_filemap_add_to_page_cache(page);
>>> +	return 0;
>>> +err:
>>> +	if (i != 0)
>>> +		error = -ENOSPC; /* no space for a huge page */
>>> +	page_cache_release(page + i);
>>> +	page[i].mapping = NULL;
>>
>> I guess it's a slight behaviour change (I think it's harmless) but if
>> you delay doing the page_cache_get() and page[i].mapping= until after
>> the radix tree insertion, you can avoid these two lines.
> 
> Hm. I don't think it's safe. The spinlock protects radix-tree against
> modification, but find_get_page() can see it just after
> radix_tree_insert().

Except that the mapping->tree_lock is still held.  I don't think
find_get_page() can find it in the radix tree without taking the lock.

> The page is locked and IIUC never uptodate at this point, so nobody will
> be able to do much with it, but leave it without valid ->mapping is a bad
> idea.

->mapping changes are protected by lock_page().  You can't keep
->mapping stable without holding it.  If you unlock_page(), you have to
recheck ->mapping after you reacquire the lock.

In other words, I think the code is fine.

>> I'm also trying to figure out how and when you'd actually have to unroll
>> a partial-huge-page worth of radix_tree_insert().  In the small-page
>> case, you can collide with another guy inserting in to the page cache.
>> But, can that happen in the _middle_ of a THP?
> 
> E.g. if you enable THP after some uptime, the mapping can contain small pages
> already.
> Or if a process map the file with bad alignement (MAP_FIXED) and touch the
> area, it will get small pages.

Could you put a comment in explaining this case a bit?  It's a bit subtle.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-05-23 16:00         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-23 16:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/23/2013 07:36 AM, Kirill A. Shutemov wrote:
>>> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
>>> +		BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
>>> +		nr = hpage_nr_pages(page);
>>> +	} else {
>>> +		BUG_ON(PageTransHuge(page));
>>> +		nr = 1;
>>> +	}
>>
>> Why can't this just be
>>
>> 		nr = hpage_nr_pages(page);
>>
>> Are you trying to optimize for the THP=y, but THP-pagecache=n case?
> 
> Yes, I try to optimize for the case.

I'd suggest either optimizing in _common_ code, or not optimizing it at
all.  Once in production, and all the config options are on, the
optimization goes away anyway.

You could create a hpagecache_nr_pages() helper or something I guess.

>>> +	}
>>> +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
>>> +	if (PageTransHuge(page))
>>> +		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
>>> +	mapping->nrpages += nr;
>>> +	spin_unlock_irq(&mapping->tree_lock);
>>> +	radix_tree_preload_end();
>>> +	trace_mm_filemap_add_to_page_cache(page);
>>> +	return 0;
>>> +err:
>>> +	if (i != 0)
>>> +		error = -ENOSPC; /* no space for a huge page */
>>> +	page_cache_release(page + i);
>>> +	page[i].mapping = NULL;
>>
>> I guess it's a slight behaviour change (I think it's harmless) but if
>> you delay doing the page_cache_get() and page[i].mapping= until after
>> the radix tree insertion, you can avoid these two lines.
> 
> Hm. I don't think it's safe. The spinlock protects radix-tree against
> modification, but find_get_page() can see it just after
> radix_tree_insert().

Except that the mapping->tree_lock is still held.  I don't think
find_get_page() can find it in the radix tree without taking the lock.

> The page is locked and IIUC never uptodate at this point, so nobody will
> be able to do much with it, but leave it without valid ->mapping is a bad
> idea.

->mapping changes are protected by lock_page().  You can't keep
->mapping stable without holding it.  If you unlock_page(), you have to
recheck ->mapping after you reacquire the lock.

In other words, I think the code is fine.

>> I'm also trying to figure out how and when you'd actually have to unroll
>> a partial-huge-page worth of radix_tree_insert().  In the small-page
>> case, you can collide with another guy inserting in to the page cache.
>> But, can that happen in the _middle_ of a THP?
> 
> E.g. if you enable THP after some uptime, the mapping can contain small pages
> already.
> Or if a process map the file with bad alignement (MAP_FIXED) and touch the
> area, it will get small pages.

Could you put a comment in explaining this case a bit?  It's a bit subtle.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
  2013-05-23 16:00         ` Dave Hansen
@ 2013-05-28 11:59           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-28 11:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> You could create a hpagecache_nr_pages() helper or something I guess.

Makes sense.
> 
> >>> +	}
> >>> +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> >>> +	if (PageTransHuge(page))
> >>> +		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> >>> +	mapping->nrpages += nr;
> >>> +	spin_unlock_irq(&mapping->tree_lock);
> >>> +	radix_tree_preload_end();
> >>> +	trace_mm_filemap_add_to_page_cache(page);
> >>> +	return 0;
> >>> +err:
> >>> +	if (i != 0)
> >>> +		error = -ENOSPC; /* no space for a huge page */
> >>> +	page_cache_release(page + i);
> >>> +	page[i].mapping = NULL;
> >>
> >> I guess it's a slight behaviour change (I think it's harmless) but if
> >> you delay doing the page_cache_get() and page[i].mapping= until after
> >> the radix tree insertion, you can avoid these two lines.
> > 
> > Hm. I don't think it's safe. The spinlock protects radix-tree against
> > modification, but find_get_page() can see it just after
> > radix_tree_insert().
> 
> Except that the mapping->tree_lock is still held.  I don't think
> find_get_page() can find it in the radix tree without taking the lock.

It can. Lookup is rcu-protected. ->tree_lock is only for add/delete/replace.

> 
> > The page is locked and IIUC never uptodate at this point, so nobody will
> > be able to do much with it, but leave it without valid ->mapping is a bad
> > idea.
> 
> ->mapping changes are protected by lock_page().  You can't keep
> ->mapping stable without holding it.  If you unlock_page(), you have to
> recheck ->mapping after you reacquire the lock.
> 
> In other words, I think the code is fine.

You are right.

> 
> >> I'm also trying to figure out how and when you'd actually have to unroll
> >> a partial-huge-page worth of radix_tree_insert().  In the small-page
> >> case, you can collide with another guy inserting in to the page cache.
> >> But, can that happen in the _middle_ of a THP?
> > 
> > E.g. if you enable THP after some uptime, the mapping can contain small pages
> > already.
> > Or if a process map the file with bad alignement (MAP_FIXED) and touch the
> > area, it will get small pages.
> 
> Could you put a comment in explaining this case a bit?  It's a bit subtle.

okay.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages
@ 2013-05-28 11:59           ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-28 11:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> You could create a hpagecache_nr_pages() helper or something I guess.

Makes sense.
> 
> >>> +	}
> >>> +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> >>> +	if (PageTransHuge(page))
> >>> +		__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> >>> +	mapping->nrpages += nr;
> >>> +	spin_unlock_irq(&mapping->tree_lock);
> >>> +	radix_tree_preload_end();
> >>> +	trace_mm_filemap_add_to_page_cache(page);
> >>> +	return 0;
> >>> +err:
> >>> +	if (i != 0)
> >>> +		error = -ENOSPC; /* no space for a huge page */
> >>> +	page_cache_release(page + i);
> >>> +	page[i].mapping = NULL;
> >>
> >> I guess it's a slight behaviour change (I think it's harmless) but if
> >> you delay doing the page_cache_get() and page[i].mapping= until after
> >> the radix tree insertion, you can avoid these two lines.
> > 
> > Hm. I don't think it's safe. The spinlock protects radix-tree against
> > modification, but find_get_page() can see it just after
> > radix_tree_insert().
> 
> Except that the mapping->tree_lock is still held.  I don't think
> find_get_page() can find it in the radix tree without taking the lock.

It can. Lookup is rcu-protected. ->tree_lock is only for add/delete/replace.

> 
> > The page is locked and IIUC never uptodate at this point, so nobody will
> > be able to do much with it, but leave it without valid ->mapping is a bad
> > idea.
> 
> ->mapping changes are protected by lock_page().  You can't keep
> ->mapping stable without holding it.  If you unlock_page(), you have to
> recheck ->mapping after you reacquire the lock.
> 
> In other words, I think the code is fine.

You are right.

> 
> >> I'm also trying to figure out how and when you'd actually have to unroll
> >> a partial-huge-page worth of radix_tree_insert().  In the small-page
> >> case, you can collide with another guy inserting in to the page cache.
> >> But, can that happen in the _middle_ of a THP?
> > 
> > E.g. if you enable THP after some uptime, the mapping can contain small pages
> > already.
> > Or if a process map the file with bad alignement (MAP_FIXED) and touch the
> > area, it will get small pages.
> 
> Could you put a comment in explaining this case a bit?  It's a bit subtle.

okay.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-05-21 20:14     ` Dave Hansen
@ 2013-05-28 12:28       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-28 12:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
> > time.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/filemap.c |   31 +++++++++++++++++++++++++------
> >  1 file changed, 25 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index b0c7c8c..657ce82 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -115,6 +115,9 @@
> >  void __delete_from_page_cache(struct page *page)
> >  {
> >  	struct address_space *mapping = page->mapping;
> > +	bool thp = PageTransHuge(page) &&
> > +		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
> > +	int nr;
> 
> Is that check for the config option really necessary?  How would we get
> a page with PageTransHuge() set without it being enabled?

I'll drop it and use hpagecache_nr_page() instead.

> I like to rewrite your code. :)

It's nice. Thanks.

> Which reminds me...  Why do we handle their reference counts differently? :)
> 
> It seems like we could easily put a for loop in delete_from_page_cache()
> that will release their reference counts along with the head page.
> Wouldn't that make the code less special-cased for tail pages?

delete_from_page_cache() is not the only user of
__delete_from_page_cache()...

It seems I did it wrong in add_to_page_cache_locked(). We shouldn't take
references on tail pages there, only one on head. On split it will be
distributed properly.

> >  	/* Leave page->index set: truncation lookup relies upon it */
> > -	mapping->nrpages--;
> > -	__dec_zone_page_state(page, NR_FILE_PAGES);
> > +	mapping->nrpages -= nr;
> > +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
> >  	if (PageSwapBacked(page))
> > -		__dec_zone_page_state(page, NR_SHMEM);
> > +		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
> >  	BUG_ON(page_mapped(page));
> 
> Man, we suck:
> 
> 	__dec_zone_page_state()
> and
> 	__mod_zone_page_state()
> 
> take a differently-typed first argument.  <sigh>
> 
> Would there be any good to making __dec_zone_page_state() check to see
> if the page we passed in _is_ a compound page, and adjusting its
> behaviour accordingly?

Yeah, it would be better but I think it outside the scope of the patchset.
Probably, later.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-05-28 12:28       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-28 12:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
> > time.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/filemap.c |   31 +++++++++++++++++++++++++------
> >  1 file changed, 25 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index b0c7c8c..657ce82 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -115,6 +115,9 @@
> >  void __delete_from_page_cache(struct page *page)
> >  {
> >  	struct address_space *mapping = page->mapping;
> > +	bool thp = PageTransHuge(page) &&
> > +		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
> > +	int nr;
> 
> Is that check for the config option really necessary?  How would we get
> a page with PageTransHuge() set without it being enabled?

I'll drop it and use hpagecache_nr_page() instead.

> I like to rewrite your code. :)

It's nice. Thanks.

> Which reminds me...  Why do we handle their reference counts differently? :)
> 
> It seems like we could easily put a for loop in delete_from_page_cache()
> that will release their reference counts along with the head page.
> Wouldn't that make the code less special-cased for tail pages?

delete_from_page_cache() is not the only user of
__delete_from_page_cache()...

It seems I did it wrong in add_to_page_cache_locked(). We shouldn't take
references on tail pages there, only one on head. On split it will be
distributed properly.

> >  	/* Leave page->index set: truncation lookup relies upon it */
> > -	mapping->nrpages--;
> > -	__dec_zone_page_state(page, NR_FILE_PAGES);
> > +	mapping->nrpages -= nr;
> > +	__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
> >  	if (PageSwapBacked(page))
> > -		__dec_zone_page_state(page, NR_SHMEM);
> > +		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
> >  	BUG_ON(page_mapped(page));
> 
> Man, we suck:
> 
> 	__dec_zone_page_state()
> and
> 	__mod_zone_page_state()
> 
> take a differently-typed first argument.  <sigh>
> 
> Would there be any good to making __dec_zone_page_state() check to see
> if the page we passed in _is_ a compound page, and adjusting its
> behaviour accordingly?

Yeah, it would be better but I think it outside the scope of the patchset.
Probably, later.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP
  2013-05-21 20:17     ` Dave Hansen
@ 2013-05-28 12:53       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-28 12:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > replace_page_cache_page() is only used by FUSE. It's unlikely that we
> > will support THP in FUSE page cache any soon.
> > 
> > Let's pospone implemetation of THP handling in replace_page_cache_page()
> > until any will use it.
> ...
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 657ce82..3a03426 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -428,6 +428,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> >  {
> >  	int error;
> >  
> > +	VM_BUG_ON(PageTransHuge(old));
> > +	VM_BUG_ON(PageTransHuge(new));
> >  	VM_BUG_ON(!PageLocked(old));
> >  	VM_BUG_ON(!PageLocked(new));
> >  	VM_BUG_ON(new->mapping);
> 
> The code calling replace_page_cache_page() has a bunch of fallback and
> error returning code.  It seems a little bit silly to bring the whole
> machine down when you could just WARN_ONCE() and return an error code
> like fuse already does:

What about:

	if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new),
		     "%s: unexpected huge page\n", __func__))
		return -EINVAL;

?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP
@ 2013-05-28 12:53       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-28 12:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > replace_page_cache_page() is only used by FUSE. It's unlikely that we
> > will support THP in FUSE page cache any soon.
> > 
> > Let's pospone implemetation of THP handling in replace_page_cache_page()
> > until any will use it.
> ...
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 657ce82..3a03426 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -428,6 +428,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> >  {
> >  	int error;
> >  
> > +	VM_BUG_ON(PageTransHuge(old));
> > +	VM_BUG_ON(PageTransHuge(new));
> >  	VM_BUG_ON(!PageLocked(old));
> >  	VM_BUG_ON(!PageLocked(new));
> >  	VM_BUG_ON(new->mapping);
> 
> The code calling replace_page_cache_page() has a bunch of fallback and
> error returning code.  It seems a little bit silly to bring the whole
> machine down when you could just WARN_ONCE() and return an error code
> like fuse already does:

What about:

	if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new),
		     "%s: unexpected huge page\n", __func__))
		return -EINVAL;

?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP
  2013-05-28 12:53       ` Kirill A. Shutemov
@ 2013-05-28 16:33         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-28 16:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/28/2013 05:53 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> +	VM_BUG_ON(PageTransHuge(old));
>>> +	VM_BUG_ON(PageTransHuge(new));
>>>  	VM_BUG_ON(!PageLocked(old));
>>>  	VM_BUG_ON(!PageLocked(new));
>>>  	VM_BUG_ON(new->mapping);
>>
>> The code calling replace_page_cache_page() has a bunch of fallback and
>> error returning code.  It seems a little bit silly to bring the whole
>> machine down when you could just WARN_ONCE() and return an error code
>> like fuse already does:
> 
> What about:
> 
> 	if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new),
> 		     "%s: unexpected huge page\n", __func__))
> 		return -EINVAL;

That looks sane to me.  But, please do make sure to differentiate in the
error message between thp and hugetlbfs (if you have the room).

BTW, I'm also not sure you need to print the function name.  The
WARN_ON() register dump usually has the function name.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP
@ 2013-05-28 16:33         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-05-28 16:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 05/28/2013 05:53 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> +	VM_BUG_ON(PageTransHuge(old));
>>> +	VM_BUG_ON(PageTransHuge(new));
>>>  	VM_BUG_ON(!PageLocked(old));
>>>  	VM_BUG_ON(!PageLocked(new));
>>>  	VM_BUG_ON(new->mapping);
>>
>> The code calling replace_page_cache_page() has a bunch of fallback and
>> error returning code.  It seems a little bit silly to bring the whole
>> machine down when you could just WARN_ONCE() and return an error code
>> like fuse already does:
> 
> What about:
> 
> 	if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new),
> 		     "%s: unexpected huge page\n", __func__))
> 		return -EINVAL;

That looks sane to me.  But, please do make sure to differentiate in the
error message between thp and hugetlbfs (if you have the room).

BTW, I'm also not sure you need to print the function name.  The
WARN_ON() register dump usually has the function name.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 19/39] thp, mm: allocate huge pages in grab_cache_page_write_begin()
  2013-05-21 21:14     ` Dave Hansen
@ 2013-05-30 13:20       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-30 13:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.
> 
> Why do we need this flag?

I don't see other way to indicate grab_cache_page_write_begin(), that we
want THP here.

> When might we set it, and when would we not set it?  What kinds of
> callers need to check for and act on it?

The decision whether allocate huge page or not is up to filesystem. In
ramfs case we just use mapping_can_have_hugepages(), on other filesystem
check might be more complicated.

> Some of this, at least, needs to make it in to the comment by the #define.

Sorry, I fail to see what kind of comment you want me to add there.

> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -194,6 +194,9 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
> >  #define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
> >  #define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
> >  
> > +#define THP_WRITE_ALLOC		({ BUILD_BUG(); 0; })
> > +#define THP_WRITE_ALLOC_FAILED	({ BUILD_BUG(); 0; })
> 
> Doesn't this belong in the previous patch?

Yes. Fixed.

> >  #define hpage_nr_pages(x) 1
> >  
> >  #define transparent_hugepage_enabled(__vma) 0
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 2e86251..8feeecc 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -270,8 +270,15 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> >  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
> >  			int tag, unsigned int nr_pages, struct page **pages);
> >  
> > -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> > +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> >  			pgoff_t index, unsigned flags);
> > +static inline struct page *grab_cache_page_write_begin(
> > +		struct address_space *mapping, pgoff_t index, unsigned flags)
> > +{
> > +	if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
> > +		return NULL;
> > +	return __grab_cache_page_write_begin(mapping, index, flags);
> > +}
> 
> OK, so there's some of the behavior.
> 
> Could you also call out why you refactored this code?  It seems like
> you're trying to optimize for the case where AOP_FLAG_TRANSHUGE isn't
> set and where the compiler knows that it isn't set.
> 
> Could you talk a little bit about the cases that you're thinking of here?

I just tried to make it cheaper for !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
case, but it seems not worth it: the only user call it from
'if (mapping_can_have_hugepages())', so I'll drop this.

> >  /*
> >   * Returns locked page at given index in given cache, creating it if needed.
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 9ea46a4..e086ef0 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2309,25 +2309,44 @@ EXPORT_SYMBOL(generic_file_direct_write);
> >   * Find or create a page at the given pagecache position. Return the locked
> >   * page. This function is specifically for buffered writes.
> >   */
> > -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> > -					pgoff_t index, unsigned flags)
> > +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> > +		pgoff_t index, unsigned flags)
> >  {
> >  	int status;
> >  	gfp_t gfp_mask;
> >  	struct page *page;
> >  	gfp_t gfp_notmask = 0;
> > +	bool thp = (flags & AOP_FLAG_TRANSHUGE) &&
> > +		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
> 
> Instead of 'thp', how about 'must_use_thp'?  The flag seems to be a
> pretty strong edict rather than a hint, so it should be reflected in the
> variables derived from it.

Ok.

> "IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)" has also popped up
> enough times in the code that it's probably time to start thinking about
> shortening it up.  It's a wee bit verbose.

I'll leave it as is for now. Probably come back later.


-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 19/39] thp, mm: allocate huge pages in grab_cache_page_write_begin()
@ 2013-05-30 13:20       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-05-30 13:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.
> 
> Why do we need this flag?

I don't see other way to indicate grab_cache_page_write_begin(), that we
want THP here.

> When might we set it, and when would we not set it?  What kinds of
> callers need to check for and act on it?

The decision whether allocate huge page or not is up to filesystem. In
ramfs case we just use mapping_can_have_hugepages(), on other filesystem
check might be more complicated.

> Some of this, at least, needs to make it in to the comment by the #define.

Sorry, I fail to see what kind of comment you want me to add there.

> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -194,6 +194,9 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
> >  #define HPAGE_CACHE_NR         ({ BUILD_BUG(); 0; })
> >  #define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
> >  
> > +#define THP_WRITE_ALLOC		({ BUILD_BUG(); 0; })
> > +#define THP_WRITE_ALLOC_FAILED	({ BUILD_BUG(); 0; })
> 
> Doesn't this belong in the previous patch?

Yes. Fixed.

> >  #define hpage_nr_pages(x) 1
> >  
> >  #define transparent_hugepage_enabled(__vma) 0
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 2e86251..8feeecc 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -270,8 +270,15 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> >  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
> >  			int tag, unsigned int nr_pages, struct page **pages);
> >  
> > -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> > +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> >  			pgoff_t index, unsigned flags);
> > +static inline struct page *grab_cache_page_write_begin(
> > +		struct address_space *mapping, pgoff_t index, unsigned flags)
> > +{
> > +	if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
> > +		return NULL;
> > +	return __grab_cache_page_write_begin(mapping, index, flags);
> > +}
> 
> OK, so there's some of the behavior.
> 
> Could you also call out why you refactored this code?  It seems like
> you're trying to optimize for the case where AOP_FLAG_TRANSHUGE isn't
> set and where the compiler knows that it isn't set.
> 
> Could you talk a little bit about the cases that you're thinking of here?

I just tried to make it cheaper for !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
case, but it seems not worth it: the only user call it from
'if (mapping_can_have_hugepages())', so I'll drop this.

> >  /*
> >   * Returns locked page at given index in given cache, creating it if needed.
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 9ea46a4..e086ef0 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2309,25 +2309,44 @@ EXPORT_SYMBOL(generic_file_direct_write);
> >   * Find or create a page at the given pagecache position. Return the locked
> >   * page. This function is specifically for buffered writes.
> >   */
> > -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> > -					pgoff_t index, unsigned flags)
> > +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> > +		pgoff_t index, unsigned flags)
> >  {
> >  	int status;
> >  	gfp_t gfp_mask;
> >  	struct page *page;
> >  	gfp_t gfp_notmask = 0;
> > +	bool thp = (flags & AOP_FLAG_TRANSHUGE) &&
> > +		IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
> 
> Instead of 'thp', how about 'must_use_thp'?  The flag seems to be a
> pretty strong edict rather than a hint, so it should be reflected in the
> variables derived from it.

Ok.

> "IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)" has also popped up
> enough times in the code that it's probably time to start thinking about
> shortening it up.  It's a wee bit verbose.

I'll leave it as is for now. Probably come back later.


-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  2013-05-21 22:05     ` Dave Hansen
@ 2013-06-03 15:02       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-03 15:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Since we're going to have huge pages backed by files,
> > wait_split_huge_page() has to serialize not only over anon_vma_lock,
> > but over i_mmap_mutex too.
> ...
> > -#define wait_split_huge_page(__anon_vma, __pmd)				\
> > +#define wait_split_huge_page(__vma, __pmd)				\
> >  	do {								\
> >  		pmd_t *____pmd = (__pmd);				\
> > -		anon_vma_lock_write(__anon_vma);			\
> > -		anon_vma_unlock_write(__anon_vma);			\
> > +		struct address_space *__mapping =			\
> > +					vma->vm_file->f_mapping;	\
> > +		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
> > +		if (__mapping)						\
> > +			mutex_lock(&__mapping->i_mmap_mutex);		\
> > +		if (__anon_vma) {					\
> > +			anon_vma_lock_write(__anon_vma);		\
> > +			anon_vma_unlock_write(__anon_vma);		\
> > +		}							\
> > +		if (__mapping)						\
> > +			mutex_unlock(&__mapping->i_mmap_mutex);		\
> >  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
> >  		       pmd_trans_huge(*____pmd));			\
> >  	} while (0)
> 
> Kirill, I asked about this patch in the previous series, and you wrote
> some very nice, detailed answers to my stupid questions.  But, you
> didn't add any comments or update the patch description.  So, if a
> reviewer or anybody looking at the changelog in the future has my same
> stupid questions, they're unlikely to find the very nice description
> that you wrote up.
> 
> I'd highly suggest that you go back through the comments you've received
> before and make sure that you both answered the questions, *and* made
> sure to cover those questions either in the code or in the patch
> descriptions.

Will do.

> Could you also describe the lengths to which you've gone to try and keep
> this macro from growing in to any larger of an abomination.  Is it truly
> _impossible_ to turn this in to a normal function?  Or will it simply be
> a larger amount of work that you can do right now?  What would it take?

Okay, I've tried once again. The patch is below. It looks too invasive for
me. What do you think?

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 19c8c14..7ed4412 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,6 +1,8 @@
 #ifndef _LINUX_HUGE_MM_H
 #define _LINUX_HUGE_MM_H
 
+#include <linux/fs.h>
+
 extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
 				      struct vm_area_struct *vma,
 				      unsigned long address, pmd_t *pmd,
@@ -114,23 +116,22 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 			__split_huge_page_pmd(__vma, __address,		\
 					____pmd);			\
 	}  while (0)
-#define wait_split_huge_page(__vma, __pmd)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		struct address_space *__mapping =			\
-					vma->vm_file->f_mapping;	\
-		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
-		if (__mapping)						\
-			mutex_lock(&__mapping->i_mmap_mutex);		\
-		if (__anon_vma) {					\
-			anon_vma_lock_write(__anon_vma);		\
-			anon_vma_unlock_write(__anon_vma);		\
-		}							\
-		if (__mapping)						\
-			mutex_unlock(&__mapping->i_mmap_mutex);		\
-		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
-		       pmd_trans_huge(*____pmd));			\
-	} while (0)
+static inline void wait_split_huge_page(struct vm_area_struct *vma,
+		pmd_t *pmd)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+
+	if (mapping)
+		mutex_lock(&mapping->i_mmap_mutex);
+	if (vma->anon_vma) {
+		anon_vma_lock_write(vma->anon_vma);
+		anon_vma_unlock_write(vma->anon_vma);
+	}
+	if (mapping)
+		mutex_unlock(&mapping->i_mmap_mutex);
+	BUG_ON(pmd_trans_splitting(*pmd));
+	BUG_ON(pmd_trans_huge(*pmd));
+}
 extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd);
 #if HPAGE_PMD_ORDER > MAX_ORDER
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0a60f28..9fc126e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,7 +19,6 @@
 #include <linux/shrinker.h>
 
 struct mempolicy;
-struct anon_vma;
 struct anon_vma_chain;
 struct file_ra_state;
 struct user_struct;
@@ -260,7 +259,6 @@ static inline int get_freepage_migratetype(struct page *page)
  * files which need it (119 of them)
  */
 #include <linux/page-flags.h>
-#include <linux/huge_mm.h>
 
 /*
  * Methods to modify the page usage count.
@@ -1475,6 +1473,28 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
 	for (avc = anon_vma_interval_tree_iter_first(root, start, last); \
 	     avc; avc = anon_vma_interval_tree_iter_next(avc, start, last))
 
+static inline void anon_vma_lock_write(struct anon_vma *anon_vma)
+{
+	down_write(&anon_vma->root->rwsem);
+}
+
+static inline void anon_vma_unlock_write(struct anon_vma *anon_vma)
+{
+	up_write(&anon_vma->root->rwsem);
+}
+
+static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
+{
+	down_read(&anon_vma->root->rwsem);
+}
+
+static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
+{
+	up_read(&anon_vma->root->rwsem);
+}
+
+#include <linux/huge_mm.h>
+
 /* mmap.c */
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
 extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fb425aa..9805e55 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -453,4 +453,41 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return mm->cpu_vm_mask_var;
 }
 
+/*
+ * The anon_vma heads a list of private "related" vmas, to scan if
+ * an anonymous page pointing to this anon_vma needs to be unmapped:
+ * the vmas on the list will be related by forking, or by splitting.
+ *
+ * Since vmas come and go as they are split and merged (particularly
+ * in mprotect), the mapping field of an anonymous page cannot point
+ * directly to a vma: instead it points to an anon_vma, on whose list
+ * the related vmas can be easily linked or unlinked.
+ *
+ * After unlinking the last vma on the list, we must garbage collect
+ * the anon_vma object itself: we're guaranteed no page can be
+ * pointing to this anon_vma once its vma list is empty.
+ */
+struct anon_vma {
+	struct anon_vma *root;		/* Root of this anon_vma tree */
+	struct rw_semaphore rwsem;	/* W: modification, R: walking the list */
+	/*
+	 * The refcount is taken on an anon_vma when there is no
+	 * guarantee that the vma of page tables will exist for
+	 * the duration of the operation. A caller that takes
+	 * the reference is responsible for clearing up the
+	 * anon_vma if they are the last user on release
+	 */
+	atomic_t refcount;
+
+	/*
+	 * NOTE: the LSB of the rb_root.rb_node is set by
+	 * mm_take_all_locks() _after_ taking the above lock. So the
+	 * rb_root must only be read/written after taking the above lock
+	 * to be sure to see a valid next pointer. The LSB bit itself
+	 * is serialized by a system wide lock only visible to
+	 * mm_take_all_locks() (mm_all_locks_mutex).
+	 */
+	struct rb_root rb_root;	/* Interval tree of private "related" vmas */
+};
+
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6dacb93..22c7278 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -11,43 +11,6 @@
 #include <linux/memcontrol.h>
 
 /*
- * The anon_vma heads a list of private "related" vmas, to scan if
- * an anonymous page pointing to this anon_vma needs to be unmapped:
- * the vmas on the list will be related by forking, or by splitting.
- *
- * Since vmas come and go as they are split and merged (particularly
- * in mprotect), the mapping field of an anonymous page cannot point
- * directly to a vma: instead it points to an anon_vma, on whose list
- * the related vmas can be easily linked or unlinked.
- *
- * After unlinking the last vma on the list, we must garbage collect
- * the anon_vma object itself: we're guaranteed no page can be
- * pointing to this anon_vma once its vma list is empty.
- */
-struct anon_vma {
-	struct anon_vma *root;		/* Root of this anon_vma tree */
-	struct rw_semaphore rwsem;	/* W: modification, R: walking the list */
-	/*
-	 * The refcount is taken on an anon_vma when there is no
-	 * guarantee that the vma of page tables will exist for
-	 * the duration of the operation. A caller that takes
-	 * the reference is responsible for clearing up the
-	 * anon_vma if they are the last user on release
-	 */
-	atomic_t refcount;
-
-	/*
-	 * NOTE: the LSB of the rb_root.rb_node is set by
-	 * mm_take_all_locks() _after_ taking the above lock. So the
-	 * rb_root must only be read/written after taking the above lock
-	 * to be sure to see a valid next pointer. The LSB bit itself
-	 * is serialized by a system wide lock only visible to
-	 * mm_take_all_locks() (mm_all_locks_mutex).
-	 */
-	struct rb_root rb_root;	/* Interval tree of private "related" vmas */
-};
-
-/*
  * The copy-on-write semantics of fork mean that an anon_vma
  * can become associated with multiple processes. Furthermore,
  * each child process will have its own anon_vma, where new
@@ -118,27 +81,6 @@ static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
 		up_write(&anon_vma->root->rwsem);
 }
 
-static inline void anon_vma_lock_write(struct anon_vma *anon_vma)
-{
-	down_write(&anon_vma->root->rwsem);
-}
-
-static inline void anon_vma_unlock_write(struct anon_vma *anon_vma)
-{
-	up_write(&anon_vma->root->rwsem);
-}
-
-static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
-{
-	down_read(&anon_vma->root->rwsem);
-}
-
-static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
-{
-	up_read(&anon_vma->root->rwsem);
-}
-
-
 /*
  * anon_vma helper functions.
  */
diff --git a/mm/memory.c b/mm/memory.c
index c845cf2..2f4fb39 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -589,7 +589,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		pmd_t *pmd, unsigned long address)
 {
 	pgtable_t new = pte_alloc_one(mm, address);
-	int wait_split_huge_page;
+	int wait_split;
 	if (!new)
 		return -ENOMEM;
 
@@ -609,17 +609,17 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	spin_lock(&mm->page_table_lock);
-	wait_split_huge_page = 0;
+	wait_split = 0;
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm->nr_ptes++;
 		pmd_populate(mm, pmd, new);
 		new = NULL;
 	} else if (unlikely(pmd_trans_splitting(*pmd)))
-		wait_split_huge_page = 1;
+		wait_split = 1;
 	spin_unlock(&mm->page_table_lock);
 	if (new)
 		pte_free(mm, new);
-	if (wait_split_huge_page)
+	if (wait_split)
 		wait_split_huge_page(vma, pmd);
 	return 0;
 }
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
@ 2013-06-03 15:02       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-03 15:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > Since we're going to have huge pages backed by files,
> > wait_split_huge_page() has to serialize not only over anon_vma_lock,
> > but over i_mmap_mutex too.
> ...
> > -#define wait_split_huge_page(__anon_vma, __pmd)				\
> > +#define wait_split_huge_page(__vma, __pmd)				\
> >  	do {								\
> >  		pmd_t *____pmd = (__pmd);				\
> > -		anon_vma_lock_write(__anon_vma);			\
> > -		anon_vma_unlock_write(__anon_vma);			\
> > +		struct address_space *__mapping =			\
> > +					vma->vm_file->f_mapping;	\
> > +		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
> > +		if (__mapping)						\
> > +			mutex_lock(&__mapping->i_mmap_mutex);		\
> > +		if (__anon_vma) {					\
> > +			anon_vma_lock_write(__anon_vma);		\
> > +			anon_vma_unlock_write(__anon_vma);		\
> > +		}							\
> > +		if (__mapping)						\
> > +			mutex_unlock(&__mapping->i_mmap_mutex);		\
> >  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
> >  		       pmd_trans_huge(*____pmd));			\
> >  	} while (0)
> 
> Kirill, I asked about this patch in the previous series, and you wrote
> some very nice, detailed answers to my stupid questions.  But, you
> didn't add any comments or update the patch description.  So, if a
> reviewer or anybody looking at the changelog in the future has my same
> stupid questions, they're unlikely to find the very nice description
> that you wrote up.
> 
> I'd highly suggest that you go back through the comments you've received
> before and make sure that you both answered the questions, *and* made
> sure to cover those questions either in the code or in the patch
> descriptions.

Will do.

> Could you also describe the lengths to which you've gone to try and keep
> this macro from growing in to any larger of an abomination.  Is it truly
> _impossible_ to turn this in to a normal function?  Or will it simply be
> a larger amount of work that you can do right now?  What would it take?

Okay, I've tried once again. The patch is below. It looks too invasive for
me. What do you think?

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 19c8c14..7ed4412 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,6 +1,8 @@
 #ifndef _LINUX_HUGE_MM_H
 #define _LINUX_HUGE_MM_H
 
+#include <linux/fs.h>
+
 extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
 				      struct vm_area_struct *vma,
 				      unsigned long address, pmd_t *pmd,
@@ -114,23 +116,22 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 			__split_huge_page_pmd(__vma, __address,		\
 					____pmd);			\
 	}  while (0)
-#define wait_split_huge_page(__vma, __pmd)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		struct address_space *__mapping =			\
-					vma->vm_file->f_mapping;	\
-		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
-		if (__mapping)						\
-			mutex_lock(&__mapping->i_mmap_mutex);		\
-		if (__anon_vma) {					\
-			anon_vma_lock_write(__anon_vma);		\
-			anon_vma_unlock_write(__anon_vma);		\
-		}							\
-		if (__mapping)						\
-			mutex_unlock(&__mapping->i_mmap_mutex);		\
-		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
-		       pmd_trans_huge(*____pmd));			\
-	} while (0)
+static inline void wait_split_huge_page(struct vm_area_struct *vma,
+		pmd_t *pmd)
+{
+	struct address_space *mapping = vma->vm_file->f_mapping;
+
+	if (mapping)
+		mutex_lock(&mapping->i_mmap_mutex);
+	if (vma->anon_vma) {
+		anon_vma_lock_write(vma->anon_vma);
+		anon_vma_unlock_write(vma->anon_vma);
+	}
+	if (mapping)
+		mutex_unlock(&mapping->i_mmap_mutex);
+	BUG_ON(pmd_trans_splitting(*pmd));
+	BUG_ON(pmd_trans_huge(*pmd));
+}
 extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd);
 #if HPAGE_PMD_ORDER > MAX_ORDER
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0a60f28..9fc126e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,7 +19,6 @@
 #include <linux/shrinker.h>
 
 struct mempolicy;
-struct anon_vma;
 struct anon_vma_chain;
 struct file_ra_state;
 struct user_struct;
@@ -260,7 +259,6 @@ static inline int get_freepage_migratetype(struct page *page)
  * files which need it (119 of them)
  */
 #include <linux/page-flags.h>
-#include <linux/huge_mm.h>
 
 /*
  * Methods to modify the page usage count.
@@ -1475,6 +1473,28 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
 	for (avc = anon_vma_interval_tree_iter_first(root, start, last); \
 	     avc; avc = anon_vma_interval_tree_iter_next(avc, start, last))
 
+static inline void anon_vma_lock_write(struct anon_vma *anon_vma)
+{
+	down_write(&anon_vma->root->rwsem);
+}
+
+static inline void anon_vma_unlock_write(struct anon_vma *anon_vma)
+{
+	up_write(&anon_vma->root->rwsem);
+}
+
+static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
+{
+	down_read(&anon_vma->root->rwsem);
+}
+
+static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
+{
+	up_read(&anon_vma->root->rwsem);
+}
+
+#include <linux/huge_mm.h>
+
 /* mmap.c */
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
 extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fb425aa..9805e55 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -453,4 +453,41 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return mm->cpu_vm_mask_var;
 }
 
+/*
+ * The anon_vma heads a list of private "related" vmas, to scan if
+ * an anonymous page pointing to this anon_vma needs to be unmapped:
+ * the vmas on the list will be related by forking, or by splitting.
+ *
+ * Since vmas come and go as they are split and merged (particularly
+ * in mprotect), the mapping field of an anonymous page cannot point
+ * directly to a vma: instead it points to an anon_vma, on whose list
+ * the related vmas can be easily linked or unlinked.
+ *
+ * After unlinking the last vma on the list, we must garbage collect
+ * the anon_vma object itself: we're guaranteed no page can be
+ * pointing to this anon_vma once its vma list is empty.
+ */
+struct anon_vma {
+	struct anon_vma *root;		/* Root of this anon_vma tree */
+	struct rw_semaphore rwsem;	/* W: modification, R: walking the list */
+	/*
+	 * The refcount is taken on an anon_vma when there is no
+	 * guarantee that the vma of page tables will exist for
+	 * the duration of the operation. A caller that takes
+	 * the reference is responsible for clearing up the
+	 * anon_vma if they are the last user on release
+	 */
+	atomic_t refcount;
+
+	/*
+	 * NOTE: the LSB of the rb_root.rb_node is set by
+	 * mm_take_all_locks() _after_ taking the above lock. So the
+	 * rb_root must only be read/written after taking the above lock
+	 * to be sure to see a valid next pointer. The LSB bit itself
+	 * is serialized by a system wide lock only visible to
+	 * mm_take_all_locks() (mm_all_locks_mutex).
+	 */
+	struct rb_root rb_root;	/* Interval tree of private "related" vmas */
+};
+
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6dacb93..22c7278 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -11,43 +11,6 @@
 #include <linux/memcontrol.h>
 
 /*
- * The anon_vma heads a list of private "related" vmas, to scan if
- * an anonymous page pointing to this anon_vma needs to be unmapped:
- * the vmas on the list will be related by forking, or by splitting.
- *
- * Since vmas come and go as they are split and merged (particularly
- * in mprotect), the mapping field of an anonymous page cannot point
- * directly to a vma: instead it points to an anon_vma, on whose list
- * the related vmas can be easily linked or unlinked.
- *
- * After unlinking the last vma on the list, we must garbage collect
- * the anon_vma object itself: we're guaranteed no page can be
- * pointing to this anon_vma once its vma list is empty.
- */
-struct anon_vma {
-	struct anon_vma *root;		/* Root of this anon_vma tree */
-	struct rw_semaphore rwsem;	/* W: modification, R: walking the list */
-	/*
-	 * The refcount is taken on an anon_vma when there is no
-	 * guarantee that the vma of page tables will exist for
-	 * the duration of the operation. A caller that takes
-	 * the reference is responsible for clearing up the
-	 * anon_vma if they are the last user on release
-	 */
-	atomic_t refcount;
-
-	/*
-	 * NOTE: the LSB of the rb_root.rb_node is set by
-	 * mm_take_all_locks() _after_ taking the above lock. So the
-	 * rb_root must only be read/written after taking the above lock
-	 * to be sure to see a valid next pointer. The LSB bit itself
-	 * is serialized by a system wide lock only visible to
-	 * mm_take_all_locks() (mm_all_locks_mutex).
-	 */
-	struct rb_root rb_root;	/* Interval tree of private "related" vmas */
-};
-
-/*
  * The copy-on-write semantics of fork mean that an anon_vma
  * can become associated with multiple processes. Furthermore,
  * each child process will have its own anon_vma, where new
@@ -118,27 +81,6 @@ static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
 		up_write(&anon_vma->root->rwsem);
 }
 
-static inline void anon_vma_lock_write(struct anon_vma *anon_vma)
-{
-	down_write(&anon_vma->root->rwsem);
-}
-
-static inline void anon_vma_unlock_write(struct anon_vma *anon_vma)
-{
-	up_write(&anon_vma->root->rwsem);
-}
-
-static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
-{
-	down_read(&anon_vma->root->rwsem);
-}
-
-static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
-{
-	up_read(&anon_vma->root->rwsem);
-}
-
-
 /*
  * anon_vma helper functions.
  */
diff --git a/mm/memory.c b/mm/memory.c
index c845cf2..2f4fb39 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -589,7 +589,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		pmd_t *pmd, unsigned long address)
 {
 	pgtable_t new = pte_alloc_one(mm, address);
-	int wait_split_huge_page;
+	int wait_split;
 	if (!new)
 		return -ENOMEM;
 
@@ -609,17 +609,17 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	spin_lock(&mm->page_table_lock);
-	wait_split_huge_page = 0;
+	wait_split = 0;
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm->nr_ptes++;
 		pmd_populate(mm, pmd, new);
 		new = NULL;
 	} else if (unlikely(pmd_trans_splitting(*pmd)))
-		wait_split_huge_page = 1;
+		wait_split = 1;
 	spin_unlock(&mm->page_table_lock);
 	if (new)
 		pte_free(mm, new);
-	if (wait_split_huge_page)
+	if (wait_split)
 		wait_split_huge_page(vma, pmd);
 	return 0;
 }
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  2013-06-03 15:02       ` Kirill A. Shutemov
@ 2013-06-03 15:53         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-06-03 15:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel


On 06/03/2013 08:02 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com
>>> -#define wait_split_huge_page(__anon_vma, __pmd)				\
>>> +#define wait_split_huge_page(__vma, __pmd)				\
>>>  	do {								\
>>>  		pmd_t *____pmd = (__pmd);				\
>>> -		anon_vma_lock_write(__anon_vma);			\
>>> -		anon_vma_unlock_write(__anon_vma);			\
>>> +		struct address_space *__mapping =			\
>>> +					vma->vm_file->f_mapping;	\
>>> +		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
>>> +		if (__mapping)						\
>>> +			mutex_lock(&__mapping->i_mmap_mutex);		\
>>> +		if (__anon_vma) {					\
>>> +			anon_vma_lock_write(__anon_vma);		\
>>> +			anon_vma_unlock_write(__anon_vma);		\
>>> +		}							\
>>> +		if (__mapping)						\
>>> +			mutex_unlock(&__mapping->i_mmap_mutex);		\
>>>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>>>  		       pmd_trans_huge(*____pmd));			\
>>>  	} while (0)
...
>> Could you also describe the lengths to which you've gone to try and keep
>> this macro from growing in to any larger of an abomination.  Is it truly
>> _impossible_ to turn this in to a normal function?  Or will it simply be
>> a larger amount of work that you can do right now?  What would it take?
> 
> Okay, I've tried once again. The patch is below. It looks too invasive for
> me. What do you think?

That patch looks great to me, actually.  It really looks to just be
superficially moving code around.  The diffstat is even too:

>  include/linux/huge_mm.h  |   35 ++++++++++++++--------------
>  include/linux/mm.h       |   24 +++++++++++++++++--
>  include/linux/mm_types.h |   37 +++++++++++++++++++++++++++++
>  include/linux/rmap.h     |   58 -----------------------------------------------
>  mm/memory.c              |    8 +++---
>  5 files changed, 81 insertions(+), 81 deletions(-)


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
@ 2013-06-03 15:53         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-06-03 15:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel


On 06/03/2013 08:02 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com
>>> -#define wait_split_huge_page(__anon_vma, __pmd)				\
>>> +#define wait_split_huge_page(__vma, __pmd)				\
>>>  	do {								\
>>>  		pmd_t *____pmd = (__pmd);				\
>>> -		anon_vma_lock_write(__anon_vma);			\
>>> -		anon_vma_unlock_write(__anon_vma);			\
>>> +		struct address_space *__mapping =			\
>>> +					vma->vm_file->f_mapping;	\
>>> +		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
>>> +		if (__mapping)						\
>>> +			mutex_lock(&__mapping->i_mmap_mutex);		\
>>> +		if (__anon_vma) {					\
>>> +			anon_vma_lock_write(__anon_vma);		\
>>> +			anon_vma_unlock_write(__anon_vma);		\
>>> +		}							\
>>> +		if (__mapping)						\
>>> +			mutex_unlock(&__mapping->i_mmap_mutex);		\
>>>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>>>  		       pmd_trans_huge(*____pmd));			\
>>>  	} while (0)
...
>> Could you also describe the lengths to which you've gone to try and keep
>> this macro from growing in to any larger of an abomination.  Is it truly
>> _impossible_ to turn this in to a normal function?  Or will it simply be
>> a larger amount of work that you can do right now?  What would it take?
> 
> Okay, I've tried once again. The patch is below. It looks too invasive for
> me. What do you think?

That patch looks great to me, actually.  It really looks to just be
superficially moving code around.  The diffstat is even too:

>  include/linux/huge_mm.h  |   35 ++++++++++++++--------------
>  include/linux/mm.h       |   24 +++++++++++++++++--
>  include/linux/mm_types.h |   37 +++++++++++++++++++++++++++++
>  include/linux/rmap.h     |   58 -----------------------------------------------
>  mm/memory.c              |    8 +++---
>  5 files changed, 81 insertions(+), 81 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
  2013-06-03 15:53         ` Dave Hansen
@ 2013-06-03 16:09           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-03 16:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> 
> On 06/03/2013 08:02 AM, Kirill A. Shutemov wrote:
> > Dave Hansen wrote:
> >> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> >>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com
> >>> -#define wait_split_huge_page(__anon_vma, __pmd)				\
> >>> +#define wait_split_huge_page(__vma, __pmd)				\
> >>>  	do {								\
> >>>  		pmd_t *____pmd = (__pmd);				\
> >>> -		anon_vma_lock_write(__anon_vma);			\
> >>> -		anon_vma_unlock_write(__anon_vma);			\
> >>> +		struct address_space *__mapping =			\
> >>> +					vma->vm_file->f_mapping;	\
> >>> +		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
> >>> +		if (__mapping)						\
> >>> +			mutex_lock(&__mapping->i_mmap_mutex);		\
> >>> +		if (__anon_vma) {					\
> >>> +			anon_vma_lock_write(__anon_vma);		\
> >>> +			anon_vma_unlock_write(__anon_vma);		\
> >>> +		}							\
> >>> +		if (__mapping)						\
> >>> +			mutex_unlock(&__mapping->i_mmap_mutex);		\
> >>>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
> >>>  		       pmd_trans_huge(*____pmd));			\
> >>>  	} while (0)
> ...
> >> Could you also describe the lengths to which you've gone to try and keep
> >> this macro from growing in to any larger of an abomination.  Is it truly
> >> _impossible_ to turn this in to a normal function?  Or will it simply be
> >> a larger amount of work that you can do right now?  What would it take?
> > 
> > Okay, I've tried once again. The patch is below. It looks too invasive for
> > me. What do you think?
> 
> That patch looks great to me, actually.  It really looks to just be
> superficially moving code around.  The diffstat is even too:

One of blocker I see is new dependency <linux/mm.h> -> <linux/fs.h>.
It makes header files nightmare worse.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too
@ 2013-06-03 16:09           ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-03 16:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> 
> On 06/03/2013 08:02 AM, Kirill A. Shutemov wrote:
> > Dave Hansen wrote:
> >> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> >>> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com
> >>> -#define wait_split_huge_page(__anon_vma, __pmd)				\
> >>> +#define wait_split_huge_page(__vma, __pmd)				\
> >>>  	do {								\
> >>>  		pmd_t *____pmd = (__pmd);				\
> >>> -		anon_vma_lock_write(__anon_vma);			\
> >>> -		anon_vma_unlock_write(__anon_vma);			\
> >>> +		struct address_space *__mapping =			\
> >>> +					vma->vm_file->f_mapping;	\
> >>> +		struct anon_vma *__anon_vma = (__vma)->anon_vma;	\
> >>> +		if (__mapping)						\
> >>> +			mutex_lock(&__mapping->i_mmap_mutex);		\
> >>> +		if (__anon_vma) {					\
> >>> +			anon_vma_lock_write(__anon_vma);		\
> >>> +			anon_vma_unlock_write(__anon_vma);		\
> >>> +		}							\
> >>> +		if (__mapping)						\
> >>> +			mutex_unlock(&__mapping->i_mmap_mutex);		\
> >>>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
> >>>  		       pmd_trans_huge(*____pmd));			\
> >>>  	} while (0)
> ...
> >> Could you also describe the lengths to which you've gone to try and keep
> >> this macro from growing in to any larger of an abomination.  Is it truly
> >> _impossible_ to turn this in to a normal function?  Or will it simply be
> >> a larger amount of work that you can do right now?  What would it take?
> > 
> > Okay, I've tried once again. The patch is below. It looks too invasive for
> > me. What do you think?
> 
> That patch looks great to me, actually.  It really looks to just be
> superficially moving code around.  The diffstat is even too:

One of blocker I see is new dependency <linux/mm.h> -> <linux/fs.h>.
It makes header files nightmare worse.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-05-28 12:28       ` Kirill A. Shutemov
  (?)
@ 2013-06-07 15:10         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-07 15:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton,
	Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, linux-fsdevel, linux-kernel

Kirill A. Shutemov wrote:
> Dave Hansen wrote:
> > Which reminds me...  Why do we handle their reference counts differently? :)
> > 
> > It seems like we could easily put a for loop in delete_from_page_cache()
> > that will release their reference counts along with the head page.
> > Wouldn't that make the code less special-cased for tail pages?
> 
> delete_from_page_cache() is not the only user of
> __delete_from_page_cache()...
> 
> It seems I did it wrong in add_to_page_cache_locked(). We shouldn't take
> references on tail pages there, only one on head. On split it will be
> distributed properly.

This way:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b267859..c2c0df2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1556,6 +1556,7 @@ static void __split_huge_page_refcount(struct page *page,
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	int tail_count = 0;
+	int init_tail_refcount;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1565,6 +1566,13 @@ static void __split_huge_page_refcount(struct page *page,
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(page);
 
+	/*
+	 * When we add a huge page to page cache we take only reference to head
+	 * page, but on split we need to take addition reference to all tail
+	 * pages since they are still in page cache after splitting.
+	 */
+	init_tail_refcount = PageAnon(page) ? 0 : 1;
+
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		struct page *page_tail = page + i;
 
@@ -1587,8 +1595,9 @@ static void __split_huge_page_refcount(struct page *page,
 		 * atomic_set() here would be safe on all archs (and
 		 * not only on x86), it's safer to use atomic_add().
 		 */
-		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
-			   &page_tail->_count);
+		atomic_add(init_tail_refcount + page_mapcount(page) +
+				page_mapcount(page_tail) + 1,
+				&page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-06-07 15:10         ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-07 15:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton,
	Al Viro, Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman,
	linux-mm, Andi Kleen, Matthew Wilcox, Kirill A. Shutemov,
	Hillf Danton, linux-fsdevel, linux-kernel

Kirill A. Shutemov wrote:
> Dave Hansen wrote:
> > Which reminds me...  Why do we handle their reference counts differently? :)
> > 
> > It seems like we could easily put a for loop in delete_from_page_cache()
> > that will release their reference counts along with the head page.
> > Wouldn't that make the code less special-cased for tail pages?
> 
> delete_from_page_cache() is not the only user of
> __delete_from_page_cache()...
> 
> It seems I did it wrong in add_to_page_cache_locked(). We shouldn't take
> references on tail pages there, only one on head. On split it will be
> distributed properly.

This way:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b267859..c2c0df2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1556,6 +1556,7 @@ static void __split_huge_page_refcount(struct page *page,
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	int tail_count = 0;
+	int init_tail_refcount;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1565,6 +1566,13 @@ static void __split_huge_page_refcount(struct page *page,
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(page);
 
+	/*
+	 * When we add a huge page to page cache we take only reference to head
+	 * page, but on split we need to take addition reference to all tail
+	 * pages since they are still in page cache after splitting.
+	 */
+	init_tail_refcount = PageAnon(page) ? 0 : 1;
+
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		struct page *page_tail = page + i;
 
@@ -1587,8 +1595,9 @@ static void __split_huge_page_refcount(struct page *page,
 		 * atomic_set() here would be safe on all archs (and
 		 * not only on x86), it's safer to use atomic_add().
 		 */
-		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
-			   &page_tail->_count);
+		atomic_add(init_tail_refcount + page_mapcount(page) +
+				page_mapcount(page_tail) + 1,
+				&page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-06-07 15:10         ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-07 15:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Kirill A. Shutemov wrote:
> Dave Hansen wrote:
> > Which reminds me...  Why do we handle their reference counts differently? :)
> > 
> > It seems like we could easily put a for loop in delete_from_page_cache()
> > that will release their reference counts along with the head page.
> > Wouldn't that make the code less special-cased for tail pages?
> 
> delete_from_page_cache() is not the only user of
> __delete_from_page_cache()...
> 
> It seems I did it wrong in add_to_page_cache_locked(). We shouldn't take
> references on tail pages there, only one on head. On split it will be
> distributed properly.

This way:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b267859..c2c0df2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1556,6 +1556,7 @@ static void __split_huge_page_refcount(struct page *page,
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	int tail_count = 0;
+	int init_tail_refcount;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1565,6 +1566,13 @@ static void __split_huge_page_refcount(struct page *page,
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(page);
 
+	/*
+	 * When we add a huge page to page cache we take only reference to head
+	 * page, but on split we need to take addition reference to all tail
+	 * pages since they are still in page cache after splitting.
+	 */
+	init_tail_refcount = PageAnon(page) ? 0 : 1;
+
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		struct page *page_tail = page + i;
 
@@ -1587,8 +1595,9 @@ static void __split_huge_page_refcount(struct page *page,
 		 * atomic_set() here would be safe on all archs (and
 		 * not only on x86), it's safer to use atomic_add().
 		 */
-		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
-			   &page_tail->_count);
+		atomic_add(init_tail_refcount + page_mapcount(page) +
+				page_mapcount(page_tail) + 1,
+				&page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines
  2013-05-21 21:28     ` Dave Hansen
@ 2013-06-07 15:17       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-07 15:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > +		if (PageTransHuge(page))
> > +			offset = pos & ~HPAGE_PMD_MASK;
> > +
> >  		pagefault_disable();
> > -		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> > +		copied = iov_iter_copy_from_user_atomic(
> > +				page + (offset >> PAGE_CACHE_SHIFT),
> > +				i, offset & ~PAGE_CACHE_MASK, bytes);
> >  		pagefault_enable();
> >  		flush_dcache_page(page);
> 
> I think there's enough voodoo in there to warrant a comment or adding
> some temporary variables.  There are three things going on that you wan
> to convey:
> 
> 1. Offset is normally <PAGE_SIZE, but you make it <HPAGE_PMD_SIZE if
>    you are dealing with a huge page
> 2. (offset >> PAGE_CACHE_SHIFT) is always 0 for small pages since
>     offset < PAGE_SIZE
> 3. "offset & ~PAGE_CACHE_MASK" does nothing for small-page offsets, but
>    it turns a large-page offset back in to a small-page-offset.
> 
> I think you can do it with something like this:
> 
>  	int subpage_nr = 0;
> 	off_t smallpage_offset = offset;
> 	if (PageTransHuge(page)) {
> 		// we transform 'offset' to be offset in to the huge
> 		// page instead of inside the PAGE_SIZE page
> 		offset = pos & ~HPAGE_PMD_MASK;
> 		subpage_nr = (offset >> PAGE_CACHE_SHIFT);
> 	}
> 	
> > +		copied = iov_iter_copy_from_user_atomic(
> > +				page + subpage_nr,
> > +				i, smallpage_offset, bytes);
> 
> 
> > @@ -2437,6 +2453,7 @@ again:
> >  			 * because not all segments in the iov can be copied at
> >  			 * once without a pagefault.
> >  			 */
> > +			offset = pos & ~PAGE_CACHE_MASK;
> 
> Urg, and now it's *BACK* in to a small-page offset?
> 
> This means that 'offset' has two _different_ meanings and it morphs
> between them during the function a couple of times.  That seems very
> error-prone to me.

I guess this way is better, right?

@@ -2382,6 +2393,7 @@ static ssize_t generic_perform_write(struct file *file,
                unsigned long bytes;    /* Bytes to write to page */
                size_t copied;          /* Bytes copied from user */
                void *fsdata;
+               int subpage_nr = 0;
 
                offset = (pos & (PAGE_CACHE_SIZE - 1));
                bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
@@ -2411,8 +2423,14 @@ again:
                if (mapping_writably_mapped(mapping))
                        flush_dcache_page(page);
 
+               if (PageTransHuge(page)) {
+                       off_t huge_offset = pos & ~HPAGE_PMD_MASK;
+                       subpage_nr = huge_offset >> PAGE_CACHE_SHIFT;
+               }
+
                pagefault_disable();
-               copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+               copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i,
+                               offset, bytes);
                pagefault_enable();
                flush_dcache_page(page);
 
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines
@ 2013-06-07 15:17       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-07 15:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > +		if (PageTransHuge(page))
> > +			offset = pos & ~HPAGE_PMD_MASK;
> > +
> >  		pagefault_disable();
> > -		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> > +		copied = iov_iter_copy_from_user_atomic(
> > +				page + (offset >> PAGE_CACHE_SHIFT),
> > +				i, offset & ~PAGE_CACHE_MASK, bytes);
> >  		pagefault_enable();
> >  		flush_dcache_page(page);
> 
> I think there's enough voodoo in there to warrant a comment or adding
> some temporary variables.  There are three things going on that you wan
> to convey:
> 
> 1. Offset is normally <PAGE_SIZE, but you make it <HPAGE_PMD_SIZE if
>    you are dealing with a huge page
> 2. (offset >> PAGE_CACHE_SHIFT) is always 0 for small pages since
>     offset < PAGE_SIZE
> 3. "offset & ~PAGE_CACHE_MASK" does nothing for small-page offsets, but
>    it turns a large-page offset back in to a small-page-offset.
> 
> I think you can do it with something like this:
> 
>  	int subpage_nr = 0;
> 	off_t smallpage_offset = offset;
> 	if (PageTransHuge(page)) {
> 		// we transform 'offset' to be offset in to the huge
> 		// page instead of inside the PAGE_SIZE page
> 		offset = pos & ~HPAGE_PMD_MASK;
> 		subpage_nr = (offset >> PAGE_CACHE_SHIFT);
> 	}
> 	
> > +		copied = iov_iter_copy_from_user_atomic(
> > +				page + subpage_nr,
> > +				i, smallpage_offset, bytes);
> 
> 
> > @@ -2437,6 +2453,7 @@ again:
> >  			 * because not all segments in the iov can be copied at
> >  			 * once without a pagefault.
> >  			 */
> > +			offset = pos & ~PAGE_CACHE_MASK;
> 
> Urg, and now it's *BACK* in to a small-page offset?
> 
> This means that 'offset' has two _different_ meanings and it morphs
> between them during the function a couple of times.  That seems very
> error-prone to me.

I guess this way is better, right?

@@ -2382,6 +2393,7 @@ static ssize_t generic_perform_write(struct file *file,
                unsigned long bytes;    /* Bytes to write to page */
                size_t copied;          /* Bytes copied from user */
                void *fsdata;
+               int subpage_nr = 0;
 
                offset = (pos & (PAGE_CACHE_SIZE - 1));
                bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
@@ -2411,8 +2423,14 @@ again:
                if (mapping_writably_mapped(mapping))
                        flush_dcache_page(page);
 
+               if (PageTransHuge(page)) {
+                       off_t huge_offset = pos & ~HPAGE_PMD_MASK;
+                       subpage_nr = huge_offset >> PAGE_CACHE_SHIFT;
+               }
+
                pagefault_disable();
-               copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+               copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i,
+                               offset, bytes);
                pagefault_enable();
                flush_dcache_page(page);
 
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines
  2013-06-07 15:17       ` Kirill A. Shutemov
@ 2013-06-07 15:29         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-06-07 15:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 06/07/2013 08:17 AM, Kirill A. Shutemov wrote:
<snip>
> I guess this way is better, right?
> 
> @@ -2382,6 +2393,7 @@ static ssize_t generic_perform_write(struct file *file,
>                 unsigned long bytes;    /* Bytes to write to page */
>                 size_t copied;          /* Bytes copied from user */
>                 void *fsdata;
> +               int subpage_nr = 0;
>  
>                 offset = (pos & (PAGE_CACHE_SIZE - 1));
>                 bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
> @@ -2411,8 +2423,14 @@ again:
>                 if (mapping_writably_mapped(mapping))
>                         flush_dcache_page(page);
>  
> +               if (PageTransHuge(page)) {
> +                       off_t huge_offset = pos & ~HPAGE_PMD_MASK;
> +                       subpage_nr = huge_offset >> PAGE_CACHE_SHIFT;
> +               }
> +
>                 pagefault_disable();
> -               copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> +               copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i,
> +                               offset, bytes);
>                 pagefault_enable();
>                 flush_dcache_page(page);

That looks substantially easier to understand to me.  Nice.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines
@ 2013-06-07 15:29         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-06-07 15:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 06/07/2013 08:17 AM, Kirill A. Shutemov wrote:
<snip>
> I guess this way is better, right?
> 
> @@ -2382,6 +2393,7 @@ static ssize_t generic_perform_write(struct file *file,
>                 unsigned long bytes;    /* Bytes to write to page */
>                 size_t copied;          /* Bytes copied from user */
>                 void *fsdata;
> +               int subpage_nr = 0;
>  
>                 offset = (pos & (PAGE_CACHE_SIZE - 1));
>                 bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
> @@ -2411,8 +2423,14 @@ again:
>                 if (mapping_writably_mapped(mapping))
>                         flush_dcache_page(page);
>  
> +               if (PageTransHuge(page)) {
> +                       off_t huge_offset = pos & ~HPAGE_PMD_MASK;
> +                       subpage_nr = huge_offset >> PAGE_CACHE_SHIFT;
> +               }
> +
>                 pagefault_disable();
> -               copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> +               copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i,
> +                               offset, bytes);
>                 pagefault_enable();
>                 flush_dcache_page(page);

That looks substantially easier to understand to me.  Nice.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-06-07 15:10         ` Kirill A. Shutemov
@ 2013-06-07 15:56           ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-06-07 15:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 06/07/2013 08:10 AM, Kirill A. Shutemov wrote:
> +	/*
> +	 * When we add a huge page to page cache we take only reference to head
> +	 * page, but on split we need to take addition reference to all tail
> +	 * pages since they are still in page cache after splitting.
> +	 */
> +	init_tail_refcount = PageAnon(page) ? 0 : 1;

What's the "init" for in the name?

In add_to_page_cache_locked() in patch 12/39, you do
> +       spin_lock_irq(&mapping->tree_lock);
> +       for (i = 0; i < nr; i++) {
> +               page_cache_get(page + i);

That looks to me to be taking references to the tail pages.  What gives? :)

>  	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
>  		struct page *page_tail = page + i;
>  
> @@ -1587,8 +1595,9 @@ static void __split_huge_page_refcount(struct page *page,
>  		 * atomic_set() here would be safe on all archs (and
>  		 * not only on x86), it's safer to use atomic_add().
>  		 */
> -		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
> -			   &page_tail->_count);
> +		atomic_add(init_tail_refcount + page_mapcount(page) +
> +				page_mapcount(page_tail) + 1,
> +				&page_tail->_count);
>  
>  		/* after clearing PageTail the gup refcount can be released */
>  		smp_mb();

This does look much better in general, though.


^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-06-07 15:56           ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-06-07 15:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 06/07/2013 08:10 AM, Kirill A. Shutemov wrote:
> +	/*
> +	 * When we add a huge page to page cache we take only reference to head
> +	 * page, but on split we need to take addition reference to all tail
> +	 * pages since they are still in page cache after splitting.
> +	 */
> +	init_tail_refcount = PageAnon(page) ? 0 : 1;

What's the "init" for in the name?

In add_to_page_cache_locked() in patch 12/39, you do
> +       spin_lock_irq(&mapping->tree_lock);
> +       for (i = 0; i < nr; i++) {
> +               page_cache_get(page + i);

That looks to me to be taking references to the tail pages.  What gives? :)

>  	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
>  		struct page *page_tail = page + i;
>  
> @@ -1587,8 +1595,9 @@ static void __split_huge_page_refcount(struct page *page,
>  		 * atomic_set() here would be safe on all archs (and
>  		 * not only on x86), it's safer to use atomic_add().
>  		 */
> -		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
> -			   &page_tail->_count);
> +		atomic_add(init_tail_refcount + page_mapcount(page) +
> +				page_mapcount(page_tail) + 1,
> +				&page_tail->_count);
>  
>  		/* after clearing PageTail the gup refcount can be released */
>  		smp_mb();

This does look much better in general, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
  2013-06-07 15:56           ` Dave Hansen
@ 2013-06-10 17:41             ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-10 17:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 06/07/2013 08:10 AM, Kirill A. Shutemov wrote:
> > +	/*
> > +	 * When we add a huge page to page cache we take only reference to head
> > +	 * page, but on split we need to take addition reference to all tail
> > +	 * pages since they are still in page cache after splitting.
> > +	 */
> > +	init_tail_refcount = PageAnon(page) ? 0 : 1;
> 
> What's the "init" for in the name?

initial_tail_refcount?

> In add_to_page_cache_locked() in patch 12/39, you do
> > +       spin_lock_irq(&mapping->tree_lock);
> > +       for (i = 0; i < nr; i++) {
> > +               page_cache_get(page + i);
> 
> That looks to me to be taking references to the tail pages.  What gives? :)

The point is to drop this from add_to_page_cache_locked() and make distribution
on split.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages
@ 2013-06-10 17:41             ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-10 17:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 06/07/2013 08:10 AM, Kirill A. Shutemov wrote:
> > +	/*
> > +	 * When we add a huge page to page cache we take only reference to head
> > +	 * page, but on split we need to take addition reference to all tail
> > +	 * pages since they are still in page cache after splitting.
> > +	 */
> > +	init_tail_refcount = PageAnon(page) ? 0 : 1;
> 
> What's the "init" for in the name?

initial_tail_refcount?

> In add_to_page_cache_locked() in patch 12/39, you do
> > +       spin_lock_irq(&mapping->tree_lock);
> > +       for (i = 0; i < nr; i++) {
> > +               page_cache_get(page + i);
> 
> That looks to me to be taking references to the tail pages.  What gives? :)

The point is to drop this from add_to_page_cache_locked() and make distribution
on split.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages
  2013-05-21 22:56     ` Dave Hansen
@ 2013-06-25 14:56       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-25 14:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > +static inline unsigned long mapping_align_mask(struct address_space *mapping)
> > +{
> > +	if (mapping_can_have_hugepages(mapping))
> > +		return PAGE_MASK & ~HPAGE_MASK;
> > +	return get_align_mask();
> > +}
> 
> get_align_mask() appears to be a bit more complicated to me than just a
> plain old mask.  Are you sure you don't need to pick up any of its
> behavior for the mapping_can_have_hugepages() case?

get_align_mask() never returns more strict mask then we do in
mapping_can_have_hugepages() case.

I can modify it this way:

        unsigned long mask = get_align_mask();

        if (mapping_can_have_hugepages(mapping))
                mask &= PAGE_MASK & ~HPAGE_MASK;
        return mask;

But it looks more confusing for me. What do you think?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages
@ 2013-06-25 14:56       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-25 14:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > +static inline unsigned long mapping_align_mask(struct address_space *mapping)
> > +{
> > +	if (mapping_can_have_hugepages(mapping))
> > +		return PAGE_MASK & ~HPAGE_MASK;
> > +	return get_align_mask();
> > +}
> 
> get_align_mask() appears to be a bit more complicated to me than just a
> plain old mask.  Are you sure you don't need to pick up any of its
> behavior for the mapping_can_have_hugepages() case?

get_align_mask() never returns more strict mask then we do in
mapping_can_have_hugepages() case.

I can modify it this way:

        unsigned long mask = get_align_mask();

        if (mapping_can_have_hugepages(mapping))
                mask &= PAGE_MASK & ~HPAGE_MASK;
        return mask;

But it looks more confusing for me. What do you think?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages
  2013-06-25 14:56       ` Kirill A. Shutemov
@ 2013-06-25 16:46         ` Dave Hansen
  -1 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-06-25 16:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 06/25/2013 07:56 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> +static inline unsigned long mapping_align_mask(struct address_space *mapping)
>>> +{
>>> +	if (mapping_can_have_hugepages(mapping))
>>> +		return PAGE_MASK & ~HPAGE_MASK;
>>> +	return get_align_mask();
>>> +}
>>
>> get_align_mask() appears to be a bit more complicated to me than just a
>> plain old mask.  Are you sure you don't need to pick up any of its
>> behavior for the mapping_can_have_hugepages() case?
> 
> get_align_mask() never returns more strict mask then we do in
> mapping_can_have_hugepages() case.
> 
> I can modify it this way:
> 
>         unsigned long mask = get_align_mask();
> 
>         if (mapping_can_have_hugepages(mapping))
>                 mask &= PAGE_MASK & ~HPAGE_MASK;
>         return mask;
> 
> But it looks more confusing for me. What do you think?

Personally, I find that a *LOT* more clear.  The &= pretty much spells
out what you said in your explanation: get_align_mask()'s mask can only
be made more strict when we encounter a huge page.

The relationship between the two masks is not apparent at all in your
original code.  This is all nitpicking though, I just wanted to make
sure you'd considered if you were accidentally changing behavior.

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages
@ 2013-06-25 16:46         ` Dave Hansen
  0 siblings, 0 replies; 243+ messages in thread
From: Dave Hansen @ 2013-06-25 16:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Al Viro, Hugh Dickins,
	Wu Fengguang, Jan Kara, Mel Gorman, linux-mm, Andi Kleen,
	Matthew Wilcox, Kirill A. Shutemov, Hillf Danton, linux-fsdevel,
	linux-kernel

On 06/25/2013 07:56 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> +static inline unsigned long mapping_align_mask(struct address_space *mapping)
>>> +{
>>> +	if (mapping_can_have_hugepages(mapping))
>>> +		return PAGE_MASK & ~HPAGE_MASK;
>>> +	return get_align_mask();
>>> +}
>>
>> get_align_mask() appears to be a bit more complicated to me than just a
>> plain old mask.  Are you sure you don't need to pick up any of its
>> behavior for the mapping_can_have_hugepages() case?
> 
> get_align_mask() never returns more strict mask then we do in
> mapping_can_have_hugepages() case.
> 
> I can modify it this way:
> 
>         unsigned long mask = get_align_mask();
> 
>         if (mapping_can_have_hugepages(mapping))
>                 mask &= PAGE_MASK & ~HPAGE_MASK;
>         return mask;
> 
> But it looks more confusing for me. What do you think?

Personally, I find that a *LOT* more clear.  The &= pretty much spells
out what you said in your explanation: get_align_mask()'s mask can only
be made more strict when we encounter a huge page.

The relationship between the two masks is not apparent at all in your
original code.  This is all nitpicking though, I just wanted to make
sure you'd considered if you were accidentally changing behavior.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 17/39] thp, mm: handle tail pages in page_cache_get_speculative()
  2013-05-21 20:49     ` Dave Hansen
@ 2013-06-27 12:40       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-27 12:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > For tail page we call __get_page_tail(). It has the same semantics, but
> > for tail page.
> 
> page_cache_get_speculative() has a ~50-line comment above it with lots
> of scariness about grace periods and RCU.  A two line comment saying
> that the semantics are the same doesn't make me feel great that you've
> done your homework here.

Okay. Will fix commit message and the comment.

> Are there any performance implications here?  __get_page_tail() says:
> "It implements the slow path of get_page().".
> page_cache_get_speculative() seems awfully speculative which would make
> me think that it is part of a _fast_ path.

It's slow path in the sense that we have to do more for tail page then for
non-compound or head page.

Probably, we can get it a bit faster by unrolling function calls and doing
only what is relevant for our case. Like this:

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ad60dcc..57ad1ae 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -161,6 +161,8 @@ void release_pages(struct page **pages, int nr, int cold);
  */
 static inline int page_cache_get_speculative(struct page *page)
 {
+	struct page *page_head = compound_trans_head(page);
+
 	VM_BUG_ON(in_interrupt());
 
 #ifdef CONFIG_TINY_RCU
@@ -176,11 +178,11 @@ static inline int page_cache_get_speculative(struct page *page)
 	 * disabling preempt, and hence no need for the "speculative get" that
 	 * SMP requires.
 	 */
-	VM_BUG_ON(page_count(page) == 0);
+	VM_BUG_ON(page_count(page_head) == 0);
 	atomic_inc(&page->_count);
 
 #else
-	if (unlikely(!get_page_unless_zero(page))) {
+	if (unlikely(!get_page_unless_zero(page_head))) {
 		/*
 		 * Either the page has been freed, or will be freed.
 		 * In either case, retry here and the caller should
@@ -189,7 +191,23 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON(PageTail(page));
+
+	if (unlikely(PageTransTail(page))) {
+		unsigned long flags;
+		int got = 0;
+
+		flags = compound_lock_irqsave(page_head);
+		if (likely(PageTransTail(page))) {
+			atomic_inc(&page->_mapcount);
+			got = 1;
+		}
+		compound_unlock_irqrestore(page_head, flags);
+
+		if (unlikely(!got))
+			put_page(page_head);
+
+		return got;
+	}
 
 	return 1;
 }

What do you think? Is it better?

-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 243+ messages in thread

* Re: [PATCHv4 17/39] thp, mm: handle tail pages in page_cache_get_speculative()
@ 2013-06-27 12:40       ` Kirill A. Shutemov
  0 siblings, 0 replies; 243+ messages in thread
From: Kirill A. Shutemov @ 2013-06-27 12:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Al Viro,
	Hugh Dickins, Wu Fengguang, Jan Kara, Mel Gorman, linux-mm,
	Andi Kleen, Matthew Wilcox, Kirill A. Shutemov, Hillf Danton,
	linux-fsdevel, linux-kernel

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > 
> > For tail page we call __get_page_tail(). It has the same semantics, but
> > for tail page.
> 
> page_cache_get_speculative() has a ~50-line comment above it with lots
> of scariness about grace periods and RCU.  A two line comment saying
> that the semantics are the same doesn't make me feel great that you've
> done your homework here.

Okay. Will fix commit message and the comment.

> Are there any performance implications here?  __get_page_tail() says:
> "It implements the slow path of get_page().".
> page_cache_get_speculative() seems awfully speculative which would make
> me think that it is part of a _fast_ path.

It's slow path in the sense that we have to do more for tail page then for
non-compound or head page.

Probably, we can get it a bit faster by unrolling function calls and doing
only what is relevant for our case. Like this:

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ad60dcc..57ad1ae 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -161,6 +161,8 @@ void release_pages(struct page **pages, int nr, int cold);
  */
 static inline int page_cache_get_speculative(struct page *page)
 {
+	struct page *page_head = compound_trans_head(page);
+
 	VM_BUG_ON(in_interrupt());
 
 #ifdef CONFIG_TINY_RCU
@@ -176,11 +178,11 @@ static inline int page_cache_get_speculative(struct page *page)
 	 * disabling preempt, and hence no need for the "speculative get" that
 	 * SMP requires.
 	 */
-	VM_BUG_ON(page_count(page) == 0);
+	VM_BUG_ON(page_count(page_head) == 0);
 	atomic_inc(&page->_count);
 
 #else
-	if (unlikely(!get_page_unless_zero(page))) {
+	if (unlikely(!get_page_unless_zero(page_head))) {
 		/*
 		 * Either the page has been freed, or will be freed.
 		 * In either case, retry here and the caller should
@@ -189,7 +191,23 @@ static inline int page_cache_get_speculative(struct page *page)
 		return 0;
 	}
 #endif
-	VM_BUG_ON(PageTail(page));
+
+	if (unlikely(PageTransTail(page))) {
+		unsigned long flags;
+		int got = 0;
+
+		flags = compound_lock_irqsave(page_head);
+		if (likely(PageTransTail(page))) {
+			atomic_inc(&page->_mapcount);
+			got = 1;
+		}
+		compound_unlock_irqrestore(page_head, flags);
+
+		if (unlikely(!got))
+			put_page(page_head);
+
+		return got;
+	}
 
 	return 1;
 }

What do you think? Is it better?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 243+ messages in thread

end of thread, other threads:[~2013-06-27 12:40 UTC | newest]

Thread overview: 243+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-12  1:22 [PATCHv4 00/39] Transparent huge page cache Kirill A. Shutemov
2013-05-12  1:22 ` Kirill A. Shutemov
2013-05-12  1:22 ` [PATCHv4 01/39] mm: drop actor argument of do_generic_file_read() Kirill A. Shutemov
2013-05-12  1:22   ` Kirill A. Shutemov
2013-05-21 18:22   ` Dave Hansen
2013-05-21 18:22     ` Dave Hansen
2013-05-12  1:22 ` [PATCHv4 02/39] block: implement add_bdi_stat() Kirill A. Shutemov
2013-05-12  1:22   ` Kirill A. Shutemov
2013-05-21 18:25   ` Dave Hansen
2013-05-21 18:25     ` Dave Hansen
2013-05-22 11:06     ` Kirill A. Shutemov
2013-05-22 11:06       ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 03/39] mm: implement zero_huge_user_segment and friends Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-23 10:32   ` Hillf Danton
2013-05-23 10:32     ` Hillf Danton
2013-05-23 11:32     ` Kirill A. Shutemov
2013-05-23 11:32       ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 18:58   ` Dave Hansen
2013-05-21 18:58     ` Dave Hansen
2013-05-22 12:03     ` Kirill A. Shutemov
2013-05-22 12:03       ` Kirill A. Shutemov
2013-05-22 14:20       ` Dave Hansen
2013-05-22 14:20         ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 05/39] memcg, thp: charge huge cache pages Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 19:04   ` Dave Hansen
2013-05-21 19:04     ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 06/39] thp, mm: avoid PageUnevictable on active/inactive lru lists Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 19:17   ` Dave Hansen
2013-05-21 19:17     ` Dave Hansen
2013-05-22 12:34     ` Kirill A. Shutemov
2013-05-22 12:34       ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 07/39] thp, mm: basic defines for transparent huge page cache Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-23 10:36   ` Hillf Danton
2013-05-23 10:36     ` Hillf Danton
2013-05-23 15:49     ` Dave Hansen
2013-05-23 15:49       ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 08/39] thp: compile-time and sysfs knob for thp pagecache Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-22 11:19   ` Hillf Danton
2013-05-22 11:19     ` Hillf Danton
2013-05-12  1:23 ` [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 19:28   ` Dave Hansen
2013-05-21 19:28     ` Dave Hansen
2013-05-22 13:51     ` Kirill A. Shutemov
2013-05-22 13:51       ` Kirill A. Shutemov
2013-05-22 15:31       ` Dave Hansen
2013-05-22 15:31         ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 10/39] thp: account anon transparent huge pages into NR_ANON_PAGES Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 19:32   ` Dave Hansen
2013-05-21 19:32     ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 11/39] thp: represent file thp pages in meminfo and friends Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 19:34   ` Dave Hansen
2013-05-21 19:34     ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 19:59   ` Dave Hansen
2013-05-21 19:59     ` Dave Hansen
2013-05-23 14:36     ` Kirill A. Shutemov
2013-05-23 14:36       ` Kirill A. Shutemov
2013-05-23 16:00       ` Dave Hansen
2013-05-23 16:00         ` Dave Hansen
2013-05-28 11:59         ` Kirill A. Shutemov
2013-05-28 11:59           ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 13/39] mm: trace filemap: dump page order Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 19:35   ` Dave Hansen
2013-05-21 19:35     ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 20:14   ` Dave Hansen
2013-05-21 20:14     ` Dave Hansen
2013-05-28 12:28     ` Kirill A. Shutemov
2013-05-28 12:28       ` Kirill A. Shutemov
2013-06-07 15:10       ` Kirill A. Shutemov
2013-06-07 15:10         ` Kirill A. Shutemov
2013-06-07 15:10         ` Kirill A. Shutemov
2013-06-07 15:56         ` Dave Hansen
2013-06-07 15:56           ` Dave Hansen
2013-06-10 17:41           ` Kirill A. Shutemov
2013-06-10 17:41             ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 20:17   ` Dave Hansen
2013-05-21 20:17     ` Dave Hansen
2013-05-28 12:53     ` Kirill A. Shutemov
2013-05-28 12:53       ` Kirill A. Shutemov
2013-05-28 16:33       ` Dave Hansen
2013-05-28 16:33         ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 16/39] thp, mm: locking tail page is a bug Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 20:18   ` Dave Hansen
2013-05-21 20:18     ` Dave Hansen
2013-05-22 14:12     ` Kirill A. Shutemov
2013-05-22 14:12       ` Kirill A. Shutemov
2013-05-22 14:53       ` Dave Hansen
2013-05-22 14:53         ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 17/39] thp, mm: handle tail pages in page_cache_get_speculative() Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 20:49   ` Dave Hansen
2013-05-21 20:49     ` Dave Hansen
2013-06-27 12:40     ` Kirill A. Shutemov
2013-06-27 12:40       ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 18/39] thp, mm: add event counters for huge page alloc on write to a file Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 20:54   ` Dave Hansen
2013-05-21 20:54     ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 19/39] thp, mm: allocate huge pages in grab_cache_page_write_begin() Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 21:14   ` Dave Hansen
2013-05-21 21:14     ` Dave Hansen
2013-05-30 13:20     ` Kirill A. Shutemov
2013-05-30 13:20       ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 21:28   ` Dave Hansen
2013-05-21 21:28     ` Dave Hansen
2013-06-07 15:17     ` Kirill A. Shutemov
2013-06-07 15:17       ` Kirill A. Shutemov
2013-06-07 15:29       ` Dave Hansen
2013-06-07 15:29         ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 21/39] thp, libfs: initial support of thp in simple_read/write_begin/write_end Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 21:49   ` Dave Hansen
2013-05-21 21:49     ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 22/39] thp: handle file pages in split_huge_page() Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 22:05   ` Dave Hansen
2013-05-21 22:05     ` Dave Hansen
2013-06-03 15:02     ` Kirill A. Shutemov
2013-06-03 15:02       ` Kirill A. Shutemov
2013-06-03 15:53       ` Dave Hansen
2013-06-03 15:53         ` Dave Hansen
2013-06-03 16:09         ` Kirill A. Shutemov
2013-06-03 16:09           ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 24/39] thp, mm: truncate support for transparent huge page cache Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 22:39   ` Dave Hansen
2013-05-21 22:39     ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 25/39] thp, mm: split huge page on mmap file page Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 26/39] ramfs: enable transparent huge page cache Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 22:43   ` Dave Hansen
2013-05-21 22:43     ` Dave Hansen
2013-05-22 14:22     ` Kirill A. Shutemov
2013-05-22 14:22       ` Kirill A. Shutemov
2013-05-22 14:55       ` Dave Hansen
2013-05-22 14:55         ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 22:56   ` Dave Hansen
2013-05-21 22:56     ` Dave Hansen
2013-06-25 14:56     ` Kirill A. Shutemov
2013-06-25 14:56       ` Kirill A. Shutemov
2013-06-25 16:46       ` Dave Hansen
2013-06-25 16:46         ` Dave Hansen
2013-05-21 23:20   ` Dave Hansen
2013-05-21 23:20     ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 28/39] thp: prepare zap_huge_pmd() to uncharge file pages Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-22  7:26   ` Hillf Danton
2013-05-22  7:26     ` Hillf Danton
2013-05-12  1:23 ` [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 23:23   ` Dave Hansen
2013-05-21 23:23     ` Dave Hansen
2013-05-22 14:37     ` Kirill A. Shutemov
2013-05-22 14:37       ` Kirill A. Shutemov
2013-05-22 14:56       ` Dave Hansen
2013-05-22 14:56         ` Dave Hansen
2013-05-21 23:23   ` Dave Hansen
2013-05-21 23:23     ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 30/39] thp: do_huge_pmd_anonymous_page() cleanup Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-22 11:45   ` Hillf Danton
2013-05-22 11:45     ` Hillf Danton
2013-05-12  1:23 ` [PATCHv4 31/39] thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 23:38   ` Dave Hansen
2013-05-21 23:38     ` Dave Hansen
2013-05-22  6:51   ` Hillf Danton
2013-05-22  6:51     ` Hillf Danton
2013-05-12  1:23 ` [PATCHv4 32/39] mm: cleanup __do_fault() implementation Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-21 23:57   ` Dave Hansen
2013-05-21 23:57     ` Dave Hansen
2013-05-12  1:23 ` [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault() Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-22 12:47   ` Hillf Danton
2013-05-22 12:47     ` Hillf Danton
2013-05-22 15:13     ` Kirill A. Shutemov
2013-05-22 15:13       ` Kirill A. Shutemov
2013-05-22 12:56   ` Hillf Danton
2013-05-22 12:56     ` Hillf Danton
2013-05-22 15:14     ` Kirill A. Shutemov
2013-05-22 15:14       ` Kirill A. Shutemov
2013-05-22 13:24   ` Hillf Danton
2013-05-22 13:24     ` Hillf Danton
2013-05-22 15:26     ` Kirill A. Shutemov
2013-05-22 15:26       ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 34/39] thp, mm: handle huge pages in filemap_fault() Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-22 11:37   ` Hillf Danton
2013-05-22 11:37     ` Hillf Danton
2013-05-22 15:34     ` Kirill A. Shutemov
2013-05-22 15:34       ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 35/39] mm: decomposite do_wp_page() and get rid of some 'goto' logic Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 36/39] mm: do_wp_page(): extract VM_WRITE|VM_SHARED case to separate function Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-23 11:57   ` Hillf Danton
2013-05-23 11:57     ` Hillf Danton
2013-05-23 12:08     ` Kirill A. Shutemov
2013-05-23 12:08       ` Kirill A. Shutemov
2013-05-23 12:12       ` Hillf Danton
2013-05-23 12:12         ` Hillf Danton
2013-05-23 12:33         ` Kirill A. Shutemov
2013-05-23 12:33           ` Kirill A. Shutemov
2013-05-12  1:23 ` [PATCHv4 38/39] thp: vma_adjust_trans_huge(): adjust file-backed VMA too Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-23 11:01   ` Hillf Danton
2013-05-23 11:01     ` Hillf Danton
2013-05-12  1:23 ` [PATCHv4 39/39] thp: map file-backed huge pages on fault Kirill A. Shutemov
2013-05-12  1:23   ` Kirill A. Shutemov
2013-05-23 11:36   ` Hillf Danton
2013-05-23 11:36     ` Hillf Danton
2013-05-23 11:48     ` Kirill A. Shutemov
2013-05-23 11:48       ` Kirill A. Shutemov
2013-05-21 18:37 ` [PATCHv4 00/39] Transparent huge page cache Dave Hansen
2013-05-21 18:37   ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.