All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26  3:15 ` Wu Fengguang
  0 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-26  3:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, Nick Piggin, Andi Kleen, linux-mm, LKML

The swap cache and page cache code assume that they 'own' the newly
allocated page and therefore can disregard the locking rules. However
now hwpoison can hit any time on any page.

So use the safer lock_page()/trylock_page(). The main intention is not
to close such a small time window of memory corruption. But to avoid
kernel oops that may result from such races, and also avoid raising
false alerts in hwpoison stress tests.

This in theory will slightly increase page cache/swap cache overheads,
however it seems to be too small to be measurable in benchmark.

CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
CC: Andi Kleen <andi@firstfloor.org> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/pagemap.h |   13 ++++---------
 mm/migrate.c            |    2 +-
 mm/swap_state.c         |    4 ++--
 3 files changed, 7 insertions(+), 12 deletions(-)

--- sound-2.6.orig/mm/swap_state.c	2009-09-14 10:50:19.000000000 +0800
+++ sound-2.6/mm/swap_state.c	2009-09-25 18:42:23.000000000 +0800
@@ -306,7 +306,7 @@ struct page *read_swap_cache_async(swp_e
 		 * re-using the just freed swap entry for an existing page.
 		 * May fail (-ENOMEM) if radix-tree node allocation failed.
 		 */
-		__set_page_locked(new_page);
+		lock_page(new_page);
 		SetPageSwapBacked(new_page);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
@@ -318,7 +318,7 @@ struct page *read_swap_cache_async(swp_e
 			return new_page;
 		}
 		ClearPageSwapBacked(new_page);
-		__clear_page_locked(new_page);
+		unlock_page(new_page);
 		swapcache_free(entry, NULL);
 	} while (err != -ENOMEM);
 
--- sound-2.6.orig/include/linux/pagemap.h	2009-09-14 10:50:19.000000000 +0800
+++ sound-2.6/include/linux/pagemap.h	2009-09-25 18:42:19.000000000 +0800
@@ -292,11 +292,6 @@ extern int __lock_page_killable(struct p
 extern void __lock_page_nosync(struct page *page);
 extern void unlock_page(struct page *page);
 
-static inline void __set_page_locked(struct page *page)
-{
-	__set_bit(PG_locked, &page->flags);
-}
-
 static inline void __clear_page_locked(struct page *page)
 {
 	__clear_bit(PG_locked, &page->flags);
@@ -435,18 +430,18 @@ extern void remove_from_page_cache(struc
 extern void __remove_from_page_cache(struct page *page);
 
 /*
- * Like add_to_page_cache_locked, but used to add newly allocated pages:
- * the page is new, so we can just run __set_page_locked() against it.
+ * Like add_to_page_cache_locked, but used to add newly allocated pages.
  */
 static inline int add_to_page_cache(struct page *page,
 		struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
 
-	__set_page_locked(page);
+	if (!trylock_page(page))
+		return -EIO;	/* hwpoisoned */
 	error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
 	if (unlikely(error))
-		__clear_page_locked(page);
+		unlock_page(page);
 	return error;
 }
 
--- sound-2.6.orig/mm/migrate.c	2009-09-14 10:50:19.000000000 +0800
+++ sound-2.6/mm/migrate.c	2009-09-25 18:42:19.000000000 +0800
@@ -551,7 +551,7 @@ static int move_to_new_page(struct page 
 	 * holding a reference to the new page at this point.
 	 */
 	if (!trylock_page(newpage))
-		BUG();
+		return -EAGAIN;		/* got by hwpoison */
 
 	/* Prepare mapping for the new page.*/
 	newpage->index = page->index;

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26  3:15 ` Wu Fengguang
  0 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-26  3:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, Nick Piggin, Andi Kleen, linux-mm, LKML

The swap cache and page cache code assume that they 'own' the newly
allocated page and therefore can disregard the locking rules. However
now hwpoison can hit any time on any page.

So use the safer lock_page()/trylock_page(). The main intention is not
to close such a small time window of memory corruption. But to avoid
kernel oops that may result from such races, and also avoid raising
false alerts in hwpoison stress tests.

This in theory will slightly increase page cache/swap cache overheads,
however it seems to be too small to be measurable in benchmark.

CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
CC: Andi Kleen <andi@firstfloor.org> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/pagemap.h |   13 ++++---------
 mm/migrate.c            |    2 +-
 mm/swap_state.c         |    4 ++--
 3 files changed, 7 insertions(+), 12 deletions(-)

--- sound-2.6.orig/mm/swap_state.c	2009-09-14 10:50:19.000000000 +0800
+++ sound-2.6/mm/swap_state.c	2009-09-25 18:42:23.000000000 +0800
@@ -306,7 +306,7 @@ struct page *read_swap_cache_async(swp_e
 		 * re-using the just freed swap entry for an existing page.
 		 * May fail (-ENOMEM) if radix-tree node allocation failed.
 		 */
-		__set_page_locked(new_page);
+		lock_page(new_page);
 		SetPageSwapBacked(new_page);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
@@ -318,7 +318,7 @@ struct page *read_swap_cache_async(swp_e
 			return new_page;
 		}
 		ClearPageSwapBacked(new_page);
-		__clear_page_locked(new_page);
+		unlock_page(new_page);
 		swapcache_free(entry, NULL);
 	} while (err != -ENOMEM);
 
--- sound-2.6.orig/include/linux/pagemap.h	2009-09-14 10:50:19.000000000 +0800
+++ sound-2.6/include/linux/pagemap.h	2009-09-25 18:42:19.000000000 +0800
@@ -292,11 +292,6 @@ extern int __lock_page_killable(struct p
 extern void __lock_page_nosync(struct page *page);
 extern void unlock_page(struct page *page);
 
-static inline void __set_page_locked(struct page *page)
-{
-	__set_bit(PG_locked, &page->flags);
-}
-
 static inline void __clear_page_locked(struct page *page)
 {
 	__clear_bit(PG_locked, &page->flags);
@@ -435,18 +430,18 @@ extern void remove_from_page_cache(struc
 extern void __remove_from_page_cache(struct page *page);
 
 /*
- * Like add_to_page_cache_locked, but used to add newly allocated pages:
- * the page is new, so we can just run __set_page_locked() against it.
+ * Like add_to_page_cache_locked, but used to add newly allocated pages.
  */
 static inline int add_to_page_cache(struct page *page,
 		struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
 {
 	int error;
 
-	__set_page_locked(page);
+	if (!trylock_page(page))
+		return -EIO;	/* hwpoisoned */
 	error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
 	if (unlikely(error))
-		__clear_page_locked(page);
+		unlock_page(page);
 	return error;
 }
 
--- sound-2.6.orig/mm/migrate.c	2009-09-14 10:50:19.000000000 +0800
+++ sound-2.6/mm/migrate.c	2009-09-25 18:42:19.000000000 +0800
@@ -551,7 +551,7 @@ static int move_to_new_page(struct page 
 	 * holding a reference to the new page at this point.
 	 */
 	if (!trylock_page(newpage))
-		BUG();
+		return -EAGAIN;		/* got by hwpoison */
 
 	/* Prepare mapping for the new page.*/
 	newpage->index = page->index;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26  3:15 ` Wu Fengguang
@ 2009-09-26  3:49   ` Andi Kleen
  -1 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2009-09-26  3:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Hugh Dickins, Nick Piggin, Andi Kleen, linux-mm, LKML

On Sat, Sep 26, 2009 at 11:15:37AM +0800, Wu Fengguang wrote:
> The swap cache and page cache code assume that they 'own' the newly
> allocated page and therefore can disregard the locking rules. However
> now hwpoison can hit any time on any page.
> 
> So use the safer lock_page()/trylock_page(). The main intention is not
> to close such a small time window of memory corruption. But to avoid
> kernel oops that may result from such races, and also avoid raising
> false alerts in hwpoison stress tests.
> 
> This in theory will slightly increase page cache/swap cache overheads,
> however it seems to be too small to be measurable in benchmark.

Thanks. Can you please describe what benchmarks you used?

Acked-by: Andi Kleen <ak@linux.intel.com>
-andi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26  3:49   ` Andi Kleen
  0 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2009-09-26  3:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Hugh Dickins, Nick Piggin, Andi Kleen, linux-mm, LKML

On Sat, Sep 26, 2009 at 11:15:37AM +0800, Wu Fengguang wrote:
> The swap cache and page cache code assume that they 'own' the newly
> allocated page and therefore can disregard the locking rules. However
> now hwpoison can hit any time on any page.
> 
> So use the safer lock_page()/trylock_page(). The main intention is not
> to close such a small time window of memory corruption. But to avoid
> kernel oops that may result from such races, and also avoid raising
> false alerts in hwpoison stress tests.
> 
> This in theory will slightly increase page cache/swap cache overheads,
> however it seems to be too small to be measurable in benchmark.

Thanks. Can you please describe what benchmarks you used?

Acked-by: Andi Kleen <ak@linux.intel.com>
-andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26  3:49   ` Andi Kleen
@ 2009-09-26 10:52     ` Wu Fengguang
  -1 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-26 10:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, Hugh Dickins, Nick Piggin, linux-mm, LKML

On Sat, Sep 26, 2009 at 11:49:36AM +0800, Andi Kleen wrote:
> On Sat, Sep 26, 2009 at 11:15:37AM +0800, Wu Fengguang wrote:
> > The swap cache and page cache code assume that they 'own' the newly
> > allocated page and therefore can disregard the locking rules. However
> > now hwpoison can hit any time on any page.
> > 
> > So use the safer lock_page()/trylock_page(). The main intention is not
> > to close such a small time window of memory corruption. But to avoid
> > kernel oops that may result from such races, and also avoid raising
> > false alerts in hwpoison stress tests.
> > 
> > This in theory will slightly increase page cache/swap cache overheads,
> > however it seems to be too small to be measurable in benchmark.
> 
> Thanks. Can you please describe what benchmarks you used?

Good question. The test case is to copy 1GB sparse file over NFS.

I ran it 100 times (2 hours total wall time) and the overheads are ~0.008%.

The average time reported by the function profiler:

                total time      add_to_page_cache_lru   percent
        before  1940083424.361  4841664.805             0.242%
        after   1914258447.188  4889044.435             0.250%

The script used is:

        for i in `seq 1 100`
        do      
                # copy 1GB sparse file over NFS
                echo 1 > /debug/tracing/function_profile_enabled
                cp /tmp/sparse/1G /dev/null
                echo 0 > /debug/tracing/function_profile_enabled

                fadvise /tmp/sparse/1G 0 0 dontneed
                cat /debug/tracing/trace_stat/function0 > function0.$i
                cat /debug/tracing/trace_stat/function1 > function1.$i
        done


File function0.1 after patch:

  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  schedule                             32492    474358128 us     14599.22 us 
  schedule_timeout                        26    99994288 us     3845934 us 
  poll_schedule_timeout                    8    52914376 us     6614297 us 
  sys_poll                                 6    52914373 us     8819062 us 
  schedule_hrtimeout_range                 8    52914373 us     6614296 us 
  do_sys_poll                              6    52914355 us     8819059 us 
  sys_read                             86997    40703194 us     467.868 us  
  vfs_read                             87002    40550651 us     466.088 us  
  do_sync_read                         86987    40011054 us     459.965 us  
  nfs_file_read                        86985    39945631 us     459.224 us  
  generic_file_aio_read                86985    39785858 us     457.387 us  
  page_cache_async_readahead            2886    34379069 us     11912.35 us 
  ondemand_readahead                    2889    34371850 us     11897.49 us 
  ra_submit                             2891    34368378 us     11888.05 us 
  __do_page_cache_readahead             2891    34360744 us     11885.41 us 
  nfs_readpages                         2876    33198107 us     11543.15 us 
  read_cache_pages                      2876    31617284 us     10993.49 us 
  __rpc_execute                        84468    25404441 us     300.758 us  
  readpage_async_filler               174071    24947004 us     143.315 us  
  do_softirq                           44822    24829675 us     553.961 us  
  __do_softirq                         44822    24803408 us     553.375 us  
  net_rx_action                        38424    24357207 us     633.906 us  
  e1000_clean                          39042    24253141 us     621.206 us  
  nfs_pageio_doio                      43534    22993433 us     528.171 us  
  nfs_pagein_one                       43520    22939752 us     527.108 us  
  nfs_read_rpcsetup                    43513    22533176 us     517.849 us  
  e1000_clean_rx_irq                   39042    22047306 us     564.707 us  
  rpc_run_task                         43528    22042766 us     506.404 us  
  call_transmit                        45873    21842655 us     476.154 us  
  nfs_pageio_add_request              174091    21583666 us     123.979 us  
  rpc_execute                          43528    21254402 us     488.292 us  
  cpuidle_idle_call                    18073    19445493 us     1075.941 us 
  acpi_idle_enter_c1                   17999    19396945 us     1077.667 us 
  e1000_receive_skb                   436279    18818686 us     43.134 us   
  tcp_v4_do_rcv                       513521    18681598 us     36.379 us   
  xprt_transmit                        43346    18532249 us     427.542 us  
  tcp_rcv_established                 513521    18384627 us     35.801 us   
  napi_gro_receive                    436279    18114586 us     41.520 us   
  napi_skb_finish                     436279    17130261 us     39.264 us   
  netif_receive_skb                   436279    16886046 us     38.704 us   
  ip_rcv                              436279    16500338 us     37.820 us   
  xs_tcp_send_request                  43346    16418581 us     378.779 us  
  xs_sendpages                         43346    16367619 us     377.603 us  
  xs_send_kvec                         43347    16310325 us     376.273 us  
  kernel_sendmsg                       43347    16254698 us     374.990 us  
  sock_sendmsg                         43347    16177184 us     373.201 us  
  tcp_sendmsg                          43347    16127746 us     372.061 us  
  do_IRQ                               45271    15968995 us     352.742 us  
  ip_local_deliver                    436279    15816133 us     36.252 us   
  irq_exit                             53709    15421475 us     287.130 us  
  acpi_processor_ffh_cstate_ente       17999    15394735 us     855.310 us  
  mwait_idle_with_hints                17999    15386089 us     854.830 us  
  tcp_v4_rcv                          436217    15173852 us     34.785 us   
  release_sock                         43364    13536313 us     312.155 us  
  xs_tcp_data_ready                   512045    10092920 us     19.711 us   
  _spin_unlock_bh                     518900    9851221 us     18.984 us   
  local_bh_enable_ip                  511352    9403704 us     18.389 us   
  tcp_read_sock                      1024090    8832051 us     8.624 us    
  __tcp_ack_snd_check                 512045    6644838 us     12.977 us   
  add_to_page_cache_lru               174072    6287717 us     36.121 us   
  tcp_transmit_skb                    296572    5962431 us     20.104 us   
  tcp_send_ack                        253210    5775135 us     22.807 us   
  __kfree_skb                         802932    5216451 us     6.496 us    
  add_to_page_cache_locked            174072    4716425 us     27.094 us   
[...]

Detailed time before patch (100 runs * 2 CPU):

Total          add_to_page_cache_lru    percent
1927694859.404 6703561                  0.34775011030908648%
1994166534.446 7311783                  0.36665859514242066%
2132812997.930 6249683                  0.29302536162643533%
1972582396.980 7908593                  0.40092586307715006%
2415371209.622 7670327                  0.31756307144194151%
2163473353.153 7406991                  0.34236571433640306%
1833544521.635 7952113                  0.43370165851816772%
2087748556.683 7995535                  0.38297404035585819%
2185868844.795 7117368                  0.32560819085499598%
1980423680.193 6514699                  0.32895481230385598%
1958021484.352 6745228                  0.34449203207963308%
2008592063.702 7398967                  0.36836583862444905%
2448117568.931 6224454                  0.25425470079519025%
2213021186.789 6938228                  0.31351837214297412%
1828902235.789 7479842                  0.40897987074596959%
1866644864.837 7720792                  0.41361868802366963%
2054526163.572 8475856                  0.41254553727677401%
1902862587.026 8334661                  0.43800645705196778%
2148449373.925 8672498                  0.40366313050031155%
2104880443.231 7741619                  0.36779376353160392%
2270646549.540 6486688                  0.2856758134071597%
2250814274.826 7512331                  0.33376058984612328%
2214008898.106 6391259                  0.28867359139646992%
2446226035.589 7738434                  0.31634173978271579%
1920377000.358 6464992                  0.33665223020244384%
2030446733.714 6022617                  0.2966153654759367%
2157128575.464 7375670                  0.34192074055731642%
2185670134.683 7943199                  0.36342167438509865%
2224381465.969 8492920                  0.38181041021667828%
2479045291.455 8161934                  0.32923698603382928%
1971641972.432 8114434                  0.41155717485517562%
2208169216.013 8138519                  0.36856410011433138%
1971312194.985 8122576                  0.41203904793283169%
2279129743.345 7365286                  0.3231622079219687%
2360162707.877 7412468                  0.3140659741491984%
1885022652.013 8252339                  0.43778460652382062%
2065345722.062 7985361                  0.3866355600759942%
1975982196.916 7925371                  0.4010851419799969%
2169738273.075 7203856                  0.33201497569522709%
2034570957.887 7396992                  0.36356520136717829%
1789538123.758 7380292                  0.41241323121418116%
1912812047.546 8697404                  0.45469203370807615%
2221047646.893 7451321                  0.33548676951723994%
2163981380.559 7982481                  0.36887937538251664%
1969189605.396 7353112                  0.37340802428831149%
2278909389.270 7088803                  0.31106120468750831%
2104996788.904 7342692                  0.34882200479854836%
2374705789.293 7327329                  0.30855733931492624%
2137944282.456 8112283                  0.3794431438915179%
2268117031.230 6889022                  0.30373309247909852%
1899091967.945 8173918                  0.43041190937398177%
2098779842.987 7858621                  0.37443760603377763%
2124662819.215 8156209                  0.38388251190904155%
2317714239.959 7048899                  0.30413149638864423%
2218944220.453 6659501                  0.30012025262358577%
1945126557.801 7704643                  0.3960998305791596%
2092060468.851 7616315                  0.36405807161888715%
1933379809.531 7300893                  0.37762331870896354%
1903832170.769 7508856                  0.39440745435912067%
2435478683.656 6790800                  0.27882814354203433%
2126189870.763 7104836                  0.33415811530747125%
2161363441.772 7011298                  0.32439236569356278%
2064449342.765 6822211                  0.33046153561039859%
2116816508.947 8156178                  0.38530396779913867%
1982774633.856 7131335                  0.35966442571092105%
2323253766.182 7275361                  0.3131539526978242%
1905927501.820 7477458                  0.39232646534874271%
2252162684.001 8064426                  0.35807475442552977%
2027146552.281 7875866                  0.38851981328818402%
1845397260.011 7598767                  0.41176862915439172%
2177206310.501 7366288                  0.33833670077434846%
2041629016.560 6600368                  0.32328929234759562%
2189308825.112 8348863                  0.38134697600613249%
1930990433.231 7344038                  0.38032492930126544%
2102320863.469 7299411                  0.34720727586536809%
1895798071.683 8051595                  0.42470741585110772%
2019648719.623 7992832                  0.39575357448754711%
1961118039.267 7837733                  0.39965636147681766%
2142238585.905 8084509                  0.37738602288244455%
2263212432.641 7757018                  0.34274369865263332%
2200111036.549 7069702                  0.32133387281622078%
1941699880.568 7786856                  0.4010329339734075%
1939777157.562 7729686                  0.39848319534368681%
1983015138.927 8136252                  0.41029701893261827%
2040218650.019 7968023                  0.39054750332400873%
1850721920.862 7835912                  0.42339758943095634%
2167986192.252 8656589                  0.39929170356052646%
2128868496.040 8248906                  0.38747841942065214%
1984840247.084 7754853                  0.39070413910605312%
2130780080.168 7053725                  0.33103955990821227%
2062729676.752 7968701                  0.38631824081513272%
1995756518.448 7903687                  0.39602461156666058%
1921150053.008 8307746                  0.43243608103345821%
2253731991.230 8033382                  0.35644797301810893%
2143807312.838 6755593                  0.31512127790332323%
1948787937.040 7152297                  0.36701258582622248%
2276500726.783 7277806                  0.31969267193182554%
2282733838.581 6336839                  0.27759867983291148%
2003104438.909 7092547                  0.35407774363791977%
2239684291.880 6982252                  0.3117516171950766%
1801833773.742 2645976                  0.14684906224756283%
2016683111.150 2261656                  0.11214731692329717%
1962727948.601 2767184                  0.14098663046870061%
1845203884.375 2266317                  0.1228220371304734%
1516914718.568 1975612                  0.13023883121557553%
1557730294.184 2109274                  0.13540688063108641%
2171303320.533 1865384                  0.085910797554627022%
1617322895.641 2073254                  0.12819048104666192%
1867107731.934 1901409                  0.10183713384500154%
1772588329.145 2045529                  0.11539786008783279%
2030802387.562 2161163                  0.10641916777508319%
1853583237.019 1803299                  0.097287187539531861%
1453728764.040 3231016                  0.22225714176699726%
1989811798.738 2418414                  0.1215398361560542%
1956758511.567 2254818                  0.11523230826241863%
2001579398.410 1952360                  0.097540971971978793%
1636515132.388 1532408                  0.093638486419854422%
1926271619.408 1918912                  0.09961793449408439%
1642519133.238 1384193                  0.084272564744573439%
1677033195.955 1340416                  0.07992781557533149%
1693983317.244 3135804                  0.18511421972571418%
1487434932.071 2467493                  0.1658891388656871%
1720442169.343 3034092                  0.17635536108479918%
1640362983.552 1944463                  0.11853858075908964%
1821382248.526 2659828                  0.1460334864991977%
1947322513.836 3540608                  0.18181929160904178%
1673422299.278 2769832                  0.16551900863249205%
1836039325.613 2175872                  0.11850900847526999%
1582929718.539 2072026                  0.13089816785500888%
1476014468.174 1996943                  0.13529291501257767%
1847131237.607 2120975                  0.11482535495137695%
1734796451.802 1842211                  0.10619176665287874%
1804827020.363 1988047                  0.11015166426310204%
1647250758.752 2634152                  0.15991202225917941%
1506833507.228 2271966                  0.15077750720977479%
1919459367.994 1608374                  0.083793073550751385%
1913364635.281 2126200                  0.11112361756846952%
1733934046.047 2031440                  0.11715785872197679%
1762796865.803 2285487                  0.12965118354456009%
1711072016.023 2270996                  0.13272357789349024%
2082560830.257 2273483                  0.10916766353083858%
1857948273.729 1266725                  0.06817870109255604%
1738287034.188 2404600                  0.1383315846409251%
1618942973.292 2016207                  0.12453848179100423%
1820789056.854 2369799                  0.13015230902664757%
1561747605.669 2487149                  0.15925422206327566%
1884502059.843 2512148                  0.13330566485076115%
1352203145.315 2618029                  0.19361210695824271%
1911003204.592 1648292                  0.086252707271200574%
1525244016.511 2555354                  0.16753738892517536%
2190575134.148 1939935                  0.088558249829422828%
1643983673.168 1747429                  0.10629235730989092%
1781900292.843 1686195                  0.094629032094141294%
1566821289.746 2291057                  0.14622324926229507%
1777199439.264 2456866                  0.13824368530171682%
1633063412.403 1953080                  0.11959609070698031%
1913601422.874 1821650                  0.095194849785599556%
2094732749.709 2322465                  0.11087166132876074%
1951529182.943 2270128                  0.11632559839953495%
1450396654.276 2867633                  0.19771370759479912%
1643954213.946 2460211                  0.14965203891504558%
1845971802.604 2593942                  0.14051904781757141%
1671974226.831 2552423                  0.15265923116756236%
1792483105.554 1547320                  0.086322710390164162%
1774273751.516 2101959                  0.1184686973024323%
1797523443.598 1781759                  0.099122990932099045%
1877038808.523 2131598                  0.11356174365288201%
1485614991.186 1796722                  0.12094129439052281%
2008042349.453 1972441                  0.09822706182154485%
2178564679.489 1948988                  0.089462021410222761%
1462911389.586 2076158                  0.14191960051575969%
1976013519.965 2692701                  0.13626936115536772%
1473561005.083 1369873                  0.092963439944099252%
2190713955.687 2450228                  0.11184609444967988%
1648795783.114 2557892                  0.15513698095279177%
2116292380.702 1948901                  0.092090347145394241%
1704227210.803 1980402                  0.11620527987385389%
1984573918.515 2325530                  0.11718031655581404%
1856414008.018 1798348                  0.096872141248277158%
1401593348.875 2285205                  0.16304336788086488%
1832475317.969 2373585                  0.12952889333487622%
1840826298.949 2112188                  0.11474129857911804%
2014475390.352 2104652                  0.10447643143618861%
1730176303.748 1975645                  0.11418749613667999%
2016553863.650 2009045                  0.099627638825555156%
1891795883.646 2160557                  0.11420666567029553%
1881160108.475 1492908                  0.079361027978115894%
1562754135.542 1978972                  0.12663361145505117%
2061996559.038 2071212                  0.10044691834821971%
1519528906.706 2460493                  0.16192472477103448%
1796244525.568 2099254                  0.1168690548596764%
1998463642.677 1988093                  0.099481069234609226%
1872169708.204 1762657                  0.094150492462082555%
1664425938.102 1710870                  0.10279039522485223%
1791543540.761 1992110                  0.11119517637588673%
1883913354.514 1770358                  0.093972368514618113%
1813806740.237 1877185                  0.10349421238531283%
1547586337.236 2702003                  0.17459465329900728%
2020313568.921 2100334                  0.10396079263684484%
1727289498.314 2059429                  0.1192289423405978%

Detailed time after patch:

Total          add_to_page_cache_lru    percent
2151489401.227 6287717                  0.29224949918015392%
1919030541.464 6872841                  0.3581413037208262%
2277104485.950 8538296                  0.37496285535785828%
2142847301.255 7992466                  0.37298345968558083%
1909650847.324 6960671                  0.36449966808089612%
2109123648.836 6825371                  0.32361170497361974%
2128503623.032 7728883                  0.36311345286743746%
1925232499.136 7526064                  0.39091714914315673%
1935780958.829 7499162                  0.38739723964101919%
1722848267.832 7353451                  0.42681942091473007%
1855159456.452 7600681                  0.40970499724785564%
1862769335.900 8607712                  0.46209221045831578%
2272365556.764 6895008                  0.30342864419310028%
2049429274.204 7252089                  0.35385895435775488%
2001218334.232 8155639                  0.4075336938750293%
1947131845.644 7229822                  0.37130623774523025%
1982915408.746 7513518                  0.37891268416495716%
2038654062.353 7531559                  0.36943781385386831%
2134879530.359 6738727                  0.31564905205057731%
2029761134.284 7472398                  0.36814174208907074%
1956557094.935 7280443                  0.37210480690019776%
1740007857.373 6663035                  0.38293131676196024%
2054278081.418 7765213                  0.37800203732105886%
2367585922.353 7629987                  0.32226864199365651%
2023763123.141 7730698                  0.38199618876350999%
1826074987.177 7579606                  0.41507638258150648%
2077873224.783 7414675                  0.35683962387909118%
2103719398.720 7093390                  0.33718327664402131%
2164758918.301 7181094                  0.33172719323572736%
1953199921.042 7750654                  0.39681826302067191%
2179298642.359 6395109                  0.29344803303679207%
1971922836.328 7215479                  0.36591081897688477%
2225552712.785 8153246                  0.36634701812105075%
2124780733.448 7080842                  0.33325048032177529%
2139511762.563 8708060                  0.40701155059639843%
1934365642.348 6842114                  0.35371358186939283%
1950765025.910 6818753                  0.34954250816646476%
2013188877.577 6495929                  0.32266863146086228%
1921391492.273 6536576                  0.34020011154870117%
2164698840.945 6712421                  0.31008567441509277%
1992440404.239 7467130                  0.3747730664422067%
2073623249.374 7113197                  0.34303227465005426%
1938481781.041 6902169                  0.35606055561138211%
1963963682.421 7776006                  0.39593430721765843%
1807704433.895 7206899                  0.39867684477995868%
1902413226.299 7687850                  0.40411041585093088%
1718395340.661 7096452                  0.41296969516166576%
2014139552.830 7653595                  0.37999328245384933%
1976902280.499 7781174                  0.3936043818026207%
1920129734.441 6808420                  0.3545812492707483%
1786139514.638 7179970                  0.4019825966089316%
2011990483.317 6739406                  0.33496212113733792%
2186643920.919 7830962                  0.35812698743876015%
1822975832.465 8627042                  0.47323951565143058%
1872249428.425 7264440                  0.38800599373722838%
1971668701.573 6818269                  0.34581210294408871%
2054585589.388 7261214                  0.35341501651254642%
2252858576.565 7228902                  0.32087686618225808%
2196771348.820 6634658                  0.30201859668116204%
2224872312.502 6948714                  0.31231967609798522%
1924947573.139 6935037                  0.360271474235066%
2070207954.999 7564561                  0.36540102078796494%
2070702822.858 6994172                  0.33776802362912656%
2099826181.510 7199665                  0.34286957003377627%
2369037401.585 7009501                  0.29587971027009985%
2109093017.803 7056062                  0.33455432929886414%
2141046841.968 7831694                  0.36578807368835009%
2284785719.942 7358167                  0.3220506385249462%
2088247084.213 6857813                  0.32840045853982414%
1880913709.753 6697468                  0.35607523967059107%
1984586172.943 7361786                  0.37094816543455983%
2130320004.839 7840067                  0.36802297223850727%
2314059227.236 6669054                  0.28819720435443524%
1935408704.712 8309322                  0.42933164348025787%
2239440115.241 7040432                  0.31438357972087749%
2387500130.769 6555638                  0.27458168129559291%
2497500671.323 6969731                  0.27906823329532587%
2229438175.095 6759889                  0.30321042653322067%
1979317661.013 7870099                  0.39761677243723181%
2228213884.202 5883791                  0.26405862748257614%
2264526373.743 6563122                  0.28982316461838858%
2256376049.198 6670128                  0.29561242694323103%
2085817937.654 6878115                  0.32975624937505721%
2177225113.484 7421724                  0.34087995559282069%
1843473456.551 7156295                  0.38819625932607071%
2403022978.094 6958096                  0.28955594946157931%
1935191866.634 6872608                  0.35513832599730566%
1917810458.850 7816937                  0.40759695328220108%
2016838266.381 7257719                  0.35985627211562177%
1904575178.166 7484129                  0.39295529448235278%
2087232238.930 7372521                  0.35321996577532039%
1976631116.176 6438233                  0.32571747693901715%
2004191806.805 6735567                  0.33607397141980955%
2110550986.328 7174599                  0.33993961986592813%
2152621184.889 7337361                  0.34085704681840484%
2059135774.936 6363510                  0.30903790208772336%
2353978555.486 7265456                  0.30864580236160993%
1945677192.325 5927833                  0.30466682877217138%
2255676557.335 6379740                  0.28283044301074045%
2034627875.756 7319043                  0.35972391252530572%
1712556465.847 3079747                  0.17983331127577226%
1837380343.831 3111263                  0.1693314620702272%
1680363376.083 1611968                  0.095929727042585719%
1769978476.954 2017096                  0.11396161175198416%
1829700109.543 2612874                  0.14280340184559598%
1783959966.875 3184558                  0.17851062014460206%
1772032973.967 2133741                  0.12041203698502599%
1667688494.152 1985120                  0.11903422053705601%
2001252356.720 2633874                  0.13161128785961057%
2077009766.270 2858188                  0.13761071548223289%
2076612330.379 2340135                  0.11269002720276176%
1834812072.536 1713713                  0.093399919569494547%
1787949752.795 2522212                  0.14106727529994451%
1907464219.876 2476303                  0.12982172741153589%
1609254879.065 1903707                  0.11829741980375917%
2069557067.911 2484547                  0.12005211349440528%
1725151836.363 2491288                  0.1444097816486799%
2077004221.561 2508662                  0.12078271069254648%
1650748196.246 3195100                  0.19355465644403197%
1843881436.725 2731880                  0.14815920078094685%
1677080490.297 2752567                  0.16412849686853959%
2103777441.563 3252694                  0.15461207710181613%
1598451912.768 2345538                  0.1467381021139566%
1825835416.807 2381575                  0.13043755083713246%
1969692431.120 2198750                  0.11162910336969484%
1867823718.088 1813348                  0.097083465770326324%
1929972836.031 2140104                  0.11088777831718792%
1627140530.277 2519227                  0.15482541016731566%
1779407314.725 2828755                  0.15897175293095681%
1724839045.836 1937500                  0.11232932166496305%
1780135194.997 2889504                  0.16231935687361485%
1758590417.053 2193512                  0.12473126082853507%
1729351400.188 1523643                  0.088104881392778986%
1484255938.207 2153531                  0.14509162096406991%
1504352024.282 1910774                  0.12701641432044322%
1916441202.247 2423510                  0.12645887581411155%
1651813655.113 2892863                  0.1751325272705837%
1772411947.023 2693122                  0.15194673024651262%
1738795901.180 3064160                  0.17622309771495132%
1717902110.376 3266140                  0.19012375503078779%
1858583685.387 2246750                  0.12088505982619635%
1641867344.242 2901893                  0.17674345069209688%
2037933222.333 3290649                  0.16146991294606342%
1713754879.753 2136278                  0.1246548164640615%
2265307843.106 2195967                  0.096939010151886218%
2066343912.900 2117667                  0.10248376307446183%
2033259596.459 2652081                  0.13043494321230309%
2029016228.457 2447926                  0.12064595470789147%
1649943910.454 2243384                  0.13596728869302646%
2064977308.315 2517573                  0.12191770775700743%
1962826877.902 2403546                  0.12245328546596376%
1991037543.226 3214700                  0.16145853255942866%
1469646744.644 1960481                  0.133398111290676%
2195532272.023 1693642                  0.077140382839349028%
1902754342.931 2542547                  0.13362455376575119%
1922726134.774 2726990                  0.14182935108023245%
1629134454.130 2814497                  0.17276026499009955%
1442857977.322 2886065                  0.2000241912483062%
1782430811.270 2884583                  0.16183421997427802%
1574026949.337 2873023                  0.18252692568003068%
2064769936.712 2717133                  0.13159495165484839%
1621381781.294 2143854                  0.13222388611576869%
1783777994.311 2515565                  0.14102455619605617%
1561198792.593 2402227                  0.15387066729728466%
1658682002.165 2924639                  0.17632306832669586%
1675144757.604 2216513                  0.13231769910860314%
1712859849.041 2274046                  0.13276310967725691%
1448178253.374 2507727                  0.17316424923226392%
1621787692.726 2822681                  0.17404750403892047%
1924315582.555 2275728                  0.11826168330344308%
2044659318.750 2606738                  0.12749008972280165%
1559320878.499 2049522                  0.13143683434629996%
1641314354.590 2786145                  0.16975084585158451%
1769603624.855 2008994                  0.11352790940200608%
1707747444.550 2813289                  0.16473682973311757%
1366611645.499 2930513                  0.21443641356721807%
1435165128.768 3161655                  0.22029903992400404%
1440555681.860 2831258                  0.19653929630435177%
1627011278.312 1956565                  0.12025515901954331%
1668464384.654 3863649                  0.23156916237089648%
1589045506.115 2743558                  0.17265446391825656%
1766150748.153 3207045                  0.18158387687765917%
1668645748.936 2997211                  0.17961937109246526%
1808149919.182 2420007                  0.13383884678626659%
1831096162.185 2226869                  0.12161398434382248%
1473136553.216 2774496                  0.18833936296964501%
1873520917.567 2427200                  0.12955286366121929%
1971938578.205 2207736                  0.11195764535473715%
1670588149.850 2437810                  0.14592525394238479%
1912140400.200 2407058                  0.12588291109524352%
1925513187.452 2415130                  0.12542786077699639%
1658671061.956 3111210                  0.18757245311382492%
1917423750.996 2888924                  0.15066695603928745%
1553962205.772 3386946                  0.21795549386076499%
1834101490.364 2367281                  0.12907033838842713%
1626846627.678 3362957                  0.20671629044711806%
1519340210.104 2320735                  0.15274623712098975%
1725615314.262 3813557                  0.22099693764197714%
1725823077.103 3137784                  0.1818137699993643%
1633124107.941 2007344                  0.12291435722731486%

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26 10:52     ` Wu Fengguang
  0 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-26 10:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, Hugh Dickins, Nick Piggin, linux-mm, LKML

On Sat, Sep 26, 2009 at 11:49:36AM +0800, Andi Kleen wrote:
> On Sat, Sep 26, 2009 at 11:15:37AM +0800, Wu Fengguang wrote:
> > The swap cache and page cache code assume that they 'own' the newly
> > allocated page and therefore can disregard the locking rules. However
> > now hwpoison can hit any time on any page.
> > 
> > So use the safer lock_page()/trylock_page(). The main intention is not
> > to close such a small time window of memory corruption. But to avoid
> > kernel oops that may result from such races, and also avoid raising
> > false alerts in hwpoison stress tests.
> > 
> > This in theory will slightly increase page cache/swap cache overheads,
> > however it seems to be too small to be measurable in benchmark.
> 
> Thanks. Can you please describe what benchmarks you used?

Good question. The test case is to copy 1GB sparse file over NFS.

I ran it 100 times (2 hours total wall time) and the overheads are ~0.008%.

The average time reported by the function profiler:

                total time      add_to_page_cache_lru   percent
        before  1940083424.361  4841664.805             0.242%
        after   1914258447.188  4889044.435             0.250%

The script used is:

        for i in `seq 1 100`
        do      
                # copy 1GB sparse file over NFS
                echo 1 > /debug/tracing/function_profile_enabled
                cp /tmp/sparse/1G /dev/null
                echo 0 > /debug/tracing/function_profile_enabled

                fadvise /tmp/sparse/1G 0 0 dontneed
                cat /debug/tracing/trace_stat/function0 > function0.$i
                cat /debug/tracing/trace_stat/function1 > function1.$i
        done


File function0.1 after patch:

  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  schedule                             32492    474358128 us     14599.22 us 
  schedule_timeout                        26    99994288 us     3845934 us 
  poll_schedule_timeout                    8    52914376 us     6614297 us 
  sys_poll                                 6    52914373 us     8819062 us 
  schedule_hrtimeout_range                 8    52914373 us     6614296 us 
  do_sys_poll                              6    52914355 us     8819059 us 
  sys_read                             86997    40703194 us     467.868 us  
  vfs_read                             87002    40550651 us     466.088 us  
  do_sync_read                         86987    40011054 us     459.965 us  
  nfs_file_read                        86985    39945631 us     459.224 us  
  generic_file_aio_read                86985    39785858 us     457.387 us  
  page_cache_async_readahead            2886    34379069 us     11912.35 us 
  ondemand_readahead                    2889    34371850 us     11897.49 us 
  ra_submit                             2891    34368378 us     11888.05 us 
  __do_page_cache_readahead             2891    34360744 us     11885.41 us 
  nfs_readpages                         2876    33198107 us     11543.15 us 
  read_cache_pages                      2876    31617284 us     10993.49 us 
  __rpc_execute                        84468    25404441 us     300.758 us  
  readpage_async_filler               174071    24947004 us     143.315 us  
  do_softirq                           44822    24829675 us     553.961 us  
  __do_softirq                         44822    24803408 us     553.375 us  
  net_rx_action                        38424    24357207 us     633.906 us  
  e1000_clean                          39042    24253141 us     621.206 us  
  nfs_pageio_doio                      43534    22993433 us     528.171 us  
  nfs_pagein_one                       43520    22939752 us     527.108 us  
  nfs_read_rpcsetup                    43513    22533176 us     517.849 us  
  e1000_clean_rx_irq                   39042    22047306 us     564.707 us  
  rpc_run_task                         43528    22042766 us     506.404 us  
  call_transmit                        45873    21842655 us     476.154 us  
  nfs_pageio_add_request              174091    21583666 us     123.979 us  
  rpc_execute                          43528    21254402 us     488.292 us  
  cpuidle_idle_call                    18073    19445493 us     1075.941 us 
  acpi_idle_enter_c1                   17999    19396945 us     1077.667 us 
  e1000_receive_skb                   436279    18818686 us     43.134 us   
  tcp_v4_do_rcv                       513521    18681598 us     36.379 us   
  xprt_transmit                        43346    18532249 us     427.542 us  
  tcp_rcv_established                 513521    18384627 us     35.801 us   
  napi_gro_receive                    436279    18114586 us     41.520 us   
  napi_skb_finish                     436279    17130261 us     39.264 us   
  netif_receive_skb                   436279    16886046 us     38.704 us   
  ip_rcv                              436279    16500338 us     37.820 us   
  xs_tcp_send_request                  43346    16418581 us     378.779 us  
  xs_sendpages                         43346    16367619 us     377.603 us  
  xs_send_kvec                         43347    16310325 us     376.273 us  
  kernel_sendmsg                       43347    16254698 us     374.990 us  
  sock_sendmsg                         43347    16177184 us     373.201 us  
  tcp_sendmsg                          43347    16127746 us     372.061 us  
  do_IRQ                               45271    15968995 us     352.742 us  
  ip_local_deliver                    436279    15816133 us     36.252 us   
  irq_exit                             53709    15421475 us     287.130 us  
  acpi_processor_ffh_cstate_ente       17999    15394735 us     855.310 us  
  mwait_idle_with_hints                17999    15386089 us     854.830 us  
  tcp_v4_rcv                          436217    15173852 us     34.785 us   
  release_sock                         43364    13536313 us     312.155 us  
  xs_tcp_data_ready                   512045    10092920 us     19.711 us   
  _spin_unlock_bh                     518900    9851221 us     18.984 us   
  local_bh_enable_ip                  511352    9403704 us     18.389 us   
  tcp_read_sock                      1024090    8832051 us     8.624 us    
  __tcp_ack_snd_check                 512045    6644838 us     12.977 us   
  add_to_page_cache_lru               174072    6287717 us     36.121 us   
  tcp_transmit_skb                    296572    5962431 us     20.104 us   
  tcp_send_ack                        253210    5775135 us     22.807 us   
  __kfree_skb                         802932    5216451 us     6.496 us    
  add_to_page_cache_locked            174072    4716425 us     27.094 us   
[...]

Detailed time before patch (100 runs * 2 CPU):

Total          add_to_page_cache_lru    percent
1927694859.404 6703561                  0.34775011030908648%
1994166534.446 7311783                  0.36665859514242066%
2132812997.930 6249683                  0.29302536162643533%
1972582396.980 7908593                  0.40092586307715006%
2415371209.622 7670327                  0.31756307144194151%
2163473353.153 7406991                  0.34236571433640306%
1833544521.635 7952113                  0.43370165851816772%
2087748556.683 7995535                  0.38297404035585819%
2185868844.795 7117368                  0.32560819085499598%
1980423680.193 6514699                  0.32895481230385598%
1958021484.352 6745228                  0.34449203207963308%
2008592063.702 7398967                  0.36836583862444905%
2448117568.931 6224454                  0.25425470079519025%
2213021186.789 6938228                  0.31351837214297412%
1828902235.789 7479842                  0.40897987074596959%
1866644864.837 7720792                  0.41361868802366963%
2054526163.572 8475856                  0.41254553727677401%
1902862587.026 8334661                  0.43800645705196778%
2148449373.925 8672498                  0.40366313050031155%
2104880443.231 7741619                  0.36779376353160392%
2270646549.540 6486688                  0.2856758134071597%
2250814274.826 7512331                  0.33376058984612328%
2214008898.106 6391259                  0.28867359139646992%
2446226035.589 7738434                  0.31634173978271579%
1920377000.358 6464992                  0.33665223020244384%
2030446733.714 6022617                  0.2966153654759367%
2157128575.464 7375670                  0.34192074055731642%
2185670134.683 7943199                  0.36342167438509865%
2224381465.969 8492920                  0.38181041021667828%
2479045291.455 8161934                  0.32923698603382928%
1971641972.432 8114434                  0.41155717485517562%
2208169216.013 8138519                  0.36856410011433138%
1971312194.985 8122576                  0.41203904793283169%
2279129743.345 7365286                  0.3231622079219687%
2360162707.877 7412468                  0.3140659741491984%
1885022652.013 8252339                  0.43778460652382062%
2065345722.062 7985361                  0.3866355600759942%
1975982196.916 7925371                  0.4010851419799969%
2169738273.075 7203856                  0.33201497569522709%
2034570957.887 7396992                  0.36356520136717829%
1789538123.758 7380292                  0.41241323121418116%
1912812047.546 8697404                  0.45469203370807615%
2221047646.893 7451321                  0.33548676951723994%
2163981380.559 7982481                  0.36887937538251664%
1969189605.396 7353112                  0.37340802428831149%
2278909389.270 7088803                  0.31106120468750831%
2104996788.904 7342692                  0.34882200479854836%
2374705789.293 7327329                  0.30855733931492624%
2137944282.456 8112283                  0.3794431438915179%
2268117031.230 6889022                  0.30373309247909852%
1899091967.945 8173918                  0.43041190937398177%
2098779842.987 7858621                  0.37443760603377763%
2124662819.215 8156209                  0.38388251190904155%
2317714239.959 7048899                  0.30413149638864423%
2218944220.453 6659501                  0.30012025262358577%
1945126557.801 7704643                  0.3960998305791596%
2092060468.851 7616315                  0.36405807161888715%
1933379809.531 7300893                  0.37762331870896354%
1903832170.769 7508856                  0.39440745435912067%
2435478683.656 6790800                  0.27882814354203433%
2126189870.763 7104836                  0.33415811530747125%
2161363441.772 7011298                  0.32439236569356278%
2064449342.765 6822211                  0.33046153561039859%
2116816508.947 8156178                  0.38530396779913867%
1982774633.856 7131335                  0.35966442571092105%
2323253766.182 7275361                  0.3131539526978242%
1905927501.820 7477458                  0.39232646534874271%
2252162684.001 8064426                  0.35807475442552977%
2027146552.281 7875866                  0.38851981328818402%
1845397260.011 7598767                  0.41176862915439172%
2177206310.501 7366288                  0.33833670077434846%
2041629016.560 6600368                  0.32328929234759562%
2189308825.112 8348863                  0.38134697600613249%
1930990433.231 7344038                  0.38032492930126544%
2102320863.469 7299411                  0.34720727586536809%
1895798071.683 8051595                  0.42470741585110772%
2019648719.623 7992832                  0.39575357448754711%
1961118039.267 7837733                  0.39965636147681766%
2142238585.905 8084509                  0.37738602288244455%
2263212432.641 7757018                  0.34274369865263332%
2200111036.549 7069702                  0.32133387281622078%
1941699880.568 7786856                  0.4010329339734075%
1939777157.562 7729686                  0.39848319534368681%
1983015138.927 8136252                  0.41029701893261827%
2040218650.019 7968023                  0.39054750332400873%
1850721920.862 7835912                  0.42339758943095634%
2167986192.252 8656589                  0.39929170356052646%
2128868496.040 8248906                  0.38747841942065214%
1984840247.084 7754853                  0.39070413910605312%
2130780080.168 7053725                  0.33103955990821227%
2062729676.752 7968701                  0.38631824081513272%
1995756518.448 7903687                  0.39602461156666058%
1921150053.008 8307746                  0.43243608103345821%
2253731991.230 8033382                  0.35644797301810893%
2143807312.838 6755593                  0.31512127790332323%
1948787937.040 7152297                  0.36701258582622248%
2276500726.783 7277806                  0.31969267193182554%
2282733838.581 6336839                  0.27759867983291148%
2003104438.909 7092547                  0.35407774363791977%
2239684291.880 6982252                  0.3117516171950766%
1801833773.742 2645976                  0.14684906224756283%
2016683111.150 2261656                  0.11214731692329717%
1962727948.601 2767184                  0.14098663046870061%
1845203884.375 2266317                  0.1228220371304734%
1516914718.568 1975612                  0.13023883121557553%
1557730294.184 2109274                  0.13540688063108641%
2171303320.533 1865384                  0.085910797554627022%
1617322895.641 2073254                  0.12819048104666192%
1867107731.934 1901409                  0.10183713384500154%
1772588329.145 2045529                  0.11539786008783279%
2030802387.562 2161163                  0.10641916777508319%
1853583237.019 1803299                  0.097287187539531861%
1453728764.040 3231016                  0.22225714176699726%
1989811798.738 2418414                  0.1215398361560542%
1956758511.567 2254818                  0.11523230826241863%
2001579398.410 1952360                  0.097540971971978793%
1636515132.388 1532408                  0.093638486419854422%
1926271619.408 1918912                  0.09961793449408439%
1642519133.238 1384193                  0.084272564744573439%
1677033195.955 1340416                  0.07992781557533149%
1693983317.244 3135804                  0.18511421972571418%
1487434932.071 2467493                  0.1658891388656871%
1720442169.343 3034092                  0.17635536108479918%
1640362983.552 1944463                  0.11853858075908964%
1821382248.526 2659828                  0.1460334864991977%
1947322513.836 3540608                  0.18181929160904178%
1673422299.278 2769832                  0.16551900863249205%
1836039325.613 2175872                  0.11850900847526999%
1582929718.539 2072026                  0.13089816785500888%
1476014468.174 1996943                  0.13529291501257767%
1847131237.607 2120975                  0.11482535495137695%
1734796451.802 1842211                  0.10619176665287874%
1804827020.363 1988047                  0.11015166426310204%
1647250758.752 2634152                  0.15991202225917941%
1506833507.228 2271966                  0.15077750720977479%
1919459367.994 1608374                  0.083793073550751385%
1913364635.281 2126200                  0.11112361756846952%
1733934046.047 2031440                  0.11715785872197679%
1762796865.803 2285487                  0.12965118354456009%
1711072016.023 2270996                  0.13272357789349024%
2082560830.257 2273483                  0.10916766353083858%
1857948273.729 1266725                  0.06817870109255604%
1738287034.188 2404600                  0.1383315846409251%
1618942973.292 2016207                  0.12453848179100423%
1820789056.854 2369799                  0.13015230902664757%
1561747605.669 2487149                  0.15925422206327566%
1884502059.843 2512148                  0.13330566485076115%
1352203145.315 2618029                  0.19361210695824271%
1911003204.592 1648292                  0.086252707271200574%
1525244016.511 2555354                  0.16753738892517536%
2190575134.148 1939935                  0.088558249829422828%
1643983673.168 1747429                  0.10629235730989092%
1781900292.843 1686195                  0.094629032094141294%
1566821289.746 2291057                  0.14622324926229507%
1777199439.264 2456866                  0.13824368530171682%
1633063412.403 1953080                  0.11959609070698031%
1913601422.874 1821650                  0.095194849785599556%
2094732749.709 2322465                  0.11087166132876074%
1951529182.943 2270128                  0.11632559839953495%
1450396654.276 2867633                  0.19771370759479912%
1643954213.946 2460211                  0.14965203891504558%
1845971802.604 2593942                  0.14051904781757141%
1671974226.831 2552423                  0.15265923116756236%
1792483105.554 1547320                  0.086322710390164162%
1774273751.516 2101959                  0.1184686973024323%
1797523443.598 1781759                  0.099122990932099045%
1877038808.523 2131598                  0.11356174365288201%
1485614991.186 1796722                  0.12094129439052281%
2008042349.453 1972441                  0.09822706182154485%
2178564679.489 1948988                  0.089462021410222761%
1462911389.586 2076158                  0.14191960051575969%
1976013519.965 2692701                  0.13626936115536772%
1473561005.083 1369873                  0.092963439944099252%
2190713955.687 2450228                  0.11184609444967988%
1648795783.114 2557892                  0.15513698095279177%
2116292380.702 1948901                  0.092090347145394241%
1704227210.803 1980402                  0.11620527987385389%
1984573918.515 2325530                  0.11718031655581404%
1856414008.018 1798348                  0.096872141248277158%
1401593348.875 2285205                  0.16304336788086488%
1832475317.969 2373585                  0.12952889333487622%
1840826298.949 2112188                  0.11474129857911804%
2014475390.352 2104652                  0.10447643143618861%
1730176303.748 1975645                  0.11418749613667999%
2016553863.650 2009045                  0.099627638825555156%
1891795883.646 2160557                  0.11420666567029553%
1881160108.475 1492908                  0.079361027978115894%
1562754135.542 1978972                  0.12663361145505117%
2061996559.038 2071212                  0.10044691834821971%
1519528906.706 2460493                  0.16192472477103448%
1796244525.568 2099254                  0.1168690548596764%
1998463642.677 1988093                  0.099481069234609226%
1872169708.204 1762657                  0.094150492462082555%
1664425938.102 1710870                  0.10279039522485223%
1791543540.761 1992110                  0.11119517637588673%
1883913354.514 1770358                  0.093972368514618113%
1813806740.237 1877185                  0.10349421238531283%
1547586337.236 2702003                  0.17459465329900728%
2020313568.921 2100334                  0.10396079263684484%
1727289498.314 2059429                  0.1192289423405978%

Detailed time after patch:

Total          add_to_page_cache_lru    percent
2151489401.227 6287717                  0.29224949918015392%
1919030541.464 6872841                  0.3581413037208262%
2277104485.950 8538296                  0.37496285535785828%
2142847301.255 7992466                  0.37298345968558083%
1909650847.324 6960671                  0.36449966808089612%
2109123648.836 6825371                  0.32361170497361974%
2128503623.032 7728883                  0.36311345286743746%
1925232499.136 7526064                  0.39091714914315673%
1935780958.829 7499162                  0.38739723964101919%
1722848267.832 7353451                  0.42681942091473007%
1855159456.452 7600681                  0.40970499724785564%
1862769335.900 8607712                  0.46209221045831578%
2272365556.764 6895008                  0.30342864419310028%
2049429274.204 7252089                  0.35385895435775488%
2001218334.232 8155639                  0.4075336938750293%
1947131845.644 7229822                  0.37130623774523025%
1982915408.746 7513518                  0.37891268416495716%
2038654062.353 7531559                  0.36943781385386831%
2134879530.359 6738727                  0.31564905205057731%
2029761134.284 7472398                  0.36814174208907074%
1956557094.935 7280443                  0.37210480690019776%
1740007857.373 6663035                  0.38293131676196024%
2054278081.418 7765213                  0.37800203732105886%
2367585922.353 7629987                  0.32226864199365651%
2023763123.141 7730698                  0.38199618876350999%
1826074987.177 7579606                  0.41507638258150648%
2077873224.783 7414675                  0.35683962387909118%
2103719398.720 7093390                  0.33718327664402131%
2164758918.301 7181094                  0.33172719323572736%
1953199921.042 7750654                  0.39681826302067191%
2179298642.359 6395109                  0.29344803303679207%
1971922836.328 7215479                  0.36591081897688477%
2225552712.785 8153246                  0.36634701812105075%
2124780733.448 7080842                  0.33325048032177529%
2139511762.563 8708060                  0.40701155059639843%
1934365642.348 6842114                  0.35371358186939283%
1950765025.910 6818753                  0.34954250816646476%
2013188877.577 6495929                  0.32266863146086228%
1921391492.273 6536576                  0.34020011154870117%
2164698840.945 6712421                  0.31008567441509277%
1992440404.239 7467130                  0.3747730664422067%
2073623249.374 7113197                  0.34303227465005426%
1938481781.041 6902169                  0.35606055561138211%
1963963682.421 7776006                  0.39593430721765843%
1807704433.895 7206899                  0.39867684477995868%
1902413226.299 7687850                  0.40411041585093088%
1718395340.661 7096452                  0.41296969516166576%
2014139552.830 7653595                  0.37999328245384933%
1976902280.499 7781174                  0.3936043818026207%
1920129734.441 6808420                  0.3545812492707483%
1786139514.638 7179970                  0.4019825966089316%
2011990483.317 6739406                  0.33496212113733792%
2186643920.919 7830962                  0.35812698743876015%
1822975832.465 8627042                  0.47323951565143058%
1872249428.425 7264440                  0.38800599373722838%
1971668701.573 6818269                  0.34581210294408871%
2054585589.388 7261214                  0.35341501651254642%
2252858576.565 7228902                  0.32087686618225808%
2196771348.820 6634658                  0.30201859668116204%
2224872312.502 6948714                  0.31231967609798522%
1924947573.139 6935037                  0.360271474235066%
2070207954.999 7564561                  0.36540102078796494%
2070702822.858 6994172                  0.33776802362912656%
2099826181.510 7199665                  0.34286957003377627%
2369037401.585 7009501                  0.29587971027009985%
2109093017.803 7056062                  0.33455432929886414%
2141046841.968 7831694                  0.36578807368835009%
2284785719.942 7358167                  0.3220506385249462%
2088247084.213 6857813                  0.32840045853982414%
1880913709.753 6697468                  0.35607523967059107%
1984586172.943 7361786                  0.37094816543455983%
2130320004.839 7840067                  0.36802297223850727%
2314059227.236 6669054                  0.28819720435443524%
1935408704.712 8309322                  0.42933164348025787%
2239440115.241 7040432                  0.31438357972087749%
2387500130.769 6555638                  0.27458168129559291%
2497500671.323 6969731                  0.27906823329532587%
2229438175.095 6759889                  0.30321042653322067%
1979317661.013 7870099                  0.39761677243723181%
2228213884.202 5883791                  0.26405862748257614%
2264526373.743 6563122                  0.28982316461838858%
2256376049.198 6670128                  0.29561242694323103%
2085817937.654 6878115                  0.32975624937505721%
2177225113.484 7421724                  0.34087995559282069%
1843473456.551 7156295                  0.38819625932607071%
2403022978.094 6958096                  0.28955594946157931%
1935191866.634 6872608                  0.35513832599730566%
1917810458.850 7816937                  0.40759695328220108%
2016838266.381 7257719                  0.35985627211562177%
1904575178.166 7484129                  0.39295529448235278%
2087232238.930 7372521                  0.35321996577532039%
1976631116.176 6438233                  0.32571747693901715%
2004191806.805 6735567                  0.33607397141980955%
2110550986.328 7174599                  0.33993961986592813%
2152621184.889 7337361                  0.34085704681840484%
2059135774.936 6363510                  0.30903790208772336%
2353978555.486 7265456                  0.30864580236160993%
1945677192.325 5927833                  0.30466682877217138%
2255676557.335 6379740                  0.28283044301074045%
2034627875.756 7319043                  0.35972391252530572%
1712556465.847 3079747                  0.17983331127577226%
1837380343.831 3111263                  0.1693314620702272%
1680363376.083 1611968                  0.095929727042585719%
1769978476.954 2017096                  0.11396161175198416%
1829700109.543 2612874                  0.14280340184559598%
1783959966.875 3184558                  0.17851062014460206%
1772032973.967 2133741                  0.12041203698502599%
1667688494.152 1985120                  0.11903422053705601%
2001252356.720 2633874                  0.13161128785961057%
2077009766.270 2858188                  0.13761071548223289%
2076612330.379 2340135                  0.11269002720276176%
1834812072.536 1713713                  0.093399919569494547%
1787949752.795 2522212                  0.14106727529994451%
1907464219.876 2476303                  0.12982172741153589%
1609254879.065 1903707                  0.11829741980375917%
2069557067.911 2484547                  0.12005211349440528%
1725151836.363 2491288                  0.1444097816486799%
2077004221.561 2508662                  0.12078271069254648%
1650748196.246 3195100                  0.19355465644403197%
1843881436.725 2731880                  0.14815920078094685%
1677080490.297 2752567                  0.16412849686853959%
2103777441.563 3252694                  0.15461207710181613%
1598451912.768 2345538                  0.1467381021139566%
1825835416.807 2381575                  0.13043755083713246%
1969692431.120 2198750                  0.11162910336969484%
1867823718.088 1813348                  0.097083465770326324%
1929972836.031 2140104                  0.11088777831718792%
1627140530.277 2519227                  0.15482541016731566%
1779407314.725 2828755                  0.15897175293095681%
1724839045.836 1937500                  0.11232932166496305%
1780135194.997 2889504                  0.16231935687361485%
1758590417.053 2193512                  0.12473126082853507%
1729351400.188 1523643                  0.088104881392778986%
1484255938.207 2153531                  0.14509162096406991%
1504352024.282 1910774                  0.12701641432044322%
1916441202.247 2423510                  0.12645887581411155%
1651813655.113 2892863                  0.1751325272705837%
1772411947.023 2693122                  0.15194673024651262%
1738795901.180 3064160                  0.17622309771495132%
1717902110.376 3266140                  0.19012375503078779%
1858583685.387 2246750                  0.12088505982619635%
1641867344.242 2901893                  0.17674345069209688%
2037933222.333 3290649                  0.16146991294606342%
1713754879.753 2136278                  0.1246548164640615%
2265307843.106 2195967                  0.096939010151886218%
2066343912.900 2117667                  0.10248376307446183%
2033259596.459 2652081                  0.13043494321230309%
2029016228.457 2447926                  0.12064595470789147%
1649943910.454 2243384                  0.13596728869302646%
2064977308.315 2517573                  0.12191770775700743%
1962826877.902 2403546                  0.12245328546596376%
1991037543.226 3214700                  0.16145853255942866%
1469646744.644 1960481                  0.133398111290676%
2195532272.023 1693642                  0.077140382839349028%
1902754342.931 2542547                  0.13362455376575119%
1922726134.774 2726990                  0.14182935108023245%
1629134454.130 2814497                  0.17276026499009955%
1442857977.322 2886065                  0.2000241912483062%
1782430811.270 2884583                  0.16183421997427802%
1574026949.337 2873023                  0.18252692568003068%
2064769936.712 2717133                  0.13159495165484839%
1621381781.294 2143854                  0.13222388611576869%
1783777994.311 2515565                  0.14102455619605617%
1561198792.593 2402227                  0.15387066729728466%
1658682002.165 2924639                  0.17632306832669586%
1675144757.604 2216513                  0.13231769910860314%
1712859849.041 2274046                  0.13276310967725691%
1448178253.374 2507727                  0.17316424923226392%
1621787692.726 2822681                  0.17404750403892047%
1924315582.555 2275728                  0.11826168330344308%
2044659318.750 2606738                  0.12749008972280165%
1559320878.499 2049522                  0.13143683434629996%
1641314354.590 2786145                  0.16975084585158451%
1769603624.855 2008994                  0.11352790940200608%
1707747444.550 2813289                  0.16473682973311757%
1366611645.499 2930513                  0.21443641356721807%
1435165128.768 3161655                  0.22029903992400404%
1440555681.860 2831258                  0.19653929630435177%
1627011278.312 1956565                  0.12025515901954331%
1668464384.654 3863649                  0.23156916237089648%
1589045506.115 2743558                  0.17265446391825656%
1766150748.153 3207045                  0.18158387687765917%
1668645748.936 2997211                  0.17961937109246526%
1808149919.182 2420007                  0.13383884678626659%
1831096162.185 2226869                  0.12161398434382248%
1473136553.216 2774496                  0.18833936296964501%
1873520917.567 2427200                  0.12955286366121929%
1971938578.205 2207736                  0.11195764535473715%
1670588149.850 2437810                  0.14592525394238479%
1912140400.200 2407058                  0.12588291109524352%
1925513187.452 2415130                  0.12542786077699639%
1658671061.956 3111210                  0.18757245311382492%
1917423750.996 2888924                  0.15066695603928745%
1553962205.772 3386946                  0.21795549386076499%
1834101490.364 2367281                  0.12907033838842713%
1626846627.678 3362957                  0.20671629044711806%
1519340210.104 2320735                  0.15274623712098975%
1725615314.262 3813557                  0.22099693764197714%
1725823077.103 3137784                  0.1818137699993643%
1633124107.941 2007344                  0.12291435722731486%

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26  3:15 ` Wu Fengguang
@ 2009-09-26 11:09   ` Hugh Dickins
  -1 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2009-09-26 11:09 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, Nick Piggin, Andi Kleen, linux-mm, LKML

On Sat, 26 Sep 2009, Wu Fengguang wrote:

> The swap cache and page cache code assume that they 'own' the newly
> allocated page and therefore can disregard the locking rules. However
> now hwpoison can hit any time on any page.
> 
> So use the safer lock_page()/trylock_page(). The main intention is not
> to close such a small time window of memory corruption. But to avoid
> kernel oops that may result from such races, and also avoid raising
> false alerts in hwpoison stress tests.
> 
> This in theory will slightly increase page cache/swap cache overheads,
> however it seems to be too small to be measurable in benchmark.

No.

But I'd most certainly defer to Nick if he disagrees with me.

I don't think anyone would want to quarrel very long over the swap
and migration mods alone, but add_to_page_cache() is of a higher
order of magnitude.

I can't see any reason to surrender add_to_page_cache() optimizations
to the remote possibility of hwpoison (infinitely remote for most of
us); though I wouldn't myself want to run the benchmark to defend them.

You'd later be sending a patch to replace __SetPageUptodate()s by
SetPageUptodate()s etc, wouldn't you?  Because any non-atomic op
on page->flags might wipe your locked bit (or your hwpoison bit).

You could make #ifdef CONFIG_HWPOISON_INJECT select the slower
versions of these things; or #ifdef CONFIG_MEMORY_FAILURE? that
would pose distros with a harder choice.  But I'd much prefer
not to go that way.

Please accept that there will be quite a number of places where
the code "knows" it's the only user of the page, and hwpoison
handling and testing should work around those places (can shift
things around slightly to suit itself better, but not add cost).

Look into why you think you want the page lock: I can see it's
going to be useful if you're taking a page out of a file (but then
why bother if page->mapping not set?), or if you're using rmap-style
lookup (but then why bother if !page_mapped?).

I suspect if memory_failure() did something like:
	if (page->mapping)
		lock_page_nosync(p);
then you'd be okay, perhaps with a few additional _inexpensive_
tweaks here and there.  With the "necessary" memory barriers?
no, we probably wouldn't want to be adding any in hot paths.

But I definitely say "something like": remember that page_mapping()
does that weird thing with PageSwapCache (a mistake from day one in
my opinion), which might or might not be what you want.  There are
probably various reasons why it's not as simple as I suggest above.

It seems to me that the Intel hardware guys have done half a job
here: the sooner they get to remapping the bad pages, the better.

Hugh

> 
> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> CC: Andi Kleen <andi@firstfloor.org> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/pagemap.h |   13 ++++---------
>  mm/migrate.c            |    2 +-
>  mm/swap_state.c         |    4 ++--
>  3 files changed, 7 insertions(+), 12 deletions(-)
> 
> --- sound-2.6.orig/mm/swap_state.c	2009-09-14 10:50:19.000000000 +0800
> +++ sound-2.6/mm/swap_state.c	2009-09-25 18:42:23.000000000 +0800
> @@ -306,7 +306,7 @@ struct page *read_swap_cache_async(swp_e
>  		 * re-using the just freed swap entry for an existing page.
>  		 * May fail (-ENOMEM) if radix-tree node allocation failed.
>  		 */
> -		__set_page_locked(new_page);
> +		lock_page(new_page);
>  		SetPageSwapBacked(new_page);
>  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
>  		if (likely(!err)) {
> @@ -318,7 +318,7 @@ struct page *read_swap_cache_async(swp_e
>  			return new_page;
>  		}
>  		ClearPageSwapBacked(new_page);
> -		__clear_page_locked(new_page);
> +		unlock_page(new_page);
>  		swapcache_free(entry, NULL);
>  	} while (err != -ENOMEM);
>  
> --- sound-2.6.orig/include/linux/pagemap.h	2009-09-14 10:50:19.000000000 +0800
> +++ sound-2.6/include/linux/pagemap.h	2009-09-25 18:42:19.000000000 +0800
> @@ -292,11 +292,6 @@ extern int __lock_page_killable(struct p
>  extern void __lock_page_nosync(struct page *page);
>  extern void unlock_page(struct page *page);
>  
> -static inline void __set_page_locked(struct page *page)
> -{
> -	__set_bit(PG_locked, &page->flags);
> -}
> -
>  static inline void __clear_page_locked(struct page *page)
>  {
>  	__clear_bit(PG_locked, &page->flags);
> @@ -435,18 +430,18 @@ extern void remove_from_page_cache(struc
>  extern void __remove_from_page_cache(struct page *page);
>  
>  /*
> - * Like add_to_page_cache_locked, but used to add newly allocated pages:
> - * the page is new, so we can just run __set_page_locked() against it.
> + * Like add_to_page_cache_locked, but used to add newly allocated pages.
>   */
>  static inline int add_to_page_cache(struct page *page,
>  		struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
>  {
>  	int error;
>  
> -	__set_page_locked(page);
> +	if (!trylock_page(page))
> +		return -EIO;	/* hwpoisoned */
>  	error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
>  	if (unlikely(error))
> -		__clear_page_locked(page);
> +		unlock_page(page);
>  	return error;
>  }
>  
> --- sound-2.6.orig/mm/migrate.c	2009-09-14 10:50:19.000000000 +0800
> +++ sound-2.6/mm/migrate.c	2009-09-25 18:42:19.000000000 +0800
> @@ -551,7 +551,7 @@ static int move_to_new_page(struct page 
>  	 * holding a reference to the new page at this point.
>  	 */
>  	if (!trylock_page(newpage))
> -		BUG();
> +		return -EAGAIN;		/* got by hwpoison */
>  
>  	/* Prepare mapping for the new page.*/
>  	newpage->index = page->index;

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26 11:09   ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2009-09-26 11:09 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, Nick Piggin, Andi Kleen, linux-mm, LKML

On Sat, 26 Sep 2009, Wu Fengguang wrote:

> The swap cache and page cache code assume that they 'own' the newly
> allocated page and therefore can disregard the locking rules. However
> now hwpoison can hit any time on any page.
> 
> So use the safer lock_page()/trylock_page(). The main intention is not
> to close such a small time window of memory corruption. But to avoid
> kernel oops that may result from such races, and also avoid raising
> false alerts in hwpoison stress tests.
> 
> This in theory will slightly increase page cache/swap cache overheads,
> however it seems to be too small to be measurable in benchmark.

No.

But I'd most certainly defer to Nick if he disagrees with me.

I don't think anyone would want to quarrel very long over the swap
and migration mods alone, but add_to_page_cache() is of a higher
order of magnitude.

I can't see any reason to surrender add_to_page_cache() optimizations
to the remote possibility of hwpoison (infinitely remote for most of
us); though I wouldn't myself want to run the benchmark to defend them.

You'd later be sending a patch to replace __SetPageUptodate()s by
SetPageUptodate()s etc, wouldn't you?  Because any non-atomic op
on page->flags might wipe your locked bit (or your hwpoison bit).

You could make #ifdef CONFIG_HWPOISON_INJECT select the slower
versions of these things; or #ifdef CONFIG_MEMORY_FAILURE? that
would pose distros with a harder choice.  But I'd much prefer
not to go that way.

Please accept that there will be quite a number of places where
the code "knows" it's the only user of the page, and hwpoison
handling and testing should work around those places (can shift
things around slightly to suit itself better, but not add cost).

Look into why you think you want the page lock: I can see it's
going to be useful if you're taking a page out of a file (but then
why bother if page->mapping not set?), or if you're using rmap-style
lookup (but then why bother if !page_mapped?).

I suspect if memory_failure() did something like:
	if (page->mapping)
		lock_page_nosync(p);
then you'd be okay, perhaps with a few additional _inexpensive_
tweaks here and there.  With the "necessary" memory barriers?
no, we probably wouldn't want to be adding any in hot paths.

But I definitely say "something like": remember that page_mapping()
does that weird thing with PageSwapCache (a mistake from day one in
my opinion), which might or might not be what you want.  There are
probably various reasons why it's not as simple as I suggest above.

It seems to me that the Intel hardware guys have done half a job
here: the sooner they get to remapping the bad pages, the better.

Hugh

> 
> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> CC: Andi Kleen <andi@firstfloor.org> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/pagemap.h |   13 ++++---------
>  mm/migrate.c            |    2 +-
>  mm/swap_state.c         |    4 ++--
>  3 files changed, 7 insertions(+), 12 deletions(-)
> 
> --- sound-2.6.orig/mm/swap_state.c	2009-09-14 10:50:19.000000000 +0800
> +++ sound-2.6/mm/swap_state.c	2009-09-25 18:42:23.000000000 +0800
> @@ -306,7 +306,7 @@ struct page *read_swap_cache_async(swp_e
>  		 * re-using the just freed swap entry for an existing page.
>  		 * May fail (-ENOMEM) if radix-tree node allocation failed.
>  		 */
> -		__set_page_locked(new_page);
> +		lock_page(new_page);
>  		SetPageSwapBacked(new_page);
>  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
>  		if (likely(!err)) {
> @@ -318,7 +318,7 @@ struct page *read_swap_cache_async(swp_e
>  			return new_page;
>  		}
>  		ClearPageSwapBacked(new_page);
> -		__clear_page_locked(new_page);
> +		unlock_page(new_page);
>  		swapcache_free(entry, NULL);
>  	} while (err != -ENOMEM);
>  
> --- sound-2.6.orig/include/linux/pagemap.h	2009-09-14 10:50:19.000000000 +0800
> +++ sound-2.6/include/linux/pagemap.h	2009-09-25 18:42:19.000000000 +0800
> @@ -292,11 +292,6 @@ extern int __lock_page_killable(struct p
>  extern void __lock_page_nosync(struct page *page);
>  extern void unlock_page(struct page *page);
>  
> -static inline void __set_page_locked(struct page *page)
> -{
> -	__set_bit(PG_locked, &page->flags);
> -}
> -
>  static inline void __clear_page_locked(struct page *page)
>  {
>  	__clear_bit(PG_locked, &page->flags);
> @@ -435,18 +430,18 @@ extern void remove_from_page_cache(struc
>  extern void __remove_from_page_cache(struct page *page);
>  
>  /*
> - * Like add_to_page_cache_locked, but used to add newly allocated pages:
> - * the page is new, so we can just run __set_page_locked() against it.
> + * Like add_to_page_cache_locked, but used to add newly allocated pages.
>   */
>  static inline int add_to_page_cache(struct page *page,
>  		struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
>  {
>  	int error;
>  
> -	__set_page_locked(page);
> +	if (!trylock_page(page))
> +		return -EIO;	/* hwpoisoned */
>  	error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
>  	if (unlikely(error))
> -		__clear_page_locked(page);
> +		unlock_page(page);
>  	return error;
>  }
>  
> --- sound-2.6.orig/mm/migrate.c	2009-09-14 10:50:19.000000000 +0800
> +++ sound-2.6/mm/migrate.c	2009-09-25 18:42:19.000000000 +0800
> @@ -551,7 +551,7 @@ static int move_to_new_page(struct page 
>  	 * holding a reference to the new page at this point.
>  	 */
>  	if (!trylock_page(newpage))
> -		BUG();
> +		return -EAGAIN;		/* got by hwpoison */
>  
>  	/* Prepare mapping for the new page.*/
>  	newpage->index = page->index;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26 10:52     ` Wu Fengguang
@ 2009-09-26 11:31       ` Wu Fengguang
  -1 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-26 11:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, Hugh Dickins, Nick Piggin, linux-mm, LKML

On Sat, Sep 26, 2009 at 06:52:59PM +0800, Wu Fengguang wrote:
> On Sat, Sep 26, 2009 at 11:49:36AM +0800, Andi Kleen wrote:
> > On Sat, Sep 26, 2009 at 11:15:37AM +0800, Wu Fengguang wrote:
> > > The swap cache and page cache code assume that they 'own' the newly
> > > allocated page and therefore can disregard the locking rules. However
> > > now hwpoison can hit any time on any page.
> > > 
> > > So use the safer lock_page()/trylock_page(). The main intention is not
> > > to close such a small time window of memory corruption. But to avoid
> > > kernel oops that may result from such races, and also avoid raising
> > > false alerts in hwpoison stress tests.
> > > 
> > > This in theory will slightly increase page cache/swap cache overheads,
> > > however it seems to be too small to be measurable in benchmark.
> > 
> > Thanks. Can you please describe what benchmarks you used?
> 
> Good question. The test case is to copy 1GB sparse file over NFS.
> 
> I ran it 100 times (2 hours total wall time) and the overheads are ~0.008%.

And standard deviation is 0.04%, much larger than the difference 0.008% ..

Thanks,
Fengguang

> The average time reported by the function profiler:
> 
>                 total time      add_to_page_cache_lru   percent
>         before  1940083424.361  4841664.805             0.242%
>         after   1914258447.188  4889044.435             0.250%
> 
> The script used is:
> 
>         for i in `seq 1 100`
>         do      
>                 # copy 1GB sparse file over NFS
>                 echo 1 > /debug/tracing/function_profile_enabled
>                 cp /tmp/sparse/1G /dev/null
>                 echo 0 > /debug/tracing/function_profile_enabled
> 
>                 fadvise /tmp/sparse/1G 0 0 dontneed
>                 cat /debug/tracing/trace_stat/function0 > function0.$i
>                 cat /debug/tracing/trace_stat/function1 > function1.$i
>         done
> 
> 
> File function0.1 after patch:
> 
>   Function                               Hit    Time            Avg
>   --------                               ---    ----            ---
>   schedule                             32492    474358128 us     14599.22 us 
>   schedule_timeout                        26    99994288 us     3845934 us 
>   poll_schedule_timeout                    8    52914376 us     6614297 us 
>   sys_poll                                 6    52914373 us     8819062 us 
>   schedule_hrtimeout_range                 8    52914373 us     6614296 us 
>   do_sys_poll                              6    52914355 us     8819059 us 
>   sys_read                             86997    40703194 us     467.868 us  
>   vfs_read                             87002    40550651 us     466.088 us  
>   do_sync_read                         86987    40011054 us     459.965 us  
>   nfs_file_read                        86985    39945631 us     459.224 us  
>   generic_file_aio_read                86985    39785858 us     457.387 us  
>   page_cache_async_readahead            2886    34379069 us     11912.35 us 
>   ondemand_readahead                    2889    34371850 us     11897.49 us 
>   ra_submit                             2891    34368378 us     11888.05 us 
>   __do_page_cache_readahead             2891    34360744 us     11885.41 us 
>   nfs_readpages                         2876    33198107 us     11543.15 us 
>   read_cache_pages                      2876    31617284 us     10993.49 us 
>   __rpc_execute                        84468    25404441 us     300.758 us  
>   readpage_async_filler               174071    24947004 us     143.315 us  
>   do_softirq                           44822    24829675 us     553.961 us  
>   __do_softirq                         44822    24803408 us     553.375 us  
>   net_rx_action                        38424    24357207 us     633.906 us  
>   e1000_clean                          39042    24253141 us     621.206 us  
>   nfs_pageio_doio                      43534    22993433 us     528.171 us  
>   nfs_pagein_one                       43520    22939752 us     527.108 us  
>   nfs_read_rpcsetup                    43513    22533176 us     517.849 us  
>   e1000_clean_rx_irq                   39042    22047306 us     564.707 us  
>   rpc_run_task                         43528    22042766 us     506.404 us  
>   call_transmit                        45873    21842655 us     476.154 us  
>   nfs_pageio_add_request              174091    21583666 us     123.979 us  
>   rpc_execute                          43528    21254402 us     488.292 us  
>   cpuidle_idle_call                    18073    19445493 us     1075.941 us 
>   acpi_idle_enter_c1                   17999    19396945 us     1077.667 us 
>   e1000_receive_skb                   436279    18818686 us     43.134 us   
>   tcp_v4_do_rcv                       513521    18681598 us     36.379 us   
>   xprt_transmit                        43346    18532249 us     427.542 us  
>   tcp_rcv_established                 513521    18384627 us     35.801 us   
>   napi_gro_receive                    436279    18114586 us     41.520 us   
>   napi_skb_finish                     436279    17130261 us     39.264 us   
>   netif_receive_skb                   436279    16886046 us     38.704 us   
>   ip_rcv                              436279    16500338 us     37.820 us   
>   xs_tcp_send_request                  43346    16418581 us     378.779 us  
>   xs_sendpages                         43346    16367619 us     377.603 us  
>   xs_send_kvec                         43347    16310325 us     376.273 us  
>   kernel_sendmsg                       43347    16254698 us     374.990 us  
>   sock_sendmsg                         43347    16177184 us     373.201 us  
>   tcp_sendmsg                          43347    16127746 us     372.061 us  
>   do_IRQ                               45271    15968995 us     352.742 us  
>   ip_local_deliver                    436279    15816133 us     36.252 us   
>   irq_exit                             53709    15421475 us     287.130 us  
>   acpi_processor_ffh_cstate_ente       17999    15394735 us     855.310 us  
>   mwait_idle_with_hints                17999    15386089 us     854.830 us  
>   tcp_v4_rcv                          436217    15173852 us     34.785 us   
>   release_sock                         43364    13536313 us     312.155 us  
>   xs_tcp_data_ready                   512045    10092920 us     19.711 us   
>   _spin_unlock_bh                     518900    9851221 us     18.984 us   
>   local_bh_enable_ip                  511352    9403704 us     18.389 us   
>   tcp_read_sock                      1024090    8832051 us     8.624 us    
>   __tcp_ack_snd_check                 512045    6644838 us     12.977 us   
>   add_to_page_cache_lru               174072    6287717 us     36.121 us   
>   tcp_transmit_skb                    296572    5962431 us     20.104 us   
>   tcp_send_ack                        253210    5775135 us     22.807 us   
>   __kfree_skb                         802932    5216451 us     6.496 us    
>   add_to_page_cache_locked            174072    4716425 us     27.094 us   
> [...]
> 
> Detailed time before patch (100 runs * 2 CPU):
> 
> Total          add_to_page_cache_lru    percent
> 1927694859.404 6703561                  0.34775011030908648%
> 1994166534.446 7311783                  0.36665859514242066%
> 2132812997.930 6249683                  0.29302536162643533%
> 1972582396.980 7908593                  0.40092586307715006%
> 2415371209.622 7670327                  0.31756307144194151%
> 2163473353.153 7406991                  0.34236571433640306%
> 1833544521.635 7952113                  0.43370165851816772%
> 2087748556.683 7995535                  0.38297404035585819%
> 2185868844.795 7117368                  0.32560819085499598%
> 1980423680.193 6514699                  0.32895481230385598%
> 1958021484.352 6745228                  0.34449203207963308%
> 2008592063.702 7398967                  0.36836583862444905%
> 2448117568.931 6224454                  0.25425470079519025%
> 2213021186.789 6938228                  0.31351837214297412%
> 1828902235.789 7479842                  0.40897987074596959%
> 1866644864.837 7720792                  0.41361868802366963%
> 2054526163.572 8475856                  0.41254553727677401%
> 1902862587.026 8334661                  0.43800645705196778%
> 2148449373.925 8672498                  0.40366313050031155%
> 2104880443.231 7741619                  0.36779376353160392%
> 2270646549.540 6486688                  0.2856758134071597%
> 2250814274.826 7512331                  0.33376058984612328%
> 2214008898.106 6391259                  0.28867359139646992%
> 2446226035.589 7738434                  0.31634173978271579%
> 1920377000.358 6464992                  0.33665223020244384%
> 2030446733.714 6022617                  0.2966153654759367%
> 2157128575.464 7375670                  0.34192074055731642%
> 2185670134.683 7943199                  0.36342167438509865%
> 2224381465.969 8492920                  0.38181041021667828%
> 2479045291.455 8161934                  0.32923698603382928%
> 1971641972.432 8114434                  0.41155717485517562%
> 2208169216.013 8138519                  0.36856410011433138%
> 1971312194.985 8122576                  0.41203904793283169%
> 2279129743.345 7365286                  0.3231622079219687%
> 2360162707.877 7412468                  0.3140659741491984%
> 1885022652.013 8252339                  0.43778460652382062%
> 2065345722.062 7985361                  0.3866355600759942%
> 1975982196.916 7925371                  0.4010851419799969%
> 2169738273.075 7203856                  0.33201497569522709%
> 2034570957.887 7396992                  0.36356520136717829%
> 1789538123.758 7380292                  0.41241323121418116%
> 1912812047.546 8697404                  0.45469203370807615%
> 2221047646.893 7451321                  0.33548676951723994%
> 2163981380.559 7982481                  0.36887937538251664%
> 1969189605.396 7353112                  0.37340802428831149%
> 2278909389.270 7088803                  0.31106120468750831%
> 2104996788.904 7342692                  0.34882200479854836%
> 2374705789.293 7327329                  0.30855733931492624%
> 2137944282.456 8112283                  0.3794431438915179%
> 2268117031.230 6889022                  0.30373309247909852%
> 1899091967.945 8173918                  0.43041190937398177%
> 2098779842.987 7858621                  0.37443760603377763%
> 2124662819.215 8156209                  0.38388251190904155%
> 2317714239.959 7048899                  0.30413149638864423%
> 2218944220.453 6659501                  0.30012025262358577%
> 1945126557.801 7704643                  0.3960998305791596%
> 2092060468.851 7616315                  0.36405807161888715%
> 1933379809.531 7300893                  0.37762331870896354%
> 1903832170.769 7508856                  0.39440745435912067%
> 2435478683.656 6790800                  0.27882814354203433%
> 2126189870.763 7104836                  0.33415811530747125%
> 2161363441.772 7011298                  0.32439236569356278%
> 2064449342.765 6822211                  0.33046153561039859%
> 2116816508.947 8156178                  0.38530396779913867%
> 1982774633.856 7131335                  0.35966442571092105%
> 2323253766.182 7275361                  0.3131539526978242%
> 1905927501.820 7477458                  0.39232646534874271%
> 2252162684.001 8064426                  0.35807475442552977%
> 2027146552.281 7875866                  0.38851981328818402%
> 1845397260.011 7598767                  0.41176862915439172%
> 2177206310.501 7366288                  0.33833670077434846%
> 2041629016.560 6600368                  0.32328929234759562%
> 2189308825.112 8348863                  0.38134697600613249%
> 1930990433.231 7344038                  0.38032492930126544%
> 2102320863.469 7299411                  0.34720727586536809%
> 1895798071.683 8051595                  0.42470741585110772%
> 2019648719.623 7992832                  0.39575357448754711%
> 1961118039.267 7837733                  0.39965636147681766%
> 2142238585.905 8084509                  0.37738602288244455%
> 2263212432.641 7757018                  0.34274369865263332%
> 2200111036.549 7069702                  0.32133387281622078%
> 1941699880.568 7786856                  0.4010329339734075%
> 1939777157.562 7729686                  0.39848319534368681%
> 1983015138.927 8136252                  0.41029701893261827%
> 2040218650.019 7968023                  0.39054750332400873%
> 1850721920.862 7835912                  0.42339758943095634%
> 2167986192.252 8656589                  0.39929170356052646%
> 2128868496.040 8248906                  0.38747841942065214%
> 1984840247.084 7754853                  0.39070413910605312%
> 2130780080.168 7053725                  0.33103955990821227%
> 2062729676.752 7968701                  0.38631824081513272%
> 1995756518.448 7903687                  0.39602461156666058%
> 1921150053.008 8307746                  0.43243608103345821%
> 2253731991.230 8033382                  0.35644797301810893%
> 2143807312.838 6755593                  0.31512127790332323%
> 1948787937.040 7152297                  0.36701258582622248%
> 2276500726.783 7277806                  0.31969267193182554%
> 2282733838.581 6336839                  0.27759867983291148%
> 2003104438.909 7092547                  0.35407774363791977%
> 2239684291.880 6982252                  0.3117516171950766%
> 1801833773.742 2645976                  0.14684906224756283%
> 2016683111.150 2261656                  0.11214731692329717%
> 1962727948.601 2767184                  0.14098663046870061%
> 1845203884.375 2266317                  0.1228220371304734%
> 1516914718.568 1975612                  0.13023883121557553%
> 1557730294.184 2109274                  0.13540688063108641%
> 2171303320.533 1865384                  0.085910797554627022%
> 1617322895.641 2073254                  0.12819048104666192%
> 1867107731.934 1901409                  0.10183713384500154%
> 1772588329.145 2045529                  0.11539786008783279%
> 2030802387.562 2161163                  0.10641916777508319%
> 1853583237.019 1803299                  0.097287187539531861%
> 1453728764.040 3231016                  0.22225714176699726%
> 1989811798.738 2418414                  0.1215398361560542%
> 1956758511.567 2254818                  0.11523230826241863%
> 2001579398.410 1952360                  0.097540971971978793%
> 1636515132.388 1532408                  0.093638486419854422%
> 1926271619.408 1918912                  0.09961793449408439%
> 1642519133.238 1384193                  0.084272564744573439%
> 1677033195.955 1340416                  0.07992781557533149%
> 1693983317.244 3135804                  0.18511421972571418%
> 1487434932.071 2467493                  0.1658891388656871%
> 1720442169.343 3034092                  0.17635536108479918%
> 1640362983.552 1944463                  0.11853858075908964%
> 1821382248.526 2659828                  0.1460334864991977%
> 1947322513.836 3540608                  0.18181929160904178%
> 1673422299.278 2769832                  0.16551900863249205%
> 1836039325.613 2175872                  0.11850900847526999%
> 1582929718.539 2072026                  0.13089816785500888%
> 1476014468.174 1996943                  0.13529291501257767%
> 1847131237.607 2120975                  0.11482535495137695%
> 1734796451.802 1842211                  0.10619176665287874%
> 1804827020.363 1988047                  0.11015166426310204%
> 1647250758.752 2634152                  0.15991202225917941%
> 1506833507.228 2271966                  0.15077750720977479%
> 1919459367.994 1608374                  0.083793073550751385%
> 1913364635.281 2126200                  0.11112361756846952%
> 1733934046.047 2031440                  0.11715785872197679%
> 1762796865.803 2285487                  0.12965118354456009%
> 1711072016.023 2270996                  0.13272357789349024%
> 2082560830.257 2273483                  0.10916766353083858%
> 1857948273.729 1266725                  0.06817870109255604%
> 1738287034.188 2404600                  0.1383315846409251%
> 1618942973.292 2016207                  0.12453848179100423%
> 1820789056.854 2369799                  0.13015230902664757%
> 1561747605.669 2487149                  0.15925422206327566%
> 1884502059.843 2512148                  0.13330566485076115%
> 1352203145.315 2618029                  0.19361210695824271%
> 1911003204.592 1648292                  0.086252707271200574%
> 1525244016.511 2555354                  0.16753738892517536%
> 2190575134.148 1939935                  0.088558249829422828%
> 1643983673.168 1747429                  0.10629235730989092%
> 1781900292.843 1686195                  0.094629032094141294%
> 1566821289.746 2291057                  0.14622324926229507%
> 1777199439.264 2456866                  0.13824368530171682%
> 1633063412.403 1953080                  0.11959609070698031%
> 1913601422.874 1821650                  0.095194849785599556%
> 2094732749.709 2322465                  0.11087166132876074%
> 1951529182.943 2270128                  0.11632559839953495%
> 1450396654.276 2867633                  0.19771370759479912%
> 1643954213.946 2460211                  0.14965203891504558%
> 1845971802.604 2593942                  0.14051904781757141%
> 1671974226.831 2552423                  0.15265923116756236%
> 1792483105.554 1547320                  0.086322710390164162%
> 1774273751.516 2101959                  0.1184686973024323%
> 1797523443.598 1781759                  0.099122990932099045%
> 1877038808.523 2131598                  0.11356174365288201%
> 1485614991.186 1796722                  0.12094129439052281%
> 2008042349.453 1972441                  0.09822706182154485%
> 2178564679.489 1948988                  0.089462021410222761%
> 1462911389.586 2076158                  0.14191960051575969%
> 1976013519.965 2692701                  0.13626936115536772%
> 1473561005.083 1369873                  0.092963439944099252%
> 2190713955.687 2450228                  0.11184609444967988%
> 1648795783.114 2557892                  0.15513698095279177%
> 2116292380.702 1948901                  0.092090347145394241%
> 1704227210.803 1980402                  0.11620527987385389%
> 1984573918.515 2325530                  0.11718031655581404%
> 1856414008.018 1798348                  0.096872141248277158%
> 1401593348.875 2285205                  0.16304336788086488%
> 1832475317.969 2373585                  0.12952889333487622%
> 1840826298.949 2112188                  0.11474129857911804%
> 2014475390.352 2104652                  0.10447643143618861%
> 1730176303.748 1975645                  0.11418749613667999%
> 2016553863.650 2009045                  0.099627638825555156%
> 1891795883.646 2160557                  0.11420666567029553%
> 1881160108.475 1492908                  0.079361027978115894%
> 1562754135.542 1978972                  0.12663361145505117%
> 2061996559.038 2071212                  0.10044691834821971%
> 1519528906.706 2460493                  0.16192472477103448%
> 1796244525.568 2099254                  0.1168690548596764%
> 1998463642.677 1988093                  0.099481069234609226%
> 1872169708.204 1762657                  0.094150492462082555%
> 1664425938.102 1710870                  0.10279039522485223%
> 1791543540.761 1992110                  0.11119517637588673%
> 1883913354.514 1770358                  0.093972368514618113%
> 1813806740.237 1877185                  0.10349421238531283%
> 1547586337.236 2702003                  0.17459465329900728%
> 2020313568.921 2100334                  0.10396079263684484%
> 1727289498.314 2059429                  0.1192289423405978%
> 
> Detailed time after patch:
> 
> Total          add_to_page_cache_lru    percent
> 2151489401.227 6287717                  0.29224949918015392%
> 1919030541.464 6872841                  0.3581413037208262%
> 2277104485.950 8538296                  0.37496285535785828%
> 2142847301.255 7992466                  0.37298345968558083%
> 1909650847.324 6960671                  0.36449966808089612%
> 2109123648.836 6825371                  0.32361170497361974%
> 2128503623.032 7728883                  0.36311345286743746%
> 1925232499.136 7526064                  0.39091714914315673%
> 1935780958.829 7499162                  0.38739723964101919%
> 1722848267.832 7353451                  0.42681942091473007%
> 1855159456.452 7600681                  0.40970499724785564%
> 1862769335.900 8607712                  0.46209221045831578%
> 2272365556.764 6895008                  0.30342864419310028%
> 2049429274.204 7252089                  0.35385895435775488%
> 2001218334.232 8155639                  0.4075336938750293%
> 1947131845.644 7229822                  0.37130623774523025%
> 1982915408.746 7513518                  0.37891268416495716%
> 2038654062.353 7531559                  0.36943781385386831%
> 2134879530.359 6738727                  0.31564905205057731%
> 2029761134.284 7472398                  0.36814174208907074%
> 1956557094.935 7280443                  0.37210480690019776%
> 1740007857.373 6663035                  0.38293131676196024%
> 2054278081.418 7765213                  0.37800203732105886%
> 2367585922.353 7629987                  0.32226864199365651%
> 2023763123.141 7730698                  0.38199618876350999%
> 1826074987.177 7579606                  0.41507638258150648%
> 2077873224.783 7414675                  0.35683962387909118%
> 2103719398.720 7093390                  0.33718327664402131%
> 2164758918.301 7181094                  0.33172719323572736%
> 1953199921.042 7750654                  0.39681826302067191%
> 2179298642.359 6395109                  0.29344803303679207%
> 1971922836.328 7215479                  0.36591081897688477%
> 2225552712.785 8153246                  0.36634701812105075%
> 2124780733.448 7080842                  0.33325048032177529%
> 2139511762.563 8708060                  0.40701155059639843%
> 1934365642.348 6842114                  0.35371358186939283%
> 1950765025.910 6818753                  0.34954250816646476%
> 2013188877.577 6495929                  0.32266863146086228%
> 1921391492.273 6536576                  0.34020011154870117%
> 2164698840.945 6712421                  0.31008567441509277%
> 1992440404.239 7467130                  0.3747730664422067%
> 2073623249.374 7113197                  0.34303227465005426%
> 1938481781.041 6902169                  0.35606055561138211%
> 1963963682.421 7776006                  0.39593430721765843%
> 1807704433.895 7206899                  0.39867684477995868%
> 1902413226.299 7687850                  0.40411041585093088%
> 1718395340.661 7096452                  0.41296969516166576%
> 2014139552.830 7653595                  0.37999328245384933%
> 1976902280.499 7781174                  0.3936043818026207%
> 1920129734.441 6808420                  0.3545812492707483%
> 1786139514.638 7179970                  0.4019825966089316%
> 2011990483.317 6739406                  0.33496212113733792%
> 2186643920.919 7830962                  0.35812698743876015%
> 1822975832.465 8627042                  0.47323951565143058%
> 1872249428.425 7264440                  0.38800599373722838%
> 1971668701.573 6818269                  0.34581210294408871%
> 2054585589.388 7261214                  0.35341501651254642%
> 2252858576.565 7228902                  0.32087686618225808%
> 2196771348.820 6634658                  0.30201859668116204%
> 2224872312.502 6948714                  0.31231967609798522%
> 1924947573.139 6935037                  0.360271474235066%
> 2070207954.999 7564561                  0.36540102078796494%
> 2070702822.858 6994172                  0.33776802362912656%
> 2099826181.510 7199665                  0.34286957003377627%
> 2369037401.585 7009501                  0.29587971027009985%
> 2109093017.803 7056062                  0.33455432929886414%
> 2141046841.968 7831694                  0.36578807368835009%
> 2284785719.942 7358167                  0.3220506385249462%
> 2088247084.213 6857813                  0.32840045853982414%
> 1880913709.753 6697468                  0.35607523967059107%
> 1984586172.943 7361786                  0.37094816543455983%
> 2130320004.839 7840067                  0.36802297223850727%
> 2314059227.236 6669054                  0.28819720435443524%
> 1935408704.712 8309322                  0.42933164348025787%
> 2239440115.241 7040432                  0.31438357972087749%
> 2387500130.769 6555638                  0.27458168129559291%
> 2497500671.323 6969731                  0.27906823329532587%
> 2229438175.095 6759889                  0.30321042653322067%
> 1979317661.013 7870099                  0.39761677243723181%
> 2228213884.202 5883791                  0.26405862748257614%
> 2264526373.743 6563122                  0.28982316461838858%
> 2256376049.198 6670128                  0.29561242694323103%
> 2085817937.654 6878115                  0.32975624937505721%
> 2177225113.484 7421724                  0.34087995559282069%
> 1843473456.551 7156295                  0.38819625932607071%
> 2403022978.094 6958096                  0.28955594946157931%
> 1935191866.634 6872608                  0.35513832599730566%
> 1917810458.850 7816937                  0.40759695328220108%
> 2016838266.381 7257719                  0.35985627211562177%
> 1904575178.166 7484129                  0.39295529448235278%
> 2087232238.930 7372521                  0.35321996577532039%
> 1976631116.176 6438233                  0.32571747693901715%
> 2004191806.805 6735567                  0.33607397141980955%
> 2110550986.328 7174599                  0.33993961986592813%
> 2152621184.889 7337361                  0.34085704681840484%
> 2059135774.936 6363510                  0.30903790208772336%
> 2353978555.486 7265456                  0.30864580236160993%
> 1945677192.325 5927833                  0.30466682877217138%
> 2255676557.335 6379740                  0.28283044301074045%
> 2034627875.756 7319043                  0.35972391252530572%
> 1712556465.847 3079747                  0.17983331127577226%
> 1837380343.831 3111263                  0.1693314620702272%
> 1680363376.083 1611968                  0.095929727042585719%
> 1769978476.954 2017096                  0.11396161175198416%
> 1829700109.543 2612874                  0.14280340184559598%
> 1783959966.875 3184558                  0.17851062014460206%
> 1772032973.967 2133741                  0.12041203698502599%
> 1667688494.152 1985120                  0.11903422053705601%
> 2001252356.720 2633874                  0.13161128785961057%
> 2077009766.270 2858188                  0.13761071548223289%
> 2076612330.379 2340135                  0.11269002720276176%
> 1834812072.536 1713713                  0.093399919569494547%
> 1787949752.795 2522212                  0.14106727529994451%
> 1907464219.876 2476303                  0.12982172741153589%
> 1609254879.065 1903707                  0.11829741980375917%
> 2069557067.911 2484547                  0.12005211349440528%
> 1725151836.363 2491288                  0.1444097816486799%
> 2077004221.561 2508662                  0.12078271069254648%
> 1650748196.246 3195100                  0.19355465644403197%
> 1843881436.725 2731880                  0.14815920078094685%
> 1677080490.297 2752567                  0.16412849686853959%
> 2103777441.563 3252694                  0.15461207710181613%
> 1598451912.768 2345538                  0.1467381021139566%
> 1825835416.807 2381575                  0.13043755083713246%
> 1969692431.120 2198750                  0.11162910336969484%
> 1867823718.088 1813348                  0.097083465770326324%
> 1929972836.031 2140104                  0.11088777831718792%
> 1627140530.277 2519227                  0.15482541016731566%
> 1779407314.725 2828755                  0.15897175293095681%
> 1724839045.836 1937500                  0.11232932166496305%
> 1780135194.997 2889504                  0.16231935687361485%
> 1758590417.053 2193512                  0.12473126082853507%
> 1729351400.188 1523643                  0.088104881392778986%
> 1484255938.207 2153531                  0.14509162096406991%
> 1504352024.282 1910774                  0.12701641432044322%
> 1916441202.247 2423510                  0.12645887581411155%
> 1651813655.113 2892863                  0.1751325272705837%
> 1772411947.023 2693122                  0.15194673024651262%
> 1738795901.180 3064160                  0.17622309771495132%
> 1717902110.376 3266140                  0.19012375503078779%
> 1858583685.387 2246750                  0.12088505982619635%
> 1641867344.242 2901893                  0.17674345069209688%
> 2037933222.333 3290649                  0.16146991294606342%
> 1713754879.753 2136278                  0.1246548164640615%
> 2265307843.106 2195967                  0.096939010151886218%
> 2066343912.900 2117667                  0.10248376307446183%
> 2033259596.459 2652081                  0.13043494321230309%
> 2029016228.457 2447926                  0.12064595470789147%
> 1649943910.454 2243384                  0.13596728869302646%
> 2064977308.315 2517573                  0.12191770775700743%
> 1962826877.902 2403546                  0.12245328546596376%
> 1991037543.226 3214700                  0.16145853255942866%
> 1469646744.644 1960481                  0.133398111290676%
> 2195532272.023 1693642                  0.077140382839349028%
> 1902754342.931 2542547                  0.13362455376575119%
> 1922726134.774 2726990                  0.14182935108023245%
> 1629134454.130 2814497                  0.17276026499009955%
> 1442857977.322 2886065                  0.2000241912483062%
> 1782430811.270 2884583                  0.16183421997427802%
> 1574026949.337 2873023                  0.18252692568003068%
> 2064769936.712 2717133                  0.13159495165484839%
> 1621381781.294 2143854                  0.13222388611576869%
> 1783777994.311 2515565                  0.14102455619605617%
> 1561198792.593 2402227                  0.15387066729728466%
> 1658682002.165 2924639                  0.17632306832669586%
> 1675144757.604 2216513                  0.13231769910860314%
> 1712859849.041 2274046                  0.13276310967725691%
> 1448178253.374 2507727                  0.17316424923226392%
> 1621787692.726 2822681                  0.17404750403892047%
> 1924315582.555 2275728                  0.11826168330344308%
> 2044659318.750 2606738                  0.12749008972280165%
> 1559320878.499 2049522                  0.13143683434629996%
> 1641314354.590 2786145                  0.16975084585158451%
> 1769603624.855 2008994                  0.11352790940200608%
> 1707747444.550 2813289                  0.16473682973311757%
> 1366611645.499 2930513                  0.21443641356721807%
> 1435165128.768 3161655                  0.22029903992400404%
> 1440555681.860 2831258                  0.19653929630435177%
> 1627011278.312 1956565                  0.12025515901954331%
> 1668464384.654 3863649                  0.23156916237089648%
> 1589045506.115 2743558                  0.17265446391825656%
> 1766150748.153 3207045                  0.18158387687765917%
> 1668645748.936 2997211                  0.17961937109246526%
> 1808149919.182 2420007                  0.13383884678626659%
> 1831096162.185 2226869                  0.12161398434382248%
> 1473136553.216 2774496                  0.18833936296964501%
> 1873520917.567 2427200                  0.12955286366121929%
> 1971938578.205 2207736                  0.11195764535473715%
> 1670588149.850 2437810                  0.14592525394238479%
> 1912140400.200 2407058                  0.12588291109524352%
> 1925513187.452 2415130                  0.12542786077699639%
> 1658671061.956 3111210                  0.18757245311382492%
> 1917423750.996 2888924                  0.15066695603928745%
> 1553962205.772 3386946                  0.21795549386076499%
> 1834101490.364 2367281                  0.12907033838842713%
> 1626846627.678 3362957                  0.20671629044711806%
> 1519340210.104 2320735                  0.15274623712098975%
> 1725615314.262 3813557                  0.22099693764197714%
> 1725823077.103 3137784                  0.1818137699993643%
> 1633124107.941 2007344                  0.12291435722731486%
> 
> Thanks,
> Fengguang

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26 11:31       ` Wu Fengguang
  0 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-26 11:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, Hugh Dickins, Nick Piggin, linux-mm, LKML

On Sat, Sep 26, 2009 at 06:52:59PM +0800, Wu Fengguang wrote:
> On Sat, Sep 26, 2009 at 11:49:36AM +0800, Andi Kleen wrote:
> > On Sat, Sep 26, 2009 at 11:15:37AM +0800, Wu Fengguang wrote:
> > > The swap cache and page cache code assume that they 'own' the newly
> > > allocated page and therefore can disregard the locking rules. However
> > > now hwpoison can hit any time on any page.
> > > 
> > > So use the safer lock_page()/trylock_page(). The main intention is not
> > > to close such a small time window of memory corruption. But to avoid
> > > kernel oops that may result from such races, and also avoid raising
> > > false alerts in hwpoison stress tests.
> > > 
> > > This in theory will slightly increase page cache/swap cache overheads,
> > > however it seems to be too small to be measurable in benchmark.
> > 
> > Thanks. Can you please describe what benchmarks you used?
> 
> Good question. The test case is to copy 1GB sparse file over NFS.
> 
> I ran it 100 times (2 hours total wall time) and the overheads are ~0.008%.

And standard deviation is 0.04%, much larger than the difference 0.008% ..

Thanks,
Fengguang

> The average time reported by the function profiler:
> 
>                 total time      add_to_page_cache_lru   percent
>         before  1940083424.361  4841664.805             0.242%
>         after   1914258447.188  4889044.435             0.250%
> 
> The script used is:
> 
>         for i in `seq 1 100`
>         do      
>                 # copy 1GB sparse file over NFS
>                 echo 1 > /debug/tracing/function_profile_enabled
>                 cp /tmp/sparse/1G /dev/null
>                 echo 0 > /debug/tracing/function_profile_enabled
> 
>                 fadvise /tmp/sparse/1G 0 0 dontneed
>                 cat /debug/tracing/trace_stat/function0 > function0.$i
>                 cat /debug/tracing/trace_stat/function1 > function1.$i
>         done
> 
> 
> File function0.1 after patch:
> 
>   Function                               Hit    Time            Avg
>   --------                               ---    ----            ---
>   schedule                             32492    474358128 us     14599.22 us 
>   schedule_timeout                        26    99994288 us     3845934 us 
>   poll_schedule_timeout                    8    52914376 us     6614297 us 
>   sys_poll                                 6    52914373 us     8819062 us 
>   schedule_hrtimeout_range                 8    52914373 us     6614296 us 
>   do_sys_poll                              6    52914355 us     8819059 us 
>   sys_read                             86997    40703194 us     467.868 us  
>   vfs_read                             87002    40550651 us     466.088 us  
>   do_sync_read                         86987    40011054 us     459.965 us  
>   nfs_file_read                        86985    39945631 us     459.224 us  
>   generic_file_aio_read                86985    39785858 us     457.387 us  
>   page_cache_async_readahead            2886    34379069 us     11912.35 us 
>   ondemand_readahead                    2889    34371850 us     11897.49 us 
>   ra_submit                             2891    34368378 us     11888.05 us 
>   __do_page_cache_readahead             2891    34360744 us     11885.41 us 
>   nfs_readpages                         2876    33198107 us     11543.15 us 
>   read_cache_pages                      2876    31617284 us     10993.49 us 
>   __rpc_execute                        84468    25404441 us     300.758 us  
>   readpage_async_filler               174071    24947004 us     143.315 us  
>   do_softirq                           44822    24829675 us     553.961 us  
>   __do_softirq                         44822    24803408 us     553.375 us  
>   net_rx_action                        38424    24357207 us     633.906 us  
>   e1000_clean                          39042    24253141 us     621.206 us  
>   nfs_pageio_doio                      43534    22993433 us     528.171 us  
>   nfs_pagein_one                       43520    22939752 us     527.108 us  
>   nfs_read_rpcsetup                    43513    22533176 us     517.849 us  
>   e1000_clean_rx_irq                   39042    22047306 us     564.707 us  
>   rpc_run_task                         43528    22042766 us     506.404 us  
>   call_transmit                        45873    21842655 us     476.154 us  
>   nfs_pageio_add_request              174091    21583666 us     123.979 us  
>   rpc_execute                          43528    21254402 us     488.292 us  
>   cpuidle_idle_call                    18073    19445493 us     1075.941 us 
>   acpi_idle_enter_c1                   17999    19396945 us     1077.667 us 
>   e1000_receive_skb                   436279    18818686 us     43.134 us   
>   tcp_v4_do_rcv                       513521    18681598 us     36.379 us   
>   xprt_transmit                        43346    18532249 us     427.542 us  
>   tcp_rcv_established                 513521    18384627 us     35.801 us   
>   napi_gro_receive                    436279    18114586 us     41.520 us   
>   napi_skb_finish                     436279    17130261 us     39.264 us   
>   netif_receive_skb                   436279    16886046 us     38.704 us   
>   ip_rcv                              436279    16500338 us     37.820 us   
>   xs_tcp_send_request                  43346    16418581 us     378.779 us  
>   xs_sendpages                         43346    16367619 us     377.603 us  
>   xs_send_kvec                         43347    16310325 us     376.273 us  
>   kernel_sendmsg                       43347    16254698 us     374.990 us  
>   sock_sendmsg                         43347    16177184 us     373.201 us  
>   tcp_sendmsg                          43347    16127746 us     372.061 us  
>   do_IRQ                               45271    15968995 us     352.742 us  
>   ip_local_deliver                    436279    15816133 us     36.252 us   
>   irq_exit                             53709    15421475 us     287.130 us  
>   acpi_processor_ffh_cstate_ente       17999    15394735 us     855.310 us  
>   mwait_idle_with_hints                17999    15386089 us     854.830 us  
>   tcp_v4_rcv                          436217    15173852 us     34.785 us   
>   release_sock                         43364    13536313 us     312.155 us  
>   xs_tcp_data_ready                   512045    10092920 us     19.711 us   
>   _spin_unlock_bh                     518900    9851221 us     18.984 us   
>   local_bh_enable_ip                  511352    9403704 us     18.389 us   
>   tcp_read_sock                      1024090    8832051 us     8.624 us    
>   __tcp_ack_snd_check                 512045    6644838 us     12.977 us   
>   add_to_page_cache_lru               174072    6287717 us     36.121 us   
>   tcp_transmit_skb                    296572    5962431 us     20.104 us   
>   tcp_send_ack                        253210    5775135 us     22.807 us   
>   __kfree_skb                         802932    5216451 us     6.496 us    
>   add_to_page_cache_locked            174072    4716425 us     27.094 us   
> [...]
> 
> Detailed time before patch (100 runs * 2 CPU):
> 
> Total          add_to_page_cache_lru    percent
> 1927694859.404 6703561                  0.34775011030908648%
> 1994166534.446 7311783                  0.36665859514242066%
> 2132812997.930 6249683                  0.29302536162643533%
> 1972582396.980 7908593                  0.40092586307715006%
> 2415371209.622 7670327                  0.31756307144194151%
> 2163473353.153 7406991                  0.34236571433640306%
> 1833544521.635 7952113                  0.43370165851816772%
> 2087748556.683 7995535                  0.38297404035585819%
> 2185868844.795 7117368                  0.32560819085499598%
> 1980423680.193 6514699                  0.32895481230385598%
> 1958021484.352 6745228                  0.34449203207963308%
> 2008592063.702 7398967                  0.36836583862444905%
> 2448117568.931 6224454                  0.25425470079519025%
> 2213021186.789 6938228                  0.31351837214297412%
> 1828902235.789 7479842                  0.40897987074596959%
> 1866644864.837 7720792                  0.41361868802366963%
> 2054526163.572 8475856                  0.41254553727677401%
> 1902862587.026 8334661                  0.43800645705196778%
> 2148449373.925 8672498                  0.40366313050031155%
> 2104880443.231 7741619                  0.36779376353160392%
> 2270646549.540 6486688                  0.2856758134071597%
> 2250814274.826 7512331                  0.33376058984612328%
> 2214008898.106 6391259                  0.28867359139646992%
> 2446226035.589 7738434                  0.31634173978271579%
> 1920377000.358 6464992                  0.33665223020244384%
> 2030446733.714 6022617                  0.2966153654759367%
> 2157128575.464 7375670                  0.34192074055731642%
> 2185670134.683 7943199                  0.36342167438509865%
> 2224381465.969 8492920                  0.38181041021667828%
> 2479045291.455 8161934                  0.32923698603382928%
> 1971641972.432 8114434                  0.41155717485517562%
> 2208169216.013 8138519                  0.36856410011433138%
> 1971312194.985 8122576                  0.41203904793283169%
> 2279129743.345 7365286                  0.3231622079219687%
> 2360162707.877 7412468                  0.3140659741491984%
> 1885022652.013 8252339                  0.43778460652382062%
> 2065345722.062 7985361                  0.3866355600759942%
> 1975982196.916 7925371                  0.4010851419799969%
> 2169738273.075 7203856                  0.33201497569522709%
> 2034570957.887 7396992                  0.36356520136717829%
> 1789538123.758 7380292                  0.41241323121418116%
> 1912812047.546 8697404                  0.45469203370807615%
> 2221047646.893 7451321                  0.33548676951723994%
> 2163981380.559 7982481                  0.36887937538251664%
> 1969189605.396 7353112                  0.37340802428831149%
> 2278909389.270 7088803                  0.31106120468750831%
> 2104996788.904 7342692                  0.34882200479854836%
> 2374705789.293 7327329                  0.30855733931492624%
> 2137944282.456 8112283                  0.3794431438915179%
> 2268117031.230 6889022                  0.30373309247909852%
> 1899091967.945 8173918                  0.43041190937398177%
> 2098779842.987 7858621                  0.37443760603377763%
> 2124662819.215 8156209                  0.38388251190904155%
> 2317714239.959 7048899                  0.30413149638864423%
> 2218944220.453 6659501                  0.30012025262358577%
> 1945126557.801 7704643                  0.3960998305791596%
> 2092060468.851 7616315                  0.36405807161888715%
> 1933379809.531 7300893                  0.37762331870896354%
> 1903832170.769 7508856                  0.39440745435912067%
> 2435478683.656 6790800                  0.27882814354203433%
> 2126189870.763 7104836                  0.33415811530747125%
> 2161363441.772 7011298                  0.32439236569356278%
> 2064449342.765 6822211                  0.33046153561039859%
> 2116816508.947 8156178                  0.38530396779913867%
> 1982774633.856 7131335                  0.35966442571092105%
> 2323253766.182 7275361                  0.3131539526978242%
> 1905927501.820 7477458                  0.39232646534874271%
> 2252162684.001 8064426                  0.35807475442552977%
> 2027146552.281 7875866                  0.38851981328818402%
> 1845397260.011 7598767                  0.41176862915439172%
> 2177206310.501 7366288                  0.33833670077434846%
> 2041629016.560 6600368                  0.32328929234759562%
> 2189308825.112 8348863                  0.38134697600613249%
> 1930990433.231 7344038                  0.38032492930126544%
> 2102320863.469 7299411                  0.34720727586536809%
> 1895798071.683 8051595                  0.42470741585110772%
> 2019648719.623 7992832                  0.39575357448754711%
> 1961118039.267 7837733                  0.39965636147681766%
> 2142238585.905 8084509                  0.37738602288244455%
> 2263212432.641 7757018                  0.34274369865263332%
> 2200111036.549 7069702                  0.32133387281622078%
> 1941699880.568 7786856                  0.4010329339734075%
> 1939777157.562 7729686                  0.39848319534368681%
> 1983015138.927 8136252                  0.41029701893261827%
> 2040218650.019 7968023                  0.39054750332400873%
> 1850721920.862 7835912                  0.42339758943095634%
> 2167986192.252 8656589                  0.39929170356052646%
> 2128868496.040 8248906                  0.38747841942065214%
> 1984840247.084 7754853                  0.39070413910605312%
> 2130780080.168 7053725                  0.33103955990821227%
> 2062729676.752 7968701                  0.38631824081513272%
> 1995756518.448 7903687                  0.39602461156666058%
> 1921150053.008 8307746                  0.43243608103345821%
> 2253731991.230 8033382                  0.35644797301810893%
> 2143807312.838 6755593                  0.31512127790332323%
> 1948787937.040 7152297                  0.36701258582622248%
> 2276500726.783 7277806                  0.31969267193182554%
> 2282733838.581 6336839                  0.27759867983291148%
> 2003104438.909 7092547                  0.35407774363791977%
> 2239684291.880 6982252                  0.3117516171950766%
> 1801833773.742 2645976                  0.14684906224756283%
> 2016683111.150 2261656                  0.11214731692329717%
> 1962727948.601 2767184                  0.14098663046870061%
> 1845203884.375 2266317                  0.1228220371304734%
> 1516914718.568 1975612                  0.13023883121557553%
> 1557730294.184 2109274                  0.13540688063108641%
> 2171303320.533 1865384                  0.085910797554627022%
> 1617322895.641 2073254                  0.12819048104666192%
> 1867107731.934 1901409                  0.10183713384500154%
> 1772588329.145 2045529                  0.11539786008783279%
> 2030802387.562 2161163                  0.10641916777508319%
> 1853583237.019 1803299                  0.097287187539531861%
> 1453728764.040 3231016                  0.22225714176699726%
> 1989811798.738 2418414                  0.1215398361560542%
> 1956758511.567 2254818                  0.11523230826241863%
> 2001579398.410 1952360                  0.097540971971978793%
> 1636515132.388 1532408                  0.093638486419854422%
> 1926271619.408 1918912                  0.09961793449408439%
> 1642519133.238 1384193                  0.084272564744573439%
> 1677033195.955 1340416                  0.07992781557533149%
> 1693983317.244 3135804                  0.18511421972571418%
> 1487434932.071 2467493                  0.1658891388656871%
> 1720442169.343 3034092                  0.17635536108479918%
> 1640362983.552 1944463                  0.11853858075908964%
> 1821382248.526 2659828                  0.1460334864991977%
> 1947322513.836 3540608                  0.18181929160904178%
> 1673422299.278 2769832                  0.16551900863249205%
> 1836039325.613 2175872                  0.11850900847526999%
> 1582929718.539 2072026                  0.13089816785500888%
> 1476014468.174 1996943                  0.13529291501257767%
> 1847131237.607 2120975                  0.11482535495137695%
> 1734796451.802 1842211                  0.10619176665287874%
> 1804827020.363 1988047                  0.11015166426310204%
> 1647250758.752 2634152                  0.15991202225917941%
> 1506833507.228 2271966                  0.15077750720977479%
> 1919459367.994 1608374                  0.083793073550751385%
> 1913364635.281 2126200                  0.11112361756846952%
> 1733934046.047 2031440                  0.11715785872197679%
> 1762796865.803 2285487                  0.12965118354456009%
> 1711072016.023 2270996                  0.13272357789349024%
> 2082560830.257 2273483                  0.10916766353083858%
> 1857948273.729 1266725                  0.06817870109255604%
> 1738287034.188 2404600                  0.1383315846409251%
> 1618942973.292 2016207                  0.12453848179100423%
> 1820789056.854 2369799                  0.13015230902664757%
> 1561747605.669 2487149                  0.15925422206327566%
> 1884502059.843 2512148                  0.13330566485076115%
> 1352203145.315 2618029                  0.19361210695824271%
> 1911003204.592 1648292                  0.086252707271200574%
> 1525244016.511 2555354                  0.16753738892517536%
> 2190575134.148 1939935                  0.088558249829422828%
> 1643983673.168 1747429                  0.10629235730989092%
> 1781900292.843 1686195                  0.094629032094141294%
> 1566821289.746 2291057                  0.14622324926229507%
> 1777199439.264 2456866                  0.13824368530171682%
> 1633063412.403 1953080                  0.11959609070698031%
> 1913601422.874 1821650                  0.095194849785599556%
> 2094732749.709 2322465                  0.11087166132876074%
> 1951529182.943 2270128                  0.11632559839953495%
> 1450396654.276 2867633                  0.19771370759479912%
> 1643954213.946 2460211                  0.14965203891504558%
> 1845971802.604 2593942                  0.14051904781757141%
> 1671974226.831 2552423                  0.15265923116756236%
> 1792483105.554 1547320                  0.086322710390164162%
> 1774273751.516 2101959                  0.1184686973024323%
> 1797523443.598 1781759                  0.099122990932099045%
> 1877038808.523 2131598                  0.11356174365288201%
> 1485614991.186 1796722                  0.12094129439052281%
> 2008042349.453 1972441                  0.09822706182154485%
> 2178564679.489 1948988                  0.089462021410222761%
> 1462911389.586 2076158                  0.14191960051575969%
> 1976013519.965 2692701                  0.13626936115536772%
> 1473561005.083 1369873                  0.092963439944099252%
> 2190713955.687 2450228                  0.11184609444967988%
> 1648795783.114 2557892                  0.15513698095279177%
> 2116292380.702 1948901                  0.092090347145394241%
> 1704227210.803 1980402                  0.11620527987385389%
> 1984573918.515 2325530                  0.11718031655581404%
> 1856414008.018 1798348                  0.096872141248277158%
> 1401593348.875 2285205                  0.16304336788086488%
> 1832475317.969 2373585                  0.12952889333487622%
> 1840826298.949 2112188                  0.11474129857911804%
> 2014475390.352 2104652                  0.10447643143618861%
> 1730176303.748 1975645                  0.11418749613667999%
> 2016553863.650 2009045                  0.099627638825555156%
> 1891795883.646 2160557                  0.11420666567029553%
> 1881160108.475 1492908                  0.079361027978115894%
> 1562754135.542 1978972                  0.12663361145505117%
> 2061996559.038 2071212                  0.10044691834821971%
> 1519528906.706 2460493                  0.16192472477103448%
> 1796244525.568 2099254                  0.1168690548596764%
> 1998463642.677 1988093                  0.099481069234609226%
> 1872169708.204 1762657                  0.094150492462082555%
> 1664425938.102 1710870                  0.10279039522485223%
> 1791543540.761 1992110                  0.11119517637588673%
> 1883913354.514 1770358                  0.093972368514618113%
> 1813806740.237 1877185                  0.10349421238531283%
> 1547586337.236 2702003                  0.17459465329900728%
> 2020313568.921 2100334                  0.10396079263684484%
> 1727289498.314 2059429                  0.1192289423405978%
> 
> Detailed time after patch:
> 
> Total          add_to_page_cache_lru    percent
> 2151489401.227 6287717                  0.29224949918015392%
> 1919030541.464 6872841                  0.3581413037208262%
> 2277104485.950 8538296                  0.37496285535785828%
> 2142847301.255 7992466                  0.37298345968558083%
> 1909650847.324 6960671                  0.36449966808089612%
> 2109123648.836 6825371                  0.32361170497361974%
> 2128503623.032 7728883                  0.36311345286743746%
> 1925232499.136 7526064                  0.39091714914315673%
> 1935780958.829 7499162                  0.38739723964101919%
> 1722848267.832 7353451                  0.42681942091473007%
> 1855159456.452 7600681                  0.40970499724785564%
> 1862769335.900 8607712                  0.46209221045831578%
> 2272365556.764 6895008                  0.30342864419310028%
> 2049429274.204 7252089                  0.35385895435775488%
> 2001218334.232 8155639                  0.4075336938750293%
> 1947131845.644 7229822                  0.37130623774523025%
> 1982915408.746 7513518                  0.37891268416495716%
> 2038654062.353 7531559                  0.36943781385386831%
> 2134879530.359 6738727                  0.31564905205057731%
> 2029761134.284 7472398                  0.36814174208907074%
> 1956557094.935 7280443                  0.37210480690019776%
> 1740007857.373 6663035                  0.38293131676196024%
> 2054278081.418 7765213                  0.37800203732105886%
> 2367585922.353 7629987                  0.32226864199365651%
> 2023763123.141 7730698                  0.38199618876350999%
> 1826074987.177 7579606                  0.41507638258150648%
> 2077873224.783 7414675                  0.35683962387909118%
> 2103719398.720 7093390                  0.33718327664402131%
> 2164758918.301 7181094                  0.33172719323572736%
> 1953199921.042 7750654                  0.39681826302067191%
> 2179298642.359 6395109                  0.29344803303679207%
> 1971922836.328 7215479                  0.36591081897688477%
> 2225552712.785 8153246                  0.36634701812105075%
> 2124780733.448 7080842                  0.33325048032177529%
> 2139511762.563 8708060                  0.40701155059639843%
> 1934365642.348 6842114                  0.35371358186939283%
> 1950765025.910 6818753                  0.34954250816646476%
> 2013188877.577 6495929                  0.32266863146086228%
> 1921391492.273 6536576                  0.34020011154870117%
> 2164698840.945 6712421                  0.31008567441509277%
> 1992440404.239 7467130                  0.3747730664422067%
> 2073623249.374 7113197                  0.34303227465005426%
> 1938481781.041 6902169                  0.35606055561138211%
> 1963963682.421 7776006                  0.39593430721765843%
> 1807704433.895 7206899                  0.39867684477995868%
> 1902413226.299 7687850                  0.40411041585093088%
> 1718395340.661 7096452                  0.41296969516166576%
> 2014139552.830 7653595                  0.37999328245384933%
> 1976902280.499 7781174                  0.3936043818026207%
> 1920129734.441 6808420                  0.3545812492707483%
> 1786139514.638 7179970                  0.4019825966089316%
> 2011990483.317 6739406                  0.33496212113733792%
> 2186643920.919 7830962                  0.35812698743876015%
> 1822975832.465 8627042                  0.47323951565143058%
> 1872249428.425 7264440                  0.38800599373722838%
> 1971668701.573 6818269                  0.34581210294408871%
> 2054585589.388 7261214                  0.35341501651254642%
> 2252858576.565 7228902                  0.32087686618225808%
> 2196771348.820 6634658                  0.30201859668116204%
> 2224872312.502 6948714                  0.31231967609798522%
> 1924947573.139 6935037                  0.360271474235066%
> 2070207954.999 7564561                  0.36540102078796494%
> 2070702822.858 6994172                  0.33776802362912656%
> 2099826181.510 7199665                  0.34286957003377627%
> 2369037401.585 7009501                  0.29587971027009985%
> 2109093017.803 7056062                  0.33455432929886414%
> 2141046841.968 7831694                  0.36578807368835009%
> 2284785719.942 7358167                  0.3220506385249462%
> 2088247084.213 6857813                  0.32840045853982414%
> 1880913709.753 6697468                  0.35607523967059107%
> 1984586172.943 7361786                  0.37094816543455983%
> 2130320004.839 7840067                  0.36802297223850727%
> 2314059227.236 6669054                  0.28819720435443524%
> 1935408704.712 8309322                  0.42933164348025787%
> 2239440115.241 7040432                  0.31438357972087749%
> 2387500130.769 6555638                  0.27458168129559291%
> 2497500671.323 6969731                  0.27906823329532587%
> 2229438175.095 6759889                  0.30321042653322067%
> 1979317661.013 7870099                  0.39761677243723181%
> 2228213884.202 5883791                  0.26405862748257614%
> 2264526373.743 6563122                  0.28982316461838858%
> 2256376049.198 6670128                  0.29561242694323103%
> 2085817937.654 6878115                  0.32975624937505721%
> 2177225113.484 7421724                  0.34087995559282069%
> 1843473456.551 7156295                  0.38819625932607071%
> 2403022978.094 6958096                  0.28955594946157931%
> 1935191866.634 6872608                  0.35513832599730566%
> 1917810458.850 7816937                  0.40759695328220108%
> 2016838266.381 7257719                  0.35985627211562177%
> 1904575178.166 7484129                  0.39295529448235278%
> 2087232238.930 7372521                  0.35321996577532039%
> 1976631116.176 6438233                  0.32571747693901715%
> 2004191806.805 6735567                  0.33607397141980955%
> 2110550986.328 7174599                  0.33993961986592813%
> 2152621184.889 7337361                  0.34085704681840484%
> 2059135774.936 6363510                  0.30903790208772336%
> 2353978555.486 7265456                  0.30864580236160993%
> 1945677192.325 5927833                  0.30466682877217138%
> 2255676557.335 6379740                  0.28283044301074045%
> 2034627875.756 7319043                  0.35972391252530572%
> 1712556465.847 3079747                  0.17983331127577226%
> 1837380343.831 3111263                  0.1693314620702272%
> 1680363376.083 1611968                  0.095929727042585719%
> 1769978476.954 2017096                  0.11396161175198416%
> 1829700109.543 2612874                  0.14280340184559598%
> 1783959966.875 3184558                  0.17851062014460206%
> 1772032973.967 2133741                  0.12041203698502599%
> 1667688494.152 1985120                  0.11903422053705601%
> 2001252356.720 2633874                  0.13161128785961057%
> 2077009766.270 2858188                  0.13761071548223289%
> 2076612330.379 2340135                  0.11269002720276176%
> 1834812072.536 1713713                  0.093399919569494547%
> 1787949752.795 2522212                  0.14106727529994451%
> 1907464219.876 2476303                  0.12982172741153589%
> 1609254879.065 1903707                  0.11829741980375917%
> 2069557067.911 2484547                  0.12005211349440528%
> 1725151836.363 2491288                  0.1444097816486799%
> 2077004221.561 2508662                  0.12078271069254648%
> 1650748196.246 3195100                  0.19355465644403197%
> 1843881436.725 2731880                  0.14815920078094685%
> 1677080490.297 2752567                  0.16412849686853959%
> 2103777441.563 3252694                  0.15461207710181613%
> 1598451912.768 2345538                  0.1467381021139566%
> 1825835416.807 2381575                  0.13043755083713246%
> 1969692431.120 2198750                  0.11162910336969484%
> 1867823718.088 1813348                  0.097083465770326324%
> 1929972836.031 2140104                  0.11088777831718792%
> 1627140530.277 2519227                  0.15482541016731566%
> 1779407314.725 2828755                  0.15897175293095681%
> 1724839045.836 1937500                  0.11232932166496305%
> 1780135194.997 2889504                  0.16231935687361485%
> 1758590417.053 2193512                  0.12473126082853507%
> 1729351400.188 1523643                  0.088104881392778986%
> 1484255938.207 2153531                  0.14509162096406991%
> 1504352024.282 1910774                  0.12701641432044322%
> 1916441202.247 2423510                  0.12645887581411155%
> 1651813655.113 2892863                  0.1751325272705837%
> 1772411947.023 2693122                  0.15194673024651262%
> 1738795901.180 3064160                  0.17622309771495132%
> 1717902110.376 3266140                  0.19012375503078779%
> 1858583685.387 2246750                  0.12088505982619635%
> 1641867344.242 2901893                  0.17674345069209688%
> 2037933222.333 3290649                  0.16146991294606342%
> 1713754879.753 2136278                  0.1246548164640615%
> 2265307843.106 2195967                  0.096939010151886218%
> 2066343912.900 2117667                  0.10248376307446183%
> 2033259596.459 2652081                  0.13043494321230309%
> 2029016228.457 2447926                  0.12064595470789147%
> 1649943910.454 2243384                  0.13596728869302646%
> 2064977308.315 2517573                  0.12191770775700743%
> 1962826877.902 2403546                  0.12245328546596376%
> 1991037543.226 3214700                  0.16145853255942866%
> 1469646744.644 1960481                  0.133398111290676%
> 2195532272.023 1693642                  0.077140382839349028%
> 1902754342.931 2542547                  0.13362455376575119%
> 1922726134.774 2726990                  0.14182935108023245%
> 1629134454.130 2814497                  0.17276026499009955%
> 1442857977.322 2886065                  0.2000241912483062%
> 1782430811.270 2884583                  0.16183421997427802%
> 1574026949.337 2873023                  0.18252692568003068%
> 2064769936.712 2717133                  0.13159495165484839%
> 1621381781.294 2143854                  0.13222388611576869%
> 1783777994.311 2515565                  0.14102455619605617%
> 1561198792.593 2402227                  0.15387066729728466%
> 1658682002.165 2924639                  0.17632306832669586%
> 1675144757.604 2216513                  0.13231769910860314%
> 1712859849.041 2274046                  0.13276310967725691%
> 1448178253.374 2507727                  0.17316424923226392%
> 1621787692.726 2822681                  0.17404750403892047%
> 1924315582.555 2275728                  0.11826168330344308%
> 2044659318.750 2606738                  0.12749008972280165%
> 1559320878.499 2049522                  0.13143683434629996%
> 1641314354.590 2786145                  0.16975084585158451%
> 1769603624.855 2008994                  0.11352790940200608%
> 1707747444.550 2813289                  0.16473682973311757%
> 1366611645.499 2930513                  0.21443641356721807%
> 1435165128.768 3161655                  0.22029903992400404%
> 1440555681.860 2831258                  0.19653929630435177%
> 1627011278.312 1956565                  0.12025515901954331%
> 1668464384.654 3863649                  0.23156916237089648%
> 1589045506.115 2743558                  0.17265446391825656%
> 1766150748.153 3207045                  0.18158387687765917%
> 1668645748.936 2997211                  0.17961937109246526%
> 1808149919.182 2420007                  0.13383884678626659%
> 1831096162.185 2226869                  0.12161398434382248%
> 1473136553.216 2774496                  0.18833936296964501%
> 1873520917.567 2427200                  0.12955286366121929%
> 1971938578.205 2207736                  0.11195764535473715%
> 1670588149.850 2437810                  0.14592525394238479%
> 1912140400.200 2407058                  0.12588291109524352%
> 1925513187.452 2415130                  0.12542786077699639%
> 1658671061.956 3111210                  0.18757245311382492%
> 1917423750.996 2888924                  0.15066695603928745%
> 1553962205.772 3386946                  0.21795549386076499%
> 1834101490.364 2367281                  0.12907033838842713%
> 1626846627.678 3362957                  0.20671629044711806%
> 1519340210.104 2320735                  0.15274623712098975%
> 1725615314.262 3813557                  0.22099693764197714%
> 1725823077.103 3137784                  0.1818137699993643%
> 1633124107.941 2007344                  0.12291435722731486%
> 
> Thanks,
> Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26 11:09   ` Hugh Dickins
@ 2009-09-26 11:48     ` Wu Fengguang
  -1 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-26 11:48 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, Andi Kleen, linux-mm, LKML

On Sat, Sep 26, 2009 at 07:09:21PM +0800, Hugh Dickins wrote:
> On Sat, 26 Sep 2009, Wu Fengguang wrote:
> 
> > The swap cache and page cache code assume that they 'own' the newly
> > allocated page and therefore can disregard the locking rules. However
> > now hwpoison can hit any time on any page.
> > 
> > So use the safer lock_page()/trylock_page(). The main intention is not
> > to close such a small time window of memory corruption. But to avoid
> > kernel oops that may result from such races, and also avoid raising
> > false alerts in hwpoison stress tests.
> > 
> > This in theory will slightly increase page cache/swap cache overheads,
> > however it seems to be too small to be measurable in benchmark.
> 
> No.
> 
> But I'd most certainly defer to Nick if he disagrees with me.
> 
> I don't think anyone would want to quarrel very long over the swap
> and migration mods alone, but add_to_page_cache() is of a higher
> order of magnitude.

Yup, add_to_page_cache() is hot path.

> I can't see any reason to surrender add_to_page_cache() optimizations
> to the remote possibility of hwpoison (infinitely remote for most of
> us); though I wouldn't myself want to run the benchmark to defend them.
> 
> You'd later be sending a patch to replace __SetPageUptodate()s by
> SetPageUptodate()s etc, wouldn't you?  Because any non-atomic op
> on page->flags might wipe your locked bit (or your hwpoison bit).

That's a sad fact, there may be more holes than we want/able to handle..

> You could make #ifdef CONFIG_HWPOISON_INJECT select the slower
> versions of these things; or #ifdef CONFIG_MEMORY_FAILURE? that
> would pose distros with a harder choice.  But I'd much prefer
> not to go that way.
> 
> Please accept that there will be quite a number of places where
> the code "knows" it's the only user of the page, and hwpoison
> handling and testing should work around those places (can shift
> things around slightly to suit itself better, but not add cost).
> 
> Look into why you think you want the page lock: I can see it's
> going to be useful if you're taking a page out of a file (but then
> why bother if page->mapping not set?), or if you're using rmap-style
> lookup (but then why bother if !page_mapped?).
> 
> I suspect if memory_failure() did something like:
> 	if (page->mapping)
> 		lock_page_nosync(p);
> then you'd be okay, perhaps with a few additional _inexpensive_
> tweaks here and there.  With the "necessary" memory barriers?
> no, we probably wouldn't want to be adding any in hot paths.
> 
> But I definitely say "something like": remember that page_mapping()
> does that weird thing with PageSwapCache (a mistake from day one in
> my opinion), which might or might not be what you want.  There are
> probably various reasons why it's not as simple as I suggest above.

Good view point! Could do it if turned out to be simple.

However we may well end up to accept the fact that "we just cannot do
hwpoison 100% correct", and settle with a simple and 99% correct code.

> It seems to me that the Intel hardware guys have done half a job
> here: the sooner they get to remapping the bad pages, the better.

When we can offer to set aside half memory :)

Thanks,
Fengguang

> > CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> > CC: Andi Kleen <andi@firstfloor.org> 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/pagemap.h |   13 ++++---------
> >  mm/migrate.c            |    2 +-
> >  mm/swap_state.c         |    4 ++--
> >  3 files changed, 7 insertions(+), 12 deletions(-)
> > 
> > --- sound-2.6.orig/mm/swap_state.c	2009-09-14 10:50:19.000000000 +0800
> > +++ sound-2.6/mm/swap_state.c	2009-09-25 18:42:23.000000000 +0800
> > @@ -306,7 +306,7 @@ struct page *read_swap_cache_async(swp_e
> >  		 * re-using the just freed swap entry for an existing page.
> >  		 * May fail (-ENOMEM) if radix-tree node allocation failed.
> >  		 */
> > -		__set_page_locked(new_page);
> > +		lock_page(new_page);
> >  		SetPageSwapBacked(new_page);
> >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> >  		if (likely(!err)) {
> > @@ -318,7 +318,7 @@ struct page *read_swap_cache_async(swp_e
> >  			return new_page;
> >  		}
> >  		ClearPageSwapBacked(new_page);
> > -		__clear_page_locked(new_page);
> > +		unlock_page(new_page);
> >  		swapcache_free(entry, NULL);
> >  	} while (err != -ENOMEM);
> >  
> > --- sound-2.6.orig/include/linux/pagemap.h	2009-09-14 10:50:19.000000000 +0800
> > +++ sound-2.6/include/linux/pagemap.h	2009-09-25 18:42:19.000000000 +0800
> > @@ -292,11 +292,6 @@ extern int __lock_page_killable(struct p
> >  extern void __lock_page_nosync(struct page *page);
> >  extern void unlock_page(struct page *page);
> >  
> > -static inline void __set_page_locked(struct page *page)
> > -{
> > -	__set_bit(PG_locked, &page->flags);
> > -}
> > -
> >  static inline void __clear_page_locked(struct page *page)
> >  {
> >  	__clear_bit(PG_locked, &page->flags);
> > @@ -435,18 +430,18 @@ extern void remove_from_page_cache(struc
> >  extern void __remove_from_page_cache(struct page *page);
> >  
> >  /*
> > - * Like add_to_page_cache_locked, but used to add newly allocated pages:
> > - * the page is new, so we can just run __set_page_locked() against it.
> > + * Like add_to_page_cache_locked, but used to add newly allocated pages.
> >   */
> >  static inline int add_to_page_cache(struct page *page,
> >  		struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
> >  {
> >  	int error;
> >  
> > -	__set_page_locked(page);
> > +	if (!trylock_page(page))
> > +		return -EIO;	/* hwpoisoned */
> >  	error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
> >  	if (unlikely(error))
> > -		__clear_page_locked(page);
> > +		unlock_page(page);
> >  	return error;
> >  }
> >  
> > --- sound-2.6.orig/mm/migrate.c	2009-09-14 10:50:19.000000000 +0800
> > +++ sound-2.6/mm/migrate.c	2009-09-25 18:42:19.000000000 +0800
> > @@ -551,7 +551,7 @@ static int move_to_new_page(struct page 
> >  	 * holding a reference to the new page at this point.
> >  	 */
> >  	if (!trylock_page(newpage))
> > -		BUG();
> > +		return -EAGAIN;		/* got by hwpoison */
> >  
> >  	/* Prepare mapping for the new page.*/
> >  	newpage->index = page->index;

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26 11:48     ` Wu Fengguang
  0 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-26 11:48 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, Andi Kleen, linux-mm, LKML

On Sat, Sep 26, 2009 at 07:09:21PM +0800, Hugh Dickins wrote:
> On Sat, 26 Sep 2009, Wu Fengguang wrote:
> 
> > The swap cache and page cache code assume that they 'own' the newly
> > allocated page and therefore can disregard the locking rules. However
> > now hwpoison can hit any time on any page.
> > 
> > So use the safer lock_page()/trylock_page(). The main intention is not
> > to close such a small time window of memory corruption. But to avoid
> > kernel oops that may result from such races, and also avoid raising
> > false alerts in hwpoison stress tests.
> > 
> > This in theory will slightly increase page cache/swap cache overheads,
> > however it seems to be too small to be measurable in benchmark.
> 
> No.
> 
> But I'd most certainly defer to Nick if he disagrees with me.
> 
> I don't think anyone would want to quarrel very long over the swap
> and migration mods alone, but add_to_page_cache() is of a higher
> order of magnitude.

Yup, add_to_page_cache() is hot path.

> I can't see any reason to surrender add_to_page_cache() optimizations
> to the remote possibility of hwpoison (infinitely remote for most of
> us); though I wouldn't myself want to run the benchmark to defend them.
> 
> You'd later be sending a patch to replace __SetPageUptodate()s by
> SetPageUptodate()s etc, wouldn't you?  Because any non-atomic op
> on page->flags might wipe your locked bit (or your hwpoison bit).

That's a sad fact, there may be more holes than we want/able to handle..

> You could make #ifdef CONFIG_HWPOISON_INJECT select the slower
> versions of these things; or #ifdef CONFIG_MEMORY_FAILURE? that
> would pose distros with a harder choice.  But I'd much prefer
> not to go that way.
> 
> Please accept that there will be quite a number of places where
> the code "knows" it's the only user of the page, and hwpoison
> handling and testing should work around those places (can shift
> things around slightly to suit itself better, but not add cost).
> 
> Look into why you think you want the page lock: I can see it's
> going to be useful if you're taking a page out of a file (but then
> why bother if page->mapping not set?), or if you're using rmap-style
> lookup (but then why bother if !page_mapped?).
> 
> I suspect if memory_failure() did something like:
> 	if (page->mapping)
> 		lock_page_nosync(p);
> then you'd be okay, perhaps with a few additional _inexpensive_
> tweaks here and there.  With the "necessary" memory barriers?
> no, we probably wouldn't want to be adding any in hot paths.
> 
> But I definitely say "something like": remember that page_mapping()
> does that weird thing with PageSwapCache (a mistake from day one in
> my opinion), which might or might not be what you want.  There are
> probably various reasons why it's not as simple as I suggest above.

Good view point! Could do it if turned out to be simple.

However we may well end up to accept the fact that "we just cannot do
hwpoison 100% correct", and settle with a simple and 99% correct code.

> It seems to me that the Intel hardware guys have done half a job
> here: the sooner they get to remapping the bad pages, the better.

When we can offer to set aside half memory :)

Thanks,
Fengguang

> > CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> > CC: Andi Kleen <andi@firstfloor.org> 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/pagemap.h |   13 ++++---------
> >  mm/migrate.c            |    2 +-
> >  mm/swap_state.c         |    4 ++--
> >  3 files changed, 7 insertions(+), 12 deletions(-)
> > 
> > --- sound-2.6.orig/mm/swap_state.c	2009-09-14 10:50:19.000000000 +0800
> > +++ sound-2.6/mm/swap_state.c	2009-09-25 18:42:23.000000000 +0800
> > @@ -306,7 +306,7 @@ struct page *read_swap_cache_async(swp_e
> >  		 * re-using the just freed swap entry for an existing page.
> >  		 * May fail (-ENOMEM) if radix-tree node allocation failed.
> >  		 */
> > -		__set_page_locked(new_page);
> > +		lock_page(new_page);
> >  		SetPageSwapBacked(new_page);
> >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> >  		if (likely(!err)) {
> > @@ -318,7 +318,7 @@ struct page *read_swap_cache_async(swp_e
> >  			return new_page;
> >  		}
> >  		ClearPageSwapBacked(new_page);
> > -		__clear_page_locked(new_page);
> > +		unlock_page(new_page);
> >  		swapcache_free(entry, NULL);
> >  	} while (err != -ENOMEM);
> >  
> > --- sound-2.6.orig/include/linux/pagemap.h	2009-09-14 10:50:19.000000000 +0800
> > +++ sound-2.6/include/linux/pagemap.h	2009-09-25 18:42:19.000000000 +0800
> > @@ -292,11 +292,6 @@ extern int __lock_page_killable(struct p
> >  extern void __lock_page_nosync(struct page *page);
> >  extern void unlock_page(struct page *page);
> >  
> > -static inline void __set_page_locked(struct page *page)
> > -{
> > -	__set_bit(PG_locked, &page->flags);
> > -}
> > -
> >  static inline void __clear_page_locked(struct page *page)
> >  {
> >  	__clear_bit(PG_locked, &page->flags);
> > @@ -435,18 +430,18 @@ extern void remove_from_page_cache(struc
> >  extern void __remove_from_page_cache(struct page *page);
> >  
> >  /*
> > - * Like add_to_page_cache_locked, but used to add newly allocated pages:
> > - * the page is new, so we can just run __set_page_locked() against it.
> > + * Like add_to_page_cache_locked, but used to add newly allocated pages.
> >   */
> >  static inline int add_to_page_cache(struct page *page,
> >  		struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
> >  {
> >  	int error;
> >  
> > -	__set_page_locked(page);
> > +	if (!trylock_page(page))
> > +		return -EIO;	/* hwpoisoned */
> >  	error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
> >  	if (unlikely(error))
> > -		__clear_page_locked(page);
> > +		unlock_page(page);
> >  	return error;
> >  }
> >  
> > --- sound-2.6.orig/mm/migrate.c	2009-09-14 10:50:19.000000000 +0800
> > +++ sound-2.6/mm/migrate.c	2009-09-25 18:42:19.000000000 +0800
> > @@ -551,7 +551,7 @@ static int move_to_new_page(struct page 
> >  	 * holding a reference to the new page at this point.
> >  	 */
> >  	if (!trylock_page(newpage))
> > -		BUG();
> > +		return -EAGAIN;		/* got by hwpoison */
> >  
> >  	/* Prepare mapping for the new page.*/
> >  	newpage->index = page->index;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26 11:48     ` Wu Fengguang
@ 2009-09-26 11:58       ` Hugh Dickins
  -1 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2009-09-26 11:58 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, Nick Piggin, Andi Kleen, linux-mm, LKML

On Sat, 26 Sep 2009, Wu Fengguang wrote:
> 
> However we may well end up to accept the fact that "we just cannot do
> hwpoison 100% correct", and settle with a simple and 99% correct code.

I thought we already accepted that: you cannot do hwpoison 100% correct
(or not without a radical rewrite of the OS); you're just looking to
get it right almost all the time on those systems which have almost
all their pages in pagecache or userspace or free.

Hugh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26 11:58       ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2009-09-26 11:58 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, Nick Piggin, Andi Kleen, linux-mm, LKML

On Sat, 26 Sep 2009, Wu Fengguang wrote:
> 
> However we may well end up to accept the fact that "we just cannot do
> hwpoison 100% correct", and settle with a simple and 99% correct code.

I thought we already accepted that: you cannot do hwpoison 100% correct
(or not without a radical rewrite of the OS); you're just looking to
get it right almost all the time on those systems which have almost
all their pages in pagecache or userspace or free.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26 11:48     ` Wu Fengguang
@ 2009-09-26 15:05       ` Andi Kleen
  -1 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2009-09-26 15:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Hugh Dickins, Andrew Morton, Nick Piggin, Andi Kleen, linux-mm, LKML

> However we may well end up to accept the fact that "we just cannot do
> hwpoison 100% correct", and settle with a simple and 99% correct code.

I would prefer to avoid any oopses, but if they are unlikely enough
and too hard to fix that's bearable. The race window here is certainly rather 
small. 

On the other hand if you cannot detect a difference in benchmarks I see
no reason not to add the additional steps, as long as the code isn't
complicated or ugly. These changes are neither.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26 15:05       ` Andi Kleen
  0 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2009-09-26 15:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Hugh Dickins, Andrew Morton, Nick Piggin, Andi Kleen, linux-mm, LKML

> However we may well end up to accept the fact that "we just cannot do
> hwpoison 100% correct", and settle with a simple and 99% correct code.

I would prefer to avoid any oopses, but if they are unlikely enough
and too hard to fix that's bearable. The race window here is certainly rather 
small. 

On the other hand if you cannot detect a difference in benchmarks I see
no reason not to add the additional steps, as long as the code isn't
complicated or ugly. These changes are neither.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26 11:09   ` Hugh Dickins
@ 2009-09-26 19:06     ` Nick Piggin
  -1 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-26 19:06 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Wu Fengguang, Andrew Morton, Andi Kleen, linux-mm, LKML

On Sat, Sep 26, 2009 at 12:09:21PM +0100, Hugh Dickins wrote:
> On Sat, 26 Sep 2009, Wu Fengguang wrote:
> 
> > The swap cache and page cache code assume that they 'own' the newly
> > allocated page and therefore can disregard the locking rules. However
> > now hwpoison can hit any time on any page.
> > 
> > So use the safer lock_page()/trylock_page(). The main intention is not
> > to close such a small time window of memory corruption. But to avoid
> > kernel oops that may result from such races, and also avoid raising
> > false alerts in hwpoison stress tests.
> > 
> > This in theory will slightly increase page cache/swap cache overheads,
> > however it seems to be too small to be measurable in benchmark.
> 
> No.
> 
> But I'd most certainly defer to Nick if he disagrees with me.
> 
> I don't think anyone would want to quarrel very long over the swap
> and migration mods alone, but add_to_page_cache() is of a higher
> order of magnitude.
> 
> I can't see any reason to surrender add_to_page_cache() optimizations
> to the remote possibility of hwpoison (infinitely remote for most of
> us); though I wouldn't myself want to run the benchmark to defend them.

Thanks Hugh, I definitely agree with you. And I agree the page lock
is a strange thing: it's only really well defined for some types of
pages (eg. pagecache pages), so it's not clear what it even really
means to take the page lock on non-pagecache or soon to be pagecache
pages.

We don't need to run benchmarks: it's unquestionably slower if we
have to go back to full atomic ops here. What we need to focus on
throughout the kernel is reducing atomic ops and unpredictable
branches rather than adding them, because our fastpaths are getting
monotonically slower *anyway*.

I would much prefer that hwpoison code first ensures it has a valid
pagecache page and is pinning it before it ever tries to do a
lock_page.

This is a bit tricky to do right now; you have a chicken and egg
problem between locking the page and pinning the inode mapping.
Possibly you could get the page ref, then check mapping != NULL,
and in that case lock the page. You'd just have to check ordering
on the other side... and if you do something crazy like that,
then please add comments in the core code saying that hwpoison
has added a particular dependency.


> I suspect if memory_failure() did something like:
> 	if (page->mapping)
> 		lock_page_nosync(p);
> then you'd be okay, perhaps with a few additional _inexpensive_
> tweaks here and there.  With the "necessary" memory barriers?
> no, we probably wouldn't want to be adding any in hot paths.

Ah, you came to the same idea. Yes memory barriers in the fastpath
are no good, but you can effectively have a memory barrier on
all other CPUs by doing a synchronize_rcu() with no cost to the
fastpaths.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26 19:06     ` Nick Piggin
  0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-26 19:06 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Wu Fengguang, Andrew Morton, Andi Kleen, linux-mm, LKML

On Sat, Sep 26, 2009 at 12:09:21PM +0100, Hugh Dickins wrote:
> On Sat, 26 Sep 2009, Wu Fengguang wrote:
> 
> > The swap cache and page cache code assume that they 'own' the newly
> > allocated page and therefore can disregard the locking rules. However
> > now hwpoison can hit any time on any page.
> > 
> > So use the safer lock_page()/trylock_page(). The main intention is not
> > to close such a small time window of memory corruption. But to avoid
> > kernel oops that may result from such races, and also avoid raising
> > false alerts in hwpoison stress tests.
> > 
> > This in theory will slightly increase page cache/swap cache overheads,
> > however it seems to be too small to be measurable in benchmark.
> 
> No.
> 
> But I'd most certainly defer to Nick if he disagrees with me.
> 
> I don't think anyone would want to quarrel very long over the swap
> and migration mods alone, but add_to_page_cache() is of a higher
> order of magnitude.
> 
> I can't see any reason to surrender add_to_page_cache() optimizations
> to the remote possibility of hwpoison (infinitely remote for most of
> us); though I wouldn't myself want to run the benchmark to defend them.

Thanks Hugh, I definitely agree with you. And I agree the page lock
is a strange thing: it's only really well defined for some types of
pages (eg. pagecache pages), so it's not clear what it even really
means to take the page lock on non-pagecache or soon to be pagecache
pages.

We don't need to run benchmarks: it's unquestionably slower if we
have to go back to full atomic ops here. What we need to focus on
throughout the kernel is reducing atomic ops and unpredictable
branches rather than adding them, because our fastpaths are getting
monotonically slower *anyway*.

I would much prefer that hwpoison code first ensures it has a valid
pagecache page and is pinning it before it ever tries to do a
lock_page.

This is a bit tricky to do right now; you have a chicken and egg
problem between locking the page and pinning the inode mapping.
Possibly you could get the page ref, then check mapping != NULL,
and in that case lock the page. You'd just have to check ordering
on the other side... and if you do something crazy like that,
then please add comments in the core code saying that hwpoison
has added a particular dependency.


> I suspect if memory_failure() did something like:
> 	if (page->mapping)
> 		lock_page_nosync(p);
> then you'd be okay, perhaps with a few additional _inexpensive_
> tweaks here and there.  With the "necessary" memory barriers?
> no, we probably wouldn't want to be adding any in hot paths.

Ah, you came to the same idea. Yes memory barriers in the fastpath
are no good, but you can effectively have a memory barrier on
all other CPUs by doing a synchronize_rcu() with no cost to the
fastpaths.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26 15:05       ` Andi Kleen
@ 2009-09-26 19:12         ` Nick Piggin
  -1 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-26 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Wu Fengguang, Hugh Dickins, Andrew Morton, linux-mm, LKML

On Sat, Sep 26, 2009 at 05:05:55PM +0200, Andi Kleen wrote:
> > However we may well end up to accept the fact that "we just cannot do
> > hwpoison 100% correct", and settle with a simple and 99% correct code.
> 
> I would prefer to avoid any oopses, but if they are unlikely enough
> and too hard to fix that's bearable. The race window here is certainly rather 
> small. 

Well, several places non-atomically modify page flags, including
within preempt-enabled regions... It's nasty to introduce these
oopses in the hwposion code! I'm ashamed I didn't pick up on this
problem seeing as I introduced several of them.

 
> On the other hand if you cannot detect a difference in benchmarks I see
> no reason not to add the additional steps, as long as the code isn't
> complicated or ugly. These changes are neither.

The patch to add atomics back into the fastpaths? I don't think that's
acceptable at all. A config option doesn't go far enough either because
distros will have to turn it on.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26 19:12         ` Nick Piggin
  0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-26 19:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Wu Fengguang, Hugh Dickins, Andrew Morton, linux-mm, LKML

On Sat, Sep 26, 2009 at 05:05:55PM +0200, Andi Kleen wrote:
> > However we may well end up to accept the fact that "we just cannot do
> > hwpoison 100% correct", and settle with a simple and 99% correct code.
> 
> I would prefer to avoid any oopses, but if they are unlikely enough
> and too hard to fix that's bearable. The race window here is certainly rather 
> small. 

Well, several places non-atomically modify page flags, including
within preempt-enabled regions... It's nasty to introduce these
oopses in the hwposion code! I'm ashamed I didn't pick up on this
problem seeing as I introduced several of them.

 
> On the other hand if you cannot detect a difference in benchmarks I see
> no reason not to add the additional steps, as long as the code isn't
> complicated or ugly. These changes are neither.

The patch to add atomics back into the fastpaths? I don't think that's
acceptable at all. A config option doesn't go far enough either because
distros will have to turn it on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26 11:48     ` Wu Fengguang
@ 2009-09-26 19:14       ` Nick Piggin
  -1 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-26 19:14 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Hugh Dickins, Andrew Morton, Andi Kleen, linux-mm, LKML

On Sat, Sep 26, 2009 at 07:48:06PM +0800, Wu Fengguang wrote:
> On Sat, Sep 26, 2009 at 07:09:21PM +0800, Hugh Dickins wrote:
> > It seems to me that the Intel hardware guys have done half a job
> > here: the sooner they get to remapping the bad pages, the better.
> 
> When we can offer to set aside half memory :)

Maybe even adding to the ECC error codes so uncorrected errors
are reduced to a similar frequency to other sources of errors in
the hardware. That seems like the sanest thing to me, but it's
not mine to wonder...


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26 19:14       ` Nick Piggin
  0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-26 19:14 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Hugh Dickins, Andrew Morton, Andi Kleen, linux-mm, LKML

On Sat, Sep 26, 2009 at 07:48:06PM +0800, Wu Fengguang wrote:
> On Sat, Sep 26, 2009 at 07:09:21PM +0800, Hugh Dickins wrote:
> > It seems to me that the Intel hardware guys have done half a job
> > here: the sooner they get to remapping the bad pages, the better.
> 
> When we can offer to set aside half memory :)

Maybe even adding to the ECC error codes so uncorrected errors
are reduced to a similar frequency to other sources of errors in
the hardware. That seems like the sanest thing to me, but it's
not mine to wonder...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26 19:06     ` Nick Piggin
@ 2009-09-26 21:32       ` Andi Kleen
  -1 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2009-09-26 21:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Wu Fengguang, Andrew Morton, Andi Kleen, linux-mm, LKML

> This is a bit tricky to do right now; you have a chicken and egg
> problem between locking the page and pinning the inode mapping.

One possibly simple solution would be to just allocate the page
locked (GFP_LOCKED). When the allocator clears the flags it already
modifies the state, so it could as well set the lock bit too. No
atomics needed.  And then clearing it later is also atomic free.

Would that satisfy the concerns?

Again another way is to just ignore it.

-Andi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-26 21:32       ` Andi Kleen
  0 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2009-09-26 21:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Wu Fengguang, Andrew Morton, Andi Kleen, linux-mm, LKML

> This is a bit tricky to do right now; you have a chicken and egg
> problem between locking the page and pinning the inode mapping.

One possibly simple solution would be to just allocate the page
locked (GFP_LOCKED). When the allocator clears the flags it already
modifies the state, so it could as well set the lock bit too. No
atomics needed.  And then clearing it later is also atomic free.

Would that satisfy the concerns?

Again another way is to just ignore it.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26 11:31       ` Wu Fengguang
@ 2009-09-27 10:47         ` Wu Fengguang
  -1 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-27 10:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, Hugh Dickins, Nick Piggin, linux-mm, LKML

On Sat, Sep 26, 2009 at 07:31:56PM +0800, Wu Fengguang wrote:
> On Sat, Sep 26, 2009 at 06:52:59PM +0800, Wu Fengguang wrote:
> > On Sat, Sep 26, 2009 at 11:49:36AM +0800, Andi Kleen wrote:
> > > On Sat, Sep 26, 2009 at 11:15:37AM +0800, Wu Fengguang wrote:
> > > > The swap cache and page cache code assume that they 'own' the newly
> > > > allocated page and therefore can disregard the locking rules. However
> > > > now hwpoison can hit any time on any page.
> > > > 
> > > > So use the safer lock_page()/trylock_page(). The main intention is not
> > > > to close such a small time window of memory corruption. But to avoid
> > > > kernel oops that may result from such races, and also avoid raising
> > > > false alerts in hwpoison stress tests.
> > > > 
> > > > This in theory will slightly increase page cache/swap cache overheads,
> > > > however it seems to be too small to be measurable in benchmark.
> > > 
> > > Thanks. Can you please describe what benchmarks you used?
> > 
> > Good question. The test case is to copy 1GB sparse file over NFS.
> > 
> > I ran it 100 times (2 hours total wall time) and the overheads are ~0.008%.
> 
> And standard deviation is 0.04%, much larger than the difference 0.008% ..

Sorry that's not correct. I improved the accounting by treating
function0+function1 from two CPUs as an integral entity:

                 total time      add_to_page_cache_lru   percent  stddev
         before  3880166848.722  9683329.610             0.250%   0.014%
         after   3828516894.376  9778088.870             0.256%   0.012%
         delta                                           0.006%

Thanks,
Fengguang


> > The average time reported by the function profiler:
> > 
> >                 total time      add_to_page_cache_lru   percent
> >         before  1940083424.361  4841664.805             0.242%
> >         after   1914258447.188  4889044.435             0.250%
> > 
> > The script used is:
> > 
> >         for i in `seq 1 100`
> >         do      
> >                 # copy 1GB sparse file over NFS
> >                 echo 1 > /debug/tracing/function_profile_enabled
> >                 cp /tmp/sparse/1G /dev/null
> >                 echo 0 > /debug/tracing/function_profile_enabled
> > 
> >                 fadvise /tmp/sparse/1G 0 0 dontneed
> >                 cat /debug/tracing/trace_stat/function0 > function0.$i
> >                 cat /debug/tracing/trace_stat/function1 > function1.$i
> >         done
> > 
> > 
> > File function0.1 after patch:
> > 
> >   Function                               Hit    Time            Avg
> >   --------                               ---    ----            ---
> >   schedule                             32492    474358128 us     14599.22 us 
> >   schedule_timeout                        26    99994288 us     3845934 us 
> >   poll_schedule_timeout                    8    52914376 us     6614297 us 
> >   sys_poll                                 6    52914373 us     8819062 us 
> >   schedule_hrtimeout_range                 8    52914373 us     6614296 us 
> >   do_sys_poll                              6    52914355 us     8819059 us 
> >   sys_read                             86997    40703194 us     467.868 us  
> >   vfs_read                             87002    40550651 us     466.088 us  
> >   do_sync_read                         86987    40011054 us     459.965 us  
> >   nfs_file_read                        86985    39945631 us     459.224 us  
> >   generic_file_aio_read                86985    39785858 us     457.387 us  
> >   page_cache_async_readahead            2886    34379069 us     11912.35 us 
> >   ondemand_readahead                    2889    34371850 us     11897.49 us 
> >   ra_submit                             2891    34368378 us     11888.05 us 
> >   __do_page_cache_readahead             2891    34360744 us     11885.41 us 
> >   nfs_readpages                         2876    33198107 us     11543.15 us 
> >   read_cache_pages                      2876    31617284 us     10993.49 us 
> >   __rpc_execute                        84468    25404441 us     300.758 us  
> >   readpage_async_filler               174071    24947004 us     143.315 us  
> >   do_softirq                           44822    24829675 us     553.961 us  
> >   __do_softirq                         44822    24803408 us     553.375 us  
> >   net_rx_action                        38424    24357207 us     633.906 us  
> >   e1000_clean                          39042    24253141 us     621.206 us  
> >   nfs_pageio_doio                      43534    22993433 us     528.171 us  
> >   nfs_pagein_one                       43520    22939752 us     527.108 us  
> >   nfs_read_rpcsetup                    43513    22533176 us     517.849 us  
> >   e1000_clean_rx_irq                   39042    22047306 us     564.707 us  
> >   rpc_run_task                         43528    22042766 us     506.404 us  
> >   call_transmit                        45873    21842655 us     476.154 us  
> >   nfs_pageio_add_request              174091    21583666 us     123.979 us  
> >   rpc_execute                          43528    21254402 us     488.292 us  
> >   cpuidle_idle_call                    18073    19445493 us     1075.941 us 
> >   acpi_idle_enter_c1                   17999    19396945 us     1077.667 us 
> >   e1000_receive_skb                   436279    18818686 us     43.134 us   
> >   tcp_v4_do_rcv                       513521    18681598 us     36.379 us   
> >   xprt_transmit                        43346    18532249 us     427.542 us  
> >   tcp_rcv_established                 513521    18384627 us     35.801 us   
> >   napi_gro_receive                    436279    18114586 us     41.520 us   
> >   napi_skb_finish                     436279    17130261 us     39.264 us   
> >   netif_receive_skb                   436279    16886046 us     38.704 us   
> >   ip_rcv                              436279    16500338 us     37.820 us   
> >   xs_tcp_send_request                  43346    16418581 us     378.779 us  
> >   xs_sendpages                         43346    16367619 us     377.603 us  
> >   xs_send_kvec                         43347    16310325 us     376.273 us  
> >   kernel_sendmsg                       43347    16254698 us     374.990 us  
> >   sock_sendmsg                         43347    16177184 us     373.201 us  
> >   tcp_sendmsg                          43347    16127746 us     372.061 us  
> >   do_IRQ                               45271    15968995 us     352.742 us  
> >   ip_local_deliver                    436279    15816133 us     36.252 us   
> >   irq_exit                             53709    15421475 us     287.130 us  
> >   acpi_processor_ffh_cstate_ente       17999    15394735 us     855.310 us  
> >   mwait_idle_with_hints                17999    15386089 us     854.830 us  
> >   tcp_v4_rcv                          436217    15173852 us     34.785 us   
> >   release_sock                         43364    13536313 us     312.155 us  
> >   xs_tcp_data_ready                   512045    10092920 us     19.711 us   
> >   _spin_unlock_bh                     518900    9851221 us     18.984 us   
> >   local_bh_enable_ip                  511352    9403704 us     18.389 us   
> >   tcp_read_sock                      1024090    8832051 us     8.624 us    
> >   __tcp_ack_snd_check                 512045    6644838 us     12.977 us   
> >   add_to_page_cache_lru               174072    6287717 us     36.121 us   
> >   tcp_transmit_skb                    296572    5962431 us     20.104 us   
> >   tcp_send_ack                        253210    5775135 us     22.807 us   
> >   __kfree_skb                         802932    5216451 us     6.496 us    
> >   add_to_page_cache_locked            174072    4716425 us     27.094 us   
> > [...]
> > 
> > Detailed time before patch (100 runs * 2 CPU):
> > 
> > Total          add_to_page_cache_lru    percent
> > 1927694859.404 6703561                  0.34775011030908648%
> > 1994166534.446 7311783                  0.36665859514242066%
> > 2132812997.930 6249683                  0.29302536162643533%
> > 1972582396.980 7908593                  0.40092586307715006%
> > 2415371209.622 7670327                  0.31756307144194151%
> > 2163473353.153 7406991                  0.34236571433640306%
> > 1833544521.635 7952113                  0.43370165851816772%
> > 2087748556.683 7995535                  0.38297404035585819%
> > 2185868844.795 7117368                  0.32560819085499598%
> > 1980423680.193 6514699                  0.32895481230385598%
> > 1958021484.352 6745228                  0.34449203207963308%
> > 2008592063.702 7398967                  0.36836583862444905%
> > 2448117568.931 6224454                  0.25425470079519025%
> > 2213021186.789 6938228                  0.31351837214297412%
> > 1828902235.789 7479842                  0.40897987074596959%
> > 1866644864.837 7720792                  0.41361868802366963%
> > 2054526163.572 8475856                  0.41254553727677401%
> > 1902862587.026 8334661                  0.43800645705196778%
> > 2148449373.925 8672498                  0.40366313050031155%
> > 2104880443.231 7741619                  0.36779376353160392%
> > 2270646549.540 6486688                  0.2856758134071597%
> > 2250814274.826 7512331                  0.33376058984612328%
> > 2214008898.106 6391259                  0.28867359139646992%
> > 2446226035.589 7738434                  0.31634173978271579%
> > 1920377000.358 6464992                  0.33665223020244384%
> > 2030446733.714 6022617                  0.2966153654759367%
> > 2157128575.464 7375670                  0.34192074055731642%
> > 2185670134.683 7943199                  0.36342167438509865%
> > 2224381465.969 8492920                  0.38181041021667828%
> > 2479045291.455 8161934                  0.32923698603382928%
> > 1971641972.432 8114434                  0.41155717485517562%
> > 2208169216.013 8138519                  0.36856410011433138%
> > 1971312194.985 8122576                  0.41203904793283169%
> > 2279129743.345 7365286                  0.3231622079219687%
> > 2360162707.877 7412468                  0.3140659741491984%
> > 1885022652.013 8252339                  0.43778460652382062%
> > 2065345722.062 7985361                  0.3866355600759942%
> > 1975982196.916 7925371                  0.4010851419799969%
> > 2169738273.075 7203856                  0.33201497569522709%
> > 2034570957.887 7396992                  0.36356520136717829%
> > 1789538123.758 7380292                  0.41241323121418116%
> > 1912812047.546 8697404                  0.45469203370807615%
> > 2221047646.893 7451321                  0.33548676951723994%
> > 2163981380.559 7982481                  0.36887937538251664%
> > 1969189605.396 7353112                  0.37340802428831149%
> > 2278909389.270 7088803                  0.31106120468750831%
> > 2104996788.904 7342692                  0.34882200479854836%
> > 2374705789.293 7327329                  0.30855733931492624%
> > 2137944282.456 8112283                  0.3794431438915179%
> > 2268117031.230 6889022                  0.30373309247909852%
> > 1899091967.945 8173918                  0.43041190937398177%
> > 2098779842.987 7858621                  0.37443760603377763%
> > 2124662819.215 8156209                  0.38388251190904155%
> > 2317714239.959 7048899                  0.30413149638864423%
> > 2218944220.453 6659501                  0.30012025262358577%
> > 1945126557.801 7704643                  0.3960998305791596%
> > 2092060468.851 7616315                  0.36405807161888715%
> > 1933379809.531 7300893                  0.37762331870896354%
> > 1903832170.769 7508856                  0.39440745435912067%
> > 2435478683.656 6790800                  0.27882814354203433%
> > 2126189870.763 7104836                  0.33415811530747125%
> > 2161363441.772 7011298                  0.32439236569356278%
> > 2064449342.765 6822211                  0.33046153561039859%
> > 2116816508.947 8156178                  0.38530396779913867%
> > 1982774633.856 7131335                  0.35966442571092105%
> > 2323253766.182 7275361                  0.3131539526978242%
> > 1905927501.820 7477458                  0.39232646534874271%
> > 2252162684.001 8064426                  0.35807475442552977%
> > 2027146552.281 7875866                  0.38851981328818402%
> > 1845397260.011 7598767                  0.41176862915439172%
> > 2177206310.501 7366288                  0.33833670077434846%
> > 2041629016.560 6600368                  0.32328929234759562%
> > 2189308825.112 8348863                  0.38134697600613249%
> > 1930990433.231 7344038                  0.38032492930126544%
> > 2102320863.469 7299411                  0.34720727586536809%
> > 1895798071.683 8051595                  0.42470741585110772%
> > 2019648719.623 7992832                  0.39575357448754711%
> > 1961118039.267 7837733                  0.39965636147681766%
> > 2142238585.905 8084509                  0.37738602288244455%
> > 2263212432.641 7757018                  0.34274369865263332%
> > 2200111036.549 7069702                  0.32133387281622078%
> > 1941699880.568 7786856                  0.4010329339734075%
> > 1939777157.562 7729686                  0.39848319534368681%
> > 1983015138.927 8136252                  0.41029701893261827%
> > 2040218650.019 7968023                  0.39054750332400873%
> > 1850721920.862 7835912                  0.42339758943095634%
> > 2167986192.252 8656589                  0.39929170356052646%
> > 2128868496.040 8248906                  0.38747841942065214%
> > 1984840247.084 7754853                  0.39070413910605312%
> > 2130780080.168 7053725                  0.33103955990821227%
> > 2062729676.752 7968701                  0.38631824081513272%
> > 1995756518.448 7903687                  0.39602461156666058%
> > 1921150053.008 8307746                  0.43243608103345821%
> > 2253731991.230 8033382                  0.35644797301810893%
> > 2143807312.838 6755593                  0.31512127790332323%
> > 1948787937.040 7152297                  0.36701258582622248%
> > 2276500726.783 7277806                  0.31969267193182554%
> > 2282733838.581 6336839                  0.27759867983291148%
> > 2003104438.909 7092547                  0.35407774363791977%
> > 2239684291.880 6982252                  0.3117516171950766%
> > 1801833773.742 2645976                  0.14684906224756283%
> > 2016683111.150 2261656                  0.11214731692329717%
> > 1962727948.601 2767184                  0.14098663046870061%
> > 1845203884.375 2266317                  0.1228220371304734%
> > 1516914718.568 1975612                  0.13023883121557553%
> > 1557730294.184 2109274                  0.13540688063108641%
> > 2171303320.533 1865384                  0.085910797554627022%
> > 1617322895.641 2073254                  0.12819048104666192%
> > 1867107731.934 1901409                  0.10183713384500154%
> > 1772588329.145 2045529                  0.11539786008783279%
> > 2030802387.562 2161163                  0.10641916777508319%
> > 1853583237.019 1803299                  0.097287187539531861%
> > 1453728764.040 3231016                  0.22225714176699726%
> > 1989811798.738 2418414                  0.1215398361560542%
> > 1956758511.567 2254818                  0.11523230826241863%
> > 2001579398.410 1952360                  0.097540971971978793%
> > 1636515132.388 1532408                  0.093638486419854422%
> > 1926271619.408 1918912                  0.09961793449408439%
> > 1642519133.238 1384193                  0.084272564744573439%
> > 1677033195.955 1340416                  0.07992781557533149%
> > 1693983317.244 3135804                  0.18511421972571418%
> > 1487434932.071 2467493                  0.1658891388656871%
> > 1720442169.343 3034092                  0.17635536108479918%
> > 1640362983.552 1944463                  0.11853858075908964%
> > 1821382248.526 2659828                  0.1460334864991977%
> > 1947322513.836 3540608                  0.18181929160904178%
> > 1673422299.278 2769832                  0.16551900863249205%
> > 1836039325.613 2175872                  0.11850900847526999%
> > 1582929718.539 2072026                  0.13089816785500888%
> > 1476014468.174 1996943                  0.13529291501257767%
> > 1847131237.607 2120975                  0.11482535495137695%
> > 1734796451.802 1842211                  0.10619176665287874%
> > 1804827020.363 1988047                  0.11015166426310204%
> > 1647250758.752 2634152                  0.15991202225917941%
> > 1506833507.228 2271966                  0.15077750720977479%
> > 1919459367.994 1608374                  0.083793073550751385%
> > 1913364635.281 2126200                  0.11112361756846952%
> > 1733934046.047 2031440                  0.11715785872197679%
> > 1762796865.803 2285487                  0.12965118354456009%
> > 1711072016.023 2270996                  0.13272357789349024%
> > 2082560830.257 2273483                  0.10916766353083858%
> > 1857948273.729 1266725                  0.06817870109255604%
> > 1738287034.188 2404600                  0.1383315846409251%
> > 1618942973.292 2016207                  0.12453848179100423%
> > 1820789056.854 2369799                  0.13015230902664757%
> > 1561747605.669 2487149                  0.15925422206327566%
> > 1884502059.843 2512148                  0.13330566485076115%
> > 1352203145.315 2618029                  0.19361210695824271%
> > 1911003204.592 1648292                  0.086252707271200574%
> > 1525244016.511 2555354                  0.16753738892517536%
> > 2190575134.148 1939935                  0.088558249829422828%
> > 1643983673.168 1747429                  0.10629235730989092%
> > 1781900292.843 1686195                  0.094629032094141294%
> > 1566821289.746 2291057                  0.14622324926229507%
> > 1777199439.264 2456866                  0.13824368530171682%
> > 1633063412.403 1953080                  0.11959609070698031%
> > 1913601422.874 1821650                  0.095194849785599556%
> > 2094732749.709 2322465                  0.11087166132876074%
> > 1951529182.943 2270128                  0.11632559839953495%
> > 1450396654.276 2867633                  0.19771370759479912%
> > 1643954213.946 2460211                  0.14965203891504558%
> > 1845971802.604 2593942                  0.14051904781757141%
> > 1671974226.831 2552423                  0.15265923116756236%
> > 1792483105.554 1547320                  0.086322710390164162%
> > 1774273751.516 2101959                  0.1184686973024323%
> > 1797523443.598 1781759                  0.099122990932099045%
> > 1877038808.523 2131598                  0.11356174365288201%
> > 1485614991.186 1796722                  0.12094129439052281%
> > 2008042349.453 1972441                  0.09822706182154485%
> > 2178564679.489 1948988                  0.089462021410222761%
> > 1462911389.586 2076158                  0.14191960051575969%
> > 1976013519.965 2692701                  0.13626936115536772%
> > 1473561005.083 1369873                  0.092963439944099252%
> > 2190713955.687 2450228                  0.11184609444967988%
> > 1648795783.114 2557892                  0.15513698095279177%
> > 2116292380.702 1948901                  0.092090347145394241%
> > 1704227210.803 1980402                  0.11620527987385389%
> > 1984573918.515 2325530                  0.11718031655581404%
> > 1856414008.018 1798348                  0.096872141248277158%
> > 1401593348.875 2285205                  0.16304336788086488%
> > 1832475317.969 2373585                  0.12952889333487622%
> > 1840826298.949 2112188                  0.11474129857911804%
> > 2014475390.352 2104652                  0.10447643143618861%
> > 1730176303.748 1975645                  0.11418749613667999%
> > 2016553863.650 2009045                  0.099627638825555156%
> > 1891795883.646 2160557                  0.11420666567029553%
> > 1881160108.475 1492908                  0.079361027978115894%
> > 1562754135.542 1978972                  0.12663361145505117%
> > 2061996559.038 2071212                  0.10044691834821971%
> > 1519528906.706 2460493                  0.16192472477103448%
> > 1796244525.568 2099254                  0.1168690548596764%
> > 1998463642.677 1988093                  0.099481069234609226%
> > 1872169708.204 1762657                  0.094150492462082555%
> > 1664425938.102 1710870                  0.10279039522485223%
> > 1791543540.761 1992110                  0.11119517637588673%
> > 1883913354.514 1770358                  0.093972368514618113%
> > 1813806740.237 1877185                  0.10349421238531283%
> > 1547586337.236 2702003                  0.17459465329900728%
> > 2020313568.921 2100334                  0.10396079263684484%
> > 1727289498.314 2059429                  0.1192289423405978%
> > 
> > Detailed time after patch:
> > 
> > Total          add_to_page_cache_lru    percent
> > 2151489401.227 6287717                  0.29224949918015392%
> > 1919030541.464 6872841                  0.3581413037208262%
> > 2277104485.950 8538296                  0.37496285535785828%
> > 2142847301.255 7992466                  0.37298345968558083%
> > 1909650847.324 6960671                  0.36449966808089612%
> > 2109123648.836 6825371                  0.32361170497361974%
> > 2128503623.032 7728883                  0.36311345286743746%
> > 1925232499.136 7526064                  0.39091714914315673%
> > 1935780958.829 7499162                  0.38739723964101919%
> > 1722848267.832 7353451                  0.42681942091473007%
> > 1855159456.452 7600681                  0.40970499724785564%
> > 1862769335.900 8607712                  0.46209221045831578%
> > 2272365556.764 6895008                  0.30342864419310028%
> > 2049429274.204 7252089                  0.35385895435775488%
> > 2001218334.232 8155639                  0.4075336938750293%
> > 1947131845.644 7229822                  0.37130623774523025%
> > 1982915408.746 7513518                  0.37891268416495716%
> > 2038654062.353 7531559                  0.36943781385386831%
> > 2134879530.359 6738727                  0.31564905205057731%
> > 2029761134.284 7472398                  0.36814174208907074%
> > 1956557094.935 7280443                  0.37210480690019776%
> > 1740007857.373 6663035                  0.38293131676196024%
> > 2054278081.418 7765213                  0.37800203732105886%
> > 2367585922.353 7629987                  0.32226864199365651%
> > 2023763123.141 7730698                  0.38199618876350999%
> > 1826074987.177 7579606                  0.41507638258150648%
> > 2077873224.783 7414675                  0.35683962387909118%
> > 2103719398.720 7093390                  0.33718327664402131%
> > 2164758918.301 7181094                  0.33172719323572736%
> > 1953199921.042 7750654                  0.39681826302067191%
> > 2179298642.359 6395109                  0.29344803303679207%
> > 1971922836.328 7215479                  0.36591081897688477%
> > 2225552712.785 8153246                  0.36634701812105075%
> > 2124780733.448 7080842                  0.33325048032177529%
> > 2139511762.563 8708060                  0.40701155059639843%
> > 1934365642.348 6842114                  0.35371358186939283%
> > 1950765025.910 6818753                  0.34954250816646476%
> > 2013188877.577 6495929                  0.32266863146086228%
> > 1921391492.273 6536576                  0.34020011154870117%
> > 2164698840.945 6712421                  0.31008567441509277%
> > 1992440404.239 7467130                  0.3747730664422067%
> > 2073623249.374 7113197                  0.34303227465005426%
> > 1938481781.041 6902169                  0.35606055561138211%
> > 1963963682.421 7776006                  0.39593430721765843%
> > 1807704433.895 7206899                  0.39867684477995868%
> > 1902413226.299 7687850                  0.40411041585093088%
> > 1718395340.661 7096452                  0.41296969516166576%
> > 2014139552.830 7653595                  0.37999328245384933%
> > 1976902280.499 7781174                  0.3936043818026207%
> > 1920129734.441 6808420                  0.3545812492707483%
> > 1786139514.638 7179970                  0.4019825966089316%
> > 2011990483.317 6739406                  0.33496212113733792%
> > 2186643920.919 7830962                  0.35812698743876015%
> > 1822975832.465 8627042                  0.47323951565143058%
> > 1872249428.425 7264440                  0.38800599373722838%
> > 1971668701.573 6818269                  0.34581210294408871%
> > 2054585589.388 7261214                  0.35341501651254642%
> > 2252858576.565 7228902                  0.32087686618225808%
> > 2196771348.820 6634658                  0.30201859668116204%
> > 2224872312.502 6948714                  0.31231967609798522%
> > 1924947573.139 6935037                  0.360271474235066%
> > 2070207954.999 7564561                  0.36540102078796494%
> > 2070702822.858 6994172                  0.33776802362912656%
> > 2099826181.510 7199665                  0.34286957003377627%
> > 2369037401.585 7009501                  0.29587971027009985%
> > 2109093017.803 7056062                  0.33455432929886414%
> > 2141046841.968 7831694                  0.36578807368835009%
> > 2284785719.942 7358167                  0.3220506385249462%
> > 2088247084.213 6857813                  0.32840045853982414%
> > 1880913709.753 6697468                  0.35607523967059107%
> > 1984586172.943 7361786                  0.37094816543455983%
> > 2130320004.839 7840067                  0.36802297223850727%
> > 2314059227.236 6669054                  0.28819720435443524%
> > 1935408704.712 8309322                  0.42933164348025787%
> > 2239440115.241 7040432                  0.31438357972087749%
> > 2387500130.769 6555638                  0.27458168129559291%
> > 2497500671.323 6969731                  0.27906823329532587%
> > 2229438175.095 6759889                  0.30321042653322067%
> > 1979317661.013 7870099                  0.39761677243723181%
> > 2228213884.202 5883791                  0.26405862748257614%
> > 2264526373.743 6563122                  0.28982316461838858%
> > 2256376049.198 6670128                  0.29561242694323103%
> > 2085817937.654 6878115                  0.32975624937505721%
> > 2177225113.484 7421724                  0.34087995559282069%
> > 1843473456.551 7156295                  0.38819625932607071%
> > 2403022978.094 6958096                  0.28955594946157931%
> > 1935191866.634 6872608                  0.35513832599730566%
> > 1917810458.850 7816937                  0.40759695328220108%
> > 2016838266.381 7257719                  0.35985627211562177%
> > 1904575178.166 7484129                  0.39295529448235278%
> > 2087232238.930 7372521                  0.35321996577532039%
> > 1976631116.176 6438233                  0.32571747693901715%
> > 2004191806.805 6735567                  0.33607397141980955%
> > 2110550986.328 7174599                  0.33993961986592813%
> > 2152621184.889 7337361                  0.34085704681840484%
> > 2059135774.936 6363510                  0.30903790208772336%
> > 2353978555.486 7265456                  0.30864580236160993%
> > 1945677192.325 5927833                  0.30466682877217138%
> > 2255676557.335 6379740                  0.28283044301074045%
> > 2034627875.756 7319043                  0.35972391252530572%
> > 1712556465.847 3079747                  0.17983331127577226%
> > 1837380343.831 3111263                  0.1693314620702272%
> > 1680363376.083 1611968                  0.095929727042585719%
> > 1769978476.954 2017096                  0.11396161175198416%
> > 1829700109.543 2612874                  0.14280340184559598%
> > 1783959966.875 3184558                  0.17851062014460206%
> > 1772032973.967 2133741                  0.12041203698502599%
> > 1667688494.152 1985120                  0.11903422053705601%
> > 2001252356.720 2633874                  0.13161128785961057%
> > 2077009766.270 2858188                  0.13761071548223289%
> > 2076612330.379 2340135                  0.11269002720276176%
> > 1834812072.536 1713713                  0.093399919569494547%
> > 1787949752.795 2522212                  0.14106727529994451%
> > 1907464219.876 2476303                  0.12982172741153589%
> > 1609254879.065 1903707                  0.11829741980375917%
> > 2069557067.911 2484547                  0.12005211349440528%
> > 1725151836.363 2491288                  0.1444097816486799%
> > 2077004221.561 2508662                  0.12078271069254648%
> > 1650748196.246 3195100                  0.19355465644403197%
> > 1843881436.725 2731880                  0.14815920078094685%
> > 1677080490.297 2752567                  0.16412849686853959%
> > 2103777441.563 3252694                  0.15461207710181613%
> > 1598451912.768 2345538                  0.1467381021139566%
> > 1825835416.807 2381575                  0.13043755083713246%
> > 1969692431.120 2198750                  0.11162910336969484%
> > 1867823718.088 1813348                  0.097083465770326324%
> > 1929972836.031 2140104                  0.11088777831718792%
> > 1627140530.277 2519227                  0.15482541016731566%
> > 1779407314.725 2828755                  0.15897175293095681%
> > 1724839045.836 1937500                  0.11232932166496305%
> > 1780135194.997 2889504                  0.16231935687361485%
> > 1758590417.053 2193512                  0.12473126082853507%
> > 1729351400.188 1523643                  0.088104881392778986%
> > 1484255938.207 2153531                  0.14509162096406991%
> > 1504352024.282 1910774                  0.12701641432044322%
> > 1916441202.247 2423510                  0.12645887581411155%
> > 1651813655.113 2892863                  0.1751325272705837%
> > 1772411947.023 2693122                  0.15194673024651262%
> > 1738795901.180 3064160                  0.17622309771495132%
> > 1717902110.376 3266140                  0.19012375503078779%
> > 1858583685.387 2246750                  0.12088505982619635%
> > 1641867344.242 2901893                  0.17674345069209688%
> > 2037933222.333 3290649                  0.16146991294606342%
> > 1713754879.753 2136278                  0.1246548164640615%
> > 2265307843.106 2195967                  0.096939010151886218%
> > 2066343912.900 2117667                  0.10248376307446183%
> > 2033259596.459 2652081                  0.13043494321230309%
> > 2029016228.457 2447926                  0.12064595470789147%
> > 1649943910.454 2243384                  0.13596728869302646%
> > 2064977308.315 2517573                  0.12191770775700743%
> > 1962826877.902 2403546                  0.12245328546596376%
> > 1991037543.226 3214700                  0.16145853255942866%
> > 1469646744.644 1960481                  0.133398111290676%
> > 2195532272.023 1693642                  0.077140382839349028%
> > 1902754342.931 2542547                  0.13362455376575119%
> > 1922726134.774 2726990                  0.14182935108023245%
> > 1629134454.130 2814497                  0.17276026499009955%
> > 1442857977.322 2886065                  0.2000241912483062%
> > 1782430811.270 2884583                  0.16183421997427802%
> > 1574026949.337 2873023                  0.18252692568003068%
> > 2064769936.712 2717133                  0.13159495165484839%
> > 1621381781.294 2143854                  0.13222388611576869%
> > 1783777994.311 2515565                  0.14102455619605617%
> > 1561198792.593 2402227                  0.15387066729728466%
> > 1658682002.165 2924639                  0.17632306832669586%
> > 1675144757.604 2216513                  0.13231769910860314%
> > 1712859849.041 2274046                  0.13276310967725691%
> > 1448178253.374 2507727                  0.17316424923226392%
> > 1621787692.726 2822681                  0.17404750403892047%
> > 1924315582.555 2275728                  0.11826168330344308%
> > 2044659318.750 2606738                  0.12749008972280165%
> > 1559320878.499 2049522                  0.13143683434629996%
> > 1641314354.590 2786145                  0.16975084585158451%
> > 1769603624.855 2008994                  0.11352790940200608%
> > 1707747444.550 2813289                  0.16473682973311757%
> > 1366611645.499 2930513                  0.21443641356721807%
> > 1435165128.768 3161655                  0.22029903992400404%
> > 1440555681.860 2831258                  0.19653929630435177%
> > 1627011278.312 1956565                  0.12025515901954331%
> > 1668464384.654 3863649                  0.23156916237089648%
> > 1589045506.115 2743558                  0.17265446391825656%
> > 1766150748.153 3207045                  0.18158387687765917%
> > 1668645748.936 2997211                  0.17961937109246526%
> > 1808149919.182 2420007                  0.13383884678626659%
> > 1831096162.185 2226869                  0.12161398434382248%
> > 1473136553.216 2774496                  0.18833936296964501%
> > 1873520917.567 2427200                  0.12955286366121929%
> > 1971938578.205 2207736                  0.11195764535473715%
> > 1670588149.850 2437810                  0.14592525394238479%
> > 1912140400.200 2407058                  0.12588291109524352%
> > 1925513187.452 2415130                  0.12542786077699639%
> > 1658671061.956 3111210                  0.18757245311382492%
> > 1917423750.996 2888924                  0.15066695603928745%
> > 1553962205.772 3386946                  0.21795549386076499%
> > 1834101490.364 2367281                  0.12907033838842713%
> > 1626846627.678 3362957                  0.20671629044711806%
> > 1519340210.104 2320735                  0.15274623712098975%
> > 1725615314.262 3813557                  0.22099693764197714%
> > 1725823077.103 3137784                  0.1818137699993643%
> > 1633124107.941 2007344                  0.12291435722731486%
> > 
> > Thanks,
> > Fengguang

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-27 10:47         ` Wu Fengguang
  0 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-27 10:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, Hugh Dickins, Nick Piggin, linux-mm, LKML

On Sat, Sep 26, 2009 at 07:31:56PM +0800, Wu Fengguang wrote:
> On Sat, Sep 26, 2009 at 06:52:59PM +0800, Wu Fengguang wrote:
> > On Sat, Sep 26, 2009 at 11:49:36AM +0800, Andi Kleen wrote:
> > > On Sat, Sep 26, 2009 at 11:15:37AM +0800, Wu Fengguang wrote:
> > > > The swap cache and page cache code assume that they 'own' the newly
> > > > allocated page and therefore can disregard the locking rules. However
> > > > now hwpoison can hit any time on any page.
> > > > 
> > > > So use the safer lock_page()/trylock_page(). The main intention is not
> > > > to close such a small time window of memory corruption. But to avoid
> > > > kernel oops that may result from such races, and also avoid raising
> > > > false alerts in hwpoison stress tests.
> > > > 
> > > > This in theory will slightly increase page cache/swap cache overheads,
> > > > however it seems to be too small to be measurable in benchmark.
> > > 
> > > Thanks. Can you please describe what benchmarks you used?
> > 
> > Good question. The test case is to copy 1GB sparse file over NFS.
> > 
> > I ran it 100 times (2 hours total wall time) and the overheads are ~0.008%.
> 
> And standard deviation is 0.04%, much larger than the difference 0.008% ..

Sorry that's not correct. I improved the accounting by treating
function0+function1 from two CPUs as an integral entity:

                 total time      add_to_page_cache_lru   percent  stddev
         before  3880166848.722  9683329.610             0.250%   0.014%
         after   3828516894.376  9778088.870             0.256%   0.012%
         delta                                           0.006%

Thanks,
Fengguang


> > The average time reported by the function profiler:
> > 
> >                 total time      add_to_page_cache_lru   percent
> >         before  1940083424.361  4841664.805             0.242%
> >         after   1914258447.188  4889044.435             0.250%
> > 
> > The script used is:
> > 
> >         for i in `seq 1 100`
> >         do      
> >                 # copy 1GB sparse file over NFS
> >                 echo 1 > /debug/tracing/function_profile_enabled
> >                 cp /tmp/sparse/1G /dev/null
> >                 echo 0 > /debug/tracing/function_profile_enabled
> > 
> >                 fadvise /tmp/sparse/1G 0 0 dontneed
> >                 cat /debug/tracing/trace_stat/function0 > function0.$i
> >                 cat /debug/tracing/trace_stat/function1 > function1.$i
> >         done
> > 
> > 
> > File function0.1 after patch:
> > 
> >   Function                               Hit    Time            Avg
> >   --------                               ---    ----            ---
> >   schedule                             32492    474358128 us     14599.22 us 
> >   schedule_timeout                        26    99994288 us     3845934 us 
> >   poll_schedule_timeout                    8    52914376 us     6614297 us 
> >   sys_poll                                 6    52914373 us     8819062 us 
> >   schedule_hrtimeout_range                 8    52914373 us     6614296 us 
> >   do_sys_poll                              6    52914355 us     8819059 us 
> >   sys_read                             86997    40703194 us     467.868 us  
> >   vfs_read                             87002    40550651 us     466.088 us  
> >   do_sync_read                         86987    40011054 us     459.965 us  
> >   nfs_file_read                        86985    39945631 us     459.224 us  
> >   generic_file_aio_read                86985    39785858 us     457.387 us  
> >   page_cache_async_readahead            2886    34379069 us     11912.35 us 
> >   ondemand_readahead                    2889    34371850 us     11897.49 us 
> >   ra_submit                             2891    34368378 us     11888.05 us 
> >   __do_page_cache_readahead             2891    34360744 us     11885.41 us 
> >   nfs_readpages                         2876    33198107 us     11543.15 us 
> >   read_cache_pages                      2876    31617284 us     10993.49 us 
> >   __rpc_execute                        84468    25404441 us     300.758 us  
> >   readpage_async_filler               174071    24947004 us     143.315 us  
> >   do_softirq                           44822    24829675 us     553.961 us  
> >   __do_softirq                         44822    24803408 us     553.375 us  
> >   net_rx_action                        38424    24357207 us     633.906 us  
> >   e1000_clean                          39042    24253141 us     621.206 us  
> >   nfs_pageio_doio                      43534    22993433 us     528.171 us  
> >   nfs_pagein_one                       43520    22939752 us     527.108 us  
> >   nfs_read_rpcsetup                    43513    22533176 us     517.849 us  
> >   e1000_clean_rx_irq                   39042    22047306 us     564.707 us  
> >   rpc_run_task                         43528    22042766 us     506.404 us  
> >   call_transmit                        45873    21842655 us     476.154 us  
> >   nfs_pageio_add_request              174091    21583666 us     123.979 us  
> >   rpc_execute                          43528    21254402 us     488.292 us  
> >   cpuidle_idle_call                    18073    19445493 us     1075.941 us 
> >   acpi_idle_enter_c1                   17999    19396945 us     1077.667 us 
> >   e1000_receive_skb                   436279    18818686 us     43.134 us   
> >   tcp_v4_do_rcv                       513521    18681598 us     36.379 us   
> >   xprt_transmit                        43346    18532249 us     427.542 us  
> >   tcp_rcv_established                 513521    18384627 us     35.801 us   
> >   napi_gro_receive                    436279    18114586 us     41.520 us   
> >   napi_skb_finish                     436279    17130261 us     39.264 us   
> >   netif_receive_skb                   436279    16886046 us     38.704 us   
> >   ip_rcv                              436279    16500338 us     37.820 us   
> >   xs_tcp_send_request                  43346    16418581 us     378.779 us  
> >   xs_sendpages                         43346    16367619 us     377.603 us  
> >   xs_send_kvec                         43347    16310325 us     376.273 us  
> >   kernel_sendmsg                       43347    16254698 us     374.990 us  
> >   sock_sendmsg                         43347    16177184 us     373.201 us  
> >   tcp_sendmsg                          43347    16127746 us     372.061 us  
> >   do_IRQ                               45271    15968995 us     352.742 us  
> >   ip_local_deliver                    436279    15816133 us     36.252 us   
> >   irq_exit                             53709    15421475 us     287.130 us  
> >   acpi_processor_ffh_cstate_ente       17999    15394735 us     855.310 us  
> >   mwait_idle_with_hints                17999    15386089 us     854.830 us  
> >   tcp_v4_rcv                          436217    15173852 us     34.785 us   
> >   release_sock                         43364    13536313 us     312.155 us  
> >   xs_tcp_data_ready                   512045    10092920 us     19.711 us   
> >   _spin_unlock_bh                     518900    9851221 us     18.984 us   
> >   local_bh_enable_ip                  511352    9403704 us     18.389 us   
> >   tcp_read_sock                      1024090    8832051 us     8.624 us    
> >   __tcp_ack_snd_check                 512045    6644838 us     12.977 us   
> >   add_to_page_cache_lru               174072    6287717 us     36.121 us   
> >   tcp_transmit_skb                    296572    5962431 us     20.104 us   
> >   tcp_send_ack                        253210    5775135 us     22.807 us   
> >   __kfree_skb                         802932    5216451 us     6.496 us    
> >   add_to_page_cache_locked            174072    4716425 us     27.094 us   
> > [...]
> > 
> > Detailed time before patch (100 runs * 2 CPU):
> > 
> > Total          add_to_page_cache_lru    percent
> > 1927694859.404 6703561                  0.34775011030908648%
> > 1994166534.446 7311783                  0.36665859514242066%
> > 2132812997.930 6249683                  0.29302536162643533%
> > 1972582396.980 7908593                  0.40092586307715006%
> > 2415371209.622 7670327                  0.31756307144194151%
> > 2163473353.153 7406991                  0.34236571433640306%
> > 1833544521.635 7952113                  0.43370165851816772%
> > 2087748556.683 7995535                  0.38297404035585819%
> > 2185868844.795 7117368                  0.32560819085499598%
> > 1980423680.193 6514699                  0.32895481230385598%
> > 1958021484.352 6745228                  0.34449203207963308%
> > 2008592063.702 7398967                  0.36836583862444905%
> > 2448117568.931 6224454                  0.25425470079519025%
> > 2213021186.789 6938228                  0.31351837214297412%
> > 1828902235.789 7479842                  0.40897987074596959%
> > 1866644864.837 7720792                  0.41361868802366963%
> > 2054526163.572 8475856                  0.41254553727677401%
> > 1902862587.026 8334661                  0.43800645705196778%
> > 2148449373.925 8672498                  0.40366313050031155%
> > 2104880443.231 7741619                  0.36779376353160392%
> > 2270646549.540 6486688                  0.2856758134071597%
> > 2250814274.826 7512331                  0.33376058984612328%
> > 2214008898.106 6391259                  0.28867359139646992%
> > 2446226035.589 7738434                  0.31634173978271579%
> > 1920377000.358 6464992                  0.33665223020244384%
> > 2030446733.714 6022617                  0.2966153654759367%
> > 2157128575.464 7375670                  0.34192074055731642%
> > 2185670134.683 7943199                  0.36342167438509865%
> > 2224381465.969 8492920                  0.38181041021667828%
> > 2479045291.455 8161934                  0.32923698603382928%
> > 1971641972.432 8114434                  0.41155717485517562%
> > 2208169216.013 8138519                  0.36856410011433138%
> > 1971312194.985 8122576                  0.41203904793283169%
> > 2279129743.345 7365286                  0.3231622079219687%
> > 2360162707.877 7412468                  0.3140659741491984%
> > 1885022652.013 8252339                  0.43778460652382062%
> > 2065345722.062 7985361                  0.3866355600759942%
> > 1975982196.916 7925371                  0.4010851419799969%
> > 2169738273.075 7203856                  0.33201497569522709%
> > 2034570957.887 7396992                  0.36356520136717829%
> > 1789538123.758 7380292                  0.41241323121418116%
> > 1912812047.546 8697404                  0.45469203370807615%
> > 2221047646.893 7451321                  0.33548676951723994%
> > 2163981380.559 7982481                  0.36887937538251664%
> > 1969189605.396 7353112                  0.37340802428831149%
> > 2278909389.270 7088803                  0.31106120468750831%
> > 2104996788.904 7342692                  0.34882200479854836%
> > 2374705789.293 7327329                  0.30855733931492624%
> > 2137944282.456 8112283                  0.3794431438915179%
> > 2268117031.230 6889022                  0.30373309247909852%
> > 1899091967.945 8173918                  0.43041190937398177%
> > 2098779842.987 7858621                  0.37443760603377763%
> > 2124662819.215 8156209                  0.38388251190904155%
> > 2317714239.959 7048899                  0.30413149638864423%
> > 2218944220.453 6659501                  0.30012025262358577%
> > 1945126557.801 7704643                  0.3960998305791596%
> > 2092060468.851 7616315                  0.36405807161888715%
> > 1933379809.531 7300893                  0.37762331870896354%
> > 1903832170.769 7508856                  0.39440745435912067%
> > 2435478683.656 6790800                  0.27882814354203433%
> > 2126189870.763 7104836                  0.33415811530747125%
> > 2161363441.772 7011298                  0.32439236569356278%
> > 2064449342.765 6822211                  0.33046153561039859%
> > 2116816508.947 8156178                  0.38530396779913867%
> > 1982774633.856 7131335                  0.35966442571092105%
> > 2323253766.182 7275361                  0.3131539526978242%
> > 1905927501.820 7477458                  0.39232646534874271%
> > 2252162684.001 8064426                  0.35807475442552977%
> > 2027146552.281 7875866                  0.38851981328818402%
> > 1845397260.011 7598767                  0.41176862915439172%
> > 2177206310.501 7366288                  0.33833670077434846%
> > 2041629016.560 6600368                  0.32328929234759562%
> > 2189308825.112 8348863                  0.38134697600613249%
> > 1930990433.231 7344038                  0.38032492930126544%
> > 2102320863.469 7299411                  0.34720727586536809%
> > 1895798071.683 8051595                  0.42470741585110772%
> > 2019648719.623 7992832                  0.39575357448754711%
> > 1961118039.267 7837733                  0.39965636147681766%
> > 2142238585.905 8084509                  0.37738602288244455%
> > 2263212432.641 7757018                  0.34274369865263332%
> > 2200111036.549 7069702                  0.32133387281622078%
> > 1941699880.568 7786856                  0.4010329339734075%
> > 1939777157.562 7729686                  0.39848319534368681%
> > 1983015138.927 8136252                  0.41029701893261827%
> > 2040218650.019 7968023                  0.39054750332400873%
> > 1850721920.862 7835912                  0.42339758943095634%
> > 2167986192.252 8656589                  0.39929170356052646%
> > 2128868496.040 8248906                  0.38747841942065214%
> > 1984840247.084 7754853                  0.39070413910605312%
> > 2130780080.168 7053725                  0.33103955990821227%
> > 2062729676.752 7968701                  0.38631824081513272%
> > 1995756518.448 7903687                  0.39602461156666058%
> > 1921150053.008 8307746                  0.43243608103345821%
> > 2253731991.230 8033382                  0.35644797301810893%
> > 2143807312.838 6755593                  0.31512127790332323%
> > 1948787937.040 7152297                  0.36701258582622248%
> > 2276500726.783 7277806                  0.31969267193182554%
> > 2282733838.581 6336839                  0.27759867983291148%
> > 2003104438.909 7092547                  0.35407774363791977%
> > 2239684291.880 6982252                  0.3117516171950766%
> > 1801833773.742 2645976                  0.14684906224756283%
> > 2016683111.150 2261656                  0.11214731692329717%
> > 1962727948.601 2767184                  0.14098663046870061%
> > 1845203884.375 2266317                  0.1228220371304734%
> > 1516914718.568 1975612                  0.13023883121557553%
> > 1557730294.184 2109274                  0.13540688063108641%
> > 2171303320.533 1865384                  0.085910797554627022%
> > 1617322895.641 2073254                  0.12819048104666192%
> > 1867107731.934 1901409                  0.10183713384500154%
> > 1772588329.145 2045529                  0.11539786008783279%
> > 2030802387.562 2161163                  0.10641916777508319%
> > 1853583237.019 1803299                  0.097287187539531861%
> > 1453728764.040 3231016                  0.22225714176699726%
> > 1989811798.738 2418414                  0.1215398361560542%
> > 1956758511.567 2254818                  0.11523230826241863%
> > 2001579398.410 1952360                  0.097540971971978793%
> > 1636515132.388 1532408                  0.093638486419854422%
> > 1926271619.408 1918912                  0.09961793449408439%
> > 1642519133.238 1384193                  0.084272564744573439%
> > 1677033195.955 1340416                  0.07992781557533149%
> > 1693983317.244 3135804                  0.18511421972571418%
> > 1487434932.071 2467493                  0.1658891388656871%
> > 1720442169.343 3034092                  0.17635536108479918%
> > 1640362983.552 1944463                  0.11853858075908964%
> > 1821382248.526 2659828                  0.1460334864991977%
> > 1947322513.836 3540608                  0.18181929160904178%
> > 1673422299.278 2769832                  0.16551900863249205%
> > 1836039325.613 2175872                  0.11850900847526999%
> > 1582929718.539 2072026                  0.13089816785500888%
> > 1476014468.174 1996943                  0.13529291501257767%
> > 1847131237.607 2120975                  0.11482535495137695%
> > 1734796451.802 1842211                  0.10619176665287874%
> > 1804827020.363 1988047                  0.11015166426310204%
> > 1647250758.752 2634152                  0.15991202225917941%
> > 1506833507.228 2271966                  0.15077750720977479%
> > 1919459367.994 1608374                  0.083793073550751385%
> > 1913364635.281 2126200                  0.11112361756846952%
> > 1733934046.047 2031440                  0.11715785872197679%
> > 1762796865.803 2285487                  0.12965118354456009%
> > 1711072016.023 2270996                  0.13272357789349024%
> > 2082560830.257 2273483                  0.10916766353083858%
> > 1857948273.729 1266725                  0.06817870109255604%
> > 1738287034.188 2404600                  0.1383315846409251%
> > 1618942973.292 2016207                  0.12453848179100423%
> > 1820789056.854 2369799                  0.13015230902664757%
> > 1561747605.669 2487149                  0.15925422206327566%
> > 1884502059.843 2512148                  0.13330566485076115%
> > 1352203145.315 2618029                  0.19361210695824271%
> > 1911003204.592 1648292                  0.086252707271200574%
> > 1525244016.511 2555354                  0.16753738892517536%
> > 2190575134.148 1939935                  0.088558249829422828%
> > 1643983673.168 1747429                  0.10629235730989092%
> > 1781900292.843 1686195                  0.094629032094141294%
> > 1566821289.746 2291057                  0.14622324926229507%
> > 1777199439.264 2456866                  0.13824368530171682%
> > 1633063412.403 1953080                  0.11959609070698031%
> > 1913601422.874 1821650                  0.095194849785599556%
> > 2094732749.709 2322465                  0.11087166132876074%
> > 1951529182.943 2270128                  0.11632559839953495%
> > 1450396654.276 2867633                  0.19771370759479912%
> > 1643954213.946 2460211                  0.14965203891504558%
> > 1845971802.604 2593942                  0.14051904781757141%
> > 1671974226.831 2552423                  0.15265923116756236%
> > 1792483105.554 1547320                  0.086322710390164162%
> > 1774273751.516 2101959                  0.1184686973024323%
> > 1797523443.598 1781759                  0.099122990932099045%
> > 1877038808.523 2131598                  0.11356174365288201%
> > 1485614991.186 1796722                  0.12094129439052281%
> > 2008042349.453 1972441                  0.09822706182154485%
> > 2178564679.489 1948988                  0.089462021410222761%
> > 1462911389.586 2076158                  0.14191960051575969%
> > 1976013519.965 2692701                  0.13626936115536772%
> > 1473561005.083 1369873                  0.092963439944099252%
> > 2190713955.687 2450228                  0.11184609444967988%
> > 1648795783.114 2557892                  0.15513698095279177%
> > 2116292380.702 1948901                  0.092090347145394241%
> > 1704227210.803 1980402                  0.11620527987385389%
> > 1984573918.515 2325530                  0.11718031655581404%
> > 1856414008.018 1798348                  0.096872141248277158%
> > 1401593348.875 2285205                  0.16304336788086488%
> > 1832475317.969 2373585                  0.12952889333487622%
> > 1840826298.949 2112188                  0.11474129857911804%
> > 2014475390.352 2104652                  0.10447643143618861%
> > 1730176303.748 1975645                  0.11418749613667999%
> > 2016553863.650 2009045                  0.099627638825555156%
> > 1891795883.646 2160557                  0.11420666567029553%
> > 1881160108.475 1492908                  0.079361027978115894%
> > 1562754135.542 1978972                  0.12663361145505117%
> > 2061996559.038 2071212                  0.10044691834821971%
> > 1519528906.706 2460493                  0.16192472477103448%
> > 1796244525.568 2099254                  0.1168690548596764%
> > 1998463642.677 1988093                  0.099481069234609226%
> > 1872169708.204 1762657                  0.094150492462082555%
> > 1664425938.102 1710870                  0.10279039522485223%
> > 1791543540.761 1992110                  0.11119517637588673%
> > 1883913354.514 1770358                  0.093972368514618113%
> > 1813806740.237 1877185                  0.10349421238531283%
> > 1547586337.236 2702003                  0.17459465329900728%
> > 2020313568.921 2100334                  0.10396079263684484%
> > 1727289498.314 2059429                  0.1192289423405978%
> > 
> > Detailed time after patch:
> > 
> > Total          add_to_page_cache_lru    percent
> > 2151489401.227 6287717                  0.29224949918015392%
> > 1919030541.464 6872841                  0.3581413037208262%
> > 2277104485.950 8538296                  0.37496285535785828%
> > 2142847301.255 7992466                  0.37298345968558083%
> > 1909650847.324 6960671                  0.36449966808089612%
> > 2109123648.836 6825371                  0.32361170497361974%
> > 2128503623.032 7728883                  0.36311345286743746%
> > 1925232499.136 7526064                  0.39091714914315673%
> > 1935780958.829 7499162                  0.38739723964101919%
> > 1722848267.832 7353451                  0.42681942091473007%
> > 1855159456.452 7600681                  0.40970499724785564%
> > 1862769335.900 8607712                  0.46209221045831578%
> > 2272365556.764 6895008                  0.30342864419310028%
> > 2049429274.204 7252089                  0.35385895435775488%
> > 2001218334.232 8155639                  0.4075336938750293%
> > 1947131845.644 7229822                  0.37130623774523025%
> > 1982915408.746 7513518                  0.37891268416495716%
> > 2038654062.353 7531559                  0.36943781385386831%
> > 2134879530.359 6738727                  0.31564905205057731%
> > 2029761134.284 7472398                  0.36814174208907074%
> > 1956557094.935 7280443                  0.37210480690019776%
> > 1740007857.373 6663035                  0.38293131676196024%
> > 2054278081.418 7765213                  0.37800203732105886%
> > 2367585922.353 7629987                  0.32226864199365651%
> > 2023763123.141 7730698                  0.38199618876350999%
> > 1826074987.177 7579606                  0.41507638258150648%
> > 2077873224.783 7414675                  0.35683962387909118%
> > 2103719398.720 7093390                  0.33718327664402131%
> > 2164758918.301 7181094                  0.33172719323572736%
> > 1953199921.042 7750654                  0.39681826302067191%
> > 2179298642.359 6395109                  0.29344803303679207%
> > 1971922836.328 7215479                  0.36591081897688477%
> > 2225552712.785 8153246                  0.36634701812105075%
> > 2124780733.448 7080842                  0.33325048032177529%
> > 2139511762.563 8708060                  0.40701155059639843%
> > 1934365642.348 6842114                  0.35371358186939283%
> > 1950765025.910 6818753                  0.34954250816646476%
> > 2013188877.577 6495929                  0.32266863146086228%
> > 1921391492.273 6536576                  0.34020011154870117%
> > 2164698840.945 6712421                  0.31008567441509277%
> > 1992440404.239 7467130                  0.3747730664422067%
> > 2073623249.374 7113197                  0.34303227465005426%
> > 1938481781.041 6902169                  0.35606055561138211%
> > 1963963682.421 7776006                  0.39593430721765843%
> > 1807704433.895 7206899                  0.39867684477995868%
> > 1902413226.299 7687850                  0.40411041585093088%
> > 1718395340.661 7096452                  0.41296969516166576%
> > 2014139552.830 7653595                  0.37999328245384933%
> > 1976902280.499 7781174                  0.3936043818026207%
> > 1920129734.441 6808420                  0.3545812492707483%
> > 1786139514.638 7179970                  0.4019825966089316%
> > 2011990483.317 6739406                  0.33496212113733792%
> > 2186643920.919 7830962                  0.35812698743876015%
> > 1822975832.465 8627042                  0.47323951565143058%
> > 1872249428.425 7264440                  0.38800599373722838%
> > 1971668701.573 6818269                  0.34581210294408871%
> > 2054585589.388 7261214                  0.35341501651254642%
> > 2252858576.565 7228902                  0.32087686618225808%
> > 2196771348.820 6634658                  0.30201859668116204%
> > 2224872312.502 6948714                  0.31231967609798522%
> > 1924947573.139 6935037                  0.360271474235066%
> > 2070207954.999 7564561                  0.36540102078796494%
> > 2070702822.858 6994172                  0.33776802362912656%
> > 2099826181.510 7199665                  0.34286957003377627%
> > 2369037401.585 7009501                  0.29587971027009985%
> > 2109093017.803 7056062                  0.33455432929886414%
> > 2141046841.968 7831694                  0.36578807368835009%
> > 2284785719.942 7358167                  0.3220506385249462%
> > 2088247084.213 6857813                  0.32840045853982414%
> > 1880913709.753 6697468                  0.35607523967059107%
> > 1984586172.943 7361786                  0.37094816543455983%
> > 2130320004.839 7840067                  0.36802297223850727%
> > 2314059227.236 6669054                  0.28819720435443524%
> > 1935408704.712 8309322                  0.42933164348025787%
> > 2239440115.241 7040432                  0.31438357972087749%
> > 2387500130.769 6555638                  0.27458168129559291%
> > 2497500671.323 6969731                  0.27906823329532587%
> > 2229438175.095 6759889                  0.30321042653322067%
> > 1979317661.013 7870099                  0.39761677243723181%
> > 2228213884.202 5883791                  0.26405862748257614%
> > 2264526373.743 6563122                  0.28982316461838858%
> > 2256376049.198 6670128                  0.29561242694323103%
> > 2085817937.654 6878115                  0.32975624937505721%
> > 2177225113.484 7421724                  0.34087995559282069%
> > 1843473456.551 7156295                  0.38819625932607071%
> > 2403022978.094 6958096                  0.28955594946157931%
> > 1935191866.634 6872608                  0.35513832599730566%
> > 1917810458.850 7816937                  0.40759695328220108%
> > 2016838266.381 7257719                  0.35985627211562177%
> > 1904575178.166 7484129                  0.39295529448235278%
> > 2087232238.930 7372521                  0.35321996577532039%
> > 1976631116.176 6438233                  0.32571747693901715%
> > 2004191806.805 6735567                  0.33607397141980955%
> > 2110550986.328 7174599                  0.33993961986592813%
> > 2152621184.889 7337361                  0.34085704681840484%
> > 2059135774.936 6363510                  0.30903790208772336%
> > 2353978555.486 7265456                  0.30864580236160993%
> > 1945677192.325 5927833                  0.30466682877217138%
> > 2255676557.335 6379740                  0.28283044301074045%
> > 2034627875.756 7319043                  0.35972391252530572%
> > 1712556465.847 3079747                  0.17983331127577226%
> > 1837380343.831 3111263                  0.1693314620702272%
> > 1680363376.083 1611968                  0.095929727042585719%
> > 1769978476.954 2017096                  0.11396161175198416%
> > 1829700109.543 2612874                  0.14280340184559598%
> > 1783959966.875 3184558                  0.17851062014460206%
> > 1772032973.967 2133741                  0.12041203698502599%
> > 1667688494.152 1985120                  0.11903422053705601%
> > 2001252356.720 2633874                  0.13161128785961057%
> > 2077009766.270 2858188                  0.13761071548223289%
> > 2076612330.379 2340135                  0.11269002720276176%
> > 1834812072.536 1713713                  0.093399919569494547%
> > 1787949752.795 2522212                  0.14106727529994451%
> > 1907464219.876 2476303                  0.12982172741153589%
> > 1609254879.065 1903707                  0.11829741980375917%
> > 2069557067.911 2484547                  0.12005211349440528%
> > 1725151836.363 2491288                  0.1444097816486799%
> > 2077004221.561 2508662                  0.12078271069254648%
> > 1650748196.246 3195100                  0.19355465644403197%
> > 1843881436.725 2731880                  0.14815920078094685%
> > 1677080490.297 2752567                  0.16412849686853959%
> > 2103777441.563 3252694                  0.15461207710181613%
> > 1598451912.768 2345538                  0.1467381021139566%
> > 1825835416.807 2381575                  0.13043755083713246%
> > 1969692431.120 2198750                  0.11162910336969484%
> > 1867823718.088 1813348                  0.097083465770326324%
> > 1929972836.031 2140104                  0.11088777831718792%
> > 1627140530.277 2519227                  0.15482541016731566%
> > 1779407314.725 2828755                  0.15897175293095681%
> > 1724839045.836 1937500                  0.11232932166496305%
> > 1780135194.997 2889504                  0.16231935687361485%
> > 1758590417.053 2193512                  0.12473126082853507%
> > 1729351400.188 1523643                  0.088104881392778986%
> > 1484255938.207 2153531                  0.14509162096406991%
> > 1504352024.282 1910774                  0.12701641432044322%
> > 1916441202.247 2423510                  0.12645887581411155%
> > 1651813655.113 2892863                  0.1751325272705837%
> > 1772411947.023 2693122                  0.15194673024651262%
> > 1738795901.180 3064160                  0.17622309771495132%
> > 1717902110.376 3266140                  0.19012375503078779%
> > 1858583685.387 2246750                  0.12088505982619635%
> > 1641867344.242 2901893                  0.17674345069209688%
> > 2037933222.333 3290649                  0.16146991294606342%
> > 1713754879.753 2136278                  0.1246548164640615%
> > 2265307843.106 2195967                  0.096939010151886218%
> > 2066343912.900 2117667                  0.10248376307446183%
> > 2033259596.459 2652081                  0.13043494321230309%
> > 2029016228.457 2447926                  0.12064595470789147%
> > 1649943910.454 2243384                  0.13596728869302646%
> > 2064977308.315 2517573                  0.12191770775700743%
> > 1962826877.902 2403546                  0.12245328546596376%
> > 1991037543.226 3214700                  0.16145853255942866%
> > 1469646744.644 1960481                  0.133398111290676%
> > 2195532272.023 1693642                  0.077140382839349028%
> > 1902754342.931 2542547                  0.13362455376575119%
> > 1922726134.774 2726990                  0.14182935108023245%
> > 1629134454.130 2814497                  0.17276026499009955%
> > 1442857977.322 2886065                  0.2000241912483062%
> > 1782430811.270 2884583                  0.16183421997427802%
> > 1574026949.337 2873023                  0.18252692568003068%
> > 2064769936.712 2717133                  0.13159495165484839%
> > 1621381781.294 2143854                  0.13222388611576869%
> > 1783777994.311 2515565                  0.14102455619605617%
> > 1561198792.593 2402227                  0.15387066729728466%
> > 1658682002.165 2924639                  0.17632306832669586%
> > 1675144757.604 2216513                  0.13231769910860314%
> > 1712859849.041 2274046                  0.13276310967725691%
> > 1448178253.374 2507727                  0.17316424923226392%
> > 1621787692.726 2822681                  0.17404750403892047%
> > 1924315582.555 2275728                  0.11826168330344308%
> > 2044659318.750 2606738                  0.12749008972280165%
> > 1559320878.499 2049522                  0.13143683434629996%
> > 1641314354.590 2786145                  0.16975084585158451%
> > 1769603624.855 2008994                  0.11352790940200608%
> > 1707747444.550 2813289                  0.16473682973311757%
> > 1366611645.499 2930513                  0.21443641356721807%
> > 1435165128.768 3161655                  0.22029903992400404%
> > 1440555681.860 2831258                  0.19653929630435177%
> > 1627011278.312 1956565                  0.12025515901954331%
> > 1668464384.654 3863649                  0.23156916237089648%
> > 1589045506.115 2743558                  0.17265446391825656%
> > 1766150748.153 3207045                  0.18158387687765917%
> > 1668645748.936 2997211                  0.17961937109246526%
> > 1808149919.182 2420007                  0.13383884678626659%
> > 1831096162.185 2226869                  0.12161398434382248%
> > 1473136553.216 2774496                  0.18833936296964501%
> > 1873520917.567 2427200                  0.12955286366121929%
> > 1971938578.205 2207736                  0.11195764535473715%
> > 1670588149.850 2437810                  0.14592525394238479%
> > 1912140400.200 2407058                  0.12588291109524352%
> > 1925513187.452 2415130                  0.12542786077699639%
> > 1658671061.956 3111210                  0.18757245311382492%
> > 1917423750.996 2888924                  0.15066695603928745%
> > 1553962205.772 3386946                  0.21795549386076499%
> > 1834101490.364 2367281                  0.12907033838842713%
> > 1626846627.678 3362957                  0.20671629044711806%
> > 1519340210.104 2320735                  0.15274623712098975%
> > 1725615314.262 3813557                  0.22099693764197714%
> > 1725823077.103 3137784                  0.1818137699993643%
> > 1633124107.941 2007344                  0.12291435722731486%
> > 
> > Thanks,
> > Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-26 21:32       ` Andi Kleen
@ 2009-09-27 16:26         ` Hugh Dickins
  -1 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2009-09-27 16:26 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Sat, 26 Sep 2009, Andi Kleen wrote:
> > This is a bit tricky to do right now; you have a chicken and egg
> > problem between locking the page and pinning the inode mapping.
> 
> One possibly simple solution would be to just allocate the page
> locked (GFP_LOCKED). When the allocator clears the flags it already
> modifies the state, so it could as well set the lock bit too. No
> atomics needed.  And then clearing it later is also atomic free.

That's a good idea.

I don't particularly like adding a GFP_LOCKED just for this, and I
don't particularly like having to remember to unlock the thing on the
various(?) error paths between getting the page and adding it to cache.

But it is a good idea, and if doing it that way would really close a
race window which checking page->mapping (or whatever) cannot (I'm
simply not sure about that), then it would seem the best way to go.

Hugh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-27 16:26         ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2009-09-27 16:26 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Sat, 26 Sep 2009, Andi Kleen wrote:
> > This is a bit tricky to do right now; you have a chicken and egg
> > problem between locking the page and pinning the inode mapping.
> 
> One possibly simple solution would be to just allocate the page
> locked (GFP_LOCKED). When the allocator clears the flags it already
> modifies the state, so it could as well set the lock bit too. No
> atomics needed.  And then clearing it later is also atomic free.

That's a good idea.

I don't particularly like adding a GFP_LOCKED just for this, and I
don't particularly like having to remember to unlock the thing on the
various(?) error paths between getting the page and adding it to cache.

But it is a good idea, and if doing it that way would really close a
race window which checking page->mapping (or whatever) cannot (I'm
simply not sure about that), then it would seem the best way to go.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-27 10:47         ` Wu Fengguang
@ 2009-09-27 19:20           ` Nick Piggin
  -1 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-27 19:20 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andi Kleen, Andrew Morton, Hugh Dickins, linux-mm, LKML

On Sun, Sep 27, 2009 at 06:47:39PM +0800, Wu Fengguang wrote:
> > 
> > And standard deviation is 0.04%, much larger than the difference 0.008% ..
> 
> Sorry that's not correct. I improved the accounting by treating
> function0+function1 from two CPUs as an integral entity:
> 
>                  total time      add_to_page_cache_lru   percent  stddev
>          before  3880166848.722  9683329.610             0.250%   0.014%
>          after   3828516894.376  9778088.870             0.256%   0.012%
>          delta                                           0.006%

I don't understand why you're doing this NFS workload to measure?
I see significant nfs, networking protocol and device overheads in
your profiles, also you're hitting some locks or something which
is causing massive context switching. So I don't think this is a
good test. But anyway as Hugh points out, you need to compare with
a *completely* fixed kernel, which includes auditing all users of
page flags non-atomically (slab, notably, but possibly also other
places).

One other thing to keep in mind that I will mention is that I am
going to push in a patch to the page allocator to allow callers
to avoid the refcounting (atomic_dec_and_test) in page lifetime,
which is especially important for SLUB and takes more cycles off
the page allocator...

I don't know exactly what you're going to do after that to get a
stable reference to slab pages. I guess you can read the page
flags and speculatively take some slab locks and recheck etc...


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-27 19:20           ` Nick Piggin
  0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-27 19:20 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andi Kleen, Andrew Morton, Hugh Dickins, linux-mm, LKML

On Sun, Sep 27, 2009 at 06:47:39PM +0800, Wu Fengguang wrote:
> > 
> > And standard deviation is 0.04%, much larger than the difference 0.008% ..
> 
> Sorry that's not correct. I improved the accounting by treating
> function0+function1 from two CPUs as an integral entity:
> 
>                  total time      add_to_page_cache_lru   percent  stddev
>          before  3880166848.722  9683329.610             0.250%   0.014%
>          after   3828516894.376  9778088.870             0.256%   0.012%
>          delta                                           0.006%

I don't understand why you're doing this NFS workload to measure?
I see significant nfs, networking protocol and device overheads in
your profiles, also you're hitting some locks or something which
is causing massive context switching. So I don't think this is a
good test. But anyway as Hugh points out, you need to compare with
a *completely* fixed kernel, which includes auditing all users of
page flags non-atomically (slab, notably, but possibly also other
places).

One other thing to keep in mind that I will mention is that I am
going to push in a patch to the page allocator to allow callers
to avoid the refcounting (atomic_dec_and_test) in page lifetime,
which is especially important for SLUB and takes more cycles off
the page allocator...

I don't know exactly what you're going to do after that to get a
stable reference to slab pages. I guess you can read the page
flags and speculatively take some slab locks and recheck etc...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-27 16:26         ` Hugh Dickins
@ 2009-09-27 19:22           ` Nick Piggin
  -1 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-27 19:22 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andi Kleen, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Sun, Sep 27, 2009 at 05:26:25PM +0100, Hugh Dickins wrote:
> On Sat, 26 Sep 2009, Andi Kleen wrote:
> > > This is a bit tricky to do right now; you have a chicken and egg
> > > problem between locking the page and pinning the inode mapping.
> > 
> > One possibly simple solution would be to just allocate the page
> > locked (GFP_LOCKED). When the allocator clears the flags it already
> > modifies the state, so it could as well set the lock bit too. No
> > atomics needed.  And then clearing it later is also atomic free.
> 
> That's a good idea.
> 
> I don't particularly like adding a GFP_LOCKED just for this, and I
> don't particularly like having to remember to unlock the thing on the
> various(?) error paths between getting the page and adding it to cache.

God no, please no more crazy branches in the page allocator.

I'm going to resubmit my patches to allow 0-ref page allocations,
so the pagecache will be able to work with those to do what we
want here.

 
> But it is a good idea, and if doing it that way would really close a
> race window which checking page->mapping (or whatever) cannot (I'm
> simply not sure about that), then it would seem the best way to go.

Yep, seems reasonable: the ordering is no technical burden, and a
simple comment pointing to hwpoison will keep it maintainable.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-27 19:22           ` Nick Piggin
  0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-27 19:22 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andi Kleen, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Sun, Sep 27, 2009 at 05:26:25PM +0100, Hugh Dickins wrote:
> On Sat, 26 Sep 2009, Andi Kleen wrote:
> > > This is a bit tricky to do right now; you have a chicken and egg
> > > problem between locking the page and pinning the inode mapping.
> > 
> > One possibly simple solution would be to just allocate the page
> > locked (GFP_LOCKED). When the allocator clears the flags it already
> > modifies the state, so it could as well set the lock bit too. No
> > atomics needed.  And then clearing it later is also atomic free.
> 
> That's a good idea.
> 
> I don't particularly like adding a GFP_LOCKED just for this, and I
> don't particularly like having to remember to unlock the thing on the
> various(?) error paths between getting the page and adding it to cache.

God no, please no more crazy branches in the page allocator.

I'm going to resubmit my patches to allow 0-ref page allocations,
so the pagecache will be able to work with those to do what we
want here.

 
> But it is a good idea, and if doing it that way would really close a
> race window which checking page->mapping (or whatever) cannot (I'm
> simply not sure about that), then it would seem the best way to go.

Yep, seems reasonable: the ordering is no technical burden, and a
simple comment pointing to hwpoison will keep it maintainable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-27 19:22           ` Nick Piggin
@ 2009-09-27 21:57             ` Hugh Dickins
  -1 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2009-09-27 21:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Sun, 27 Sep 2009, Nick Piggin wrote:
> On Sun, Sep 27, 2009 at 05:26:25PM +0100, Hugh Dickins wrote:
> > 
> > I don't particularly like adding a GFP_LOCKED just for this, and I
> > don't particularly like having to remember to unlock the thing on the
> > various(?) error paths between getting the page and adding it to cache.
> 
> God no, please no more crazy branches in the page allocator.
> 
> I'm going to resubmit my patches to allow 0-ref page allocations,
> so the pagecache will be able to work with those to do what we
> want here.
>  
> > But it is a good idea, and if doing it that way would really close a
> > race window which checking page->mapping (or whatever) cannot (I'm
> > simply not sure about that), then it would seem the best way to go.
> 
> Yep, seems reasonable: the ordering is no technical burden, and a
> simple comment pointing to hwpoison will keep it maintainable.

You move from "God no" to "Yep, seems reasonable"!

I think perhaps you couldn't bring yourself to believe that I was
giving any support to Andi's GFP_LOCKED idea.  Pretend I did not!

I'll assume we stick with the "God no", and we'll see how what
you come up with affects what they want.

Hugh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-27 21:57             ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2009-09-27 21:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Sun, 27 Sep 2009, Nick Piggin wrote:
> On Sun, Sep 27, 2009 at 05:26:25PM +0100, Hugh Dickins wrote:
> > 
> > I don't particularly like adding a GFP_LOCKED just for this, and I
> > don't particularly like having to remember to unlock the thing on the
> > various(?) error paths between getting the page and adding it to cache.
> 
> God no, please no more crazy branches in the page allocator.
> 
> I'm going to resubmit my patches to allow 0-ref page allocations,
> so the pagecache will be able to work with those to do what we
> want here.
>  
> > But it is a good idea, and if doing it that way would really close a
> > race window which checking page->mapping (or whatever) cannot (I'm
> > simply not sure about that), then it would seem the best way to go.
> 
> Yep, seems reasonable: the ordering is no technical burden, and a
> simple comment pointing to hwpoison will keep it maintainable.

You move from "God no" to "Yep, seems reasonable"!

I think perhaps you couldn't bring yourself to believe that I was
giving any support to Andi's GFP_LOCKED idea.  Pretend I did not!

I'll assume we stick with the "God no", and we'll see how what
you come up with affects what they want.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-27 21:57             ` Hugh Dickins
@ 2009-09-27 23:01               ` Nick Piggin
  -1 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-27 23:01 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andi Kleen, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Sun, Sep 27, 2009 at 10:57:29PM +0100, Hugh Dickins wrote:
> On Sun, 27 Sep 2009, Nick Piggin wrote:
> > On Sun, Sep 27, 2009 at 05:26:25PM +0100, Hugh Dickins wrote:
> > > 
> > > I don't particularly like adding a GFP_LOCKED just for this, and I
> > > don't particularly like having to remember to unlock the thing on the
> > > various(?) error paths between getting the page and adding it to cache.
> > 
> > God no, please no more crazy branches in the page allocator.
> > 
> > I'm going to resubmit my patches to allow 0-ref page allocations,
> > so the pagecache will be able to work with those to do what we
> > want here.
> >  
> > > But it is a good idea, and if doing it that way would really close a
> > > race window which checking page->mapping (or whatever) cannot (I'm
> > > simply not sure about that), then it would seem the best way to go.
> > 
> > Yep, seems reasonable: the ordering is no technical burden, and a
> > simple comment pointing to hwpoison will keep it maintainable.
> 
> You move from "God no" to "Yep, seems reasonable"!
> 
> I think perhaps you couldn't bring yourself to believe that I was
> giving any support to Andi's GFP_LOCKED idea.  Pretend I did not!
> 
> I'll assume we stick with the "God no", and we'll see how what
> you come up with affects what they want.

Well, yes, I mean "no" to a GFP_LOCKED... If you follow me :)

Reasonable being the basic idea of setting up our flags before we
increment page count, although of course we'd want to see how all
the error cases etc pan out.

There is no real rush AFAIKS to fix this one single pagecache site
while we have problems with slab allocators and all other unaudited
places that nonatomically modify page flags with an elevated
page reference ... just mark HWPOISON as broken for the moment, or
cut it down to do something much simpler I guess?


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-27 23:01               ` Nick Piggin
  0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-27 23:01 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andi Kleen, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Sun, Sep 27, 2009 at 10:57:29PM +0100, Hugh Dickins wrote:
> On Sun, 27 Sep 2009, Nick Piggin wrote:
> > On Sun, Sep 27, 2009 at 05:26:25PM +0100, Hugh Dickins wrote:
> > > 
> > > I don't particularly like adding a GFP_LOCKED just for this, and I
> > > don't particularly like having to remember to unlock the thing on the
> > > various(?) error paths between getting the page and adding it to cache.
> > 
> > God no, please no more crazy branches in the page allocator.
> > 
> > I'm going to resubmit my patches to allow 0-ref page allocations,
> > so the pagecache will be able to work with those to do what we
> > want here.
> >  
> > > But it is a good idea, and if doing it that way would really close a
> > > race window which checking page->mapping (or whatever) cannot (I'm
> > > simply not sure about that), then it would seem the best way to go.
> > 
> > Yep, seems reasonable: the ordering is no technical burden, and a
> > simple comment pointing to hwpoison will keep it maintainable.
> 
> You move from "God no" to "Yep, seems reasonable"!
> 
> I think perhaps you couldn't bring yourself to believe that I was
> giving any support to Andi's GFP_LOCKED idea.  Pretend I did not!
> 
> I'll assume we stick with the "God no", and we'll see how what
> you come up with affects what they want.

Well, yes, I mean "no" to a GFP_LOCKED... If you follow me :)

Reasonable being the basic idea of setting up our flags before we
increment page count, although of course we'd want to see how all
the error cases etc pan out.

There is no real rush AFAIKS to fix this one single pagecache site
while we have problems with slab allocators and all other unaudited
places that nonatomically modify page flags with an elevated
page reference ... just mark HWPOISON as broken for the moment, or
cut it down to do something much simpler I guess?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-27 23:01               ` Nick Piggin
@ 2009-09-28  1:19                 ` Andi Kleen
  -1 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2009-09-28  1:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Andi Kleen, Wu Fengguang, Andrew Morton, linux-mm, LKML

> There is no real rush AFAIKS to fix this one single pagecache site
> while we have problems with slab allocators and all other unaudited
> places that nonatomically modify page flags with an elevated

hwpoison ignores slab pages.

> page reference ... just mark HWPOISON as broken for the moment, or
> cut it down to do something much simpler I guess?

Erm no. These cases are *EXTREMLY* unlikely to hit.

I'll look into exploiting the ordering of the mapping assignment.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-28  1:19                 ` Andi Kleen
  0 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2009-09-28  1:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Andi Kleen, Wu Fengguang, Andrew Morton, linux-mm, LKML

> There is no real rush AFAIKS to fix this one single pagecache site
> while we have problems with slab allocators and all other unaudited
> places that nonatomically modify page flags with an elevated

hwpoison ignores slab pages.

> page reference ... just mark HWPOISON as broken for the moment, or
> cut it down to do something much simpler I guess?

Erm no. These cases are *EXTREMLY* unlikely to hit.

I'll look into exploiting the ordering of the mapping assignment.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-28  1:19                 ` Andi Kleen
@ 2009-09-28  1:52                   ` Wu Fengguang
  -1 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-28  1:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, Hugh Dickins, Andrew Morton, linux-mm, LKML

On Mon, Sep 28, 2009 at 09:19:43AM +0800, Andi Kleen wrote:
> > There is no real rush AFAIKS to fix this one single pagecache site
> > while we have problems with slab allocators and all other unaudited
> > places that nonatomically modify page flags with an elevated
> 
> hwpoison ignores slab pages.
> 
> > page reference ... just mark HWPOISON as broken for the moment, or
> > cut it down to do something much simpler I guess?
> 
> Erm no. These cases are *EXTREMLY* unlikely to hit.
> 
> I'll look into exploiting the ordering of the mapping assignment.

Andi, given that overheads of this patch is considered unacceptable,
I think we can just ignore it.

The proposed schemes are already tricky enough (and may not achieve
100% correctness). We have not even considered the interaction with
free buddy pages, unpoison, and hwpoison filtering.

It may well result in something unmanageable.

On the other hand, we may just ignore the __set_page_locked race, 

- it could trigger BUG() on unlock_page(), however that's _no worse_
  than plain kernel without hwpoison. Plain kernel will also die when
  trying to fill data into the newly allocated pages.
- it is _not yet_ a LRU page. So it does not hurt the general idea of
  "hwpoison can handle LRU pages reliably".
- in hwpoison stress testing, we can avoid such pages by checking the
  PG_lru bit. Thus we can make the tests immune to this race.

Or,
- the page being __set_page_locked() is _not_ the fine LRU page
- we can prevent the kernel panic in the tests
- for a production workload, this presents merely another (rare) type
  of kernel page we cannot rescue.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-28  1:52                   ` Wu Fengguang
  0 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-28  1:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, Hugh Dickins, Andrew Morton, linux-mm, LKML

On Mon, Sep 28, 2009 at 09:19:43AM +0800, Andi Kleen wrote:
> > There is no real rush AFAIKS to fix this one single pagecache site
> > while we have problems with slab allocators and all other unaudited
> > places that nonatomically modify page flags with an elevated
> 
> hwpoison ignores slab pages.
> 
> > page reference ... just mark HWPOISON as broken for the moment, or
> > cut it down to do something much simpler I guess?
> 
> Erm no. These cases are *EXTREMLY* unlikely to hit.
> 
> I'll look into exploiting the ordering of the mapping assignment.

Andi, given that overheads of this patch is considered unacceptable,
I think we can just ignore it.

The proposed schemes are already tricky enough (and may not achieve
100% correctness). We have not even considered the interaction with
free buddy pages, unpoison, and hwpoison filtering.

It may well result in something unmanageable.

On the other hand, we may just ignore the __set_page_locked race, 

- it could trigger BUG() on unlock_page(), however that's _no worse_
  than plain kernel without hwpoison. Plain kernel will also die when
  trying to fill data into the newly allocated pages.
- it is _not yet_ a LRU page. So it does not hurt the general idea of
  "hwpoison can handle LRU pages reliably".
- in hwpoison stress testing, we can avoid such pages by checking the
  PG_lru bit. Thus we can make the tests immune to this race.

Or,
- the page being __set_page_locked() is _not_ the fine LRU page
- we can prevent the kernel panic in the tests
- for a production workload, this presents merely another (rare) type
  of kernel page we cannot rescue.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-28  1:19                 ` Andi Kleen
@ 2009-09-28  2:57                   ` Nick Piggin
  -1 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-28  2:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Hugh Dickins, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Mon, Sep 28, 2009 at 03:19:43AM +0200, Andi Kleen wrote:
> > There is no real rush AFAIKS to fix this one single pagecache site
> > while we have problems with slab allocators and all other unaudited
> > places that nonatomically modify page flags with an elevated
> 
> hwpoison ignores slab pages.

"ignores" them *after* it has already written to page flags?
By that time it's too late.

 
> > page reference ... just mark HWPOISON as broken for the moment, or
> > cut it down to do something much simpler I guess?
> 
> Erm no. These cases are *EXTREMLY* unlikely to hit.

Well it's fundamentally badly buggy, rare or not. We could avoid
lots of nasty atomic operations if we just care that it works
most of the time. I guess it's a matter of perspective but I
won't push for one thing or the other in hwpoison code so long
as it stays out of core code for the most part.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-28  2:57                   ` Nick Piggin
  0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-28  2:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Hugh Dickins, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Mon, Sep 28, 2009 at 03:19:43AM +0200, Andi Kleen wrote:
> > There is no real rush AFAIKS to fix this one single pagecache site
> > while we have problems with slab allocators and all other unaudited
> > places that nonatomically modify page flags with an elevated
> 
> hwpoison ignores slab pages.

"ignores" them *after* it has already written to page flags?
By that time it's too late.

 
> > page reference ... just mark HWPOISON as broken for the moment, or
> > cut it down to do something much simpler I guess?
> 
> Erm no. These cases are *EXTREMLY* unlikely to hit.

Well it's fundamentally badly buggy, rare or not. We could avoid
lots of nasty atomic operations if we just care that it works
most of the time. I guess it's a matter of perspective but I
won't push for one thing or the other in hwpoison code so long
as it stays out of core code for the most part.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-28  2:57                   ` Nick Piggin
@ 2009-09-28  4:11                     ` Andi Kleen
  -1 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2009-09-28  4:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Hugh Dickins, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Mon, Sep 28, 2009 at 04:57:41AM +0200, Nick Piggin wrote:
> On Mon, Sep 28, 2009 at 03:19:43AM +0200, Andi Kleen wrote:
> > > There is no real rush AFAIKS to fix this one single pagecache site
> > > while we have problems with slab allocators and all other unaudited
> > > places that nonatomically modify page flags with an elevated
> > 
> > hwpoison ignores slab pages.
> 
> "ignores" them *after* it has already written to page flags?
> By that time it's too late.

Yes, currently the page lock comes first. The only exception 
is for page count == 0 pages. I suppose we could move the slab check
up, but then it only helps when slab is set.

So if you make slab use refcount == 0 pages that would help.

> Well it's fundamentally badly buggy, rare or not. We could avoid

Let's put it like this -- any access to the poisoned cache lines
in that page will trigger a panic anyways.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-28  4:11                     ` Andi Kleen
  0 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2009-09-28  4:11 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Hugh Dickins, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Mon, Sep 28, 2009 at 04:57:41AM +0200, Nick Piggin wrote:
> On Mon, Sep 28, 2009 at 03:19:43AM +0200, Andi Kleen wrote:
> > > There is no real rush AFAIKS to fix this one single pagecache site
> > > while we have problems with slab allocators and all other unaudited
> > > places that nonatomically modify page flags with an elevated
> > 
> > hwpoison ignores slab pages.
> 
> "ignores" them *after* it has already written to page flags?
> By that time it's too late.

Yes, currently the page lock comes first. The only exception 
is for page count == 0 pages. I suppose we could move the slab check
up, but then it only helps when slab is set.

So if you make slab use refcount == 0 pages that would help.

> Well it's fundamentally badly buggy, rare or not. We could avoid

Let's put it like this -- any access to the poisoned cache lines
in that page will trigger a panic anyways.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-28  4:11                     ` Andi Kleen
@ 2009-09-28  4:29                       ` Nick Piggin
  -1 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-28  4:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Hugh Dickins, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Mon, Sep 28, 2009 at 06:11:08AM +0200, Andi Kleen wrote:
> On Mon, Sep 28, 2009 at 04:57:41AM +0200, Nick Piggin wrote:
> > On Mon, Sep 28, 2009 at 03:19:43AM +0200, Andi Kleen wrote:
> > > > There is no real rush AFAIKS to fix this one single pagecache site
> > > > while we have problems with slab allocators and all other unaudited
> > > > places that nonatomically modify page flags with an elevated
> > > 
> > > hwpoison ignores slab pages.
> > 
> > "ignores" them *after* it has already written to page flags?
> > By that time it's too late.
> 
> Yes, currently the page lock comes first. The only exception 
> is for page count == 0 pages. I suppose we could move the slab check
> up, but then it only helps when slab is set.

Yes, so it misses other potential non-atomic page flags manipulations.

 
> So if you make slab use refcount == 0 pages that would help.

Yes it would help here and also help with the pagecache part too,
and most other cases I suspect. I have some patches to do this at
home so I'll post them when I get back.

 
> > Well it's fundamentally badly buggy, rare or not. We could avoid
> 
> Let's put it like this -- any access to the poisoned cache lines
> in that page will trigger a panic anyways.

Well yes, although maybe people who care about this feature will
care more about having a reliable panic than introducing a
random data corruption. I guess the chance of an ecc failure
combined with a chance the race window hits could be some orders
of magnitude less likely than other sources of bugs ;) but still
I don't like using that argument to allow known bugs -- it leads
to interesting things if we take it to a conclusion.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-28  4:29                       ` Nick Piggin
  0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-09-28  4:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Hugh Dickins, Wu Fengguang, Andrew Morton, linux-mm, LKML

On Mon, Sep 28, 2009 at 06:11:08AM +0200, Andi Kleen wrote:
> On Mon, Sep 28, 2009 at 04:57:41AM +0200, Nick Piggin wrote:
> > On Mon, Sep 28, 2009 at 03:19:43AM +0200, Andi Kleen wrote:
> > > > There is no real rush AFAIKS to fix this one single pagecache site
> > > > while we have problems with slab allocators and all other unaudited
> > > > places that nonatomically modify page flags with an elevated
> > > 
> > > hwpoison ignores slab pages.
> > 
> > "ignores" them *after* it has already written to page flags?
> > By that time it's too late.
> 
> Yes, currently the page lock comes first. The only exception 
> is for page count == 0 pages. I suppose we could move the slab check
> up, but then it only helps when slab is set.

Yes, so it misses other potential non-atomic page flags manipulations.

 
> So if you make slab use refcount == 0 pages that would help.

Yes it would help here and also help with the pagecache part too,
and most other cases I suspect. I have some patches to do this at
home so I'll post them when I get back.

 
> > Well it's fundamentally badly buggy, rare or not. We could avoid
> 
> Let's put it like this -- any access to the poisoned cache lines
> in that page will trigger a panic anyways.

Well yes, although maybe people who care about this feature will
care more about having a reliable panic than introducing a
random data corruption. I guess the chance of an ecc failure
combined with a chance the race window hits could be some orders
of magnitude less likely than other sources of bugs ;) but still
I don't like using that argument to allow known bugs -- it leads
to interesting things if we take it to a conclusion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-27 19:20           ` Nick Piggin
@ 2009-09-28  8:44             ` Wu Fengguang
  -1 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-28  8:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, Andrew Morton, Hugh Dickins, linux-mm, LKML

On Mon, Sep 28, 2009 at 03:20:25AM +0800, Nick Piggin wrote:
> On Sun, Sep 27, 2009 at 06:47:39PM +0800, Wu Fengguang wrote:
> > > 
> > > And standard deviation is 0.04%, much larger than the difference 0.008% ..
> > 
> > Sorry that's not correct. I improved the accounting by treating
> > function0+function1 from two CPUs as an integral entity:
> > 
> >                  total time      add_to_page_cache_lru   percent  stddev
> >          before  3880166848.722  9683329.610             0.250%   0.014%
> >          after   3828516894.376  9778088.870             0.256%   0.012%
> >          delta                                           0.006%
> 
> I don't understand why you're doing this NFS workload to measure?

Because it is the first convenient workload hit my mind, and avoids
real disk IO :)

> I see significant nfs, networking protocol and device overheads in
> your profiles, also you're hitting some locks or something which
> is causing massive context switching. So I don't think this is a
> good test.

Yes there are overheads. However it is a real and common workload.

> But anyway as Hugh points out, you need to compare with a
> *completely* fixed kernel, which includes auditing all users of page
> flags non-atomically (slab, notably, but possibly also other
> places).

That's good point. We can do more benchmarks when more fixes are
available. However I suspect their design goal will be "fix them
without introducing noticeable overheads" :)

> One other thing to keep in mind that I will mention is that I am
> going to push in a patch to the page allocator to allow callers
> to avoid the refcounting (atomic_dec_and_test) in page lifetime,
> which is especially important for SLUB and takes more cycles off
> the page allocator...
>
> I don't know exactly what you're going to do after that to get a
> stable reference to slab pages. I guess you can read the page
> flags and speculatively take some slab locks and recheck etc...

For reliably we could skip page lock on zero refcounted pages.

We may lose the PG_hwpoison bit on races with __SetPageSlub*, however
it should be an acceptable imperfection.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-28  8:44             ` Wu Fengguang
  0 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-28  8:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, Andrew Morton, Hugh Dickins, linux-mm, LKML

On Mon, Sep 28, 2009 at 03:20:25AM +0800, Nick Piggin wrote:
> On Sun, Sep 27, 2009 at 06:47:39PM +0800, Wu Fengguang wrote:
> > > 
> > > And standard deviation is 0.04%, much larger than the difference 0.008% ..
> > 
> > Sorry that's not correct. I improved the accounting by treating
> > function0+function1 from two CPUs as an integral entity:
> > 
> >                  total time      add_to_page_cache_lru   percent  stddev
> >          before  3880166848.722  9683329.610             0.250%   0.014%
> >          after   3828516894.376  9778088.870             0.256%   0.012%
> >          delta                                           0.006%
> 
> I don't understand why you're doing this NFS workload to measure?

Because it is the first convenient workload hit my mind, and avoids
real disk IO :)

> I see significant nfs, networking protocol and device overheads in
> your profiles, also you're hitting some locks or something which
> is causing massive context switching. So I don't think this is a
> good test.

Yes there are overheads. However it is a real and common workload.

> But anyway as Hugh points out, you need to compare with a
> *completely* fixed kernel, which includes auditing all users of page
> flags non-atomically (slab, notably, but possibly also other
> places).

That's good point. We can do more benchmarks when more fixes are
available. However I suspect their design goal will be "fix them
without introducing noticeable overheads" :)

> One other thing to keep in mind that I will mention is that I am
> going to push in a patch to the page allocator to allow callers
> to avoid the refcounting (atomic_dec_and_test) in page lifetime,
> which is especially important for SLUB and takes more cycles off
> the page allocator...
>
> I don't know exactly what you're going to do after that to get a
> stable reference to slab pages. I guess you can read the page
> flags and speculatively take some slab locks and recheck etc...

For reliably we could skip page lock on zero refcounted pages.

We may lose the PG_hwpoison bit on races with __SetPageSlub*, however
it should be an acceptable imperfection.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-28  8:44             ` Wu Fengguang
@ 2009-09-29  5:16               ` Wu Fengguang
  -1 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-29  5:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, Andrew Morton, Hugh Dickins, linux-mm, LKML

On Mon, Sep 28, 2009 at 04:44:01PM +0800, Wu Fengguang wrote:
> On Mon, Sep 28, 2009 at 03:20:25AM +0800, Nick Piggin wrote:

> > One other thing to keep in mind that I will mention is that I am
> > going to push in a patch to the page allocator to allow callers
> > to avoid the refcounting (atomic_dec_and_test) in page lifetime,
> > which is especially important for SLUB and takes more cycles off
> > the page allocator...
> >
> > I don't know exactly what you're going to do after that to get a
> > stable reference to slab pages. I guess you can read the page
> > flags and speculatively take some slab locks and recheck etc...
> 
> For reliably we could skip page lock on zero refcounted pages.
> 
> We may lose the PG_hwpoison bit on races with __SetPageSlub*, however
> it should be an acceptable imperfection.

I'd like to propose this fix for 2.6.32, which can do 100% correctness
for the discussed races :)

In brief it is

        if (is not lru page)
                return and don't touch page lock;

Any comments?

Thanks,
Fengguang

---
HWPOISON: return early on non-LRU pages

This avoids unnecessary races with __set_page_locked() and
__SetPageSlab*() and maybe more non-atomic page flag operations.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/memory-failure.c |   54 ++++++++++++++----------------------------
 1 file changed, 19 insertions(+), 35 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c	2009-09-29 12:27:36.000000000 +0800
+++ sound-2.6/mm/memory-failure.c	2009-09-29 12:32:52.000000000 +0800
@@ -327,16 +327,6 @@ static const char *action_name[] = {
 };
 
 /*
- * Error hit kernel page.
- * Do nothing, try to be lucky and not touch this instead. For a few cases we
- * could be more sophisticated.
- */
-static int me_kernel(struct page *p, unsigned long pfn)
-{
-	return DELAYED;
-}
-
-/*
  * Already poisoned page.
  */
 static int me_ignore(struct page *p, unsigned long pfn)
@@ -370,9 +360,6 @@ static int me_pagecache_clean(struct pag
 	int ret = FAILED;
 	struct address_space *mapping;
 
-	if (!isolate_lru_page(p))
-		page_cache_release(p);
-
 	/*
 	 * For anonymous pages we're done the only reference left
 	 * should be the one m_f() holds.
@@ -498,30 +485,18 @@ static int me_pagecache_dirty(struct pag
  */
 static int me_swapcache_dirty(struct page *p, unsigned long pfn)
 {
-	int ret = FAILED;
-
 	ClearPageDirty(p);
 	/* Trigger EIO in shmem: */
 	ClearPageUptodate(p);
 
-	if (!isolate_lru_page(p)) {
-		page_cache_release(p);
-		ret = DELAYED;
-	}
-
-	return ret;
+	return DELAYED;
 }
 
 static int me_swapcache_clean(struct page *p, unsigned long pfn)
 {
-	int ret = FAILED;
-
-	if (!isolate_lru_page(p)) {
-		page_cache_release(p);
-		ret = RECOVERED;
-	}
 	delete_from_swap_cache(p);
-	return ret;
+
+	return RECOVERED;
 }
 
 /*
@@ -576,13 +551,6 @@ static struct page_state {
 	{ reserved,	reserved,	"reserved kernel",	me_ignore },
 	{ buddy,	buddy,		"free kernel",	me_free },
 
-	/*
-	 * Could in theory check if slab page is free or if we can drop
-	 * currently unused objects without touching them. But just
-	 * treat it as standard kernel for now.
-	 */
-	{ slab,		slab,		"kernel slab",	me_kernel },
-
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
 	{ head,		head,		"huge",		me_huge_page },
 	{ tail,		tail,		"huge",		me_huge_page },
@@ -775,6 +743,22 @@ int __memory_failure(unsigned long pfn, 
 	}
 
 	/*
+	 * We ignore non-LRU pages for good reasons.
+	 * - PG_locked is only well defined for LRU pages and a few others
+	 * - to avoid races with __set_page_locked()
+	 * - to avoid races with __SetPageSlab*() (and more non-atomic ops)
+	 * The check (unnecessarily) ignores LRU pages being isolated and
+	 * walked by the page reclaim code, however that's not a big loss.
+	 */
+        if (!PageLRU(p))
+                lru_add_drain_all();
+        if (isolate_lru_page(p)) {
+                action_result(pfn, "non LRU", IGNORED);
+                return -EBUSY;
+        }
+	page_cache_release(p);
+
+	/*
 	 * Lock the page and wait for writeback to finish.
 	 * It's very difficult to mess with pages currently under IO
 	 * and in many cases impossible, so we just avoid it here.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-09-29  5:16               ` Wu Fengguang
  0 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-09-29  5:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, Andrew Morton, Hugh Dickins, linux-mm, LKML

On Mon, Sep 28, 2009 at 04:44:01PM +0800, Wu Fengguang wrote:
> On Mon, Sep 28, 2009 at 03:20:25AM +0800, Nick Piggin wrote:

> > One other thing to keep in mind that I will mention is that I am
> > going to push in a patch to the page allocator to allow callers
> > to avoid the refcounting (atomic_dec_and_test) in page lifetime,
> > which is especially important for SLUB and takes more cycles off
> > the page allocator...
> >
> > I don't know exactly what you're going to do after that to get a
> > stable reference to slab pages. I guess you can read the page
> > flags and speculatively take some slab locks and recheck etc...
> 
> For reliably we could skip page lock on zero refcounted pages.
> 
> We may lose the PG_hwpoison bit on races with __SetPageSlub*, however
> it should be an acceptable imperfection.

I'd like to propose this fix for 2.6.32, which can do 100% correctness
for the discussed races :)

In brief it is

        if (is not lru page)
                return and don't touch page lock;

Any comments?

Thanks,
Fengguang

---
HWPOISON: return early on non-LRU pages

This avoids unnecessary races with __set_page_locked() and
__SetPageSlab*() and maybe more non-atomic page flag operations.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/memory-failure.c |   54 ++++++++++++++----------------------------
 1 file changed, 19 insertions(+), 35 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c	2009-09-29 12:27:36.000000000 +0800
+++ sound-2.6/mm/memory-failure.c	2009-09-29 12:32:52.000000000 +0800
@@ -327,16 +327,6 @@ static const char *action_name[] = {
 };
 
 /*
- * Error hit kernel page.
- * Do nothing, try to be lucky and not touch this instead. For a few cases we
- * could be more sophisticated.
- */
-static int me_kernel(struct page *p, unsigned long pfn)
-{
-	return DELAYED;
-}
-
-/*
  * Already poisoned page.
  */
 static int me_ignore(struct page *p, unsigned long pfn)
@@ -370,9 +360,6 @@ static int me_pagecache_clean(struct pag
 	int ret = FAILED;
 	struct address_space *mapping;
 
-	if (!isolate_lru_page(p))
-		page_cache_release(p);
-
 	/*
 	 * For anonymous pages we're done the only reference left
 	 * should be the one m_f() holds.
@@ -498,30 +485,18 @@ static int me_pagecache_dirty(struct pag
  */
 static int me_swapcache_dirty(struct page *p, unsigned long pfn)
 {
-	int ret = FAILED;
-
 	ClearPageDirty(p);
 	/* Trigger EIO in shmem: */
 	ClearPageUptodate(p);
 
-	if (!isolate_lru_page(p)) {
-		page_cache_release(p);
-		ret = DELAYED;
-	}
-
-	return ret;
+	return DELAYED;
 }
 
 static int me_swapcache_clean(struct page *p, unsigned long pfn)
 {
-	int ret = FAILED;
-
-	if (!isolate_lru_page(p)) {
-		page_cache_release(p);
-		ret = RECOVERED;
-	}
 	delete_from_swap_cache(p);
-	return ret;
+
+	return RECOVERED;
 }
 
 /*
@@ -576,13 +551,6 @@ static struct page_state {
 	{ reserved,	reserved,	"reserved kernel",	me_ignore },
 	{ buddy,	buddy,		"free kernel",	me_free },
 
-	/*
-	 * Could in theory check if slab page is free or if we can drop
-	 * currently unused objects without touching them. But just
-	 * treat it as standard kernel for now.
-	 */
-	{ slab,		slab,		"kernel slab",	me_kernel },
-
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
 	{ head,		head,		"huge",		me_huge_page },
 	{ tail,		tail,		"huge",		me_huge_page },
@@ -775,6 +743,22 @@ int __memory_failure(unsigned long pfn, 
 	}
 
 	/*
+	 * We ignore non-LRU pages for good reasons.
+	 * - PG_locked is only well defined for LRU pages and a few others
+	 * - to avoid races with __set_page_locked()
+	 * - to avoid races with __SetPageSlab*() (and more non-atomic ops)
+	 * The check (unnecessarily) ignores LRU pages being isolated and
+	 * walked by the page reclaim code, however that's not a big loss.
+	 */
+        if (!PageLRU(p))
+                lru_add_drain_all();
+        if (isolate_lru_page(p)) {
+                action_result(pfn, "non LRU", IGNORED);
+                return -EBUSY;
+        }
+	page_cache_release(p);
+
+	/*
 	 * Lock the page and wait for writeback to finish.
 	 * It's very difficult to mess with pages currently under IO
 	 * and in many cases impossible, so we just avoid it here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-09-28  8:44             ` Wu Fengguang
@ 2009-10-01  2:02               ` Nick Piggin
  -1 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-10-01  2:02 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andi Kleen, Andrew Morton, Hugh Dickins, linux-mm, LKML

On Mon, Sep 28, 2009 at 04:44:01PM +0800, Wu Fengguang wrote:
> On Mon, Sep 28, 2009 at 03:20:25AM +0800, Nick Piggin wrote:
> > On Sun, Sep 27, 2009 at 06:47:39PM +0800, Wu Fengguang wrote:
> > > > 
> > > > And standard deviation is 0.04%, much larger than the difference 0.008% ..
> > > 
> > > Sorry that's not correct. I improved the accounting by treating
> > > function0+function1 from two CPUs as an integral entity:
> > > 
> > >                  total time      add_to_page_cache_lru   percent  stddev
> > >          before  3880166848.722  9683329.610             0.250%   0.014%
> > >          after   3828516894.376  9778088.870             0.256%   0.012%
> > >          delta                                           0.006%
> > 
> > I don't understand why you're doing this NFS workload to measure?
> 
> Because it is the first convenient workload hit my mind, and avoids
> real disk IO :)

Using tmpfs or sparse files is probably a lot easier.

 
> > I see significant nfs, networking protocol and device overheads in
> > your profiles, also you're hitting some locks or something which
> > is causing massive context switching. So I don't think this is a
> > good test.
> 
> Yes there are overheads. However it is a real and common workload.

Right, but so are lots of other workloads that don't hit
add_to_page_cache heavily :)
 

> > But anyway as Hugh points out, you need to compare with a
> > *completely* fixed kernel, which includes auditing all users of page
> > flags non-atomically (slab, notably, but possibly also other
> > places).
> 
> That's good point. We can do more benchmarks when more fixes are
> available. However I suspect their design goal will be "fix them
> without introducing noticeable overheads" :)

s/noticeable//

The problem with all the non-noticeable overheads that we're
continually adding to the kernel is that we're adding them to
the kernel. Non-noticeable part only makes it worse because
you can't bisect them :)
 

> > One other thing to keep in mind that I will mention is that I am
> > going to push in a patch to the page allocator to allow callers
> > to avoid the refcounting (atomic_dec_and_test) in page lifetime,
> > which is especially important for SLUB and takes more cycles off
> > the page allocator...
> >
> > I don't know exactly what you're going to do after that to get a
> > stable reference to slab pages. I guess you can read the page
> > flags and speculatively take some slab locks and recheck etc...
> 
> For reliably we could skip page lock on zero refcounted pages.
> 
> We may lose the PG_hwpoison bit on races with __SetPageSlub*, however
> it should be an acceptable imperfection.

I think if you're wiling to accept these problems, then it is
completely reasonable to also accept similar races with kernel
fastpaths to avoid extra overheads there.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-10-01  2:02               ` Nick Piggin
  0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2009-10-01  2:02 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andi Kleen, Andrew Morton, Hugh Dickins, linux-mm, LKML

On Mon, Sep 28, 2009 at 04:44:01PM +0800, Wu Fengguang wrote:
> On Mon, Sep 28, 2009 at 03:20:25AM +0800, Nick Piggin wrote:
> > On Sun, Sep 27, 2009 at 06:47:39PM +0800, Wu Fengguang wrote:
> > > > 
> > > > And standard deviation is 0.04%, much larger than the difference 0.008% ..
> > > 
> > > Sorry that's not correct. I improved the accounting by treating
> > > function0+function1 from two CPUs as an integral entity:
> > > 
> > >                  total time      add_to_page_cache_lru   percent  stddev
> > >          before  3880166848.722  9683329.610             0.250%   0.014%
> > >          after   3828516894.376  9778088.870             0.256%   0.012%
> > >          delta                                           0.006%
> > 
> > I don't understand why you're doing this NFS workload to measure?
> 
> Because it is the first convenient workload hit my mind, and avoids
> real disk IO :)

Using tmpfs or sparse files is probably a lot easier.

 
> > I see significant nfs, networking protocol and device overheads in
> > your profiles, also you're hitting some locks or something which
> > is causing massive context switching. So I don't think this is a
> > good test.
> 
> Yes there are overheads. However it is a real and common workload.

Right, but so are lots of other workloads that don't hit
add_to_page_cache heavily :)
 

> > But anyway as Hugh points out, you need to compare with a
> > *completely* fixed kernel, which includes auditing all users of page
> > flags non-atomically (slab, notably, but possibly also other
> > places).
> 
> That's good point. We can do more benchmarks when more fixes are
> available. However I suspect their design goal will be "fix them
> without introducing noticeable overheads" :)

s/noticeable//

The problem with all the non-noticeable overheads that we're
continually adding to the kernel is that we're adding them to
the kernel. Non-noticeable part only makes it worse because
you can't bisect them :)
 

> > One other thing to keep in mind that I will mention is that I am
> > going to push in a patch to the page allocator to allow callers
> > to avoid the refcounting (atomic_dec_and_test) in page lifetime,
> > which is especially important for SLUB and takes more cycles off
> > the page allocator...
> >
> > I don't know exactly what you're going to do after that to get a
> > stable reference to slab pages. I guess you can read the page
> > flags and speculatively take some slab locks and recheck etc...
> 
> For reliably we could skip page lock on zero refcounted pages.
> 
> We may lose the PG_hwpoison bit on races with __SetPageSlub*, however
> it should be an acceptable imperfection.

I think if you're wiling to accept these problems, then it is
completely reasonable to also accept similar races with kernel
fastpaths to avoid extra overheads there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
  2009-10-01  2:02               ` Nick Piggin
@ 2009-10-02 10:54                 ` Wu Fengguang
  -1 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-10-02 10:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, Andrew Morton, Hugh Dickins, linux-mm, LKML

On Thu, Oct 01, 2009 at 10:02:07AM +0800, Nick Piggin wrote:
> On Mon, Sep 28, 2009 at 04:44:01PM +0800, Wu Fengguang wrote:
> > On Mon, Sep 28, 2009 at 03:20:25AM +0800, Nick Piggin wrote:
> > > On Sun, Sep 27, 2009 at 06:47:39PM +0800, Wu Fengguang wrote:
> > > > > 
> > > > > And standard deviation is 0.04%, much larger than the difference 0.008% ..
> > > > 
> > > > Sorry that's not correct. I improved the accounting by treating
> > > > function0+function1 from two CPUs as an integral entity:
> > > > 
> > > >                  total time      add_to_page_cache_lru   percent  stddev
> > > >          before  3880166848.722  9683329.610             0.250%   0.014%
> > > >          after   3828516894.376  9778088.870             0.256%   0.012%
> > > >          delta                                           0.006%
> > > 
> > > I don't understand why you're doing this NFS workload to measure?
> > 
> > Because it is the first convenient workload hit my mind, and avoids
> > real disk IO :)
> 
> Using tmpfs or sparse files is probably a lot easier.

Good ideas. In fact I tried them in the very beginning.
The ratios are roughly at the same level (which is somehow unexpected):

        total time      add_to_page_cache_lru   percent  stddev
tmpfs   1579056274.576  2615476.234             0.1656354036338758%
sparse  1074931917.425  3001273                 0.27920586888791538%

Workload is to copy 1G /dev/zero to /dev/shm/, or 1G sparse file
(ext2) to /dev/null.

 echo 1 > /debug/tracing/function_profile_enabled
 cp /dev/zero /dev/shm/
 echo 0 > /debug/tracing/function_profile_enabled

 dd if=/dev/zero of=/mnt/test bs=1k count=1 seek=1048575
 echo 1 > /debug/tracing/function_profile_enabled
 cp /mnt/test/sparse /dev/null
 echo 0 > /debug/tracing/function_profile_enabled

> > > I see significant nfs, networking protocol and device overheads in
> > > your profiles, also you're hitting some locks or something which
> > > is causing massive context switching. So I don't think this is a
> > > good test.
> > 
> > Yes there are overheads. However it is a real and common workload.
> 
> Right, but so are lots of other workloads that don't hit
> add_to_page_cache heavily :)
>  
> 
> > > But anyway as Hugh points out, you need to compare with a
> > > *completely* fixed kernel, which includes auditing all users of page
> > > flags non-atomically (slab, notably, but possibly also other
> > > places).
> > 
> > That's good point. We can do more benchmarks when more fixes are
> > available. However I suspect their design goal will be "fix them
> > without introducing noticeable overheads" :)
> 
> s/noticeable//
> 
> The problem with all the non-noticeable overheads that we're
> continually adding to the kernel is that we're adding them to
> the kernel. Non-noticeable part only makes it worse because
> you can't bisect them :)

Yes it makes sense.

> > > One other thing to keep in mind that I will mention is that I am
> > > going to push in a patch to the page allocator to allow callers
> > > to avoid the refcounting (atomic_dec_and_test) in page lifetime,
> > > which is especially important for SLUB and takes more cycles off
> > > the page allocator...
> > >
> > > I don't know exactly what you're going to do after that to get a
> > > stable reference to slab pages. I guess you can read the page
> > > flags and speculatively take some slab locks and recheck etc...
> > 
> > For reliably we could skip page lock on zero refcounted pages.
> > 
> > We may lose the PG_hwpoison bit on races with __SetPageSlub*, however
> > it should be an acceptable imperfection.
> 
> I think if you're wiling to accept these problems, then it is
> completely reasonable to also accept similar races with kernel
> fastpaths to avoid extra overheads there.

Yes I do. Even better, for this perticular race, we managed to avoid
it completely without introducing overheads in fast path :)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()
@ 2009-10-02 10:54                 ` Wu Fengguang
  0 siblings, 0 replies; 54+ messages in thread
From: Wu Fengguang @ 2009-10-02 10:54 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, Andrew Morton, Hugh Dickins, linux-mm, LKML

On Thu, Oct 01, 2009 at 10:02:07AM +0800, Nick Piggin wrote:
> On Mon, Sep 28, 2009 at 04:44:01PM +0800, Wu Fengguang wrote:
> > On Mon, Sep 28, 2009 at 03:20:25AM +0800, Nick Piggin wrote:
> > > On Sun, Sep 27, 2009 at 06:47:39PM +0800, Wu Fengguang wrote:
> > > > > 
> > > > > And standard deviation is 0.04%, much larger than the difference 0.008% ..
> > > > 
> > > > Sorry that's not correct. I improved the accounting by treating
> > > > function0+function1 from two CPUs as an integral entity:
> > > > 
> > > >                  total time      add_to_page_cache_lru   percent  stddev
> > > >          before  3880166848.722  9683329.610             0.250%   0.014%
> > > >          after   3828516894.376  9778088.870             0.256%   0.012%
> > > >          delta                                           0.006%
> > > 
> > > I don't understand why you're doing this NFS workload to measure?
> > 
> > Because it is the first convenient workload hit my mind, and avoids
> > real disk IO :)
> 
> Using tmpfs or sparse files is probably a lot easier.

Good ideas. In fact I tried them in the very beginning.
The ratios are roughly at the same level (which is somehow unexpected):

        total time      add_to_page_cache_lru   percent  stddev
tmpfs   1579056274.576  2615476.234             0.1656354036338758%
sparse  1074931917.425  3001273                 0.27920586888791538%

Workload is to copy 1G /dev/zero to /dev/shm/, or 1G sparse file
(ext2) to /dev/null.

 echo 1 > /debug/tracing/function_profile_enabled
 cp /dev/zero /dev/shm/
 echo 0 > /debug/tracing/function_profile_enabled

 dd if=/dev/zero of=/mnt/test bs=1k count=1 seek=1048575
 echo 1 > /debug/tracing/function_profile_enabled
 cp /mnt/test/sparse /dev/null
 echo 0 > /debug/tracing/function_profile_enabled

> > > I see significant nfs, networking protocol and device overheads in
> > > your profiles, also you're hitting some locks or something which
> > > is causing massive context switching. So I don't think this is a
> > > good test.
> > 
> > Yes there are overheads. However it is a real and common workload.
> 
> Right, but so are lots of other workloads that don't hit
> add_to_page_cache heavily :)
>  
> 
> > > But anyway as Hugh points out, you need to compare with a
> > > *completely* fixed kernel, which includes auditing all users of page
> > > flags non-atomically (slab, notably, but possibly also other
> > > places).
> > 
> > That's good point. We can do more benchmarks when more fixes are
> > available. However I suspect their design goal will be "fix them
> > without introducing noticeable overheads" :)
> 
> s/noticeable//
> 
> The problem with all the non-noticeable overheads that we're
> continually adding to the kernel is that we're adding them to
> the kernel. Non-noticeable part only makes it worse because
> you can't bisect them :)

Yes it makes sense.

> > > One other thing to keep in mind that I will mention is that I am
> > > going to push in a patch to the page allocator to allow callers
> > > to avoid the refcounting (atomic_dec_and_test) in page lifetime,
> > > which is especially important for SLUB and takes more cycles off
> > > the page allocator...
> > >
> > > I don't know exactly what you're going to do after that to get a
> > > stable reference to slab pages. I guess you can read the page
> > > flags and speculatively take some slab locks and recheck etc...
> > 
> > For reliably we could skip page lock on zero refcounted pages.
> > 
> > We may lose the PG_hwpoison bit on races with __SetPageSlub*, however
> > it should be an acceptable imperfection.
> 
> I think if you're wiling to accept these problems, then it is
> completely reasonable to also accept similar races with kernel
> fastpaths to avoid extra overheads there.

Yes I do. Even better, for this perticular race, we managed to avoid
it completely without introducing overheads in fast path :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2009-10-02 10:55 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-26  3:15 [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked() Wu Fengguang
2009-09-26  3:15 ` Wu Fengguang
2009-09-26  3:49 ` Andi Kleen
2009-09-26  3:49   ` Andi Kleen
2009-09-26 10:52   ` Wu Fengguang
2009-09-26 10:52     ` Wu Fengguang
2009-09-26 11:31     ` Wu Fengguang
2009-09-26 11:31       ` Wu Fengguang
2009-09-27 10:47       ` Wu Fengguang
2009-09-27 10:47         ` Wu Fengguang
2009-09-27 19:20         ` Nick Piggin
2009-09-27 19:20           ` Nick Piggin
2009-09-28  8:44           ` Wu Fengguang
2009-09-28  8:44             ` Wu Fengguang
2009-09-29  5:16             ` Wu Fengguang
2009-09-29  5:16               ` Wu Fengguang
2009-10-01  2:02             ` Nick Piggin
2009-10-01  2:02               ` Nick Piggin
2009-10-02 10:54               ` Wu Fengguang
2009-10-02 10:54                 ` Wu Fengguang
2009-09-26 11:09 ` Hugh Dickins
2009-09-26 11:09   ` Hugh Dickins
2009-09-26 11:48   ` Wu Fengguang
2009-09-26 11:48     ` Wu Fengguang
2009-09-26 11:58     ` Hugh Dickins
2009-09-26 11:58       ` Hugh Dickins
2009-09-26 15:05     ` Andi Kleen
2009-09-26 15:05       ` Andi Kleen
2009-09-26 19:12       ` Nick Piggin
2009-09-26 19:12         ` Nick Piggin
2009-09-26 19:14     ` Nick Piggin
2009-09-26 19:14       ` Nick Piggin
2009-09-26 19:06   ` Nick Piggin
2009-09-26 19:06     ` Nick Piggin
2009-09-26 21:32     ` Andi Kleen
2009-09-26 21:32       ` Andi Kleen
2009-09-27 16:26       ` Hugh Dickins
2009-09-27 16:26         ` Hugh Dickins
2009-09-27 19:22         ` Nick Piggin
2009-09-27 19:22           ` Nick Piggin
2009-09-27 21:57           ` Hugh Dickins
2009-09-27 21:57             ` Hugh Dickins
2009-09-27 23:01             ` Nick Piggin
2009-09-27 23:01               ` Nick Piggin
2009-09-28  1:19               ` Andi Kleen
2009-09-28  1:19                 ` Andi Kleen
2009-09-28  1:52                 ` Wu Fengguang
2009-09-28  1:52                   ` Wu Fengguang
2009-09-28  2:57                 ` Nick Piggin
2009-09-28  2:57                   ` Nick Piggin
2009-09-28  4:11                   ` Andi Kleen
2009-09-28  4:11                     ` Andi Kleen
2009-09-28  4:29                     ` Nick Piggin
2009-09-28  4:29                       ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.