All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] kdump, vmcore: support mmap() on /proc/vmcore
@ 2013-02-14 10:11 ` HATAYAMA Daisuke
  0 siblings, 0 replies; 70+ messages in thread
From: HATAYAMA Daisuke @ 2013-02-14 10:11 UTC (permalink / raw)
  To: ebiederm, vgoyal, cpw, kumagai-atsushi, lisa.mitchell; +Cc: kexec, linux-kernel

Currently, read to /proc/vmcore is done by read_oldmem() that uses
ioremap/iounmap per a single page. For example, if memory is 1GB,
ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
times. This causes big performance degradation.

To address the issue, this patch implements mmap() on /proc/vmcore to
improve read performance. My simple benchmark shows the improvement
from 200 [MiB/sec] to over 50.0 [GiB/sec].

Benchmark
=========

= Machine spec
  - CPU: Intel(R) Xeon(R) CPU E7- 4820 @ 2.00GHz (4 sockets, 8 cores) (*)
  - memory: 32GB
  - kernel: 3.8-rc6 with this patch
  - vmcore size: 31.7GB

  (*) only 1 cpu is used in the 2nd kernel now.

= Benchmark Case

1) copy /proc/vmcore *WITHOUT* mmap() on /proc/vmcore

$ time dd bs=4096 if=/proc/vmcore of=/dev/null
8307246+1 records in
8307246+1 records out
real    2m 31.50s
user    0m 1.06s
sys     2m 27.60s

So performance is 214.26 [MiB/sec].

2) copy /proc/vmcore with mmap()

  I ran the next command and recorded real time:

  $ for n in $(seq 1 15) ; do \
  >   time copyvmcore2 --blocksize=$((4096 * (1 << (n - 1)))) /proc/vmcore /dev/null \
  > done

  where copyvmcore2 is an ad-hoc test tool that read data from
  /proc/vmcore via mmap() in given block-size unit and write them to
  some file.

|  n | map size |  time | page table | performance |
|    |          | (sec) |            |   [GiB/sec] |
|----+----------+-------+------------+-------------|
|  1 | 4 KiB    | 78.35 | 8 iB       |        0.40 |
|  2 | 8 KiB    | 45.29 | 16 iB      |        0.70 |
|  3 | 16 KiB   | 23.82 | 32 iB      |        1.33 |
|  4 | 32 KiB   | 12.90 | 64 iB      |        2.46 |
|  5 | 64 KiB   |  6.13 | 128 iB     |        5.17 |
|  6 | 128 KiB  |  3.26 | 256 iB     |        9.72 |
|  7 | 256 KiB  |  1.86 | 512 iB     |       17.04 |
|  8 | 512 KiB  |  1.13 | 1 KiB      |       28.04 |
|  9 | 1 MiB    |  0.77 | 2 KiB      |       41.16 |
| 10 | 2 MiB    |  0.58 | 4 KiB      |       54.64 |
| 11 | 4 MiB    |  0.50 | 8 KiB      |       63.38 |
| 12 | 8 MiB    |  0.46 | 16 KiB     |       68.89 |
| 13 | 16 MiB   |  0.44 | 32 KiB     |       72.02 |
| 14 | 32 MiB   |  0.44 | 64 KiB     |       72.02 |
| 15 | 64 MiB   |  0.45 | 128 KiB    |       70.42 |

3) copy /proc/vmcore with mmap() on /dev/oldmem

I posted another patch series for mmap() on /dev/oldmem a few weeks ago.
See: https://lkml.org/lkml/2013/2/3/431

Next is the table shown on the post showing the benchmark.

|  n | map size |  time | page table | performance |
|    |          | (sec) |            |   [GiB/sec] |
|----+----------+-------+------------+-------------|
|  1 | 4 KiB    | 41.86 | 8 iB       |        0.76 |
|  2 | 8 KiB    | 25.43 | 16 iB      |        1.25 |
|  3 | 16 KiB   | 13.28 | 32 iB      |        2.39 |
|  4 | 32 KiB   |  7.20 | 64 iB      |        4.40 |
|  5 | 64 KiB   |  3.45 | 128 iB     |        9.19 |
|  6 | 128 KiB  |  1.82 | 256 iB     |       17.42 |
|  7 | 256 KiB  |  1.03 | 512 iB     |       30.78 |
|  8 | 512 KiB  |  0.61 | 1K iB      |       51.97 |
|  9 | 1 MiB    |  0.41 | 2K iB      |       77.32 |
| 10 | 2 MiB    |  0.32 | 4K iB      |       99.06 |
| 11 | 4 MiB    |  0.27 | 8K iB      |      117.41 |
| 12 | 8 MiB    |  0.24 | 16 KiB     |      132.08 |
| 13 | 16 MiB   |  0.23 | 32 KiB     |      137.83 |
| 14 | 32 MiB   |  0.22 | 64 KiB     |      144.09 |
| 15 | 64 MiB   |  0.22 | 128 KiB    |      144.09 |

= Discussion

- For small map size, we can see performance degradation on mmap()
  case due to many page table modification and TLB flushes similarly
  to read_oldmem() case. But for large map size we can see the
  improved performance.

  Each application need to choose appropreate map size for their
  preferable performance.

- mmap() on /dev/oldmem appears better than that on /proc/vmcore. But
  actual processing does not only copying but also IO work. This
  difference is not a problem.

- Both mmap() case shows drastically better performance than previous
  RFC patch set's about 2.5 [GiB/sec] that maps all dump target memory
  in kernel direct mapping address space. This is because there's no
  longer memcpy() from kernel-space to user-space.

Design
======

= Support Range

- mmap() on /proc/vmcore is supported on ELF64 interface only. ELF32
  interface is used only if dump target size is less than 4GB. Then,
  the existing interface is enough in performance.

= Change of /proc/vmcore format

For mmap()'s page-size boundary requirement, /proc/vmcore changed its
own shape and now put its objects in page-size boundary.

- Allocate buffer for ELF headers in page-size boundary.
  => See [PATCH 01/13].

- Note objects scattered on old memory are copied in a single
  page-size aligned buffer on 2nd kernel, and it is remapped to
  user-space.
  => See [PATCH 09/13].
  
- The head and/or tail pages of memroy chunks are also copied on 2nd
  kernel if either of their ends is not page-size aligned. See
  => See [PATCH 12/13].

= 32-bit PAE limitation

- On 32-bit PAE limitation, mmap_vmcore() can handle upto 16TB memory
  only since remap_pfn_range()'s third argument, pfn, has 32-bit
  length only, defined as unsigned long type.

TODO
====

- fix makedumpfile to use mmap() on /proc/vmcore and benchmark it to
  confirm whether we can see enough performance improvement.

Test
====

Done on x86-64, x86-32 both with 1GB and over 4GB memory environments.

---

HATAYAMA Daisuke (13):
      vmcore: introduce mmap_vmcore()
      vmcore: copy non page-size aligned head and tail pages in 2nd kernel
      vmcore: count holes generated by round-up operation for vmcore size
      vmcore: round-up offset of vmcore object in page-size boundary
      vmcore: copy ELF note segments in buffer on 2nd kernel
      vmcore: remove unused helper function
      vmcore: modify read_vmcore() to read buffer on 2nd kernel
      vmcore: modify vmcore clean-up function to free buffer on 2nd kernel
      vmcore: modify ELF32 code according to new type
      vmcore: introduce types for objects copied in 2nd kernel
      vmcore: fill unused part of buffer for ELF headers with 0
      vmcore: round up buffer size of ELF headers by PAGE_SIZE
      vmcore: allocate buffer for ELF headers on page-size alignment


 fs/proc/vmcore.c        |  408 +++++++++++++++++++++++++++++++++++------------
 include/linux/proc_fs.h |   11 +
 2 files changed, 313 insertions(+), 106 deletions(-)

-- 

Thanks.
HATAYAMA, Daisuke

^ permalink raw reply	[flat|nested] 70+ messages in thread
* makedumpfile mmap() benchmark
@ 2013-05-03 19:10 ` Cliff Wickman
  0 siblings, 0 replies; 70+ messages in thread
From: Cliff Wickman @ 2013-05-03 19:10 UTC (permalink / raw)
  To: kexec, linux-kernel
  Cc: vgoyal, ebiederm, kumagai-atsushi, lisa.mitchell, jingbai.ma, d.hatayama


> Jingbai Ma wote on 27 Mar 2013:
> I have tested the makedumpfile mmap patch on a machine with 2TB memory, 
> here is testing results:
> Test environment:
> Machine: HP ProLiant DL980 G7 with 2TB RAM.
> CPU: Intel(R) Xeon(R) CPU E7- 2860  @ 2.27GHz (8 sockets, 10 cores)
> (Only 1 cpu was enabled the 2nd kernel)
> Kernel: 3.9.0-rc3+ with mmap kernel patch v3
> vmcore size: 2.0TB
> Dump file size: 3.6GB
> makedumpfile mmap branch with parameters: -c --message-level 23 -d 31 
> --map-size <map-size>
> All measured time from debug message of makedumpfile.
> 
> As a comparison, I also have tested with original kernel and original 
> makedumpfile 1.5.1 and 1.5.3.
> I added all [Excluding unnecessary pages] and [Excluding free pages] 
> time together as "Filter Pages", and [Copyying Data] as "Copy data" here.
> 
> makedumjpfile	Kernel	map-size (KB)	Filter pages (s)	Copy data (s)	Total (s)
> 1.5.1	 3.7.0-0.36.el7.x86_64	N/A	940.28	1269.25	2209.53
> 1.5.3	 3.7.0-0.36.el7.x86_64	N/A	380.09	992.77	1372.86
> 1.5.3	v3.9-rc3	N/A	197.77	892.27	1090.04
> 1.5.3+mmap	v3.9-rc3+mmap	0	164.87	606.06	770.93
> 1.5.3+mmap	v3.9-rc3+mmap	4	88.62	576.07	664.69
> 1.5.3+mmap	v3.9-rc3+mmap	1024	83.66	477.23	560.89
> 1.5.3+mmap	v3.9-rc3+mmap	2048	83.44	477.21	560.65
> 1.5.3+mmap	v3.9-rc3+mmap	10240	83.84	476.56	560.4

I have also tested the makedumpfile mmap patch on a machine with 2TB memory, 
here are the results:
Test environment:
Machine: SGI UV1000 with 2TB RAM.
CPU: Intel(R) Xeon(R) CPU E7- 8837  @ 2.67GHz
(only 1 cpu was enabled in the 2nd kernel)
Kernel: 3.0.13 with mmap kernel patch v3 (I had to tweak the patch a bit)
vmcore size: 2.0TB
Dump file size: 3.6GB
makedumpfile mmap branch with parameters: -c --message-level 23 -d 31 
   --map-size <map-size>
All measured times are actual clock times.
All tests are noncyclic.   Crash kernel memory: crashkernel=512M

As did Jingbai Ma, I also tested with an unpatched kernel and
makedumpfile 1.5.1 and 1.5.3.  But they do 2 filtering scans: unnecessary
pages and free pages; here added together as filter pages time.

                                      Filter    Copy
makedumpfile Kernel	 map-size(KB) pages(s)	data(s) Total(s)
1.5.1	     3.0.13	   N/A	      671   	511    1182
1.5.3	     3.0.13	   N/A	      294       535     829
1.5.3+mmap   3.0.13+mmap     0	       54    	506   	560
1.5.3+mmap   3.0.13+mmap  4096	       40    	416	456
1.5.3+mmap   3.0.13+mmap 10240	       37	424	461

Using mmap for the copy data as well as for filtering pages did little:
1.5.3+mmap   3.0.13+mmap  4096	       37    	414	451

My results are quite similar to Jingbai Ma's.
The mmap patch to the kernel greatly speeds the filtering of pages, so
we at SGI would very much like to see this patch in the 3.10 kernel.
  http://marc.info/?l=linux-kernel&m=136627770125345&w=2

What puzzles me is that the patch greatly speeds the read's of /proc/vmcore
(where map-size is 0) as well as providing the mmap ability.  I can now
seek/read page structures almost as fast as mmap'ing and copying them.
(versus Jingbai Ma's results where mmap almost doubled the speed of reads)
I have put counters in to verify, and we are doing several million
seek/read's vs. a few thousand mmap's.  Yet the performance is similar
(54sec vs. 37sec, above). I can't rationalize that much improvement.

Thanks,
Cliff Wickman

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2013-05-07  8:48 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-14 10:11 [PATCH 00/13] kdump, vmcore: support mmap() on /proc/vmcore HATAYAMA Daisuke
2013-02-14 10:11 ` HATAYAMA Daisuke
2013-02-14 10:11 ` [PATCH 01/13] vmcore: allocate buffer for ELF headers on page-size alignment HATAYAMA Daisuke
2013-02-14 10:11   ` HATAYAMA Daisuke
2013-02-15 15:01   ` Vivek Goyal
2013-02-15 15:01     ` Vivek Goyal
2013-02-14 10:11 ` [PATCH 02/13] vmcore: round up buffer size of ELF headers by PAGE_SIZE HATAYAMA Daisuke
2013-02-14 10:11   ` HATAYAMA Daisuke
2013-02-15 15:18   ` Vivek Goyal
2013-02-15 15:18     ` Vivek Goyal
2013-02-18 15:58     ` HATAYAMA Daisuke
2013-02-18 15:58       ` HATAYAMA Daisuke
2013-02-14 10:11 ` [PATCH 03/13] vmcore: fill unused part of buffer for ELF headers with 0 HATAYAMA Daisuke
2013-02-14 10:11   ` HATAYAMA Daisuke
2013-02-14 10:12 ` [PATCH 04/13] vmcore: introduce types for objects copied in 2nd kernel HATAYAMA Daisuke
2013-02-14 10:12   ` HATAYAMA Daisuke
2013-02-15 15:28   ` Vivek Goyal
2013-02-15 15:28     ` Vivek Goyal
2013-02-18 16:06     ` HATAYAMA Daisuke
2013-02-18 16:06       ` HATAYAMA Daisuke
2013-02-19 23:07       ` Vivek Goyal
2013-02-19 23:07         ` Vivek Goyal
2013-02-14 10:12 ` [PATCH 05/13] vmcore: modify ELF32 code according to new type HATAYAMA Daisuke
2013-02-14 10:12   ` HATAYAMA Daisuke
2013-02-15 15:30   ` Vivek Goyal
2013-02-15 15:30     ` Vivek Goyal
2013-02-18 16:11     ` HATAYAMA Daisuke
2013-02-18 16:11       ` HATAYAMA Daisuke
2013-02-14 10:12 ` [PATCH 06/13] vmcore: modify vmcore clean-up function to free buffer on 2nd kernel HATAYAMA Daisuke
2013-02-14 10:12   ` HATAYAMA Daisuke
2013-02-15 15:32   ` Vivek Goyal
2013-02-15 15:32     ` Vivek Goyal
2013-02-14 10:12 ` [PATCH 07/13] vmcore: modify read_vmcore() to read " HATAYAMA Daisuke
2013-02-14 10:12   ` HATAYAMA Daisuke
2013-02-15 15:51   ` Vivek Goyal
2013-02-15 15:51     ` Vivek Goyal
2013-02-14 10:12 ` [PATCH 08/13] vmcore: remove unused helper function HATAYAMA Daisuke
2013-02-14 10:12   ` HATAYAMA Daisuke
2013-02-15 15:52   ` Vivek Goyal
2013-02-15 15:52     ` Vivek Goyal
2013-02-14 10:12 ` [PATCH 09/13] vmcore: copy ELF note segments in buffer on 2nd kernel HATAYAMA Daisuke
2013-02-14 10:12   ` HATAYAMA Daisuke
2013-02-15 16:53   ` Vivek Goyal
2013-02-15 16:53     ` Vivek Goyal
2013-02-18 17:02     ` HATAYAMA Daisuke
2013-02-18 17:02       ` HATAYAMA Daisuke
2013-02-19 23:05       ` Vivek Goyal
2013-02-19 23:05         ` Vivek Goyal
2013-02-14 10:12 ` [PATCH 10/13] vmcore: round-up offset of vmcore object in page-size boundary HATAYAMA Daisuke
2013-02-14 10:12   ` HATAYAMA Daisuke
2013-02-14 10:12 ` [PATCH 11/13] vmcore: count holes generated by round-up operation for vmcore size HATAYAMA Daisuke
2013-02-14 10:12   ` HATAYAMA Daisuke
2013-02-14 10:12 ` [PATCH 12/13] vmcore: copy non page-size aligned head and tail pages in 2nd kernel HATAYAMA Daisuke
2013-02-14 10:12   ` HATAYAMA Daisuke
2013-02-14 10:12 ` [PATCH 13/13] vmcore: introduce mmap_vmcore() HATAYAMA Daisuke
2013-02-14 10:12   ` HATAYAMA Daisuke
2013-02-15  3:57 ` [PATCH 00/13] kdump, vmcore: support mmap() on /proc/vmcore Atsushi Kumagai
2013-02-15  3:57   ` Atsushi Kumagai
2013-02-18  0:16   ` Hatayama, Daisuke
2013-02-18  0:16     ` Hatayama, Daisuke
2013-03-27  5:51 ` makedumpfile mmap() benchmark Jingbai Ma
2013-03-27  5:51   ` Jingbai Ma
2013-03-27  6:23   ` HATAYAMA Daisuke
2013-03-27  6:23     ` HATAYAMA Daisuke
2013-03-27  6:35     ` Jingbai Ma
2013-03-27  6:35       ` Jingbai Ma
2013-05-03 19:10 Cliff Wickman
2013-05-03 19:10 ` Cliff Wickman
2013-05-07  8:47 ` HATAYAMA Daisuke
2013-05-07  8:47   ` HATAYAMA Daisuke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.