All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/7] pstore: Improve performance of ftrace backend with ramoops
@ 2016-10-08  5:28 Joel Fernandes
  2016-10-08  5:28 ` [PATCH 1/7] pstore: Make spinlock per zone instead of global Joel Fernandes
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Joel Fernandes @ 2016-10-08  5:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Steven Rostedt, Joel Fernandes

Here's an early RFC for a patch series on improving ftrace throughput with
ramoops. I am hoping to get some early comments so I'm releasing it in advance.
It is functional and tested.

Currently ramoops uses a single zone to store function traces. To make this
work, it has to uses locking to synchronize accesses to the buffers. Recently
the synchronization was completely moved from a cmpxchg mechanism to raw
spinlocks due to difficulties in using cmpxchg on uncached memory and also on
RAMs behind PCIe. [1] This change further dropped the peformance of ramoops
pstore backend by more than half in my tests.

This patch series improves the situation dramatically by around 280% from what
it is now by creating a ramoops persistent zone for each CPU and avoiding use of
locking altogether for ftrace. At init time, the persistent zones are then
merged together.

Here are some tests to show the improvements.  Tested using a qemu quad core
x86_64 instance with -mem-path to persist the guest RAM to a file. I measured
avergage throughput of dd over 30 seconds:

dd if=/dev/zero | pv | dd of=/dev/null

Without this patch series: 24MB/s
With per-cpu buffers and counter increment: 91.5 MB/s (improvement by ~ 281%)
with per-cpu buffers and trace_clock: 51.9 MB/s

Some more considerations:
1. Inorder to do the merge of the individual buffers, I am using racy counters
since I didn't want to sacrifice throughput for perfect time stamps.
trace_clock() for timestamps although did the job but was almost half the
throughput of using counter based timestamp.

2. Since the patches divide the available ftrace persistent space by the number
of CPUs, lesser space will now be available per-CPU however the user is free to
disable per CPU behavior and revert to the old behavior by specifying
PSTORE_PER_CPU flag.  Its a space vs performance trade-off so if user has
enough space and not a lot of CPUs, then using per-CPU persistent buffers make
sense for better performance.

3. Without using any counters or timestamps, the improvement is even more
(~140MB/s) but the buffers cannot be merged.

[1] https://lkml.org/lkml/2016/9/8/375

Joel Fernandes (7):
  pstore: Make spinlock per zone instead of global
  pstore: locking: dont lock unless caller asks to
  pstore: Remove case of PSTORE_TYPE_PMSG write using deprecated
    function
  pstore: Make ramoops_init_przs generic for other prz arrays
  ramoops: Split ftrace buffer space into per-CPU zones
  pstore: Add support to store timestamp counter in ftrace records
  pstore: Merge per-CPU ftrace zones into one zone for output

 fs/pstore/ftrace.c         |   3 +
 fs/pstore/inode.c          |   7 +-
 fs/pstore/internal.h       |  34 -------
 fs/pstore/ram.c            | 234 +++++++++++++++++++++++++++++++++++----------
 fs/pstore/ram_core.c       |  30 +++---
 include/linux/pstore.h     |  69 +++++++++++++
 include/linux/pstore_ram.h |   6 +-
 7 files changed, 280 insertions(+), 103 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 22+ messages in thread
* [PATCH 0/7] pstore: Improve performance of ftrace backend with ramoops
@ 2016-10-20  7:17 Joel Fernandes
  2016-10-20  7:17 ` [PATCH 7/7] pstore: Merge per-CPU ftrace zones into one zone for output Joel Fernandes
  0 siblings, 1 reply; 22+ messages in thread
From: Joel Fernandes @ 2016-10-20  7:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: Kees Cook, Joel Fernandes

Currently ramoops uses a single zone to store function traces. To make this
work, it has to uses locking to synchronize accesses to the buffers. Recently
the synchronization was completely moved from a cmpxchg mechanism to raw
spinlocks due to difficulties in using cmpxchg on uncached memory and also on
RAMs behind PCIe. [1] This change further dropped the peformance of ramoops
pstore backend by more than half in my tests.

This patch series improves the situation dramatically by around 280% from what
it is now by creating a ramoops persistent zone for each CPU and avoiding use of
locking altogether for ftrace. At init time, the persistent zones are then
merged together.

Here are some tests to show the improvements.  Tested using a qemu quad core
x86_64 instance with -mem-path to persist the guest RAM to a file. I measured
avergage throughput of dd over 30 seconds:

dd if=/dev/zero | pv | dd of=/dev/null

Without this patch series: 24MB/s
with per-cpu buffers and trace_clock: 51.9 MB/s
With per-cpu buffers and counter increment: 91.5 MB/s (improvement by ~ 281%)

Changes since RFC [2]:
- improve commit message clarity for optional locking of zone buffers.
- use macro for better code clarity of locking requirements
- use kcalloc instead of kmalloc for allocating prz array
- print warning if pmsg calls write_buf instead of write_buf_user
- free zones properly for ftrace per CPU usecase.

[1] https://lkml.org/lkml/2016/9/8/375
[2] https://lkml.org/lkml/2016/10/8/12

Joel Fernandes (7):
  pstore: Make spinlock per zone instead of global
  pstore: locking: dont lock unless caller asks to
  pstore: Warn for the case of PSTORE_TYPE_PMSG write using deprecated
    function
  pstore: Make ramoops_init_przs generic for other prz arrays
  ramoops: Split ftrace buffer space into per-CPU zones
  pstore: Add support to store timestamp counter in ftrace records
  pstore: Merge per-CPU ftrace zones into one zone for output

 fs/pstore/ftrace.c         |   3 +
 fs/pstore/inode.c          |   7 +-
 fs/pstore/internal.h       |  34 -------
 fs/pstore/ram.c            | 236 +++++++++++++++++++++++++++++++++++----------
 fs/pstore/ram_core.c       |  30 +++---
 include/linux/pstore.h     |  69 +++++++++++++
 include/linux/pstore_ram.h |  14 ++-
 7 files changed, 291 insertions(+), 102 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-10-20  7:18 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-08  5:28 [RFC 0/7] pstore: Improve performance of ftrace backend with ramoops Joel Fernandes
2016-10-08  5:28 ` [PATCH 1/7] pstore: Make spinlock per zone instead of global Joel Fernandes
2016-10-08  5:44   ` Joel Fernandes
2016-10-10 23:44   ` Kees Cook
2016-10-08  5:28 ` [PATCH 2/7] pstore: locking: dont lock unless caller asks to Joel Fernandes
2016-10-10 23:48   ` Kees Cook
2016-10-11 14:41     ` Joel Fernandes
2016-10-08  5:28 ` [PATCH 3/7] pstore: Remove case of PSTORE_TYPE_PMSG write using deprecated function Joel Fernandes
2016-10-10 23:52   ` Kees Cook
2016-10-11 14:46     ` Joel Fernandes
2016-10-08  5:28 ` [PATCH 4/7] pstore: Make ramoops_init_przs generic for other prz arrays Joel Fernandes
2016-10-10 23:55   ` Kees Cook
2016-10-08  5:28 ` [PATCH 5/7] ramoops: Split ftrace buffer space into per-CPU zones Joel Fernandes
2016-10-09 17:15   ` Joel Fernandes
2016-10-10 23:59     ` Kees Cook
2016-10-11  0:00       ` Kees Cook
2016-10-16 17:40       ` Joel Fernandes
2016-10-18 20:37         ` Kees Cook
2016-10-08  5:28 ` [PATCH 6/7] pstore: Add support to store timestamp counter in ftrace records Joel Fernandes
2016-10-08  5:28 ` [PATCH 7/7] pstore: Merge per-CPU ftrace zones into one zone for output Joel Fernandes
2016-10-11  9:57 ` [RFC 0/7] pstore: Improve performance of ftrace backend with ramoops Steven Rostedt
2016-10-20  7:17 [PATCH " Joel Fernandes
2016-10-20  7:17 ` [PATCH 7/7] pstore: Merge per-CPU ftrace zones into one zone for output Joel Fernandes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.