From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965069AbcJHF26 (ORCPT ); Sat, 8 Oct 2016 01:28:58 -0400 Received: from mail-pa0-f44.google.com ([209.85.220.44]:35119 "EHLO mail-pa0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934857AbcJHF25 (ORCPT ); Sat, 8 Oct 2016 01:28:57 -0400 From: Joel Fernandes To: linux-kernel@vger.kernel.org Cc: Steven Rostedt , Joel Fernandes Subject: [RFC 0/7] pstore: Improve performance of ftrace backend with ramoops Date: Fri, 7 Oct 2016 22:28:27 -0700 Message-Id: <1475904515-24970-1-git-send-email-joelaf@google.com> X-Mailer: git-send-email 2.7.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Here's an early RFC for a patch series on improving ftrace throughput with ramoops. I am hoping to get some early comments so I'm releasing it in advance. It is functional and tested. Currently ramoops uses a single zone to store function traces. To make this work, it has to uses locking to synchronize accesses to the buffers. Recently the synchronization was completely moved from a cmpxchg mechanism to raw spinlocks due to difficulties in using cmpxchg on uncached memory and also on RAMs behind PCIe. [1] This change further dropped the peformance of ramoops pstore backend by more than half in my tests. This patch series improves the situation dramatically by around 280% from what it is now by creating a ramoops persistent zone for each CPU and avoiding use of locking altogether for ftrace. At init time, the persistent zones are then merged together. Here are some tests to show the improvements. Tested using a qemu quad core x86_64 instance with -mem-path to persist the guest RAM to a file. I measured avergage throughput of dd over 30 seconds: dd if=/dev/zero | pv | dd of=/dev/null Without this patch series: 24MB/s With per-cpu buffers and counter increment: 91.5 MB/s (improvement by ~ 281%) with per-cpu buffers and trace_clock: 51.9 MB/s Some more considerations: 1. Inorder to do the merge of the individual buffers, I am using racy counters since I didn't want to sacrifice throughput for perfect time stamps. trace_clock() for timestamps although did the job but was almost half the throughput of using counter based timestamp. 2. Since the patches divide the available ftrace persistent space by the number of CPUs, lesser space will now be available per-CPU however the user is free to disable per CPU behavior and revert to the old behavior by specifying PSTORE_PER_CPU flag. Its a space vs performance trade-off so if user has enough space and not a lot of CPUs, then using per-CPU persistent buffers make sense for better performance. 3. Without using any counters or timestamps, the improvement is even more (~140MB/s) but the buffers cannot be merged. [1] https://lkml.org/lkml/2016/9/8/375 Joel Fernandes (7): pstore: Make spinlock per zone instead of global pstore: locking: dont lock unless caller asks to pstore: Remove case of PSTORE_TYPE_PMSG write using deprecated function pstore: Make ramoops_init_przs generic for other prz arrays ramoops: Split ftrace buffer space into per-CPU zones pstore: Add support to store timestamp counter in ftrace records pstore: Merge per-CPU ftrace zones into one zone for output fs/pstore/ftrace.c | 3 + fs/pstore/inode.c | 7 +- fs/pstore/internal.h | 34 ------- fs/pstore/ram.c | 234 +++++++++++++++++++++++++++++++++++---------- fs/pstore/ram_core.c | 30 +++--- include/linux/pstore.h | 69 +++++++++++++ include/linux/pstore_ram.h | 6 +- 7 files changed, 280 insertions(+), 103 deletions(-) -- 2.7.4