From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 929A7C43461 for ; Thu, 10 Sep 2020 06:20:01 +0000 (UTC) Received: from isis.lip6.fr (isis.lip6.fr [132.227.60.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id E583D207DE for ; Thu, 10 Sep 2020 06:20:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E583D207DE Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=collabora.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=cocci-bounces@systeme.lip6.fr Received: from systeme.lip6.fr (systeme.lip6.fr [132.227.104.7]) by isis.lip6.fr (8.15.2/8.15.2) with ESMTP id 08A6JmDD012585; Thu, 10 Sep 2020 08:19:48 +0200 (CEST) Received: from systeme.lip6.fr (systeme.lip6.fr [127.0.0.1]) by systeme.lip6.fr (Postfix) with ESMTP id 8392E778C; Thu, 10 Sep 2020 08:19:48 +0200 (CEST) Received: from isis.lip6.fr (isis.lip6.fr [132.227.60.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by systeme.lip6.fr (Postfix) with ESMTPS id DCD077595 for ; Thu, 10 Sep 2020 08:19:46 +0200 (CEST) Received: from bhuna.collabora.co.uk (bhuna.collabora.co.uk [IPv6:2a00:1098:0:82:1000:25:2eeb:e3e3]) by isis.lip6.fr (8.15.2/8.15.2) with ESMTP id 08A6Jkiq020727 for ; Thu, 10 Sep 2020 08:19:46 +0200 (CEST) Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: jaskaran_singh) with ESMTPSA id 7E51C29AF39 Message-ID: <642305cb62e2920e1a02e5c6f623fcc073df240d.camel@collabora.com> From: Jaskaran Singh To: cocci@systeme.lip6.fr Date: Thu, 10 Sep 2020 11:49:41 +0530 In-Reply-To: <20200910061703.2397-1-jaskaran.singh@collabora.com> References: <20200910061703.2397-1-jaskaran.singh@collabora.com> Organization: Collabora User-Agent: Evolution 3.32.5 (3.32.5-1.fc30) MIME-Version: 1.0 X-Greylist: Sender IP whitelisted, Sender e-mail whitelisted, not delayed by milter-greylist-4.4.3 (isis.lip6.fr [132.227.60.2]); Thu, 10 Sep 2020 08:19:48 +0200 (CEST) X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.4.3 (isis.lip6.fr [IPv6:2001:660:3302:283c:0:0:0:2]); Thu, 10 Sep 2020 08:19:46 +0200 (CEST) X-Scanned-By: MIMEDefang 2.78 on 132.227.60.2 X-Scanned-By: MIMEDefang 2.78 Subject: Re: [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing X-BeenThere: cocci@systeme.lip6.fr X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: cocci-bounces@systeme.lip6.fr Errors-To: cocci-bounces@systeme.lip6.fr On Thu, 2020-09-10 at 11:47 +0530, Jaskaran Singh wrote: > This patch series aims to optimize performance for recursively > parsing > header files in Coccinelle. > > Coccinelle's C parsing subsystem has an option called --recursive- > includes > to recursively parse header files. This is used for type > inference/annotation. > > Previously, using --recursive-includes on the entire Linux kernel > source > code would take far too long. On my computer with the following > specs, > - Processor: AMD Ryzen 5 3550H > - RAM: 8 GB > it would take close to 7 hours to complete. The optimization that > this > patch series implements reduces that time to 1 hour. > > The following is a high-level description of what has been > implemented: > - As header files are recursively parsed, they are scanned for the > following: > - fields of structs/unions/enums > - typedefs > - function prototypes > - global variables > The names of the above are stored in a "name cache", i.e. a > hashtable to > map the name to the files it is declared in. > - A dependency graph is built to determine dependencies between all > the > files in the codebase. > - In the type annotation phase of the C subsystem, if a function > call, > struct/union field or identifier is encountered, the type of which > is > not known to the annoter, the name cache is checked for the name. > - The name cache gives a list of files that the name is > declared/defined > in. These files are cross checked with the dependency graph to > determine if any of these are reachable by the file that the > annoter is > working on. > - If a reachable header file is found, that file is parsed and the > type > associated to the name is returned. > > Different approaches that were attempted to alleviate this issue, and > the > problems with each are as follows: > - Caching the most recently used files: A LRU cache to store ASTs of > the > most recently encountered header files. The problem with this > approach > is the amount of memory it takes to cache the header file ASTs. > - Caching the most troublesome files: A pseudo-LFU cache to store > files > that cumulatively take the longest to parse, and thus bloat the > time > taken. The problem with this approach is the amount of memory it > takes > to cache the header file ASTs. > - Skipping unparsable locations in header files: Skipping top-level > items > in a header file that cannot be parsed. This approach does not > produce > even close to the amount of optimization needed. > > The next step from here would be: > - Maintain a small but persistent cache of header files in groups of > directories. Leverage multiprocessing for parsing these header > files. > - Leverage multiprocessing to parse header files initially for name > extraction. > - Performing some initial matching with the semantic patch to > determine if > a C file matches. If matches are found, call the annoter and > recursively > parse header files for type annotation. > - Recursively parse all header files only once and build a large type > environment. Use the dependency graph to determine reachability. > This > has potential memory usage issues though. > > > Changes in v2: > -------------- > - Change occurences of 'begin' and 'match' on the same line with > something else > to the next line for better readability. > > > Makefile | 2 > parsing_c/includes_cache.ml | 286 > +++++++++++++++++++++++++++++++++++++++++++ > parsing_c/includes_cache.mli | 47 +++++++ > parsing_c/parse_c.ml | 27 +++- > parsing_c/type_annoter_c.ml | 130 ++++++++++++++++--- > 5 files changed, 466 insertions(+), 26 deletions(-) > Yikes, the diffstat is the old one. Here's the latest (not very different) Makefile | 2 parsing_c/includes_cache.ml | 290 +++++++++++++++++++++++++++++++++++++++++++ parsing_c/includes_cache.mli | 47 ++++++ parsing_c/parse_c.ml | 28 +++- parsing_c/type_annoter_c.ml | 134 +++++++++++++++++-- 5 files changed, 475 insertions(+), 26 deletions(-) > _______________________________________________ Cocci mailing list Cocci@systeme.lip6.fr https://systeme.lip6.fr/mailman/listinfo/cocci