All of lore.kernel.org
 help / color / mirror / Atom feed
* [Buildroot] [git commit branch/2021.02.x] support/scripts/pkg-stats: reimplement CPE parsing in pkg-stats
@ 2022-04-04 12:39 Peter Korsgaard
  0 siblings, 0 replies; only message in thread
From: Peter Korsgaard @ 2022-04-04 12:39 UTC (permalink / raw)
  To: buildroot

commit: https://git.buildroot.net/buildroot/commit/?id=b8d6d7c3d317e5b52686857b7d08ffe2533e8ff2
branch: https://git.buildroot.net/buildroot/commit/?id=refs/heads/2021.02.x

pkg-stats currently uses the services from support/scripts/cpedb.py to
match the CPE identifiers of packages with the official CPE database.

Unfortunately, the cpedb.py code uses regular ElementTree parsing,
which involves loading the full XML tree into memory. This causes the
pkg-stats process to consume a huge amount of memory:

thomas   1310458 85.2 21.4 3708952 3450164 pts/5 R+   16:04   0:33  |   |   \_ python3 ./support/scripts/pkg-stats

So, 3.7 GB of VSZ and 3.4 GB of RSS are used by the pkg-stats
process. This is causing the OOM killer to kick-in on machines with
relatively low memory.

This commit reimplements the XML parsing needed to do the CPE matching
directly in pkg-stats, using the XmlParser functionality of
ElementTree, also called "streaming parsing". Thanks to this, we never
load the entire XML tree in RAM, but only stream it through the
parser, and construct a very simple list of all CPE identifiers. The
max memory consumption of pkg-stats is now:

thomas   1317511 74.2  0.9 381104 152224 pts/5   R+   16:08   0:17  |   |   \_ python3 ./support/scripts/pkg-stats

So, 381 MB of VSZ and 152 MB of RSS, which is obviously much better.

The JSON output of pkg-stats for the full package set, before and after
this commit, is exactly identical.

Now, one will probably wonder why this isn't directly changed in
cpedb.py. The reason is simple: cpedb.py is also used by
support/scripts/missing-cpe, which (for now) heavily relies on having
in memory the ElementTree objects, to re-generate a snippet of XML
that allows us to submit to NIST new CPE entries.

So, future work could include one of those two options:

 (1) Re-integrate cpedb.py into missing-cpe directly, and live with
     two different ways of processing the CPE database.

 (2) Rewrite the missing-cpe logic to also be compatible with a
     streaming parsing, which would allow this logic to be again
     shared between pkg-stats and missing-cpe.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
[yann.morin.1998@free.fr:
  - add missing import of requests
  - import CPEDB_URL from cpedb, instead of duplicating it
  - fix flake8 errors
]
Signed-off-by: Yann E. MORIN <yann.morin.1998@free.fr>
(cherry picked from commit bd1798ad959a901ccf5009b0c199fa5470912cc2)
Signed-off-by: Peter Korsgaard <peter@korsgaard.com>
---
 support/scripts/pkg-stats | 41 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/support/scripts/pkg-stats b/support/scripts/pkg-stats
index 491faa984f..1928c9950c 100755
--- a/support/scripts/pkg-stats
+++ b/support/scripts/pkg-stats
@@ -27,12 +27,16 @@ import re
 import subprocess
 import json
 import sys
+import time
+import gzip
+import xml.etree.ElementTree
+import requests
 
 brpath = os.path.normpath(os.path.join(os.path.dirname(__file__), "..", ".."))
 
 sys.path.append(os.path.join(brpath, "utils"))
 from getdeveloperlib import parse_developers  # noqa: E402
-from cpedb import CPEDB  # noqa: E402
+from cpedb import CPEDB_URL  # noqa: E402
 
 INFRA_RE = re.compile(r"\$\(eval \$\(([a-z-]*)-package\)\)")
 URL_RE = re.compile(r"\s*https?://\S*\s*$")
@@ -613,12 +617,41 @@ def check_package_cves(nvd_path, packages):
 
 
 def check_package_cpes(nvd_path, packages):
-    cpedb = CPEDB(nvd_path)
-    cpedb.get_xml_dict()
+    class CpeXmlParser:
+        cpes = []
+
+        def start(self, tag, attrib):
+            if tag == "{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item":
+                self.cpes.append(attrib['name'])
+
+        def close(self):
+            return self.cpes
+
+    print("CPE: Setting up NIST dictionary")
+    if not os.path.exists(os.path.join(nvd_path, "cpe")):
+        os.makedirs(os.path.join(nvd_path, "cpe"))
+
+    cpe_dict_local = os.path.join(nvd_path, "cpe", os.path.basename(CPEDB_URL))
+    if not os.path.exists(cpe_dict_local) or os.stat(cpe_dict_local).st_mtime < time.time() - 86400:
+        print("CPE: Fetching xml manifest from [" + CPEDB_URL + "]")
+        cpe_dict = requests.get(CPEDB_URL)
+        open(cpe_dict_local, "wb").write(cpe_dict.content)
+
+    print("CPE: Unzipping xml manifest...")
+    nist_cpe_file = gzip.GzipFile(fileobj=open(cpe_dict_local, 'rb'))
+
+    parser = xml.etree.ElementTree.XMLParser(target=CpeXmlParser())
+    while True:
+        c = nist_cpe_file.read(1024*1024)
+        if not c:
+            break
+        parser.feed(c)
+    cpes = parser.close()
+
     for p in packages:
         if not p.cpeid:
             continue
-        if cpedb.find(p.cpeid):
+        if p.cpeid in cpes:
             p.status['cpe'] = ("ok", "verified CPE identifier")
         else:
             p.status['cpe'] = ("error", "CPE identifier unknown in CPE database")
_______________________________________________
buildroot mailing list
buildroot@buildroot.org
https://lists.buildroot.org/mailman/listinfo/buildroot

^ permalink raw reply related	[flat|nested] only message in thread

only message in thread, other threads:[~2022-04-04 12:41 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-04 12:39 [Buildroot] [git commit branch/2021.02.x] support/scripts/pkg-stats: reimplement CPE parsing in pkg-stats Peter Korsgaard

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.