All of lore.kernel.org
 help / color / mirror / Atom feed
From: Quentin Schulz <foss@0leil.net>
To: bitbake-devel@lists.openembedded.org,
	Richard Purdie <richard.purdie@linuxfoundation.org>
Subject: Re: [bitbake-devel] [PATCH 2/2] siggen: Change file format of siginfo files to use zstd compressed json
Date: Wed, 13 Oct 2021 23:08:07 +0200	[thread overview]
Message-ID: <EFA30A03-8DEA-4751-A51B-851C9B9EE7BB@0leil.net> (raw)
In-Reply-To: <20211013153540.2873636-2-richard.purdie@linuxfoundation.org>

Hi Richard,

On October 13, 2021 5:35:40 PM GMT+02:00, Richard Purdie <richard.purdie@linuxfoundation.org> wrote:
>Since OE is about to change to zstd compression of sstate, it would make it
>timely to convert the siginfo files from pickle which isn't reproducible
>to json which is both reproducible and also human readable. At the same time

I'm not sure this is actually true for python 3.6 according to my reading of the docs.

"""
Note

This module’s encoders and decoders preserve input and output order by default. Order is only lost if the underlying containers are unordered.

Prior to Python 3.7, dict was not guaranteed to be ordered, so inputs and outputs were typically scrambled unless collections.OrderedDict was specifically requested. Starting with Python 3.7, the regular dict became order preserving, so it is no longer necessary to specify collections.OrderedDict for JSON generation and parsing.

"""

C.f. second note https://docs.python.org/3.8/library/json.html

C.f. https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-compactdict
Says it is behaving this way but shouldn't be relied upon.

It seems we need to make sure dict are ordered by using collections.OrderedDict instead?

In json.load, object_hook_pairs=collections.OrderedDict should do the trick afaiu?

Or I guess we could do the same thing as for sets but for dicts?

Note that OrderedDict is not alphabetically ordered but ordered by "time-of-insertion". I assume that with sort_keys=True passed to JSON.dump, if OrderedDict is used for JSON.load, the function will return a fully reproducible dictionary which will happen to be alphabetically sorted too.

I'm not entirely sure of my reading and understanding of the docs but wanted to bring this up.

Hope this isn't just noise :)

Cheers,
Quentin

>add zstd compression. This makes the siginfo files smaller, reprodubicle
>and easier to debug.
>
>Backwards compatibility mixing the two formats hasn't been supported since
>in reality if sstate changes at the same time, files will be in one format
>or the new one but comparing mixed formats won't make much sense.
>
>Since json doesn't support sets, we translate them into lists in the files
>themselves. We only use sets in bitbake since it makes things easier in
>the internal code, sorted lists are fine for the file format.
>
>[YOCTO #13973]
>
>Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
>---
> lib/bb/siggen.py | 29 +++++++++++++++++------------
> 1 file changed, 17 insertions(+), 12 deletions(-)
>
>diff --git a/lib/bb/siggen.py b/lib/bb/siggen.py
>index 625a9cf3bb..f526792efd 100644
>--- a/lib/bb/siggen.py
>+++ b/lib/bb/siggen.py
>@@ -11,6 +11,8 @@ import pickle
> import bb.data
> import difflib
> import simplediff
>+import json
>+import bb.compress.zstd
> from bb.checksum import FileChecksumCache
> from bb import runqueue
> import hashserv
>@@ -19,6 +21,12 @@ import hashserv.client
> logger = logging.getLogger('BitBake.SigGen')
> hashequiv_logger = logging.getLogger('BitBake.SigGen.HashEquiv')
> 
>+class SetEncoder(json.JSONEncoder):
>+    def default(self, obj):
>+        if isinstance(obj, set):
>+            return list(sorted(obj))
>+        return json.JSONEncoder.default(self, obj)
>+
> def init(d):
>     siggens = [obj for obj in globals().values()
>                       if type(obj) is type and issubclass(obj, SignatureGenerator)]
>@@ -398,9 +406,9 @@ class SignatureGeneratorBasic(SignatureGenerator):
> 
>         fd, tmpfile = tempfile.mkstemp(dir=os.path.dirname(sigfile), prefix="sigtask.")
>         try:
>-            with os.fdopen(fd, "wb") as stream:
>-                p = pickle.dump(data, stream, -1)
>-                stream.flush()
>+            with bb.compress.zstd.open(fd, "wt", encoding="utf-8", num_threads=1) as f:
>+                json.dump(data, f, sort_keys=True, separators=(",", ":"), cls=SetEncoder)
>+                f.flush()
>             os.chmod(tmpfile, 0o664)
>             bb.utils.rename(tmpfile, sigfile)
>         except (OSError, IOError) as err:
>@@ -794,12 +802,10 @@ def compare_sigfiles(a, b, recursecb=None, color=False, collapsed=False):
>         formatparams.update(values)
>         return formatstr.format(**formatparams)
> 
>-    with open(a, 'rb') as f:
>-        p1 = pickle.Unpickler(f)
>-        a_data = p1.load()
>-    with open(b, 'rb') as f:
>-        p2 = pickle.Unpickler(f)
>-        b_data = p2.load()
>+    with bb.compress.zstd.open(a, "rt", encoding="utf-8", num_threads=1) as f:
>+        a_data = json.load(f)
>+    with bb.compress.zstd.open(b, "rt", encoding="utf-8", num_threads=1) as f:
>+        b_data = json.load(f)
> 
>     def dict_diff(a, b, whitelist=set()):
>         sa = set(a.keys())
>@@ -1031,9 +1037,8 @@ def calc_taskhash(sigdata):
> def dump_sigfile(a):
>     output = []
> 
>-    with open(a, 'rb') as f:
>-        p1 = pickle.Unpickler(f)
>-        a_data = p1.load()
>+    with bb.compress.zstd.open(a, "rt", encoding="utf-8", num_threads=1) as f:
>+        a_data = json.load(f)
> 
>     output.append("basewhitelist: %s" % (a_data['basewhitelist']))
> 


  reply	other threads:[~2021-10-13 21:08 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-13 15:35 [PATCH 1/2] compress: Allow to operate on file descriptor Richard Purdie
2021-10-13 15:35 ` [PATCH 2/2] siggen: Change file format of siginfo files to use zstd compressed json Richard Purdie
2021-10-13 21:08   ` Quentin Schulz [this message]
     [not found]     ` <44a0906e6279b9ab34d29f8129d54b598b783397.camel@linuxfoundation.org>
2021-10-14  8:00       ` [bitbake-devel] " Quentin Schulz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=EFA30A03-8DEA-4751-A51B-851C9B9EE7BB@0leil.net \
    --to=foss@0leil.net \
    --cc=bitbake-devel@lists.openembedded.org \
    --cc=richard.purdie@linuxfoundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.