bitbake-devel.lists.openembedded.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] compress: Allow to operate on file descriptor
@ 2021-10-13 15:35 Richard Purdie
  2021-10-13 15:35 ` [PATCH 2/2] siggen: Change file format of siginfo files to use zstd compressed json Richard Purdie
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Purdie @ 2021-10-13 15:35 UTC (permalink / raw)
  To: bitbake-devel

The code works fine if we pass a file descriptor in and we need to
do this from the siggen code so add that as a valid input.

Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
---
 lib/bb/compress/_pipecompress.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/bb/compress/_pipecompress.py b/lib/bb/compress/_pipecompress.py
index 4b9f662143..5de17a82e2 100644
--- a/lib/bb/compress/_pipecompress.py
+++ b/lib/bb/compress/_pipecompress.py
@@ -49,7 +49,7 @@ def open_wrap(
             raise ValueError("Argument 'newline' not supported in binary mode")
 
     file_mode = mode.replace("t", "")
-    if isinstance(filename, (str, bytes, os.PathLike)):
+    if isinstance(filename, (str, bytes, os.PathLike, int)):
         binary_file = cls(filename, file_mode, **kwargs)
     elif hasattr(filename, "read") or hasattr(filename, "write"):
         binary_file = cls(None, file_mode, fileobj=filename, **kwargs)
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] siggen: Change file format of siginfo files to use zstd compressed json
  2021-10-13 15:35 [PATCH 1/2] compress: Allow to operate on file descriptor Richard Purdie
@ 2021-10-13 15:35 ` Richard Purdie
  2021-10-13 21:08   ` [bitbake-devel] " Quentin Schulz
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Purdie @ 2021-10-13 15:35 UTC (permalink / raw)
  To: bitbake-devel

Since OE is about to change to zstd compression of sstate, it would make it
timely to convert the siginfo files from pickle which isn't reproducible
to json which is both reproducible and also human readable. At the same time
add zstd compression. This makes the siginfo files smaller, reprodubicle
and easier to debug.

Backwards compatibility mixing the two formats hasn't been supported since
in reality if sstate changes at the same time, files will be in one format
or the new one but comparing mixed formats won't make much sense.

Since json doesn't support sets, we translate them into lists in the files
themselves. We only use sets in bitbake since it makes things easier in
the internal code, sorted lists are fine for the file format.

[YOCTO #13973]

Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
---
 lib/bb/siggen.py | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/lib/bb/siggen.py b/lib/bb/siggen.py
index 625a9cf3bb..f526792efd 100644
--- a/lib/bb/siggen.py
+++ b/lib/bb/siggen.py
@@ -11,6 +11,8 @@ import pickle
 import bb.data
 import difflib
 import simplediff
+import json
+import bb.compress.zstd
 from bb.checksum import FileChecksumCache
 from bb import runqueue
 import hashserv
@@ -19,6 +21,12 @@ import hashserv.client
 logger = logging.getLogger('BitBake.SigGen')
 hashequiv_logger = logging.getLogger('BitBake.SigGen.HashEquiv')
 
+class SetEncoder(json.JSONEncoder):
+    def default(self, obj):
+        if isinstance(obj, set):
+            return list(sorted(obj))
+        return json.JSONEncoder.default(self, obj)
+
 def init(d):
     siggens = [obj for obj in globals().values()
                       if type(obj) is type and issubclass(obj, SignatureGenerator)]
@@ -398,9 +406,9 @@ class SignatureGeneratorBasic(SignatureGenerator):
 
         fd, tmpfile = tempfile.mkstemp(dir=os.path.dirname(sigfile), prefix="sigtask.")
         try:
-            with os.fdopen(fd, "wb") as stream:
-                p = pickle.dump(data, stream, -1)
-                stream.flush()
+            with bb.compress.zstd.open(fd, "wt", encoding="utf-8", num_threads=1) as f:
+                json.dump(data, f, sort_keys=True, separators=(",", ":"), cls=SetEncoder)
+                f.flush()
             os.chmod(tmpfile, 0o664)
             bb.utils.rename(tmpfile, sigfile)
         except (OSError, IOError) as err:
@@ -794,12 +802,10 @@ def compare_sigfiles(a, b, recursecb=None, color=False, collapsed=False):
         formatparams.update(values)
         return formatstr.format(**formatparams)
 
-    with open(a, 'rb') as f:
-        p1 = pickle.Unpickler(f)
-        a_data = p1.load()
-    with open(b, 'rb') as f:
-        p2 = pickle.Unpickler(f)
-        b_data = p2.load()
+    with bb.compress.zstd.open(a, "rt", encoding="utf-8", num_threads=1) as f:
+        a_data = json.load(f)
+    with bb.compress.zstd.open(b, "rt", encoding="utf-8", num_threads=1) as f:
+        b_data = json.load(f)
 
     def dict_diff(a, b, whitelist=set()):
         sa = set(a.keys())
@@ -1031,9 +1037,8 @@ def calc_taskhash(sigdata):
 def dump_sigfile(a):
     output = []
 
-    with open(a, 'rb') as f:
-        p1 = pickle.Unpickler(f)
-        a_data = p1.load()
+    with bb.compress.zstd.open(a, "rt", encoding="utf-8", num_threads=1) as f:
+        a_data = json.load(f)
 
     output.append("basewhitelist: %s" % (a_data['basewhitelist']))
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [bitbake-devel] [PATCH 2/2] siggen: Change file format of siginfo files to use zstd compressed json
  2021-10-13 15:35 ` [PATCH 2/2] siggen: Change file format of siginfo files to use zstd compressed json Richard Purdie
@ 2021-10-13 21:08   ` Quentin Schulz
       [not found]     ` <44a0906e6279b9ab34d29f8129d54b598b783397.camel@linuxfoundation.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Quentin Schulz @ 2021-10-13 21:08 UTC (permalink / raw)
  To: bitbake-devel, Richard Purdie

Hi Richard,

On October 13, 2021 5:35:40 PM GMT+02:00, Richard Purdie <richard.purdie@linuxfoundation.org> wrote:
>Since OE is about to change to zstd compression of sstate, it would make it
>timely to convert the siginfo files from pickle which isn't reproducible
>to json which is both reproducible and also human readable. At the same time

I'm not sure this is actually true for python 3.6 according to my reading of the docs.

"""
Note

This module’s encoders and decoders preserve input and output order by default. Order is only lost if the underlying containers are unordered.

Prior to Python 3.7, dict was not guaranteed to be ordered, so inputs and outputs were typically scrambled unless collections.OrderedDict was specifically requested. Starting with Python 3.7, the regular dict became order preserving, so it is no longer necessary to specify collections.OrderedDict for JSON generation and parsing.

"""

C.f. second note https://docs.python.org/3.8/library/json.html

C.f. https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-compactdict
Says it is behaving this way but shouldn't be relied upon.

It seems we need to make sure dict are ordered by using collections.OrderedDict instead?

In json.load, object_hook_pairs=collections.OrderedDict should do the trick afaiu?

Or I guess we could do the same thing as for sets but for dicts?

Note that OrderedDict is not alphabetically ordered but ordered by "time-of-insertion". I assume that with sort_keys=True passed to JSON.dump, if OrderedDict is used for JSON.load, the function will return a fully reproducible dictionary which will happen to be alphabetically sorted too.

I'm not entirely sure of my reading and understanding of the docs but wanted to bring this up.

Hope this isn't just noise :)

Cheers,
Quentin

>add zstd compression. This makes the siginfo files smaller, reprodubicle
>and easier to debug.
>
>Backwards compatibility mixing the two formats hasn't been supported since
>in reality if sstate changes at the same time, files will be in one format
>or the new one but comparing mixed formats won't make much sense.
>
>Since json doesn't support sets, we translate them into lists in the files
>themselves. We only use sets in bitbake since it makes things easier in
>the internal code, sorted lists are fine for the file format.
>
>[YOCTO #13973]
>
>Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
>---
> lib/bb/siggen.py | 29 +++++++++++++++++------------
> 1 file changed, 17 insertions(+), 12 deletions(-)
>
>diff --git a/lib/bb/siggen.py b/lib/bb/siggen.py
>index 625a9cf3bb..f526792efd 100644
>--- a/lib/bb/siggen.py
>+++ b/lib/bb/siggen.py
>@@ -11,6 +11,8 @@ import pickle
> import bb.data
> import difflib
> import simplediff
>+import json
>+import bb.compress.zstd
> from bb.checksum import FileChecksumCache
> from bb import runqueue
> import hashserv
>@@ -19,6 +21,12 @@ import hashserv.client
> logger = logging.getLogger('BitBake.SigGen')
> hashequiv_logger = logging.getLogger('BitBake.SigGen.HashEquiv')
> 
>+class SetEncoder(json.JSONEncoder):
>+    def default(self, obj):
>+        if isinstance(obj, set):
>+            return list(sorted(obj))
>+        return json.JSONEncoder.default(self, obj)
>+
> def init(d):
>     siggens = [obj for obj in globals().values()
>                       if type(obj) is type and issubclass(obj, SignatureGenerator)]
>@@ -398,9 +406,9 @@ class SignatureGeneratorBasic(SignatureGenerator):
> 
>         fd, tmpfile = tempfile.mkstemp(dir=os.path.dirname(sigfile), prefix="sigtask.")
>         try:
>-            with os.fdopen(fd, "wb") as stream:
>-                p = pickle.dump(data, stream, -1)
>-                stream.flush()
>+            with bb.compress.zstd.open(fd, "wt", encoding="utf-8", num_threads=1) as f:
>+                json.dump(data, f, sort_keys=True, separators=(",", ":"), cls=SetEncoder)
>+                f.flush()
>             os.chmod(tmpfile, 0o664)
>             bb.utils.rename(tmpfile, sigfile)
>         except (OSError, IOError) as err:
>@@ -794,12 +802,10 @@ def compare_sigfiles(a, b, recursecb=None, color=False, collapsed=False):
>         formatparams.update(values)
>         return formatstr.format(**formatparams)
> 
>-    with open(a, 'rb') as f:
>-        p1 = pickle.Unpickler(f)
>-        a_data = p1.load()
>-    with open(b, 'rb') as f:
>-        p2 = pickle.Unpickler(f)
>-        b_data = p2.load()
>+    with bb.compress.zstd.open(a, "rt", encoding="utf-8", num_threads=1) as f:
>+        a_data = json.load(f)
>+    with bb.compress.zstd.open(b, "rt", encoding="utf-8", num_threads=1) as f:
>+        b_data = json.load(f)
> 
>     def dict_diff(a, b, whitelist=set()):
>         sa = set(a.keys())
>@@ -1031,9 +1037,8 @@ def calc_taskhash(sigdata):
> def dump_sigfile(a):
>     output = []
> 
>-    with open(a, 'rb') as f:
>-        p1 = pickle.Unpickler(f)
>-        a_data = p1.load()
>+    with bb.compress.zstd.open(a, "rt", encoding="utf-8", num_threads=1) as f:
>+        a_data = json.load(f)
> 
>     output.append("basewhitelist: %s" % (a_data['basewhitelist']))
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [bitbake-devel] [PATCH 2/2] siggen: Change file format of siginfo files to use zstd compressed json
       [not found]     ` <44a0906e6279b9ab34d29f8129d54b598b783397.camel@linuxfoundation.org>
@ 2021-10-14  8:00       ` Quentin Schulz
  0 siblings, 0 replies; 4+ messages in thread
From: Quentin Schulz @ 2021-10-14  8:00 UTC (permalink / raw)
  To: Richard Purdie; +Cc: Quentin Schulz, bitbake-devel

Hi Richard,

On Wed, Oct 13, 2021 at 11:02:39PM +0100, Richard Purdie wrote:
> On Wed, 2021-10-13 at 23:08 +0200, Quentin Schulz wrote:
> > Hi Richard,
> > 
> > On October 13, 2021 5:35:40 PM GMT+02:00, Richard Purdie
> > <richard.purdie@linuxfoundation.org> wrote:
> > > Since OE is about to change to zstd compression of sstate, it would make it
> > > timely to convert the siginfo files from pickle which isn't reproducible
> > > to json which is both reproducible and also human readable. At the same time
> > 
> > I'm not sure this is actually true for python 3.6 according to my reading of
> > the docs.
> > 
> > """
> > Note
> > 
> > This module’s encoders and decoders preserve input and output order by
> > default. Order is only lost if the underlying containers are unordered.
> > 
> > Prior to Python 3.7, dict was not guaranteed to be ordered, so inputs and
> > outputs were typically scrambled unless collections.OrderedDict was
> > specifically requested. Starting with Python 3.7, the regular dict became
> > order preserving, so it is no longer necessary to specify
> > collections.OrderedDict for JSON generation and parsing.
> > 
> > """
> > 
> > C.f. second note https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.python.org_3.8_library_json.html&d=DwIDaQ&c=_sEr5x9kUWhuk4_nFwjJtA&r=LYjLexDn7rXIzVmkNPvw5ymA1XTSqHGq8yBP6m6qZZ4njZguQhZhkI_-172IIy1t&m=9L-zhcxLgvsuEdoQhhdpX3Wl2Kq17Ua7Dv0zjWLo3x8FrOOKUrlGj0SSwTEs8huq&s=eutqZsapNnq6RK-NosLiiqHQX8xc-U5riHIjtk4nbrY&e= 
> > 
> > C.f. https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.python.org_3_whatsnew_3.6.html-23whatsnew36-2Dcompactdict&d=DwIDaQ&c=_sEr5x9kUWhuk4_nFwjJtA&r=LYjLexDn7rXIzVmkNPvw5ymA1XTSqHGq8yBP6m6qZZ4njZguQhZhkI_-172IIy1t&m=9L-zhcxLgvsuEdoQhhdpX3Wl2Kq17Ua7Dv0zjWLo3x8FrOOKUrlGj0SSwTEs8huq&s=WTUEXJ-RwwYdmjzjvsTMIdd7V1CPnbqozRu1Xo3MMAQ&e= 
> > Says it is behaving this way but shouldn't be relied upon.
> > 
> > It seems we need to make sure dict are ordered by using
> > collections.OrderedDict instead?
> > 
> > In json.load, object_hook_pairs=collections.OrderedDict should do the trick
> > afaiu?
> > 
> > Or I guess we could do the same thing as for sets but for dicts?
> > 
> > Note that OrderedDict is not alphabetically ordered but ordered by "time-of-
> > insertion". I assume that with sort_keys=True passed to JSON.dump, if
> > OrderedDict is used for JSON.load, the function will return a fully
> > reproducible dictionary which will happen to be alphabetically sorted too.
> > 
> > I'm not entirely sure of my reading and understanding of the docs but wanted
> > to bring this up.
> > 
> > Hope this isn't just noise :)
> 
> My understanding is that by having sort_keys=True passed to json, the dict's 
> keys are sorted and this does make things reproducible for our uses?
> 

I know nothing of the code you're modifying so it is possible I'm
voicing concern for something that does not apply to bitbake in this
scenario.

Serializing seems ok, I have my doubts on deserializing though. However,
I'm not sure non-reproducible deserialized dict is an issue if the
serialized dict is the only thing we care about (well, dict into reproducible
serialized JSON). If we only care about reproducibility of the JSON
file, then you can probably safely ignore the rest of this mail :)

You're using json.load without any argument, therefore this is the code
you're using:
https://github.com/python/cpython/blob/3.6/Lib/json/__init__.py#L274-L354
The last line being the decoder you're using, the default decoder
instanciated here:
https://github.com/python/cpython/blob/3.6/Lib/json/__init__.py#L241

https://github.com/python/cpython/blob/3.6/Lib/json/decoder.py#L291-L298
is what I'm worried about.

Specifically:

"""
This feature can be used to implement custom decoders that rely on the
order that the key and value pairs are decoded (for example,
collections.OrderedDict will remember the order of insertion).
"""

This seems to indicate that the resulting dict after a JSON
deserializing is not guaranteed to be reproducible. Therefore, depending
on the logic applied to this deserialized-from-JSON dict, that might be
an issue, re-serializion not being an issue there.

Cheers,
Quentin


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-10-14  8:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-13 15:35 [PATCH 1/2] compress: Allow to operate on file descriptor Richard Purdie
2021-10-13 15:35 ` [PATCH 2/2] siggen: Change file format of siginfo files to use zstd compressed json Richard Purdie
2021-10-13 21:08   ` [bitbake-devel] " Quentin Schulz
     [not found]     ` <44a0906e6279b9ab34d29f8129d54b598b783397.camel@linuxfoundation.org>
2021-10-14  8:00       ` Quentin Schulz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).