All of lore.kernel.org
 help / color / mirror / Atom feed
* [Buildroot] [PATCH] core/instrumentation: filter-out packages that install identical files
@ 2019-01-03 16:04 Yann E. MORIN
  2019-01-03 21:37 ` Thomas Petazzoni
  0 siblings, 1 reply; 6+ messages in thread
From: Yann E. MORIN @ 2019-01-03 16:04 UTC (permalink / raw)
  To: buildroot

Currently, we check that no two packages write to the same files, as a
sanity check. We do so by checking which files were touched since the
end of the build (aka begining of the installation).

However, when the packages do install the exact same file, i,e, the
same content, we in fact do not really care what package had provided
said file.

In the past, we avoided that situation because we were md5sum-inf every
files before and after installation. Anything that changed was new or
modified, and everything that did not change was not modified (but could
have been reinstalled).

However, since 7fb6e78254 (core/instrumentation: shave minutes off the
build time), we're now using mtimes, and we're in the situation that the
exact same file installed by two-or-more packages is reported.

In such a situation, it is not very interesting to know what package
installed the file, because whatever the ordering, or whatever the
subset of said packages, we'd have ended up with the same file anyway.
One prominent case where this happens, if the fftw familly of pcackages,
that all install different libraries, but install the same set of
identical headers and some common utilities.

The rationale for 7fb6e78254 was that the impact on the build time was
horrible, so we can't revert it.

However, we can partially restore use of md5 to detect if the files were
actually modified or not, and limiting the md5sum-ing only to those
actually modified files. The comparison of the md5 is then offloaded to
later, during the final check-uniq-files calls.

Since the md5sum is run only on new files, they are cache-hot, and so
the md5 is not going to storage to fetch them (and if they were not
cache-hot, it would mean memory pressure for a reason or another, and
the system would already be slowed for other reasons).

Using a defconfig with a various set of ~236 packages, the build times
reported by running "time make >log.file 2>&1", on an otherwise unloaded
system, were (smallest value of 5 runs each):

    before:     35:15.92
    after:      35:22.03

Which is a slight overhead of just 6.11s, which is less than 26ms per
package. Also, the system, although pretty idle, was still doing
background work like fetching mails and such, so the overhead is not
even consistent across various measurements...

Signed-off-by: "Yann E. MORIN" <yann.morin.1998@free.fr>
Cc: Trent Piepho <tpiepho@impinj.com>
Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Cc: Gwenhael Goavec-Merou <gwenhael.goavec-merou@trabucayre.com>
Cc: Peter Korsgaard <peter@korsgaard.com>
Cc: Arnout Vandecappelle <arnout@mind.be>

---
Note: as said above, the measurements were not consistent, and I just
reported the two smallest values. The two values usually differed by
less than 4s overall, and the two smallest values just exacerbate the
delta. I.e we know that the "before" can go as fast as that, but we
don't know if the "after" was the fastest.

I'm also interested in people actually having the overhead issue before
7fb6e78254, to test this new patch and report how they are impacted.
---
 package/pkg-generic.mk           | 10 ++++++----
 support/scripts/check-uniq-files | 34 ++++++++++++++++++++--------------
 support/scripts/size-stats       |  2 +-
 3 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/package/pkg-generic.mk b/package/pkg-generic.mk
index f5cab2b9c2..7c6a995c19 100644
--- a/package/pkg-generic.mk
+++ b/package/pkg-generic.mk
@@ -64,12 +64,14 @@ GLOBAL_INSTRUMENTATION_HOOKS += step_time
 # $(3): suffix of file  (optional)
 define step_pkg_size_inner
 	@touch $(BUILD_DIR)/packages-file-list$(3).txt
-	$(SED) '/^$(1),/d' $(BUILD_DIR)/packages-file-list$(3).txt
+	$(SED) '/^$(1)  /d' $(BUILD_DIR)/packages-file-list$(3).txt
 	cd $(2); \
-	find . \( -type f -o -type l \) \
+	find . -xtype f \
 		-newer $($(PKG)_DIR)/.stamp_built \
-		-exec printf '$(1),%s\n' {} + \
-		>> $(BUILD_DIR)/packages-file-list$(3).txt
+		-print0 \
+	|xargs -0 -r md5sum \
+	|sed -r -e 's/^/$(1)  /' \
+	>> $(BUILD_DIR)/packages-file-list$(3).txt
 endef
 
 define step_pkg_size
diff --git a/support/scripts/check-uniq-files b/support/scripts/check-uniq-files
index fbc6b5d6e7..94deea01db 100755
--- a/support/scripts/check-uniq-files
+++ b/support/scripts/check-uniq-files
@@ -24,24 +24,30 @@ def main():
         sys.stderr.write('No type was provided\n')
         return False
 
-    file_to_pkg = defaultdict(list)
+    file_to_pkg = defaultdict(set)
+    file_md5 = defaultdict(set)
     with open(args.packages_file_list[0], 'rb') as pkg_file_list:
         for line in pkg_file_list.readlines():
-            pkg, _, file = line.rstrip(b'\n').partition(b',')
-            file_to_pkg[file].append(pkg)
+            pkg, md5, file = line.rstrip(b'\n').split(b'  ', 2)
+            file_to_pkg[file].add(pkg)
+            file_md5[file].add(md5)
 
     for file in file_to_pkg:
-        if len(file_to_pkg[file]) > 1:
-            # If possible, try to decode the binary strings with
-            # the default user's locale
-            try:
-                sys.stderr.write(warn.format(args.type, file.decode(),
-                                             [p.decode() for p in file_to_pkg[file]]))
-            except UnicodeDecodeError:
-                # ... but fallback to just dumping them raw if they
-                # contain non-representable chars
-                sys.stderr.write(warn.format(args.type, file,
-                                             file_to_pkg[file]))
+        if len(file_to_pkg[file]) == 1 or len(file_md5[file]) == 1:
+            # If only one package installed that file, or they all
+            # installed the same content, don't report that file.
+            continue
+
+        # If possible, try to decode the binary strings with
+        # the default user's locale
+        try:
+            sys.stderr.write(warn.format(args.type, file.decode(),
+                                         [p.decode() for p in file_to_pkg[file]]))
+        except UnicodeDecodeError:
+            # ... but fallback to just dumping them raw if they
+            # contain non-representable chars
+            sys.stderr.write(warn.format(args.type, file,
+                                         file_to_pkg[file]))
 
 
 if __name__ == "__main__":
diff --git a/support/scripts/size-stats b/support/scripts/size-stats
index deea92e278..f67fe1d268 100755
--- a/support/scripts/size-stats
+++ b/support/scripts/size-stats
@@ -68,7 +68,7 @@ def build_package_dict(builddir):
     filesdict = {}
     with open(os.path.join(builddir, "build", "packages-file-list.txt")) as filelistf:
         for l in filelistf.readlines():
-            pkg, fpath = l.split(",", 1)
+            pkg, _, fpath = l.split("  ", 2)
             # remove the initial './' in each file path
             fpath = fpath.strip()[2:]
             fullpath = os.path.join(builddir, "target", fpath)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Buildroot] [PATCH] core/instrumentation: filter-out packages that install identical files
  2019-01-03 16:04 [Buildroot] [PATCH] core/instrumentation: filter-out packages that install identical files Yann E. MORIN
@ 2019-01-03 21:37 ` Thomas Petazzoni
  2019-01-03 22:38   ` Yann E. MORIN
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Petazzoni @ 2019-01-03 21:37 UTC (permalink / raw)
  To: buildroot

Hello,

On Thu,  3 Jan 2019 17:04:31 +0100, Yann E. MORIN wrote:
> Currently, we check that no two packages write to the same files, as a
> sanity check. We do so by checking which files were touched since the
> end of the build (aka begining of the installation).
> 
> However, when the packages do install the exact same file, i,e, the
> same content, we in fact do not really care what package had provided
> said file.
> 
> In the past, we avoided that situation because we were md5sum-inf every
> files before and after installation. Anything that changed was new or
> modified, and everything that did not change was not modified (but could
> have been reinstalled).
> 
> However, since 7fb6e78254 (core/instrumentation: shave minutes off the
> build time), we're now using mtimes, and we're in the situation that the
> exact same file installed by two-or-more packages is reported.
> 
> In such a situation, it is not very interesting to know what package
> installed the file, because whatever the ordering, or whatever the
> subset of said packages, we'd have ended up with the same file anyway.
> One prominent case where this happens, if the fftw familly of pcackages,
> that all install different libraries, but install the same set of
> identical headers and some common utilities.
> 
> The rationale for 7fb6e78254 was that the impact on the build time was
> horrible, so we can't revert it.
> 
> However, we can partially restore use of md5 to detect if the files were
> actually modified or not, and limiting the md5sum-ing only to those
> actually modified files. The comparison of the md5 is then offloaded to
> later, during the final check-uniq-files calls.
> 
> Since the md5sum is run only on new files, they are cache-hot, and so
> the md5 is not going to storage to fetch them (and if they were not
> cache-hot, it would mean memory pressure for a reason or another, and
> the system would already be slowed for other reasons).
> 
> Using a defconfig with a various set of ~236 packages, the build times
> reported by running "time make >log.file 2>&1", on an otherwise unloaded
> system, were (smallest value of 5 runs each):
> 
>     before:     35:15.92
>     after:      35:22.03
> 
> Which is a slight overhead of just 6.11s, which is less than 26ms per
> package. Also, the system, although pretty idle, was still doing
> background work like fetching mails and such, so the overhead is not
> even consistent across various measurements...

Overall, I find the idea good and useful.

A couple of comments/questions (here and below):

 - I believe we should still fix the fact that the .la file timestamp
   is changed everytime we $(SED) the .la files, even if they are not
   changed. This will also help reduce the number of files to check
   using md5sum: a .la file should be fixed only once, and its
   timestamp never touched again.

 - Could we make a check-uniq-files error a hard build failure, as a
   follow-up patch from yours ? So far, check-uniq-files errors have
   been a bit useless, because they do not cause the build to fail, so
   nobody really cared.

> diff --git a/package/pkg-generic.mk b/package/pkg-generic.mk
> index f5cab2b9c2..7c6a995c19 100644
> --- a/package/pkg-generic.mk
> +++ b/package/pkg-generic.mk
> @@ -64,12 +64,14 @@ GLOBAL_INSTRUMENTATION_HOOKS += step_time
>  # $(3): suffix of file  (optional)
>  define step_pkg_size_inner
>  	@touch $(BUILD_DIR)/packages-file-list$(3).txt
> -	$(SED) '/^$(1),/d' $(BUILD_DIR)/packages-file-list$(3).txt
> +	$(SED) '/^$(1)  /d' $(BUILD_DIR)/packages-file-list$(3).txt

Why is the separator changed from , to space ?

>  	cd $(2); \
> -	find . \( -type f -o -type l \) \
> +	find . -xtype f \

Why are you changing from -type f -o -type l to -xtype f ? Is this
really related to the change ? Is -xtype supported in old version of
find ?

>  		-newer $($(PKG)_DIR)/.stamp_built \
> -		-exec printf '$(1),%s\n' {} + \
> -		>> $(BUILD_DIR)/packages-file-list$(3).txt  
> +		-print0 \
> +	|xargs -0 -r md5sum \

Ah, you're changing the separator to two spaces, because md5sum's output
already uses two spaces as a separator ?

This could break people who have scripts that parse the
package-file-list.txt contents. Maybe we should keep the compatibility,
and keep using , as a separator, and push the md5sum as an additional
third column ?

Like:
	|xargs -0 -r md5sum \
	| sed -r -e 's/([0-9a-f]{32})  (.*)/$(1),\2,\1/'

One drawback is that it won't work with filenames that contain a comma.
But in this case, we're forced to keep the filename field as the last
one, and therefore break backward compatibility of the
package-file-lists.txt format every time we want to add a new
information to it.

> +	|sed -r -e 's/^/$(1)  /' \
> +	>> $(BUILD_DIR)/packages-file-list$(3).txt
>  endef
>  
>  define step_pkg_size
> diff --git a/support/scripts/check-uniq-files b/support/scripts/check-uniq-files
> index fbc6b5d6e7..94deea01db 100755
> --- a/support/scripts/check-uniq-files
> +++ b/support/scripts/check-uniq-files
> @@ -24,24 +24,30 @@ def main():
>          sys.stderr.write('No type was provided\n')
>          return False
>  
> -    file_to_pkg = defaultdict(list)
> +    file_to_pkg = defaultdict(set)

You're changing from list to set, it's a bit unrelated. Should be a
separate commit ?

> +    file_md5 = defaultdict(set)
>      with open(args.packages_file_list[0], 'rb') as pkg_file_list:
>          for line in pkg_file_list.readlines():
> -            pkg, _, file = line.rstrip(b'\n').partition(b',')
> -            file_to_pkg[file].append(pkg)
> +            pkg, md5, file = line.rstrip(b'\n').split(b'  ', 2)

The ", 2" in split() is to make sure that if the filename contains two
consecutive spaces, we don't split it up ?

> +            file_to_pkg[file].add(pkg)
> +            file_md5[file].add(md5)
>  
>      for file in file_to_pkg:
> -        if len(file_to_pkg[file]) > 1:
> -            # If possible, try to decode the binary strings with
> -            # the default user's locale
> -            try:
> -                sys.stderr.write(warn.format(args.type, file.decode(),
> -                                             [p.decode() for p in file_to_pkg[file]]))
> -            except UnicodeDecodeError:
> -                # ... but fallback to just dumping them raw if they
> -                # contain non-representable chars
> -                sys.stderr.write(warn.format(args.type, file,
> -                                             file_to_pkg[file]))
> +        if len(file_to_pkg[file]) == 1 or len(file_md5[file]) == 1:
> +            # If only one package installed that file, or they all
> +            # installed the same content, don't report that file.
> +            continue
> +
> +        # If possible, try to decode the binary strings with
> +        # the default user's locale
> +        try:
> +            sys.stderr.write(warn.format(args.type, file.decode(),
> +                                         [p.decode() for p in file_to_pkg[file]]))
> +        except UnicodeDecodeError:
> +            # ... but fallback to just dumping them raw if they
> +            # contain non-representable chars
> +            sys.stderr.write(warn.format(args.type, file,
> +                                         file_to_pkg[file]))

I'm not sure why the whole test is inverted, it causes a lot of noise
in the diff. If there is a good reason for it, should be a separate
patch.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Buildroot] [PATCH] core/instrumentation: filter-out packages that install identical files
  2019-01-03 21:37 ` Thomas Petazzoni
@ 2019-01-03 22:38   ` Yann E. MORIN
  2019-01-04  9:24     ` Thomas Petazzoni
  0 siblings, 1 reply; 6+ messages in thread
From: Yann E. MORIN @ 2019-01-03 22:38 UTC (permalink / raw)
  To: buildroot

Thomas, All,

On 2019-01-03 22:37 +0100, Thomas Petazzoni spake thusly:
> On Thu,  3 Jan 2019 17:04:31 +0100, Yann E. MORIN wrote:
> > Currently, we check that no two packages write to the same files, as a
> > sanity check. We do so by checking which files were touched since the
> > end of the build (aka begining of the installation).
> > 
> > However, when the packages do install the exact same file, i,e, the
> > same content, we in fact do not really care what package had provided
> > said file.
> > 
> > In the past, we avoided that situation because we were md5sum-inf every
> > files before and after installation. Anything that changed was new or
> > modified, and everything that did not change was not modified (but could
> > have been reinstalled).
> > 
> > However, since 7fb6e78254 (core/instrumentation: shave minutes off the
> > build time), we're now using mtimes, and we're in the situation that the
> > exact same file installed by two-or-more packages is reported.
> > 
> > In such a situation, it is not very interesting to know what package
> > installed the file, because whatever the ordering, or whatever the
> > subset of said packages, we'd have ended up with the same file anyway.
> > One prominent case where this happens, if the fftw familly of pcackages,
> > that all install different libraries, but install the same set of
> > identical headers and some common utilities.
> > 
> > The rationale for 7fb6e78254 was that the impact on the build time was
> > horrible, so we can't revert it.
> > 
> > However, we can partially restore use of md5 to detect if the files were
> > actually modified or not, and limiting the md5sum-ing only to those
> > actually modified files. The comparison of the md5 is then offloaded to
> > later, during the final check-uniq-files calls.
> > 
> > Since the md5sum is run only on new files, they are cache-hot, and so
> > the md5 is not going to storage to fetch them (and if they were not
> > cache-hot, it would mean memory pressure for a reason or another, and
> > the system would already be slowed for other reasons).
> > 
> > Using a defconfig with a various set of ~236 packages, the build times
> > reported by running "time make >log.file 2>&1", on an otherwise unloaded
> > system, were (smallest value of 5 runs each):
> > 
> >     before:     35:15.92
> >     after:      35:22.03
> > 
> > Which is a slight overhead of just 6.11s, which is less than 26ms per
> > package. Also, the system, although pretty idle, was still doing
> > background work like fetching mails and such, so the overhead is not
> > even consistent across various measurements...
> 
> Overall, I find the idea good and useful.
> 
> A couple of comments/questions (here and below):
> 
>  - I believe we should still fix the fact that the .la file timestamp
>    is changed everytime we $(SED) the .la files, even if they are not
>    changed. This will also help reduce the number of files to check
>    using md5sum: a .la file should be fixed only once, and its
>    timestamp never touched again.

That is a different change, indeed, that should be done in the staging
install commands. I can tackle that.

>  - Could we make a check-uniq-files error a hard build failure, as a
>    follow-up patch from yours ? So far, check-uniq-files errors have
>    been a bit useless, because they do not cause the build to fail, so
>    nobody really cared.

That's fine, I already have the patch ready here: ;-)
    https://git.buildroot.org/~ymorin/git/buildroot/commit/?h=yem/misc&id=2c1ba76ee86a8f6bd9b9d4a67fa1e9bf28e840da

> > diff --git a/package/pkg-generic.mk b/package/pkg-generic.mk
> > index f5cab2b9c2..7c6a995c19 100644
> > --- a/package/pkg-generic.mk
> > +++ b/package/pkg-generic.mk
> > @@ -64,12 +64,14 @@ GLOBAL_INSTRUMENTATION_HOOKS += step_time
> >  # $(3): suffix of file  (optional)
> >  define step_pkg_size_inner
> >  	@touch $(BUILD_DIR)/packages-file-list$(3).txt
> > -	$(SED) '/^$(1),/d' $(BUILD_DIR)/packages-file-list$(3).txt
> > +	$(SED) '/^$(1)  /d' $(BUILD_DIR)/packages-file-list$(3).txt
> 
> Why is the separator changed from , to space ?

See below, to match the separator with md5.

> >  	cd $(2); \
> > -	find . \( -type f -o -type l \) \
> > +	find . -xtype f \
> 
> Why are you changing from -type f -o -type l to -xtype f ? Is this
> really related to the change ? Is -xtype supported in old version of
> find ?

-xtype is what we were using before 7fb6e78254:
    https://git.buildroot.org/buildroot/commit/package/pkg-generic.mk?id=7fb6e782542fc440c2da226ec4525236d0508b77

> >  		-newer $($(PKG)_DIR)/.stamp_built \
> > -		-exec printf '$(1),%s\n' {} + \
> > -		>> $(BUILD_DIR)/packages-file-list$(3).txt  
> > +		-print0 \
> > +	|xargs -0 -r md5sum \
> 
> Ah, you're changing the separator to two spaces, because md5sum's output
> already uses two spaces as a separator ?

Yes.

> This could break people who have scripts that parse the
> package-file-list.txt contents. Maybe we should keep the compatibility,
> and keep using , as a separator, and push the md5sum as an additional
> third column ?

Well, if they were splitting on the first comma, adding the md5 as third
column would also break their scripts.

Also, it is not sane to break on the comma after the file name, becuase
a filename is arbitrary. There should be nothing after it.

> Like:
> 	|xargs -0 -r md5sum \
> 	| sed -r -e 's/([0-9a-f]{32})  (.*)/$(1),\2,\1/'
> 
> One drawback is that it won't work with filenames that contain a comma.
> But in this case, we're forced to keep the filename field as the last
> one, and therefore break backward compatibility of the
> package-file-lists.txt format every time we want to add a new
> information to it.

In any case, we have to break backward compatibility. I actually thought
about it, but forgot to say so in the commit message, my bad...

> > +	|sed -r -e 's/^/$(1)  /' \
> > +	>> $(BUILD_DIR)/packages-file-list$(3).txt
> >  endef
> >  
> >  define step_pkg_size
> > diff --git a/support/scripts/check-uniq-files b/support/scripts/check-uniq-files
> > index fbc6b5d6e7..94deea01db 100755
> > --- a/support/scripts/check-uniq-files
> > +++ b/support/scripts/check-uniq-files
> > @@ -24,24 +24,30 @@ def main():
> >          sys.stderr.write('No type was provided\n')
> >          return False
> >  
> > -    file_to_pkg = defaultdict(list)
> > +    file_to_pkg = defaultdict(set)
> 
> You're changing from list to set, it's a bit unrelated. Should be a
> separate commit ?

Yep, will do.

> > +    file_md5 = defaultdict(set)
> >      with open(args.packages_file_list[0], 'rb') as pkg_file_list:
> >          for line in pkg_file_list.readlines():
> > -            pkg, _, file = line.rstrip(b'\n').partition(b',')
> > -            file_to_pkg[file].append(pkg)
> > +            pkg, md5, file = line.rstrip(b'\n').split(b'  ', 2)
> 
> The ", 2" in split() is to make sure that if the filename contains two
> consecutive spaces, we don't split it up ?

Exactly. We know we have exactly three fields, so we need two splits.

> > +            file_to_pkg[file].add(pkg)
> > +            file_md5[file].add(md5)
> >  
> >      for file in file_to_pkg:
> > -        if len(file_to_pkg[file]) > 1:
> > -            # If possible, try to decode the binary strings with
> > -            # the default user's locale
> > -            try:
> > -                sys.stderr.write(warn.format(args.type, file.decode(),
> > -                                             [p.decode() for p in file_to_pkg[file]]))
> > -            except UnicodeDecodeError:
> > -                # ... but fallback to just dumping them raw if they
> > -                # contain non-representable chars
> > -                sys.stderr.write(warn.format(args.type, file,
> > -                                             file_to_pkg[file]))
> > +        if len(file_to_pkg[file]) == 1 or len(file_md5[file]) == 1:
> > +            # If only one package installed that file, or they all
> > +            # installed the same content, don't report that file.
> > +            continue
> > +
> > +        # If possible, try to decode the binary strings with
> > +        # the default user's locale
> > +        try:
> > +            sys.stderr.write(warn.format(args.type, file.decode(),
> > +                                         [p.decode() for p in file_to_pkg[file]]))
> > +        except UnicodeDecodeError:
> > +            # ... but fallback to just dumping them raw if they
> > +            # contain non-representable chars
> > +            sys.stderr.write(warn.format(args.type, file,
> > +                                         file_to_pkg[file]))
> 
> I'm not sure why the whole test is inverted, it causes a lot of noise
> in the diff. If there is a good reason for it, should be a separate
> patch.

Sure, will do.

Thanks!

Regards,
Yann E. MORIN.

> Best regards,
> 
> Thomas
> -- 
> Thomas Petazzoni, CTO, Bootlin
> Embedded Linux and Kernel engineering
> https://bootlin.com

-- 
.-----------------.--------------------.------------------.--------------------.
|  Yann E. MORIN  | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
| +33 662 376 056 | Software  Designer | \ / CAMPAIGN     |  ___               |
| +33 223 225 172 `------------.-------:  X  AGAINST      |  \e/  There is no  |
| http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL    |   v   conspiracy.  |
'------------------------------^-------^------------------^--------------------'

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Buildroot] [PATCH] core/instrumentation: filter-out packages that install identical files
  2019-01-03 22:38   ` Yann E. MORIN
@ 2019-01-04  9:24     ` Thomas Petazzoni
  2019-01-04  9:51       ` Yann E. MORIN
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Petazzoni @ 2019-01-04  9:24 UTC (permalink / raw)
  To: buildroot

Hello,

On Thu, 3 Jan 2019 23:38:32 +0100, Yann E. MORIN wrote:

> > A couple of comments/questions (here and below):
> > 
> >  - I believe we should still fix the fact that the .la file timestamp
> >    is changed everytime we $(SED) the .la files, even if they are not
> >    changed. This will also help reduce the number of files to check
> >    using md5sum: a .la file should be fixed only once, and its
> >    timestamp never touched again.  
> 
> That is a different change, indeed, that should be done in the staging
> install commands. I can tackle that.

It is a different change, but a change we can easily forget once this
md5 logic is in place, as we will no longer notice that those files are
rewritten over and over again.

> >  - Could we make a check-uniq-files error a hard build failure, as a
> >    follow-up patch from yours ? So far, check-uniq-files errors have
> >    been a bit useless, because they do not cause the build to fail, so
> >    nobody really cared.  
> 
> That's fine, I already have the patch ready here: ;-)
>     https://git.buildroot.org/~ymorin/git/buildroot/commit/?h=yem/misc&id=2c1ba76ee86a8f6bd9b9d4a67fa1e9bf28e840da

I tried it on top of this patch, and after fixing the conflict, I get a
weird exception:

# Check files that are touched by more than one package
./support/scripts/check-uniq-files --fail -t target /home/thomas/projets/buildroot/output/build/packages-file-list.txt
Traceback (most recent call last):
  File "./support/scripts/check-uniq-files", line 61, in <module>
    sys.exit(main())
  File "./support/scripts/check-uniq-files", line 17, in main
    help='Fail if a file is touched by more than one package')
  File "/usr/lib64/python2.7/argparse.py", line 1294, in add_argument
    action = action_class(**kwargs)
TypeError: __init__() got an unexpected keyword argument 'type'
make[1]: *** [Makefile:725: target-finalize] Error 1
make: *** [Makefile:84: _all] Error 2

Clearly weird, because .add_argument() definitely takes a type= keyword
argument. I don't have the time to investigate right now.

> -xtype is what we were using before 7fb6e78254:
>     https://git.buildroot.org/buildroot/commit/package/pkg-generic.mk?id=7fb6e782542fc440c2da226ec4525236d0508b77

OK, but is this change to go back to -xtype related ?

> > One drawback is that it won't work with filenames that contain a comma.
> > But in this case, we're forced to keep the filename field as the last
> > one, and therefore break backward compatibility of the
> > package-file-lists.txt format every time we want to add a new
> > information to it.  
> 
> In any case, we have to break backward compatibility. I actually thought
> about it, but forgot to say so in the commit message, my bad...

I don't have a good solution to offer right away, but breaking backward
compatibility on a file like package-file-list.txt is problematic I
believe. Indeed, we have several times recommended to people to do some
post processing using this file. I am myself considering using it for
the toolchains.bootlin.com stuff to remove from the toolchain tarballs
the files that are not really needed (like files installed by
host-automake, host-autoconf, etc.). This would be some external
scripting that parses package-file-list.txt, and I would not be very
happy to see this being broken once in a while. Especially since I tend
to build using two different Buildroot versions (for stable and
bleeding-edge), and I would even have an intermediate period where my
bleeding-edge builds have the new package-file-list.txt format, and the
stable builds have the old format. Not nice.

So I'm not again changing the format now, but I'm not happy with the
idea that we might have to change it over and over and over again, due
to the need for the file name field to be the last one.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Buildroot] [PATCH] core/instrumentation: filter-out packages that install identical files
  2019-01-04  9:24     ` Thomas Petazzoni
@ 2019-01-04  9:51       ` Yann E. MORIN
  2019-01-04 10:03         ` Thomas Petazzoni
  0 siblings, 1 reply; 6+ messages in thread
From: Yann E. MORIN @ 2019-01-04  9:51 UTC (permalink / raw)
  To: buildroot

Thomas, All,

On 2019-01-04 10:24 +0100, Thomas Petazzoni spake thusly:
> On Thu, 3 Jan 2019 23:38:32 +0100, Yann E. MORIN wrote:
[--SNIP--]
> > >  - Could we make a check-uniq-files error a hard build failure, as a
> > >    follow-up patch from yours ? So far, check-uniq-files errors have
> > >    been a bit useless, because they do not cause the build to fail, so
> > >    nobody really cared.  
> > 
> > That's fine, I already have the patch ready here: ;-)
> >     https://git.buildroot.org/~ymorin/git/buildroot/commit/?h=yem/misc&id=2c1ba76ee86a8f6bd9b9d4a67fa1e9bf28e840da
> 
> I tried it on top of this patch, and after fixing the conflict, I get a
> weird exception:

Hmm, did you see the date the commit was made? 1 year ago now. I need to
refresh it (OK, I lied when I saud "ready", I meant "I already got
prepared for that").

That patch is meant to be applied to master as it is today.

Additionally, I wanted to add a --(no-)md5 option to check-uniq-file, to
ignore the md5 and bail out if files are only just barely touched by two
packages.

> # Check files that are touched by more than one package
> ./support/scripts/check-uniq-files --fail -t target /home/thomas/projets/buildroot/output/build/packages-file-list.txt
> Traceback (most recent call last):
>   File "./support/scripts/check-uniq-files", line 61, in <module>
>     sys.exit(main())
>   File "./support/scripts/check-uniq-files", line 17, in main
>     help='Fail if a file is touched by more than one package')
>   File "/usr/lib64/python2.7/argparse.py", line 1294, in add_argument
>     action = action_class(**kwargs)
> TypeError: __init__() got an unexpected keyword argument 'type'
> make[1]: *** [Makefile:725: target-finalize] Error 1
> make: *** [Makefile:84: _all] Error 2
> 
> Clearly weird, because .add_argument() definitely takes a type= keyword
> argument. I don't have the time to investigate right now.

I'll look at it before posting, of course.

> > -xtype is what we were using before 7fb6e78254:
> >     https://git.buildroot.org/buildroot/commit/package/pkg-generic.mk?id=7fb6e782542fc440c2da226ec4525236d0508b77
> 
> OK, but is this change to go back to -xtype related ?

I just mostly "reverted" to the previous find command we were using.

> > > One drawback is that it won't work with filenames that contain a comma.
> > > But in this case, we're forced to keep the filename field as the last
> > > one, and therefore break backward compatibility of the
> > > package-file-lists.txt format every time we want to add a new
> > > information to it.  
> > 
> > In any case, we have to break backward compatibility. I actually thought
> > about it, but forgot to say so in the commit message, my bad...
> 
> I don't have a good solution to offer right away, but breaking backward
> compatibility on a file like package-file-list.txt is problematic I
> believe. Indeed, we have several times recommended to people to do some
> post processing using this file. I am myself considering using it for
> the toolchains.bootlin.com stuff to remove from the toolchain tarballs
> the files that are not really needed (like files installed by
> host-automake, host-autoconf, etc.). This would be some external
> scripting that parses package-file-list.txt, and I would not be very
> happy to see this being broken once in a while. Especially since I tend
> to build using two different Buildroot versions (for stable and
> bleeding-edge), and I would even have an intermediate period where my
> bleeding-edge builds have the new package-file-list.txt format, and the
> stable builds have the old format. Not nice.
> 
> So I'm not again changing the format now, but I'm not happy with the
> idea that we might have to change it over and over and over again, due
> to the need for the file name field to be the last one.

OK, let me think of it...

Quick suggestion: let's break it now and never break it again. We could
use \0 as a field separator: \0 *is* guaranteed to never be part of a
filename. So, the new format would be:

    package-name\0file-name\0md5\n

So, a user would have to split on \0 and extract by field number rather
than until EOL. This will allow us to add new fields as need be.

Would that be OK?

Of course, splitting on \0 is a bit less easy to do in shell scripts,
but sed, grep, and awk all allow it pretty easily. It's less trivial in
bash to use it to split fields ?-la "${var#,*}", but heck, that's
already pretty unsafe.

Note that \n is allowed in a filename, so we'd need to be a bit smart
when reading this file. Or ignore the problem and blame whoever creates
a file with a \n in it (and detect it and bail out, too).

Regards,
Yann E. MORIN.

-- 
.-----------------.--------------------.------------------.--------------------.
|  Yann E. MORIN  | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
| +33 662 376 056 | Software  Designer | \ / CAMPAIGN     |  ___               |
| +33 223 225 172 `------------.-------:  X  AGAINST      |  \e/  There is no  |
| http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL    |   v   conspiracy.  |
'------------------------------^-------^------------------^--------------------'

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Buildroot] [PATCH] core/instrumentation: filter-out packages that install identical files
  2019-01-04  9:51       ` Yann E. MORIN
@ 2019-01-04 10:03         ` Thomas Petazzoni
  0 siblings, 0 replies; 6+ messages in thread
From: Thomas Petazzoni @ 2019-01-04 10:03 UTC (permalink / raw)
  To: buildroot

Hello,

On Fri, 4 Jan 2019 10:51:10 +0100, Yann E. MORIN wrote:

> > I tried it on top of this patch, and after fixing the conflict, I get a
> > weird exception:  
> 
> Hmm, did you see the date the commit was made? 1 year ago now. I need to
> refresh it (OK, I lied when I saud "ready", I meant "I already got
> prepared for that").

I know, but the commit was pretty trivial, and I wanted to do some
testing.

> That patch is meant to be applied to master as it is today.

I guess you wanted to say "NOT meant".

> Additionally, I wanted to add a --(no-)md5 option to check-uniq-file, to
> ignore the md5 and bail out if files are only just barely touched by two
> packages.

I'm not sure how much useful that is going to be.

> > Clearly weird, because .add_argument() definitely takes a type= keyword
> > argument. I don't have the time to investigate right now.  
> 
> I'll look at it before posting, of course.

OK.

> > > -xtype is what we were using before 7fb6e78254:
> > >     https://git.buildroot.org/buildroot/commit/package/pkg-generic.mk?id=7fb6e782542fc440c2da226ec4525236d0508b77  
> > 
> > OK, but is this change to go back to -xtype related ?  
> 
> I just mostly "reverted" to the previous find command we were using.

But is there a reason ?

> > So I'm not again changing the format now, but I'm not happy with the
> > idea that we might have to change it over and over and over again, due
> > to the need for the file name field to be the last one.  
> 
> OK, let me think of it...
> 
> Quick suggestion: let's break it now and never break it again. We could
> use \0 as a field separator: \0 *is* guaranteed to never be part of a
> filename. So, the new format would be:
> 
>     package-name\0file-name\0md5\n
> 
> So, a user would have to split on \0 and extract by field number rather
> than until EOL. This will allow us to add new fields as need be.
> 
> Would that be OK?
> 
> Of course, splitting on \0 is a bit less easy to do in shell scripts,
> but sed, grep, and awk all allow it pretty easily. It's less trivial in
> bash to use it to split fields ?-la "${var#,*}", but heck, that's
> already pretty unsafe.
> 
> Note that \n is allowed in a filename, so we'd need to be a bit smart
> when reading this file. Or ignore the problem and blame whoever creates
> a file with a \n in it (and detect it and bail out, too).

\0 would indeed work, but as you said it's not the most practical
separator. Let's see if Arnout or Peter have some additional feedback
on this. Another option is to use a more "structured" format.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-01-04 10:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-03 16:04 [Buildroot] [PATCH] core/instrumentation: filter-out packages that install identical files Yann E. MORIN
2019-01-03 21:37 ` Thomas Petazzoni
2019-01-03 22:38   ` Yann E. MORIN
2019-01-04  9:24     ` Thomas Petazzoni
2019-01-04  9:51       ` Yann E. MORIN
2019-01-04 10:03         ` Thomas Petazzoni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.