All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] DMG chunk size independence
@ 2017-04-15  8:38 Ashijeet Acharya
  2017-04-17 20:29 ` John Snow
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Ashijeet Acharya @ 2017-04-15  8:38 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, John Snow, Kevin Wolf, Max Reitz, QEMU Developers

Hi,

Some of you are already aware but for the benefit of the open list,
this mail is regarding the task mentioned
Here -> http://wiki.qemu-project.org/ToDo/Block/DmgChunkSizeIndependence

I had a chat with Fam regarding this and he suggested a solution where
we fix the output buffer size to a max of say "64K" and keep inflating
until we reach the end of the input stream. We extract the required
data when we enter the desired range and discard the rest. Fam however
termed this as only a  "quick fix".

The ideal fix would obviously be if we can somehow predict the exact
location inside the compressed stream relative to the desired offset
in the output decompressed stream, such as a specific sector in a
chunk. Unfortunately this is not possible without doing a first pass
over the decompressed stream as answered on the zlib FAQ page
Here -> http://zlib.net/zlib_faq.html#faq28

AFAICT after reading the zran.c example in zlib, the above mentioned
ideal fix would ultimately lead us to decompress the whole chunk in
steps at least once to maintain an access point lookup table. This
solution is better if we get several random access requests over
different read requests, otherwise it ends up being equal to the fix
suggested by Fam plus some extra effort needed in building and
maintaining access points.

I have not explored the bzip2 compressed chunks yet but have naively
assumed that we will face the same situation there?

I would like the community's opinion on this and add their suggestions
if possible to give me some new thinking points.

Thanks
Ashijeet

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] DMG chunk size independence
  2017-04-15  8:38 [Qemu-devel] DMG chunk size independence Ashijeet Acharya
@ 2017-04-17 20:29 ` John Snow
  2017-04-18 10:21   ` Ashijeet Acharya
  2017-04-18 10:29 ` Kevin Wolf
  2017-04-25 10:48 ` Ashijeet Acharya
  2 siblings, 1 reply; 11+ messages in thread
From: John Snow @ 2017-04-17 20:29 UTC (permalink / raw)
  To: Ashijeet Acharya, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, QEMU Developers, Max Reitz



On 04/15/2017 04:38 AM, Ashijeet Acharya wrote:
> Hi,
> 
> Some of you are already aware but for the benefit of the open list,
> this mail is regarding the task mentioned
> Here -> http://wiki.qemu-project.org/ToDo/Block/DmgChunkSizeIndependence
> 

OK, so the idea here is that we should be able to read portions of
chunks instead of buffering entire chunks, because chunks can be quite
large and an unverified DMG file should not be able to cause QEMU to
allocate large portions of memory.

Currently, QEMU has a maximum chunk size and it will not open DMG files
that have chunks that exceed that size, correct?

> I had a chat with Fam regarding this and he suggested a solution where
> we fix the output buffer size to a max of say "64K" and keep inflating
> until we reach the end of the input stream. We extract the required
> data when we enter the desired range and discard the rest. Fam however
> termed this as only a  "quick fix".
> 

So it looks like your problem now is how to allow reads to subsets while
tolerating zipped chunks, right?

We can't predict where the data we want is going to appear mid-stream,
but I'm not that familiar with the DMG format, so what does the data
look like and how do we seek to it in general?

We've got the mish blocks stored inside of the ResouceFork (right?), and
each mish block contains one-or-more chunk records. So given any offset
into the virtual file, we at least know which chunk it belongs to, but
thanks to zlib, we can't just read the bits we care about.

(Correct so far?)

> The ideal fix would obviously be if we can somehow predict the exact
> location inside the compressed stream relative to the desired offset
> in the output decompressed stream, such as a specific sector in a
> chunk. Unfortunately this is not possible without doing a first pass
> over the decompressed stream as answered on the zlib FAQ page
> Here -> http://zlib.net/zlib_faq.html#faq28
> 

Yeah, I think you need to start reading the data from the beginning of
each chunk -- but it depends on the zlib data. It COULD be broken up
into different pieces, but there's no way to know without scanning it in
advance.

(Unrelated:

Do we have a zlib format driver?

It might be cute to break up such DMG files and offload zlib
optimization to another driver, like this:

[dmg]-->[zlib]-->[raw]

And we could pretend that each zlib chunk in this file is virtually its
own zlib "file" and access it with modified offsets as appropriate.

Any optimizations we make could just apply to this driver.

[anyway...])


Pre-scanning for these sync points is probably a waste of time as
there's no way to know (*I THINK*) how big each sync-block would be
decompressed, so there's still no way this helps you seek within a
compressed block...

> AFAICT after reading the zran.c example in zlib, the above mentioned
> ideal fix would ultimately lead us to decompress the whole chunk in
> steps at least once to maintain an access point lookup table. This
> solution is better if we get several random access requests over
> different read requests, otherwise it ends up being equal to the fix
> suggested by Fam plus some extra effort needed in building and
> maintaining access points.
> 

Yeah, probably not worth it overall... I have to imagine that most uses
of DMG files are for iso-like cases for installers where accesses are
going to be either sequential (or mostly sequential) and most data will
not be read twice.

I could be wrong, but that's my hunch.

Maybe you can cache the state of the INFLATE process such that once you
fill the cache with data, we can simply resume the INFLATE procedure
when the guest almost inevitably asks for the next subsequent bytes.

That'd probably be efficient /enough/ in most cases without having to
worry about a metadata cache for zlib blocks or a literal data cache for
inflated data.

Or maybe I'm full of crap, I don't know -- I'd probably try a few
approaches and see which one empirically worked better.

> I have not explored the bzip2 compressed chunks yet but have naively
> assumed that we will face the same situation there?
> 

Not sure.

> I would like the community's opinion on this and add their suggestions
> if possible to give me some new thinking points.
> 
> Thanks
> Ashijeet
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] DMG chunk size independence
  2017-04-17 20:29 ` John Snow
@ 2017-04-18 10:21   ` Ashijeet Acharya
  2017-04-18 17:05     ` John Snow
  0 siblings, 1 reply; 11+ messages in thread
From: Ashijeet Acharya @ 2017-04-18 10:21 UTC (permalink / raw)
  To: John Snow, Stefan Hajnoczi
  Cc: Fam Zheng, Kevin Wolf, Max Reitz, QEMU Developers

On Tue, Apr 18, 2017 at 01:59 John Snow <jsnow@redhat.com> wrote:

>
>
> On 04/15/2017 04:38 AM, Ashijeet Acharya wrote:
> > Hi,
> >
> > Some of you are already aware but for the benefit of the open list,
> > this mail is regarding the task mentioned
> > Here -> http://wiki.qemu-project.org/ToDo/Block/DmgChunkSizeIndependence
> >
>
> OK, so the idea here is that we should be able to read portions of
> chunks instead of buffering entire chunks, because chunks can be quite
> large and an unverified DMG file should not be able to cause QEMU to
> allocate large portions of memory.
>
> Currently, QEMU has a maximum chunk size and it will not open DMG files
> that have chunks that exceed that size, correct?
>

Yes, it has an upper limit 64MiB at the moment and refuses to cater
anything beyond that.


> > I had a chat with Fam regarding this and he suggested a solution where
> > we fix the output buffer size to a max of say "64K" and keep inflating
> > until we reach the end of the input stream. We extract the required
> > data when we enter the desired range and discard the rest. Fam however
> > termed this as only a  "quick fix".
> >
>
> So it looks like your problem now is how to allow reads to subsets while
> tolerating zipped chunks, right?


Yes

>
>
> We can't predict where the data we want is going to appear mid-stream,
> but I'm not that familiar with the DMG format, so what does the data
> look like and how do we seek to it in general?


If I understood correctly what you meant;
The data is divided into three types
a) Uncompressed
b) zlib compressed
c) bz2 compressed

All these chunks appear in random order depending on the file.

ATM we are decompressing the whole chunk in a buffer and start reading
sector by sector until we have what we need or we run out of output in that
chunk.

If you meant something else there, let me know.


>
> We've got the mish blocks stored inside of the ResouceFork (right?), and


I haven't understood yet what a ResourceFork is but its safe to say from
what I know that mish blocks do appear inside resource forks and contain
all the required info about the chunks.

>
> each mish block contains one-or-more chunk records. So given any offset
> into the virtual file, we at least know which chunk it belongs to, but
> thanks to zlib, we can't just read the bits we care about.
>
> (Correct so far?)


Absolutely


>
> > The ideal fix would obviously be if we can somehow predict the exact
> > location inside the compressed stream relative to the desired offset
> > in the output decompressed stream, such as a specific sector in a
> > chunk. Unfortunately this is not possible without doing a first pass
> > over the decompressed stream as answered on the zlib FAQ page
> > Here -> http://zlib.net/zlib_faq.html#faq28
> >
>
> Yeah, I think you need to start reading the data from the beginning of
> each chunk -- but it depends on the zlib data. It COULD be broken up
> into different pieces, but there's no way to know without scanning it in
> advance.


Hmm, that's the real issue I am facing. MAYBE break it like

a) inflate till the required starting offset in one go
b) save the access point and discard the undesired data
c) proceed by inflating one sector at a time and stop if we hit chunk's end
or request's end


>
> (Unrelated:
>
> Do we have a zlib format driver?
>
> It might be cute to break up such DMG files and offload zlib
> optimization to another driver, like this:
>
> [dmg]-->[zlib]-->[raw]
>
> And we could pretend that each zlib chunk in this file is virtually its
> own zlib "file" and access it with modified offsets as appropriate.
>
> Any optimizations we make could just apply to this driver.
>
> [anyway...])


Are you thinking about implementing zlib just like we have bz2 implemented
currently?


>
>
> Pre-scanning for these sync points is probably a waste of time as
> there's no way to know (*I THINK*) how big each sync-block would be
> decompressed, so there's still no way this helps you seek within a
> compressed block...
>

I think we can predict that actually, because we know the number of sectors
present in that chunk and each sector's size too. So...


> > AFAICT after reading the zran.c example in zlib, the above mentioned
> > ideal fix would ultimately lead us to decompress the whole chunk in
> > steps at least once to maintain an access point lookup table. This
> > solution is better if we get several random access requests over
> > different read requests, otherwise it ends up being equal to the fix
> > suggested by Fam plus some extra effort needed in building and
> > maintaining access points.
> >
>
> Yeah, probably not worth it overall... I have to imagine that most uses
> of DMG files are for iso-like cases for installers where accesses are
> going to be either sequential (or mostly sequential) and most data will
> not be read twice.


Exactly, if we are sure that there will be no requests to read the same
data twice, its completely a wasted effort. But I am not aware of the use
cases of DMG since I only learned about it last week. So maybe someone can
enlighten me on those if possible?


>
> I could be wrong, but that's my hunch.
>
> Maybe you can cache the state of the INFLATE process such that once you
> fill the cache with data, we can simply resume the INFLATE procedure
> when the guest almost inevitably asks for the next subsequent bytes.
>
> That'd probably be efficient /enough/ in most cases without having to
> worry about a metadata cache for zlib blocks or a literal data cache for
> inflated data.


Yes, I have a similar approach in mind to inflate one sector at a time and
save the offset in the compressed stream and treat it as an access point
for the next one.


>
> Or maybe I'm full of crap, I don't know -- I'd probably try a few
> approaches and see which one empirically worked better.
>
> > I have not explored the bzip2 compressed chunks yet but have naively
> > assumed that we will face the same situation there?
> >
>
> Not sure.
>

I will look it up :)

Stefan/Kevin, Do you have any other preferred solution in your mind?
Because I am more or less getting inclined towards starting to inflate one
sector at a time and submit v1

>
Ashijeet

>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] DMG chunk size independence
  2017-04-15  8:38 [Qemu-devel] DMG chunk size independence Ashijeet Acharya
  2017-04-17 20:29 ` John Snow
@ 2017-04-18 10:29 ` Kevin Wolf
  2017-04-25 10:48 ` Ashijeet Acharya
  2 siblings, 0 replies; 11+ messages in thread
From: Kevin Wolf @ 2017-04-18 10:29 UTC (permalink / raw)
  To: Ashijeet Acharya
  Cc: Stefan Hajnoczi, Fam Zheng, John Snow, Max Reitz, QEMU Developers

Am 15.04.2017 um 10:38 hat Ashijeet Acharya geschrieben:
> Hi,
> 
> Some of you are already aware but for the benefit of the open list,
> this mail is regarding the task mentioned
> Here -> http://wiki.qemu-project.org/ToDo/Block/DmgChunkSizeIndependence
> 
> I had a chat with Fam regarding this and he suggested a solution where
> we fix the output buffer size to a max of say "64K" and keep inflating
> until we reach the end of the input stream. We extract the required
> data when we enter the desired range and discard the rest. Fam however
> termed this as only a  "quick fix".

You can cache the current position for a very easy optimisation of
sequential reads.

> The ideal fix would obviously be if we can somehow predict the exact
> location inside the compressed stream relative to the desired offset
> in the output decompressed stream, such as a specific sector in a
> chunk. Unfortunately this is not possible without doing a first pass
> over the decompressed stream as answered on the zlib FAQ page
> Here -> http://zlib.net/zlib_faq.html#faq28
> 
> AFAICT after reading the zran.c example in zlib, the above mentioned
> ideal fix would ultimately lead us to decompress the whole chunk in
> steps at least once to maintain an access point lookup table. This
> solution is better if we get several random access requests over
> different read requests, otherwise it ends up being equal to the fix
> suggested by Fam plus some extra effort needed in building and
> maintaining access points.

I'm not sure if it's worth the additional effort.

If we take a step back, what the dmg driver is used for in practice is
converting images into a different format, that is strictly sequential
I/O. The important thing is that images with large chunk sizes can be
read at all, performance (with sequential I/O) is secondary, and
performance with random I/O is almost irrelevant, as far as I am
concerned.

Kevin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] DMG chunk size independence
  2017-04-18 10:21   ` Ashijeet Acharya
@ 2017-04-18 17:05     ` John Snow
  2017-04-18 17:43       ` Ashijeet Acharya
  0 siblings, 1 reply; 11+ messages in thread
From: John Snow @ 2017-04-18 17:05 UTC (permalink / raw)
  To: Ashijeet Acharya, Stefan Hajnoczi
  Cc: Fam Zheng, Kevin Wolf, Max Reitz, QEMU Developers



On 04/18/2017 06:21 AM, Ashijeet Acharya wrote:
> 
> On Tue, Apr 18, 2017 at 01:59 John Snow <jsnow@redhat.com
> <mailto:jsnow@redhat.com>> wrote:
> 
> 
> 
>     On 04/15/2017 04:38 AM, Ashijeet Acharya wrote:
>     > Hi,
>     >
>     > Some of you are already aware but for the benefit of the open list,
>     > this mail is regarding the task mentioned
>     > Here ->
>     http://wiki.qemu-project.org/ToDo/Block/DmgChunkSizeIndependence
>     >
> 
>     OK, so the idea here is that we should be able to read portions of
>     chunks instead of buffering entire chunks, because chunks can be quite
>     large and an unverified DMG file should not be able to cause QEMU to
>     allocate large portions of memory.
> 
>     Currently, QEMU has a maximum chunk size and it will not open DMG files
>     that have chunks that exceed that size, correct?
> 
> 
> Yes, it has an upper limit 64MiB at the moment and refuses to cater
> anything beyond that.
> 
> 
>     > I had a chat with Fam regarding this and he suggested a solution where
>     > we fix the output buffer size to a max of say "64K" and keep inflating
>     > until we reach the end of the input stream. We extract the required
>     > data when we enter the desired range and discard the rest. Fam however
>     > termed this as only a  "quick fix".
>     >
> 
>     So it looks like your problem now is how to allow reads to subsets while
>     tolerating zipped chunks, right?
> 
> 
> Yes
> 
> 
> 
>     We can't predict where the data we want is going to appear mid-stream,
>     but I'm not that familiar with the DMG format, so what does the data
>     look like and how do we seek to it in general?
> 
> 
> If I understood correctly what you meant;
> The data is divided into three types
> a) Uncompressed
> b) zlib compressed
> c) bz2 compressed
> 
> All these chunks appear in random order depending on the file.
> 
> ATM we are decompressing the whole chunk in a buffer and start reading
> sector by sector until we have what we need or we run out of output in
> that chunk.
> 
> If you meant something else there, let me know.
> 
> 
> 
>     We've got the mish blocks stored inside of the ResouceFork (right?), and
> 
> 
> I haven't understood yet what a ResourceFork is but its safe to say from
> what I know that mish blocks do appear inside resource forks and contain
> all the required info about the chunks.
> 
> 
>     each mish block contains one-or-more chunk records. So given any offset
>     into the virtual file, we at least know which chunk it belongs to, but
>     thanks to zlib, we can't just read the bits we care about.
> 
>     (Correct so far?)
> 
> 
> Absolutely
> 
> 
> 
>     > The ideal fix would obviously be if we can somehow predict the exact
>     > location inside the compressed stream relative to the desired offset
>     > in the output decompressed stream, such as a specific sector in a
>     > chunk. Unfortunately this is not possible without doing a first pass
>     > over the decompressed stream as answered on the zlib FAQ page
>     > Here -> http://zlib.net/zlib_faq.html#faq28
>     >
> 
>     Yeah, I think you need to start reading the data from the beginning of
>     each chunk -- but it depends on the zlib data. It COULD be broken up
>     into different pieces, but there's no way to know without scanning it in
>     advance.
> 
> 
> Hmm, that's the real issue I am facing. MAYBE break it like
> 
> a) inflate till the required starting offset in one go
> b) save the access point and discard the undesired data
> c) proceed by inflating one sector at a time and stop if we hit chunk's
> end or request's end
> 
> 
> 
>     (Unrelated:
> 
>     Do we have a zlib format driver?
> 
>     It might be cute to break up such DMG files and offload zlib
>     optimization to another driver, like this:
> 
>     [dmg]-->[zlib]-->[raw]
> 
>     And we could pretend that each zlib chunk in this file is virtually its
>     own zlib "file" and access it with modified offsets as appropriate.
> 
>     Any optimizations we make could just apply to this driver.
> 
>     [anyway...])
> 
> 
> Are you thinking about implementing zlib just like we have bz2
> implemented currently?
> 
> 
> 
> 
>     Pre-scanning for these sync points is probably a waste of time as
>     there's no way to know (*I THINK*) how big each sync-block would be
>     decompressed, so there's still no way this helps you seek within a
>     compressed block...
> 
> 
> I think we can predict that actually, because we know the number of
> sectors present in that chunk and each sector's size too. So...
> 
> 
>     > AFAICT after reading the zran.c example in zlib, the above mentioned
>     > ideal fix would ultimately lead us to decompress the whole chunk in
>     > steps at least once to maintain an access point lookup table. This
>     > solution is better if we get several random access requests over
>     > different read requests, otherwise it ends up being equal to the fix
>     > suggested by Fam plus some extra effort needed in building and
>     > maintaining access points.
>     >
> 
>     Yeah, probably not worth it overall... I have to imagine that most uses
>     of DMG files are for iso-like cases for installers where accesses are
>     going to be either sequential (or mostly sequential) and most data will
>     not be read twice.
> 
> 
> Exactly, if we are sure that there will be no requests to read the same
> data twice, its completely a wasted effort. But I am not aware of the
> use cases of DMG since I only learned about it last week. So maybe
> someone can enlighten me on those if possible?
> 
> 
> 
>     I could be wrong, but that's my hunch.
> 
>     Maybe you can cache the state of the INFLATE process such that once you
>     fill the cache with data, we can simply resume the INFLATE procedure
>     when the guest almost inevitably asks for the next subsequent bytes.
> 
>     That'd probably be efficient /enough/ in most cases without having to
>     worry about a metadata cache for zlib blocks or a literal data cache for
>     inflated data.
> 
> 
> Yes, I have a similar approach in mind to inflate one sector at a time
> and save the offset in the compressed stream and treat it as an access
> point for the next one.
> 

Right, just save whatever zlib library state you need to save and resume
inflating. Probably the most reasonable way to go for v1. As long as you
can avoid re-inflating prior data in a chunk when possible this is
probably good.

> 
> 
>     Or maybe I'm full of crap, I don't know -- I'd probably try a few
>     approaches and see which one empirically worked better.
> 
>     > I have not explored the bzip2 compressed chunks yet but have naively
>     > assumed that we will face the same situation there?
>     >
> 
>     Not sure.
> 
> 
> I will look it up :)
> 
> Stefan/Kevin, Do you have any other preferred solution in your mind?
> Because I am more or less getting inclined towards starting to inflate
> one sector at a time and submit v1 
> 
> 
> Ashijeet
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] DMG chunk size independence
  2017-04-18 17:05     ` John Snow
@ 2017-04-18 17:43       ` Ashijeet Acharya
  2017-04-23  9:03         ` Ashijeet Acharya
  0 siblings, 1 reply; 11+ messages in thread
From: Ashijeet Acharya @ 2017-04-18 17:43 UTC (permalink / raw)
  To: John Snow
  Cc: Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, QEMU Developers

On Tue, Apr 18, 2017 at 10:35 PM, John Snow <jsnow@redhat.com> wrote:
>
>>
>>     I could be wrong, but that's my hunch.
>>
>>     Maybe you can cache the state of the INFLATE process such that once you
>>     fill the cache with data, we can simply resume the INFLATE procedure
>>     when the guest almost inevitably asks for the next subsequent bytes.
>>
>>     That'd probably be efficient /enough/ in most cases without having to
>>     worry about a metadata cache for zlib blocks or a literal data cache for
>>     inflated data.
>>
>>
>> Yes, I have a similar approach in mind to inflate one sector at a time
>> and save the offset in the compressed stream and treat it as an access
>> point for the next one.
>>
>
> Right, just save whatever zlib library state you need to save and resume
> inflating. Probably the most reasonable way to go for v1. As long as you
> can avoid re-inflating prior data in a chunk when possible this is
> probably good.

Yup, I have started with that. Something you should know is that I had
an IRC discussion with Kevin and he suggested to fix the buffer size
to a max of 2MiB as 512b (which I proposed in my previous response) is
excessively low and will slow down the driver drastically!

Ashijeet

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] DMG chunk size independence
  2017-04-18 17:43       ` Ashijeet Acharya
@ 2017-04-23  9:03         ` Ashijeet Acharya
  2017-04-24 21:19           ` John Snow
  0 siblings, 1 reply; 11+ messages in thread
From: Ashijeet Acharya @ 2017-04-23  9:03 UTC (permalink / raw)
  To: John Snow
  Cc: Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz, QEMU Developers

Hi,

Great news!
I have almost completed this task and the results are looking
promising. I have not yet attended to the DMG files having bz2
compressed chunks but that should be easy and pretty similar to my
approach for zlib compressed files. So, no worries there.

For testing I am first converting the images to raw format and then
comparing the resulting image with the one converted using v2.9.0 DMG
driver and after battling for 2 days with my code, it finally prints
"Images are identical." According to John, that should be pretty
conclusive and I completely agree.

Now, the real thing I wanted to ask was, if someone is aware of a DMG
file which has a chunk size above 64 MiB so that I can test those too.
If yes, please share the download link with me.
Currently I am testing the ones posted by Peter Wu while submitting
his DMG work in 2014.
Here -> https://lists.nongnu.org/archive/html/qemu-devel/2014-12/msg03606.html

Expect v1 soon...

Ashijeet

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] DMG chunk size independence
  2017-04-23  9:03         ` Ashijeet Acharya
@ 2017-04-24 21:19           ` John Snow
  2017-04-25  5:20             ` Ashijeet Acharya
  2017-04-25  9:50             ` Peter Wu
  0 siblings, 2 replies; 11+ messages in thread
From: John Snow @ 2017-04-24 21:19 UTC (permalink / raw)
  To: Ashijeet Acharya
  Cc: Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz,
	QEMU Developers, Peter Wu



On 04/23/2017 05:03 AM, Ashijeet Acharya wrote:
> Hi,
> 
> Great news!
> I have almost completed this task and the results are looking
> promising. I have not yet attended to the DMG files having bz2
> compressed chunks but that should be easy and pretty similar to my
> approach for zlib compressed files. So, no worries there.
> 
> For testing I am first converting the images to raw format and then
> comparing the resulting image with the one converted using v2.9.0 DMG
> driver and after battling for 2 days with my code, it finally prints
> "Images are identical." According to John, that should be pretty
> conclusive and I completely agree.
> 

Yes, comparing a sample.dmg against a raw file generated from the 2.9.0
qemu-img tool should be reasonably good evidence that you have not
altered the behavior of the tool.

> Now, the real thing I wanted to ask was, if someone is aware of a DMG
> file which has a chunk size above 64 MiB so that I can test those too.
> If yes, please share the download link with me.
> Currently I am testing the ones posted by Peter Wu while submitting
> his DMG work in 2014.
> Here -> https://lists.nongnu.org/archive/html/qemu-devel/2014-12/msg03606.html
> 

Are any of those over 64MB? I assume you're implying that they aren't.

Maybe Peter knows?...

> Expect v1 soon...
> 
> Ashijeet
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] DMG chunk size independence
  2017-04-24 21:19           ` John Snow
@ 2017-04-25  5:20             ` Ashijeet Acharya
  2017-04-25  9:50             ` Peter Wu
  1 sibling, 0 replies; 11+ messages in thread
From: Ashijeet Acharya @ 2017-04-25  5:20 UTC (permalink / raw)
  To: John Snow
  Cc: Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Max Reitz,
	QEMU Developers, Peter Wu

>> For testing I am first converting the images to raw format and then
>> comparing the resulting image with the one converted using v2.9.0 DMG
>> driver and after battling for 2 days with my code, it finally prints
>> "Images are identical." According to John, that should be pretty
>> conclusive and I completely agree.
>>
>
> Yes, comparing a sample.dmg against a raw file generated from the 2.9.0
> qemu-img tool should be reasonably good evidence that you have not
> altered the behavior of the tool.
>
>> Now, the real thing I wanted to ask was, if someone is aware of a DMG
>> file which has a chunk size above 64 MiB so that I can test those too.
>> If yes, please share the download link with me.
>> Currently I am testing the ones posted by Peter Wu while submitting
>> his DMG work in 2014.
>> Here -> https://lists.nongnu.org/archive/html/qemu-devel/2014-12/msg03606.html
>>
>
> Are any of those over 64MB? I assume you're implying that they aren't.

No, they are not. Because none of them crash while converting with
using the qemu-img tool as in 2.9.0(as it has a limitation of 64MiB).

>
> Maybe Peter knows?...

Yes, I contacted him and he has been of great help so far :-)

Ashijeet

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] DMG chunk size independence
  2017-04-24 21:19           ` John Snow
  2017-04-25  5:20             ` Ashijeet Acharya
@ 2017-04-25  9:50             ` Peter Wu
  1 sibling, 0 replies; 11+ messages in thread
From: Peter Wu @ 2017-04-25  9:50 UTC (permalink / raw)
  To: John Snow
  Cc: Ashijeet Acharya, Stefan Hajnoczi, Fam Zheng, Kevin Wolf,
	Max Reitz, QEMU Developers

On Mon, Apr 24, 2017 at 05:19:48PM -0400, John Snow wrote:
> 
> 
> On 04/23/2017 05:03 AM, Ashijeet Acharya wrote:
> > Hi,
> > 
> > Great news!
> > I have almost completed this task and the results are looking
> > promising. I have not yet attended to the DMG files having bz2
> > compressed chunks but that should be easy and pretty similar to my
> > approach for zlib compressed files. So, no worries there.
> > 
> > For testing I am first converting the images to raw format and then
> > comparing the resulting image with the one converted using v2.9.0 DMG
> > driver and after battling for 2 days with my code, it finally prints
> > "Images are identical." According to John, that should be pretty
> > conclusive and I completely agree.
> > 
> 
> Yes, comparing a sample.dmg against a raw file generated from the 2.9.0
> qemu-img tool should be reasonably good evidence that you have not
> altered the behavior of the tool.
> 
> > Now, the real thing I wanted to ask was, if someone is aware of a DMG
> > file which has a chunk size above 64 MiB so that I can test those too.
> > If yes, please share the download link with me.
> > Currently I am testing the ones posted by Peter Wu while submitting
> > his DMG work in 2014.
> > Here -> https://lists.nongnu.org/archive/html/qemu-devel/2014-12/msg03606.html
> > 
> 
> Are any of those over 64MB? I assume you're implying that they aren't.
> 
> Maybe Peter knows?...

I don't know DMG with bzip2-compressed chunks over 64M. Looking through
more recent files, there is this log for "Install macOS Sierra
10.12(16A323)-B.dmg" which contains only zlib-compressed or raw data
where the uncompressed size (in the MISH block) is always at most 1MiB:
https://github.com/Lekensteyn/dmg2img/issues/1#issuecomment-273662984

In an Xcode_7.2.dmg file, the situation was similar, only zlib or raw
and also with a max uncompressed size of 1MiB (actually, an exact size
of 1MiB in both cases based on "sectorCount").

Perhaps bzip2-compressed chunks are not so common for larger disk images
since zlib is faster.
-- 
Kind regards,
Peter Wu
https://lekensteyn.nl

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] DMG chunk size independence
  2017-04-15  8:38 [Qemu-devel] DMG chunk size independence Ashijeet Acharya
  2017-04-17 20:29 ` John Snow
  2017-04-18 10:29 ` Kevin Wolf
@ 2017-04-25 10:48 ` Ashijeet Acharya
  2 siblings, 0 replies; 11+ messages in thread
From: Ashijeet Acharya @ 2017-04-25 10:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, John Snow, Kevin Wolf, Max Reitz, QEMU Developers, Peter Wu

Hi,

cc'ing Peter Wu in...

Currently I have completed the task for zlib, uncompressed and zeroed
chunks in a DMG file using the approach we discussed earlier.
Unfortunately, this approach is not appropriate for bz2 chunks since
we cannot restart our decompression from the access point we cached
since bz2 decompression checks for a special magic key 'BZh' before it
starts decompressing. Since our cached point can be pointing to any
random location inside the compressed stream and not necessarily the
start of a "block", dmg_uncompress_bz2_do() fails with an error value
BZ_DATA_ERROR_MAGIC (-5) and thus our approach fails.
This blog post here explains this limitation too ->
https://blastedbio.blogspot.in/2011/11/random-access-to-bzip2.html

Now, there is an interesting thing I found out about bz2 compressed
streams i.e. the size of a compressed block varies from 0 to a max of
900 KiB. This is guaranteed and can be verified because each block has
a 4 byte header attached to it at the beginning in which the first
three bytes are the magic key "BZh" followed by a number from 1-9.
These help us find the max size that block will have as the size
increments by 100KiB for each value (eg. BZh3 has a max of 300KiB).

Now the wikipedia page here
(https://en.wikipedia.org/wiki/Bzip2#File_format) states that a 900KiB
block can expand to a max of 46MiB in its uncompressed form. Thus we
need not worry about QEMU allocating wild sized memory at once as we
have a limit of 64MiB as of now and stick to the approach of
decompressing the whole block every time we enter it. This solves our
problem of caching an access point and ultimately failing with this
error value BZ_DATA_ERROR_MAGIC (-5).

I am hesitant in this approach because I am not sure yet that "blocks"
and "chunks" mean the same thing and are just two different
terminologies (i.e. chunks == blocks) OR chunks are made up of blocks
(i.e chunks = x * blocks).

I approached Peter Wu (who worked on DMG a few years ago) about this
and he's not sure either.

(Peter, you may skip this part as I already explained you this earlier :-) )
I did a little naive test of my own, where I downloaded one of the bz2
DMG images and tried reading it with a HEX editor.

First, I manually calculated the size between the appearance of two
sequential magic keys ('BZh') offsets which marked the length of a
block starting at the offset of first magic key. Next I compared it to
the size of the corresponding chunk whose size (s->lenghts[chunk]) we
get by reading the mish blocks and all that stuff while opening the
image in QEMU, and interestingly both the sizes appeared to be equal.
I repeated it for quite a few chunks and this test stayed valid for
all.

Peter thinks we cannot rely on this test thus I wouldn't mind more
views on it...

Ashijeet

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-04-25 10:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-15  8:38 [Qemu-devel] DMG chunk size independence Ashijeet Acharya
2017-04-17 20:29 ` John Snow
2017-04-18 10:21   ` Ashijeet Acharya
2017-04-18 17:05     ` John Snow
2017-04-18 17:43       ` Ashijeet Acharya
2017-04-23  9:03         ` Ashijeet Acharya
2017-04-24 21:19           ` John Snow
2017-04-25  5:20             ` Ashijeet Acharya
2017-04-25  9:50             ` Peter Wu
2017-04-18 10:29 ` Kevin Wolf
2017-04-25 10:48 ` Ashijeet Acharya

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.