All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sbc: bitstream packing optimization for encoder.
@ 2008-12-11 20:27 Siarhei Siamashka
  2008-12-12  1:39 ` Marcel Holtmann
  0 siblings, 1 reply; 12+ messages in thread
From: Siarhei Siamashka @ 2008-12-11 20:27 UTC (permalink / raw)
  To: linux-bluetooth

[-- Attachment #1: Type: text/plain, Size: 1398 bytes --]

Hi All,

Appears that bitstream packing is also one of the most CPU hungry
operations in SBC encoder, which did not get proper attention yet.
Could somebody review this patch and provide feedback/comments?

Benchmarks were done by running the following script:

./sbcenc -s 8 big_buck_bunny_480p_mono.au > /dev/null
./sbcenc -s 8 big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -s 8 -j big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -s 4 big_buck_bunny_480p_mono.au > /dev/null
./sbcenc -s 4 big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -s 4 -j big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -b 128 -s 8 big_buck_bunny_480p_mono.au > /dev/null
./sbcenc -b 128 -s 8 big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -b 128 -s 8 -j big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -b 64 -s 4 big_buck_bunny_480p_mono.au > /dev/null
./sbcenc -b 64 -s 4 big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -b 64 -s 4 -j big_buck_bunny_480p_stereo.au > /dev/null

Times from ARM device (ARM11):

before patch:
real    9m 42.98s
user    8m 51.06s
sys     0m 39.50s

after patch:
real    6m 35.55s
user    5m 44.09s
sys     0m 40.43s

Times from x86 PC (Intel Core2):

before patch:
real    0m49.437s
user    0m45.267s
sys     0m3.708s

after patch:
real    0m27.606s
user    0m23.869s
sys     0m3.656s

 
Best regards,
Siarhei Siamashka

[-- Attachment #2: 0001-sbc-bitstream-writing-optimization-for-encoder.patch --]
[-- Type: text/x-diff, Size: 4736 bytes --]

>From cec91fe56d834109db6280330085832c79ded6b6 Mon Sep 17 00:00:00 2001
From: Siarhei Siamashka <siarhei.siamashka@nokia.com>
Date: Thu, 11 Dec 2008 21:21:28 +0200
Subject: [PATCH] sbc: bitstream writing optimization for encoder.

SBC encoder performance improvement up to 1.5x for ARM11
and almost twice faster for Intel Core2 in some cases.

Signed-off-by: Siarhei Siamashka <siarhei.siamashka@nokia.com>
---
 sbc/sbc.c |   65 ++++++++++++++++++++++++++++++++++--------------------------
 1 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/sbc/sbc.c b/sbc/sbc.c
index 7072673..9b676e0 100644
--- a/sbc/sbc.c
+++ b/sbc/sbc.c
@@ -909,6 +909,29 @@ static int sbc_analyze_audio(struct sbc_encoder_state *state,
 	}
 }
 
+/* Supplementary bitstream writing macros for 'sbc_pack_frame' */
+
+#define PUT_BITS(v, n)\
+	{\
+		bits_cache = (v) | (bits_cache << (n));\
+		bits_count += (n);\
+		if (bits_count >= 16) {\
+			bits_count -= 8;\
+			*data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
+			bits_count -= 8;\
+			*data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
+		}\
+	} while (0);\
+
+#define FLUSH_BITS()\
+	while (bits_count >= 8) {\
+		bits_count -= 8;\
+		*data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
+	}\
+	if (bits_count > 0) {\
+	    *data_ptr++ = (uint8_t)(bits_cache << (8 - bits_count));\
+	}\
+
 /*
  * Packs the SBC frame from frame into the memory at data. At most len
  * bytes will be used, should more memory be needed an appropriate
@@ -926,14 +949,18 @@ static int sbc_analyze_audio(struct sbc_encoder_state *state,
 
 static int sbc_pack_frame(uint8_t *data, struct sbc_frame *frame, size_t len)
 {
-	int produced;
+	/* Bitstream writer starts from the fourth byte */
+	uint8_t *data_ptr = data + 4;
+	uint32_t bits_cache = 0;
+	uint32_t bits_count = 0;
+
 	/* Will copy the header parts for CRC-8 calculation here */
 	uint8_t crc_header[11] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
 	int crc_pos = 0;
 
 	uint16_t audio_sample;
 
-	int ch, sb, blk, bit;	/* channel, subband, block and bit counters */
+	int ch, sb, blk;	/* channel, subband, block and bit counters */
 	int bits[2][8];		/* bits distribution */
 	int levels[2][8];	/* levels are derived from that */
 
@@ -973,8 +1000,6 @@ static int sbc_pack_frame(uint8_t *data, struct sbc_frame *frame, size_t len)
 
 	/* Can't fill in crc yet */
 
-	produced = 32;
-
 	crc_header[0] = data[1];
 	crc_header[1] = data[2];
 	crc_pos = 16;
@@ -999,6 +1024,7 @@ static int sbc_pack_frame(uint8_t *data, struct sbc_frame *frame, size_t len)
 		u_int32_t scalefactor_j[2];
 		uint8_t scale_factor_j[2];
 
+		uint8_t joint = 0;
 		frame->joint = 0;
 
 		for (sb = 0; sb < frame->subbands - 1; sb++) {
@@ -1031,6 +1057,7 @@ static int sbc_pack_frame(uint8_t *data, struct sbc_frame *frame, size_t len)
 			if ((scalefactor[0][sb] + scalefactor[1][sb]) >
 					(scalefactor_j[0] + scalefactor_j[1]) ) {
 				/* use joint stereo for this subband */
+				joint |= 1 << (frame->subbands - 1 - sb);
 				frame->joint |= 1 << sb;
 				frame->scale_factor[0][sb] = scale_factor_j[0];
 				frame->scale_factor[1][sb] = scale_factor_j[1];
@@ -1045,24 +1072,16 @@ static int sbc_pack_frame(uint8_t *data, struct sbc_frame *frame, size_t len)
 			}
 		}
 
-		data[4] = 0;
-		for (sb = 0; sb < frame->subbands - 1; sb++)
-			data[4] |= ((frame->joint >> sb) & 0x01) << (frame->subbands - 1 - sb);
-
-		crc_header[crc_pos >> 3] = data[4];
-
-		produced += frame->subbands;
+		PUT_BITS(joint, frame->subbands);
+		crc_header[crc_pos >> 3] = joint;
 		crc_pos += frame->subbands;
 	}
 
 	for (ch = 0; ch < frame->channels; ch++) {
 		for (sb = 0; sb < frame->subbands; sb++) {
-			data[produced >> 3] <<= 4;
+			PUT_BITS(frame->scale_factor[ch][sb] & 0x0F, 4);
 			crc_header[crc_pos >> 3] <<= 4;
-			data[produced >> 3] |= frame->scale_factor[ch][sb] & 0x0F;
 			crc_header[crc_pos >> 3] |= frame->scale_factor[ch][sb] & 0x0F;
-
-			produced += 4;
 			crc_pos += 4;
 		}
 	}
@@ -1088,25 +1107,15 @@ static int sbc_pack_frame(uint8_t *data, struct sbc_frame *frame, size_t len)
 						(uint16_t) ((((frame->sb_sample_f[blk][ch][sb]*levels[ch][sb]) >>
 									(frame->scale_factor[ch][sb] + 1)) +
 								levels[ch][sb]) >> 1);
-					audio_sample <<= 16 - bits[ch][sb];
-					for (bit = 0; bit < bits[ch][sb]; bit++) {
-						data[produced >> 3] <<= 1;
-						if (audio_sample & 0x8000)
-							data[produced >> 3] |= 0x1;
-						audio_sample <<= 1;
-						produced++;
-					}
+					PUT_BITS(audio_sample, bits[ch][sb]);
 				}
 			}
 		}
 	}
 
-	/* align the last byte */
-	if (produced % 8) {
-		data[produced >> 3] <<= 8 - (produced % 8);
-	}
+	FLUSH_BITS();
 
-	return (produced + 7) >> 3;
+	return data_ptr - data;
 }
 
 struct sbc_priv {
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] sbc: bitstream packing optimization for encoder.
  2008-12-11 20:27 [PATCH] sbc: bitstream packing optimization for encoder Siarhei Siamashka
@ 2008-12-12  1:39 ` Marcel Holtmann
  2008-12-12  9:37   ` Siarhei Siamashka
  2008-12-16 21:46   ` Siarhei Siamashka
  0 siblings, 2 replies; 12+ messages in thread
From: Marcel Holtmann @ 2008-12-12  1:39 UTC (permalink / raw)
  To: Siarhei Siamashka; +Cc: linux-bluetooth

Hi Siarhei,

> Appears that bitstream packing is also one of the most CPU hungry
> operations in SBC encoder, which did not get proper attention yet.
> Could somebody review this patch and provide feedback/comments?

again thanks for working on this. I am actually willing to include your
patches quickly in one of the next releases to get more wider testing
audience.

One comment from my side is that this should work with little-endian and
big-endian as input streams. Please keep that in mind.

Regards

Marcel



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sbc: bitstream packing optimization for encoder.
  2008-12-12  1:39 ` Marcel Holtmann
@ 2008-12-12  9:37   ` Siarhei Siamashka
  2008-12-12 11:04     ` Marcel Holtmann
  2008-12-16 21:46   ` Siarhei Siamashka
  1 sibling, 1 reply; 12+ messages in thread
From: Siarhei Siamashka @ 2008-12-12  9:37 UTC (permalink / raw)
  To: ext Marcel Holtmann; +Cc: linux-bluetooth

On Friday 12 December 2008 03:39:34 ext Marcel Holtmann wrote:
> Hi Siarhei,
>
> > Appears that bitstream packing is also one of the most CPU hungry
> > operations in SBC encoder, which did not get proper attention yet.
> > Could somebody review this patch and provide feedback/comments?
>
> again thanks for working on this. I am actually willing to include your
> patches quickly in one of the next releases to get more wider testing
> audience.

Thanks. Are the contributors of the last SBC-related modifications reachable
nowadays? They could probably help with the review of our patches.

SBC code still can be improved in many ways from the performance point of
view. The current implementation follows SBC specification pretty much
literally. But rearranging the code a bit and getting rid of multidimentional
arrays can provide identical output, but work noticeably faster. Also SBC
can benefit from ARM assembly optimizations (for ARM11 and Cortex-A8),
so these could be applied a bit later once all the high level stuff is in
place.

Current implementation only needs to be checked to ensure that it is correct
and fully adheres to the specification before applying large scale
optimizations.

> One comment from my side is that this should work with little-endian and
> big-endian as input streams. Please keep that in mind.

Yes, it should work fine on both big and little endian systems, though I did
not test it on big-endian hardware myself. Actually SBC bitstream packer
favors big-endian systems, because the following part...

+               if (bits_count >= 16) {\
+                       bits_count -= 8;\
+                       *data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
+                       bits_count -= 8;\
+                       *data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
+               }\

...could be implemented as something like this for big-endian systems
(provided that we additionally take care of alignment and convert 'data_ptr'
into a pointer to uint16_t):

+               if (bits_count >= 16) {\
+                       bits_count -= 16;\
+                       *data_ptr++ = (uint16_t)(bits_cache >> bits_count);\
+               }\

But for little endian-systems and also to ensure endian neutrality, data
writes are done one byte at a time.


And Just an additional comment about portability, there are also systems where
uint8_t data type is not supported (and CHAR_BIT is different from 8). For
example, there is a port SBC encoder from bluez to C55x DSP [1].  I'm in no
way involved in this project, but they could probably want to submit some
changes to mainline SBC code to make it usable on such systems with a bit
less tweaks.

1. https://garage.maemo.org/projects/dsp-sbc


Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sbc: bitstream packing optimization for encoder.
  2008-12-12  9:37   ` Siarhei Siamashka
@ 2008-12-12 11:04     ` Marcel Holtmann
  2008-12-12 15:23       ` Christian Hoene
  2008-12-12 16:07       ` Siarhei Siamashka
  0 siblings, 2 replies; 12+ messages in thread
From: Marcel Holtmann @ 2008-12-12 11:04 UTC (permalink / raw)
  To: Siarhei Siamashka; +Cc: linux-bluetooth

Hi Siarhei,

> > > Appears that bitstream packing is also one of the most CPU hungry
> > > operations in SBC encoder, which did not get proper attention yet.
> > > Could somebody review this patch and provide feedback/comments?
> >
> > again thanks for working on this. I am actually willing to include your
> > patches quickly in one of the next releases to get more wider testing
> > audience.
> 
> Thanks. Are the contributors of the last SBC-related modifications reachable
> nowadays? They could probably help with the review of our patches.

they should be at least all subscribed to this mailing list now. Since I
will actually close the other ones in two weeks.

> SBC code still can be improved in many ways from the performance point of
> view. The current implementation follows SBC specification pretty much
> literally. But rearranging the code a bit and getting rid of multidimentional
> arrays can provide identical output, but work noticeably faster. Also SBC
> can benefit from ARM assembly optimizations (for ARM11 and Cortex-A8),
> so these could be applied a bit later once all the high level stuff is in
> place.
> 
> Current implementation only needs to be checked to ensure that it is correct
> and fully adheres to the specification before applying large scale
> optimizations.

I agree. So lets get a fully compliant Linux version first and then we
go from there.

> > One comment from my side is that this should work with little-endian and
> > big-endian as input streams. Please keep that in mind.
> 
> Yes, it should work fine on both big and little endian systems, though I did
> not test it on big-endian hardware myself. Actually SBC bitstream packer
> favors big-endian systems, because the following part...
> 
> +               if (bits_count >= 16) {\
> +                       bits_count -= 8;\
> +                       *data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
> +                       bits_count -= 8;\
> +                       *data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
> +               }\
> 
> ...could be implemented as something like this for big-endian systems
> (provided that we additionally take care of alignment and convert 'data_ptr'
> into a pointer to uint16_t):
> 
> +               if (bits_count >= 16) {\
> +                       bits_count -= 16;\
> +                       *data_ptr++ = (uint16_t)(bits_cache >> bits_count);\
> +               }\
> 
> But for little endian-systems and also to ensure endian neutrality, data
> writes are done one byte at a time.

Lets keep doing it byte by byte for now. However keep in mind that the
input stream can also be in big endian even it is runs on a little
machine. And vice-versa.

> And Just an additional comment about portability, there are also systems where
> uint8_t data type is not supported (and CHAR_BIT is different from 8). For
> example, there is a port SBC encoder from bluez to C55x DSP [1].  I'm in no
> way involved in this project, but they could probably want to submit some
> changes to mainline SBC code to make it usable on such systems with a bit
> less tweaks.

We are doing Linux first and then worry about other systems.

Regards

Marcel



^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] sbc: bitstream packing optimization for encoder.
  2008-12-12 11:04     ` Marcel Holtmann
@ 2008-12-12 15:23       ` Christian Hoene
  2008-12-12 15:30         ` Marcel Holtmann
  2008-12-17 14:16         ` Siarhei Siamashka
  2008-12-12 16:07       ` Siarhei Siamashka
  1 sibling, 2 replies; 12+ messages in thread
From: Christian Hoene @ 2008-12-12 15:23 UTC (permalink / raw)
  To: 'Marcel Holtmann', 'Siarhei Siamashka'; +Cc: linux-bluetooth

Hi all,

next week I will discuss the bug fixing of SBC with Jaska Uimonen, who is
ill this week.

The sound quality of the fix point implementation still remains below of the
quality of the floating point version.

Maybe, we shall support both depending on the performance requirements?

Regards,
 
Christian


> -----Original Message-----
> From: linux-bluetooth-owner@vger.kernel.org [mailto:linux-bluetooth-
> owner@vger.kernel.org] On Behalf Of Marcel Holtmann
> Sent: Friday, December 12, 2008 12:04 PM
> To: Siarhei Siamashka
> Cc: linux-bluetooth@vger.kernel.org
> Subject: Re: [PATCH] sbc: bitstream packing optimization for encoder.
> 
> Hi Siarhei,
> 
> > > > Appears that bitstream packing is also one of the most CPU hungry
> > > > operations in SBC encoder, which did not get proper attention
> yet.
> > > > Could somebody review this patch and provide feedback/comments?
> > >
> > > again thanks for working on this. I am actually willing to include
> your
> > > patches quickly in one of the next releases to get more wider
> testing
> > > audience.
> >
> > Thanks. Are the contributors of the last SBC-related modifications
> reachable
> > nowadays? They could probably help with the review of our patches.
> 
> they should be at least all subscribed to this mailing list now. Since
> I
> will actually close the other ones in two weeks.
> 
> > SBC code still can be improved in many ways from the performance
> point of
> > view. The current implementation follows SBC specification pretty
> much
> > literally. But rearranging the code a bit and getting rid of
> multidimentional
> > arrays can provide identical output, but work noticeably faster. Also
> SBC
> > can benefit from ARM assembly optimizations (for ARM11 and Cortex-
> A8),
> > so these could be applied a bit later once all the high level stuff
> is in
> > place.
> >
> > Current implementation only needs to be checked to ensure that it is
> correct
> > and fully adheres to the specification before applying large scale
> > optimizations.
> 
> I agree. So lets get a fully compliant Linux version first and then we
> go from there.
> 
> > > One comment from my side is that this should work with little-
> endian and
> > > big-endian as input streams. Please keep that in mind.
> >
> > Yes, it should work fine on both big and little endian systems,
> though I did
> > not test it on big-endian hardware myself. Actually SBC bitstream
> packer
> > favors big-endian systems, because the following part...
> >
> > +               if (bits_count >= 16) {\
> > +                       bits_count -= 8;\
> > +                       *data_ptr++ = (uint8_t)(bits_cache >>
> bits_count);\
> > +                       bits_count -= 8;\
> > +                       *data_ptr++ = (uint8_t)(bits_cache >>
> bits_count);\
> > +               }\
> >
> > ...could be implemented as something like this for big-endian systems
> > (provided that we additionally take care of alignment and convert
> 'data_ptr'
> > into a pointer to uint16_t):
> >
> > +               if (bits_count >= 16) {\
> > +                       bits_count -= 16;\
> > +                       *data_ptr++ = (uint16_t)(bits_cache >>
> bits_count);\
> > +               }\
> >
> > But for little endian-systems and also to ensure endian neutrality,
> data
> > writes are done one byte at a time.
> 
> Lets keep doing it byte by byte for now. However keep in mind that the
> input stream can also be in big endian even it is runs on a little
> machine. And vice-versa.
> 
> > And Just an additional comment about portability, there are also
> systems where
> > uint8_t data type is not supported (and CHAR_BIT is different from
> 8). For
> > example, there is a port SBC encoder from bluez to C55x DSP [1].  I'm
> in no
> > way involved in this project, but they could probably want to submit
> some
> > changes to mainline SBC code to make it usable on such systems with a
> bit
> > less tweaks.
> 
> We are doing Linux first and then worry about other systems.
> 
> Regards
> 
> Marcel
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-
> bluetooth" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] sbc: bitstream packing optimization for encoder.
  2008-12-12 15:23       ` Christian Hoene
@ 2008-12-12 15:30         ` Marcel Holtmann
  2008-12-12 15:49           ` Christian Hoene
  2008-12-12 16:01           ` Siarhei Siamashka
  2008-12-17 14:16         ` Siarhei Siamashka
  1 sibling, 2 replies; 12+ messages in thread
From: Marcel Holtmann @ 2008-12-12 15:30 UTC (permalink / raw)
  To: Christian Hoene; +Cc: 'Siarhei Siamashka', linux-bluetooth

Hi Christian,

can we please keep this mailing a top-posting free zone. I know it is
hard for Outlock to get this right, but it is possible.

> next week I will discuss the bug fixing of SBC with Jaska Uimonen, who is
> ill this week.
> 
> The sound quality of the fix point implementation still remains below of the
> quality of the floating point version.
> 
> Maybe, we shall support both depending on the performance requirements?

I think we should focus a fixed point version. There should be no need
for floating point at all. If fixed point isn't good enough, then we
screwed it up.

And in case of embedded devices we are seriously limited with floating
point and doing that in software just doesn't work out. And this is
mostly not directly a performance problem. It is more a power
consumption problem. We don't wanna have A2DP drain the battery.

Regards

Marcel



^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] sbc: bitstream packing optimization for encoder.
  2008-12-12 15:30         ` Marcel Holtmann
@ 2008-12-12 15:49           ` Christian Hoene
  2008-12-12 15:56             ` Marcel Holtmann
  2008-12-12 16:01           ` Siarhei Siamashka
  1 sibling, 1 reply; 12+ messages in thread
From: Christian Hoene @ 2008-12-12 15:49 UTC (permalink / raw)
  To: 'Marcel Holtmann'; +Cc: 'Siarhei Siamashka', linux-bluetooth

> > The sound quality of the fix point implementation still remains below
> of the
> > quality of the floating point version.
> >
> > Maybe, we shall support both depending on the performance
> requirements?
> 
> I think we should focus a fixed point version. There should be no need
> for floating point at all. If fixed point isn't good enough, then we
> screwed it up.
> 
> And in case of embedded devices we are seriously limited with floating
> point and doing that in software just doesn't work out. And this is
> mostly not directly a performance problem. It is more a power
> consumption problem. We don't wanna have A2DP drain the battery.
>

It would help if we use 64 bit integers if a good sound is required but this
is again a problem for embedded devices.

Regards
 Christian


> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] sbc: bitstream packing optimization for encoder.
  2008-12-12 15:49           ` Christian Hoene
@ 2008-12-12 15:56             ` Marcel Holtmann
  0 siblings, 0 replies; 12+ messages in thread
From: Marcel Holtmann @ 2008-12-12 15:56 UTC (permalink / raw)
  To: Christian Hoene; +Cc: 'Siarhei Siamashka', linux-bluetooth

Hi Christian,

> > > The sound quality of the fix point implementation still remains below
> > of the
> > > quality of the floating point version.
> > >
> > > Maybe, we shall support both depending on the performance
> > requirements?
> > 
> > I think we should focus a fixed point version. There should be no need
> > for floating point at all. If fixed point isn't good enough, then we
> > screwed it up.
> > 
> > And in case of embedded devices we are seriously limited with floating
> > point and doing that in software just doesn't work out. And this is
> > mostly not directly a performance problem. It is more a power
> > consumption problem. We don't wanna have A2DP drain the battery.
> >
> 
> It would help if we use 64 bit integers if a good sound is required but this
> is again a problem for embedded devices.

depends on. Looking at the libmad library, they do a pretty good job in
just using integers.

Regards

Marcel



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sbc: bitstream packing optimization for encoder.
  2008-12-12 15:30         ` Marcel Holtmann
  2008-12-12 15:49           ` Christian Hoene
@ 2008-12-12 16:01           ` Siarhei Siamashka
  1 sibling, 0 replies; 12+ messages in thread
From: Siarhei Siamashka @ 2008-12-12 16:01 UTC (permalink / raw)
  To: ext Marcel Holtmann; +Cc: Christian Hoene, linux-bluetooth

On Friday 12 December 2008 17:30:47 ext Marcel Holtmann wrote:
> Hi Christian,
>
> can we please keep this mailing a top-posting free zone. I know it is
> hard for Outlock to get this right, but it is possible.
>
> > next week I will discuss the bug fixing of SBC with Jaska Uimonen, who is
> > ill this week.
> >
> > The sound quality of the fix point implementation still remains below of
> > the quality of the floating point version.
> >
> > Maybe, we shall support both depending on the performance requirements?

Well, it is quite natural that sound quality for 16 bit fixed point version
can't be better than the quality of the floating point one. But when
dealing with lossy compression, it is not always strictly necessary to
precisely match the output of the reference implementation, carefully
reproducing all the compression artefacts :)

More important is whether this implementation passes a standard SBC
conformance test. If it does, it would be good to have 16 bit fixed point
implementation.

Sound quality of Jaska's patch can be probably tweaked a bit by trying
different bias for coefficient values (not just clip values to 16 bit, but
also try different directions for rounding).

> I think we should focus a fixed point version. There should be no need
> for floating point at all. If fixed point isn't good enough, then we
> screwed it up.
>
> And in case of embedded devices we are seriously limited with floating
> point and doing that in software just doesn't work out. And this is
> mostly not directly a performance problem. It is more a power
> consumption problem. We don't wanna have A2DP drain the battery.

At least for ARM, performance of the possible polyphase filter implementations
can be approximately ranged in the following way (from fastest to slowest):
1. 16-bit*16-bit->32-bit integer multiplications (the best for any 
ARM cores >=armv5te)
2. single precision floating point multiplications (if floating point math is 
supported by hardware)
3. 32-bit*32-bit->32-bit  integer multiplications
4. double precision floating point multiplications (if floating point math is 
supported by hardware and FPU is pipelined for this kind of operation - not 
the case for Cortex-A8 unfortunately)
5. 32-bit*32-bit->64-bit integer multiplications

Currently SBC contains ARM inline assembly macros to utilize 
32-bit*32-bit->32-bit multiplication (multiply-accumulate variant) which
is not the fastest option. It is only good for armv4 cores which do not
support anything better. Even quite an old Nokia 770 internet tablet 
has support for armv5te instructions.

For x86 platform, 16-bit integer multiplications can be done quite efficiently
using MMX or SSE2.

To sum up, the 16-bit fixed point version will be the fastest and is the best
variant if it can provide the required precision. Otherwise single precision
floating point version is the next natural choice.

Surely there are also other factors to consider and raw multiplications
throughput performance is not the only thing. The need of scaling values for
fixed point versions and int->float/float->int conversions for the floating
point ones also plays some role.

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sbc: bitstream packing optimization for encoder.
  2008-12-12 11:04     ` Marcel Holtmann
  2008-12-12 15:23       ` Christian Hoene
@ 2008-12-12 16:07       ` Siarhei Siamashka
  1 sibling, 0 replies; 12+ messages in thread
From: Siarhei Siamashka @ 2008-12-12 16:07 UTC (permalink / raw)
  To: ext Marcel Holtmann; +Cc: linux-bluetooth

On Friday 12 December 2008 13:04:09 ext Marcel Holtmann wrote:
[...]
> > But for little endian-systems and also to ensure endian neutrality, data
> > writes are done one byte at a time.
>
> Lets keep doing it byte by byte for now. However keep in mind that the
> input stream can also be in big endian even it is runs on a little
> machine. And vice-versa.

Yes I know, thanks for reminding :) This patch does not touch endian
conversion for the input data, so everything should be OK. Anyway, it's
better to do some real tests on big-endian systems just to be sure.

The endian conversion and channels deinterleaving part for the input data
is actually another performance bottleneck in SBC and can be also optimized.

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sbc: bitstream packing optimization for encoder.
  2008-12-12  1:39 ` Marcel Holtmann
  2008-12-12  9:37   ` Siarhei Siamashka
@ 2008-12-16 21:46   ` Siarhei Siamashka
  1 sibling, 0 replies; 12+ messages in thread
From: Siarhei Siamashka @ 2008-12-16 21:46 UTC (permalink / raw)
  To: ext Marcel Holtmann; +Cc: linux-bluetooth

[-- Attachment #1: Type: text/plain, Size: 782 bytes --]

On Friday 12 December 2008 03:39:34 ext Marcel Holtmann wrote:
> Hi Siarhei,
>
> > Appears that bitstream packing is also one of the most CPU hungry
> > operations in SBC encoder, which did not get proper attention yet.
> > Could somebody review this patch and provide feedback/comments?
>
> again thanks for working on this. I am actually willing to include your
> patches quickly in one of the next releases to get more wider testing
> audience.

Well, after doing a bit more testing, appears that a bug slipped in:
+					PUT_BITS(audio_sample, bits[ch][sb]);

There should be the following line instead:
+					PUT_BITS(audio_sample & levels[ch][sb], bits[ch][sb]);

Updated patch is attached. Any additional testing/feedback is very much
welcome.

Best regards
Siarhei Siamashka

[-- Attachment #2: 0001-sbc-bitstream-writing-optimization-for-encoder-try2.patch --]
[-- Type: text/x-diff, Size: 4753 bytes --]

>From 0a3818bd86342c60cda2fa87c59ecd5f9d6d4591 Mon Sep 17 00:00:00 2001
From: Siarhei Siamashka <siarhei.siamashka@nokia.com>
Date: Thu, 11 Dec 2008 21:21:28 +0200
Subject: [PATCH] sbc: bitstream writing optimization for encoder.

SBC encoder performance improvement up to 1.5x for ARM11
and almost twice faster for Intel Core2 in some cases.

Signed-off-by: Siarhei Siamashka <siarhei.siamashka@nokia.com>
---
 sbc/sbc.c |   65 ++++++++++++++++++++++++++++++++++--------------------------
 1 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/sbc/sbc.c b/sbc/sbc.c
index 7072673..c495a75 100644
--- a/sbc/sbc.c
+++ b/sbc/sbc.c
@@ -909,6 +909,29 @@ static int sbc_analyze_audio(struct sbc_encoder_state *state,
 	}
 }
 
+/* Supplementary bitstream writing macros for 'sbc_pack_frame' */
+
+#define PUT_BITS(v, n)\
+	{\
+		bits_cache = (v) | (bits_cache << (n));\
+		bits_count += (n);\
+		if (bits_count >= 16) {\
+			bits_count -= 8;\
+			*data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
+			bits_count -= 8;\
+			*data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
+		}\
+	} while (0);\
+
+#define FLUSH_BITS()\
+	while (bits_count >= 8) {\
+		bits_count -= 8;\
+		*data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
+	}\
+	if (bits_count > 0) {\
+	    *data_ptr++ = (uint8_t)(bits_cache << (8 - bits_count));\
+	}\
+
 /*
  * Packs the SBC frame from frame into the memory at data. At most len
  * bytes will be used, should more memory be needed an appropriate
@@ -926,14 +949,18 @@ static int sbc_analyze_audio(struct sbc_encoder_state *state,
 
 static int sbc_pack_frame(uint8_t *data, struct sbc_frame *frame, size_t len)
 {
-	int produced;
+	/* Bitstream writer starts from the fourth byte */
+	uint8_t *data_ptr = data + 4;
+	uint32_t bits_cache = 0;
+	uint32_t bits_count = 0;
+
 	/* Will copy the header parts for CRC-8 calculation here */
 	uint8_t crc_header[11] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
 	int crc_pos = 0;
 
 	uint16_t audio_sample;
 
-	int ch, sb, blk, bit;	/* channel, subband, block and bit counters */
+	int ch, sb, blk;	/* channel, subband, block and bit counters */
 	int bits[2][8];		/* bits distribution */
 	int levels[2][8];	/* levels are derived from that */
 
@@ -973,8 +1000,6 @@ static int sbc_pack_frame(uint8_t *data, struct sbc_frame *frame, size_t len)
 
 	/* Can't fill in crc yet */
 
-	produced = 32;
-
 	crc_header[0] = data[1];
 	crc_header[1] = data[2];
 	crc_pos = 16;
@@ -999,6 +1024,7 @@ static int sbc_pack_frame(uint8_t *data, struct sbc_frame *frame, size_t len)
 		u_int32_t scalefactor_j[2];
 		uint8_t scale_factor_j[2];
 
+		uint8_t joint = 0;
 		frame->joint = 0;
 
 		for (sb = 0; sb < frame->subbands - 1; sb++) {
@@ -1031,6 +1057,7 @@ static int sbc_pack_frame(uint8_t *data, struct sbc_frame *frame, size_t len)
 			if ((scalefactor[0][sb] + scalefactor[1][sb]) >
 					(scalefactor_j[0] + scalefactor_j[1]) ) {
 				/* use joint stereo for this subband */
+				joint |= 1 << (frame->subbands - 1 - sb);
 				frame->joint |= 1 << sb;
 				frame->scale_factor[0][sb] = scale_factor_j[0];
 				frame->scale_factor[1][sb] = scale_factor_j[1];
@@ -1045,24 +1072,16 @@ static int sbc_pack_frame(uint8_t *data, struct sbc_frame *frame, size_t len)
 			}
 		}
 
-		data[4] = 0;
-		for (sb = 0; sb < frame->subbands - 1; sb++)
-			data[4] |= ((frame->joint >> sb) & 0x01) << (frame->subbands - 1 - sb);
-
-		crc_header[crc_pos >> 3] = data[4];
-
-		produced += frame->subbands;
+		PUT_BITS(joint, frame->subbands);
+		crc_header[crc_pos >> 3] = joint;
 		crc_pos += frame->subbands;
 	}
 
 	for (ch = 0; ch < frame->channels; ch++) {
 		for (sb = 0; sb < frame->subbands; sb++) {
-			data[produced >> 3] <<= 4;
+			PUT_BITS(frame->scale_factor[ch][sb] & 0x0F, 4);
 			crc_header[crc_pos >> 3] <<= 4;
-			data[produced >> 3] |= frame->scale_factor[ch][sb] & 0x0F;
 			crc_header[crc_pos >> 3] |= frame->scale_factor[ch][sb] & 0x0F;
-
-			produced += 4;
 			crc_pos += 4;
 		}
 	}
@@ -1088,25 +1107,15 @@ static int sbc_pack_frame(uint8_t *data, struct sbc_frame *frame, size_t len)
 						(uint16_t) ((((frame->sb_sample_f[blk][ch][sb]*levels[ch][sb]) >>
 									(frame->scale_factor[ch][sb] + 1)) +
 								levels[ch][sb]) >> 1);
-					audio_sample <<= 16 - bits[ch][sb];
-					for (bit = 0; bit < bits[ch][sb]; bit++) {
-						data[produced >> 3] <<= 1;
-						if (audio_sample & 0x8000)
-							data[produced >> 3] |= 0x1;
-						audio_sample <<= 1;
-						produced++;
-					}
+					PUT_BITS(audio_sample & levels[ch][sb], bits[ch][sb]);
 				}
 			}
 		}
 	}
 
-	/* align the last byte */
-	if (produced % 8) {
-		data[produced >> 3] <<= 8 - (produced % 8);
-	}
+	FLUSH_BITS();
 
-	return (produced + 7) >> 3;
+	return data_ptr - data;
 }
 
 struct sbc_priv {
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] sbc: bitstream packing optimization for encoder.
  2008-12-12 15:23       ` Christian Hoene
  2008-12-12 15:30         ` Marcel Holtmann
@ 2008-12-17 14:16         ` Siarhei Siamashka
  1 sibling, 0 replies; 12+ messages in thread
From: Siarhei Siamashka @ 2008-12-17 14:16 UTC (permalink / raw)
  To: ext Christian Hoene; +Cc: 'Marcel Holtmann', linux-bluetooth

On Friday 12 December 2008 17:23:33 ext Christian Hoene wrote:
> Hi all,
>
> next week I will discuss the bug fixing of SBC with Jaska Uimonen, who is
> ill this week.
>
> The sound quality of the fix point implementation still remains below of
> the quality of the floating point version.

Has the sound quality improved with the latest patches?

By the way, which floating point version do you have in mind? Is it btsco,
which is referenced from this wiki page: http://wiki.bluez.org/wiki/SBC ?

I tried to have a look at btsco, but it seems to be much worse quality wise
than the fixed point version with an updated filtering function. I tried to
encode some audio sample, decode it back and compare result with the
original file in both cases.

Also I compared 32-bit fixed point vs. 16-bit fixed point, and looks like
16-bit fixed point version has a sufficient precision. Anyway, the values,
received from the filter, are quantized later and much more information
about lower order bits is usually lost at this stage.

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-12-17 14:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-11 20:27 [PATCH] sbc: bitstream packing optimization for encoder Siarhei Siamashka
2008-12-12  1:39 ` Marcel Holtmann
2008-12-12  9:37   ` Siarhei Siamashka
2008-12-12 11:04     ` Marcel Holtmann
2008-12-12 15:23       ` Christian Hoene
2008-12-12 15:30         ` Marcel Holtmann
2008-12-12 15:49           ` Christian Hoene
2008-12-12 15:56             ` Marcel Holtmann
2008-12-12 16:01           ` Siarhei Siamashka
2008-12-17 14:16         ` Siarhei Siamashka
2008-12-12 16:07       ` Siarhei Siamashka
2008-12-16 21:46   ` Siarhei Siamashka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.