To: Carnac who wrote (3345 ) 5/5/1999 9:17:00 AM From: SBHX Read Replies (2) | Respond to of 5927
How about this : If you use the std berkeley code, 1. Tweak the hell out of the getbits and showbits macro with MMX and use mm0 as a 64bit storage buffer, mm1 as nbits to shift. 2. in getblk.c, for the four macroblock type decode, perform the VLD but leave deZigZag and IDCT to the chip. If the chip does deQ as well, then great, save a few mmx mult from the cpu, o/w it is just 4 (or 8) 4way mmx mults [Even here, there's room to cut corners]. If you take the avg non-zero terms in each macroblock (over a frame) to be 16 (very conservative high number), then what you are saying is that you want to budget 2k cycles / 16 terms. This leaves > 120 cycles / non-zero term. This should be tons of time --- especially if the decoded cmd-stream lives in write combining space that doesn't pollute your cache. 3. If you are willing to leave with some artifacts (courtesy of Hitachi's patent), you could do what intel is rumoured to be planning : downscale to 960x1080 or 960x540 in s/w by have the result of idct spit out 4x8 blocks instead of 8x8. This saves some clocks on the cpu, but will behave even better if there is h/w assist to run in this mode [it also shaves memBW reqd for MC]. Eh? 4. This thing isn't as cut and dried as you think. S/W decode of mpeg2 has been marked by major breakthroughs at the expense of some neat trick in tradng off precision for faster speed. These are very creative people who are good at giving ccube fits. Recently, the precision loss [other than the tricks in compression] have been minimal with h/w assist, which makes ccube even more unhappy.Tell us how to decode video in < 2k cycles/macroblock on a Pentium III 500 MHz and external MCP+IDCT h/w acceleration, but with enough headroom left on the host CPU for the other stuff. Great, if they are just lucky and their competition is so inept, then this bodes great for their stock. All chip companies have to content with yield as they need to solve it to be profitable. This is usually the domain of experienced chip-makers like S3, ATI and matrox as they have been doing it well a long time, but hired guns with extensive manufacturing experence can make a big difference at the nvidia and tdfx. Hard to say what is happening here. I have a theory of why chip companies failed to execute in the past. If you look at cirrus logic and (to a lesser extent) S3's balance sheets, they start to have problems a yr after their balance sheets show that marketing spending is >2X of R&D. I think the company that has a balanced mix of R&D and marketing will have a better shot of growing and selling. As soon as you see some company's marketing spending skyrocket and R&D fall drastically, it's a warning to get out.I think you indirectly answered why ATI was done so well up to now: their competition was stupid, produced new silicon too infrequently, and had yield problems. C-Cube has also survived as long as it has mostly thanks to the incompetence of its competition. JMHO.