Architecture is key to execution in Java <MUST READ> George Shaw, Shaw Laboratories Ltd., Consulting Computer Embedded Systems -- Part 4: Languages And Operating Systems
George Shaw, Shaw Laboratories Ltd., Consulting Computer
06/16/97 Electronic Engineering Times Page 119 Copyright 1997 CMP Publications Inc.
Engineering, Hayward, Calif.#
While Java has gained considerable momentum in the marketplace as the programming language of choice, most professional programmers will find that it requires them to think about programming in an entirely new way.
In the classical programming process, it is typically necessary-or even required-that all steps, except possibly for program execution, be performed on the same type of computer. But Java program development is different. There, a programmer translates an idea into the syntax of the Java programming language. Then a Java compiler translates the Java language into "bytecodes," which are used by the compiler as intermediate code rather than the classical abstract intermediate code. Bytecodes are also the machine instruction set of the Java Virtual Machine (VM) and the distributed executable format for the program.
Since the bytecodes are the intermediate code, compiler optimizations are performed directly on them. (Note that some compilers may include the abstract intermediate code step.) However, early Java compilers have been fairly poor: some do not even perform live-variable analysis to allow register reuse within a method executing on the target computer. At this point, the translation process at the programmer's site is complete. The remainder of the translation process occurs on the computer.
Bytecodes are actual machine instructions for a heretofore hypothetical stack computer, or Java processor. Although Sun has licensed the design for a stack computer that executes bytecodes directly, silicon has yet to be produced. Stack computer architectures, also known as zero-address or zero-operand architectures, differ from more conventional register-based architectures in that compute operations do not specify the addresses of the operands used. The operands manipulated are on the top of an operand stack.
By contrast, most 32-bit RISC processors are three-address architectures, i.e., each instruction specifies two source addresses and a destination address. Each address requires bits within the instruction; for example, 5 bits for each of the addresses (a total of 15 bits) if the processor has 32 registers. Stack architectures thus typically have short instructions because no bits are used to specify operand addresses. Although instructions must be executed to load operands onto the operand stack and to store results from the operand stack, fewer explicit operand addresses are required because intermediate results can be left on the operand stack. Thus programs are often shorter than those on register architectures.
Rich bytecodes
Hence, most Java VM instructions are encoded in 8 bits; however, the bytecodes are semantically rich. For example, bytecodes that access storage locations are data-typed and other bytecodes represent high-level functions such as create object or invoke method. For the most part, the semantic content of the bytecodes is useful in completing the compilation process, but is not useful during execution. For example, identical processes would be performed to load a 32-bit integer or a 32-bit floating-point number from a variable onto the operand stack even though these loads are encoded as two different bytecode instructions.
There are two basic choices for execution of Java bytecodes on a non-Java processor: interpret the bytecodes, or continue the compilation process and translate the bytecodes into the machine instructions for the user's computer, also known as just-in-time (JIT) compilation.
Emulating a Java processor, bytecode interpretation is simple, but requires dispatch overhead of about three instructions and five memory references for every Java bytecode executed. Additionally, maintaining the Java VM stack in software is costly and typically no code optimizations can be performed.
Interpreted execution overhead is fairly high at three to 20 times slower than compiled code, but the memory overhead (40 kbytes to 50 kbytes) is relatively low at less than one-fifth that of most JIT compilers. The results of execution also appear immediately, albeit slowly, without the delay of the JIT compilation step. Also, an interpreter tends to be more stable since compiler optimizations are often the source of bugs that could result in security holes, several of which have already been reported.
Compiling bytecodes can range from a fairly straightforward to a very complex process. A straightforward JIT compiler can simply substitute equivalent in-line code for the semantics of each bytecode. The substituted code might even contain some optimizations, but at the very least it eliminates the overhead of the bytecode dispatcher in the interpreter .
A more complex JIT compiler would eliminate as many stack operations as possible and keep more elements from the stack in registers. It would perform peephole optimization to reduce multiple bytecodes into more optimal instruction sequences and it would register map. It might even begin compilation during the application download to reduce apparent JIT compilation overhead.
Some additional optimizations are more difficult, though. While bytecodes are semantically rich and the bytecode representation does contain the same overall semantic content as the original Java program, it does not contain all the information present when the Java source code was processed. Given enough resources and time, though, an aggressive JIT compiler should produce a result similar to that of a Java compiler that directly produces machine code.
However, the memory footprint for both the JIT compiler and the compiled code is much larger than that of an interpreter. A simple JIT compiler requires 200 kbytes; an aggressive JIT compiler conceivably requires several megabytes, or even disk storage. Program expansion from JIT compilation of three to five times the bytecode size is typical on 80X86 CPUs.
If Java execution is critical to the application, use of a CPU with an architecture closer to the Java VM would make sense since this would reduce or eliminate the JIT compilation overhead. However, there may be reasons to stay with an existing register-based CPU, including legacy applications or existing hardware designs. If resources are limited or immediate execution is required, an interpreter may be more appropriate than a JIT compiler. But if the hardware design is not frozen and performance is important, a stack-based CPU architecture may be much more cost effective.
One of the first Java-like CPUs is the PSC1000 from Patriot Scientific Corp. (San Diego)<PTSC:OTCBB ptsc.com > . The Java VM maps so closely onto its architecture that 38 percent of the Java VM bytecodes translate directly into the same or fewer bytes of opcodes on the PSC1000. Since, like the Java VM, it is also a stack architecture, the bytecode-to-machine-code translation is direct; no register mappings or complex optimizations are required. Bytecode expansion is only 20 percent compared to 300 percent to 500 percent on 80X86 chips. Not surprisingly, the JIT compiler is really just a simple opcode translator and thus requires less than 20 kbytes of memory. It is actually part of the class loader and bytecode verification that execute during downloading, so translation overhead is effectively zero.
The PSC1000 CPU was designed from the ground up to execute languages like C, C++, Forth and Postscript efficiently, and Java is really just a better C++. Although Sun has not made public what bytecodes are being added to picoJava to support languages like C and C++, support is said to be good. If this is the case, support for legacy C code, operating systems, low-level device drivers and other non-Java code should be adequate. However, research has shown that performance is improved more by first adding eight global registers than by increasing the number of local registers beyond eight. It would seem that such a large architectural change to picoJava as adding a global register space is unlikely, so performance with conventional languages may be less than desirable.
Though picoJava and the PSC1000 have comparable performance, picoJava is using more complex technology to keep up. PicoJava, as benchmarked by Sun, has a 4-kbyte instruction cache and an 8-kbyte data cache. Other than separate caches for a 16-deep local-register stack and 18-deep operand stack (both implemented mostly as RAM) and a 32-bit instruction-prefetch register, the PSC1000 has no caches. PicoJava also uses instruction pipelining and register forwarding, whereas the PSC1000 uses neither. This greatly reduces chip area, complexity and core power dissipation compared with picoJava.
While both are available as licensable cores, the PSC1000 is also implemented in a less-advanced 0.5-micron, two-layer metal technology. As a portable Verilog/VHDL design, the addition of instruction and data caches and a move to 0.35-micron technology would achieve a performance improvement of at least three times.
The JIT compilation overhead greatly reduces the performance of register architectures when running Java.
Last year Sun Microsystems published a white paper comparing the performance of picoJava, Pentium and 80486 processors that claims picoJava to be five times faster than a Pentium; however, the comparison is flawed because Sun chose to compare processors on a clock-speed-equivalent basis of 100 MHz, but did not benchmark its competing processors actually running at that speed. Instead, it simply adjusted the performance results by the ratio of the actual benchmark system clock speed to the picoJava target of 100 MHz.
Off scale
This adjustment is inappropriate because bus speeds, a limiting factor in processor performance, do not scale directly with processor speed. Thus, Sun's 166-MHz Pentium benchmark time measurements were multiplied by 1.66 to scale the time to 100 MHz, when published benchmarks (e.g., Intel Corp.'s iCOMP Index 2.0) indicate that 1.41 is a more likely ratio.
Copyright (c) 1997 CMP Media Inc. |