SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : All About Sun Microsystems

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: Randy Giese who wrote (3003)7/5/1997 5:41:00 PM
From: Urlman   of 64865
 
Architecture is key to execution in Java <MUST READ>
George Shaw, Shaw Laboratories Ltd., Consulting Computer
Embedded Systems -- Part 4: Languages And Operating Systems

George Shaw, Shaw Laboratories Ltd., Consulting Computer

06/16/97
Electronic Engineering Times
Page 119
Copyright 1997 CMP Publications Inc.

Engineering, Hayward, Calif.#

While Java has gained considerable momentum in the marketplace as the
programming language of choice, most professional programmers will find
that it requires them to think about programming in an entirely new way.

In the classical programming process, it is typically necessary-or even
required-that all steps, except possibly for program execution, be
performed on the same type of computer. But Java program development
is different. There, a programmer translates an idea into the syntax of the
Java programming language. Then a Java compiler translates the Java
language into "bytecodes," which are used by the compiler as intermediate
code rather than the classical abstract intermediate code. Bytecodes are
also the machine instruction set of the Java Virtual Machine (VM) and the
distributed executable format for the program.

Since the bytecodes are the intermediate code, compiler optimizations are
performed directly on them. (Note that some compilers may include the
abstract intermediate code step.) However, early Java compilers have
been fairly poor: some do not even perform live-variable analysis to allow
register reuse within a method executing on the target computer. At this
point, the translation process at the programmer's site is complete. The
remainder of the translation process occurs on the computer.

Bytecodes are actual machine instructions for a heretofore hypothetical
stack computer, or Java processor. Although Sun has licensed the design
for a stack computer that executes bytecodes directly, silicon has yet to be
produced. Stack computer architectures, also known as zero-address or
zero-operand architectures, differ from more conventional register-based
architectures in that compute operations do not specify the addresses of
the operands used. The operands manipulated are on the top of an
operand stack.

By contrast, most 32-bit RISC processors are three-address
architectures, i.e., each instruction specifies two source addresses and a
destination address. Each address requires bits within the instruction; for
example, 5 bits for each of the addresses (a total of 15 bits) if the
processor has 32 registers. Stack architectures thus typically have short
instructions because no bits are used to specify operand addresses.
Although instructions must be executed to load operands onto the operand
stack and to store results from the operand stack, fewer explicit operand
addresses are required because intermediate results can be left on the
operand stack. Thus programs are often shorter than those on register
architectures.

Rich bytecodes

Hence, most Java VM instructions are encoded in 8 bits; however, the
bytecodes are semantically rich. For example, bytecodes that access
storage locations are data-typed and other bytecodes represent high-level
functions such as create object or invoke method. For the most part, the
semantic content of the bytecodes is useful in completing the compilation
process, but is not useful during execution. For example, identical
processes would be performed to load a 32-bit integer or a 32-bit
floating-point number from a variable onto the operand stack even though
these loads are encoded as two different bytecode instructions.

There are two basic choices for execution of Java bytecodes on a
non-Java processor: interpret the bytecodes, or continue the compilation
process and translate the bytecodes into the machine instructions for the
user's computer, also known as just-in-time (JIT) compilation.

Emulating a Java processor, bytecode interpretation is simple, but requires
dispatch overhead of about three instructions and five memory references
for every Java bytecode executed. Additionally, maintaining the Java VM
stack in software is costly and typically no code optimizations can be
performed.

Interpreted execution overhead is fairly high at three to 20 times slower
than compiled code, but the memory overhead (40 kbytes to 50 kbytes) is
relatively low at less than one-fifth that of most JIT compilers. The results
of execution also appear immediately, albeit slowly, without the delay of
the JIT compilation step. Also, an interpreter tends to be more stable since
compiler optimizations are often the source of bugs that could result in
security holes, several of which have already been reported.

Compiling bytecodes can range from a fairly straightforward to a very
complex process. A straightforward JIT compiler can simply substitute
equivalent in-line code for the semantics of each bytecode. The substituted
code might even contain some optimizations, but at the very least it
eliminates the overhead of the bytecode dispatcher in the interpreter .

A more complex JIT compiler would eliminate as many stack operations
as possible and keep more elements from the stack in registers. It would
perform peephole optimization to reduce multiple bytecodes into more
optimal instruction sequences and it would register map. It might even
begin compilation during the application download to reduce apparent JIT
compilation overhead.

Some additional optimizations are more difficult, though. While bytecodes
are semantically rich and the bytecode representation does contain the
same overall semantic content as the original Java program, it does not
contain all the information present when the Java source code was
processed. Given enough resources and time, though, an aggressive JIT
compiler should produce a result similar to that of a Java compiler that
directly produces machine code.

However, the memory footprint for both the JIT compiler and the
compiled code is much larger than that of an interpreter. A simple JIT
compiler requires 200 kbytes; an aggressive JIT compiler conceivably
requires several megabytes, or even disk storage. Program expansion
from JIT compilation of three to five times the bytecode size is typical on
80X86 CPUs.

If Java execution is critical to the application, use of a CPU with an
architecture closer to the Java VM would make sense since this would
reduce or eliminate the JIT compilation overhead. However, there may be
reasons to stay with an existing register-based CPU, including legacy
applications or existing hardware designs. If resources are limited or
immediate execution is required, an interpreter may be more appropriate
than a JIT compiler. But if the hardware design is not frozen and
performance is important, a stack-based CPU architecture may be much
more cost effective.

One of the first Java-like CPUs is the PSC1000 from Patriot Scientific
Corp. (San Diego)<PTSC:OTCBB ptsc.com > . The Java VM maps so closely onto its architecture
that 38 percent of the Java VM bytecodes translate directly into the same
or fewer bytes of opcodes on the PSC1000. Since, like the Java VM, it is
also a stack architecture, the bytecode-to-machine-code translation is
direct; no register mappings or complex optimizations are required.
Bytecode expansion is only 20 percent compared to 300 percent to 500
percent on 80X86 chips. Not surprisingly, the JIT compiler is really just a
simple opcode translator and thus requires less than 20 kbytes of memory.
It is actually part of the class loader and bytecode verification that execute
during downloading, so translation overhead is effectively zero.

The PSC1000 CPU was designed from the ground up to execute
languages like C, C++, Forth and Postscript efficiently, and Java is really
just a better C++. Although Sun has not made public what bytecodes are
being added to picoJava to support languages like C and C++, support is
said to be good. If this is the case, support for legacy C code, operating
systems, low-level device drivers and other non-Java code should be
adequate. However, research has shown that performance is improved
more by first adding eight global registers than by increasing the number of
local registers beyond eight. It would seem that such a large architectural
change to picoJava as adding a global register space is unlikely, so
performance with conventional languages may be less than desirable.

Though picoJava and the PSC1000 have comparable performance,
picoJava is using more complex technology to keep up. PicoJava, as
benchmarked by Sun, has a 4-kbyte instruction cache and an 8-kbyte
data cache. Other than separate caches for a 16-deep local-register stack
and 18-deep operand stack (both implemented mostly as RAM) and a
32-bit instruction-prefetch register, the PSC1000 has no caches. PicoJava
also uses instruction pipelining and register forwarding, whereas the
PSC1000 uses neither. This greatly reduces chip area, complexity and
core power dissipation compared with picoJava.

While both are available as licensable cores, the PSC1000 is also
implemented in a less-advanced 0.5-micron, two-layer metal technology.
As a portable Verilog/VHDL design, the addition of instruction and data
caches and a move to 0.35-micron technology would achieve a
performance improvement of at least three times.

The JIT compilation overhead greatly reduces the performance of register
architectures when running Java.

Last year Sun Microsystems published a white paper comparing the
performance of picoJava, Pentium and 80486 processors that claims
picoJava to be five times faster than a Pentium; however, the comparison
is flawed because Sun chose to compare processors on a
clock-speed-equivalent basis of 100 MHz, but did not benchmark its
competing processors actually running at that speed. Instead, it simply
adjusted the performance results by the ratio of the actual benchmark
system clock speed to the picoJava target of 100 MHz.

Off scale

This adjustment is inappropriate because bus speeds, a limiting factor in
processor performance, do not scale directly with processor speed. Thus,
Sun's 166-MHz Pentium benchmark time measurements were multiplied
by 1.66 to scale the time to 100 MHz, when published benchmarks (e.g.,
Intel Corp.'s iCOMP Index 2.0) indicate that 1.41 is a more likely ratio.

Copyright (c) 1997 CMP Media Inc.
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext