Benedict & Kunal... Here is some technobabble 4 ya: Roposted from Usenet
From Keith Wootten <Keith@wootten.demon.co.uk> Organization DMG Reading Date Tue, 19 May 1998 09:28:23 +0100 Newsgroups comp.arch,comp.arch.fpga,comp.arch.embedded Message-ID <GDzk9BAnKUY1Ewtr@wootten.demon.co.uk> References 1 2 3 4 5
In article <Et6LoJ.LE0@world.std.com>, Joseph H Allen <jhallen@world.std.com> writes >In article <X0F+fHAUJLY1Ew92@wootten.demon.co.uk>, >Keith Wootten <Keith@wootten.demon.co.uk> wrote: >>In article <Et568q.1zq@world.std.com>, Joseph H Allen >><jhallen@world.std.com> writes >>>In article <355f5eaf.6311596@news.megsinet.net>, <msimon@tefbbs.com> wrote: >>> >>>>Read a book called "Stack Computers" by Phil Koopman available free on >>>>the net. >>> >>>>Some of the ideas in this book are being used in a Java Engine by >>>>Patriot Scientific. >>> >>>I've been thinking about these stack processors, but I'm still not >>>convinced. They don't do particularly well with structure accesses or even >>>with simple block copies, and they tend to require that a lot of the stack >>>to be cached on the FPGA (so they're bigger). >> >>I'm not sure what you mean about simple block copies - surely this is >>just something which a chip either does well or doesn't do well >>depending on the hardware available, irrespective of whether it's a >>stack machine or not? >> >>Load source address to register A >>Load destination address to register B >>Load count to register C >> >>Read data at A, post incrementing A >>Write data at B, post incrementing B >>Decrement C and repeat till zero. > >>FWIW the PSC1000 does the above loop with three 8bit opcodes for up to >>2^32 iterations, these three making 3/4 of a 32bit instruction group. > >Huh? Wouldn't you need something like: > > 1000 ; push destination address > 2000 ; push source address >loop: > over ; dup destination > over ; dup source > @ ; replace source with contents > ! ; write contents to destination > 2 ; increment source > add > swap ; increment dest > 2 > add > swap > ... > >I.E., 10 instructions to move each word (unless I'm really missing something >about these 0-address top of stack machines).
[snipped]
Yes, if your stack machine were to be *only* a hardware implementation of the 'standard' Forth virtual machine. Stack machines (actual Silicon ones) never implement exactly and only this or any other virtual machine, but always add extra useful instructions.
eg on the PSC1000 (using Patriot's syntax and after loading the three registers)
align to 4 byte boundary
copying ld[x++] \ push TOS with (x) and increment x by 4 st[r0++] \ pop TOS to (r0) and increment r0 by 4 mloop copying \ decrement ct and goto copying if non-zero
which is three 8bit opcodes. The mloop instruction works with up to three preceeding opcodes in the same 32bit (four instruction) memory group. Other stack machines do differently, just as register based machines do (eg Z80 block move), but I don't know of any which would be limited to the Forth virtual machine. This is really my point - IMO block moving ability is pretty much unconnected to the *fundamental* chip architecture. In the above example, the stack isn't really used as a stack, just a convenient transfer register.
Cheers -- Keith Wootten ====================================================================== Re: Minimal ALU instruction set.
From Bernd Paysan <bernd.paysan@remove.muenchen.this.org.junk> Organization Siemens AG, Semiconductor Group Date Tue, 19 May 1998 09:40:50 +0200 Newsgroups comp.arch,comp.arch.fpga,comp.arch.embedded Message-ID <35613782.4D08@remove.muenchen.this.org.junk> References 1 2 3 4 5
Joseph H Allen wrote: > >FWIW the PSC1000 does the above loop with three 8bit opcodes for up to > >2^32 iterations, these three making 3/4 of a 32bit instruction group. > > Huh? Wouldn't you need something like: > > 1000 ; push destination address > 2000 ; push source address > loop: > over ; dup destination > over ; dup source > @ ; replace source with contents > ! ; write contents to destination > 2 ; increment source > add > swap ; increment dest > 2 > add > swap > ... > > I.E., 10 instructions to move each word (unless I'm really missing something > about these 0-address top of stack machines).
No, you are just missing a few special case operations of the PSC1000. There are two address registers (A and top of returnstack) which allow addressing with postincrement mode. And there is a counter and a decrement and branch if not zero instruction. The PSC1000 is not really a MISC processor.
-- Bernd Paysan "Late answers are wrong answers!" jwdt.com ==================================================================
Re: Minimal ALU instruction set.
From Keith Wootten <Keith@wootten.demon.co.uk> Organization Dragonfly Designs Date Mon, 18 May 1998 23:12:36 +0100 Newsgroups comp.arch,comp.arch.fpga,comp.arch.embedded Message-ID <X0F+fHAUJLY1Ew92@wootten.demon.co.uk> References 1 2 3 4 5
In article <Et568q.1zq@world.std.com>, Joseph H Allen <jhallen@world.std.com> writes >In article <355f5eaf.6311596@news.megsinet.net>, <msimon@tefbbs.com> wrote: > >>Read a book called "Stack Computers" by Phil Koopman available free on >>the net. > >>Some of the ideas in this book are being used in a Java Engine by >>Patriot Scientific. > >I've been thinking about these stack processors, but I'm still not >convinced. They don't do particularly well with structure accesses or even >with simple block copies, and they tend to require that a lot of the stack >to be cached on the FPGA (so they're bigger).
I'm not sure what you mean about simple block copies - surely this is just something which a chip either does well or doesn't do well depending on the hardware available, irrespective of whether it's a stack machine or not?
Load source address to register A Load destination address to register B Load count to register C
Read data at A, post incrementing A Write data at B, post incrementing B Decrement C and repeat till zero.
FWIW the PSC1000 does the above loop with three 8bit opcodes for up to 2^32 iterations, these three making 3/4 of a 32bit instruction group.
IMO you're right about the stacks - these ideally need to be on-chip for efficiency and this can use (for an FPGA) a fair amount of resources. An alternative is to have an external RAM for each of the two stacks like the Novix chip had. This approach, of course, uses a lot of I/O as it needs three data busses, one address bus for main memory, and two small address busses for stack memory plus associated control lines.
[rest snipped]
Cheers -- Keith Wootten ============================================================================ Re: Minimal ALU instruction set.
From jhallen@world.std.com (Joseph H Allen) Organization The World Public Access UNIX, Brookline, MA Date Tue, 19 May 1998 02:02:43 GMT Newsgroups comp.arch,comp.arch.fpga,comp.arch.embedded Message-ID <Et6LoJ.LE0@world.std.com> References 1 2 3 4
In article <X0F+fHAUJLY1Ew92@wootten.demon.co.uk>, Keith Wootten <Keith@wootten.demon.co.uk> wrote: >In article <Et568q.1zq@world.std.com>, Joseph H Allen ><jhallen@world.std.com> writes >>In article <355f5eaf.6311596@news.megsinet.net>, <msimon@tefbbs.com> wrote: >> >>>Read a book called "Stack Computers" by Phil Koopman available free on >>>the net. >> >>>Some of the ideas in this book are being used in a Java Engine by >>>Patriot Scientific. >> >>I've been thinking about these stack processors, but I'm still not >>convinced. They don't do particularly well with structure accesses or even >>with simple block copies, and they tend to require that a lot of the stack >>to be cached on the FPGA (so they're bigger). > >I'm not sure what you mean about simple block copies - surely this is >just something which a chip either does well or doesn't do well >depending on the hardware available, irrespective of whether it's a >stack machine or not? > >Load source address to register A >Load destination address to register B >Load count to register C > >Read data at A, post incrementing A >Write data at B, post incrementing B >Decrement C and repeat till zero.
>FWIW the PSC1000 does the above loop with three 8bit opcodes for up to >2^32 iterations, these three making 3/4 of a 32bit instruction group.
Huh? Wouldn't you need something like:
1000 ; push destination address 2000 ; push source address loop: over ; dup destination over ; dup source @ ; replace source with contents ! ; write contents to destination 2 ; increment source add swap ; increment dest 2 add swap ...
I.E., 10 instructions to move each word (unless I'm really missing something about these 0-address top of stack machines). Maybe 8 if you have an increment by 2 instruction. The load and store each require at least 2 cycles and the rest require at least one each, so that's 10 cycles per word.
A one-address machine with indexing (and a 16-bit accumulator) needs:
loop: lda 0,r1 sta 0,r2 lda 1,r1 sta 1,r2 ...
I.E., two instructions per move.
>IMO you're right about the stacks - these ideally need to be on-chip for >efficiency and this can use (for an FPGA) a fair amount of resources. >An alternative is to have an external RAM for each of the two stacks >like the Novix chip had. This approach, of course, uses a lot of I/O as >it needs three data busses, one address bus for main memory, and two >small address busses for stack memory plus associated control lines.
>Keith Wootten
-- /* jhallen@world.std.com (192.74.137.5) */ /* Joseph H. Allen */ int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) +r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2 ]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);} =============================================================================Re: Minimal ALU instruction set.
From jhallen@world.std.com (Joseph H Allen) Organization The World Public Access UNIX, Brookline, MA Date Tue, 19 May 1998 02:02:43 GMT Newsgroups comp.arch,comp.arch.fpga,comp.arch.embedded Message-ID <Et6LoJ.LE0@world.std.com> References 1 2 3 4
In article <X0F+fHAUJLY1Ew92@wootten.demon.co.uk>, Keith Wootten <Keith@wootten.demon.co.uk> wrote: >In article <Et568q.1zq@world.std.com>, Joseph H Allen ><jhallen@world.std.com> writes >>In article <355f5eaf.6311596@news.megsinet.net>, <msimon@tefbbs.com> wrote: >> >>>Read a book called "Stack Computers" by Phil Koopman available free on >>>the net. >> >>>Some of the ideas in this book are being used in a Java Engine by >>>Patriot Scientific. >> >>I've been thinking about these stack processors, but I'm still not >>convinced. They don't do particularly well with structure accesses or even >>with simple block copies, and they tend to require that a lot of the stack >>to be cached on the FPGA (so they're bigger). > >I'm not sure what you mean about simple block copies - surely this is >just something which a chip either does well or doesn't do well >depending on the hardware available, irrespective of whether it's a >stack machine or not? > >Load source address to register A >Load destination address to register B >Load count to register C > >Read data at A, post incrementing A >Write data at B, post incrementing B >Decrement C and repeat till zero.
>FWIW the PSC1000 does the above loop with three 8bit opcodes for up to >2^32 iterations, these three making 3/4 of a 32bit instruction group.
Huh? Wouldn't you need something like:
1000 ; push destination address 2000 ; push source address loop: over ; dup destination over ; dup source @ ; replace source with contents ! ; write contents to destination 2 ; increment source add swap ; increment dest 2 add swap ...
I.E., 10 instructions to move each word (unless I'm really missing something about these 0-address top of stack machines). Maybe 8 if you have an increment by 2 instruction. The load and store each require at least 2 cycles and the rest require at least one each, so that's 10 cycles per word.
A one-address machine with indexing (and a 16-bit accumulator) needs:
loop: lda 0,r1 sta 0,r2 lda 1,r1 sta 1,r2 ...
I.E., two instructions per move.
>IMO you're right about the stacks - these ideally need to be on-chip for >efficiency and this can use (for an FPGA) a fair amount of resources. >An alternative is to have an external RAM for each of the two stacks >like the Novix chip had. This approach, of course, uses a lot of I/O as >it needs three data busses, one address bus for main memory, and two >small address busses for stack memory plus associated control lines.
>Keith Wootten
-- /* jhallen@world.std.com (192.74.137.5) */ /* Joseph H. Allen */ int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) +r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2 ]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
|