Last visit was: Thu May 01, 2025 2:27 pm
It is currently Thu May 01, 2025 2:27 pm



 [ 39 posts ]  Go to page 1, 2, 3  Next
 Tugman 18-bit Stack CPU 
Author Message

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
I finally implemented an ISA design I came up with over a decade ago, hereby dubbed `Tugman.`

The current proof-of-concept is an 18-bit implementation running at up to 50MHz (all instructions execute in 1 cycle) with 8K-word memory, on a Tang Nano 9K. (On Xilinx or Tang Nano 20K it should work at 100MHz+). The entire SOC with a UART and IO decoding takes up 6% of FPGA resources (~500 LUTS, ~120 flops, 9 BRAMs)

New: on Tang Nano 20K, stable at 111MHz.

https://tildegit.org/potato/Tugman

The CPU is a stack machine with a VLIW-like instruction set based loosely around a J1, but with a more complex ALU, with direct operations on NOS, memory, IO, or top of return stack.

The instruction set is quirky, as each instruction can simultaneously do an ALU operation, issue a memory read, adjust stack pointers, and issue a write to memory, IO, or stacks, or return from a subroutine. Jumps and calls may be conditional on Z and C flags. This makes coding interesting.

Here is a quick summary of the instruction layout:
Code:
  00_JJJd_oooo_oooo_oooo   jmp (JJJ=call,jmp,jz,jnz,jc,jnc,jmi,jpl) drop,offset
  01_N..._...._...._....   Negate TOS before ALU operation (for -)
  01_.C.._...._...._....   Carry on
  01_..XX_X..._...._....   ALU op (+/-,  &, |, ^, portB, >>)
  01_...._.XXX_...._....   B mux  (TOS, NOS, TOR, IO, MEM, 1,-1)
  01_...._...._X..._....   return
  01_...._...._.XXX_....   write control (nothing,mem,IO)
  01_...._...._...._XX..   RSP control (nop,push,pop,inc-tor)
  01_...._...._...._..XX   DSP control (nop,push,pop,write-NOS)
  1X_XXXX_XXXX_XXXX_XXXX   literal

Unlike every other minimalistic stack machine I've ever worked with, coding it in assembly is actually fun, and the instruction density seems to be really good.

Some highlights:
* Single-cycle execution, 0-cycle return in most cases;
* Instruction density competing with register machines;
* A reasonably complete instruction set;
* 1-cycle interrupt latency (not yet implemented, but may be trivial);
* 1-cycle per memory indirection, as deep as you need;
* All calls and jumps can optionally drop;
* All logical operators can clear, set or keep the carry flag unchanged;
* Forth in hardware;
* Co-routines
̶*̶ ̶P̶r̶e̶d̶i̶c̶a̶t̶e̶d̶ ̶c̶a̶l̶l̶s̶
* Other weird stuff (TBD)

The repo contains a proof-of-concept system with a UART (115200kbaud), a loader, and a simple monitor.

A simple FASMG-based assembler is in the repo.

[This top entry is edited to be up-to-date. The rest of the thread follows my development effort over time]


Last edited by enso1 on Thu Apr 03, 2025 4:12 pm, edited 17 times in total.



Tue Sep 03, 2024 6:40 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1821
Interesting! (And welcome back) and thanks for sharing your repo - I see you have some details of your assembly language too.


Wed Sep 04, 2024 9:51 am

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
Last night I managed to improve fMax to 45MHz and reduce utilization substantially, while adding a ton of possible instructions. The trick was to eliminate the explicit subtraction, and instead introduce a bit to invert TOS into the ALU (and invert the carry input). Subtraction is now synthesiszed as NOT_TOS + ALU_B with carry making the NOT into proper 2's compliment negate.

Negating TOS is actually very useful, since literals are only 17 bits (a pre-negate on ALU_A saves an explicit negation instruction). And it's very useful for masking bits for logic operations.

I also switched the 8 operations into two sets of 4 operations: one set uses ALU_A and ALU_B inputs, while the other, only ALU_B (currently, B passthrough and shift right), and removed rotate left (which can almost be replaced by TOS + TOS with carry).

The core is now 20% smaller than the stock J1 CPU, but infinitely more usable and fun. It is 25% slower theoretically, but I'd say the instruction density is close to double.


Wed Sep 04, 2024 12:09 pm

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
Copying Memory

memcpy is always a pain with a stack machine: incrementing two pointers and keeping a count... Currently, a 7-cycle loop is the best I can do. It's not bad, actually. I keep wanting to add autoincrement, but it's probably an unnecessary expense for a processor with an 8K memory space.
[Note: later I present a 6-cycle memory copy loop!]
Code:
;======================================================================================
; copy memory from src to dst
;(cnt,src,dst   )
   push            
   push            ;dstack: (cnt-- )  rstack: (dst,src--)
.loop:                          ;    D         R       Mem
   lit  1                       ;                               increment by 1
   op   OP_ADD,B_TOR,RDSP,WR    ;(cnt,src++   src++  reading    src,read RSP, inc src
   op   OP_B,B_MEM,RSPD         ;(cnt,val     dst               read result, RSP=dst
   op   OP_B,B_TOR,WM,DSPD      ;(cnt,dst     dst    write      store val
   op   OP_ADD,B_1,WR           ;(cnt--       dst++             inc dst in RSP
   op   OP_SUB,B_1,RSPI         ;(cnt--       src               cnt--. RSP=SRC
   jz   loop

   op   OP_B,B_NOS,DSPD,RSPD2   ;drop count and both pointers
   op   OP_B,B_TOS,RSPD,RET


This takes advantage of the fact that reads are issued on ALU_B (TOR, TopOfRstack), while it's being increment in the ALU, and the result is written back, in a single instruction. The read result is available next cycle, while RSP is adjusted so that TOR is now source address. Next instruction stores memory, and the following, increments TOR. Finally, we decrement the count (while adjust RSP back to destination), and loop while not zero.

Looking at it today, I think I can hoist the lit 1 out of the loop and take it down to 6 cycles!

Kind of cool, actually.


Last edited by enso1 on Sun Feb 23, 2025 7:30 pm, edited 1 time in total.



Wed Sep 04, 2024 1:03 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Very cool. It is small enough that many cores could fit in a larger FPGA.
Have you seen the 4-stack CPU?

_________________
Robert Finch http://www.finitron.ca


Thu Sep 05, 2024 4:15 am WWW

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
It is tiny, and can be configured to run off a single BRAM with 1K words. This makes it a great picocontroller in a bigger system - an IO, disk, or communications controller, or a video subsystem.

I've admired Bernd's CPU from afar. It seems like a nightmare to program though -- too many things to keep track of. Para mi, Tugman processor is at the limit of pleasant.


Last edited by enso1 on Thu Sep 05, 2024 4:05 pm, edited 1 time in total.



Thu Sep 05, 2024 2:28 pm

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
Today I added CMP and TST. These affect flags and drop, which is usually what is necessary; I hated what my monitor code looked like without these!

I dreaded adding the extra logic, but amazingly, it didn't affect size and actually improved fMax by 1MHz. Weird, but whatever. It almost makes sense because LUTs have 4 inputs and I was using two in a couple of places, but speed improvement is harder to explain -- mostly luck I suppose.

Code:
     Operation                ALU_B select
===  ==========          ===  ===============
000  ALU_B +/- TOS       000  TOS
001  ALU_B & TOS         001  NOS
010  ALU_B | TOS         010  TOR
011  ALU_B ^ TOS         011  [ALU_B] memory
100  C̶o̶m̶p̶a̶r̶e̶ ̶&̶ ̶d̶r̶o̶p̶      100  IO input
101  T̶e̶s̶t̶ ̶&̶ ̶d̶r̶o̶p̶         101  1
110  ALU_B               110  -1
111  ALU_B >>            111  0


Last edited by enso1 on Mon Sep 09, 2024 2:59 pm, edited 1 time in total.



Thu Sep 05, 2024 4:02 pm

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
After some consideration decided that conditional calls are not really worth it, and I'd rather have a call, a jump, and six other conditional jumps.

This and a couple of other small changes reduced utilization back to 460 LUTs and improved fMax to 49.192 MHz.


Thu Sep 05, 2024 7:13 pm

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
Spent the day working on a monitor and reviewing hardware. Other than fMax dropping horribly as soon as I touch the IP mux... I am stuck with call, jump, jz, jnz, and jc -- as soon as I add jnc I lose 5MHz and gain layers of logic. I tried separating the condition code mux, but it's even worse... Left it alone, as I like 49.195Mhz max, for now. Someday I'll have to figure out how to add timing constraints, or pin down some components with location constraints.

I also figured out a way to sneak extra instructions in -- for instructions that don't affect the carry flag normally, like loading portB into TOS, I can do complicated tests and set the carry -- without any extra opcodes!

I was writing an ASCII-HEX routine, and range checking was a pain. I put in an instruction that does a range check on NOS against high and low bytes in TOS. That worked fine, but added 50 LUTs and slowed down the system by 8MHz... Then I took it out and rewrote the converter without it, and it was only 4 instructions longer... Completely not worth it!

But I will keep this in mind for later.

I now have a hex dump and hex input, will write a monitor tomorrow.


Thu Sep 05, 2024 11:43 pm

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
I just realized that memory locations 1 and 3FFFF may be used as scrap registers with almost no overhead.

The ISA allows for 1 and -1 to be used as ALU_B constants, and since there is a memory read on ALU_B every cycle, memory locations 1 and 3FFFF, next cycle may be used to read the memory into ALU_B. For writing, just WMEM bit stores TOS into 1 or -1 when ALU_B is B_1 or B_N1.


Sat Sep 07, 2024 1:52 pm

Joined: Mon Oct 07, 2019 2:41 am
Posts: 768
Looks like a nice machine. Block ram makes big difference for many designs.


Sun Sep 08, 2024 6:20 am

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
BRAMS are dog**** slow on this device! I am used to old Xilinx XC3S which is at least twice as fast. I think it's 5-6 ns before the BRAM responds after the clock. I should really move to the $30 Nano20K device which should be almost Xilinx speed.

Speaking of BRAM, my 18 x 8192 RAM synthesizes as 9 2-bit BRAMS instead of the expected 8 18-bit ones, which is sensible as it avoids muxing multiple devices. I made a feeble attempt to build with a single 18-bit BRAM, but it wouldn't synthesize, insisting that I was making a read-before-write memory, which I clearly wasn't. I will have to try again later, for now it's 8K words.

Although I am almost happy with the ISA, every time I try to finish the monitor I get annoyed with the details. My attempt at the Z and C flags along with test and compare instructions that do not modify TOS but just set flags seemed clumsy, and I still found simple things like returning from a subroutine with some kind of a flag hard to implement without a inflating and slowing down the CPU (for some reason, just keeping flags in registers is a lot more expensive than pushing them onto the return stack with the return address and restoring them, even though it is a lot more verilog to do that!).

So I took an extra bit in jump instructions for an optional drop, and removed test and compare instructions. For compares, I can subtract and jc/drop or je/drop, flags are no longer restored on return. je/jne tests tos as it exists, so it's easy to pass flags around and drop them during the test in a single instruction. I can add js/jns for testing the sign bit, but it costs a little and I haven't needed it yet.

Unfortunately I am out of bits in the jump/call instruction, so I had to switch to relative branches with a 4K reach (or just limit the RAM to 4k)... I could keep unconditional jumps to full 8K and just limit the conditional branches to 4K, but that would require another decoder and a bigger IP mux, and with this software you never know if it will make it much bigger or much smaller and faster until you try.

Keeping the flags separate across calls/returns bloated me to 500 LUTs, but I am just over 50Mhz now.

Amazingly, there were almost no changes in the assembler and the software.


Sun Sep 08, 2024 1:27 pm

Joined: Mon Oct 07, 2019 2:41 am
Posts: 768
enso1 wrote:
Although I am almost happy with the ISA, every time I try to finish the monitor I get annoyed with the details. My attempt at the Z and C flags along with test and compare instructions that do not modify TOS but just set flags seemed clumsy, and I still found simple things like returning from a subroutine with some kind of a flag hard to implement without a inflating and slowing down the CPU (for some reason, just keeping flags in registers is a lot more expensive than pushing them onto the return stack with the return address and restoring them, even though it is a lot more verilog to do that!).

Unfortunately I am out of bits in the jump/call instruction, so I had to switch to relative branches with a 4K reach (or just limit the RAM to 4k)... I could keep unconditional jumps to full 8K and just limit the conditional branches to 4K, but that would require another decoder and a bigger IP mux, and with this software you never know if it will make it much bigger or much smaller and faster until you try.


Not using flags might be a better option, if you look at something like fig-forth. 0branch and branch and forth calls/returns are the only program counter operations.
Most of the older computers used Skip on condition and Jump and that kept them simple.
When I was playing with the other brand of FPGA's, the design would often compile but not run in the hardware.
I don't care about fmax, more about are the setup and hold times met. The software never could tell me.
I have gone to CPLDs that I can design with. (Wincupl) and the programmer was on sale.https://store.rosco-m68k.com/products/l ... programmer


Sun Sep 08, 2024 7:40 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Quote:
I don't care about fmax, more about are the setup and hold times met. The software never could tell me.

I think the FPGA software automatically adjusts the timing of internal logic to meet setup and hold times. If it cannot it gives a warning. I believe one can also specify what the delay (-min for setup and -max for hold) needs to be for IO signals. One can even specify the I/O standard (voltage levels and timing). There are all kinds of stuff that can be specified in constraint files. It is a bit of a learning curve though.

_________________
Robert Finch http://www.finitron.ca


Mon Sep 09, 2024 6:00 am WWW

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
I am very happy with how it's turned out (except that I lost a bit of jump range): it's mostly Forth-like, but carry flag is still there if you want it, and you can drop or not while jumping. Z branches just test TOS, C is flopped from last ALU result.
Code:
; Forth-like compare and jump if less than 10:
    lit    10
    op   OP_SUB,SRC_NOS     ;(n,n-10--
    jcd                     ;jump on carry and drop

; or test result of operation, without a drop
    lit    '0'
    op   OP_SUB,SRC_NOS,DSPD ;(n-'0'--
    jc    .notdigit          ;jump on carry, do not drop
...


Mon Sep 09, 2024 2:53 pm
 [ 39 posts ]  Go to page 1, 2, 3  Next

Who is online

Users browsing this forum: claudebot, DotBot and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software