Last visit was: Thu May 01, 2025 2:27 pm
|
It is currently Thu May 01, 2025 2:27 pm
|
Author |
Message |
enso1
Joined: Tue Sep 03, 2024 6:20 pm Posts: 33
|
I finally implemented an ISA design I came up with over a decade ago, hereby dubbed `Tugman.` The current proof-of-concept is an 18-bit implementation running at up to 50MHz (all instructions execute in 1 cycle) with 8K-word memory, on a Tang Nano 9K. (On Xilinx or Tang Nano 20K it should work at 100MHz+). The entire SOC with a UART and IO decoding takes up 6% of FPGA resources (~500 LUTS, ~120 flops, 9 BRAMs) New: on Tang Nano 20K, stable at 111MHz. https://tildegit.org/potato/TugmanThe CPU is a stack machine with a VLIW-like instruction set based loosely around a J1, but with a more complex ALU, with direct operations on NOS, memory, IO, or top of return stack. The instruction set is quirky, as each instruction can simultaneously do an ALU operation, issue a memory read, adjust stack pointers, and issue a write to memory, IO, or stacks, or return from a subroutine. Jumps and calls may be conditional on Z and C flags. This makes coding interesting. Here is a quick summary of the instruction layout: Code: 00_JJJd_oooo_oooo_oooo jmp (JJJ=call,jmp,jz,jnz,jc,jnc,jmi,jpl) drop,offset 01_N..._...._...._.... Negate TOS before ALU operation (for -) 01_.C.._...._...._.... Carry on 01_..XX_X..._...._.... ALU op (+/-, &, |, ^, portB, >>) 01_...._.XXX_...._.... B mux (TOS, NOS, TOR, IO, MEM, 1,-1) 01_...._...._X..._.... return 01_...._...._.XXX_.... write control (nothing,mem,IO) 01_...._...._...._XX.. RSP control (nop,push,pop,inc-tor) 01_...._...._...._..XX DSP control (nop,push,pop,write-NOS) 1X_XXXX_XXXX_XXXX_XXXX literal
Unlike every other minimalistic stack machine I've ever worked with, coding it in assembly is actually fun, and the instruction density seems to be really good. Some highlights: * Single-cycle execution, 0-cycle return in most cases; * Instruction density competing with register machines; * A reasonably complete instruction set; * 1-cycle interrupt latency (not yet implemented, but may be trivial); * 1-cycle per memory indirection, as deep as you need; * All calls and jumps can optionally drop; * All logical operators can clear, set or keep the carry flag unchanged; * Forth in hardware; * Co-routines ̶*̶ ̶P̶r̶e̶d̶i̶c̶a̶t̶e̶d̶ ̶c̶a̶l̶l̶s̶ * Other weird stuff (TBD) The repo contains a proof-of-concept system with a UART (115200kbaud), a loader, and a simple monitor. A simple FASMG-based assembler is in the repo. [This top entry is edited to be up-to-date. The rest of the thread follows my development effort over time]
Last edited by enso1 on Thu Apr 03, 2025 4:12 pm, edited 17 times in total.
|
Tue Sep 03, 2024 6:40 pm |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
Interesting! (And welcome back) and thanks for sharing your repo - I see you have some details of your assembly language too.
|
Wed Sep 04, 2024 9:51 am |
|
 |
enso1
Joined: Tue Sep 03, 2024 6:20 pm Posts: 33
|
Last night I managed to improve fMax to 45MHz and reduce utilization substantially, while adding a ton of possible instructions. The trick was to eliminate the explicit subtraction, and instead introduce a bit to invert TOS into the ALU (and invert the carry input). Subtraction is now synthesiszed as NOT_TOS + ALU_B with carry making the NOT into proper 2's compliment negate.
Negating TOS is actually very useful, since literals are only 17 bits (a pre-negate on ALU_A saves an explicit negation instruction). And it's very useful for masking bits for logic operations.
I also switched the 8 operations into two sets of 4 operations: one set uses ALU_A and ALU_B inputs, while the other, only ALU_B (currently, B passthrough and shift right), and removed rotate left (which can almost be replaced by TOS + TOS with carry).
The core is now 20% smaller than the stock J1 CPU, but infinitely more usable and fun. It is 25% slower theoretically, but I'd say the instruction density is close to double.
|
Wed Sep 04, 2024 12:09 pm |
|
 |
enso1
Joined: Tue Sep 03, 2024 6:20 pm Posts: 33
|
Copying Memory memcpy is always a pain with a stack machine: incrementing two pointers and keeping a count... Currently, a 7-cycle loop is the best I can do. It's not bad, actually. I keep wanting to add autoincrement, but it's probably an unnecessary expense for a processor with an 8K memory space. [Note: later I present a 6-cycle memory copy loop!]Code: ;====================================================================================== ; copy memory from src to dst ;(cnt,src,dst ) push push ;dstack: (cnt-- ) rstack: (dst,src--) .loop: ; D R Mem lit 1 ; increment by 1 op OP_ADD,B_TOR,RDSP,WR ;(cnt,src++ src++ reading src,read RSP, inc src op OP_B,B_MEM,RSPD ;(cnt,val dst read result, RSP=dst op OP_B,B_TOR,WM,DSPD ;(cnt,dst dst write store val op OP_ADD,B_1,WR ;(cnt-- dst++ inc dst in RSP op OP_SUB,B_1,RSPI ;(cnt-- src cnt--. RSP=SRC jz loop
op OP_B,B_NOS,DSPD,RSPD2 ;drop count and both pointers op OP_B,B_TOS,RSPD,RET
This takes advantage of the fact that reads are issued on ALU_B (TOR, TopOfRstack), while it's being increment in the ALU, and the result is written back, in a single instruction. The read result is available next cycle, while RSP is adjusted so that TOR is now source address. Next instruction stores memory, and the following, increments TOR. Finally, we decrement the count (while adjust RSP back to destination), and loop while not zero. Looking at it today, I think I can hoist the lit 1 out of the loop and take it down to 6 cycles! Kind of cool, actually.
Last edited by enso1 on Sun Feb 23, 2025 7:30 pm, edited 1 time in total.
|
Wed Sep 04, 2024 1:03 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Very cool. It is small enough that many cores could fit in a larger FPGA. Have you seen the 4-stack CPU?
_________________Robert Finch http://www.finitron.ca
|
Thu Sep 05, 2024 4:15 am |
|
 |
enso1
Joined: Tue Sep 03, 2024 6:20 pm Posts: 33
|
It is tiny, and can be configured to run off a single BRAM with 1K words. This makes it a great picocontroller in a bigger system - an IO, disk, or communications controller, or a video subsystem.
I've admired Bernd's CPU from afar. It seems like a nightmare to program though -- too many things to keep track of. Para mi, Tugman processor is at the limit of pleasant.
Last edited by enso1 on Thu Sep 05, 2024 4:05 pm, edited 1 time in total.
|
Thu Sep 05, 2024 2:28 pm |
|
 |
enso1
Joined: Tue Sep 03, 2024 6:20 pm Posts: 33
|
Today I added CMP and TST. These affect flags and drop, which is usually what is necessary; I hated what my monitor code looked like without these! I dreaded adding the extra logic, but amazingly, it didn't affect size and actually improved fMax by 1MHz. Weird, but whatever. It almost makes sense because LUTs have 4 inputs and I was using two in a couple of places, but speed improvement is harder to explain -- mostly luck I suppose. Code: Operation ALU_B select === ========== === =============== 000 ALU_B +/- TOS 000 TOS 001 ALU_B & TOS 001 NOS 010 ALU_B | TOS 010 TOR 011 ALU_B ^ TOS 011 [ALU_B] memory 100 C̶o̶m̶p̶a̶r̶e̶ ̶&̶ ̶d̶r̶o̶p̶ 100 IO input 101 T̶e̶s̶t̶ ̶&̶ ̶d̶r̶o̶p̶ 101 1 110 ALU_B 110 -1 111 ALU_B >> 111 0
Last edited by enso1 on Mon Sep 09, 2024 2:59 pm, edited 1 time in total.
|
Thu Sep 05, 2024 4:02 pm |
|
 |
enso1
Joined: Tue Sep 03, 2024 6:20 pm Posts: 33
|
After some consideration decided that conditional calls are not really worth it, and I'd rather have a call, a jump, and six other conditional jumps.
This and a couple of other small changes reduced utilization back to 460 LUTs and improved fMax to 49.192 MHz.
|
Thu Sep 05, 2024 7:13 pm |
|
 |
enso1
Joined: Tue Sep 03, 2024 6:20 pm Posts: 33
|
Spent the day working on a monitor and reviewing hardware. Other than fMax dropping horribly as soon as I touch the IP mux... I am stuck with call, jump, jz, jnz, and jc -- as soon as I add jnc I lose 5MHz and gain layers of logic. I tried separating the condition code mux, but it's even worse... Left it alone, as I like 49.195Mhz max, for now. Someday I'll have to figure out how to add timing constraints, or pin down some components with location constraints.
I also figured out a way to sneak extra instructions in -- for instructions that don't affect the carry flag normally, like loading portB into TOS, I can do complicated tests and set the carry -- without any extra opcodes!
I was writing an ASCII-HEX routine, and range checking was a pain. I put in an instruction that does a range check on NOS against high and low bytes in TOS. That worked fine, but added 50 LUTs and slowed down the system by 8MHz... Then I took it out and rewrote the converter without it, and it was only 4 instructions longer... Completely not worth it!
But I will keep this in mind for later.
I now have a hex dump and hex input, will write a monitor tomorrow.
|
Thu Sep 05, 2024 11:43 pm |
|
 |
enso1
Joined: Tue Sep 03, 2024 6:20 pm Posts: 33
|
I just realized that memory locations 1 and 3FFFF may be used as scrap registers with almost no overhead.
The ISA allows for 1 and -1 to be used as ALU_B constants, and since there is a memory read on ALU_B every cycle, memory locations 1 and 3FFFF, next cycle may be used to read the memory into ALU_B. For writing, just WMEM bit stores TOS into 1 or -1 when ALU_B is B_1 or B_N1.
|
Sat Sep 07, 2024 1:52 pm |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
Looks like a nice machine. Block ram makes big difference for many designs.
|
Sun Sep 08, 2024 6:20 am |
|
 |
enso1
Joined: Tue Sep 03, 2024 6:20 pm Posts: 33
|
BRAMS are dog**** slow on this device! I am used to old Xilinx XC3S which is at least twice as fast. I think it's 5-6 ns before the BRAM responds after the clock. I should really move to the $30 Nano20K device which should be almost Xilinx speed.
Speaking of BRAM, my 18 x 8192 RAM synthesizes as 9 2-bit BRAMS instead of the expected 8 18-bit ones, which is sensible as it avoids muxing multiple devices. I made a feeble attempt to build with a single 18-bit BRAM, but it wouldn't synthesize, insisting that I was making a read-before-write memory, which I clearly wasn't. I will have to try again later, for now it's 8K words.
Although I am almost happy with the ISA, every time I try to finish the monitor I get annoyed with the details. My attempt at the Z and C flags along with test and compare instructions that do not modify TOS but just set flags seemed clumsy, and I still found simple things like returning from a subroutine with some kind of a flag hard to implement without a inflating and slowing down the CPU (for some reason, just keeping flags in registers is a lot more expensive than pushing them onto the return stack with the return address and restoring them, even though it is a lot more verilog to do that!).
So I took an extra bit in jump instructions for an optional drop, and removed test and compare instructions. For compares, I can subtract and jc/drop or je/drop, flags are no longer restored on return. je/jne tests tos as it exists, so it's easy to pass flags around and drop them during the test in a single instruction. I can add js/jns for testing the sign bit, but it costs a little and I haven't needed it yet.
Unfortunately I am out of bits in the jump/call instruction, so I had to switch to relative branches with a 4K reach (or just limit the RAM to 4k)... I could keep unconditional jumps to full 8K and just limit the conditional branches to 4K, but that would require another decoder and a bigger IP mux, and with this software you never know if it will make it much bigger or much smaller and faster until you try.
Keeping the flags separate across calls/returns bloated me to 500 LUTs, but I am just over 50Mhz now.
Amazingly, there were almost no changes in the assembler and the software.
|
Sun Sep 08, 2024 1:27 pm |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
enso1 wrote: Although I am almost happy with the ISA, every time I try to finish the monitor I get annoyed with the details. My attempt at the Z and C flags along with test and compare instructions that do not modify TOS but just set flags seemed clumsy, and I still found simple things like returning from a subroutine with some kind of a flag hard to implement without a inflating and slowing down the CPU (for some reason, just keeping flags in registers is a lot more expensive than pushing them onto the return stack with the return address and restoring them, even though it is a lot more verilog to do that!).
Unfortunately I am out of bits in the jump/call instruction, so I had to switch to relative branches with a 4K reach (or just limit the RAM to 4k)... I could keep unconditional jumps to full 8K and just limit the conditional branches to 4K, but that would require another decoder and a bigger IP mux, and with this software you never know if it will make it much bigger or much smaller and faster until you try.
Not using flags might be a better option, if you look at something like fig-forth. 0branch and branch and forth calls/returns are the only program counter operations. Most of the older computers used Skip on condition and Jump and that kept them simple. When I was playing with the other brand of FPGA's, the design would often compile but not run in the hardware. I don't care about fmax, more about are the setup and hold times met. The software never could tell me. I have gone to CPLDs that I can design with. (Wincupl) and the programmer was on sale. https://store.rosco-m68k.com/products/l ... programmer
|
Sun Sep 08, 2024 7:40 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: I don't care about fmax, more about are the setup and hold times met. The software never could tell me. I think the FPGA software automatically adjusts the timing of internal logic to meet setup and hold times. If it cannot it gives a warning. I believe one can also specify what the delay (-min for setup and -max for hold) needs to be for IO signals. One can even specify the I/O standard (voltage levels and timing). There are all kinds of stuff that can be specified in constraint files. It is a bit of a learning curve though.
_________________Robert Finch http://www.finitron.ca
|
Mon Sep 09, 2024 6:00 am |
|
 |
enso1
Joined: Tue Sep 03, 2024 6:20 pm Posts: 33
|
I am very happy with how it's turned out (except that I lost a bit of jump range): it's mostly Forth-like, but carry flag is still there if you want it, and you can drop or not while jumping. Z branches just test TOS, C is flopped from last ALU result. Code: ; Forth-like compare and jump if less than 10: lit 10 op OP_SUB,SRC_NOS ;(n,n-10-- jcd ;jump on carry and drop
; or test result of operation, without a drop lit '0' op OP_SUB,SRC_NOS,DSPD ;(n-'0'-- jc .notdigit ;jump on carry, do not drop ...
|
Mon Sep 09, 2024 2:53 pm |
|
Who is online |
Users browsing this forum: claudebot, DotBot and 1 guest |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|