Last visit was: Thu May 01, 2025 11:56 am
It is currently Thu May 01, 2025 11:56 am



 [ 39 posts ]  Go to page Previous  1, 2, 3  Next
 Tugman 18-bit Stack CPU 
Author Message

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
No breaking changes to the assembly/existing code, but got everything clean. Sitting at 512 LUTS/50.9MHz.

AND TOS preserves the C flag, to allow the instruction to return with C intact (other ALU operations trashed C so I couldn't return flags). This makes coding nicer. For instance, here is an input routine that converts ASCII to hex nybbles, returning C and original character on error:
Code:
;==============================================================================
; a2h (a--h/nc or a,?/c    If carry set, it is not a hex digit.
;
a2h:    lit     $30               
        op      OP_XOR,B_NOS        ;(a,n--) transform so '0-9' is 0-9         
        lit     10
        op      OP_SUB,B_NOS        ;(a,n,n-10) check upper range
        jcd     .digit
        lit     $3FF89             
        op      OP_ADD,B_NOS,DSPD   ;move 'A-F' to -6:-1; G will set C
        jc      .return               ;>F
        lit     $3FFFA
        op      OP_SUB,B_NOS
        jcd     .return              ;<A
.digit: lit     $F
         op     OP_AND,B_NOS,DSPD2,RSPD,RET  ;no carry
.return: op    OP_AND,B_TOS,RSPD,RET  ;carry-preserving return


Tue Sep 10, 2024 1:15 am

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
It got too confusing, so I cleared up the meaning of a couple of things, keeping the ISA identical:

It's all about the ALU_C and ALU_N bits in the instruction.

All logical operations can clear, set, or preserve the carry flag. Normally clear; ALU_C means keep carry, ALU_N means negate carry (and ALU_A!). Now it is very obvious how to deal with carry.

For the shift instruction, ALU_C and ALU_N are used to turn it into rotate (ALU_N), or rotate with carry (ALU_C).

For add, ALU_N makes it subtract, and ALU_C makes it into ADC/SBC.

Still at 480 LUTs/ 49.626MHz.


Tue Sep 10, 2024 5:35 pm

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
Spent all day yesterday tracking down a terrible bug. Turned out that the UART dropped a bit under certain strange conditions. More on that in the UART topic at 6502.org http://forum.6502.org/viewtopic.php?f=10&t=8142&p=109615#p109615.

But I think I love coding with this CPU in assembly. The monitor is almost done.


Wed Sep 11, 2024 4:40 pm

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
Problem solved!

The clue: Sequences like $3FFFF, $0010C would result in very consistent corruption of an address ending in C. On a whim I changed the data written to $0010E, and the corrupted address ended in $E!

It's not the UART.

It is the memory write mechanism. The address comes out of NOS, which is not a register, but is muxed from the datastack using DSP, which itself is computed based on a number of bits in the opcode. When an instruction contains a DROP or another datastack adjustment, and is followed by a write, the setup or hold violation from the previous cycle affects the BRAM write cycle in a bizzare way that mixes the address with data, setting some bits to 0. Perhaps only one of the 9 BRAMS comprising each word is affected due to the FPGA layout and weird delays in the routing matrix.

The toolchain should have reported a timing violation.

Anyway, strategically placing a nop before the write cycle lets TOS and NOS settle and the problem goes away.

James Bowman's stack verilog taken from J1 is at fault. I've suspected it would bite me over a decade, but somehow assumed it was OK. I will have to rewrite it with a separate NOS register.

If you don't do it yourself, it will suck.

If you do it yourself, it will also suck, but you will know why.


Sat Sep 14, 2024 5:25 pm

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1821
Good to have an answer! (It's a different thing, but I think others have reported problems with initialised block RAM, being synthesised to gates, and not in a correct way. Maybe the tools are a bit immature with respect to block RAMs)


Sat Sep 14, 2024 6:38 pm

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
I still cannot synthesize a single 1Kx18 BRAM system. It works with 9 8Kx2 BRAMS, but with a single BRAM it errors stating that mode 02 (read-before-write) is not allowed, even though I am clearly not doing that!


Sat Sep 14, 2024 7:28 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Do you use the BRAM templates?

I found I had a lot of trouble getting the tools to infer RAMs properly. So, I started using the templates.

I had what I thought might be a timing issue and it turned out to be that the BRAM had to set to "read_first" so that the old read value would be read out during a write cycle.

_________________
Robert Finch http://www.finitron.ca


Sun Sep 15, 2024 4:19 am WWW

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
My code looks almost exactly like the template, except I have one read-only port of instruction fetch and one read/write port for data. I am absolutely careful to not do read before write (read happens only if there is no write), but I get the same error when I do!

I have to re-check the BRAM manual, perhaps 18x1024 is not supported by this particular chip...


Sun Sep 15, 2024 11:15 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
What brand of FPGA?
I think Xilinx/AMD supports up to 18x2048.

_________________
Robert Finch http://www.finitron.ca


Wed Sep 18, 2024 7:05 am WWW

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
This project was built on the $20 Chinese Tang Nano 9K board, built around the GoWin FPGA (GW1NR-LV9QN88PC6/I5).

Its BRAMs are very much like those of old Xilinx and really should work in 18x1KB mode...

So far I've relied on the Verilog inference of a memory array, using common sense and provided templates. That works for the 8K memory (which is synthesized as 9 two-bit memories, not a bad choice)

Sometime very soon I will try it by instantiating a BRAM directly. I haven't yet because then I'd have to figure out how to convert my binary init data (currently $readmemh a bunch of 32-bit words, with only low 18 bits used) to the weird format required.


Fri Sep 20, 2024 7:26 pm

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
Decision time

As mentioned, reading from NOS into TOS is problematic due to the way James Bowman has implemented the datastack.

I fixed it with a strategic NOP, but I am still somewhat dissatisfied and did some experiments.

Reading memory Forth-style, from [TOS] and writing NOS into [TOS] also works, because TOS is a proper register and can address memory without issues. However while converting the loader and the monitor I became somewhat depressed. It was no longer fun, because everything was in the wrong place, and I had to swap a lot to get things to work. It was no longer pleasant!

So I am going to try another idea I've had. After all, this is a research CPU project, existing to satisfy my curiosity about many strange CPU concepts. Anyway, what I want to do is address memory via TOR, the top of return stack pointer. While I am at it, I want to add the ability to increment TOR! This is something I've wanted for a while.

I don't have any free bits, but currently the two increment-control bits allow 0, +1, -1 and -2. -2 is nearly useless, and I can use it to mean increment TOR and leave RSP alone.

It's worth a try, and unless it really slows or bloats the CPU I am going to give it a shot.

UPDATE: implemented. fMax is still over 50MHz, and gate count is just over 500. Loader and monitor are much simpler and shorter!


Sat Sep 21, 2024 12:52 am

Joined: Mon Oct 07, 2019 2:41 am
Posts: 768
enso1 wrote:
Decision time

As mentioned, reading from NOS into TOS is problematic due to the way James Bowman has implemented the datastack.

I fixed it with a strategic NOP, but I am still somewhat dissatisfied and did some experiments.

Reading memory Forth-style, from [TOS] and writing NOS into [TOS] also works, because TOS is a proper register and can address memory without issues. However while converting the loader and the monitor I became somewhat depressed. It was no longer fun, because everything was in the wrong place, and I had to swap a lot to get things to work. It was no longer pleasant!

So I am going to try another idea I've had. After all, this is a research CPU project, existing to satisfy my curiosity about many strange CPU concepts. Anyway, what I want to do is address memory via TOR, the top of return stack pointer. While I am at it, I want to add the ability to increment TOR! This is something I've wanted for a while.

I don't have any free bits, but currently the two increment-control bits allow 0, +1, -1 and -2. -2 is nearly useless, and I can use it to mean increment TOR and leave RSP alone.

It's worth a try, and unless it really slows or bloats the CPU I am going to give it a shot.

UPDATE: implemented. fMax is still over 50MHz, and gate count is just over 500. Loader and monitor are much simpler and shorter!


If white mice show up, leave quickly.

I have always liked using a 2 phase clock so your 50 Mhz is only 12.5 Mhz for me.

Playing around with CPLD's I try to keep it simple as you can tell how quickly logic grows too big, the software (DOS)
just crashes.
FPGA's are even harder to just tell what logic gets compiled.Things just stop working or become flaky.
The versions that give me the most headaches were the simple changes of the instruction set.
ADD now is XXX rather than YYY. Funny the serial ports are not working, but the welcome screen has been
ok for the last few major updates.
Keep versions around you as milestone software.


Sat Sep 21, 2024 6:28 am

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
Oldben: yes, these are strange creatures.

For instance, just changing the constant delay value that decides the UART-RX timing and sample offset changes fMax by as much as 20%! I empirically found a reasonably optimal setting (for 27MHz) of 237 and 200, resulting in fMax of 53.674MHz. Other settings, such as the more appropriate 234 and 117 result in a bloated-by-30-luts system running at fMax or 46.2MHz.

On Xilinx I got into instantiating simple circuits that fit into specific shapes, and using constraints to keep them as such -- basically manually placing circuits, which was tedious but usually pretty solid. It wasn't as bad as you think because of relative placement constraints -- you can basically build more complex blocks from smaller/simpler blocks that literally fit inside...

Since I am in git, old versions are still around... This one with memory access from the return stack is not bad for repeatedly accessing memory, but accessing individual locations requires pushing the address, waiting for it to issue a read, and loading the result (while cleaning up simultaneously)... 3 cycles is not too bad I suppose, but it works better in a loop where the address is pre-loaded and the incrementor is used...


Sat Sep 21, 2024 3:38 pm

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
Added TOR overflow flag and jumps. Since I can easily increment TOR, I can use it as a loop counter.

Amazingly, no extra resources used (after some reshuffling of the condition muxes), and fMax edged up to 53.811.

I started with a J1, a much less capable CPU, and actually reduced size, increased fMax, and added a ****ton of instructions and capabilities!

I feel I'm coming close to a design freeze, as everything is working as well as can be expected and I'm running out of instruction bits anyway.


Sat Sep 21, 2024 5:29 pm

Joined: Mon Oct 07, 2019 2:41 am
Posts: 768
enso1 wrote:
Added TOR overflow flag and jumps. Since I can easily increment TOR, I can use it as a loop counter.

Amazingly, no extra resources used (after some reshuffling of the condition muxes), and fMax edged up to 53.811.

I started with a J1, a much less capable CPU, and actually reduced size, increased fMax, and added a ****ton of instructions and capabilities!

I feel I'm coming close to a design freeze, as everything is working as well as can be expected and I'm running out of instruction bits anyway.


I have a few free bits with the bit slice design, so I am trying to see if I can push the 9 bit slice to 11 bits and have a 22 bit CPU. This way a can use a 11 bit FAT table for a disk operating sytem and pack 3 - 7 bit bytes per word. My design was mostly frozen for 18 bits, but now I need to think about software issues and memory layout. I have a working version of small-C (8080) but I need lots of memory for code produced so a larger address space of 19 bits looks about right. 256kb for user max , 64kb plus for the OS. No time sharing like unix.
Software and OS design has big inpact on a design layout, so I hope your design can handle bigger software projects.


Sun Sep 22, 2024 4:22 am
 [ 39 posts ]  Go to page Previous  1, 2, 3  Next

Who is online

Users browsing this forum: claudebot, DotBot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software