Last visit was: Sun Feb 01, 2026 9:18 pm
It is currently Sun Feb 01, 2026 9:18 pm



 [ 278 posts ]  Go to page Previous  1 ... 15, 16, 17, 18, 19  Next
 Qupls (Q+) 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
Re-wrote the functions in the memory scheduler as separate modules. The memory scheduler keeps getting elided from the design, and I have not figured out why yet. I had hoped that breaking it up into smaller modules would help isolate the issue.

Finally got to the first simulation. The memory scheduler shows up in simulation. The simulator does not cut it out.

_________________
Robert Finch http://www.finitron.ca


Tue Dec 30, 2025 6:32 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
Finally pulled the load / store queue out of the Qupls4 mainline into its own module. Dunno if I did it “the right way” but I bolted a command interface onto the LSQ. It can process up to 10 commands in parallel. I am not sure there are enough parallel commands allowed. There could be several branches wanting to invalidate the LSQ while Rob entries that are done also want to invalidate LSQ entries all in the same clock cycle.
Commands are: Invalidate, enqueue, set address, set data, increment address.

_________________
Robert Finch http://www.finitron.ca


Wed Dec 31, 2025 7:20 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
Tonight’s quandary: getting read data to the reservation stations at reasonable speed.

Multiplexing register tags from issue queues in the reservation stations onto a four wide bus for register read requests was 289 logic levels. I forgot to register the outputs which I had intended to do. But it looks like a few more registers are required. Registering the outputs moved the timing critical path elsewhere.
So many logic levels are required I am guessing because the multiplexers are built out of cascaded LUTs. Discrete logic could probably do better.

I am hoping to get 40 MHz performance out of the core which should make it roughly the same (or better) performance than an 80 MHz in-order design.

I have not figured out why some modules are being removed from the design by synthesis. But I have found what seem to be minor flaws causing some modules to be removed. Most of the design is present now.
The 6551 UART was being eliminated, but I found that the state machine was advancing too quickly, not allowing output registers to be set, so they were always at zero when the state changed. The tools picked up on the fact and simply removed the component from the design. This was the result of changes made to support two different bus protocols.

Found out the read port select logic was way too slow (291 logic levels). The logic dynamically selects ports for reading. It was packing the port selects into the minimum number of read ports being wary of only active ports, so a ton of multiplexers. Now it is coded differently as shown in the diagram below.
Attachment:
Qupls4_read_port_selector.png

After a few minor adjustments the timing is up to 37 MHz. It may need to run under 40 MHz as I cannot see a way to improve the timing. The critical path is now in instruction dispatch, which basically copies values from a pipeline register into another pipeline register feeding the reservation stations.


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Thu Jan 01, 2026 2:31 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1870
wow, that's quite the logic depth! what's the depth down to now?


Thu Jan 01, 2026 2:13 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
Quote:
wow, that's quite the logic depth! what's the depth down to now?
I think I have got the depth down around 100 logic levels now. I am not sure how the tools calculates the logic level depth. I am assuming a lookup table counts as multiple depths of logic. I seem to recall hearing that a superscalar design is somewhere around 20 logic levels. The FPGA implementation needs to cascade logic sometimes I suspect a custom discrete logic design would not need to.

Tools timing is telling me it should work to 46 MHz now, which is good as 40 Mz is desired (the video dot clock rate which is handy).

Got the Qupls4 Arpl compiler working. It at least generates code that looks like it should work.

There was a nasty issue with a delete in a doubly-linked list that did not work properly. I have yet to figure out why. The delete causes push and only push instructions to be removed from the code. The delete works just fine around other instructions. I think it may be some sort of weird memory dependency having to do with pointer aliasing. I am just guessing. It is “fixed” at the moment by not doing a delete, and instead putting a special NOP opcode in the place of the deleted instruction. When the output routine sees this it just does not output anything. The effect is that the output code is right, but there is an extra linkage in the code list.

Did some work on the assembler too. The assembler will progress along more slowly than the compiler.

All this work while waiting for synthesis.

Scrapped the store immediate instruction. It had too much overlap with ordinary stores with constant postfixes applied. The only difference is that store immediate would allow a four bit constant field in the instruction to be stored. This would be handy for storing zeros for instance. But the same thing can be done with a postfix instruction, except that it takes up more room in the program.

_________________
Robert Finch http://www.finitron.ca


Fri Jan 02, 2026 4:32 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
Learned a new trick the other day reading comp.arch newsposts. Finally got around to implementing it. Using the move instruction as a renamer command. MOVE does not need to do anything other than assign the source register tag to the destination register. There is no need for processing on a move instruction. No new tag is assigned. MOVE is subsequently treated as a NOP. NOPs get removed from the pipeline by the dispatcher.

Freed up two opcodes used for IP relative addressing. Turns out they were not needed as the IP can be specified as register #62 in ordinary load / store instructions. Changed the code for a zero register to register #63. This makes the r0 register completely general-purpose, works the same as any other GPR.

Sacrificed a bit of branch displacement, repurposed to indicate increment or decrement of the tested register. It only works for specific branches: iblt, ible, ibltu, ibleu, dbne and dbnez. 22 bits is still plenty of displacement bits for conditional branches.

Did some work on Qupls version 5. Comparing:

Qupls4 (48-bit inst.)
Fibonacci: 21 instructions, 126 bytes
Serial driver: 260 instructions, 1622 bytes
Xmodem: 177 instructions, 1100 bytes

Qupls5 (32-bit inst.)
Fibonacci: 22 instructions, 100 bytes
Serial driver: 277 instructions, 1224 bytes
Xmodem: 186 instructions, 832 bytes

While Qupls4 instructions are wider by 50%, the code density is only about 34% worse. Made up for due to a fewer number of instructions. Qupls4 uses about 3% fewer instructions, meaning it may code execute slightly faster.

_________________
Robert Finch http://www.finitron.ca


Mon Jan 05, 2026 1:02 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
Updated the name supplier component so that newly freed tags cannot be used for at least 15 clock cycles. This prevents pipelining issues. For instance, the reservation stations monitor the write bus to the register file, looking for operands and there is a 13 clock-cycle window. If registers were reused too quickly then stale data might be read from the write bus. The 15 cycle delay was implemented using SRL components which are very efficient shift registers.

Then I removed the rotating of the available list. The idea of rotating the list was to prevent a tag from being reused too soon. With the delay line that is not necessary. Changes to the name supplier cut the size of the component to less than half size and likely improved the timing. Using a delay line is less expensive than using the rotating selector.

Reduced the maximum tap delay on the write bus history to 13 clock cycles. It used to be slightly longer than the longest running instruction, but this might cause stale data to be used if the register tag is reused.

_________________
Robert Finch http://www.finitron.ca


Tue Jan 06, 2026 3:12 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
Finally got around to writing a small test bench for the RAT.

Still learning more about the toolset. To get a timing summary for the RAT I created a clock signal. According to the timing summary report it should be good up to 170 MHz. Since the goal was 40 MHz for the CPU, this will likely work.

The BTB was too slow, it failed to meet 40 MHz timing. There were too many levels of changes to the instruction pointer. So, I took an axe to the component. There used to be a separate table for each possible instruction in a group requiring a 4:1 mux, now there is just a single table for all instructions in the group.
Switching of the IP immediately for interrupts was removed, to remove a couple of levels of muxes. Instead, the IP will switch at commit time. This does increase the latency to reach the interrupt subroutine, but it is better than slowing the whole machine down for every instruction.
With the changes to the BTB is should work out to 150 MHz+ now.

Feeling optimistic tonight, I may make my new goal for timing 100+ MHz.

Modified the decoder. It should be good out to about 180 MHz now. There was a way too complex micro-op buffer setup limiting performance to just under 40 MHz. The setup is much simpler now, but may result in many more NOPs (bubbles) entering the pipeline. This would decrease performance, but the much higher clock rate more than makes up for it.

Did some work adding more pipelining to the float components:
FMA64L45: 3300 LUTs / 3900 FFs (280 MHz)
Previously the FMA component with a latency of eight worked to just over 40MHz. With about 45 pipeline stages the FMA should work to about 280 MHz.
I may remove some of the pipelining to reduce the latency. Can probably get away with every other stage clocked. Or a double CPU frequency clock could be supplied to the FPU.

_________________
Robert Finch http://www.finitron.ca


Thu Jan 08, 2026 6:02 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
Got register file updated, now works to 290 MHz, suitable for my target goal of over 100 MHz operation. I added a pipeline register which increases the read latency by a cycle.

Did some more experimentation with the name supplier, trying again to get a FIFO based approach to work. I desired a FIFO based approach as it is potentially smaller and faster than my current approach. Did not have much luck. My register allocation bitmap approach seems to work. The size is about 2000 LUTs and works to 160 MHz, probably good enough for my 100 MHz operation goal. Other parts of the design are slower.

Worked on improving the instruction dispatch timing too. It will work out to 120 MHz now. I cannot see a way ATM of improving the timing further.

_________________
Robert Finch http://www.finitron.ca


Sun Jan 11, 2026 8:14 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
Made micro-op translation its own pipeline stage. The goal is to have each pipeline stage take only one clock cycle and be very fast.

Got the bright idea of clocking most of the in-order portion of the pipeline at double the core clock frequency. The stages can operate at over 200+ MHz which is double the clock frequency of other parts of the design. This effectively shortens the pipeline.

I finally got synthesis far enough to get timing for the whole core. The first try was 3 MHz operation due to 900! logic levels.

Timing is up to 25 MHz now, due to a logic reduction to 88 levels from 900. (900 had to do with load / store queue interfacing), which with a bit of work has been improved. There were unnecessary loops iterating through the LSQ.

Got it down to two signals now preventing 100 MHz operation. Both have to do with branch logic. After that the number of logic levels drops to 19.

The core size is also reduced.

_________________
Robert Finch http://www.finitron.ca


Tue Jan 13, 2026 5:15 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
Finally got the timing for the whole CPU to just over 100 MHz. It still can be improved a bit further. Now to shoot for 120 MHz, the limit set by dispatch. There was much fiddling with many components.

LSQ entries are invalidated at a max rate of four per clock cycle when a load / store commits. Previously they were invalidated willy-nilly in several places such as when a branch invalidated instructions. This was costly resource and timing wise.

Dynamic write port assignment was too slow running at about 100 MHz when combined with write queue timing resulted in too much delay, dropping the fmax down to 68 MHz. Selecting the port was being done using four bitmaps, one for each write port. This was reduced to two bitmaps by searching FFO and FLO instead of just FFO. That change and a couple of well placed registers moved the selection off the critical timing path. Dynamic write port assignment now times to 270 MHz.

The multiplier in Qupls4 AGEN for index scaling finally showed up on the timing path. It was not an issue when shooting for 40 MHz timing. But now 100+MHz is the goal and AGEN limited things to 88 MHz. However, the multipliers were cascaded with other logic without intervening registers, so the first thing tried was adding registers. Adding registers worked.

Register were inserted in a few other places to break up long chains of logic. I think things are sitting at about 19 logic levels now.

_________________
Robert Finch http://www.finitron.ca


Wed Jan 14, 2026 4:26 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
It turned out that when the entire system (SoC) was built the instruction dispatcher was on the critical timing list, limiting the fmax to about 70 MHz. While the component by itself was good to about 120 MHz when included in the rest of the system, the timing was worse.

So, I found a way to do the dispatch almost twice as fast (205 MHz) using 10% of the resources. Copy across all the fields needed for the reservation station into the *same* number of entries as there are ROB entries, so the mapping is just 1:1 wires. Then select the entries being sent to the reservation stations. The difference between this and the previous method is that the entries are loaded at the same time as the selection is being made, instead of being loaded after the selection is made.

I had to add “Keep” directives to the code for the re-order buffer and the load / store queue. This prevents them from being removed from the design. With the keep directives it looks like the entire design is present and it is only about 140k LUTs. Whether it works or not is something else.

_________________
Robert Finch http://www.finitron.ca


Thu Jan 15, 2026 6:15 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1870
That's a remarkable improvement!


Thu Jan 15, 2026 9:31 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
Quote:
That's a remarkable improvement!
Thanks :)

After a couple of days banging away at the keyboard, I managed to get a simple test ROM compiled and linked. It will allow testing of the system. Spent most of the time updating the compiler and pre-processor. A few assembler updates were also made.

The compiler had to be updated as it was not generating the correct labels for global constants. It would use the correct label in the code, but when the constant values were dumped, they were dumped as if they were locals. This was trickier to fix than it seems. This led to unresolved symbol errors.

The pre-processor also needed fixes. I updated it significantly a year or so ago to use a new buffer type but never put it to real use. It had its share of glitches mostly to do with #including files.

_________________
Robert Finch http://www.finitron.ca


Sat Jan 17, 2026 7:20 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2462
Location: Canada
Got the CPU fetching the first two cache lines in simulation. But then it does a jump to the BRK routine as if it encounters a BRK instruction. There is something amiss with the pipeline, there are not supposed to be any BRK instructions in it. And the pipeline is missing the reset JMP instruction. But at least it looks like the BRK vectoring is working.

I may have figured this out. By default, the pipeline registers were initialized to zero at reset, which mostly works except that zero is the fault code for a debug fault. The fault code field was being set to zero which indicates a debug fault. So, the CPU was taking a debug fault for the first instruction encountered. Things could almost work this way. It is tempting to redefine zero as the reset fault.

_________________
Robert Finch http://www.finitron.ca


Mon Jan 19, 2026 5:46 am WWW
 [ 278 posts ]  Go to page Previous  1 ... 15, 16, 17, 18, 19  Next

Who is online

Users browsing this forum: chinanet-backbone, Chrome-11x-bots, claudebot, facebook crawler, SemrushBot and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software