Last visit was: Thu May 01, 2025 7:28 pm
It is currently Thu May 01, 2025 7:28 pm



 [ 204 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9 ... 14  Next
 Qupls (Q+) 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Added a separate CPU clock generator to the system. The CPU needs a 1x, 2x, and 5x clock, and there were not enough clocks on the system clock generator to support varying the CPU clock. The core missed 20 MHz timing by about 950ps. So, I made some minor adjustments. The one signal not meeting timing is only 2 logic levels, but 95% of the time is in routing. Most of the timing limitation is in the RAT so I am investigating how this can be improved. Now working with a 125/50/25 clock combo.

Moved the capabilities instructions from the ALU to the FPU. The FPU has access to register pairs via the quad-float extension prefix (QFEXT), so that allows it to make use of 128-bit values. Any of 256 registers may be used to extend a register to 128-bits. Four of the vector registers could be sacrificed for the purpose of extending all registers to 128-bits.
I should be able to change things to allow the use of uncompressed 32-bit capabilities which require 4x wide registers.

_________________
Robert Finch http://www.finitron.ca


Mon Sep 16, 2024 3:20 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Boosted the clock rate up to 25 MHz, and the response buffer reared its ugly head again as the timing limiter. So, I completely re-wrote it to use a fifo using the fifo template in the toolset. Timing now meets 25 MHz. Next try for 30MHz. The fifo runs on a five times clock. Up to four inputs are temporally multiplexed into the input. On the output side, while reading using a five times clock, only one value is read per 1x clock. The bus rate is 1 value per clock. But sometimes multiple responders will respond at the same time, hence the fifo.

Finally got the idea to perform some of the simpler ALU ops on the FPU as well. This increases the number of parallel ops that can be performed. Simple ops include add, sub, compare, and, or, xor, mov and cmov. Multi-cycle ops like integer mul and div and shifts are not done on the FPU. Less common ops are also not performed on the FPU. Supporting ALU ops on the FPU required minor modifications to the scheduler and decoder. In a similar fashion, FCMP, FABS, and FNEG can be performed on the ALU.

I busted something in the core along the way. The core still works – sort of. It only executes a few instructions before it hangs though. More debugging.

_________________
Robert Finch http://www.finitron.ca


Tue Sep 17, 2024 2:47 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Ran into an issue with a ROB slot being issued for execution multiple cycles in a row. This should not be able to happen. Once issued the ROB slot is busy executing the instruction. It is not freed up until commit time a number of cycles later. I think this happened because the ROB slot was not marked busy until the end of the clock cycle and the scheduler logic did not get a chance to see the busy status. So, I put some FF’s in the scheduler logic to prevent multiple issues in consecutive clock cycles to the same ROB slot. It is possible for an ALU to be available every clock cycle when it is executing single-cycle instructions. The idea is to keep the ALU as busy as possible. It gets fed from multiple ROB slots.

Added some registers in the ALU/FPU output paths to account for pipeline differences in signals.

Wrote a short blurb on implementing the RAT in Qupls and posted it on my website.
http://www.finitron.ca

Synthesis did not retime the FPU's FMA combo unit, leading to 133 logic levels, and a failure to meet timing.
There are about eight levels of registers added to the output of the FMA so that synthesis can break up the logic across multiple clock cycles.

_________________
Robert Finch http://www.finitron.ca


Wed Sep 18, 2024 7:03 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
I ended up using a registered version of the FMA that has a latency of eight clocks.

There is a timing issue that is causing implementation to take forever to complete. I cannot tell where the timing issue is because a timing report is not done. It is as if it is trying to meet an impossible to meet timing requirement. One thing I did do was turn retiming on. So, I have turned it off now for another rebuild.

Some experimentation with the RAT was done. I think I managed to eliminate some bypass logic that was not needed. For instance, bypassing the valid bit to TRUE for register zero (since it is a fixed value and always valid) was removed. Instead, if there is an attempt to update the valid status for register zero, it is squashed during update. The difference is only four ports needs to be checked at update, otherwise 20 read ports would need bypassing.

Timing in the RAT is affected by reading the mapping of architectural to physical registers, which is then used to read the valid bit for the physical register. The output of one RAM is feeding the address input of a second RAM, and this has only ½ clock cycle to be performed in. I came up with a way to avoid the doubling up on the RAM access; however, it is then too large to implement. It doubles the size of the RAT. I may be able to move the lookup of the architectural register to physical register mapping back to prior CPU stage. That would help a lot.

_________________
Robert Finch http://www.finitron.ca


Thu Sep 19, 2024 5:30 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
The timing issue turned out to be badly referenced clock signal names. For some reason the toolset added an “_1” to the clock signal names and then it did not match what was setup in the constraint file. Resulting in the toolset attempting to meet multiple clock constraints as if they were dependent.

The scheduler turned up in the list of things impacting timing. It uses both clock edges so is working within ½ a clock cycle. The “minor” modifications made the other day were just enough to impact timing.

Improved the PRED modifier implementation. It was working only for a single following instruction. It should be applied for up to eight instructions.

_________________
Robert Finch http://www.finitron.ca


Fri Sep 20, 2024 1:48 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
A second branch was started before the first one was complete. This messed up the branch state machine. The scheduler is not supposed to issue another branch unless the flow control unit is idle.

Register file updates are not happening properly. It is tricky to debug because of the multiplexed write ports.

_________________
Robert Finch http://www.finitron.ca


Sat Sep 21, 2024 3:01 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Had fun for a while getting correct values in the correct registers. But things seem to be working now. It even seems to work across branch checkpoint restores.

Runs now in SIM up until a TLB miss occurs trying to update the random number generator.

Had a loop in the boot-ROM to initialize the page table followed by initialization of the timer/random number generator/ and LEDs. An issue is that the CPU reads instructions quite a bit beyond what is executing. In this case it was trying to begin initialization of the random number generator before the page table was setup. (The CPU can translate the memory addresses for instructions without performing the memory op yet.). This led to a hang. The only thing I can think of to do is put a ton of NOPs in after the page table setup loop so those would be fetched instead of the initialization code. With seven pipeline stages before the re-order buffer that is 28 instructions, plus instruction in the re-order buffer is about 44 instructions. That means placing about 50 to 60 NOPs in the code. Gets great idea: use vector NOP instructions. That way code bloat is less as only eight vector NOPs are required.

SIM is running with 224 architectural registers, which is 28 out of 32 max vector registers. Any more registers makes the core too big to fit. That is supported by 448 physical registers. The core size is begin controlled by limiting the number of registers. This probably has the biggest impact on size.

Things left out so the core would fit:
* 4+ vector registers
* multi-precision operations where a register could be treated as 4x16 bits, 2x32 bits or 1x64 bits.
* 128-bit floating point ops
* Capabilities
* branch checkpoints limited to 4 instead of 16
* alternate fetch paths
* 2nd ALU and 2nd FPU
* 2nd load unit
* un-aligned memory access
* re-order buffer limited to 16 entries

Under consideration is 51-bit instructions, but it requires making a five-wide machine. 51-bit instructions would fit five per 256 bits and almost works out evenly. It would be adding only 3-bits per instruction. By fetching five instructions at once, the instructions would not need to be right aligned, eliminating the alignment shifter. Also gone would be the fetch of two cache-lines to accommodate instructions crossing a cache-line boundary. Instead, only ½ of a cache line need be fetched. The logic simplification would help offset the cost of widening the fetch path to five instructions. It could maybe be done in the current FPGA by having the fifth instruction always be a NOP and not fed into the pipeline, so the machine would be only four-wide.

_________________
Robert Finch http://www.finitron.ca


Sun Sep 22, 2024 6:12 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Added background execution buffers for the block operation instructions which allows them to be executed in the background. Added them while trying to figure out a way to implement something resembling pipeline loop mode. Branches currently have a large impact on performance.

Spent part of the day adapting the core for a larger FPGA. Main bus is now 256-bit wide. Dozens of small changes. Widening the bus decreased the amount of control logic required in the data cache, saving some LUTs.

_________________
Robert Finch http://www.finitron.ca


Mon Sep 23, 2024 5:30 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Found an issue having to do with the instruction aligner. When aligning instructions to the right the left-hand side of the cache-line is filled with zeros. This meant that if an instruction was outside of the bounds for alignment it would be set to zero. The solution was to set the out-of-bounds area to NOPs instead of zeros. This causes out-of-bounds instructions to be executed as NOPs, so the PC is reset to an aligned boundary to re-execute the out-of-bounds instructions. <= further fixed this issue by reading two cache lines so instructions are never out-of-bounds. No need to fudge the PC then. (restoring an older i-cache version which worked).

Finally hit a conditional branch instruction, making it possible to debug further.

Embarking on many changes to Q+ to use 64-bit instructions. Going to have massive amounts of registers (256) instead of vector registers. At least two pipeline stages can be eliminated that way. Changing things in a piecemeal fashion. First aligning 48-bit instructions on 64-bit boundaries. What vector instructions provided was code density, and fetch bandwidth reduction. Code density is now out-the-window, a simpler design preferred. (Code density is not as bad as 200% larger as large constants require multiple 32-bit instructions). And the fetch bandwidth reduction is not as valuable as it may have been at one time. There seems to be loads of fetch bandwidth. Also predicating all instructions. Branches are very slow. It may be faster to skip over large numbers of predicated instructions instead of branching.

Scratching my head over how to calculate IPC. If I$ cache misses are excluded from the number of clock ticks, the IPC jumps up to about 4 or more. But that includes invalidated instructions due to branches. It can execute 3 simple ALU type instructions per clock. I think a more realistic IPC is somewhere between 1 and 2.

_________________
Robert Finch http://www.finitron.ca


Thu Sep 26, 2024 7:44 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Flat-lined again, after about 70 micro-seconds. Debugging by running in SIM until something flat-lines, figure out why it flat-lined, fix it, run in SIM again, repeat.
Not sure about my debugging methodology.
In this case the PC is being set to zero. It looks like maybe a bad return address. The link register has zero in it.

Attachment:
Flatlined.png


You do not have the required permissions to view the files attached to this post.

_________________
Robert Finch http://www.finitron.ca


Fri Sep 27, 2024 1:58 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Bug Fixes
A bad opcode for subtract was one issue. The displacement was not encoded correctly causing multiple store operations to all store to the same address.
Something went amiss with DRAM access. The instruction was sent to the MEM unit but not decoded as a load or a store. So, the MEM unit hung. Not sure how this happened but code was put in to unhang the MEM unit in such an event.

Finally found a serious error in the construction of the RAT and renamer. It was freeing up registers too soon, always freeing at commit time. It needed to free up registers only once a new target was about to commit. That way it is known the old one is not needed anymore. Fixing the issue doubled the size of the RAT, as there needs to be a history maintained. Fortunately it is only 1 deep, a looks a lot like a shift register.

Getting longer runs now. Hopefully something simple like Fibonacci will be able to run soon.

_________________
Robert Finch http://www.finitron.ca


Sat Sep 28, 2024 6:54 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Been working on getting Fibonacci running in SIM. Some fixes to the MEM unit were required. It now starts up okay but gets off-track when a conditional branch fails; it starts executing other code in the boot program. It is fairly difficult to track the execution path when there are branches involved. There are a lot of excess instructions that get dumped from the pipeline, and it is confusing to see them all.

DRAM/dcache logic needed some work. The “unstick” logic added a day ago caused loads to be finished too soon, resulting in zeros loading into registers. The signal needed some more qualification.

Some of the bypass logic in the RAT needed to be adjusted.

An interesting issue arose with zero being in the link register at startup. Since it is during boot there is no return address. There is a simple jump to the bootstrap code. The issue then is that if the CPU hits a return instruction it will speculate ahead and fetch code from the return address. But when the link register contains zero, the CPU ends up fetching and trying to process instructions at address zero and up. This causes the CPU to hang. So, the link register is now initialized at startup to point to a long sequence of NOPs. That way the CPU speculates into NOP territory and does not hang.

Code:
                                           15: start:
                                           16: ; set global pointers
02:0000000000000000 840F0000F0FFFEFF       17:    ldi %sp,$0xFFFFFFFFFFFEFFF0
02:0000000000000008 840E000000000000       18:    lda %gp,_start_bss
02:0000000000000010 8414000038000000       19:    ldi %lr1,exit_adr
02:0000000000000018 840F1F00F8FFFFFF       20:    subtract %sp,%sp,$8
02:0000000000000020 840400000D000000       21:    ldi %t0,$13
02:0000000000000028 D3041F0000000000       22:    store %t0,[%sp]
02:0000000000000030 2020040000000000       23:    bra _Fibonacci
                                           24: ;   bra _bootrom
                                           25: exit_adr:
                                           26: .rept 32
                                           27:    nop
                                           28: .endr

_________________
Robert Finch http://www.finitron.ca


Sun Sep 29, 2024 8:22 am WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 768
I think making the reset virtual might be a better idea, with the concept of virtualization now better known.
I like the idea of the DMA hardware,just moving a list data,address pairs. PC at XXXX SP XXXX MMU XXX
at boot.
Will the read ahead on the return link, be problems for return to a lower level ring?
shell() while(1) if(!abort_flag)dostuff();else return;


Sun Sep 29, 2024 7:08 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Quote:
I think making the reset virtual might be a better idea, with the concept of virtualization now better known.
I like the idea of the DMA hardware,just moving a list data,address pairs. PC at XXXX SP XXXX MMU XXX
at boot.
Will the read ahead on the return link, be problems for return to a lower level ring?
shell() while(1) if(!abort_flag)dostuff();else return;

I do not think the read-ahead will cause problems. Instructions that are not supposed to execute get stomped on, and do not become part of the state of the machine.
The NOPs are more of a temporary fix so things can run in SIM. There should be a timeout occurring on access to invalid RAM which causes an exception instead of causing the machine to hang.

******

Bug fixes
Address lines were connected to the branch-target-buffer as if it were a cache, with just the high-order address lines connected. This caused the BTB not to work. It should have been lower order address lines connected.

The DRAM load flag was being set but not cleared causing a perpetual dram load state. It went unnoticed for a while as stores override loads.

There is something amiss in the renamer as it causes a lot of stalls. 100 stalls for 1500 instructions.

Changes / Additions

Register tags were being freed-up for rename at the end of execution by a functional unit. This was changed to free the tags at commit time. While I think it would work the other way, it potentially requires more write ports on the RAT, as there may be more functional units. By releasing the registers at the commit stage, ports are only required for the machine width, which is likely less than the number of units in the machine. The drawback is that there may be more outstanding registers, or fewer for allocation, which might cause the machine to stall.

Request cancel logic needed to be added to the data cache. The CPU hung because the data cache request buffers were all full, due to cancelled operations.

Added alternate fetch pathing to the pipeline. Whenever a conditional branch is detected, a pair of PC values is setup, one each for the predicted path and the other path. Fetches occur along the alternate path when the predicted path leaves holes in its fetch. Then if the predicted path turns out to be incorrect, instructions for it are invalidated and instructions from the alternate path chosen. Most of the instructions for the alternate path are already in the pipeline, so branches take a minimum of cycles to complete. Great theory, coded and it does not work yet. For some reason to be worked out yet, it always sticks to the predicted path. The alternate path is not being fetched.

Fibonacci
Not sure if it works. I do not think it did as register values do not appear to be correct. However, it runs looping about a number of times, then returns. It seems to run for about the right length of time.

Getting there slowly.

_________________
Robert Finch http://www.finitron.ca


Mon Sep 30, 2024 7:59 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Bug fixes
A couple of issues found in the checkpoint valid RAM. The wrong clock signal was being used, causing most of the RAM updates to fail. Several signals needed to be set during reset for SIM to work properly.

Bypassing logic was masking issues with RAT updates. Things mostly worked because registers were being used only for short periods of time and values were coming from the bypass logic.

DRAM access cancel logic worked too well. It was cancelling almost all DRAM requests. The same task was being used to both cancel requests and mark requests invalid once they were complete. It needed a switch to disable the cancel.

Changes
The logic to load the rename fifos with tags on a reset was removed. Instead, the list of registers required to be freed was initialized to include all registers. The non-reset logic will automatically add registers being freed to the fifos. So, there was no need to do it twice. This causes the renamer to stall a bit at startup, but it still starts faster as the reset time is shorter.

_________________
Robert Finch http://www.finitron.ca


Tue Oct 01, 2024 9:27 am WWW
 [ 204 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9 ... 14  Next

Who is online

Users browsing this forum: Amazonbot, claudebot and 15 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software