Last visit was: Thu May 01, 2025 5:35 pm
It is currently Thu May 01, 2025 5:35 pm



 [ 204 posts ]  Go to page Previous  1 ... 6, 7, 8, 9, 10, 11, 12 ... 14  Next
 Qupls (Q+) 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Bug fixes
On a read-hit the data cache controller was still trying to run a bus cycle. Eventually the CPU hung because the data cache request queue was full.
Stomp logic was off by one. An extra instruction group was being stomped on. This caused a return instruction to be missed.

Changes
Added a module to the data cache controller, extracted from the controller module to make things more modular. In the process found a couple of signals that were not being used.

Reached a milestone: got the Q+ CPU to execute Fibonacci in SIM. The CPU is working in superscalar mode.
Running blinkin lights after Fibonacci appears to work now.

_________________
Robert Finch http://www.finitron.ca


Thu Nov 07, 2024 4:10 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Decided to try and implement backout and restore. While the core does seem to work without it, it is somewhat faster if backout-restore is included. Fibonacci is almost running with checkpoint backup-restore. It runs about 16 iterations then starts writing to low memory as a pointer in a register gets corrupted.

Bugs
There is an issue with a register restore. Near as I can tell, logical register 20 should be restored to physical register 143, but it gets restored to register 144. Register 144 has a zero in it. Then that causes the Fibonacci to count 0,1,1,2,3,3,3,6,9,15…
Instead of adding 2+3 it adds 0+3 creating another 3 in the sequence.
I do not know where the 144 is coming from. It has not been explained yet. My best guess is that it is an error in the simulation due to a bad memory bit perhaps.
I put in a couple of debug dumps to figure out where the 144 was coming from, then the issue disappeared.

Bug fixes
Found a spot where an instruction was not being stomped on when it should be. It was the third instruction slot in the rename stage.

Registers were being freed too soon causing them to be reused before it was safe to do so. Registers needed to be freed at the commit / retire stage. The register valid status however needs to be set as soon as the operation is done. Additional inputs to the rename stage were required.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 09, 2024 12:04 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Changes
Modified the core to commit up to six instructions at once now, provided the last two instructions are invalid. There could be a lot of invalid instructions in the pipeline due to pipeline flushes for branches. Since the invalid instructions do not update any state, they can easily be skipped over.

Also modified the commit so that it commits only instructions with the same checkpoint index in any given cycle. This is usually the case. The RAT cannot handle updates to multiple different checkpoints at the same time.

Fibonacci runs for the selected number of iterations (20), then blinkin lights runs a couple of iterations and crashes. This is with checkpoint backout-restore. Successful run lengths are getting longer.

Checkpoints work based on instruction groups, not individual instructions. If there are two branch instructions in the group, then only a single checkpoint is made. This means it may require fewer checkpoints than a design that checkpoints every branch. The CPU relies on a backout mechanism for other instructions in the same group as a branch.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 10, 2024 4:29 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Bug Fixes
It has been a while since the compiler was updated. I found a bug today in the calculation of the arg space. It was including the return block size. This caused the stack pointer to be incorrect on a return and deallocate instruction.

The CPU is running better now. It can successfully execute the Fibonacci in serial and superscalar mode. Superscalar mode is about twice as fast as serial mode. It then runs blinkin lights, then crashes trying to run blinkin lights a second time. The crash is due to a load from low memory due to a bad pointer.
The crash is after executing about 2000 instructions.

Changes
The number of branch checkpoints was changed to 32 from 16. It does not require very much more logic as the checkpoint RAM is made from LUTs and will use the same number of LUTs. This change was to get things running further. There is an issue where the CPU runs out of checkpoints after a while. Either extra checkpoints are being allocated, or checkpoints are not being freed. I have not found the spot yet.

Calls and returns were added to the list of instructions using checkpoints. As call / return instructions are only about 1% of instructions this does not affect the checkpoint usage very much. The issue was that instructions in the shadow of a call / return operation were being renamed, and the renames had to be backed out. The easiest way to do that was just to trigger the branch backout mechanism.

Contemplating a debug run mode. If the CPU can be made to run for 10,000 cycles for instance, before a state machine screw-up, it might be possible to save context, reset the CPU, then restore the context and continue. If this was done every 9,000 clock cycles the CPU could be made to run in spurts.

Also approaching the point where it may be useful to have debug registers added to the CPU.

If it could just be made to run well enough to display text messages on-screen....

_________________
Robert Finch http://www.finitron.ca


Mon Nov 11, 2024 3:58 pm WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 768
robfinch wrote:
Bug Fixes
If it could just be made to run well enough to display text messages on-screen....

Does the text output work?


Tue Nov 12, 2024 6:34 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Quote:
Does the text output work?

I am sure the text controller works as it can display random characters on-screen (a power on reset feature), but I have not been able to get the CPU to write to it yet. When running in the FPGA the CPU hangs during boot-up. It seems to oscillate back and forth between addresses that write to the LEDs as if it were looping, which it should be doing, but the LEDs are not being affected. Since it takes an hour or so to build the system, I have just been testing things mostly in simulation.

Put together a package encapsulating the AMBA / AXI4 bus signals. The SoC and CPU will probably be modified to use the AXI4 bus. It may help. The FTA bus is very high performance, but I suspect it may not be as reliable.

_________________
Robert Finch http://www.finitron.ca


Tue Nov 12, 2024 8:19 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Running on the FPGA, the CPU appears to be running the blinking LEDs code, then starts running the clear screen routine. The PC is visible in the logic analyzer display. This is as programmed for boot-up. But the LEDs are not flashing and the screen does not clear. So, there appears to be an issue in the I/O path somewhere.

Trying a trick with the LED port on the FPGA board. It is going to be clocked at five times the CPU rate, instead of the CPU rate. The CPU’s request will then look five times as long. This may reveal issues with timing in the CPU bus. The CPU does bus transactions using a single clock cycle. So, if the timing is off then maybe the bus transaction gets missed.

_________________
Robert Finch http://www.finitron.ca


Tue Nov 12, 2024 6:14 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Changes
Changed the time-multiplexed five write port register file into a non-time multiplexed one with six write ports. This makes the register file significantly larger, but removes the five times clock, making it easier to use a higher clock frequency for the core.

I have been looking at the RISCV-BOOM pipeline diagram comparing it to Q+ and realized I left the issue queues out of my design. Back to the drawing board.
Now I am wondering if things can work without issue queues at a reasonable pace. Since ALU’s are small, could the lack of issue queues be made up for by having more ALUs? Many ALU instructions are single-cycle so I think the issue queue does not buy much. If there are five ALU’s for instance, if there is an instruction taking more clocks then the other four ALUs can absorb the difference.

_________________
Robert Finch http://www.finitron.ca


Fri Nov 15, 2024 1:10 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Bug fixes
Found a spot where a checkpoint was not being set in the ROB. This caused it to be missed when it came time to free the checkpoint. Checkpoints are working better now, after a fix, the machine does not stall waiting for checkpoints.

Changes
20 MHz Timing got missed in two places. One was in the memory controller which is operating at 200 MHz, not really missing 20 MHz timing. Timing was missed by 20 ps for atomic memory operations. I simply commented out the atomic memory operations for now. The other spot timing was missed was in the branch-target-buffer when processing branch misses. To fix this some of the branch miss logic was registered in a one-hot fashion.

Registering the branch miss logic helped. But it could not meet the next level - 33 MHz timing. The branch target buffer was being read on the negative edge of the clock to simulate an asynchronous RAM. Clocking on the negative edge provided only ½ of a clock cycle to fetch data. This was switched to the positive edge of the clock., by carefully adjusting the read address input to be appropriate for the positive clock edge. This should double the amount of time the BTB has for reads. So, the goal is to meet 40 MHz timing now.

The scheduler is highest on the list of things not meeting 40 MHz timing, indicating it will run at only 36 MHz. So, a small fix to the logic may have helped reduce the number of logic levels.

_________________
Robert Finch http://www.finitron.ca


Sat Nov 16, 2024 6:40 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Milestones
Timing is met for 40 MHz operation.

Changes
Still trying to reduce the number of logic levels in the scheduler. Timing is just 1 ns short of 50 MHz or a top speed of about 48 MHz.
The first thing done to improve timing was to move some of the comparisons on the re-order buffer back into the mainline where they are then registered and stored in the ROB. This reduced the number of logic levels to about 59. The next thing done was to remove the indirection on the ROB pointers and replace it with direct indexes. The issue with doing this is that now the entire ROB must be searched by the scheduler instead of just a smaller window. References like heads[hd] got replaced with just hd. The difference is it removes a level of muxing on the index.
With the scheduler timing “Fixed” , now there are about 1200 other signals not meeting 50 MHz timing by 736ps. It looks like the max frequency of the design is about 48 MHz without a tremendous amount of re-work.
So, it is going to run at a convenient 40 MHz, the video dot clock rate.

Peak performance should be about 160 MIPs, unless one counts dual-operand instructions as two instructions, in which case peak would be 320 MIPs. More realistically the goal is 40 MIPs sustained. I suspect when performance is measured it will not be that good.

Other work. After about 4000 clocks the core runs out of available physical registers and hangs. This is something I thought had been previously fixed. Physical registers must not be being freed properly.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 17, 2024 6:22 am WWW

Joined: Mon Oct 07, 2019 2:41 am
Posts: 768
robfinch wrote:
Milestones
Timing is met for 40 MHz operation.

Changes

With the scheduler timing “Fixed” , now there are about 1200 other signals not meeting 50 MHz timing by 736ps. It looks like the max frequency of the design is about 48 MHz without a tremendous amount of re-work.
So, it is going to run at a convenient 40 MHz, the video dot clock rate.

Peak performance should be about 160 MIPs, unless one counts dual-operand instructions as two instructions, in which case peak would be 320 MIPs. More realistically the goal is 40 MIPs sustained. I suspect when performance is measured it will not be that good.



What video formats had you planned?


Sun Nov 17, 2024 7:24 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Quote:
What video formats had you planned?

I plan on using 800x600x16bpp. I tried to get 1366x768 video working but could not. TV/Monitor kept saying "unsupported video mode" even though I am pretty sure that is its native mode.
The frame buffer and text controller can divide down the pixels into lower resolutions. For example, 400x300. Color depth is also programmable. 8/16/24 or 32 bpp. 24 bpp is a little weird as it does not work out evenly. It uses as many 24-bit pixels as will fit into a 256-bit strip, 10 pixels per 256 bits in other words.

I trimmed a clock generator from the system. It used to have a separate clock generator for video which allowed me to experiment with the video mode. Now the video, CPU, and DRAM all use the same clock generator. The DRAM controller also has its own internal clock generator.

_________________
Robert Finch http://www.finitron.ca


Sun Nov 17, 2024 2:21 pm WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Changes / Additions
1) Added a simple issue queue for floating-point instructions.

I “forgot” to put issue queues in my design, realized it when reading up on the RISCV-BOOM. Then got to wondering why Q+ worked not too badly anyway. I reasoned that because most instructions were single cycle, the issue queue would not help much. Then, I decided to try reading up on issue queues.
I have read that issue queues for branch and agen improve performance only slightly, as I expected.
So, I decided to implement issue queues for FPU operations only, where a lot of ops are multi-cycle. Since Q+ can perform simple integer instructions on the FPUs there are effectively more integer function units available. If an ALU is stalled on a multi-cycle op, the instruction can very likely be issued to another ALU or FPU. Since there are queues on the FPU some integer operations can be queued as well.

I gleaned a trick from studying the Itanium, which is that functional units do not have to be limited to executing instructions generally regarded as for a specific unit. IIRC some compares and integer ops could be performed by the MEM unit in the Itanium.

So, in Q+ there is a mix of operations that can be performed on both an ALU and an FPU. Since the register file is common between ALU and FPU operations it is easy to perform either type of operation on the other functional unit.

2) Added a return address stack predictor and more decode logic in the MUX stage to detect calls and returns. The RAS was added to the branch-target-buffer code since a return takes place in the same stage. The RAS stack is 32 entries deep as that is a convenient size for the FPGA.

_________________
Robert Finch http://www.finitron.ca


Fri Nov 22, 2024 4:23 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Q+ has been sitting on the shelf for just about 2 months now. I have been busy watching TV and playing with the trainset. I am not sure when I will get back Q+.

I have been thinking about playing with AI, seeing as how it is already in progress in the news. And developing a CPU for AI work. Q+ has an ANN layer and instructions for support, but the CPU is big and complex. Spending more resources on ANN and less on the CPU may be better for AI.

_________________
Robert Finch http://www.finitron.ca


Wed Jan 22, 2025 5:08 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Got rid of the PUSH/POP all instructions. It may have worked okay on a machine with 32/64 registers but this pair of instructions is unlikely to be used on a machine with 128/256 registers. It just does not make sense to push/pop that many registers for an ISR. The regular PUSH / POP register instructions can push or pop up to six registers with a single instruction. Six regs is probably enough for many ISRs. Multiple instructions can be used to handle more regs.

I have been working with 128 regs, even though up to 256 regs are supported by the ISA. Trying to come up with a way to get rid of the micro-code.

I spent some time sketching out another smaller ISA, but it exploded into something resembling Q+, so I decided to revert to working on Q+.

_________________
Robert Finch http://www.finitron.ca


Sat Feb 01, 2025 3:40 am WWW
 [ 204 posts ]  Go to page Previous  1 ... 6, 7, 8, 9, 10, 11, 12 ... 14  Next

Who is online

Users browsing this forum: claudebot and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software