Last visit was: Thu May 01, 2025 12:17 pm
|
It is currently Thu May 01, 2025 12:17 pm
|
Author |
Message |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
Synthesis reports 73,000 LUTs for Stark. This is considerably smaller than Qupls. Which I think was over 100,000 LUTs for the same configuration. 2 ALU’s 1 FPU, 1 MEM, and 1 branch. Going to try using up some of the difference for a larger ROB.
I just finished my cpu, let me count the LUT's, it might take a while if I need use my toes. 303 LUT's for a simple 18 bit cpu split over 3 CPLD's. 1985 tech vs 2025 tech. Not sure when 128 macro cell CPLD's came out. I sent off the the ALU pcb's with all the changes made since DEC of 2024.
This sure shows a real big change in tech over the years. Ben.
|
Wed Apr 23, 2025 3:20 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Now up to 92,000 LUTs with a 32-entry re-order buffer. It took seven hours to synthesize. I just cannot design anything with less than 1000 LUTs anymore. The logic puzzle is not captivating enough. Moved rename logic out to an asynch process operating on the ROB. Quote: 303 LUT's for a simple 18 bit cpu 303 LUTs for a CPU is amazing. I think the 6502 is somewhere around 600 LUTs. It is amazing the number of transistors a modern CPU may use, and how much can be done with just a few transistors.
_________________Robert Finch http://www.finitron.ca
|
Thu Apr 24, 2025 3:45 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Migrating the machine to a micro-op based design. There are just too many ports per ISA instruction to handle directly. So, the solution is to break up the ISA instructions into micro-ops. A simple micro-op decoder was made. It decodes ISA instructions into one to eight micro-ops at the decode stage. With four instructions processed at decode, up to 32 micro-ops could be produced. These are buffered in a shift register. Decode then consumes four micro-ops from the head of the shift register. When all the micro-ops are used up, and new set is fetched and decoded. Many instructions only need a single micro-op, so in many cases the machine is processing four ISA instructions at a time. However, if a complex instruction is done it may take more than four micro-ops to process.
_________________Robert Finch http://www.finitron.ca
|
Fri Apr 25, 2025 7:45 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Put in a small fix to suppress stomp logic when the destination of a branch is found in the reorder buffer.
I think I have got the single-step mode logic restored.
Another hoop to jump through: making modifiers work when in single-stepping mode.
More work on micro-ops, and back-tracking on all the ALU result ports.
Deferred interrupts occurring in the middle of a micro-op stream for an instruction to the start of the next instruction.
_________________Robert Finch http://www.finitron.ca
|
Sat Apr 26, 2025 4:17 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
A busy day. A lot of minor changes to improve synthesis; the result is a larger core, but hopefully closer to working. Also some more major changes.
* Got rid of processing for the fifth instruction in the in-order pipeline stages. Processing is now limited to four instructions. The fifth instruction was for postfixes which have been removed from the design. * Moved the backout flag out to a separate module. * Moved the restore flag generation out to a separate module. * Moved copy destination flags logic out to a separate module. * Created a module for register validation in the reservation stations. This is instead of using a task. Synthesis warned about the assignments using a task. I was not sure if it would work or not, so I made sure by creating a module instead. * Moved inline code for the dram done signal to a separate module. There were two copies of the code in the mainline one for each dram port. There is now only a single copy to maintain.
Found an alternate way to implement sync and flow control dependencies. If there is a sync instruction the following instructions should not issue. Like sync if there is a flow control op then memory store instructions should not issue. This was implemented by searching the ROB for preceding sync or flow control instructions. It is now done by recording the ROB entry of a sync or flow control at enqueue time. At commit time when the sync or flow control commits, dependent instructions are cleared of the dependency. I am not sure it is any better. The idea was to try and reduce the amount of logic.
_________________Robert Finch http://www.finitron.ca
|
Sun Apr 27, 2025 4:15 am |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
Do you have any error correcting on memory?
|
Sun Apr 27, 2025 4:25 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: Do you have any error correcting on memory? Nope. Broke the ALU / FPU up into more components with different latencies to get better performance. Most of the components could handle a new instruction every clock cycle, but were limited by a ‘done’ signal for longer latency components. For instance, integer multiply takes three clocks, but can start a new multiply every clock cycle. The previous configuration stalled the integer ALU for three clocks while the multiply completed. It had to because it was in the same pipeline as other integer operations. Now there is no stall. Made up a nice PowerPoint slide set for the in-order pipeline. Makes it easier to see what I am doing. Needs a lot of changes in Stark.sv. Attachment: StarkCPU_execute_stage.png
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
|
Tue Apr 29, 2025 2:53 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Major re-write day. Got rid of the scheduler component, replaced with an instruction dispatcher and scheduling in the reservation stations. The result is a little larger, but should be better performance.
Also made the reservation station generic so the same station can be instanced for different functional units. Added a three-entry queue to the station. This makes the station quite a bit larger so it is an option. Each station can request a register file read for up to four registers per clock cycle. With 11 stations this is 44 reads. I tried building the read port selector for a 44:16 mux and it was quite large. So, I changed it to a 64:16 mux and the result was 33% smaller. I guess the non-binary power number made it harder for the synthesizer to optimize. It is a 64:16 mux now with 20 slots unused.
A big addition was the reservation_station_entry_t structure. Reservation entries are now passed around instead of individual signals. It makes the code a little cleaner.
The organization of the CPU is now such that there are parallel pipelines for execution units, which may have different latencies. Some of the stalls were eliminated.
The core was around 110,000 LUTs but recent changes likely made it significantly larger. I am guessing 150,000 LUTs. The core is synthesizing ATM (it takes about 3 hours).
_________________Robert Finch http://www.finitron.ca
|
Wed Apr 30, 2025 4:28 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Spent time fixing hundreds of minor bugs in preparation for simulation. Also got working on micro-ops. Here is a diagram of how they fit into the pipeline. Attachment: StarkCPU_decode_stageA.png Attachment: StarkCPU_decode_stageB.png
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
|
Thu May 01, 2025 5:24 am |
|
Who is online |
Users browsing this forum: claudebot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|