Last visit was: Thu May 01, 2025 7:21 pm
|
It is currently Thu May 01, 2025 7:21 pm
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Tracked down a bug involving store operations. The data to store is not needed when the store queues in the load-store buffer. It can be fetched at a later time. The issue that arose was that the physical register number needed to load the data got altered by a later instruction reusing the corresponding architectural register. Because a different number was used, the valid flag did not correspond to the one for the store operation. This caused the CPU to hang waiting for the wrong register to be made valid. The inc/dec indicator for branches was not used correctly when decoding the target register. It decoded using a hard-coded constant. The wrong constant was used. This was changed to use a struct field name.
_________________Robert Finch http://www.finitron.ca
|
Wed Oct 02, 2024 6:50 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Bug fixes Update of the RAT table was fixed by having the input to the checkpoint RAM registered only when the pipeline enable signal is active. Without being qualified in this manner, the input would cause the registers to be mapped to zero.
The branch taken flag was not propagated properly, resulting in the branchmiss signal always being true for branches. This caused a lot of pipeline flushing.
Changes A icache miss no longer stalls the pipeline. Instead, NOPs are fed into the pipeline so that other instructions can be processed while the cache miss occurs.
The icache hit was modified so that it reports a hit if the fetched instruction group fits within a cache line. That way it is only necessary that one cache line be ready instead of two. So, processing may proceed sometimes when only a single cache line is available from the icache. Sometimes two lines are required if the group spans a cache line. The change is a PERFORMANCE option.
Q+ now has only a single memory addressing mode, scaled indexed with displacement. The displacement is 23-bits and close enough to the 32-bit displacement of regind. to make the register indirect with displacement mode redundant. Shifting opcodes around again.
Postfixed immediates were added back into the design. A, B, or C source regs may now be overridden with a 64-bit immediate value.
_________________Robert Finch http://www.finitron.ca
|
Fri Oct 04, 2024 4:27 am |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
Ah, "only a single memory addressing mode, scaled indexed with displacement" - that feels a bit like what we did in OPC land. Simple to implement, and adequately functional.
|
Fri Oct 04, 2024 7:31 am |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
That may be simple today when you have access a multi-port register file, but at one time only the BIG BOYS had that that feature. Does the offset scale as well?
|
Sat Oct 05, 2024 7:17 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: Does the offset scale as well? No, the offset does not scale. It is 20+ (21 or 23?) bits so it does not really need to be scaled. It is handy sometimes to be able to position things at arbitrary locations. Quote: hat may be simple today when you have access a multi-port register file, but at one time only the BIG BOYS had that that feature. BIG BOYS are possible now in small foot-prints like FPGA implementations. I am really designing something big. I pretty much conquered the small so I need to work on the large. Quote: Ah, "only a single memory addressing mode, scaled indexed with displacement" - that feels a bit like what we did in OPC land. Simple to implement, and adequately functional. Simplifying the logic a little bit was a goal. It is strange. A wider opcode makes fewer instructions possible. Bug fixesThe done flag was being reset for single cycle operations performed by ALU #1, causing them to be performed twice. This was bad as the target register was freed after the first execution, and ended up being reused before the second execution of the operation. Results got mixed up. Missing was a check in the ALU #1 flag setting that the operation was multi-cycle. Other IssuesStruggling with the reg valid flag tonight. Somehow the reg valid flag is being set when it should not be or not being reset, and that causes a whole chain of instructions to use incorrect register values. I re-wrote the register valid flag logic. It is quite a bit larger now, so I had to reduce the number of supported registers to 128 to try and manage the size. It is so large because it is made entirely of FFs. It has some properties that make it almost impossible to implement as a RAM. For instance, all the reg. valid flags of one checkpoint need to be able to be copied to the next checkpoint in a single clock cycle. This means something like a giant shift register with tap points for setting and getting values would be handy. It also requires eight write ports, and 24 read ports. It is a lot of logic, fortunately it is just a single bit per register that needs to be controlled. Making it multi-cycle for updates and reads cannot be done as that would adversely affect the performance of the CPU. There are issues with the renamer and RAT. I managed to fix up the renamer so that it does not stall as often as it used to. At least that is moving in the right direction. However, the CPU appears to be reusing the same physical register too quickly. The same register was both being released from use and reused at the same time. That should not be possible. The only thing I can think of is that the register may be on the free list more than once. A physical register being reused inherits the value of the register when it is allocated. This is okay because it is a target register and about to be overwritten with a new value. But when things go amiss it leads to interesting operation. In the most recent mistake, the stack pointer ends up with the iteration count in it. Milestone: got a Led to light-up in SIM for the blinkin-lights delay routine.
_________________Robert Finch http://www.finitron.ca
|
Sun Oct 06, 2024 6:52 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Writing about the checkpoint valid RAM made me realize I perceived it to be more complex than need be. It does not need to copy the valid bits on a new checkpoint. Therefore, it could be made much smaller.
_________________Robert Finch http://www.finitron.ca
|
Sun Oct 06, 2024 7:47 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Latest Perplexation A signal is causing updates to the load store queue to occur when they should not. Inside the RAT module the signal is a zero as expected. But on the wire connected to the RAT port outside the module the signal becomes a one, and it is screwing things up. This looks to me like either an error in the simulator or more likely a bit error on the workstation. There is no good reason why a piece of wire would have two different values. So, I am leaving testing until tomorrow to see if it makes a difference.
Almost Milestone: We have blinkin lights in SIM! <not quite>
I let it rip, and tried running the loop for 400 us. After 22 iterations it hung. It is strange. One would think it would reach a steady state after only three or four iterations. Figured out why it was hanging and now there is a bit error issue causing zero to output to the LEDs every other iteration. The hang occurred because the renamer ran out of registers to use on slot #3. Apparently, all the register piled up in the other three slots. So, the renamer was re-written again. This time it places the tags back into the slots they came from, rather than going through a rotator. It seems to work.
I put in a hack so that instructions after a branch do not issue until the branch resolves. This does adversely affect performance, but allows things to work. A loop with five instructions including a store operation takes about 31 clock cycles. Once the predictor kicks in the loop is 11 clock cycles. Instructions executing after a branch were toasting register values. The checkpoint restore is supposed to take care of restoring machine state if the branch does not go the way it was predicted. Something in the restore process must not be working yet.
_________________Robert Finch http://www.finitron.ca
|
Mon Oct 07, 2024 3:15 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
The issue with the wire having two different values appears to be gone.
Wrote another renamer, this one based on a FFO, find-first-one, operation in a bitmap. This renamer works better than the FIFO based or SRL based ones. It rarely stalls the CPU. It is also smaller than the FIFO based renamer.
Wrote a recursive task to invalidate ROB entries for a branch path not taken. It hung the synthesizer. Trying to get the alternate path fetching working.
Nasty issue arose. When a branch is correctly predicted, but the instructions in the same group are not supposed to be executed they are stomped on. This issue is that register rename has already taken place for the registers in the group. Normally this would be undone by restoring a checkpoint, but there is no checkpoint restore since the branch is correctly predicted. The solution is to back out the register mappings using values from the re-order buffer. This required writing a state machine.
With all the changes in the past day or so, things are busted back down to just displaying one LED on, before a crash.
_________________Robert Finch http://www.finitron.ca
|
Wed Oct 09, 2024 8:34 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Worked on the branch checkpoints improving things. There is now only a single checkpoint allowed per instruction group, even if there are multiple branches in the group. Once a checkpoint is restored, the ROB is walked backwards within the group to restore the appropriate registers. To free the checkpoint, all instructions in the group must be finished, then the checkpoint can be freed.
Ran into an issue with updates to the checkpoint RAM. Updates of a previous write cycle were being overwritten with old values as it takes a clock cycle for the RAM to update. The RAM is updated in a read-modify-write fashion. A buffer was used to hold the previously written values. Then if there are two write cycles in a row, then the buffer is referenced instead of the RAM output which would be out of date by one cycle.
_________________Robert Finch http://www.finitron.ca
|
Thu Oct 10, 2024 6:52 am |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
Do all the changes, still work with a virtual memory page fault?
|
Thu Oct 10, 2024 10:39 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: Do all the changes, still work with a virtual memory page fault? IDK. I have been testing parts of the CPU with virtual memory disabled to avoid issues with that system. Addresses still go to the MMU, but the virtual address is returned instead of the physical one. Addresses that have a 1:1 mapping of virtual to physical are being used so the software will run. A page fault handler has not been written yet. The TLB / page table is initialized with addresses that will not cause a page fault when the boot program is running. Bug FixesIn a classic blunder port #16 was being used to provide the argument C register number, but the valid status was being read from port #23. It should have been read from the same port. This bug caused the CPU to hang trying to perform a store operation. <- this turned out not to be the cause of the hang. It turns out arg C has its own dedicated port to the file, and it was set correctly. An ‘else’ clause needed to be removed from the checkpoint valid ram update. It was preventing the update of the status to valid when a new checkpoint is created at the same time as a valid status update. In another blunder, the address generation for register indirect with displacement had not been removed from the address generator. This would cause incorrect addresses to be generated. However, a case of a bad address had not been encountered yet. Positive edge logic that detects when to use a new checkpoint in the RAT was too sticky. This resulted in cycling through all available checkpoints, causing a stall and CPU hang. Current Milestone: getting a '1' to appear on the LED in SIM.
_________________Robert Finch http://www.finitron.ca
|
Fri Oct 11, 2024 6:08 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Lots of fiddling with the RAT today. Things are much closer to working but still not perfect. Got back up to ‘3’ output on the LEDs in SIM. Indexing was left off of bypassing signals in the RAT leading to bypass logic not working. This resulted in a lack of bypassing of register names, and thus incorrect values appearing. <- this bypassing was later removed. The renamer was modified to not use a register name unless it has been available for at least three cycles. This is to prevent the names from being reused too quickly. This was mainly to help avoid confusion in the SIM dumps. Seeing the same register used in several different places in the dump makes it hard to determine the flow of values. The ROB queue pointer is now being reset after instructions are stomped on during a branch. This is to improve performance by not requiring the CPU to skip over stomped instructions. Some stats from the simulator. The IPC is somewhat misleading as a lot of useless NOP instructions, and instructions that have been stomped on are included. At least the CPU can skip over them pretty fast. Code: ----- Stats ----- Clock ticks: 297 Instructions: 624: 610 IPC: 2.101010 Peak: 3.270440 Copy targets: 4 Stomped instructions: 34 Stalls for checkpoints: x Stalls due to renamer: 0 Stalls due to I-Cache miss: 219
_________________Robert Finch http://www.finitron.ca
|
Sat Oct 12, 2024 6:59 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Tricks Learned a new debugging trick. Make the CPU run in a serial fashion to debug the stomp logic. To do this the scheduler was modified to issue instructions only if the previous instruction is done. This slows down the CPU considerably but makes debugging easier. It is a config option.
Bug Fixes Found out sometimes the fetched instruction group was replicated. This caused instruction groups to be executed twice. Code was put into place to turn a second copy of the instruction group into NOPs. Not sure exactly why there were two copies showing up. It has something to do with the fetch logic.
Changes Made the RAT an option. It got too large. Until I can find a way to reduce the size it is disabled. It is approximately 2/5 of the size of the core. Core size is sitting at about 195k LUTs with the RAT, and 110k LUTs without.
Milestone: blinkin lights working in SIM!!!! Now to try it in the FPGA. Then onto Fibonacci.
_________________Robert Finch http://www.finitron.ca
|
Mon Oct 14, 2024 10:41 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
In the FPGA only a single LED lit up. Pressing reset caused a different LED to lite up. But there were timing issues reported on the timing report, so that may be causing issues.
Fibonacci did not work at all.
Extracted the decode stage from another module and made it its own module. Extracted the rename stage code from the mainline and made it into its own module. Working towards a module per pipeline stage. When initially coded the CPU was mostly in a single file coded as a giant module, which makes things easy to work with in the beginning. But now that things have expanded outwards it is easier to manage with multiple modules.
_________________Robert Finch http://www.finitron.ca
|
Wed Oct 16, 2024 4:06 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Took a break from Qupls and developed the rf6847 VDG, video display generator. There are a few differences from the 6847 as the circuit is designed to output to VGA. The circuit works in the same manner as the 6847 ic with the same modes. It has a small footprint, using five block RAMs and about 200 LUTs. It has an internal display memory as well as a character generator ROM. There is 16kB display memory and a 4kB character ROM. It uses 800x600 mode scaled down to 400x300. The 6847’s 256x192 screen is centered in this display area with a larger border. Alternately, the full 400x300 area can be used, resulting in a 50x25 text display. Text mode: 32x16 (legacy) or 50x25 Graphics modes: 64x64, 128x64, 128x96, 128x192, and 256x192 (legacy) OR 100x100, 200x100, 200x150, 200x300, 400x300 (full graphics mode) Semi-graphics modes also supported.
_________________Robert Finch http://www.finitron.ca
|
Fri Oct 18, 2024 5:26 am |
|
Who is online |
Users browsing this forum: claudebot, SemrushBot and 6 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|