Last visit was: Thu May 01, 2025 12:21 pm
|
It is currently Thu May 01, 2025 12:21 pm
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Changes Changed the structure of the condition registers to make better use of them. The CRs now contain four separate bytes, one for each operating mode. Previously they just used the lower eight bits in common for all operating modes. With the new structure the condition registers do not need to be saved and restored when the operating mode changes. Like the stack pointers, there is a separate register for each mode. Not pushing and popping the CRs on a mode change trims four memory ops off the time for an environment call. The lightweight call saves only seven registers now.
Changed the BLI (branch-and-link interrupt) instruction to JALI (jump-and-link interrupt) which uses absolute addressing rather than relative addressing. A full 32-bit address may be specified. It also allows specification of the destination operating mode and software stack required. All this information uses two instruction words. The second word is a NOP instruction containing what would not fit into the first instruction word. Then, I got rid of the instruction. The BLRL instruction was reused instead.
In the assembler, disabled the constant sharing code. It turns out that constants that should not be shared, were being shared. For instance, if the value of a symbol is unknown at assemble time, a relocation record will be output for the value. Unfortunately, the values set in the assembled code may match even though when relocated they are not supposed to. It may be possible to fix this in a future version of the assembler. Removing the constant sharing code caused the size of the text to increase by about 2% because constants were no longer shared.
Added the program base and limit back into the design, this time as part of the MMU rather than as CSR registers. The base defaults to zero, and the limit default to the max, so there is no effect unless they are changed. By setting the program base and limit and using based addresses it is possible to use only a single page table. Otherwise a separate page table is required for every process.
_________________Robert Finch http://www.finitron.ca
|
Thu Apr 10, 2025 6:38 am |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
I like the idea of keeping multiple contexts on-chip for different modes - it's what ARM did, in a limited way, with their limited silicon. If on-chip resources are cheap, it's well worthwhile using them to decrease external memory accesses.
|
Thu Apr 10, 2025 7:25 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
It turned out that the condition registers are 32-bits wide anyway because they are implemented in the same register file as the GPRs. They needed to be subject to renaming, and I was not going to implement a separate renamer for them. It was a bit of a waste not using the full register, then I got to thinking about how the other 24-bits of the registers could be used. The one drawback is that information may be leaked from other modes, since the entire CR is visible – it could be pushed to the stack and then loaded into a different register for viewing. Not sure how big a security risk that is. There are no pointers or anything available. Not sure what good it would do malware knowing the carry flags is set or clear in a different mode. There may need to be byte write enables on the register file to prevent updates of the wrong mode.
I would have liked to have separate register files for each mode, but it is just too many registers. 96x4=384 regs which needs at least 1024 regs for register renaming. Turns out a second set of regs ccould be supported “for free” as there are 512 regs available in a single BRAM. However, it would double the size of the renamer to support it.
Additions / Changes Added the program base and limit back into the design as part of the MMU. The base defaults to zero, and the limit default to the max, so there is no effect unless they are changed. By setting the program base and limit and using based addresses it is possible to use only a single page table. Otherwise a separate page table is required for every process. The BRAM used to implement the registers can support 32 sets of registers at the same cost. The base/limit registers and associated logic added only about 1% to the size of the MMU. They may not be that useful, but they are also inexpensive.
Bugs A bug was found in the MMU. The index of the miss queue entry corresponding to the bus transaction taking place was not being set. This would cause the wrong miss address and asid to be selected from the miss queue, which would cause the wrong entry to be updated in the TLB. It would likely work only for the first miss.
_________________Robert Finch http://www.finitron.ca
|
Fri Apr 11, 2025 2:11 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Messed up on my bit counting. Conditional branches were encoded with 33 bits. So, I trimmed a bit from the condition field. There were four bits reserved for this, but only three are needed.
Update the MMU. Added an attributes checking module. But MMU updates are not quite complete. There are a couple of signals (write enable and operating mode) that need to be propagated through to the attributes checker. Rebuilt the MMU with dummy signals and it turned out to decrease in size, but the number of block RAMs in use increased.
Working on the CPU proper. Instructions can generate up to three results. We want six functional units to able to be processed at a time. In the StarkCPU there are then potentially 18 results generated in a single clock cycle. However, typically there are probably only four or five generated. So, I am using a register file with six write ports, and queues for the 18 results. There will be a selector circuit that chooses six from the 18 queues to update the register file. I am tempted to try developing a six wide machine instead of four wide. The selector rotates the first queue looked at every clock cycle.
_________________Robert Finch http://www.finitron.ca
|
Sun Apr 13, 2025 6:35 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Porting modules over from Q+ and slowly adapting them for StarkCPU. The branch evaluator for StarkCPU is only about 1/8th the size it was for Qupls. 60 LUTs versus 450 LUTs.
Separated out the icache and dcache and created their own repository for them. They should be usable with different CPUs.
_________________Robert Finch http://www.finitron.ca
|
Mon Apr 14, 2025 4:28 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Coded the CARRY modifier. The CARRY modifier is processed in the decode stage. It applies the carry register spec for the carry input and output of each instruction following the CARRY modifier in the pipeline. The logic is a bit tricky because it forms a loop from the last carry output setting back to the start of decode for the next cycle. At the same time, there could be another CARRY instruction inside the group of instructions being processed at decode. So, the carry settings may need to be reset in the middle of the instruction group. The modifier state also needs to be available to be stored and restored across context switches. So, there is a CSR for this. Saving the carry mod state during an exception is okay. One line of code to copy the carry mod to the CSR saving state. Returning from an exception is a different story. It has to be treated almost like the pc. The carry mod needs to be fed into the start of the pipeline in sync with instruction fetch. It then has to propagate forward down the pipeline to the decode stage.
Did significant work on interrupt support. Some more tricky code. Interrupts are collected at the start of the pipeline but are not processed until the commit stage. This means that if there was an interrupt, all the work in the pipeline is dumped. It is about the simplest approach as it alleviates the issue of what happens if interrupts are disabled or enabled by previous instructions. At commit time there are no previous instructions left. A check is made at commit time to see if an interrupt occurred but was disabled. If so, then it is added to a queue which feeds the start of the pipeline. At the decode stage the ATOM instruction is detected and interrupts masked if ATOM is present.
_________________Robert Finch http://www.finitron.ca
|
Tue Apr 15, 2025 3:43 am |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
How do you enable IRQ's to be in sync with a return from interrupt and or mmu mapping? What little software I have seen always makes the hardware IRQ's look just like the software irq's. Looking at them from that stand point does it change anything?
|
Tue Apr 15, 2025 4:04 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: How do you enable IRQ's to be in sync with a return from interrupt and or mmu mapping? They do appear to be in sync. The instructions in the pipeline have not permanently updated the machine state, so it is safe to dump them. It is know exactly which instruction that IRQ occurred on, so the return from interrupt can return to that same instruction. Changing the structure of the core. Currently the core stores the argument valid flags in the re-order buffer along with physical register specifications. The argument valid flags in the re-order buffer control when the instruction is issued. Values for the instructions are read from the register file. This is being changed to keeping track of the valid status in the reservation station instead of the re-order buffer. The reservation station will monitor the read ports for values that need to be updated at the station. Done some monkeying around with the register file. It’s been reduced to a 4w16r port file from a 4w24r file. This makes it about 2/3 the size. The read ports are now going to be multiplexed to serve the number of read ports required. The number of required read ports increased because of the ISA. 26 read ports are required. This is just too many! Especially when most of the time far fewer read ports are required. Making the file smaller only saves about 1600 LUTs, but it also saves 40 BRAMs.
_________________Robert Finch http://www.finitron.ca
|
Wed Apr 16, 2025 4:15 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Worked mainly on the StarkCPU decode stage and decoder today. Had some fun figuring out how to handle constants. Decided to back-track a bit again and place constants on the cache-line literally anywhere. The position of a constant is encoded in a four-bit field in the instruction. So, the decode stage scans all possible instructions on the cache line for constant positions, then marks all the ones it found as invalid instructions. It does this before performing other decode functions. An issue that arose was that there could be a branch into the middle of a cache line meaning it is not good enough to just process the four instructions being decoded for constants. There may have been earlier instructions on the line with constants placed after the branch landing point. Hence the need to scan the entire cache line.
Experimented with the number of registers both architectural and physical. It looks like there may be enough resources to support two sets of architectural registers. The register file for the StarkCPU is smaller than the original Qupls (96 regs vs 128 regs) that combined with the reduction in the number of read ports means there is more logic available for the RAT. No extra BRAMs are needed to support the second set of registers. The only issue I have to figure out now is which modes are the best to support with its own register file. There are four modes, but only two sets of registers.
_________________Robert Finch http://www.finitron.ca
|
Thu Apr 17, 2025 3:19 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Found a way to get four complete register sets squeezed into the register file without increasing the size of the RAT further. It can be noted that the OS probably does not need floating-point registers. So, only the user app has them available. By re-arranging the register tags to be more conducive to the operating mode, the number of registers can be reduced. Also noted that the condition registers are already broken into groups for the operating mode, so the same set of registers can be reused in each mode. In total for the four register sets, we have 96 (user app), 40 supervisor, 40 hypervisor, and 48 secure mode registers. The total is 224 registers. 512 physical registers can still be used! There are about 2.3 physical registers for each architectural one.
Added the ”MPO” instruction standing for memory privilege override. Previously one had to manipulate CSRs to perform overrides of memory privileges.
_________________Robert Finch http://www.finitron.ca
|
Fri Apr 18, 2025 4:20 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Documented some more of the float instructions. I left a few float instructions out of the design for now until I get things working.
Additions Added the XORC instruction which exclusively ors with complement. I had XORI r1,r1,-1 in code with the intent of complementing a register, but it turns out that -1 is a 64-bit constant, which was unexpected. The XORI instruction only takes unsigned numbers, so it is not possible to encode the -1 within the instruction itself. Then got the idea to xor with the complement of r0, since the complement of r0 is -1. Hence added the XORC instruction. It seemed fitting as there are already ANDC, and ORC instructions, and it is probably more useful than adding a specific “COM” instruction.
Added an option to the assembler – “abits” which allows the number of address bits in use to be specified. The assembler can then generate shorter constants. When I looked at the assembled code, there were a lot of 64-bit constants for addresses. Hence the “abits” option. This reduced was used as “-abits=32” to reduce the size of constants for addresses to 32-bits.
Fixed up about a dozen encoding errors in the assembler. Assembled code is looking better.
There was a lot fiddling to get constants coming out at the end of a cache line properly. I eyeballed a few dozen cases, and they seemed right. Hopefully, everything is okay. It is interesting to note that constants can be output anywhere on the cache line except for the beginning. They do not have to be at the end. They could also be embedded in-line following instructions.
Added some more instruction decode. A good chunk of it is done now.
_________________Robert Finch http://www.finitron.ca
|
Sat Apr 19, 2025 2:59 am |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
Quote: Added some more instruction decode. A good chunk of it is done now. What was not being decoded?
|
Sat Apr 19, 2025 6:51 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: What was not being decoded? There were several classes of instructions. The Qupls decoder code was ported over piecemeal, a few decodes at a time. I had the decodes that were not ported yet commented out. Which I then uncommented one by one as I ported them. Refactored branch station code that was inline into a separate module. Did some work updating the Stark.sv, the master CPU component. Arguments are marked valid a bit differently than in Qupls. The valid status comes from the stations which detect when all args are valid. Decided to get rid of the predicate logic and replace it with better branch prediction logic. Detecting predicated instructions is done by the scheduler as I have implemented it, it is a real piece of work. It searches the ROB backwards for predicate instructions. One factor slowing the scheduler down, and the scheduler is one thing on the critical timing path. I may rethink this later, if I can come up with a faster way to handle predicates.
_________________Robert Finch http://www.finitron.ca
|
Sun Apr 20, 2025 4:07 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
I put some elbow grease into the predicate logic. I had a long personal log of what I was doing, but decided not to post. Ultimately, branches that go a short distance forward <7 instructions now automatically turn into predicates. I likely will get rid of other predicate logic as it is too complex. It would result in a larger, slower CPU. Ran into lots of issues. What if there are two predicates in the ROB at the same time? Possible because they are usually short. What happens if an interrupt occurs in the predicate shadow? What to do about micro-machine instructions which feed the pipeline in the middle of a predicate shadow? The LOC started to explode.
I tried synthesizing StarkCPU. Had to fix up about 100 little mistakes. Still have not got the whole CPU to synthesize, but it is a lot closer. I would like to get an idea of size as some things are done differently than with Qupls.
_________________Robert Finch http://www.finitron.ca
|
Mon Apr 21, 2025 6:31 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Fixed up miscellaneous errors that prevented synthesis from completing. 20-30 seconds per error. They were all simple, but there were a lot of them. I refactored register designations, using Rs1, Rs2, Rs3 etc. instead of Ra, Rb, Rc and I missed a few spots. I got the names of struct fields wrong in some places. A few typos. That sort of thing.
Somehow, I managed to toast the single-step-mode logic I had put in place a few weeks ago. I went back to the repository to try and restore an old copy of the logic but could not find it anywhere. I must have over-wrote the logic at some point before saving it in the repository, so I am left to redoing the logic. It was just a single module that was overwritten, but most of the SSM logic was in that module.
Synthesis reports 73,000 LUTs for Stark. This is considerably smaller than Qupls. Which I think was over 100,000 LUTs for the same configuration. 2 ALU’s 1 FPU, 1 MEM, and 1 branch. Going to try using up some of the difference for a larger ROB.
_________________Robert Finch http://www.finitron.ca
|
Tue Apr 22, 2025 5:28 am |
|
Who is online |
Users browsing this forum: claudebot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|