Last visit was: Thu Dec 04, 2025 1:30 pm
|
It is currently Thu Dec 04, 2025 1:30 pm
|
| Author |
Message |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2428 Location: Canada
|
Worked some on the reservation stations and bypassing networks. The bypassing networks are not shown on the pipeline diagrams as that would clutter up the diagram.
Reservation stations queue up to four arguments for up to three instructions. The argument values are set from the register file or from the bypassing networks. There are at least four bypassing inputs (parameterized). The current design has eight inputs.
Four of the bypassing inputs come from the input to the register file. This trims a clock cycle off of register access time. The other four inputs come from the outputs of frequently used functional units. For instance, the output of the first simple arithmetic unit (SAU) is bypassed back to its input so that back-to-back instructions can be made single cycle. It also feeds the input to other functional units.
The reservation stations are set up to be generic in nature. The same component is used to support different functional units. While the stations support up to four instruction arguments, all types of instructions (functional units) do not need that many arguments. The hardware for unneeded arguments will get trimmed by the synthesizer.
*****
Used up eight opcodes for SIMD support. Also, it was decided to move the precision field out of the branch format and into the opcode. This caused eight more opcodes to be used. But gives two more bits for the branch displacement.
To support lower precision non-SIMD operations the upper bits of the destination register are set to zero.
There are about 24 opcodes left open.
_________________Robert Finch http://www.finitron.ca
|
| Fri Nov 14, 2025 3:10 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2428 Location: Canada
|
Some work on extended precision arithmetic. Added an ADC instruction that adds three source operands and produces low order and high order (carry bit) in two destination registers. A 256-bit add can then be done with just four instructions. Format: adc Rd1, Rs1, Rs2, Rs3, Rd2 Code: adc a3, a1, a2, 0, cy0 adc b3, b1, b2, cy0, cy1 adc c3, c1, c2, cy1, cy2 adc d3, d1, d2, cy2, cy3
Shift instructions where also added that save the upper or lower bits of the shift result in a second destination register. Added some more conditional move instructions. Conditional move if even (CMOVEVN), move if less than zero and move if less than or equal to zero. Decided to get rid of the ADDnUI instructions. I cannot see them being used that often and the same functionality is available using a regular ADD_ASL instruction by substituting an immediate for Rs2. It is a little bit less code dense. It is probably worth it to simplify the instruction set. Here is a table of the root opcodes: Attachment: Qupls2026_opcodes.jpg
You do not have the required permissions to view the files attached to this post.
_________________Robert Finch http://www.finitron.ca
|
| Sat Nov 15, 2025 4:14 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2428 Location: Canada
|
Changed the PUSH and POP instructions from being implemented with micro-code to being implemented using the micro-op translator. PUSH and POP are now translated into one to five micro-ops depending on how many registers are used. There is less overhead and better performance of the operations when translated to micro-ops.
Changed the base data-path width to 128-bits which I am going to try and see if it will fit.
There are now 128 logical registers available in Qupls. It turns out that the BRAM setup is 512 registers deep no matter whether there are 32,64 or 128 registers. So, may as well make them available.
_________________Robert Finch http://www.finitron.ca
|
| Tue Nov 18, 2025 3:33 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2428 Location: Canada
|
Changed the base data-path width back to 64-bits. There is no longer any micro-code or HW state machine. Some of the operations done with micro-code could be done using the micro-op translator. Broke the ENTER and EXIT instructions into two separate instructions each so that they would fit into the micro-op translator. ENTER and EXIT no longer push and pop registers from the stack. That is done by a second (or third) instruction now. Code: ENTER 64 ; allocate 64 bytes for non-safe stack usage PUSHSS s0, s1, s2, s3 ; push regs onto safe stack … POPSS s3, s2, s1, s0 ; pop regs from safe stack EXIT 64 ; deallocate and return
Decided to drop the sign control bits from the instruction set. In many cases having sign control bits did not make sense. For instance, when using a base register during an address calculation it probably would never negate the base register value, so a sign control bit is wasted for that case. Another case is branch instructions. Because there is branching on relative conditions, if a change in the sign of an operand is needed it can often be done by swapping operands. Sign control is now sometimes controlled by the opcode as is typical in many machines. Rather than having an ADD with sign control there is now both an ADD and SUB. It was desired to support 128 registers and removing the sign control bits makes this possible. The ISA uses the 128 registers as SIMD registers by grouping registers into groups of four. That makes 32 x 256-bit SIMD registers available.
_________________Robert Finch http://www.finitron.ca
|
| Wed Nov 19, 2025 12:57 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2428 Location: Canada
|
Worked on supporting vectors (SIMD) with Qupls. Like many other designs, Qupls uses a scalar register to contain a mask for vector operations. Many instructions directly support masked operations. To mask specific elements of the vector the appropriate bit mask must be generated. This can be done using one of the SET instructions. The SET instruction will set or clear bits required to reference particular elements of the vector. Rather than a vector length register, Qupls uses a global mask (vgm) register. This register needs to be set up to contain a bit-mask corresponding to the elements that should be active. To set this register in a manner analogous to setting a vector length register, a special 256-bit constant can be loaded into a vector register, then a SET instruction used. Like the following: Code: OR r8,r0,$0x0706050403020100,0 OR r9,r0,$0x0F0E0D0C0B0A0908,0 OR r10,r0,$0x1716151413121110,0 OR r9,r0,$0x1F1E1D1C1B1A1918,0 SLT.BP vgm,v2,#12,$-1
Which sets up the mask register for 12 elements that are a byte wide. The mask can also be set much easier with an immediate constant: OR vgm, $0x3FF, $0, $0 ; mask for 12 elements, any element size The vector element size (VELSZ) register contains a code indicating the size of a vector element. Elements may be 8, 16, 32, or 64-bits wide for integers or 16,32,64, or 128-bits for floats (128-bit floats not being supported currently). The VELSZ allows size agnostic vector instructions to be used. VADD v1, v2, v3 will add two vectors according to the vector element size. This makes it possible to write a vector routine without a specific element size specified. Worked on updating the SAU (simple arithmetic unit) to support vector operations.
_________________Robert Finch http://www.finitron.ca
|
| Thu Nov 20, 2025 1:28 am |
|
 |
|
gfoot
Joined: Sat Oct 04, 2025 10:54 am Posts: 25
|
robfinch wrote: Like the following: Code: OR r8,r0,$0x0706050403020100,0 OR r9,r0,$0x0F0E0D0C0B0A0908,0 OR r10,r0,$0x1716151413121110,0 OR r9,r0,$0x1F1E1D1C1B1A1918,0 SLT.BP vgm,v2,#12,$-1
Which sets up the mask register for 12 elements that are a byte wide. I was trying to understand this - is the fourth "OR" call meant to be loading r11 though, rather than r9?
|
| Thu Nov 20, 2025 10:44 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2428 Location: Canada
|
Quote: was trying to understand this - is the fourth "OR" call meant to be loading r11 though, rather than r9? Sharp eyes. Definitely. Should be: Code: OR r8,r0,$0x0706050403020100,0 OR r9,r0,$0x0F0E0D0C0B0A0908,0 OR r10,r0,$0x1716151413121110,0 OR r11,r0,$0x1F1E1D1C1B1A1918,0 SLT.BP vgm,v2,#12,$-1 v2 and r8 to r11 are aliased.
_________________Robert Finch http://www.finitron.ca
|
| Thu Nov 20, 2025 4:43 pm |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2428 Location: Canada
|
I skipped posting for a day or so, so I have logged a lot in the meantime.
A lot of work was done on the memory logic, especially to support unaligned memory ops. The mechanism previously used probably was not working. I never got as far as testing unaligned memory access. It was re-dispatching a memory instruction in the event of an unaligned memory op. This was bad because a different load / store queue slot would be used. It is doubtful that the instruction would be completed properly.
Unaligned memory ops are now handled using the memory controller logic and state machine without dispatching the instruction again. A couple of additional states were added to the memory state machine, triggered when there is an unaligned memory op.
It is made more complex to implement vector memory ops. Vector memory ops require both processing for unaligned access and re-dispatching of instructions. Instructions need to be dispatched again for vector ops as the memory address may change with each lane processed. The address needs to be recomputed by the address generator, triggered by dispatching the instruction. At the same time, a vector lane might span an alignment boundary, requiring unaligned access.
Yet to be built is exception logging for vector instructions. There should be a vector of lane accesses that failed.
The logic for micro-code was removed, it reduced code size by over 500 LOC.
Gave the vector length and vector lane size registers separate fields for up to eight different data types. This allows a different length and lane size to be used for each type. The five current data types are integer, floating-point, fixed-point, character, and addresses. Addresses are used in indexing the load / store instructions. It is possible to use only an eight-bit address offset vector while moving 64-bit data such as a float to or from memory. The float vector would consume eight registers, while the address vector would need only a single register.
I have the max vector length and max vector size constants returned by the GETINFO instruction which returns CPU specific information.
_________________Robert Finch http://www.finitron.ca
|
| Sat Nov 22, 2025 10:32 am |
|
 |
|
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 878
|
Would it be possible to have a load/store unaligned memory order code? The only use that is with 16 bit addressing for 8 bit byte code of some kind, like P-code or emulating a 8088/8086. Internally it would be n 8 bit memory references,
|
| Sat Nov 22, 2025 7:50 pm |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2428 Location: Canada
|
Quote: Would it be possible to have a load/store unaligned memory order code? The only use that is with 16 bit addressing for 8 bit byte code of some kind, like P-code or emulating a 8088/8086. Internally it would be n 8 bit memory references, Currently memory operations are in-order including unaligned accesses. Memory is loaded a cache-line at a time, then the proper bytes are extracted. A byte load instruction will load only the specified byte from the cache. Therefore, emulating an 8088/8086 or running P-code should be straight-forward. ***** Added a trio of instructions to translate virtual to physical addresses. The first instruction is for scalar translations the second for indexed vector translations. The last for regular vector addressing.
_________________Robert Finch http://www.finitron.ca
|
| Sun Nov 23, 2025 2:38 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2428 Location: Canada
|
Added vector reduction instructions. Been looking at the RISCV vector extensions with brief surveys of other architectures. Making sure I do not miss anything.
Added a few FP instructions to the documentation. The opcodes were reserved but they were undocumented. The docs are about 540 pages and there is still a ton of documentation to do.
Copied the graphics transform code from Thor2021 and converted it to use floating-point numbers in addition to fixed point. The floating point is 13x larger and likely several times slower than the fixed point. The floating-point code likely has a few more bits precision. The fixed point is only 18.18.
_________________Robert Finch http://www.finitron.ca
|
| Tue Nov 25, 2025 4:21 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2428 Location: Canada
|
Worked on 128-bit decimal float support today. Created a reservation station that supports 128-bit operands with interfacing to the 64-bit register file. Even / odd register pairs are used for 128-bit operands.
Just when I thought I had the ISA instruction formats worked out, I decided to change them. The register spec field was wasting bits, so it was reduced back to six bits. The number of registers directly available was changed. There is little reason to have 128 GPRs available. Most of the time the compiler cannot use that many. They were being used to support vector instructions. There are now up to 32 vector registers supported, specified using five bits of the six-bit register spec. Rather than specifying the first scalar register of the vector. The unused sixth bit is used for sign control. Most instructions now support sign control again.
Micro-ops were modified to support eight-bit register specs. Vector registers are mapped into the last 128 micro-op registers. The GPRs are mapped into the first 40 micro-op registers. This means 168 registers are mapped. This is good for register renaming as it then has three or more registers available for each micro-op register.
_________________Robert Finch http://www.finitron.ca
|
| Wed Nov 26, 2025 1:30 am |
|
 |
|
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2428 Location: Canada
|
Doing work on the micro-op translator.
Added another instruction MOVMR, heard about on comp.arch newsgroup, to move multiple registers to other registers. This is useful for calling a subroutine where the register arguments need to be set up from random register sources. MOVMR can transfer up to seven registers to argument registers.
Modified the instruction_t type, updated for Qupls4 instead of Stark. It is much simpler now. There are a lot fewer formats to deal with. Added some documentation for the float package.
Had two types doing basically the same thing. instruction_t a type holding the instruction formats and micro_op_t a type holding the formats of micro_ops. It is the micro_op_t that contains the instruction information that needs to be decoded. Instruction_t was a holdover from before there were micro_ops in the design. All references to instruction_t were removed and / or replaced with micro_op_t. TG for file search and replace.
Fixed up numerous minor bugs so that the core could synthesize. It took 18 hours to synthesize the last successful run.
_________________Robert Finch http://www.finitron.ca
|
| Sat Nov 29, 2025 12:32 am |
|
Who is online |
Users browsing this forum: claudebot, CN-mobile-9808-b and 11 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|