Last visit was: Thu May 01, 2025 12:20 pm
It is currently Thu May 01, 2025 12:20 pm



 [ 54 posts ]  Go to page Previous  1, 2, 3, 4
 Bigfoot 
Author Message

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Started working on the instruction decoder for Bigfoot. Paused for a bit to work out the encodings for floating-point operations. Managed to pack the FMA type instructions into a single root opcode, freeing up three root opcodes that were previously reserved for the function. Six opcodes free at the root level now. I have a use for at least one of them.

Decided to add more complex branch conditions. Floating-point branches were added. Previously branches just tested a single bit in the condition register. This would make floating-point branches somewhat inefficient as the unordered bit would need to be tested separately. Either more branch conditions were needed or the condition registers would need to be expanded to 16-bits. Branches now test multiple bits to determine whether to branch. The branch displacement was reduced by a bit to allow more conditions to be specified. The displacement is nine or seventeen bits. I would prefer more, but it should work okay. Branching more than +/-64k conditionally is quite a bit (rare).

The native mode Bigfoot instructions set is almost twice as dense code wise as the last Qupls instruction set. One of the things I was not fond of about Qupls.

_________________
Robert Finch http://www.finitron.ca


Wed Sep 04, 2024 4:26 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Got most of the instruction decoded for Bigfoot done. Worked on porting over portions of the Qupls code to Bigfoot. A lot of code required only minor changes.

Did some exploration of multi-pumping the register file to support a lot of read ports. Using a six times clock, the amount of BRAM required to support the register file can be reduced significantly. The amount of BRAM required had increased to accommodate the larger 128-bit registers for capabilities. So, the size needed to be reduced. The six times clock was to support four reads of the register file during a single CPU clock. The number of read ports was set to 20 (a multiple of four) as about 18 are needed. Five ports are read four times to get the equivalent of 20 ports.

The CPU core is going to be too big and complex again. Extra alignment shifts to support the variable length instruction set added to the size of the core. The instruction extract module is now about 20kLUTs. Previously it was 5kLUTs.

Now thinking of doing a much simpler core ‘Little toe’ as a demonstration of concepts like register renaming, the RAT, and others. There would be no cache and it would execute directly out of BRAM. As much as I can fit in in something < 5k LOC.

_________________
Robert Finch http://www.finitron.ca


Thu Sep 05, 2024 4:06 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1821
wow, that's an interesting possible tradeoff - to time-multiplex the register file to get the ports. Usually access to the register file is performance critical. In fact one CPU I worked on had, I think, a pair of coherent register files, to offer more ports without a time cost, but of course with an area cost. (The CPU shipped and worked but was immediately cancelled so was never a product.)


Thu Sep 05, 2024 7:32 am

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Quote:
wow, that's an interesting possible tradeoff - to time-multiplex the register file to get the ports.

Yes, performance is being traded off to get something that might possibly work in the FPGA.

Logic in the FPGA is implemented with RAMs. So, if there are a few levels of logic, the BRAM may be faster than the logic. This is probably only true with an FPGA. The other logic in the core probably limits the max to 40 MHz (that is a logic depth of 10 LUTs) which is a couple of times slower than what the BRAMs can do.

BRAMs are very fast, not as fast as LUT rams. LUT rams I have heard can work at 400+MHz if one is careful. I am hoping to get the BRAMs working at 100+MHz. They would likely be on the timing path still though. Muxing the BRAMs likely moves the FMAX down to around 20 MHz. But the issue is that the register file is huge and it simply won’t fit otherwise. Even muxed the register file is about 50 BRAMs. 256x128-bit registers needs 512+ physical registers. It is a vector register file, with the vector registers being renamed. There are four copies of the register file, one for each write port and a FF array (live value table) tracks the which copy is valid. 4 copies X 20 read ports = 80 ports to be supplied. Note that writes are not muxed, just the reads.

*****

Modified the transaction buffer component of the MMU to handle different PTE sizes. I added the capability to handle different sized PTEs to the MMU a while back, but forgot to update the tran buffer. So, it finally got updated. The tran buffer component does not care about the format of the PTE, only the size. Only three sizes are supported, 4, 8 and 16 bytes. Much of the MMU is PTE agnostic. The tran buffer component buffers bus transactions coming from the MMU allowing the MMU to have multiple outstanding requests. The MMU can be busy translating multiple addresses at the same time.

_________________
Robert Finch http://www.finitron.ca


Fri Sep 06, 2024 3:09 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
duh, if the read ports are muxed, why not mux the write ports too? The register file now has the write ports muxed too, making it much smaller.
If I read the BRAM specs correctly they should work close to 300MHz max. There is already a 200MHz clock for the video. With the CPU at 1/6 it would be 33.33MHz which is close to the max the logic will allow anyway.

_________________
Robert Finch http://www.finitron.ca


Fri Sep 06, 2024 6:17 am WWW

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Wrote a test bench to test the register file. It updates the register file randomly (random register numbers and values). Then reads back the same registers to ensure that data got written correctly. Readback does not begin until long after the register has been updated. This is to avoid false positives generated by the bypass logic. I managed to get the register file working with a five times clock instead of a six times one. If things work out okay, the video clocks can be re-used to clock the register file and CPU. There is a five times dot rate video clock in the system already.

Worked on the expansion of vector instructions. It is a bit cheesy. Vector instructions are replicated according to the vector length and placed in an expansion buffer. (All instructions go through the expansion buffer). The register number field is modified to correspond to the vector element to process. Effectively the vector instruction is turned into a series of scalar instructions. The expansion buffer is large enough to hold four eight element vectors, in other words it has 32 entries. Four is the maximum number of instructions that would be fetched at one time. My first attempt at implementing the expansion buffer packed all instructions in the buffer so that they occupied a minimum amount of space and were all consecutive. This resulted in quite a large component (50k LUTs). Deemed too large. So instead, instructions are no longer packed in the buffer. This reduced the size by a factor of 50! Or 1k LUT. The trade-off is that there are now NOPs in the buffer which get fed to the processing core. I am looking into partially packing the buffer to improve performance. The typical case of no vector instructions is handled by packing scalar instructions in the first four slots.

The next trick is going to be reducing the size of the RAT. It is 47kLUTs and 72 BRAMs. Most of the BRAMs are for the checkpoint valid RAM which has 17 read ports. It takes 68 BRAMs. Unfortunately, the trick of time-multiplexing the read and write ports is not so easy to do in this case. The BRAMs are read on the negative edge of the clock to simulate an asynchronous RAM. That gives only ½ clock cycle to work in. On the write side of things, a write performs a read-then-write cycle and is pipelined. The next read begins before the previous write is complete. This makes it challenging to multiplex. Most of the LUTs in the RAT are not used by the RAM components. The RAM components only use about 17kLUTs. The other 30kLUTs is logic (forwarding etc.) in the RAT. The RAT logic needs to be simplified.

_________________
Robert Finch http://www.finitron.ca


Sat Sep 07, 2024 3:39 am WWW

Joined: Wed Jan 09, 2013 6:54 pm
Posts: 1821
This term wasn't in my cache:
Quote:
The RAT (Register Allocation Table) is the key structure for out-of-order exec CPUs to find out which previous outputs an instruction/uop is reading. Each write of a register is allocated a physical register, and updates the RAT's mapping of architectural to physical.


Sat Sep 07, 2024 7:08 am

Joined: Tue Sep 03, 2024 6:20 pm
Posts: 33
How long are your build times? I find myself getting horribly distracted when more than 30 seconds pass...


Sat Sep 07, 2024 1:15 pm

Joined: Sat Feb 02, 2013 9:40 am
Posts: 2307
Location: Canada
Quote:
How long are your build times?

It depends. I usually test synthesize components. If they are small synthesis is quick – a couple of minutes at most. Larger components take longer. I am currently working with Qupls CPU which is pretty big and it takes about 10 minutes to synthesize. To build the whole system can take about an hour. I do not build the whole system very often. I am multi-tasking, I work away while things are synthesizing :)

******

I think I found a way to pack the expansion buffer and not use too many LUTs. Using a level of indirection, a buffer index array that contains five-bit buffer indexes. The index array is shifted around instead of the buffer. Since the values are only five bits wide instead of 200+ bits it uses a lot less logic than shifting the buffer. It still adds 5k LUTs to the implementation, but that is much better than 45k LUTs. And the buffer looks like it is packed, meaning there is no performance loss.

Found several errors while porting the Qupls code to Bigfoot, so I went back and updated the Qupls code. I was having trouble getting the Qupls code to run in simulation so maybe the fixes will help things.

_________________
Robert Finch http://www.finitron.ca


Sun Sep 08, 2024 1:25 am WWW
 [ 54 posts ]  Go to page Previous  1, 2, 3, 4

Who is online

Users browsing this forum: claudebot and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group
Designed by ST Software