Last visit was: Thu May 01, 2025 10:39 pm
|
It is currently Thu May 01, 2025 10:39 pm
|
Author |
Message |
robinsonb5
Joined: Wed Nov 20, 2019 12:56 pm Posts: 92
|
robfinch wrote: The Thor2022 scheduler component is driving me crazy. It is now reporting as being over 100,000 LUTs in size, totally ridiculous and blowing the LUT budget, when included in the top module. If I synthesize the module by itself, it reports as being 51 LUTs in size, which I think is the proper size. So, I am experimenting to try and find out why the difference. That smells like a RAM block not being inferred - if it works in isolation, does the full integrated design maybe end up doing something like feeding the output of one memory block directly into the address input of another?
|
Fri Aug 26, 2022 9:15 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: That smells like a RAM block not being inferred - if it works in isolation, does the full integrated design maybe end up doing something like feeding the output of one memory block directly into the address input of another? It does work like that a little bit. One 8x3bits wide ram is used to address a second ram. I figured it may make an 8x8 matrix but that is only about 3200 LUTs. I had a very complex scheduler and it worked out to about 20,000 LUTs in size. I figured it may be turning the RAM access into a matrix. So I went ahead and really simplified the scheduler and things went nuts. I am sure it is just something that I cannot see ATM.
_________________Robert Finch http://www.finitron.ca
|
Sat Aug 27, 2022 3:53 am |
|
 |
robinsonb5
Joined: Wed Nov 20, 2019 12:56 pm Posts: 92
|
robfinch wrote: It does work like that a little bit. One 8x3bits wide ram is used to address a second ram. I'm going to hand-wave here, since I'm fuzzy on the details (and they no-doubt vary between FPGAs anyway), but my guess would be that the tool wants to pack the address signal as a register either inside or immediately adjacent to the RAM block, and likewise for the output of the other RAM block. If the design considers those registers to be one and the same, then it can't satisfy both requirements simultaneously, and thus uses logic instead of RAM blocks. If so, adding an extra register between the two blocks should help, but obviously will cost you an extra cycle. (I don't know which FPGA and toolchain you're using, but there was an update to Quartus 18.1 which fixed some RAM block corner cases.)
|
Sat Aug 27, 2022 10:29 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
I am back onto this project yet again with the 2023 version. It keeps a chunk of the original Thor like 64 GPRs and vectors but adds some new tricks like sign control on operands and immediate operand swapping.
Project Thor2023 underway. - Fixed 40-bit instruction format, instruction postfix words for extended constants - Predication via PRED modifier - 64 GPRs, unified register file, 96-bit registers - Sign control on operands, immediate operand swapping - 88-bit extended precision binary floating point, optional 96-bit decimal float - 32/64 bit addressing - Block tagging of data in MMU - 1 address mode, base plus scaled index plus displacement - Vector operations - Bit/Bit pair manipulation instructions
- Vector instructions always use vector mask register #0 unless overridden with a VMASK modifier
Four operating modes, App, supervisor, hypervisor and machine - 512 entry relocatable vector tables for each operating mode - Load using stack canary register, gpr #54, checks canary value and exceptions if differs - 24-bit branch displacements - BSR, PIC - Loading / storing groups of five registers to / from cache line -
_________________Robert Finch http://www.finitron.ca
|
Sun Jan 01, 2023 4:50 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Decided to switch the spec to 96-bit triple precision from 88-bits. For the demo some of the low order bits of the float may not be supported to make best use of the DSP blocks. But best to keep things officially to a standard precision. Got enough of the assembler working to assemble the Fibonnaci program. The assembler needed to be coded to handle 96-bit integer values. Fortunately, the assembler had most of the logic in place already. It was just a matter of making use of it. Code: 13: start: 02:0000000000000000 0302000160 14: CSRRD r2,r0,0x3001 # get the thread number 02:0000000000000005 0802020F00 15: AND r2,r2,15 # 0 to 3 02:000000000000000A DC09820000 16: BNZ r2,stall # Allow only thread 0 to work 17: 02:000000000000000F 040200FD00 18: LDI r2,0xFD 02:0000000000000014 0402000100 19: LDI r2,0x01 # x = 1 02:0000000000000019 52000000001F00FC 20: STT r2,0xFFFC0000 02:0000000000000021 FF00 21: 02:0000000000000023 0403001000 22: LDI r3,0x10 # calculates 16th fibonacci number (13 = D in hex) (CHANGE HERE IF YOU WANT TO CALCULATE ANOTHER NUMBER) 02:0000000000000028 0201030024 23: OR r1,r3,r0 # transfer y register to accumulator 02:000000000000002D 040303FDFE 24: ADD r3,r3,-3 # handles the algorithm iteration counting 25: 02:0000000000000032 0401000200 26: LDI r1,2 # a = 2 02:0000000000000037 52040000001F00FC 27: STT r1,0xFFFC0004 # stores a 02:000000000000003F FF00 28: 29: floop: 02:0000000000000041 50040000001F00FC 30: LDT r2,0xFFFC0004 # x = a 02:0000000000000049 FF00 02:000000000000004B 0201010210 31: ADD r1,r1,r2 # a += x 02:0000000000000050 52040000001F00FC 32: STT r1,0xFFFC0004 # stores a 02:0000000000000058 FF00 02:000000000000005A 52000000001F00FC 33: STT r2,0xFFFC0000 # stores x 02:0000000000000062 FF00 02:0000000000000064 040303FFFE 34: ADD r3,r3,-1 # y -= 1 02:0000000000000069 DC0DD8FFFF 35: BNZ r3,floop # jumps back to loop if Z bit != 0 (y's decremention isn't zero yet) 02:000000000000006E 9F00000000 36: NOP 02:0000000000000073 9F00000000 37: NOP 02:0000000000000078 9F00000000 38: NOP 02:000000000000007D 9F00000000 39: NOP 02:0000000000000082 9F00000000 40: NOP 02:0000000000000087 9F00000000 41: NOP 42: stall: 02:000000000000008C DC00000000 43: BRA stall
_________________Robert Finch http://www.finitron.ca
|
Tue Jan 03, 2023 6:22 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Motoring along on Thor2023. Today put the bus interface unit into place, stolen from the rfPhoenix project. Coded up part of a state machine to run the core. Going to use a simple state machine driven approach as most of the clock cycles will be burned up accessing memory. An instruction cache is being used which is part of the BIU.
The current goal is having enough of the machine in place to simulate Fibonacci.
_________________Robert Finch http://www.finitron.ca
|
Wed Jan 04, 2023 5:18 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
More work on coding the core and updating the spec document. It will be a while before the fun of debugging begins.
_________________Robert Finch http://www.finitron.ca
|
Thu Jan 05, 2023 3:59 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Spent today mulling over the operation of atomic memory ops. Started coding support for them in the multi-port memory controller mpmc10. It is interesting that it is the memory controller that needs to be able to execute the atomic operations. The cpu more or less just passes the instruction through to the memory controller.
AMO ops supported are ADD, AND, OR, EOR, ASL, LSR, MIN, MAX and CMPXCHG.
Not planning on getting these working right away, but they are planned in.
_________________Robert Finch http://www.finitron.ca
|
Fri Jan 06, 2023 3:47 am |
|
 |
MichaelM
Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
|
Rob:
Given the long list of instructions to which the AMO attribute / behavior applies, maybe an AMO instruction may be a more efficient approach to flagging the AMO behavior to the memory controller? Usage from a compiler may require a special syntax or structure, but the potential performance penalty of AMO would not apply to such common instructions such as ADD, EOR, etc.
_________________ Michael A.
|
Fri Jan 06, 2023 9:15 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Quote: Given the long list of instructions to which the AMO attribute / behavior applies, maybe an AMO instruction may be a more efficient approach to flagging the AMO behavior to the memory controller? Usage from a compiler may require a special syntax or structure, but the potential performance penalty of AMO would not apply to such common instructions such as ADD, EOR, etc. Yes. There is a separate set of instructions independent of the usual ADD, AND, etc. just for AMO operations. I did not mean to imply that regular instructions were passed to the memory controller. I am calling them AMADD, AMAND, AMOR, etc. I got lazy and left the 'AM' prefix off when listing them. Supporting them with the compiler could be interesting. Normal 'C' may use intrinsic functions. But I think I will make up a way to support them in my non-C C like compiler. It may be as simple as keywords like "amo_add". Spent time today working on AMO, atomic memory operations. Realizing that the way to do them is at the coherence point. So, an opcode needs to be passed to the memory controller indicating the AMO to perform. The address and data are supplied by the CPU, but the operation actually takes place in the memory controller. The Wishbone bus I have been using does not support this mechanism, so I have added onto it. It needed an opcode field and another data field. The AMO operations increased the size of the memory controller by about 25%. That combined with the bus interface unit is quite large.
_________________Robert Finch http://www.finitron.ca
|
Sat Jan 07, 2023 4:13 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Fixed the float compare module to include comparisons of infinities. The compare while taking only a single clock cycle must act like a subtract operation. If there are two infinities being subtracted then the result should be a nan. If one of the operands is infinity then it should be greater than the other which is a non-infinity.
Got the basic machine coded with ADD, CMP, AND, OR, EOR, LOAD, STORE and branch instructions and predicates too. No complex ops yet but it should be enough to run the Fibonacci example.
Using a simple state machine to implement the core, performance will not be the best but hopefully the core will fit into the FPGA.
_________________Robert Finch http://www.finitron.ca
|
Mon Jan 09, 2023 3:45 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Some work on a first simulation. Nothing works at the moment. But I was able to see the I$ loaded with instructions.
_________________Robert Finch http://www.finitron.ca
|
Tue Jan 10, 2023 4:58 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Modifying the ISA to remove the rounding mode spec on some FP instructions. The available bits will be used for FP constants. Rounding specification will be available with an instruction modifier.
Added FP16 to FPx conversion routines, where x is 32,64,96 or 128. The conversion is probably fast enough to use inline as comb logic with immediate input coming from the instruction. It is basically just a bit copy with a small adder needed for the exponent.
_________________Robert Finch http://www.finitron.ca
|
Wed Jan 11, 2023 6:42 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Spent some time reading up on the AM2901 bit-slice.
Modified the Thor2023 ISA to be more like the Thor2022 ISA. Rather than a vector register indicator on every register there is a single bit which indicates a vector instruction along with an additional bit for register spec B to indicate a vector register and the same for register spec C if present. This saves one bit in many instructions.
Have not been able to get simulation to execute instructions yet. There is an issue loading the I$. The instruction data is being fetched from memory but not loaded. There is a control glitch somewhere.
_________________Robert Finch http://www.finitron.ca
|
Thu Jan 12, 2023 3:14 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Still unable to get the I$ loaded in simulation.
Accessing memory is slightly complicated as things are pipelined and requests and responses are asynchronous. Responses can come back in a different order than they were requested. The BIU keeps a table of outstanding requests and as responses come in it matches the requests and incoming response. It also keeps track of burst length for requests and burst length is what is seeming not to match properly. The cache is updated only once all data in the burst is retrieved. The memory segment, CODE, must also match. The BIU has evolved over time and seemed to be working for the Phoenix project. It was just plugged into Thor2023 and does not appear to work. There is probably something not being initialized in the same manner.
_________________Robert Finch http://www.finitron.ca
|
Wed Jan 18, 2023 4:48 am |
|
Who is online |
Users browsing this forum: Amazonbot, claudebot, DotBot and 1 guest |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|