Last visit was: Thu May 01, 2025 10:47 pm
|
It is currently Thu May 01, 2025 10:47 pm
|
Author |
Message |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Moved the instruction victim cache from the bus interface unit to the instruction cache component, and made the victim cache optional. Added logic to invalidate the victim cache on a snoop hit in the victim cache.
It was pointed out on comp.arch that fully associative comparators for snooping were not required. All that is required is comparators for every way of a set, since the addresses end up in the same set. So, updating the cache to account for this reduced the size by 13%.
_________________Robert Finch http://www.finitron.ca
|
Wed Mar 08, 2023 6:17 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Completely re-writing the bus interface unit. This has come from updating the cache modules and writing cache controllers. A lot of the logic that was in the bus interface unit is ending up in the cache controllers. Moving the TLB out to a shared component also results in lots of changes.
_________________Robert Finch http://www.finitron.ca
|
Sun Mar 12, 2023 4:05 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Just pouring the work into the bus interface unit, BIU. Whittled it down to 2,700 lines from 3,330. It should end up much smaller yet. The BIU previously included just about everything to interface to the outside bus. It included hardware table walkers. Some of the code is being moved outside of the unit to allow for multiple cores.
_________________Robert Finch http://www.finitron.ca
|
Tue Mar 14, 2023 5:23 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Working on the memory request queue also called a load / store queue today. The code is not terribly large, 500 LOC, but it generates a lot of logic because of the manipulation of each queue entry separately. The queue has parameters allowing it to perform store merging and load bypassing. With everything enabled the core takes about 66,500 LUTs which is too large to be used for the current project. The minimal version of the core is just 16,500 LUTs. The store merging merges stores to the same cache line together resulting in a single store operation to the external bus. For example, storing a byte to offset zero of a cache line, then storing a wyde to offset 12 of the same cache line merges the store data together, and ends up performing only a single store operation. Any number of stores could be merged. If performing a byte-by-byte store up to eight bytes will be merged into a single cache line, limited due to the queue depth of eight, and a single store operation will take place. This is significantly faster than performing the individual stores. The core also features load bypassing where if a load address matches an address and the selected range of data is available in the queue then the load will take place out of the queue instead of from external memory.
Both store merging and load bypassing respect the memory cache-ability of the operation. If the operation is non-cacheable because it is to I/O then store merging and load bypassing is disabled.
_________________Robert Finch http://www.finitron.ca
|
Wed Mar 15, 2023 4:37 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Moved the region table into the shared TLB. The PCI config space is now shared between the TLB and region table.
I finally figured out a lower cost way to represent a super-page in the TLB. Done by increasing the number of ways associative. Two ways are now dedicated to 16MB super-pages. Four ways are dedicated to 16kB pages. I chose to increase the ways rather than add another read port to the TLB because it is lot less expensive. Adding a port would quadruple the RAM requirements of the TLB. Adding two ways only increases RAM requirements by 50%.
The TLB currently piles up all the 16MB pages in the first 256 entries of the TLB. This occurs because there are not enough distinguishing address bits incoming. 16MB pages have only eight significant address bits when the address bus is 32-bits.
_________________Robert Finch http://www.finitron.ca
|
Mon Mar 20, 2023 10:43 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Expanded size of the LVL field in the PTE to four bits, to allow up to 16 levels of paging. With a smaller page size and larger virtual address range, more than eight levels may be required. Wrote a TLB miss interrupt routine after researching some routines on the web. In part to get a feel for how well the ISA works. I think I am doing too much work in the TLB component. There are updates to pages required on a cache miss that should not really be in the TLB. The idea for the TLB miss routine is to have a separate routine for each entry level from the root pointer. This avoids having branches in the miss routine at the cost of some code replication. Code: ; TLB miss handler ; Handles a 34-bit virtual address ; The TLB device needs to be permanently mapped into the system's address space ; since it is MMIO and uses the TLB. ; ; tlb_miss_irq34: st96 a0,[sp] ; save working registers st96 a1,12[sp] st96 a2,24[sp] st96 a3,36[sp] st96 a4,48[sp] ld96 a0,TLB_TLBMISS_ADR ; a0 = miss address ld96 a1,PTBR ; a1 = page table base clr a1,a1,0,13 ; clear 14 LSBs, address is page aligned extu a2,a0,24,9 ; get miss address bits 24 to 33, index into top level page table ld96 a3,[a1+a2*] ; get PTP from top level table bbc a3,PTE_V,.noL1PTE ; check that entry is valid bbc a3,PTE_T,.L1superPage ; check for 16MB superpage extu a1,a3,PTE_PPNLO,22 ; get PTP pointer low bits extu a4,a3,PTE_PPNHI,32 ; and high bits asl a4,a4,22 ; build into one variable or a1,a1,a4 asl a1,a1,14 ; convert PPN to table address extu a2,a0,14,9 ; get miss address bits 14 to 23 ld96 a3,[a1+a2*] ; get MPP bbc a3,PTE_V,.noL0PTE ; check that entry is valid bbs a3,PTE_T,.corrupt ; should be a PTE, otherwise table corrupt .L1superPage: st96 a3,TLB_TLBE_HOLD ; store MPP in holding reg st16 r0,TLB_TLBE_TRIGGER ; update TLB ld96 a0,[sp] ; restore working registers ld96 a1,12[sp] ld96 a2,24[sp] ld96 a3,36[sp] ld96 a4,48[sp] rti ; Here, memory was not mapped to support the access. So, the program must be ; trying to read or write a random address. Abort the program. .noL1PTE: .noL0PTE: .corrupt: ldi a0,ABORT_PROGRAM ldi a1,ERR_TLBMISS syscall
_________________Robert Finch http://www.finitron.ca
|
Wed Mar 22, 2023 3:50 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Coded high-performance semaphores to be shared between CPU cores. Any core may set or read the semaphores via a CSR register. One use of the semaphores is to obtain exclusive access to the TLB registers. I re-wrote the TLB miss handler to account for exclusive access. Code: ; shared TLB miss handler ; Handles a 34-bit virtual address ; Slightly more complex than an unshared TLB as the TLB registers need to be ; protected via a semaphore. Updates must be restricted to one core at a time. ; The TLB device needs to be permanently mapped into the system's address space ; since it is MMIO and uses the TLB. ; The stack must be mapped into a global address space. ; ; tlb_miss_irq34: st96 t0,[sp] ; save working registers st96 t1,12[sp] st96 t2,24[sp] st96 t3,36[sp] st96 t4,48[sp] st96 t5,60[sp] ld96 t0,TLB_MISS_ADR ; t0 = miss address, reading miss address clears interrupt csrrs r0,3,M_IE ; enable interrupts csrrd t1,r0,S_PTBR ; t1 = page table base clr t1,t1,0,13 ; clear 14 LSBs, address is page aligned extu t2,t0,24,9 ; get miss address bits 24 to 33, index into top level page table ld96 t3,[t1+t2*] ; get PTP from top level table bbc t3,PTE_V,.noL1PTE ; check that entry is valid extu t5,t3,PTE_T,0 ; get PTE.T bit bbc t3,PTE_T,.L1superPage ; check for 16MB superpage extu t1,t3,PTE_PPN,63 ; get PTP pointer asl t1,t1,14 ; convert PPN to table address extu t2,t0,14,9 ; get miss address bits 14 to 23 ld96 t3,[t1+t2*] ; get MPP bbc t3,PTE_V,.noL0PTE ; check that entry is valid bbs t3,PTE_T,.corrupt ; should be a PTE, otherwise table corrupt .L1superPage: extu t1,t0,20,75 ; VPN bits 6 to 83 = miss address bits 20 to 95 csrrd t2,r0,S_ASID ; add ASID to miss address asl t2,t2,80 or t1,t1,t2 ; t1 = VPN+ASID extu t2,t0,14,9 ; t2 = address bits 14 to 23 = TLB entry number asl t2,t2,5 ; shift into position csrrd t4,0,S_LFSR ; choose a random way to replace pne t5,"TFFIIIII" ; way depends on page level and t4,t4,3 ; way 0 to 3 ; normal page and t4,t4,1 ; way 0 or 1 ; superpage add t4,t4,4 ; way 4 or 5 ; superpage or t2,t2,t4 ; bump out csrrc r0,3,M_IE ; disable interrupts .lock: csrwr t4,1,M_SEMA ; try and set semaphore csrrd t4,0,M_SEMA ; check and see if set, zero returned if set bbs t4,0,.lock ; must have been clear st96 t1,TLB_PTE ; do quick stores to memory st96 t3,TLB_VPN st96 t2,TBL_CTRL st8 r0,TLB_WTRIG ; trigger update csrrc r0,1,M_SEMA ; release semaphore csrrs r0,3,M_IE ; enable interrupts ld96 t0,[sp] ; restore working registers ld96 t1,12[sp] ld96 t2,24[sp] ld96 t3,36[sp] ld96 t4,48[sp] ld96 t5,60[sp] rti ; Here, memory was not mapped to support the access. So, the program must be ; trying to read or write a random address. Abort the program. .noL1PTE: .noL0PTE: .corrupt: ldi a0,ABORT_PROGRAM ldi a1,ERR_TLBMISS syscall
_________________Robert Finch http://www.finitron.ca
|
Thu Mar 23, 2023 6:19 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Worked on the hardware card table, HCT, and cache coherency tonight. The HCT is a telescopic memory reflecting with progressively more detail where a pointer store occurred in memory. The write barrier for pointer stores ends up looking like the following: Code: ; Milli-code routine for garbage collect write barrier. ; This sequence is short enough to be used in-line. ; Three level card memory. ; a2 is a register pointing to the card table. ; STPTR will cause an update of the master card table, and hardware card table. ; GCWriteBarrier: STPTR a0,[a1] ; store the pointer value to memory at a1 LSR t0,a1,#8 ; compute card address ST8 r0,[a2+t0] ; clear byte in card memory
The STPTR instruction updates two levels of the HCT automatically. The highest level is only a single 32-bit word indicating which 16MB page a STPTR happened on. The second highest level is a 4kB memory accessed as 1k x 32 bits indicating which 16kB page a STPTR happened on. It is then necessary to scan only the cards in a 16kB page. There are only 64 cards to check and likely only 32 pointers to check in a card.
_________________Robert Finch http://www.finitron.ca
|
Sun Mar 26, 2023 4:21 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Yesterday: Elbow grease into the MPU component tonight. The MPU ties together several other components including the interval timers, interrupt controller, serial port and shared TLB.
Putting together the system-on-chip using the MPU component. The system does not synthesize to the correct size; most of the system is being omitted from the build. Obviously something is amiss.
Have the group register load and store instructions partially implemented. The idea is that a group of registers is stored with a single instruction. The group, in this case five registers, occupies 480-bits of a 512-bit cache line. Registers are stored in groups of five beginning with r0 to r4, then r5 to r9, r10 to r14, etc. The entire register context can be saved with 13 store instructions instead of requiring 64 instructions. These instructions could also be handy during function prolog and epilog. Setting up the register file for group access was a challenge. The register file is broken into five groups of 16 registers each. Of which only the lowest 13 registers out of the 16 are used, resulting in 65 available registers. There is map involved to convert the six-bit register code into a three-bit group and four-bit index. The ABI should be setup to make best use of the groups of five registers.
_________________Robert Finch http://www.finitron.ca
|
Tue Mar 28, 2023 2:46 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Started working on the compiler, CC64, which is seriously out of date. Lots of strcpy() to change to strcpy_s() and so on. Also had to update the preprocessor FPP64.
_________________Robert Finch http://www.finitron.ca
|
Thu Mar 30, 2023 3:33 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Got a first pass at the compiler done beginning with the source code of an earlier version. I believe it compiles code close to correctly but there are some performance issues. For instance, the compiler is not using base plus scaled indexed addressing when it could be. This results in code like the following: Code: # if(flags[i]){ sll t0,s0,1 lea t1,_flags[gp] ldw t7,[t0+t1] beqz t7,.00035
Which uses two extra instructions and two temporaries, when it could look like: Code: ldw t7,_flags[r0+s0*2] beqz t7,.00035
The best place to fix this in the compiler is at the expression parsing stage, when it builds expression nodes representing the indexing operation. It might also be possible to handle this with pattern matching in the peephole optimizer but that would not allow the temporaries to be used elsewhere.
_________________Robert Finch http://www.finitron.ca
|
Fri Mar 31, 2023 4:01 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Heavy duty work on the compiler today. Re-wrote the processing of initialization of aggregate types. It does not work as well as it used to. I think the code is improved though. It is tricky to do because types must be matched up between the variable and the initialization data. All the initialization data is grabbed at once and stored in expression trees. The trees should evaluate to constants. And having the data in trees creates an issue matching up the data with the variable elements. So, the expression trees are converted to linear lists in several places. It was necessary to add an ‘order’ number to the expression nodes recording which order they were encountered in. Unfortunately, it does not quite work correctly yet in all cases. Sometimes incorrect data is output for initialization.
Got back a compiler test suite off the web. The test suite I had disappeared during the great hard drive crash of 2022.
_________________Robert Finch http://www.finitron.ca
|
Sun Apr 09, 2023 3:58 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Added the ORF instruction. It operates the same way as the OR instruction except that it uses an immediate value encoded as a float. Half, single, double, and quad precisions are supported. The instruction can be used to load a floating-point immediate value into a register. As a single cycle operation it is faster than using FADD to load a value.
Converted immediate constants from 96 to 128 bits. Dropping the whole 96-bit machine idea, and just going with 128 bits. Changed the way the PFX2 instruction works, it is now issued twice in succession to provide 64-bits of constant information. This frees up a prefix.
Got the first try on a sequential machine coded, but there is a signal amiss. It does not synthesize correctly, and leaves out the data cache.
_________________Robert Finch http://www.finitron.ca
|
Sun Apr 16, 2023 5:35 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Made page relative branching an option, and changed the default branching mechanism to simple relative addressing. The issue with page relative branching is that it makes the code more exposed to attacks because it is almost absolute addressing for the code. The core also now supports unaligned 64-byte accesses. This hopefully will make the compiler code easier to manage.
_________________Robert Finch http://www.finitron.ca
|
Sun Apr 23, 2023 8:08 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
Forgot to provide an ASID for instruction accesses. Fixing this did not fix the issue of code being elided during synthesis.
Decided to make vector instructions one byte wider than scalar ones. So, they are 48-bit instructions. The extra byte is to be able to specify the vector mask register to use. There is a leftover bit. The issue is that the compiler would always spit out a vector mask modifier before every vector instruction, then rely on the peephole optimizer to merge the mask instructions together where possible. A vector mask instruction was occupying five bytes of storage and it turns out that it is probably not any more storage efficient than just adding an extra byte to each instruction. The optimizer could not merge the mask instruction together across flow control boundaries. And it was otherwise tricky to do properly.
Getting rid of the decimal mode flag. The issue is that there needs to be able to be a mix of binary and decimal mode instructions available *at the same time*. This was discussed on a newsgroup. There is an extra bit available in some register-register operate instruction which will probably be used to indicate decimal mode.
_________________Robert Finch http://www.finitron.ca
|
Mon Apr 24, 2023 3:52 am |
|
Who is online |
Users browsing this forum: claudebot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|