Last visit was: Thu May 01, 2025 10:43 pm
|
It is currently Thu May 01, 2025 10:43 pm
|
74xx based CPU (yet another)
Author |
Message |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
I just design with cmos 22V10's. They may be not at fast as other PALS but I can get LS TTL power and LS TTL speeds. All the cpu roms I have been using are small in size 256x8, or 32x8 so by adding don't care terms I can fit the proms into 22v10s. I am down-size-ing my 20 bit cpu into a 9 bit cpu, at the moment, so soon I will another 74xx based cpu. PALS could replace TTL in hardware build to ease PCB layout, but with slow memory 150 ns speed is not a factor. The current programmer I have does not work, so I need a low cost < $300 programmer for PLD's and EEPROMS and windows 7.
|
Wed Oct 07, 2020 6:52 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
oldben wrote: I just design with cmos 22V10's. They may be not at fast as other PALS but I can get LS TTL power and LS TTL speeds. All the cpu roms I have been using are small in size 256x8, or 32x8 so by adding don't care terms I can fit the proms into 22v10s. I am down-size-ing my 20 bit cpu into a 9 bit cpu, at the moment, so soon I will another 74xx based cpu. PALS could replace TTL in hardware build to ease PCB layout, but with slow memory 150 ns speed is not a factor. The current programmer I have does not work, so I need a low cost < $300 programmer for PLD's and EEPROMS and windows 7. Atmel/Microchip still have 16v8s and 22v10s in production, with max propagation delays as low as 5 ns: ATF16V8C-5JX and ATF22V10C-5JXI also recently came across some remaining stock of old Lattice GAL22V10 (thanks Dieter) at speeds as fast as 2.3 ns Digikey Gal Search, such as this ISPGAL22V10AB-23LN-ND. But my understanding is that they are out of production so once the stock is gone, it is gone. But some hobbyists have reported issues attempting to program the ATF variants. I am unsure about what's the right programmer for that. I wonder if we need to go for an expensive one for this to work (?)
|
Mon Oct 12, 2020 2:54 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
Over the last several days, I've been working on completing and refining the Logisim Model. At some point, I decided to tweak yet a bit more the instruction set encodings, and ended removing the zero extended byte load indirect register with offset, "ld.zb [Rs, K], Rd", instruction from the set. Therefore, the only remaining addressing mode with implicit zero extended loads is the register+register indirect mode, "ld.zb [Rs, Rn], Rd". The zero extend load instruction was already rarely used on regular compiled C code, because byte sized memory accesses are either sign agnostic or explicitly sign extended. The compiler can still chose the sign extending load for 'any ext' loads, with zero performance penally, or placing an explicit zero extend instruction in the few cases where this is really needed. The modification above, along with the repositioning of a 'reserved' slot, allowed further refactor of the instruction type encodings, in a way that it is now possible to determine the instruction type by just looking at the first 2 or 3 encoding bits, which helped to 'clean' the circuitry for the immediate decoding and pre-instruction decoding. In addition, I also moved the default encoding field positions to improve pre-decoding, for example, immediate fields are now split into two parts with the least significative bits always going to the same positions across all the instruction types. For reference, this is the link to the most updated instruction set https://github.com/John-Lluch/CPU74/blob/master/Docs/CPU74InstrSetV10.pdf. Changes are depicted in blue. I updated the compiler for the removed instruction, but I have yet to update the assembler and the software simulator with the new opcodes, so testing of the new encodings is pending yet. --- The most interesting part is that I took some time to draw a 'timing chart' diagram in order to help me find the critical path in a visual way and get an initial estimate of the maximum clock frequency this processor can be run at. So this is it: Attachment: TimingChartV10.png This corresponds to the Logisim Model posted as png files in this directory https://github.com/John-Lluch/CPU74/tree/master/Docs/LogisimDocsV10I marked the critical path for the 'Fetch Pipe' in purple colour. The critical path for the decode-execute pipe in red. Also, there's several clock frequency scenarios depicted on top of the drawing. For propagation delays I have used the 'typical' values in Texas Instruments or Fairchild datasheets, or the average between the min and max if a typical value was not specified. As can be seen from the drawing, the processor in its current design is just slightly better than 12 MHz. My goal is still 16 MHz though, so I can already think on a couple of improvements that will hopefully improve on the current figure. So that's all or now. Joan
You do not have the required permissions to view the files attached to this post.
Last edited by joanlluch on Mon Oct 12, 2020 4:13 pm, edited 1 time in total.
|
Mon Oct 12, 2020 4:00 pm |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
Nice diagram!
|
Mon Oct 12, 2020 4:07 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
Time for an update. Since my last post the following changes have been made: Register FileI improved performance by observing that the register ports to the ALU inputs can be selected directly from the instruction opcode, instead of having to go through the delay of the instruction decoder. This gives the opportunity to have register outputs already preselected by the time the enable signal from the instruction decoder arrives. The trick is connecting the already preselected register, to 74AC541 buffers which in turn present the register to the ALU input buses as soon as the enable signal arrives. Although the offending path was the A_BUS only, the same has been done for B_BUS. As a bonus, we have reduced capacitance on the ALU input lines because instead of 8 three state ics connected for the register file, we now have only one. The internal capacitance delay introduced from the registers to the output buffer is not relevant because it does not participate on the critical path. As before, a duplicated set of identical registers is used to feed the A_BUS and the B_BUS independently. https://github.com/John-Lluch/CPU74/blob/master/Docs/LogisimDocsV11/Registers.pngThis reduced the critical path by about 7 ns, so now the Processor can run at a theoretical frequency of nearly 14 MHz, That's two megaherths better than the previous 12 MHz Instruction DecoderThe above change is reflected in the Instruction Decoder module by the addition of a 74LVC1G3157 2:1 analog switch which it the actual component providing the selection information to the register file. This uses the property that only instructions of the R3 type will use the instruction field 'Rs' to indicate the register that must be connected to A_BUS. For all remaining cases, including the instructions of all the other types and for the additional cycles of any instruction type, the field 'Rd' is used. The Instruction Decoder module just provides a bit indicating this circumstance in advance of the actual instruction decoding, so that the register file can use it to prepare registers ahead of their possible selection to the ALU. https://github.com/John-Lluch/CPU74/blob/master/Docs/LogisimDocsV11/InstDecoder.pngData MemoryThe Logisim model has been completed with the incorporation of Data Memory as well as the load/store circuitry. The main CPU circuit now looks like this: https://github.com/John-Lluch/CPU74/blob/master/Docs/LogisimDocsV11/Main.pngThe processor is able to perform 'truncated stores' and 'sign/zero extended loads'. This functionality has been divided into three units, besides the memory access itself, described below: * MemoryMemory is divided into two banks organised as two memory chips of 32K x 8 bit each, for a total of 64K bytes. The lower and high memory banks store low and high bytes respectively at any given address. The processor allows both word and byte accesses to memory. Word stores are always word aligned, so the least significant address bit is ignored. Byte load/stores can be performed at any address, so in this case the least significant address bit is used to determine which memory bank to use. * The MAR-CTRL unit processes and tweaks control signals from the Instruction Decoder, to meet the needs of the load store instructions. The most important functions are (a) selecting whether it's a load or a store, (b) in case of a store, whether it is a word store or a byte store, in the latter case whether it is an aligned or unaligned store, (c) in case of a load, whether it is a word load or byte load, in the latter case whether it is aligned or unaligned load, and whether it needs zero or sign extension. https://github.com/John-Lluch/CPU74/blob/master/Docs/LogisimDocsV11/MarCtrl.pngThe available control signals are shown in a table on the top of the document above. The circuit also takes into account that any memory access is always a two cycle operation. - The first cycle is used to compute a memory address which will be available in the ALU output bus by the end of the cycle. Here, the computed address is just clocked into the Address Register, MAR. In addition, some control signals are stored in an internal register too, MemCy2. The later is done in order to have those signals ready and available early in the next cycle, which is the actual memory store or load operation. - On the second cycle the stored control signals are used to drive the /CE, /WE, /OE of the memory chips, according to what's required. - Out of paranoia, I also incorporated a trick around one spare bit on MemCY2, that allows immediate deactivation of the /WE signal when the clock pulse at the end of the second cycle arrives. This makes sure that the memory is only activated for writes during the length of the second cycle, but not beyond that. Also, writes are guaranteed to not corrupt undesired memory locations by making sure that the address is available earlier in the cycle, whereas the /WE low signal comes later from the instruction decoder. * The SHL8 unit provides the functionality for byte stores, by shifting the input word 8 bits left, so that the relevant byte can be stored in the appropriate memory bank. https://github.com/John-Lluch/CPU74/blob/master/Docs/LogisimDocsV11/SHL8.png* The SHR8EXT unit provides the functionality for byte loads. It can shift the memory output 8 bits right and add a sign or zero extension to the result. https://github.com/John-Lluch/CPU74/blob/master/Docs/LogisimDocsV11/SHR8EXT.png* The design also allows the use of the SHL8 and SHR8EXT units alone, without memory intervention, to execute several processor instructions. Initially, I made the SHL8 unit to perform a SWAP byte operation instead of a 8 bit shift. The hardware implementation is virtually identical except that the lower byte is feed from the upper byte instead of zeros. However, after some thought I changed my mind and decided to implement the shift version as shown. The motivation was that the compiler uses 'swapb' in the context of 8 bit shifts only, in combination with additional sign or zero extensions. So why not just replace the 'swap' instruction and provide the 8 shifts instructions directly?. So that's what I did. The resulting compiler generated code, in situations where the swapb instruction was used, is more compact and faster. For example this piece of code to perform a 8 bit shift on a 32 bit variable: Code: long sh8 ( long in ) { return in>>8 ; } Which previously resulted in this: Code: .globl sh8 sh8: zext r1, r2 bswap r2, r2 bswap r0, r0 zext r0, r0 or r0, r2, r0 bswap r1, r1 sext r1, r1 ret And now it is just this: Code: .globl sh8 sh8: lsr8 r0, r0 lsl8 r1, r2 or r0, r2, r0 asr8 r1, r1 ret The next goal is trying to reduce further the critical path to hopefully bring the processor to the 16 MHz initial goal (It's currently at 14 MHz). I see the opportunity to move the evaluation of the Condition Flags to the beginning of the next cycle, but still in time for the next ALU operation to use them, instead of using valuable processor time at the end of the current cycle, but that's for a future post... So that's it for now. EDIT. Forgot to add a link to the current Time Diagram reflecting the latest changes and showing the 'almost' 14 MHz critical path https://github.com/John-Lluch/CPU74/blob/master/Docs/TimingChartV11.pngJoan
|
Thu Oct 29, 2020 3:20 pm |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
Don't forget to add clock jitter into the model. Clocks only come in few sizes 14.32 MHZ is what you can get rather than 14 Mhz. Ben.
|
Thu Oct 29, 2020 6:38 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
oldben wrote: Don't forget to add clock jitter into the model. Clocks only come in few sizes 14.32 MHZ is what you can get rather than 14 Mhz. Ben. Hi Ben, thanks for your input. The processor time chart can be regarded as an approximation. I'm using typical propagation delays for ics specifying them, or the average figure between min to max otherwise. That is, I am not computing any explicit capacitance delays, or considering pcb trace effects, or temperatures, or clock signals skew, but just using the default propagation delays that are available on the specs. Taking care of not connecting single gate outputs to a big number of inputs, the specified delays should be rather conservative in most cases, so actual delays should be slightly better than specified. Or so I think. At the end of the day the time chart is mostly a tool to become aware of where the critical path might be to help with optimising performance. The aim is getting to 16 MHz, which I think it's feasible, so I guess I should not need a 14 MHz clock anyway.
|
Fri Oct 30, 2020 6:38 pm |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
I have been sticking to LS TTL since you have slow edge rates. High speed chips need carefull design and often a 4 layer PCB. Good luck with the 16 Mhz clock. Ben.
|
Fri Oct 30, 2020 8:25 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
oldben wrote: I have been sticking to LS TTL since you have slow edge rates. High speed chips need carefull design and often a 4 layer PCB. Good luck with the 16 Mhz clock. Ben. Hi Ben, You have a really good point here, and I'm just an inexperienced noob on hardware design or PCB layout, so I can't really say anything about this. But the positive side is that I count with the inestimable help of Dieter (ttlworks in the 6502 forum) and Drass, who are both very experienced guys on high MHz designs. They are both answering a lot of questions that I ask to them in the background. In fact, I have learned almost everything I know about hardware designs from them. Drass is currently attempting an out-of-this-world pipelined, discrete chip based, 6502 processor with a 100 MHz goal in mind. See this thread in the 6502.org forum in case you have not done it before: http://forum.6502.org/viewtopic.php?f=4&t=6282 Well, on face of that, my project just looks ridiculously naive, and totally feasible.  This is not claiming, however, that this processor will someday leave the "on paper" state, It may never do, because I have some health/neurological issues that may prevent me from building it in real hardware. Joan
|
Sat Oct 31, 2020 11:54 am |
|
 |
oldben
Joined: Mon Oct 07, 2019 2:41 am Posts: 768
|
A 100 mhz 6502 is a interesting goal. I would go for a 50 mhz version, so that one can find memory for it run with. Ben.
|
Sat Oct 31, 2020 5:05 pm |
|
 |
Garth
Joined: Tue Dec 11, 2012 8:03 am Posts: 285 Location: California
|
oldben wrote: A 100 mhz 6502 is a interesting goal. I would go for a 50 mhz version, so that one can find memory for it run with. Ben. I've seen SRAM down to 6ns (and that was quite a few years ago) which should be plenty fast for 50MHz, hopefully more. 10ns is pretty run-of-the-mill today.
_________________http://WilsonMinesCo.com/ lots of 6502 resources
|
Sat Oct 31, 2020 6:12 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
I am now forwarding the ALU result to the next cycle for the purpose of computing the expensive flags (namely the Z flag) and the T flag. This allows me to end the cycle as soon as the alu output is ready. The data flow of the ALU has been modified in a way that flags are no longer required immediately by the ALU, so there’s enough time to compute them on the cycle they may be used. This reduced the critical path of the decode-execute stage by 16 ns, so I am now well in the 16 MHZ zone, in particular at 56 ns. If we concede that some delays may be better in practice, I could be lucky and get it working at 20 MHz (keeping it properly cooled, that is). Now the fetch stage needs to be reworked because that's now where the critical path goes. These are the relevant logic models of the changes referred above: https://github.com/John-Lluch/CPU74/tree/master/Docs/LogisimDocsV12and the new resulting timming chart: https://github.com/John-Lluch/CPU74/blob/master/Docs/TimingChartV12.pngThe refactoring of the ALU involves all the instructions that use the T flag, which are the conditional branches and selects. Now, these instructions have a late use of the T flag, which happens almost at the end of the Execution cycle, and there's no longer two versions of the ALU control signals. The deferred computing of the Satus Register flags as described has an unwelcome hidden problem however. The SR register is no longer available as a physical register, but computed from registered data on every cycle. It's kind of a 'virtual' register. This means that SR reads can be easily implemented, after all the 'virtual' Z, C, T values are available by the end of any cycle. But unfortunately this complicates SR writes or restores, for example upon returning from an interrupt. That's I problem I will need to solve if I stick with this implementation.
|
Mon Nov 02, 2020 11:57 am |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
Will the flags persist, such that you can take a series of branches after a single arithmetic operation? (Will a store recompute the flags, as it does on 6800, but does not on the 6502?)
|
Mon Nov 02, 2020 12:08 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
BigEd wrote: Will the flags persist, such that you can take a series of branches after a single arithmetic operation? (Will a store recompute the flags, as it does on 6800, but does not on the 6502?) That's an interesting observation. Only some instructions modify the flags. Others do not. For example register to register moves, memory load/stores, an a number of other instructions do not alter the flags. Or, looking at it the other way round, only logic and arithmetic operations modify the flags. So yes, the flags persist through non-flag-altering instructions, and this is in fact a practice heavily exploited by the compiler, particularly around select instructions. The fact that now the SR register is 'virtual' does not change that behaviour because after all the 'registered' data that is stored to compute the actual flags is only updated after flag-altering instructions. The net effect is as if physical flags actually existed.
|
Mon Nov 02, 2020 12:21 pm |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
Thanks - that sounds good.
(I think my instinct might be to update the flags on a register-register move. Perhaps ideally it falls out of the way the machine is built.)
|
Mon Nov 02, 2020 12:42 pm |
|
Who is online |
Users browsing this forum: AhrefsBot, Bytespider, claudebot, DotBot and 1 guest |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|