Last visit was: Thu May 01, 2025 5:14 pm
|
It is currently Thu May 01, 2025 5:14 pm
|
Author |
Message |
roelh
Joined: Mon Oct 07, 2019 1:26 pm Posts: 50
|
Hi Joan, Robinson, thank you for your replies.
Joan, yes my code generator does not do any optimizing yet. Everything simply gets allocated in the workspace area. A pity that all code is optimized away, this leaves almost nothing to compare. Perhaps I must use another demo code that can not be optimized so much.
Robinsonb5, your assembly code takes some practice to understand. I read through your postings and I understand that it is this way because you can assemble this to machine code without having a dedicated assembler. Personally I think PDP-11 or 68000 code is easier to read, that's why I choose pdp-11 lookalike instructions for the Kobold assembly. In your case I would rename temp to ACC and have explicit instructions like MOV (R2),ACC
It is surprising to see that your CPU uses up to 2 times more instructions as Kobold to do the same thing. And the Kobold code has not even been optimized. But since you're using an FPGA there is plenty of speed so I guess it's not a problem for you. I like your endevour to get performance out of 8-bit instructions. But the single 'temp' accumulator seems to be a bottleneck. Did you think about using a small stack of accumulators (8,16 or 32 of them) ? Good that your C compiler doesn't optimize (as Joan's does), because that would also optimize everything away and leave nothing to compare.
|
Sun Dec 29, 2019 9:44 am |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
roelh wrote: Hi Joan, Robinson, thank you for your replies. Joan, yes my code generator does not do any optimizing yet. Everything simply gets allocated in the workspace area. A pity that all code is optimized away, this leaves almost nothing to compare. Perhaps I must use another demo code that can not be optimized so much.
Hi Roelh, LLVM optimises away load/stores to local variables that have not an effect outside the function, or that can be evaluated as constant expressions. The trick to prevent this is to use function arguments that are actually used to compute return values, or to update global vars, or to update arguments passed by reference. Also, small functions within the visibility scope of the compiler (namely on the same C source file, or with known bodies in included header files) tend to be inlined so they are not called at all, but this can be explicitly prevented if required by applying the 'noinline' attribute to these functions. I made some changes to your code to show this, and this is the result: Modified C code Code: __attribute__((noinline)) int sum (int num1, int num2) { return num1+num2; } int (*f2p) (int, int); int n;
int f(int op1, int *op2) { *op2 = sum(op1,n); f2p = sum; //Calling a function using function pointer op1 = (*f2p)(10, 13); return op1; }
// demo structure member, array element, and if statement
char ar[8];
struct tt { int r1 ; int r2; };
int i;
void test(struct tt* p){ if (i==5) { i = p-> r2; } else ar[6] = i; } CPU74 Code: # --------------------------------------------- # sum # --------------------------------------------- .globl sum sum: add r1, r0, r0 ret
# --------------------------------------------- # f # --------------------------------------------- .globl f f: // function enters with arguments in r0 and r1 add SP, -2, SP // make room on the stack for callee register saving (this is equivalent to 'push r4', but I removed push/pop instructions from the set) st.w r4, [SP, 0] // save r4 to the stack (required as per the calling conventions) mov r1, r4 // second argument (it's a pointer, but this does not matter at this point) goes to r4 ld.w [&n], r1 // load r1 with the contents of global 'n' call @sum // call sum with r0 and r1 as arguments st.w r0, [r4, 0] // store the sum retun value to the address pointed to by r4 mov &sum, r0 // load the address of sum in r0 st.w r0, [&f2p] // store the address of sum into global var f2p mov 10, r0 // set r0 to 10 mov 13, r1 // set r1 to 13 call @sum // call sum with arguments in r0 and r1, result goes to r0 ld.w [SP, 0], r4 // restore r4 ( this, along with the next instruction is equivalent to 'pop r4') add SP, 2, SP // restore the previous frame ret // return with the result in r0
# --------------------------------------------- # test # --------------------------------------------- .globl test test: ld.w [&i], r1 cmp.ne r1, 5 brcc .LBB2_2 ld.w [r0, 2], r0 st.w r0, [&i] ret .LBB2_2: st.b r1, [&ar+6] ret
# --------------------------------------------- # Global Data # --------------------------------------------- .comm n,2,2 .comm f2p,2,2 .comm i,2,2 .comm ar,8,2
If I do not apply the noinline attribute to the sum function then I get what it would be more natural for LLVM, which is this: Code: # --------------------------------------------- # sum # --------------------------------------------- .globl sum sum: add r1, r0, r0 ret
# --------------------------------------------- # f # --------------------------------------------- .globl f f: // function enters with arguments in r0 and r1 (r1 is a pointer) ld.w [&n], r2 // load variable n in r2 add r2, r0, r0 // add to first argument (this is the inlined sum function) st.w r0, [r1, 0] // store the sum retun value to the address pointed to by r1 mov &sum, r0 // load the address of sum in r0 st.w r0, [&f2p] // store the address of sum into global var f2p mov 23, r0 // this again is the inlined call to sum, this time resulting in a constant value ret // return r0
# --------------------------------------------- # test # --------------------------------------------- .globl test test: ld.w [&i], r1 cmp.ne r1, 5 brcc .LBB2_2 ld.w [r0, 2], r0 st.w r0, [&i] ret .LBB2_2: st.b r1, [&ar+6] ret
# --------------------------------------------- # Global Data # --------------------------------------------- .comm n,2,2 .comm f2p,2,2 .comm i,2,2 .comm ar,8,2
Now, the sum function is inlined and the result is much faster and shorter code. In any of the examples, the function pointer is still not seen to actually being used to call a function. This is because the compiler knows which function the pointer points to at compile time, so it just generates code to call the function directly. For the compiler to actually use a function pointer, it must be dereferenced in a scope where the target function is not known. For example if the pointer is set on a function, and used on a different one, or if the pointer is passed as an argument. I hope this is of interest. Joan
|
Sun Dec 29, 2019 12:18 pm |
|
 |
roelh
Joined: Mon Oct 07, 2019 1:26 pm Posts: 50
|
Hi Joan, I took your modified code as the new demo code. Also modified the array access to show some real indexing (with I).
For function f, the difference is not very great, with 18 for Kobold and 14 for CPU74. Other functions will also improve when I put some optimizations in the generated code.
It is tempting to work on the code generator, but I have an empty Kobold PCB waiting to be populated...
|
Sun Dec 29, 2019 1:06 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
roelh wrote: Hi Joan, I took your modified code as the new demo code. Also modified the array access to show some real indexing (with I).
For function f, the difference is not very great, with 18 for Kobold and 14 for CPU74. Other functions will also improve when I put some optimizations in the generated code.
It is tempting to work on the code generator, but I have an empty Kobold PCB waiting to be populated... Indeed, the difference is not that great. One aspect that slightly hurts the CPU74 ISA is the lack of push/pop instructions. The ISA defines that up to 4 function arguments are passed on registers (r0 to r3) and the remaining ones (or any varargs) are passed on the stack. Functions are not meant to preserve registers r0 to r3, but must preserve r4 to r7 (The SP and PC are not part of the general purpose set). Any function can potentially modify r0...r3, so these registers can not be used across calls, instead r4 is used in your code example. But r4 is one of the 'must preserve' registers, so that's why it must be saved/restored from the stack. The lack of push/pop means 4 instructions in total just for saving/restoring one register, and that's why inlining greatly benefits this particular case. On the more general case, the penalty is not that big because a stack frame must be generally allocated anyway for local variables or register spills, or a greater number of registers must be saved. In some cases, implementing the 'tail call' optimisation would also help, but I have not gone that far... I think you will have to consider the same kind of balance when you define what to do with registers across function calls, should them all be preserved?, only some of them?, or none of them? In all cases this adds more code generated in order to adhere with the chosen convention. Your current approach seems to be to store all arguments to the local function frame, which is fine, so you can access arguments across function calls, and probably you assume that registers are not preserved by called functions. Depending on the available memory addressing modes, this may or may not reduce code as compared with making a more extensive use of registers. My gut feeling is that pure load/store architectures benefit more from saved/restored registers performed once per function call, than a potentially bigger number of memory accesses. But I guess this is highly debatable and dependent on the actual architecture.
|
Sun Dec 29, 2019 2:18 pm |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
(Is this something one could explore and tune? Recompile the same benchmarks with the compiler set up to honour different calling conventions and see how many registers it's best to use for scratch, how many for parameters, how many to preserve?)
|
Sun Dec 29, 2019 3:01 pm |
|
 |
robinsonb5
Joined: Wed Nov 20, 2019 12:56 pm Posts: 92
|
roelh wrote: Robinsonb5, your assembly code takes some practice to understand. I read through your postings and I understand that it is this way because you can assemble this to machine code without having a dedicated assembler. Personally I think PDP-11 or 68000 code is easier to read, that's why I choose pdp-11 lookalike instructions for the Kobold assembly. 68000 was the first assembly language I learned, so it's naturally the one I consider easiest to read! Quote: In your case I would rename temp to ACC and have explicit instructions like MOV (R2),ACC I originally intended that register to be an accumulator, i.e. the target of ALU instructions, but actually I ended up with ALU instructions writing their results to the nominated register instead, with tmp simply providing the implicit second operand, so I'm not sure ACC is strictly appropriate. It would certainly be clearer to be specify the second operand explicitly, though. Quote: It is surprising to see that your CPU uses up to 2 times more instructions as Kobold to do the same thing. And the Kobold code has not even been optimized. Yes, though the code currently emitted by the compiler is functional but far from optimal - I've found that by hand-coding I can reduce code size by anything between a quarter and two thirds - for exactly the reasons that prompted you to start this project. So I shall be watching its progress with interest! Quote: But since you're using an FPGA there is plenty of speed so I guess it's not a problem for you. True - an Altera DE2 board can run this comfortably at 133MHz. Quote: I like your endevour to get performance out of 8-bit instructions. But the single 'temp' accumulator seems to be a bottleneck. Yes, it is - especially since I haven't yet implemented any kind of result forwarding. Quote: Did you think about using a small stack of accumulators (8,16 or 32 of them)? Yes, I considered that, and it would be interesting to explore. I haven't yet because this project doesn't exist purely for its own sake - I have an application in mind for it: a project where there's next to no block RAM left, so I need a CPU with small logic footprint, no block RAM used for a register file and good code density so the ROM doesn't take up too much block RAM either. I'm hoping that I can improve the code generator to the point where the last requirement is met!
|
Sun Dec 29, 2019 8:52 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
I figured I'd share the output from my compiler. The compiler isn't smart enough to use registers automatically, if one wants to use registers the register keyword must be used. Register arguments can generally be used only for leaf routines. I modified the test program slightly: Code: int sum (register int num1, register int num2) { return num1+num2; } int (*f2p) (register int, register int); int n;
int f(int op1, int *op2) { *op2 = sum(op1,n); f2p = sum; //Calling a function using function pointer op1 = (*f2p)(10, 13); return op1; }
// demo structure member, array element, and if statement
char ar[8];
struct tt { int r1 ; int r2; };
int i;
void test(register struct tt* p){ if (i==5) { i = p-> r2; } else ar[6] = i; } Compiler output isn't too bad, but it's no match for LLVM. Code: code align 8 ;==================================================== ; Basic Block 0 ;==================================================== public code _sum: add $v0,$a0,$a1 TestC74_8: ret $lk1 endpublic
bss align 2 public bss _n: fill.b 4,0x00
endpublic code align 8 ;==================================================== ; Basic Block 0 ;==================================================== public code _f: sub $sp,$sp,#16 st $fp,[$sp] st $r0,4[$sp] mov $r23,$lk1 st $r23,12[$sp] mov $fp,$sp sub $sp,$sp,#16 st $r11,0[$sp] ld $r11,16[$fp] ; *op2 = sum(op1,n); ld $t0,20[$fp] st $t0,-8[$fp] ld $a1,_n mov $a0,$r11 jal $lk1,_sum ld $t0,-8[$fp] mov $t1,$v0 st $t1,[$t0] ; f2p = sum; ldi $t0,#_sum st $t0,_f2p ; op1 = (*f2p)(10, 13); ld $t0,_f2p st $t0,-8[$fp] ldi $a1,#13 ldi $a0,#10 jal $lk1,[$t0] mov $t0,$v0 mov $r11,$t0 ; return op1; mov $v0,$r11 TestC74_13: TestC74_16: ld $r11,0[$sp] mov $sp,$fp ld $fp,[$sp] ld $r23,12[$sp] mov $lk1,$r23 add $sp,$sp,#24 ret $lk1 endpublic
bss align 8 align 8 dw $FFF0200000000001 public bss _ar: fill.b 8,0x00
endpublic align 2 public bss _i: fill.b 4,0x00
endpublic code align 8 ;==================================================== ; Basic Block 0 ;==================================================== public code _test: ; if (i==5) ld $v0,_i cmp $cr2,$v0,#5 bne $cr2,TestC74_27 ; { i = p-> r2; } ld $v0,4[$a0] st $v0,_i bra TestC74_28 TestC74_27: ; else ar[6] = i; ldb $v0,_i stb $v0,_ar+6 TestC74_28: TestC74_23: TestC74_26: ret $lk1 endpublic
rodata align 16 ; global _test ; global _ar ; global _f2p ; global _sum ; global _f ; global _i ; global _n
_________________Robert Finch http://www.finitron.ca
|
Mon Dec 30, 2019 6:57 am |
|
 |
roelh
Joined: Mon Oct 07, 2019 1:26 pm Posts: 50
|
Hi Rob, thanks for sharing you compiler output. Is this the C64 compiler that I found on your website ? I tried to look at C64.zip but I got a 404 error. (on this page: http://www.finitron.ca/Projects/C64.html ). For which processor is this compiled ? $lk1, $t0, $t1, $v0 did puzzle me. Are these alias names for certain registers ? I saw that when a stack frame is needed, the entry and exit code is quite large...
|
Mon Dec 30, 2019 9:34 am |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
No, it's a much more recent version with a lot of fixes. I should really update the website. Quote: For which processor is this compiled ? The compile is for a custom processor I'm working on. It has an oddball number of bits per byte, to make things interesting. Quote: $lk1, $t0, $t1, $v0 did puzzle me. Are these alias names for certain registers ?
Yes, those are aliases for certain registers. $tx is a temporary, $vx is a return value, $lkx is a link register. $crx is a compare result register, for which there is a separate register file (8 entries). There are four link registers available as a separate register file. Quote: I saw that when a stack frame is needed, the entry and exit code is quite large... Yes, 16 bytes. It's a fixed size to keep the compiler simpler. There's the return address, exception return address and frame pointer to store, plus one extra. I wasn't too worried about the frame size because there's lots of ram and stacks generally aren't very deep.
_________________Robert Finch http://www.finitron.ca
|
Mon Dec 30, 2019 1:52 pm |
|
 |
roelh
Joined: Mon Oct 07, 2019 1:26 pm Posts: 50
|
Hi all, I had some progress on the C compiler. As an example, I took the 8-queens problem (as posted in the article https://www.sigbus.info/how-i-wrote-a-self-hosting-c-compiler-in-40-days ). The program is compiled by the C compiler, that delivers Kobold Assembly. The c compiler has no I/O yet, so in the assemby output I added instructions to write to screen. In the simulator, everything that is written to 0xF000 is going to screen, so I added in the printf function: - mov 0xF000,A2 - mov D3,(A2) The resulting Kobold machine code was successfully executed by the simulator on the webpage. The simulator did about 100 instructions per second, and the first solution took about 20 minutes. Here is the (slightly modified) C program: Code: void printf(char c) { }
void print_board(int board[8][8]) { int i,j; for (i = 0; i < 8; i++) { for (j = 0; j < 8; j++) { if (board[i][j]) printf('Q'); else printf('.'); } printf('\n'); } printf('\n'); printf('\n'); }
int conflict(int board[8][8], int row, int col) { int i; for (i = 0; i < row; i++) { if (board[i][col]) return 1; int j; j = row - i; if (0 < col - j + 1) if (board[i][col - j]) return 1; if (col + j < 8) if (board[i][col + j]) return 1; } return 0; }
void solve(int board[8][8], int row) { if (row == 8) { print_board(board); return ; } int i; for ( i = 0; i < 8; i++) { if (conflict(board, row, i)==0) { board[row][i] = 1; solve(board, row + 1); board[row][i] = 0; } } } int board[8][8];
int main() { int i,j; for (i = 0; i < 8; i++) for (j = 0; j < 8; j++) board[i][j] = 0; solve(board, 0); }
And here is the resulting Kobold K2 assembly listing (modified so it will give character output): Code: 0 ;--- list of types --- 1 ; int basic 2 2 ; void basic 0 3 ; char basic 1 4 ; float basic 4 5 ; bool basic 1 6 ; array-of-chars array 0 7 ; -none- basic 2 8 ; _tt1 array 16 9 ; _tt2 array 128 10 ; _tt3 array 16 11 ; _tt4 array 128 12 ; _tt5 array 16 13 ; _tt6 array 128 14 ; _tt7 array 16 15 ; _tt8 array 128 16 code section 00000 211Er 17 MOVP 1,WP; wp page init 00002 611Cs 18 MOV 256,WP; wp init 00004 3A01 19 MOVP 1,A2; A2 page init 00006 3B01 20 MOVP 1,A3; A3 page init 00008 601At 21 JMP main 0000A AAAA 22 data section 0000C AAAA 0000E AAAA 00010 AAAA 00012 AAAA 00014 AAAA 00016 AAAA 00018 AAAA 0001A 01EAt 0001C 0100s 0001E 0001r 23 Z_0: EQU 32 24 Z_2: EQU 34 25 Z_4: EQU 36 26 Z_6: EQU 38 27 Z_8: EQU 40 28 Z_10: EQU 42 29 Z_12: EQU 44 30 Z_14: EQU 46 31 Z_16: EQU 48 32 Z_18: EQU 50 33 Z_20: EQU 52 34 Z_22: EQU 54 35 Z_24: EQU 56 36 Z_26: EQU 58 37 Z_28: EQU 60 38 Z_30: EQU 62 39 board: EQU 0x1000 40 ;--- list of global vars --- 41 ; board _tt8 4096 42 code section 43 ;---> Removed lines that contain debugging output of the optimizer 219 220 ;***************************************** 221 ; (1) function: printf 222 ;***************************************** 223 ; This is a leaf function 224 225 ;--- function entry --- 226 printf: 227 ; MOV D3,(Z_0) // parameter 1 c 228 ;--- list of local vars --- 229 ; c char (Z_0) 230 ;--- function body --- 00020 621Eu 231 mov 0xF000,A2 00022 9740 232 mov D3,(A2) 233 ;--- function exit --- 00024 5800 234 MOV D0,PC; return 235 236 ;***************************************** 237 ; (2) function: print_board 238 ;***************************************** 239 240 ;--- function entry --- 241 print_board: 00026 943E 242 MOV D0,(WP+30); save return address 00028 7D20 243 MOV 32,D1; wp increment 0002A 5120 244 ADD D1,WP; 0002C 9720 245 MOV D3,(WP+0) // parameter 1 board 246 ;--- list of local vars --- 247 ; board _tt2 (WP+0) 248 ; i int (WP+2) 249 ; j int (WP+4) 250 ;--- function body --- 251 ; for [[ i = 0] [bool i < 8] [int i = [int 252 ; i + 1]]] 253 ; [ i = 0] 0002E 7F00 254 MOV 0,D3; load 0 00030 9722 255 MOV D3,(WP+2); store at i 00032 700E+ 256 L_1: 00034 3F08 257 MOVC 8,D3; compare: mov_compl 00036 4723 b258 ADDI (WP+2),D3; 00038 A008v 259 BRC L_2 260 ; for [[ j = 0] [bool j < 8] [int j = [int 261 ; j + 1]]] 262 ; [ j = 0] 0003A 700C/ 263 MOV 0,D3; load 0 0003C 008Av 0003E F000u 00040 7F00 00042 9724 264 MOV D3,(WP+4); store at j 00044 7006+ 265 L_3: 00046 3F08 266 MOVC 8,D3; compare: mov_compl 00048 4725 b267 ADDI (WP+4),D3; 0004A A018g 268 BRC L_4 269 ; if [bool [int board : [int [int 8 * i] + 270 ; j]] != 0] 0004C 6722 271 MOV (WP+2),D3; 272 ; multiply by 8 0004E 9FFE 273 SHL D3 00050 9FFE 274 SHL D3 00052 9FFE 275 SHL D3 00054 4724 276 ADD (WP+4),D3; 00056 9FFE 277 SHL D3; multiply by 2 00058 4320 278 ADD (WP+0),D3,A3; add local array base 0005A 6760 279 MOV (A3+0),D3; 0005C 701A/b280 ADD 65535,D3; 0005E 007Ag 00060 471Er 00062 F00C. 281 BRNC L_5 282 ; [void printf [81]] 00064 7F51 283 MOV 81,D3; 'Q' load arg 0 00066 7C6A 284 MOV L_7,D0 00068 7820 285 CALL printf; 286 L_7: 0006A 7874 287 JMP L_6 288 L_5: 289 ; [void printf [46]] 0006C 7F2E 290 MOV 46,D3; '.' load arg 0 0006E 7C72 291 MOV L_8,D0 00070 7820 292 CALL printf; 293 L_8: 00072 7002+ 294 L_6: 295 ; [int j = [int j + 1]] 00074 6725 296 MOVI (WP+4),D3; load int,j,+,1 00076 9724 297 MOV D3,(WP+4); store at j 00078 7846 298 JMP L_3 299 L_4: 300 ; [void printf [10]] 0007A 7F0A 301 MOV 10,D3; load arg 0 0007C 7006/ 302 MOV L_9,D0 0007E FFFFr 00080 7C84 00082 7820 303 CALL printf; 304 L_9: 305 ; [int i = [int i + 1]] 00084 6723 306 MOVI (WP+2),D3; load int,i,+,1 00086 9722 307 MOV D3,(WP+2); store at i 00088 7834 308 JMP L_1 309 L_2: 310 ; [void printf [10]] 0008A 7F0A 311 MOV 10,D3; load arg 0 0008C 7C90 312 MOV L_10,D0 0008E 7820 313 CALL printf; 314 L_10: 315 ; [void printf [10]] 00090 7F0A 316 MOV 10,D3; load arg 0 00092 7C96 317 MOV L_11,D0 00094 7820 318 CALL printf; 319 L_11: 320 ;--- function exit --- 00096 3D20 321 MOVC 32,D1; wp decrement 00098 5121 322 ADDI D1,WP; 0009A 603E 323 MOV (WP+30),PC; return 324 325 ;***************************************** 326 ; (3) function: conflict 327 ;***************************************** 328 ; This is a leaf function 329 330 ;--- function entry --- 331 conflict: 0009C 9F20 332 MOV D3,(Z_0) // parameter 1 board 0009E 7004/ 333 MOV D2,(Z_2) // parameter 2 row 000A0 9E22 000A2 7760 334 MOV A3,D3 000A4 9F24 335 MOV D3,(Z_4) // parameter 3 col 336 ;--- list of local vars --- 337 ; board _tt4 (Z_0) 338 ; row int (Z_2) 339 ; col int (Z_4) 340 ; i int (Z_6) 341 ; j int (Z_8) 342 ;--- function body --- 343 ; for [[ i = 0] [bool i < row] [int i = 344 ; [int i + 1]]] 345 ; [ i = 0] 000A6 7F00 346 MOV 0,D3; load 0 000A8 9F26 347 MOV D3,(Z_6); store at i 000AA 700C+ 348 L_12: 000AC 2F22 349 MOVC (Z_2),D3; compare: mov_compl 000AE 4F27 b350 ADDI (Z_6),D3; 000B0 A012r 351 BRC L_13 352 ; if [bool [int board : [int [int 8 * i] + 353 ; col]] != 0] 000B2 6F26 354 MOV (Z_6),D3; 355 ; multiply by 8 000B4 9FFE 356 SHL D3 000B6 9FFE 357 SHL D3 000B8 9FFE 358 SHL D3 000BA 4F24 359 ADD (Z_4),D3; 000BC 7014/ 360 SHL D3; multiply by 2 000BE 0150r 000C0 9FFE 000C2 4B20 361 ADD (Z_0),D3,A3; add local array base 000C4 6760 362 MOV (A3+0),D3; 000C6 471Egb363 ADD 65535,D3; 000C8 E01Ch 364 BRNC L_14 365 ; {return:1} 000CA 7F01 366 MOV 1,D3; return value 000CC 5800 367 MOV D0,PC; return 368 L_14: 369 ; [ j = [int row - i]] 000CE 2F26 370 MOVC (Z_6),D3; load int,row,-,i 000D0 4F23 371 ADDI (Z_2),D3; 000D2 9F28 372 MOV D3,(Z_8); store at j 373 ; if [bool 0 < [bool [bool col - j] + 1]] 000D4 2F28 374 MOVC (Z_8),D3; compare: mov_compl 000D6 4F25 375 ADDI (Z_4),D3; 000D8 5F01 376 ADD 1,D3; 000DA 7012/ 377 COM D3; complement 000DC 00CEh 000DE FFFFg 000E0 1F00 000E2 5F01 b378 ADD 1,D3; +1 000E4 A01Er 379 BRC L_15 380 ; if [bool [int board : [int [int 8 * i] + 381 ; [int col - j]]] != 0] 000E6 6F26 382 MOV (Z_6),D3; 383 ; multiply by 8 000E8 9FFE 384 SHL D3 000EA 9FFE 385 SHL D3 000EC 9FFE 386 SHL D3 000EE 9F3C 387 MOV D3,(Z_28); 000F0 2F28 388 MOVC (Z_8),D3; 000F2 4F25 389 ADDI (Z_4),D3; 000F4 4F3C 390 ADD (Z_28),D3; 000F6 9FFE 391 SHL D3; multiply by 2 000F8 4B20 392 ADD (Z_0),D3,A3; add local array base 000FA 601C! 393 MOV (A3+0),D3; 000FC 0100! 000FE 010Cr 00100 6760 00102 471Egb394 ADD 65535,D3; 00104 E01Ch 395 BRNC L_16 396 ; {return:1} 00106 7F01 397 MOV 1,D3; return value 00108 5800 398 MOV D0,PC; return 399 L_16: 0010A 7002+ 400 L_15: 401 ; if [bool [bool col + j] < 8] 0010C 6F28 402 MOV (Z_8),D3; compare using stack 0010E 4F24 403 ADD (Z_4),D3; 00110 9F3C 404 MOV D3,(Z_28); 00112 3F08 405 MOVC 8,D3; compare: mov_compl 00114 4F3D b406 ADDI (Z_28),D3; compare from stack 00116 A00Ei 407 BRC L_17 408 ; if [bool [int board : [int [int 8 * i] + 409 ; [int col + j]]] != 0] 00118 7014/ 410 MOV (Z_6),D3; 0011A 014Ai 0011C 010Ah 0011E FFFFg 00120 6F26 411 ; multiply by 8 00122 9FFE 412 SHL D3 00124 9FFE 413 SHL D3 00126 9FFE 414 SHL D3 00128 9F3C 415 MOV D3,(Z_28); 0012A 6F28 416 MOV (Z_8),D3; 0012C 4F24 417 ADD (Z_4),D3; 0012E 4F3C 418 ADD (Z_28),D3; 00130 9FFE 419 SHL D3; multiply by 2 00132 4B20 420 ADD (Z_0),D3,A3; add local array base 00134 6760 421 MOV (A3+0),D3; 00136 601E!b422 ADD 65535,D3; 00138 ???? 0013A ???? 423 BRNC L_18 0013C ???? 424 ; {return:1} 0013E 0140! 425 MOV 1,D3; return value 00140 471Eg 00142 E01Ch 00144 7F01 00146 5800 426 MOV D0,PC; return 427 L_18: 00148 7002+ 428 L_17: 429 ; [int i = [int i + 1]] 0014A 6F27 430 MOVI (Z_6),D3; load int,i,+,1 0014C 9F26 431 MOV D3,(Z_6); store at i 0014E 78AC 432 JMP L_12 433 L_13: 434 ; {return:0} 00150 7F00 435 MOV 0,D3; return value 00152 5800 436 MOV D0,PC; return 437 ;--- function exit --- 438 ; Exit was handled by return statement 439 440 ;***************************************** 441 ; (4) function: solve 442 ;***************************************** 443 444 ;--- function entry --- 445 solve: 00154 943E 446 MOV D0,(WP+30); save return address 00156 7D20 447 MOV 32,D1; wp increment 00158 5120 448 ADD D1,WP; 0015A 700C/ 449 MOV D3,(WP+0) // parameter 1 board 0015C 0148h 0015E FFFFg 00160 9720 00162 9622 450 MOV D2,(WP+2) // parameter 2 row 451 ;--- list of local vars --- 452 ; board _tt6 (WP+0) 453 ; row int (WP+2) 454 ; i int (WP+4) 455 ;--- function body --- 456 ; if [bool row == 8] 00164 2722 457 MOVC (WP+2),D3; compare: mov_complement 00166 5F09 458 ADD 9,D3; +1 00168 471Erb459 ADD 65535,D3; 0016A A01Cs 460 BRC L_19 461 ; [void print_board [board]] 0016C 6720 462 MOV (WP+0),D3; load arg 0 0016E 641At 463 MOV L_20,D0 00170 7826 464 CALL print_board; 465 L_20: 466 ; {return:0} 00172 3D20 467 MOVC 32,D1; wp decrement 00174 5121 468 ADDI D1,WP; 00176 603E 469 MOV (WP+30),PC; return 470 L_19: 471 ; for [[ i = 0] [bool i < 8] [int i = [int 472 ; i + 1]]] 473 ; [ i = 0] 00178 7008/ 474 MOV 0,D3; load 0 0017A 0172t 0017C 0178s 0017E FFFFr 00180 7F00 00182 9724 475 MOV D3,(WP+4); store at i 00184 7006+ 476 L_21: 00186 3F08 477 MOVC 8,D3; compare: mov_compl 00188 4725 b478 ADDI (WP+4),D3; 0018A A018g 479 BRC L_22 480 ; if [bool [int conflict [board row i]] == 481 ; 0] 0018C 6324 482 MOV (WP+4),A3; load arg 2 0018E 6622 483 MOV (WP+2),D2; load arg 1 00190 6720 484 MOV (WP+0),D3; load arg 0 00192 6416h 485 MOV L_24,D0 00194 789C 486 CALL conflict; 487 L_24: 00196 700A/b488 ADD 65535,D3; 00198 ???? 0019A ???? 489 BRC L_23 0019C 0196h 490 ; [ [int board : [int [int 8 * row] + i]] 0019E 01E4g 001A0 471Er 001A2 A01Cs 491 ; = 1] 001A4 7F01 492 MOV 1,D3; load 1 001A6 6622 493 MOV (WP+2),D2; 494 ; multiply by 8 001A8 9EFE 495 SHL D2 001AA 9EFE 496 SHL D2 001AC 9EFE 497 SHL D2 001AE 4624 498 ADD (WP+4),D2; 001B0 9EFE 499 SHL D2; multiply by 2 001B2 4220 500 ADD (WP+0),D2,A2; add local array base 001B4 9740 501 MOV D3,(A2+0); store at int,board,:,int,int,8,*,row,+,i 502 ; [void solve [board [int row + 1]]] 001B6 6623 503 MOVI (WP+2),D2; calc arg 1 001B8 601A! 504 MOV (WP+0),D3; load arg 0 001BA 01C0! 001BC 01DAs 001BE FFFFr 001C0 6720 001C2 7406. 505 MOV L_25,D0 001C4 601Eg 506 CALL solve; 507 L_25: 508 ; [ [int board : [int [int 8 * row] + i]] 509 ; = 0] 001C6 7F00 510 MOV 0,D3; load 0 001C8 6622 511 MOV (WP+2),D2; 512 ; multiply by 8 001CA 9EFE 513 SHL D2 001CC 9EFE 514 SHL D2 001CE 9EFE 515 SHL D2 001D0 4624 516 ADD (WP+4),D2; 001D2 9EFE 517 SHL D2; multiply by 2 001D4 4220 518 ADD (WP+0),D2,A2; add local array base 001D6 9740 519 MOV D3,(A2+0); store at int,board,:,int,int,8,*,row,+,i 001D8 7014+ 520 L_23: 521 ; [int i = [int i + 1]] 001DA 6725 522 MOVI (WP+4),D3; load int,i,+,1 001DC 7006/ 523 MOV D3,(WP+4); store at i 001DE 0154g 001E0 9724 001E2 601Er 524 JMP L_21 525 L_22: 526 ;--- function exit --- 001E4 3D20 527 MOVC 32,D1; wp decrement 001E6 5121 528 ADDI D1,WP; 001E8 603E 529 MOV (WP+30),PC; return 530 531 ;***************************************** 532 ; (5) function: main 533 ;***************************************** 534 535 ;--- function entry --- 536 main: 001EA 943E 537 MOV D0,(WP+30); save return address 001EC 7D20 538 MOV 32,D1; wp increment 001EE 5120 539 ADD D1,WP; 540 ;--- list of local vars --- 541 ; i int (WP+0) 542 ; j int (WP+2) 543 ;--- function body --- 544 ; for [[ i = 0] [bool i < 8] [int i = [int 545 ; i + 1]]] 546 ; [ i = 0] 001F0 7F00 547 MOV 0,D3; load 0 001F2 9720 548 MOV D3,(WP+0); store at i 001F4 700C+ 549 L_26: 001F6 3F08 550 MOVC 8,D3; compare: mov_compl 001F8 700A/b551 ADDI (WP+0),D3; 001FA ???? 001FC ???? 552 BRC L_27 001FE 0186r 553 ; for [[ j = 0] [bool j < 8] [int j = [int 00200 4721 00202 A01Eg 554 ; j + 1]]] 555 ; [ j = 0] 00204 7F00 556 MOV 0,D3; load 0 00206 9722 557 MOV D3,(WP+2); store at j 00208 700A+ 558 L_28: 0020A 3F08 559 MOVC 8,D3; compare: mov_compl 0020C 4723 b560 ADDI (WP+2),D3; 0020E A012h 561 BRC L_29 562 ; [ [int board : [int [int 8 * i] + j]] 563 ; = 0] 00210 7F00 564 MOV 0,D3; load 0 00212 6620 565 MOV (WP+0),D2; 566 ; multiply by 8 00214 9EFE 567 SHL D2 00216 9EFE 568 SHL D2 00218 9EFE 569 SHL D2 0021A 7016/ 570 ADD (WP+2),D2; 0021C 022Eh 0021E 0234g 00220 4622 00222 9EFE 571 SHL D2; multiply by 2 00224 421Er 572 ADD board,D2,A2; add global array base 00226 9740 573 MOV D3,(A2+0); store at int,board,:,int,int,8,*,i,+,j 574 ; [int j = [int j + 1]] 00228 6723 575 MOVI (WP+2),D3; load int,j,+,1 0022A 9722 576 MOV D3,(WP+2); store at j 0022C 601Cs 577 JMP L_28 578 L_29: 579 ; [int i = [int i + 1]] 0022E 6721 580 MOVI (WP+0),D3; load int,i,+,1 00230 9720 581 MOV D3,(WP+0); store at i 00232 600Ct 582 JMP L_26 583 L_27: 584 ; [void solve [board 0]] 00234 7E00 585 MOV 0,D2; load arg 1 00236 700C/ 586 MOV board,D3; load address 00238 ???? 0023A 01F6t 587 MOV L_30,D0 0023C 020As 0023E 1000r 00240 671Eg 00242 7406. 00244 601Ch 588 CALL solve; 589 L_30: 590 ;--- function exit --- 00246 3D20 591 MOVC 32,D1; wp decrement 00248 5121 592 ADDI D1,WP; 0024A 603E 593 MOV (WP+30),PC; return 594 0024C ???? 595 0024E ???? 595 00250 ???? 595 00252 ???? 595 00254 ???? 595 00256 ???? 595 00258 ???? 595 0025A ???? 595 0025C 0154h 595 0025E 1000g 595
Just a few notes about the assembler output: The output is grouped in sections of 0x0020 bytes. 16-bit immediates are not directly after their instruction, but grouped together at the end of a section. In order to make clear which immediates belong to which instruction, they have a lower-case character appended to the instruction word (on the listing only, not in binary output). So: 000E4 A01Er 380 BRC L_15 ; address 000E4, instruction A01E, line 380, conditional branch to label L_15. The 'r' after A01E indicates where the immediate is found. 000FE 010Cr ; address 000FE, contents 010C, has code 'r'. So the value of L_15 is 010C, and that is where the CPU will jump to if there was a carry. At the end of a section, the assembler generates an instruction to jump over the 16-bit immediates to the next section. A "/","+" or "!" means that this is a jump instruction that is inserted by the assembler, in most cases to jump to the next section of code. In the case of '!' it is a jump with 16-bit immediate, that is located at the '!' that follows it. A '.' means that it is a absolute branch or jump that was replaced by a relative jump. The compiler does several optimizations, but most of them could not be used in this program. But you can see, that it does NOT use the WP (workspace) for local variables when the function is a leaf function (calls no other functions). In that case, the locals are in Z-page at addr 0x0020 and upwards. This saves incrementing and decrementing the WP in a leaf function. Next thing to do, is to run it on the real thing.
|
Sun Feb 09, 2020 1:36 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
Hi Roelh, That's a great achievement. I couldn't resist to input the same C code to my CPU74 compiler and run it on the simulator. This is the resulting compiled code: CPU74 Code: .text .file "queens.c"
# --------------------------------------------- # print_board # --------------------------------------------- .globl print_board print_board: add SP, -6, SP st.w r4, [SP, 4] st.w r5, [SP, 2] st.w r6, [SP, 0] mov 0, r1 mov 81, r2 mov 46, r3 mov 10, r4 .LBB1_1: mov 0, r5 .LBB1_2: ld.w [r0, r5], r6 cmp.eq r6, 0 selcc r3, r2, r6 st.b r6, [-1L] lea r5, 2, r5 cmp.eq r5, 16L brncc .LBB1_2 st.b r4, [-1L] lea r0, 16, r0 lea r1, 1, r1 cmp.eq r1, 8 brncc .LBB1_1 mov 10, r0 st.b r0, [-1L] ld.w [SP, 0], r6 ld.w [SP, 2], r5 ld.w [SP, 4], r4 add SP, 6, SP ret
# --------------------------------------------- # conflict # --------------------------------------------- .globl conflict conflict: add SP, -8, SP st.w r4, [SP, 6] st.w r5, [SP, 4] st.w r6, [SP, 2] st.w r7, [SP, 0] cmp.lt r1, 1 brcc .LBB2_8 sub r2, r1, r3 add r3, r3, r4 add r2, r2, r3 add r0, r3, r3 add r0, r4, r4 add r2, r1, r5 add r5, r5, r6 add r0, r6, r6 neg r1, r1 .LBB2_2: ld.w [r3, 0], r0 cmp.eq r0, 0 mov 1, r0 brncc .LBB2_9 add r2, r1, r7 cmp.lt r7, 0 brcc .LBB2_5 ld.w [r4, 0], r7 cmp.eq r7, 0 brncc .LBB2_9 .LBB2_5: cmp.gt r5, 7 brcc .LBB2_7 ld.w [r6, 0], r7 cmp.eq r7, 0 brncc .LBB2_9 .LBB2_7: lea r3, 16, r3 sub r5, 1, r5 lea r4, 18, r4 lea r6, 14, r6 add r1, 1, r1 brncc .LBB2_2 .LBB2_8: mov 0, r0 .LBB2_9: ld.w [SP, 0], r7 ld.w [SP, 2], r6 ld.w [SP, 4], r5 ld.w [SP, 6], r4 add SP, 8, SP ret
# --------------------------------------------- # solve # --------------------------------------------- .globl solve solve: add SP, -12, SP st.w r4, [SP, 10] st.w r5, [SP, 8] st.w r6, [SP, 6] st.w r7, [SP, 4] mov r1, r5 mov r0, r4 cmp.eq r5, 8 brncc .LBB3_6 mov 0, r0 mov 81, r1 mov 46, r2 mov 10, r3 .LBB3_2: mov 0, r5 .LBB3_3: ld.w [r4, r5], r6 cmp.eq r6, 0 selcc r2, r1, r6 st.b r6, [-1L] lea r5, 2, r5 cmp.eq r5, 16L brncc .LBB3_3 st.b r3, [-1L] lea r4, 16, r4 lea r0, 1, r0 cmp.eq r0, 8 brncc .LBB3_2 mov 10, r0 st.b r0, [-1L] jmp .LBB3_10 .LBB3_6: lea r5, 1, r0 st.w r0, [SP, 0] add r5, r5, r0 add r0, r0, r0 add r0, r0, r0 add r0, r0, r6 mov 0, r7 st.w r5, [SP, 2] .LBB3_7: mov r4, r0 mov r5, r1 mov r7, r2 call @conflict cmp.eq r0, 0 brncc .LBB3_9 add r4, r6, r5 mov 1, r0 st.w r0, [r5, 0] mov r4, r0 ld.w [SP, 0], r1 call @solve mov 0, r0 st.w r0, [r5, 0] ld.w [SP, 2], r5 .LBB3_9: lea r6, 2, r6 lea r7, 1, r7 cmp.eq r7, 8 brncc .LBB3_7 .LBB3_10: ld.w [SP, 4], r7 ld.w [SP, 6], r6 ld.w [SP, 8], r5 ld.w [SP, 10], r4 add SP, 12, SP ret
# --------------------------------------------- # main # --------------------------------------------- .globl main main: add SP, -2, SP st.w r4, [SP, 0] mov &board, r4 mov 128L, r2 mov r4, r0 mov 0, r1 call @memset mov r4, r0 mov 0, r1 call @solve mov 0, r0 ld.w [SP, 0], r4 add SP, 2, SP ret
# --------------------------------------------- # Global Data # --------------------------------------------- .comm board,128,2
The above is compiled with the -Os compiler flag, which is a balance between speed and size. It can be gotten slightly shorter although somewhat slower with the -Oz flag. In particular the "solve" function gets shortened by about 20% with the -Oz flag. I do not even try the aggressiveness of -O2 or -O3 anymore because that tends to unroll everything (starting by loops) and to produce extremely long assembly outputs. For non-pipelined architectures these compiler optimisation options don't really make sense. So that's why I stick to just -Os or -Oz for the CPU74 backend. The simulator produces the following output for the execution of the 8-queens code above: Code: /Users/joan/Documents-Local/Relay/CPU74/Simulator/DerivedData/Simulator/Build/Products/Release/c74-sim Q....... ....Q... .......Q .....Q.. ..Q..... ......Q. .Q...... ...Q....
Q....... .....Q.. .......Q ..Q..... ......Q. ...Q.... .Q...... ....Q...
Q....... ......Q. ...Q.... .....Q.. .......Q .Q...... ....Q... ..Q.....
Q....... ......Q. ....Q... .......Q .Q...... ...Q.... .....Q.. ..Q.....
.Q...... ...Q.... .....Q.. .......Q ..Q..... Q....... ......Q. ....Q...
.Q...... ....Q... ......Q. Q....... ..Q..... .......Q .....Q.. ...Q....
.Q...... ....Q... ......Q. ...Q.... Q....... .......Q .....Q.. ..Q.....
.Q...... .....Q.. Q....... ......Q. ...Q.... .......Q ..Q..... ....Q...
.Q...... .....Q.. .......Q ..Q..... Q....... ...Q.... ......Q. ....Q...
.Q...... ......Q. ..Q..... .....Q.. .......Q ....Q... Q....... ...Q....
.Q...... ......Q. ....Q... .......Q Q....... ...Q.... .....Q.. ..Q.....
.Q...... .......Q .....Q.. Q....... ..Q..... ....Q... ......Q. ...Q....
..Q..... Q....... ......Q. ....Q... .......Q .Q...... ...Q.... .....Q..
..Q..... ....Q... .Q...... .......Q Q....... ......Q. ...Q.... .....Q..
..Q..... ....Q... .Q...... .......Q .....Q.. ...Q.... ......Q. Q.......
..Q..... ....Q... ......Q. Q....... ...Q.... .Q...... .......Q .....Q..
..Q..... ....Q... .......Q ...Q.... Q....... ......Q. .Q...... .....Q..
..Q..... .....Q.. .Q...... ....Q... .......Q Q....... ......Q. ...Q....
..Q..... .....Q.. .Q...... ......Q. Q....... ...Q.... .......Q ....Q...
..Q..... .....Q.. .Q...... ......Q. ....Q... Q....... .......Q ...Q....
..Q..... .....Q.. ...Q.... Q....... .......Q ....Q... ......Q. .Q......
..Q..... .....Q.. ...Q.... .Q...... .......Q ....Q... ......Q. Q.......
..Q..... .....Q.. .......Q Q....... ...Q.... ......Q. ....Q... .Q......
..Q..... .....Q.. .......Q Q....... ....Q... ......Q. .Q...... ...Q....
..Q..... .....Q.. .......Q .Q...... ...Q.... Q....... ......Q. ....Q...
..Q..... ......Q. .Q...... .......Q ....Q... Q....... ...Q.... .....Q..
..Q..... ......Q. .Q...... .......Q .....Q.. ...Q.... Q....... ....Q...
..Q..... .......Q ...Q.... ......Q. Q....... .....Q.. .Q...... ....Q...
...Q.... Q....... ....Q... .......Q .Q...... ......Q. ..Q..... .....Q..
...Q.... Q....... ....Q... .......Q .....Q.. ..Q..... ......Q. .Q......
...Q.... .Q...... ....Q... .......Q .....Q.. Q....... ..Q..... ......Q.
...Q.... .Q...... ......Q. ..Q..... .....Q.. .......Q Q....... ....Q...
...Q.... .Q...... ......Q. ..Q..... .....Q.. .......Q ....Q... Q.......
...Q.... .Q...... ......Q. ....Q... Q....... .......Q .....Q.. ..Q.....
...Q.... .Q...... .......Q ....Q... ......Q. Q....... ..Q..... .....Q..
...Q.... .Q...... .......Q .....Q.. Q....... ..Q..... ....Q... ......Q.
...Q.... .....Q.. Q....... ....Q... .Q...... .......Q ..Q..... ......Q.
...Q.... .....Q.. .......Q .Q...... ......Q. Q....... ..Q..... ....Q...
...Q.... .....Q.. .......Q ..Q..... Q....... ......Q. ....Q... .Q......
...Q.... ......Q. Q....... .......Q ....Q... .Q...... .....Q.. ..Q.....
...Q.... ......Q. ..Q..... .......Q .Q...... ....Q... Q....... .....Q..
...Q.... ......Q. ....Q... .Q...... .....Q.. Q....... ..Q..... .......Q
...Q.... ......Q. ....Q... ..Q..... Q....... .....Q.. .......Q .Q......
...Q.... .......Q Q....... ..Q..... .....Q.. .Q...... ......Q. ....Q...
...Q.... .......Q Q....... ....Q... ......Q. .Q...... .....Q.. ..Q.....
...Q.... .......Q ....Q... ..Q..... Q....... ......Q. .Q...... .....Q..
....Q... Q....... ...Q.... .....Q.. .......Q .Q...... ......Q. ..Q.....
....Q... Q....... .......Q ...Q.... .Q...... ......Q. ..Q..... .....Q..
....Q... Q....... .......Q .....Q.. ..Q..... ......Q. .Q...... ...Q....
....Q... .Q...... ...Q.... .....Q.. .......Q ..Q..... Q....... ......Q.
....Q... .Q...... ...Q.... ......Q. ..Q..... .......Q .....Q.. Q.......
....Q... .Q...... .....Q.. Q....... ......Q. ...Q.... .......Q ..Q.....
....Q... .Q...... .......Q Q....... ...Q.... ......Q. ..Q..... .....Q..
....Q... ..Q..... Q....... .....Q.. .......Q .Q...... ...Q.... ......Q.
....Q... ..Q..... Q....... ......Q. .Q...... .......Q .....Q.. ...Q....
....Q... ..Q..... .......Q ...Q.... ......Q. Q....... .....Q.. .Q......
....Q... ......Q. Q....... ..Q..... .......Q .....Q.. ...Q.... .Q......
....Q... ......Q. Q....... ...Q.... .Q...... .......Q .....Q.. ..Q.....
....Q... ......Q. .Q...... ...Q.... .......Q Q....... ..Q..... .....Q..
....Q... ......Q. .Q...... .....Q.. ..Q..... Q....... ...Q.... .......Q
....Q... ......Q. .Q...... .....Q.. ..Q..... Q....... .......Q ...Q....
....Q... ......Q. ...Q.... Q....... ..Q..... .......Q .....Q.. .Q......
....Q... .......Q ...Q.... Q....... ..Q..... .....Q.. .Q...... ......Q.
....Q... .......Q ...Q.... Q....... ......Q. .Q...... .....Q.. ..Q.....
.....Q.. Q....... ....Q... .Q...... .......Q ..Q..... ......Q. ...Q....
.....Q.. .Q...... ......Q. Q....... ..Q..... ....Q... .......Q ...Q....
.....Q.. .Q...... ......Q. Q....... ...Q.... .......Q ....Q... ..Q.....
.....Q.. ..Q..... Q....... ......Q. ....Q... .......Q .Q...... ...Q....
.....Q.. ..Q..... Q....... .......Q ...Q.... .Q...... ......Q. ....Q...
.....Q.. ..Q..... Q....... .......Q ....Q... .Q...... ...Q.... ......Q.
.....Q.. ..Q..... ....Q... ......Q. Q....... ...Q.... .Q...... .......Q
.....Q.. ..Q..... ....Q... .......Q Q....... ...Q.... .Q...... ......Q.
.....Q.. ..Q..... ......Q. .Q...... ...Q.... .......Q Q....... ....Q...
.....Q.. ..Q..... ......Q. .Q...... .......Q ....Q... Q....... ...Q....
.....Q.. ..Q..... ......Q. ...Q.... Q....... .......Q .Q...... ....Q...
.....Q.. ...Q.... Q....... ....Q... .......Q .Q...... ......Q. ..Q.....
.....Q.. ...Q.... .Q...... .......Q ....Q... ......Q. Q....... ..Q.....
.....Q.. ...Q.... ......Q. Q....... ..Q..... ....Q... .Q...... .......Q
.....Q.. ...Q.... ......Q. Q....... .......Q .Q...... ....Q... ..Q.....
.....Q.. .......Q .Q...... ...Q.... Q....... ......Q. ....Q... ..Q.....
......Q. Q....... ..Q..... .......Q .....Q.. ...Q.... .Q...... ....Q...
......Q. .Q...... ...Q.... Q....... .......Q ....Q... ..Q..... .....Q..
......Q. .Q...... .....Q.. ..Q..... Q....... ...Q.... .......Q ....Q...
......Q. ..Q..... Q....... .....Q.. .......Q ....Q... .Q...... ...Q....
......Q. ..Q..... .......Q .Q...... ....Q... Q....... .....Q.. ...Q....
......Q. ...Q.... .Q...... ....Q... .......Q Q....... ..Q..... .....Q..
......Q. ...Q.... .Q...... .......Q .....Q.. Q....... ..Q..... ....Q...
......Q. ....Q... ..Q..... Q....... .....Q.. .......Q .Q...... ...Q....
.......Q .Q...... ...Q.... Q....... ......Q. ....Q... ..Q..... .....Q..
.......Q .Q...... ....Q... ..Q..... Q....... ......Q. ...Q.... .....Q..
.......Q ..Q..... Q....... .....Q.. .Q...... ....Q... ......Q. ...Q....
.......Q ...Q.... Q....... ..Q..... .....Q.. .Q...... ......Q. ....Q...
Program ended with exit code: 0 I suppose these are all the possible solutions to the 8-queens problem. I have not setup a timer to determine how long the simulation exactly takes on my 12 years old computer, but it's about 2.5 seconds for the complete set of solutions given above. This is in spite that the simulator is not made to be fast, but to simulate all individual cpu control signals and execute microinstructions based on them. My 2.5 second for the entire set of solutions, is in huge contrast with your stated 20 minutes required only for the first solution. That's obviously explained by using a compiled language (Apple Swift in my case) instead of a web-based or interpreted one (python?, javascript?). Although I'm surprised that the difference is SO huge. Code: Next thing to do, is to run it on the real thing.
That's great too, and something that is still very far from possible in my case. Joan [EDIT: re-posted the code generated by the compiler after fixing a subtle optimisation bug]
Last edited by joanlluch on Mon Feb 10, 2020 1:16 pm, edited 1 time in total.
|
Sun Feb 09, 2020 3:36 pm |
|
 |
BigEd
Joined: Wed Jan 09, 2013 6:54 pm Posts: 1821
|
(Might be interesting to know, in each case, how many instructions or clock cycles were needed - if it were cheap to find out.)
|
Sun Feb 09, 2020 6:55 pm |
|
 |
joanlluch
Joined: Fri Mar 22, 2019 8:03 am Posts: 328 Location: Girona-Catalonia
|
BigEd wrote: (Might be interesting to know, in each case, how many instructions or clock cycles were needed - if it were cheap to find out.) That's quite interesting. I added some code to accumulate the number of instructions and cycles (the simulator is cycle accurate) as well as the computation of the elapsed time and these are the results For compiler option -Oz Code: Executed instruction count: 1473350 Total cycle count: 1980095 Elapsed time: 2.70678 seconds For compiler option -Os Code: Executed instruction count: 1321973 Total cycle count: 1762834 Elapsed time: 2.46086 seconds (note that elapsed time may have some variations because the computer is of course sharing time with other running processes) So the simulator appears to run at an equivalent of 1762834 / 2.46086 = 0.72 MHz. I want to state again that the cpu74 simulator is not optimised or designed for speed, but to keep track of all control signals as they will be produced in the processor. I'm pretty sure that if the goal was fast cpu emulation alone, it could run more than 10x or 20x faster, approaching or surpassing the expected clock frequency of the real thing. (But of course, this is possible because the simulator runs on a 64 bit machine (intel) at 3 GHz) [EDIT: I found a subtle compiler optimisation bug that in rare circumstances prevented an -Os optimisation to apply. In this case, it affected the 'confilct' function of the queens program. So I am editing the compiler assembly output I posted on the previous post, and updated the -Os figures in this post]
|
Sun Feb 09, 2020 9:39 pm |
|
 |
robfinch
Joined: Sat Feb 02, 2013 9:40 am Posts: 2307 Location: Canada
|
I figured I’d add the CC64’s compiler output to the list. The following is untested but looks relatively good. The target is a RISCV compatible. I just spotted a couple of errors in the code, but things should be good for a chuckle. Yes, this is optimized code, unoptimized is much larger. Code: code align 16 ;==================================================== ; Basic Block 0 ;==================================================== public code _printf: queens_8: ret endpublic
data align 8 code align 16 ;==================================================== ; Basic Block 0 ;==================================================== public code _print_board: sub $sp,$sp,#32 sto $fp,[$sp] sto $x0,8[$sp] sto $ra,24[$sp] mov $fp,$sp sub $sp,$sp,#54 sto $s0,0[$sp] sto $s1,8[$sp] ldo $s0,-18[$fp] ; for (i = 0; i < 8; i++) { mov $s1,$x0 slt $t0,$s1,#8 beqz $t0,queens_26 queens_25: ; for (j = 0; j < 8; j++) { mov $s0,$x0 slt $t0,$s0,#8 beqz $t0,queens_29 queens_28: ; if (board[i][j]) mov $t1,$s0 ldi $t2,#3 sll $t0,$t1,$t2 mov $t3,$s1 ldi $t4,#6 sll $t2,$t3,$t4 mov $t3,$t2 lea $t4,32[$fp] sto $t0,-38[$fp] mov $t0,$t4 add $t1,$t3,$t0 ldo $t0,-38[$fp] sto $t4,-38[$fp] add $t4,$t1,$t0 beq $t4,$x0,queens_31 ; printf('Q'); sub $sp,$sp,#8 ldi $t0,#81 sto $t0,0[$sp] call _printf add $sp,$sp,#8 bra queens_32 queens_31: ; printf('.'); sub $sp,$sp,#8 ldi $t0,#46 sto $t0,0[$sp] call _printf add $sp,$sp,#8 queens_32: add $s0,$s0,#1 slt $t0,$s0,#8 bnez $t0,queens_28 queens_29: ; printf('\n'); sub $sp,$sp,#8 ldi $t0,#10 sto $t0,0[$sp] call _printf add $sp,$sp,#8 add $s1,$s1,#1 slt $t0,$s1,#8 bnez $t0,queens_25 queens_26: ; printf('\n'); sub $sp,$sp,#8 ldi $t0,#10 sto $t0,0[$sp] call _printf add $sp,$sp,#8 sub $sp,$sp,#8 sto $t0,0[$sp] call _printf add $sp,$sp,#8 queens_21: queens_24: ldo $s0,0[$sp] ldo $s1,8[$sp] mov $sp,$fp ldo $fp,[$sp] ldo $ra,24[$sp] add $sp,$sp,#32 ret endpublic
data align 8 code align 16 ;==================================================== ; Basic Block 0 ;==================================================== public code _conflict: sub $sp,$sp,#32 sto $fp,[$sp] sto $x0,8[$sp] mov $fp,$sp sub $sp,$sp,#78 sto $s0,0[$sp] sto $s1,8[$sp] sto $s2,16[$sp] sto $s3,24[$sp] sto $s4,32[$sp] ldo $s0,-8[$fp] ldo $s1,552[$fp] ldo $s2,-18[$fp] ldo $s3,562[$fp] lea $t0,32[$fp] mov $s4,$t0 ; for (i = 0; i < row; i++) { mov $s0,$x0 bge $s0,$s1,queens_55 queens_54: ; if (board[i][col]) mov $t1,$s3 ldi $t2,#3 sll $t0,$t1,$t2 mov $t3,$s0 ldi $t4,#6 sll $t2,$t3,$t4 mov $t3,$t2 mov $t4,$s4 add $t1,$t3,$t4 add $t3,$t1,$t0 beq $t3,$x0,queens_57 ; return 1; ldi $v0,#1 queens_53: ldo $s0,0[$sp] ldo $s1,8[$sp] ldo $s2,16[$sp] ldo $s3,24[$sp] ldo $s4,32[$sp] mov $sp,$fp ldo $fp,[$sp] add $sp,$sp,#32 ret queens_57: ; j = row - i; mov $t1,$s1 mov $t2,$s0 sub $t0,$t1,$t2 mov $s2,$t0 ; if (0 < col - j + 1) ldi $t0,#0 mov $t3,$s3 mov $t4,$s2 sub $t2,$t3,$t4 mov $t3,$t2 ldi $t4,#1 add $t1,$t3,$t4 bge $t0,$t1,queens_59 ; if (board[i][col - j]) mov $t2,$s3 mov $t3,$s2 sub $t1,$t2,$t3 mov $t2,$t1 ldi $t3,#3 sll $t0,$t2,$t3 mov $t4,$s0 sto $t0,-38[$fp] ldi $t0,#6 sll $t3,$t4,$t0 ldo $t0,-38[$fp] mov $t4,$t3 mov $t0,$s4 add $t2,$t4,$t0 add $t4,$t2,$t0 beq $t4,$x0,queens_61 ; return 1; bra queens_53 queens_61: queens_59: ; if (col + j < 8) mov $t1,$s3 mov $t2,$s2 add $t0,$t1,$t2 slt $t1,$t0,#8 beqz $t1,queens_63 ; if (board[i][col + j]) mov $t2,$s3 mov $t3,$s2 add $t1,$t2,$t3 mov $t2,$t1 ldi $t3,#3 sll $t0,$t2,$t3 mov $t4,$s0 sto $t0,-38[$fp] ldi $t0,#6 sll $t3,$t4,$t0 ldo $t0,-38[$fp] mov $t4,$t3 mov $t0,$s4 add $t2,$t4,$t0 add $t4,$t2,$t0 beq $t4,$x0,queens_65 ; return 1; bra queens_53 queens_65: queens_63: add $s0,$s0,#1 blt $s0,$s1,queens_54 queens_55: ; return 0; mov $v0,$x0 bra queens_53 endpublic
;==================================================== ; Basic Block 0 ;==================================================== public code _solve: sub $sp,$sp,#32 sto $fp,[$sp] sto $x0,8[$sp] sto $ra,24[$sp] mov $fp,$sp sub $sp,$sp,#42 sto $s0,0[$sp] sto $s1,8[$sp] sto $s2,16[$sp] ldo $s0,-8[$fp] lea $s1,32[$fp] ldo $s2,552[$fp] ; if (row == 8) { xor $t0,$s2,#8 bnez $t0,queens_82 ; print_board(board); sub $sp,$sp,#8 sto $s1,0[$sp] call _print_board add $sp,$sp,#8 queens_78: queens_81: ldo $s0,0[$sp] ldo $s1,8[$sp] ldo $s2,16[$sp] mov $sp,$fp ldo $fp,[$sp] ldo $ra,24[$sp] add $sp,$sp,#32 ret queens_82: ; for ( i = 0; i < 8; i++) { mov $s0,$x0 slt $t0,$s0,#8 beqz $t0,queens_85 queens_84: ; if (conflict(board, row, i)==0) { sub $sp,$sp,#24 sto $s1,0[$sp] sto $s2,8[$sp] sto $s0,16[$sp] call _conflict add $sp,$sp,#24 mov $t0,$ra bne $t0,$x0,queens_87 ; board[row][i] = 1; mov $t1,$s0 ldi $t2,#3 sll $t0,$t1,$t2 mov $t3,$s2 ldi $t4,#6 sll $t2,$t3,$t4 mov $t3,$t2 mov $t4,$s1 add $t1,$t3,$t4 ldi $t3,#1 ; solve(board, row + 1); sub $sp,$sp,#16 sto $s1,0[$sp] mov $t1,$s2 ldi $t2,#1 add $t0,$t1,$t2 sto $t0,8[$sp] call _solve add $sp,$sp,#16 ; board[row][i] = 0; mov $t1,$s0 ldi $t2,#3 sll $t0,$t1,$t2 mov $t3,$s2 ldi $t4,#6 sll $t2,$t3,$t4 mov $t3,$t2 mov $t4,$s1 add $t1,$t3,$t4 add $t3,$t1,$t0 sto $x0,[$t3] queens_87: add $s0,$s0,#1 slt $t0,$s0,#8 bnez $t0,queens_84 queens_85: bra queens_81 endpublic
bss align 8 align 8 dw $FFF0200000000040 public bss _board: fill.b 512,0x00
endpublic code align 16 data align 8 code align 16 ;==================================================== ; Basic Block 0 ;==================================================== public code _main: sub $sp,$sp,#32 sto $fp,[$sp] sto $x0,8[$sp] sto $ra,24[$sp] mov $fp,$sp sub $sp,$sp,#46 sto $s0,0[$sp] sto $s1,8[$sp] ldo $s0,-18[$fp] ; for (i = 0; i < 8; i++) mov $s1,$x0 slt $t0,$s1,#8 beqz $t0,queens_104 queens_103: ; for (j = 0; j < 8; j++) mov $s0,$x0 slt $t0,$s0,#8 beqz $t0,queens_107 queens_106: ; board[i][j] = 0; mov $t1,$s0 ldi $t2,#3 sll $t0,$t1,$t2 mov $t3,$s1 ldi $t4,#6 sll $t2,$t3,$t4 mov $t3,$t2 lea $t4,_board add $t1,$t3,$t4 add $t3,$t1,$t0 sto $x0,[$t3] add $s0,$s0,#1 slt $t0,$s0,#8 bnez $t0,queens_106 queens_107: add $s1,$s1,#1 slt $t0,$s1,#8 bnez $t0,queens_103 queens_104: ; solve(board, 0); sub $sp,$sp,#16 lea $t0,_board sto $t0,0[$sp] sto $x0,8[$sp] call _solve add $sp,$sp,#16 queens_99: queens_102: ldo $s0,0[$sp] ldo $s1,8[$sp] mov $sp,$fp ldo $fp,[$sp] ldo $ra,24[$sp] add $sp,$sp,#32 ret endpublic
rodata align 16 ; global _main ; global _board ; global _solve ; global _conflict ; global _printf ; global _print_board
_________________Robert Finch http://www.finitron.ca
|
Fri Feb 14, 2020 9:56 am |
|
 |
roelh
Joined: Mon Oct 07, 2019 1:26 pm Posts: 50
|
Hi all, my C compiler is now integrated into the Kobold K2 online Javascript assembler/simulator. On a single browser page you can now compile, assemble and simulate the C code, and download a file to program into the Kobold. See it live at http://www.enscope.nl/simas_kobold2/ ! The page loads a prime-number generator as default program. You only have to press the RUN button to start it. However, the simulator is quite slow because it simulates almost at the gate level.
|
Wed Mar 11, 2020 10:04 pm |
|
Who is online |
Users browsing this forum: claudebot and 0 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|