Last visit was: Mon Nov 03, 2025 9:41 pm 
         | 
        
            It is currently Mon Nov 03, 2025 9:41 pm 
            
         | 
     
     
	
	 
  
	
			
	
	 Perils of Optimization On Modern High-Performance Processors   
	
        
        
            | Author | 
            Message | 
         
        
			| 
				
				 MichaelM 
				
				
					 Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
				 
				 
			 | 
			
				
				
					
						Recently, I've been studying the book "From Mathematics to Generic Programming," by Alexander A. Stepanov and Daniel E. Rose. I took up studying this book in order to gain an understanding of the process by which Stepanov designed the C++ Standard Template Library. With regard to the book, I find that it is well written, and intellectually stimulating. I particularly enjoy the authors' linkage of the algorithms described in the book to the mathematical foundation for those algorithms.   (Note: In the spirit of full disclosure of any personal biases regarding this subject, I am not a particular fan of canned algorithms. In particular, I am not a particular fan of canned algorithms such as those in the Standard Template Library, Matlab, Simulink, and similar toolsets which, in my opinion, many software developers, engineers, and scientists are using blindly.) There is a mathematical basis for many algorithms commonly available in standard libraries. However, there appears to be a blind trust that many of these algorithms behave correctly under a wide range of conditions. This assumption is generally unsupported by the documentation that accompanies many of these collections of algorithms. To be fair to the developers/suppliers of these libraries and toolsets, many software developers, engineers, and scientists do not test in a critical manner the libraries for suitability to their particular application domain. Perhaps the lack of curiosity is driven in part by a lack of confidence with regard to mathematics. There are many algorithms whose mathematical basis is certainly difficult for many, but there are also many algorithms whose mathematical basis should be easily understood by anyone capable of learning and using a programming language. The second chapter of the book relates to the Egyptian/Russian Peasant multiplication algorithm applicable to unsigned non-zero integers. This is an algorithm that should be easily understood by anybody who has learned to program computers. I decided to understand the development of the authors' implementations of the algorithm in C. Therefore, I implemented each example that the authors' presented in C, and compiled and tested each "improved" version of the algorithm. After completing that exercise, I believed that I understood each of the authors' algorithms. After developing and testing each of the 8 algorithms, which progress from simple recursion to non-recursive looping, I decided to test the efficiency of the algorithms using a test program that exhaustively calculated all 65536 products for two arguments, each ranging from 1 to 256. Interestingly, I found that a recursive algorithm yielded the best time instead of the non-recursive looping implementation. Throughout the elaboration of the various algorithms, the authors clearly emphasized the cost of testing variables and calling subroutines. In other words, they wanted the reader to pick up on the processing costs associated with testing variables and recursively calling the core multiplication subroutine. With respect to recursion (Algorithms 1-5), the authors built up the algorithmic progression until Algorithm 5 was implemented in a strict tail recursive form. I've used recursion in the past, but it is not a natural way for me to develop algorithms. Until reading the authors' definition for strictly tail recursive they provide in the book, I had not actually followed through on determining the conditions needed to refer to an implementation as tail recursive. For the non-recursive looping algorithms (Algorithms 6-8), the authors go through the development of these algorithms with the objective of reducing the number of tests and the number of loops required to implement the desired product. ( Note: In the multiplication algorithm being considered, the multiplier should be less than multiplicand. Examination of the multiplier is used to develop the partial products which are subsequently added together to form the product. To minimize the number of recursions/loops, the multiplier should be the lesser of the two numbers. The algorithms implemented do not ensure that the multiplier is less than the multiplicand. My testing is performed in such a manner that for roughly half of the products computed the multiplier is less than the multiplicand, and greater than or equal to  the multiplicand for the remaining products.) The following table shows the results I obtained for the 8 algorithms implemented for this multiplication technique. I used Eclipse and GCC on a 64-bit Linux Mint platform powered by a 1 GHz, 4 GB, AMD E1 processor. My test program consists of three nested loops: n - iteration loop to ensure a sufficiently long test time; k - multiplier loop; l - multiplicand loop. The multiplier and multiplicand loops iterate from 1 to 256 to generate 65536 products. The iteration loop repeats the inner two loops. The resulting products generated by the 8 implementations are tested against the product of k and l. If there is an error, an error message is printed and a  break is taken. I use the argument vector to input the number of iterations and the algorithm number, which I use as an index into an array of function pointers. Each algorithm is timed using the  time utility of Linux: $ time test_pgm n_iterations algorithm_index. Code: Routine_Nm,Dbg/-O3,Rls/-Os,Rls/-O3,Rls/-O2,Rls/-O1,Rls/-O0 Multiply_1,  7.234,  9.585,  7.389,  9.904, 10.778, 17.481 Multiply_2,  4.975,  7.063,  4.988,  8.005, 15.532, 17.464 Multiply_3,  4.725,  6.218,  4.721,  4.711, 11.791, 17.492 Multiply_4,  6.633,  6.236,  6.625,  6.687, 12.668, 17.995 Multiply_5,  4.286,  6.065,  4.283,  4.287, 14.193, 17.850 Multiply_6,  5.939,  6.961,  5.930,  5.935,  7.199,  9.832 Multiply_7,  4.481,  6.667,  4.481,  4.481,  7.178,  9.755 Multiply_8,  5.016,  6.192,  5.013,  5.015,  6.709,  9.001
  Across the top are the optimizations employed. Dbg signifies the Debug output, and Rls signifies the Release output from the GCC compiler. I captured results for all 5 optimization levels that can be selected using the -Ox optimization option. When using optimization levels -O2 or -O3, the strictly tail recursive Algorithm 5 provides the best times. Without optimization (-O0) or minimal optimization (-O1), the non-recursive looping algorithms, Algorithm 6-8, handily beat out the recursive algorithms, Algorithms 1-5. There was some variation in the measurements using the  time utility, so I captured the following data using the  perf utility, which provides more consistent timing data by using the performance counters inside the processor. I captured data using the Rls/-O3 optimization for all 8 algorithms. Interestingly, there is a wide variation in the effective number of instructions per cycle, the total number of branches, and the number missed branches. Algorithm 5 is still the fastest, but its number of instructions per cycle is not the highest, nor does it have the least number of branches or missed branches. Code: morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 1
   Performance counter stats for 'Release/Stepanov 2000 1':
         7280.156042      task-clock (msec)         #    0.995 CPUs utilized                          677      context-switches          #    0.093 K/sec                                   28      cpu-migrations            #    0.004 K/sec                                   52      page-faults               #    0.007 K/sec                        7,238,360,838      cycles                    #    0.994 GHz                     [50.01%]                  0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.05%]                  0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.21%]     13,485,495,935      instructions              #    1.86  insns per cycle         [50.04%]      1,680,883,125      branches                  #  230.886 M/sec                   [49.97%]            861,832      branch-misses             #    0.05% of all branches         [49.87%]
         7.317273564 seconds time elapsed
  morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 2
   Performance counter stats for 'Release/Stepanov 2000 2':
         5015.594286      task-clock (msec)         #    0.994 CPUs utilized                          478      context-switches          #    0.095 K/sec                                   13      cpu-migrations            #    0.003 K/sec                                   53      page-faults               #    0.011 K/sec                        4,979,463,334      cycles                    #    0.993 GHz                     [49.96%]                  0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.03%]                  0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.18%]      9,100,716,091      instructions              #    1.83  insns per cycle         [50.14%]      1,450,183,464      branches                  #  289.135 M/sec                   [50.03%]            848,119      branch-misses             #    0.06% of all branches         [49.84%]
         5.045531490 seconds time elapsed
  morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 3
   Performance counter stats for 'Release/Stepanov 2000 3':
         4666.820274      task-clock (msec)         #    0.994 CPUs utilized                          594      context-switches          #    0.127 K/sec                                    5      cpu-migrations            #    0.001 K/sec                                   53      page-faults               #    0.011 K/sec                        4,644,182,243      cycles                    #    0.995 GHz                     [50.08%]                  0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.00%]                  0      stalled-cycles-backend    #    0.00% backend  cycles idle    [49.97%]      7,500,004,657      instructions              #    1.61  insns per cycle         [49.99%]      1,449,465,718      branches                  #  310.590 M/sec                   [50.03%]            807,622      branch-misses             #    0.06% of all branches         [50.15%]
         4.696254481 seconds time elapsed
  morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 4
   Performance counter stats for 'Release/Stepanov 2000 4':
         6437.925075      task-clock (msec)         #    0.989 CPUs utilized                          673      context-switches          #    0.105 K/sec                                   25      cpu-migrations            #    0.004 K/sec                                   52      page-faults               #    0.008 K/sec                        6,405,362,667      cycles                    #    0.995 GHz                     [49.99%]                  0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.03%]                  0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.07%]      6,836,760,873      instructions              #    1.07  insns per cycle         [50.11%]      2,232,507,206      branches                  #  346.774 M/sec                   [50.03%]          8,277,203      branch-misses             #    0.37% of all branches         [49.96%]
         6.508887512 seconds time elapsed
  morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 5
   Performance counter stats for 'Release/Stepanov 2000 5':
         4329.726333      task-clock (msec)         #    0.996 CPUs utilized                          386      context-switches          #    0.089 K/sec                                   15      cpu-migrations            #    0.003 K/sec                                   51      page-faults               #    0.012 K/sec                        4,306,586,028      cycles                    #    0.995 GHz                     [49.95%]                  0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [49.96%]                  0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.15%]      6,839,918,869      instructions              #    1.59  insns per cycle         [50.10%]      2,235,540,097      branches                  #  516.324 M/sec                   [50.04%]          9,797,084      branch-misses             #    0.44% of all branches         [49.95%]
         4.345760144 seconds time elapsed
  morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 6
   Performance counter stats for 'Release/Stepanov 2000 6':
         5819.323941      task-clock (msec)         #    0.994 CPUs utilized                          531      context-switches          #    0.091 K/sec                                   25      cpu-migrations            #    0.004 K/sec                                   55      page-faults               #    0.009 K/sec                        5,790,127,256      cycles                    #    0.995 GHz                     [49.91%]                  0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [49.96%]                  0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.08%]      7,097,873,094      instructions              #    1.23  insns per cycle         [50.11%]      2,234,531,073      branches                  #  383.985 M/sec                   [50.14%]          7,263,670      branch-misses             #    0.33% of all branches         [49.98%]
         5.851833687 seconds time elapsed
  morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 7
   Performance counter stats for 'Release/Stepanov 2000 7':
         4518.088795      task-clock (msec)         #    0.969 CPUs utilized                          521      context-switches          #    0.115 K/sec                                    9      cpu-migrations            #    0.002 K/sec                                   54      page-faults               #    0.012 K/sec                        4,493,347,007      cycles                    #    0.995 GHz                     [50.03%]                  0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.03%]                  0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.00%]      7,025,496,576      instructions              #    1.56  insns per cycle         [49.97%]      2,226,158,240      branches                  #  492.721 M/sec                   [50.02%]         12,654,222      branch-misses             #    0.57% of all branches         [50.03%]
         4.664287850 seconds time elapsed
  morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 8
   Performance counter stats for 'Release/Stepanov 2000 8':
         5037.158903      task-clock (msec)         #    0.996 CPUs utilized                          466      context-switches          #    0.093 K/sec                                   18      cpu-migrations            #    0.004 K/sec                                   53      page-faults               #    0.011 K/sec                        5,012,721,116      cycles                    #    0.995 GHz                     [50.09%]                  0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.09%]                  0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.10%]      6,643,054,184      instructions              #    1.33  insns per cycle         [49.91%]      2,097,877,713      branches                  #  416.480 M/sec                   [50.00%]          5,236,597      branch-misses             #    0.25% of all branches         [49.95%]
         5.058676239 seconds time elapsed
  morrisma@morrisma5 ~/Development/CProjects/Stepanov $ 
  Not exactly sure what these results mean for a different application. My objective was simply to time the various algorithms. The implication from the book was that each succeeding algorithm (higher number) was a better implementation than the previous algorithm. From the results presented above, that is clearly not the case. It appears to me that the more the program construction is able to successfully coerce instruction level parallelism (ILP) from the optimizer the faster the program will be. I don't know exactly how to interpret the performance effects of the branch prediction logic. The code used in this test is included below. With these results in mind, I will attempt to build and test a version of Algorithm 5 using the MiniCPU instruction set. The next chapter in book deals with the prime number sieve algorithm ascribed to Eratosthenes. I am looking forward to developing and testing the algorithms developed by Stepanov and Rose for this ancient algorithm, which is often used as a processor/language benchmark. Code: /*  ============================================================================  Name        : Stepanov.c  Author      : Michael A. Morris  Version     :  Copyright   : Copyright 2017 - GPLv3  Description : Various Algorithms for Egyptian/Russian Peasant Multiplication in C, Ansi-style  ============================================================================  */
  #include <stdio.h> #include <stdlib.h>
  #define half(x) ((x) >> 1) #define odd(x)   ((x) & 1)
  int multiply1(int, int);
  int multiply2(int, int); int mult_acc0(int, int, int);
  int multiply3(int, int); int mult_acc1(int, int, int);
  int multiply4(int, int); int mult_acc2(int, int, int);
  int multiply5(int, int); int mult_acc3(int, int, int);
  int multiply6(int, int); int mult_acc4(int, int, int);
  int multiply7(int, int); int multiply8(int, int);
  typedef int (*MFP_t)(int, int);
  int main(int argc, char *argv[]) {    int i, j, k, l;    int n, prod;    MFP_t mfp[8] = {                       &multiply1,                   &multiply2,                   &multiply3,                   &multiply4,                   &multiply5,                   &multiply6,                   &multiply7,                   &multiply8                   };
     i = atoi(argv[1]);    j = atoi(argv[2]) - 1;
     for(n = 0; n < i; n++){       //printf("\n********************* n=%5d **************************\n\n", n);       for(k = 1; k <= 256; k++) {          //printf("k=%d\n", k);          for(l = 1; l <= 256; l++) {             prod = mfp[j](k, l);             if(prod != (k * l)) {                printf("error: %d; %d\n", prod, k*l);                break;             }          }       }    }
     return EXIT_SUCCESS; }
 
  /*  * int multiply1(int n, int a)   -   Multiply two unsigned integers using Egyptian  *                                  multiplication algorithm.  */
  int multiply1(int n, int a) {    if(1 == n) return(a);    int result = multiply1(half(n), a << 1);    if(odd(n)) return(result + a);    return(result); }
 
  int multiply2(int n, int a) {    if(1 == n) return(a);    return(mult_acc0(0, n, a)); }
  int mult_acc0(int r, int n, int a) {    if(1 == n) return(r + a);    if(odd(n)) {       return(mult_acc0(r + a, half(n), a << 1));    } else {       return(mult_acc0(r    , half(n), a << 1));    } }
 
  int multiply3(int n, int a) {    if(1 == n) return(a);    return(mult_acc1(0, n, a)); }
  int mult_acc1(int r, int n, int a) {    if(1 == n) return(r + a);    if(odd(n)) r += a;    return(mult_acc1(r, half(n), a << 1)); }
 
  int multiply4(int n, int a) {    if(1 == n) return(a);    return(mult_acc2(0, n, a)); }
  int mult_acc2(int r, int n, int a) {    if(odd(n)) {       r += a;       if(1 == n) return(r);    }    return(mult_acc2(r, half(n), a << 1)); }
 
  int multiply5(int n, int a) {    if(1 == n) return(a);    return(mult_acc3(0, n, a)); }
  int mult_acc3(int r, int n, int a) {    if(odd(n)) {       r += a;       if(1 == n) return(r);    }    n = half(n);    a = a << 1;    return(mult_acc3(r, n, a)); }
 
  int multiply6(int n, int a) {    if(1 == n) return(a);    return(mult_acc4(a, n - 1, a)); }
  int mult_acc4(int r, int n, int a) {    while(1) {       if(odd(n)) {          r += a;          if(1 == n) return(r);       }       n = half(n);       a = a << 1;    } }
 
  int multiply7(int n, int a) {    while(!odd(n)) {       a = a << 1;       n = half(n);    }    if(1 == n) return(a);    return(mult_acc4(a, n - 1, a)); }
 
  int multiply8(int n, int a) {    while(!odd(n)) {       a = a << 1;       n = half(n);    }    if(1 == n) return(a);    return(mult_acc4(a, half(n - 1), a << 1)); }
   
					
						 _________________ Michael A. 
					
  
			 | 
		 
		
			| Sun Sep 03, 2017 11:46 pm | 
			
				
			 | 
    	
		 
	
	
		  | 
	 
	
			| 
				
				 BigEd 
				
				
					 Joined: Wed Jan 09, 2013 6:54 pm Posts: 1852
				 
				 
			 | 
			
				
				
					
						Interesting! A very thorough job of doing the exercises, measuring and then having a think. I'm guilty of almost always skipping exercises, which is very self-limiting. I was just going to see if I could add anything by colouring - first colouring according to the best/worst per row, which is looking at the compiler, and then colouring according to best/worst in each column, which is looking at the execution. But it seems I can't add colour to code blocks. Still, I think even writing that paragraph shows that we've got three things going on - the algorithm, in the abstract - the compiler's competence in building a code sequence - the CPU's competence in executing that code sequence As your thread title indicates, the modern CPU is something of a beast in forging ahead: speculatively executing through conditional branches, and making light work of jumps, calls, and returns. Let's have a look without the benefit of colour: MichaelM wrote: Code: Routine_Nm,Dbg/-O3,Rls/-Os,Rls/-O3,Rls/-O2,Rls/-O1,Rls/-O0 Multiply_1,  7.234,  9.585,  7.389,  9.904, 10.778, 17.481 Multiply_2,  4.975,  7.063,  4.988,  8.005, 15.532, 17.464 Multiply_3,  4.725,  6.218,  4.721,  4.711, 11.791, 17.492 Multiply_4,  6.633,  6.236,  6.625,  6.687, 12.668, 17.995 Multiply_5,  4.286,  6.065,  4.283,  4.287, 14.193, 17.850 Multiply_6,  5.939,  6.961,  5.930,  5.935,  7.199,  9.832 Multiply_7,  4.481,  6.667,  4.481,  4.481,  7.178,  9.755 Multiply_8,  5.016,  6.192,  5.013,  5.015,  6.709,  9.001
  We can see from the final column, where the compiler isn't being clever, that the 6, 7 and 8 versions are a lot "better" - but we can't tell if it's the algorithm or the CPU. In an effort to try to gain some insight, I've cropped down the 'perf' results to see only what I think might be the most informative stats, noting that these are for the O3 compilation and therefore the compiler has already tried to squeeze out as much as it can from the source code: MichaelM wrote: Code:  Performance counter stats for 'Release/Stepanov 2000 1':      7,238,360,838      cycles                    #    0.994 GHz                     [50.01%]     13,485,495,935      instructions              #    1.86  insns per cycle         [50.04%]      1,680,883,125      branches                  #  230.886 M/sec                   [49.97%]            861,832      branch-misses             #    0.05% of all branches         [49.87%]
   Performance counter stats for 'Release/Stepanov 2000 2':      4,979,463,334      cycles                    #    0.993 GHz                     [49.96%]      9,100,716,091      instructions              #    1.83  insns per cycle         [50.14%]      1,450,183,464      branches                  #  289.135 M/sec                   [50.03%]            848,119      branch-misses             #    0.06% of all branches         [49.84%]
   Performance counter stats for 'Release/Stepanov 2000 3':      4,644,182,243      cycles                    #    0.995 GHz                     [50.08%]      7,500,004,657      instructions              #    1.61  insns per cycle         [49.99%]      1,449,465,718      branches                  #  310.590 M/sec                   [50.03%]            807,622      branch-misses             #    0.06% of all branches         [50.15%]
   Performance counter stats for 'Release/Stepanov 2000 4':      6,405,362,667      cycles                    #    0.995 GHz                     [49.99%]      6,836,760,873      instructions              #    1.07  insns per cycle         [50.11%]      2,232,507,206      branches                  #  346.774 M/sec                   [50.03%]          8,277,203      branch-misses             #    0.37% of all branches         [49.96%]
   Performance counter stats for 'Release/Stepanov 2000 5':      4,306,586,028      cycles                    #    0.995 GHz                     [49.95%]      6,839,918,869      instructions              #    1.59  insns per cycle         [50.10%]      2,235,540,097      branches                  #  516.324 M/sec                   [50.04%]          9,797,084      branch-misses             #    0.44% of all branches         [49.95%]
   Performance counter stats for 'Release/Stepanov 2000 6':      5,790,127,256      cycles                    #    0.995 GHz                     [49.91%]      7,097,873,094      instructions              #    1.23  insns per cycle         [50.11%]      2,234,531,073      branches                  #  383.985 M/sec                   [50.14%]          7,263,670      branch-misses             #    0.33% of all branches         [49.98%]
   Performance counter stats for 'Release/Stepanov 2000 7':      4,493,347,007      cycles                    #    0.995 GHz                     [50.03%]      7,025,496,576      instructions              #    1.56  insns per cycle         [49.97%]      2,226,158,240      branches                  #  492.721 M/sec                   [50.02%]         12,654,222      branch-misses             #    0.57% of all branches         [50.03%]
   Performance counter stats for 'Release/Stepanov 2000 8':      5,012,721,116      cycles                    #    0.995 GHz                     [50.09%]      6,643,054,184      instructions              #    1.33  insns per cycle         [49.91%]      2,097,877,713      branches                  #  416.480 M/sec                   [50.00%]          5,236,597      branch-misses             #    0.25% of all branches         [49.95%]
  We do at least see these points, I think: - the number of executed instructions goes down at first, from 1, 2, 3 and then the rest of the pack. - the final code version, 8, executes the fewest branches of the 4-8 group. - And 8 misses the fewest in that group, too. Which means they are predictable branches. I'd like to be able to say something useful about the instructions per cycle figure, but I'm not sure if I can.  
					
  
			 | 
		 
		
			| Mon Sep 04, 2017 3:53 am | 
			
				
			 | 
    	
		 
	
	
		  | 
	 
	
			| 
				
				 rwiker 
				
				
					 Joined: Tue Aug 08, 2017 1:33 pm Posts: 12
				 
				 
			 | 
			
				
				
					
						I just tested this on a Linux box (dual Xeon E5630, Ubuntu Mate 16.04, clang), and found version 3 to be the fastest with -O3.  At first, I didn't even notice version 3 (since I was focussing on version 5), and so wrote this: Code: int multiply9(int n, int a) {   return mult_acc5(0, n, a); }
  int mult_acc5(int r, int n, int a) {   r+= (odd(n)?a:0);   return ((n>1) ? mult_acc5(r, n>>1, a<<1) : r); }
  which gives identical runtime to version 3 on my setup. The tail-recursive implementations are turned into iterative code at sufficiently high optimization levels (basically, turning the recursive calls into gotos to a point near the start of the function), so there is effectively no function-call overhead in this situation. Disassembling the code should be interesting, too.  
					
  
			 | 
		 
		
			| Mon Sep 04, 2017 8:25 am | 
			
				
			 | 
    	
		 
	
	
		  | 
	 
	
			| 
				
				 MichaelM 
				
				
					 Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
				 
				 
			 | 
			
				
				
					
						Very nice, compact realization. I tried several construction changes like those  rwiker includes in  multiply9. With GCC, I did not get performance improvements using the  trigraph (?:) construction, but I did see a minor performance improvement of a few percent using  a+a instead of  a<<1. I don't have ready access to different x86 processor variants, but it may be interesting to see a comparison of results from several variants using multiply3, multiply5, and multiply9, while varying the compiler and optimization level. I suspect that, even for a simple benchmark like these multiply routines, we should see performance variations that can be attributed to the architectural differences between the processors. AMD and Intel processors, although they appear to be compatible ISAs, have significant differences in micro-architecture: different caching policies, different instruction decoding logic, different pipeline depths and ILP strategies, etc. I suspect, that on average, the compiler code generators and optimization routines produce code that is a good match to the x86 processors at some level. I copied  rwiker's code and ran it in my environment. The results are posted below: Code: morrisma@morrisma5 ~/Development/CProjects/Stepanov $ perf stat Release/Stepanov 2000 9
   Performance counter stats for 'Release/Stepanov 2000 9':
         4677.083417      task-clock (msec)         #    0.998 CPUs utilized                          411      context-switches          #    0.088 K/sec                                    8      cpu-migrations            #    0.002 K/sec                                   51      page-faults               #    0.011 K/sec                        4,653,583,491      cycles                    #    0.995 GHz                     [50.04%]                  0      stalled-cycles-frontend   #    0.00% frontend cycles idle    [50.08%]                  0      stalled-cycles-backend    #    0.00% backend  cycles idle    [50.07%]      8,707,304,550      instructions              #    1.87  insns per cycle         [49.98%]      1,578,000,951      branches                  #  337.390 M/sec                   [50.03%]            732,542      branch-misses             #    0.05% of all branches         [50.00%]
         4.687538150 seconds time elapsed
  morrisma@morrisma5 ~/Development/CProjects/Stepanov $ 
  In my environment, the code improves on the performance given by my tests for  multiply3 using the -O3 optimization level, but it doesn't perform better than  multiply5 in my environment. Perhaps the difference in the results can be attributed to the micro-achitectural differences between the Intel Xeon  E5630 and the AMD E1 processors.  
					
						 _________________ Michael A. 
					
  
			 | 
		 
		
			| Mon Sep 04, 2017 1:14 pm | 
			
				
			 | 
    	
		 
	
	
		  | 
	 
	
			| 
				
				 rwiker 
				
				
					 Joined: Tue Aug 08, 2017 1:33 pm Posts: 12
				 
				 
			 | 
			
				
				
					
						It's quite likely that both architectural differences and compiler details affect the performance of the different implementations. I compiled the code using clang (-O3) on my MacBook, and disassembled the code in the debugger: Code:  .../stepanov {703} $ lldb stepanov (lldb) target create "stepanov" Current executable set to 'stepanov' (x86_64). (lldb) disass -n multiply9 stepanov`multiply9: stepanov[0x100000e00] <+0>:  pushq  %rbp stepanov[0x100000e01] <+1>:  movq   %rsp, %rbp stepanov[0x100000e04] <+4>:  movl   %edi, %eax stepanov[0x100000e06] <+6>:  andl   $0x1, %eax stepanov[0x100000e09] <+9>:  cmovnel %esi, %eax stepanov[0x100000e0c] <+12>: cmpl   $0x2, %edi stepanov[0x100000e0f] <+15>: jl     0x100000e33               ; <+51> stepanov[0x100000e11] <+17>: nopw   %cs:(%rax,%rax) stepanov[0x100000e20] <+32>: sarl   %edi stepanov[0x100000e22] <+34>: addl   %esi, %esi stepanov[0x100000e24] <+36>: movl   %edi, %ecx stepanov[0x100000e26] <+38>: andl   $0x1, %ecx stepanov[0x100000e29] <+41>: cmovnel %esi, %ecx stepanov[0x100000e2c] <+44>: addl   %ecx, %eax stepanov[0x100000e2e] <+46>: cmpl   $0x1, %edi stepanov[0x100000e31] <+49>: jg     0x100000e20               ; <+32> stepanov[0x100000e33] <+51>: popq   %rbp stepanov[0x100000e34] <+52>: retq    stepanov[0x100000e35] <+53>: nopw   %cs:(%rax,%rax)
  (lldb) disass -n mult_acc5 stepanov`mult_acc5: stepanov[0x100000f30] <+0>:  pushq  %rbp stepanov[0x100000f31] <+1>:  movq   %rsp, %rbp stepanov[0x100000f34] <+4>:  movl   %esi, %eax stepanov[0x100000f36] <+6>:  andl   $0x1, %eax stepanov[0x100000f39] <+9>:  cmovnel %edx, %eax stepanov[0x100000f3c] <+12>: addl   %edi, %eax stepanov[0x100000f3e] <+14>: cmpl   $0x2, %esi stepanov[0x100000f41] <+17>: jl     0x100000f63               ; <+51> stepanov[0x100000f43] <+19>: nopw   %cs:(%rax,%rax) stepanov[0x100000f50] <+32>: sarl   %esi stepanov[0x100000f52] <+34>: addl   %edx, %edx stepanov[0x100000f54] <+36>: movl   %esi, %ecx stepanov[0x100000f56] <+38>: andl   $0x1, %ecx stepanov[0x100000f59] <+41>: cmovnel %edx, %ecx stepanov[0x100000f5c] <+44>: addl   %ecx, %eax stepanov[0x100000f5e] <+46>: cmpl   $0x1, %esi stepanov[0x100000f61] <+49>: jg     0x100000f50               ; <+32> stepanov[0x100000f63] <+51>: popq   %rbp stepanov[0x100000f64] <+52>: retq   
 
 
 It seems to me that the compiler decides to inline mult_acc5 into multiply9, as well as making mult_acc5 available to callers outside the current file. I keep getting surprised by how good modern compilers can be (then again, I've been surprised by compilers since at least 1992)... It would be interesting to see how well GCC does on this.  
					
  
			 | 
		 
		
			| Mon Sep 04, 2017 5:21 pm | 
			
				
			 | 
    	
		 
	
	
		  | 
	 
	
			| 
				
				 rwiker 
				
				
					 Joined: Tue Aug 08, 2017 1:33 pm Posts: 12
				 
				 
			 | 
			
				
				
					
						If I move mult_acc5 into a separate source file, the disassembly of multipl9 looks like this:  Code: (lldb) disass -n multiply9 stepanov`multiply9: stepanov[0x100000e20] <+0>:  pushq  %rbp stepanov[0x100000e21] <+1>:  movq   %rsp, %rbp stepanov[0x100000e24] <+4>:  movl   %esi, %eax stepanov[0x100000e26] <+6>:  movl   %edi, %ecx stepanov[0x100000e28] <+8>:  xorl   %edi, %edi stepanov[0x100000e2a] <+10>: movl   %ecx, %esi stepanov[0x100000e2c] <+12>: movl   %eax, %edx stepanov[0x100000e2e] <+14>: popq   %rbp stepanov[0x100000e2f] <+15>: jmp    0x100000f30               ; mult_acc5 stepanov[0x100000e34] <+20>: nopw   %cs:(%rax,%rax)
  (lldb) 
  So, clang definitely inline-expanded mult_acc5 into multiply9 when both functions were in the same source file.  
					
  
			 | 
		 
		
			| Mon Sep 04, 2017 5:27 pm | 
			
				
			 | 
    	
		 
	
	
		  | 
	 
	
			| 
				
				 MichaelM 
				
				
					 Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
				 
				 
			 | 
			
				
				
					
						The following code represents the disassembly of my Stepanov object module. I am much more familiar with 8086/80186 assembly language than that of the larger members of the ISA, but it looks to me that the mult_accX functions have been inlined in the multiplyY functions. The transformations/optimizations that these compilers appear to be implementing are rather amazing. Need to find and read an architecture description for these processors; my knowledge of the ISA is rather dated. ( Note: there is a very interesting discussion on the repz ret instruction that follows the je offset instruction that continues the loop/recursion in the multiply5 function. Can't say that the reason given for that specific construction applies to Intel platforms, but apparently it is a holdover in gcc to support the recommendations of AMD for the K8 processor. Another side note is that one recommendation by AMD for the early 64-bit devices was to limit the number of branches to 3 or less in a block of 16 bytes, which I interpret to include returns instructions as well. Apparently, that processor linked branch prediction to the instruction cache and provided three branch predictions per 16 byte block. ) Code: morrisma@morrisma5 ~/Development/CProjects/Stepanov $ objdump -d Release/src/Stepanov.o
  Release/src/Stepanov.o:     file format elf64-x86-64
 
  Disassembly of section .text:
  0000000000000000 <multiply2>:    0:   31 c0                   xor    %eax,%eax    2:   83 ff 01                cmp    $0x1,%edi    5:   74 29                   je     30 <multiply2+0x30>    7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)    e:   00 00    10:   89 f9                   mov    %edi,%ecx   12:   8d 14 30                lea    (%rax,%rsi,1),%edx   15:   d1 ff                   sar    %edi   17:   83 e1 01                and    $0x1,%ecx   1a:   85 c9                   test   %ecx,%ecx   1c:   0f 45 c2                cmovne %edx,%eax   1f:   01 f6                   add    %esi,%esi   21:   83 ff 01                cmp    $0x1,%edi   24:   75 ea                   jne    10 <multiply2+0x10>   26:   01 f0                   add    %esi,%eax   28:   c3                      retq      29:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)   30:   89 f0                   mov    %esi,%eax   32:   c3                      retq      33:   66 66 66 66 2e 0f 1f    data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)   3a:   84 00 00 00 00 00 
  0000000000000040 <multiply3>:   40:   31 c0                   xor    %eax,%eax   42:   83 ff 01                cmp    $0x1,%edi   45:   74 29                   je     70 <multiply3+0x30>   47:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)   4e:   00 00    50:   8d 14 30                lea    (%rax,%rsi,1),%edx   53:   40 f6 c7 01             test   $0x1,%dil   57:   0f 45 c2                cmovne %edx,%eax   5a:   d1 ff                   sar    %edi   5c:   01 f6                   add    %esi,%esi   5e:   83 ff 01                cmp    $0x1,%edi   61:   75 ed                   jne    50 <multiply3+0x10>   63:   01 f0                   add    %esi,%eax   65:   c3                      retq      66:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)   6d:   00 00 00    70:   89 f0                   mov    %esi,%eax   72:   c3                      retq      73:   66 66 66 66 2e 0f 1f    data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)   7a:   84 00 00 00 00 00 
  0000000000000080 <multiply4>:   80:   83 ff 01                cmp    $0x1,%edi   83:   74 23                   je     a8 <multiply4+0x28>   85:   31 c0                   xor    %eax,%eax   87:   eb 0b                   jmp    94 <multiply4+0x14>   89:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)   90:   01 f6                   add    %esi,%esi   92:   d1 ff                   sar    %edi   94:   40 f6 c7 01             test   $0x1,%dil   98:   74 f6                   je     90 <multiply4+0x10>   9a:   01 f0                   add    %esi,%eax   9c:   83 ff 01                cmp    $0x1,%edi   9f:   75 ef                   jne    90 <multiply4+0x10>   a1:   f3 c3                   repz retq    a3:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)   a8:   89 f0                   mov    %esi,%eax   aa:   c3                      retq      ab:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  00000000000000b0 <multiply5>:   b0:   83 ff 01                cmp    $0x1,%edi   b3:   74 23                   je     d8 <multiply5+0x28>   b5:   31 c0                   xor    %eax,%eax   b7:   eb 0b                   jmp    c4 <multiply5+0x14>   b9:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)   c0:   d1 ff                   sar    %edi   c2:   01 f6                   add    %esi,%esi   c4:   40 f6 c7 01             test   $0x1,%dil   c8:   74 f6                   je     c0 <multiply5+0x10>   ca:   01 f0                   add    %esi,%eax   cc:   83 ff 01                cmp    $0x1,%edi   cf:   75 ef                   jne    c0 <multiply5+0x10>   d1:   f3 c3                   repz retq    d3:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)   d8:   89 f0                   mov    %esi,%eax   da:   c3                      retq      db:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  00000000000000e0 <multiply6>:   e0:   83 ff 01                cmp    $0x1,%edi   e3:   89 f0                   mov    %esi,%eax   e5:   74 1a                   je     101 <multiply6+0x21>   e7:   83 ef 01                sub    $0x1,%edi   ea:   89 f2                   mov    %esi,%edx   ec:   eb 06                   jmp    f4 <multiply6+0x14>   ee:   66 90                   xchg   %ax,%ax   f0:   d1 ff                   sar    %edi   f2:   01 d2                   add    %edx,%edx   f4:   40 f6 c7 01             test   $0x1,%dil   f8:   74 f6                   je     f0 <multiply6+0x10>   fa:   01 d0                   add    %edx,%eax   fc:   83 ff 01                cmp    $0x1,%edi   ff:   75 ef                   jne    f0 <multiply6+0x10>  101:   f3 c3                   repz retq   103:   66 66 66 66 2e 0f 1f    data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)  10a:   84 00 00 00 00 00 
  0000000000000110 <multiply7>:  110:   40 f6 c7 01             test   $0x1,%dil  114:   89 f0                   mov    %esi,%eax  116:   75 12                   jne    12a <multiply7+0x1a>  118:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)  11f:   00   120:   d1 ff                   sar    %edi  122:   01 c0                   add    %eax,%eax  124:   40 f6 c7 01             test   $0x1,%dil  128:   74 f6                   je     120 <multiply7+0x10>  12a:   83 ff 01                cmp    $0x1,%edi  12d:   74 22                   je     151 <multiply7+0x41>  12f:   83 ef 01                sub    $0x1,%edi  132:   89 c2                   mov    %eax,%edx  134:   eb 0e                   jmp    144 <multiply7+0x34>  136:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)  13d:   00 00 00   140:   d1 ff                   sar    %edi  142:   01 d2                   add    %edx,%edx  144:   40 f6 c7 01             test   $0x1,%dil  148:   74 f6                   je     140 <multiply7+0x30>  14a:   01 d0                   add    %edx,%eax  14c:   83 ff 01                cmp    $0x1,%edi  14f:   75 ef                   jne    140 <multiply7+0x30>  151:   f3 c3                   repz retq   153:   66 66 66 66 2e 0f 1f    data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)  15a:   84 00 00 00 00 00 
  0000000000000160 <multiply8>:  160:   40 f6 c7 01             test   $0x1,%dil  164:   89 f0                   mov    %esi,%eax  166:   75 12                   jne    17a <multiply8+0x1a>  168:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)  16f:   00   170:   d1 ff                   sar    %edi  172:   01 c0                   add    %eax,%eax  174:   40 f6 c7 01             test   $0x1,%dil  178:   74 f6                   je     170 <multiply8+0x10>  17a:   83 ff 01                cmp    $0x1,%edi  17d:   74 22                   je     1a1 <multiply8+0x41>  17f:   83 ef 01                sub    $0x1,%edi  182:   8d 14 00                lea    (%rax,%rax,1),%edx  185:   d1 ff                   sar    %edi  187:   eb 0b                   jmp    194 <multiply8+0x34>  189:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)  190:   d1 ff                   sar    %edi  192:   01 d2                   add    %edx,%edx  194:   40 f6 c7 01             test   $0x1,%dil  198:   74 f6                   je     190 <multiply8+0x30>  19a:   01 d0                   add    %edx,%eax  19c:   83 ff 01                cmp    $0x1,%edi  19f:   75 ef                   jne    190 <multiply8+0x30>  1a1:   f3 c3                   repz retq   1a3:   66 66 66 66 2e 0f 1f    data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)  1aa:   84 00 00 00 00 00 
  00000000000001b0 <multiply9>:  1b0:   83 ff 01                cmp    $0x1,%edi  1b3:   7e 23                   jle    1d8 <multiply9+0x28>  1b5:   31 c0                   xor    %eax,%eax  1b7:   eb 0b                   jmp    1c4 <multiply9+0x14>  1b9:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)  1c0:   01 f6                   add    %esi,%esi  1c2:   d1 ff                   sar    %edi  1c4:   89 fa                   mov    %edi,%edx  1c6:   83 e2 01                and    $0x1,%edx  1c9:   0f 45 d6                cmovne %esi,%edx  1cc:   01 d0                   add    %edx,%eax  1ce:   83 ff 01                cmp    $0x1,%edi  1d1:   75 ed                   jne    1c0 <multiply9+0x10>  1d3:   f3 c3                   repz retq   1d5:   0f 1f 00                nopl   (%rax)  1d8:   89 f0                   mov    %esi,%eax  1da:   c3                      retq     1db:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  00000000000001e0 <multiply1>:  1e0:   41 57                   push   %r15  1e2:   41 56                   push   %r14  1e4:   41 55                   push   %r13  1e6:   41 54                   push   %r12  1e8:   41 89 f4                mov    %esi,%r12d  1eb:   55                      push   %rbp  1ec:   53                      push   %rbx  1ed:   48 83 ec 18             sub    $0x18,%rsp  1f1:   83 ff 01                cmp    $0x1,%edi  1f4:   0f 84 b1 00 00 00       je     2ab <multiply1+0xcb>  1fa:   89 fa                   mov    %edi,%edx  1fc:   89 f3                   mov    %esi,%ebx  1fe:   89 fd                   mov    %edi,%ebp  200:   d1 fa                   sar    %edx  202:   44 8d 24 36             lea    (%rsi,%rsi,1),%r12d  206:   83 fa 01                cmp    $0x1,%edx  209:   0f 84 92 00 00 00       je     2a1 <multiply1+0xc1>  20f:   89 f9                   mov    %edi,%ecx  211:   44 8d 2c b5 00 00 00    lea    0x0(,%rsi,4),%r13d  218:   00   219:   c1 f9 02                sar    $0x2,%ecx  21c:   83 f9 01                cmp    $0x1,%ecx  21f:   74 76                   je     297 <multiply1+0xb7>  221:   41 89 f8                mov    %edi,%r8d  224:   44 8d 34 f5 00 00 00    lea    0x0(,%rsi,8),%r14d  22b:   00   22c:   41 c1 f8 03             sar    $0x3,%r8d  230:   41 83 f8 01             cmp    $0x1,%r8d  234:   74 57                   je     28d <multiply1+0xad>  236:   41 89 f9                mov    %edi,%r9d  239:   41 89 f7                mov    %esi,%r15d  23c:   41 c1 f9 04             sar    $0x4,%r9d  240:   41 c1 e7 04             shl    $0x4,%r15d  244:   41 83 f9 01             cmp    $0x1,%r9d  248:   44 89 0c 24             mov    %r9d,(%rsp)  24c:   74 34                   je     282 <multiply1+0xa2>  24e:   c1 e6 05                shl    $0x5,%esi  251:   c1 ff 05                sar    $0x5,%edi  254:   44 89 44 24 0c          mov    %r8d,0xc(%rsp)  259:   89 4c 24 08             mov    %ecx,0x8(%rsp)  25d:   89 54 24 04             mov    %edx,0x4(%rsp)  261:   e8 00 00 00 00          callq  266 <multiply1+0x86>  266:   44 8b 0c 24             mov    (%rsp),%r9d  26a:   44 8b 44 24 0c          mov    0xc(%rsp),%r8d  26f:   41 01 c7                add    %eax,%r15d  272:   8b 4c 24 08             mov    0x8(%rsp),%ecx  276:   8b 54 24 04             mov    0x4(%rsp),%edx  27a:   41 83 e1 01             and    $0x1,%r9d  27e:   44 0f 44 f8             cmove  %eax,%r15d  282:   45 01 fe                add    %r15d,%r14d  285:   41 83 e0 01             and    $0x1,%r8d  289:   45 0f 44 f7             cmove  %r15d,%r14d  28d:   45 01 f5                add    %r14d,%r13d  290:   83 e1 01                and    $0x1,%ecx  293:   45 0f 44 ee             cmove  %r14d,%r13d  297:   45 01 ec                add    %r13d,%r12d  29a:   83 e2 01                and    $0x1,%edx  29d:   45 0f 44 e5             cmove  %r13d,%r12d  2a1:   44 01 e3                add    %r12d,%ebx  2a4:   83 e5 01                and    $0x1,%ebp  2a7:   44 0f 45 e3             cmovne %ebx,%r12d  2ab:   48 83 c4 18             add    $0x18,%rsp  2af:   44 89 e0                mov    %r12d,%eax  2b2:   5b                      pop    %rbx  2b3:   5d                      pop    %rbp  2b4:   41 5c                   pop    %r12  2b6:   41 5d                   pop    %r13  2b8:   41 5e                   pop    %r14  2ba:   41 5f                   pop    %r15  2bc:   c3                      retq     2bd:   0f 1f 00                nopl   (%rax)
  00000000000002c0 <mult_acc0>:  2c0:   83 fe 01                cmp    $0x1,%esi  2c3:   89 d0                   mov    %edx,%eax  2c5:   74 21                   je     2e8 <mult_acc0+0x28>  2c7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)  2ce:   00 00   2d0:   89 f1                   mov    %esi,%ecx  2d2:   8d 04 12                lea    (%rdx,%rdx,1),%eax  2d5:   d1 fe                   sar    %esi  2d7:   83 e1 01                and    $0x1,%ecx  2da:   01 fa                   add    %edi,%edx  2dc:   85 c9                   test   %ecx,%ecx  2de:   0f 45 fa                cmovne %edx,%edi  2e1:   83 fe 01                cmp    $0x1,%esi  2e4:   89 c2                   mov    %eax,%edx  2e6:   75 e8                   jne    2d0 <mult_acc0+0x10>  2e8:   01 f8                   add    %edi,%eax  2ea:   c3                      retq     2eb:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  00000000000002f0 <mult_acc1>:  2f0:   eb 11                   jmp    303 <mult_acc1+0x13>  2f2:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)  2f8:   40 f6 c6 01             test   $0x1,%sil  2fc:   0f 45 f8                cmovne %eax,%edi  2ff:   01 d2                   add    %edx,%edx  301:   d1 fe                   sar    %esi  303:   83 fe 01                cmp    $0x1,%esi  306:   8d 04 17                lea    (%rdi,%rdx,1),%eax  309:   75 ed                   jne    2f8 <mult_acc1+0x8>  30b:   f3 c3                   repz retq   30d:   0f 1f 00                nopl   (%rax)
  0000000000000310 <mult_acc2>:  310:   89 f8                   mov    %edi,%eax  312:   eb 08                   jmp    31c <mult_acc2+0xc>  314:   0f 1f 40 00             nopl   0x0(%rax)  318:   01 d2                   add    %edx,%edx  31a:   d1 fe                   sar    %esi  31c:   40 f6 c6 01             test   $0x1,%sil  320:   74 f6                   je     318 <mult_acc2+0x8>  322:   01 d0                   add    %edx,%eax  324:   83 fe 01                cmp    $0x1,%esi  327:   75 ef                   jne    318 <mult_acc2+0x8>  329:   f3 c3                   repz retq   32b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  0000000000000330 <mult_acc3>:  330:   89 f8                   mov    %edi,%eax  332:   eb 08                   jmp    33c <mult_acc3+0xc>  334:   0f 1f 40 00             nopl   0x0(%rax)  338:   d1 fe                   sar    %esi  33a:   01 d2                   add    %edx,%edx  33c:   40 f6 c6 01             test   $0x1,%sil  340:   74 f6                   je     338 <mult_acc3+0x8>  342:   01 d0                   add    %edx,%eax  344:   83 fe 01                cmp    $0x1,%esi  347:   75 ef                   jne    338 <mult_acc3+0x8>  349:   f3 c3                   repz retq   34b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  0000000000000350 <mult_acc4>:  350:   89 f8                   mov    %edi,%eax  352:   eb 08                   jmp    35c <mult_acc4+0xc>  354:   0f 1f 40 00             nopl   0x0(%rax)  358:   d1 fe                   sar    %esi  35a:   01 d2                   add    %edx,%edx  35c:   40 f6 c6 01             test   $0x1,%sil  360:   74 f6                   je     358 <mult_acc4+0x8>  362:   01 d0                   add    %edx,%eax  364:   83 fe 01                cmp    $0x1,%esi  367:   75 ef                   jne    358 <mult_acc4+0x8>  369:   f3 c3                   repz retq   36b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  0000000000000370 <mult_acc5>:  370:   89 f8                   mov    %edi,%eax  372:   eb 08                   jmp    37c <mult_acc5+0xc>  374:   0f 1f 40 00             nopl   0x0(%rax)  378:   01 d2                   add    %edx,%edx  37a:   d1 fe                   sar    %esi  37c:   89 f1                   mov    %esi,%ecx  37e:   83 e1 01                and    $0x1,%ecx  381:   0f 45 ca                cmovne %edx,%ecx  384:   01 c8                   add    %ecx,%eax  386:   83 fe 01                cmp    $0x1,%esi  389:   7f ed                   jg     378 <mult_acc5+0x8>  38b:   f3 c3                   repz retq 
  Disassembly of section .text.startup:
  0000000000000000 <main>:    0:   41 57                   push   %r15    2:   41 56                   push   %r14    4:   ba 0a 00 00 00          mov    $0xa,%edx    9:   41 55                   push   %r13    b:   41 54                   push   %r12    d:   55                      push   %rbp    e:   53                      push   %rbx    f:   48 89 f3                mov    %rsi,%rbx   12:   48 83 ec 58             sub    $0x58,%rsp   16:   48 8b 7e 08             mov    0x8(%rsi),%rdi   1a:   31 f6                   xor    %esi,%esi   1c:   48 c7 04 24 00 00 00    movq   $0x0,(%rsp)   23:   00    24:   48 c7 44 24 08 00 00    movq   $0x0,0x8(%rsp)   2b:   00 00    2d:   48 c7 44 24 10 00 00    movq   $0x0,0x10(%rsp)   34:   00 00    36:   48 c7 44 24 18 00 00    movq   $0x0,0x18(%rsp)   3d:   00 00    3f:   48 c7 44 24 20 00 00    movq   $0x0,0x20(%rsp)   46:   00 00    48:   48 c7 44 24 28 00 00    movq   $0x0,0x28(%rsp)   4f:   00 00    51:   48 c7 44 24 30 00 00    movq   $0x0,0x30(%rsp)   58:   00 00    5a:   48 c7 44 24 38 00 00    movq   $0x0,0x38(%rsp)   61:   00 00    63:   48 c7 44 24 40 00 00    movq   $0x0,0x40(%rsp)   6a:   00 00    6c:   e8 00 00 00 00          callq  71 <main+0x71>   71:   48 8b 7b 10             mov    0x10(%rbx),%rdi   75:   49 89 c6                mov    %rax,%r14   78:   ba 0a 00 00 00          mov    $0xa,%edx   7d:   31 f6                   xor    %esi,%esi   7f:   e8 00 00 00 00          callq  84 <main+0x84>   84:   45 85 f6                test   %r14d,%r14d   87:   8d 50 ff                lea    -0x1(%rax),%edx   8a:   7e 64                   jle    f0 <main+0xf0>   8c:   48 63 d2                movslq %edx,%rdx   8f:   45 31 ff                xor    %r15d,%r15d   92:   4c 8b 2c d4             mov    (%rsp,%rdx,8),%r13   96:   41 bc 01 00 00 00       mov    $0x1,%r12d   9c:   0f 1f 40 00             nopl   0x0(%rax)   a0:   44 89 e5                mov    %r12d,%ebp   a3:   bb 01 00 00 00          mov    $0x1,%ebx   a8:   eb 14                   jmp    be <main+0xbe>   aa:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)   b0:   83 c3 01                add    $0x1,%ebx   b3:   44 01 e5                add    %r12d,%ebp   b6:   81 fb 01 01 00 00       cmp    $0x101,%ebx   bc:   74 1c                   je     da <main+0xda>   be:   89 de                   mov    %ebx,%esi   c0:   44 89 e7                mov    %r12d,%edi   c3:   41 ff d5                callq  *%r13   c6:   39 c5                   cmp    %eax,%ebp   c8:   74 e6                   je     b0 <main+0xb0>   ca:   89 c6                   mov    %eax,%esi   cc:   89 ea                   mov    %ebp,%edx   ce:   bf 00 00 00 00          mov    $0x0,%edi   d3:   31 c0                   xor    %eax,%eax   d5:   e8 00 00 00 00          callq  da <main+0xda>   da:   41 83 c4 01             add    $0x1,%r12d   de:   41 81 fc 01 01 00 00    cmp    $0x101,%r12d   e5:   75 b9                   jne    a0 <main+0xa0>   e7:   41 83 c7 01             add    $0x1,%r15d   eb:   45 39 f7                cmp    %r14d,%r15d   ee:   75 a6                   jne    96 <main+0x96>   f0:   48 83 c4 58             add    $0x58,%rsp   f4:   31 c0                   xor    %eax,%eax   f6:   5b                      pop    %rbx   f7:   5d                      pop    %rbp   f8:   41 5c                   pop    %r12   fa:   41 5d                   pop    %r13   fc:   41 5e                   pop    %r14   fe:   41 5f                   pop    %r15  100:   c3                      retq    morrisma@morrisma5 ~/Development/CProjects/Stepanov $
   
					
						 _________________ Michael A. 
					
  
			 | 
		 
		
			| Mon Sep 04, 2017 8:31 pm | 
			
				
			 | 
    	
		 
	
	
		  | 
	 
	
			| 
				
				 BigEd 
				
				
					 Joined: Wed Jan 09, 2013 6:54 pm Posts: 1852
				 
				 
			 | 
			
				
				
					
						In the land of GCC there are also -march -mtune See perhaps  https://stackoverflow.com/questions/106 ... chitecture 
					
  
			 | 
		 
		
			| Tue Sep 05, 2017 1:14 pm | 
			
				
			 | 
    	
		 
	
	
		  | 
	 
	
			| 
				
				 BigEd 
				
				
					 Joined: Wed Jan 09, 2013 6:54 pm Posts: 1852
				 
				 
			 | 
			
				
				
					
						Also well worth a mention is Matt Godbolt's GCC Explorer, which allows you to explore  - different versions of GCC  - GCC for x86 or for ARM  - any different command line options and to do that for any short sequence of C code you care to paste in, from the comfort of your browser. https://godbolt.org/Oh, in a variety of source languages, and also to share links to interesting findings.  
					
  
			 | 
		 
		
			| Tue Sep 05, 2017 6:12 pm | 
			
				
			 | 
    	
		 
	
	
		  | 
	 
	
			| 
				
				 rwiker 
				
				
					 Joined: Tue Aug 08, 2017 1:33 pm Posts: 12
				 
				 
			 | 
			
				
				
					
						 Just to muddle the waters a bit: I implemented this algorithm in Common Lisp, using (optional) declarations to induce the compiler (SBCL, in this case) to use fixnum arithmetic. This implementation took about 5.5 seconds (5000 iterations), which compares fairly well with the 3.16 seconds (I think) that I got from multiply3 with clang -O3. 
					
  
			 | 
		 
		
			| Tue Sep 05, 2017 8:09 pm | 
			
				
			 | 
    	
		 
	
	
		  | 
	 
	
			| 
				
				 MichaelM 
				
				
					 Joined: Wed Apr 24, 2013 9:40 pm Posts: 213 Location: Huntsville, AL
				 
				 
			 | 
			
				
				
					
						 Big Ed:
  Very cool link. Bookmarked it. 
					
						 _________________ Michael A. 
					
  
			 | 
		 
		
			| Wed Sep 06, 2017 2:01 am | 
			
				
			 | 
    	
		 
	
	 
	
	
 
	 
	
	
		Who is online | 
	 
	
		Users browsing this forum: claudebot and 0 guests  | 
	 
	 
 
	 | 
	You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
  | 
 
 
 
	 |