i trying learn more openmp , cache contention, wrote simple program better understand how works. getting bad thread scaling simple addition of vectors, don't understand why. program:

#include <iostream> #include <omp.h> #include <vector>  using namespace std;  int main(){      // initialize stuff     int nuelements=20000000; // number of elements     int i;     vector<int> x, y, z;     x.assign(nuelements,0);     y.assign(nuelements,0);     z.assign(nuelements,0);     double start; // timer      (i=0;i<nuelements;++i){        x[i]=i;        y[i]=i;     }          // increase threads 1 every time, , add 2 vectors       (int t=1;t<5;++t){          // re-set z vector values         z.clear();          // set number of threads iteration         omp_set_num_threads(t);          // start timer         start=omp_get_wtime();          // parallel #pragma omp parallel         (i=0;i<nuelements;++i)         {             z[i]=x[i]+y[i];         }         // print wall time         cout<<"time "<<omp_get_max_threads()<<" thread(s) : "<<omp_get_wtime()-start<<endl;     }     return 0; } 

running produces following output:

time 1 thread(s) : 0.020606 time 2 thread(s) : 0.022671 time 3 thread(s) : 0.026737 time 4 thread(s) : 0.02825 

i compiled command : clang++ -o3 -std=c++11 -fopenmp=libiomp5 test_omp.cpp

as can see, scaling gets worse number of threads increases. running on 4-core intel-i7 processor. know what's happening?

you limited memory bandwidth, not cpu speed. takes 1 cpu keep memory busy if you're doing addition , copying, adding more cores doesn't help.

if want see benefit of adding more threads, try executing more complex operations on memory small enough fit in l1 or l2 cache.


