Efficiently merge 2 large csv files in java by common labels -


i need merge 2 large csv files (approximately 40 million data elements in each ~500mb) common row or column labels, can specified user. example, if dataset1.csv contained:

patient_id    x1     x2    x3 pi1           1      2     3 pi3           4      5     6 

and dataset2.csv contained:

patient_id    y1    y2    y3 pi0           0     0     0 pi1           11    12    13 pi2           99    98    97 pi3           14    15    16 

the user specify merge these 2 files row labels (the patient ids) , resulting output.csv be:

patient_id    x1   x2   x3   y1    y2   y3 pi1           1    2    3    11    12   13 pi3           4    5    6    14    15   16 

since combine information patient ids common (intersection) both input files. strategy problem create hashmap, row or column labels merged (in case row labels, patient ids) keys , data patient id stored arraylist value. create hashmap each input data file , merge values based on similar keys. represent data 2-d arraylist of type arraylist> merged data has type. iterate through merged arraylist> object, call data type object, , print file. code below:

below datamerge class dependent on below data class file.

import java.util.hashmap; import java.util.arraylist;  public class datamerge {   /**merges 2 data objects similar label. example, if 2 data sets represent  * different data same set of patients, represented unique patient  * id, mergedata return data set containing patient ids common both  * data sets along data represented in both data sets. labelinrow1 , labelinrow2 separately   * indicate whether common labels in separate rows(true) of d1 , d2, respectively, or separate columns otherwise.*/   public static data mergedata(data d1, data d2, boolean labelinrow1,          boolean labelinrow2){     arraylist<arraylist<string>> mergeddata = new arraylist<arraylist<string>>();     hashmap<string,arraylist<string>> d1map = d1.mapfeaturetodata(labelinrow1);     hashmap<string,arraylist<string>> d2map = d2.mapfeaturetodata(labelinrow2);     arraylist<string> d1features;     arraylist<string> d2features;      if (labelinrow1){         d1features = d1.getcolumnlabels();     } else {         d1features = d1.getrowlabels();     }     if (labelinrow2){         d2features = d2.getcolumnlabels();     } else {         d2features = d2.getrowlabels();     }     d1features.trimtosize();     d2features.trimtosize();      arraylist<string> mergedfeatures = new arraylist<string>();     if ((d1.getlabellabel() != "") && (d1.getlabellabel() == "")) {         mergedfeatures.add(d1.getlabellabel());     }     else if ((d1.getlabellabel() == "") && (d1.getlabellabel() != "")) {         mergedfeatures.add(d2.getlabellabel());     } else {         mergedfeatures.add(d1.getlabellabel());     }      mergedfeatures.addall(d1features);     mergedfeatures.addall(d2features);     mergedfeatures.trimtosize();     mergeddata.add(mergedfeatures);      (string key : d1map.keyset()){         arraylist<string> currow = new arraylist<string>();         if (d2map.containskey(key)){             currow.add(key);             currow.addall(d1map.get(key));             currow.addall(d2map.get(key));             currow.trimtosize();             mergeddata.add(currow);         }     }     mergeddata.trimtosize();     data result = new data(mergeddata, true);     return result; }  } 

below data type object along associated hashmap generating functions row , column label extraction methods.

import java.util.*; import java.io.*;  /**represents unlabeled or labeled data set series of nested     arraylists, each nested   * arraylist represents line of input data.*/  public class data { private arraylist<string> collabels = new arraylist<string>();  //row labels  private arraylist<string> rowlabels = new arraylist<string>();  //column labels  private string labellabel;  private arraylist<arraylist<string>> unlabeleddata; //data without row , column labels    /**returns arraylist of arraylists, each nested arraylist represents line *of input file.*/ @suppresswarnings("resource") private static arraylist<arraylist<string>> readfile(string filepath, string filesep){     arraylist<arraylist<string>> result = new arraylist<arraylist<string>>();     try{         bufferedreader input = new bufferedreader(new filereader (filepath));         string line = input.readline();         while (line != null){             string[] splitline = line.split(filesep);             result.add(new arraylist<string>(arrays.aslist(splitline)));             line = input.readline();         }     }     catch (exception e){         system.err.println(e);     }     result.trimtosize();;     return result; }   /**returns arraylist of arraylists, each nested arraylist represents line of input  * data without row or column labels*/   private arraylist<arraylist<string>> extractlabelsanddata(string filepath, string filesep){     arraylist<arraylist<string>> tempdata = new arraylist<arraylist<string>>();     tempdata.addall(readfile(filepath, filesep));     tempdata.trimtosize();     this.collabels.addall(tempdata.remove(0));     this.labellabel = this.collabels.remove(0);     this.collabels.trimtosize();     (arraylist<string> line : tempdata){         this.rowlabels.add(line.remove(0));     }     this.rowlabels.trimtosize();     return tempdata; }     /**returns arraylist of arraylists, each nested arraylist represents line of input  * data without row or column labels. mutate original data*/ private arraylist<arraylist<string>> extractlabelsanddata (arraylist<arraylist<string>> data){     arraylist<arraylist<string>> result = new arraylist<arraylist<string>>();     (arraylist<string> line : data){         arraylist<string> temp = new arraylist<string>();         (string element : line){             temp.add(element);         }         temp.trimtosize();         result.add(temp);     }     this.collabels.addall(result.remove(0));     this.labellabel = this.collabels.remove(0);     this.collabels.trimtosize();     (arraylist<string> line : result){         this.rowlabels.add(line.remove(0));     }     this.rowlabels.trimtosize();     result.trimtosize();     return result; }   /**returns labellabel data*/ public string getlabellabel(){     return this.labellabel; }   /**returns arraylist of labels while maintaining order * in appear in data. row indicates desired * features in same row. assumed labels in * first row of data. */ public arraylist<string> getcolumnlabels(){     return this.collabels; }   /**returns arraylist of labels while maintaining order * in appear in data. column indicates desired * features in same column. assumed labels in * first column of data.*/ public arraylist<string> getrowlabels(){     return this.rowlabels; }   /**creates hashmap list of feature labels mapped entire data. example,  * if data set contains patient ids , test results, function can used create  * hashmap keys patient ids , values arraylist of test  * results. boolean input isrow, which, when true, designates  * desired keys listed in rows or false if in columns.*/ public hashmap<string, arraylist<string>> mapfeaturetodata(boolean isrow){     hashmap<string, arraylist<string>> featuremap = new hashmap<string,arraylist<string>>();     if (!isrow){         (arraylist<string> line : this.unlabeleddata){             (int = 0; < this.collabels.size(); i++){                 if (featuremap.containskey(this.collabels.get(i))){                     featuremap.get(this.collabels.get(i)).add(line.get(i));                 } else{                     arraylist<string> firstvalue = new arraylist<string>();                     firstvalue.add(line.get(i));                     featuremap.put(this.collabels.get(i), firstvalue);                 }             }         }     } else {         (int = 0; < this.rowlabels.size(); i++){             if (!featuremap.containskey(this.rowlabels.get(i))){                 featuremap.put(this.rowlabels.get(i), this.unlabeleddata.get(i));             } else {                 featuremap.get(this.rowlabels.get(i)).addall(this.unlabeleddata.get(i));             }         }     }     return featuremap; }    /**writes data file in specified outputpath. sep indicates data delimiter.  * labeledoutput indicates whether or not user wants data written file   * labeled or unlabeled. if data unlabeled begin with, labeledoutput   * should not set true. */ public void writedatatofile(string outputpath, string sep){     try {         printstream writer = new printstream(new bufferedoutputstream (new fileoutputstream (outputpath, true)));         string sol = this.labellabel + sep;         (int n = 0; n < this.collabels.size(); n++){             if (n == this.collabels.size()-1){                 sol += this.collabels.get(n) + "\n";             } else {                 sol += this.collabels.get(n) + sep;             }         }         (int = 0; < this.unlabeleddata.size(); i++){             arraylist<string> line = this.unlabeleddata.get(i);             sol += this.rowlabels.get(i) + sep;             (int j = 0; j < line.size(); j++){                 if (j == line.size()-1){                     sol += line.get(j);                 } else {                     sol += line.get(j) + sep;                 }             }             sol += "\n";         }         sol = sol.trim();         writer.print(sol);         writer.close();      } catch (exception e){         system.err.println(e);     } }   /**constructor data object. filepath specifies input file directory,  * filesep indicates file separator used in input file, , haslabels  * designates whether input data has row , column labels. note if   * haslabels set true, assumed there both row , column labels*/ public data(string filepath, string filesep, boolean haslabels){     if (haslabels){         this.unlabeleddata = extractlabelsanddata(filepath, filesep);         this.unlabeleddata.trimtosize();     } else {         this.unlabeleddata = readfile(filepath, filesep);         this.unlabeleddata.trimtosize();     }  }   /**constructor data object accepts nested arraylists inputs*/ public data (arraylist<arraylist<string>> data, boolean haslabels){     if (haslabels){         this.unlabeleddata = extractlabelsanddata(data);         this.unlabeleddata.trimtosize();     } else {         this.unlabeleddata = data;         this.unlabeleddata.trimtosize();     } } } 

the program works small datasets it's been 5+ days , merge still hasn't finished. i'm looking more efficient time , memory solution. suggested using byte arrays instead of strings, may make run faster. have suggestions?

edit: did digging around in code , found reading input files , merging them takes no time (like 20 seconds). writing file part takes 5+ days

you concatenating data fields millions of rows of data 1 ginormous string writing single string. slow death memory thrashing allocate , reallocate extremely large strings, copying them on , on , on each field , separator you're adding string. around 3rd or 4th day each string ... multiple millions of characters long? ... , poor garbage collector sweating , taking out on you.

don't that.

build each line of output file separately , write it. build next line.

furthermore, use stringbuilder class build lines, though you'll such improvement previous step might not bother this. though it's way , should learn how.


Comments

Popular posts from this blog

scala - 'wrong top statement declaration' when using slick in IntelliJ -

c# - DevExpress.Wpf.Grid.InfiniteGridSizeException was unhandled -

PySide and Qt Properties: Connecting signals from Python to QML -