Efficiently merge 2 large csv files in java by common labels -
i need merge 2 large csv files (approximately 40 million data elements in each ~500mb) common row or column labels, can specified user. example, if dataset1.csv contained:
patient_id x1 x2 x3 pi1 1 2 3 pi3 4 5 6
and dataset2.csv contained:
patient_id y1 y2 y3 pi0 0 0 0 pi1 11 12 13 pi2 99 98 97 pi3 14 15 16
the user specify merge these 2 files row labels (the patient ids) , resulting output.csv be:
patient_id x1 x2 x3 y1 y2 y3 pi1 1 2 3 11 12 13 pi3 4 5 6 14 15 16
since combine information patient ids common (intersection) both input files. strategy problem create hashmap, row or column labels merged (in case row labels, patient ids) keys , data patient id stored arraylist value. create hashmap each input data file , merge values based on similar keys. represent data 2-d arraylist of type arraylist> merged data has type. iterate through merged arraylist> object, call data type object, , print file. code below:
below datamerge class dependent on below data class file.
import java.util.hashmap; import java.util.arraylist; public class datamerge { /**merges 2 data objects similar label. example, if 2 data sets represent * different data same set of patients, represented unique patient * id, mergedata return data set containing patient ids common both * data sets along data represented in both data sets. labelinrow1 , labelinrow2 separately * indicate whether common labels in separate rows(true) of d1 , d2, respectively, or separate columns otherwise.*/ public static data mergedata(data d1, data d2, boolean labelinrow1, boolean labelinrow2){ arraylist<arraylist<string>> mergeddata = new arraylist<arraylist<string>>(); hashmap<string,arraylist<string>> d1map = d1.mapfeaturetodata(labelinrow1); hashmap<string,arraylist<string>> d2map = d2.mapfeaturetodata(labelinrow2); arraylist<string> d1features; arraylist<string> d2features; if (labelinrow1){ d1features = d1.getcolumnlabels(); } else { d1features = d1.getrowlabels(); } if (labelinrow2){ d2features = d2.getcolumnlabels(); } else { d2features = d2.getrowlabels(); } d1features.trimtosize(); d2features.trimtosize(); arraylist<string> mergedfeatures = new arraylist<string>(); if ((d1.getlabellabel() != "") && (d1.getlabellabel() == "")) { mergedfeatures.add(d1.getlabellabel()); } else if ((d1.getlabellabel() == "") && (d1.getlabellabel() != "")) { mergedfeatures.add(d2.getlabellabel()); } else { mergedfeatures.add(d1.getlabellabel()); } mergedfeatures.addall(d1features); mergedfeatures.addall(d2features); mergedfeatures.trimtosize(); mergeddata.add(mergedfeatures); (string key : d1map.keyset()){ arraylist<string> currow = new arraylist<string>(); if (d2map.containskey(key)){ currow.add(key); currow.addall(d1map.get(key)); currow.addall(d2map.get(key)); currow.trimtosize(); mergeddata.add(currow); } } mergeddata.trimtosize(); data result = new data(mergeddata, true); return result; } }
below data type object along associated hashmap generating functions row , column label extraction methods.
import java.util.*; import java.io.*; /**represents unlabeled or labeled data set series of nested arraylists, each nested * arraylist represents line of input data.*/ public class data { private arraylist<string> collabels = new arraylist<string>(); //row labels private arraylist<string> rowlabels = new arraylist<string>(); //column labels private string labellabel; private arraylist<arraylist<string>> unlabeleddata; //data without row , column labels /**returns arraylist of arraylists, each nested arraylist represents line *of input file.*/ @suppresswarnings("resource") private static arraylist<arraylist<string>> readfile(string filepath, string filesep){ arraylist<arraylist<string>> result = new arraylist<arraylist<string>>(); try{ bufferedreader input = new bufferedreader(new filereader (filepath)); string line = input.readline(); while (line != null){ string[] splitline = line.split(filesep); result.add(new arraylist<string>(arrays.aslist(splitline))); line = input.readline(); } } catch (exception e){ system.err.println(e); } result.trimtosize();; return result; } /**returns arraylist of arraylists, each nested arraylist represents line of input * data without row or column labels*/ private arraylist<arraylist<string>> extractlabelsanddata(string filepath, string filesep){ arraylist<arraylist<string>> tempdata = new arraylist<arraylist<string>>(); tempdata.addall(readfile(filepath, filesep)); tempdata.trimtosize(); this.collabels.addall(tempdata.remove(0)); this.labellabel = this.collabels.remove(0); this.collabels.trimtosize(); (arraylist<string> line : tempdata){ this.rowlabels.add(line.remove(0)); } this.rowlabels.trimtosize(); return tempdata; } /**returns arraylist of arraylists, each nested arraylist represents line of input * data without row or column labels. mutate original data*/ private arraylist<arraylist<string>> extractlabelsanddata (arraylist<arraylist<string>> data){ arraylist<arraylist<string>> result = new arraylist<arraylist<string>>(); (arraylist<string> line : data){ arraylist<string> temp = new arraylist<string>(); (string element : line){ temp.add(element); } temp.trimtosize(); result.add(temp); } this.collabels.addall(result.remove(0)); this.labellabel = this.collabels.remove(0); this.collabels.trimtosize(); (arraylist<string> line : result){ this.rowlabels.add(line.remove(0)); } this.rowlabels.trimtosize(); result.trimtosize(); return result; } /**returns labellabel data*/ public string getlabellabel(){ return this.labellabel; } /**returns arraylist of labels while maintaining order * in appear in data. row indicates desired * features in same row. assumed labels in * first row of data. */ public arraylist<string> getcolumnlabels(){ return this.collabels; } /**returns arraylist of labels while maintaining order * in appear in data. column indicates desired * features in same column. assumed labels in * first column of data.*/ public arraylist<string> getrowlabels(){ return this.rowlabels; } /**creates hashmap list of feature labels mapped entire data. example, * if data set contains patient ids , test results, function can used create * hashmap keys patient ids , values arraylist of test * results. boolean input isrow, which, when true, designates * desired keys listed in rows or false if in columns.*/ public hashmap<string, arraylist<string>> mapfeaturetodata(boolean isrow){ hashmap<string, arraylist<string>> featuremap = new hashmap<string,arraylist<string>>(); if (!isrow){ (arraylist<string> line : this.unlabeleddata){ (int = 0; < this.collabels.size(); i++){ if (featuremap.containskey(this.collabels.get(i))){ featuremap.get(this.collabels.get(i)).add(line.get(i)); } else{ arraylist<string> firstvalue = new arraylist<string>(); firstvalue.add(line.get(i)); featuremap.put(this.collabels.get(i), firstvalue); } } } } else { (int = 0; < this.rowlabels.size(); i++){ if (!featuremap.containskey(this.rowlabels.get(i))){ featuremap.put(this.rowlabels.get(i), this.unlabeleddata.get(i)); } else { featuremap.get(this.rowlabels.get(i)).addall(this.unlabeleddata.get(i)); } } } return featuremap; } /**writes data file in specified outputpath. sep indicates data delimiter. * labeledoutput indicates whether or not user wants data written file * labeled or unlabeled. if data unlabeled begin with, labeledoutput * should not set true. */ public void writedatatofile(string outputpath, string sep){ try { printstream writer = new printstream(new bufferedoutputstream (new fileoutputstream (outputpath, true))); string sol = this.labellabel + sep; (int n = 0; n < this.collabels.size(); n++){ if (n == this.collabels.size()-1){ sol += this.collabels.get(n) + "\n"; } else { sol += this.collabels.get(n) + sep; } } (int = 0; < this.unlabeleddata.size(); i++){ arraylist<string> line = this.unlabeleddata.get(i); sol += this.rowlabels.get(i) + sep; (int j = 0; j < line.size(); j++){ if (j == line.size()-1){ sol += line.get(j); } else { sol += line.get(j) + sep; } } sol += "\n"; } sol = sol.trim(); writer.print(sol); writer.close(); } catch (exception e){ system.err.println(e); } } /**constructor data object. filepath specifies input file directory, * filesep indicates file separator used in input file, , haslabels * designates whether input data has row , column labels. note if * haslabels set true, assumed there both row , column labels*/ public data(string filepath, string filesep, boolean haslabels){ if (haslabels){ this.unlabeleddata = extractlabelsanddata(filepath, filesep); this.unlabeleddata.trimtosize(); } else { this.unlabeleddata = readfile(filepath, filesep); this.unlabeleddata.trimtosize(); } } /**constructor data object accepts nested arraylists inputs*/ public data (arraylist<arraylist<string>> data, boolean haslabels){ if (haslabels){ this.unlabeleddata = extractlabelsanddata(data); this.unlabeleddata.trimtosize(); } else { this.unlabeleddata = data; this.unlabeleddata.trimtosize(); } } }
the program works small datasets it's been 5+ days , merge still hasn't finished. i'm looking more efficient time , memory solution. suggested using byte arrays instead of strings, may make run faster. have suggestions?
edit: did digging around in code , found reading input files , merging them takes no time (like 20 seconds). writing file part takes 5+ days
you concatenating data fields millions of rows of data 1 ginormous string writing single string. slow death memory thrashing allocate , reallocate extremely large strings, copying them on , on , on each field , separator you're adding string. around 3rd or 4th day each string ... multiple millions of characters long? ... , poor garbage collector sweating , taking out on you.
don't that.
build each line of output file separately , write it. build next line.
furthermore, use stringbuilder
class build lines, though you'll such improvement previous step might not bother this. though it's way , should learn how.
Comments
Post a Comment