python - Hadoop streaming how to multiply matrix by vector when they're stored in many files -
i have matrix like:
1,1,2 2,3,4 6,4,6 1,2,4 3,6,3 4,6,2 4,5,8 3,4,4
and vector
1,3 4,5 5,4 6,2
they're stored in 2 different files. need multiply them column. matrix of body m(i,j,v), row number, j column number , v value. vector of body v(j,v).
i wrote mapper
#!/usr/bin/env python import sys # class store matrix records class matrixrecord(object): def __init__( self ): self.i= none self.j= none self.v= none # class store vector records class vectorrecord(object): def __init__( self ): self.j= none self.v= none # lists store objects listofmatrixrecords = [] listofvectorrecords = [] # input comes stdin (standard input) line in sys.stdin: # remove leading , trailing whitespace , split splittedline = line.strip().split(",") # if it's matrix element - body looks # 1,3,6 if(len(splittedline) == 3): x = matrixrecord(); x.i = splittedline[0] x.j = splittedline[1] x.v = splittedline[2] listofmatrixrecords.append(x) #add matrix records list #if it's vector element - body looks # 2,4 else: y = vectorrecord(); y.j = splittedline[0] y.v = splittedline[1] listofvectorrecords.append(y) #add vector records list #get matrix records , multiply them vector values vectorposition = {record.j record in listofvectorrecords} #gets j properties of objects vector matrixposition = {record.j record in listofmatrixrecords} #gets j properties of objects matrix duplicate in vectorposition & matrixposition: #checks duplicates between matrix , vector x in listofmatrixrecords: if x.j == duplicate: # if there's duplicate, means must multiply y in listofvectorrecords: if y.j == x.j: x.v = int(x.v) * int(y.v); #return result stdout, reducer take input x in listofmatrixrecords: print ('%s\t%s' % (x.i,x.v))
but works if stored in 1 input file, not many, because each file new mapper created , therefore
listofmatrixrecords = [] listofvectorrecords = []
never contain matrix / vector records.
is there way write custom shuffle method hadoop streaming perhaps?
i launch hadoop this:
hadoop jar "d:\hadoop-2.7.1\share\hadoop\tools\lib\hadoop-streaming-2.7.1.jar" -mapper "python d:\map.py" -reducer "python d:\reducer.py" -input /input/* -output /output
Comments
Post a Comment