python - Hadoop streaming how to multiply matrix by vector when they're stored in many files -


i have matrix like:

1,1,2 2,3,4 6,4,6 1,2,4 3,6,3 4,6,2 4,5,8 3,4,4 

and vector

1,3 4,5 5,4 6,2 

they're stored in 2 different files. need multiply them column. matrix of body m(i,j,v), row number, j column number , v value. vector of body v(j,v).

i wrote mapper

#!/usr/bin/env python  import sys  # class store matrix records class matrixrecord(object):     def __init__( self ):         self.i= none         self.j= none         self.v= none  # class store vector records class vectorrecord(object):     def __init__( self ):         self.j= none         self.v= none  # lists store objects listofmatrixrecords = [] listofvectorrecords = []  # input comes stdin (standard input) line in sys.stdin:     # remove leading , trailing whitespace , split     splittedline = line.strip().split(",")        # if it's matrix element - body looks     # 1,3,6     if(len(splittedline) == 3):         x = matrixrecord();         x.i = splittedline[0]         x.j = splittedline[1]         x.v = splittedline[2]         listofmatrixrecords.append(x) #add matrix records list     #if it's vector element - body looks     # 2,4     else:          y = vectorrecord();         y.j = splittedline[0]         y.v = splittedline[1]         listofvectorrecords.append(y) #add vector records list  #get matrix records , multiply them vector values vectorposition = {record.j record in listofvectorrecords} #gets j properties of objects vector matrixposition = {record.j record in listofmatrixrecords} #gets j properties of objects matrix  duplicate in vectorposition & matrixposition: #checks duplicates between matrix , vector     x in listofmatrixrecords:         if x.j == duplicate:    # if there's duplicate, means must multiply             y in listofvectorrecords:                 if y.j == x.j:                     x.v = int(x.v) * int(y.v);  #return result stdout, reducer take input x in listofmatrixrecords:     print ('%s\t%s' % (x.i,x.v)) 

but works if stored in 1 input file, not many, because each file new mapper created , therefore

listofmatrixrecords = [] listofvectorrecords = []  

never contain matrix / vector records.

is there way write custom shuffle method hadoop streaming perhaps?

i launch hadoop this:

hadoop jar "d:\hadoop-2.7.1\share\hadoop\tools\lib\hadoop-streaming-2.7.1.jar" -mapper "python d:\map.py" -reducer "python d:\reducer.py" -input /input/* -output /output 


Comments

Popular posts from this blog

PySide and Qt Properties: Connecting signals from Python to QML -

c# - DevExpress.Wpf.Grid.InfiniteGridSizeException was unhandled -

scala - 'wrong top statement declaration' when using slick in IntelliJ -