python - ValueError: could not convert string to float -


i have text file contains data. data follows

join2_train = sc.textfile('join2_train.csv',4) join2_train.take(3)      [u'21.9059,ta-00002,s-0066,7/7/2013,0,0,yes,1,sp-0019,6.35,0.71,137,8,19.05,n,n,n,n,ef-008,ef-008,0,0,0',      u'12.3412,ta-00002,s-0066,7/7/2013,0,0,yes,2,sp-0019,6.35,0.71,137,8,19.05,n,n,n,n,ef-008,ef-008,0,0,0',      u'6.60183,ta-00002,s-0066,7/7/2013,0,0,yes,5,sp-0019,6.35,0.71,137,8,19.05,n,n,n,n,ef-008,ef-008,0,0,0'] 

now trying parse string function splits each of lines of text , convert labeledpoint. have included line converting string elements float

the function follows

from pyspark.mllib.regression import labeledpoint import numpy np  def parsepoint(line):       """converts comma separated unicode string `labeledpoint`.      args:         line (unicode): comma separated unicode string first element label ,             remaining elements features.      returns:         labeledpoint: line converted `labeledpoint`, consists of label ,             features.     """     values = line.split(',')     value1 = [map(float,i) in values]     return labeledpoint(value1[0],value1[1:])  

now when try actions on parsed line , valueerror. action try below

parse_train = join2_train.map(parsepoint)  parse_train.take(5) 

the error message below

--------------------------------------------------------------------------- py4jjavaerror                             traceback (most recent call last) <ipython-input-63-f53b10964381> in <module>()       1 parse_train = join2_train.map(parsepoint)       2  ----> 3 parse_train.take(5)  /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in take(self, num)    1222     1223             p = range(partsscanned, min(partsscanned + numpartstotry, totalparts)) -> 1224             res = self.context.runjob(self, takeuptonumleft, p, true)    1225     1226             items += res  /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/context.py in runjob(self, rdd, partitionfunc, partitions, allowlocal)     840         mappedrdd = rdd.mappartitions(partitionfunc)     841         port = self._jvm.pythonrdd.runjob(self._jsc.sc(), mappedrdd._jrdd, javapartitions, --> 842                                           allowlocal)     843         return list(_load_from_socket(port, mappedrdd._jrdd_deserializer))     844   /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)     536         answer = self.gateway_client.send_command(command)     537         return_value = get_return_value(answer, self.gateway_client, --> 538                 self.target_id, self.name)     539      540         temp_arg in temp_args:  /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)     298                 raise py4jjavaerror(     299                     'an error occurred while calling {0}{1}{2}.\n'. --> 300                     format(target_id, '.', name), value)     301             else:     302                 raise py4jerror(  py4jjavaerror: error occurred while calling z:org.apache.spark.api.python.pythonrdd.runjob. : org.apache.spark.sparkexception: job aborted due stage failure: task 0 in stage 22.0 failed 1 times, recent failure: lost task 0.0 in stage 22.0 (tid 31, localhost): org.apache.spark.api.python.pythonexception: traceback (most recent call last):   file "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 101, in main     process()   file "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 96, in process     serializer.dump_stream(func(split_index, iterator), outfile)   file "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/serializers.py", line 236, in dump_stream     vs = list(itertools.islice(iterator, batch))   file "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1220, in takeuptonumleft     yield next(iterator)   file "<ipython-input-62-0243c4dd1876>", line 18, in parsepoint valueerror: not convert string float: .      @ org.apache.spark.api.python.pythonrdd$$anon$1.read(pythonrdd.scala:135)     @ org.apache.spark.api.python.pythonrdd$$anon$1.<init>(pythonrdd.scala:176)     @ org.apache.spark.api.python.pythonrdd.compute(pythonrdd.scala:94)     @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277)     @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244)     @ org.apache.spark.scheduler.resulttask.runtask(resulttask.scala:61)     @ org.apache.spark.scheduler.task.run(task.scala:64)     @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:203)     @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1145)     @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:615)     @ java.lang.thread.run(thread.java:745)  driver stacktrace:     @ org.apache.spark.scheduler.dagscheduler.org$apache$spark$scheduler$dagscheduler$$failjobandindependentstages(dagscheduler.scala:1204)     @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1193)     @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1192)     @ scala.collection.mutable.resizablearray$class.foreach(resizablearray.scala:59)     @ scala.collection.mutable.arraybuffer.foreach(arraybuffer.scala:47)     @ org.apache.spark.scheduler.dagscheduler.abortstage(dagscheduler.scala:1192)     @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:693)     @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:693)     @ scala.option.foreach(option.scala:236)     @ org.apache.spark.scheduler.dagscheduler.handletasksetfailed(dagscheduler.scala:693)     @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1393)     @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1354)     @ org.apache.spark.util.eventloop$$anon$1.run(eventloop.scala:48) 

add function check if string can converted float:

def isfloat(string):     try:         float(string)         return true     except valueerror:         return false 

and in parsepoint:

value1 = [map(float,i) in values if isfloat(i)] 

by modifying float line follows

value1 = [float(i) in values] 

and parsing string numeric values, can correct labeledpoints. real problem trying make labeledpoint objects strings cannot converted float ta-00002 in join2_train object


Comments

Popular posts from this blog

html - Firefox flex bug applied to buttons? -

html - Missing border-right in select on Firefox -

c# - two queries in same method -