python - ValueError: could not convert string to float -
i have text file contains data. data follows
join2_train = sc.textfile('join2_train.csv',4) join2_train.take(3) [u'21.9059,ta-00002,s-0066,7/7/2013,0,0,yes,1,sp-0019,6.35,0.71,137,8,19.05,n,n,n,n,ef-008,ef-008,0,0,0', u'12.3412,ta-00002,s-0066,7/7/2013,0,0,yes,2,sp-0019,6.35,0.71,137,8,19.05,n,n,n,n,ef-008,ef-008,0,0,0', u'6.60183,ta-00002,s-0066,7/7/2013,0,0,yes,5,sp-0019,6.35,0.71,137,8,19.05,n,n,n,n,ef-008,ef-008,0,0,0']
now trying parse string function splits each of lines of text , convert labeledpoint. have included line converting string elements float
the function follows
from pyspark.mllib.regression import labeledpoint import numpy np def parsepoint(line): """converts comma separated unicode string `labeledpoint`. args: line (unicode): comma separated unicode string first element label , remaining elements features. returns: labeledpoint: line converted `labeledpoint`, consists of label , features. """ values = line.split(',') value1 = [map(float,i) in values] return labeledpoint(value1[0],value1[1:])
now when try actions on parsed line , valueerror. action try below
parse_train = join2_train.map(parsepoint) parse_train.take(5)
the error message below
--------------------------------------------------------------------------- py4jjavaerror traceback (most recent call last) <ipython-input-63-f53b10964381> in <module>() 1 parse_train = join2_train.map(parsepoint) 2 ----> 3 parse_train.take(5) /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in take(self, num) 1222 1223 p = range(partsscanned, min(partsscanned + numpartstotry, totalparts)) -> 1224 res = self.context.runjob(self, takeuptonumleft, p, true) 1225 1226 items += res /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/context.py in runjob(self, rdd, partitionfunc, partitions, allowlocal) 840 mappedrdd = rdd.mappartitions(partitionfunc) 841 port = self._jvm.pythonrdd.runjob(self._jsc.sc(), mappedrdd._jrdd, javapartitions, --> 842 allowlocal) 843 return list(_load_from_socket(port, mappedrdd._jrdd_deserializer)) 844 /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, --> 538 self.target_id, self.name) 539 540 temp_arg in temp_args: /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise py4jjavaerror( 299 'an error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise py4jerror( py4jjavaerror: error occurred while calling z:org.apache.spark.api.python.pythonrdd.runjob. : org.apache.spark.sparkexception: job aborted due stage failure: task 0 in stage 22.0 failed 1 times, recent failure: lost task 0.0 in stage 22.0 (tid 31, localhost): org.apache.spark.api.python.pythonexception: traceback (most recent call last): file "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 101, in main process() file "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) file "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/serializers.py", line 236, in dump_stream vs = list(itertools.islice(iterator, batch)) file "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1220, in takeuptonumleft yield next(iterator) file "<ipython-input-62-0243c4dd1876>", line 18, in parsepoint valueerror: not convert string float: . @ org.apache.spark.api.python.pythonrdd$$anon$1.read(pythonrdd.scala:135) @ org.apache.spark.api.python.pythonrdd$$anon$1.<init>(pythonrdd.scala:176) @ org.apache.spark.api.python.pythonrdd.compute(pythonrdd.scala:94) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244) @ org.apache.spark.scheduler.resulttask.runtask(resulttask.scala:61) @ org.apache.spark.scheduler.task.run(task.scala:64) @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:203) @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1145) @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:615) @ java.lang.thread.run(thread.java:745) driver stacktrace: @ org.apache.spark.scheduler.dagscheduler.org$apache$spark$scheduler$dagscheduler$$failjobandindependentstages(dagscheduler.scala:1204) @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1193) @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1192) @ scala.collection.mutable.resizablearray$class.foreach(resizablearray.scala:59) @ scala.collection.mutable.arraybuffer.foreach(arraybuffer.scala:47) @ org.apache.spark.scheduler.dagscheduler.abortstage(dagscheduler.scala:1192) @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:693) @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:693) @ scala.option.foreach(option.scala:236) @ org.apache.spark.scheduler.dagscheduler.handletasksetfailed(dagscheduler.scala:693) @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1393) @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1354) @ org.apache.spark.util.eventloop$$anon$1.run(eventloop.scala:48)
add function check if string can converted float:
def isfloat(string): try: float(string) return true except valueerror: return false
and in parsepoint:
value1 = [map(float,i) in values if isfloat(i)]
by modifying float line follows
value1 = [float(i) in values]
and parsing string numeric values, can correct labeledpoints. real problem trying make labeledpoint objects strings cannot converted float ta-00002
in join2_train
object
Comments
Post a Comment