Sunday, October 22, 2017

Spark SQL & 3rd Party Spark Machine Learning libraries

Spark machine learning inventory

Spark SQL and Dataframes: Python and Spark


Getting the Data and Creating the RDD

import urllib
f = urllib.urlretrieve ("", "kddcup.data_10_percent.gz")
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file).cache()

Step 2: Create a Dataframe

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Pandas. They can be constructed from a wide array of sources such as an existing RDD in our case.

The entry point into all SQL functionality in Spark is the SQLContext class. To create a basic instance, all we need is a SparkContext reference.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

Inferring the Schema: With a SQLContext, we are ready to create a DataFrame from our existing RDD. But first we need to tell Spark SQL the schema in our data.

Step 3:

Once we have our RDD of Row we can infer and register the schema.
interactions_df = sqlContext.createDataFrame(row_data)
Now we can run SQL queries over our data frame that has been registered as a table.

# Select tcp network interactions with more than 1 second duration and no transfer from destination
tcp_interactions = sqlContext.sql(""" SELECT duration, dst_bytes FROM interactions WHERE protocol_type = 'tcp' AND duration > 1000 AND dst_bytes = 0 """)
The results of SQL queries are RDDs and support all the normal RDD operations.
# Output duration together with dst_bytes
tcp_interactions_out = p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes))
for ti_out in tcp_interactions_out.collect():
  print ti_out

One thing to remember: You can’t map a dataframe, but you can convert the dataframe to an RDD and map that by doing Prior to Spark 2.0, would alias to With Spark 2.0, you must explicitly call .rdd first.

No comments:

Post a Comment