Spark SQL & 3rd Party Spark Machine Learning libraries

Spark machine learning inventory

https://github.com/claesenm/spark-ml-inventory

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks: http://jadianes.me/spark-py-notebooks/
MLlib: Classification with Logistic Regression: https://github.com/jadianes/spark-py-notebooks/blob/master/nb8-mllib-logit/nb8-mllib-logit.ipynb

Spark SQL and Dataframes: Python and Spark

Steps:

Getting the Data and Creating the RDD

import urllib
f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file).cache()

Step 2: Create a Dataframe

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Pandas. They can be constructed from a wide array of sources such as an existing RDD in our case.

The entry point into all SQL functionality in Spark is the SQLContext class. To create a basic instance, all we need is a SparkContext reference.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

Inferring the Schema: With a SQLContext, we are ready to create a DataFrame from our existing RDD. But first we need to tell Spark SQL the schema in our data.

Step 3:

Once we have our RDD of Row we can infer and register the schema.

interactions_df = sqlContext.createDataFrame(row_data)
interactions_df.registerTempTable("interactions")

Now we can run SQL queries over our data frame that has been registered as a table.


# Select tcp network interactions with more than 1 second duration and no transfer from destination
tcp_interactions = sqlContext.sql("""
    SELECT duration, dst_bytes FROM interactions WHERE protocol_type = 'tcp' AND duration > 1000 AND dst_bytes = 0
""")
tcp_interactions.show()

The results of SQL queries are RDDs and support all the normal RDD operations.

# Output duration together with dst_bytes
tcp_interactions_out = tcp_interactions.rdd.map(lambda p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes))
for ti_out in tcp_interactions_out.collect():
  print ti_out

One thing to remember: You can’t map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.

Comments

Quantum Contrapt - Part 01

First short story in the 'CSlueth files' series. Note: This is purely a work of fiction! 1. The dead body Wednesday, Chalmers tekniska hogskola, 6:30 PM He had just walked out of the classroom in the V building. For the past forty five minutes he had been coding an exciting dynamic systems simulation after having completed a late and rather boring simulation exercises session class. He had to take the road in front of the Chalmers Bibliotek library as he wanted to meet a friend staying nearby. As he walked past the Information Sciences department building there was a gust of cold winter wind that struck his face and he raised the collars of this jacket, sinking in his chin into its warm, furry interiors. It was December and it would get dark really early. It was dark now and hardly anyone else around. He hummed a tune as he ran up the flight of stairs, the library was at an elevated level from the V building. He found himself humming some r...

DNA newspaper plagiarizes my photographs!

The newspaper DNA (Daily News and Analysis - http://www.dnaindia.com/bangalore ) seems to have involved in not verifying its sources of photographs and having used my photographs (does this amount to plagiarism? I think it does) after it carried some of my pictures in the 'After Hrs' section of its newspaper on 31st January 2009, which I had taken at the IIMB Yamini 2009. It is good that they covered the event but they should have cited/verified the sources of the photographs. In all probability they or their sources just picked up the photos from my blog, with the belief that no one would notice anyways - seems they could not escape as luck would have it, I spotted them in the DNA paper on Saturday. It was early in the morning when as I flipped open the last page of the supplement that I was stunned to see my pics, which I was able to recognize immediately - however there were no credits anywhere in sight! Please check the photos below from the e-paper version on their website...

IIMB PGSEM Star of the Quarter

From the PGSEM Student Affairs Council (SAC) Indian Institute of Management, Bangalore: Star of the Quarter Award is given to an individual who has significantly contributed to PGSEM community through extra curricular activities in an academic quarter, as part of various activities done by PGSEM-SAC and committees. SAC members will nominate significant contributors for the Award and the voting will be conducted among the current batches to decide the Star along with SAC members' points. SAC member is not eligible to be nominated for this award. You will be getting a mail on voting for Star of the Quarter - Q3 2008-2009. Please take your time to vote for and decide the Star. A voting happened (to decide amongst 11 equally strong and capable contestants) and guess what? I polled 32.65% of the votes and yes I won!! My immense gratitude towards one & all who felt I was worth it!! It takes an institute of IIMB's stature and star students to recognize a STAR! :D Also have t...

Arbit Scratchpad

Search This Blog