Monday, October 23, 2017

Google BigQuery & Apache Hive

Google BIGQUERY is a fast, economical and fully-managed enterprise data warehouse for large-scale data analytics. Details of querying your custom table in BigQuery:

The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. Built on top of Apache Hadoop™, Hive provides the following features:
  • Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
  • A mechanism to impose structure on a variety of data formats
  • Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™
  • Query execution via Apache Tez™, Apache Spark™, or MapReduce
  • Procedural language with HPL-SQL
  • Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.

More details on getting started:

Sunday, October 22, 2017

Spark SQL & 3rd Party Spark Machine Learning libraries

Spark machine learning inventory

Spark SQL and Dataframes: Python and Spark


Getting the Data and Creating the RDD

import urllib
f = urllib.urlretrieve ("", "kddcup.data_10_percent.gz")
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file).cache()

Step 2: Create a Dataframe

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Pandas. They can be constructed from a wide array of sources such as an existing RDD in our case.

The entry point into all SQL functionality in Spark is the SQLContext class. To create a basic instance, all we need is a SparkContext reference.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

Inferring the Schema: With a SQLContext, we are ready to create a DataFrame from our existing RDD. But first we need to tell Spark SQL the schema in our data.

Step 3:

Once we have our RDD of Row we can infer and register the schema.
interactions_df = sqlContext.createDataFrame(row_data)
Now we can run SQL queries over our data frame that has been registered as a table.

# Select tcp network interactions with more than 1 second duration and no transfer from destination
tcp_interactions = sqlContext.sql(""" SELECT duration, dst_bytes FROM interactions WHERE protocol_type = 'tcp' AND duration > 1000 AND dst_bytes = 0 """)
The results of SQL queries are RDDs and support all the normal RDD operations.
# Output duration together with dst_bytes
tcp_interactions_out = p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes))
for ti_out in tcp_interactions_out.collect():
  print ti_out

One thing to remember: You can’t map a dataframe, but you can convert the dataframe to an RDD and map that by doing Prior to Spark 2.0, would alias to With Spark 2.0, you must explicitly call .rdd first.

Saturday, October 21, 2017

HDFS (Hadoop), Scikit-Learn & Apache Spark MLlib

On Linux: Ubuntu 14.04.5 LTS, Release: 14.04, trusty.

Apache Hadoop is an open source software framework that enables large data sets to be broken up into blocks, distributed to multiple servers for storage and processing. Hadoop’s strength comes from a server network – known as a Hadoop cluster – that can process data much more quickly than a single machine. The non-profit Apache Software Foundation supports the free open source Hadoop project, but commercial versions have become very common.

The Hadoop Distributed File System (HDFS) is the place in a Hadoop cluster that you store data. Built for data-intensive applications, the HDFS is designed to run on clusters of inexpensive commodity servers. HDFS is optimized for high performance, read intensive operations and resilient to failures in the cluster. It does not prevent failures but is unlikely to lose data, since by default HDFS makes multiple copies of each of its data blocks.

Hadoop does batch processing i.e processing of blocks of data already stored over a period of time. Initially Hadoop's MapReduce technique was the best framework for processing data in batches. Spark is an open-source cluster computing framework for real-time processing. Spark's additional functionality is that it can process data in real time and since it was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations it is also about 100 times faster than Hadoop MapReduce in batch processing large data sets.

Spark can create distributed datasets from any file stored in the Hadoop distributed filesystem (HDFS) or other storage systems supported by the Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, etc.). Spark does not require Hadoop; it simply has support for storage systems implementing the Hadoop APIs. Spark supports text files, SequenceFiles etc and any other Hadoop InputFormat.

More differences and Spark details are here:

Install Hadoop in Stand-Alone Mode on Ubuntu 16.04

Once installed run it as:

Scikit-Learn ML Examples: