Skip to main content

Posts

Showing posts from October, 2017

Google BigQuery & Apache Hive

Google BIGQUERY is a fast, economical and fully-managed enterprise data warehouse for large-scale data analytics. Details of querying your custom table in BigQuery: https://cloud.google.com/bigquery/quickstart-web-ui The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. Built on top of Apache Hadoop™, Hive provides the following features: Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis. A mechanism to impose structure on a variety of data formats Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™ Query execution via Apache Tez™, Apache Spark™, or MapReduce Procedural language with HPL-SQL Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider. More details on getting started: 

Spark SQL & 3rd Party Spark Machine Learning libraries

Spark machine learning inventory https://github.com/claesenm/spark-ml-inventory Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks:  http://jadianes.me/spark-py-notebooks/ MLlib: Classification with Logistic Regression:  https://github.com/jadianes/spark-py-notebooks/blob/master/nb8-mllib-logit/nb8-mllib-logit.ipynb Spark SQL and Dataframes: Python and Spark Steps: Getting the Data and Creating the RDD import urllib f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz") data_file = "./kddcup.data_10_percent.gz" raw_data = sc.textFile(data_file).cache() Step 2: Create a Dataframe A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Pandas. They can be constructed from a

HDFS (Hadoop), Scikit-Learn & Apache Spark MLlib

On Linux: Ubuntu 14.04.5 LTS, Release: 14.04, trusty. Apache Hadoop is an open source software framework that enables large data sets to be broken up into blocks, distributed to multiple servers for storage and processing. Hadoop’s strength comes from a server network – known as a Hadoop cluster – that can process data much more quickly than a single machine. The non-profit Apache Software Foundation supports the free open source Hadoop project, but commercial versions have become very common. The Hadoop Distributed File System (HDFS) is the place in a Hadoop cluster that you store data. Built for data-intensive applications, the HDFS is designed to run on clusters of inexpensive commodity servers. HDFS is optimized for high performance, read intensive operations and resilient to failures in the cluster. It does not prevent failures but is unlikely to lose data, since by default HDFS makes multiple copies of each of its data blocks. Hadoop does batch processing i.e processing