Skip to main content

HDFS (Hadoop), Scikit-Learn & Apache Spark MLlib

On Linux: Ubuntu 14.04.5 LTS, Release: 14.04, trusty.

Apache Hadoop is an open source software framework that enables large data sets to be broken up into blocks, distributed to multiple servers for storage and processing. Hadoop’s strength comes from a server network – known as a Hadoop cluster – that can process data much more quickly than a single machine. The non-profit Apache Software Foundation supports the free open source Hadoop project, but commercial versions have become very common.

The Hadoop Distributed File System (HDFS) is the place in a Hadoop cluster that you store data. Built for data-intensive applications, the HDFS is designed to run on clusters of inexpensive commodity servers. HDFS is optimized for high performance, read intensive operations and resilient to failures in the cluster. It does not prevent failures but is unlikely to lose data, since by default HDFS makes multiple copies of each of its data blocks.

Hadoop does batch processing i.e processing of blocks of data already stored over a period of time. Initially Hadoop's MapReduce technique was the best framework for processing data in batches. Spark is an open-source cluster computing framework for real-time processing. Spark's additional functionality is that it can process data in real time and since it was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations it is also about 100 times faster than Hadoop MapReduce in batch processing large data sets.

Spark can create distributed datasets from any file stored in the Hadoop distributed filesystem (HDFS) or other storage systems supported by the Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, etc.). Spark does not require Hadoop; it simply has support for storage systems implementing the Hadoop APIs. Spark supports text files, SequenceFiles etc and any other Hadoop InputFormat.

More differences and Spark details are here: https://www.edureka.co/blog/spark-tutorial/



Install Hadoop in Stand-Alone Mode on Ubuntu 16.04

Once installed run it as:
/usr/local/hadoop/bin/hadoop


Scikit-Learn ML Examples:
http://scikit-learn.org/stable/auto_examples/index.html#






Spark Examples:

https://spark.apache.org/examples.html

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version ...
      /_/

There was a problem such as the following while running pyspark
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/spark/launcher/Main : Unsupported major.minor version 52.0

Apache Maven and JDK 8 had to be installed. Details here:
https://www.digitalocean.com/community/tutorials/how-to-install-java-with-apt-get-on-ubuntu-16-04

Also another problem to keep in mind from some of the Spark MLib code on the website to add the context and session variables:

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)


Big Data with Apache Spark




HDFS with Spark: https://cbw.sh/spark.html

Setting Up Your Environment - In order to use HDFS and Spark, you first need to configure your environment so that you have access to the required tools. The easiest way to do this is to modify the .bashrc configuration file in your home directory.

Comments

Popular posts from this blog

IIMB PGSEM Star of the Quarter

From the PGSEM Student Affairs Council (SAC) Indian Institute of Management, Bangalore: Star of the Quarter Award is given to an individual who has significantly contributed to PGSEM community through extra curricular activities in an academic quarter, as part of various activities done by PGSEM-SAC and committees. SAC members will nominate significant contributors for the Award and the voting will be conducted among the current batches to decide the Star along with SAC members' points. SAC member is not eligible to be nominated for this award.   You will be getting a mail on voting for Star of the Quarter - Q3 2008-2009. Please take your time to vote for and decide the Star.   A voting happened (to decide amongst 11 equally strong and capable contestants) and guess what? I polled 32.65% of the votes and yes  I won!!  My immense gratitude towards one & all who felt I was worth it!! It takes an institute of IIMB's stature and star students to recognize a STAR! :D Also have t...

IIMB PGSEM SOP Essays.

The IIMB PGSEM application for 2008 had a SOP section which required 5 short essays to be written. Here are the ones I had written: Statement of Purpose How do you see the PGSEM helping you in your goals? (150 words) My taking up the PGSEM course has twin objectives, namely, self-development and learning all aspects of setting up, managing a commercial/social enterprise. Having worked in the software industry for five years, I have closely seen the software development life-cycle. However, there are several aspects of business and the economy that are of interest to me and I find the time ripe to explore these in a formal way, through academics; specifically strategic management of a firm, innovation strategies, and the scope of strategic consulting. Getting ready to usher in acceleration in growth opportunities in my care...

Quantum Contrapt - Part 01

First short story in the 'CSlueth files' series.  Note: This is purely a work of fiction! 1. The dead body  Wednesday, Chalmers tekniska hogskola, 6:30 PM  He had just walked out of the classroom in the V building. For the past forty five minutes he had been coding an exciting dynamic systems simulation after having completed a late and rather boring simulation exercises session class. He had to take the road in front of the Chalmers Bibliotek library as he wanted to meet a friend staying nearby. As he walked past the Information Sciences department building there was a gust of cold winter wind that struck his face and he raised the collars of this jacket, sinking in his chin into its warm, furry interiors. It was December and it would get dark really early. It was dark now and hardly anyone else around.   He hummed a tune as he ran up the flight of stairs, the library was at an elevated level from the V building. He found himself humming some r...