Friday, November 10, 2017

Convolutional Neural Networks - CNNs


Convolution: The primary purpose of Convolution in case of a CNNs is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features


In the computation above we slide the orange matrix over the original image (green) by 1 pixel (also called ‘stride’) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final value which forms a single element of the output matrix (pink). Note that the 3×3 matrix “sees” only a part of the input image in each stride.


Saturday, November 04, 2017

Machine Learning - Model Evaluation Metrics


Confusion Matrix:




ROC (Receiver Operating Characteristics) and Area Under Curve (AUC)


ROC graphs are two-dimensional graphs in which tp rate (true positive rate or recall in above diagram) is plotted on the Y axis and fp rate is plotted on the X axis. An ROC graph depicts relative tradeoffs between benefits (true positives) and costs (false positives). More details: An introduction to ROC analysis by Tom Fawcett

For example, when you consider the results of a particular test in two populations, one population with a disease, the other population without the disease, you will rarely observe a perfect separation between the two groups. Indeed, the distribution of the test results will overlap, as shown in the following figure.


For every possible cut-off point or criterion value you select to discriminate between the two populations, there will be some cases with the disease correctly classified as positive (TP = True Positive fraction), but some cases with the disease will be classified negative (FN = False Negative fraction). On the other hand, some cases without the disease will be correctly classified as negative (TN = True Negative fraction), but some cases without the disease will be classified as positive (FP = False Positive fraction).

  • Sensitivity: probability that a test result will be positive when the disease is present (true positive rate, expressed as a percentage). 
  • Specificity: probability that a test result will be negative when the disease is not present (true negative rate, expressed as a percentage). 
  • Positive likelihood ratio: ratio between the probability of a positive test result given the presence of the disease and the probability of a positive test result given the absence of the disease, i.e. = True positive rate / False positive rate = Sensitivity / (1-Specificity)
  • Negative likelihood ratio: ratio between the probability of a negative test result given the presence of the disease and the probability of a negative test result given the absence of the disease, i.e. = False negative rate / True negative rate = (1-Sensitivity) / Specificity
  • Positive predictive value: probability that the disease is present when the test is positive (expressed as a percentage).
  • Negative predictive value: probability that the disease is not present when the test is negative (expressed as a percentage). 

In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity). Therefore the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test - see next diagram.


The most important metric are the following:



Thursday, November 02, 2017

Deploy a Python app on Google Cloud

STEPS:
TUTORIALDIR=src/[YOUR_PROJECT_ID]/python_gae_quickstart-2017-11-01-23-03
git clone https://github.com/GoogleCloudPlatform/python-docs-samples $TUTORIALDIR
cd $TUTORIALDIR/appengine/standard/hello_world
dev_appserver.py $PWD
gcloud app deploy app.yaml --project [YOUR_PROJECT_ID]
The app runs on: https://[YOUR_PROJECT_ID].appspot.com/

Flask App on Google App Engine:

cd $TUTORIALDIR/appengine/standard/flask/tutorial
gcloud app deploy app.yaml --project [YOUR_PROJECT_ID]
Run the Flask app on: https://[YOUR_PROJECT_ID].appspot.com/form

Using the Container Engine:

https://cloud.google.com/container-engine/docs/quickstart#optional_hello_app_code_review

This example makes use of a web app framework - a web application framework can simplify development by taking care of the details of the interface, letting you focus development effort on your applications features. App Engine includes a simple web application framework called webapp2 - a lightweight framework that allows you quickly build simple web applications for the Python 2.7 runtime.
webapp2 is compatible with the WSGI standard for Python web applications. You don't have to use webapp2 to write Python applications for App Engine. Other web application frameworks, such as Django, work with App Engine, and App Engine supports any Python code that uses the CGI standard. The webapp2 project, by Rodrigo Moraes, started as a fork of the App Engine webapp framework, which was used by the Python 2.5 runtime. webapp2 includes a number of features that make developing web applications easier, such as improved support for URI routing, session management and localization. The Python 2.7 runtime uses webapp2, and the project is maintained externally to App Engine. It is supported, but not maintained, by Google.
For more information about webapp2, see the official documentation.

RUNNING Django + setup a MySQL database instance on AppEngine: 
https://cloud.google.com/python/django/appengine#configure_the_database_settings

Wednesday, November 01, 2017

Google Cloud - Natural Language API


Apart from the Quickstart here:
https://cloud.google.com/natural-language/docs/

I used this code:
https://raw.githubusercontent.com/GoogleCloudPlatform/python-docs-samples/master/language/cloud-client/v1/quickstart.py

I had to do a few additional steps including:

Setting up Google Application Default Credentials - including setting the environment variable GOOGLE_APPLICATION_CREDENTIALS

Well described here:
https://developers.google.com/identity/protocols/application-default-credentials

Then I had to disable unwanted warnings based on the details here:
https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings


Monday, October 23, 2017

Google BigQuery & Apache Hive

Google BIGQUERY is a fast, economical and fully-managed enterprise data warehouse for large-scale data analytics. Details of querying your custom table in BigQuery:

https://cloud.google.com/bigquery/quickstart-web-ui


The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. Built on top of Apache Hadoop™, Hive provides the following features:
  • Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
  • A mechanism to impose structure on a variety of data formats
  • Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™
  • Query execution via Apache Tez™, Apache Spark™, or MapReduce
  • Procedural language with HPL-SQL
  • Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.

More details on getting started: https://cwiki.apache.org/confluence/display/Hive/GettingStarted

Sunday, October 22, 2017

Spark SQL & 3rd Party Spark Machine Learning libraries

Spark machine learning inventory

https://github.com/claesenm/spark-ml-inventory



Spark SQL and Dataframes: Python and Spark

Steps:

Getting the Data and Creating the RDD

import urllib
f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file).cache()


Step 2: Create a Dataframe

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Pandas. They can be constructed from a wide array of sources such as an existing RDD in our case.

The entry point into all SQL functionality in Spark is the SQLContext class. To create a basic instance, all we need is a SparkContext reference.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

Inferring the Schema: With a SQLContext, we are ready to create a DataFrame from our existing RDD. But first we need to tell Spark SQL the schema in our data.

Step 3:

Once we have our RDD of Row we can infer and register the schema.
interactions_df = sqlContext.createDataFrame(row_data)
interactions_df.registerTempTable("interactions")
Now we can run SQL queries over our data frame that has been registered as a table.

# Select tcp network interactions with more than 1 second duration and no transfer from destination
tcp_interactions = sqlContext.sql(""" SELECT duration, dst_bytes FROM interactions WHERE protocol_type = 'tcp' AND duration > 1000 AND dst_bytes = 0 """) tcp_interactions.show()
The results of SQL queries are RDDs and support all the normal RDD operations.
# Output duration together with dst_bytes
tcp_interactions_out = tcp_interactions.rdd.map(lambda p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes))
for ti_out in tcp_interactions_out.collect():
  print ti_out


One thing to remember: You can’t map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.




Saturday, October 21, 2017

HDFS (Hadoop), Scikit-Learn & Apache Spark MLlib

On Linux: Ubuntu 14.04.5 LTS, Release: 14.04, trusty.

Apache Hadoop is an open source software framework that enables large data sets to be broken up into blocks, distributed to multiple servers for storage and processing. Hadoop’s strength comes from a server network – known as a Hadoop cluster – that can process data much more quickly than a single machine. The non-profit Apache Software Foundation supports the free open source Hadoop project, but commercial versions have become very common.

The Hadoop Distributed File System (HDFS) is the place in a Hadoop cluster that you store data. Built for data-intensive applications, the HDFS is designed to run on clusters of inexpensive commodity servers. HDFS is optimized for high performance, read intensive operations and resilient to failures in the cluster. It does not prevent failures but is unlikely to lose data, since by default HDFS makes multiple copies of each of its data blocks.

Hadoop does batch processing i.e processing of blocks of data already stored over a period of time. Initially Hadoop's MapReduce technique was the best framework for processing data in batches. Spark is an open-source cluster computing framework for real-time processing. Spark's additional functionality is that it can process data in real time and since it was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations it is also about 100 times faster than Hadoop MapReduce in batch processing large data sets.

Spark can create distributed datasets from any file stored in the Hadoop distributed filesystem (HDFS) or other storage systems supported by the Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, etc.). Spark does not require Hadoop; it simply has support for storage systems implementing the Hadoop APIs. Spark supports text files, SequenceFiles etc and any other Hadoop InputFormat.

More differences and Spark details are here: https://www.edureka.co/blog/spark-tutorial/



Install Hadoop in Stand-Alone Mode on Ubuntu 16.04

Once installed run it as:
/usr/local/hadoop/bin/hadoop


Scikit-Learn ML Examples:
http://scikit-learn.org/stable/auto_examples/index.html#