Arbit Scratchpad

Posts

Showing posts from 2017

Install and use IPython Notebook

sudo apt-get -y install ipython ipython-notebook sudo -H pip install jupyter Running Jupyter : Execute the following command: jupyter notebook More details here: How To Set Up a Jupyter Notebook to Run IPython on Ubuntu 16.04

Convolutional Neural Networks - CNNs

https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ Convolution: The primary purpose of Convolution in case of a CNNs is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features In the computation above we slide the orange matrix over the original image (green) by 1 pixel (also called ‘stride’) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final value which forms a single element of the output matrix (pink). Note that the 3×3 matrix “sees” only a part of the input image in each stride. In CNNs terminology, the 3×3 matrix is called a ‘filter‘ or ‘kernel’ or ‘feature detector’ and the matrix formed by sliding the filter over the image and computing the dot product is called the ‘Convolved Feature’ or ‘Activation Map’ or the ‘Feature Map‘. It is important to note that filters acts as feature ...

Machine Learning - Model Evaluation Metrics

Confusion Matrix: ROC (Receiver Operating Characteristics) and Area Under Curve (AUC) ROC graphs are two-dimensional graphs in which tp rate (true positive rate or recall in above diagram) is plotted on the Y axis and fp rate is plotted on the X axis. An ROC graph depicts relative tradeoffs between benefits (true positives) and costs (false positives). More details: An introduction to ROC analysis by Tom Fawcett For example, when you consider the results of a particular test in two populations, one population with a disease, the other population without the disease, you will rarely observe a perfect separation between the two groups. Indeed, the distribution of the test results will overlap, as shown in the following figure. For every possible cut-off point or criterion value you select to discriminate between the two populations, there will be some cases with the disease correctly classified as positive (TP = True Positive fraction), but some cases with the dise...

Deploy a Python app on Google Cloud

STEPS: TUTORIALDIR=src/[YOUR_PROJECT_ID]/python_gae_quickstart-2017-11-01-23-03 git clone https://github.com/GoogleCloudPlatform/python-docs-samples $TUTORIALDIR cd $TUTORIALDIR/appengine/standard/hello_world dev_appserver.py $PWD gcloud app deploy app.yaml --project [YOUR_PROJECT_ID] The app runs on: https://[YOUR_PROJECT_ID].appspot.com/ Flask App on Google App Engine: cd $TUTORIALDIR/appengine/standard/flask/tutorial gcloud app deploy app.yaml --project [YOUR_PROJECT_ID] Run the Flask app on: https://[YOUR_PROJECT_ID].appspot.com/form Example: /src/tantal-183814/python_gae_quickstart-2017-11-01-23-03/appengine/standard/flask/tutorial Using the Container Engine : https://cloud.google.com/container-engine/docs/quickstart#optional_hello_app_code_review This example makes use of a web app framework - a web application framework can simplify development by taking care of the details of the interface, letting you focus development effort on your applicatio...

Google Cloud - Natural Language API

Apart from the Quickstart here: https://cloud.google.com/natural-language/docs/ I used this code: https://raw.githubusercontent.com/GoogleCloudPlatform/python-docs-samples/master/language/cloud-client/v1/quickstart.py I had to do a few additional steps including: Setting up Google Application Default Credentials - including setting the environment variable GOOGLE_APPLICATION_CREDENTIALS Well described here: https://developers.google.com/identity/protocols/application-default-credentials Then I had to disable unwanted warnings based on the details here: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings

Google BigQuery & Apache Hive

Google BIGQUERY is a fast, economical and fully-managed enterprise data warehouse for large-scale data analytics. Details of querying your custom table in BigQuery: https://cloud.google.com/bigquery/quickstart-web-ui The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. Built on top of Apache Hadoop™, Hive provides the following features: Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis. A mechanism to impose structure on a variety of data formats Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™ Query execution via Apache Tez™, Apache Spark™, or MapReduce Procedural language with HPL-SQL Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider. More details on getting started:...

Spark SQL & 3rd Party Spark Machine Learning libraries

Spark machine learning inventory https://github.com/claesenm/spark-ml-inventory Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks: http://jadianes.me/spark-py-notebooks/ MLlib: Classification with Logistic Regression: https://github.com/jadianes/spark-py-notebooks/blob/master/nb8-mllib-logit/nb8-mllib-logit.ipynb Spark SQL and Dataframes: Python and Spark Steps: Getting the Data and Creating the RDD import urllib f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz") data_file = "./kddcup.data_10_percent.gz" raw_data = sc.textFile(data_file).cache() Step 2: Create a Dataframe A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Pandas. They can be constructed from a ...

HDFS (Hadoop), Scikit-Learn & Apache Spark MLlib

On Linux: Ubuntu 14.04.5 LTS, Release: 14.04, trusty. Apache Hadoop is an open source software framework that enables large data sets to be broken up into blocks, distributed to multiple servers for storage and processing. Hadoop’s strength comes from a server network – known as a Hadoop cluster – that can process data much more quickly than a single machine. The non-profit Apache Software Foundation supports the free open source Hadoop project, but commercial versions have become very common. The Hadoop Distributed File System (HDFS) is the place in a Hadoop cluster that you store data. Built for data-intensive applications, the HDFS is designed to run on clusters of inexpensive commodity servers. HDFS is optimized for high performance, read intensive operations and resilient to failures in the cluster. It does not prevent failures but is unlikely to lose data, since by default HDFS makes multiple copies of each of its data blocks. Hadoop does batch processing i.e processing...

Multi-Class Classification with the Keras Deep Learning / TensorFlow

Ubuntu 14.04 TensorFlow: https://www.tensorflow.org I had to install Numpy, Pandas: http://pandas.pydata.org/ , Ski-Kit: http://scikit-learn.org/ pip install -U scikit-learn Complete methodology utilized is available on Jason Brownlee's blog here: http://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/