Machine Learning Tools


Everything About Data

This is an incomplete list of all machine learning tools currently available as of July 2016. I categorized them into Open Source tools and commercial tools, however, the open source tools usually have a commercialized version with support, and the commercial tools tend to include a free version so you can download and try them out. Click the product links to learn more.

Open Source

spark-logo-trademark 

Spark MLlib

  • MLlib is Apache Spark’s scalable machine learning library.
    • Initial contribution from AMPLab, UC Berkeley
    • Shipped with Spark since version 0.8
    • Over 30 contributors
    • Includes any common machine learning and statistical algorithms
    • Supports Scala, Java and Python programming languages
  • Pros
    • Powerful processing performance of Spark. (10x faster in memory and 100x faster in hard disk.)
    • Runs on Hadoop, Mesos or Stand online.
    • Easy to code. (with Scala)
  • Cons
    • Spark requires experienced engineers.
  • Online Resources http://spark.apache.org/mllib/
  • Algorithm
  • –Basic Statistics
    • Summary, Correlation, Sampling, Hypothesis testing, and…

View original post 932 more words

Advanced Analytics Reference Architecture


Everything About Data

Building data platforms and deliverying advanced analytical services in the new age of data intelligence can be a daunting task. It’s not really helping with all the tools and methodologies that we know we can use. Therefore, a reference architecture is needed to provide guidelines for the process design and best practices for advanced analytics, so we can not only meet the business requirement, but also bring more value to the business.

1. Architectural Guidance

  • The architecture should cover all building blocks including the following: Data Infrastructure, Data Engineering, Traditional Business Intelligence, and Advanced Analytics. Within Advanced Analytics, we should include machine learning, deep learning, data science, predictive analytics, and the operationalization of models.
  • One of the first steps should be finding the gaps between current infrastructure, tools, technologies and the end state environment.
  • We need to create a unified approach to both structured and unstructured data. It’s perfectly fine to…

View original post 473 more words

Install Hadoop and Spark on a Mac


Everything About Data

Hadoop best performs on a cluster of multiple nodes/servers, however, it can run perfectly on a single machine, even a Mac, so we can use it for development. Also, Spark is a popular tool to process data in Hadoop. The purpose of this blog is to show you the steps to install Hadoop and Spark on a Mac.

Operating System: Mac OSX Yosemite 10.11.3
Hadoop Version 2.7.2
Spark 1.6.1

Pre-requisites

1. Install Java

Open a terminal window to check what Java version is installed.
$ java -version

If Java is not installed, go to https://java.com/en/download/ to download and install latest JDK. If Java is installed, use following command in a terminal window to find the java home path
$ /usr/libexec/java_home

Next we need to set JAVA_HOME environment on mac
$ echo export “JAVA_HOME=$(/usr/libexec/java_home)” >> ~/.bash_profile
$ source ~/.bash_profile

2. Enable SSH as Hadoop requires it.

Go to System Preferences -> Sharing -> and check “Remote Login”.

Generate SSH…

View original post 549 more words

Introduction to Azure HDInsight


AugmentIQ

What is Azure HDInsight
Azure HDInsight is an Apache Hadoop distribution powered by the cloud.Azure HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution.HDInsight deploys and provisions managed Apache Hadoop clusters in the cloud, providing a software framework designed to process, analyze, and report on big data with high reliability and availability. Azure HDInsight Service uses Azure Blob Storage as the default file system . Storing the data in Azure Blob provides an advantage of retaining data even after cluster is deleted and can be reused.One can also use HDFS as the default file system by the changing the configuration accordingly.

What is WASB in HDInsight
Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs.The WASBS variation uses SSL certificates for improved security. It in many ways “is” HDFS. However, WASB creates a layer of abstraction that enables seperation of storage. This separation is…

View original post 840 more words