The Data Science Process_Part 2 (Practice)

As already discussed the Stages in the knowledge discovery process are:

  • Opportunity Assessment & Business Understanding
  • Data Understanding & Data Acquisition
  • Data Cleaning and Transformation
  • Model Building
  • Policy Construction
  • Evaluation, Residuals and Metrics
  • Model Deployment, Monitoring, Model Updates

Let’s see these stages in more detail:  Continue reading

The Data Science Process_Part 1 (Theory)

Data Science is not new. In fact, it’s been around for many years.

Over that time, various groups of data professionals have defined and documented methodologies that are useful when we need to conduct a data science project. All these several attempts to make the process of discovering knowledge scientific, resulted to similar steps, therefore it is safe to state that there are some core principles that underlie the data science process.

Continue reading

What is NoSQL?

NoSQL means ‘Not only SQL’, aka ‘Non-relational’. These are databases specifically introduced to handle the rise in data types, data access and data availability needs.

Today’s needs require a database that is capable of providing a scalable, flexible solution to efficiently and safely manage the massive flow of data to and from a global user base.

Continue reading

Hadoop Distributions and Offerings

Hadoop is available from either the Apache Software Foundation or from companies that offer their own Hadoop distributions.

The Hadoop ecosystem has many component parts, all of which exist as their own Apache projects. Because Hadoop has grown considerably, and faces some significant further changes, different versions of these open source community components might not be fully compatible with other components. This poses considerable difficulties for people looking to get an independent start with Hadoop by downloading and compiling projects directly from Apache.

Continue reading

The Apache Hadoop Ecosystem

Hadoop is the most common single platform for storing and analyzing big data.

Apache projects are created to develop open source software and are supported by the Apache Software Foundation,  a nonprofit organization made up of a decentralized community of developers. Open source software, which is usually developed in a public and collaborative way, is software whose source code is freely available to anyone for study, modification and distribution.

Hadoop was originally intended to serve as the infrastructure for the Nutch project in 2002. Nutch needed an architecture that could scale to billions of web pages, and this needed architecture was inspired by the Google File System, that would ultimately become HDFS. In 2004 Google published a paper introducing MapReduce and by 2006 Nutch was using both MapReduce and HDFS.

Continue reading