ETL tools can play a major role in your analytics project
ETL tools are an important part of any data analytics, machine learning project as the required data is usually only available in different data sources. So ETL (extract transform load) is a much needed part of the process. In this post we explore the best open source ETL tools available.
The ETL process basically involves:
- the extraction of data from homogeneous or heterogeneous data sources,
- the transformation of the data for storing it in proper format or structure needed for querying and analytics purpose,
- the loading of the data it into the final target (database, operational data store, data mart, data warehouse, csv file, etc)
All the three phases usually execute in parallel in ETL tools since the data extraction takes time. The power and versatility of the available tools vary greatly. Fortunately there are some very good open source ETL tools, some of them only available recently.
Pentaho is an open source solution with very strong ETL and data integration capabilities. In addition, it provides OLAP processing, visualization, advanced predictive capabilities, and also allows for integration with Hadoop. Talend is also a good option. Pentaho has, however, much more robust visualization tools which can be of great use out of the data once it’s been warehoused. Anyway, both Pentaho and Talend are reliable, open source, performant, user friendly and cross plaform.
Luigi is another good option. It can work not only in the Hadoop cluster. Luigi is Python based, it is supported by Spotify and it is used by many startups around the world .It is also extendable. For simple cases, Pentaho may be a better option, whereas for more complex cases Luigi may be the best alternative.
Airflow is another open source ETL tool to consider. It is a workflow management platform that has recently been open sourced by AirBnB. It’s coded in Python, it’s actively worked on and it has become a serious option to manage batch tasks. It is basically an orchestrator for ETL’s. It not only schedules and executes your tasks but it also helps manages the sequence and dependencies of those tasks through workflows. As its documentation says,
“The basic setup of Airflow is fairly easy to complete if you have some technical chops. If you have some experience with Python, writing your first jobs in Airflow shouldn’t be a problem. Once you are setup, the web interface is user-friendly and can provide the status on each tasks that has ran. You can trigger new tasks to run from the interface and visually see all the dependencies between the pipelines. Airflow is a very complete solution. Out of the box, it can connect to a wide variety of databases. It can alert you by email if a task fail, write a message on Slack when a task has finished running etc. You might not need everything it has to offer out of the box. Also, Airflow is designed to scale. The workers can be distributed across different nodes. Generally, there shouldn’t much hardcore computing done by the workers but being able to scale the workers will provide some head room for your tasks to run smoothly. The setup if you want to scale workers is more advanced.”
AirFlow is still young and there are some rough edges but it goes a lot further than Luigi. Not to forget: If you don’t have an engineer in your team or someone who’s technical, AirFlow is not a good choice. While the setup is easy, you need to have someone who can manage and maintain your Airflow instance. Also, if you are running on Windows environment, you might run into troubles setting up Airflow.
The fact that Luigi and AjrFlow are both Python based and expandable makes them my top choice. Just make sure that there is a technical person in the team. I suggest Pentaho or Talend when there is a lack of technical people and for simple cases.
Originally publish on my blog Cyzne.com under the same title