Ashwin Aravind

Apache Drill is a schema free SQL engine that can be plugged on different types of distributed data stores including- HDFS, MongoDB, Amazon S3, Hbase, text files using ODBC/JDBC interfaces. Drill can be connected to Business intelligence applications Like Tableau, MicroStrategy and Tibco Spotfire for data exploration. Drill is similar to Cloudera Impala in several aspects but it stands out in some of key capabilities.

Schema Free – Unlike Imapala there is no need to define schema in Hive Metastore, which makes it very flexible. Querying other data stores other than HDFS- Drill engine can be configured to query several other data Stores MongoDB, Amazon S3, Hbase, JSON files, Parquet, Text Files, directories and Sequential files.

High Level Architecture Apache Drill Consists of Dill Bit process which run all the nodes in the cluster. The cluster membership of all drill bit is maintained by Zoo Keeper.

The Client(ODBC/JDBC Driver/APIs) submits the SQL Query to any Drill Bit on any of the nodes.This drill bit acts as the Foreman and converts the SQL to Logical and thereafter Physical Plan. The Physical plan is converted into multi-level execution plan and executed in parallel in each of the nodes. The nodes return the results to the Foreman Drill Bit, which publishes the result to the client. The User can submit the query to any of the Nodes having the Drill bit which acts as the foreman and retrieves the results from other nodes. Drill is optimized for querying large datasets using some of the below features Data Sources/Storage Plugins supported by Drill:

HBase Hive MapR-DB File system RDBMS MONGODB AMAZON – S3

Drill supports the following input formats for data:

Avro CSV (Comma-Separated-Values) TSV (Tab-Separated-Values) PSV (Pipe-Separated-Values) Parquet MapR-DB* Hadoop Sequence Files