Import CSV files into Apache Hive
A common task that might be required by a big data scientist is to load a CSV file into Apache Hive. Apache Hive is a SQL-like database sitting on top of Hadoop’s distributed file system (HDFS).
In this post, we give a step-by-step approach to import a CSV file into Apache Hive. We assume that Apache Hive has been properly configured along with HDFS.
The main steps are the following:
- First copy the csv file
data.csv
into HDFS. - Create a table in Hive to store the contents of
data.csv
- Import
data.csv
into the above relation.
Copy file into HDFS
This is pretty straightfoward:
# Copy test.csv into HDFS
hadoop fs -put /path/on/localsystem/data.csv /path/on/hdfs/data.csv
Create Hive Table
Start hive console and create a table example
.
hive
hive> CREATE TABLE example(name String, address String, price int)\
ROW FORMAT DELIMITED FIELD TERMINATED BY ',' ESCAPED BY '\\' STORED AS TEXTFILE;
The above command creates a table named example with fields ‘name’, ‘address’ and ‘price’. Moreover, it defines the storage of the table as a text file whose rows are its records delimited by ‘,’ and also commas escaped by ‘\’.
Moreover, note that the fields of example must have the same order as in /path/on/hdfs/data.csv
Load data into Hive Table
To load data into table example
, you simple type:
hive> LOAD DATA INPATH '/path/on/hdfs/data.csv' OVERWRITE INTO TABLE example;
Note that the above command will delete the file /path/on/hdfs/data.csv
.