Sito di riferimento : http://sqoop.apache.org/
Documentazione di riferimento: http://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html
Il tool di import si propone di importare una singola tabella da un RDBMS in HDFS. Ad ogni righa della tabella andrà a corrispondere ad una riga in un file di testo, binario, Avro o SequenceFiles.
$ sqoop import (generic-args) (import-args) $ sqoop-import (generic-args) (import-args)
Argument | Description |
---|---|
--connect <jdbc-uri> | Specify JDBC connect string |
--connection-manager <class-name> | Specify connection manager class to use |
--driver <class-name> | Manually specify JDBC driver class to use |
--hadoop-mapred-home <dir> | Override $HADOOP_MAPRED_HOME |
--help | Print usage instructions |
--password-file | Set path for a file containing the authentication password |
-P | Read password from console |
--password <password> | Set authentication password |
--username <username> | Set authentication username |
--verbose | Print more information while working |
--connection-param-file <filename> | Optional properties file that provides connection parameters |
--relaxed-isolation | Set connection transaction isolation to read uncommitted for the mappers. |
Recap Mysql
come db di riferimento per i test uso MySql, a seguire i comandi per creare una base dati, una tabella ed inserire dei dati.
mysql -u root -p CREATE DATABASE miodatabase; USE miodatabase; CREATE TABLE miatabella ( -> id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, -> nome VARCHAR(20), -> cognome VARCHAR(30), -> anni INT, -> compleanno DATE); INSERT INTO `miatabella` (`id`,`nome`,`cognome`,`anni`,`compleanno`) VALUES (NULL, "Simon", "Usai",42, '1976-04-27');
Per connettersi ad un db bisogna specificare il driver e la url di connesione
sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \ --connect <connect-string> ... sqoop import --connect 'jdbc:postgresql://<postgres_server_url>:<postgres_port>/<db_name>'
Importante: bisogna rendere disponibile al tool che utilizziamo il driver, ecco vari modi per fare questo:
da terminale:
export HADOOP_CLASSPATH="/home/myuser/postgresql-9.4.1209.jar"
oppure come parametro
-libjars <comma separated list of jars>- specify comma separated jar files to include in the classpath. The -files, -libjars, and -archives arguments are not typically used with Sqoop, but they are included as part of Hadoop’s internal argument-parsing system.
oppure se siamo su una installazione cloudera https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_jdbc_driver_install.html
mkdir -p /var/lib/sqoop chown sqoop:sqoop /var/lib/sqoop chmod 755 /var/lib/sqoop cp ojdbc6.jar /var/lib/sqoop/ (per esempio il driver oracle)
oppure direttamente nella cartelle delle lib di sqoop
/.../sqoop-1.4.6.bin__hadoop-2.0.4-alpha/lib
Argument | Description |
---|---|
--append | Append data to an existing dataset in HDFS |
--as-avrodatafile | Imports data to Avro Data Files |
--as-sequencefile | Imports data to SequenceFiles |
--as-textfile | Imports data as plain text (default) |
--as-parquetfile | Imports data to Parquet Files |
--boundary-query <statement> | Boundary query to use for creating splits |
--columns <col,col,col…> | Columns to import from table |
--delete-target-dir | Delete the import target directory if it exists (per gli esempi permette di rilanciare sempre lo stesso script) |
--direct | Use direct connector if exists for the database |
--fetch-size <n> | Number of entries to read from database at once. |
--inline-lob-limit <n> | Set the maximum size for an inline LOB |
-m,–num-mappers <n> | Use n map tasks to import in parallel |
-e,–query <statement> | Import the results of statement. |
--split-by <column-name> | Column of the table used to split work units. Cannot be used with –autoreset-to-one-mapper option. |
--split-limit <n> | Upper Limit for each split size. This only applies to Integer and Date columns. For date or timestamp fields it is calculated in seconds. |
--autoreset-to-one-mapper | Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with –split-by <col> option. |
--table <table-name> | Table to read |
--target-dir <dir> | HDFS destination dir |
--temporary-rootdir <dir> | HDFS directory for temporary files created during import (overrides default “_sqoop”) |
--warehouse-dir <dir> | HDFS parent for table destination |
--where <where clause> | WHERE clause to use during import |
-z,–compress | Enable compression |
--compression-codec <c> | Use Hadoop codec (default gzip) |
--null-string <null-string> | The string to be written for a null value for string columns |
--null-non-string <null-string> | The string to be written for a null value for non-string columns |
Avviare hadoop: /.../hadoop-2.7.4/sbin/start_all.sh
Per consultare il log dei processi map-reduce: http://localhost:8088/cluster/apps
sqoop import -libjars /home/simon/sqldeveloper/jdbc/lib/mysql-connector-java-5.1.14-bin.jar \ --connect jdbc:mysql://127.0.0.1:3306/miodatabase \ --driver com.mysql.jdbc.Driver \ --username root --password password \ --table miatabella \ --fields-terminated-by ";" \ --target-dir /tmp/miatabella \ -m 1 \ --columns "nome,cognome" \ --append
otteniamo il seguente risultato
simon@Itaca ~: hdfs dfs -ls /tmp/miatabella 18/06/12 16:06:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 2 items -rw-r--r-- 1 simon supergroup 0 2018-06-12 16:05 /tmp/miatabella/_SUCCESS -rw-r--r-- 1 simon supergroup 22 2018-06-12 16:05 /tmp/miatabella/part-m-00000 simon@Itaca ~: hdfs dfs -cat /tmp/miatabella/part-m-00000 18/06/12 16:06:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Simon;Usai
Codici di compressioni Haddoop
bzip2 | .bz2 | org.apache.hadoop.io.compress.BZip2Codec |
default | .deflate | org.apache.hadoop.io.compress.DefaultCodec |
deflate | .deflate | org.apache.hadoop.io.compress.DeflateCodec |
gzip | .gz | org.apache.hadoop.io.compress.GzipCodec |
lz4 | .lz4 | org.apache.hadoop.io.compress.Lz4Codec |
snappy | .snappy | org.apache.hadoop.io.compress.SnappyCodec |
Non tutti i codec sono disponibili di default (nel caso bisogna aggiungere il jar come per i driver). Di default con -z comprimo in gzip.
Compressione lz4
sqoop import --connect jdbc:mysql://127.0.0.1:3306/miodatabase \ --driver com.mysql.jdbc.Driver \ --username root \ --password password \ --table miatabella \ --fields-terminated-by ";" \ --target-dir /tmp/miatabella \ --delete-target-dir \ --compression-codec lz4
Come si vede vengono creati dei file con estensione .lz4 (due file perchè non abbiamo usato -m 1, quindi un file per ogni processo allocato dal tool)
simon@Itaca ~/bigdata/hadoop-2.7.4/sbin: hdfs dfs -ls /tmp/miatabella 18/06/13 11:53:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 3 items -rw-r--r-- 1 simon supergroup 0 2018-06-13 11:52 /tmp/miatabella/_SUCCESS -rw-r--r-- 1 simon supergroup 37 2018-06-13 11:52 /tmp/miatabella/part-m-00000.lz4 -rw-r--r-- 1 simon supergroup 37 2018-06-13 11:52 /tmp/miatabella/part-m-00001.lz4
Proviamo un import in Avro
Nota: se si prova il formato avro con la –compression-codec lz4 abbiamo Error: org.apache.avro.AvroRuntimeException: Unrecognized codec: lz4 si evince che non tutti i fromati di file hanno tutte le compressioni.
sqoop import --connect jdbc:mysql://127.0.0.1:3306/miodatabase \ --driver com.mysql.jdbc.Driver \ --username root \ --password password \ --table miatabella \ --fields-terminated-by ";" \ --target-dir /tmp/miatabella \ --delete-target-dir \ --as-avrodatafile \ --compression-codec bzip2
simon@Itaca ~/bigdata/hadoop-2.7.4/sbin: hdfs dfs -ls /tmp/miatabella 18/06/13 12:08:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 3 items -rw-r--r-- 1 simon supergroup 0 2018-06-13 12:07 /tmp/miatabella/_SUCCESS -rw-r--r-- 1 simon supergroup 645 2018-06-13 12:07 /tmp/miatabella/part-m-00000.avro -rw-r--r-- 1 simon supergroup 645 2018-06-13 12:07 /tmp/miatabella/part-m-00001.avro
target-dir Vs. warehouse-dir
sqoop import --connect jdbc:mysql://127.0.0.1:3306/miodatabase \ --driver com.mysql.jdbc.Driver \ --username root \ --password password \ --table miatabella \ --fields-terminated-by ";" \ --warehouse-dir /tmp/miatabella \ --delete-target-dir \
Fondamentalmente con l'opzione –warehouse-dir si elegge la cartella a root dell'import, e la directory di output prende il nome della tabella (naturalmente l'esempio proposto è forviante avendo usato nomi inadeguati per evidenziare la differenze,/tmp/miatabella/miatabella.)
hdfs dfs -ls /tmp/miatabella 18/06/13 12:55:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 4 items -rw-r--r-- 1 simon supergroup 0 2018-06-13 12:54 /tmp/miatabella/_SUCCESS drwxr-xr-x - simon supergroup 0 2018-06-13 12:55 /tmp/miatabella/miatabella -rw-r--r-- 1 simon supergroup 726 2018-06-13 12:54 /tmp/miatabella/part-m-00000.avro -rw-r--r-- 1 simon supergroup 726 2018-06-13 12:54 /tmp/miatabella/part-m-00001.avro
WHERE
Inseriemo altri dati:
INSERT INTO `miatabella` (`id`,`nome`,`cognome`,`anni`,`compleanno`) VALUES (NULL, "Aldo", "Plutarco",42, '1980-01-21');
sqoop import --connect jdbc:mysql://127.0.0.1:3306/miodatabase \ --driver com.mysql.jdbc.Driver \ --username root \ --password password \ --table miatabella \ --fields-terminated-by ";" \ --target-dir /tmp/miatabella \ --delete-target-dir \ --where "nome='Aldo'"
Come atteso (e mai scontato) abbiamo:
simon@Itaca ~/bigdata/hadoop-2.7.4/sbin: hdfs dfs -cat /tmp/miatabella/part-m-00000 18/06/13 15:16:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 3;Aldo;Plutarco;42;1980-01-21
Argument | Description |
---|---|
--enclosed-by <char> | Sets a required field enclosing character |
--escaped-by <char> | Sets the escape character |
--fields-terminated-by <char> | Sets the field separator character |
--lines-terminated-by <char> | Sets the end-of-line character |
--mysql-delimiters | Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ' |
--optionally-enclosed-by <char> | Sets a field enclosing character |
Supported escape characters are:
Argument | Description |
---|---|
--input-enclosed-by <char> | Sets a required field encloser |
--input-escaped-by <char> | Sets the input escape character |
--input-fields-terminated-by <char> | Sets the input field separator |
--input-lines-terminated-by <char> | Sets the input end-of-line character |
--input-optionally-enclosed-by <char> | Sets a field enclosing character |
--enclosed-by
sqoop import --connect jdbc:mysql://127.0.0.1:3306/miodatabase \ --driver com.mysql.jdbc.Driver \ --username root \ --password password \ --table miatabella \ --target-dir /tmp/miatabella \ --delete-target-dir \ --m 1 \ --fields-terminated-by ";" \ --enclosed-by "\"" \
otteniamo:
simon@Itaca ~/bigdata: hdfs dfs -cat /tmp/miatabella/part-m-00000 18/06/14 12:32:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable "4";"Simon";"Usai";"42";"1976-04-27" "5";"Simon";"Usai";"42";"1976-04-27" "6";"Simon";"Usai";"42";"1976-04-27"
--lines-terminated-by
sqoop import --connect jdbc:mysql://127.0.0.1:3306/miodatabase \ --driver com.mysql.jdbc.Driver \ --username root \ --password password \ --table miatabella \ --target-dir /tmp/miatabella \ --delete-target-dir \ --m 1 \ --fields-terminated-by ";" \ --enclosed-by "\"" \ --lines-terminated-by "\\n"
otteniamo:
simon@Itaca ~/bigdata: hdfs dfs -cat /tmp/miatabella/part-m-00000 18/06/14 12:32:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable "4";"Simon";"Usai";"42";"1976-04-27" "5";"Simon";"Usai";"42";"1976-04-27" "6";"Simon";"Usai";"42";"1976-04-27"
NOTA: se al posto di “\n” inseriamo “\r” (ritorno a capo) e siamo in ambiente linix abbiamo:
simon@Itaca ~/bigdata: hdfs dfs -cat /tmp/miatabella/part-m-00000 > test.txt simon@Itaca ~/bigdata: vi test.txt "4";"Simon";"Usai";"42";"1976-04-27"^M"5";"Simon";"Usai";"42";"1976-04-27"^M"6";"Simon";"Usai";"42";"1976-04-27"^M