Appunti di Viaggio

Sito di riferimento : http://sqoop.apache.org/

Documentazione di riferimento: http://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html

SQOOP-IMPORT

Il tool di import si propone di importare una singola tabella da un RDBMS in HDFS. Ad ogni righa della tabella andrà a corrispondere ad una riga in un file di testo, binario, Avro o SequenceFiles.

$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)

Common arguments

Argument	Description
--connect <jdbc-uri>	Specify JDBC connect string
--connection-manager <class-name>	Specify connection manager class to use
--driver <class-name>	Manually specify JDBC driver class to use
--hadoop-mapred-home <dir>	Override $HADOOP_MAPRED_HOME
--help	Print usage instructions
--password-file	Set path for a file containing the authentication password
-P	Read password from console
--password <password>	Set authentication password
--username <username>	Set authentication username
--verbose	Print more information while working
--connection-param-file <filename>	Optional properties file that provides connection parameters
--relaxed-isolation	Set connection transaction isolation to read uncommitted for the mappers.

Recap Mysql

come db di riferimento per i test uso MySql, a seguire i comandi per creare una base dati, una tabella ed inserire dei dati.

mysql -u root -p

CREATE DATABASE miodatabase;

USE miodatabase;

CREATE TABLE miatabella (
    -> id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, 
    -> nome VARCHAR(20),
    -> cognome VARCHAR(30),
    -> anni INT, 
    -> compleanno DATE);

INSERT INTO `miatabella` (`id`,`nome`,`cognome`,`anni`,`compleanno`) VALUES (NULL, "Simon", "Usai",42, '1976-04-27');

Per connettersi ad un db bisogna specificare il driver e la url di connesione

sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
    --connect <connect-string> ...

sqoop import 
   --connect 'jdbc:postgresql://<postgres_server_url>:<postgres_port>/<db_name>'

Importante: bisogna rendere disponibile al tool che utilizziamo il driver, ecco vari modi per fare questo:

da terminale:

export HADOOP_CLASSPATH="/home/myuser/postgresql-9.4.1209.jar"

oppure come parametro

-libjars <comma separated list of jars>- specify comma separated jar files to include in the classpath.

The -files, -libjars, and -archives arguments are not typically used with Sqoop, but they are included as part of Hadoop’s internal argument-parsing system.

oppure se siamo su una installazione cloudera https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_jdbc_driver_install.html

mkdir -p /var/lib/sqoop
chown sqoop:sqoop /var/lib/sqoop 
chmod 755 /var/lib/sqoop
cp ojdbc6.jar /var/lib/sqoop/ (per esempio il driver oracle)

oppure direttamente nella cartelle delle lib di sqoop

/.../sqoop-1.4.6.bin__hadoop-2.0.4-alpha/lib

Import control arguments

Argument	Description
--append	Append data to an existing dataset in HDFS
--as-avrodatafile	Imports data to Avro Data Files
--as-sequencefile	Imports data to SequenceFiles
--as-textfile	Imports data as plain text (default)
--as-parquetfile	Imports data to Parquet Files
--boundary-query <statement>	Boundary query to use for creating splits
--columns <col,col,col…>	Columns to import from table
--delete-target-dir	Delete the import target directory if it exists (per gli esempi permette di rilanciare sempre lo stesso script)
--direct	Use direct connector if exists for the database
--fetch-size <n>	Number of entries to read from database at once.
--inline-lob-limit <n>	Set the maximum size for an inline LOB
-m,–num-mappers <n>	Use n map tasks to import in parallel
-e,–query <statement>	Import the results of statement.
--split-by <column-name>	Column of the table used to split work units. Cannot be used with –autoreset-to-one-mapper option.
--split-limit <n>	Upper Limit for each split size. This only applies to Integer and Date columns. For date or timestamp fields it is calculated in seconds.
--autoreset-to-one-mapper	Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with –split-by <col> option.
--table <table-name>	Table to read
--target-dir <dir>	HDFS destination dir
--temporary-rootdir <dir>	HDFS directory for temporary files created during import (overrides default “_sqoop”)
--warehouse-dir <dir>	HDFS parent for table destination
--where <where clause>	WHERE clause to use during import
-z,–compress	Enable compression
--compression-codec <c>	Use Hadoop codec (default gzip)
--null-string <null-string>	The string to be written for a null value for string columns
--null-non-string <null-string>	The string to be written for a null value for non-string columns

Esempi:

Avviare hadoop: /.../hadoop-2.7.4/sbin/start_all.sh

Per consultare il log dei processi map-reduce: http://localhost:8088/cluster/apps

sqoop import -libjars /home/simon/sqldeveloper/jdbc/lib/mysql-connector-java-5.1.14-bin.jar \ 
--connect jdbc:mysql://127.0.0.1:3306/miodatabase \
--driver com.mysql.jdbc.Driver \
--username root --password password \
--table miatabella \
--fields-terminated-by ";" \
--target-dir /tmp/miatabella \
-m 1 \
--columns "nome,cognome" \
--append

otteniamo il seguente risultato

simon@Itaca ~: hdfs dfs -ls /tmp/miatabella
18/06/12 16:06:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 simon supergroup          0 2018-06-12 16:05 /tmp/miatabella/_SUCCESS
-rw-r--r--   1 simon supergroup         22 2018-06-12 16:05 /tmp/miatabella/part-m-00000

simon@Itaca ~: hdfs dfs -cat /tmp/miatabella/part-m-00000
18/06/12 16:06:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Simon;Usai

Codici di compressioni Haddoop

bzip2	.bz2	org.apache.hadoop.io.compress.BZip2Codec
default	.deflate	org.apache.hadoop.io.compress.DefaultCodec
deflate	.deflate	org.apache.hadoop.io.compress.DeflateCodec
gzip	.gz	org.apache.hadoop.io.compress.GzipCodec
lz4	.lz4	org.apache.hadoop.io.compress.Lz4Codec
snappy	.snappy	org.apache.hadoop.io.compress.SnappyCodec

Non tutti i codec sono disponibili di default (nel caso bisogna aggiungere il jar come per i driver). Di default con -z comprimo in gzip.

Compressione lz4

sqoop import  --connect jdbc:mysql://127.0.0.1:3306/miodatabase \
--driver com.mysql.jdbc.Driver \
--username root \
--password password \
--table miatabella \
--fields-terminated-by ";" \
--target-dir /tmp/miatabella \
--delete-target-dir \
--compression-codec lz4

Come si vede vengono creati dei file con estensione .lz4 (due file perchè non abbiamo usato -m 1, quindi un file per ogni processo allocato dal tool)

simon@Itaca ~/bigdata/hadoop-2.7.4/sbin:  hdfs dfs -ls /tmp/miatabella
18/06/13 11:53:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r--   1 simon supergroup          0 2018-06-13 11:52 /tmp/miatabella/_SUCCESS
-rw-r--r--   1 simon supergroup         37 2018-06-13 11:52 /tmp/miatabella/part-m-00000.lz4
-rw-r--r--   1 simon supergroup         37 2018-06-13 11:52 /tmp/miatabella/part-m-00001.lz4

Proviamo un import in Avro

Nota: se si prova il formato avro con la –compression-codec lz4 abbiamo Error: org.apache.avro.AvroRuntimeException: Unrecognized codec: lz4 si evince che non tutti i fromati di file hanno tutte le compressioni.

sqoop import  --connect jdbc:mysql://127.0.0.1:3306/miodatabase \
--driver com.mysql.jdbc.Driver \
--username root \
--password password \
--table miatabella \
--fields-terminated-by ";" \
--target-dir /tmp/miatabella \
--delete-target-dir \
--as-avrodatafile \
--compression-codec bzip2

simon@Itaca ~/bigdata/hadoop-2.7.4/sbin: hdfs dfs -ls /tmp/miatabella
18/06/13 12:08:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r--   1 simon supergroup          0 2018-06-13 12:07 /tmp/miatabella/_SUCCESS
-rw-r--r--   1 simon supergroup        645 2018-06-13 12:07 /tmp/miatabella/part-m-00000.avro
-rw-r--r--   1 simon supergroup        645 2018-06-13 12:07 /tmp/miatabella/part-m-00001.avro

target-dir Vs. warehouse-dir

sqoop import  --connect jdbc:mysql://127.0.0.1:3306/miodatabase \
--driver com.mysql.jdbc.Driver \
--username root \
--password password \
--table miatabella \
--fields-terminated-by ";" \
--warehouse-dir /tmp/miatabella \
--delete-target-dir \

Fondamentalmente con l'opzione –warehouse-dir si elegge la cartella a root dell'import, e la directory di output prende il nome della tabella (naturalmente l'esempio proposto è forviante avendo usato nomi inadeguati per evidenziare la differenze,/tmp/miatabella/miatabella.)

hdfs dfs -ls /tmp/miatabella
18/06/13 12:55:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
-rw-r--r--   1 simon supergroup          0 2018-06-13 12:54 /tmp/miatabella/_SUCCESS
drwxr-xr-x   - simon supergroup          0 2018-06-13 12:55 /tmp/miatabella/miatabella
-rw-r--r--   1 simon supergroup        726 2018-06-13 12:54 /tmp/miatabella/part-m-00000.avro
-rw-r--r--   1 simon supergroup        726 2018-06-13 12:54 /tmp/miatabella/part-m-00001.avro

WHERE

Inseriemo altri dati:

INSERT INTO `miatabella` (`id`,`nome`,`cognome`,`anni`,`compleanno`) VALUES (NULL, "Aldo", "Plutarco",42, '1980-01-21');

sqoop import  --connect jdbc:mysql://127.0.0.1:3306/miodatabase \
--driver com.mysql.jdbc.Driver \
--username root \
--password password \
--table miatabella \
--fields-terminated-by ";" \
--target-dir /tmp/miatabella \
--delete-target-dir \
--where "nome='Aldo'"

Come atteso (e mai scontato) abbiamo:

simon@Itaca ~/bigdata/hadoop-2.7.4/sbin: hdfs dfs -cat /tmp/miatabella/part-m-00000
18/06/13 15:16:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
3;Aldo;Plutarco;42;1980-01-21

Output line formatting arguments:

Argument	Description
--enclosed-by <char>	Sets a required field enclosing character
--escaped-by <char>	Sets the escape character
--fields-terminated-by <char>	Sets the field separator character
--lines-terminated-by <char>	Sets the end-of-line character
--mysql-delimiters	Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: '
--optionally-enclosed-by <char>	Sets a field enclosing character

Supported escape characters are:

\b (backspace)
\n (newline)
\r (carriage return)
\t (tab)
\" (double-quote)
\' (single-quote)
\\ (backslash)
\0 (NUL) - This will insert NUL characters between fields or lines, or will disable enclosing/escaping if used for one of the --enclosed-by, --optionally-enclosed-by, or --escaped-by arguments.

Input parsing arguments:

Argument	Description
--input-enclosed-by <char>	Sets a required field encloser
--input-escaped-by <char>	Sets the input escape character
--input-fields-terminated-by <char>	Sets the input field separator
--input-lines-terminated-by <char>	Sets the input end-of-line character
--input-optionally-enclosed-by <char>	Sets a field enclosing character

--enclosed-by

sqoop import  --connect jdbc:mysql://127.0.0.1:3306/miodatabase \
--driver com.mysql.jdbc.Driver \
--username root \
--password password \
--table miatabella \
--target-dir /tmp/miatabella \
--delete-target-dir \
--m 1 \
--fields-terminated-by ";" \
--enclosed-by "\"" \

otteniamo:

simon@Itaca ~/bigdata: hdfs dfs -cat /tmp/miatabella/part-m-00000
18/06/14 12:32:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
"4";"Simon";"Usai";"42";"1976-04-27"
"5";"Simon";"Usai";"42";"1976-04-27"
"6";"Simon";"Usai";"42";"1976-04-27"

--lines-terminated-by

sqoop import  --connect jdbc:mysql://127.0.0.1:3306/miodatabase \
--driver com.mysql.jdbc.Driver \
--username root \
--password password \
--table miatabella \
--target-dir /tmp/miatabella \
--delete-target-dir \
--m 1 \
--fields-terminated-by ";" \
--enclosed-by "\"" \
--lines-terminated-by "\\n"

otteniamo:

simon@Itaca ~/bigdata: hdfs dfs -cat /tmp/miatabella/part-m-00000
18/06/14 12:32:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
"4";"Simon";"Usai";"42";"1976-04-27"
"5";"Simon";"Usai";"42";"1976-04-27"
"6";"Simon";"Usai";"42";"1976-04-27"

NOTA: se al posto di “\n” inseriamo “\r” (ritorno a capo) e siamo in ambiente linix abbiamo:

simon@Itaca ~/bigdata: hdfs dfs -cat /tmp/miatabella/part-m-00000 > test.txt
simon@Itaca ~/bigdata: vi test.txt 

"4";"Simon";"Usai";"42";"1976-04-27"^M"5";"Simon";"Usai";"42";"1976-04-27"^M"6";"Simon";"Usai";"42";"1976-04-27"^M