Big Data Ambiente Locale

Configurazione in locale su macchina Ubuntu 16.04 di un ambiente per test di sviluppo in locale, gli applicativi installati sono:

Prerequisito : instalare Java e configurare le variabili di ambiente.

Se le cose non vanno bene controlla i log... e le porte sudo netstat -tulpn | grep 22

Esempio di setting delle variabili di ambiente:

alias ll='ls -lah'
alias gg='git status -s'
alias python=python3

export SBT_OPTS="-Xmx2G -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=2G -Xss2M  -Duser.timezone=GMT"
#export JAVA_HOME=/home/simon/programmi/java/jdk1.8.0_171_64
export JAVA_HOME=/home/simon/programmi/java/openjdk-java-se-8u40-ri
export M2_HOME=/home/simon/programmi/apache-maven-3.6.1        
export M2=$M2_HOME/bin
export MAVEN_OPTS="-Xms256m -Xmx512m"
export SPARK_HOME=/home/simon/programmi/spark-2.4.3-bin-hadoop2.7

export HADOOP_HOME=/home/simon/programmi/hadoop-3.1.2
export SQOOP_HOME=/home/sqoop-1.4.7.bin__hadoop-2.6.0
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HIVE_HOME=/home/simon/programmi/apache-hive-3.1.2-bin
export HBASE_HOME=/home/simon/programmi/hbase-2.2.1


export PYENV_ROOT=$HOME/.pyenv
export PATH=$PYENV_ROOT/bin:$PATH

export PATH=$JAVA_HOME/bin:$M2:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$SQOOP_HOME/bin:$HIVE_HOME/bin:$HBASE_HOME/bin:$PATH

Zookeeper

In prima battuta utilizzo Zookeeper di Hbase che ne ha uno integrato all'avvio! Ma per avere a disposizione gli script va scaricato da:

Comandi base:

cd  /home/simon/programmi/apache-zookeeper-3.5.5-bin/bin
./zkCli.sh -server localhost:2181 ls /<hive.server2.zookeeper.namespace>

Kafka

Kafka si puo' scaricare da:

Il file di configurazione di kafka e' server.properties dove inseriamo:

listeners = PLAINTEXT://localhost:9092
delete.topic.enable=true

Il file di configurazione di zookeeper e' zookeeper.properties dove inseriamo:

dataDir=/tmp/zookeeper
clientPort=2181

I comandi base di kakfa sono https://kafka.apache.org/quickstart dalla cartella bin:

./zookeeper-server-start.sh config/zookeeper.properties
./kafka-server-start.sh config/server.properties
./kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test
./kafka-topics.sh --list --bootstrap-server localhost:9092
./kafka-console-producer.sh --broker-list localhost:9092 --topic test
./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
./kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic test
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list  localhost:9092  --topic test --time -2 /  -1

Note sui consumer https://stackoverflow.com/questions/38024514/understanding-kafka-topics-and-partitions

Hbase

Link per il download https://hbase.apache.org/

Il file di configurazione e' hbase-site.xml dove inseriamo:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///home/simon/programmi/hbase-2.2.1/data</value>
  </property>

 <property>
    <name>hbase.unsafe.stream.capability.enforce</name>
    <value>false</value>
  </property>

  <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/home/simon/programmi/hbase-2.2.1/zookeeper</value>
  </property>
</configuration>

Hbase si avvia con un zookeeper integrato!! dalla cartella bin:

./start-hbase.sh
./stop-hbase.sh

La shell di Hbase si avvia con il comando hbase shell dalla crtella bin, a seguire i comando base:

status
version
table_help
Whoami
create <tablename>, <columnfamilyname>
list
describe <table name>
disable <tablename>
disable_all<"matching regex">
enable <tablename>
show_filters
drop <table name>
drop_all<"regex">
count <'tablename'>, CACHE =>1000
put <'tablename'>,<'rowname'>,<'columnvalue'>,<'value'>
get <'tablename'>, <'rowname'>, {< Additional parameters>}  //{TIMERANGE => [ts1, ts2]} {COLUMN => ['c1', 'c2', 'c3']}
delete <'tablename'>,<'row name'>,<'column name'>
truncate <tablename>
scan <'tablename'>, {Optional parameters}

Hadoop

Scaricare Hadoop (Hdfs + Yarn) dal sito https://hadoop.apache.org/

Hadoop funziona tramite rete e come prerequisito l'utente che lancia il processo deve porter loggarsi in ssh senza password.

La cartella .ssh (con i permessi drwxr-xr-x 2 simon simon 4096 set 24 14:14 .ssh/), i permessi su tutti i file sono fondamentali deve contenere i seguenti file:

  • id_rsa chiave privata
  • id_rsa.pub chiave pubblica
  • authorized_keys copia della chiave pubblica
sudo apt-get install openssh-server openssh-client 
$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost  fai login senza password (sudo ssystemctl restart sshd)

Nella cartela etc/hadoop configuriamo i file (vanno create le cartellle dentro i file xml)

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
        <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
        </property>
        <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/simon/programmi/hadoop-3.1.2/hadooptmpdata</value>
        </property>
</configuration>

hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
        <property>
        <name>dfs.replication</name>
        <value>1</value>
        <name>dfs.name.dir</name>
        <value>file:///home/simon/programmi/hadoop-3.1.2/hdfs/namenode</value>
        <name>dfs.data.dir</name>
        <value>file:///home/simon/programmi/hadoop-3.1.2/hdfs/datanode</value>
        </property>
</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

yarn-site.xml

<?xml version="1.0"?>
<configuration>

<!-- Site specific YARN configuration properties -->
<property>
<name>mapreduceyarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Inizializzare il namenode dalla cartella bin hdfs namenode -format

Dalla cartella sbin ./start-all.sh

Namenode Web UI http://localhost:50070