A seguire una serie di esempi di export (load da file system o hdfs) di file con la shell di spark 1.6
Nota:
i file per i test:
val jdbcDF = spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "schema.tablename") .option("user", "username") .option("password", "password") .load() val dataframe_mysql = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://sn2:3306/test").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "emp").option("user", "root").option("password", "mapr").load()
val rdd =sc.textFile("/home/simon/resources/employees.csv") scala> val csv = sqlContext.read.format("com.databricks.spark.csv").option("header","false").option("delimiter", ",").load("/home/simon/resources/employees.csv") csv: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, C3: string, C4: string, C5: string] scala> csv.show(1) +-----+----------+------+-------+---+----------+ | C0| C1| C2| C3| C4| C5| +-----+----------+------+-------+---+----------+ |10001|1953-09-02|Georgi|Facello| M|1986-06-26| +-----+----------+------+-------+---+----------+ only showing top 1 row
scala> val json = sqlContext.read.json("/home/simon/resources/people.json") json: org.apache.spark.sql.DataFrame = [age: bigint, name: string] scala> json.show(1) +----+-------+ | age| name| +----+-------+ |null|Michael| +----+-------+ only showing top 1 row
scala> val parquet = sqlContext.read.parquet("/home/simon/resources/users.parquet") parquet: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string, favorite_numbers: array<int>] scala> parquet.show(1) 18/07/13 15:11:05 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl +------+--------------+----------------+ | name|favorite_color|favorite_numbers| +------+--------------+----------------+ |Alyssa| null| [3, 9, 15, 20]| +------+--------------+----------------+ only showing top 1 row
scala> import com.databricks.spark.avro._ import com.databricks.spark.avro._ scala> val avro = sqlContext.read.format("com.databricks.spark.avro").load("/home/simon/resources/users.avro") avro: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string, favorite_numbers: array<int>] scala> avro.show(1) +------+--------------+----------------+ | name|favorite_color|favorite_numbers| +------+--------------+----------------+ |Alyssa| null| [3, 9, 15, 20]| +------+--------------+----------------+ only showing top 1 row