hadoop - Hive Create Multi small files for each insert in HDFS -

following been achieved

  1. kafka producer pulling data twitter using spark streaming.
  2. kafka consumer ingesting data hive external table(on hdfs).

while working fine far. there 1 issue facing, while app insert data hive table, created small file each row data per file.

below code

// define topics read   val topic = "topic_twitter"   val groupid = "group-1"   val consumer = kafkaconsumer(topic, groupid, "localhost:2181")  //create sparkcontext   val sparkcontext = new sparkcontext("local[2]", "kafkaconsumer")  //create hivecontext     val hivecontext = new org.apache.spark.sql.hive.hivecontext(sparkcontext)    hivecontext.sql("create external table if not exists twitter_data (tweetid bigint, tweettext string, username string, tweettimestamp string,   userlang string)")   hivecontext.sql("create external table if not exists demo (foo string)") 

hive demo table populated 1 single record. kafka consumer loop thru data topic ="topic_twitter" in process each row , populate in hive table

val hivesql = "insert table twitter_data select stack( 1," +      tweetid        +","  +      tweettext      +"," +      username       +"," +     tweettimestamp +","  +     userlang + ") demo limit 1"  hivecontext.sql(hivesql) 

below images hadoop environment. twitter_data, demo hie tables in hdfs

last 10 files created in hdfs enter image description here

as can see file size not more 200kb, there way merge these files in 1 file?

[take 2] ok, can't "stream" data hive. can add periodic compaction post-processing job...

  • create table 3 partitions e.g. (role='collecta'), (role='collectb'), (role='archive')
  • point spark inserts (role='activea')
  • at point, switch (role='activeb')
  • then dump every record have collected in "a" partition "archive", hoping hive default config job of limiting fragmentation

    insert table twitter_data partition (role='archive') select ... twitter_data role='activea' ; truncate table twitter_data partition (role='activea') ;

  • at point, switch "a" etc.

one last word: if hive still creates many files on each compaction job, try tweaking parameters in session, before insert e.g.

set hive.merge.mapfiles =true; set hive.merge.mapredfiles =true; set hive.merge.smallfiles.avgsize=1024000000; 


Popular posts from this blog

html - Firefox flex bug applied to buttons? -

html - Missing border-right in select on Firefox -

python - build a suggestions list using fuzzywuzzy -