hadoop - Hive Create Multi small files for each insert in HDFS -
following been achieved
- kafka producer pulling data twitter using spark streaming.
- kafka consumer ingesting data hive external table(on hdfs).
while working fine far. there 1 issue facing, while app insert data hive table, created small file each row data per file.
below code
// define topics read val topic = "topic_twitter" val groupid = "group-1" val consumer = kafkaconsumer(topic, groupid, "localhost:2181") //create sparkcontext val sparkcontext = new sparkcontext("local[2]", "kafkaconsumer") //create hivecontext val hivecontext = new org.apache.spark.sql.hive.hivecontext(sparkcontext) hivecontext.sql("create external table if not exists twitter_data (tweetid bigint, tweettext string, username string, tweettimestamp string, userlang string)") hivecontext.sql("create external table if not exists demo (foo string)") hive demo table populated 1 single record. kafka consumer loop thru data topic ="topic_twitter" in process each row , populate in hive table
val hivesql = "insert table twitter_data select stack( 1," + tweetid +"," + tweettext +"," + username +"," + tweettimestamp +"," + userlang + ") demo limit 1" hivecontext.sql(hivesql) below images hadoop environment. twitter_data, demo 
as can see file size not more 200kb, there way merge these files in 1 file?
[take 2] ok, can't "stream" data hive. can add periodic compaction post-processing job...
- create table 3 partitions e.g.
(role='collecta'),(role='collectb'),(role='archive') - point spark inserts
(role='activea') - at point, switch
(role='activeb') then dump every record have collected in "a" partition "archive", hoping hive default config job of limiting fragmentation
insert table twitter_data partition (role='archive') select ... twitter_data role='activea' ; truncate table twitter_data partition (role='activea') ;at point, switch "a" etc.
one last word: if hive still creates many files on each compaction job, try tweaking parameters in session, before insert e.g.
set hive.merge.mapfiles =true; set hive.merge.mapredfiles =true; set hive.merge.smallfiles.avgsize=1024000000; 
Comments
Post a Comment