hadoop - Hive Create Multi small files for each insert in HDFS -
following been achieved
- kafka producer pulling data twitter using spark streaming.
- kafka consumer ingesting data hive external table(on hdfs).
while working fine far. there 1 issue facing, while app insert data hive table, created small file each row data per file.
below code
// define topics read val topic = "topic_twitter" val groupid = "group-1" val consumer = kafkaconsumer(topic, groupid, "localhost:2181") //create sparkcontext val sparkcontext = new sparkcontext("local[2]", "kafkaconsumer") //create hivecontext val hivecontext = new org.apache.spark.sql.hive.hivecontext(sparkcontext) hivecontext.sql("create external table if not exists twitter_data (tweetid bigint, tweettext string, username string, tweettimestamp string, userlang string)") hivecontext.sql("create external table if not exists demo (foo string)")
hive demo table populated 1 single record. kafka consumer loop thru data topic ="topic_twitter" in process each row , populate in hive table
val hivesql = "insert table twitter_data select stack( 1," + tweetid +"," + tweettext +"," + username +"," + tweettimestamp +"," + userlang + ") demo limit 1" hivecontext.sql(hivesql)
below images hadoop environment. twitter_data, demo
as can see file size not more 200kb, there way merge these files in 1 file?
[take 2] ok, can't "stream" data hive. can add periodic compaction post-processing job...
- create table 3 partitions e.g.
(role='collecta')
,(role='collectb')
,(role='archive')
- point spark inserts
(role='activea')
- at point, switch
(role='activeb')
then dump every record have collected in "a" partition "archive", hoping hive default config job of limiting fragmentation
insert table twitter_data partition (role='archive') select ... twitter_data role='activea' ; truncate table twitter_data partition (role='activea') ;
at point, switch "a" etc.
one last word: if hive still creates many files on each compaction job, try tweaking parameters in session, before insert e.g.
set hive.merge.mapfiles =true; set hive.merge.mapredfiles =true; set hive.merge.smallfiles.avgsize=1024000000;
Comments
Post a Comment