If I have KBs of data every few seconds, is using Kinesis Firehose with a Lambda function to perform a transformation and using Redshift as the target necessarily better than just doing the same except with S3 instead of Kinesis? I know Kinesis is intended for real-time processing but is there actually a benefit to using it rather than just using S3 and having files dropped into S3 trigger a lambda function for processing and storing into Redshift? They seem equivalent other than that Kinesis is associated with real-time processing while S3 is not.
Question posted in Amazon Web Sevices
The official Amazon Web Services documentation can be found here.
The official Amazon Web Services documentation can be found here.
3
Answers
Amazon Kinesis Data Firehose can combine streams of data into fewer, larger objects in Amazon S3 based on size or time. This makes it easier to store in S3 and load into Redshift.
Amazon Redshift performs poorly if you are continually using
INSERT
on a few rows of data, compared to usingCOPY
on a larger set of data (which allows parallel loading too).Amazon Kinesis Firehose will take care of the full process of receiving data through to inserting it into Redshift. If you want to do it yourself, you’ll presumably trigger an AWS Lambda function for each object and you’ll need to write code to insert it into Redshift and handle errors. It’s really a matter of balancing cost against convenience.
The biggest thing Kinesis Firehose is the Firehose functionality. This part bundles up the Kinesis events into S3 files of reasonably size. Loading many small files into Redshift can be very inefficient. So your Lambda process will also need to bundle into Redshift loadable files significant size.
Good question .
If you dont use Firehouse the real issue will be cost and performance.
Cost
S3 has less cost for storage but it also has cost for put request .
If have many small files then storage cost will be less but you will have more put request cost .
So you would not want to make big files and put into S3 .In that case you will have to accumulate many files and then put into s3 .Firehose does it for for you .
Otherwise you need to write some thing and run somewhere .
Performance
Almost all data lake has better performance when they have less frequent inserts.
Like wise Redshift .
so you would again want to combine all small files and create one big file and then load /insert into Redshift .
Forehouse does this again for you .
If accumulate small files create bigger files .