Querying DynamoDB data with Athena
See original GitHub issueWhich Category is your question related to? Custom
What AWS Services are you utilizing? API Lambda Auth
Provide additional details e.g. code snippets I have a requirement to run complex analytics over the data that we are storing in DynamoDB. Specifically, joining together data across Amplify generated tables in a multi-tenant environment… Finding top performing “factors” per account, per team and per user (amongst other more complex requirements). The specifics about how the data connects in Dynamo isn’t necessarily important, what I am struggling with is the best way to get my DynamoDB data into place for querying with Athena (and likely in the future with QuickSight and automated analysis).
There are guides out there for how to provide the results to the user via AppSync (https://aws.amazon.com/blogs/mobile/visualizing-big-data-with-aws-appsync-amazon-athena-and-aws-amplify/), but I can’t seem to find much out there to help in getting my data to S3.
So this brings me to the question(s), what method would be best, how would I go about doing it and how should I format the data in S3? The options that I am considered are the following…
- DynamoStream (
@model
backed) => Lambda => S3 - DynamoStream (
@model
backed) => Lambda => Firehose => S3 - Glue ETL => S3
Has anyone else gone through a similar scenario? Are there any docs out there that I’ve missed? I have reached the edge of my experience in getting to this point, so before I embark on another learning curve, I thought it would be best to get some advice.
Thanks in advance!
Just as a note, I am not interested in storing pre-calculated metrics at this point as a lot of the analysis will be exploratory at first. So doing calculations in a Lambda resolver or storing post-calculated metrics off the back of a Dynamo stream is a no-no for us right now.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (2 by maintainers)
@jonperryxlm Apologize for the late response. We don’t support ETL solutions out of the box today with the Amplify CLI and this is an interesting use case and I’ll mark this as a feature-request for our team to consider. Having said that, we do support the first half of your ask out here which is “DynamoStream (@model backed) => Lambda” integration and then in the Lambda you can choose to perform your desired ETL operation by either publishing the results to S3 for further analysis or Firehose -> S3. For managing it within the infrastructure within the CLI itself have you considered the use of custom stacks from the CLI https://docs.amplify.aws/cli/usage/customcf? Also, please let us know if you have any issues around DynamoStream (@model backed) => Lambda" integration and you can find more info about it out here - https://docs.amplify.aws/cli/usage/lambda-triggers#dynamodb-lambda-triggers
@houmark
How did you solve the duplicate data problem? If I run the job more than once, the data doubles.