Remind’s data engineering team strives to provide the whole company with access to the data they need, as big as 10 million daily events, and empower them to directly make decisions. They initially started with Redshift as its source of truth resource for data, and AWS S3 to optimize for cost. While S3 is used for long-term storage of historical data in JSON format, Redshift only stores the most valuable data, not older than 3 months. The company uses Interana to run custom queries on their JSON files on S3, but they’ve also recently started using AWS Athena, as a fully managed Presto system – to query both S3 and Redshift databases. The move for Athena also triggered a change in the data format – from JSON to Parquet, which they say was the hardest step in building up their data platform. An EMR/Hive system is responsible for doing the needed data transformations between S3 and Athena. In the data ingestion part of the story, Remind gathers data through their APIs from both mobile devices and personal computers, as the company business targets schools, parents, and students. This data is then passed to a streaming Kinesis Firehose system, before streaming it out to S3 and Redshift.
Remind’s future plans are probably focused on facilitating data format conversions using AWS Glue. This step would allow them to replace EMR/Hive from their architecture and use Spark SQL instead of Athena for diverse ETL tasks.