Coursera is an education company that partners with the top universities and organizations in the world to offer online courses. They started building their data architecture somewhere around 2013, as both numbers of users and available courses increased. As of late 2017, Coursera provides courses to 27 million worldwide users.
Coursera collects data from their users through API calls coming from mobile and web apps, their production DBs, and logs gathered from monitoring. A backend service called “eventing” periodically uploads all received events to S3 and continuously publishes events to Kafka. The engineering team has selected Redshift a central warehouse, offering much lower operational cost when compared with Spark or Hadoop at the time.
On the analytics end, the engineering team created an internal web-based query page where people across the company can write SQL queries to the warehouse and get the needed information. Of course, there are analytics dashboard across the company which are refreshed on a daily basis. Finally, many decisions made in Coursera are based on machine learning algorithms, such as A/B testing, course recommendations, understanding student dropouts and others.