[SOLVED] Which would be a quicker (and better) tool for querying data stored in the Parquet format – Spark SQL, Athena or ElasticSearch?

Issue

Am currently building an ETL pipeline, which outputs tables of data (order of ~100+ GBs) to a downstream interactive dashboard, which allows filtering the data dynamically (based on pre-defined & indexed filters).

Have zeroed in on using PySpark / Spark for the initial ETL phase.
Next, this processed data will be summarised (simple counts, averages, etc.) & then visualised in the interactive dashboard.

Towards the interactive querying part, I was wondering which tool might work best with my structured & transactional data (stored in Parquet format) –

  1. Spark SQL (in memory dynamic querying)
  2. AWS Athena (Serverless SQL querying, based on Presto)
  3. Elastic Search (search engine)
  4. Redis (Key Value DB)

Feel free to suggest alternative tools, if you know of a better option.

Solution

Based on the information you’ve provided, I am going to make several assumptions:

  1. You are on AWS (hence Elastic Search and Athena being options). Therefore, I will steer you to AWS documentation.
  2. As you have pre-defined and indexed filters, you have well ordered, structured data.

Going through the options listed

  1. Spark SQL – If you are already considering Spark and you are already on AWS, then you can leverage AWS Elastic Map Reduce.
  2. AWS Athena (Serverless SQL querying, based on Presto) – Athena is a powerful tool. It lets you query data stored on S3, which is quite cost effective. However, building workflows in Athena can require a bit of work as you’ll spend a lot of time managing files on S3. Historically, Athena can only produce CSV output, so it often works best as the final stage in a Big Data Pipeline. However, with support for CTAS statements, you can now output data in multiple formats such as Parquet with multiple compression algorithms.
  3. Elastic Search (search engine) – Is not really a query tool, so it is likely not part of the core of this pipeline.
  4. Redis (Key Value DB) – Redis is an in memory key-value data store. It is generally used to provide small bits of information to be rapidly consumed by applications in use cases such as caching and session management. Therefore, it does not seem to fit your use case. If you want some hands on experience with Redis, I recommend Try Redis.

I would also look into Amazon Redshift.

For further reading, read Big Data Analytics Options on AWS.

As @Damien_The_Unbeliever recommended, there will be no substitute for your own prototyping and benchmarking.

Answered By – Zerodf

Answer Checked By – Willingham (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *