I like programming and anime.

I manage the bot /u/mahoro@lemmy.ml

  • 1 Post
  • 3 Comments
Joined 1 year ago
cake
Cake day: June 12th, 2023

help-circle


  • Do you use it? When?

    Parquet is really used for big data batch data processing. It’s columnar-based file format and is optimized for large, aggregation queries. It’s non-human readable so you need a library like apache arrow to read/write to it.

    I would use parquet in the following circumstances (or combination of circumstances):

    • The data is very large
    • I’m integrating this into an analytical query engine (Presto, etc.)
    • I’m transporting data that needs to land in an analytical data warehouse (Snowflake, BigQuery, etc.)
    • Consumed by data scientists, machine learning engineers, or other data engineers

    Since the data is columnar-based, doing queries like select sum(sales) from revenue is much cheaper and faster if the underlying data is in parquet than csv.

    The big advantage of csv is that it’s more portable. csv as a data file format has been around forever, so it is used in a lot of places where parquet can’t be used.