# tpcds **Repository Path**: mirrors_Stratio/tpcds ## Basic Information - **Project Name**: tpcds - **Description**: TPC-DS benchmarks including data generation with Spark and queries with Spark - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-08-18 - **Last Updated**: 2026-02-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README Usage: To compile, invoke mvn clean package For convenience, set TPCDS_WORKLOAD_GEN to the directory where this git repository is checked out, eg: export TPCDS_WORKLOAD_GEN=~/tpcds To generate data with spark bin/spark-submit --class edu.brown.cs.systems.tpcds.spark.SparkTPCDSDataGenerator ${TPCDS_WORKLOAD_GEN}/target/spark-workloadgen-5.0-jar-with-dependencies.jar To run: bin/spark-submit --class edu.brown.cs.systems.tpcds.spark.SparkTPCDSWorkloadGenerator ${TPCDS_WORKLOAD_GEN}/target/spark-workloadgen-5.0-jar-with-dependencies.jar To configure the TPC-DS data set, there are a variety of configuration options. Most of these are inherited from Databricks spark-sql-perf, which we use to generate the TPC-DS data. The options of interest are as follows: - scaleFactor specifies the dataset size. A scale factor of n generates approximately n GB of data. Most data formats compress this quite effectively, so on disk the data will appear smaller (eg, Parquet or Orc can compress by a factor of approximately 4). - dataLocation specifics the location of the dataset. Typically this will be in HDFS, and you can specify HDFS file locations as normal (eg, hdfs://:/) - dataFormat specifies the format to store the data. "parquet" and "orc" are good choices with high compression; "text" is also supported. The full (default) configuration options are as follows: tpcds { scaleFactor = 1 dataLocation = "hdfs://127.0.0.1:9000/tpcds" dataFormat = "parquet" overwrite = false partitionTables = true useDoubleForDecimal = false clusterByPartitionColumns = false filterOutNullPartitionValues = false numPartitions = 1000 usePartitionColumns = false } We have provided a couple of useful command line utilities, which are generated into the folder `target/appassembler/bin`: - list-queries lists the available queries. It takes zero or one arguments; with zero arguments, it lists the available benchmarks; with 1 argument, it either lists a benchmark, or prints a query. Queries are broken down into benchmarks. Since multiple people have implemented variants of the original TPC-DS queries, we have included multiple of these variants here. The impala-tpcds-modified-queries are a set of 20 selected queries that several work has used for benchmarking previously with Spark. - dsdgen is a wrapper around the dsdgen utility that TPC provides. This package comes with precompiled dsdgen binaries for Linux and Mac, which we use for data generation.