# tpcds

**Repository Path**: mirrors_Stratio/tpcds

## Basic Information

- **Project Name**: tpcds
- **Description**: TPC-DS benchmarks including data generation with Spark and queries with Spark
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-08-18
- **Last Updated**: 2026-02-28

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

Usage:

To compile, invoke

	mvn clean package

For convenience, set TPCDS_WORKLOAD_GEN to the directory where this git repository is checked out, eg:

    export TPCDS_WORKLOAD_GEN=~/tpcds

To generate data with spark

	bin/spark-submit --class edu.brown.cs.systems.tpcds.spark.SparkTPCDSDataGenerator ${TPCDS_WORKLOAD_GEN}/target/spark-workloadgen-5.0-jar-with-dependencies.jar
	
To run:

	bin/spark-submit --class edu.brown.cs.systems.tpcds.spark.SparkTPCDSWorkloadGenerator ${TPCDS_WORKLOAD_GEN}/target/spark-workloadgen-5.0-jar-with-dependencies.jar

To configure the TPC-DS data set, there are a variety of configuration options.  Most of these are inherited from Databricks spark-sql-perf, which we use to generate the TPC-DS data.

The options of interest are as follows:

 - scaleFactor specifies the dataset size.  A scale factor of n generates approximately n GB of data.  Most data formats compress this quite effectively, so on disk the data will appear smaller (eg, Parquet or Orc can compress by a factor of approximately 4).
 - dataLocation specifics the location of the dataset.  Typically this will be in HDFS, and you can specify HDFS file locations as normal (eg, hdfs://<hostname>:<port>/<path>)
 - dataFormat specifies the format to store the data.  "parquet" and "orc" are good choices with high compression; "text" is also supported.

The full (default) configuration options are as follows:

	tpcds {
    	scaleFactor = 1
    	dataLocation = "hdfs://127.0.0.1:9000/tpcds"
    	dataFormat = "parquet"
    	overwrite = false
    	partitionTables = true
    	useDoubleForDecimal = false
    	clusterByPartitionColumns = false
    	filterOutNullPartitionValues = false
    	numPartitions = 1000
    	usePartitionColumns = false
    }
	
We have provided a couple of useful command line utilities, which are generated into the folder `target/appassembler/bin`:

 - list-queries lists the available queries.  It takes zero or one arguments; with zero arguments, it lists the available benchmarks; with 1 argument, it either lists a benchmark, or prints a query.  Queries are broken down into benchmarks.  Since multiple people have implemented variants of the original TPC-DS queries, we have included multiple of these variants here.  The impala-tpcds-modified-queries are a set of 20 selected queries that several work has used for benchmarking previously with Spark.
 - dsdgen is a wrapper around the dsdgen utility that TPC provides.  This package comes with precompiled dsdgen binaries for Linux and Mac, which we use for data generation.