# entropy **Repository Path**: zhoub86/entropy ## Basic Information - **Project Name**: entropy - **Description**: Entropy estimator - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-03-18 - **Last Updated**: 2021-03-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README Entropy estimators ================= We provide implementation of our [rate-optimal estimators](http://ieeexplore.ieee.org/abstract/document/7444171/) and some classical estimators in both Python and C++. This is a tutorial for both software users and developers whoever want to use entropy estimators! Table of contents ================= * [Table of contents](#table-of-contents) * [Python](#python) * [C++](#c) * [References](#reference) Python ===== This Python code works on Python3, and uses [Numpy](http://www.numpy.org), the fundamental package for scientific computing with Python. Make sure Python3 and Numpy are properly installed before using this code. * [Instruction for installing Python3](https://docs.python.org/3/using/index.html) * [Instruction for installing Numpy](https://www.scipy.org/install.html) Basic script ------- Here is an example on how to use entropy estimators in Python: ```python >>> from entropy import * >>> entropy = Entropy(k=100000) >>> fin = [[1,2910],[2,45]] >>> entropy.estimate(fin) 15.794990467501666 ``` The entropy estimate output ```15.794990``` is in bits. * ```from entropy import *``` imports all functions from *entropy.py*; * ```entropy = Entropy(k=100000)``` initializes an entropy estimator with alphabet size 100,000, an upper bound on the support size. We can use a conservative upper bound and the estimator is insensitive to that. In the case no upper bound is unavailable, an alternative option is to use the sample size. * ```fin = [[1,2910],[2,45]]``` is the fingerprint, represented by a list of tuples. This means 2910 symobols appeared exactly once, and 45 symbols appeared exactly twice. * ```entropy.estimate(fin)``` produces the entropy estimate using our rate-optimal estimator. The unit of output is bits. Functions ------ ### Samples statistics The entropy estimator requires an input of fingerprint. In case of raw data, we have functions for this statistics: * ```sample_to_fin(sample)``` returns fingerprints of samples * input: a list of samples * output: a list of tuples (fingerprints) * ```hist_to_fin(hist)``` returns fingerprints from histogram (number of appearances/frequency counts) * input: a list of frequencies * output: a list of tuples (fingerprints) ### Other estimators We implemented classical entropy estimators, and outputs are all in bits: * ```entropy.estimate_plug(fin)```: plug-in estimator * ```entropy.estimate_Miller_Madow(fin)```: Miller-Madow estimator Comprehensive script --------- We provide ```main.py``` as an example script to tune parameters in our entropy estimator. It can also be used directly as a estimator software working on data files! Type ```python3 main.py -k 100000 -fin fin_sample.txt``` or ```python3 main.py -k 100000 -hist hist_sample.txt``` to experiment on the fingerprint "fin\_sample.txt" or histogram "hist\_sample.txt" respectively, which are both the statistic of 3,000 samples generated by the uniform distribution over 100,000 symbols. ### Program arguments #### Main arguments: * ```-k int```: set alphabet size. * ```-fin str```: set fingerprint input file. Each line of file consists of two numbers: frequency *j*, the number of symbols that appears exactly *j* times. * ```-hist str```: set histogram input file. Each line of file consists of only one number: frequency *j*. Symbols are not needed. * *k* must be provided. Either *fin* or *hist* must be provided. If both *fin* and *hist* are provided, only *fin* will be read. #### Optional arguments: * ```-L int```: set polynomial degree. Default *L=1.6 log k*. * ```-M float```: set the right endpoint of approximation interval. Default *M=3.5 log k*. * ```-N int```: set the threshold to apply the polynomial estimator. Default *N=1.6 log k*. * The parameters above can be combined, e.g., ```./entropy -k 100000 -fin fin_sample.txt -L 18 -M 40 -N 18```. * ``` --help``` or ```-h```: see the list of arguments. C++ ===== Compile and run --------- Check out all source code, including the Makefile and .txt files. Type ```make``` to compile the sources, you will get executable file *"entropy"*. Type ```./entropy -k=100000 -fin=fin_sample.txt``` or ```./entropy -k=100000 -hist=hist_sample.txt``` to experiment on the fingerprint "fin\_sample.txt" or histogram "hist\_sample.txt" respectively, which are both the statistic of 3,000 samples generated by the uniform distribution over 100,000 symbols. ### Program arguments #### Main arguments: * ```-k=number```: set alphabet size. * ```-fin=filename```: set fingerprint input file. Each line of file consists of two numbers: frequency *j*, the number of symbols that appears exactly *j* times. * ```-hist=filename```: set histogram input file. Each line of file consists of only one number: frequency *j*. Symbols are not needed. * *k* must be provided. Either *fin* or *hist* must be provided. If both *fin* and *hist* are provided, only *fin* will be read. #### Optional arguments: * ```-L=number```: set polynomial degree. Default *L=1.6 log k*. * ```-M=number```: set the right endpoint of approximation interval. Default *M=3.5 log k*. * ```-N=number```: set the threshold to apply the polynomial estimator. Default *N=1.6 log k*. * The parameters above can be combined, e.g., ```./entropy -k=100000 -fin=fin_sample.txt -L=18 -M=40 -N=18```. * Type ```./entropy -help``` or ```./entropy -h``` to see the list of arguments. ## More for developers For developer who want to write a new test scratch to test the entropy estimator, follow the example *main_test.cpp*. After ```make``` you will also see the executable file *"test"*. ### Entropy estimator class: *Entropy* Work flow: 1. Set alphabet size *k* and several parameters: * **setDegree(int L)**: the degree of polynomial is L. * **setInterval(int M)**: the approximation interval is [0,M/n], where n is the sample size. * **setThreshold(int N)**: the threshold to use polynomial estimator is when the histogram is at most N. 2. Input fingerprint or histogram. Use one of the following: * **setFin( filename )**: each line of file consists of two numbers: frequency *j*, the number of symbols that appears exactly *j* times. * **setFin( freq, count )**: input two vectors of the same length: *freq[i]* represents frequency, and *count[i]* counts the number of symbols that appear exactly *freq[i]* times. * **setHist( filename )**: each line of file consists of one number: frequency *j*. Symbols are not needed. * **setHist( freq )**: input one vector with frequencies. 3. Output various estimates: * **estimate()**: our polynomial estimator * **estimate_plug()**: plug-in estimator * **estimate\_Miller\_Madow()**: Miller-Madow estimator ### Synthetic data experiments For those who want to generate synthetic samples within C++, you can refer to the standard [random number generator facilities](http://www.cplusplus.com/reference/random/). There are examples to generate random numbers according to different types of distributions. If the distribution is not yet provided, you can use the general [discrete distribution](http://www.cplusplus.com/reference/random/discrete\_distribution/). There are several ways to [construct discrete distributions](http://www.cplusplus.com/reference/random/discrete\_distribution/discrete\_distribution/). Reference =================== For detailed explanation of parameters, please refer to our paper [Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation](http://ieeexplore.ieee.org/abstract/document/7444171/). The parameters described in the paper are: *L=c0 log k, M=c1 log k, N=c2 log k*. As in the paper, the default values are *L=1.6 log k, M=3.5 log k, N=1.6 log k*.