# entropy
**Repository Path**: zhoub86/entropy
## Basic Information
- **Project Name**: entropy
- **Description**: Entropy estimator
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-03-18
- **Last Updated**: 2021-03-18
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
Entropy estimators
=================
We provide implementation of our [rate-optimal estimators](http://ieeexplore.ieee.org/abstract/document/7444171/) and some classical estimators in both Python and C++.
This is a tutorial for both software users and developers whoever want to use entropy estimators!
Table of contents
=================
* [Table of contents](#table-of-contents)
* [Python](#python)
* [C++](#c)
* [References](#reference)
Python
=====
This Python code works on Python3, and uses [Numpy](http://www.numpy.org), the fundamental package for scientific computing with Python.
Make sure Python3 and Numpy are properly installed before using this code.
* [Instruction for installing Python3](https://docs.python.org/3/using/index.html)
* [Instruction for installing Numpy](https://www.scipy.org/install.html)
Basic script
-------
Here is an example on how to use entropy estimators in Python:
```python
>>> from entropy import *
>>> entropy = Entropy(k=100000)
>>> fin = [[1,2910],[2,45]]
>>> entropy.estimate(fin)
15.794990467501666
```
The entropy estimate output ```15.794990``` is in bits.
* ```from entropy import *``` imports all functions from *entropy.py*;
* ```entropy = Entropy(k=100000)``` initializes an entropy estimator with alphabet size 100,000, an upper bound on the support size. We can use a conservative upper bound and the estimator is insensitive to that. In the case no upper bound is unavailable, an alternative option is to use the sample size.
* ```fin = [[1,2910],[2,45]]``` is the fingerprint, represented by a list of tuples. This means 2910 symobols appeared exactly once, and 45 symbols appeared exactly twice.
* ```entropy.estimate(fin)``` produces the entropy estimate using our rate-optimal estimator. The unit of output is bits.
Functions
------
### Samples statistics
The entropy estimator requires an input of fingerprint.
In case of raw data, we have functions for this statistics:
* ```sample_to_fin(sample)``` returns fingerprints of samples
* input: a list of samples
* output: a list of tuples (fingerprints)
* ```hist_to_fin(hist)``` returns fingerprints from histogram (number of appearances/frequency counts)
* input: a list of frequencies
* output: a list of tuples (fingerprints)
### Other estimators
We implemented classical entropy estimators, and outputs are all in bits:
* ```entropy.estimate_plug(fin)```: plug-in estimator
* ```entropy.estimate_Miller_Madow(fin)```: Miller-Madow estimator
Comprehensive script
---------
We provide ```main.py``` as an example script to tune parameters in our entropy estimator.
It can also be used directly as a estimator software working on data files!
Type ```python3 main.py -k 100000 -fin fin_sample.txt``` or ```python3 main.py -k 100000 -hist hist_sample.txt``` to experiment on the fingerprint "fin\_sample.txt" or histogram "hist\_sample.txt" respectively, which are both the statistic of 3,000 samples generated by the uniform distribution over 100,000 symbols.
### Program arguments
#### Main arguments:
* ```-k int```: set alphabet size.
* ```-fin str```: set fingerprint input file.
Each line of file consists of two numbers: frequency *j*, the number of symbols that appears exactly *j* times.
* ```-hist str```: set histogram input file.
Each line of file consists of only one number: frequency *j*. Symbols are not needed.
* *k* must be provided.
Either *fin* or *hist* must be provided.
If both *fin* and *hist* are provided, only *fin* will be read.
#### Optional arguments:
* ```-L int```: set polynomial degree. Default *L=1.6 log k*.
* ```-M float```: set the right endpoint of approximation interval. Default *M=3.5 log k*.
* ```-N int```: set the threshold to apply the polynomial estimator. Default *N=1.6 log k*.
* The parameters above can be combined, e.g., ```./entropy -k 100000 -fin fin_sample.txt -L 18 -M 40 -N 18```.
* ``` --help``` or ```-h```: see the list of arguments.
C++
=====
Compile and run
---------
Check out all source code, including the Makefile and .txt files.
Type ```make``` to compile the sources, you will get executable file *"entropy"*.
Type ```./entropy -k=100000 -fin=fin_sample.txt``` or ```./entropy -k=100000 -hist=hist_sample.txt``` to experiment on the fingerprint "fin\_sample.txt" or histogram "hist\_sample.txt" respectively, which are both the statistic of 3,000 samples generated by the uniform distribution over 100,000 symbols.
### Program arguments
#### Main arguments:
* ```-k=number```: set alphabet size.
* ```-fin=filename```: set fingerprint input file.
Each line of file consists of two numbers: frequency *j*, the number of symbols that appears exactly *j* times.
* ```-hist=filename```: set histogram input file.
Each line of file consists of only one number: frequency *j*. Symbols are not needed.
* *k* must be provided.
Either *fin* or *hist* must be provided.
If both *fin* and *hist* are provided, only *fin* will be read.
#### Optional arguments:
* ```-L=number```: set polynomial degree. Default *L=1.6 log k*.
* ```-M=number```: set the right endpoint of approximation interval. Default *M=3.5 log k*.
* ```-N=number```: set the threshold to apply the polynomial estimator. Default *N=1.6 log k*.
* The parameters above can be combined, e.g., ```./entropy -k=100000 -fin=fin_sample.txt -L=18 -M=40 -N=18```.
* Type ```./entropy -help``` or ```./entropy -h``` to see the list of arguments.
## More for developers
For developer who want to write a new test scratch to test the entropy estimator, follow the example *main_test.cpp*.
After ```make``` you will also see the executable file *"test"*.
### Entropy estimator class: *Entropy*
Work flow:
1. Set alphabet size *k* and several parameters:
* **setDegree(int L)**: the degree of polynomial is L.
* **setInterval(int M)**: the approximation interval is [0,M/n], where n is the sample size.
* **setThreshold(int N)**: the threshold to use polynomial estimator is when the histogram is at most N.
2. Input fingerprint or histogram. Use one of the following:
* **setFin( filename )**: each line of file consists of two numbers: frequency *j*, the number of symbols that appears exactly *j* times.
* **setFin( freq, count )**: input two vectors of the same length: *freq[i]* represents frequency, and *count[i]* counts the number of symbols that appear exactly *freq[i]* times.
* **setHist( filename )**: each line of file consists of one number: frequency *j*. Symbols are not needed.
* **setHist( freq )**: input one vector with frequencies.
3. Output various estimates:
* **estimate()**: our polynomial estimator
* **estimate_plug()**: plug-in estimator
* **estimate\_Miller\_Madow()**: Miller-Madow estimator
### Synthetic data experiments
For those who want to generate synthetic samples within C++, you can refer to the standard [random number generator facilities](http://www.cplusplus.com/reference/random/).
There are examples to generate random numbers according to different types of distributions.
If the distribution is not yet provided, you can use the general [discrete distribution](http://www.cplusplus.com/reference/random/discrete\_distribution/).
There are several ways to [construct discrete distributions](http://www.cplusplus.com/reference/random/discrete\_distribution/discrete\_distribution/).
Reference
===================
For detailed explanation of parameters, please refer to our paper [Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation](http://ieeexplore.ieee.org/abstract/document/7444171/).
The parameters described in the paper are: *L=c0 log k, M=c1 log k, N=c2 log k*.
As in the paper, the default values are *L=1.6 log k, M=3.5 log k, N=1.6 log k*.