# wiki2vec

**Repository Path**: mirrors_alvations/wiki2vec

## Basic Information

- **Project Name**: wiki2vec
- **Description**: Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-09-24
- **Last Updated**: 2026-02-28

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Wiki2Vec

Utilities for creating Word2Vec vectors for Dbpedia Entities via a Wikipedia Dump.

Within the release of [Word2Vec](http://code.google.com/p/word2vec/) the Google team released vectors for freebase entities trained on the Wikipedia. These vectors are useful for a variety of tasks.

This Tool will allow you to generate those vectors. Instead of `mids` entities will be addressed via `DbpediaIds` which correspond to wikipedia article's titles.
Vectors are generated for (i) words appearing inside wikipedia (ii) vectors for topics i.e: `dbpedia/Barack_Obama`.


## Quick usage:

- The automated Script set up and runs everything on Ubuntu 14.04. For other Platforms check `Going the long way`
- Run `sudo sh prepare.sh <Locale> PathToOutputFolder`. i.e: 
   - `sudo sh prepare.sh es_ES /mnt/data/`  will work on the spanish wikipedia
   - `sudo sh prepare.sh en_US /mnt/data/`  will work on the english wikipedia
   - `sudo sh prepare.sh da_DA /mnt/data/`  will work on the danish wikipedia
  
- Running `prepare` will:
   - Download the latest wikipedia dump for the given language
   - Clean the dump, stem it and tokenize it
   - Create a `language.corpus` file in `outputFolder`, this corpus can be fed to any word2vec tool to generate vectors.

- Once you get `language.corpus` go to `resources/gensim` and do:

  `wiki2vec.sh pathToCorpus pathToOutputFolder <MIN_WORD_COUNT> <VECTOR_SIZE> <WINDOW_SIZE>`

this will install all requiered dependencies for Gensim and build word2vec vectors.

i.e:

`wiki2vec.sh corpus output 50 500 10`

- Discards words below 50 counts, generate vectors of size 500, and the window size for building the counts of each occurence is 10 words.

------

`prepare.sh` script installs:
 - Java 7
 - Sbt
 - Apache Spark

`wiki2vec.sh` script installs:
 - python-pip
 - build-essential
 - liblapack-dev
 - gfortran
 - zlib1g-dev
 - python-dev
 - cython
 - numpy
 - scipy
 - gensim

## Going the long way

### Compile

 - Get sbt
 - make sure `JAVA_HOME` is pointing to Java 7 
 - do `sbt assembly`

### Readable Wikipedia

Wikipedia dumps are stored in xml format. This is a difficult format to process in parallel because the  xml file has to be streamed getting the articles on the go.
A Readable wikipedia Dump is a transformation of the dump such that it is easy to pipeline into tools such as Spark or Hadoop.

Every line in a readable wikipedia dump follows the format:
`Dbpedia Title` `<tab>` `Article's Text`

The class `org.idio.wikipedia.dumps.ReadableWiki` gets a `multistreaming-xml.bz2`wikipedia dump and outputs a readable wikipedia.
i.e

`java -Xmx10G -Xms10G -cp org.idio.wikipedia.dumps.ReadableWiki wiki2vec-assembly-1.0.jar path-to-wiki-dump/eswiki-20150105-pages-articles-multistream.xml.bz2 pathToReadableWiki/eswiki-readable-20150105.lines`


### Word2Vec Corpus

Creates a Tokenized corpus which can be fed into tools such as Gensim to create Word2Vec vectors for Dbpedia entities.

- Every Wikipedia link to an article within wiki is replaced by : `DbpediaId/DbpediaIDToLink`. i.e: 

if an article's text contains: 
```
[[ Barack Obama | B.O ]] is the president of [[USA]]
```

is transformed into:

```
DbpediaID/Barack_Obama B.O is the president of DbpediaID/USA
```

- Articles are tokenized (At the moment in a very naive way)


#### Getting a Word2Vec Corpus

1. Make sure you got a `Readable Wikipedia`
2. Download Spark : http://d3kbcqa49mib13.cloudfront.net/spark-1.2.0-bin-hadoop2.4.tgz
3. In your Spark folder do:
  ```
  bin/spark-submit --class "org.idio.wikipedia.word2vec.Word2VecCorpus"  target/scala-2.10/wiki2vec-assembly-1.0.jar   /PathToYourReadableWiki/readableWiki.lines /Path/To/RedirectsFile /PathToOut/Word2vecReadyWikipediaCorpus
  ```
4. Feed your corpus to a word2vec tool

### Stemming

By default the word2vec corpus is always stemmed. If you don't want that to happen: 


#### If using the automated scripts..
pass None as an extra argument

`sudo sh prepare.sh es_ES /mnt/data/ None`  will work on the spanish wikipedia and won't stem words

#### If you are manually running the tools:
Pass None as an extra argument when calling spark
 ```
 bin/spark-submit --class "org.idio.wikipedia.word2vec.Word2VecCorpus"  target/scala-2.10/wiki2vec-assembly-1.0.jar   /PathToYourReadableWiki/readableWiki.lines /Path/To/RedirectsFile /PathToOut/Word2vecReadyWikipediaCorpus None
 ```


## Word2Vec tools:

- [Gensim](https://radimrehurek.com/gensim/)
- [DeepLearning4j](https://github.com/SkymindIO/deeplearning4j): Feb 2014, Gets stuck in infinite loops on a big corpus
- [Spark's word2vec](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala): Feb 2014, `number of dimensions` * `vocabulary size` has to be less than a certain value otherwise an exception is thrown. [issue](http://mail-archives.apache.org/mod_mbox/spark-issues/201412.mbox/%3CJIRA.12761684.1418621192000.36769.1418759475999@Atlassian.JIRA%3E)


## ToDo:
- Remove hard coded spark params
- Handle Wikipedia Redirections
- Intra Article co-reference resolution