In our case, the main container is `theme-default-content` and the selector the titles and sub-titles are `h1`, `h2`... 💡 _To better understand the selectors, go to [this section](#more-about-the-selectors)._ 🔨 _There are many other fields you can set in the config file that allow you to adapt the scraper to your need. Check out [this section](#all-the-config-file-settings)._ ### Run the Scraper #### From Source Code This project supports Python 3.8. The [`pipenv` command](https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv) must be installed. Set both environment variables `MEILISEARCH_HOST_URL` and `MEILISEARCH_API_KEY`.
Following on from the example in the [first step](#run-your-meilisearch-instance), they are respectively `http://localhost:7700` and `myMasterKey`. Then, run: ```bash $ pipenv install $ pipenv run ./docs_scraper
In a production environment, we recommend providing the private key instead of the master key, as it is safer and it has enough permissions to perform such requests. _More about [MeiliSearch authentication](https://docs.meilisearch.com/guides/advanced_guides/authentication.html)._ ## 🖌 And for the front-end search bar? After having scraped your documentation, you might need a search bar to improve your user experience! About the front part: - If your website is a VuePress application, check out the [vuepress-plugin-meilisearch](https://github.com/meilisearch/vuepress-plugin-meilisearch) repository. - For all kinds of documentation, check out the [docs-searchbar.js](https://github.com/meilisearch/docs-searchbar.js) library. **Both of these libraries provide a front-end search bar perfectly adapted for documentation.**  ## 🛠 More Configurations ### More About the Selectors #### Bases Very simply, selectors are needed to tell the scraper "I want to get the content in this HTML tag".
This HTML tag is a **selector**. A selector can be: - a class (e.g. `.main-content`) - an id (e.g. `#main-article`) - an HTML tag (e.g. `h1`) With a more concrete example: ```json "lvl0": { "selector": ".navbar-nav .active", "global": true, "default_value": "Documentation" }, ``` `.navbar-nav .active` means "take the content in the class `active` that is itself in the class `navbar-nav`". `global: true` means you want the same `lvl0` (so, the same main title) for all the contents extracted from the same page. `"default_value": "Documentation"` will be the displayed value if no content in `.navbar-nav .active` was found. NB: You can set the `global` and `default_value` attributes for every selector level (`lvlX`) and not only for the `lvl0`. #### The Levels You can notice different levels of selectors (0 to 6 maximum) in the config file. They correspond to different levels of titles, and will be displayed this way:  Your data will be displayed with a main title (`lvl0`), sub-titles (`lvl1`), sub-sub-titles (`lvl2`) and so on... ### All the Config File Settings #### `index_uid` The `index_uid` field is the index identifier in your MeiliSearch instance in which your website content is stored. The scraping tool will create a new index if it does not exist. ```json { "index_uid": "example" } ``` #### `start_urls` This array contains the list of URLs that will be used to start scraping your website.
The scraper will recursively follow any links (`` tags) from those pages. It will not follow links that are on another domain. ```json { "start_urls": ["https://www.example.com/docs"] } ``` ##### Using Page Rank This parameter gives more weight to some pages and helps to boost records built from the page.
Pages with highest `page_rank` will be returned before pages with a lower `page_rank`. ```json { "start_urls": [ { "url": "http://www.example.com/docs/concepts/", "page_rank": 5 }, { "url": "http://www.example.com/docs/contributors/", "page_rank": 1 } ] } ``` In this example, records built from the Concepts page will be ranked higher than results extracted from the Contributors page. #### `stop_urls` (optional) The scraper will not follow links that match `stop_urls`. ```json { "start_urls": ["https://www.example.com/docs"], "stop_urls": ["https://www.example.com/about-us"] } ``` #### `selectors_key` (optional) This allows you to use custom selectors per page. If the markup of your website is so different from one page to another that you can't have generic selectors, you can namespace your selectors and specify which set of selectors should be applied to specific pages. ```json { "start_urls": [ "http://www.example.com/docs/", { "url": "http://www.example.com/docs/concepts/", "selectors_key": "concepts" }, { "url": "http://www.example.com/docs/contributors/", "selectors_key": "contributors" } ], "selectors": { "default": { "lvl0": ".main h1", "lvl1": ".main h2", "lvl2": ".main h3", "lvl3": ".main h4", "lvl4": ".main h5", "text": ".main p" }, "concepts": { "lvl0": ".header h2", "lvl1": ".main h1.title", "lvl2": ".main h2.title", "lvl3": ".main h3.title", "lvl4": ".main h5.title", "text": ".main p" }, "contributors": { "lvl0": ".main h1", "lvl1": ".contributors .name", "lvl2": ".contributors .title", "text": ".contributors .description" } } } ``` Here, all documentation pages will use the selectors defined in `selectors.default` while the page under `./concepts` will use `selectors.concepts` and those under `./contributors` will use `selectors.contributors`. #### `scrape_start_urls` (optional) By default, the scraper will extract content from the pages defined in `start_urls`. If you do not have any valuable content on your starts_urls or if it's a duplicate of another page, you should set this to `false`. ```json { "scrape_start_urls": false } ``` #### `sitemap_urls` (optional) You can pass an array of URLs pointing to your sitemap(s) files. If this value is set, the scraper will try to read URLs from your sitemap(s) ```json { "sitemap_urls": ["http://www.example.com/docs/sitemap.xml"] } ``` #### `sitemap_alternate_links` (optional) Sitemaps can contain alternative links for URLs. Those are other versions of the same page, in a different language, or with a different URL. By default docs-scraper will ignore those URLs. Set this to true if you want those other versions to be scraped as well. ```json { "sitemap_urls": ["http://www.example.com/docs/sitemap.xml"], "sitemap_alternate_links": true } ``` With the above configuration and the `sitemap.xml` below, both `http://www.example.com/docs/` and `http://www.example.com/docs/de/` will be scraped. ```html
Here is the [dedicated page about stop-words](https://docs.meilisearch.com/guides/advanced_guides/stop_words.html) in the official documentation.
You can find more complete lists of English stop-words [like this one](https://gist.github.com/sebleier/554280). #### `min_indexed_level` (optional) The default value is 0. By increasing it, you can choose not to index some records if they don't have enough `lvlX` matching. For example, with a `min_indexed_level: 2`, the scraper indexes temporary records having at least lvl0, lvl1 and lvl2 set. This is useful when your documentation has pages that share the same `lvl0` and `lvl1` for example. In that case, you don't want to index all the shared records, but want to keep the content different across pages. ```json { "min_indexed_level": 2 } ``` #### `only_content_level` (optional) When `only_content_level` is set to `true`, then the scraper won't create records for the `lvlX` selectors.
If used, `min_indexed_level` is ignored. ```json { "only_content_level": true } ``` ### Authentication __WARNING:__ Please be aware that the scraper will send authentication headers to every scraped site, so use `allowed_domains` to adjust the scope accordingly! #### Basic HTTP Basic HTTP authentication is supported by setting these environment variables: - `DOCS_SCRAPER_BASICAUTH_USERNAME` - `DOCS_SCRAPER_BASICAUTH_PASSWORD` #### Cloudflare Access: Identity and Access Management If it happens to you to scrape sites protected by Cloudflare Access, you have to set appropriate HTTP headers. Values for these headers are taken from env variables `CF_ACCESS_CLIENT_ID` and `CF_ACCESS_CLIENT_SECRET`. In case of Google Cloud Identity-Aware Proxy, please specify these env variables: - `IAP_AUTH_CLIENT_ID` - # pick [client ID of the application](https://console.cloud.google.com/apis/credentials) you are connecting to - `IAP_AUTH_SERVICE_ACCOUNT_JSON` - # generate in [Actions](https://console.cloud.google.com/iam-admin/serviceaccounts) -> Create key -> JSON ### Installing Chrome Headless Websites that need JavaScript for rendering are passed through ChromeDriver.
[Download the version](http://chromedriver.chromium.org/downloads) suited to your OS and then set the environment variable `CHROMEDRIVER_PATH`. ## 🤖 Compatibility with MeiliSearch This package is compatible with the following MeiliSearch versions: - `v0.13.0` - `v0.12.X` - `v0.11.X` - `v0.10.X` ## ⚙️ Development Workflow and Contributing Any new contribution is more than welcome in this project! If you want to know more about the development workflow or want to contribute, please visit our [contributing guidelines](/CONTRIBUTING.md) for detailed instructions! ## Credits Based on [Algolia's docsearch scraper repository](https://github.com/algolia/docsearch-scraper) from [this commit](https://github.com/curquiza/docsearch-scraper/commit/aab0888989b3f7a4f534979f0148f471b7c435ee).
Due to a lot of future changes on this repository compared to the original one, we don't maintain it as an official fork.
**MeiliSearch** provides and maintains many **SDKs and Integration tools** like this one. We want to provide everyone with an **amazing search experience for any kind of project**. If you want to contribute, make suggestions, or just know what's going on right now, visit us in the [integration-guides](https://github.com/meilisearch/integration-guides) repository.