##// END OF EJS Templates
cleanup: removed pycrypto code from db schema, it's unused anyway
cleanup: removed pycrypto code from db schema, it's unused anyway

File last commit:

r3482:4b2dd92b default
r3519:66982ec6 default
Show More
indexing.rst
359 lines | 13.4 KiB | text/x-rst | RstLexer
project: added all source files and assets
r1 .. _indexing-ref:
Full-text Search
----------------
docs: updated indexer documentation
r3482 RhodeCode provides a full text search capabilities to search inside file content,
commit message, and file paths. Indexing is not enabled by default and to use
full text search building an index is a pre-requisite.
docs: added SAML documentation....
r3290 By default RhodeCode is configured to use `Whoosh`_ to index |repos| and
docs: updated indexer documentation
r3482 provide full-text search. `Whoosh`_ works well for a small amount of data and
shouldn't be used in case of large code-bases and lots of repositories.
dan
docs: add elasticsearch docs
r153
docs: updated indexer documentation
r3482 |RCE| also provides support for `ElasticSearch 6`_ as a backend more for advanced
docs: update full text search indexing documentation
r3400 and scalable search. See :ref:`enable-elasticsearch` for details.
dan
docs: add elasticsearch docs
r153
Indexing
^^^^^^^^
docs: update full text search indexing documentation
r3400 To run the indexer you need to have an |authtoken| with admin rights to all |repos|.
project: added all source files and assets
r1
docs: updated indexer documentation
r3482 To index repositories stored in RhodeCode, you have the option to set the indexer up in a
project: added all source files and assets
r1 number of ways, for example:
docs: update full text search indexing documentation
r3400 * Call the indexer via a cron job. We recommend running this once at night.
In case you need everything indexed immediately it's possible to index few
docs: updated indexer documentation
r3482 times during the day. Indexer has a special locking mechanism that won't allow
two instances of indexer running at once. It's safe to run it even every 1hr.
docs: update full text search indexing documentation
r3400 * Set the indexer to infinitely loop and reindex as soon as it has run its previous cycle.
project: added all source files and assets
r1 * Hook the indexer up with your CI server to reindex after each push.
docs: updated indexer documentation
r3482 The indexer works by indexing new commits added since the last run, and comparing
file changes to index only new or modified files.
If you wish to build a brand new index from scratch each time, use the ``force``
option in the configuration file, or run it with --force flag.
project: added all source files and assets
r1
.. important::
You need to have |RCT| installed, see :ref:`install-tools`. Since |RCE|
docs: update full text search indexing documentation
r3400 3.5.0 they are installed by default and available with community/enterprise installations.
project: added all source files and assets
r1
To set up indexing, use the following steps:
1. :ref:`config-rhoderc`, if running tools remotely.
2. :ref:`run-index`
3. :ref:`set-index`
4. :ref:`advanced-indexing`
.. _config-rhoderc:
Configure the ``.rhoderc`` File
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
docs: update full text search indexing documentation
r3400 .. note::
Optionally it's possible to use indexer without the ``.rhoderc``. Simply instead of
executing with `--instance-name=enterprise-1` execute providing the host and token
docs: updated indexer documentation
r3482 directly: `--api-host=http://127.0.0.1:10000 --api-key=<auth-token-goes-here>`
docs: update full text search indexing documentation
r3400
project: added all source files and assets
r1 |RCT| uses the :file:`/home/{user}/.rhoderc` file for connection details
docs: added SAML documentation....
r3290 to |RCE| instances. If this file is not automatically created,
project: added all source files and assets
r1 you can configure it using the following example. You need to configure the
details for each instance you want to index.
.. code-block:: bash
# Check the instance details
# of the instance you want to index
$ rccontrol status
docs: update full text search indexing documentation
r3400 - NAME: enterprise-1
- STATUS: RUNNING
- TYPE: Enterprise
- VERSION: 4.1.0
- URL: http://127.0.0.1:10003
project: added all source files and assets
r1
docs: added SAML documentation....
r3290 To get your API Token, on the |RCE| interface go to
project: added all source files and assets
r1 :menuselection:`username --> My Account --> Auth tokens`
.. code-block:: ini
# Configure .rhoderc with matching details
# This allows the indexer to connect to the instance
[instance:enterprise-1]
api_host = http://127.0.0.1:10000
api_key = <auth token goes here>
docs: update full text search indexing documentation
r3400
project: added all source files and assets
r1
.. _run-index:
Run the Indexer
^^^^^^^^^^^^^^^
docs: update full text search indexing documentation
r3400 Run the indexer using the following command, and specify the instance you want to index:
project: added all source files and assets
r1
.. code-block:: bash
docs: updated indexer documentation
r3482 # Using default simples indexing of all repositories
docs: added info about --optimize flag for full text search.
r2949 $ /home/user/.rccontrol/enterprise-1/profile/bin/rhodecode-index \
--instance-name=enterprise-1
project: added all source files and assets
r1
docs: updated indexer documentation
r3482 # Using a custom mapping file with indexing rules, and using elasticsearch 6 backend
docs: added info about --optimize flag for full text search.
r2949 $ /home/user/.rccontrol/enterprise-1/profile/bin/rhodecode-index \
--instance-name=enterprise-1 \
docs: updated indexer documentation
r3482 --mapping=/home/user/.rccontrol/enterprise-1/search_mapping.ini \
--es-version=6 --engine-location=http://elasticsearch-host:9200
docs: update full text search indexing documentation
r3400
# Using a custom mapping file and invocation without ``.rhoderc``
$ /home/user/.rccontrol/enterprise-1/profile/bin/rhodecode-index \
--api-host=http://rhodecodecode.myserver.com --api-key=xxxxx \
--mapping=/home/user/.rccontrol/enterprise-1/search_mapping.ini
# From inside a virtualev on your local machine or CI server.
(venv)$ rhodecode-index --instance-name=enterprise-1
project: added all source files and assets
r1
.. note::
docs: added info about --optimize flag for full text search.
r2949 In case of often indexing the index may become fragmented. Most often a result of that
is error about `too many open files`. To fix this indexer needs to be executed with
--optimize flag. E.g `rhodecode-index --instance-name=enterprise-1 --optimize`
This should be executed regularly, once a week is recommended.
project: added all source files and assets
r1
.. _set-index:
Schedule the Indexer
^^^^^^^^^^^^^^^^^^^^
To schedule the indexer, configure the crontab file to run the indexer inside
your |RCT| virtualenv using the following steps.
1. Open the crontab file, using ``crontab -e``.
2. Add the indexer to the crontab, and schedule it to run as regularly as you
wish.
3. Save the file.
.. code-block:: bash
$ crontab -e
# The virtualenv can be called using its full path, so for example you can
# put this example into the crontab
# Run the indexer daily at 4am using the default mapping settings
* 4 * * * /home/ubuntu/.virtualenv/rhodecode-venv/bin/rhodecode-index \
--instance-name=enterprise-1
# Run the indexer every Sunday at 3am using default mapping
* 3 * * 0 /home/ubuntu/.virtualenv/rhodecode-venv/bin/rhodecode-index \
--instance-name=enterprise-1
# Run the indexer every 15 minutes
# using a specially configured mapping file
*/15 * * * * ~/.rccontrol/enterprise-4/profile/bin/rhodecode-index \
--instance-name=enterprise-4 \
docs: update full text search indexing documentation
r3400 --mapping=/home/user/.rccontrol/enterprise-4/search_mapping.ini
project: added all source files and assets
r1
.. _advanced-indexing:
Advanced Indexing
^^^^^^^^^^^^^^^^^
docs: update full text search indexing documentation
r3400
Force Re-Indexing single repository
+++++++++++++++++++++++++++++++++++
Often it's required to re-index whole repository because of some repository changes,
or to remove some indexed secrets, or files. There's a special `--repo-name=` flag
for the indexer that limits execution to a single repository. For example to force-reindex
single repository such call can be made::
rhodecode-index --instance-name=enterprise-1 --force --repo-name=rhodecode-vcsserver
Removing repositories from index
++++++++++++++++++++++++++++++++
The indexer automatically removes renamed repositories and builds index for new names.
docs: updated indexer documentation
r3482 In the same way if a listed repository in mapping.ini is not reported existing by the
server it's removed from the index.
docs: update full text search indexing documentation
r3400 In case that you wish to remove indexed repository manually such call would allow that::
project: added all source files and assets
r1
docs: update full text search indexing documentation
r3400 rhodecode-index --instance-name=enterprise-1 --remove-only --repo-name=rhodecode-vcsserver
Using search_mapping.ini file for advanced index rules
++++++++++++++++++++++++++++++++++++++++++++++++++++++
By default rhodecode-index runs for all repositories, all files with parsing limits
defined by the CLI default arguments. You can change those limits by calling with
docs: updated indexer documentation
r3482 different flags such as `--max-filesize=2048kb` or `--repo-limit=10`
docs: update full text search indexing documentation
r3400
For more advanced execution logic it's possible to use a configuration file that
would define detailed rules which repositories and how should be indexed.
|RCT| provides an example index configuration file called :file:`search_mapping.ini`.
This file is created by default during installation and is located at:
* :file:`/home/{user}/.rccontrol/{instance-id}/search_mapping.ini`, using default |RCT|.
project: added all source files and assets
r1 * :file:`~/venv/lib/python2.7/site-packages/rhodecode_tools/templates/mapping.ini`,
when using ``virtualenv``.
.. note::
docs: update full text search indexing documentation
r3400 If you need to create the :file:`search_mapping.ini` file manually, use the |RCT|
``rhodecode-index --create-mapping path/to/search_mapping.ini`` API call.
For details, see the :ref:`tools-cli` section.
project: added all source files and assets
r1
docs: update full text search indexing documentation
r3400 To Run the indexer with mapping file provide it using `--mapping` flag::
project: added all source files and assets
r1
docs: update full text search indexing documentation
r3400 rhodecode-index --instance-name=enterprise-1 --mapping=/my/path/search_mapping.ini
Here's a detailed example of using :file:`search_mapping.ini` file.
project: added all source files and assets
r1
.. code-block:: ini
[__DEFAULT__]
docs: update full text search indexing documentation
r3400 ; Create index on commits data, and files data in this order. Available options
; are `commits`, `files`
index_types = commits,files
; Commit fetch limit. In what amount of chunks commits should be fetched
; via api and parsed. This allows server to transfer smaller chunks and be less loaded
commit_fetch_limit = 1000
project: added all source files and assets
r1
docs: update full text search indexing documentation
r3400 ; Commit process limit. Limit the number of commits indexer should fetch, and
; store inside the full text search index. eg. if repo has 2000 commits, and
; limit is 1000, on the first run it will process commits 0-1000 and on the
; second 1000-2000 commits. Help reduce memory usage, default is 50000
; (set -1 for unlimited)
docs: updated indexer documentation
r3482 commit_process_limit = 20000
project: added all source files and assets
r1
docs: update full text search indexing documentation
r3400 ; Limit of how many repositories each run can process, default is -1 (unlimited)
; in case of 1000s of repositories it's better to execute in chunks to not overload
; the server.
repo_limit = -1
project: added all source files and assets
r1
docs: update full text search indexing documentation
r3400 ; Default patterns for indexing files and content of files. Binary files
; are skipped by default.
; Add to index those comma separated files; globs syntax
; e.g index_files = *.py, *.c, *.h, *.js
index_files = *,
project: added all source files and assets
r1
docs: update full text search indexing documentation
r3400 ; Do not add to index those comma separated files, this excludes
; both search by name and content; globs syntax
docs: updated indexer documentation
r3482 ; e.g index_files = *.key, *.sql, *.xml, *.pem, *.crt
docs: update full text search indexing documentation
r3400 skip_files = ,
project: added all source files and assets
r1
docs: update full text search indexing documentation
r3400 ; Add to index content of those comma separated files; globs syntax
; e.g index_files = *.h, *.obj
index_files_content = *,
; Do not add to index content of those comma separated files; globs syntax
docs: updated indexer documentation
r3482 ; Binary files are not indexed by default.
; e.g index_files = *.min.js, *.xml, *.dump, *.log, *.dump
docs: update full text search indexing documentation
r3400 skip_files_content = ,
; Force rebuilding an index from scratch. Each repository will be rebuild from
; scratch with a global flag. Use --repo-name=NAME --force to rebuild single repo
project: added all source files and assets
r1 force = false
docs: update full text search indexing documentation
r3400 ; maximum file size that indexer will use, files above that limit are not going
; to have they content indexed.
; Possible options are KB (kilobytes), MB (megabytes), eg 1MB or 1024KB
docs: updated indexer documentation
r3482 max_filesize = 10MB
project: added all source files and assets
r1
docs: update full text search indexing documentation
r3400 [__INDEX_RULES__]
; Ordered match rules for repositories. A list of all repositories will be fetched
; using API and this list will be filtered using those rules.
; Syntax for entry: `glob_pattern_OR_full_repo_name = 0 OR 1` where 0=exclude, 1=include
; When this ordered list is traversed first match will return the include/exclude marker
; For example:
; upstream/binary_repo = 0
; upstream/subrepo/xml_files = 0
; upstream/* = 1
; special-repo = 1
; * = 0
; This will index all repositories under upstream/*, but skip upstream/binary_repo
; and upstream/sub_repo/xml_files, last * = 0 means skip all other matches
project: added all source files and assets
r1
docs: update full text search indexing documentation
r3400 ; == EXPLICIT REPOSITORY INDEXING ==
; If defined this will skip using __INDEX_RULES__, and will not use API to fetch
; list of repositories, it will explicitly take names defined with [NAME] format and
; try to build the index, to build index just for repo_name_1 and special-repo use:
; [repo_name_1]
; [special-repo]
project: added all source files and assets
r1
docs: update full text search indexing documentation
r3400 ; == PER REPOSITORY CONFIGURATION ==
; This allows overriding the global configuration per repository.
; example to set specific file limit, and skip certain files for repository special-repo
docs: updated indexer documentation
r3482 ; the CLI flags doesn't override the conf settings.
docs: update full text search indexing documentation
r3400 ; [conf:special-repo]
; max_filesize = 5mb
; skip_files = *.xml, *.sql
project: added all source files and assets
r1
docs: update full text search indexing documentation
r3400
In case of 1000s of repositories it can be tricky to write the include/exclude rules at first.
There's a special flag to test the mapping file rules and list repositories that would
docs: updated indexer documentation
r3482 be indexed. Run the indexer with `--show-matched-repos` to list only the
match repositories defined in .ini file rules::
docs: update full text search indexing documentation
r3400
rhodecode-index --instance-name=enterprise-1 --show-matched-repos --mapping=/my/path/search_mapping.ini
project: added all source files and assets
r1
dan
docs: add elasticsearch docs
r153 .. _enable-elasticsearch:
docs: updated indexer documentation
r3482 Enabling ElasticSearch
dan
docs: add elasticsearch docs
r153 ^^^^^^^^^^^^^^^^^^^^^^
docs: updated indexer documentation
r3482 ElasticSearch is available in EE edition only. It provides much scalable and more advanced
search capabilities. While Whoosh is fine for upto 1-2GB of data, beyond that amount it
starts slowing down, and can cause other problems.
New ElasticSearch 6 also provides much more advanced query language.
It allows advanced filtering by file paths, extensions, use OR statements, ranges etc.
Please check query language examples in the search field for some advanced query language usage.
docs: update full text search indexing documentation
r3400
dan
docs: add elasticsearch docs
r153 1. Open the :file:`rhodecode.ini` file for the instance you wish to edit. The
default location is
:file:`home/{user}/.rccontrol/{instance-id}/rhodecode.ini`
2. Find the search configuration section:
.. code-block:: ini
###################################
## SEARCH INDEXING CONFIGURATION ##
###################################
search.module = rhodecode.lib.index.whoosh
search.location = %(here)s/data/index
and change it to:
.. code-block:: ini
search.module = rc_elasticsearch
docs: update full text search indexing documentation
r3400 search.location = http://localhost:9200
## specify Elastic Search version, 6 for latest or 2 for legacy
search.es_version = 6
docs: updated indexer documentation
r3482 where ``search.location`` points to the ElasticSearch server
docs: update full text search indexing documentation
r3400 by default running on port 9200.
dan
docs: add elasticsearch docs
r153
docs: update full text search indexing documentation
r3400 Index invocation also needs change. Please provide --es-version= and
docs: updated indexer documentation
r3482 --engine-location= parameters to define ElasticSearch server location and it's version.
docs: update full text search indexing documentation
r3400 For example::
rhodecode-index --instace-name=enterprise-1 --es-version=6 --engine-location=http://localhost:9200
dan
docs: add elasticsearch docs
r153
project: added all source files and assets
r1 .. _Whoosh: https://pypi.python.org/pypi/Whoosh/
docs: updated indexer documentation
r3482 .. _ElasticSearch 6: https://www.elastic.co/