.. _full-text-search-setup:

Full-text Search
----------------

RhodeCode provides a full text search capabilities to search inside file content,
commit message, and file paths. Indexing is not enabled by default and to use
full text search building an index is a pre-requisite.

By default RhodeCode is configured to use `Whoosh`_ to index |repos| and
provide full-text search. `Whoosh`_ works well for a small amount of data and
shouldn't be used in case of large code-bases and lots of repositories.

|RCE| also provides support for `ElasticSearch 6`_ as a backend more for advanced
and scalable search.


Auth Token generation
^^^^^^^^^^^^^^^^^^^^^

RhodeCode indexer runs on top of |RCE| API and requires an |authtoken| before continuing.
To run the indexer you need to have an |authtoken| with *admin* rights to all of |repos| that indexer should
process.

To get your API Token, on the |RCE| interface go to
Click on the icon with your user in top right corner :menuselection:`your-username --> My Account --> Auth tokens`

1. Put a description for the |authtoken|
2. Select expiration date if desired
3. Select `api calls` role for the token
4. Click :guilabel:`Add`
5. Click on the obfuscated generated token, and copy it.


Indexing
^^^^^^^^

To index repositories stored in RhodeCode, you have the option to set the indexer up in a
number of ways, for example:

* Call the indexer via a cron job. We recommend running this once at night.
  In case you need everything indexed immediately it's possible to index few
  times during the day. Indexer has a special locking mechanism that won't allow
  two instances of indexer running at once. It's safe to run it even every 1hr.
* Hook the indexer up with your CI server to reindex after each push.
* Set the indexer to infinitely loop and reindex as soon as it has run its previous cycle.
  This allows to get an instance indexing of content that would be available seconds after changes happen.

The indexer works by indexing new commits added since the last run, and comparing
file changes to index only new or modified files across each invocation.

.. note::

    If you wish to build a brand new index from scratch each time, use the ``force``
    option in the configuration file, or run it with --force flag.


To set up indexing, use the following steps:

1. :ref:`config-rhoderc`
2. :ref:`run-index`
3. :ref:`set-index`
4. :ref:`advanced-indexing`


.. _config-rhoderc:

Configure the ``.rhoderc`` File
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. note::

    Optionally it's possible to use indexer without the ``.rhoderc``. Simply instead of
    executing with `--instance-name=rc-idx` execute providing the host and token
    directly: `--api-host=https://your-host.example.com --api-key=<auth-token-goes-here>`


.. note::

    In some cases the domain could be only available via the custom DNS, you can always refer the
    instance by it's docker name and port (`http://rhodecode:10020`) instead of hostname, for example:

    .. code-block:: bash

        ./rcstack cli cmd rhodecode-index --api-host=http://rhodecode:10020 --api-key=xxx


Indexer uses the :file:`/home/{user}/.rhoderc` file for connection details
to |RCE| instances. You need to configure the details for each instance you want to index.


.. code-block:: bash

    ./rcstack cli cmd rhodecode-setup-config \
    --filename=/etc/rhodecode/conf/.rhoderc \
    --instance-name=rc-idx api_host=https://your-host.example.com,api_key=<auth-token-goes-here>


Here's an example generated config you might also mount as a file to the docker image.

.. code-block:: ini

    # Configure .rhoderc with matching details
    # This allows the indexer to connect to the instance
    [instance:rc-idx]
    api_host = https://your-host.example.com
    api_key = <auth token goes here>


.. _run-index:


Run the Indexer
^^^^^^^^^^^^^^^

Run the indexer using the following command, and specify the instance you want to index:

.. code-block:: bash

   # Using default simples indexing of all repositories
   $ ./rcstack cli cmd rhodecode-index \
      --no-tty --config=/etc/rhodecode/conf/.rhoderc \
      --instance-name=rc-idx

   # Using a custom mapping file and invocation without ``.rhoderc``
   $ ./rcstack cli cmd rhodecode-index \
      --no-tty \
      --api-host=https://your-host.example.com --api-key=xxxxx \
      --mapping=/etc/rhodecode/conf/search_mapping.ini

   # Using a custom mapping file with indexing rules, and using elasticsearch 6 backend
   $ ./rcstack cli cmd rhodecode-index \
      --no-tty --config=/etc/rhodecode/conf/.rhoderc \
      --instance-name=rc-idx \
      --mapping=/etc/rhodecode/conf/search_mapping.ini \
      --es-version=6 --engine-location=http://elasticsearch:9200

   # For some advanced usage, please check --help flag to see what other CLI options are available
   ``$ ./rcstack cli cmd rhodecode-index --help``

.. note::

   In case of often indexing using Whoosh backend the index may become fragmented. Most often a result of that
   is error about `too many open files`. To fix this indexer needs to be executed with `--optimize` flag. E.g

    .. code-block:: bash

        $ ./rcstack cli cmd rhodecode-index --instance-name=rc-idx --optimize

    This should be executed regularly, once a week is recommended. When using ElasticSearch this step can be skipped.


.. _set-index:

Schedule the Indexer
^^^^^^^^^^^^^^^^^^^^

To schedule the indexer, configure the crontab file to run the indexer inside
your |RCT| virtualenv using the following steps.

1. Open the crontab file, using ``crontab -e``.
2. Add the indexer to the crontab, and schedule it to run as regularly as you
   wish.
3. Save the file.

.. code-block:: bash

    $ crontab -e

    # The virtualenv can be called using its full path, so for example you can
    # put this example into the crontab

    # Run the indexer daily at 4am using the default mapping settings, --no-tty is required for non interactive calls
    * 4 * * * ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx

    # Run the indexer every Sunday at 3am using default mapping
    * 3 * * 0 ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx

    # Run the indexer every 15 minutes
    # using a specially configured mapping file
    */15 * * * * ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --mapping=/etc/rhodecode/conf/search_mapping.ini

.. _advanced-indexing:

Advanced Indexing
^^^^^^^^^^^^^^^^^


Force Re-Indexing single repository
+++++++++++++++++++++++++++++++++++

Often it's required to re-index whole repository because of some repository changes,
or to remove some indexed secrets, or files. There's a special `--repo-name=` flag
for the indexer that limits execution to a single repository. For example to force-reindex
single repository such call can be made

.. code-block:: bash

    ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --force --repo-name=rhodecode-vcsserver

Limiting indexing to small number of repos
++++++++++++++++++++++++++++++++++++++++++

Often to preserve memory usage and system load we might limit the number of repositories processed on each call.
There's a special `--repo-limit=` flag for the indexer that limits execution to a N repositories.

.. code-block:: bash

    ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --repo-limit=10


Removing repositories from index
++++++++++++++++++++++++++++++++

The indexer automatically removes renamed repositories and builds index for new names.
In the same way if a listed repository in mapping.ini is not reported existing by the
server it's removed from the index.
In case that you wish to remove indexed repository manually such call would allow that

.. code-block:: bash

    ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --remove-only --repo-name=rhodecode-vcsserver


Using search_mapping.ini file for advanced index rules
++++++++++++++++++++++++++++++++++++++++++++++++++++++

By default rhodecode-index runs for all repositories, all files with parsing limits
defined by the CLI default arguments. You can change those limits by calling with
different flags such as `--max-filesize=2048kb` or `--repo-limit=10`

For more advanced execution logic it's possible to use a configuration file that
would define detailed rules which repositories and how should be indexed.

To create the :file:`search_mapping.ini` file manually, use the below command

.. code-block:: bash

    ./rcstack cli cmd rhodecode-index --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx \
    --create-mapping=/etc/rhodecode/conf/search_mapping.ini


Now the indexer can be executed with `--mapping` flag


Here's a detailed example of using :file:`search_mapping.ini` file.

.. code-block:: ini

    [__DEFAULT__]
    ; Create index on commits data, and files data in this order. Available options
    ; are `commits`, `files`
    index_types = commits,files

    ; Commit fetch limit. In what amount of chunks commits should be fetched
    ; via api and parsed. This allows server to transfer smaller chunks and be less loaded
    commit_fetch_limit = 1000

    ; Commit process limit. Limit the number of commits indexer should fetch, and
    ; store inside the full text search index. eg. if repo has 2000 commits, and
    ; limit is 1000, on the first run it will process commits 0-1000 and on the
    ; second 1000-2000 commits. Help reduce memory usage, default is 50000
    ; (set -1 for unlimited)
    commit_process_limit = 20000

    ; Limit of how many repositories each run can process, default is -1 (unlimited)
    ; in case of 1000s of repositories it's better to execute in chunks to not overload
    ; the server.
    repo_limit = -1

    ; Default patterns for indexing files and content of files. Binary files
    ; are skipped by default.

    ; Add to index those comma separated files; globs syntax
    ; e.g index_files = *.py, *.c, *.h, *.js
    index_files = *,

    ; Do not add to index those comma separated files, this excludes
    ; both search by name and content; globs syntax
    ; e.g index_files = *.key, *.sql, *.xml, *.pem, *.crt
    skip_files = ,

    ; Add to index content of those comma separated files; globs syntax
    ; e.g index_files = *.h, *.obj
    index_files_content = *,

    ; Do not add to index content of those comma separated files; globs syntax
    ; Binary files are not indexed by default.
    ; e.g index_files = *.min.js, *.xml, *.dump, *.log, *.dump
    skip_files_content = ,

    ; Force rebuilding an index from scratch. Each repository will be rebuild from
    ; scratch with a global flag. Use --repo-name=NAME --force to rebuild single repo
    force = false

    ; maximum file size that indexer will use, files above that limit are not going
    ; to have they content indexed.
    ; Possible options are KB (kilobytes), MB (megabytes), eg 1MB or 1024KB
    max_filesize = 10MB


    [__INDEX_RULES__]
    ; Ordered match rules for repositories. A list of all repositories will be fetched
    ; using API and this list will be filtered using those rules.
    ; Syntax for entry: `glob_pattern_OR_full_repo_name = 0 OR 1` where 0=exclude, 1=include
    ; When this ordered list is traversed first match will return the include/exclude marker
    ; For example:
    ;    upstream/binary_repo = 0
    ;    upstream/subrepo/xml_files = 0
    ;    upstream/* = 1
    ;    special-repo = 1
    ;    * = 0
    ; This will index all repositories under upstream/*, but skip upstream/binary_repo
    ; and upstream/sub_repo/xml_files, last * = 0 means skip all other matches


    ; == EXPLICIT REPOSITORY INDEXING ==
    ; If defined this will skip using __INDEX_RULES__, and will not use API to fetch
    ; list of repositories, it will explicitly take names defined with [NAME] format and
    ; try to build the index, to build index just for repo_name_1 and special-repo use:
    ;    [repo_name_1]
    ;    [special-repo]

    ; == PER REPOSITORY CONFIGURATION ==
    ; This allows overriding the global configuration per repository.
    ; example to set specific file limit, and skip certain files for repository special-repo
    ; the CLI flags doesn't override the conf settings.
    ;    [conf:special-repo]
    ;    max_filesize = 5mb
    ;    skip_files = *.xml, *.sql


In case of 1000s of repositories it can be tricky to write the include/exclude rules at first.
There's a special flag to test the mapping file rules and list repositories that would
be indexed. Run the indexer with `--show-matched-repos` to list only the
match repositories defined in .ini file rules

.. code-block:: bash

    ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --show-matched-repos --mapping=/etc/rhodecode/conf/search_mapping.ini


.. _enable-elasticsearch:

Enabling ElasticSearch
^^^^^^^^^^^^^^^^^^^^^^

ElasticSearch is available in EE edition only. It provides much scalable and more advanced
search capabilities. While Whoosh is fine for upto 1-2GB of data, beyond that amount it
starts slowing down, and can cause other problems.
New ElasticSearch 6 also provides much more advanced query language.
It allows advanced filtering by file paths, extensions, use OR statements, ranges etc.
Please check query language examples in the search field for some advanced query language usage.


1. Open the :file:`rhodecode.ini` file for the instance you wish to edit. The
   default location is :file:`config/_shared/rhodecode.ini`
2. Find the search configuration section:

.. code-block:: ini

    ###################################
    ## SEARCH INDEXING CONFIGURATION ##
    ###################################

    search.module = rhodecode.lib.index.whoosh
    search.location = %(here)s/data/index

and change it to:

.. code-block:: ini

    search.module = rc_elasticsearch
    search.location = http://elasticsearch:9200
    ## specify Elastic Search version, 6 for latest or 2 for legacy
    search.es_version = 6

where ``search.location`` points to the ElasticSearch server
by default running on port 9200.

Index invocation also needs change. Please provide `--es-version=` and
`--engine-location=` parameters to define ElasticSearch server location and it's version.
For example::

    --instance-name=rc-idx --es-version=6 --engine-location=http://elasticsearch:9200


.. _Whoosh: https://pypi.python.org/pypi/Whoosh/
.. _ElasticSearch 6: https://www.elastic.co/