From 7aac5afedf865a305b1e06714b34cccfa8b52b87 2024-01-30 22:51:32
From: RhodeCode Admin <admin@rhodecode.com>
Date: 2024-01-30 22:51:32
Subject: [PATCH] feat(docs): added comprehensive full text search doc guide

---

diff --git a/docs/source/usage/full-text-search-setup.rst b/docs/source/usage/full-text-search-setup.rst
new file mode 100644
index 0000000..83dd412
--- /dev/null
+++ b/docs/source/usage/full-text-search-setup.rst
@@ -0,0 +1,390 @@
+.. _full-text-search-setup:
+
+Full-text Search
+----------------
+
+RhodeCode provides a full text search capabilities to search inside file content,
+commit message, and file paths. Indexing is not enabled by default and to use
+full text search building an index is a pre-requisite.
+
+By default RhodeCode is configured to use `Whoosh`_ to index |repos| and
+provide full-text search. `Whoosh`_ works well for a small amount of data and
+shouldn't be used in case of large code-bases and lots of repositories.
+
+|RCE| also provides support for `ElasticSearch 6`_ as a backend more for advanced
+and scalable search.
+
+
+Auth Token generation
+^^^^^^^^^^^^^^^^^^^^^
+
+RhodeCode indexer runs on top of |RCE| API and requires an |authtoken| before continuing.
+To run the indexer you need to have an |authtoken| with *admin* rights to all of |repos| that indexer should
+process.
+
+To get your API Token, on the |RCE| interface go to
+Click on the icon with your user in top right corner :menuselection:`your-username --> My Account --> Auth tokens`
+
+1. Put a description for the |authtoken|
+2. Select expiration date if desired
+3. Select `api calls` role for the token
+4. Click :guilabel:`Add`
+5. Click on the obfuscated generated token, and copy it.
+
+
+Indexing
+^^^^^^^^
+
+To index repositories stored in RhodeCode, you have the option to set the indexer up in a
+number of ways, for example:
+
+* Call the indexer via a cron job. We recommend running this once at night.
+  In case you need everything indexed immediately it's possible to index few
+  times during the day. Indexer has a special locking mechanism that won't allow
+  two instances of indexer running at once. It's safe to run it even every 1hr.
+* Hook the indexer up with your CI server to reindex after each push.
+* Set the indexer to infinitely loop and reindex as soon as it has run its previous cycle.
+  This allows to get an instance indexing of content that would be available seconds after changes happen.
+
+The indexer works by indexing new commits added since the last run, and comparing
+file changes to index only new or modified files across each invocation.
+
+.. note::
+
+    If you wish to build a brand new index from scratch each time, use the ``force``
+    option in the configuration file, or run it with --force flag.
+
+
+To set up indexing, use the following steps:
+
+1. :ref:`config-rhoderc`
+2. :ref:`run-index`
+3. :ref:`set-index`
+4. :ref:`advanced-indexing`
+
+
+.. _config-rhoderc:
+
+Configure the ``.rhoderc`` File
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. note::
+
+    Optionally it's possible to use indexer without the ``.rhoderc``. Simply instead of
+    executing with `--instance-name=rc-idx` execute providing the host and token
+    directly: `--api-host=https://your-host.example.com --api-key=<auth-token-goes-here>`
+
+
+.. note::
+
+    In some cases the domain could be only available via the custom DNS, you can always refer the
+    instance by it's docker name and port (`http://rhodecode:10020`) instead of hostname, for example:
+
+    .. code-block:: bash
+
+        ./rcstack cli cmd rhodecode-index --api-host=http://rhodecode:10020 --api-key=xxx
+
+
+Indexer uses the :file:`/home/{user}/.rhoderc` file for connection details
+to |RCE| instances. You need to configure the details for each instance you want to index.
+
+
+.. code-block:: bash
+
+    ./rcstack cli cmd rhodecode-setup-config \
+    --filename=/etc/rhodecode/conf/.rhoderc \
+    --instance-name=rc-idx api_host=https://your-host.example.com,api_key=<auth-token-goes-here>
+
+
+Here's an example generated config you might also mount as a file to the docker image.
+
+.. code-block:: ini
+
+    # Configure .rhoderc with matching details
+    # This allows the indexer to connect to the instance
+    [instance:rc-idx]
+    api_host = https://your-host.example.com
+    api_key = <auth token goes here>
+
+
+.. _run-index:
+
+
+Run the Indexer
+^^^^^^^^^^^^^^^
+
+Run the indexer using the following command, and specify the instance you want to index:
+
+.. code-block:: bash
+
+   # Using default simples indexing of all repositories
+   $ ./rcstack cli cmd rhodecode-index \
+      --no-tty --config=/etc/rhodecode/conf/.rhoderc \
+      --instance-name=rc-idx
+
+   # Using a custom mapping file and invocation without ``.rhoderc``
+   $ ./rcstack cli cmd rhodecode-index \
+      --no-tty \
+      --api-host=https://your-host.example.com --api-key=xxxxx \
+      --mapping=/etc/rhodecode/conf/search_mapping.ini
+
+   # Using a custom mapping file with indexing rules, and using elasticsearch 6 backend
+   $ ./rcstack cli cmd rhodecode-index \
+      --no-tty --config=/etc/rhodecode/conf/.rhoderc \
+      --instance-name=rc-idx \
+      --mapping=/etc/rhodecode/conf/search_mapping.ini \
+      --es-version=6 --engine-location=http://elasticsearch:9200
+
+   # For some advanced usage, please check --help flag to see what other CLI options are available
+   ``$ ./rcstack cli cmd rhodecode-index --help``
+
+.. note::
+
+   In case of often indexing using Whoosh backend the index may become fragmented. Most often a result of that
+   is error about `too many open files`. To fix this indexer needs to be executed with `--optimize` flag. E.g
+
+    .. code-block:: bash
+
+        $ ./rcstack cli cmd rhodecode-index --instance-name=rc-idx --optimize
+
+    This should be executed regularly, once a week is recommended. When using ElasticSearch this step can be skipped.
+
+
+.. _set-index:
+
+Schedule the Indexer
+^^^^^^^^^^^^^^^^^^^^
+
+To schedule the indexer, configure the crontab file to run the indexer inside
+your |RCT| virtualenv using the following steps.
+
+1. Open the crontab file, using ``crontab -e``.
+2. Add the indexer to the crontab, and schedule it to run as regularly as you
+   wish.
+3. Save the file.
+
+.. code-block:: bash
+
+    $ crontab -e
+
+    # The virtualenv can be called using its full path, so for example you can
+    # put this example into the crontab
+
+    # Run the indexer daily at 4am using the default mapping settings, --no-tty is required for non interactive calls
+    * 4 * * * ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx
+
+    # Run the indexer every Sunday at 3am using default mapping
+    * 3 * * 0 ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx
+
+    # Run the indexer every 15 minutes
+    # using a specially configured mapping file
+    */15 * * * * ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --mapping=/etc/rhodecode/conf/search_mapping.ini
+
+.. _advanced-indexing:
+
+Advanced Indexing
+^^^^^^^^^^^^^^^^^
+
+
+Force Re-Indexing single repository
++++++++++++++++++++++++++++++++++++
+
+Often it's required to re-index whole repository because of some repository changes,
+or to remove some indexed secrets, or files. There's a special `--repo-name=` flag
+for the indexer that limits execution to a single repository. For example to force-reindex
+single repository such call can be made
+
+.. code-block:: bash
+
+    ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --force --repo-name=rhodecode-vcsserver
+
+Limiting indexing to small number of repos
+++++++++++++++++++++++++++++++++++++++++++
+
+Often to preserve memory usage and system load we might limit the number of repositories processed on each call.
+There's a special `--repo-limit=` flag for the indexer that limits execution to a N repositories.
+
+.. code-block:: bash
+
+    ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --repo-limit=10
+
+
+Removing repositories from index
+++++++++++++++++++++++++++++++++
+
+The indexer automatically removes renamed repositories and builds index for new names.
+In the same way if a listed repository in mapping.ini is not reported existing by the
+server it's removed from the index.
+In case that you wish to remove indexed repository manually such call would allow that
+
+.. code-block:: bash
+
+    ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --remove-only --repo-name=rhodecode-vcsserver
+
+
+Using search_mapping.ini file for advanced index rules
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+By default rhodecode-index runs for all repositories, all files with parsing limits
+defined by the CLI default arguments. You can change those limits by calling with
+different flags such as `--max-filesize=2048kb` or `--repo-limit=10`
+
+For more advanced execution logic it's possible to use a configuration file that
+would define detailed rules which repositories and how should be indexed.
+
+To create the :file:`search_mapping.ini` file manually, use the below command
+
+.. code-block:: bash
+
+    ./rcstack cli cmd rhodecode-index --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx \
+    --create-mapping=/etc/rhodecode/conf/search_mapping.ini
+
+
+Now the indexer can be executed with `--mapping` flag
+
+
+Here's a detailed example of using :file:`search_mapping.ini` file.
+
+.. code-block:: ini
+
+    [__DEFAULT__]
+    ; Create index on commits data, and files data in this order. Available options
+    ; are `commits`, `files`
+    index_types = commits,files
+
+    ; Commit fetch limit. In what amount of chunks commits should be fetched
+    ; via api and parsed. This allows server to transfer smaller chunks and be less loaded
+    commit_fetch_limit = 1000
+
+    ; Commit process limit. Limit the number of commits indexer should fetch, and
+    ; store inside the full text search index. eg. if repo has 2000 commits, and
+    ; limit is 1000, on the first run it will process commits 0-1000 and on the
+    ; second 1000-2000 commits. Help reduce memory usage, default is 50000
+    ; (set -1 for unlimited)
+    commit_process_limit = 20000
+
+    ; Limit of how many repositories each run can process, default is -1 (unlimited)
+    ; in case of 1000s of repositories it's better to execute in chunks to not overload
+    ; the server.
+    repo_limit = -1
+
+    ; Default patterns for indexing files and content of files. Binary files
+    ; are skipped by default.
+
+    ; Add to index those comma separated files; globs syntax
+    ; e.g index_files = *.py, *.c, *.h, *.js
+    index_files = *,
+
+    ; Do not add to index those comma separated files, this excludes
+    ; both search by name and content; globs syntax
+    ; e.g index_files = *.key, *.sql, *.xml, *.pem, *.crt
+    skip_files = ,
+
+    ; Add to index content of those comma separated files; globs syntax
+    ; e.g index_files = *.h, *.obj
+    index_files_content = *,
+
+    ; Do not add to index content of those comma separated files; globs syntax
+    ; Binary files are not indexed by default.
+    ; e.g index_files = *.min.js, *.xml, *.dump, *.log, *.dump
+    skip_files_content = ,
+
+    ; Force rebuilding an index from scratch. Each repository will be rebuild from
+    ; scratch with a global flag. Use --repo-name=NAME --force to rebuild single repo
+    force = false
+
+    ; maximum file size that indexer will use, files above that limit are not going
+    ; to have they content indexed.
+    ; Possible options are KB (kilobytes), MB (megabytes), eg 1MB or 1024KB
+    max_filesize = 10MB
+
+
+    [__INDEX_RULES__]
+    ; Ordered match rules for repositories. A list of all repositories will be fetched
+    ; using API and this list will be filtered using those rules.
+    ; Syntax for entry: `glob_pattern_OR_full_repo_name = 0 OR 1` where 0=exclude, 1=include
+    ; When this ordered list is traversed first match will return the include/exclude marker
+    ; For example:
+    ;    upstream/binary_repo = 0
+    ;    upstream/subrepo/xml_files = 0
+    ;    upstream/* = 1
+    ;    special-repo = 1
+    ;    * = 0
+    ; This will index all repositories under upstream/*, but skip upstream/binary_repo
+    ; and upstream/sub_repo/xml_files, last * = 0 means skip all other matches
+
+
+    ; == EXPLICIT REPOSITORY INDEXING ==
+    ; If defined this will skip using __INDEX_RULES__, and will not use API to fetch
+    ; list of repositories, it will explicitly take names defined with [NAME] format and
+    ; try to build the index, to build index just for repo_name_1 and special-repo use:
+    ;    [repo_name_1]
+    ;    [special-repo]
+
+    ; == PER REPOSITORY CONFIGURATION ==
+    ; This allows overriding the global configuration per repository.
+    ; example to set specific file limit, and skip certain files for repository special-repo
+    ; the CLI flags doesn't override the conf settings.
+    ;    [conf:special-repo]
+    ;    max_filesize = 5mb
+    ;    skip_files = *.xml, *.sql
+
+
+
+In case of 1000s of repositories it can be tricky to write the include/exclude rules at first.
+There's a special flag to test the mapping file rules and list repositories that would
+be indexed. Run the indexer with `--show-matched-repos` to list only the
+match repositories defined in .ini file rules
+
+.. code-block:: bash
+
+    ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --show-matched-repos --mapping=/etc/rhodecode/conf/search_mapping.ini
+
+
+.. _enable-elasticsearch:
+
+Enabling ElasticSearch
+^^^^^^^^^^^^^^^^^^^^^^
+
+ElasticSearch is available in EE edition only. It provides much scalable and more advanced
+search capabilities. While Whoosh is fine for upto 1-2GB of data, beyond that amount it
+starts slowing down, and can cause other problems.
+New ElasticSearch 6 also provides much more advanced query language.
+It allows advanced filtering by file paths, extensions, use OR statements, ranges etc.
+Please check query language examples in the search field for some advanced query language usage.
+
+
+1. Open the :file:`rhodecode.ini` file for the instance you wish to edit. The
+   default location is :file:`config/_shared/rhodecode.ini`
+2. Find the search configuration section:
+
+.. code-block:: ini
+
+    ###################################
+    ## SEARCH INDEXING CONFIGURATION ##
+    ###################################
+
+    search.module = rhodecode.lib.index.whoosh
+    search.location = %(here)s/data/index
+
+and change it to:
+
+.. code-block:: ini
+
+    search.module = rc_elasticsearch
+    search.location = http://elasticsearch:9200
+    ## specify Elastic Search version, 6 for latest or 2 for legacy
+    search.es_version = 6
+
+where ``search.location`` points to the ElasticSearch server
+by default running on port 9200.
+
+Index invocation also needs change. Please provide `--es-version=` and
+`--engine-location=` parameters to define ElasticSearch server location and it's version.
+For example::
+
+    --instance-name=rc-idx --es-version=6 --engine-location=http://elasticsearch:9200
+
+
+.. _Whoosh: https://pypi.python.org/pypi/Whoosh/
+.. _ElasticSearch 6: https://www.elastic.co/