From 7aac5afedf865a305b1e06714b34cccfa8b52b87 2024-01-30 22:51:32 From: RhodeCode Admin Date: 2024-01-30 22:51:32 Subject: [PATCH] feat(docs): added comprehensive full text search doc guide --- diff --git a/docs/source/usage/full-text-search-setup.rst b/docs/source/usage/full-text-search-setup.rst new file mode 100644 index 0000000..83dd412 --- /dev/null +++ b/docs/source/usage/full-text-search-setup.rst @@ -0,0 +1,390 @@ +.. _full-text-search-setup: + +Full-text Search +---------------- + +RhodeCode provides a full text search capabilities to search inside file content, +commit message, and file paths. Indexing is not enabled by default and to use +full text search building an index is a pre-requisite. + +By default RhodeCode is configured to use `Whoosh`_ to index |repos| and +provide full-text search. `Whoosh`_ works well for a small amount of data and +shouldn't be used in case of large code-bases and lots of repositories. + +|RCE| also provides support for `ElasticSearch 6`_ as a backend more for advanced +and scalable search. + + +Auth Token generation +^^^^^^^^^^^^^^^^^^^^^ + +RhodeCode indexer runs on top of |RCE| API and requires an |authtoken| before continuing. +To run the indexer you need to have an |authtoken| with *admin* rights to all of |repos| that indexer should +process. + +To get your API Token, on the |RCE| interface go to +Click on the icon with your user in top right corner :menuselection:`your-username --> My Account --> Auth tokens` + +1. Put a description for the |authtoken| +2. Select expiration date if desired +3. Select `api calls` role for the token +4. Click :guilabel:`Add` +5. Click on the obfuscated generated token, and copy it. + + +Indexing +^^^^^^^^ + +To index repositories stored in RhodeCode, you have the option to set the indexer up in a +number of ways, for example: + +* Call the indexer via a cron job. We recommend running this once at night. + In case you need everything indexed immediately it's possible to index few + times during the day. Indexer has a special locking mechanism that won't allow + two instances of indexer running at once. It's safe to run it even every 1hr. +* Hook the indexer up with your CI server to reindex after each push. +* Set the indexer to infinitely loop and reindex as soon as it has run its previous cycle. + This allows to get an instance indexing of content that would be available seconds after changes happen. + +The indexer works by indexing new commits added since the last run, and comparing +file changes to index only new or modified files across each invocation. + +.. note:: + + If you wish to build a brand new index from scratch each time, use the ``force`` + option in the configuration file, or run it with --force flag. + + +To set up indexing, use the following steps: + +1. :ref:`config-rhoderc` +2. :ref:`run-index` +3. :ref:`set-index` +4. :ref:`advanced-indexing` + + +.. _config-rhoderc: + +Configure the ``.rhoderc`` File +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. note:: + + Optionally it's possible to use indexer without the ``.rhoderc``. Simply instead of + executing with `--instance-name=rc-idx` execute providing the host and token + directly: `--api-host=https://your-host.example.com --api-key=` + + +.. note:: + + In some cases the domain could be only available via the custom DNS, you can always refer the + instance by it's docker name and port (`http://rhodecode:10020`) instead of hostname, for example: + + .. code-block:: bash + + ./rcstack cli cmd rhodecode-index --api-host=http://rhodecode:10020 --api-key=xxx + + +Indexer uses the :file:`/home/{user}/.rhoderc` file for connection details +to |RCE| instances. You need to configure the details for each instance you want to index. + + +.. code-block:: bash + + ./rcstack cli cmd rhodecode-setup-config \ + --filename=/etc/rhodecode/conf/.rhoderc \ + --instance-name=rc-idx api_host=https://your-host.example.com,api_key= + + +Here's an example generated config you might also mount as a file to the docker image. + +.. code-block:: ini + + # Configure .rhoderc with matching details + # This allows the indexer to connect to the instance + [instance:rc-idx] + api_host = https://your-host.example.com + api_key = + + +.. _run-index: + + +Run the Indexer +^^^^^^^^^^^^^^^ + +Run the indexer using the following command, and specify the instance you want to index: + +.. code-block:: bash + + # Using default simples indexing of all repositories + $ ./rcstack cli cmd rhodecode-index \ + --no-tty --config=/etc/rhodecode/conf/.rhoderc \ + --instance-name=rc-idx + + # Using a custom mapping file and invocation without ``.rhoderc`` + $ ./rcstack cli cmd rhodecode-index \ + --no-tty \ + --api-host=https://your-host.example.com --api-key=xxxxx \ + --mapping=/etc/rhodecode/conf/search_mapping.ini + + # Using a custom mapping file with indexing rules, and using elasticsearch 6 backend + $ ./rcstack cli cmd rhodecode-index \ + --no-tty --config=/etc/rhodecode/conf/.rhoderc \ + --instance-name=rc-idx \ + --mapping=/etc/rhodecode/conf/search_mapping.ini \ + --es-version=6 --engine-location=http://elasticsearch:9200 + + # For some advanced usage, please check --help flag to see what other CLI options are available + ``$ ./rcstack cli cmd rhodecode-index --help`` + +.. note:: + + In case of often indexing using Whoosh backend the index may become fragmented. Most often a result of that + is error about `too many open files`. To fix this indexer needs to be executed with `--optimize` flag. E.g + + .. code-block:: bash + + $ ./rcstack cli cmd rhodecode-index --instance-name=rc-idx --optimize + + This should be executed regularly, once a week is recommended. When using ElasticSearch this step can be skipped. + + +.. _set-index: + +Schedule the Indexer +^^^^^^^^^^^^^^^^^^^^ + +To schedule the indexer, configure the crontab file to run the indexer inside +your |RCT| virtualenv using the following steps. + +1. Open the crontab file, using ``crontab -e``. +2. Add the indexer to the crontab, and schedule it to run as regularly as you + wish. +3. Save the file. + +.. code-block:: bash + + $ crontab -e + + # The virtualenv can be called using its full path, so for example you can + # put this example into the crontab + + # Run the indexer daily at 4am using the default mapping settings, --no-tty is required for non interactive calls + * 4 * * * ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx + + # Run the indexer every Sunday at 3am using default mapping + * 3 * * 0 ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx + + # Run the indexer every 15 minutes + # using a specially configured mapping file + */15 * * * * ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --mapping=/etc/rhodecode/conf/search_mapping.ini + +.. _advanced-indexing: + +Advanced Indexing +^^^^^^^^^^^^^^^^^ + + +Force Re-Indexing single repository ++++++++++++++++++++++++++++++++++++ + +Often it's required to re-index whole repository because of some repository changes, +or to remove some indexed secrets, or files. There's a special `--repo-name=` flag +for the indexer that limits execution to a single repository. For example to force-reindex +single repository such call can be made + +.. code-block:: bash + + ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --force --repo-name=rhodecode-vcsserver + +Limiting indexing to small number of repos +++++++++++++++++++++++++++++++++++++++++++ + +Often to preserve memory usage and system load we might limit the number of repositories processed on each call. +There's a special `--repo-limit=` flag for the indexer that limits execution to a N repositories. + +.. code-block:: bash + + ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --repo-limit=10 + + +Removing repositories from index +++++++++++++++++++++++++++++++++ + +The indexer automatically removes renamed repositories and builds index for new names. +In the same way if a listed repository in mapping.ini is not reported existing by the +server it's removed from the index. +In case that you wish to remove indexed repository manually such call would allow that + +.. code-block:: bash + + ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --remove-only --repo-name=rhodecode-vcsserver + + +Using search_mapping.ini file for advanced index rules +++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +By default rhodecode-index runs for all repositories, all files with parsing limits +defined by the CLI default arguments. You can change those limits by calling with +different flags such as `--max-filesize=2048kb` or `--repo-limit=10` + +For more advanced execution logic it's possible to use a configuration file that +would define detailed rules which repositories and how should be indexed. + +To create the :file:`search_mapping.ini` file manually, use the below command + +.. code-block:: bash + + ./rcstack cli cmd rhodecode-index --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx \ + --create-mapping=/etc/rhodecode/conf/search_mapping.ini + + +Now the indexer can be executed with `--mapping` flag + + +Here's a detailed example of using :file:`search_mapping.ini` file. + +.. code-block:: ini + + [__DEFAULT__] + ; Create index on commits data, and files data in this order. Available options + ; are `commits`, `files` + index_types = commits,files + + ; Commit fetch limit. In what amount of chunks commits should be fetched + ; via api and parsed. This allows server to transfer smaller chunks and be less loaded + commit_fetch_limit = 1000 + + ; Commit process limit. Limit the number of commits indexer should fetch, and + ; store inside the full text search index. eg. if repo has 2000 commits, and + ; limit is 1000, on the first run it will process commits 0-1000 and on the + ; second 1000-2000 commits. Help reduce memory usage, default is 50000 + ; (set -1 for unlimited) + commit_process_limit = 20000 + + ; Limit of how many repositories each run can process, default is -1 (unlimited) + ; in case of 1000s of repositories it's better to execute in chunks to not overload + ; the server. + repo_limit = -1 + + ; Default patterns for indexing files and content of files. Binary files + ; are skipped by default. + + ; Add to index those comma separated files; globs syntax + ; e.g index_files = *.py, *.c, *.h, *.js + index_files = *, + + ; Do not add to index those comma separated files, this excludes + ; both search by name and content; globs syntax + ; e.g index_files = *.key, *.sql, *.xml, *.pem, *.crt + skip_files = , + + ; Add to index content of those comma separated files; globs syntax + ; e.g index_files = *.h, *.obj + index_files_content = *, + + ; Do not add to index content of those comma separated files; globs syntax + ; Binary files are not indexed by default. + ; e.g index_files = *.min.js, *.xml, *.dump, *.log, *.dump + skip_files_content = , + + ; Force rebuilding an index from scratch. Each repository will be rebuild from + ; scratch with a global flag. Use --repo-name=NAME --force to rebuild single repo + force = false + + ; maximum file size that indexer will use, files above that limit are not going + ; to have they content indexed. + ; Possible options are KB (kilobytes), MB (megabytes), eg 1MB or 1024KB + max_filesize = 10MB + + + [__INDEX_RULES__] + ; Ordered match rules for repositories. A list of all repositories will be fetched + ; using API and this list will be filtered using those rules. + ; Syntax for entry: `glob_pattern_OR_full_repo_name = 0 OR 1` where 0=exclude, 1=include + ; When this ordered list is traversed first match will return the include/exclude marker + ; For example: + ; upstream/binary_repo = 0 + ; upstream/subrepo/xml_files = 0 + ; upstream/* = 1 + ; special-repo = 1 + ; * = 0 + ; This will index all repositories under upstream/*, but skip upstream/binary_repo + ; and upstream/sub_repo/xml_files, last * = 0 means skip all other matches + + + ; == EXPLICIT REPOSITORY INDEXING == + ; If defined this will skip using __INDEX_RULES__, and will not use API to fetch + ; list of repositories, it will explicitly take names defined with [NAME] format and + ; try to build the index, to build index just for repo_name_1 and special-repo use: + ; [repo_name_1] + ; [special-repo] + + ; == PER REPOSITORY CONFIGURATION == + ; This allows overriding the global configuration per repository. + ; example to set specific file limit, and skip certain files for repository special-repo + ; the CLI flags doesn't override the conf settings. + ; [conf:special-repo] + ; max_filesize = 5mb + ; skip_files = *.xml, *.sql + + + +In case of 1000s of repositories it can be tricky to write the include/exclude rules at first. +There's a special flag to test the mapping file rules and list repositories that would +be indexed. Run the indexer with `--show-matched-repos` to list only the +match repositories defined in .ini file rules + +.. code-block:: bash + + ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --show-matched-repos --mapping=/etc/rhodecode/conf/search_mapping.ini + + +.. _enable-elasticsearch: + +Enabling ElasticSearch +^^^^^^^^^^^^^^^^^^^^^^ + +ElasticSearch is available in EE edition only. It provides much scalable and more advanced +search capabilities. While Whoosh is fine for upto 1-2GB of data, beyond that amount it +starts slowing down, and can cause other problems. +New ElasticSearch 6 also provides much more advanced query language. +It allows advanced filtering by file paths, extensions, use OR statements, ranges etc. +Please check query language examples in the search field for some advanced query language usage. + + +1. Open the :file:`rhodecode.ini` file for the instance you wish to edit. The + default location is :file:`config/_shared/rhodecode.ini` +2. Find the search configuration section: + +.. code-block:: ini + + ################################### + ## SEARCH INDEXING CONFIGURATION ## + ################################### + + search.module = rhodecode.lib.index.whoosh + search.location = %(here)s/data/index + +and change it to: + +.. code-block:: ini + + search.module = rc_elasticsearch + search.location = http://elasticsearch:9200 + ## specify Elastic Search version, 6 for latest or 2 for legacy + search.es_version = 6 + +where ``search.location`` points to the ElasticSearch server +by default running on port 9200. + +Index invocation also needs change. Please provide `--es-version=` and +`--engine-location=` parameters to define ElasticSearch server location and it's version. +For example:: + + --instance-name=rc-idx --es-version=6 --engine-location=http://elasticsearch:9200 + + +.. _Whoosh: https://pypi.python.org/pypi/Whoosh/ +.. _ElasticSearch 6: https://www.elastic.co/