##// END OF EJS Templates
feat(docs): added comprehensive full text search doc guide
super-admin -
Show More
@@ -0,0 +1,390 b''
1 .. _full-text-search-setup:
2
3 Full-text Search
4 ----------------
5
6 RhodeCode provides a full text search capabilities to search inside file content,
7 commit message, and file paths. Indexing is not enabled by default and to use
8 full text search building an index is a pre-requisite.
9
10 By default RhodeCode is configured to use `Whoosh`_ to index |repos| and
11 provide full-text search. `Whoosh`_ works well for a small amount of data and
12 shouldn't be used in case of large code-bases and lots of repositories.
13
14 |RCE| also provides support for `ElasticSearch 6`_ as a backend more for advanced
15 and scalable search.
16
17
18 Auth Token generation
19 ^^^^^^^^^^^^^^^^^^^^^
20
21 RhodeCode indexer runs on top of |RCE| API and requires an |authtoken| before continuing.
22 To run the indexer you need to have an |authtoken| with *admin* rights to all of |repos| that indexer should
23 process.
24
25 To get your API Token, on the |RCE| interface go to
26 Click on the icon with your user in top right corner :menuselection:`your-username --> My Account --> Auth tokens`
27
28 1. Put a description for the |authtoken|
29 2. Select expiration date if desired
30 3. Select `api calls` role for the token
31 4. Click :guilabel:`Add`
32 5. Click on the obfuscated generated token, and copy it.
33
34
35 Indexing
36 ^^^^^^^^
37
38 To index repositories stored in RhodeCode, you have the option to set the indexer up in a
39 number of ways, for example:
40
41 * Call the indexer via a cron job. We recommend running this once at night.
42 In case you need everything indexed immediately it's possible to index few
43 times during the day. Indexer has a special locking mechanism that won't allow
44 two instances of indexer running at once. It's safe to run it even every 1hr.
45 * Hook the indexer up with your CI server to reindex after each push.
46 * Set the indexer to infinitely loop and reindex as soon as it has run its previous cycle.
47 This allows to get an instance indexing of content that would be available seconds after changes happen.
48
49 The indexer works by indexing new commits added since the last run, and comparing
50 file changes to index only new or modified files across each invocation.
51
52 .. note::
53
54 If you wish to build a brand new index from scratch each time, use the ``force``
55 option in the configuration file, or run it with --force flag.
56
57
58 To set up indexing, use the following steps:
59
60 1. :ref:`config-rhoderc`
61 2. :ref:`run-index`
62 3. :ref:`set-index`
63 4. :ref:`advanced-indexing`
64
65
66 .. _config-rhoderc:
67
68 Configure the ``.rhoderc`` File
69 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
70
71 .. note::
72
73 Optionally it's possible to use indexer without the ``.rhoderc``. Simply instead of
74 executing with `--instance-name=rc-idx` execute providing the host and token
75 directly: `--api-host=https://your-host.example.com --api-key=<auth-token-goes-here>`
76
77
78 .. note::
79
80 In some cases the domain could be only available via the custom DNS, you can always refer the
81 instance by it's docker name and port (`http://rhodecode:10020`) instead of hostname, for example:
82
83 .. code-block:: bash
84
85 ./rcstack cli cmd rhodecode-index --api-host=http://rhodecode:10020 --api-key=xxx
86
87
88 Indexer uses the :file:`/home/{user}/.rhoderc` file for connection details
89 to |RCE| instances. You need to configure the details for each instance you want to index.
90
91
92 .. code-block:: bash
93
94 ./rcstack cli cmd rhodecode-setup-config \
95 --filename=/etc/rhodecode/conf/.rhoderc \
96 --instance-name=rc-idx api_host=https://your-host.example.com,api_key=<auth-token-goes-here>
97
98
99 Here's an example generated config you might also mount as a file to the docker image.
100
101 .. code-block:: ini
102
103 # Configure .rhoderc with matching details
104 # This allows the indexer to connect to the instance
105 [instance:rc-idx]
106 api_host = https://your-host.example.com
107 api_key = <auth token goes here>
108
109
110 .. _run-index:
111
112
113 Run the Indexer
114 ^^^^^^^^^^^^^^^
115
116 Run the indexer using the following command, and specify the instance you want to index:
117
118 .. code-block:: bash
119
120 # Using default simples indexing of all repositories
121 $ ./rcstack cli cmd rhodecode-index \
122 --no-tty --config=/etc/rhodecode/conf/.rhoderc \
123 --instance-name=rc-idx
124
125 # Using a custom mapping file and invocation without ``.rhoderc``
126 $ ./rcstack cli cmd rhodecode-index \
127 --no-tty \
128 --api-host=https://your-host.example.com --api-key=xxxxx \
129 --mapping=/etc/rhodecode/conf/search_mapping.ini
130
131 # Using a custom mapping file with indexing rules, and using elasticsearch 6 backend
132 $ ./rcstack cli cmd rhodecode-index \
133 --no-tty --config=/etc/rhodecode/conf/.rhoderc \
134 --instance-name=rc-idx \
135 --mapping=/etc/rhodecode/conf/search_mapping.ini \
136 --es-version=6 --engine-location=http://elasticsearch:9200
137
138 # For some advanced usage, please check --help flag to see what other CLI options are available
139 ``$ ./rcstack cli cmd rhodecode-index --help``
140
141 .. note::
142
143 In case of often indexing using Whoosh backend the index may become fragmented. Most often a result of that
144 is error about `too many open files`. To fix this indexer needs to be executed with `--optimize` flag. E.g
145
146 .. code-block:: bash
147
148 $ ./rcstack cli cmd rhodecode-index --instance-name=rc-idx --optimize
149
150 This should be executed regularly, once a week is recommended. When using ElasticSearch this step can be skipped.
151
152
153 .. _set-index:
154
155 Schedule the Indexer
156 ^^^^^^^^^^^^^^^^^^^^
157
158 To schedule the indexer, configure the crontab file to run the indexer inside
159 your |RCT| virtualenv using the following steps.
160
161 1. Open the crontab file, using ``crontab -e``.
162 2. Add the indexer to the crontab, and schedule it to run as regularly as you
163 wish.
164 3. Save the file.
165
166 .. code-block:: bash
167
168 $ crontab -e
169
170 # The virtualenv can be called using its full path, so for example you can
171 # put this example into the crontab
172
173 # Run the indexer daily at 4am using the default mapping settings, --no-tty is required for non interactive calls
174 * 4 * * * ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx
175
176 # Run the indexer every Sunday at 3am using default mapping
177 * 3 * * 0 ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx
178
179 # Run the indexer every 15 minutes
180 # using a specially configured mapping file
181 */15 * * * * ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --mapping=/etc/rhodecode/conf/search_mapping.ini
182
183 .. _advanced-indexing:
184
185 Advanced Indexing
186 ^^^^^^^^^^^^^^^^^
187
188
189 Force Re-Indexing single repository
190 +++++++++++++++++++++++++++++++++++
191
192 Often it's required to re-index whole repository because of some repository changes,
193 or to remove some indexed secrets, or files. There's a special `--repo-name=` flag
194 for the indexer that limits execution to a single repository. For example to force-reindex
195 single repository such call can be made
196
197 .. code-block:: bash
198
199 ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --force --repo-name=rhodecode-vcsserver
200
201 Limiting indexing to small number of repos
202 ++++++++++++++++++++++++++++++++++++++++++
203
204 Often to preserve memory usage and system load we might limit the number of repositories processed on each call.
205 There's a special `--repo-limit=` flag for the indexer that limits execution to a N repositories.
206
207 .. code-block:: bash
208
209 ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --repo-limit=10
210
211
212 Removing repositories from index
213 ++++++++++++++++++++++++++++++++
214
215 The indexer automatically removes renamed repositories and builds index for new names.
216 In the same way if a listed repository in mapping.ini is not reported existing by the
217 server it's removed from the index.
218 In case that you wish to remove indexed repository manually such call would allow that
219
220 .. code-block:: bash
221
222 ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --remove-only --repo-name=rhodecode-vcsserver
223
224
225 Using search_mapping.ini file for advanced index rules
226 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
227
228 By default rhodecode-index runs for all repositories, all files with parsing limits
229 defined by the CLI default arguments. You can change those limits by calling with
230 different flags such as `--max-filesize=2048kb` or `--repo-limit=10`
231
232 For more advanced execution logic it's possible to use a configuration file that
233 would define detailed rules which repositories and how should be indexed.
234
235 To create the :file:`search_mapping.ini` file manually, use the below command
236
237 .. code-block:: bash
238
239 ./rcstack cli cmd rhodecode-index --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx \
240 --create-mapping=/etc/rhodecode/conf/search_mapping.ini
241
242
243 Now the indexer can be executed with `--mapping` flag
244
245
246 Here's a detailed example of using :file:`search_mapping.ini` file.
247
248 .. code-block:: ini
249
250 [__DEFAULT__]
251 ; Create index on commits data, and files data in this order. Available options
252 ; are `commits`, `files`
253 index_types = commits,files
254
255 ; Commit fetch limit. In what amount of chunks commits should be fetched
256 ; via api and parsed. This allows server to transfer smaller chunks and be less loaded
257 commit_fetch_limit = 1000
258
259 ; Commit process limit. Limit the number of commits indexer should fetch, and
260 ; store inside the full text search index. eg. if repo has 2000 commits, and
261 ; limit is 1000, on the first run it will process commits 0-1000 and on the
262 ; second 1000-2000 commits. Help reduce memory usage, default is 50000
263 ; (set -1 for unlimited)
264 commit_process_limit = 20000
265
266 ; Limit of how many repositories each run can process, default is -1 (unlimited)
267 ; in case of 1000s of repositories it's better to execute in chunks to not overload
268 ; the server.
269 repo_limit = -1
270
271 ; Default patterns for indexing files and content of files. Binary files
272 ; are skipped by default.
273
274 ; Add to index those comma separated files; globs syntax
275 ; e.g index_files = *.py, *.c, *.h, *.js
276 index_files = *,
277
278 ; Do not add to index those comma separated files, this excludes
279 ; both search by name and content; globs syntax
280 ; e.g index_files = *.key, *.sql, *.xml, *.pem, *.crt
281 skip_files = ,
282
283 ; Add to index content of those comma separated files; globs syntax
284 ; e.g index_files = *.h, *.obj
285 index_files_content = *,
286
287 ; Do not add to index content of those comma separated files; globs syntax
288 ; Binary files are not indexed by default.
289 ; e.g index_files = *.min.js, *.xml, *.dump, *.log, *.dump
290 skip_files_content = ,
291
292 ; Force rebuilding an index from scratch. Each repository will be rebuild from
293 ; scratch with a global flag. Use --repo-name=NAME --force to rebuild single repo
294 force = false
295
296 ; maximum file size that indexer will use, files above that limit are not going
297 ; to have they content indexed.
298 ; Possible options are KB (kilobytes), MB (megabytes), eg 1MB or 1024KB
299 max_filesize = 10MB
300
301
302 [__INDEX_RULES__]
303 ; Ordered match rules for repositories. A list of all repositories will be fetched
304 ; using API and this list will be filtered using those rules.
305 ; Syntax for entry: `glob_pattern_OR_full_repo_name = 0 OR 1` where 0=exclude, 1=include
306 ; When this ordered list is traversed first match will return the include/exclude marker
307 ; For example:
308 ; upstream/binary_repo = 0
309 ; upstream/subrepo/xml_files = 0
310 ; upstream/* = 1
311 ; special-repo = 1
312 ; * = 0
313 ; This will index all repositories under upstream/*, but skip upstream/binary_repo
314 ; and upstream/sub_repo/xml_files, last * = 0 means skip all other matches
315
316
317 ; == EXPLICIT REPOSITORY INDEXING ==
318 ; If defined this will skip using __INDEX_RULES__, and will not use API to fetch
319 ; list of repositories, it will explicitly take names defined with [NAME] format and
320 ; try to build the index, to build index just for repo_name_1 and special-repo use:
321 ; [repo_name_1]
322 ; [special-repo]
323
324 ; == PER REPOSITORY CONFIGURATION ==
325 ; This allows overriding the global configuration per repository.
326 ; example to set specific file limit, and skip certain files for repository special-repo
327 ; the CLI flags doesn't override the conf settings.
328 ; [conf:special-repo]
329 ; max_filesize = 5mb
330 ; skip_files = *.xml, *.sql
331
332
333
334 In case of 1000s of repositories it can be tricky to write the include/exclude rules at first.
335 There's a special flag to test the mapping file rules and list repositories that would
336 be indexed. Run the indexer with `--show-matched-repos` to list only the
337 match repositories defined in .ini file rules
338
339 .. code-block:: bash
340
341 ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --show-matched-repos --mapping=/etc/rhodecode/conf/search_mapping.ini
342
343
344 .. _enable-elasticsearch:
345
346 Enabling ElasticSearch
347 ^^^^^^^^^^^^^^^^^^^^^^
348
349 ElasticSearch is available in EE edition only. It provides much scalable and more advanced
350 search capabilities. While Whoosh is fine for upto 1-2GB of data, beyond that amount it
351 starts slowing down, and can cause other problems.
352 New ElasticSearch 6 also provides much more advanced query language.
353 It allows advanced filtering by file paths, extensions, use OR statements, ranges etc.
354 Please check query language examples in the search field for some advanced query language usage.
355
356
357 1. Open the :file:`rhodecode.ini` file for the instance you wish to edit. The
358 default location is :file:`config/_shared/rhodecode.ini`
359 2. Find the search configuration section:
360
361 .. code-block:: ini
362
363 ###################################
364 ## SEARCH INDEXING CONFIGURATION ##
365 ###################################
366
367 search.module = rhodecode.lib.index.whoosh
368 search.location = %(here)s/data/index
369
370 and change it to:
371
372 .. code-block:: ini
373
374 search.module = rc_elasticsearch
375 search.location = http://elasticsearch:9200
376 ## specify Elastic Search version, 6 for latest or 2 for legacy
377 search.es_version = 6
378
379 where ``search.location`` points to the ElasticSearch server
380 by default running on port 9200.
381
382 Index invocation also needs change. Please provide `--es-version=` and
383 `--engine-location=` parameters to define ElasticSearch server location and it's version.
384 For example::
385
386 --instance-name=rc-idx --es-version=6 --engine-location=http://elasticsearch:9200
387
388
389 .. _Whoosh: https://pypi.python.org/pypi/Whoosh/
390 .. _ElasticSearch 6: https://www.elastic.co/
General Comments 0
You need to be logged in to leave comments. Login now