Show More
@@ -0,0 +1,390 b'' | |||||
|
1 | .. _full-text-search-setup: | |||
|
2 | ||||
|
3 | Full-text Search | |||
|
4 | ---------------- | |||
|
5 | ||||
|
6 | RhodeCode provides a full text search capabilities to search inside file content, | |||
|
7 | commit message, and file paths. Indexing is not enabled by default and to use | |||
|
8 | full text search building an index is a pre-requisite. | |||
|
9 | ||||
|
10 | By default RhodeCode is configured to use `Whoosh`_ to index |repos| and | |||
|
11 | provide full-text search. `Whoosh`_ works well for a small amount of data and | |||
|
12 | shouldn't be used in case of large code-bases and lots of repositories. | |||
|
13 | ||||
|
14 | |RCE| also provides support for `ElasticSearch 6`_ as a backend more for advanced | |||
|
15 | and scalable search. | |||
|
16 | ||||
|
17 | ||||
|
18 | Auth Token generation | |||
|
19 | ^^^^^^^^^^^^^^^^^^^^^ | |||
|
20 | ||||
|
21 | RhodeCode indexer runs on top of |RCE| API and requires an |authtoken| before continuing. | |||
|
22 | To run the indexer you need to have an |authtoken| with *admin* rights to all of |repos| that indexer should | |||
|
23 | process. | |||
|
24 | ||||
|
25 | To get your API Token, on the |RCE| interface go to | |||
|
26 | Click on the icon with your user in top right corner :menuselection:`your-username --> My Account --> Auth tokens` | |||
|
27 | ||||
|
28 | 1. Put a description for the |authtoken| | |||
|
29 | 2. Select expiration date if desired | |||
|
30 | 3. Select `api calls` role for the token | |||
|
31 | 4. Click :guilabel:`Add` | |||
|
32 | 5. Click on the obfuscated generated token, and copy it. | |||
|
33 | ||||
|
34 | ||||
|
35 | Indexing | |||
|
36 | ^^^^^^^^ | |||
|
37 | ||||
|
38 | To index repositories stored in RhodeCode, you have the option to set the indexer up in a | |||
|
39 | number of ways, for example: | |||
|
40 | ||||
|
41 | * Call the indexer via a cron job. We recommend running this once at night. | |||
|
42 | In case you need everything indexed immediately it's possible to index few | |||
|
43 | times during the day. Indexer has a special locking mechanism that won't allow | |||
|
44 | two instances of indexer running at once. It's safe to run it even every 1hr. | |||
|
45 | * Hook the indexer up with your CI server to reindex after each push. | |||
|
46 | * Set the indexer to infinitely loop and reindex as soon as it has run its previous cycle. | |||
|
47 | This allows to get an instance indexing of content that would be available seconds after changes happen. | |||
|
48 | ||||
|
49 | The indexer works by indexing new commits added since the last run, and comparing | |||
|
50 | file changes to index only new or modified files across each invocation. | |||
|
51 | ||||
|
52 | .. note:: | |||
|
53 | ||||
|
54 | If you wish to build a brand new index from scratch each time, use the ``force`` | |||
|
55 | option in the configuration file, or run it with --force flag. | |||
|
56 | ||||
|
57 | ||||
|
58 | To set up indexing, use the following steps: | |||
|
59 | ||||
|
60 | 1. :ref:`config-rhoderc` | |||
|
61 | 2. :ref:`run-index` | |||
|
62 | 3. :ref:`set-index` | |||
|
63 | 4. :ref:`advanced-indexing` | |||
|
64 | ||||
|
65 | ||||
|
66 | .. _config-rhoderc: | |||
|
67 | ||||
|
68 | Configure the ``.rhoderc`` File | |||
|
69 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||
|
70 | ||||
|
71 | .. note:: | |||
|
72 | ||||
|
73 | Optionally it's possible to use indexer without the ``.rhoderc``. Simply instead of | |||
|
74 | executing with `--instance-name=rc-idx` execute providing the host and token | |||
|
75 | directly: `--api-host=https://your-host.example.com --api-key=<auth-token-goes-here>` | |||
|
76 | ||||
|
77 | ||||
|
78 | .. note:: | |||
|
79 | ||||
|
80 | In some cases the domain could be only available via the custom DNS, you can always refer the | |||
|
81 | instance by it's docker name and port (`http://rhodecode:10020`) instead of hostname, for example: | |||
|
82 | ||||
|
83 | .. code-block:: bash | |||
|
84 | ||||
|
85 | ./rcstack cli cmd rhodecode-index --api-host=http://rhodecode:10020 --api-key=xxx | |||
|
86 | ||||
|
87 | ||||
|
88 | Indexer uses the :file:`/home/{user}/.rhoderc` file for connection details | |||
|
89 | to |RCE| instances. You need to configure the details for each instance you want to index. | |||
|
90 | ||||
|
91 | ||||
|
92 | .. code-block:: bash | |||
|
93 | ||||
|
94 | ./rcstack cli cmd rhodecode-setup-config \ | |||
|
95 | --filename=/etc/rhodecode/conf/.rhoderc \ | |||
|
96 | --instance-name=rc-idx api_host=https://your-host.example.com,api_key=<auth-token-goes-here> | |||
|
97 | ||||
|
98 | ||||
|
99 | Here's an example generated config you might also mount as a file to the docker image. | |||
|
100 | ||||
|
101 | .. code-block:: ini | |||
|
102 | ||||
|
103 | # Configure .rhoderc with matching details | |||
|
104 | # This allows the indexer to connect to the instance | |||
|
105 | [instance:rc-idx] | |||
|
106 | api_host = https://your-host.example.com | |||
|
107 | api_key = <auth token goes here> | |||
|
108 | ||||
|
109 | ||||
|
110 | .. _run-index: | |||
|
111 | ||||
|
112 | ||||
|
113 | Run the Indexer | |||
|
114 | ^^^^^^^^^^^^^^^ | |||
|
115 | ||||
|
116 | Run the indexer using the following command, and specify the instance you want to index: | |||
|
117 | ||||
|
118 | .. code-block:: bash | |||
|
119 | ||||
|
120 | # Using default simples indexing of all repositories | |||
|
121 | $ ./rcstack cli cmd rhodecode-index \ | |||
|
122 | --no-tty --config=/etc/rhodecode/conf/.rhoderc \ | |||
|
123 | --instance-name=rc-idx | |||
|
124 | ||||
|
125 | # Using a custom mapping file and invocation without ``.rhoderc`` | |||
|
126 | $ ./rcstack cli cmd rhodecode-index \ | |||
|
127 | --no-tty \ | |||
|
128 | --api-host=https://your-host.example.com --api-key=xxxxx \ | |||
|
129 | --mapping=/etc/rhodecode/conf/search_mapping.ini | |||
|
130 | ||||
|
131 | # Using a custom mapping file with indexing rules, and using elasticsearch 6 backend | |||
|
132 | $ ./rcstack cli cmd rhodecode-index \ | |||
|
133 | --no-tty --config=/etc/rhodecode/conf/.rhoderc \ | |||
|
134 | --instance-name=rc-idx \ | |||
|
135 | --mapping=/etc/rhodecode/conf/search_mapping.ini \ | |||
|
136 | --es-version=6 --engine-location=http://elasticsearch:9200 | |||
|
137 | ||||
|
138 | # For some advanced usage, please check --help flag to see what other CLI options are available | |||
|
139 | ``$ ./rcstack cli cmd rhodecode-index --help`` | |||
|
140 | ||||
|
141 | .. note:: | |||
|
142 | ||||
|
143 | In case of often indexing using Whoosh backend the index may become fragmented. Most often a result of that | |||
|
144 | is error about `too many open files`. To fix this indexer needs to be executed with `--optimize` flag. E.g | |||
|
145 | ||||
|
146 | .. code-block:: bash | |||
|
147 | ||||
|
148 | $ ./rcstack cli cmd rhodecode-index --instance-name=rc-idx --optimize | |||
|
149 | ||||
|
150 | This should be executed regularly, once a week is recommended. When using ElasticSearch this step can be skipped. | |||
|
151 | ||||
|
152 | ||||
|
153 | .. _set-index: | |||
|
154 | ||||
|
155 | Schedule the Indexer | |||
|
156 | ^^^^^^^^^^^^^^^^^^^^ | |||
|
157 | ||||
|
158 | To schedule the indexer, configure the crontab file to run the indexer inside | |||
|
159 | your |RCT| virtualenv using the following steps. | |||
|
160 | ||||
|
161 | 1. Open the crontab file, using ``crontab -e``. | |||
|
162 | 2. Add the indexer to the crontab, and schedule it to run as regularly as you | |||
|
163 | wish. | |||
|
164 | 3. Save the file. | |||
|
165 | ||||
|
166 | .. code-block:: bash | |||
|
167 | ||||
|
168 | $ crontab -e | |||
|
169 | ||||
|
170 | # The virtualenv can be called using its full path, so for example you can | |||
|
171 | # put this example into the crontab | |||
|
172 | ||||
|
173 | # Run the indexer daily at 4am using the default mapping settings, --no-tty is required for non interactive calls | |||
|
174 | * 4 * * * ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx | |||
|
175 | ||||
|
176 | # Run the indexer every Sunday at 3am using default mapping | |||
|
177 | * 3 * * 0 ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx | |||
|
178 | ||||
|
179 | # Run the indexer every 15 minutes | |||
|
180 | # using a specially configured mapping file | |||
|
181 | */15 * * * * ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --mapping=/etc/rhodecode/conf/search_mapping.ini | |||
|
182 | ||||
|
183 | .. _advanced-indexing: | |||
|
184 | ||||
|
185 | Advanced Indexing | |||
|
186 | ^^^^^^^^^^^^^^^^^ | |||
|
187 | ||||
|
188 | ||||
|
189 | Force Re-Indexing single repository | |||
|
190 | +++++++++++++++++++++++++++++++++++ | |||
|
191 | ||||
|
192 | Often it's required to re-index whole repository because of some repository changes, | |||
|
193 | or to remove some indexed secrets, or files. There's a special `--repo-name=` flag | |||
|
194 | for the indexer that limits execution to a single repository. For example to force-reindex | |||
|
195 | single repository such call can be made | |||
|
196 | ||||
|
197 | .. code-block:: bash | |||
|
198 | ||||
|
199 | ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --force --repo-name=rhodecode-vcsserver | |||
|
200 | ||||
|
201 | Limiting indexing to small number of repos | |||
|
202 | ++++++++++++++++++++++++++++++++++++++++++ | |||
|
203 | ||||
|
204 | Often to preserve memory usage and system load we might limit the number of repositories processed on each call. | |||
|
205 | There's a special `--repo-limit=` flag for the indexer that limits execution to a N repositories. | |||
|
206 | ||||
|
207 | .. code-block:: bash | |||
|
208 | ||||
|
209 | ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --repo-limit=10 | |||
|
210 | ||||
|
211 | ||||
|
212 | Removing repositories from index | |||
|
213 | ++++++++++++++++++++++++++++++++ | |||
|
214 | ||||
|
215 | The indexer automatically removes renamed repositories and builds index for new names. | |||
|
216 | In the same way if a listed repository in mapping.ini is not reported existing by the | |||
|
217 | server it's removed from the index. | |||
|
218 | In case that you wish to remove indexed repository manually such call would allow that | |||
|
219 | ||||
|
220 | .. code-block:: bash | |||
|
221 | ||||
|
222 | ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --remove-only --repo-name=rhodecode-vcsserver | |||
|
223 | ||||
|
224 | ||||
|
225 | Using search_mapping.ini file for advanced index rules | |||
|
226 | ++++++++++++++++++++++++++++++++++++++++++++++++++++++ | |||
|
227 | ||||
|
228 | By default rhodecode-index runs for all repositories, all files with parsing limits | |||
|
229 | defined by the CLI default arguments. You can change those limits by calling with | |||
|
230 | different flags such as `--max-filesize=2048kb` or `--repo-limit=10` | |||
|
231 | ||||
|
232 | For more advanced execution logic it's possible to use a configuration file that | |||
|
233 | would define detailed rules which repositories and how should be indexed. | |||
|
234 | ||||
|
235 | To create the :file:`search_mapping.ini` file manually, use the below command | |||
|
236 | ||||
|
237 | .. code-block:: bash | |||
|
238 | ||||
|
239 | ./rcstack cli cmd rhodecode-index --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx \ | |||
|
240 | --create-mapping=/etc/rhodecode/conf/search_mapping.ini | |||
|
241 | ||||
|
242 | ||||
|
243 | Now the indexer can be executed with `--mapping` flag | |||
|
244 | ||||
|
245 | ||||
|
246 | Here's a detailed example of using :file:`search_mapping.ini` file. | |||
|
247 | ||||
|
248 | .. code-block:: ini | |||
|
249 | ||||
|
250 | [__DEFAULT__] | |||
|
251 | ; Create index on commits data, and files data in this order. Available options | |||
|
252 | ; are `commits`, `files` | |||
|
253 | index_types = commits,files | |||
|
254 | ||||
|
255 | ; Commit fetch limit. In what amount of chunks commits should be fetched | |||
|
256 | ; via api and parsed. This allows server to transfer smaller chunks and be less loaded | |||
|
257 | commit_fetch_limit = 1000 | |||
|
258 | ||||
|
259 | ; Commit process limit. Limit the number of commits indexer should fetch, and | |||
|
260 | ; store inside the full text search index. eg. if repo has 2000 commits, and | |||
|
261 | ; limit is 1000, on the first run it will process commits 0-1000 and on the | |||
|
262 | ; second 1000-2000 commits. Help reduce memory usage, default is 50000 | |||
|
263 | ; (set -1 for unlimited) | |||
|
264 | commit_process_limit = 20000 | |||
|
265 | ||||
|
266 | ; Limit of how many repositories each run can process, default is -1 (unlimited) | |||
|
267 | ; in case of 1000s of repositories it's better to execute in chunks to not overload | |||
|
268 | ; the server. | |||
|
269 | repo_limit = -1 | |||
|
270 | ||||
|
271 | ; Default patterns for indexing files and content of files. Binary files | |||
|
272 | ; are skipped by default. | |||
|
273 | ||||
|
274 | ; Add to index those comma separated files; globs syntax | |||
|
275 | ; e.g index_files = *.py, *.c, *.h, *.js | |||
|
276 | index_files = *, | |||
|
277 | ||||
|
278 | ; Do not add to index those comma separated files, this excludes | |||
|
279 | ; both search by name and content; globs syntax | |||
|
280 | ; e.g index_files = *.key, *.sql, *.xml, *.pem, *.crt | |||
|
281 | skip_files = , | |||
|
282 | ||||
|
283 | ; Add to index content of those comma separated files; globs syntax | |||
|
284 | ; e.g index_files = *.h, *.obj | |||
|
285 | index_files_content = *, | |||
|
286 | ||||
|
287 | ; Do not add to index content of those comma separated files; globs syntax | |||
|
288 | ; Binary files are not indexed by default. | |||
|
289 | ; e.g index_files = *.min.js, *.xml, *.dump, *.log, *.dump | |||
|
290 | skip_files_content = , | |||
|
291 | ||||
|
292 | ; Force rebuilding an index from scratch. Each repository will be rebuild from | |||
|
293 | ; scratch with a global flag. Use --repo-name=NAME --force to rebuild single repo | |||
|
294 | force = false | |||
|
295 | ||||
|
296 | ; maximum file size that indexer will use, files above that limit are not going | |||
|
297 | ; to have they content indexed. | |||
|
298 | ; Possible options are KB (kilobytes), MB (megabytes), eg 1MB or 1024KB | |||
|
299 | max_filesize = 10MB | |||
|
300 | ||||
|
301 | ||||
|
302 | [__INDEX_RULES__] | |||
|
303 | ; Ordered match rules for repositories. A list of all repositories will be fetched | |||
|
304 | ; using API and this list will be filtered using those rules. | |||
|
305 | ; Syntax for entry: `glob_pattern_OR_full_repo_name = 0 OR 1` where 0=exclude, 1=include | |||
|
306 | ; When this ordered list is traversed first match will return the include/exclude marker | |||
|
307 | ; For example: | |||
|
308 | ; upstream/binary_repo = 0 | |||
|
309 | ; upstream/subrepo/xml_files = 0 | |||
|
310 | ; upstream/* = 1 | |||
|
311 | ; special-repo = 1 | |||
|
312 | ; * = 0 | |||
|
313 | ; This will index all repositories under upstream/*, but skip upstream/binary_repo | |||
|
314 | ; and upstream/sub_repo/xml_files, last * = 0 means skip all other matches | |||
|
315 | ||||
|
316 | ||||
|
317 | ; == EXPLICIT REPOSITORY INDEXING == | |||
|
318 | ; If defined this will skip using __INDEX_RULES__, and will not use API to fetch | |||
|
319 | ; list of repositories, it will explicitly take names defined with [NAME] format and | |||
|
320 | ; try to build the index, to build index just for repo_name_1 and special-repo use: | |||
|
321 | ; [repo_name_1] | |||
|
322 | ; [special-repo] | |||
|
323 | ||||
|
324 | ; == PER REPOSITORY CONFIGURATION == | |||
|
325 | ; This allows overriding the global configuration per repository. | |||
|
326 | ; example to set specific file limit, and skip certain files for repository special-repo | |||
|
327 | ; the CLI flags doesn't override the conf settings. | |||
|
328 | ; [conf:special-repo] | |||
|
329 | ; max_filesize = 5mb | |||
|
330 | ; skip_files = *.xml, *.sql | |||
|
331 | ||||
|
332 | ||||
|
333 | ||||
|
334 | In case of 1000s of repositories it can be tricky to write the include/exclude rules at first. | |||
|
335 | There's a special flag to test the mapping file rules and list repositories that would | |||
|
336 | be indexed. Run the indexer with `--show-matched-repos` to list only the | |||
|
337 | match repositories defined in .ini file rules | |||
|
338 | ||||
|
339 | .. code-block:: bash | |||
|
340 | ||||
|
341 | ./rcstack cli cmd rhodecode-index --no-tty --config=/etc/rhodecode/conf/.rhoderc --instance-name=rc-idx --show-matched-repos --mapping=/etc/rhodecode/conf/search_mapping.ini | |||
|
342 | ||||
|
343 | ||||
|
344 | .. _enable-elasticsearch: | |||
|
345 | ||||
|
346 | Enabling ElasticSearch | |||
|
347 | ^^^^^^^^^^^^^^^^^^^^^^ | |||
|
348 | ||||
|
349 | ElasticSearch is available in EE edition only. It provides much scalable and more advanced | |||
|
350 | search capabilities. While Whoosh is fine for upto 1-2GB of data, beyond that amount it | |||
|
351 | starts slowing down, and can cause other problems. | |||
|
352 | New ElasticSearch 6 also provides much more advanced query language. | |||
|
353 | It allows advanced filtering by file paths, extensions, use OR statements, ranges etc. | |||
|
354 | Please check query language examples in the search field for some advanced query language usage. | |||
|
355 | ||||
|
356 | ||||
|
357 | 1. Open the :file:`rhodecode.ini` file for the instance you wish to edit. The | |||
|
358 | default location is :file:`config/_shared/rhodecode.ini` | |||
|
359 | 2. Find the search configuration section: | |||
|
360 | ||||
|
361 | .. code-block:: ini | |||
|
362 | ||||
|
363 | ################################### | |||
|
364 | ## SEARCH INDEXING CONFIGURATION ## | |||
|
365 | ################################### | |||
|
366 | ||||
|
367 | search.module = rhodecode.lib.index.whoosh | |||
|
368 | search.location = %(here)s/data/index | |||
|
369 | ||||
|
370 | and change it to: | |||
|
371 | ||||
|
372 | .. code-block:: ini | |||
|
373 | ||||
|
374 | search.module = rc_elasticsearch | |||
|
375 | search.location = http://elasticsearch:9200 | |||
|
376 | ## specify Elastic Search version, 6 for latest or 2 for legacy | |||
|
377 | search.es_version = 6 | |||
|
378 | ||||
|
379 | where ``search.location`` points to the ElasticSearch server | |||
|
380 | by default running on port 9200. | |||
|
381 | ||||
|
382 | Index invocation also needs change. Please provide `--es-version=` and | |||
|
383 | `--engine-location=` parameters to define ElasticSearch server location and it's version. | |||
|
384 | For example:: | |||
|
385 | ||||
|
386 | --instance-name=rc-idx --es-version=6 --engine-location=http://elasticsearch:9200 | |||
|
387 | ||||
|
388 | ||||
|
389 | .. _Whoosh: https://pypi.python.org/pypi/Whoosh/ | |||
|
390 | .. _ElasticSearch 6: https://www.elastic.co/ |
General Comments 0
You need to be logged in to leave comments.
Login now