##// END OF EJS Templates
docs: updated indexer documentation
marcink -
r3482:4b2dd92b default
parent child Browse files
Show More
@@ -1,363 +1,359 b''
1 1 .. _indexing-ref:
2 2
3 3 Full-text Search
4 4 ----------------
5 5
6 RhodeCode provides a full text search capabilities to search inside file content,
7 commit message, and file paths. Indexing is not enabled by default and to use
8 full text search building an index is a pre-requisite.
9
6 10 By default RhodeCode is configured to use `Whoosh`_ to index |repos| and
7 provide full-text search.
11 provide full-text search. `Whoosh`_ works well for a small amount of data and
12 shouldn't be used in case of large code-bases and lots of repositories.
8 13
9 |RCE| also provides support for `Elasticsearch 6`_ as a backend more for advanced
14 |RCE| also provides support for `ElasticSearch 6`_ as a backend more for advanced
10 15 and scalable search. See :ref:`enable-elasticsearch` for details.
11 16
12 17 Indexing
13 18 ^^^^^^^^
14 19
15 20 To run the indexer you need to have an |authtoken| with admin rights to all |repos|.
16 21
17 To index new content added, you have the option to set the indexer up in a
22 To index repositories stored in RhodeCode, you have the option to set the indexer up in a
18 23 number of ways, for example:
19 24
20 25 * Call the indexer via a cron job. We recommend running this once at night.
21 26 In case you need everything indexed immediately it's possible to index few
22 times during the day.
27 times during the day. Indexer has a special locking mechanism that won't allow
28 two instances of indexer running at once. It's safe to run it even every 1hr.
23 29 * Set the indexer to infinitely loop and reindex as soon as it has run its previous cycle.
24 30 * Hook the indexer up with your CI server to reindex after each push.
25 31
26 The indexer works by indexing new commits added since the last run. If you
27 wish to build a brand new index from scratch each time,
28 use the ``force`` option in the configuration file.
32 The indexer works by indexing new commits added since the last run, and comparing
33 file changes to index only new or modified files.
34 If you wish to build a brand new index from scratch each time, use the ``force``
35 option in the configuration file, or run it with --force flag.
29 36
30 37 .. important::
31 38
32 39 You need to have |RCT| installed, see :ref:`install-tools`. Since |RCE|
33 40 3.5.0 they are installed by default and available with community/enterprise installations.
34 41
35 42 To set up indexing, use the following steps:
36 43
37 44 1. :ref:`config-rhoderc`, if running tools remotely.
38 45 2. :ref:`run-index`
39 46 3. :ref:`set-index`
40 47 4. :ref:`advanced-indexing`
41 48
42 49 .. _config-rhoderc:
43 50
44 51 Configure the ``.rhoderc`` File
45 52 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
46 53
47 54 .. note::
48 55
49 56 Optionally it's possible to use indexer without the ``.rhoderc``. Simply instead of
50 57 executing with `--instance-name=enterprise-1` execute providing the host and token
51 directly: `--api-host=http://127.0.0.1:10000 --api-key=<auth token goes here>
58 directly: `--api-host=http://127.0.0.1:10000 --api-key=<auth-token-goes-here>`
52 59
53 60
54 61 |RCT| uses the :file:`/home/{user}/.rhoderc` file for connection details
55 62 to |RCE| instances. If this file is not automatically created,
56 63 you can configure it using the following example. You need to configure the
57 64 details for each instance you want to index.
58 65
59 66 .. code-block:: bash
60 67
61 68 # Check the instance details
62 69 # of the instance you want to index
63 70 $ rccontrol status
64 71
65 72 - NAME: enterprise-1
66 73 - STATUS: RUNNING
67 74 - TYPE: Enterprise
68 75 - VERSION: 4.1.0
69 76 - URL: http://127.0.0.1:10003
70 77
71 78 To get your API Token, on the |RCE| interface go to
72 79 :menuselection:`username --> My Account --> Auth tokens`
73 80
74 81 .. code-block:: ini
75 82
76 83 # Configure .rhoderc with matching details
77 84 # This allows the indexer to connect to the instance
78 85 [instance:enterprise-1]
79 86 api_host = http://127.0.0.1:10000
80 87 api_key = <auth token goes here>
81 88
82 89
83 90 .. _run-index:
84 91
85 92 Run the Indexer
86 93 ^^^^^^^^^^^^^^^
87 94
88 95 Run the indexer using the following command, and specify the instance you want to index:
89 96
90 97 .. code-block:: bash
91 98
92 # Using default installation
99 # Using default simples indexing of all repositories
93 100 $ /home/user/.rccontrol/enterprise-1/profile/bin/rhodecode-index \
94 101 --instance-name=enterprise-1
95 102
96 # Using a custom mapping file
103 # Using a custom mapping file with indexing rules, and using elasticsearch 6 backend
97 104 $ /home/user/.rccontrol/enterprise-1/profile/bin/rhodecode-index \
98 105 --instance-name=enterprise-1 \
99 --mapping=/home/user/.rccontrol/enterprise-1/search_mapping.ini
106 --mapping=/home/user/.rccontrol/enterprise-1/search_mapping.ini \
107 --es-version=6 --engine-location=http://elasticsearch-host:9200
100 108
101 109 # Using a custom mapping file and invocation without ``.rhoderc``
102 110 $ /home/user/.rccontrol/enterprise-1/profile/bin/rhodecode-index \
103 111 --api-host=http://rhodecodecode.myserver.com --api-key=xxxxx \
104 112 --mapping=/home/user/.rccontrol/enterprise-1/search_mapping.ini
105 113
106 114 # From inside a virtualev on your local machine or CI server.
107 115 (venv)$ rhodecode-index --instance-name=enterprise-1
108 116
109 117
110 118 .. note::
111 119
112 120 In case of often indexing the index may become fragmented. Most often a result of that
113 121 is error about `too many open files`. To fix this indexer needs to be executed with
114 122 --optimize flag. E.g `rhodecode-index --instance-name=enterprise-1 --optimize`
115 123 This should be executed regularly, once a week is recommended.
116 124
117 125
118 126 .. _set-index:
119 127
120 128 Schedule the Indexer
121 129 ^^^^^^^^^^^^^^^^^^^^
122 130
123 131 To schedule the indexer, configure the crontab file to run the indexer inside
124 132 your |RCT| virtualenv using the following steps.
125 133
126 134 1. Open the crontab file, using ``crontab -e``.
127 135 2. Add the indexer to the crontab, and schedule it to run as regularly as you
128 136 wish.
129 137 3. Save the file.
130 138
131 139 .. code-block:: bash
132 140
133 141 $ crontab -e
134 142
135 143 # The virtualenv can be called using its full path, so for example you can
136 144 # put this example into the crontab
137 145
138 146 # Run the indexer daily at 4am using the default mapping settings
139 147 * 4 * * * /home/ubuntu/.virtualenv/rhodecode-venv/bin/rhodecode-index \
140 148 --instance-name=enterprise-1
141 149
142 150 # Run the indexer every Sunday at 3am using default mapping
143 151 * 3 * * 0 /home/ubuntu/.virtualenv/rhodecode-venv/bin/rhodecode-index \
144 152 --instance-name=enterprise-1
145 153
146 154 # Run the indexer every 15 minutes
147 155 # using a specially configured mapping file
148 156 */15 * * * * ~/.rccontrol/enterprise-4/profile/bin/rhodecode-index \
149 157 --instance-name=enterprise-4 \
150 158 --mapping=/home/user/.rccontrol/enterprise-4/search_mapping.ini
151 159
152 160 .. _advanced-indexing:
153 161
154 162 Advanced Indexing
155 163 ^^^^^^^^^^^^^^^^^
156 164
157 165
158 166 Force Re-Indexing single repository
159 167 +++++++++++++++++++++++++++++++++++
160 168
161 169 Often it's required to re-index whole repository because of some repository changes,
162 170 or to remove some indexed secrets, or files. There's a special `--repo-name=` flag
163 171 for the indexer that limits execution to a single repository. For example to force-reindex
164 172 single repository such call can be made::
165 173
166 174 rhodecode-index --instance-name=enterprise-1 --force --repo-name=rhodecode-vcsserver
167 175
168 176
169 177 Removing repositories from index
170 178 ++++++++++++++++++++++++++++++++
171 179
172 180 The indexer automatically removes renamed repositories and builds index for new names.
181 In the same way if a listed repository in mapping.ini is not reported existing by the
182 server it's removed from the index.
173 183 In case that you wish to remove indexed repository manually such call would allow that::
174 184
175 185 rhodecode-index --instance-name=enterprise-1 --remove-only --repo-name=rhodecode-vcsserver
176 186
177 187
178 188 Using search_mapping.ini file for advanced index rules
179 189 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
180 190
181 191 By default rhodecode-index runs for all repositories, all files with parsing limits
182 192 defined by the CLI default arguments. You can change those limits by calling with
183 different flags such as `--max-filesize 2048kb` or `--repo-limit 10`
193 different flags such as `--max-filesize=2048kb` or `--repo-limit=10`
184 194
185 195 For more advanced execution logic it's possible to use a configuration file that
186 196 would define detailed rules which repositories and how should be indexed.
187 197
188 198 |RCT| provides an example index configuration file called :file:`search_mapping.ini`.
189 199 This file is created by default during installation and is located at:
190 200
191 201 * :file:`/home/{user}/.rccontrol/{instance-id}/search_mapping.ini`, using default |RCT|.
192 202 * :file:`~/venv/lib/python2.7/site-packages/rhodecode_tools/templates/mapping.ini`,
193 203 when using ``virtualenv``.
194 204
195 205 .. note::
196 206
197 207 If you need to create the :file:`search_mapping.ini` file manually, use the |RCT|
198 208 ``rhodecode-index --create-mapping path/to/search_mapping.ini`` API call.
199 209 For details, see the :ref:`tools-cli` section.
200 210
201 211 To Run the indexer with mapping file provide it using `--mapping` flag::
202 212
203 213 rhodecode-index --instance-name=enterprise-1 --mapping=/my/path/search_mapping.ini
204 214
205 215
206 216 Here's a detailed example of using :file:`search_mapping.ini` file.
207 217
208 218 .. code-block:: ini
209 219
210 220 [__DEFAULT__]
211 221 ; Create index on commits data, and files data in this order. Available options
212 222 ; are `commits`, `files`
213 223 index_types = commits,files
214 224
215 225 ; Commit fetch limit. In what amount of chunks commits should be fetched
216 226 ; via api and parsed. This allows server to transfer smaller chunks and be less loaded
217 227 commit_fetch_limit = 1000
218 228
219 229 ; Commit process limit. Limit the number of commits indexer should fetch, and
220 230 ; store inside the full text search index. eg. if repo has 2000 commits, and
221 231 ; limit is 1000, on the first run it will process commits 0-1000 and on the
222 232 ; second 1000-2000 commits. Help reduce memory usage, default is 50000
223 233 ; (set -1 for unlimited)
224 commit_process_limit = 50000
234 commit_process_limit = 20000
225 235
226 236 ; Limit of how many repositories each run can process, default is -1 (unlimited)
227 237 ; in case of 1000s of repositories it's better to execute in chunks to not overload
228 238 ; the server.
229 239 repo_limit = -1
230 240
231 241 ; Default patterns for indexing files and content of files. Binary files
232 242 ; are skipped by default.
233 243
234 244 ; Add to index those comma separated files; globs syntax
235 245 ; e.g index_files = *.py, *.c, *.h, *.js
236 246 index_files = *,
237 247
238 248 ; Do not add to index those comma separated files, this excludes
239 249 ; both search by name and content; globs syntax
240 ; e.g index_files = *.key, *.sql, *.xml
250 ; e.g index_files = *.key, *.sql, *.xml, *.pem, *.crt
241 251 skip_files = ,
242 252
243 253 ; Add to index content of those comma separated files; globs syntax
244 254 ; e.g index_files = *.h, *.obj
245 255 index_files_content = *,
246 256
247 257 ; Do not add to index content of those comma separated files; globs syntax
248 ; e.g index_files = *.exe, *.bin, *.log, *.dump
258 ; Binary files are not indexed by default.
259 ; e.g index_files = *.min.js, *.xml, *.dump, *.log, *.dump
249 260 skip_files_content = ,
250 261
251 262 ; Force rebuilding an index from scratch. Each repository will be rebuild from
252 263 ; scratch with a global flag. Use --repo-name=NAME --force to rebuild single repo
253 264 force = false
254 265
255 266 ; maximum file size that indexer will use, files above that limit are not going
256 267 ; to have they content indexed.
257 268 ; Possible options are KB (kilobytes), MB (megabytes), eg 1MB or 1024KB
258 max_filesize = 2MB
269 max_filesize = 10MB
259 270
260 271
261 272 [__INDEX_RULES__]
262 273 ; Ordered match rules for repositories. A list of all repositories will be fetched
263 274 ; using API and this list will be filtered using those rules.
264 275 ; Syntax for entry: `glob_pattern_OR_full_repo_name = 0 OR 1` where 0=exclude, 1=include
265 276 ; When this ordered list is traversed first match will return the include/exclude marker
266 277 ; For example:
267 278 ; upstream/binary_repo = 0
268 279 ; upstream/subrepo/xml_files = 0
269 280 ; upstream/* = 1
270 281 ; special-repo = 1
271 282 ; * = 0
272 283 ; This will index all repositories under upstream/*, but skip upstream/binary_repo
273 284 ; and upstream/sub_repo/xml_files, last * = 0 means skip all other matches
274 285
275 ; Another example:
276 ; *-fork = 0
277 ; * = 1
278 ; This will index all repositories, except those that have -fork as suffix.
279
280 rhodecode-vcsserver = 1
281 rhodecode-enterprise-ce = 1
282 upstream/mozilla/firefox-repo = 0
283 upstream/git-binaries = 0
284 upstream/* = 1
285 * = 0
286 286
287 287 ; == EXPLICIT REPOSITORY INDEXING ==
288 288 ; If defined this will skip using __INDEX_RULES__, and will not use API to fetch
289 289 ; list of repositories, it will explicitly take names defined with [NAME] format and
290 290 ; try to build the index, to build index just for repo_name_1 and special-repo use:
291 291 ; [repo_name_1]
292 292 ; [special-repo]
293 293
294 294 ; == PER REPOSITORY CONFIGURATION ==
295 295 ; This allows overriding the global configuration per repository.
296 296 ; example to set specific file limit, and skip certain files for repository special-repo
297 ; the CLI flags doesn't override the conf settings.
297 298 ; [conf:special-repo]
298 299 ; max_filesize = 5mb
299 300 ; skip_files = *.xml, *.sql
300 ; index_types = files,
301 301
302 [conf:rhodecode-vcsserver]
303 index_types = files,
304 max_filesize = 5mb
305 skip_files = *.xml, *.sql
306 index_files = *.py, *.c, *.h, *.js
307 302
308 303
309 304 In case of 1000s of repositories it can be tricky to write the include/exclude rules at first.
310 305 There's a special flag to test the mapping file rules and list repositories that would
311 be indexed. Run the indexer with `--show-matched-repos` to list only the match rules::
306 be indexed. Run the indexer with `--show-matched-repos` to list only the
307 match repositories defined in .ini file rules::
312 308
313 309 rhodecode-index --instance-name=enterprise-1 --show-matched-repos --mapping=/my/path/search_mapping.ini
314 310
315 311
316 312 .. _enable-elasticsearch:
317 313
318 Enabling Elasticsearch
314 Enabling ElasticSearch
319 315 ^^^^^^^^^^^^^^^^^^^^^^
320 316
321 Elasticsearch is available in EE edition only. It provides much scalable and more advanced
322 search capabilities. While Whoosh is fine for upto 1-2GB of data beyond that amount of
323 data it starts slowing down, and can cause other problems. Elasticsearch 6 also provides
324 much more advanced query language allowing advanced filtering by file paths, extensions
325 OR statements, ranges etc. Please check query language examples in the search field for
326 some advanced query language usage.
317 ElasticSearch is available in EE edition only. It provides much scalable and more advanced
318 search capabilities. While Whoosh is fine for upto 1-2GB of data, beyond that amount it
319 starts slowing down, and can cause other problems.
320 New ElasticSearch 6 also provides much more advanced query language.
321 It allows advanced filtering by file paths, extensions, use OR statements, ranges etc.
322 Please check query language examples in the search field for some advanced query language usage.
327 323
328 324
329 325 1. Open the :file:`rhodecode.ini` file for the instance you wish to edit. The
330 326 default location is
331 327 :file:`home/{user}/.rccontrol/{instance-id}/rhodecode.ini`
332 328 2. Find the search configuration section:
333 329
334 330 .. code-block:: ini
335 331
336 332 ###################################
337 333 ## SEARCH INDEXING CONFIGURATION ##
338 334 ###################################
339 335
340 336 search.module = rhodecode.lib.index.whoosh
341 337 search.location = %(here)s/data/index
342 338
343 339 and change it to:
344 340
345 341 .. code-block:: ini
346 342
347 343 search.module = rc_elasticsearch
348 344 search.location = http://localhost:9200
349 345 ## specify Elastic Search version, 6 for latest or 2 for legacy
350 346 search.es_version = 6
351 347
352 where ``search.location`` points to the elasticsearch server
348 where ``search.location`` points to the ElasticSearch server
353 349 by default running on port 9200.
354 350
355 351 Index invocation also needs change. Please provide --es-version= and
356 --engine-location= parameters to define elasticsearch server location and it's version.
352 --engine-location= parameters to define ElasticSearch server location and it's version.
357 353 For example::
358 354
359 355 rhodecode-index --instace-name=enterprise-1 --es-version=6 --engine-location=http://localhost:9200
360 356
361 357
362 358 .. _Whoosh: https://pypi.python.org/pypi/Whoosh/
363 .. _Elasticsearch 6: https://www.elastic.co/ No newline at end of file
359 .. _ElasticSearch 6: https://www.elastic.co/
General Comments 0
You need to be logged in to leave comments. Login now