There are a variety of reasons why a knowledge document can terminate an index. It is not always black and white and that each needs to be reviewed on a case by case basis. Below lists some of the common causes of indexing issues. Keep in mind that 99.9% of all problems with indexing are with ServiceCenter / Service Manager and not the search engine. The version of Service Manager being used will determine which Search Engine is being utilized by Knowledge Management. Service Manager 9.2x and earlier will use the Verity K2 Search Engine. Service Manager 9.30 and newer uses the SOLR Search Engine.
Probably the number one cause is attachments. Out of the box HP excludes a small set of file extensions: jpg;bmp;gif;exe;unl. This means that when Service Manager parses the knowledge document it will ignore those attachments that contain one of those file extensions. This list is rather small and should be expanded to include: png;psd;tif,js;jsp;dll. There are probably others, but these have been the most common.
If additional extensions are not listed it does not necessarily mean the index will fail. In some cases Service Manager will parse the record and submit the document to the search engine, but the search engine rejects the document and does not index it. In earlier versions of ServiceCenter/Service Manager (i.e. 6.2.x through 9.2.x) the rejection by the search engine was something we had to actively search for by reviewing the <search engine>\\data\\services\\KM-INDEX\\log\\status.log. The errors in this log would give direction on how to approach a document that was rejected by the search engine.
Items to check when configuring to support attachments:
1. What types of attachments are most likely to be attached to a knowledge document?
2. How large is the attachment? If a large Word document then perhaps the External document type should be used so the actual Word document becomes the knowledge document instead of an attachment with a link .
Here is a list of the extensions that will be indexed by the Verity K2 Search Engine: .pdf, .rtf, .mif, .wpd, .htm, .html, .shtml, .asp, .cgi, .php, .sml, .xml, .txt, .text, .c, .h, .cpp, .cxx, .pl, .mbx, .doc, .xls, .ppt, .aw, .eml. Anything not on this list will – more than likely – be rejected by the search engine even if Service Manager parses the contents of the attachment and transmits that data to the indexer. Additionally, password protected files will not be indexed by the search engine.
NOTE: There is currently a 40 character limit at the database level for the “Skip these extensions” field (skipexts in the database). When you combine the out of the box extensions with the expanded list of extensions recommended in this article, the character count exceeds 40. Reindexing of the knowledge libraries may fail due to the system truncating what is included in that field.
The SOLR Search Engine uses the Tika parser. To see what types of files can be indexed by SOLR simply review the list within Service Manager called: “kmmimetypes”
B. Documents with invalid characters
ServiceCenter / Service Manager and both the Verity K2 and SOLR Search Engines accept standard html and xml tags. Special characters need to be wrapped via standard html. In many cases proprietary tags will cause problems. The most common tags are the Microsoft Office ‘mso’ tag. These tags can be inserted when a user copies and pastes from an application like Microsoft Word. What SHOULD happen, when a copy and paste method is used, is that a cleanup window (which is called the “Tidy” routine) should pop up notifying the user that they should paste the contents of their document into the Tidy window so the proprietary tags are cleaned up and replaced with standard tags. This should happen for anything that contains non standard formatting tags.
Non-standard formatting tags will not cause the index to fail, but what will usually happen is that all data up to the point of the special character is indexed. Everything after that special character is not indexed. So, if the term being searched on happens to be after that special character that specific document will not list in the hitlist.
Determining this can be complex and if you feel you are running across such a scenario then the first thing to try is to send the document in question to the workflow, edit the document and then click on Source button so you can see what the special character is wrapped in. If you see MSO anywhere then this is part of the problem with this document.
C. Knowledge Documents with LOTS of data or massive tables (This applies only to systems using Verity K2 Search Engine)
Documents, incidents and interactions will fail when there is more than 32k of data contained within a single field. In most cases the index will fail with a Signal 11 or flat out kill the client. This problem DOES NOT occur beginning with Service Manager 7.11 because the KMAdmin script has been updated to take into account larger data.
D. File locking bug (This applies only to systems using Verity K2 Search Engine)
There is an issue with KMUpdate locking the file to be indexed in versions 6.2.1 through 6.2.5 and 7.0.1 and 7.0.2. When KMUpdate wakes up to index the kmknowledgebaseupdate records for a specific file it must lock that file. If an admin is reviewing that specific library via Manage Knowledgebases at the same time KMUpdate SHOULD NOT index that Library, but it does and thus will never close the index.
What should happen is that if KMUpdate is running the KMAdmin should not be able to perform any function on that library and vice-a-versa. This is resolved with ServiceCenter 6.2.6+ with the 6.2.6 KMUnloads and in Service Manage 7.10+.
Once this problem is experienced the only method of clearing the problem is to stop and restart both the search engine and the ServiceCenter server application.
Overall most everything will index without issue. As mentioned above many index issues need to be examined on a case-by-case basis.