I'm building a fulltext index using the following code:
CREATE FULLTEXT INDEX "IND_<TABLE>" ON "<SCHEMA>"."<TABLE>"("LONGTEXT")
ASYNC
FAST PREPROCESS OFF
SEARCH ONLY OFF
TEXT ANALYSIS ON
LANGUAGE DETECTION ('EN')
CONFIGURATION 'LINGANALYSIS_BASIC';
There is a document in the table with the text in 'LONGTEXT' that has a word, "Berry." Berry appears once in the document, but it shows up with several thousands TA_COUNTER values and multiple records. I expected only one record for this document in $TA_IND<TABLE>. It shows multiple entries.
I"m trying to figure out if there is a bug in the tokenization in HANA I'm on SPS10 that is causing poor Berry to appear multiple times.
Example of the $TA_IND_<TABLE>
| DOCID | TA_COUNTER | TA_NORMALIZED |
|---|---|---|
| 000400062454 | 2,012 | berry |
| 000400062454 | 2,033 | berry |
I find it hard to believe that 2,012 or 2,033 berrys' were found even harder to believe that it was found twice in different locations?
From the SAP HANA Text Analysis Developer documentation.
TA_COUNTER: The token counter counts all tokens across the document. The order is only unique for a given processing type (hence the TA_RULE as the key). I checked the parent table as well and there is only one reference to the 000400062454 document identifier.