Quantcast
Channel: SCN: Message List - SAP HANA Developer Center
Viewing all articles
Browse latest Browse all 9165

Text Analysis Token Counting Bug?

$
0
0

I'm building a fulltext index using the following code:

 

   CREATE FULLTEXT INDEX "IND_<TABLE>" ON "<SCHEMA>"."<TABLE>"("LONGTEXT")

   ASYNC

   FAST PREPROCESS OFF

   SEARCH ONLY OFF

  TEXT ANALYSIS ON

  LANGUAGE DETECTION ('EN')

  CONFIGURATION 'LINGANALYSIS_BASIC';

 

There is a document in the table with the text in 'LONGTEXT' that has a word, "Berry." Berry appears once in the document, but it shows up with several thousands TA_COUNTER values and multiple records. I expected only one record for this document in $TA_IND<TABLE>. It shows multiple entries.

 

I"m trying to figure out if there is a bug in the tokenization in HANA I'm on SPS10 that is causing poor Berry to appear multiple times.

 

Example of the $TA_IND_<TABLE>

DOCIDTA_COUNTERTA_NORMALIZED
0004000624542,012berry
0004000624542,033berry

 

I find it hard to believe that 2,012 or 2,033 berrys' were found even harder to believe that it was found twice in different locations?

 

From the SAP HANA Text Analysis Developer documentation.

TA_COUNTER: The token counter counts all tokens across the document. The order is only unique for a given processing type (hence the TA_RULE as the key). I checked the parent table as well and there is only one reference to the 000400062454 document identifier.


Viewing all articles
Browse latest Browse all 9165

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>