Quantcast
Channel: SCN: Message List - SAP HANA Developer Center
Viewing all articles
Browse latest Browse all 9165

Re: HANA Special Chars (Charset / Collation)

$
0
0

I am also having problems with this.

So far, what I found is that if you create a varchar field but try to insert special caracter not in the 7 bit ascii table (NOT the 8 bit extended ascii), then hana will automatically store that caracter in a unicode format that will take more space.

 

.. I think it does this,.. we cannot be sure

 

According to unicode.org

http://www.unicode.org/faq/utf_bom.html

 

UTF-8 can take between 1 and 4 bytes per character.

 

What I think happen with Hana is that they store VARCHAR type as UTF-8 and NVARCHAR as UTF-8 or UTF-16.

When they store varchar, they calculate the size of the field taking into account that it will contains only ascii caracter  so 1 caracter will only take 1 byte.  So they allocate 20 bytes for a 20 character length.

As the varchar back-end is a utf-8, you can store unicode characters inside it, however, if they are specials characters, they will take more than 1 byte per character and break hana estimate on the size of the fied.

 

Example :

CREATE TABLE varchartest(varcol varchar(10), nvarcol nvarchar(10))

 

--It work, as it's normal character.

insert into varchartest(varcol,nvarcol) values ('1234567890','1234567890');

 

--does not work,  inserted value too large for column

insert into varchartest(varcol,nvarcol) values ('éééééééééé','éééééééééé');

--does not work,  inserted value too large for column

insert into varchartest(varcol,nvarcol) values ('éééééé','éééééééééé');

 

--This one work,  é character must be 2 bytes wide so 2 bytes * 5 character = 10 bytes, as defined by the varchar column.

insert into varchartest(varcol,nvarcol) values ('ééééé','éééééééééé');

 

--still work, 1 more caracter in the varchar does not work.

insert into varchartest(varcol,nvarcol) values ('ààààà','àààààààààà');

 

--does not work : 4 chinese character don't fit into 10 bytes on the varchar field.

insert into varchartest(varcol,nvarcol) values ('章章章章','章章章章章章章章章章') --6

 

--work, we can fit 3 chinese character.

insert into varchartest(varcol,nvarcol) values ('章章章','章章章章章章章章章章') --6

 

With this, we know that a chinese character take 3 normal character.

 

--This will also give the length of the string as it would be saved on the database.

select length(CAST('章' AS text)) from dummy   --- return 3 character

 

If you select * the  table, you will see that your unicode text into the varchar field were correctly saved.

 

 

Anyway, this problem is big for my team

 

1. we cannot use nvarchar  because we are using legacy application that cannot deal with unicode string.

2. we just want to use characters in the Extended ascii with 8 bits, however, they trigger that problem.

If you could set the catalogue of the table or the schema to  something that include our set of characters, then the caracters could would always be correct. However, we cannot do that.

 

bad solution :

3. I guess that we will multiply the length * 2 or *4  each varchar fields to  prevent user crashing the insert when the real count is correct but the database think otherwise.


Viewing all articles
Browse latest Browse all 9165

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>