Showing posts with label Unicode. Show all posts
Showing posts with label Unicode. Show all posts

Thursday, February 13, 2014

COMP6021 Class 12

the one with the UTF-8






We reviewed the data from our experiments in the lab when we saved English & Japanese text using different formats.




I reviewed the UTF-8 variable length coding system. (screen capture failed)

Exercise
I gave students a UTF-8 message to decode. 49 E2 99 A5 E6 97 A5 E6 9C AC
Students were given copies of the relevant areas of the Unicode code points.
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
http://www.alanwood.net/unicode/miscellaneous_symbols.html


Although they got off to a slow start, I was very pleased that by the student all students had figured it out. This is the kind of thing that would make for a good exam question.

Solution
01001001 11100010:10011001:10100101 11100110:10010111:10100101  11100110:10011100:10101100

11100010:10011001:10100101 11100110:10010111:10100101  11100110:10011100:10101100
remove markers for leading and continuation bytes

I  0010011001100101 0110010111100101  0110011100101100

I 2665 65E5 672C

I ♥日本

Wednesday, February 12, 2014

COMP6021 Class 11 Unicode Lab

the one with the text files

I asked students to download a English-language book, paste it into Notepad, and save it using each of the four formats available: ANSI, Unicode, Unicode big endian, UTF-8.

I then asked them to do the same with a Japanese book. But some students' machines lost the plot.

I asked students to generate a 1000 character English-language document and save it with the four formats.

I then asked them to replace 200 of the characters with Japanese and do the same. Again some machines freaked out.




The values Colin got are shown in the table below. What conclusions can we draw? In particular, consider what happened in the mixed English & Japanese text.

If you have another text editor on your own machine try the same experiments with different formats.

Document ANSI Unicode Unicode BE UTF-8
Hounds of the Baskervilles 320,692 641,386 641,386 326,691
Rashomon 6,315* 12,632 12,632 18,222
Thousand English 1,080 2,162 2,162 1,083
Thousand Mixed 1,080* 2,124 2,124 1,274



* Japanese text not preserved. Displayed jibberish


Feel free to leave comments below


At the end of class I got each student to download the three images needed for the assignment. More about that another time.

COMP6021 Class 10 Unicode

the one with the Unicode



We looked at standard tables for storing text. We started with ASCII and then saw some of the various extensions, DOS Code Pages, and various ISO formats before arriving at Unicode. We started to look at some of the ways Unicode code points can be represented with bits. UTF-7, UTF-23 were easy. We ignored UTF-16. UFT-8 started easy but got more complicated. We will continue with UTF-8 in the next class.

COMP6021 Characters, Symbols and the Unicode Miracle - Computerphile