Colin Manning Munster Technological University: Unicode

Showing posts with label Unicode. Show all posts

Thursday, February 13, 2014

COMP6021 Class 12

the one with the UTF-8

We reviewed the data from our experiments in the lab when we saved English & Japanese text using different formats.

I reviewed the UTF-8 variable length coding system. (screen capture failed)

Exercise
I gave students a UTF-8 message to decode. 49 E2 99 A5 E6 97 A5 E6 9C AC
Students were given copies of the relevant areas of the Unicode code points.
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
http://www.alanwood.net/unicode/miscellaneous_symbols.html

Although they got off to a slow start, I was very pleased that by the student all students had figured it out. This is the kind of thing that would make for a good exam question.

Solution
01001001 11100010:10011001:10100101 11100110:10010111:10100101 11100110:10011100:10101100

I 11100010:10011001:10100101 11100110:10010111:10100101 11100110:10011100:10101100
remove markers for leading and continuation bytes

I 0010011001100101 0110010111100101 0110011100101100

I 2665 65E5 672C

I ♥日本

Wednesday, February 12, 2014

COMP6021 Class 11 Unicode Lab

the one with the text files

I asked students to download a English-language book, paste it into Notepad, and save it using each of the four formats available: ANSI, Unicode, Unicode big endian, UTF-8.

I then asked them to do the same with a Japanese book. But some students' machines lost the plot.

I asked students to generate a 1000 character English-language document and save it with the four formats.

I then asked them to replace 200 of the characters with Japanese and do the same. Again some machines freaked out.

The values Colin got are shown in the table below. What conclusions can we draw? In particular, consider what happened in the mixed English & Japanese text.

If you have another text editor on your own machine try the same experiments with different formats.

Document	ANSI	Unicode	Unicode BE	UTF-8
Hounds of the Baskervilles	320,692	641,386	641,386	326,691
Rashomon	6,315*	12,632	12,632	18,222
Thousand English	1,080	2,162	2,162	1,083
Thousand Mixed	1,080*	2,124	2,124	1,274

* Japanese text not preserved. Displayed jibberish

Feel free to leave comments below

At the end of class I got each student to download the three images needed for the assignment. More about that another time.

COMP6021 Class 10 Unicode

the one with the Unicode

We looked at standard tables for storing text. We started with ASCII and then saw some of the various extensions, DOS Code Pages, and various ISO formats before arriving at Unicode. We started to look at some of the ways Unicode code points can be represented with bits. UTF-7, UTF-23 were easy. We ignored UTF-16. UFT-8 started easy but got more complicated. We will continue with UTF-8 in the next class.

Colin Manning Munster Technological University

Thursday, February 13, 2014

COMP6021 Class 12

Wednesday, February 12, 2014

COMP6021 Class 11 Unicode Lab

COMP6021 Class 10 Unicode

COMP6021 Characters, Symbols and the Unicode Miracle - Computerphile

Search This Blog

Class Labels

Pages

Quick Links

Thematic Labels

Pageviews this month

Popular Posts this Week

Blog Archive