### BEGIN
Linguistic Society of America (LSA) 2005, Oakland, CA
http://www.lsadc.org/annmeet/sessions.html#sunmorn
Workshop: Unicode for Linguists: How To Type, Send, and Archive
Linguistic Texts by Computer
Sunday, 9 January 9:00 AM - 12:00 PM
Room: Simmons 3/4, Oakland Convention Center.
Organizer: Deborah Anderson (UC-Berkeley)
Participants: Charles A. Bigelow (Bigelow & Holmes), William Bright,
Peter Constable (Microsoft), Richard Cook (UC-Berkeley), Kenneth
Whistler (Sybase, Inc.)
--- Workshop Schedule ---
9:00 - 9:05 Introductory remarks by William Bright
9:05 - 9:50 "Introduction to Unicode for Linguists"
by Peter Constable
9:55 - 10:25 "Unicode Fonts for Linguists (with a demonstration)"
by Charles Bigelow
10:30 - 11:00 "Conversion of Legacy Linguistic Transcription data to
Unicode 4.0" by Richard Cook
11:05 - 11:20 "Developing Tools with IPA-encoded Unicode: Towards a
Phonologically-Based Search Engine" by Edward Garrett
11:25 - 11:40 "The Future of Unicode"
by Deborah Anderson and Ken Whistler
11:40 - 12:00 Discussion and Q & A period with panel participants.
---
COOK
Dr. Richard Cook (Project Manager and Systems Administrator, the
Sino-Tibetan Etymological Dictionary and Thesaurus at UC Berkeley
; Unicode
editorial committee member, representative to ISO/IEC
JTC1/SC2/WG2/IRG ; co-author of the
CDL specification ; Post-Doctoral
Researcher and programmer for the World Color Survey
.)
Richard Cook will discuss Unicode 4.0 linguistic transcription
support, and conversion of legacy data to Unicode 4.0. The use of
custom(izable) tools for converting legacy data to Unicode 4.0 will
be demonstrated.
Title:
Conversion of legacy linguistic transcription data to Unicode 4.0
Abstract:
This presentation describes the process of converting legacy data to
Unicode encoding, including character set mapping, encoding unencoded
characters, and demonstration of conversion tool use. Terminology is
introduced in context, including the following: legacy data, code
point, encoding, custom encoding, standard encoding, USV, mapping,
character set, encoding form.
Unicode conversion of the STEDT Project's legacy data serves as a
specific example. The STEDT Project (federally funded at UC Berkeley
since 1987) began migrating its million-record Sino-Tibetan (ST)
lexical relational database system to Unicode in earnest in the late
90's. The justification for this conversion hinges on attaining
universal permanent data access. STEDT data was originally input and
archived using a custom-encoded Apple Macintosh character set refined
over the years on the basis of the transcription characters appearing
in specific lexical print (and handwritten) sources. Data input using
other encodings was migrated to this custom encoding. Source
transcriptions would sometimes be normalized in order to render them
in the custom encoding, though in general the character set was
well-suited to capturing ST transcriptional conventions. The
custom-encoding was not, however, well-suited to smooth data
interchange and archiving.
The conversion process involved the following steps: (0) Realizing
that it needed to be done, learning how to do it, and that it could
only be done in stages; (1) Mapping to Unicode 3.0; (2) Formal
proposal of unencoded characters for inclusion in Unicode/ISO 10646;
(3) Mapping to Unicode 4.0; (4) Creation of tools for conversion of
the custom-encoded data; (5) Creation of tools for use of the new
Unicode data. Complete standard mapping of this character set became
possible only with the advent of Unicode 4.0, with the encoding of
certain characters peculiar to ST usage that had previously escaped
the notice of standardization bodies.
Dictating aspects of actions taken in step 4, step 5 is ongoing, and
addresses a whole set of related issues, including: (A) relative
primacy and maintenance of the data in original and converted forms;
(B) database application data requirements; (C) font support.
Unicode now serves as the standard interface to STEDT's lexical
data, and with application and font support coming soon to a computer
near you, Unicode will provide linguists worldwide with access to this
valuable data for years to come.
--------------------
Dr. Richard S. Cook
STEDT Project
Linguistics Dept.
UC Berkeley
http://stedt.berkeley.edu
### END