### BEGIN Linguistic Society of America (LSA) 2005, Oakland, CA http://www.lsadc.org/annmeet/sessions.html#sunmorn Workshop: Unicode for Linguists: How To Type, Send, and Archive Linguistic Texts by Computer Sunday, 9 January 9:00 AM - 12:00 PM Room: Simmons 3/4, Oakland Convention Center. Organizer: Deborah Anderson (UC-Berkeley) Participants: Charles A. Bigelow (Bigelow & Holmes), William Bright, Peter Constable (Microsoft), Richard Cook (UC-Berkeley), Kenneth Whistler (Sybase, Inc.) --- Workshop Schedule --- 9:00 - 9:05 Introductory remarks by William Bright 9:05 - 9:50 "Introduction to Unicode for Linguists" by Peter Constable 9:55 - 10:25 "Unicode Fonts for Linguists (with a demonstration)" by Charles Bigelow 10:30 - 11:00 "Conversion of Legacy Linguistic Transcription data to Unicode 4.0" by Richard Cook 11:05 - 11:20 "Developing Tools with IPA-encoded Unicode: Towards a Phonologically-Based Search Engine" by Edward Garrett 11:25 - 11:40 "The Future of Unicode" by Deborah Anderson and Ken Whistler 11:40 - 12:00 Discussion and Q & A period with panel participants. --- COOK Dr. Richard Cook (Project Manager and Systems Administrator, the Sino-Tibetan Etymological Dictionary and Thesaurus at UC Berkeley ; Unicode editorial committee member, representative to ISO/IEC JTC1/SC2/WG2/IRG ; co-author of the CDL specification ; Post-Doctoral Researcher and programmer for the World Color Survey .) Richard Cook will discuss Unicode 4.0 linguistic transcription support, and conversion of legacy data to Unicode 4.0. The use of custom(izable) tools for converting legacy data to Unicode 4.0 will be demonstrated. Title: Conversion of legacy linguistic transcription data to Unicode 4.0 Abstract: This presentation describes the process of converting legacy data to Unicode encoding, including character set mapping, encoding unencoded characters, and demonstration of conversion tool use. Terminology is introduced in context, including the following: legacy data, code point, encoding, custom encoding, standard encoding, USV, mapping, character set, encoding form. Unicode conversion of the STEDT Project's legacy data serves as a specific example. The STEDT Project (federally funded at UC Berkeley since 1987) began migrating its million-record Sino-Tibetan (ST) lexical relational database system to Unicode in earnest in the late 90's. The justification for this conversion hinges on attaining universal permanent data access. STEDT data was originally input and archived using a custom-encoded Apple Macintosh character set refined over the years on the basis of the transcription characters appearing in specific lexical print (and handwritten) sources. Data input using other encodings was migrated to this custom encoding. Source transcriptions would sometimes be normalized in order to render them in the custom encoding, though in general the character set was well-suited to capturing ST transcriptional conventions. The custom-encoding was not, however, well-suited to smooth data interchange and archiving. The conversion process involved the following steps: (0) Realizing that it needed to be done, learning how to do it, and that it could only be done in stages; (1) Mapping to Unicode 3.0; (2) Formal proposal of unencoded characters for inclusion in Unicode/ISO 10646; (3) Mapping to Unicode 4.0; (4) Creation of tools for conversion of the custom-encoded data; (5) Creation of tools for use of the new Unicode data. Complete standard mapping of this character set became possible only with the advent of Unicode 4.0, with the encoding of certain characters peculiar to ST usage that had previously escaped the notice of standardization bodies. Dictating aspects of actions taken in step 4, step 5 is ongoing, and addresses a whole set of related issues, including: (A) relative primacy and maintenance of the data in original and converted forms; (B) database application data requirements; (C) font support. Unicode now serves as the standard interface to STEDT's lexical data, and with application and font support coming soon to a computer near you, Unicode will provide linguists worldwide with access to this valuable data for years to come. -------------------- Dr. Richard S. Cook STEDT Project Linguistics Dept. UC Berkeley http://stedt.berkeley.edu ### END