Dolist: integrate viewgraphs from ss.vg talk get CSCW reviewers: Malone, Grudin A Structured Approach to Sharing Semi-Structured Information Gary Perlman J. Edward Swan, II Keywords: information sharing systems interoperability Abstract: Semi-structured information is not as well defined as database records but more structured than full text. Semi-structured records are similar to forms, attribute lists, and frames. Semi-structured records contain a series of fields with arbitrary names and (possibly typed) values, in which the order of fields is arbitrary and often irregular, except that multiple instances of the same field are ordered. Examples of semi-structured information include bibliographic records, electronic mail messages and news articles, survey questionnaires, schedules of events, and personal databases such as address books, household inventories, grade rosters, etc. Semi-structured records can be extended by allowing pointers in field values to other records; textual identifiers of other records can effectively implement hierarchical and hypertext network structures. For example, bibliographic records can point to their references, news articles can point to the articles they discuss, and hierarchical structures can be created by having superordinates point to their subordinates. Given the diversity of semi-structured information and its apparent suitability for many information sharing tasks, it is not surprising that there is much software for managing it. What is surprising, is that the software used for one domain (e.g., mail) is usually not used for another domain (e.g., bibliographies), even if the domains are very similar (e.g., many people use different programs for mail and news), although there are some notable exceptions (e.g., emacs mail/news readers, the MH mail handler with uses UNIX file manipulation programs to manipulate mail messages). One reason for the use of different software tools is that operations in one domain may not make sense in another domain, at least not at first glance. Another reason for different tools is the difference of storage formats; each domain has its own peculiar markup for the same basic structure, making software reuse difficult. The SST (Semi-Structured Toolkit) attempts to provide an integrated environment for semi-structured information. At the heart of SST is a table-driven parser/generator that reads and writes a wide variety of semi-structured record (SSR) formats, representing them internally in a generic structure for manipulation. Because SSRs have a simple structure, dynamic input parsing and output generation can be done efficiently. Common manipulations on the semi-structured records include matching, sorting, editing, selecting, formatting, and viewing. By creating a cross-tabulation of Operations by application Domains, we have found many holes where an operation supplied in one application program is not supplied in another, even though the missing operations would be useful. For example, the Berkeley Mail program does not allow search operations (although many other mail programs do), and it does not let users reorder messages according to key fields such as date, sender, subject, and so on. It may be the exception rather than the rule that functionality is domain specific; the number of functions in a domain-specific application can often be doubled and still include only intuitive functions. For example, the generation of tables of contents (based on a hierarchy of fields that are displayed only when the values change), common in bibliographic applications, is readily applied to sorted mail and news. The SST is implemented in ANSI C and is running in various forms on UNIX (command line and X), Macintosh, and on DOS. Current work is being done to build an interface, PipeFitter, that allows novices to link the tools in UNIX-like pipelines to allow the creation of applications for exploring bibliographies, mail and news, and a variety of other application domains (this work is being done by Lynn Snider, and will also be applicable to many UNIX filter programs). Other work is in the development of widgets for viewing SSRs in a variety of dynamically manipulable formats (this work is being done by J. Edward Swan, II). There is also an effort to "productize" the SST for use at three levels: C function libraries, UNIX filter programs, and via a graphical connector, so they can be used by others. The first release of the SST and PipeFitter will be in conjunction with the HCI Bibliography, a free access extended bibliography on Human-Computer Interaction, being compiled at The Ohio State University. The SST is being used for manipulating bibliographic records, mail requests for information about the project, and for managing information (including pictures) about people in HCI. [1] The Problem: Poor Information Sharing {goals:} [1.1] Incompatible File Formats -> Poor Information Sharing [1.2] The Myth of Incompatible Functionality [1.3] Low Functionality [1.4] Obstacles to Technology Transfer {need to distinguish between technological solutions and their transfer / acceptance} [2] Our Solution: The SST to operate on general SS Infomation {goals: need to convince people of: the generality: applies to many data types the novelty: we are doing something new here, not "yet another X for Y" } {this section is a survey of SS info and tools and general operations} [2.1] A Survey of Examples of and Operations on SS Information Applicable to many types of information Diversity indicates the imp[ortance of SS Info see J.TOOIS.87.5.393 on hetero data access [2.1.1] Example: Calendars (may want to use other example for first one because malone suggested this) no standard for information sharing events start time [stop time] start date [stop date] weekday who what where [2.1.2] Example: Bibliographic Information refer | scribe | procite (standard but incompatible formats) many types of publications with required and optional fields [2.1.3] Example: mail and news mostly compatible systems many required fields, few optional fields [2.1.4] Example: personal databases questionnaires question type default answer grade sheets addresses household inventories [2.1.5] other SS info setopt various parameters UNIX files and their attributes operations: mv, cp, ln, rm, vi, chown, touch, etc [2.2] Summary of Types/Applications/Systems and Operations {point out that with many applications of SS info, there exist many different systems. [2.3] Reconsideration of SS Information after the Survey systems A B C D E applics U V W X Y Z combinations: AU BV CW DX EY EZ (no reuse of systems - why?) file formats inapplicable functions? no - many shared functions among systems most of each program is generic function many prima facia specific functions can be applied to other domains inapplicable to much data no - SS info is all over the place, but different formats make systems interoperable tech transfer diff't work for same function - change of procedure extra work for same function loss of skills investment lack of function in new system loss of function loss of data investment loss of time - usability, interoperability [2.4] Formalization of SS Information {this may better belong in the intro, but it is better to introduce examples first.} {maybe we need to separate the survey of data types from the survey of applications and their operations.} Similar to forms, attribute lists, and frames (Malone's SS messages) - note that malone wrote that Lens has broader applications - to calendars More structured than full text - tabular formats of typed fields. Operations on specific fields. Less structured than database records implicit markup, partially ordered, sparse [2.4.1] Definition of SS Information A semi-structured record (or less formally, a record) is a data structure that is a partially-ordered list of semi-structured fields (each with a name and a value). Within a semi-structured record, the order of differently named fields does not matter, but the order of identically named fields is maintained. A semi-structured field (or less formally, a field) is a data strcture with a field-name and a field-value. A field-name (or less formally, a name) is a string of characters containing letters, digits, and a limited set of pubctuation characters: the underscore, the hyphen or minus sign, and the period or decimal point. A field-value (or less formally, a value) is a string of characters that may span more than one line (according to a string-continuation method). A string-continuation method is a lexical convention to determine if a string should span more than one line. String-continuation methods include backslash-termination, delimiting the start and end of a string with quote-pairs, beginning the continuing line with a tab (tab indentation), default continuation until the start of a new semi-structured field, default continuation until the end of a semi-structured record. A quote-pair is a pair of characters that are used to delmit the start and end of a string. There are seven pairs of characters: "+", `+', '+', {+}, (+), [+], <+>. [2.4.2] Syntax Specification for SS Information {This may belong in architecture as part of a technological solution} embedded markup vs. implicit {this issue may be better as part of the data survey} flexible (multiple formats) types (persons, paper titles, dates) use of application-specific schemas to augment embedded markup e.g., longer version of "A" field is "author" which is of type "name" [2.4.3] Operations on SS information {this section is large could be part of task analysis by surveying systems} [3] The Architecture of the SS Toolkit {this may be better called "the technology of the solution} (this section is not too novel - just good practice) (the evolution of the functions may be interesting: reduction of union of systems) SS Toolkit cross-products of applics by operations domain domain dependent independent primary operations secondary operations <----- generalization table of primary and secondary operations integration of indexing for efficiency for hypertext outlining views [3.1] The C Function Library Level The SS Function Library consists of two levels of functions: Level 0: The SS Abstract Data Type This level hides the secrets of the internal represenation of data structures used by the SS Function Library. It defines the data structures and provides an abstract interface through constants, macros, and functions. Access functions (defined in ss.h) include: ssnew () ssfree (record) ssgetnfields (record) sssetnfields (record, n) ssgetname (record, i) sssetname (record, i, name) ssgetvalue (record, i) sssetvalue (record, i, value) Level 1: The SS Functions This level implements the major functions of the SS Library. Functions are categorized by their operands and operations. I/O Functions include: record = ssread (format, file) sswrite (record, format, file) ssfparse (buffer, method, name, value) ssformat (record, template, buffer) *** modified Editing Functions Include: ssfinsert (record, location, field) ssfdelete (record, location) *** may only mark a field as deleted ssforder (record, namelist) ssfeval (record, location) Searching and Sorting Functions include: sscompare (record, record, namelist, criteria) ssmerge (record, record, options) sexprinit (expression) ssmatch (record, sexpr) ssindex (file, primary, secondary, options) sssearch (file, sexpr, primary, secondary, options) options include stop and go lists [3.2] The UNIX Filter Command Set Level All SS commands are UNIX-style filters. This means that they interpret a series of options and then process a series of files. If no files are specified, or if the file name - is inserted in the file list, then the standard input is read. The SS commands write their output to the standard output, diagnostics and error messages to the standard error, and produce a non-zero exit status on failure, zero for successful runs. The SS commands include: ssfinsert inserts fields into records ssfdelete deletes fields from records ssfmove orders fields within records ssfeval evaluates expressions in fields ssmerge merges fields from records sssearch searches for matching records ssformat formats the fields in records sssort sorts records according to sorting fields The SS command line options include: -f name Field Name on which to Operate. This option can be repeated for many commands. -H Provide Online Help for the Command. This help includes options and current values, limits, and program version information. -X string Storage Specification String. This option specifies the input (and output) format for the command, allowing commands to read (and optionally write) different database formats. If the string is a file, then the format string is read from that file; otherwise the string is interpreted directly. -X match Match Criteria. This option specifies the cirteria by which names and values in records will be matched. Criteria include case-sensitivity and substring. -r Reverse. This option reverses the operation of the command. -X template Format Template. This option contains a small language for flexible-formatting of the contents of records to produce output that is not necessarily a record. If the template is a file, then the format template is read from that file; otherwise the template is interpreted directly. In a formatting template string, conversion sequences follow many of the conventions of the printf(3) function: The general form of the conversion sequence for a field is: %[[-]w][.d]name[{word}] All conversion sequences begin with a % character and speciy a named field to be converted. The width of a field value can be padded to at least w characters, and truncated to at most d characters, within which, the field value is by default right-justified, unless a - sign is used. Before formatting, a space-separated numbered word can be selected. All chaarcters that are not part of a conversoin sequence are output verbatim. -X location Field Location. This option sepcifies the location at which an action will take place. The location can be a field number or a field name. [3.3] The Graphical Interface Level [3.3.1] PipeFitter? [3.3.2] The SST Programming Language (this may belong in a section on recreating applications) [4] How the SST Helps with the Problems syntax spec + programming language [4.1] Recreating Existing Applications with the SST users info file systems formats functions syntax prog lang we are NOT working on a new mail/news system we are NOT working on a new system to do X with format Y (that would obselesce your skills/data investment) we are making a system to do anything with any SS format [5] Screen Shots and Examples {need before and after shots of seamless informaiton environments: Old formats: bib in refer format news in rn biography in ?? format mail in elm We see current work by others as devising better UI's for each application domain, not a better information environment. So, each window in the above is being replaced with another incompatible window. Q: What % of files in UNIX are ss records? New format: bib in ssview news in ssview bio in ssview mail in ssview } [6] Experiences and Evaluation Transfer of learning Increased functionality Increased information sharing [7] Unsolved Problems and Future Work [7.1] Is SS Info pervasive because of Human Cognition? - ref: Minsky [8] References %A Paul Heckel %T The Elements of Friendly Software Design: The New Edition %I Sybex Inc. %C Alameda, CA %D 1991 %A Zloof %T QBE %A Malone %T Information Lens %A Minsky %T Frames %A Bobrow %T PIE A Personal Information Environment %B 2nd or 3rd Cognitive Science Conference