Language Processing Technologies for Electronic Rulemaking: A Project Highlight Stuart Shulman University of Pittsburgh 121 University Place, Suite 600 Pittsburgh, PA 15260 +1-412-624-3776 shulman@pitt.edu Jamie Callan Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213 +1-412-268-4525 callan@cmu.edu Eduard Hovy USC-ISI 4676 Admiralty Way Marina del Rey, CA 90292-6695 +1-310-448-8731 hovy@isi.edu Stephen Zavestoski University of San Francisco 2130 Fulton St. San Francisco, CA 94117-1080 +1-415-422-5485 smzavestoski@usfca.edu ABSTRACT In this project, we are developing new text processing tools that help people perform advanced analysis of large collections of text commentary. This problem is increasingly faced by the United States federal government's regulation writers who formulate the rules and regulations that define the details of laws enacted by Congress. Our research focuses on text clustering, text searching using information retrieval, near-duplicate detection, opinion identification, stakeholder characterization, and extractive summarization, as well as the impact of such tools on the process of rulemaking itself. Versions of a Rule-Writer's Workbench will be built by Computer Science researchers at ISI and CMU, deployed annually for experimental use by our government partners, and evaluated by social science researchers from the Library and Information Science and Sociology departments at the Universities of Pittsburgh and San Francisco respectively. This three-year project started in October 2004 and is funded under the National Science Foundation's Digital Government program. Categories and Subject Descriptors H.3.3 INFORMATION SEARCH AND RETRIEVAL General Terms Algorithms, human factors Keywords Natural language processing, information retrieval, near-duplicate detection, opinion recognition, text annotation, regulations, rulemaking, federal government 1. MOTIVATION AND BACKGROUND Since its formation, the eRulemaking Research Group has participated in and organized workshops, issued a focus group- based stakeholder report, and made presentations to federal agencies as well as an international academic community. We launched an eRulemaking text data testbed, created a group web site that has had over 18,000 pages of written and presentation content viewed or downloaded since its late May 2004 inception, and we collaborated with five federal agencies in the successful submission of a proposal to the NSF’s Digital Government program. This multi-year study expands and upgrades the scope and function of the eRulemaking testbed and systematically feeds back information from users in and out of government via workshops, focus groups, Web surveys, reports, publications, and regular national and international presentations. Most proposed regulations attract relatively few comments, but high profile regulations can attract hundreds of thousands of comments from the public. When the volume of comments is large, it is usually because one or more stakeholder organizations publicized the proposed regulation and created a letter template (a “form letter”) to which people can attach their name. A decade ago form letters were not difficult to process. Except for the signature, sender’s address and possibly a brief hand-written comment, they were exact duplicates of the original form letter. Form letters could be identified and sorted easily, so that the original could be considered, and the duplicative copies counted, reported and then largely ignored. Now, agencies accept comments via e-mail and the Web. With the assistance of an expanding array of for-profit web services, it is easy (though not cheap) for interest groups to ask members of the public to modify a form letter to reflect their particular viewpoint. As a result, the task of identifying distinct and substantive viewpoints is now dramatically more difficult. Agencies have experimented with simple heuristics, for example based on the length of a comment or its email routing path, but manual sorting, often by private sector contractors, remains the predominant method. For example, the Environmental Protection Agency (EPA) recently printed over 500,000 emails on their proposed mercury regulation and had 15 staff members sort by sight the duplicates, triplicates, and all manner of variants, as well as the odd substantive and original non-form letter comment. Our research explores the use of information extraction and information retrieval to develop tools that assist rule-writers and