Here is a selection of my links to topics and resources related to Internet searching and personal search engines / agents, distributed indexing and archiving, concordancing and corpus linguistics, copyright and digital libraries, programming, and useful utilities. Still in initial stage of development.
From the preliminary version of a Web corpus with 97,198,272 tokens and 525,509 types
- PIE Web Corpus 2006 – 100 or more HTML
- HTML version of list of 30,524 types occurring 100 or more times
- PIE Web Corpus 2006 – 100 or more TAB
- Tab-separated text version of list of 30,524 types occurring 100 or more times
- PIE Web Corpus 2006 – 10 or more TAB
- Tab-separated text version of list of 104,675 types occurring 10 or more times
Links for information only -- software not evaluated yet
- Alkaline
- Search engine precompiled for various flavors of Unix, Linux and Windows, free for non-commercial use. Mentions indexing up to 500K pages.
- ASPSeek
- Freeware search engine for Linux. Crawls the Web, indexes pages, and provides user search facilities; claims capable of searching millions of pages. Open source.
- DataparkSearch
- Open source search engine akin to ASPSeek and mnoGoSearch, but under more active development than the former (April 2004).
- mnoGoSearch
- GNU general public license search engine; runs under Unix, Linux, or Windows. Appears to have both free and paid versions. Includes support for various languages.
- Perlfect Search
- Freeware Perl search engine script; runs under Unix, Linux, or Windows.
- WebSPHINX
- A Customizable Personal Web Crawler; GNU license freeware; Java (= multi-platform).
- ht://Dig
- Complete WWW indexing and searching system; GNU license freeware; requires Unix or Linux. Would be useful for monitoring a selection of sites or as the basis for a specialized search engine.
Links for information only -- software not evaluated yet
- Grub.org
- "Grub provides a free for download, free to run, distributed crawling client, which is used to create an infrastructure (database + volunteers) that will eventually provide URL update status information for nearly every web page on the Internet. Grub's distributed crawler network will enable websites, content providers, and individuals to notify others that changes have occurred in their content, all in real time."
- Herodotus
- Timo Burkard, Herodotus: A Peer-to-Peer Web Archival System, MIT Master's Thesis, June, 2002. (.PDF file)
- Building a Distributed Full-Text Index for the Web
- Paper from the Tenth International World Wide Web Conference 1-5 May 2001, Hong Kong by Sergey Melnik, Sriram Raghavan, Beverly Yang, Hector Garcia-Molina, Computer Science Department, Stanford University.
KWiCFinder could be extended to support creation of special-purpose online corpora from online documents as outlined in this paper. Clarification of the copyright issues is an essential prerequisite to such an initiative.
- Internet Archive Copyright Links
- Links to various resources, including the National Academy Press' book The Digital Dilemma: Intellectual Property in the Information Age, the Association for Computing Machinery's Intellectual Property page, and to the Archive's amici curiae brief to the Supreme Court in the "Sonny Bono" Copyright Term Extension Act case arguing that ., which argues
- Internet Archive "How People Envision Using Internet Libraries"
- Discusses the many reasons for archiving the Internet and for digital libraries in general.
- Kenneth D. Crews' Copyright Information Center
- Legal opinions on topics related to intellectual property in a university framework, from Indiana University-Purdue University Indianapolis. Prof. Crews in the principal investigator of copyright issues on the National Science Foundation-funded Digital Music Library project, whose Copyright Page links to papers and opnions on the issue of digital libraries.
- .
- .
- PowerBasic
- Superfast Windows 32 implementation of Basic which does all the "heavy lifting" for KWiCFinder. Very active and supportive user community. Highly recommended for programmers who want the power of C without the arcane
- BK ReplaceEm
- Powerful search and replace utility which can operate on groups of files; supports regular expressions; freeware. Valuable for webmasters and programmers.
Screen
Shots | Report
Formats | Sample
Search Reports
Download
KWiCFinder |
Support
Feedback | Questions or Suggestions |
Author | William H. Fletcher |
Version | 8 December 2006 |
URL | http://KWiCFinder.com/RelatedLinks.html |