June 30: Introduction to the Mighty Search Engine
*Pre-class blogtalk* Recap of last week (Semantic Web) *Exploration* Constructing facets and topics for our blogs *Lecture* Intro to search engines; mini-exploration of Zipf's law *Pre-class blogtalk* Last week was a little overwhelming...here's a recap: Is the Semantic Web hype? (link on June 27th) W3C's mandate of RDF: makes it more difficult to use with XML. This defeats the purpose. RDF's syntax is still too ambiguous. "The fact that the programmer and the interpreter of the computer output use the symbols to stand for objects in the world is totally beyond the scope of the computer. The computer, to repeat, has a syntax but no semantics." John Searle The computer knows nothing about data except the "physical" characteristics of the ones and zeroes. RDF can be confusing even to comp sci PhD's. Amy J. Warner, Librarian. "Guidelines & Standards for Taxonomies and Other IA Controlled Vocabularies" Google "Amy Warner" filetype: ppt Worked with Lou Rosenthal on IA consulting firm (which is no more) Epicurious.com is another good example of faceted classification Warner's continuum of vocabulary control: Synonym control (USE, UF) -> hierarchical relationships (BT, NT) -> Associative relationships (RT) Steps in Controlled Vocabulary Construction Groups terms by subject (facet analysis) Link synonyms and variants (synonym rings) ID broader/narrower terms (taxnomies/hierarchies) ID related terms (thesauri) Subject (aboutness) will only be one facet. (Librarian brains might get stuck on the aboutness idea.) *Exploration* Back to our MT blogs We're trying to figure out how to use "category" and "keywords" to gain better access to the content in our blogs. Every entry belongs to a blog. Scale: in a small group of blogs, some facets may seem less important (license type) but as size grows, every facet is more useful MT does not allow lots of access to comment content. A search engine can help remedy this -- extraction indexing. Fine line between semi-automated indexing and what search engines do. Each facet is a separate index in a search engine. Constructing a comment index. MT can provide some automated help in this endeavor. Extra structure around semantics will have to be done by humans (us) Better facets are "controlled fields" (e.g., not title) Facets of an entry *=supplied by MT; (M) = manual Blog* Keywords (M) Category (M) Creator* Comments# *: Creator* & body :( Trackback: URL Date: Published & Last commented on Topic trees (hierarchical) Every end point belongs to a supercategory. Conversely, facets have no connections. They are disconnected. Paul: Faceted systems are flatter than hierarchical systems Facets: all together now (from the whiteboard) comments --# --creator date --of publication --of latest comment creator trackbacks --# --URL category keywords (begs the question -- how do you deal with keyword's interaction with category?) Could also fold trackback into comments, since syntax (# and creator) is the same for both Depends on whether you're more interested in the origin/type of "commentaries," or keeping the facets shallower. Remember your user! What makes more sense, seeing at a glance how much commentary there is, or seeing the origins of the feedback at a glance? How Trackback works Trackback is essentially an index to another blogger's post on something in your blog, a notification that a certain entry has been linked in another blog. Copy and paste the URL from the trackback box to see who's pinged your blog. XML-RPC (remote procedure call) is what makes it work Nittier gritty of the category and keyword Categories: how about 2: info seeking and info offering Pro: makes good distinction Con: what if one post does both? "mixed" is not a good category. Toby sayz: Information Anarchy Keywords: Many keywords in every post; they should be related to one another in a certain way Yoda master or bureaucracy: "you have to shoot for substantial consensus" Category and Keywords should serve clearly different purposes -- and keep in mind the mechanics of choosing a category (as a post-er) -- think of the pulldown menu dropping through the floor jill/txt discovered a correlation between her mood when posting and the number of comments (fewer comments on bad-mood posts) Let's rename the facets for more usability category = type keywords = topics Steve: precedent set by the courts, protects blogs and other electronic communications for libel because they're less like publishing, and fall into free speech. They're exempt from fact-checking that other info conduits are. Tone? Genre? Mood? Purpose? Intent? Objective? discussion derogatory informative *Lecture* Search engines and web indexing see also tonight's ppt The exact opposite of what we've been doing. They are a program, but a simple program at that. Print indexing Once upon a time it was thought that the search engine would replace the human indexer. Print indexing doesn't sacle very well to the web. Prohibitively expensive to create such web indexes in a way that works well. Document and its presentation are one and the same. Web indexing Done by a program that makes decisions very quickly, which is good for solving large-scale problems But...the web is far more complicated. There is a difference between the intellectual content and its presentation: --templates --dynamic page creation --personalization (cookies) These areas are just not indexable in the conventional sense. Do individuals get their own indexes because of personalization? Yikes! What to index? Technology answers this question, in part. Search engines are limited to the "presentation" aspect of the content. Spiders Programs that imitate browers. They look through the text and gather links, then file and follow them in turn Limited by bandwidth/speed Foiled by text boxes! Can't read them. Blocked from all the content lying behind them. Indexers Sits at the other end of the bandwidth "straw" and acts like a huge threshing machine: tokenizing the text (breaking it down into individual words); eliminating stopwords; adding token and document info into indices Google, AltaVista, Yahoo have very similar spider programs, but the indexers are their proprietary heart. Index The catalog of terms, optimized for retrieval Frequently there is more than one index: it's easier to search two quick, small indexes than one large index Results are completely dependent on the corpus indexed Paul: to clarify, you're not searching something "live," you're searching the "threshed" material J: Yes, and the document itself is reduced to a document ID (marker for retrieval) Toby: how often is the DB refreshed? J: That depends on the company running the search engine. Paul: can they index documents by type? J: Not necessarily; Google can't tell what a blog is, but it can tell what a frequently-updated page is. Structure of an index Elements: term, position, weight, document Each element is indexed separately to increase speed of retrieval Zipf's law Familiar curve describing the distribution of words within a corpus There is a large number of terms occuring in very few documents (idosyncratic terms); there is a small number of terms occuring in almost all documents (stopwords). In the middle are the focal terms, or terms with resolving power. Paul: an example of an idiosyncratic term would be an ISBN or serial no. J: or supercalifragilisticexpialidocious Query engine For how it works, see ppt slide PageRank: Google's Secret A special trick saved for the moment of query. Reverse citation indexing Blogs skew the page rank a bit "Link graph" as a way to model the web: mathematical ways to understand paths from place to place Limits of search engines Invisible web --That is, the "deep web" too --Until AI, we're stuck at about one percent of the web being indexed. --Dynamic pages become spider traps, which they tend to avoid if they're smart enough Bandwidth --More does not necessarily solve the problem Stupidity (human and computer) End of ppt Mini-exploration: little search engine for 875's blogs db.worddump is a human-readable (ASCII) database of terms used search; eliminate duplication; and voila! Results! View into Zipf's law: 1368 occurences of the word "blog" in our, um, frequently updated Web pages that serve as publicly accessible personal journals. Thanks, Webopedia.∞ | June 30, 2003 in lis 875: where this blog began | Comments (0) | TrackBack
June 26: XTM: Rebuilding the Tower of Babel
Summary *Pre-lecture blogtalk* XFML; concept of the killer app; RIAA vs. P2P *Lecture* Ranganathan Recap; Flamenco and faceted classification *Exploration* XFML documents; mapping topics and facets in groups *Pre-lecture blogtalk* Quick intro to XFML XFML=Extensible faceted markup language It is a file kept on a server as an alternate view of the homepage -- a faceted index into site content History of the idea of the Semantic Web The Semantic Web relies on machine readable metadata and RDF The idea is trademarked by W3C; because RDF was used, RDF "got the blessing" of W3C. W3C is a "small-time" standards organization compared to bigger groups like the ISO, which don't neessarily agree that RDF is the right choice. (Did I get that right, Jonathan?) RDF is an XML derivative language. The plan was for it to serve all sorts of semantic purposes, as one of a large set of tools -- but the early adopters and innovators didn't get it to launch past the proverbial chasm in the technology adoption curve. "Killer app"=killer application: refers to the application/task that makes a technology really useful RDF is not an enduser application. It's a model (think triple syntax). Its goal is to make browsers and other technology smarter. The relationship between data and metadata is recursive, folding back on itself many times. This complicates the nature of the organization of information. Brief discussion of the RIAA P2P issue Slashdot's article calculating the revenue lost by music companies says they overestimate the impact of P2P sharing. Paul: the boom in CD sales created by people replacing their collections in the new format quickly ended, and sales will never be as high as they were in the early days of the CD. Also, profits were higher because there are low production costs for issuing an old recording as a CD Nichole: Since a CD costs the same to produce regardless of the number of songs, the record industry stopped issuing singles Cris: There are fewer new artists today. Besides, people have always traded tapes before. J: Trading tunes builds the fan base Paul: In India, trading is a big part of the music scene, or "cassette culture." Artists earn most by performing Steve: Who gets the money anyway? Not the artist. J: Basic laws of marketing say the decision is: support one artist in a big way, or lots of artists in a little way? The industry has chosen the less diverse option. *Lecture* Consider the MARC record to be a DTD. Each field has a value. RSS tags=fields in a MARC record RSS can create much more specific kinds of records than MARC does (i.e., RDF language is tailored to the medium) MARC is designed to be flexible enough to cover all media (and realia too) Ranganathan Recap: Faceted Classification PMEST Personality, Matter, Energy, Space, Time Why didn't faceted classification catch on? Some of the same drawbacks as Dewey, with none of the benefits. The facets must be mutually exclusive, which is hard to do. It is also harder to search faceted systems without today's technology. Ex: City & Location = bad facets, because they are not mutually exclusive Also, technology at the time did not support the hierarchy-free sorting activity that faceted classification lent itself so well to. When you can atomize and re-collect things quickly in a relational DB or the like, it's much better. Tower Records example www.towerrecords.com Has a faceted system for records Disadvantage of facets -- tends to work better with homogenous collection (but doesn't everything?) Facets allow you to combine limiting with searching Try limiting by different factors on the Towerrecords.com web site Flamenco Project at Berkeley example (cool!) Reinforces the importance of mutually exclusive categories Parametric searches=done within a field, searching an index of terms found in those fields only Within a few steps in a faceted step, a search can be narrowed rather quickly. Gets you to the browsing stage fast. Each facet can have a hierarchy under it, as long as no parents are shared. Facets allow many paths to a destination. Metadata drives all these kinds of classification systems. facetmap.com If flamenco has built this with a DB, it wouldn't help the developer unless he stole it -- but you can mark up your data just exactly the same way and create the same kind of thing. *Exploration* XFML document facet id= occurence=instance of a document XFML dynamically generates indexes when demanded. Rationalizing the category labels Goal: browsable index of our collective sites XFML is the medium Why? It's simple; online tools are available This is a CMS (Content management system). Decide what kind of metadata you want to include. Over the weekend, look at each other's blogs, determine the topics discussed in each, and decide on two or more mutually exclusive facets to group them under. More discussion of this task on Monday.∞ | June 27, 2003 in lis 875: where this blog began | Comments (1) | TrackBack
Music industry bares its teeth?
Breaking news in the P2P world. No time to reflect yet! Must post! The syllabus keeps jumping ahead of itself!∞ | June 26, 2003 in lis 875: where this blog began | Comments (7) | TrackBack
June 25: RSS: Making connections
*Pre-class blogtalk* How to change the CSS for MT blogs. *Lecture* Content management and research blogs *Exploration* RSS and newsreaders *Pre-class blogtalk* How to change the CSS for this blog. Please test and tell me if this is right! 3 files comprise what you see in the blog: CSS, HTML content, and background. View source code for the root blog (www.relativepath.org/875): styles-site.css in the head of the document is the stylesheet. Cut and paste to direct your browser to: relativepath.org/875/styles-site.css (poundsign) in CSS means it's an id. For example, anything with an id = banner has the given attributes. Download and edit a local copy of the CSS: Save file to desktop as index.html Open this local file in browser The absolute path in your browser location window should now read like a filepath: C:Documents and Settings ... etc Open the local file in a text editor (notepad) to edit and change one line in the header: Change this: link rel="stylesheet" href="http://www.relativepath.org/875/styles-site.css" type="text/css" / to this: link rel="stylesheet" href="styles-site.css" type="text/css" / Use a text editor or to play with the styles. Tinker, tinker, tinker. Try a find and replace with color codes. Save. Reload. Viola! Important! Upload as custom.css. *Lecture* Postponement of Topic Maps until later, a little too deep for now. But check out ontopia, via Kate's blog. Content management The technology curve: if a specification can't get the momentum together to pass the chasm, it dies. RSS is in danger of failing because of this. There is some animosity towards Dave Winer as a result of his "milking of change" Readings: Hackos brings the expertise of librarianship to the IT world. Content management systems are important in any context where information is used. Prism as a way of marking up large amounts of news data. For use in a content management system. Uses Dublin Core and RDF to create a robust system. Research blogs jill/txt/ Runs a research blog from a humanities perspective See also UMinn blog collective "Marx's illusion of (?)" : have blogs been around forever? Blogs are at the early pragmatist stage. Wither blogging? There has been an exponential increase in the number of blogs. Why? It makes it much easier to update web sites, and lubricates the transition of info from RL to the web. Emerging technologies: photo blogging. Narcissistic but cool. Embedded technology (thinking toasters and smart houses) Why does reproduction drive technology, and why is food always the example in creating content management systems? *Exploration* Deeper into RSS and newsreaders In the beginning was MCF=meta content framework, under the auspices of Apple. "Hotsauce" looked out for new web content. Tim Brey and friends developed RDF. Netscape developed RSS .90 Attempt at transcribing what was on the whiteboard: RSS .90 evolved into .91 using XML .91 features (XML) Channel=newspaper or media source; a container for news. title, link, description language, copyright, managing editor, web master, rating, pub date, last build date, docs skip days - days skip hours - hours image text input (search box) items (the most important part. Without these three elements, it is invalid): title, link, description .92 features (when Dave Winer took over) (XML) Channel: (same metadata) skip days skip hours cloud=(Winerism) describes a collection of RSS feeds. Contains MD that allows software to request that the cloud notify it when changes are made to the feed. image text image item: title, link, description; source, enclosure (make references to media files), category (belongs to item) 2.0 features -- released a couple weeks before 1.0 (XML) Channel (same metadata) plus title, generator, category skip hours, skip days image (optional) text input item: (same as 9.2) author, guid (global unique identifier), pubdate 1.0 features (XML and RDF) Channel title, link, desription, items, image image textinput item (title, link, description) The root of all these models is a single channel. Cris: How is this different from Yahoo? J: Yahoo is a portal, not so much a channel. There is no central site--Yahoo directs away from itself. Discussion of handout in RSS code Download Newzcrawler -- try it at home! One can blog from this interface as well -- neat! Syndic8.org -- master of the feeds Summary: there are several ways to make the most of RSS, including desktop clients like newzcrawler & their kin.∞ | June 25, 2003 in lis 875: where this blog began | Comments (1) | TrackBack
Foray into notetaking
Hi all, I'm going to try posting my notes from class sessions on a regular basis in the category "class notes." Check out the archives for older entries. I committed a blog-pas and backdated the notes, but honest to blog, I took the notes in real time. I want to send out a plea for y'all to add to, clarify and correct when necessary -- and we'll end up with a splendid record of what we've been up to. Thanks!∞ | June 25, 2003 in lis 875: where this blog began | Comments (3) | TrackBack
Technology adoption
I love this little curve. It describes the technology adoption life cycle -- which hasn't been mentioned explicitly in class or blog yet, I think. If it's accurate, it helps winnow real digital divide problems from social lag and voluntary non-use of technology. It might also provide some context for the strong feelings in the blog community about the power law. I wonder if one could define the point on the curve at which a small user group becomes large enough to lose the comraderie and close-knit feel that frequently accompanies early adoption.∞ | June 25, 2003 in lis 875: where this blog began | Comments (3) | TrackBack
June 24: XML
Summary *Pre-lecture blogtalk* Migrating from HTML to RSS; what library people and CS people are good at; history of RSS specification *Lecture* PowerPoint slides courtesy Lars Marius Garshol *Exploration* XML tags *Pre-lecture blogtalk* Katy: is it hard to migrate from HTML to RSS? Jonathan: they are different standards. XML and HTML both come from SGML. Not all HTML is XML compliant. HTML's problem: too flexible. Hard for a machine to process. Browsers have to deal with ambiguity in bad markup, because it allows more "bad documents" to be viewed (lower standards). Not as flexible as SGML, but its flexibility makes specifying anything harder. XML is like HTML, only simpler, because you use it to define other languages. Really important: RSS is a standard whose job it is to primarily give md about XHTML. or any resource. Write a log in WML: convert old HTML content to XHTML. Don't convert HTML to RSS because they're complementary. HTML allows fancy things: headers, formatting, etc. RSS can't generate a whole document with its 3 elements (s-o-p) RSS has roots in markup of scientific documents, hence the priorities it has. Three Bobs: How are these bits of information identified outside of their own namespace? These systems must be defined in a common ground. (What is that?) XML topic maps can help. So can Published Subject Indicators (PSIs). PSI=URI, but some Publisher maintains a "set of names" (i.e., an authority file?), like when the gov't manages SSNs. But the solution needs to be impelented to work (duh). Things CS does well that Lib'ns should learn: Scaling math to models that work, as in P2P, DNS. If this task would be distributed, it can be done. (there is no computer in the sky) Things Lib'ns do well that CS can learn: Army ant method of attack. BORG mind, collaboration , group work. Yes, this is hubris, but if CS and Libn's get it together, WE CAN DO IT! Especially if a few catalogers get famous and make waves in the CS community. Steve: lib's lack the funds that CS and tech companies do. Jonathan: use the scraps of slacktime during the day, can do wonders. And there are lots of lib'ns. Jumping ahead to RSS. Blogs can be subject gateways -- their great strength. The history of RSS specification: Stalemate between two camps thus far: Semantic Web (RDF) vs. Dave Winer (RSS) RSS=Real Simple Synidcation or RDF Site Syndication. Dave says: RDF has nothing to do with RSS. Dave asserts the right to influence things because one of the origins of RSS is Microsoft CDF/MCF (Metadata Content Framework), which drives syndication -- that is, spiders will draw users' attention to new postings. ANd Winer ran Scriptingnews blog, was consuled by Netscape to develop this synidcation. He told them to drop RSS. RSS .91 is in XML syntax, but does not use RSS triple-form syndication. Netscape dropped the project and Dave adopted it (with Userland Software Co.) and modified RSS to meet his needs. SO, a bunch of non-hardcore, not-Tim Brey-following perople broke out and reformulated the syndication to extend functionality and become an open specification in the commons. Sam Ruby, Python hacker, hosted a discussion (launched a wiki) to nail down a philosophical model (nothing to do with RDF, RSS, XML) to define: What is a blog entry? They want to translate this model into a workable computer model Cris: How are RDF and RSS different? They are all document types. RDF=Resource, Property, Hypertext reference are the primary components RSS=title, description, URL, more specific things about a SPECIFIC thing to be syndicated online. Portable, easy to read software. Not meant to be read in a browser like HTML and XHTML are. For the programs to pull and download things from other servers, translated from RSS, mushed up, and presented to the user. It's a MD exchange program. RDF people want to do RDF in RSS b/c it's easy to model RSS in RDF. last word: the split betwen RDF and RSS is important because if eveyone is speaking the same language, more complicated systems can be developed. If not development is stunted. It's simple and firm enough that it can be a foundation for further growth. (TCP/IP example: a universal, simple tool that made the Web possible.) Remember the difficulties with slow social change: attaining critical mass is a process which requires patience. "Mosaic for Dummies" book a blast from the past. *Lecture* (See PP slides, courtesy Lars Marius Garshol) What are smart data? What are dumb data? Smart data on the web ---The state of data on the web means it's easy to view, hard to search. The trouble with this ---Functionally, HTML is so dumb, it's nearly useless. ---Google is a huge network of single desktops computers, rather than a single mainframe. Computing power is so cheap that they treat 'em like toilet paper. The need for exchange ---Need for communication (of course) stimulates standardization. Being too flexible has problems: too universal=too hard to write and process. A critique of HTML ---way too specific vocabulary, not enough layout. HTML is a kind of data model that represents a document (in the Platonic sense) but not really. How XML solves this problem ---Define your own tags, and use automatic validation of your definition. Dumb data ---XML is so flexible that it's being used for data-to-data transfers as well as marked-up documents. Smart data ---Steve: where does XML hide the missing display info? ---Jonathan: Browsers are loaded with defaults. CSS also provide a display framework. So, Browser sees XML, says: "What?" XML says "Look at my style sheet" and whoop there you go. (XSLT digression: Styling information for XML. 2 parts: formatting, and translator of one kind of XML into another) XML background ---DTD (Document Type Definition)-some parts left out Elements ---For every start tag there must be an end tag (unless it's an empty tag) ---"Attach semantics to a piece of the document" allows to associate something meaningful with that content Attributes ---Annotation, not vehicle for information: a stand-alone structure Other stuff Internationalization ---XML assumes Unicode to accomplish this Document types ---SGML tried to define a class of documents (memos, articles, etc.) DTDs ---the grammar of a document language: what is proper and what is not ---there are programs that just validate grammar A sample DTD XML standards: a quick guide REC-xml 1.0 ---Borrows heavily from SGML; focused on validating and non-validating types of code Linking ---Like hrefs, but more robust. Point to an element within a document (still underway, not functional yet) Style ---CSS2 and XSL Programming ---SAX, DOM Processing languages: hierarchical, nested tree, or linear (event-driven model)? Various ---Namespaces, forms, topic maps ---Paul: ISO standard for topic maps? Yes. Stronger validation Using XML...beyond our scope. *Exploration* Models of a blog entry Author=who Permalink=where Timestamp=when Content: media type, language, data=what Should this schema be applied only to blogs, or to internet content in general? Permalink and timestamp are the most important tools to make a blog useful to the user. The approach is modular. Things get complicated if you try to pile it all on , but you can do better to add additional functionality in modules. Components: Metadata: title, summary, contributors Categorization: Category, Subject Security: PGP key, URL of key, Signature Licence: Licence type Relations: container, other entries, other things (URL) Are they trying to model not just a syndication format but a content format? Is this what Movable Type, Blogger, etc going to be expected to do, to do what they do? If it's only Metadata, it's a simpler question. Jessica: SOme blogs lack these elements? J: Yes. Most blogs have datestamps, but whether they represent the creation, post, or modify date is sometimes unclear. Pro-content management for librarians! Web publishing closes the gap between creation and distribution. Relevant for digital libraries, and for management of larger websites. Posting, in J's mind, is the only firm foundation on which you can stand the blog. Last Modified is movable; created is not as useful Exercise in XML tags (not repeated here) Issues: important to define language of metadata form of author is also critical plink is a hypertext reference date is for the date standard -- more specific info about date can be contained in metadata content is a container of elements language and media type are attributes of the data DTDaware DTDs assume you're dealing with one document at a time; XML is "loosey-goosey" Back to RDF: it creates MD about docs. Would you like to buy an encyclopedia? Oh! You lost me. Help me out, y'all!∞ | June 24, 2003 in lis 875: where this blog began | Comments (0) | TrackBack
Compost update
I was feeling a little nostalgic for my very first blog, which I foolishly (?) deleted. So in tribute to the juvenilia of my ujournal, here's a little post to let the unverse know that last week I threw some wilty, 2-week-old bok choi on top of the dirt in ye olde vermicompost bin, and was too lazy to bury it. Two days later I opened up the lid and all the teeny buds had blossomed into stunning yellow flowers. Tenacious little buggers.∞ | June 24, 2003 in lis 875: where this blog began | Comments (4) | TrackBack
June 23: Introduction to metadata standards
Summary *Pre-lecture blogtalk* Blogs, forums, and wikis; copyright and copyleft *Lecture* Data vs. information recap; ceci n'est pas une pipe; syntax and grammar; metadata vs. AI; introducing XML, RDF, and DC *Exploration* Semantic web Deep thought of the day: Why do we feel guilty for not posting entries? We're hardwired not to ignore social spaces. *Pre-lecture blogtalk* Professional vs. personal blogs and their effect on authority Blogs, forums, and wikis -- different takes on the same idea Are librarians threatened by blogs? Authenticity: how can you prove you are who you write about? Does literary merit trump "truth"? See Jonathan Delacour Shelley Powers: Burning Bird What of Copyright? Internationally: no protection But does lack of copyright protection prevent collaborative/personal artmaking online? Intro to copyleft: Machine code is in your computer. No human can read it. Get a complier and ... Now they're at the level of code. Richard Stallman (kind of a nutbar) developed copyleft in order to escape the tragedy of the commons. This made it difficult to collaborate on projects. Trust is implicit in long-term, group projects. Copyleft=this material is free to use, modify, etc., but you must distribute any derivative work with the source code in the future. Think of it as a fence around the commons. You can go in, but you can't take stuff out. Hence, the more code is in the commons the more valuable the code is. New hackers can stand on the shoulders of giants. "Everything is free except the ability to destroy the foundation of that freedom." LINUX a good example -- reverse engineered from UNIX (clone in techspeak=not a copy) Makes the expensive free. *Lecture* Data vs. information recap Data=no human understanding. Again, the dark world of the CPU. Datum vs. data sets. Datum are senseless in isolation; data presupposes a grouping intended for a purpose. Sentence diagramming is just as much metadata as markup. Ceci n'est pas une pipe. "heh heh, I'm a famous surrealist painter and you're in a gallery and this will blow your mind." "Bathe in the human before we immerse ourselves in the machine" Nominalist vs. idealist (?), Platonic ideals. Fry a robot brain while you're at it Bridging the Gap Syntax: How we group data and signal its intended use (well-formedness). Has to be orderly or it's impossible to tell what you're saying. Grammar: How we group information and signal its intended meaning or interpretation. Has to make sense to the human. Distinction is kind of medieval, but grammar refers to individual words: Sometimes it makes sense to use them, sometimes not. Metadata: encoding for information metadata vs. AI Automatic classification: current algorithms are limited (mostly statistical, i.e, just math); true AI would indeed obviate tne need for metadata; it would also obviate the need for humans (save us, John O'Connor!) Manual organization: a social problem; difficult to program, difficult to maintain; not very scalable. Race between Semantic Web and Skynet Basic Metadata syntax embedded linked -- a question of addressing in both cases, there is a question of namespaces (the context for metadata) Introducing XML It is cool. Why? The universal syntax for metadata. A single metatdata parser can read anything in XML. XML is data and metadata blended. But what about the grammar? Things are written in XML. Things are NOT XML. I.e., HTML can be written in XML; (can MARC be written in XML?) Three kinds of metadata grammar Structural: Markup. The realtionship between bits of data. Descriptive: keywords and info about the info Administrative: manipulateds the data. e.g. draft and publish status in MT Metadata grammars Why do we have these? Because a community needs them. Specific needs are met by different types of metadata. Community-specific: (EAD, EdNA) General-purpose, or "glue" (DC, RDF) Introducing RDF Expressed as XML Magic triple: subject->predicate->object This is not a pipe! Introducing DC 15 fields extensible glues different domain-specific vocabularies together expressible in RDF, HTML, XHTML or XML. Different syntactical ways of expressing the same goal. The 15 elements of DC Took two years to develop. Trying to be as universal as MARC, without the knots. "Ancient concept" of distinct creator and publisher in online world -- will it be weird when these differences are obsolete? Relation field: to an index, or to nearby pages *Exploration* Semantic web intro: see www.amk.ca/talks/semweb-intro/ Some examples of modeling info objects via RDF Browsers don't do that much from an IT point of view -- they just present info for human consumption. Screenscraping is one method of cutting to the metadata chase: but requires a high level of detail in programming Now, we want information retrieval to be more automated -- we want AI or MD to do it for us. Resource Description Format (RDF) -- very low level description (see s->p->o above) RDF Schema lets you describe controlled vocabularies and use them to describe things Web Ontology Langauge (OWL): lets you describe relationships between vocabularies The higher the level, the more powerful the semantic web, as it allows more communication and interaction between info. Creating new contexts for the info described. Overview of RDF RDF="spceification that defines a model for representing the world, and a syntax for serializing and exchanging that model." RDF can be used without the fear that your work will have been in vain. A more self-describing approach, while more complicated itself, becomes more universal and easier to use to describe the info So where is the MD? In the headers of a document, followed by a blank line Syntax can be limited: the pairs of tag and content must all refer to the same item, but this can be worked around by providing multiple definitions (exmaple: a file for reivews and one for authors) like a relational database See RDF graph at www.amk.ca/talks/semweb-intro/ Don't get weirded out by RDF: It is the way it is because it describes things in the electronic world - pointers to electronic data URI (identifier) and URL (locator): the difference. URL is a locator. It can be found using this bit of data. URI is an identifier, a label that says this bit of data is unique. It doesn't matter what it means, it just IS. URI can be made up. URL must be resolvable to a place on the web. the Triplet Shuffle: Circles, lines with labels, and boxes. Circle=bit of data. Line=relationship to Box=terminal. Where you fall off the road. That's all. Difference between RDF and relational databases Toby: looks like entity relationship diagrams in databases. But DBs are weak when relationships change because each change=destroying and rebuilding tables. Steve: Is RDF more robust because each bit of data is unique? I.e., one and only one ISBN, author, etc. Bag in RDFspeak=unordered list. This creates anonymous resources, where there is no subject/object. There can be one-to-many relationships. RDF is better than relational DBs becuase it is more rigorous than them, which have all that cumbersome redundancy. If you remember nothing else from this lecture, think of metadata and data as a chain of events that need to get grounded in reality for anything interesting to happen. resources, properties, and literals -- read the web site in more depth Distinction between syntax and model: there are many ways of writing the graph -- XML, etc. Jumping into another example. wwww.ukoln.ac.uk/metadata/resources/rdf/examples/2/ 2 versions of a journal: Use the ISSN to ground a metadata framework for one entity with two manifestations: print and electronic. Nailing down the data. Electronic world brooks no vagueness. In this example, the print version establishes the basis for (IsBasisFor) the electronic version. XML allows you to declare namespaces, to allow you to understand what the URIs are identifying.∞ | June 23, 2003 in lis 875: where this blog began | Comments (0) | TrackBack
What do we need to know?
Last weekend at my little brother's high school graduation dinner, I was telling the folks about this class and related projects. My mom asked if it would be possible to get by in the future without knowing "all this computer stuff." That sneaky digital divide pops up again. So I ask you all, without putting a finer point on it: how would you answer?∞ | June 23, 2003 in lis 875: where this blog began | Comments (11) | TrackBack
Daily home astrology report
Whoops, dropped out of the blogosphere for a bit there. The obvious flipside to using these blogs to expand the classroom is the guilt inflicted by neglecting them. And the very public display of a lack of participation. (blush) It was a fine weekend, though -- a dear friend came down from the Twin Cities and we went to Chicago to see a documentary about They Might Be Giants. I'd recommend it, though whether it will show in Madison is a gamble. My friend and I did talk about blogging and other things meta-related, including some of the identity and content ("what in the world would I write") issues touched on in class. She's working on some promotional HTML stuff for the musician Stuart Davis (shameless plug), which looks great, and she assured me that she picked up most of her skills through sheer tinkering. Her next plan is to learn about CSS, so maybe by learning in tandem we'll help each other out. One of my goals this summer is to help develop a web site for the LITA student group, and put an end to the irony implicit in the lack of said site. Hope my reach has not exceeded my grasp.∞ | June 23, 2003 in media | Comments (4) | TrackBack
Monkey news
I heard this morning on about an article in Nature via NPR that Dr. David C. Page, a biologist at the Whitehead Institute in Cambridge, Mass., and his intrepid team have discovered that the heretofore mostly ignored human Y-chromosome has quietly been recombining with itself for millions of years and is now comprised of vast swaths of palindromes. The story took a slightly hokey turn when it was pointed out that when the Y-chromosome is included in the comparison, men share as much genetic material with male chimpanzees as with human women. Since I'm not sure that nature always trumps nurture, I can't say this "explains" anything about human behavior -- but it's good fodder for the cheap shots. See also this article by Nicholas Wade of the NY Times.
∞ | June 19, 2003 in lis 875: where this blog began | Comments (0) | TrackBack
June 19: Web Publishing (pretty scant)
Pre-lecture blogtalk Ethics: first amendment spying 4 elements of a good blog post (via DiveintoMark, via ?): Permalinks. Designed to prevent link breakage Purl=Persistent URL Dates. Author. Content. Trackbacks http://www.dailynews.com/Stories/0,1413,200~20954~1463579,00.html http://discover.npr.org/features/feature.jhtml?wfId=1303260 Exploration Behind the curtain at Movable Type∞ | June 19, 2003 in lis 875: where this blog began | Comments (0) | TrackBack
What this material needs...
...is a good coloring book. Is there such a thing? I could use some visuals to help the last couple lectures gel.∞ | June 18, 2003 in lis 875: where this blog began | Comments (5) | TrackBack
Hakon Lie...
...is indeed the name of the inventor of CSS. I didn't do anything special to get this not-too-old article.∞ | June 18, 2003 in lis 875: where this blog began | Comments (4) | TrackBack
June 18: Web Machinery
Summary Pre-lecture blogtalk Further explanation on markup and code; CSS and HTML. Lecture jump to powerpoint Static and dynamic HTML; Client/Server interactivity; CGI; browsers and forms; modular servers vs. application servers Pre-lecture blogtalk Further explanation of markup and code ascii and unicode programming languages are in plain text compiler and interpreter (similar but different) Quick demo of how to change CSS in your blog, but a better example (with corrections from classmates!) can be found in entry for June 25. Lecture Static and dynamic HTML Static=pre-cooked dynamic=made to order Client side Functionality: Javascript and ECMAscript Presentation: CSS Structure: XSLT Client/Server interactivity Browser says to servers, "Act on this data" Servers say to browsers, "Sure, hold this." CGI Common Gateway Interface A protocol for exchange of info: from HTTP to an arbitrary program on the server Common use: HTTP passes data to a program, which manipulates the data, and returns HTML to the broswer. Browsers and forms HTTP provides three ways of sending data to a server: GET, POST, COOKIE GET=fetching ordinary pages (after "?" parameters) POST=sends a packet of data Cookie=initiated by the server, stored on client's machine to save effort of server & (ampersand) is a field separator Modular Servers Apache is the best. It's open source. Core is steady, and modules (programs) "snap on." these modules can be written in different coding languages, but are compatible with the core Modules can shift URLs, change content on the fly, etc. Modules vs. Apps Modules are hard to "pipeline" Application servers are more suited to collaborative problemsolving (think BORG) Dynamic templating Code is embedded in the tags. PHP and ASP work with Apache ASP works with IIS JSP works in Java Servlet engine such as JBoss and Tomcat Problems with templates It's hard to separate logic from markup Logic that exists in multiple places is hard to change Middleware What are these "http servers on steroids" doing? They connect to data sources, manipulate data, send and receive data Databases and directories: are not necessarily different. A directory is a lightweight, read-only database Beyond templates See powerpoint for a nifty diagram of an XML pipeline∞ | June 18, 2003 in lis 875: where this blog began | Comments (0) | TrackBack
Don't try this at home
I thought I'd get a little early-morning entry going, but my first attempt at posting a comment from my iMac at home failed (the error message said something about the rebuild failing). The display's a little funky on this Mac -- do you other Mac users find that as well? This brings me around to the logistics of blogging. I don't have a DSL connection at home, choosing instead to take advantage of the fast computers at school and (ahem) work. I presume that many people, for this reason and others, just aren't interested in sinking tons of dough into equipment and premium internet service. Creating a virtual presence for yourself, be it a blog, a web site, or even a wish list on Amazon, would be difficult without a good setup and the time to use it. Blogging, like any gadget-intensive hobby, could get very expensive very quickly. But is the digital divide between those who can blog from home and those who use borrowed technology worth worrying about?∞ | June 18, 2003 in lis 875: where this blog began | Comments (5) | TrackBack
June 17: Web Fundamentals
Summary *Pre-lecture blogtalk* privacy, reputation, journal vs. log, hosted vs. personal *Lecture* jump to powerpoint Web structure elements; data vs. information; definitions of CPU, OS; history of code; networks; protocols; seven layers of the OSI model; markup languages; what's happening now? *Exploration* FTP by hand Python demo Pre-lecture blogtalk privacy Can blogging get you into trouble with your s.o.? concerns related to identity and pseudonyms fair use and copyright of content ideas of trust and anonymity: how are they related? Can an anonymous blogger be trusted? Perhaps if over time a reputation is established. reputation how is your professional life affected by your postings? predigesting info and how it affects your reputation (posts=power) power law: Winer et al.: There are many people with few links; few people with many links. links=cash in this respect, the blogworld parallels academia blgger community reacts in protest to power law, perhaps out of nostalgia for "good old days" of fewer blogs technorati.com (?) journal v log blog as navel gazing v micro-portal with subject specialty "fake leukemia girl" skokal (sp?) scandal hosted v personal hosted services add features to all hosted blogs at once. Users don't have to download updates. Lecture: Web structure beginning with a confession of dependence on PowerPoint, and the distortion this can lend to the thought process, a la Tufte 3 elements of the Web: tags, code, packets but first: data v information data = raw; info = processed data and datum datum = one piece. That means, data always come in sets kierkegaard's madman, who ran around naming things CPU lives in a dark dark world. Without cards, inputs, outputs, etc., nothing. 32-bit or 64-bit words binary logic OS connects CPU to monitor, hard drive, network, etc. encoder/decoder of info serves all the other programs (computers=clouds on clouds analogy) Code: a history Binary (Object code) Machine code (Hex [base 16] or Octal [base 8]) Compiled language (C, C++, Java) language talks to compiler that interprets into hex Interpreted/scripting language (Perl, Python) compiles as you go, making for even human-friendlier interaction Virtual machines enables code to live on any machine. Big deal before browser wars. Networks no real difference between a CPU talking to a hard drive and another computer. BUT they need to speak the same language Protocols negotiate the exchange of info between different systems "diplomacy" and trust most hacker attacks take advantage of protocols (handshakes) FTP, TCP/IP, HTTP, NNTP OSI model (see powerpoint for a pretty chart) FTP wraps itself in TCP (what am I) packet wraps itself in IP (what am I about) wraps iteslf in ethernet (where am I going) and back out. seven layers: FTP and telnet: 7. application 6. presentation 5. session (exchange of handshakes and transactions) 4. transport (TCP and DNS) 3. network (IP) Outer envelope: 2. data link 1. physical Markup languages HTML tags are just that--labels! Programming languages v Markup languages P: changes what CPU does; defines an explicit process M: does not affect CPU; conveys explicit meaning (a semantic) What's happening now? sophisticated network protocols lead to distributed computing smarter markup leads to ease of incorporation of new human knowledge into information ecology faster CPU cycles lead to richer intepretation of data sets Exploration FTP by hand The power of coding is embodied in patience. Gives power over computer if you can talk to it directly. Python demo∞ | June 17, 2003 in lis 875: where this blog began | Comments (0) | TrackBack
First thoughts
When I told my coworkers I was using up my vacation time to take a four-week class on metadata, most of them groaned sympathetically, but I don't think they recognize how much more fun this will be than loafing. That or they haven't grasped how my aspirations to geekdom permeate my being. I'm excited to have a place to talk about blogs and blogging. In fact I've been thinking about blogging a lot lately. My previous exposure to the blog community is limited. For "Digital Divides and Differences," a class taught by Greg Downey last fall, we all maintained an anonymous blog. At the end of the semester we traded with the freshman section and drew conclusions about the authors. Participation was lopsided -- the freshmen disclosed a lot of personal information, and we grad students maintained our detachment. At that time I had a limited grasp of the potential for blogs, having seen only the personal, diary-style entries that clog free sites like ujournal and so cast my own blog in their image. At the same time I was concerned about privacy, so I only wrote about my worm compost bin and its slimy denizens. I recently read "The weblog book" by Rebecca Blood of Rebecca's Pocket. It got me thinking about the wider applications of blogs and what in the world I'd have to contribute to the internet if I had my own. I anticipate that this course will soothe my fevered brain and give me reasons to jump in with two feet -- if indeed a tenth of everyone will have a blog within a few years. Can't wait to hear what y'all have to say in your own...∞ | June 16, 2003 in lis 875: where this blog began | Comments (3) | TrackBack
Welcome to 875!
Greetings. This is what they call a 'blog' nowadays. We'll be using it to keep a historical record of what we learn during the course of this class, to converse with one another about class topics whenever we feel like it, and to understand the limits and innovations that blogs embody vis a vis the web at large. Fundamentally, blogs are just a genre of web site, pioneered by sites like "Slashdot":http://www.slashdot.org and "Peterme"://www.peterme.org. As it happens, the structure of a blog is fairly stable and lends itself to automation--and so automated they became! Blogs nowadays usually refer to any site that is being managed 'under the hood' by blogging software. The software we are using is called "Movable Type":http://www.movabletype.org, and is pretty much the most popular stuff out there. Rather than prejudice you by telling you about my favorite blogs (yet), I recommend starting with a few of the weblog search engines and aggregators out there. We'll go over what these sites are up to in class today. Daypop: "http://www.daypop.com/":http://www.daypop.com/ Blogdex: "http://blogdex.media.mit.edu/":http://blogdex.media.mit.edu/ Popdex: "http://www.popdex.com/":http://www.popdex.com/ Technorati: "http://www.technorati.com/":http://www.technorati.com/ You could also start out with the two queens of librarian blogging, "Jessamyn West":http://www.librarian.net/ and "Jenny Levine":http://www.theshiftedlibrarian.com/. Peer into their archives, and look at Jenny's impressive "blogroll":http://www.theshiftedlibrarian.com/stories/2002/05/25/sitesIReadInMyAggregator.html (a list of blogs she reads regularly).∞ | June 16, 2003 in lis 875: where this blog began | Comments (0) | TrackBack




