Subscribe: Edward Bilodeau's Weblog
Added By: Feedage Forager Feedage Grade B rated
Language: English
algorithm  documents  don  exam  number  precision  recall  relevant documents  relevant  search algorithm  search algorithms  search  time 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Edward Bilodeau's Weblog

Edward Bilodeau's Weblog

Currently thinking about: information architecture, web construction, LCSH, union catalogs and other bibliographic tools.

Published: 2005-11-03T17:46:24-08:00


Class prep


About 15 minutes to show time class, and I'm sitting in the caf, enjoying the last of a coffee before getting another. The deck for tonight's lecture on metadata has come together fairly well. How well I can deliver the material remains to be seen. But I'm confident!

I enjoy teaching. It is the part of my job that I like the best. I have a notepad on the table next to me covered with notes for the new class I'm teaching next semester: web development. Its an undergraduate course, part of a software development certificate. Basically, all the HTML/CSS stuff I pulled out of my grad course, amplified. I've decided (and gotten permission from the program director) to focus less on server-side development and instead drive home larger points: web architecture, values (accessibility, internationalization, platform independence, etc), standards-based development, usability, etc. I plan on stopping short of AJAX as well, although I will cover the fundamentals of client-side scripting. So, a new course. Lots to plan, lots to brush up on. Should be fun.

But tonight its all about metadata. And the final project. And the final exam. Hmm, time is ticking. I need some time to boot up the classroom and get my laptop plugged in and running. But first, I need another coffee. The big cup this time, svp!

Yahoo directory not being maintained


Why are there only two resources under XHTML in the Yahoo directory? I can only assume that they are not bothering to maintain or grow the directory in any meaningful way. Or maybe the have shifted their staff over to tagging everything?



November is a busy time of the year, the time when you realize that what you do now will decide how much pressure you are going to be under come December. As a result, I've been trying to focus as much as possible on getting work done, which has meant less time for tangental rambings in this space.

...where we await the Great Pumpkin!



You knew they had to grow them all somewhere!

Remind me again: Why are we paying taxes?


Matthew Good points out how backwards our government aid policy is: "We can send DART all over the world but can’t seem to manage sending it mere hours north of our own capital. Of course, there are those that would respond to that by claiming that DART is for disaster relief and that what has happened in Kashechewan is not a disaster. And such a response would quite accurately describe precisely why it is."

Our new HipsterPDA hard drive



Is it wrong to be excited and thrilled to receive a new filing cabinet?

Recall and precision


(Since Rosenfeld and Morville didn't get it right...) Recall and precision are concepts that relate to the ability of a search algorithm in retrieve relevant documents from a collection of documents. More specifically, recall is a measure of how many of the relevant documents were retrieved, while precision is a measure of how many of the retrieved documents were in fact relevant. These concepts can be defined operationally as follows: Given, D = number of documents retrieved R = number of relevant documents retrieved N = number of relevant documents in the collection Recall = R / N Precision = R / D For example, say that in order to evaluate the effectiveness of three search algorithms, we assemble a collection of 100 documents, 30 of which are considered relevant to our test query. Search algorithm #1 retrieves all 100 documents, so we can calculate Recall = 30 / 30 = 100% Precision = 30 / 100 = 30% While the recall for this algorithm is excellent, the precision is quite low. In other words, while it did retrieve all the relevant documents, it also retrieved a large number of non-relevant documents. Someone using this search algorithm would have to manually sort through many non-relevant documents before finding those that are relevant. Search algorithm #2 retrieves 70 documents, including all 30 relevant documents. We calculate Recall = 30 / 30 = 100% Precision = 30 / 70 = 43% This improves over the first algorithm in that is still retrieves all relevant documents, but leaves out some non-relevant ones. Recall is perfect, and precision has improved. It is important to note that this situation rarely occurs in practice. Typically, changes made to search algorithms to improve precision involve establishing criteria for rejecting documents, a process that inevitably causes some relevant documents to be labeled as non-relevant and therefore excluded from the search results. In other words, as we improve precision, we reduce the ability of our algorithm to recall all relevant documents. This is illustrated in the following example: Search algorithm #3 retrieves 50 documents, including 20 relevant documents. We calculate Recall = 20 / 30 = 67% Precision = 20 / 50 = 40% We see here that the precision of the search has been improved over the first case, even though fewer of the relevant documents have been retrieved (i.e. only 20 of the 30 relevant documents known to be in the collection, a recall of 67%). From the user's point of view, precision has increased because they have fewer non-relevant documents to sort through. All search algorithms have to make a tradeoff between recall and precision. The importance of each will depend on the needs of the user. If someone is looking for every single relevant document in the collection (ex. "find me every news article written about our company"), recall will be more important then precision. Precision will be more important for someone looking for a few relevant documents (ex. "find me a few articles on protecting my computer from spyware"). When do you need to use these formulas? Unless you are an information science or computer science researcher developing search algorithms, probably never. The problem with using the formulas in practice is that unless you are working with an artificial collection of documents, it is almost impossible to come up with an accurate count of relevant documents. Several factors contribute to this. First, you are usually searching a very large number of documents, so it is not feasible to go through them all and count the number of relevant documents for a given search. Second, relevance is subjective, making the relevance of any given document difficult to assess in any general way. Third, relevance is not a true/false measurement, but rather something that is measured in degrees of relevance, usually relative to other documents (i.e." this document is more relev[...]

Blogger word verification update


Google has finally done what they should have done before rolling out their spamblocking features in blogger, and given people a way to whitelist their blogs. The fact they implemented a buggy version of the technology that shut out long-time users like myself shows (a) how poorly thought out their work is, and (b) where their priorities lie. No one likes being treated as collateral damage.



While preparing for my lecture this evening, I noticed that the formulas for precision and recall given in the polar bear book (2nd edition!) are wrong. Plug in some numbers and you'll see.

Just goes to show you how often the formulas are actually used in practice (i.e. never). The concepts, though, are central to search engine evaluation.

No sort by popularity?!?


As far as I can tell, none of the big sites (delicious, digg, feedster, furl, technorati) allow you to sort a search by popularity. Bloglines appears to, but it either doesn't seem to find anything (I think it isn't looking at much data) or the results are really bad.

Does anyone know why these sites haven't yet provided this obvious search feature?



Very busy yesterday and today. This morning I have class (bibliographic sources, which I am in right now), while this afternoon I have to take 10^6 screenshots for tonight's lecture.

FWIW: Exam advice


  1. Study early and often. Don't cram. The pathways through your memory will be too weak to survive the flash of fear and anxiety you will experience when you start reading the exam questions.
  2. Don't try memorize everything. Try to understand the concepts, theories, etc. You'll have a better chance trying to apply general knowledge then trying to apply memorized facts. If you don't memorize the right thing, for example, you're out of luck.
  3. Stay calm and confident, both while you are studying and during the exam. Give yourself pep talks, positive reinforcement, whatever you need to do to build up your confidence. Although it may seem like you don't know enough, you'll be surprised how much you know when faced with the exam questions. Knowledge has a way of lying dormant until it is really needed.
  4. Don't study the day of the exam. As the song goes, if you don't know it by now, you'll never know it.
  5. Don't hang out with your classmates beforehand to talk about the pending exam (see #3). They will generally be in various states of delirium, and are not likely to make any positive contributions to your preparation.
  6. When the invigilator tells you to start, don't. Take ten deep breaths. Relax. It is just a test. You won't miss those 30 seconds at the end, but you do need to relax at the beginning.
  7. Read the exam all the way through once. Get your brain working on all the problems as soon as possible.
  8. Relax.
  9. Read the exam again, jotting down notes of things that come to mind as you go.
  10. Relax.
  11. Allocate your time according to the point value of each question. Keep track of the time and be ready to cut your losses as necessary.
  12. Do your best.

GLIS 607 Mid-term


This morning I had my mid-term exam for GLIS 607 Organization of Information. The exam covered descriptive cataloging, main and added entries, and authority files. It was an open book exam, which is good, because (a) there are a large number of complex rules that in generally only get committed to memory after years of practice, and (b) there is no way that I would have been able to remember everything! I had good notes and examples with me as well, so I was able to cope. It was still challenging, but overall, I'm happy with the way it went.