About 15 minutes to
show time class, and I'm sitting in the caf, enjoying the last of a coffee before getting another. The deck for tonight's lecture on metadata has come together fairly well. How well I can deliver the material remains to be seen. But I'm confident!
I enjoy teaching. It is the part of my job that I like the best. I have a notepad on the table next to me covered with notes for the new class I'm teaching next semester: web development. Its an undergraduate course, part of a software development certificate. Basically, all the HTML/CSS stuff I pulled out of my grad course, amplified. I've decided (and gotten permission from the program director) to focus less on server-side development and instead drive home larger points: web architecture, values (accessibility, internationalization, platform independence, etc), standards-based development, usability, etc. I plan on stopping short of AJAX as well, although I will cover the fundamentals of client-side scripting. So, a new course. Lots to plan, lots to brush up on. Should be fun.
But tonight its all about metadata. And the final project. And the final exam. Hmm, time is ticking. I need some time to boot up the classroom and get my laptop plugged in and running. But first, I need another coffee. The big cup this time, svp!
Why are there only two resources under XHTML in the Yahoo directory? I can only assume that they are not bothering to maintain or grow the directory in any meaningful way. Or maybe the have shifted their staff over to tagging everything?
November is a busy time of the year, the time when you realize that what you do now will decide how much pressure you are going to be under come December. As a result, I've been trying to focus as much as possible on getting work done, which has meant less time for tangental rambings in this space.
You knew they had to grow them all somewhere!
Matthew Good points out how backwards our government aid policy is: "We can send DART all over the world but can’t seem to manage sending it mere hours north of our own capital. Of course, there are those that would respond to that by claiming that DART is for disaster relief and that what has happened in Kashechewan is not a disaster. And such a response would quite accurately describe precisely why it is."
Is it wrong to be excited and thrilled to receive a new filing cabinet?
2005-10-28T16:33:27-08:00(Since Rosenfeld and Morville didn't get it right...) Recall and precision are concepts that relate to the ability of a search algorithm in retrieve relevant documents from a collection of documents. More specifically, recall is a measure of how many of the relevant documents were retrieved, while precision is a measure of how many of the retrieved documents were in fact relevant. These concepts can be defined operationally as follows: Given, D = number of documents retrieved R = number of relevant documents retrieved N = number of relevant documents in the collection Recall = R / N Precision = R / D For example, say that in order to evaluate the effectiveness of three search algorithms, we assemble a collection of 100 documents, 30 of which are considered relevant to our test query. Search algorithm #1 retrieves all 100 documents, so we can calculate Recall = 30 / 30 = 100% Precision = 30 / 100 = 30% While the recall for this algorithm is excellent, the precision is quite low. In other words, while it did retrieve all the relevant documents, it also retrieved a large number of non-relevant documents. Someone using this search algorithm would have to manually sort through many non-relevant documents before finding those that are relevant. Search algorithm #2 retrieves 70 documents, including all 30 relevant documents. We calculate Recall = 30 / 30 = 100% Precision = 30 / 70 = 43% This improves over the first algorithm in that is still retrieves all relevant documents, but leaves out some non-relevant ones. Recall is perfect, and precision has improved. It is important to note that this situation rarely occurs in practice. Typically, changes made to search algorithms to improve precision involve establishing criteria for rejecting documents, a process that inevitably causes some relevant documents to be labeled as non-relevant and therefore excluded from the search results. In other words, as we improve precision, we reduce the ability of our algorithm to recall all relevant documents. This is illustrated in the following example: Search algorithm #3 retrieves 50 documents, including 20 relevant documents. We calculate Recall = 20 / 30 = 67% Precision = 20 / 50 = 40% We see here that the precision of the search has been improved over the first case, even though fewer of the relevant documents have been retrieved (i.e. only 20 of the 30 relevant documents known to be in the collection, a recall of 67%). From the user's point of view, precision has increased because they have fewer non-relevant documents to sort through. All search algorithms have to make a tradeoff between recall and precision. The importance of each will depend on the needs of the user. If someone is looking for every single relevant document in the collection (ex. "find me every news article written about our company"), recall will be more important then precision. Precision will be more important for someone looking for a few relevant documents (ex. "find me a few articles on protecting my computer from spyware"). When do you need to use these formulas? Unless you are an information science or computer science researcher developing search algorithms, probably never. The problem with using the formulas in practice is that unless you are working with an artificial collection of documents, it is almost impossible to come up with an accurate count of relevant documents. Several factors contribute to this. First, you are usually searching a very large number of documents, so it is not feasible to go through them all and count the number of relevant documents for a given search. Second, relevance is subjective, making the relevance of any given document difficult to assess in any general way. Third, relevance is not a true/false measurement, but rather something that is measured in degrees of relevance, usually relative to other documents (i.e." this document is more relev[...]
Google has finally done what they should have done before rolling out their spamblocking features in blogger, and given people a way to whitelist their blogs. The fact they implemented a buggy version of the technology that shut out long-time users like myself shows (a) how poorly thought out their work is, and (b) where their priorities lie. No one likes being treated as collateral damage.
While preparing for my lecture this evening, I noticed that the formulas for precision and recall given in the polar bear book (2nd edition!) are wrong. Plug in some numbers and you'll see.
Just goes to show you how often the formulas are actually used in practice (i.e. never). The concepts, though, are central to search engine evaluation.
As far as I can tell, none of the big sites (delicious, digg, feedster, furl, technorati) allow you to sort a search by popularity. Bloglines appears to, but it either doesn't seem to find anything (I think it isn't looking at much data) or the results are really bad.
Does anyone know why these sites haven't yet provided this obvious search feature?
Very busy yesterday and today. This morning I have class (bibliographic sources, which I am in right now), while this afternoon I have to take 10^6 screenshots for tonight's lecture.
This morning I had my mid-term exam for GLIS 607 Organization of Information. The exam covered descriptive cataloging, main and added entries, and authority files. It was an open book exam, which is good, because (a) there are a large number of complex rules that in generally only get committed to memory after years of practice, and (b) there is no way that I would have been able to remember everything! I had good notes and examples with me as well, so I was able to cope. It was still challenging, but overall, I'm happy with the way it went.