Emdashes—Modern Times Between the Lines

The Basics:
About Emdashes | Email us

Before it moved to The New Yorker:
Ask the Librarians

Best of Emdashes: Hit Parade
A Web Comic: The Wavy Rule


Gabba Gabba Hey! Are the New Yorker Archives Full-Text Searchable?

Filed under: The Squib Report

Martin Schneider writes:

I just noticed something weird: You can get hits from old New Yorker articles on Google.

It may not be immediately apparent how significant this is. Since The New Yorker began steadily—aggressively, even—increasing its electronic profile in 2000, one of the natural consequences has been that you can access the materials by searching on them.

But there have always been arbitrary constraints: Anything since 2000 is likelier to be searchable because the magazine was putting a lot of its content on its website—logical. Before that, and you might be out of luck. The Complete New Yorker DVD set came out in 2005, which vastly increased the user's ability to search on The New Yorker's past. But the search was a keyword search that also (I think; I've never quite gotten a handle on this) folded in The New Yorker's own internal abstracts and possibly some other text—but never full-text searches or anything close to it. The Digital Edition, unveiled a mere three months ago, also doesn't incorporate full text. (The Digital Edition lives at http://archives.newyorker.com/, which will become relevant shortly.)

So here's what happened. You know the "site:" tag in Google? You use it if you want to limit a search to a single website. I was fiddling around, searching for the term "Ramones" on newyorker.com—and I realized that my hits weren't limited to www.newyorker.com; you also get stuff from archives.newyorker.com. Here are the results from that search:

site:newyorker.com ramones

Google's gotten subtle and variable enough that different people might get slightly different results, but on my machine, it returns 198 hits. Scrolling down, the first (counting....) twenty-six hits are from www.newyorker.com, and just about all of them appear to be recent, that is, since 2000. That material was posted to the magazine's website.

But the twenty-seventh hit is not from www.newyorker.com. It's from archives.newyorker.com. And it dates from 1991. The title reads, "The New Yorker Digital Reader : Jan 07, 1991." I don't know for sure, but it looks like every hit after that might be from archives.newyorker.com. (I guess this is a good moment to observe that you have to be a subscriber of the magazine to benefit from this quirk. In case you don't know, I'll reiterate that any print subscriber automatically receives free access of all old issues on the Digital Reader.)

And yes, if you're wondering, these results are completely different from the hits you would get from the other New Yorker resources. On the CNY DVD set, a search for "Ramones" returns 6 results (I only have one update installed on my version, FYI.) On the website, the same search returns 162 hits, but a great many of them are for "Ramon" and have nothing to do with our beloved Forest Hills punk gods.

Most of these hits for the Ramones seem to be listings, which makes some sense. Readers tend to forget the sheer volume of verbiage that each week's listings section represents. Those would provide a huge amount of content that is nowhere else accessible. Now you can document Jerry Orbach's storied career as a Broadway crooner! Among other things.

I don't actually think these results are coming from a proper full-text archive. I think these are OCR (optical character recognition) results. I worked extensively with OCR in the late 1990s, so I kind of know it when I see it. One of the hits in Google provides the following preview:

he Ramones-who are, after Patti Smith, per haps the most successful act to pass through these ... \\rho have all taken Ramone clS their stage name,

"\\rho" is obviously "who," and "clS" is obviously "as." That's OCR output, right there. So I guess the results will be imperfect. Good, but imperfect. (It stands to reason that if The New Yorker had their archives OCR'd, then it would capture advertisement content as well. Basically the nature of magazine layout would make this very hairy—but you'd stlll get some decent results, as the Ramones search shows.)

You can search on those archives hits exclusively by doing this:

site:archives.newyorker.com ramones

Okay, that's enough on this subject for now. Please do write in if you discover anything interesting about this!


Awesome find! Though I’m afraid of how much time I’ll waste with this, now…

Post a comment

(If you haven't left a comment here before, it may need to be approved by the site owner before your comment will appear. Thanks for waiting.)

2008 Webby Awards Official Honoree