docster: instant document delivery
(c) April 2000 by Daniel Chudnov
You may reproduce this article in any format and for any purpose, but only in its entirety, including this statement.
Background: the attack of napster
Have you seen napster yet? If not, take a look. Napster is two things: one part distributed filesystem and one part global music copying tool. It works incredibly efficiently and is very easy for users.
A typical session with napster might go like this:
That's it. Do not go to the record store. Do not respond to the monthly selection. Look for what you want to hear, click, download, listen. And everybody's doing it. So many people are using napster, in fact, that several college campus network administrators are cutting out all napster traffic because the traffic is flooding their internet pipes.
Why is napster so successful? Because it's simple. Behind the scenes, it works quite simply also. You have mp3 files on your machine, and your machine is on the net. When you connect (usually by just starting your client application), the napster server knows what files are on your machine (if you tell it where to look). And the napster server knows what files are on the machines of the other two or three thousand people logged in at the same time. There's a song you want to hear? Search the napster server... it knows who has it, and napster will send you a copy of some other bloke's copy of that song.
Upon connecting to napster (usually late at night on weekends; the university I'm at doesn't allow napster traffic during business hours, a reasonable restriction), there are normally about 2,500 users logged in, and over 600,000 songs (files) available. Probably 80% of these files are duplicated at least once, and 20% probably account for 80% of the traffic and so on, but I've found some fairly obscure, groovy stuff. The key thing is that if I can think of a song, I can probably find it, even though napster applies no organizational scheme to its catalog of connected songs.
What of it?
The questions any librarian would ask at this point are obvious: "what about copyright?" and "doesn't it need to be organized?" The simple answer for question one is that while the napster folks state that they seek "to comply with applicable laws and regulations governing copyright," napster is widely used for making copies of songs in blatant violation of copyright. I know... I've done it. This doesn't seem to stop thousands of folks from using it; evidently folks aren't losing sleep over it. Napster is being sued, but the service hasn't been slowed at all yet. To put it simply, we all know that it's wrong, but somehow this is still-too-new for many people to dismiss as morally corrupt so therefore plenty of users remain.
(Note (2000-4-14): today the news broke about Metallica suing napster and my employer. Maybe it's time to post that Kirk Hammett pick I got at that 1989-7-4 Pine Knob show on ebay. :)
As for organization, it doesn't seem to matter. Nobody's organized a bit of it. Catalogers should turn red when seeing how poor (read: absent) the indexing is. 100% brute force, not even stop word removal and no clear record editing. Applying a few simple techniques to this problem would make searching for songs more reliable but not really any easier, because people are already mostly finding what they want to find and that's adequate for most. If you don't believe this, ask your nearest university network administrator.
And napster isn't just about music. As is well stated in this cnet article (and in this week's Nature, 13 April 2000, "Music software to come to genome aid?" by Declan Butler), this is a groundbreaking model of information delivery. It changes things. It's a killer app if ever there was one. The napster model shows that it's simple to share movies, music, anything that can live in a file on a connected box. All you need is a simple protocol and some clients that can speak that protocol. And fast net connections and big, cheap hard drives aren't going away anytime soon.
Some might wonder how napster is different from what the web already provides. The difference might seem minor, but cuts backwards through everything librarians know about giving people access to information. The difference is that while anybody can put anything up on the web and share it with friends, few people can provide the necessary overhead. Even if you can run a web server, for instance, a certain amount of centralized description or searching is still necessary (directory sites, search engines, etc.) before anyone can find your files.
With napster, however, you only have to be connected and willing to share your files. Napster does the rest by keeping track of what you've got so others can find it. You don't need to do anything to let thousands of other people copy files from your machine via napster.
So put aside security and copyright and organization concerns for a moment and consider... does this remind you of anything? Hmmm... I've got a song/movie/file that I've enjoyed and other people might like it too. Maybe other people have things I would enjoy. I wouldn't mind letting other people have my song/movie/file if I could also use theirs in return. This kind of cooperation could work. But how can I be sure that such cooperation would continue? And how would we organize it all?
Ever hear of a lending library?
Paper shall set you free
Funny how napster doesn't care about dublin core or MARC. It doesn't need a circulation module. It doesn't even matter what kind of computer you have, as long as you have a working client and decent bandwidth. Think of the implications of applying this model in our libraries. With all the advances in standardization of e-print archives and such (see the Open Archives initiative), we already have high hopes about the future of online publishing. With that solved, maybe the napster model could help us deal with our favorite legacy format: bound journals.
Have you ever worked in or near a busy InterLibrary Loan office? Do you know that sometimes we libraries photocopy and use Ariel to send the same document ten times in a month for patrons in other libraries? It seems terribly wrong that we've got to do this work over and over when we could just keep a copy on a hard drive, but we know well that the legal precedent today prevents us from creating such centralized storage. Heck, often we can't even fill an ILL request out of our already-digital ejournal collections because we sign restrictive licenses. So we have to go back to the stacks and photocopy and scan it through again instead of clickclickclicking to a lovely pdf for which we've paid so heftily.
But looking at napster, there's a key thing to consider about how its file sharing model might apply to document delivery. In napster, there is no centralized storage. In napster clients (gnapster, at least) users see from search results that there are listings of other users' copies of a song you want. You click to download. In the background napster echoes to its status line "requesting La Vida Loca from user hongkongfooey" and sometimes hongkongfooey says no... which is okay, too, because you can probably ask somebody else for it. When somebody (more precisely, their napster client, depending on how they've set their preferences, as napster doesn't wait for human approval if the right options are already set) okays a file transfer, they are giving you a copy, not napster.
In walks docster
Imagine all the researchers you know, with a new bibliographic management tool that combined file storage with a napster-like communications protocol -- docster. Instead of just citations, docster also stores the files themselves and retains a connection between the citation metadata and each corresponding file. Somewhere in the ether is a docster server to which those researchers connect. They're reading one of their articles, and they find a new reference they want to pull up. What to do? Just query docster for it. Docster will figure out who else among those connected has a copy of that article, and if it's found, requests and saves a copy for our friendly researcher.
Of course, we cannot do this. Libraries depend too much on copyright to attack the system so directly. But what if we focused instead on altering the napster model enough to make it explicitly copyright-compliant? After all, many cases of one researcher giving another a copy of an article are a fair use of that article. Fair use provides us with this possibility and it's not a giant leap to argue that perhaps coordinated copying through such a centralized server could constitute fair use, especially if docster didn't compete with commercial interests.
Well, it's still a big leap, but think of the benefits. Say there's an article from 1973 that's suddenly all the rage. It doesn't exist online yet, so a patron request comes to you from some other library, and you've got the journal, so you fill the request. But forty-eight other researchers want that article too. If that first patron uses docster, any of those other folks also using docster can just grab the file from the first requestor. If others don't use docster, they can request a copy from their local libraries, who -- I hope -- do use docster. Nobody has to go scan that article again, and suddenly there is redundant digital storage (see also LOCKSS).
Let the librarians librarize
Still, though, we're not doing enough to enforce copyright. Currently, a research library filling tens of thousands of document requests for its patrons per year makes copyright payments to the CCC when their fills demand. This system keeps publishers happy but keeps librarians chasing our collective tails. And even though systems like EFTS have automated even the copyright payment transfers, we're still continuing our massively parallel redundant (wasteful) copying and scanning efforts.
But because we're so good at making sure we make payments, we could leverage that structure within docster. We could federate the docster servers at the institutional (or consortial) level. For the several hundred or thousand researchers in a given field with a departmental library at a big institution, their docster requests go through their library. The requests, that is, not the files. It might look like this:
In this transaction (which might take only a few seconds for queries and download time) both libraries know about the article being requested, but X Library can keep A's identity private. Likewise Y Library can keep B's identity private. Thus the transaction might consist of identification at the institutional level, ensuring the privacy of both parties. But if a copyright payment needs to be made, X Library can pass that through to EFTS for clearance and then charge Researcher A's grant number (assuming, of course, that Researcher A knowingly signed up for the service). Y Library didn't have to pull anything from the stacks, and Researcher B might have been cooking dinner through the whole thing. Neither library ever stored or transmitted a copy directly; rather they only determined who had a copy (Researcher B), and had a copy sent to the requestor (Researcher A).
And the paper publisher gets paid. Everybody's happy.
Concrete steps and benefits
The necessary infrastructure for making this work is mostly in place. There are variants of the napster protocol under development (see the cnet piece now if you didn't read it before ;), none of which would require significant modifications. Some sort of federation protocol would need to be established, but it wouldn't be any more complicated than the routing cell structure (target library priority lists) implemented for Docline. Docster client software would need to be integrated with bibliographic metadata, but that's an easy hack too.
To address security concerns, it might be necessary to carefully define the protocol so that it would not compromise any individual user's machine. Additionally some sort of basic certification authority might need to be used to verify the identities of source and target institutions. While these are not trivial tasks, there are well-known approaches to each.
Think of the time we would save, and the speed at which articles would move around. For any article that had ever been filled into the docster environment (and that still lives on a connected machine), there would be no more placing a request, verifying the cite, placing the order, claiming the order, pulling from stacks, copying, Arieling, Prosperoing, emailing, etc., not to mention all the logging and tracking we rekey when moving requests from one system to the next when we don't have an ILL automation system (and even if we do). Your happy researcher would just need to type in a search and -- hopefully -- download what she needs. If not, you or your neighborly peer library make the copy and send. Once. And nobody else has to again.
Indeed it is easy to imagine building some local search functions into docster clients to avoid even making a request whenever possible. A local fulltext holdings database might be queried first (through the help of something like jake), then a local OPAC for print holdings. If these steps fail, a request could be automatically broadcast, and any request that bounces due to bad data or lacking files could be corrected, rebroadcast, or sent through existing systems like OCLC, RLIN, or Docline, and then Ariel'd back and delivered through the local docster server. The next such request would hit the now docster-available file (if the first requestor keeps his machine online).
Why wouldn't researchers create workarounds and bypass the libraries altogether? Well, technically, they already can use napster-like services for this, and obviously there's nothing to stop them from doing so. But libraries would play several vital roles in this equation: first, our patrons trust us with private information because we've safeguarded their privacy carefully and reliably for years; second, libraries can provide many value-added services such as integrating docster searches with additional functions as described above; third, the clientele we serve (at least where I work, a medical library) includes researchers working on deadlines and clinical staff saving patients' lives. Many of these patrons are fortunate enough to not have to care about costs; they wouldn't think twice about paying a reasonable charge for immediate, accurate delivery of crucial information.
Beyond the fact that building EFTS payments into this model would help our accounting procedures become even more automated, consider the copyright question one more time. As docster grows, more and more articles would be fed into the system. Some of those articles will be old. Some will even be old enough to qualify as public domain. As each year passes, a slew of articles will pass into such exalted status. For these public domain articles, nobody has to do any accounting at all.
Put this together with today's increasingly online publishing, and a window begins to close -- the window between today's e-prints (which will increasingly follow the open archives specifications and be accordingly easy to access) and yesterday's older print archives (which will increasingly be public domain). In between is a growing pool of documents available through docster, instantly accessible within complete compliance of (read: payment for) copyright.
It certainly wouldn't take very long to construct and conduct a limited trial. If we approach docster from day one as a good faith effort to comply with copyright while creating efficiences, we might be challenged by publishers but we'll at least have a good case going for us that we're not taking any money away. And best of all, we'd certainly have our patrons on our side. In all likelihood, the amount of revenues publishers would receive would probably increase significantly. Any librarian will tell you that the minute access gets easier, more people want access.
Think it through. I'll say it again: if you still don't believe this can happen, ask your nearest network administrator about napster traffic.
I am very grateful to KB, AB, MW, RKM, RMS, MG, AO, and KP for their insight and feedback on early drafts, and in particular to JS for allowing me to test this idea out in public and on an unsuspecting crowd. -dc