Monthly Archives: February 2011

Converting a plaintext bibliography to Endnote/RIS format with help from Linux/Terminal

[Update 16/7/2011: See my more recent post on the topic, Launching Ref2RIS – convert your typed bibliography to Endnote format, which makes things even easier.]

You won’t want to do this unless you’ve got literally hundreds of references. Any less, and these suggestions are way easier.

1. Format references so they’re each on their own line – no blank lines.

2. Use Word’s “Find Special” capabilities to replace a phrase in italics with {it}a phrase in italics{endit} and a phrase in bold with {b}a phrase in bold{endb}.  (Similarly if the citations contain underlines.)

3. Save as plaintext – say, source.txt.  Now the fun begins…  My own source text contains 600-odd lines in ACS style, like this:

Bamford, C. H.; Tipper, C. F. H. {it}Comprehensive Chemical Kinetics{endit}; Elsevier: New York, {b}1977{endb}. 
House, D. A.{it}Chem. Rev.{endit} {b}1962{endb}, {it}62{endit}, 185

4. Open up Terminal or some other Linux command line.

5. Endnote records are separated by a line

ER  - 

– that’s two spaces before the hyphen and one after.  (All these details come from Endnote’s help pages.) This is the easy part: type in

sed -e 's/^\(.*\)/\1ER  - /' source.txt > source1.txt

6. The start of each Endnote record tells you what kind of citation it is – eg a book, journal etc.  To find every line that includes a colon (ie separating the publisher from the city published in) type in

sed -e 's/^\(.*:\)/TY  - BOOK@@\1/' source1.txt > source2.txt

Note 1: The “@@” is in there as a sign that you’ll need to replace this with a new line later; but we want to keep everything on one line for now.
Note 2: This is a good example of why this whole method is highly suspect, because it’ll also catch citations which have a colon in the article title or in a typo or whatever.  So if you can think of a better sign that a citation is a book then use that instead of the colon.

Alternatively, you could type in

sed -e 's/^\(.*{it}[0-9]*{endit}\)/TY  - JOUR@@\1/' source1.txt > source2.txt

to find every line that contains {it}[some number]{endit} which, in my source, is the best indicator that I’m dealing with a journal.  The same caveats apply – you’ll get both false positives and false negatives.

Anyway, keep doing what seems best given your source, and fix up the inevitable mistakes by hand until each line starts with TY  – something.  If you want to give up and just assume that everything that isn’t already assigned as something must be a journal then try

sed -e 's/^\([^(TY  - )].*$\)/TY  - JOUR@@\1/' source2.txt > source3.txt

I now have source looking like:

TY  - BOOK@@Bamford, C. H.; Tipper, C. F. H. {it}Comprehensive Chemical Kinetics{endit}; Elsevier: New York, {b}1977{endb}. 
ER  -
TY  - JOUR@@House, D. A.{it}Chem. Rev.{endit} {b}1962{endb}, {it}62{endit}, 185
ER  -

7. Now we keep playing with patterns.  (You may be able to do large chunks of this with regular find/replace, but for illustrative purposes I’ll keep using Terminal.)

For example, in my source the authors are nicely set off: they come after “@@” and before the first “{it}” (or “in {it}”), and if there’s more than one of them they’re separated by “;”.  So a few commands:

sed -e 's/@@\(.* in {it}\)/@@A1  - \1/' source3.txt > source4.txt
sed -e 's/@@\(.* {it}\)/@@A1  - \1/' source3.txt > source4.txt
sed -e 's/;\(.*;\)/@@A1  - \1/' source5.txt > source6.txt (This one I had to repeat a few times depending how many authors could be cited in one reference; there's supposed to be a way to do it globally but my unix fu is not strong.)
sed -e 's/;\(.*{it}\)/@@A1  - \1/' source8.txt > source9.txt

Journal titles:

sed -e 's/^\(TY  - JOUR.*\)\({it}.*{endit} {b}\)/\1@@JO  - \2/' source9.txt > source10.txt


sed -e 's/\({b}[0-9]*{endb}\)/@@Y1  - \1/' source10.txt > source11.txt

And so forth.  You pretty soon start to see why the first suggestion on most lists of ways to convert plaintext citations into RIS format is always “Just type it in / search for it again by hand”.  The method above is really only suitable if you’ve got literally hundreds of citations. (I have 639, plus or minus.)

8. Eventually you’ll be at a point where you can do a simple find/replace to change @@ to a new line and nuke all the {it} and so forth.  This will be a great relief.

9. Rename your final saved file from source12.txt to source12.ris and open with Endnote.

10. Bonus material:  if this was a bibliography to a paper using numbered citations in order using eg [1], then in that paper you can do a find/replace on [ -> { and ] -> }, then tell the Endnote plugin to format citations, and voila, the best magic ever.  (If the paper uses author/date citations then you’ll have to link them by hand, sorry.)

A rule about rhetorical questions

At intermediate and high school we learned the basics of debating. One technique we learned about was the rhetorical question; and we also learned an important rule for their use: Don’t ask a rhetorical question if there’s a chance your audience will respond with the ‘wrong’ answer.

@libsmatter reported from an ALIA panel:

What if when our budgets were cut we asked – “so – what do you want us to stop doing?”

which I used to agree with. And I still agree that if our budgets keep getting cut then we’ll have to cut services. But that doesn’t mean the argument will make everyone say, “Oh, right. Um, we didn’t think of that. Here, have an extra million dollars.”

Because if an institution wants/needs/thinks it needs to cut the library’s budget, it can really easily reply, “You need to keep providing the same service. Be more efficient. Work smarter. And if you can’t figure out how to do that for yourselves then well, we’ll send in our favourite efficiency experts and cut your staffing for you.”

And if we’re not prepared to accept that answer then we should be very careful about asking that question.