Knowledge Management – Eric Blue's Digital Home

20 Years of Blogging and Coding: Reflections on an Unexpected Journey

ericblue — Sun, 08 Sep 2024 00:31:48 +0000

This week marks the 20th anniversary of when I first started writing online and published my first blog post. In 2010, I released my first open-source project: an unofficial Fitbit API written in Perl.

At the time, I had no grand vision—just curiosity and a desire to document my thoughts and pursue my passion across a range of topics. Fast forward to today, and I’m both surprised and deeply honored to see how my work has found its way into academia. My writings and open-source projects have been referenced by researchers and featured in books, academic theses, research papers, art exhibits (even before I found my own way into art years later), and research projects across the globe—26 citations and references from universities and organizations across 11 countries.

These contributions have touched on diverse topics, including the Quantified Self, Knowledge Management, Mind Mapping, EEG (Electroencephalography), Sleep Behavior and Monitoring, Wearable Fitness Technology, Information Visualization, and Learning. It’s amazing and humbling to see how these projects have been used in so many different ways, inspiring further research and creative endeavors. Many of these articles and open-source projects were spur-of-the-moment, late-night or weekend tinkerings, while others were multi-year journeys exploring niche areas of tech that fascinated me.

Some of the research projects and published papers include processing EEG data to control prosthetic limbs, improving software development processes through mind mapping, exploring personal knowledge management with my multi-year side project (the Personal Memex), and even an art exhibit visualizing brainwave data.

Although I don’t blog as much these days—most of my writing now happens on social media—I’m still actively creating and sharing open-source projects. It’s incredible to see that the work I started years ago continues to be used and appreciated by others. My hope is that these contributions will keep finding new homes in unexpected places, inspiring and supporting others in their journeys.

To celebrate this milestone, I’ve published a summary and a list of all these cited works on GitHub. So, as I continue to create and share, I encourage others to do the same. Follow your curiosity, embrace your passions, and put your work out there. You never know the lasting impact it could have on people and projects you never imagined!

Github: https://github.com/ericblue/my-cited-works

Firefox Scrapbook Hacks – Viewing and Saving Webpages from Anywhere!

ericblue — Sun, 03 Apr 2011 20:12:00 +0000

This weekend I decided to wrap up a couple cool knowledge management “hacks” and share some code on GitHub. I primarily use the Firefox Scrapbook plugin to save all web pages of interest and use it as a general “digital snippet” repository. Since I started using Scrapbook in 2006 there have been a number of online services that have come along to offer this functionality (namely Evernote, Zotero, and countless others). Some of these services make it very easy to universally access and save webpages between multiple devices. As part of my usual DIY philosophy, I’ve made an effort to stick with Scrapbook and build the missing features myself. This is in large part due to data ownership (it’s my data and I don’t want to be tied to a single service/company), plus it’s fun to tinker and make these useful “hacks”.
In Dec ’09 I shared a blog post about how to synchronize the scrapbook data between multiple computers. This was the first major step to sharing data between multiple devices, but still lacked some of the ubiquity that I desired. In a nutshell I’ve made 2 major enhancements to Scrapbook:

An email ‘bridge’ to Scrapbook so I can email links from any device (PC, iPhone, iPad) and have them saved by Scrapbook
A centralized web-interface to browse/search/filter my scrapbook data.

I’ll start off with the less visually-stunning hack (email bridge), but by far the craftier of two.
Hack #1 – Scrapbook Email Interface
Whenever I began synchronizing my Scrapbook data between the 2-3 computers this solved a huge problem with being able to save webpages from anywhere. Since 2009 a lot has changed, and devices like iPhone and iPad (yes, Apple fan boy to a degree) have changed the way we consume news. Recently I’ve been using apps on the iPad like Zite and Flipboard to consolidate my Twitter, Facebook, and Googler Reader feeds into a single personalized newspaper. This means that now > 50% of my reading time is spent from a device that has no visibility into my Scrapbook data. I simply wanted a way to automatically email a link (built nativily into these apps) and have it automagically saved into my Scrapbook folder. I could have simply cut corners and wrote a script to hand-edit the Scrapbook RDF Files and save the web page using something like wget or curl. But, it just wouldn’t be the same…. I want the webpage saved EXACTLY as Firefox would normally render and save it.
This poses a bit of a technical challenge, since Scrapbook runs inside Firefox and there’s no native way to interface with a plugin running inside a browser. After researching a number of approaches, I came across 2 Firefox plugins that let you build interfaces inside firefox (http, telent, etc.) that actually let you control the browser and execute Javascript. Of the 2 plugins; POW and MozRepl, I decided to go with POW (Plain Old Webserver). Both plugins are wicked cool in the sense that they’re non-traditional and very powerful. POW runs a webserver inside firefox and let’s you run your ‘server-side’ scripts as Javascript. I’ve basically written a server process that runs INSIDE the client and executes XPCOM/Javascript to control the web browser windows and invokes the Scrapbook plugin API directly.
The setup process is simple:

Setup and install the POW and Scrapbook plugins in your browser
Configure POW to run a desired port and create a new directory /scrapbook/
Copy the index.sjs (server-side javascript) to this new /scrapbook/ directory
Setup a new email box or alias (e.g. yourusername+scrapbook@gmail.com)
Either run scrapbook2email.pl manually or run as a CRON job every couple minutes
Simply send emails to your new Scrapbook email, run the email script, and watch your pages be saved automatically

At a high-level this is accomplished with 2 scripts:
Email Interface script (Perl)
This script uses IMAP to retrieve scrapbook email requests from a designated folder. Along with doing basic sender/recipient validation, the script is also aware of plain text/multipart messages. Once the email request is parsed, the link of the requested web page to be saved will be extracted. Given the request URL the script will then contact the POW server and pass the requested URL (e.g. http://127.0.0.1:6670/scrapbook/?url=http://yourwebpagetobesaved.com/?articleID=3q4e3332). Note that this version of the script requires that Firefox/POW be running and makes no attempt to launch for you.
For a copy of the script click here (GitHub).
Scrapbook/POW Bridge (Server-Side Javascript)
This script does the heavy lifting, and is essentially running at the other end of the POW server URL (http://127.0.0.1:6670/scrapbook/). Once the requested URL is detected the browser will spawn a new tab, automatically execute the Scrapbook Capture request, and save the webpage to a new top-level folder (e.g. Unfiled/MM-DD-YYYY). This script was tested with Scrapbook v.1.3.7.
For a copy of the script click here (GitHub).
It’s nifty now to email a link to my Scrapbook Bot and wihin a couple minutes a little notify popup shows in Firefox indicating my page was saved.
Hack #2 – Scrapbook Browser
This code was actually written back in Dec ’09 after I wrote the synchronize blog post (and around the time I wrote the Document Viewer), however I haven’t shared until now. What I’ve done is write a simple Perl/JQuery web app that used Simile’s Exhibit to view Scrapbook data in a tile, table, or timeline. This interface also has a file/folder view so you can browse snippets just like you can through the native Scrapbook plugin interface within Firefox.
Here are some screenshots:
Tile View

Timeline View

Table View

Folder View

To download the code click here (GitHub).

Learning Faster – Automatically Extract Highlighted Text from PDF Documents

ericblue — Fri, 17 Dec 2010 07:46:07 +0000

Overview
I never really considered myself a “highlighter” until a couple years ago. Back in school I would, on occasion, highlight some interesting passages while doing homework or reading books and jot them down later. More often then not though many of those highlights would go to waste. After all, what good are highlighting interesting bits of text if you don’t use them later? My highlight compulsion increased about 6 years ago when I dove head first into mindmapping and starting experimenting with a technique called MMOST (Mind Map Organic Study Technique). In a nutshell, MMOST is a strategy for quickly digesting books and summarizing what you’ve learned into a mindmap so you can recall or reference at a later date. For a great intro to the MMOST technique, check out the post on How to Understand a Business Book in Four Hours. What does highlighting have to do with MMOST? While I’m reading a book I’ll highlight the passages that stick out to me and use those as the basis for creating the mindmap summary. It can take a lot of time, but the process of highlighting, reviewing, and creating the mindmap can significantly improve your recall and what you get out of a book (or any research project).
Another big change happened earlier this year when I started using an iPad. I’ve been gradually accumulating more digital books (using PDFs and purchasing books through Amazon using Kindle). After using Kindle for a short time I was blown away by the feature that let’s you highlight book passages and get summaries of the highlighted text and page number (The direct URL is http://kindle.amazon.com/your_highlights. This is REALLY useful for accelerating the summarizing process and the beauty of it is that it’s automatic – the extraction just works! Around the time I started using Kindle for iPad I discovered a fantastic PDF Document reader called GoodReader.

GoodReader is a full-featured document reader with some powerful features. Not only can you take all of your documents on the go, you can access remotely using WebDAV, Google Docs, DropBox, Email, and other online services. Starting a couple months ago it got even better by supporting PDF highlighting and annotations. I thought to myself, “Hey, it would be great if I could somehow extract all my highlighted text just like Kindle. I could TRIPLE the number of books I read and create summaries for almost all of them!”. It turns out this IS possible, but it is no where near as simple as I initially hoped. I dove down the deep rabit hole of reviewing the ~ 1,000 page Adobe PDF specification, hacked and tinkered with Perl and Java code, reviewed numerous open source and commercial offerings, and have emerged (slightly scathed but wiser) with some good solutions.
The Challenge
I won’t get into the nitty-gritty details here, but what would seem a simple operation of extracting highlighted text from a PDF turns out to be exceedingly difficult depending on what strategy you use. In fact, as near as I can tell, there is no existing open source or commercial solution that can reliably extract the 100% text accurately from all documents. The main challenge with PDF is that it isn’t a markup language like HTML that will explicitly tell you how text should be rendered. For example:

This is an example sentence that I would like to highlight.

The PDF format, while parsable, uses concepts like dictionaries, objects, streams and coordinate systems that tell PDF readers how to correctly render the doc. What this means is that things like annotations (notes) and highlights are rendered separately from the text itself. The best way to visualize this is to think of the highlighted PDF as having 2 distinct layers: the top layer is the highlight itself and the bottom layer is the text. The straightforward strategy is to simply say: “Find the X,Y coordinates of the region of highlight, then find the X,Y coordinates of all text in that same region and simply copy it”. Well, the unfortunate complexity is that in order to find the coordinates of the text you also have to take into consideration the font type and size of the font. After many hours of hacking with only minimal success, I’ve concluded that this method is not currently possible without a lot of additional coding. And, unless somebody can point me in the right direction, I haven’t found any open source or commercial offerings that do this. OK, so you’re probably wondering why I’ve made you read this much of the post only to tell you it’s not technically possible. It is possible, just using a slightly different method.
The Solutions
It turns out that you can automatically extract the highlight with 100% accuracy, but there is a caveat that requires a little more manual work. It sounds much more painful than it really is. The trick is to not only highlight the passage of text, but also copy the text and paste as an annotation (note) on top of the highlight. For GoodReader it’s simply a matter of a couple extra clicks. And for people who use Adobe Acrobat or Acrobat Reader, there is an option in most versions to automatically copy/paste text into a note whenever you select text to highlight (Go to Settings -> Commenting Preferences -> “Copy selected text into Highlight, Cross-Out, and Underline comment pop-ups.”). Here’s how you accomplish using GoodReader as of v3.2.0:

Select the text you would like to highlight and select Copy. As soon as you click Copy, the menu option above the text will remain.
Next select the Highlight option. At this point the text will now be highlighted.
Tap the highlighted text and select the Open option. A note dialogue will appear.
Hold down for 2 sections on the note until the Paste option appears and select. Click Save.

Basically 6 quick clicks/taps and you’re done. It’s not ideal, but certainly a good trade-off if it means you get to extract automatically and have 100% reliability. Now, there are a couple options for easily extracting your highlights.
Option 1 – Use a PDF Reader to create highlight summaries
If you have the money, Adobe Acrobat has many features that let you view and print all of your annotations (notes, highlights, etc.). Although not significantly cost prohibitive most people (myself included) don’t really want to spend money if you can find a comparable free or open source solution. Adobe Acrobat Reader (the free version most people use) does allow you to view the highlights in a summary pane, but doesn’t allow you to extract and print (You’ll notice that if you don’t create the annotated note with your highlight the entry will show blank.) The best free PDF viewer that I experimented with is Foxit Reader and it allows you to easily create a PDF summary of your highlights. Simply go to Comments -> Summary Comments and you’ll be prompted to save a new PDF file that only contains the highlighted text along with the page number.

Option 2 – Programmatically extract highlights
For those inclined to hack, there are a couple open source options for parsing PDF files. I first started experimenting with a great Perl module called CAM::PDF. After a few weekends of tinkering around and subsequently needing to dig into the official Adobe PDF specificaiton I realized how complicated PDF parsing, rendering, and text extraction can be. CAM::PDF does make it easy parse the overall structure of the document and extract text for an entire page, but it is very difficult to extract for exact coordinates (for a number of technical reasons). At this point I was still trying to solve the problem with the original strategy of extracing text by x,y coordinates, and after researching for countless hours I realized my open source options were limited. My next step was to experiment with PDFBox, an Apache open source JAVA PDF library. After some searching I was very excited to at least scratch the surface and get preliminary results of text extraction based on the highlight x,y coordinates. I soon discovered that needing to take the font style, orientation, and spacing into consideration to grab the exact text would prove to be time consuming. I haven’t yet found other examples, or reached out on the mailing list, but I’m sure with sufficient determination and time this could be done. Not wanting to devote this amount of time right now to solve this problem, I opted to go for the pragmatic solution of saving the note and extracting that. For those interested, I’ve attached some very simple test code that will extract the annotated comment and I’ve included commented out code for doing very basic (and not yet accurate) extraction based on region/coordinates. When I have more time I may make this a standalone executable so you can run from the command-line and bulk extract highlights from multiple documents:
[codesyntax lang=”java”]
import java.awt.geom.Rectangle2D;
import java.io.File;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.util.PDFTextStripperByArea;
public class ExtractHighlights {
public static void main(String args[]) {
try {
PDDocument pddDocument = PDDocument.load(new File(“sample.pdf”));
List allPages = pddDocument.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
int pageNum = i + 1;
PDPage page = (PDPage) allPages.get(i);
List la = page.getAnnotations();
if (la.size() < 1) {
continue;
}
System.out.println(“Total annotations = ” + la.size());
System.out.println(“\nProcess Page ” + pageNum + “…”);
// Just get the first annotation for testing
PDAnnotation pdfAnnot = la.get(0);
System.out.println(“Annot type = ” + pdfAnnot.getSubtype());
System.out.println(“Modified date = ” + pdfAnnot.getModifiedDate());
System.out.println(“Rectangle = ” + pdfAnnot.getRectangle());
// Sample code taken from Canoo unit test – extractAnnotations
// See https://svn.canoo.com/trunk/webtest/src/main/java/com/canoo/webtest/plugins/pdftest/htmlunit/pdfbox/PdfBoxPDFPage.java
// Experimental – Not completely working since rectangle doesn’t take font size/spacing into account
// PDFTextStripperByArea stripper = new PDFTextStripperByArea();
// stripper.setSortByPosition(true);
//
// PDRectangle rect = pdfAnnot.getRectangle();
// float x = rect.getLowerLeftX() – 1;
// float y = rect.getUpperRightY() – 1;
// float width = rect.getWidth() + 2;
// float height = rect.getHeight() + rect.getHeight() / 4;
// int rotation = page.findRotation();
// if (rotation == 0) {
// PDRectangle pageSize = page.findMediaBox();
// y = pageSize.getHeight() – y;
//}
//
// Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
// stripper.addRegion(Integer.toString(0), awtRect);
// stripper.extractRegions(page);
//
// System.out.println(“Getting text from region = ” + awtRect + “\n”);
// System.out.println(stripper.getTextForRegion(Integer.toString(0)));
System.out.println(“Getting text from comment = ” + pdfAnnot.getContents());
}
pddDocument.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
[/codesyntax]
Of all the APIs I reviewed PDFBox appears to be one of the best: enumerating through the annotations is easy, extracting the note is just as simple, and the basic API is there to extract highlights with no need for the note (just be prepared to dig in and do some work). I also spent some time researching Adobe’s Javascript API and saw some forum posts where a person had mentioned they wrote a JavaScript plugin for Adobe Acrobat Reader that extracted the highlight without the need for the notes. However, I could not find a working example. With further research I’m sure this could be another option.
For the short-term, my practical solution is going to use Foxit Reader to create the highlight summaries. Foxit works under Wine (linux) and I’ve been able to share my GoodReader docs over WiFi and mount that Goodreader share as a WebDav folder. This means that once I’m done reading and highlighting a PDF I can easily open up in FoxitReader without needing to copy anything, generate the highlight summary, and save back to my Documents folder. Longer-term I’ll probably elaborate on the PDFBox code and write a program to automatically extract the highlights and save as text, XML, or HTML.
Other Links of Interest

My PDF Bookmarks from Del.icio.us (TONS of good links found during research)
Python – Scrape Highlighted (Not portable, but uses a combo of Python, AppleScript and SkimPDF for Mac)
Python – PDF Miner
PDF Can Opener (Inspects PDF docs)
Acrobat Exhibit Highlighter (Some highlight tools using Javascript to enhance Acrobat)
Topic Grazer (Windows – helps with text extraction)

Happy Highlighting!

Example Document Browser Code

ericblue — Fri, 12 Feb 2010 18:51:56 +0000

Since I posted my article last month on How To Create Your Own Personal Document Viewer, I’ve had a few inquiries on how people could have a similar setup themselves. I thought it might be helpful to .zip up the docbrowser project and show some of the code that does the conversions using the utilities I illustrated in the article. Disclaimer: This code is by no means my finest work (it was hacked together on a Sat. afternoon), but it gets the job done. At a high-level the code is very simple:

Determine the doc extension and perform the appropriate conversion (.doc.pdf.xls) or redirect using an external app (mindmapviewer or Google books)
Assign conversion commands to be executed for each doc type
Before doc display, lookup converted doc in cache to speed up render time (use MD5 hash on the title)

I’ve created a .zip file(4.1MB) of the entire Doc Browser sample code. It contains the simple .CGI Conversion script, along with jQueryFileTree for rendering the doc tree, FlexPaper, and some sample documents.

How To Create Your Own Personal Document Viewer (Like Scribd or Google Books)

ericblue — Sun, 03 Jan 2010 07:35:21 +0000

Overview
Like most people, I have a large number of personal documents in a variety of formats (PDF, Excel, Word, RTF, PowerPoint, etc.). For the typical user, organizing these documents in a ‘My Documents’ folder and having MS Office/Open Office/Adobe Acrobat installed simply gets the job done. However, I’ve been looking for some sort of “Web 2.0” solution to view my documents while I’m on the go. And, since my knowledge manager is web-based, I’d like a way to browse and embed personal documents directly in my wiki without needing any special software.
I’ve been impressed with services like Scribd (think YouTube for Documents). Most people have probably already used Scribd, but in case you haven’t, this service allows you to upload your documents (variety of formats supported) and view them online in Flash format. The beauty of this service is that you can also share documents and embed directly inside you website/blog/wiki. While this works great for sharing certain types of documents, it’s not really appropriate for uploading my entire collection of documents (especially since many contain personal information). So, I decided to figure out how to create my own hosted document/book viewer like Scribd or Google Books.
Example
The following embedded document browser was actually fairly straight forward to make. In a nutshell, the viewer takes a PDF file that is converted to Flash (using SWFTools – pdf2swf), and then uses an open source flash viewer called FlexPaper to help with navigation.

The navigation bar is fairly straight forward. You can page up/down, go directly to a given page, zoom, print, and even select a thumbnail mode. It does currently lack the ability to view full screen, ~~search~~ (Search was JUST added to version 1.1) or select text, so I create additional option to view in HTML (using wvHtml) and view the frame full screen.
Open Source To The Rescue
When I first start exploring ways to view all my docs in a web interface, I didn’t initially focus on flash. I figured it would be too difficult to have the end product look like Scribd (I was way wrong). So, I evaluated a number of Linux command-line utilities to convert documents on the fly. The following is a decent list of applications that can help with any of your conversion needs:

wvWare – A library for converting Word docs. The utility I used most was wvHtml to convert from .doc directly to .html.
xlHtml – Converts Excel spreadsheets to HTML.
PDFtoHtml – Converts PDF documents to HTML
UnRTF – Converts RTF to text or HTML
SWFTools – A collection of utilities to generate and work with SWF (Flash) files

There are apparently some ways to convert between various formats using Open Office on the command-line (e.g. JODConvert, PyODConverter, Unoconv, etc.). However, I haven’t yet spent time evaluating these approaches since my current setup seems to be working pretty well.
DocBrowser Project

I put up a very preliminary Document Browser prototype at http://eric-blue.com/projects/docbrowser/. The interface uses JQuery and JQueryFileTree to make entire document folder available for browsing just like Windows Explorer.
The doc viewer pane uses the Flash-based interface like the iFrame above for all .PDF docs. And, the conversion script will render the output in HTML according to the doc type (.doc, .xls, .rtf) using the tools listed above. I’ve even added support for Mind Manager mindmaps using my web-based mindmap viewer to do conversions into Freemind flash on the fly.
Overall, I’m happy with the end result. I’ve setup a customized version of the document browser to run on my personal web server at home. I can now successfully view my documents from my Laptop while I’m on the road, and I’ve been able to embed documents directly in my wiki so I don’t have to spend time hunting for the right doc.
Other Interesting Links

Open source flash viewers –

FlexPaper and SWF Viewer/zViewer

PSView (Online viewer for PDF, Postscript, Word) – http://view.samurajdata.se/
Vuzit (Online document viewer) and API – http://vuzit.com/

Update: Sample code has been posted here http://eric-blue.com/2010/02/12/example-document-browser-code/

Knowledge To Go: Put Your Wiki On Your IPhone

ericblue — Mon, 14 Dec 2009 05:39:14 +0000

Building my own personal knowledge manager has been quite a journey. Over the last couple years I’ve taken a “piecemeal” approach and slowly built up the features of my system one component at a time. One major feature that has always been on my mind is data portability. Last week I wrote an article on how to sync your digial scrapbook between multiple computers and even sync to your wiki. This feature had me thinking about how I could take portability to the next level.
Being able to access your personal information/knowledge from multiple places is the ultimate realization of total information ubiquity. Being able to access all of your personal bookmarks, notes, contact information, journal entries, and research data from any computer is obviously useful. Being able to access all of your personal knowledge from a handheld device like an iPhone is absolutely exciting! Without sounding totally nostalgic, this type of portability is in a large part a modern-day realization of what Vannevar Bush had envisioned in his article on the Memex (“As We May Think“).

“Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, “memex” will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.”

Currently, my personal wiki and other research data (bookmarks, pdfs, mindmaps, etc) are stored on my private server (accessible only behind my firewall). Unless I enable SSH access, my wiki content is not generally available from the Internet and I have no way to easily access remotely. I’ve been thinking for a while on the best approach for making this data completely portable. After some experimentation, I’ve found an easy method for making my personal wiki completely accessible in offline mode right on my iPhone. At a high-level all you really need to do are 2 things:

1) Find software that can take a snapshot of your wiki content and make it available for offline viewing
2) Find software that lets you save a copy of your snapshot wiki, store on the iPhone, and view in a web browser (actually both on the phone itself and another PC)

Creating a backup of your wiki
There are a lot of applications out there that act as ‘spiders’ that crawl your website and save local copies of your pages so you can view in offline mode (no need for an Internet connection). After trying a handful, one of the better applications I tested was HTTrack (available for Windows and Linux). I should note that I really did try to make this work with Scrapbook. To date I’ve used Scrapbook to capture copies of pages with no problems. However, it turns out that backing up a wiki pushes it to its limit… Scrapbook only does one serial http connection at a time, doesn’t have a configurable delay between requests (default is 1 sec and this takes too long), filtering options are not extensive enough, and the process of dynamically updating the HTML to support relatives links took way too long. In the end, HTTrack ended up being the best solution for a complete wiki backup.
HTTrack is a highly configurable crawler that allows you to create a complete snapshot of your wiki (Mediawiki in my case). Crawling a wiki turns out to be a little more complicated that your typical website. Because wiki’s offer a number of functions (editing of pages, viewing history, printing and exporting in other formats) there are certain links that should not be included in the backup. After some trial and error, I discovered that since I used Semantic Mediawiki I needed to be even more careful with the links I wanted to include (many of the Special and Property pages took FOREVER to index).
I tried the windows version of HTTrack (even under Wine on Linux) and the web client version as well. However, was not completely impressed with how it worked. What I wanted was a command-line script to run the backup. Luckily, I found a couple websites that have used HTTrack for this purpose and decided to use for my own needs. Here is a copy of the script i used to create the offline snapshot of my wiki:
[codesyntax lang=”text”]
#! /bin/sh
# Inspired by blogpost from http://www-public.it-sudparis.eu/~berger_o/weblog/2008/05/30/offline-backup-mediawiki-with-httrack/
# -w mirror web sites (–mirror)
# -O backup directory
# -%P extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don’t use) (–extended-parsing[=N])
# -N0 Saves files like in site Site-structure (default)
# -s0 follow robots.txt and meta robots tags (0=never,1=sometimes,* 2=always) (–robots[=N])
# -p7 Expert options, priority mode: 7 > get html files before, then treat other files
# -S Expert option, stay on the same directory
# -a Expert option, stay on the same address
# -K0 keep original links (e.g. http://www.adr/link) (K0 *relative link, K absolute links, K3 absolute URI links) (–keep-links[=N]
# -A25000 maximum transfer rate in bytes/seconds (1000=1kb/s max) (–max-rate[=N])
# -F user-agent field (-F “user-agent name”) (–user-agent )
# -%s update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..) (–updatehack)
# -x Build option, replace external html links by error pages
# -%x Build option, do not include any password for external password protected websites (%x0 include) (–no-passwords)
site=wiki:8080/memex
topurl=http://$site
backupdir=~/websites/memex
httrack -c4 -w $topurl/Special:Allpages \
-O “$backupdir” -%P -N0 -s0 -p7 -S -a -K0 \
-F “Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)” \
-%s -x -%x \
“+*$site/index.php?*” \
“+*$site/mindmap*” \
“-*Special*” \
“-*Property*” \
“-$site/index.php?title=Property:*” \
“-$site/index.php?title=Special:*” \
“-*$site/Discussion:*” \
“-*$site/Help*” \
“-*/docs/*” \
“-*/wikifiles/*” \
“-*month=*&year=*” \
“-*action=edit” \
“-*action=formedit” \
“-*action=history” \
“-*printable=yes” \
“-*oldid=*” \
“+*$site/images/*” \
“+*.css” \
“+*.js”
[/codesyntax]
The best feature of HTTrack is that it will download all content, including Javascript and Flash, and update all links and make them relative. This way the entire website can be viewed offline and made portable. Overall, my entire wiki backup was ~ 30MB of markup and content (of course, excluding all audio and video). And, in the future I need to come up with a solution for exporting my mindmaps. Since iPhone does not yet support flash I’ll need some other way to allow for embedded viewing of my mindmap content. Anyways, at this point all of my content is now ready for copying to my iPhone.
Storing my Wiki on my iPhone

OK, this is the really nifty part. The one thing I really missed about my old 60GB IPod Video was the ability to mount it over USB and use it just like an external hard drive. I used to haul around TONs of my data and could easily share between Windows, Linux and Mac. Unfortunately when the iPhones and iPod Touch’s came out, you could no longer mount the iPhone and copy files (without hackery of course). Luckily there are a number apps that let you use your iPhone as a storage device. One of the BEST applications out there is an app called AirSharing.

With Air Sharing, you can:

Mount your iPhone or iPod touch as a wireless drive on a Mac, windows, or Linux computer, over Wi-Fi, or connect from your computer’s web browser.
Drag-drop files between your iPhone or iPod touch and your computers.
View documents in many common formats.

What’s really useful is that you can mount your iPhone using WebDAV and transfer files just like a regular drive. The incredibly cool bonus is that you can also access your content from another computer. If you’re connected to the same Wi-Fi network, you can use any PC to browse (e.g. http://iphone-local:8080/wiki/) and access your content just like it was on the original server. For an added layer of security, while you’re on the go you can setup an AdHoc wireless network and connect privately between your computer and the iPhone to access your personal knowledge base.
Of course, accessing the content on your iPhone from another PC is an added bonus. The real power in this solution is the ability to browse your wiki on the iPhone without needing any Internet access (3G Or WiFi). Simply open up your Airsharing app and browse directly to your wiki folder and click on index.html. Wala!, your browsing your personal wiki just like usual.
I exported the majority of the text content from my wiki (preserving the original formating, with Javascript support). In fact, I even shared my digital scrapbook that I blogged about last week. but you can also choose to export your entire document collection and multimedia files (video, MP3s, etc). This is incredibly useful for taking your knowledge on the go and having all of your data RIGHT at your finger tips. Here are some screenshots of my personal knowledge manager wiki right on my iPhone:
All Articles

Workout Journal

Learning

Documents

How to Synchronize Your Digital Scrapbook

ericblue — Mon, 07 Dec 2009 07:18:23 +0000

I had originally planned on calling this article ‘How to Use Cloud Computing to Synchronize Your Digital Scrapbook For Research and Integrate Into Your Personal Knowledge Management Wiki for Extra Credit’, but I figured that would be a bit too much. Luckily I am going to give info on how to do both of these things so stay with me!
Background
For my own personal knowledge management setup, I’m very interested in tracking a number of different ‘things’:
* Documents – PDFs, word documents, mindmaps, etc.
* Notes – Journal entries, book summaries, personal notes (think wiki text)
* Links – Bookmarks (personal or social sites like del.icio.us)
* Multimedia – Audio / Video
* Snippets – Captured web pages (full or partiallly snipped content)
When I first mentioned my ‘Digital Scrapbook’, I wasn’t dropping any hints about me having any crafty hobbies, I generally refer to my system for storing Snippets as my Scrapbook. This name is no doubt in large part due to the fact that I’ve been using the popular Firefox plugin ScrapBook to manage my digital snippets for a few years now.

ScrapBook is a fantastic solution for storing local copies of web pages for research (with highlighting, editing, and annotation), saving snips of important sections of sites, recording purchase confirmations or receipts, and saving your travel itineraries. One major thing it has been lacking though is the ability to synchronize or share the Scrapbook with other computers. I use multiple computers (a couple laptops: Mac & Window and a central desktop: Linux) so my goal is to have consistent and up to date data between all systems. And, up until now, I’ve had no way to integrate this save data into my wiki-based knowledge management system.
I started investigating a solution for this a number of months ago and stumbled across a related (and powerful) research tool called Zotero. I haven’t had a chance to use Zotero in depth, but one new feature in the beta version that stuck out to me was the ability to synchronize your data with a remote server. On the surface this feature looks good (and probably is for most people – data sync to Zotero server and webdav support for documents), but I was looking for a solution where I have more control over where the data is hosted. Although I’m usually not concerned with hosting my data with most providers, I often save private financial information in my Scrapbook (credit reports, financial statements, account numbers, etc.) so I’d like to have control over where the data is saved and how it’s encrypted. Further research eventually sparked a few ideas for a solution.
Synchronizing and Sharing ScrapBook Data
I decided to find a way to explore a setup using some file sharing/sync services after reading an article on syncing Scrapbook using Dropbox. I had never used Dropbox before and after giving it a brief testdrive it looked very promising. Hey, you get a 2GB account for free so that’s definitely an added bonus! Although Dropbox has some killer features (a big one being an iPhone app to access your files), I opted to experiment with another sync service. I’ve been using JungleDisk for a couple years as my Amazon-S3 backed offsite backup solution, and was curious if this could be used. After downloading the latest version (3.0.2 for Linux) I discovered that it now supports file/directory synchronization between computers. After about 10-15 minutes of setup and file syncing I had a working solution between my laptop and desktop computers. Here’s what you’ll need to do:

Step 1: Download and install the latest version of the Scrapbook plugin for Firefox on your 1st computer. For a good quick intro/tutorial to Scrapbook, check out this video from Lifehacker.
Step 2: Setup an alternate Scrapbook location that resides outside of your Firefox profile directory (Prefrences -> Organize -> Save data to)
Step 3: Setup your preferred sync solution and use the directory provided in Step 2. I preferred JungleDisk for my setup, but there are other services like Dropbox, Box.net, SugarSync, etc. Check out the Activty Owner wiki for a detailed list of sync services. And, although I haven’t personally tried yet, I’m sure there are some other non-hosted open source sync solutions like Unison (cross-platform) that could be used.
Step 4: For your 2nd (or subsequent computers) repeat steps 1 through 3.

Wiki Integration (Extra Credit)
OK, for me this was the icing on the cake. Since my Scrapbook data is now on the same computer as my wiki I thought it would be nifty to somehow integrate directly into some of my wiki pages. I found out that Scrapbook supports the ability to export your Scrapbook hierarchy as a tree in HTML (from Scrapbook Sidebar: Tools -> Output Tree as HTML). Although this isn’t completely automatic (yet) this gave me the the content I needed to add to my wiki. Now, since wikis by there very nature dont’ typically allow you to embed other HTML pages I needed to find a way to make this work.

Step 1: Setup a directory on your webserver to serve content from your Scrapbook directory (setup in Step 2 above) (e.g. http://yourwebsite/scrapbook). This can either be on the same server as your wiki or another, it doesn’t really matter.
Step 2: Verify the output of the directory tree looks good. If you enabled frames, the URL should be something like http://yourwebsite/scrapbook/tree/frame.html.
Step 3: For MediaWiki users there are various ways to directly embed pages in your wiki content. I found that the AnySite extension did the trick for me. Enable the extension, pick a wiki page where you want to display your ScrapBook data and you are set! For example, here is my content:
* Link to [http://wiki:8080/wikifiles/scrapbook/tree/frame.html ScrapBook Tree]
http://wiki:8080/wikifiles/scrapbook/tree/frame.html
[[Category:Documents]]

Total Recall, Personal Informatics and Life Logging

ericblue — Mon, 19 Oct 2009 05:38:51 +0000

It’s been a little while since my last post and figured it was time to get back into my blogging groove. I recently came across a few interesting links that I thought I would share. The two topics I want to discuss are Personal Informatics and Life Logging. I found it fascinating that both of these topics, complex and mysterious sounding on their own, are very much related to my primary research project: My Personal Memex.
For those not familiar with the concept of the Memex:

“Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, “memex” will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.” — Vannevar Bush (1945)

Total Recall
Fast forward fifty four years, and we finally have the technology (hardware and software) to make the Memex a reality. My project has been primarily focused on fulfilling a small portion of the the original idea, but has only touched the surface. The Memex fully realized would be a system that completely (and automatically) digitizes experiences, memories, and interactions with the environment. The capability for Total Recall , offloading human memory to a digital space, is not too far away (think Cyborgs). Now for the fun!

Dante Cyborg from Flickr

Personal Informatics
According to Johnny Holland, Personal Informatics is:

“… characterized as the monitoring and displaying of information about our daily activities through intelligent devices, services and systems. This information allows us to see trends and opportunities for change that we would otherwise miss. With the rise in network and RFID technology we are pointing to a time where personal informatics can play an important role in our lives. If people can access this information about their daily routines, and interact with their own personal data currently invisible to them: would they make more informed decisions?”

The example in this category that I wanted to share is a new product called FitBit. The Fitbit accurately tracks your calories burned, steps taken, distance traveled and sleep quality. The Fitbit contains a 3D motion sensor like the one found in the Nintendo Wii. The Fitbit tracks your motion in three dimensions and converts this into useful information about your daily activities.

You can wear the Fitbit on your waist, in your pocket or on undergarments. At night, you can wear the Fitbit clipped to the included wristband in order to track your sleep. Anytime you walk by the included wireless base station, data from your Fitbit is silently uploaded in the background to Fitbit.com.

Life Logging

A new camera promises to capture your whole life in digital form! For consumers, the gadget will provide an easy way to become a “lifelogger” – someone who attempts to electronically record as much of their life as possible. Microsoft researcher Gordon Bell has made his life an experiment in lifelogging, recording everything from phone calls to TV viewing, and uses a SenseCam wherever he goes.

A camera you can wear as a pendant to record every moment of your life will soon be launched by a UK-based firm.

Originally invented to help jog the memories of people with Alzheimer’s disease, it might one day be used by consumers to create “lifelogs” that archive their entire lives.

Worn on a cord around the neck, the camera takes pictures automatically as often as once every 30 seconds. It also uses an accelerometer and light sensors to snap an image when a person enters a new environment, and an infrared sensor to take one when it detects the body heat of a person in front of the wearer. It can fit 30,000 images onto its 1-gigabyte memory.

The ViconRevue was originally developed as the SenseCam by Microsoft Research Cambridge, UK, for researchers studying Alzheimer’s and other dementias. Studies showed that reviewing the events of the day using SenseCam photos could help some people improve long-term recall.

For an intriguing, in-depth article on Gordon Bell, check out the article on Fast Company from 2007.

Information Visualization Toolkits for Mind Mapping

ericblue — Fri, 05 Jun 2009 05:25:22 +0000

The other week, I wrote a blog post The Visual Wiki: A New Metaphor For Knowledge Access and Management. At the time, until I read the paper in depth, I hadn’t realized that this was about a project that I had blogged about last year: Thinkbase – A Visual Semantic Wiki. In a nutshell:

“Thinkbase is a new way to navigate and explore information on the web. It is what we call a ‘Visual Wiki’. It is based on Freebase, an open, shared database of the world’s knowledge – in other words a Semantic Wiki. Thinkbase uses a visualization tool (Thinkmap) to create an interactive visual representation of the semantic relationships in Freebase.”

The other similar project that was mentioned in the research paper was ThinkPedia. While ThinkBase offers visual navigation for FreeBase, ThinkPedia does the same for Wikipedia content. One of the sub-projects of my Personal Memex project intends to offer visual navigation in very much the same was as both of these applications. The engine that these projects use for visual navigation, ThinkMap, is VERY impressive. Unfortunately, it’s a commercial license (~5k) and keeping with the spirit of my open source model, I need to find something that is free.
With that said, I’ve started to research various visualization toolkits/APIs that offer some time of visual navigation. This navigation is very mindmap or concept map like in nature. There are some variations: some are force-directed graphs while others are hyperbolic. My research is still very much underway, but I’ve been collecting my links and have assembled into a mindmap. I’ve broken down the categories based on open source vs. commercial (for illustrative purposes), and platform (Java, Flash, or JavaScript).
Stay tuned on my progress in this area over the coming months since this will more than likely be my primary “pet technology project” for the summer.

Freebase Parallax: Set-based Browsing Interface

ericblue — Sat, 23 May 2009 16:02:36 +0000

I found a very interesting project from David François Huynh, developer of some impressive projects over at Simile. Parallax offers a new way to browse and explore data on Freebase, one of the largest open and shared (structured) databases of knowledge on the web.

Freebase Parallax: A new way to browse and explore data from David Huynh on Vimeo.

I also discovered a somewhat related research project at Stanford called Vispedia.