Fragment: Indexing Local Jupyter Notebooks for Search

It’s been some time since I last explored this (eg here and here, and as far as I know know other solutions have appeared since, but a question still remains as to how to effectively search over a set of notebooks.

Partial alternative solutions maybe worth noting include:

  • nbscan for searching over notebooks from the command-line;
  • nbgallery bakes in Solr/sunspot; it’d be really nice if the nbgallery search tools could be easily decoupled so the search could be added to an arbitrary Jupyter notebook, or JupyterHub, server as an extension…);
  • this simple search engine with automcomplete by Simon Willison.

There is also the lunr based search of Jupyter Book (related issue). (The more recent elasticlunr Javascript search engine also looks interesting… perhaps even more so than lunr.js…)

[UPDATE: This is new to me, and I’ve not had a chance to try it: Find your Jupyter notebooks with ElasticSearch – elastic search recipe.]

One of the things I often wondered about in respect of building a notebook search engine index would be how to crawl / index freshly updated notebooks.

One way would presumably be to regularly crawl the directory path in which notebooks live looking for notebook files that have a changed timestamp compared to the last time they were indexed; another might be to set up some sort of watcher on the operating system that calls the indexer whenever it spots a file being updated (maybe something like fswatch?).

Another way might be to use something like the pgcontents contents manager to save (or process) notebooks into a search engine index database. (For other examples of Jupyter notebook content managers, see this Tracking Jupyter round-up. I wonder, is there a sqlite content manager that can save notebooks directly into SQLite? Would the pgcontents extension handle that with little or no modification, other thn to the supplied database connection string?) If notebooks were saved as notebooks to disk, and into a database for indexing as part of the search engine, how would the indexed notebook also be linked back to the notebook on disk so it could be linked to via search results?

Thinks: how is nbgallery architected? Where are notebooks saved to? How is the Solr search engine index managed?

More generally, I wonder: are there any Python based, simple full-text search engines with local fielsystem crawlers/monitors/indexers out there?

PS Other search engines to have a look at:

PPS updating lunr.js – thread: https://github.com/olivernn/lunr.js/issues/284, https://www.npmjs.com/package/lunr-mutable-indexes . Maybe also https://github.com/lucaong/minisearch

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...