Working with Notebooks
#
1. Google ColabNotebooks can be a useful tool in many areas. In our case, we use them primarily for evaluating data sets on Elasticsearch. Since Python is currently the best option for working with NLP, we chose a Python Notebook option.
Why use a "Notebook"?
A Notebook is an easy way to combine runnable code, text, mathematical equations, tables, and many other helpful visualizations. Working with a Notebook makes it easier to understand code and display additional information in one document. With the possibility of running it bit by bit, debugging is easier and clearer for others to understand.
Compiling and running code is always a RAM intensive process, therefore it is useful to have code presented line-by-line, and Notebooks are ideal for such a task. Depending on what you want to accomplish, and how much memory you have, there are two excellent options:
Google Colab and Jupyter Notebooks
Both are similarly structured, but there are small differences in how they handle data. For our use cases we chose Google Colab.
To get started with a Google Colab Notebook, all you need is your browser. If you use the Notebook with a Google Drive account the Notebook will save automatically, otherwise, you will have to save changes manually; for example as a commit in Github. Information on that can be found in this notebook.
For further information check out the Google Colab help notebooks.
#
2. CellsA notebook consists of several parts referred to as “cells”. These can be in either text or code format. Text formatted cells are useful for holding information and structuring the notebook. Text is formatted in markdown so that you can use headlines to create a table of contents which is then displayed on the left.
Cells containing code are compatible with several programming languages and require the same syntax as the language used. For Python, you have to import libraries the same way you normally would. A cell only responds with an output;either a printed command or an error. Otherwise, everything went smoothly and you will not receive any feedback.
The cells can be moved within the notebook at any time, but you should keep in mind that some cells are dependent on others regarding variables or imports. Basically, you can execute the cells, and thus the code snippets, in any order so long as you keep track of dependencies.
#
3. Install packagesSince Google Colab is hosted on Google's servers you have to install the necessary packages and download data every time you reconnect the notebook.
To install packages that aren't already installed on the Colab server, for example the Python package for Elasticsearch, simply run !pip
followed by the package you want to install:
!pip install Elasticsearch -q
#
4. Get DataAs already mentioned, data must also be downloaded again with each new connection. Download and other commands follow the same syntax as using your terminal on Linux. You can find the downloaded files on the left when you click on the folder sign. The files can be referenced like this:
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/cran/cran.tar.gz!tar -xf cran.tar.gz
PATH_TO_CRAN_TXT = '/content/cran.all.1400'
#
5. Setup Elasticsearch instanceIIf you don't want to host Elasticsearch on your own server, it is also possible to initialize Elasticsearch in Google Colab. To do this, it is first necessary to download Elasticsearch:
# download elasticsearch!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz -q!tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz!chown -R daemon:daemon elasticsearch-7.9.1
Then you can start an Elasticsearch server and set up an instance:
# start serverimport osfrom subprocess import Popen, PIPE, STDOUTes_server = Popen(['elasticsearch-7.9.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1) )# wait a bit then test!curl -X GET "localhost:9200/"
# client-side!pip install elasticsearch -qfrom elasticsearch import Elasticsearchfrom datetime import datetimees = Elasticsearch()es.ping() # got True
You can check whether everything worked with a short example search request. To do this we create an index, add a document, and search it:
doc = { 'author': 'kimchy', 'text': 'Elasticsearch: cool. bonsai cool.', 'timestamp': datetime.now(),}res = es.index(index="test-index", id=1, body=doc)
response = es.cat.indices()print(response)
res = es.get(index="test-index", id=1)print(res['_source'])
es.indices.refresh(index="test-index")
res = es.search(index="test-index", body={"query": {"match_all": {}}})
Our Elasticsearch Setup Guide provides more details on how to set up Elasticsearch on your own server.
#
6. SummaryNow that you know how to download data, get a local Elasticsearch instance running, and how you can work in Google Colab, you can start experimenting on your own. Next, check out our First Experiment.
Acknowledgements:
Thanks to Kenny Hall for proofreading this article.