Lobster Wordcloud from Academic Paper

Mon, 22 Jun 2020 00:00:00 +0000

Creating a Word Cloud from an Academic Paper!

Word clouds are a fun and creative way to show what the most used words are in a book, document, website or basically anything containing words. I wanted to create a word cloud from one of my papers that were published from my PhD. To take it a step further, I wanted to turn it into a lobster cloud because the paper is on a species of spiny lobster. Here is a walkthrough of how I did this…

First, let’s see what we will need to import. Academic papers are usually in a column format, so I used the pdf_layout_scanner package by Yusuke Shinyama to import the pdf into a format that can be read by the computer. It extracts text from pdf’s with multiple columns.

from pdf_layout_scanner import layout_scanner

Next, we import the Natural Language Toolkit or nltk library, which we will use to tokenize words from the pdf. We will also remove stop words which are commonly used words such as ‘the’, ‘a’, ‘in’ etc.

from nltk import word_tokenize
from nltk.corpus import stopwords

Then we want to import the WordCloud library.

from wordcloud import WordCloud

Another important library we will need is the Pillow (PIL). Pillow is used for opening, manipulating and saving different image file formats

from PIL import Image

Finally, we will import numpy, pandas and matplotlib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 1: Parse the pdf file using layout_scanner

pages = layout_scanner.get_pages('lobster.pdf')

Check how many pages we working with

print(len(pages))

Created a variable called text to store the first 23 pages of the paper (the rest are references, or images)

text = pages[0:23]

type(text)

list

Our text is currently a list, which we will have to convert into a string. We have 52251 words

text2 = ' '.join(text) #convert list to string simplest method

len(text2)

After tokenizing and removing stop words we see that the number of words is now reduced to 6711

stop_words = set(stopwords.words('english'))
text_tokens = word_tokenize(text2)

filtered_words = [w for w in text_tokens if not w in stop_words]
len(filtered_words)

filtered_words = ' '.join(filtered_words)

Now for the fun part! To make the wordcloud in the shape of a lobster, we will need a vector .png image of a lobster. The best resource for finding pictures of biological creatures is PhyloPic. Download the image to your working folder and assign it to a variable.

LOB_FILE = 'Spiny2.png'

Here is an example of how an ordinary word cloud looks

word_cloud = WordCloud().generate(filtered_words)

plt.imshow(word_cloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

Now we will use pillow to read the image and do some manipulations. We have to create an image mask from the lobster image which will be a canvas for the wordcloud

icon = Image.open(LOB_FILE)
#creating blank image object using pillow
image_mask = Image.new(mode='RGB', size = icon.size, color = (255, 255, 255))
image_mask.paste(icon, box = icon)
rgb_array = np.array(image_mask) #converts the image object into an array


word_cloud = WordCloud(mask = rgb_array, background_color = 'white',
                      max_words = 1000, colormap = 'ocean', max_font_size = 300)
word_cloud.generate(filtered_words.upper())

plt.figure(figsize=[20, 20])
plt.imshow(word_cloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

text manipulation | Sohana Singh

Lobster Wordcloud from Academic Paper

Creating a Word Cloud from an Academic Paper!

First, let’s see what we will need to import. Academic papers are usually in a column format, so I used the pdf_layout_scanner package by Yusuke Shinyama to import the pdf into a format that can be read by the computer. It extracts text from pdf’s with multiple columns.

Next, we import the Natural Language Toolkit or nltk library, which we will use to tokenize words from the pdf. We will also remove stop words which are commonly used words such as ‘the’, ‘a’, ‘in’ etc.

Then we want to import the WordCloud library.

Another important library we will need is the Pillow (PIL). Pillow is used for opening, manipulating and saving different image file formats

Finally, we will import numpy, pandas and matplotlib

Step 1: Parse the pdf file using layout_scanner

Check how many pages we working with

Created a variable called text to store the first 23 pages of the paper (the rest are references, or images)

Our text is currently a list, which we will have to convert into a string. We have 52251 words

After tokenizing and removing stop words we see that the number of words is now reduced to 6711

Now for the fun part! To make the wordcloud in the shape of a lobster, we will need a vector .png image of a lobster. The best resource for finding pictures of biological creatures is PhyloPic. Download the image to your working folder and assign it to a variable.

Here is an example of how an ordinary word cloud looks

Now we will use pillow to read the image and do some manipulations. We have to create an image mask from the lobster image which will be a canvas for the wordcloud

And there you have it! A beautiful lobster wordcloud created from an academic paper