<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>text manipulation | Sohana Singh</title>
    <link>/tags/text-manipulation/</link>
      <atom:link href="/tags/text-manipulation/index.xml" rel="self" type="application/rss+xml" />
    <description>text manipulation</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Mon, 22 Jun 2020 00:00:00 +0000</lastBuildDate>
    <image>
      <url>/img/icon-192.png</url>
      <title>text manipulation</title>
      <link>/tags/text-manipulation/</link>
    </image>
    
    <item>
      <title>Lobster Wordcloud from Academic Paper</title>
      <link>/project/lobstercloud/</link>
      <pubDate>Mon, 22 Jun 2020 00:00:00 +0000</pubDate>
      <guid>/project/lobstercloud/</guid>
      <description>

&lt;h1 id=&#34;creating-a-word-cloud-from-an-academic-paper&#34;&gt;Creating a Word Cloud from an Academic Paper!&lt;/h1&gt;

&lt;h4 id=&#34;word-clouds-are-a-fun-and-creative-way-to-show-what-the-most-used-words-are-in-a-book-document-website-or-basically-anything-containing-words-i-wanted-to-create-a-word-cloud-from-one-of-my-papers-that-were-published-from-my-phd-to-take-it-a-step-further-i-wanted-to-turn-it-into-a-lobster-cloud-because-the-paper-is-on-a-species-of-spiny-lobster-here-is-a-walkthrough-of-how-i-did-this&#34;&gt;Word clouds are a fun and creative way to show what the most used words are in a book, document, website or basically anything containing words. I wanted to create a word cloud from one of my papers that were published from my PhD. To take it a step further, I wanted to turn it into a lobster cloud because the paper is on a species of spiny lobster. Here is a walkthrough of how I did this&amp;hellip;&lt;/h4&gt;

&lt;h4 id=&#34;first-let-s-see-what-we-will-need-to-import-academic-papers-are-usually-in-a-column-format-so-i-used-the-pdf-layout-scanner-https-pypi-org-project-pdf-layout-scanner-package-by-yusuke-shinyama-to-import-the-pdf-into-a-format-that-can-be-read-by-the-computer-it-extracts-text-from-pdf-s-with-multiple-columns&#34;&gt;First, let&amp;rsquo;s see what we will need to import. Academic papers are usually in a column format, so I used the &lt;a href=&#34;https://pypi.org/project/PDF-Layout-Scanner/&#34; target=&#34;_blank&#34;&gt;pdf_layout_scanner&lt;/a&gt; package by Yusuke Shinyama to import the pdf into a format that can be read by the computer. It extracts text from pdf&amp;rsquo;s with multiple columns.&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;from pdf_layout_scanner import layout_scanner
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;next-we-import-the-natural-language-toolkit-https-www-nltk-org-or-nltk-library-which-we-will-use-to-tokenize-words-from-the-pdf-we-will-also-remove-stop-words-which-are-commonly-used-words-such-as-the-a-in-etc&#34;&gt;Next, we import the &lt;a href=&#34;https://www.nltk.org/&#34; target=&#34;_blank&#34;&gt;Natural Language Toolkit&lt;/a&gt; or nltk library, which we will use to tokenize words from the pdf. We will also remove stop words which are commonly used words such as &amp;lsquo;the&amp;rsquo;, &amp;lsquo;a&amp;rsquo;, &amp;lsquo;in&amp;rsquo; etc.&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;from nltk import word_tokenize
from nltk.corpus import stopwords
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;then-we-want-to-import-the-wordcloud-http-amueller-github-io-word-cloud-library&#34;&gt;Then we want to import the &lt;a href=&#34;http://amueller.github.io/word_cloud/&#34; target=&#34;_blank&#34;&gt;WordCloud&lt;/a&gt; library.&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;from wordcloud import WordCloud
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;another-important-library-we-will-need-is-the-pillow-pil-https-python-pillow-org-pillow-is-used-for-opening-manipulating-and-saving-different-image-file-formats&#34;&gt;Another important library we will need is the &lt;a href=&#34;https://python-pillow.org/&#34; target=&#34;_blank&#34;&gt;Pillow (PIL)&lt;/a&gt;. Pillow is used for opening, manipulating and saving different image file formats&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;from PIL import Image
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;finally-we-will-import-numpy-pandas-and-matplotlib&#34;&gt;Finally, we will import numpy, pandas and matplotlib&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;step-1-parse-the-pdf-file-using-layout-scanner&#34;&gt;Step 1: Parse the pdf file using layout_scanner&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;pages = layout_scanner.get_pages(&#39;lobster.pdf&#39;)
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;check-how-many-pages-we-working-with&#34;&gt;Check how many pages we working with&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;print(len(pages))
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;44
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;created-a-variable-called-text-to-store-the-first-23-pages-of-the-paper-the-rest-are-references-or-images&#34;&gt;Created a variable called text to store the first 23 pages of the paper (the rest are references, or images)&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;text = pages[0:23]
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;type(text)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;list
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;our-text-is-currently-a-list-which-we-will-have-to-convert-into-a-string-we-have-52251-words&#34;&gt;Our text is currently a list, which we will have to convert into a string. We have 52251 words&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;text2 = &#39; &#39;.join(text) #convert list to string simplest method
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;len(text2)
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;52251
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;after-tokenizing-and-removing-stop-words-we-see-that-the-number-of-words-is-now-reduced-to-6711&#34;&gt;After tokenizing and removing stop words we see that the number of words is now reduced to 6711&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;stop_words = set(stopwords.words(&#39;english&#39;))
text_tokens = word_tokenize(text2)

filtered_words = [w for w in text_tokens if not w in stop_words]
len(filtered_words)

&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;6711
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;filtered_words = &#39; &#39;.join(filtered_words)
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;now-for-the-fun-part-to-make-the-wordcloud-in-the-shape-of-a-lobster-we-will-need-a-vector-png-image-of-a-lobster-the-best-resource-for-finding-pictures-of-biological-creatures-is-phylopic-http-phylopic-org-download-the-image-to-your-working-folder-and-assign-it-to-a-variable&#34;&gt;Now for the fun part! To make the wordcloud in the shape of a lobster, we will need a vector .png image of a lobster. The best resource for finding pictures of biological creatures is &lt;a href=&#34;http://phylopic.org/&#34; target=&#34;_blank&#34;&gt;PhyloPic&lt;/a&gt;. Download the image to your working folder and assign it to a variable.&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;LOB_FILE = &#39;Spiny2.png&#39;
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&#34;here-is-an-example-of-how-an-ordinary-word-cloud-looks&#34;&gt;Here is an example of how an ordinary word cloud looks&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;word_cloud = WordCloud().generate(filtered_words)

plt.imshow(word_cloud, interpolation = &#39;bilinear&#39;)
plt.axis(&#39;off&#39;)
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src=&#34;LobsterCloud_28_0.png&#34; alt=&#34;png&#34; /&gt;&lt;/p&gt;

&lt;h4 id=&#34;now-we-will-use-pillow-to-read-the-image-and-do-some-manipulations-we-have-to-create-an-image-mask-from-the-lobster-image-which-will-be-a-canvas-for-the-wordcloud&#34;&gt;Now we will use pillow to read the image and do some manipulations. We have to create an image mask from the lobster image which will be a canvas for the wordcloud&lt;/h4&gt;

&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;icon = Image.open(LOB_FILE)
#creating blank image object using pillow
image_mask = Image.new(mode=&#39;RGB&#39;, size = icon.size, color = (255, 255, 255))
image_mask.paste(icon, box = icon)
rgb_array = np.array(image_mask) #converts the image object into an array


word_cloud = WordCloud(mask = rgb_array, background_color = &#39;white&#39;,
                      max_words = 1000, colormap = &#39;ocean&#39;, max_font_size = 300)
word_cloud.generate(filtered_words.upper())

plt.figure(figsize=[20, 20])
plt.imshow(word_cloud, interpolation = &#39;bilinear&#39;)
plt.axis(&#39;off&#39;)
plt.show()
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src=&#34;LobsterCloud_30_0.png&#34; alt=&#34;png&#34; /&gt;&lt;/p&gt;

&lt;h4 id=&#34;and-there-you-have-it-a-beautiful-lobster-wordcloud-created-from-an-academic-paper&#34;&gt;And there you have it! A beautiful lobster wordcloud created from an academic paper&lt;/h4&gt;
</description>
    </item>
    
  </channel>
</rss>
