python latex example
Written by Daniel Herber on November 29, 2017.
A word cloud is an image composed of words used in a particular text, in which the size of each word indicates its frequency.
Creating a word cloud from a $\LaTeX$ document can be challenging since there are a number of $\LaTeX$-specific commands/environments that should be ignored when counting words and potentially many subfiles in larger documents.
This post describes one method for creating word clouds from $\LaTeX$ documents that handles many of these issues.
---
## Requirements
Below are the required software/tools need to create a word cloud from your $\LaTeX$ document:
- detex
- See [[link]](https://code.google.com/archive/p/opendetex/downloads) for Windows-compatible binary
- See [[link]](https://github.com/pkubowicz/opendetex) for the current opendetex project
- Python installation and the ability to install packages
- Download [[link]](https://www.python.org/downloads/)
- Word cloud tool
- If prefer [www.wordclouds.com](https://www.wordclouds.com)
- Other includes [[link]](https://www.jasondavies.com/wordcloud/) and [[link]](https://github.com/amueller/word_cloud) (the latter is for python)
### detex Syntax
The program detex is a filter to strip TeX commands from a .tex file [[link]](https://www.freebsd.org/cgi/man.cgi?query=detex).
The usage syntax is:
```
detex [ -clnstw ] [ -e environment-list ] [ filename[.tex] ... ]
```
- ```-c``` option can be used in LaTeX mode to have detex echo the arguments to \cite, \ref, and \pageref macros
- ```-l``` option can be used to force LaTeX mode
- ```-n``` option is used, no \input or \include commands will be processed
- ```-s``` option can be used to force the old functionality of detex
- ```-t``` option can be used to force TeX mode regardless of input content.
- ```-w``` flag is given, the output is a word list
- ```-e``` option can be used to specify a comma separated environment-list of environments to ignore (there seems to be a limit on the number of environments that can be ignored)
Finally, make sure you are running the command from the location of ```filename[.tex]```.
---
## Code
Below is some Python code that can generate the word list that then can be used in a your favorite word cloud generator.
It uses the detex command on a known $\LaTeX$ file and then cleans up the text. It finally counts the words and outputs the result.
This code can be downloaded at [[link]](blogs/python/post_1/detexwordcount.py).
```python
### imports
import os # miscellaneous operating system interfaces
import subprocess # spawn new processes
import nltk # natural language toolkit, http://www.nltk.org/
from nltk.corpus import stopwords # high-frequency words like the, to and also
from collections import Counter # dict subclass for counting hashable objects
### user options
# environments to skip during the detex process
enviros = 'align,vAlgorithm,tikzpicture,subfigure,gather,gathered,lstlisting,tabular,subequations,blankenv'
# working directory (i.e., location of main .tex file)
path = os.path.normpath('C:/filelocation/')
# name of the main .tex file
inputfile = 'inputfile.tex'
### detex the files
# create the command string
command = 'detex -w -l -e ' + enviros + ' ' + inputfile
# run the detex command
text = subprocess.check_output(command, cwd=path)
### clean up the text
# break the text into words
content = text.split()
# change the encoding
content = [x.decode("utf-8") for x in content]
# lowercase
content = [x.lower() for x in content]
# remove certain characters
content = [x.replace(')', '') for x in content]
content = [x.replace('(', '') for x in content]
content = [x.replace('-', '') for x in content]
content = [x.replace('"', '') for x in content]
# lemmatize the words (bring to a common form)
wnl = nltk.WordNetLemmatizer()
content = [wnl.lemmatize(word) for word in content]
# remove the high-frequency words
s = set(stopwords.words('english'))
content = filter(lambda w: not w in s, content)
# create a list of words and their frequency
c = Counter(content)
# print the list of words
for key, value in c.most_common(): print(value, key)
```
---
## Examples
#### Simple Example
Consider the following .tex file:
```tex
\documentclass{article}
\newenvironment*{blankenv}{}{} % blank environment to skip during detex
\usepackage{amsmath}
\begin{document}
\begin{blankenv}
Test
\end{blankenv}
Hello World! Here are some equations:
\begin{align} % this equation is skipped
\int_0^1 \sin(x) dx, y = 2
\end{align}
Hello, this is some text with an equation: $\sin(x)$. % the equation is skipped
\end{document}
```
The output using the code above is:
| Counts | Word |
| ----: | :---- |
| 2 | equation |
| 2 | hello |
| 1 | text |
| 1 | world |
Some notes on this output:
- All punctuation was ignored
- All words are in lowercase
- All simple words were ignored
- Comments were ignored
- Both equations (align and inline) were ignored
- The blank environment (blankenv) was used to skip sections of the text
- Both equations and equation are combined
#### My Dissertation
I used this method on my dissertation to create the word cloud below.
The code for my dissertation is available at [[link]](https://github.com/danielrherber/phd-dissertation-daniel-r-herber).
The word cloud was created using [www.wordclouds.com](https://www.wordclouds.com).
[](blogs/python/post_1/wordcloud.svg){data-lightbox="blog_imgs"}