Listing all the unique words in a piece of text

This lesson demonstrates how to list each unique word in a piece of text.

The uniqueWords function

The uniqueWords function takes one parameter, pText. The function uses a repeat loop to check each word in turn creating an array variable named tWordsList. Each element of tWordsList is associated with a different word; the element's key is the word, and the element's contents is a number. For example, if the first word of the string is "Cans", then after the first word is processed, the array "wordsList" contains one element, named "Cans", which contains the number 1.

When a word is processed, the handler adds 1 to the element corresponding to that word. If there is no array element with that name already, one is created automatically by the add command. In general, changing a variable, a chunk in a variable, or an element in an array variable creates the variable, chunk, or element automatically, if it doesn't already exist. If there is already an element with that name, that is, if the word already exists in the array, 1 is added to that existing element.

After all the words have been processed, the function exits the repeat loop. At this point, the array variable tWordsList contains an element for each unique word, whose name is the word itself. The keys of tWordsList, therefore, is a list of all the unique words in the string.

LiveCode chunk expressions

This form of word-by-word processing is possible because LiveCode uses chunk expressions to manage text. A chunk expression is a way of describing a specific portion of a container. LiveCode can directly address individual words, characters, lines, and items (delimited by any character).

In this example, we use the repeat for each chunk form of the repeat control structure:

repeat for each word tWord in tString

This repeat structure loops through each word in the parameter pString, putting the current word into a variable called tWord. You can also loop through other chunk types in a repeat structure, processing each character, line, or item.

The uniqueWords function code

function uniqueWords pString
	local tWordsList
	repeat for each word tWord in pString
		add 1 to tWordsList[tWord]
	end repeat

	return the keys of tWordsList
end uniqueWords

A note on efficiency

This example uses the repeat for each word form of the repeat control structure. When looping over chunk types in a string, this form is the fastest. The following repeat structure is functionally equivalent to the one in this example, but is much slower:

repeat with x = 1 to the number of words in pString
	add 1 to wordsList[word x of pString]
end repeat

A note on trueWord

You may notice that the code as written returns strings of text with punctuation in them. A "word" is generally considered to be any sequence of non-whitespace characters followed by whitespace. As of LiveCode 7, one can instead use the trueWord keyword, which is more discerning in that regard.

3 Comments

Richard M Kriesel

"A note on efficiency" above has a bug at "tWordsList" which should be "word" instead.

Richard M Kriesel

Since uniqueWords("a, b") returns words "a," and "b" this lesson ought to recognize that "word" here refers to a LiveCode word, and that "trueword" may better serve the user's needs.

Panos Merakos

Hello Richard,
Thank you for your comments.

We have now fixed the error in the code in "A note on efficiency". Thanks for spotting it.

RE the use of "trueWord" instead of "word" - yes, you are correct. However, "trueWord" was introduced in LC 7, so since there are (few) users that still use LC < 7 - I think will not update the example script and the sample stack, but I will add a note to the lesson.

Add your comment

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.