4 find-hidden-word-text - find hidden text in MS Word documents
8 find-hidden-word-text word.doc > hidden.txt
12 This is a command-line UNIX tool to ease the task of discovering hidden text
15 More specifically, it is an implementation of Method 2 from Simon Byers'
16 paper, _Scalable Exploitation of, and Responses to Information Leakage
17 Through Hidden Data in Published Documents_, at
18 <URL:http://www.user-agent.org/word_docs.pdf>.
20 This goes a little further in that it removes some common 'noise' strings,
21 like 'Word.Document.8', 'Title', 'PAGE', 'Microsoft Word Document' and
22 the like. It will also remove any strings that do not contain at least
23 1 whitespace character.
27 This tool requires antiword be installed.
31 Justin Mason, C<jm dash wordtext at jmason dot org>
40 if (scalar @ARGV > 1) { $print_names = 1; }
42 foreach my $file (@ARGV) {
47 open (IN, "antiword -t $file |") or die "cannot run antiword";
48 my $aw = join ('', <IN>);
49 close IN or die "cannot run antiword -t $file";
51 open (IN, "strings $file |") or die "cannot run strings";
52 my $str = join ('', <IN>);
55 # normalize the antiword version
58 # get each string from strings, and see if we can find it in the "visible"
61 foreach (split (/\n/, $str)) {
62 s/\s+/ /g; s/^ //gs; s/ $//gs;
63 next if ($aw =~ /\Q$_\E/);
66 # skip almost-entirely non-alpha 4-byte snippets
67 #next if /^(?:\W\w\W\W|\W\W\w\W|\w\W{3,3}|\W{4,4}|\W{3,3}\w)$/;
69 next if (!/ /); # no spaces!
71 # skip 4-to-6-byte snippets with 1 nonalpha and no spaces
72 #next if (/^\S{4,6}$/ && /\W/);
74 # common word droppings
75 next if /^\s*PAGE\s*$/;
76 #next if /^Word.Document.\d$/;
77 next if /^Microsoft Word 9.0$/;
78 next if /^Microsoft Word Document$/;
81 #next if /^MSWordDoc$/;
82 next if /^Click to edit Master text styles$/;
83 next if /^Click to edit Master title style$/;
84 next if /^Embedded OLE Servers$/;
89 # output the strings and their counts
90 foreach (sort keys %count) {