parse html and extract text content -pg电子麻将胡了
this example shows how to parse html code and extract the text content from particular elements.
parse html code
read html code from the url https://www.mathworks.com/help/textanalytics using webread.
url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);parse the html code using htmltree.
tree = htmltree(code);
view the html element name of the tree.
tree.name
ans = "html"
view the child elements of the tree. the children are subtrees of tree.
tree.children
ans =
4×1 htmltree:
" "
text analytics toolbox documentation
create a word cloud from the text of the hyperlinks.
str = extracthtmltext(subtrees);
figure
wordcloud(str);
title("hyperlinks")
get html attributes
get the class attributes from the paragraph elements in the html tree.
subtrees = findelement(tree,'p'); attr = "class"; str = getattribute(subtrees,attr)
str = 21×1 string array"add_margin_5" "category_desc" "category_desc" "category_desc" "category_desc" "text-center" "pg电子麻将胡了 copyright"
create a word cloud from the text contained in paragraph elements with class "category_desc".
subtrees = findelement(tree,'p.category_desc');
str = extracthtmltext(subtrees);
figure
wordcloud(str);
see also
| | | | tokenizeddocument