parse html and extract text content -pg电子麻将胡了

this example shows how to parse html code and extract the text content from particular elements.

parse html code

read html code from the url https://www.mathworks.com/help/textanalytics using webread.

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);

parse the html code using htmltree.

tree = htmltree(code);

view the html element name of the tree.

tree.name

ans = 
"html"

view the child elements of the tree. the children are subtrees of tree.

tree.children

ans = 
  4×1 htmltree:
    " "
    text analytics toolbox documentation

create a word cloud from the text of the hyperlinks.

str = extracthtmltext(subtrees);
figure
wordcloud(str);
title("hyperlinks")

get html attributes

get the class attributes from the paragraph elements in the html tree.

subtrees = findelement(tree,'p');
attr = "class";
str = getattribute(subtrees,attr)

str = 21×1 string array
    
    
    "add_margin_5"
    
    
    
    
    
    "category_desc"
    "category_desc"
    "category_desc"
    "category_desc"
    
    
    
    "text-center"
    
    
    
    "pg电子麻将胡了 copyright"

create a word cloud from the text contained in paragraph elements with class "category_desc".

subtrees = findelement(tree,'p.category_desc');
str = extracthtmltext(subtrees);
figure
wordcloud(str);

parse html and extract text content -pg电子麻将胡了

parse html code

get html attributes

see also

related topics

parse html and extract text content -pg电子麻将胡了

parse html code

get html attributes

see also

related topics

wechat