add entity tags to documents

since r2019a

syntax

updateddocuments = addentitydetails(documents)

updateddocuments = addentitydetails(documents,name,value)

description

use addentitydetails to add entity tags to documents.

use addentitydetails to detect person names, locations, organizations, and other named entities in text. this process is known as named entity recognition.

the function supports english, japanese, german, and korean text.

example

updateddocuments = addentitydetails(documents) detects the named entities in documents. the function adds details to the tokens with missing entity details only. to get the entity details from updateddocuments, use .

updateddocuments = addentitydetails(documents,name,value) also specifies additional options using one or more name-value pairs.

tip

use addentitydetails before using the lower, upper, normalizewords, removewords, and removestopwords functions as addentitydetails uses information that is removed by these functions.

examples

add named entity tags to documents

create a tokenized document array.

str = [
    "mary moved to natick, massachusetts."
    "john uses matlab at mathworks."];
documents = tokenizeddocument(str);

add the entity details to the documents using the addentitydetails function. this function detects the named entities in the text and adds the details to the table returned by the tokendetails function. view the updated token details of the first few tokens.

documents = addentitydetails(documents);
tdetails = tokendetails(documents)

tdetails=13×8 table
         token         documentnumber    sentencenumber    linenumber       type        language    partofspeech       entity   
    _______________    ______________    ______________    __________    ___________    ________    ____________    ____________
    "mary"                   1                 1               1         letters           en       proper-noun     person      
    "moved"                  1                 1               1         letters           en       verb            non-entity  
    "to"                     1                 1               1         letters           en       adposition      non-entity  
    "natick"                 1                 1               1         letters           en       proper-noun     location    
    ","                      1                 1               1         punctuation       en       punctuation     non-entity  
    "massachusetts"          1                 1               1         letters           en       proper-noun     location    
    "."                      1                 1               1         punctuation       en       punctuation     non-entity  
    "john"                   2                 1               1         letters           en       proper-noun     person      
    "uses"                   2                 1               1         letters           en       verb            non-entity  
    "matlab"                 2                 1               1         letters           en       proper-noun     other       
    "at"                     2                 1               1         letters           en       adposition      non-entity  
    "mathworks"              2                 1               1         letters           en       proper-noun     organization
    "."                      2                 1               1         punctuation       en       punctuation     non-entity

view the words tagged with the entities "person", "location", "organization", or "other". these words are the words not tagged with "non-entity".

idx = tdetails.entity ~= "non-entity";
tdetails.token(idx)

ans = 6x1 string
    "mary"
    "natick"
    "massachusetts"
    "john"
    "matlab"
    "mathworks"

add named entity tags to japanese text

tokenize japanese text using tokenizeddocument.

str = [
    "マリーさんはボストンからニューヨークに引っ越しました。"
    "駅へ鈴木さんを迎えに行きます。"
    "東京は大阪より大きいですか？"
    "東京に行った時、新宿や渋谷などいろいろな所を訪れました。"];
documents = tokenizeddocument(str);

for japanese text, the software automatically adds named entity tags, so you do not need to use the addentitydetails function. this software detects person names, locations, organizations, and other named entities. to view the entity details, use the tokendetails function.

tdetails = tokendetails(documents);
head(tdetails)

       token        documentnumber    linenumber     type      language    partofspeech       lemma          entity  
    ____________    ______________    __________    _______    ________    ____________    ____________    __________
    "マリー"               1               1         letters       ja       proper-noun     "マリー"         person    
    "さん"                1               1         letters       ja       noun            "さん"           person    
    "は"                  1               1         letters       ja       adposition      "は"            non-entity
    "ボストン"             1               1         letters       ja       proper-noun     "ボストン"        location  
    "から"                1               1         letters       ja       adposition      "から"           non-entity
    "ニューヨーク"          1               1         letters       ja       proper-noun     "ニューヨーク"    location  
    "に"                  1               1         letters       ja       adposition      "に"            non-entity
    "引っ越し"             1               1         letters       ja       verb            "引っ越す"        non-entity

view the words tagged with entity "person", "location", "organization", or "other". these words are the words not tagged "non-entity".

idx = tdetails.entity ~= "non-entity";
tdetails(idx,:).token

ans = 11x1 string
    "マリー"
    "さん"
    "ボストン"
    "ニューヨーク"
    "鈴木"
    "さん"
    "東京"
    "大阪"
    "東京"
    "新宿"
    "渋谷"

add named entity tags to german text

tokenize german text using tokenizeddocument.

str = [
    "ernst zog von frankfurt nach berlin."
    "besuchen sie volkswagen in wolfsburg."];
documents = tokenizeddocument(str);

to add entity tags to german text, use the addentitydetails function. this function detects person names, locations, organizations, and other named entities.

documents = addentitydetails(documents);

to view the entity details, use the tokendetails function.

tdetails = tokendetails(documents);
head(tdetails)

       token       documentnumber    sentencenumber    linenumber       type        language    partofspeech      entity  
    ___________    ______________    ______________    __________    ___________    ________    ____________    __________
    "ernst"              1                 1               1         letters           de       proper-noun     person    
    "zog"                1                 1               1         letters           de       verb            non-entity
    "von"                1                 1               1         letters           de       adposition      non-entity
    "frankfurt"          1                 1               1         letters           de       proper-noun     location  
    "nach"               1                 1               1         letters           de       adposition      non-entity
    "berlin"             1                 1               1         letters           de       proper-noun     location  
    "."                  1                 1               1         punctuation       de       punctuation     non-entity
    "besuchen"           2                 1               1         letters           de       verb            non-entity

view the words tagged with entity "person", "location", "organization", or "other". these words are the words not tagged with "non-entity".

idx = tdetails.entity ~= "non-entity";
tdetails(idx,:)

ans=5×8 table
       token        documentnumber    sentencenumber    linenumber     type      language    partofspeech       entity   
    ____________    ______________    ______________    __________    _______    ________    ____________    ____________
    "ernst"               1                 1               1         letters       de       proper-noun     person      
    "frankfurt"           1                 1               1         letters       de       proper-noun     location    
    "berlin"              1                 1               1         letters       de       proper-noun     location    
    "volkswagen"          2                 1               1         letters       de       noun            organization
    "wolfsburg"           2                 1               1         letters       de       proper-noun     location

input arguments

`documents` — input documents
`tokenizeddocument` array

input documents, specified as a tokenizeddocument array.

name-value arguments

specify optional pairs of arguments as name1=value1,...,namen=valuen, where name is the argument name and value is the corresponding value. name-value arguments must appear after other arguments, but the order of the pairs does not matter.

before r2021a, use commas to separate each name and value, and enclose name in quotes.

example: discardknownvalues=true specifies to discard previously computed details and recompute them.

`retokenizemethod` — method to retokenize documents
`"entity"` (default) | `"none"`

method to retokenize documents, specified as one of the following:

"entity" – transform the tokens for named entity recognition. the function merges tokens from the same entity into a single token.
"none" – do not retokenize the documents.

`discardknownvalues` — option to discard previously computed details
`false` (default) | `true`

option to discard previously computed details and recompute them, specified as true or false.

data types: logical

`model` — ner model
`"auto"` (default) | `hmmentitymodel` object

since r2023a

custom ner model, specified as one of these values:

"auto" — use the built-in ner model.
object — use the specified custom ner model. to train a custom ner model, use the function. for an example, see .

output arguments

`updateddocuments` — updated documents
`tokenizeddocument` array

updated documents, returned as a tokenizeddocument array. to get the token details from updateddocuments, use .

algorithms

language details

tokenizeddocument objects contain details about the tokens including language details. the language details of the input documents determine the behavior of addentitydetails. the tokenizeddocument function, by default, automatically detects the language of the input text. to specify the language details manually, use the language option of tokenizeddocument. to view the token details, use the function.

version history

introduced in r2019a

r2023a: specify custom ner model

to specify a custom ner model, use the model name-value argument. to train a custom ner model, use the function. for an example, see .

add entity tags to documents -pg电子麻将胡了

syntax

description

examples

add named entity tags to documents

add named entity tags to japanese text

add named entity tags to german text

input arguments

`documents` — input documents
`tokenizeddocument` array

name-value arguments

`retokenizemethod` — method to retokenize documents
`"entity"` (default) | `"none"`

`discardknownvalues` — option to discard previously computed details
`false` (default) | `true`

`model` — ner model
`"auto"` (default) | `hmmentitymodel` object

output arguments

`updateddocuments` — updated documents
`tokenizeddocument` array

algorithms

language details

version history

r2023a: specify custom ner model

see also

topics

add entity tags to documents -pg电子麻将胡了

syntax

description

examples

add named entity tags to documents

add named entity tags to japanese text

add named entity tags to german text

input arguments

documents — input documents tokenizeddocument array

name-value arguments

retokenizemethod — method to retokenize documents "entity" (default) | "none"

discardknownvalues — option to discard previously computed details false (default) | true

model — ner model "auto" (default) | hmmentitymodel object

output arguments

updateddocuments — updated documents tokenizeddocument array

algorithms

language details

version history

r2023a: specify custom ner model

see also

topics

wechat

`documents` — input documents
`tokenizeddocument` array

`retokenizemethod` — method to retokenize documents
`"entity"` (default) | `"none"`

`discardknownvalues` — option to discard previously computed details
`false` (default) | `true`

`model` — ner model
`"auto"` (default) | `hmmentitymodel` object

`updateddocuments` — updated documents
`tokenizeddocument` array