热点新闻
NYT数据集、LDC2008T19、The New York Times Annotated Corpus数据集
2023-09-07 06:37  浏览:5832  搜索引擎搜索“微商筹货网”
温馨提示:信息一旦丢失不一定找得到,请务必收藏信息以备急用!本站所有信息均是注册会员发布如遇到侵权请联系文章中的联系方式或客服删除!
联系我时,请说明是在微商筹货网看到的信息,谢谢。
展会发布 展会网站大全 报名观展合作 软文发布

The New York Times Annotated Corpus

Item Name:The New York Times Annotated Corpus

Author(s):Evan Sandhaus

LDC Catalog No.:LDC2008T19

ISBN:1-58563-486-7

ISLRN:429-488-225-160-9

DOI:

Release Date:October 17, 2008

Member Year(s):2008

DCMI Type(s):Text

Data Source(s):newswire

Application(s):summarization, metadata extraction, information retrieval, information extraction

Language(s):English

Language ID(s):eng

License(s):The New York Times Annotated Corpus Agreement

online documentation:LDC2008T19 documents

Licensing Instructions:Subscription & Standard Members, and Non-Members

Citation:Sandhaus, Evan. The New York Times Annotated Corpus LDC2008T19. Web Download. Philadelphia: Linguistic Data Consortium, 2008.

Related Works:HidehasAnnotationLDC2014T27 Benchmarks for Open Relation ExtractionLDC2018T12 Concretely Annotated New York Times

Introduction

The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:

Over 1.8 million articles (excluding wire services articles that appeared during the covered period).

Over 650,000 article summaries written by library scientists.

Over 1,500,000 articles manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors.

Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at nytimes.com.

Java tools for parsing corpus documents from .xml into a memory resident object.

As part of the New York Times' indexing procedures, most articles are manually summarized and tagged by a staff of library scientists. This collection contains over 650,000 article-summary pairs which may prove to be useful in the development and evaluation of algorithms for automated document summarization. Also, over 1.5 million documents have at least one tag. Articles are tagged for persons, places, organizations, titles and topics using a controlled vocabulary that is applied consistently across articles. For instance if one article mentions "Bill Clinton" and another refers to "President William Jefferson Clinton", both articles will be tagged with "CLINTON, BILL".

The New York Times has established a community website for researchers working on the data set encourages feedback and discussion about the corpus.

Data

The text in this corpus is formatted in News Industry Text Format (NITF) developed by the International Press Telecommunications Council, an independent association of news agencies and publishers. NITF is an XML specification that provides a standardized representation for the content and structure of discrete news articles. NITF encompasses structural markup such as bylines, headlines and paragraphs. The format also provides management attributes for categorizing articles into topics, summarization usage restrictions and revision histories. The goals of NITF are to answer the essential questions inherent in news articles: Who, What, When, Where and Why.

Who: Who owns the copyright, who has rights to republish the article and who the article is about.

What: The subjects reported, the named entities inside the article and the events it describes.

When: When the article was written, when it was issued and when it was revised.

Where: Where the article was written, where the events took place and where it was delivered.

Why: The metadata describing the newsworthiness of the article.

Samples

Please view the following sample.

Updates

A revised manual is now available.

发布人:2cd5****    IP:117.173.23.***     举报/删稿
展会推荐
让朕来说2句
评论
收藏
点赞
转发