Tuesday, May 29, 2012

EDRM Processing - Metrics

Processing - Metrics

One of the biggest challenges that occurs when dealing with electronic data, is estimating the volume when all that is known is the total GB to process. Since the overall volume will have significant impact on the project as a whole, it is important to understand the circumstances that will drive that estimate.

Means of Measuring

Pages

In a lot of cases the overall review time and cost for a project can be determined by the total number of pages that will be reviewed, and eventually produced. This can be better estimated the more you know about the collection. If you can separate the total volume, and identify the amount of email data, application data, and non printable data, you can get a more accurate estimate then you would base on volume alone.

Number of Documents

Since another important driver in how much effort will need to be put in to the document review, is the number of documents that will be reviewed, estimating this can be a valuable statistic. Although there are quick ways to identify the number of documents in the collection, it becomes more challenging to quickly identify the documents that will be removed from the culling process.

Culling Rate

The amount of deduplication can vary greatly based on the nature of the data (backups, live data, or a combination), the scope of the deduplication (within or across custodian), and the custodian retention habits.

Searching/Filtering is another aspect that is important to consider when estimating the overall volume that will be delivered for review. Depending on the on the number of terms, and the nature of the documents the results can vary greatly.

Non-Printable Files

Non-printable files are documents that in general will not be delivered or reviewed. Therefore it is important to exclude them from the document/GB/page estimates in order to yield more accurate results.

Industry Benchmark Survey

The table below lists some industry averages that can be used as a tool for guidance for estimating a document collection:

Benchmark	Value
	High	Median	Low
Images *[1] per GB	78,671	47,213	18,534
Images per file email	11	4	2
Images per file app files	63	10	3
Files per GB email	36,530	22,572	9,934
Files per GB app files	20,305	15,791	7,553
GB per custodian email	5	2	1
GB per custodian app files	4	1	0
Culling Rate Percentages
Deduplication	51%	21%	6%
Searching/Filtering	64%	61%	23%
Non-printable files	22%	5%	2%
Processing Speeds
Process time per GB native	117	33	11
Process time per GB image	35	32	23
Process time to first deliverable	53	35	21
Process time by file type	4	3	2
Process time by file type	6	4	3
Process time by file type	2	3	2
Quality
First pass quality yield % *[2]	57%	78%	73%

Paper-to-Electronic Estimate Conversion Table

Boxes of Documents	Approximate Total Pages	Megabytes, Gigabytes, Terabytes
1	2,500	50	Megabytes
10	25,000	500	Megabytes
20	50,000	1	Gigabyte
100	250,000	5	Gigabyte
200	500,000	10	Gigabyte
300	750,000	15	Gigabyte
400	1,000,000	20	Gigabyte
500	1,250,000	25	Gigabyte
1,000	2,500,000	50	Gigabyte
2,000	5,000,000	100	Gigabyte
5,000	12,500,000	250	Gigabyte
10,000	25,000,000	500	Gigabyte
20,000	50,000,000	1	Terabyte
40,000	100,000,000	2	Terabyte
60,000	150,000,000	3	Terabyte

Footnotes

^ Images are counted one per page, so that a 4-page multi-page TIFF would count as 4 images.
^ The percentage of data that runs through without intervention or exception handling.

EDRM Processing - Metrics., Retrieved May 29, 2012

Source: http://edrm.net/wiki2/index.php/Processing_-_Metrics#Means_of_Measuring

Monday, May 21, 2012

EDRM Processing - Data Conversion

Processing - Data Conversion

Once a variety of search strategies have been implemented as provided above, the documents and/or identified information may be staged for review depending on the instructions of the legal team. Those instructions typically are a function of the technical and human resources available to the firm or client. A client has a number of options for conducting this next phase, ranging from review of documents in their original native format to review of materials in quasi-paper formats, such as TIFF (Tagged Image File Format, usually represented by a file extension of .tif) or PDF (Portable Document Format, usually represented by a file extension of .pdf). Using these quasi-paper formats - image formats - became standard because the formats could be used as both for review and as a production format that were considered unalterable. To accommodate a more efficient review, these image formats often are accompanied by files containing the text of the document. Document review systems built around quasi-paper formats often connect the document’s images to text and metadata from the document as well as, in some situations, to copies of the files themselves.

An alternative to converting electronic documents to images prior to review is to perform an initial review of documents in their native format. It has been estimated that 80% or more of reviewed material ultimately is deemed irrelevant to the legal matter, resulting in wasted conversion fees. If a converted format is preferred for production, this approach enables the review team to only convert what is relevant, non-privileged or otherwise to be produced.

To accommodate native file review, many service and software processing providers have developed technologies to provide reviewers the ability to review native files after the metadata has been preserved and linked to the document. Some allow native files to be opened and viewed in their native application, while others allow documents to be viewed by using viewer technology. The technology that is chosen must be determined by the requirements of the case and the processing constraints (scope and schedule).

In addition to native file and image review, there has been a recent interest in native file productions. Historically, paper productions represented the most common method of providing document collections to the opposing side. As lawyers and judges have become more educated on the benefits of electronic productions, from both cost and review perspectives, image productions have become more prevalent. Today, some regulatory bodies have been requiring productions to be made in native format. Under the proposed amendments to the Federal Rules of Civil Procedure, parties are to discuss the intended form or forms of production during the initial Rule 26(f) discovery conference. Determining production formats at an early stage of the discovery process may influence the review formats needed. Additionally, the requirement that files be made available consistent with the manner in which they were maintained (or “as ordinarily maintained”) further can be interpreted to support a native file production. Recent case law relating specifically to Excel spreadsheets provide an indication of this trend. Review and production formats will be decided by the court, if not the parties, based on an analysis of the how the information is kept and how the production format aids in ensuring the just, speedy, and efficient conclusion of a matter.

Whether processing documents to image for review or production purposes, it is important to understand the details. The goal of creating a printable image of an electronic document is to render the document in a non-modifiable form that allows all document contents to be reviewed. In processing documents to an image format, some software and service providers use viewer technology to determine the rendering of the file. Viewer technology allows a variety of application files to be viewed without using the native applications. This can be useful in avoiding significant application license fees and increasing the speed with which the contents of a file can be viewed. These efficiencies are gained at the expense of completeness. No viewer renders all of the underlying application data. Others software and service providers use the native applications to render the information contained within the file.

User created information can be nested within file types in ways that are not immediately apparent to the reader. For example in word documents, comments can be stored in a document, but the print range can be set to include comments or not. Similarly, comments in Excel spreadsheets may not be easily seen without specifically formatting the print range to include those items. Also, in spreadsheets, entire pages of a worksheet may be hidden or protected. It is crucial to unhide and unprotect this information to reveal all the contents within the file for review purposes.

Frequently, users protect files or sub components of files (e.g., sheets or cells in a spreadsheet). It is important to unprotect such files by cracking passwords. This process must occur prior to the application of any culling strategies, including keyword search, if the responsive dataset is to be complete. Files that are protected and are not successfully cracked should be segregated and reported to the client.

Once the image format has been created, the images can be delivered along with the text that has been extracted for each file and its metadata information. As compared with paper productions, the electronic information deemed responsive to the searches and culling strategies are bundled into something called a “batch load” and subsequently delivered to the client or law firm. Once there, the electronic package will be placed into a discovery document management system, such as Concordance or Summation, which allows the litigation team to run multiple search queries to identify responsive documents and prepare the strategy for the case. Each image in the collection is given a unique identifier, typically a Bates Number. This information can also be packaged for production by a processing software or service provider. Production sets can include images with Bates Numbers for tracking purposes various endorsements based on the specific case matter or native files with their metadata preserved.

EDRM Processing - Data Conversion., Retrieved May 21, 2012

Source: http://edrm.net/wiki2/index.php/Processing_-_Data_Conversion

Monday, May 14, 2012

Helpful Electronic Discovery Reference Model Definitions

The purpose of this document is to outline standards for production of electronically stored information in discovery. The intent is for these standards to be easily communicated by attorneys at a meet and confer by referring to the category of production. The following definitions are provided regarding the forms of production (See the EDRM Production Guide for further clarification on the forms of production, http://edrm.net/resources/guides/edrm-framework-guides/production):

Native Format – Files are produced in the format in which they were originally created (Example: .docx produced in .docx; .pdf produced in .pdf, etc.)
Near-Native Format – Files are extracted or converted into another searchable format (Example: e-mails produced in .htm, .mht, or .rtf; Databases produced in .txt or .csv format)
Image (Near Paper) Format – Electronic files are converted to image format or paper is scanned to image format
Paper – Electronic files are printed to paper or paper files remain in paper format

The categories of production identified below include A1, A2, B1, B2, C1, C2, D and E. The descriptions of the standards are followed by a Quick Guide to Components of Productions A-D, a chart containing the Characteristics of Productions A-D and achart containing the required metadata and other information fields. In addition to agreeing to one of these standards, the requesting party should tell the producing party which review tool they will be using. This information is needed to properly identify the components and formats required to successfully load the information into a review tool.

A. Native/Near-Native Production

E-mail, databases and proprietary files are produced in a near native format. Attachments and loose files are produced in native format. Only files requiring redaction are tiffed.

Includes searchable text for redacted files:
1. Each native /near-native file name matches the DocID. (I.e. DocID = ABC0000123; Filename = ABC0000123.doc for MS Word document.)
2. Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
3. Each file requiring redaction has group IV single page tifs. Each file requiring redaction has a unique bates number applied to images matching the DocID or Bates number. The same number may be applied to each page within a document or the numbers can increment by page.
4. OCR for redacted files in multipage .txt format. Each file named the same as the DocID/Bates number of the corresponding document. (I.e. Image Filename = ABC0000123.tif; OCR Filename = ABC0000123.txt.)
5. Load file(s) for native/near-native, images, extracted text and OCR files in EDRM xml or common format such as that required by Concordance or Summation.
6. Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.
Does not include searchable text for redacted files:
1. Each native /near-native file name matches the DocID. (I.e. DocID = ABC0000123; Filename = ABC0000123.doc for MS Word document.)
2. Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
3. Each file requiring redaction has group IV single page tifs. Each file requiring redaction has a unique bates number applied to images matching the DocID or Bates number. The same number may be applied to each page within a document or the numbers can increment by page.
4. Load file(s) for native/near-native, images, extracted text and OCR files in EDRM xml or common format such as that required by Concordance or Summation.
5. Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.

B. Image (Near-Paper)/Native/Near-Native Production

Most files are converted to image format (tif, pdf, etc.) with the exception of files like MS Excel that are not usable in image format and/or paper scanned to image format and OCR’d.

Includes searchable text for redacted files:
1. Most Native/near native files are converted to group IV single page tif. Each file has a unique bates number applied to images matching the DocID or Bates number.
2. Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
3. Spreadsheets and files that are not usable in .tif format are produced in native or near-native format and named the same as the Doc ID. (I.e. DocID = ABC0000123; Filename = ABC0000123.xls for MS Excel document.)
4. OCR for redacted files in multipage .txt format. Each file named the same as the DocID/Bates number of the corresponding document. (I.e. Image Filename = ABC0000123.tif; OCR Filename = ABC0000123.txt.)
5. Load file(s) for native/near-native, images, extracted text and OCR files in EDRM xml or common format such as that required by Concordance or Summation.
6. Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.
Does not include searchable text for redacted files:
1. Most Native/near native files are converted to group IV single page tif. Each file has a unique bates number applied to images matching the DocID or Bates number.
2. Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
3. Spreadsheets and files that are not usable in .tif format will be produced in native or near-native format and named the same as the Doc ID. (I.e. DocID = ABC0000123; Filename = ABC0000123.doc for MS Word document.)
4. Load file(s) for native/near-native, images, extracted text and OCR files in EDRM xml or common format such as that required by Concordance or Summation.
5. Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.

C. Image Production

All files are converted to image format (tif, pdf, etc.) and/or paper is scanned to image format and OCR’d.

Includes searchable text for redacted files:
1. All Native/near native files are converted to group IV single page tif. Each file has a unique bates number applied to images matching the DocID or Bates number.
2. All images are black & white except for those that require color for interpretation. Color images are produced in .jpg format unless otherwise agreed.
3. Container files such as .zip or .rar may be converted to .tif format with a table of contents or referenced in the “folder” field containing the path to the original native file as it existed at the time of collection.
4. Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
5. OCR for redacted files in multipage .txt format. Each file named the same as the DocID/Bates number of the corresponding document. (I.e. Image Filename = ABC0000123.tif; OCR Filename = ABC0000123.txt.)
6. Load file(s) for image files, extracted text and OCR in EDRM xml or common format such as that required by Concordance or Summation.
7. Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.
Does not include searchable text for redacted files:
1. All Native/near native files are converted to group IV single page tif. Each file has a unique bates number applied to images matching the DocID or Bates number.
2. Each searchable native/near native file has an extracted text file in .txt format named with the DocID of the corresponding file. Each non-searchable file containing text has a multipage OCR text file named with the DocID of the corresponding file. (I.e. DocID = ABC0000123; Filename = ABC0000123.txt.)
3. Load file(s) for image files, extracted text and OCR in EDRM xml or common format such as that required by Concordance or Summation.
4. Data file including, at a minimum, the standard EDRM extracted metadata and other information fields to the extent they exist (see chart below). This data may be included in load file or produced as a separate text delimited file.

D. Custom

Images, Load File, Data file and no searchable text
Images only
Paper
Other

E. On-line Production

Files presented for production via online review tool. Formats, fields, loads and exports to be negotiated on a case by case basis.

Quick Guide to Components of Productions A-D

Production	Native	Near Native	Images	Extracted Text	OCR Text	Searchable Text for Redacted Files	Load File	Data File
A1	x	x	x	x	x	x	x	x
A2	x	x	x	x	x		x	x
B1	x	x	x	x	x	x	x	x
B2	x	x	x	x	x		x	x
C1			x	x	x	x	x	x
C2			x	x	x		x	x
D1			x				x	x
D2			x

Characteristics of Productions A-D

Characteristics	A1	A2	B1	B2	C1	C2	D1	D2	D3
Increase costs for image conversion			x	x	x	x	x	x	x
Increase turn around time for image conversion of majority of data set			x	x	x	x	x	x	x
Increase cost and turn around time for OCRing redacted files	x		x		x
Files are not searchable							x	x	x
Files such as spreadsheets and small databases are not in a format conducive for review					x	x	x	x	x
Cannot individually number or endorse pages for document control	x	x	x	x
Cannot brand pages with confidentiality endorsements	x	x	x	x
Risk of accidental alteration is greater than with image format	x	x	x	x
Metadata may be hidden and not fully reviewed prior to production	x	x	x	x
May require native application or provision of client’s proprietary software to open files	x	x	x	x
Cost of conversion and printing									x
No link back to native file								x	x
No database or text for searching								x	x

Metadata and Other Information Fields

Fields for email (Not All Inclusive)	Description
ATTACHMENTIDS	Docids of attachment(s) to email/edoc
BATES RANGE	Begin and end bates number of a document if it differs from DocID; this can be provided in one bates range field or 2 separate fields for the beginning and ending number
BCC	Names of persons blind copied on an email
CC	Names of persons copied on an email
CUSTODIAN	Name of person from whom the file was obtained
DATERECEIVED	Date email was received
DATESENT	Date email was sent
DOCEXT	Extension of native document
DOCID	Unique number assigned to each file or first page
DOCLINK	Full relative path to the current location of the native or near-native document used to link metadata to native or near native file
FILENAME	Name of the original native file as it existed at the time of collection
FOLDER	File path/folder structure for the original native file as it existed at the time of collection
FROM	Name of person sending an email
HASH	Identifying value of an electronic record – used for deduplication and authentication; hash value is typically MD5 or SHA1
PARENTID	DocId of the parent document
RCRDTYPE	Indicates document type, i.e., email; attachment; edoc; scanned; etc.
SUBJECT	Subject line of an email
TIMERECEIVED	Time email was received in user’s mailbox
TIMESENT	Time email was sent
TO	Name(s) of person(s) receiving email

Fields for edocs & Attachments (Not All Inclusive)	Description
ATTACHMENTIDS	DocIds of attachment(s) to email/edoc
AUTHORS	Name of person creating document
BATES RANGE	Begin and end bates number of a document if it differs from DocID; this can be provided in one bates range field or 2 separate fields for the beginning and ending number
CUSTODIAN	Name of person from whom the file was obtained
DATECREATED	Date document was created
DATESAVED	Date document was last saved
DOCEXT	Extension of native document
DOCID	Unique number assigned to each file or first page
DOCLINK	Full relative path to the current location of the native or near-native document used to link metadata to native or near native file
DOCTITLE	Title given to native file
FILENAME	Name of the original native file as it existed at the time of collection
FOLDER	File path/folder structure for the original native file as it existed at the time of collection
HASH	Identifying value of an electronic record – used for deduplication and authentication; hash value is typically MD5 or SHA1
PARENTID	DocId of the parent document
RCRDTYPE	Indicates document type, i.e., email; attachment; email attachment (email); edoc; scanned; etc.

EDRM Production Standards. (2011), Retrieved May 14, 2012
Lead author: Julie Brown (Vorys, Sater, Seymour and Pease LLP)
Updated February 10, 2011
Source: http://www.edrm.net/resources/standards/production

Production	Native	Near Native	Images	Extracted Text	OCR Text	Searchable Text for Redacted Files	Load File	Data File
A1	x	x	x	x	x	x	x	x
A2	x	x	x	x	x		x	x
B1	x	x	x	x	x	x	x	x
B2	x	x	x	x	x		x	x
C1			x	x	x	x	x	x
C2			x	x	x		x	x
D1			x				x	x
D2			x

Production	Native	Near Native	Images	Extracted Text	OCR Text	Searchable Text for Redacted Files	Load File	Data File
A1	x	x	x	x	x	x	x	x
A2	x	x	x	x	x		x	x
B1	x	x	x	x	x	x	x	x
B2	x	x	x	x	x		x	x
C1			x	x	x	x	x	x
C2			x	x	x		x	x
D1			x				x	x
D2			x

Production	Native	Near Native	Images	Extracted Text	OCR Text	Searchable Text for Redacted Files	Load File	Data File
A1	x	x	x	x	x	x	x	x
A2	x	x	x	x	x		x	x
B1	x	x	x	x	x	x	x	x
B2	x	x	x	x	x		x	x
C1			x	x	x	x	x	x
C2			x	x	x		x	x
D1			x				x	x
D2			x