Semi Structured Data Examples: Unlock Their Potential

Your team probably isn’t stuck because it lacks data. It’s stuck because the data arrives in formats your systems don’t love.
Invoices land as PDFs with different layouts. Purchase orders show up as email attachments. Shipping paperwork gets scanned on a warehouse floor. A bank statement export looks machine-readable until you try to map it into your ERP. None of that fits neatly into rows and columns at the start, but it isn’t random either. It has labels, patterns, sections, metadata, and repeatable fields. That’s semi structured data.
This matters more than is often recognized. Approximately 90% of enterprise data is unstructured, and semi-structured formats have become the practical bridge between rigid databases and messy real-world documents. In operations, that bridge is where the work happens. It’s where AP teams key in invoice fields, where freight staff retype manifest details, and where procurement teams clean supplier files before they can load anything into an ERP.
If you’re evaluating ERP Systems In Australia, this is the part that usually gets underestimated. ERP projects often assume the source data will arrive clean. In practice, the hard part is converting semi-structured inputs into something the ERP can trust.
Below are eight semi structured data examples that show up constantly in business workflows. For each one, I’ll break down why it counts as semi-structured, where teams usually get burned during parsing, and how to turn it into a structured asset with a practical automation approach. The goal isn’t just to define formats. It’s to help you stop treating incoming documents as admin clutter and start treating them as system-ready data.
1. XML (eXtensible Markup Language)
XML is one of the clearest semi structured data examples because it looks organized immediately, but it still leaves room for variation. You get tags, nested elements, attributes, and metadata. You don’t always get consistency across trading partners, versions, or ERP exports.
In procurement, logistics, and finance, XML usually appears in document exchange rather than in human workflow. A supplier sends an invoice in UBL XML. A bank publishes statement feeds in XML. SAP outputs a record in XML for downstream processing. Every one of those files has structure, but the field names, namespaces, optional elements, and hierarchy can differ enough to break a brittle import.

Where XML goes wrong in operations
The problem usually isn’t that XML is hard to parse. It’s that teams assume one XML file represents all XML files.
A freight forwarder might receive shipment messages from multiple parties that all use different tag conventions. A finance team might get bank statement XML that technically validates, but the downstream accounting import expects a different structure. An AP team might have invoice XML with missing optional elements, which is valid for the sender but unusable for the receiver.
Common trouble spots include:
- Namespace collisions: Different schemas can use similar tag names that mean different things.
- Optional fields: One sender includes tax breakdowns and another doesn’t.
- Nested repetition: Line items, charges, and references can repeat in ways flat systems don’t handle well.
- Schema drift: A partner updates its XSD, and your import fails inconspicuously.
**Practical rule:** Validate XML before you transform it. Bad XML pushed downstream becomes a reconciliation problem, not just a parsing problem.
What works when you automate XML
Teams get better results when they treat XML as an integration format, not as a final business format. Parse it, validate it against the correct schema, normalize naming, and then map it into a canonical structure your ERP or TMS can ingest.
If your source is still a PDF and your target is XML, the useful move is to standardize the output rather than mirror the document’s layout. That’s where tools built for document extraction help. A practical example is converting semi-structured documents into system-ready XML using a workflow like convert PDF to XML.
Pretty-printing also helps during troubleshooting. Human-readable XML shortens the feedback loop between operations staff and technical teams because people can inspect whether the tax amount, PO number, or consignee data landed in the right node.
2. JSON (JavaScript Object Notation)
If XML dominated older integrations, JSON dominates modern ones. It’s one of the most widely used formats for semi-structured data globally, and its use spans APIs, banking, social media, digital marketing, and IoT workflows, as outlined in AltexSoft’s overview of semi-structured data.
That popularity makes JSON one of the most practical semi structured data examples for business teams. It’s readable, portable, and flexible. A parsed invoice can become a JSON object. A TMS can return shipment events as JSON. A workflow tool can pass parsed fields to your accounting stack in JSON payloads.
Why JSON works so well
JSON gives you just enough structure without forcing every record into the same shape. A customer object can hold addresses, contacts, and an array of orders. A shipment object can include stops, references, and accessorials. A parser can return fields that exist on one document and leave absent fields empty or omitted on another.
That flexibility is exactly why operations teams like it. It handles real-world inconsistency better than rigid tabular formats.
It’s also why teams create a mess with it.
The trade-off most teams miss
JSON is forgiving at the file level and unforgiving at the pipeline level. If one system sends invoiceNumber and another sends invoice_no, both are valid JSON. Your ERP mapping still breaks.
The usual parsing issues are less about syntax and more about discipline:
- Inconsistent key names: Different teams label the same field in different ways.
- Over-nesting: Data becomes technically elegant but operationally painful to query.
- Mixed data types: One sender treats totals as numbers, another as strings.
- Sparse records: Optional fields make validation and downstream mapping harder.
A document extraction workflow often works best when JSON is the handoff layer. DigiParser-style output is useful here because it can turn invoices, bills of lading, or purchase orders into a stable schema that API-based systems can accept.
Keep the raw JSON for traceability. Map a cleaned version for the ERP. Mixing those two purposes in one payload creates avoidable headaches.
A practical automation tip
Use one canonical field dictionary across all outputs. Decide once what your keys are called, how dates are formatted, and whether nested arrays should be flattened before import. Then enforce that standard at the parser or middleware layer, not inside the ERP.
That’s the difference between “we output JSON” and “we have a usable data pipeline.”
3. CSV (Comma-Separated Values)
CSV looks simple, and that’s exactly why teams underestimate it.
CSV is often considered fully structured because it opens cleanly in Excel. In practice, many business CSV files are semi-structured at the edges. They often contain variable line item counts, inconsistent delimiters, embedded commas, extra header rows, summary footers, and blank columns that only exist for one export type.
That’s why CSV belongs on any serious list of semi structured data examples.
Where CSV causes operational friction
The most common use case is straightforward. You extract invoice fields, bill of lading details, resume attributes, or bank transactions, then push them into a CSV because the receiving system accepts flat files. That works well until one file contains a multiline address, a product description with commas, or an export generated in a different encoding.
The issues aren’t glamorous, but they’re expensive in time:
- Header mismatches: Imports fail because a column name changed slightly.
- Escaping problems: Commas and line breaks split fields incorrectly.
- Encoding issues: Special characters render badly in downstream systems.
- Mixed row structures: Summary lines get imported as data rows.
Why teams still rely on it
CSV remains popular because every business system can do something with it. It’s the universal compromise format. It’s also a useful bridge when a team isn’t ready for a deeper API integration.
A finance team might batch extracted invoice line items into CSV for an accounting upload. A logistics team might export shipment references for a TMS import. An HR team might convert parsed candidate data into a CSV for an HRIS.
That bridge is useful, but it needs guardrails.
The practical way to make CSV reliable
Start with strict headers and controlled exports. If your receiving system expects invoice_date, don’t let another workflow send date_of_invoice. Quote fields that contain commas or new lines. Use UTF-8 consistently. Test imports with representative files before rolling out high-volume automation.
If your source documents are PDFs, an effective pattern is to extract first and shape the result into a clean import file using a workflow like how to convert PDF to CSV.
One more point matters here. CSV is often treated as the end state when it should be treated as a transport layer. Keep a richer source of truth, then generate CSV for the systems that still need it. That protects you when the ERP field map changes later.
4. EDI (Electronic Data Interchange) Documents
EDI is where semi-structured data gets very business-critical, very fast.
A lot of teams assume EDI is “structured” because the standards are formal. In reality, EDI documents are semi-structured in the way that matters operationally. They use standardized segments and rules, but they also include optional loops, partner-specific implementations, and contextual meaning that depends on the transaction set.
That’s why EDI imports often work perfectly in one partner relationship and fail in another.
What EDI looks like in the field
In freight, procurement, and retail supply chains, you’ll run into EDIFACT and X12 constantly. Examples include ORDERS messages for purchase orders, DESADV for shipment notices, and invoice transactions for billing and payment processing. The file may look cryptic to a human, but it’s still carrying recognizable business entities such as buyer, seller, line items, units, charges, and dates.
The challenge is never just reading the message. The challenge is interpreting it correctly for your own systems and workflows.
Why parsing EDI is harder than it appears
Teams usually get caught by partner variation. The standard says one thing. The implementation guide says another. The actual trading partner behavior says something else.
A few recurring pain points:
- Mandatory versus optional confusion: A field may be optional in the standard but required for your workflow.
- Segment reuse: The same segment can mean different things depending on context.
- Translator dependence: Legacy EDI maps often sit with one vendor or one internal specialist.
- Error handling gaps: Rejected messages don’t always feed back cleanly to operations.
EDI projects fail quietly. The file arrives, the translator accepts it, and the business process still breaks because one reference didn’t map where staff expected.
What works in practice
Map EDI to transaction intent, not just to field names. If a segment is carrying shipment references, ask where those references must land in your TMS. If an invoice includes allowances and charges, decide whether finance needs them summarized or itemized before import.
For mixed partner ecosystems, a hybrid model often works better than ideological purity. Larger partners can stay on EDI. Smaller suppliers can submit PDFs, emails, or portal exports that a parsing platform converts into structured output. That keeps your inbound data model consistent without forcing every supplier into the same technical standard.
5. Email and Attachment Headers (MIME Format)
A lot of business operations still run through inboxes. That’s not elegant, but it’s real.
Email is one of the most overlooked semi structured data examples because people focus on the attachment and ignore the envelope. The body text, subject line, sender address, timestamps, reply chain, and MIME metadata all contain structure. When a supplier emails an invoice, the attachment matters. So does the subject, the sender domain, and the date received.
Why email is semi-structured, not just communication
Email has defined fields and transport rules, but the content remains flexible. One vendor writes “Invoice Attached.” Another sends a reply to an old thread with three attachments and no clear note. A customer sends a PO confirmation in the body and a separate PDF with line-level detail.
That mixture is exactly what makes email operationally messy and automation-friendly at the same time.
Typical business uses include:
- Supplier invoices: Forwarded to a processing inbox for extraction.
- PO acknowledgments: Received from vendors with details in the body and attachment.
- Bank statements: Sent as monthly PDFs from financial institutions.
- Resume intake: Submitted to a hiring inbox with varying attachment formats.
The practical parsing challenge
The issue isn’t only extracting the attachment. It’s deciding what context should travel with it.
If an AP team processes emailed invoices, the sender address may help classify the vendor. The subject line may include an invoice number. The body might mention a branch location or cost center that isn’t visible in the PDF. MIME headers can also help preserve the audit trail, which matters for compliance and dispute handling.
A dedicated ingestion workflow works better than asking staff to monitor shared inboxes manually. For example, a purpose-built email parser lets teams route incoming messages through a defined pipeline instead of treating each email as a standalone admin task.
What works best
Use routing rules early. Filter by sender, subject pattern, or mailbox alias before the parser sees the file. Archive the original email with its metadata even if the extracted fields move into your ERP. And don’t rely on body text alone for business-critical values when an attachment should be the source of truth.
This is one of the easiest automation wins because the document flow is already digital. The main job is to stop letting email remain the system of record.
6. PDF and Image Scans with OCR Data
A carrier emails a signed proof of delivery as a crooked phone photo. A supplier uploads a scanned invoice with a stamp over the total. A field rep submits a receipt from a dimly lit restaurant. Operations teams see these documents every day, and they all create the same problem. The file looks readable to a person, but the data inside is inconsistent, positional, and hard for systems to trust.
A PDF or scanned image becomes semi-structured once OCR extracts text, coordinates, page zones, and detected labels. That partial structure is what makes invoices, bills of lading, receipts, statements, and onboarding forms usable in automation. The content is still messy, but it is no longer a blank image.

Why this format matters in real operations
In our work with finance, logistics, and HR teams, PDF and image files often remain the handoff format between companies, branches, drivers, and back-office staff. The sender may have a structured system on their side, but what arrives is still a scan, exported PDF, or mobile photo.
The business goal is not better document storage. The goal is to turn a document into fields your downstream systems can use. That usually means vendor name, invoice number, line items, tax, delivery date, consignee, or approval status. If those values cannot be extracted reliably, the process falls back to manual review.
This is also where teams underestimate variation. Two invoice PDFs from the same supplier can differ if one came from an ERP export and the other was printed, stamped, signed, and rescanned.
Why OCR pipelines fail
The main failure point is not OCR alone. It is the combination of text recognition, layout interpretation, and field mapping.
A scan can be technically readable and still produce bad data. Totals may be captured as line items. Multi-page tables may split incorrectly. Header values may drift into the wrong field because the document uses columns, sidebars, or handwritten notes. Low image quality makes that worse, but clean images do not remove the layout problem.
Teams handling these files should plan for three layers of control:
- Input quality controls: Set standards for scan angle, contrast, file type, and page completeness.
- Confidence-based review: Route low-confidence fields to staff instead of auto-posting them.
- Cross-field validation: Check totals against line items, dates against expected ranges, and supplier names against master records.
A good walkthrough helps teams understand what high-variation document extraction looks like in practice:
What works in production
Template-only extraction works for stable forms. It struggles once suppliers change layouts or staff upload photos instead of system-generated PDFs. In those environments, teams get better results from a parser that combines OCR with document classification, field rules, and exception handling.
For example, DigiParser can take scanned invoices or image-based shipping documents, extract the key fields, and return structured output for an ERP, TMS, or spreadsheet workflow. The practical win is not just reading the text. It is separating reliable fields from uncertain ones so staff review only the exceptions.
The same principle applies outside classic AP workflows. Teams dealing with submissions generated through HTML form mailto functionality often receive attached PDFs, screenshots, or scanned forms in shared inboxes. Once those attachments enter the process, the challenge shifts from message routing to document extraction quality.
A useful benchmark comes from resume processing. In a government recruitment case study, Withum described using AI-based document extraction to process highly variable resumes with 99%+ field-level accuracy on a private dataset of 10,000+ resumes. The lesson is practical. When documents vary in structure, a flexible extraction workflow with validation beats brittle page-specific rules.
What works best
Start by classifying the document before extracting fields. Keep the original file for audit and dispute handling. Validate related fields together instead of trusting single values in isolation. And measure exception rates by document source, because one supplier, branch, or mobile workflow usually creates a disproportionate share of review work.
Teams that handle PDFs and scans well do not chase perfect OCR. They build a controlled process that turns inconsistent documents into structured records with clear review points.
7. HTML and Web Form Data
Web pages are structured enough to work with and messy enough to break your process. Such is their nature.
HTML and web form submissions are classic semi structured data examples because they contain markup, labels, tables, attributes, and form fields, but the actual content can vary wildly. A carrier portal might show quote results in one table today and in a card layout tomorrow. A supplier portal might require a login and then load order data dynamically in the browser.
Where operations teams encounter it
This format shows up anywhere a person copies data from a website into another system. Freight teams pull quote details from carrier sites. Buyers check supplier portals for order status. HR teams collect applications through online forms. Office teams read tracking pages and manually update customers.
The data is there. The problem is that it’s embedded inside presentation logic.
That creates two very different tasks. One is extraction. The other is maintenance.
Why web data pipelines become fragile
HTML parsing breaks when the page changes, even slightly. A CSS class gets renamed. A table becomes a nested component. A site shifts to JavaScript-heavy rendering. A field label changes from “Consignee” to “Delivery Party,” and suddenly your scraper misses it.
For that reason, web capture works best when teams are disciplined about selectors, validation, and monitoring. Useful tactics include:
- Use stable selectors: Prefer semantic attributes over cosmetic class names.
- Handle dynamic content: Browser automation is often required for script-rendered pages.
- Cache results: Don’t hammer the same portal repeatedly for the same data.
- Respect site rules: Terms of service and access controls still apply.
A related issue appears with forms that trigger email-based workflows. If you’re dealing with browser submissions and email handoffs together, understanding HTML form mailto functionality helps explain why some “simple” form setups create messy intake data for operations teams.
Websites are poor source systems. Capture what you need, normalize it fast, and move it into a system you control.
The practical automation angle
Treat HTML as a capture surface, not a storage standard. Pull the key fields, timestamp the retrieval, and convert them into a canonical internal format right away. If you leave the logic tied to page structure for too long, routine website updates become business interruptions.
8. Database Query Results and ERP/TMS Export Formats
It's at this point that many automation projects get humbled. The data comes from your own systems, but it’s still awkward.
ERP and TMS exports often look structured because they originate from databases. Yet many exported files are semi-structured in practice. They include metadata headers, subtotal rows, fixed-width layouts, nested report sections, footers, and report-specific formatting choices that make downstream automation harder than expected.
Why internal exports still create friction
A procurement team exports purchase orders from SAP for supplier review. Finance exports an AP aging report from Oracle. A logistics team pulls a shipment manifest from a TMS. Each report contains business-critical data, but not always in a system-ready form.
The problem is that report design and data design are different jobs. Reports are built for human review. Pipelines need consistency.
That gap is bigger than many teams expect. There’s a known lack of clear guidance on the financial and operational trade-offs of integrating semi-structured outputs such as JSON or Avro into legacy ERP, TMS, and accounting environments that expect rigid schemas, as noted in this discussion of a gap in current semi-structured data guidance. In real terms, that means the export may be easy to produce and still expensive to operationalize.
A practical example from HR data
An arXiv study on automated analysis of semi-structured resume data processed 5,000 resumes and extracted 12 key fields with 92% precision and 88% recall. It also increased recruiter throughput from 200 resumes per day to 800 and improved shortlist accuracy from 65% to 89%.
That study focuses on resume pipelines, but the core lesson applies to ERP and report exports. Once you define a target schema and apply extraction plus validation consistently, staff stop spending time on repetitive review and can focus on exceptions.
What works when bridging legacy systems
Don’t feed raw report exports directly into downstream workflows unless you’ve tested every variation. Normalize first.
- Document field mappings: Reports change. Your mapping record should outlast the person who built it.
- Use delta logic where possible: Reprocessing full exports creates duplicate handling problems.
- Separate metadata from data rows: Headers and totals should not flow into transaction tables.
- Prefer APIs when available: When exports remain necessary, standardize them at the ingestion layer.
If your ERP strategy is still evolving, it helps to understand the broader context of Enterprise Resource Planning (ERP). But for day-to-day operations, the rule is simpler. Internal system exports are not automatically clean inputs. Treat them with the same discipline you’d apply to supplier files.
8-Example Semi-Structured Data Comparison
| Format | Implementation Complexity 🔄 | Resource Requirements ⚡ | Expected Outcomes ⭐ | Ideal Use Cases 📊 | Key Advantages 💡 |
|---|---|---|---|---|---|
| XML (eXtensible Markup Language) | High, requires XSDs, namespaces and nested parsing | Higher storage and parse time due to verbosity; mature tooling available | Reliable, schema-validated structured data for compliance | ERP/EDI integrations, invoices, procurement, finance exchanges | Strong schema validation, human-readable tags, broad enterprise support |
| JSON (JavaScript Object Notation) | Low–Medium, simple syntax; optional JSON Schema for validation | Low storage and fast parsing; ideal for APIs and streaming | Efficient, compact payloads for real-time exchange | REST APIs, webhooks, modern integrations, automation workflows | Fast parsing, native web support, small payloads |
| CSV (Comma-Separated Values) | Low, straightforward row/column format; minimal rules | Very low storage/processing; fastest for bulk tabular data | Quick bulk imports/exports and easy human review | Spreadsheet exports, batch ERP imports, simple reporting | Universal support, easy for non-technical users, minimal overhead |
| EDI (Electronic Data Interchange) Documents | High, strict segment logic, translation tables, trading partner setup | High initial setup and specialized translation software; ongoing maintenance | Robust, legally auditable B2B exchanges with low manual intervention | Large-scale B2B logistics, retailers, carriers requiring standards | Industry-standard reliability, audit trails, reduces manual entry |
| Email + MIME Attachments | Medium, MIME parsing and attachment extraction; handle forwarded chains | Moderate storage and parsing; needs secure email infrastructure | Hands-free, always-on ingestion with rich metadata context | Forwarded invoices/POs to dedicated inboxes, HR submissions | Familiar UX, supports many attachment types, built-in audit trail |
| PDF & Image Scans with OCR | Medium–High, OCR, layout analysis, confidence scoring, model tuning | High CPU/GPU and storage; higher latency for large batches | Structured data extracted from unstructured/legacy documents | Scanned invoices, receipts, faxed POs, handwritten forms | Handles legacy and handwritten docs, no template setup required |
| HTML & Web Form Data | Medium, static parsing simple; dynamic sites require browser automation | Moderate, headless browsers for JS-heavy pages increase resource use | Structured capture leveraging semantic tags and embedded metadata | Carrier quotes, procurement portals, e‑commerce and web forms | Semantic cues aid extraction, real-time scraping possible |
| Database Query Results & ERP/TMS Exports | Medium, predictable formats but vendor-specific mappings required | Moderate–High for large exports; requires credentials and secure access | Authoritative, reconcilable data for synchronization and reporting | Nightly exports, delta loads, system-to-system synchronization | Source of truth data, metadata for validation, reliable structure |
From Chaos to Clarity: Your Automation Blueprint
Semi-structured data isn’t a side issue. It’s the operating reality for teams that run purchasing, freight, AP, HR, and back-office workflows. Documents arrive in XML, JSON, CSV, EDI, email, PDFs, HTML portals, and internal exports. Every format carries enough structure to be useful, but not enough consistency to drop safely into an ERP, TMS, or accounting system without work.
That’s why manual rekeying survives for so long. It fills the gap between business documents and system requirements. Staff read the invoice, find the PO number, type the total, fix the supplier name, split line items, and move on to the next file. It works, but it doesn’t scale well, and it introduces avoidable errors.
The practical shift is to stop asking whether a format is “structured enough” and start asking a better question. What is the cleanest path from this source format to a trusted internal schema? Once you think that way, the right approach becomes clearer.
For XML and JSON, the priority is usually normalization. The file already has machine-readable structure, but key names, nesting, and optional elements need control. For CSV, reliability depends on disciplined headers, encoding, and quoting. For EDI, success depends on mapping transaction intent, not just segment syntax. For email, the win comes from capturing metadata and routing rules early. For PDFs and scans, template-free extraction and exception handling matter more than perfect document quality. For HTML and web forms, stability comes from fast normalization and ongoing monitoring. For ERP and TMS exports, the biggest mistake is assuming internal reports are ready for downstream automation without cleanup.
Two trade-offs matter in almost every implementation.
First, flexibility versus standardization. Semi-structured data is valuable because it adapts to changing documents and source systems. But the downstream business system still needs a stable shape. That means you need a canonical schema somewhere in the pipeline, even if your inputs remain flexible.
Second, automation versus review. Teams often swing too far in one direction. They either keep humans in every step, which kills throughput, or they remove review entirely, which creates posting and reconciliation problems. The better model is selective review. Let automation handle the routine fields and send uncertain records, unusual layouts, or cross-field mismatches to staff.
The companies that get the best results usually do three things well. They preserve the original file for auditability. They define one target schema for each workflow. And they treat validation as part of the ingestion process, not as a cleanup job for finance or operations later.
If you’re trying to make this practical, DigiParser is one relevant option for turning documents, emails, and scans into structured outputs such as CSV, Excel, or JSON for downstream use. The important point isn’t the brand name by itself. It’s the operating model: one extraction layer across many semi-structured inputs, with consistent outputs that your core systems can accept.
That’s how semi structured data examples stop being a taxonomy exercise and become an automation plan. You don’t need every source to become perfect. You need a repeatable way to convert imperfect inputs into trusted business data.
If your team is still copying fields out of PDFs, inboxes, and portal exports, DigiParser is worth a look. It’s built to extract data from semi-structured business documents and return structured outputs your ERP, TMS, accounting system, or workflow tools can use.
Transform Your Document Processing
Start automating your document workflows with DigiParser's AI-powered solution.