项目作者: skjolber

项目描述 :
World's fastest CSV parser / databinding for Java
高级语言: Java
项目地址: git://github.com/skjolber/sesseltjonna-csv.git
创建时间: 2018-09-03T22:12:25Z
项目社区:https://github.com/skjolber/sesseltjonna-csv

开源协议:

下载


Build Status
Maven Central
codecov

sesseltjonna-csv: CSV processing insanity

sesseltjonna-csv is an advanced, high-performance CSV library with developer-friendly configuration options.

Projects using this library will benefit from:

  • friendly builder-pattern for creating CSV-parser with data-binding
    • syntactic sugar
    • per-field customization options
  • under-the-hood dynamic code generator for optimal performance

In a nutshell, a very specific parser is generated per unique CSV file header (i.e. column order), which yields extremely fast processing while allowing for per-field customizations. The overhead of doing this is low and pays of surprisingly fast.

The library also hosts traditional CSV parsers (statically typed) for those wanting to work directly on String arrays.

Bugs, feature suggestions and help requests can be filed with the issue-tracker.

Obtain

The project is implemented in Java and built using Maven. The project is available on the central Maven repository.


Maven coordinates

Add

xml <properties> <sesseltjonna-csv.version>1.0.26</sesseltjonna-csv.version> </properties>

and

xml <dependency> <groupId>com.github.skjolber.sesseltjonna-csv</groupId> <artifactId>databinder</artifactId> <version>${sesseltjonna-csv.version}</version> </dependency>

or

xml <dependency> <groupId>com.github.skjolber.sesseltjonna-csv</groupId> <artifactId>parser</artifactId> <version>${sesseltjonna-csv.version}</version> </dependency>

or


Gradle coordinates

For

groovy ext { sesseltjonnaCsvVersion = '1.0.26' }

add

groovy implementation("com.github.skjolber.sesseltjonna-csv:databinder:${sesseltjonnaCsvVersion}")
or

groovy implementation("com.github.skjolber.sesseltjonna-csv:parser:${sesseltjonnaCsvVersion}")

Usage - databinding

Use the builder to configure your parser.

  1. CsvMapper<Trip> mapper = CsvMapper.builder(Trip.class)
  2. .stringField("route_id")
  3. .quoted()
  4. .optional()
  5. .stringField("service_id")
  6. .required()
  7. .build();

where each field must be either required or optional. The necessary Trip setters will be deducted from the field name (see further down for customization).

Then create a CsvReader using

  1. Reader reader = ...; // your input
  2. CsvReader<Trip> csvReader = mapper.create(reader);

and parse until null using

  1. do {
  2. Trip trip = csvReader.next();
  3. if(trip == null) {
  4. break;
  5. }
  6. // your code here
  7. } while(true);

To run some custom logic before applying values, add your own consumer:

  1. CsvMapper<City> mapping = CsvMapper.builder(City.class)
  2. .longField("Population")
  3. .consumer((city, n) -> city.setPopulation(n * 1000))
  4. .optional()
  5. .build();

or with custom (explicit) setters:

  1. CsvMapper<Trip> mapper = CsvMapper.builder(Trip.class)
  2. .stringField("route_id")
  3. .setter(Trip::setRouteId)
  4. .quoted()
  5. .optional()
  6. .stringField("service_id")
  7. .setter(Trip::setServiceId)
  8. .required()
  9. .build();

Intermediate processor

The library supports an intermediate processor for handling complex references. In other words when a column value maps to a child or parent object, it can be resolved at parse or post-processing time. For example by resolving a Country when parsing a City using an instance of MyCountryLookup - first the mapper:

  1. CsvMapper2<City, MyCountryLookup> mapping = CsvMapper2.builder(City.class, MyCountryLookup.class)
  2. .longField("Country")
  3. .consumer((city, lookup, country) -> city.setCountry(lookup.getCountry(country))
  4. .optional()
  5. .build();

Then supply an instance of of the intermediate processor when creating the CsvRader:

  1. MyCountryLookup lookup = ...;
  2. CsvReader<City> csvReader = mapper.create(reader, lookup);

Using this feature can be essential when parsing multiple CSV files in parallel, or even fragments of the same file in parallel, with entities referencing each other, storing the values in intermediate processors and resolving references as a post-processing step.

Usage - traditional parser

Create a CsvReader<String[]> using

  1. Reader input = ...; // your input
  2. CsvReader<String[]> csvReader = StringArrayCsvReader.builder().build(input);
  3. String[] next;
  4. do {
  5. next = csvReader.next();
  6. if(next == null) {
  7. break;
  8. }
  9. // your code here
  10. } while(true);

Note that the String-array itself is reused between lines. The column indexes can be rearranged by using the builder withColumnMapping(..) methods, which should be useful when doing your own (efficient) hand-coded data-binding.

Performance

The dynamically generated instances are extremely fast (i.e. as good as a parser tailored very specifically to the file being parsed), but note that the assumption is that the number of different CSV files for a given application or format is limited, so that parsing effectively is performed by a JIT-compiled class and not by a newly generated class for each file.

To maximize performance (like response time) it is always necessary to pre-warm the JVM regardless of the underlying implementation.

JMH benchmark results.

If the parser runs alone on a multi-core system, the ParallelReader from the SimpleFlatMapper might further improve performance by approximately 50%.

Class-loading / footprint

Performance note for single-shot scenarios and CsvMapper: If a custom setter is specified, the library will invoke it to determine the underlying method invocation using ByteBuddy, so some additional class-loading will take place.

Compatibility

The following rules / restrictions apply, mostly for keeping in sync with RFC-4180:

  • Quoted fields must be declared as quoted (in the builder) and can contain all characters.
  • The first character of a quoted field must be a quote. If not, the value is treated as a plain field.
  • Non-qouted fields must not contain the newline (or separator).
  • Each field is either required or optional (no empty string is ever propagated to the target). Missing values result in CsvException.
  • All lines must contain the same number of columns
  • Corrupt files can result in CsvException
  • Newline and carriage return + newline line endings are supported (and auto-detected)
    • Must be consistent throughout the document
  • Columns which have no mapping are skipped (ignored).

Also note that

  • The default mode assumes the first line is the header. If column order is fixed, a default parser can be created.
  • Read buffer / max line length is per default 64K. 64K should be enough to hold a lot of lines, if not try increasing the buffer size.

See it in action

See the project gtfs-databinding for a full working example.

Contact

If you have any questions, comments or feature requests, please open an issue.

Contributions are welcome, especially those with unit tests ;)

License

Apache 2.0

History

  • 1.0.26: Maintenance release
  • 1.0.25: Maintenance release
  • 1.0.24: Maintenance release
  • 1.0.23: Allow for access to the underlying buffer, for error correction purposes - see issue #38.
  • 1.0.22: Parse first line of dynamic parser with static parser.
  • 1.0.21: Make CsvReader AutoCloseable. Remove (unused) ASM dependencies from static parser. Bump dependencies
  • 1.0.20: Fix AbstractCsvReader; better names and EOF fix.
  • 1.0.19: Improve JDK9+ support using moditech plugin, fix parsing of single line without linebreak.
  • 1.0.18: Add default module names for JDK9+, renamed packages accordingly.
  • 1.0.17: Improve parse of quoted columns