Show HN: Paradict – Streamable multi-format serialization with schema https://ift.tt/G2Mb9lD
Show HN: Paradict – Streamable multi-format serialization with schema Hi HN ! I'm Alex, a tech enthusiast. I'm excited to show you Paradict ( https://ift.tt/qlCui0m ), my solution for streamable multi-format serialization. Although JSON, YAML, and TOML are all human-readable, they serve different purposes. For example, TOML is specifically designed for configuration files while JSON is used as a data interchange format. Sometimes an initiative to create a binary version of JSON arises and as far as I know, it ends with an unidirectional mapping of datatypes. There is no silver bullet, yet one coherent solution built from scratch that addresses multi-format (binary and textual) serialization and configuration files would be a step forward. Earlier this year, I accidentally designed a textual data format to represent complex data structures inside a document divided into sections. The project, namely Jesth (Just Extract Sections Then Hack'em), generated an interesting discussion on HN ( https://ift.tt/jMuS72Z ). Out of curiosity, I ran some benchmarks using Jesth, JSON and MessagePack, with and without Gzip compression against a large JSON file downloaded from the web. The benchmarking gave me insights that led to the decision to evolve Jesth's ideas into a new multi-format serialization solution. I designed and built Paradict from scratch to serialize and deserialize a dictionary data structure. Although Paradict's root data structure is a dictionary, lists, sets, and dictionaries can be nested within it at arbitrary depth. A Paradict dictionary can be populated with strings, binary data, integers, floats, complex numbers, booleans, dates, times, datetimes, comments, extension objects, and grids (matrices). There is also a schema-based validation mechanism that can contain programmatic checkers. The binary serialization format is designed with compactness in mind such as Pi with its first two decimal places, the Golden ratio with its first two decimal places, and the date of the funeral of Pope Benedict XVI would each be encoded on two bytes (not counting their respective 1-byte tag which starts each Paradict binary datum). This binary format has two levels of granularity for continuous data stream processing: a datum at the low level, which is in some cases a 2-tuple composed of a tag and its payload, and the message at the high level which is a dictionary data structure. The textual serialization format has two modes: data and config modes. Config mode implicitly treats dictionary keys as strings, removing the need to surround them with quotes, and unlike the colon (:) between a key-value pair in data mode, it uses the equal sign (=) as separator. This textual format has two levels of granularity for continuous data stream processing: a single line of text at the low level and the message at the high level which is a dictionary data structure. Here is a valid Paradict configuration document that contains a "user" section: [user] # no comment id = 42 name = 'alex' birthday = 2042-12-25T16:20:59Z photo = (bin) 54 68 69 73 20 69 73 20 6E 6F 74 20 61 20 70 68 6F 74 6F 67 72 61 70 68 weight_matrix = (grid) 1 0 1 0 0 1 0 1 1 0 1 0 books = (dict) romance = (list) 'Happy Place' 'Romantic Comedy' sci_fi = (list) 'Dune' 'Neuromancer' epitaph = (text) According to the law of conservation of energy, no a bit of you is gone; you are just less orderly. --- Under the hood, Paradict uses Braq ( https://ift.tt/Mde1yBk ), the most obvious way to section a document (as shown just above), and Ustrid ( https://ift.tt/O5B6VNw ), to uniquely generate string identifiers. Paradict is available on PyPI and you can learn more by reading its README, browsing the source code or playing with its tests. Let me know what you think about all this ! https://ift.tt/qlCui0m December 18, 2023 at 10:00PM
Comments
Post a Comment