[ 🌹 ]红花会 : PG ONE - 他 🔥 Chinese Hip Hop China Rap 中文说唱 / 饶舌 [ AUDIO ONLY ]

1 Comment
From: zhongtv
Duration: 03:37

Follow Him on Weibo @PG_ONE

Song Title: 他
Artist: PG ONE of 红花会
Release Date: 2016-11-07

#1 STOP FOR CHINESE HIP HOP 中国说唱

Subscribe to ZHONG.TV and stay updated with the best Chinese Hip Hop music on the web. http://www.youtube.com/subscription_center?add_user=zhongtv

ZHONG.TV on FACEBOOK : http://www.facebook.com/ZhongTV
ZHONG.TV on TWITTER: http://www.twitter.com/ZhongTV
ZHONG.TV on SINA WEIBO: http://www.weibo.com/zhongtvnews
ZHONG.TV on YOUKU http://u.youku.com/ZHONGTV
ZHONG.TV WEBSITE http://www.zhong.tv

Read the whole story
quad
2244 days ago
reply
I love this song; pity his nationalism.
Share this story
Delete

13251260

jwz
1 Share
Read the whole story
quad
2244 days ago
reply
Share this story
Delete

Require Credit For Your Software

1 Comment
a general-purpose legal tool to require credit for open work
Read the whole story
quad
2244 days ago
reply
I respect building legal structures to incentivise more open software in the world. I’m sceptically hopeful about credit/reputation economies moving free software forward.
Share this story
Delete

Yes, we have no imagination

1 Comment

The Hong Kong government – enfeebled, emasculated, eviscerated and more lobotomized than ever – draws on all its powers of creativity to Be Seen To Do Something.

In a dazzling display of originality, it draws on the same script used under three former chief executives and announces that the city (bursting at full capacity with shortages of labour and accommodation) faces imminent economic collapse. Going further into out-of-the-box lateral-thinking wackiness, it proposes a range of extremely tired one-off sweeteners including the immensely stale free-electricity-for-everyone, an extra month’s welfare payment for the poor, tax waivers, a rent-free month for public-housing tenants and subsidies for school students, plus even more stunningly inane little quasi-handouts for smaller businesses.

The SCMP, trying to be delicate about the crass and hackneyed freebies, reports that ‘different sectors found them underwhelming’.

Financial Secretary Paul Chan insists that the package has nothing to do with the massive protests that have rocked Hong Kong. Everyone else insists that it is totally a reaction to unrest – but will, just as totally, have zero impact at all in calming things. (Chan could have befuddled his critics by boldly agreeing that, yes – he had no intention whatever of using the sweeteners to calm things. But our senior officials are being sparer than usual with their wittiness these days.)

I declare the weekend open with a hopefully more-impressive package of sweeteners.

From HK Free Press: a review of front-line protesters’ materiel, and 360-degree video view of the demonstration at Tai Po.

An interesting selection of mayhem-porn from the authentically no-frills neighbourhood of Shamshuipo.

For graphic arts fans: Badiucao’s latest great cartoon (check mug and T-shirt offer), and a reminder that Hong Kong’s anti-government movement is so broad-based that we now have to like not only civil servants, but dog people – posters for protesting pets.

Willy Lam asks Will Xi Send the PLA In? In a nutshell: probably not. However, he has interesting thoughts on what happens further ahead, including the installation of an administration of local CCP loyalists under Liaison Office direction.

Time magazine does a good wrap-up of the whole situation in Hong Kong – just in case anyone hasn’t been following things up to now.

And SCMP Magazine does an in-depth article on – not handbags, not some fancy new restaurant, but… tear gas.

The China Media Project – which normally has the patience of a saint – finally gets rather exasperated with the bone-headedness of People’s Daily

This is a Party-state that claims to have benevolent global ambitions, to offer a “China Solution” to issues facing the world – and yet it cannot speak a human language. It cannot admit any subtlety on complex issues. 

Another China Heritage ‘Hong Kong Apostasy’ translation – thoughts on the protests by businesswoman Canny Leung.

And lest we get too wrapped up in the idea that Mainlanders’ views are shaped only by official propaganda, here’s some perspective.

For a complete break and a badly needed dose of pure relaxing sanity: this (don’t listen at work, or in front of kinds – in fact, best not listen to it at all, seriously. Not sure why I’m putting it here.)

Read the whole story
quad
2244 days ago
reply
The linked articles are all gold.
Share this story
Delete

"Mulan" is a masculine, non-Sinitic name

1 Share

There is much hullabaloo over the new "Mulan" trailer:

Question:  What does she say at 1:29?

After watching Mulan gallop across the horizon for a couple of seconds, the very first thing we see is something that really bothered me:  a mammoth Hakka (people [Han], language [Sinitic]) round house in mountainous, heavily wooded south China.  Called tǔlóu 土楼 ("earthen structures") and mostly dating from the second half of the second millennium AD, these houses are situated far away in time, space, and culture from the northern, scrubby borderlands whence came non-Sinitic Mulan a millennium earlier.

It seems that this movie is going to be much worse even than the previous animated film about the heroine, which was bad enough.  Chinese viewers are complaining that the current Mulan is full of Oriental stereotypes.  Westerners are upset that the film is pandering to PRC nationalism.  A small minority of scholars are disappointed that the heroine is untrue to ethnic and linguistic reality.

There's precious little historical evidence concerning Mulan upon which to base all the stories, plays, and now full-length films about her.  Fundamentally, what we know about Mulan comes from an anonymous, medium-length (65 lines in the translation of Arthur Waley [1889-1966]) ballad dating to around the 5th-6th century.  You can read the ballad here:  Victor H. Mair, ed., The Columbia Anthology of Traditional Chinese Literature (New York:  Columbia University Press, 1994), pp. 474-476; quoted in its entirety below.  It is preceded by this note:

Mulan (old [Middle Sinitic / MS] pronunciation Muklan) was a member of the Särbi (Hsien-pei) people.  This celebrated ballad tells of her resolve to take her father's place in fending off the encroaching Jou-jan nomads.  She is often compared with Joan of Arc, although the two do not share much more in common than the fact that they were both women warriors.  The people and places in the ballad are all from the far northern borderlands of China, and it is likely that this remarkable work was first conceived in one of the languages of that land of nomads.

Mulan was supposed to have lived during the Northern Wei Dynasty (386-534), which was ruled over by the Tuoba (Tabgach) clan of the Xianbei, a people having nomadic origins on the Eurasian Steppe (more about the ethnolinguistics of the Xianbei below).  Groups of the Xianbei moved down into what we now know of as Northern China and, over a period of several centuries, founded a number of statelets, kingdoms, and dynasties there.

—————

What's all the dissension about?

"The Mulan trailer is a dismal sign Disney is bowing to China's nationalistic agenda:  Mulan has been transformed from life-affirming epic to patriotic saga, showing Hollywood is prioritising box office success",  Jingan Young, The Guardian (7/8/19)

VHM:  This is a good article.  Click on the title to read it if you have time and interest.

I hope that everyone reading this post realizes that Mulan was not even Chinese (Sinitic / Han).  She was of Xiānbēi (Wade-Giles Hsien-pei) 鮮卑 (*Särbi [this is a reconstruction; we really don't know exactly what the ethnonym of the Xianbei sounded like in their own language]) extraction —  for the etymology, see here.  Most scholars think that the Xianbei spoke a Proto-Turkic or Para-Mongolic language more or less closely related to Khitan (see here).

Louis Ligeti already wrote about this in 1970: "Le Tabghach, un dialect de la langue sien-pi" Mongolian Studies, ed. L. Ligeti (Budapest: Akadémiai Kiadó, 1970): 265-308.  See now: Andrew Shimunek in Languages of Ancient Southern Mongolia and North China. A Historical-Comparative Study of the Serbi or Xianbei Branch of the Serbi-Mongolic Language Family, with an Analysis of Northeastern Frontier Chinese and Old Tibetan Phonology (Wiesbaden: Harrassowitz, 2017), with a just published review of it by András Róna-Tas in Archivum Eurasiae Medii Aevi 24 (2018): 315-335.  In any event, the Xianbei weren't "Chinese" and they didn't speak "Chinese" / Sinitic.

The world is so indoctrinated by Chinese propaganda — going back centuries before the PRC (but it's much worse now) — that everybody thinks Mulan is the Chinese version of Joan of Arc.  Even the kids at our local swim club put on an elaborate pageant glorifying the pseudo-Chinese heroine fighting against the evil northern barbarians!  And there must be hundreds of other similar fake history stories being written and plays being performed about Mulan each year all around the world.  It's as though Gavin Menzies' wild fantasies (e.g., 1421, 1434) of Chinese pre-Columbian maritime discovery were taken at face value (must read the Australian scholar Geoff Wade's thorough debunking):

The 1421 website has now gone offline, but a historical version can be found here.  See also here.

Thus responsible historians have called Menzies to account, despite the fact that he is alive and highly litigious.  If Mulan were still with us and had lawyers to speak on her behalf, I'm confident that she would welcome those scholars who are brave enough to set the record straight against those who distort her story.

Thinking of Mulan as "Chinese" (Sinitic / Han) is like considering everyone and everything in Eastern Central Asia (ECA) (Uyghurstan / Xinjiang) as "Chinese" (Sinitic / Han), when, before about 1,500 years ago, most people in ECA were Indo-European (Tocharians, Iranians, Indians) and, after that, until quite recently (indeed, even now), most people in ECA are not "Chinese" (Sinitic / Han), but rather Turkic.

Thinking of Mulan as an overtly feminine warrior is also off the mark.  Judging from the trailer, there will be plenty of fighting scenes where she looks very much like a woman.  But listen to the penultimate quatrain of the ballad, which describes her meeting with her fellow soldiers after she had returned home from the war:

Chūmén kàn huǒbàn,

Huǒbàn jiē jīnghuáng.

Tóngxíng shí'èr nián,

Bùzhī Mùlán shì nǚláng.

出門看火伴,

火伴皆驚惶。

同行十二年,

不知木蘭是女郎。

She left the house and met her messmates on the road;

Her messmates were startled out of their wits.

They had marched with her for twelve years of war

And never known that Mulan was a girl.

Thus, Mulan fought as a man, not as a woman.  Her fellow soldiers had no idea that she was a woman.  This is not so strange as you may think.  Indeed, it is a common trope in Chinese popular literature for a woman to assume the guise of a man in order to accomplish feats that her natural gender would have denied her, such as standing in as a conscripted soldier for her ailing or elderly father.  Even more interesting, women were not permitted to take the examinations to become scholars or officials, so some girls disguised themselves as men to study the classics and sit for the civil service exams.  There are quite a few funny scenes where the male fellow students of a girl disguised as one of them are perplexed by her toilet habits.  Naturally, there are also many touching love stories that develop out of such situations, but only after many years of gender ruse and "they triumphs".

Even more telling about Mulan's gender and ethnic identity is that, written in Sinographs, as it would have been when transcribed from Xianbei language, her name appears as Mùlán 木蘭, which means "Magnolia" (in particular, red or lily magnolia [Magnolia liliiflora]) and is conspicuously feminine.  Still today, Mùlán 木蘭 is a common given name for Chinese women.  Such a pretty, feminine name simply would not have worked for a dozen years of war among exclusively male soldiery.

This is where the outstanding historical research of Sanping Chen comes in.  In chapter 2, "From Mulan to Unicorn", of his Multicultural China in the Early Middle Ages (Philadelphia:  University of Pennsylvania Press, 2012), pp. 39-59, 197-201, Chen shows that Mulan's name in the eponymous ballad dedicated to her emerged from the same Turco-Mongol milieu as that described above for her Xianbei background, not from a "Chinese" (Sinitic / Han) linguistic environment.

In particular, Chen demonstrates that Mulan (MS Muklan) — together with its cognates — was a favored male name of military men of the Xianbei Tuoba (Tabgach) and other Turco-Mongol groups in the north.  Without going into all of the historical, philological, phonological, and other linguistic evidence that Chen adduces, I will only mention that he situates the probable source of the Xianbei word transcribed as Mulan in a group of Altaic words having to do with cervids, especially stags.

Chen states:

Indeed, in his excellent Etymological Dictionary of the Altaic Languages, Sergei Starostin proposed an Altaic root *mulaI, "a kind of deer", with Tungusic *mul- and Proto-Mongolian *maral, "mountain deer", and Proto-Turkic *bulan, "elk."  This root, especially the Proto-Turkic form, would be a near perfect fit for the Tuoba name Mulan.

Chen separately devotes a considerable amount of attention to another Altaic word, bulān or buklān, meaning "elk", "stag", "moose", "deer", and, according to the great 11th century Turkic lexicographer,  Mahmud ibn Hussayn ibn Muhammed al-Kashgari, "unicorn"!

Note the Cantonese and Minnan pronunciations of mùlán 木蘭:

Source

We must remember that "The Ballad of Mulan" is the closest thing we get to a historical account of the heroine, and it not very historical at that.  All the other later versions of the tale, up to and including the two Disney movies, are legends and imaginative fiction that grow increasingly improbable with the passage of time.  They are embellishments and elaborations of a tale which from its very beginning had only a tenuous connection to history.  A strange phenomenon is that, the further removed from reality a given rendition is, the more strongly attached to the embroidered version are its devotees.

What we know about the history of "The Ballad of Mulan" ("Mùlán cí 木蘭辭") is summarized in this passage from Wikipedia:

The Ballad of Mulan was first transcribed in the Musical Records of Old and New (Chinese: 古今樂錄; pinyin: Gǔjīn Yuèlù [VHM:  this anthology itself has not survived, but parts of it are quoted in later texts]) in the 6th century. The earliest extant text of the poem comes from an 11th- or 12th-century anthology known as the Music Bureau Collection (Chinese: 樂府詩; pinyin: Yuèfǔshī). Its author, Guo Maoqian, explicitly mentions the Musical Records of Old and New as his source for the poem. As a ballad, the lines do not necessarily have equal numbers of syllables. The poem consists of 31 couplets, and is mostly composed of five-character phrases, with a few extending to seven or nine.

There was no treatment of the legend since the two 12th century poems, until in the late Ming, playwright Xu Wei (d. 1593) dramatized the tale as "The Female Mulan" (雌木蘭 or, more fully, "The Heroine Mulan Goes to War in Her Father's Place" (Chinese: 雌木蘭替父從軍; pinyin: Cí-Mùlán Tì Fù Cóngjūn), in two acts.

For additional later renditions of the legend, see Mulan: Five Versions of a Classic Chinese Legend, with Related Texts, tr., ed., and intro. by Shiamin Kwa and Wilt L. Idema (Indianapolis and Cambridge:  Hackett, 2010).

Given what we know from Sanping Chen about the strongly masculine character of the Xianbei name Mulan (MS Muklan), it is ironic that Xu Wei emphasizes the femininity of the Sinographic form Mùlán 木蘭 by adding the prefix cí 雌 ("female").

Since the ballad is not too long, but is extremely important as the sole primary source of the legends about Mulan that grew up over the centuries, I offer it in its entirety here:

Click, click, forever click, click;

Mulan sits at the door and weaves.

Listen, and you will not hear the shuttle's sound,

But only hear a girl's sobs and sighs.

"Oh tell me, lady, are you thinking of your love,

Oh tell me, lady, are you longing for your dear?"

"Oh no, oh no, I am not thinking of my love,

Oh no, oh no, I am not longing for my dear.

But last night I read the battle-roll;

The Khan has ordered a great levy of men.

The battle-roll was written in twelve books,

And in each book stood my father's name.

My father's sons are not grown men,

And of all my brothers, none is older than me.

Oh let me to the market to buy saddle and horse,

And ride with the soldiers to take my father's place."

In the eastern market she's bought a gallant horse,

In the western market she's bought saddle and cloth.

In the southern market she's bought snaffle and reins,

In the northern market she's bought a long whip.

In the morning she stole from her father's and mother's house;

At night she was camping by the Yellow River's side.

She could not hear her father and mother calling to her by her name,

But only the song of the Yellow River as its hurrying waters hissed and swirled through the night.

At dawn they left the River and went on their way;

At dusk they came to the Black Mountain's side.

She could not hear her father and mother calling to her by her name,

She could only hear the muffled voices of foreign* horsemen riding on the hills of Yen.

A thousand tricents** she tramped on the errands of war,

Frontiers and hills she crossed like a bird in flight.

Through the northern air echoed the watchman's tap;

The wintry light gleamed on coats of mail.

The captain had fought a hundred fights, and died;

The warriors in ten years had won their rest.

They went home; they saw the Son of Heaven's face;

The Son of Heaven*** was seated in the Hall of Light.

Dispensing enfeoffments and accolades by the dozens;

And of prize money a hundred thousand strings.

Then spoke the Khan and asked Mulan what they wanted.

"Oh, Mulan asks not to be made

A Counsellor at the Khan's court;

I only wish to borrow a camel that can march

A thousand tricents a day,

To take me back to my home."

When her father and mother heard that she had come,

They went to the outer town wall and led her back to the house.

When her little sister heard that she had come,

She went to the door and rouged her face afresh.

When her little brother heard that his sister had come,

He sharpened his knife and darted like a flash

Towards the pigs and sheep.

She opened the gate that leads to the eastern tower,

She sat on her bed that stood in the western tower.

She cast aside her heavy soldier's cloak,

And wore again her old-time dress.

She stood at the window and bound her cloudy hair;

She went to the mirror and fastened her yellow combs.

She left the house and met her messmates on the road;

Her messmates were startled out of their wits.

They had marched with her for twelve years of war

And never known that Mulan was a girl.

For the male hare has a lilting, lolloping gate,

And the female hare has a wild and roving eye;

But set them both scampering side by side,

Who could distinguish between male and female?

(from the version in The Columbia Anthology of Traditional Chinese Literature, with emendations by VHM)

*hú 胡, which Waley loosely translates at "Scythian"; in this instance it is likely referring to the Rouran, who incidentally were the first people to use the title Khan / Qaghan for their ruler.

**lǐ 里, which Waley renders as "league(s)", but a league is three statute miles; that's much too long.  A lǐ 里 is 300 paces, roughly a third of a mile (a thousand paces), hence my neologism, which is no longer so "neo-", since I've been using it for several decades.

***Here and in the preceding line, the ruler is referred to as tiānzǐ 天子 ("Son of Heaven"), not as huángdì 皇帝 ("august thearch"), which would have been suitable for an emperor in the Confucian order.

"The Ballad of Mulan" is also available in an English translation by Hans H. Frankel, from his The  Flowering  Plum  and  the  Palace  Lady:  Interpretations  of  Chinese  Poetry (New  Haven:  Yale  University Press, 1976), pp. 68-72, in this pdf from Columbia University (Asia for Educators).  It is preceded by this short, but richly informative and commendably cogent, introduction:

This poem was composed in the fifth or sixth century CE.  At the time, China was divided between north and south.  The rulers of the northern dynasties were from non-Han ethnic groups, most of them from Turkic peoples such as the Toba (Tuoba, also known as Xianbei), whose Northern Wei dynasty ruled most of northern China from 386–534. This background explains why the character Mulan refers to the Son of Heaven as "Khan" — the title given to rulers among the pastoral nomadic people of the north, including the Xianbei — one of the many reasons why the images conveyed in  the movie "Mulan" of a stereotypically Confucian Chinese civilization fighting against the barbaric "Huns" to the north are inaccurate.

Another English translation of "The Ballad of Mulan", this one by Jack Yuan, may be found here in Wikisource.

No matter which translation one consults, one cannot miss the repeated references to the ruler as the Khan / Qaghan (可汗 ["ruler; sovereign"]) and to Mulan's wish, after the fighting is over, to be rewarded with a camel to take her back home.  Such references are suitable for life among the Xianbei during the 5th-6th century.

Enough!  If people want to watch "Mulan" as some sort of wǔxiá / mou5-hap6 / bú-kiap / vú-hia̍p 武俠 ("martial hero / heroine") thriller, so be it, but please don't confuse this film with history.

Forgive me, but, judging from the trailer, this Disney film about Mulan is yītāhútú 一塌糊塗 ("one big mess").

Readings

[Thanks to Geoff Wade, Peter Golden, Alexander Vovin, and Juha Janhunen]

Read the whole story
quad
2270 days ago
reply
Share this story
Delete

Filter before you parse: faster analytics on raw data with Sparser

1 Comment and 2 Shares

Filter before you parse: faster analytics on raw data with Sparser Palkar et al., VLDB’18

We’ve been parsing JSON for over 15 years. So it’s surprising and wonderful that with a fresh look at the problem the authors of this paper have been able to deliver an order-of-magnitude speed-up with Sparser in about 4Kloc.

The classic approach to JSON parsing is to use a state-machine based parsing algorithm. This is the approach used by e.g. RapidJSON. Such algorithms are sequential and can’t easily exploit the SIMD capabilities of modern CPUs. State of the art JSON parsers such as Mison are designed to match the capabilities of modern hardware. Mison uses SIMD instructions to find special characters such as brackets and colons and build a structural index over a raw json string.

… we found that Mison can parse highly nested in-memory data at over 2GMB/s per core, over 5x faster than RapidJSON, the fastest traditional state-machine based parser available.

How can we parse JSON even faster? The key lies in re-framing the question. The fastest way to parse a JSON file is not to parse it at all. Zero ms is a hard lower bound ;). In other words, if you can quickly determine that a JSON file (or Avro, or Parquet, …) can’t possible contain what you’re looking for then you can avoid parsing it in the first place. That’s similar to the way that we might use a bloom filter to rule out the presence of a certain key or value in a file (guaranteeing no false-negatives, though we might have false positives). Sparser is intended for use in situations where we are interacting directly with raw unstructured or semi-structured data though where a pre-computing index or similar data structure either isn’t available or is too expensive to compute given the anticipated access frequency.

In such a context we’re going to need a fast online test with no false negatives. Comparing state-of-the-art parsers to the raw hardware capabilities suggests there’s some headroom to work with:

Even with these new techniques, however, we still observe a large memory-compute performance gap: a single core can scan a raw bytestream of JSON data 10x faster than Mison parses it. Perhaps surprisingly, similar gaps can occur even when parsing binary formats that require byte-level processing, such as Avro and Parquet.

Imagine I have a big file containing tweet data. I want to find all tweets mentioning the hashtag ‘#themorningpaper’. Instead of feeding the file straight into a JSON parser, I could just do a simple grep first. If grep finds nothing, we don’t need to parse as there won’t be any matches. Sparser doesn’t work exactly like this, but it’s pretty close! In place of grep, it uses a collection of raw filters, designed with mechanical sympathy in mind to make them really efficient on modern hardware. A cost optimiser figures out the best performing combination of filters for a given query predicate and data set. When scanning through really big files, the cost optimiser is re-run whenever parsing throughput drops by some threshold (20% in the implementation).

Sparser can make a big difference to execution times. Across a variety of workloads in the evaluation, Sparser achieved up to a 22x speed-up compared to Mison. This is a big deal, because serialising (and de-serialising) is a significant contributor to overall execution times in big data analytic workloads. So much so, that when integrated into a Spark system, end-to-end application performance improved by up to 9x.

Efficient raw filters

Raw filters (RF) operate over raw bytestreams and can produce false positives but no false negatives. They are designed to be SIMD efficient. There are two raw filter types: substring search and key-value search.

Say we have a predicate like this: name = "Athena" AND text = "My submission to VLDB". Substring search just looks for records that contain a substring sequence from the target values. For efficiency reasons it considers 2, 4, and 8-byte wide strings. Sticking with 4-byte substrings, we have several we could potentially use for matching, e.g. ‘Athe’, ‘ubmi’ or ‘VLDB’. Using VLDB as an example, the string is repeated eight times in a 32-byte vector register. We need 4 one-byte shifts to cover all possible matching positions in an input sequence:

Note that the example in the figure above is actually a false positive for the input predicate. That’s ok. We don’t mind a few of those getting through.

The main advantage of the substring search RF is that, for a well-chosen sequence, the operator can reject inputs at nearly the speed of streaming data through a CPU core.

Key-value search looks for all co-occurrences of a key and a corresponding value within a record. The operator takes three parameters: a key, a value, and a set of one-byte delimiters (e.g. ,). After finding an occurrence of the key, the operator searches for the value and stops searching at the first occurring delimiter character. The key, value, and stopping point can all be searched for using the packed vector technique we looked at for substrings.

Whereas substring searches support both equality and LIKE, key-value filters do not support LIKE. This prevents false negatives from getting through.

Optimising filter cascades

Sticking with the example predicate name = "Athena" AND text = "My submission to VLDB", there are multiple raw filters we could consider, and multiple ways to order those filters. For example, if “VLDB” is highly selective it might be good to run a substring filter on VLDB first, and then feed the results into a key-value filter looking for name = "Athena". But if ‘VLDB’ occurs frequently in the dataset, we might be better off doing the key-value filtering first, and the substring search second. Or maybe we should try alternative substring searches in combination or instead, e.g. ‘submissi’. The optimum arrangement of filters in an RF cascade depends on the underlying data, the performance cost of running the individual raw filters, and their selectivity. We also have to contend with predicates such as(name = "Athena" AND text = "Greetings") OR name = "Jupiter", which are converted into DNF form before processing.

The first stage in the process is to compile a set of candidate RFs to consider based on clauses in the input query. Each simple predicate component of a predicate in DNF form is turned into substring and key-value RFs as appropriate. A substring RF is produced for each 4- and 8-byte substring of each token in the predicate expression, plus one searching for the token in its entirety . Key-value RFs will be generated for JSON, but for formats such as Avro and Parquet where the key name is unlikely to be present in the binary stream these are skipped. For the simple predicate name = "Athena" we end up with e.g.:

  • Athe
  • then
  • hena
  • Athena
  • key = name, value = Athena, delimeters = ,

Since these can only produce false positives, if any of these RFs fails, the record can’t match. For conjunctive clauses, we can simply take the union of all the simple predicate RFs in the clause. If any of them fail, the record can’t match. For disjunctions (DNF is the disjunction of conjunctions) then we require that an RF from each conjunction must fail in order to prevent false negatives.

Now Sparser draws a sample of records from the input and executes (independently) all of the RFs generated in the first step. It stores the passthrough rates of each RF in a compact matrix structure as well as recording the runtime costs of each RF and the runtime cost for the full parser.

After sampling, the optimizer has a populated matrix representing the records in the sample that passed for each RF, the average running time of each RF, and the average running time of the full parser.

Next up to 32 possible candidate RF cascades are generated. A cascade is a binary tree where non-leaf nodes are RFs and leaf nodes are decisions (parse or discard). Sparser generates trees up to depth D = 4. If there are more than 32 possible trees, then 32 are selected at random by picking a random RF generated from each token in round-robin fashion.

Now Sparser estimates the costs of the candidate cascades using the matrix it populated during the sampling step. Since the matrix stores a results of each pass/fail for an RF as single bit in the matrix, the passthrough rate of RF i is simply the number of 1’s in the ith row of the matrix. The joint passthrough rate of any two RFs is the bitwise and of their respective rows.

The key advantage to this approach is that these bitwise operations have SIMD support in modern hardware and complete in 1-3 cycles on 256-bit values on modern CPUs (roughly 1ns on a 3GHz processor).

Using this bit-matrix technique, the optimiser adds at most 1.2% overhead in the benchmark queries, including the time for sampling and scoring.

Periodic resampling

Sparser periodically recalibrates the cascade to account for data skew or sorting in the underlying input file. Consider an RF that filters by date and an input file sorted by date – it will either be highly selective or not selective at all depending on the portion of the file currently being processed.

Sparser maintains an exponentially weighted moving average of its own parsing throughput. In our implementation, we update this average on every 100MB block of input data. If the average throughput deviates significantly (e.g. 20% in our implementation), Sparser reruns its optimizer to select a new RF cascade.

Experimental results

Sparser is implemented in roughly 4000 lines of C, and supports mapping query predicates for text logs, JSON, Avro, Parquet, and PCAP. The team also integrated Sparser with Spark using the Data Sources API. Sparser is evaluated across a variety of workloads, datasets, and data formats.

Here you can see the end-to-end improvements when processing 68GB of JSON tweets using Spark:

Avro and Parquet formats get a big boost too:

I’m short on space to cover the evaluation in detail, but here are the highlights:

  • With raw filtering, Sparser improves on state-of-the-art JSON parsers by up to 22x. For distributed workloads it improves the end-to-end time by up to 9x.
  • Parsing of binary formats such as Avro and Parquet are accelerated by up to 5x. For queries over unstructured text logs, Sparser reduces the runtime by up to 4x.
  • Sparser selects RF cascades that are within 10% of the global optimum while incurring only a 1.2% runtime overhead.

In a periodic resampling just using a date based predicate, the resampling and re-optimisation process improved throughput by 25x compared to a sticking with the initially selected RF cascade for the whole job.

See the blog post from the authors and a link to the code here.



Read the whole story
quad
2589 days ago
reply
This might be useful in my near future…
Share this story
Delete
Next Page of Stories